From: Kan Liang <[email protected]>
This patchkit intends to support Intel core misc PMUs.
There are miscellaneous free running (read-only) counters in core.
Some new PMUs called core misc PMUs are composed to include these
counters. The counters include TSC, IA32_APERF, IA32_MPERF,
IA32_PPERF, SMI_COUNT, CORE_C*_RESIDENCY and PKG_C*_RESIDENCY.
There could be more in future platform.
Although these counters may be used simultaneously by other tools,
it still make sense to implement them in perf. Because we can
conveniently collect them together with other events.
Furthermore, the handling of the free running counters is very different,
so it makes sense to put them into separate PMUs.
Here are some useful examples.
1. the ASTATE/MSTATE/TSC events can be used to calculate the frequency
during each sampling period.
$ perf record -e
{ref-cycles,core_misc/tsc/,core_misc/power-mperf/,core_misc/power-aperf/}:S
--running-time -a ~/tchain_edit
$ perf report --stdio --group --show-freq
# Samples: 71K of event 'anon group { ref-cycles, core_misc/tsc/,
core_misc/power-mperf/, core_misc/power-aperf/ }'
# Event count (approx.): 215265868412
#
# Overhead TSC MHz AVG MHz BZY MHz
Command Shared Object Symbol
# ................................ ......... ......... .........
............ ................ ..................................
#
98.85% 5.41% 98.89% 98.95% 2293 1474 2302
tchain_edit tchain_edit [.] f3
0.39% 1.64% 0.39% 0.37% 2295 1 3053
kworker/25:1 [kernel.vmlinux] [k] delay_tsc
0.08% 24.20% 0.07% 0.06% 2295 82 2746
swapper [kernel.vmlinux] [k] acpi_idle_do_entry
0.05% 0.00% 0.05% 0.05% 2295 2289 2295
tchain_edit tchain_edit [.] f2
2. Caculate the CPU%
CPU_Utilization = CPU_CLK_UNHALTED.REF_TSC / TSC
$ perf stat -x, -e "ref-cycles,core_misc/tsc/" -C0 taskset -c 0 sleep 1
3481579,,ref-cycles
2301685567,,core_misc/tsc/
The CPU% for sleep is 0.15%.
$ perf stat -x, -e "ref-cycles,core_misc/tsc/" -C0 taskset -c 0 busyloop
11924042536,,ref-cycles
11929411840,,core_misc/tsc/
The CPU% for busyloop is 99.95%
3. Caculate fraction of time when the core is running in C6 state
CORE_C6_time% = CORE_C6_RESIDENCY / TSC
$ perf stat -x, -e"power_core/c6-residency/,core_misc/tsc/" -C0
-- taskset -c 0 sleep 1
2287199396,,power_core/c6-residency/
2297755875,,core_misc/tsc/
For sleep, 99.5% of time run in C6 state.
$ perf stat -x, -e"power_core/c6-residency/,core_misc/tsc/" -C0
-- taskset -c 0 busyloop
1330044,,power_core/c6-residency/
9932928928,,core_misc/tsc/
For busyloop, 0.01% of time run in C6 state.
Kan Liang (9):
perf/x86: Add Intel core misc PMUs support
perf/x86: core_misc PMU disable and enable support
perf/x86: Add is_hardware_event
perf/x86: special case per-cpu core misc PMU events
perf,tools: open event with it's own cpus and threads
perf,tools: Dump per-sample freq in report -D
perf,tools: save APERF/MPERF/TSC in struct perf_sample
perf,tools: caculate and save tsc/avg/bzy freq in he_stat
perf,tools: Show freq in perf report --stdio
arch/x86/include/asm/perf_event.h | 2 +
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/perf_event.h | 10 +
arch/x86/kernel/cpu/perf_event_intel.c | 4 +
arch/x86/kernel/cpu/perf_event_intel_core_misc.c | 933 +++++++++++++++++++++++
arch/x86/kernel/cpu/perf_event_intel_core_misc.h | 96 +++
include/linux/perf_event.h | 17 +-
include/linux/sched.h | 1 +
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 15 +-
tools/perf/Documentation/perf-report.txt | 10 +
tools/perf/builtin-annotate.c | 2 +-
tools/perf/builtin-diff.c | 2 +-
tools/perf/builtin-record.c | 2 +-
tools/perf/builtin-report.c | 17 +
tools/perf/perf.h | 1 +
tools/perf/tests/hists_link.c | 4 +-
tools/perf/ui/hist.c | 69 +-
tools/perf/util/event.h | 3 +
tools/perf/util/hist.c | 52 +-
tools/perf/util/hist.h | 5 +
tools/perf/util/session.c | 60 +-
tools/perf/util/session.h | 4 +
tools/perf/util/sort.c | 3 +
tools/perf/util/sort.h | 3 +
tools/perf/util/symbol.h | 9 +-
tools/perf/util/util.c | 4 +
27 files changed, 1304 insertions(+), 26 deletions(-)
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_core_misc.c
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_core_misc.h
--
1.8.3.1
From: Kan Liang <[email protected]>
There are miscellaneous free running (read-only) counters in core.
These counters may be used simultaneously by other tools, such as
turbostat. However, it still make sense to implement them in perf.
Because we can conveniently collect them together with other events, and
allow to use them from tools without special MSR access code.
Furthermore, the handling of the free running counters is very
different, so it makes sense to put them into a separate pmu.
These counters include TSC, IA32_APERF, IA32_MPERF, IA32_PPERF,
SMI_COUNT, CORE_C*_RESIDENCY and PKG_C*_RESIDENCY.
This patch adds new PMUs to support these counters, including helper
functions that add/delete events.
According to counters' scope and category, three PMUs are registered
with the perf_event core subsystem.
- 'core_misc': The counter is available for each logical processor. The
counter include TSC, IA32_APERF, IA32_MPERF, IA32_PPERF and
SMI_COUNT.
- 'power_core': The counter is available for each processor core. The
counter include CORE_C*_RESIDENCY, which is power related.
- 'power_pkg': The counter is available for each physical package. The
counter include PKG_C*_RESIDENCY, which is power related.
The events are exposed in sysfs for use by perf stat and other tools.
The files are:
/sys/devices/core_misc/events/power-aperf
/sys/devices/core_misc/events/power-mperf
/sys/devices/core_misc/events/power-pperf
/sys/devices/core_misc/events/smi-count
/sys/devices/core_misc/events/tsc
/sys/devices/power_core/events/c*-residency
/sys/devices/power_pkg/events/c*-residency
These events only support system-wide mode counting. For
power_core/power_pkg, measuring only one CPU per core/socket is
sufficient. The /sys/devices/power_*/cpumask file can be used by tools
to figure out which CPUs to monitor by default.
The PMU type (attr->type) is dynamically allocated and is available from
/sys/devices/core_misc/type and /sys/device/power_*/type.
Sampling is not supported.
Here are some examples.
1. To caculate the CPU%
CPU_Utilization = CPU_CLK_UNHALTED.REF_TSC / TSC
$ perf stat -x, -e "ref-cycles,core_misc/tsc/" -C0 taskset -c 0 sleep 1
3481579,,ref-cycles
2301685567,,core_misc/tsc/
The CPU% for sleep is 0.15%.
$ perf stat -x, -e "ref-cycles,core_misc/tsc/" -C0 taskset -c 0
busyloop
11924042536,,ref-cycles
11929411840,,core_misc/tsc/
The CPU% for busyloop is 99.95%
2. To caculate the fraction of time when the core is running in C6
state
CORE_C6_time% = CORE_C6_RESIDENCY / TSC
$ perf stat -x, -e"power_core/c6-residency/,core_misc/tsc/" -C0
-- taskset -c 0 sleep 1
2287199396,,power_core/c6-residency/
2297755875,,core_misc/tsc/
For sleep, 99.5% of time run in C6 state.
$ perf stat -x, -e"power_core/c6-residency/,core_misc/tsc/" -C0
-- taskset -c 0 busyloop
1330044,,power_core/c6-residency/
9932928928,,core_misc/tsc/
For busyloop, 0.01% of time run in C6 state.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/perf_event_intel_core_misc.c | 890 +++++++++++++++++++++++
arch/x86/kernel/cpu/perf_event_intel_core_misc.h | 96 +++
3 files changed, 987 insertions(+)
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_core_misc.c
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_core_misc.h
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 9bff687..a516820 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -41,6 +41,7 @@ obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_p6.o perf_event_knc.o perf_event_p4.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_rapl.o perf_event_intel_cqm.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_pt.o perf_event_intel_bts.o
+obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_core_misc.o
obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
perf_event_intel_uncore_snb.o \
diff --git a/arch/x86/kernel/cpu/perf_event_intel_core_misc.c b/arch/x86/kernel/cpu/perf_event_intel_core_misc.c
new file mode 100644
index 0000000..c6c82ac
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_core_misc.c
@@ -0,0 +1,890 @@
+/*
+ * perf_event_intel_core_misc.c: support miscellaneous core counters
+ *
+ * Copyright (C) 2015, Intel Corp.
+ * Author: Kan Liang ([email protected])
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Library General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Library General Public License for more details.
+ *
+ */
+
+/*
+ * This file export misc free running (read-only) counters in the core
+ * for perf. These counters may be use simultaneously by other tools,
+ * such as turbostat. However, it still make sense to implement them
+ * in perf. Because we can conveniently collect them together with
+ * other events, and allow to use them from tools without special MSR
+ * access code.
+ *
+ * The events only support system-wide mode counting. There is no
+ * sampling support because it is not supported by the hardware.
+ *
+ * All of these counters are specified in the Intel® 64 and IA-32
+ * Architectures Software Developer.s Manual Vol3b.
+ *
+ * Architectural counters:
+ * TSC: time-stamp counter (Section 17.13)
+ * perf code: 0x05
+ * APERF: Actual Performance Clock Counter (Section 14.1)
+ * perf code: 0x01
+ * MPERF: TSC Frequency Clock Counter (Section 14.1)
+ * perf code: 0x02
+ *
+ * Model specific counters:
+ * PPERF: Productive Performance Count. (See Section 14.4.5.1)
+ * perf code: 0x03
+ * Available model: SLM server
+ * SMI_COUNT: SMI Counter
+ * perf code: 0x04
+ * Available model: SLM,AMT,NHM,WSM,SNB,IVB,HSW,BDW
+ * MSR_CORE_C1_RES: CORE C1 Residency Counter
+ * perf code: 0x06
+ * Available model: SLM,AMT
+ * Scope: Core (each processor core has a MSR)
+ * MSR_CORE_C3_RESIDENCY: CORE C3 Residency Counter
+ * perf code: 0x07
+ * Available model: NHM,WSM,SNB,IVB,HSW,BDW
+ * Scope: Core
+ * MSR_CORE_C6_RESIDENCY: CORE C6 Residency Counter
+ * perf code: 0x08
+ * Available model: SLM,AMT,NHM,WSM,SNB,IVB,HSW,BDW
+ * Scope: Core
+ * MSR_CORE_C7_RESIDENCY: CORE C7 Residency Counter
+ * perf code: 0x09
+ * Available model: SNB,IVB,HSW,BDW
+ * Scope: Core
+ * MSR_PKG_C2_RESIDENCY: Package C2 Residency Counter.
+ * perf code: 0x0a
+ * Available model: SNB,IVB,HSW,BDW
+ * Scope: Package (physical package)
+ * MSR_PKG_C3_RESIDENCY: Package C3 Residency Counter.
+ * perf code: 0x0b
+ * Available model: NHM,WSM,SNB,IVB,HSW,BDW
+ * Scope: Package (physical package)
+ * MSR_PKG_C6_RESIDENCY: Package C6 Residency Counter.
+ * perf code: 0x0c
+ * Available model: NHM,WSM,SNB,IVB,HSW,BDW
+ * Scope: Package (physical package)
+ * MSR_PKG_C7_RESIDENCY: Package C7 Residency Counter.
+ * perf code: 0x0d
+ * Available model: NHM,WSM,SNB,IVB,HSW,BDW
+ * Scope: Package (physical package)
+ * MSR_PKG_C8_RESIDENCY: Package C8 Residency Counter.
+ * perf code: 0x0e
+ * Available model: HSW ULT only
+ * Scope: Package (physical package)
+ * MSR_PKG_C9_RESIDENCY: Package C9 Residency Counter.
+ * perf code: 0x0f
+ * Available model: HSW ULT only
+ * Scope: Package (physical package)
+ * MSR_PKG_C10_RESIDENCY: Package C10 Residency Counter.
+ * perf code: 0x10
+ * Available model: HSW ULT only
+ * Scope: Package (physical package)
+ * MSR_SLM_PKG_C6_RESIDENCY: Package C6 Residency Counter for SLM.
+ * perf code: 0x11
+ * Available model: SLM,AMT
+ * Scope: Package (physical package)
+ *
+ */
+
+#include "perf_event_intel_core_misc.h"
+
+static struct intel_core_misc_type *empty_core_misc[] = { NULL, };
+struct intel_core_misc_type **core_misc = empty_core_misc;
+
+static struct perf_core_misc_event_msr core_misc_events[] = {
+ { PERF_POWER_APERF, MSR_IA32_APERF },
+ { PERF_POWER_MPERF, MSR_IA32_MPERF },
+ { PERF_POWER_PPERF, MSR_PPERF },
+ { PERF_SMI_COUNT, MSR_SMI_COUNT},
+ { PERF_TSC, 0 },
+ { PERF_POWER_CORE_C1_RES, MSR_CORE_C1_RES },
+ { PERF_POWER_CORE_C3_RES, MSR_CORE_C3_RESIDENCY },
+ { PERF_POWER_CORE_C6_RES, MSR_CORE_C6_RESIDENCY },
+ { PERF_POWER_CORE_C7_RES, MSR_CORE_C7_RESIDENCY },
+ { PERF_POWER_PKG_C2_RES, MSR_PKG_C2_RESIDENCY },
+ { PERF_POWER_PKG_C3_RES, MSR_PKG_C3_RESIDENCY },
+ { PERF_POWER_PKG_C6_RES, MSR_PKG_C6_RESIDENCY },
+ { PERF_POWER_PKG_C7_RES, MSR_PKG_C7_RESIDENCY },
+ { PERF_POWER_PKG_C8_RES, MSR_PKG_C8_RESIDENCY },
+ { PERF_POWER_PKG_C9_RES, MSR_PKG_C9_RESIDENCY },
+ { PERF_POWER_PKG_C10_RES, MSR_PKG_C10_RESIDENCY },
+ { PERF_POWER_SLM_PKG_C6_RES, MSR_PKG_C7_RESIDENCY },
+};
+
+EVENT_ATTR_STR(power-aperf, power_aperf, "event=0x01");
+EVENT_ATTR_STR(power-mperf, power_mperf, "event=0x02");
+EVENT_ATTR_STR(power-pperf, power_pperf, "event=0x03");
+EVENT_ATTR_STR(smi-count, smi_count, "event=0x04");
+EVENT_ATTR_STR(tsc, clock_tsc, "event=0x05");
+EVENT_ATTR_STR(c1-residency, power_core_c1_res, "event=0x06");
+EVENT_ATTR_STR(c3-residency, power_core_c3_res, "event=0x07");
+EVENT_ATTR_STR(c6-residency, power_core_c6_res, "event=0x08");
+EVENT_ATTR_STR(c7-residency, power_core_c7_res, "event=0x09");
+EVENT_ATTR_STR(c2-residency, power_pkg_c2_res, "event=0x0a");
+EVENT_ATTR_STR(c3-residency, power_pkg_c3_res, "event=0x0b");
+EVENT_ATTR_STR(c6-residency, power_pkg_c6_res, "event=0x0c");
+EVENT_ATTR_STR(c7-residency, power_pkg_c7_res, "event=0x0d");
+EVENT_ATTR_STR(c8-residency, power_pkg_c8_res, "event=0x0e");
+EVENT_ATTR_STR(c9-residency, power_pkg_c9_res, "event=0x0f");
+EVENT_ATTR_STR(c10-residency, power_pkg_c10_res, "event=0x10");
+EVENT_ATTR_STR(c6-residency, power_slm_pkg_c6_res, "event=0x11");
+
+static cpumask_t core_misc_core_cpu_mask;
+static cpumask_t core_misc_pkg_cpu_mask;
+
+static DEFINE_PER_CPU(struct core_misc_pmu *, core_misc_pmu);
+static DEFINE_PER_CPU(struct core_misc_pmu *, core_misc_pmu_to_free);
+static DEFINE_PER_CPU(struct core_misc_pmu *, core_misc_core_pmu);
+static DEFINE_PER_CPU(struct core_misc_pmu *, core_misc_core_pmu_to_free);
+static DEFINE_PER_CPU(struct core_misc_pmu *, core_misc_pkg_pmu);
+static DEFINE_PER_CPU(struct core_misc_pmu *, core_misc_pkg_pmu_to_free);
+
+#define __GET_CORE_MISC_PMU_RETURN(core_misc_pmu) \
+{ \
+ pmu = per_cpu(core_misc_pmu, event->cpu); \
+ if (pmu && (pmu->pmu->type == event->pmu->type)) \
+ return pmu; \
+}
+static struct core_misc_pmu *get_core_misc_pmu(struct perf_event *event)
+{
+ struct core_misc_pmu *pmu;
+
+ __GET_CORE_MISC_PMU_RETURN(core_misc_pmu);
+ __GET_CORE_MISC_PMU_RETURN(core_misc_core_pmu);
+ __GET_CORE_MISC_PMU_RETURN(core_misc_pkg_pmu);
+
+ return NULL;
+}
+
+static int core_misc_pmu_event_init(struct perf_event *event)
+{
+ u64 cfg = event->attr.config & CORE_MISC_EVENT_MASK;
+ int ret = 0;
+
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ /*
+ * check event is known (determines counter)
+ */
+ if (!cfg || (cfg >= PERF_CORE_MISC_EVENT_MAX))
+ return -EINVAL;
+
+ /* unsupported modes and filters */
+ if (event->attr.exclude_user ||
+ event->attr.exclude_kernel ||
+ event->attr.exclude_hv ||
+ event->attr.exclude_idle ||
+ event->attr.exclude_host ||
+ event->attr.exclude_guest ||
+ event->attr.sample_period) /* no sampling */
+ return -EINVAL;
+
+ /* must be done before validate_group */
+ event->hw.event_base = core_misc_events[cfg-1].msr;
+ event->hw.config = cfg;
+ event->hw.idx = core_misc_events[cfg-1].id;
+
+ return ret;
+}
+
+static u64 core_misc_pmu_read_counter(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ u64 val;
+
+ if (hwc->idx == PERF_TSC)
+ val = rdtsc();
+ else
+ rdmsrl_safe(event->hw.event_base, &val);
+ return val;
+}
+
+static void core_misc_pmu_event_update(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ u64 prev_raw_count, new_raw_count;
+ s64 delta;
+ int shift = 0;
+
+ if (hwc->idx == PERF_SMI_COUNT)
+ shift = 32;
+again:
+ prev_raw_count = local64_read(&hwc->prev_count);
+ new_raw_count = core_misc_pmu_read_counter(event);
+
+ if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
+ new_raw_count) != prev_raw_count) {
+ cpu_relax();
+ goto again;
+ }
+
+ delta = (new_raw_count << shift) - (prev_raw_count << shift);
+ delta >>= shift;
+
+ local64_add(delta, &event->count);
+}
+
+static void __core_misc_pmu_event_start(struct core_misc_pmu *pmu,
+ struct perf_event *event)
+{
+ if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
+ return;
+
+ event->hw.state = 0;
+ list_add_tail(&event->active_entry, &pmu->active_list);
+ local64_set(&event->hw.prev_count, core_misc_pmu_read_counter(event));
+ pmu->n_active++;
+}
+
+static void core_misc_pmu_event_start(struct perf_event *event, int mode)
+{
+ struct core_misc_pmu *pmu = get_core_misc_pmu(event);
+ unsigned long flags;
+
+ if (pmu == NULL)
+ return;
+
+ spin_lock_irqsave(&pmu->lock, flags);
+ __core_misc_pmu_event_start(pmu, event);
+ spin_unlock_irqrestore(&pmu->lock, flags);
+}
+
+static void core_misc_pmu_event_stop(struct perf_event *event, int mode)
+{
+ struct core_misc_pmu *pmu = get_core_misc_pmu(event);
+ struct hw_perf_event *hwc = &event->hw;
+ unsigned long flags;
+
+ if (pmu == NULL)
+ return;
+
+ spin_lock_irqsave(&pmu->lock, flags);
+
+ /* mark event as deactivated and stopped */
+ if (!(hwc->state & PERF_HES_STOPPED)) {
+ WARN_ON_ONCE(pmu->n_active <= 0);
+ pmu->n_active--;
+
+ list_del(&event->active_entry);
+
+ WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
+ hwc->state |= PERF_HES_STOPPED;
+ }
+
+ /* check if update of sw counter is necessary */
+ if ((mode & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE)) {
+ /*
+ * Drain the remaining delta count out of a event
+ * that we are disabling:
+ */
+ core_misc_pmu_event_update(event);
+ hwc->state |= PERF_HES_UPTODATE;
+ }
+ spin_unlock_irqrestore(&pmu->lock, flags);
+}
+
+static void core_misc_pmu_event_del(struct perf_event *event, int mode)
+{
+ core_misc_pmu_event_stop(event, PERF_EF_UPDATE);
+}
+
+static int core_misc_pmu_event_add(struct perf_event *event, int mode)
+{
+ struct core_misc_pmu *pmu = get_core_misc_pmu(event);
+ struct hw_perf_event *hwc = &event->hw;
+ unsigned long flags;
+
+ if (pmu == NULL)
+ return -EINVAL;
+
+ spin_lock_irqsave(&pmu->lock, flags);
+
+ hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
+
+ if (mode & PERF_EF_START)
+ __core_misc_pmu_event_start(pmu, event);
+
+ spin_unlock_irqrestore(&pmu->lock, flags);
+
+ return 0;
+}
+
+static ssize_t core_misc_get_attr_cpumask(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct pmu *pmu = dev_get_drvdata(dev);
+ struct intel_core_misc_type *type;
+ int i;
+
+ for (i = 0; core_misc[i]; i++) {
+ type = core_misc[i];
+ if (type->pmu.type == pmu->type) {
+ switch (type->type) {
+ case perf_intel_core_misc_core:
+ return cpumap_print_to_pagebuf(true, buf, &core_misc_core_cpu_mask);
+ case perf_intel_core_misc_pkg:
+ return cpumap_print_to_pagebuf(true, buf, &core_misc_pkg_cpu_mask);
+ default:
+ return 0;
+ }
+ }
+ }
+
+ return 0;
+}
+
+static DEVICE_ATTR(cpumask, S_IRUGO, core_misc_get_attr_cpumask, NULL);
+
+static struct attribute *core_misc_pmu_attrs[] = {
+ &dev_attr_cpumask.attr,
+ NULL,
+};
+
+static struct attribute_group core_misc_pmu_attr_group = {
+ .attrs = core_misc_pmu_attrs,
+};
+
+DEFINE_CORE_MISC_FORMAT_ATTR(event, event, "config:0-7");
+static struct attribute *core_misc_formats_attr[] = {
+ &format_attr_event.attr,
+ NULL,
+};
+
+static struct attribute_group core_misc_pmu_format_group = {
+ .name = "format",
+ .attrs = core_misc_formats_attr,
+};
+
+static struct attribute *nhm_core_misc_events_attr[] = {
+ EVENT_PTR(power_aperf),
+ EVENT_PTR(power_mperf),
+ EVENT_PTR(smi_count),
+ EVENT_PTR(clock_tsc),
+ NULL,
+};
+
+static struct attribute_group nhm_core_misc_pmu_events_group = {
+ .name = "events",
+ .attrs = nhm_core_misc_events_attr,
+};
+
+const struct attribute_group *nhm_core_misc_attr_groups[] = {
+ &core_misc_pmu_format_group,
+ &nhm_core_misc_pmu_events_group,
+ NULL,
+};
+
+static struct intel_core_misc_type nhm_core_misc = {
+ .name = "core_misc",
+ .type = perf_intel_core_misc_thread,
+ .pmu_group = nhm_core_misc_attr_groups,
+};
+
+static struct attribute *nhm_power_core_events_attr[] = {
+ EVENT_PTR(power_core_c3_res),
+ EVENT_PTR(power_core_c6_res),
+ NULL,
+};
+
+static struct attribute_group nhm_power_core_pmu_events_group = {
+ .name = "events",
+ .attrs = nhm_power_core_events_attr,
+};
+
+const struct attribute_group *nhm_power_core_attr_groups[] = {
+ &core_misc_pmu_attr_group,
+ &core_misc_pmu_format_group,
+ &nhm_power_core_pmu_events_group,
+ NULL,
+};
+
+static struct intel_core_misc_type nhm_power_core = {
+ .name = "power_core",
+ .type = perf_intel_core_misc_pkg,
+ .pmu_group = nhm_power_core_attr_groups,
+};
+
+static struct attribute *nhm_power_pkg_events_attr[] = {
+ EVENT_PTR(power_pkg_c3_res),
+ EVENT_PTR(power_pkg_c6_res),
+ EVENT_PTR(power_pkg_c7_res),
+ NULL,
+};
+
+static struct attribute_group nhm_power_pkg_pmu_events_group = {
+ .name = "events",
+ .attrs = nhm_power_pkg_events_attr,
+};
+
+const struct attribute_group *nhm_power_pkg_attr_groups[] = {
+ &core_misc_pmu_attr_group,
+ &core_misc_pmu_format_group,
+ &nhm_power_pkg_pmu_events_group,
+ NULL,
+};
+
+static struct intel_core_misc_type nhm_power_pkg = {
+ .name = "power_pkg",
+ .type = perf_intel_core_misc_pkg,
+ .pmu_group = nhm_power_pkg_attr_groups,
+};
+
+static struct intel_core_misc_type *nhm_core_misc_types[] = {
+ &nhm_core_misc,
+ &nhm_power_core,
+ &nhm_power_pkg,
+};
+
+static struct attribute *slm_power_core_events_attr[] = {
+ EVENT_PTR(power_core_c1_res),
+ EVENT_PTR(power_core_c6_res),
+ NULL,
+};
+
+static struct attribute_group slm_power_core_pmu_events_group = {
+ .name = "events",
+ .attrs = slm_power_core_events_attr,
+};
+
+const struct attribute_group *slm_power_core_attr_groups[] = {
+ &core_misc_pmu_attr_group,
+ &core_misc_pmu_format_group,
+ &slm_power_core_pmu_events_group,
+ NULL,
+};
+
+static struct intel_core_misc_type slm_power_core = {
+ .name = "power_core",
+ .type = perf_intel_core_misc_pkg,
+ .pmu_group = slm_power_core_attr_groups,
+};
+
+static struct attribute *slm_power_pkg_events_attr[] = {
+ EVENT_PTR(power_slm_pkg_c6_res),
+ NULL,
+};
+
+static struct attribute_group slm_power_pkg_pmu_events_group = {
+ .name = "events",
+ .attrs = slm_power_pkg_events_attr,
+};
+
+const struct attribute_group *slm_power_pkg_attr_groups[] = {
+ &core_misc_pmu_attr_group,
+ &core_misc_pmu_format_group,
+ &slm_power_pkg_pmu_events_group,
+ NULL,
+};
+
+static struct intel_core_misc_type slm_power_pkg = {
+ .name = "power_pkg",
+ .type = perf_intel_core_misc_pkg,
+ .pmu_group = slm_power_pkg_attr_groups,
+};
+
+static struct intel_core_misc_type *slm_core_misc_types[] = {
+ &nhm_core_misc,
+ &slm_power_core,
+ &slm_power_pkg,
+};
+
+static struct attribute *slm_s_core_misc_events_attr[] = {
+ EVENT_PTR(power_aperf),
+ EVENT_PTR(power_mperf),
+ EVENT_PTR(power_pperf),
+ EVENT_PTR(smi_count),
+ EVENT_PTR(clock_tsc),
+ NULL,
+};
+
+static struct attribute_group slm_s_core_misc_pmu_events_group = {
+ .name = "events",
+ .attrs = slm_s_core_misc_events_attr,
+};
+
+const struct attribute_group *slm_s_core_misc_attr_groups[] = {
+ &core_misc_pmu_format_group,
+ &slm_s_core_misc_pmu_events_group,
+ NULL,
+};
+
+static struct intel_core_misc_type slm_s_core_misc = {
+ .name = "core_misc",
+ .type = perf_intel_core_misc_thread,
+ .pmu_group = slm_s_core_misc_attr_groups,
+};
+
+static struct intel_core_misc_type *slm_s_core_misc_types[] = {
+ &slm_s_core_misc,
+ &slm_power_core,
+ &slm_power_pkg,
+};
+
+static struct attribute *snb_power_core_events_attr[] = {
+ EVENT_PTR(power_core_c3_res),
+ EVENT_PTR(power_core_c6_res),
+ EVENT_PTR(power_core_c7_res),
+ NULL,
+};
+
+static struct attribute_group snb_power_core_pmu_events_group = {
+ .name = "events",
+ .attrs = snb_power_core_events_attr,
+};
+
+const struct attribute_group *snb_power_core_attr_groups[] = {
+ &core_misc_pmu_attr_group,
+ &core_misc_pmu_format_group,
+ &snb_power_core_pmu_events_group,
+ NULL,
+};
+
+static struct intel_core_misc_type snb_power_core = {
+ .name = "power_core",
+ .type = perf_intel_core_misc_core,
+ .pmu_group = snb_power_core_attr_groups,
+};
+
+static struct attribute *snb_power_pkg_events_attr[] = {
+ EVENT_PTR(power_pkg_c2_res),
+ EVENT_PTR(power_pkg_c3_res),
+ EVENT_PTR(power_pkg_c6_res),
+ EVENT_PTR(power_pkg_c7_res),
+ NULL,
+};
+
+static struct attribute_group snb_power_pkg_pmu_events_group = {
+ .name = "events",
+ .attrs = snb_power_pkg_events_attr,
+};
+
+const struct attribute_group *snb_power_pkg_attr_groups[] = {
+ &core_misc_pmu_attr_group,
+ &core_misc_pmu_format_group,
+ &snb_power_pkg_pmu_events_group,
+ NULL,
+};
+
+static struct intel_core_misc_type snb_power_pkg = {
+ .name = "power_pkg",
+ .type = perf_intel_core_misc_pkg,
+ .pmu_group = snb_power_pkg_attr_groups,
+};
+
+static struct intel_core_misc_type *snb_core_misc_types[] = {
+ &nhm_core_misc,
+ &snb_power_core,
+ &snb_power_pkg,
+ NULL,
+};
+
+static struct attribute *hsw_ult_power_pkg_events_attr[] = {
+ EVENT_PTR(power_pkg_c2_res),
+ EVENT_PTR(power_pkg_c3_res),
+ EVENT_PTR(power_pkg_c6_res),
+ EVENT_PTR(power_pkg_c7_res),
+ EVENT_PTR(power_pkg_c8_res),
+ EVENT_PTR(power_pkg_c9_res),
+ EVENT_PTR(power_pkg_c10_res),
+ NULL,
+};
+
+static struct attribute_group hsw_ult_power_pkg_pmu_events_group = {
+ .name = "events",
+ .attrs = hsw_ult_power_pkg_events_attr,
+};
+
+const struct attribute_group *hsw_ult_power_pkg_attr_groups[] = {
+ &core_misc_pmu_attr_group,
+ &core_misc_pmu_format_group,
+ &hsw_ult_power_pkg_pmu_events_group,
+ NULL,
+};
+
+static struct intel_core_misc_type hsw_ult_power_pkg = {
+ .name = "power_pkg",
+ .type = perf_intel_core_misc_pkg,
+ .pmu_group = hsw_ult_power_pkg_attr_groups,
+};
+
+static struct intel_core_misc_type *hsw_ult_core_misc_types[] = {
+ &nhm_core_misc,
+ &snb_power_core,
+ &hsw_ult_power_pkg,
+};
+
+#define __CORE_MISC_CPU_EXIT(_type, _cpu_mask, fn) \
+{ \
+ pmu = per_cpu(core_misc_ ## _type, cpu); \
+ if (pmu) { \
+ id = fn(cpu); \
+ target = -1; \
+ for_each_online_cpu(i) { \
+ if (i == cpu) \
+ continue; \
+ if (id == fn(i)) { \
+ target = i; \
+ break; \
+ } \
+ } \
+ if (cpumask_test_and_clear_cpu(cpu, &core_misc_ ## _cpu_mask) && target >= 0) \
+ cpumask_set_cpu(target, &core_misc_ ## _cpu_mask); \
+ WARN_ON(cpumask_empty(&core_misc_ ## _cpu_mask)); \
+ if (target >= 0) \
+ perf_pmu_migrate_context(pmu->pmu, cpu, target); \
+ } \
+}
+
+static void core_misc_cpu_exit(int cpu)
+{
+ struct core_misc_pmu *pmu;
+ int i, id, target;
+
+ __CORE_MISC_CPU_EXIT(core_pmu, core_cpu_mask, topology_core_id);
+ __CORE_MISC_CPU_EXIT(pkg_pmu, pkg_cpu_mask, topology_physical_package_id);
+}
+
+#define __CORE_MISC_CPU_INIT(_type, _cpu_mask, fn) \
+{ \
+ pmu = per_cpu(core_misc_ ## _type, cpu); \
+ if (pmu) { \
+ id = fn(cpu); \
+ for_each_cpu(i, &core_misc_ ## _cpu_mask) { \
+ if (id == fn(i)) \
+ break; \
+ } \
+ if (i >= nr_cpu_ids) \
+ cpumask_set_cpu(cpu, &core_misc_ ## _cpu_mask); \
+ } \
+}
+
+static void core_misc_cpu_init(int cpu)
+{
+ int i, id;
+ struct core_misc_pmu *pmu;
+
+ __CORE_MISC_CPU_INIT(core_pmu, core_cpu_mask, topology_core_id);
+ __CORE_MISC_CPU_INIT(pkg_pmu, pkg_cpu_mask, topology_physical_package_id);
+}
+
+#define __CORE_MISC_CPU_PREPARE(core_misc_pmu, type) \
+{ \
+ pmu = per_cpu(core_misc_pmu, cpu); \
+ if (pmu) \
+ break; \
+ pmu = kzalloc_node(sizeof(*pmu), GFP_KERNEL, cpu_to_node(cpu)); \
+ spin_lock_init(&pmu->lock); \
+ INIT_LIST_HEAD(&pmu->active_list); \
+ pmu->pmu = &type->pmu; \
+ per_cpu(core_misc_pmu, cpu) = pmu; \
+}
+
+static int core_misc_cpu_prepare(int cpu)
+{
+ struct core_misc_pmu *pmu;
+ struct intel_core_misc_type *type;
+ int i;
+
+ for (i = 0; core_misc[i]; i++) {
+ type = core_misc[i];
+
+ switch (type->type) {
+ case perf_intel_core_misc_thread:
+ __CORE_MISC_CPU_PREPARE(core_misc_pmu, type)
+ break;
+ case perf_intel_core_misc_core:
+ __CORE_MISC_CPU_PREPARE(core_misc_core_pmu, type);
+ break;
+ case perf_intel_core_misc_pkg:
+ __CORE_MISC_CPU_PREPARE(core_misc_pkg_pmu, type);
+ break;
+ }
+ }
+
+ return 0;
+}
+
+#define __CORE_MISC_CPU_KREE(pmu_to_free) \
+{ \
+ if (per_cpu(pmu_to_free, cpu)) { \
+ kfree(per_cpu(pmu_to_free, cpu)); \
+ per_cpu(pmu_to_free, cpu) = NULL; \
+ } \
+}
+
+static void core_misc_cpu_kfree(int cpu)
+{
+ __CORE_MISC_CPU_KREE(core_misc_pmu_to_free);
+ __CORE_MISC_CPU_KREE(core_misc_core_pmu_to_free);
+ __CORE_MISC_CPU_KREE(core_misc_pkg_pmu_to_free);
+}
+
+#define __CORE_MISC_CPU_DYING(pmu, pmu_to_free) \
+{ \
+ if (per_cpu(pmu, cpu)) { \
+ per_cpu(pmu_to_free, cpu) = per_cpu(pmu, cpu); \
+ per_cpu(pmu, cpu) = NULL; \
+ } \
+}
+
+static int core_misc_cpu_dying(int cpu)
+{
+ __CORE_MISC_CPU_DYING(core_misc_pmu, core_misc_pmu_to_free);
+ __CORE_MISC_CPU_DYING(core_misc_core_pmu, core_misc_core_pmu_to_free);
+ __CORE_MISC_CPU_DYING(core_misc_pkg_pmu, core_misc_pkg_pmu_to_free);
+
+ return 0;
+}
+static int core_misc_cpu_notifier(struct notifier_block *self,
+ unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (long)hcpu;
+
+ switch (action & ~CPU_TASKS_FROZEN) {
+ case CPU_UP_PREPARE:
+ core_misc_cpu_prepare(cpu);
+ break;
+ case CPU_STARTING:
+ core_misc_cpu_init(cpu);
+ break;
+ case CPU_UP_CANCELED:
+ case CPU_DYING:
+ core_misc_cpu_dying(cpu);
+ break;
+ case CPU_ONLINE:
+ case CPU_DEAD:
+ core_misc_cpu_kfree(cpu);
+ break;
+ case CPU_DOWN_PREPARE:
+ core_misc_cpu_exit(cpu);
+ break;
+ default:
+ break;
+ }
+
+ return NOTIFY_OK;
+}
+
+#define CORE_MISC_CPU(_model, _ops) { \
+ .vendor = X86_VENDOR_INTEL, \
+ .family = 6, \
+ .model = _model, \
+ .driver_data = (kernel_ulong_t)&_ops, \
+ }
+
+static const struct x86_cpu_id core_misc_ids[] __initconst = {
+ CORE_MISC_CPU(0x37, slm_core_misc_types),/* Silvermont */
+ CORE_MISC_CPU(0x4d, slm_s_core_misc_types),/* Silvermont Avoton/Rangely */
+ CORE_MISC_CPU(0x4c, slm_core_misc_types),/* Airmont */
+ CORE_MISC_CPU(0x1e, nhm_core_misc_types),/* Nehalem */
+ CORE_MISC_CPU(0x1a, nhm_core_misc_types),/* Nehalem-EP */
+ CORE_MISC_CPU(0x2e, nhm_core_misc_types),/* Nehalem-EX */
+ CORE_MISC_CPU(0x25, nhm_core_misc_types),/* Westmere */
+ CORE_MISC_CPU(0x2c, nhm_core_misc_types),/* Westmere-EP */
+ CORE_MISC_CPU(0x2f, nhm_core_misc_types),/* Westmere-EX */
+ CORE_MISC_CPU(0x2a, snb_core_misc_types),/* SandyBridge */
+ CORE_MISC_CPU(0x2d, snb_core_misc_types),/* SandyBridge-E/EN/EP */
+ CORE_MISC_CPU(0x3a, snb_core_misc_types),/* IvyBridge */
+ CORE_MISC_CPU(0x3e, snb_core_misc_types),/* IvyBridge-EP/EX */
+ CORE_MISC_CPU(0x3c, snb_core_misc_types),/* Haswell Core */
+ CORE_MISC_CPU(0x3f, snb_core_misc_types),/* Haswell Server */
+ CORE_MISC_CPU(0x46, snb_core_misc_types),/* Haswell + GT3e */
+ CORE_MISC_CPU(0x45, hsw_ult_core_misc_types),/* Haswell ULT */
+ CORE_MISC_CPU(0x3d, snb_core_misc_types),/* Broadwell Core-M */
+ CORE_MISC_CPU(0x56, snb_core_misc_types),/* Broadwell Xeon D */
+ CORE_MISC_CPU(0x47, snb_core_misc_types),/* Broadwell + GT3e */
+ CORE_MISC_CPU(0x4f, snb_core_misc_types),/* Broadwell Server */
+ {}
+};
+
+static int __init core_misc_init(void)
+{
+ const struct x86_cpu_id *id;
+
+ id = x86_match_cpu(core_misc_ids);
+ if (!id)
+ return -ENODEV;
+
+ core_misc = (struct intel_core_misc_type **)id->driver_data;
+
+ return 0;
+}
+
+static void __init core_misc_cpumask_init(void)
+{
+ int cpu, err;
+
+ cpu_notifier_register_begin();
+
+ for_each_online_cpu(cpu) {
+ err = core_misc_cpu_prepare(cpu);
+ if (err) {
+ pr_info(" CPU prepare failed\n");
+ cpu_notifier_register_done();
+ return;
+ }
+ core_misc_cpu_init(cpu);
+ }
+
+ __perf_cpu_notifier(core_misc_cpu_notifier);
+
+ cpu_notifier_register_done();
+}
+
+static void __init core_misc_pmus_register(void)
+{
+ struct intel_core_misc_type *type;
+ int i, err;
+
+ for (i = 0; core_misc[i]; i++) {
+ type = core_misc[i];
+
+ type->pmu = (struct pmu) {
+ .attr_groups = type->pmu_group,
+ .task_ctx_nr = perf_invalid_context,
+ .event_init = core_misc_pmu_event_init,
+ .add = core_misc_pmu_event_add, /* must have */
+ .del = core_misc_pmu_event_del, /* must have */
+ .start = core_misc_pmu_event_start,
+ .stop = core_misc_pmu_event_stop,
+ .read = core_misc_pmu_event_update,
+ .capabilities = PERF_PMU_CAP_NO_INTERRUPT,
+ };
+
+ err = perf_pmu_register(&type->pmu, type->name, -1);
+ if (WARN_ON(err))
+ pr_info("Failed to register PMU %s error %d\n",
+ type->pmu.name, err);
+ }
+}
+
+static int __init core_misc_pmu_init(void)
+{
+ int err;
+
+ if (cpu_has_hypervisor)
+ return -ENODEV;
+
+ err = core_misc_init();
+ if (err)
+ return err;
+
+ core_misc_cpumask_init();
+
+ core_misc_pmus_register();
+
+ return 0;
+}
+device_initcall(core_misc_pmu_init);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_core_misc.h b/arch/x86/kernel/cpu/perf_event_intel_core_misc.h
new file mode 100644
index 0000000..0ed66e4
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_core_misc.h
@@ -0,0 +1,96 @@
+/*
+ * Copyright (C) 2015, Intel Corp.
+ * Author: Kan Liang ([email protected])
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Library General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Library General Public License for more details.
+ *
+ */
+
+#ifndef __PERF_EVENT_INTEL_CORE_MISC_H
+#define __PERF_EVENT_INTEL_CORE_MISC_H
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/perf_event.h>
+#include <asm/cpu_device_id.h>
+#include "perf_event.h"
+
+#define CORE_MISC_HRTIMER_INTERVAL (60LL * NSEC_PER_SEC)
+
+struct intel_core_misc_type {
+ struct pmu pmu;
+ const char *name;
+ int type;
+ const struct attribute_group **pmu_group;
+};
+
+enum perf_intel_core_misc_type {
+ perf_intel_core_misc_thread = 0,
+ perf_intel_core_misc_core,
+ perf_intel_core_misc_pkg,
+};
+
+struct perf_core_misc_event_msr {
+ int id;
+ u64 msr;
+};
+
+enum perf_core_misc_id {
+ /*
+ * core_misc events, generalized by the kernel:
+ */
+ PERF_POWER_APERF = 1,
+ PERF_POWER_MPERF = 2,
+ PERF_POWER_PPERF = 3,
+ PERF_SMI_COUNT = 4,
+ PERF_TSC = 5,
+ PERF_POWER_CORE_C1_RES = 6,
+ PERF_POWER_CORE_C3_RES = 7,
+ PERF_POWER_CORE_C6_RES = 8,
+ PERF_POWER_CORE_C7_RES = 9,
+ PERF_POWER_PKG_C2_RES = 10,
+ PERF_POWER_PKG_C3_RES = 11,
+ PERF_POWER_PKG_C6_RES = 12,
+ PERF_POWER_PKG_C7_RES = 13,
+ PERF_POWER_PKG_C8_RES = 14,
+ PERF_POWER_PKG_C9_RES = 15,
+ PERF_POWER_PKG_C10_RES = 16,
+ PERF_POWER_SLM_PKG_C6_RES = 17,
+
+ PERF_CORE_MISC_EVENT_MAX, /* non-ABI */
+};
+
+/*
+ * event code: LSB 8 bits, passed in attr->config
+ * any other bit is reserved
+ */
+#define CORE_MISC_EVENT_MASK 0xFFULL
+
+#define DEFINE_CORE_MISC_FORMAT_ATTR(_var, _name, _format) \
+static ssize_t __core_misc_##_var##_show(struct kobject *kobj, \
+ struct kobj_attribute *attr, \
+ char *page) \
+{ \
+ BUILD_BUG_ON(sizeof(_format) >= PAGE_SIZE); \
+ return snprintf(page, sizeof(_format) + 2, _format "\n"); \
+} \
+static struct kobj_attribute format_attr_##_var = \
+ __ATTR(_name, 0444, __core_misc_##_var##_show, NULL)
+
+
+struct core_misc_pmu {
+ spinlock_t lock;
+ int n_active;
+ struct list_head active_list;
+ struct intel_core_misc_type *core_misc_type;
+ struct pmu *pmu;
+};
+#endif
--
1.8.3.1
From: Kan Liang <[email protected]>
This patch implements core_misc PMU disable and enable functions.
core_misc PMU counters are free running counters, so it's impossible to
stop/start them. The "disable" means not read counters.
With disable/enable functions, it's possible to "disable" core_misc
events when other PMU events stop. For example, we are able to stop
read the core_misc counter during irq handler.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/include/asm/perf_event.h | 2 ++
arch/x86/kernel/cpu/perf_event.h | 10 ++++++
arch/x86/kernel/cpu/perf_event_intel.c | 4 +++
arch/x86/kernel/cpu/perf_event_intel_core_misc.c | 41 ++++++++++++++++++++++++
4 files changed, 57 insertions(+)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index dc0f6ed..2905f4c 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -11,6 +11,8 @@
#define X86_PMC_IDX_MAX 64
+#define X86_CORE_MISC_COUNTER_MAX 64
+
#define MSR_ARCH_PERFMON_PERFCTR0 0xc1
#define MSR_ARCH_PERFMON_PERFCTR1 0xc2
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3e7fd27..fb14f8a 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -239,6 +239,12 @@ struct cpu_hw_events {
int excl_thread_id; /* 0 or 1 */
/*
+ * Intel core misc
+ */
+ struct perf_event *core_misc_events[X86_CORE_MISC_COUNTER_MAX]; /* in counter order */
+ unsigned long core_misc_active_mask[BITS_TO_LONGS(X86_CORE_MISC_COUNTER_MAX)];
+
+ /*
* AMD specific bits
*/
struct amd_nb *amd_nb;
@@ -927,6 +933,10 @@ int p6_pmu_init(void);
int knc_pmu_init(void);
+void intel_core_misc_pmu_enable(void);
+
+void intel_core_misc_pmu_disable(void);
+
ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
char *page);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index b9826a9..651a86d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1586,6 +1586,8 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
if (!x86_pmu.late_ack)
apic_write(APIC_LVTPC, APIC_DM_NMI);
__intel_pmu_disable_all();
+ if (cpuc->core_misc_active_mask)
+ intel_core_misc_pmu_disable();
handled = intel_pmu_drain_bts_buffer();
handled += intel_bts_interrupt();
status = intel_pmu_get_status();
@@ -1671,6 +1673,8 @@ again:
done:
__intel_pmu_enable_all(0, true);
+ if (cpuc->core_misc_active_mask)
+ intel_core_misc_pmu_enable();
/*
* Only unmask the NMI after the overflow counters
* have been reset. This avoids spurious NMIs on
diff --git a/arch/x86/kernel/cpu/perf_event_intel_core_misc.c b/arch/x86/kernel/cpu/perf_event_intel_core_misc.c
index c6c82ac..4efe842 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_core_misc.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_core_misc.c
@@ -250,12 +250,19 @@ static void __core_misc_pmu_event_start(struct core_misc_pmu *pmu,
static void core_misc_pmu_event_start(struct perf_event *event, int mode)
{
struct core_misc_pmu *pmu = get_core_misc_pmu(event);
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ int idx = event->hw.idx;
unsigned long flags;
if (pmu == NULL)
return;
spin_lock_irqsave(&pmu->lock, flags);
+
+ if (pmu->pmu->type == perf_intel_core_misc_thread) {
+ cpuc->core_misc_events[idx] = event;
+ __set_bit(idx, cpuc->core_misc_active_mask);
+ }
__core_misc_pmu_event_start(pmu, event);
spin_unlock_irqrestore(&pmu->lock, flags);
}
@@ -264,6 +271,7 @@ static void core_misc_pmu_event_stop(struct perf_event *event, int mode)
{
struct core_misc_pmu *pmu = get_core_misc_pmu(event);
struct hw_perf_event *hwc = &event->hw;
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
unsigned long flags;
if (pmu == NULL)
@@ -273,6 +281,8 @@ static void core_misc_pmu_event_stop(struct perf_event *event, int mode)
/* mark event as deactivated and stopped */
if (!(hwc->state & PERF_HES_STOPPED)) {
+ if (__test_and_clear_bit(hwc->idx, cpuc->core_misc_active_mask))
+ cpuc->core_misc_events[hwc->idx] = NULL;
WARN_ON_ONCE(pmu->n_active <= 0);
pmu->n_active--;
@@ -294,6 +304,32 @@ static void core_misc_pmu_event_stop(struct perf_event *event, int mode)
spin_unlock_irqrestore(&pmu->lock, flags);
}
+void intel_core_misc_pmu_enable(void)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct perf_event *event;
+ u64 start;
+ int bit;
+
+ for_each_set_bit(bit, cpuc->core_misc_active_mask,
+ X86_CORE_MISC_COUNTER_MAX) {
+ event = cpuc->core_misc_events[bit];
+ start = core_misc_pmu_read_counter(event);
+ local64_set(&event->hw.prev_count, start);
+ }
+}
+
+void intel_core_misc_pmu_disable(void)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ int bit;
+
+ for_each_set_bit(bit, cpuc->core_misc_active_mask,
+ X86_CORE_MISC_COUNTER_MAX) {
+ core_misc_pmu_event_update(cpuc->core_misc_events[bit]);
+ }
+}
+
static void core_misc_pmu_event_del(struct perf_event *event, int mode)
{
core_misc_pmu_event_stop(event, PERF_EF_UPDATE);
@@ -863,6 +899,11 @@ static void __init core_misc_pmus_register(void)
.capabilities = PERF_PMU_CAP_NO_INTERRUPT,
};
+ if (type->type == perf_intel_core_misc_thread) {
+ type->pmu.pmu_disable = (void *) intel_core_misc_pmu_disable;
+ type->pmu.pmu_enable = (void *) intel_core_misc_pmu_enable;
+ }
+
err = perf_pmu_register(&type->pmu, type->name, -1);
if (WARN_ON(err))
pr_info("Failed to register PMU %s error %d\n",
--
1.8.3.1
From: Kan Liang <[email protected]>
Using is_hardware_event to replace !is_software_event to indicate a
hardware event.
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 7 ++++++-
kernel/events/core.c | 6 +++---
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2027809..fea0ddf 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -761,13 +761,18 @@ static inline bool is_sampling_event(struct perf_event *event)
}
/*
- * Return 1 for a software event, 0 for a hardware event
+ * Return 1 for a software event, 0 for other event
*/
static inline int is_software_event(struct perf_event *event)
{
return event->pmu->task_ctx_nr == perf_sw_context;
}
+static inline int is_hardware_event(struct perf_event *event)
+{
+ return event->pmu->task_ctx_nr == perf_hw_context;
+}
+
extern struct static_key perf_swevent_enabled[PERF_COUNT_SW_MAX];
extern void ___perf_sw_event(u32, u64, struct pt_regs *, u64);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d3dae34..9077867 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1347,7 +1347,7 @@ static void perf_group_attach(struct perf_event *event)
WARN_ON_ONCE(group_leader->ctx != event->ctx);
if (group_leader->group_flags & PERF_GROUP_SOFTWARE &&
- !is_software_event(event))
+ is_hardware_event(event))
group_leader->group_flags &= ~PERF_GROUP_SOFTWARE;
list_add_tail(&event->group_entry, &group_leader->sibling_list);
@@ -1553,7 +1553,7 @@ event_sched_out(struct perf_event *event,
event->pmu->del(event, 0);
event->oncpu = -1;
- if (!is_software_event(event))
+ if (is_hardware_event(event))
cpuctx->active_oncpu--;
if (!--ctx->nr_active)
perf_event_ctx_deactivate(ctx);
@@ -1881,7 +1881,7 @@ event_sched_in(struct perf_event *event,
goto out;
}
- if (!is_software_event(event))
+ if (is_hardware_event(event))
cpuctx->active_oncpu++;
if (!ctx->nr_active++)
perf_event_ctx_activate(ctx);
--
1.8.3.1
From: Kan Liang <[email protected]>
This patch special case per-cpu core_misc PMU events and allow them to
be part of any hardware/software group for system-wide monitoring.
An useful example would be to include the ASTATE/MSTATE event in a
sampling group. This can be used to calculate the frequency during each
sampling period, and track it over time.
A new context type (perf_free_context) is introduced to indicate these
per-cpu core misc PMU events. They are
- Free running counter
- Don't have any state to switch on context switch and never fails
to schedule
- No sampling support
- Only support system-wide monitoring
- per-cpu
We also defined a new PERF event type PERF_TYPE_CORE_MISC_FREE for them.
It's safe to mix cpu PMU events and CORE_MISC_FREE events in a group.
Because when cpu PMU events disable/enable, we can disable/enable
them at the same time without failure.
Since there is no sampling support for these events. They are only
available for group reading and system-wide monitoring.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_core_misc.c | 8 +++++---
include/linux/perf_event.h | 10 ++++++++++
include/linux/sched.h | 1 +
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 9 +++++++++
5 files changed, 26 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_core_misc.c b/arch/x86/kernel/cpu/perf_event_intel_core_misc.c
index 4efe842..dad4495 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_core_misc.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_core_misc.c
@@ -889,7 +889,6 @@ static void __init core_misc_pmus_register(void)
type->pmu = (struct pmu) {
.attr_groups = type->pmu_group,
- .task_ctx_nr = perf_invalid_context,
.event_init = core_misc_pmu_event_init,
.add = core_misc_pmu_event_add, /* must have */
.del = core_misc_pmu_event_del, /* must have */
@@ -902,9 +901,12 @@ static void __init core_misc_pmus_register(void)
if (type->type == perf_intel_core_misc_thread) {
type->pmu.pmu_disable = (void *) intel_core_misc_pmu_disable;
type->pmu.pmu_enable = (void *) intel_core_misc_pmu_enable;
+ type->pmu.task_ctx_nr = perf_free_context;
+ err = perf_pmu_register(&type->pmu, type->name, PERF_TYPE_CORE_MISC_FREE);
+ } else {
+ type->pmu.task_ctx_nr = perf_invalid_context;
+ err = perf_pmu_register(&type->pmu, type->name, -1);
}
-
- err = perf_pmu_register(&type->pmu, type->name, -1);
if (WARN_ON(err))
pr_info("Failed to register PMU %s error %d\n",
type->pmu.name, err);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fea0ddf..3538f1c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -773,6 +773,16 @@ static inline int is_hardware_event(struct perf_event *event)
return event->pmu->task_ctx_nr == perf_hw_context;
}
+static inline int is_free_event(struct perf_event *event)
+{
+ return event->pmu->task_ctx_nr == perf_free_context;
+}
+
+static inline int has_context_event(struct perf_event *event)
+{
+ return event->pmu->task_ctx_nr > perf_invalid_context;
+}
+
extern struct static_key perf_swevent_enabled[PERF_COUNT_SW_MAX];
extern void ___perf_sw_event(u32, u64, struct pt_regs *, u64);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae21f15..717f492 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1335,6 +1335,7 @@ union rcu_special {
struct rcu_node;
enum perf_event_task_context {
+ perf_free_context = -2,
perf_invalid_context = -1,
perf_hw_context = 0,
perf_sw_context,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index d97f84c..232b674 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -32,6 +32,7 @@ enum perf_type_id {
PERF_TYPE_HW_CACHE = 3,
PERF_TYPE_RAW = 4,
PERF_TYPE_BREAKPOINT = 5,
+ PERF_TYPE_CORE_MISC_FREE = 6,
PERF_TYPE_MAX, /* non-ABI */
};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9077867..995b436 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8019,6 +8019,15 @@ SYSCALL_DEFINE5(perf_event_open,
}
/*
+ * Special case per-cpu free counter events and allow them to be part of
+ * any hardware/software group for system-wide monitoring.
+ */
+ if (group_leader && !task &&
+ is_free_event(event) &&
+ has_context_event(group_leader))
+ pmu = group_leader->pmu;
+
+ /*
* Get the target context (task or percpu):
*/
ctx = find_get_context(pmu, task, event);
--
1.8.3.1
From: Kan Liang <[email protected]>
evsel may have different cpus and threads as evlist's.
Use it's own cpus and threads, when open evsel in perf record.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-record.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 283fe96..eec3ee8 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -279,7 +279,7 @@ static int record__open(struct record *rec)
evlist__for_each(evlist, pos) {
try_again:
- if (perf_evsel__open(pos, evlist->cpus, evlist->threads) < 0) {
+ if (perf_evsel__open(pos, pos->cpus, pos->threads) < 0) {
if (perf_evsel__fallback(pos, errno, msg, sizeof(msg))) {
if (verbose)
ui__warning("%s\n", msg);
--
1.8.3.1
From: Kan Liang <[email protected]>
The group read results of TSC/ASTATE/MSTATE event can be used to
calculate the frequency during each sampling period.
Show it in report -D.
Here is an example:
$ perf record -e
'{ref-cycles,core_misc/tsc/,core_misc/power-mperf/,core_misc/power-aperf/}:S'
--running-time -a ~/tchain_edit
Here is one sample from perf report -D
18 506413677835 0x3f1d8 [0x90]: PERF_RECORD_SAMPLE(IP, 0x1): 8/8:
0xffffffff810cba6d period: 62219 addr: 0
... sample_read:
...... time enabled 000000000025a464
...... time running 000000000025a464
.... group nr 4
..... id 00000000000001a0, value 000000000008a605
..... id 00000000000001e2, value 0000000000565ac5
..... id 0000000000000222, value 0000000000079bc8
..... id 0000000000000262, value 0000000000068d69
..... TSC_MHz 2294
..... AVG_MHz 174
..... Bzy_MHz 2663
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/session.c | 40 +++++++++++++++++++++++++++++++++++-----
tools/perf/util/session.h | 4 ++++
2 files changed, 39 insertions(+), 5 deletions(-)
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index ed9dc25..6a142d8 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -851,8 +851,14 @@ static void perf_evlist__print_tstamp(struct perf_evlist *evlist,
printf("%" PRIu64 " ", sample->time);
}
-static void sample_read__printf(struct perf_sample *sample, u64 read_format)
+static void sample_read__printf(struct perf_evlist *evlist,
+ struct perf_sample *sample,
+ u64 read_format)
{
+ struct perf_evsel *evsel;
+ struct perf_sample_id *sid;
+ u64 tsc = 0, aperf = 0, mperf = 0;
+
printf("... sample_read:\n");
if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
@@ -875,10 +881,33 @@ static void sample_read__printf(struct perf_sample *sample, u64 read_format)
printf("..... id %016" PRIx64
", value %016" PRIx64 "\n",
value->id, value->value);
+
+ if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING) {
+ sid = perf_evlist__id2sid(evlist, value->id);
+ evsel = sid->evsel;
+ if ((evsel != NULL) &&
+ (evsel->attr.type == PERF_TYPE_CORE_MISC_FREE)) {
+ if (evsel->attr.config == PERF_POWER_APERF)
+ aperf = value->value;
+ if (evsel->attr.config == PERF_POWER_MPERF)
+ mperf = value->value;
+ if (evsel->attr.config == PERF_TSC)
+ tsc = value->value;
+ }
+ }
}
} else
printf("..... id %016" PRIx64 ", value %016" PRIx64 "\n",
sample->read.one.id, sample->read.one.value);
+
+ if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING) {
+ if (tsc > 0)
+ printf("..... TSC_MHz %lu\n", (1000 * tsc) / sample->read.time_running);
+ if (aperf > 0)
+ printf("..... AVG_MHz %lu\n", (1000 * aperf) / sample->read.time_running);
+ if ((tsc > 0) && (aperf > 0) && (mperf > 0))
+ printf("..... Bzy_MHz %lu\n", (1000 * tsc / aperf * mperf) / sample->read.time_running);
+ }
}
static void dump_event(struct perf_evlist *evlist, union perf_event *event,
@@ -899,8 +928,8 @@ static void dump_event(struct perf_evlist *evlist, union perf_event *event,
event->header.size, perf_event__name(event->header.type));
}
-static void dump_sample(struct perf_evsel *evsel, union perf_event *event,
- struct perf_sample *sample)
+static void dump_sample(struct perf_evlist *evlist, struct perf_evsel *evsel,
+ union perf_event *event, struct perf_sample *sample)
{
u64 sample_type;
@@ -938,7 +967,7 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,
printf("... transaction: %" PRIx64 "\n", sample->transaction);
if (sample_type & PERF_SAMPLE_READ)
- sample_read__printf(sample, evsel->attr.read_format);
+ sample_read__printf(evlist, sample, evsel->attr.read_format);
}
static struct machine *machines__find_for_cpumode(struct machines *machines,
@@ -1053,11 +1082,12 @@ static int machines__deliver_event(struct machines *machines,
switch (event->header.type) {
case PERF_RECORD_SAMPLE:
- dump_sample(evsel, event, sample);
if (evsel == NULL) {
++evlist->stats.nr_unknown_id;
return 0;
}
+ dump_sample(evlist, evsel, event, sample);
+
if (machine == NULL) {
++evlist->stats.nr_unprocessable_samples;
return 0;
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index b44afc7..220cfb3 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -42,6 +42,10 @@ struct perf_session {
#define PRINT_IP_OPT_ONELINE (1<<4)
#define PRINT_IP_OPT_SRCLINE (1<<5)
+#define PERF_POWER_APERF 1
+#define PERF_POWER_MPERF 2
+#define PERF_TSC 5
+
struct perf_tool;
struct perf_session *perf_session__new(struct perf_data_file *file,
--
1.8.3.1
From: Kan Liang <[email protected]>
Save APERF/MPERF/TSC in struct perf_sample, so the following sample
process function can easily handle it.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/event.h | 3 +++
tools/perf/util/session.c | 20 ++++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index c53f363..5a5431f 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -191,6 +191,9 @@ struct perf_sample {
u64 data_src;
u32 flags;
u16 insn_len;
+ u64 tsc;
+ u64 aperf;
+ u64 mperf;
void *raw_data;
struct ip_callchain *callchain;
struct branch_stack *branch_stack;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 6a142d8..bffa58b 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1005,6 +1005,8 @@ static int deliver_sample_value(struct perf_evlist *evlist,
struct machine *machine)
{
struct perf_sample_id *sid = perf_evlist__id2sid(evlist, v->id);
+ struct perf_evsel *evsel;
+ u64 nr = 0;
if (sid) {
sample->id = v->id;
@@ -1017,6 +1019,24 @@ static int deliver_sample_value(struct perf_evlist *evlist,
return 0;
}
+ if (perf_evsel__is_group_leader(sid->evsel)) {
+ evsel = sid->evsel;
+ evlist__for_each_continue(evlist, evsel) {
+ if ((evsel->leader != sid->evsel) ||
+ (++nr >= sample->read.group.nr))
+ break;
+
+ if (evsel->attr.type == PERF_TYPE_CORE_MISC_FREE) {
+ if (evsel->attr.config == PERF_POWER_APERF)
+ sample->aperf = sample->read.group.values[nr].value;
+ if (evsel->attr.config == PERF_POWER_MPERF)
+ sample->mperf = sample->read.group.values[nr].value;
+ if (evsel->attr.config == PERF_TSC)
+ sample->tsc = sample->read.group.values[nr].value;
+ }
+ }
+ }
+
return tool->sample(tool, event, sample, sid->evsel, machine);
}
--
1.8.3.1
From: Kan Liang <[email protected]>
Introduce a new hist_iter ops (hist_iter_freq) to caculate the
tsc/avg/bzy freq when processing samples, and save them in hist_entry.
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-annotate.c | 2 +-
tools/perf/builtin-diff.c | 2 +-
tools/perf/tests/hists_link.c | 4 ++--
tools/perf/util/hist.c | 52 ++++++++++++++++++++++++++++++++++++++-----
tools/perf/util/hist.h | 2 ++
tools/perf/util/sort.h | 3 +++
tools/perf/util/symbol.h | 6 +++++
7 files changed, 61 insertions(+), 10 deletions(-)
diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c
index 2c1bec3..06e2f87 100644
--- a/tools/perf/builtin-annotate.c
+++ b/tools/perf/builtin-annotate.c
@@ -71,7 +71,7 @@ static int perf_evsel__add_sample(struct perf_evsel *evsel,
return 0;
}
- he = __hists__add_entry(hists, al, NULL, NULL, NULL, 1, 1, 0, true);
+ he = __hists__add_entry(hists, al, NULL, NULL, NULL, 1, 1, 0, NULL, true);
if (he == NULL)
return -ENOMEM;
diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c
index daaa7dc..2fffcc4 100644
--- a/tools/perf/builtin-diff.c
+++ b/tools/perf/builtin-diff.c
@@ -315,7 +315,7 @@ static int hists__add_entry(struct hists *hists,
u64 weight, u64 transaction)
{
if (__hists__add_entry(hists, al, NULL, NULL, NULL, period, weight,
- transaction, true) != NULL)
+ transaction, NULL, true) != NULL)
return 0;
return -ENOMEM;
}
diff --git a/tools/perf/tests/hists_link.c b/tools/perf/tests/hists_link.c
index 8c102b0..5d9f9e3 100644
--- a/tools/perf/tests/hists_link.c
+++ b/tools/perf/tests/hists_link.c
@@ -90,7 +90,7 @@ static int add_hist_entries(struct perf_evlist *evlist, struct machine *machine)
goto out;
he = __hists__add_entry(hists, &al, NULL,
- NULL, NULL, 1, 1, 0, true);
+ NULL, NULL, 1, 1, 0, NULL, true);
if (he == NULL) {
addr_location__put(&al);
goto out;
@@ -116,7 +116,7 @@ static int add_hist_entries(struct perf_evlist *evlist, struct machine *machine)
goto out;
he = __hists__add_entry(hists, &al, NULL,
- NULL, NULL, 1, 1, 0, true);
+ NULL, NULL, 1, 1, 0, NULL, true);
if (he == NULL) {
addr_location__put(&al);
goto out;
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 6f28d53..ce32bd58 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -436,7 +436,9 @@ struct hist_entry *__hists__add_entry(struct hists *hists,
struct symbol *sym_parent,
struct branch_info *bi,
struct mem_info *mi,
- u64 period, u64 weight, u64 transaction,
+ u64 period, u64 weight,
+ u64 transaction,
+ struct freq_info *freq,
bool sample_self)
{
struct hist_entry entry = {
@@ -454,6 +456,9 @@ struct hist_entry *__hists__add_entry(struct hists *hists,
.nr_events = 1,
.period = period,
.weight = weight,
+ .tsc_freq = (freq != NULL) ? freq->tsc_freq : 0,
+ .avg_freq = (freq != NULL) ? freq->avg_freq : 0,
+ .bzy_freq = (freq != NULL) ? freq->bzy_freq : 0,
},
.parent = sym_parent,
.filtered = symbol__parent_filter(sym_parent) | al->filtered,
@@ -481,6 +486,33 @@ iter_add_next_nop_entry(struct hist_entry_iter *iter __maybe_unused,
}
static int
+iter_add_single_freq_entry(struct hist_entry_iter *iter, struct addr_location *al)
+{
+ struct perf_evsel *evsel = iter->evsel;
+ struct perf_sample *sample = iter->sample;
+ struct hist_entry *he;
+ struct freq_info freq;
+
+ if (sample->read.time_running > 0) {
+ freq.tsc_freq = (1000 * sample->tsc) / sample->read.time_running;
+ freq.avg_freq = (1000 * sample->aperf) / sample->read.time_running;
+ if (sample->aperf > 0)
+ freq.bzy_freq = freq.tsc_freq * sample->mperf / sample->aperf;
+ else
+ freq.bzy_freq = 0;
+ }
+
+ he = __hists__add_entry(evsel__hists(evsel), al, iter->parent, NULL, NULL,
+ sample->period, sample->weight,
+ sample->transaction, &freq, true);
+ if (he == NULL)
+ return -ENOMEM;
+
+ iter->he = he;
+ return 0;
+}
+
+static int
iter_prepare_mem_entry(struct hist_entry_iter *iter, struct addr_location *al)
{
struct perf_sample *sample = iter->sample;
@@ -517,7 +549,7 @@ iter_add_single_mem_entry(struct hist_entry_iter *iter, struct addr_location *al
* and the he_stat__add_period() function.
*/
he = __hists__add_entry(hists, al, iter->parent, NULL, mi,
- cost, cost, 0, true);
+ cost, cost, 0, NULL, true);
if (!he)
return -ENOMEM;
@@ -618,7 +650,7 @@ iter_add_next_branch_entry(struct hist_entry_iter *iter, struct addr_location *a
* and not events sampled. Thus we use a pseudo period of 1.
*/
he = __hists__add_entry(hists, al, iter->parent, &bi[i], NULL,
- 1, 1, 0, true);
+ 1, 1, 0, NULL, true);
if (he == NULL)
return -ENOMEM;
@@ -656,7 +688,7 @@ iter_add_single_normal_entry(struct hist_entry_iter *iter, struct addr_location
he = __hists__add_entry(evsel__hists(evsel), al, iter->parent, NULL, NULL,
sample->period, sample->weight,
- sample->transaction, true);
+ sample->transaction, NULL, true);
if (he == NULL)
return -ENOMEM;
@@ -718,7 +750,7 @@ iter_add_single_cumulative_entry(struct hist_entry_iter *iter,
he = __hists__add_entry(hists, al, iter->parent, NULL, NULL,
sample->period, sample->weight,
- sample->transaction, true);
+ sample->transaction, NULL, true);
if (he == NULL)
return -ENOMEM;
@@ -791,7 +823,7 @@ iter_add_next_cumulative_entry(struct hist_entry_iter *iter,
he = __hists__add_entry(evsel__hists(evsel), al, iter->parent, NULL, NULL,
sample->period, sample->weight,
- sample->transaction, false);
+ sample->transaction, NULL, false);
if (he == NULL)
return -ENOMEM;
@@ -813,6 +845,14 @@ iter_finish_cumulative_entry(struct hist_entry_iter *iter,
return 0;
}
+const struct hist_iter_ops hist_iter_freq = {
+ .prepare_entry = iter_prepare_normal_entry,
+ .add_single_entry = iter_add_single_freq_entry,
+ .next_entry = iter_next_nop_entry,
+ .add_next_entry = iter_add_next_nop_entry,
+ .finish_entry = iter_finish_normal_entry,
+};
+
const struct hist_iter_ops hist_iter_mem = {
.prepare_entry = iter_prepare_mem_entry,
.add_single_entry = iter_add_single_mem_entry,
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 5ed8d9c..3601658 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -102,6 +102,7 @@ extern const struct hist_iter_ops hist_iter_normal;
extern const struct hist_iter_ops hist_iter_branch;
extern const struct hist_iter_ops hist_iter_mem;
extern const struct hist_iter_ops hist_iter_cumulative;
+extern const struct hist_iter_ops hist_iter_freq;
struct hist_entry *__hists__add_entry(struct hists *hists,
struct addr_location *al,
@@ -109,6 +110,7 @@ struct hist_entry *__hists__add_entry(struct hists *hists,
struct branch_info *bi,
struct mem_info *mi, u64 period,
u64 weight, u64 transaction,
+ struct freq_info *freq,
bool sample_self);
int hist_entry_iter__add(struct hist_entry_iter *iter, struct addr_location *al,
int max_stack_depth, void *arg);
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index e97cd47..5720076 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -54,6 +54,9 @@ struct he_stat {
u64 period_guest_us;
u64 weight;
u32 nr_events;
+ u64 tsc_freq;
+ u64 avg_freq;
+ u64 bzy_freq;
};
struct hist_entry_diff {
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index b98ce51..b71d575 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -180,6 +180,12 @@ struct mem_info {
union perf_mem_data_src data_src;
};
+struct freq_info {
+ u64 tsc_freq;
+ u64 avg_freq;
+ u64 bzy_freq;
+};
+
struct addr_location {
struct machine *machine;
struct thread *thread;
--
1.8.3.1
From: Kan Liang <[email protected]>
Show freq for each symbol in perf report by --stdio --show-freq
In sampling group, only group leader do sampling. So only need to print
group leader's freq in --group.
Here is an example.
$ perf report --stdio --group --show-freq
# Samples: 71K of event 'anon group { ref-cycles, core_misc/tsc/,
core_misc/power-mperf/, core_misc/power-aperf/ }'
# Event count (approx.): 215265868412
#
# Overhead TSC MHz AVG MHz BZY MHz
Command Shared Object Symbol
# ................................ ......... ......... .........
............ ................ ..................................
#
98.85% 5.41% 98.89% 98.95% 2293 1474 2302
tchain_edit tchain_edit [.] f3
0.39% 1.64% 0.39% 0.37% 2295 1 3053
kworker/25:1 [kernel.vmlinux] [k] delay_tsc
0.08% 24.20% 0.07% 0.06% 2295 82 2746
swapper [kernel.vmlinux] [k] acpi_idle_do_entry
0.05% 0.00% 0.05% 0.05% 2295 2289 2295
tchain_edit tchain_edit [.] f2
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/Documentation/perf-report.txt | 10 +++++
tools/perf/builtin-report.c | 17 ++++++++
tools/perf/perf.h | 1 +
tools/perf/ui/hist.c | 69 +++++++++++++++++++++++++++++---
tools/perf/util/hist.h | 3 ++
tools/perf/util/session.c | 2 +-
tools/perf/util/sort.c | 3 ++
tools/perf/util/symbol.h | 3 +-
tools/perf/util/util.c | 4 ++
9 files changed, 105 insertions(+), 7 deletions(-)
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index c33b69f..fb82390 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -303,6 +303,16 @@ OPTIONS
special event -e cpu/mem-loads/ or -e cpu/mem-stores/. See
'perf mem' for simpler access.
+--show-freq::
+ Show frequency result from sample read.
+ To generate the frequency output, the perf.data file must have been
+ obtained using perf record -a --running-time, using special events
+ -e core_misc/tsc/, core_misc/power-mperf/ or core_misc/power-aperf/,
+ and group read.
+ TSC MHz: average MHz that the TSC ran during the sample interval.
+ AVG MHz: number of cycles executed divided by time elapsed.
+ BZY MHz: average clock rate while the CPU was busy (in "c0" state).
+
--percent-limit::
Do not show entries which have an overhead under that percent.
(Default: 0).
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 95a4771..00e77e2 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -164,6 +164,8 @@ static int process_sample_event(struct perf_tool *tool,
iter.ops = &hist_iter_mem;
else if (symbol_conf.cumulate_callchain)
iter.ops = &hist_iter_cumulative;
+ else if (symbol_conf.show_freq)
+ iter.ops = &hist_iter_freq;
else
iter.ops = &hist_iter_normal;
@@ -721,6 +723,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
OPT_BOOLEAN(0, "demangle-kernel", &symbol_conf.demangle_kernel,
"Enable kernel symbol demangling"),
OPT_BOOLEAN(0, "mem-mode", &report.mem_mode, "mem access profile"),
+ OPT_BOOLEAN(0, "show-freq", &symbol_conf.show_freq, "freqency profile"),
OPT_CALLBACK(0, "percent-limit", &report, "percent",
"Don't show entries under that percent", parse_percent_limit),
OPT_CALLBACK(0, "percentage", NULL, "relative|absolute",
@@ -733,6 +736,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
struct perf_data_file file = {
.mode = PERF_DATA_MODE_READ,
};
+ struct perf_evsel *pos;
int ret = hists__init();
if (ret < 0)
@@ -818,6 +822,19 @@ repeat:
symbol_conf.cumulate_callchain = false;
}
+ if (symbol_conf.show_freq) {
+ evlist__for_each(session->evlist, pos) {
+ if (pos->attr.type == PERF_TYPE_CORE_MISC_FREE) {
+ if (pos->attr.config == PERF_POWER_APERF)
+ perf_aperf = true;
+ if (pos->attr.config == PERF_POWER_MPERF)
+ perf_mperf = true;
+ if (pos->attr.config == PERF_TSC)
+ perf_tsc = true;
+ }
+ }
+ }
+
if (setup_sorting() < 0) {
if (sort_order)
parse_options_usage(report_usage, options, "s", 1);
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 937b16a..54d248ec 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -33,6 +33,7 @@ static inline unsigned long long rdclock(void)
extern const char *input_name;
extern bool perf_host, perf_guest;
+extern bool perf_tsc, perf_aperf, perf_mperf;
extern const char perf_version_string[];
void pthread__unblock_sigwinch(void);
diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
index 25d6083..3cb2cd5 100644
--- a/tools/perf/ui/hist.c
+++ b/tools/perf/ui/hist.c
@@ -17,7 +17,7 @@
static int __hpp__fmt(struct perf_hpp *hpp, struct hist_entry *he,
hpp_field_fn get_field, const char *fmt, int len,
- hpp_snprint_fn print_fn, bool fmt_percent)
+ hpp_snprint_fn print_fn, bool fmt_percent, bool single)
{
int ret;
struct hists *hists = he->hists;
@@ -36,7 +36,7 @@ static int __hpp__fmt(struct perf_hpp *hpp, struct hist_entry *he,
} else
ret = hpp__call_print_fn(hpp, print_fn, fmt, len, get_field(he));
- if (perf_evsel__is_group_event(evsel)) {
+ if (perf_evsel__is_group_event(evsel) && !single) {
int prev_idx, idx_delta;
struct hist_entry *pair;
int nr_members = evsel->nr_members;
@@ -109,10 +109,17 @@ int hpp__fmt(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
const char *fmtstr, hpp_snprint_fn print_fn, bool fmt_percent)
{
int len = fmt->user_len ?: fmt->len;
+ bool single = false;
+
+ if (symbol_conf.show_freq &&
+ ((fmt == &perf_hpp__format[PERF_HPP__TSC]) ||
+ (fmt == &perf_hpp__format[PERF_HPP__AVG]) ||
+ (fmt == &perf_hpp__format[PERF_HPP__BZY])))
+ single = true;
if (symbol_conf.field_sep) {
return __hpp__fmt(hpp, he, get_field, fmtstr, 1,
- print_fn, fmt_percent);
+ print_fn, fmt_percent, single);
}
if (fmt_percent)
@@ -120,7 +127,7 @@ int hpp__fmt(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
else
len -= 1;
- return __hpp__fmt(hpp, he, get_field, fmtstr, len, print_fn, fmt_percent);
+ return __hpp__fmt(hpp, he, get_field, fmtstr, len, print_fn, fmt_percent, single);
}
int hpp__fmt_acc(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
@@ -234,6 +241,30 @@ static int hpp__header_fn(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
return scnprintf(hpp->buf, hpp->size, "%*s", len, fmt->name);
}
+static int hpp__single_width_fn(struct perf_hpp_fmt *fmt,
+ struct perf_hpp *hpp __maybe_unused,
+ struct perf_evsel *evsel)
+{
+ int len = fmt->user_len ?: fmt->len;
+
+ if (symbol_conf.event_group && !symbol_conf.show_freq)
+ len = max(len, evsel->nr_members * fmt->len);
+
+ if (len < (int)strlen(fmt->name))
+ len = strlen(fmt->name);
+
+ return len;
+}
+
+static int hpp__single_header_fn(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
+ struct perf_evsel *evsel)
+{
+ int len = hpp__single_width_fn(fmt, hpp, evsel);
+
+ return scnprintf(hpp->buf, hpp->size, "%*s", len, fmt->name);
+}
+
+
static int hpp_color_scnprintf(struct perf_hpp *hpp, const char *fmt, ...)
{
va_list args;
@@ -363,6 +394,9 @@ HPP_PERCENT_ACC_FNS(overhead_acc, period)
HPP_RAW_FNS(samples, nr_events)
HPP_RAW_FNS(period, period)
+HPP_RAW_FNS(tsc_freq, tsc_freq)
+HPP_RAW_FNS(avg_freq, avg_freq)
+HPP_RAW_FNS(bzy_freq, bzy_freq)
static int64_t hpp__nop_cmp(struct perf_hpp_fmt *fmt __maybe_unused,
struct hist_entry *a __maybe_unused,
@@ -395,6 +429,17 @@ static int64_t hpp__nop_cmp(struct perf_hpp_fmt *fmt __maybe_unused,
.sort = hpp__sort_ ## _fn, \
}
+#define HPP__SINGLE_PRINT_FNS(_name, _fn) \
+ { \
+ .name = _name, \
+ .header = hpp__single_header_fn, \
+ .width = hpp__single_width_fn, \
+ .entry = hpp__entry_ ## _fn, \
+ .cmp = hpp__nop_cmp, \
+ .collapse = hpp__nop_cmp, \
+ .sort = hpp__sort_ ## _fn, \
+ }
+
#define HPP__PRINT_FNS(_name, _fn) \
{ \
.name = _name, \
@@ -414,7 +459,10 @@ struct perf_hpp_fmt perf_hpp__format[] = {
HPP__COLOR_PRINT_FNS("guest usr", overhead_guest_us),
HPP__COLOR_ACC_PRINT_FNS("Children", overhead_acc),
HPP__PRINT_FNS("Samples", samples),
- HPP__PRINT_FNS("Period", period)
+ HPP__PRINT_FNS("Period", period),
+ HPP__SINGLE_PRINT_FNS("TSC MHz", tsc_freq),
+ HPP__SINGLE_PRINT_FNS("AVG MHz", avg_freq),
+ HPP__SINGLE_PRINT_FNS("BZY MHz", bzy_freq)
};
LIST_HEAD(perf_hpp__list);
@@ -485,6 +533,14 @@ void perf_hpp__init(void)
if (symbol_conf.show_total_period)
perf_hpp__column_enable(PERF_HPP__PERIOD);
+ if (symbol_conf.show_freq) {
+ if (perf_tsc)
+ perf_hpp__column_enable(PERF_HPP__TSC);
+ if (perf_aperf)
+ perf_hpp__column_enable(PERF_HPP__AVG);
+ if (perf_mperf && perf_tsc && perf_aperf)
+ perf_hpp__column_enable(PERF_HPP__BZY);
+ }
/* prepend overhead field for backward compatiblity. */
list = &perf_hpp__format[PERF_HPP__OVERHEAD].sort_list;
if (list_empty(list))
@@ -661,6 +717,9 @@ void perf_hpp__reset_width(struct perf_hpp_fmt *fmt, struct hists *hists)
case PERF_HPP__OVERHEAD_GUEST_SYS:
case PERF_HPP__OVERHEAD_GUEST_US:
+ case PERF_HPP__TSC:
+ case PERF_HPP__AVG:
+ case PERF_HPP__BZY:
fmt->len = 9;
break;
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 3601658..71e2fa3 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -237,6 +237,9 @@ enum {
PERF_HPP__OVERHEAD_ACC,
PERF_HPP__SAMPLES,
PERF_HPP__PERIOD,
+ PERF_HPP__TSC,
+ PERF_HPP__AVG,
+ PERF_HPP__BZY,
PERF_HPP__MAX_INDEX
};
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index bffa58b..e14128a 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1026,7 +1026,7 @@ static int deliver_sample_value(struct perf_evlist *evlist,
(++nr >= sample->read.group.nr))
break;
- if (evsel->attr.type == PERF_TYPE_CORE_MISC_FREE) {
+ if (symbol_conf.show_freq && evsel->attr.type == PERF_TYPE_CORE_MISC_FREE) {
if (evsel->attr.config == PERF_POWER_APERF)
sample->aperf = sample->read.group.values[nr].value;
if (evsel->attr.config == PERF_POWER_MPERF)
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 4c65a14..f618fba 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -1225,6 +1225,9 @@ static struct hpp_dimension hpp_sort_dimensions[] = {
DIM(PERF_HPP__OVERHEAD_ACC, "overhead_children"),
DIM(PERF_HPP__SAMPLES, "sample"),
DIM(PERF_HPP__PERIOD, "period"),
+ DIM(PERF_HPP__TSC, "tsc"),
+ DIM(PERF_HPP__AVG, "avg"),
+ DIM(PERF_HPP__BZY, "bzy"),
};
#undef DIM
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index b71d575..60a6f1a 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -106,7 +106,8 @@ struct symbol_conf {
filter_relative,
show_hist_headers,
branch_callstack,
- has_filter;
+ has_filter,
+ show_freq;
const char *vmlinux_name,
*kallsyms_name,
*source_prefix,
diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index edc2d63..c299af9 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -34,6 +34,10 @@ bool test_attr__enabled;
bool perf_host = true;
bool perf_guest = false;
+bool perf_tsc = false;
+bool perf_aperf = false;
+bool perf_mperf = false;
+
char tracing_events_path[PATH_MAX + 1] = "/sys/kernel/debug/tracing/events";
void event_attr_init(struct perf_event_attr *attr)
--
1.8.3.1
On Thu, Jul 16, 2015 at 09:33:45PM +0100, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Using is_hardware_event to replace !is_software_event to indicate a
> hardware event.
Why...?
For an uncore event e, is_hardware_event(e) != !is_software_event(e), so
this will be a change of behaviour...
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> include/linux/perf_event.h | 7 ++++++-
> kernel/events/core.c | 6 +++---
> 2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 2027809..fea0ddf 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -761,13 +761,18 @@ static inline bool is_sampling_event(struct perf_event *event)
> }
>
> /*
> - * Return 1 for a software event, 0 for a hardware event
> + * Return 1 for a software event, 0 for other event
> */
> static inline int is_software_event(struct perf_event *event)
> {
> return event->pmu->task_ctx_nr == perf_sw_context;
> }
>
> +static inline int is_hardware_event(struct perf_event *event)
> +{
> + return event->pmu->task_ctx_nr == perf_hw_context;
> +}
> +
> extern struct static_key perf_swevent_enabled[PERF_COUNT_SW_MAX];
>
> extern void ___perf_sw_event(u32, u64, struct pt_regs *, u64);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index d3dae34..9077867 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1347,7 +1347,7 @@ static void perf_group_attach(struct perf_event *event)
> WARN_ON_ONCE(group_leader->ctx != event->ctx);
>
> if (group_leader->group_flags & PERF_GROUP_SOFTWARE &&
> - !is_software_event(event))
> + is_hardware_event(event))
> group_leader->group_flags &= ~PERF_GROUP_SOFTWARE;
>
> list_add_tail(&event->group_entry, &group_leader->sibling_list);
> @@ -1553,7 +1553,7 @@ event_sched_out(struct perf_event *event,
> event->pmu->del(event, 0);
> event->oncpu = -1;
>
> - if (!is_software_event(event))
> + if (is_hardware_event(event))
> cpuctx->active_oncpu--;
> if (!--ctx->nr_active)
> perf_event_ctx_deactivate(ctx);
> @@ -1881,7 +1881,7 @@ event_sched_in(struct perf_event *event,
> goto out;
> }
>
> - if (!is_software_event(event))
> + if (is_hardware_event(event))
> cpuctx->active_oncpu++;
> if (!ctx->nr_active++)
> perf_event_ctx_activate(ctx);
... whereby we won't accuont uncore events as active, and thereforef
will never perform throttling.
That doesn't sound right.
Mark.
* [email protected] <[email protected]> wrote:
> From: Kan Liang <[email protected]>
>
> This patchkit intends to support Intel core misc PMUs. There are miscellaneous
> free running (read-only) counters in core. Some new PMUs called core misc PMUs
> are composed to include these counters. The counters include TSC, IA32_APERF,
> IA32_MPERF, IA32_PPERF, SMI_COUNT, CORE_C*_RESIDENCY and PKG_C*_RESIDENCY. There
> could be more in future platform.
Could you please do something like:
s/perf_event_intel_core_misc.c/perf_event_x86/
and in general propagate it to a core perf x86 position?
This feature is not Intel specific per se, although the initial MSRs you are
supporting are Intel specific (and that is fine).
Thanks,
Ingo
On Thu, Jul 16, 2015 at 09:33:44PM +0100, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> This patch implements core_misc PMU disable and enable functions.
> core_misc PMU counters are free running counters, so it's impossible to
> stop/start them.
Doesn't that effectively mean you can't group them? You'll get arbitrary
noise because counters will be incrementing as you read them.
[...]
> @@ -927,6 +933,10 @@ int p6_pmu_init(void);
>
> int knc_pmu_init(void);
>
> +void intel_core_misc_pmu_enable(void);
> +
> +void intel_core_misc_pmu_disable(void);
> +
> ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
> char *page);
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> index b9826a9..651a86d 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -1586,6 +1586,8 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
> if (!x86_pmu.late_ack)
> apic_write(APIC_LVTPC, APIC_DM_NMI);
> __intel_pmu_disable_all();
> + if (cpuc->core_misc_active_mask)
> + intel_core_misc_pmu_disable();
Huh? Free running counters have nothing to do with the PMU interrupt;
there's nothing they can do to trigger it. This feels very hacky.
If this is necessary, surely it should live in __intel_pmu_disable_all?
[...]
> +void intel_core_misc_pmu_enable(void)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> + struct perf_event *event;
> + u64 start;
> + int bit;
> +
> + for_each_set_bit(bit, cpuc->core_misc_active_mask,
> + X86_CORE_MISC_COUNTER_MAX) {
> + event = cpuc->core_misc_events[bit];
> + start = core_misc_pmu_read_counter(event);
> + local64_set(&event->hw.prev_count, start);
> + }
> +}
> +
> +void intel_core_misc_pmu_disable(void)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> + int bit;
> +
> + for_each_set_bit(bit, cpuc->core_misc_active_mask,
> + X86_CORE_MISC_COUNTER_MAX) {
> + core_misc_pmu_event_update(cpuc->core_misc_events[bit]);
> + }
> +}
> +
> static void core_misc_pmu_event_del(struct perf_event *event, int mode)
> {
> core_misc_pmu_event_stop(event, PERF_EF_UPDATE);
> @@ -863,6 +899,11 @@ static void __init core_misc_pmus_register(void)
> .capabilities = PERF_PMU_CAP_NO_INTERRUPT,
> };
>
> + if (type->type == perf_intel_core_misc_thread) {
> + type->pmu.pmu_disable = (void *) intel_core_misc_pmu_disable;
> + type->pmu.pmu_enable = (void *) intel_core_misc_pmu_enable;
Why are you suprressing an entirely valid compiler warning here?
The signatures of intel_core_misc_pmu_{enable,disable} aren't right. Fix
them to take a struct pmu *.
Mark.
On Thu, Jul 16, 2015 at 09:33:46PM +0100, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> This patch special case per-cpu core_misc PMU events and allow them to
> be part of any hardware/software group for system-wide monitoring.
> An useful example would be to include the ASTATE/MSTATE event in a
> sampling group. This can be used to calculate the frequency during each
> sampling period, and track it over time.
>
> A new context type (perf_free_context) is introduced to indicate these
> per-cpu core misc PMU events. They are
> - Free running counter
> - Don't have any state to switch on context switch and never fails
> to schedule
> - No sampling support
> - Only support system-wide monitoring
> - per-cpu
> We also defined a new PERF event type PERF_TYPE_CORE_MISC_FREE for them.
>
> It's safe to mix cpu PMU events and CORE_MISC_FREE events in a group.
> Because when cpu PMU events disable/enable, we can disable/enable
> them at the same time without failure.
Which effectively means you're context-switching their state (given what
your enable/disable code does).
As with my earlier comments, I don't think these can be grouped with
events (not even from the same PMU given their free-running nature).
They're CPU-affine, so you can associate them with work done on that
CPU.
So as far as I can see, you should be able to handle the per-cpu misc
events in the perf_hardware_context, providing you reject grouping in
your event_init functions.
What does this extra context give you?
Mark.
> Since there is no sampling support for these events. They are only
> available for group reading and system-wide monitoring.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> arch/x86/kernel/cpu/perf_event_intel_core_misc.c | 8 +++++---
> include/linux/perf_event.h | 10 ++++++++++
> include/linux/sched.h | 1 +
> include/uapi/linux/perf_event.h | 1 +
> kernel/events/core.c | 9 +++++++++
> 5 files changed, 26 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_core_misc.c b/arch/x86/kernel/cpu/perf_event_intel_core_misc.c
> index 4efe842..dad4495 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_core_misc.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_core_misc.c
> @@ -889,7 +889,6 @@ static void __init core_misc_pmus_register(void)
>
> type->pmu = (struct pmu) {
> .attr_groups = type->pmu_group,
> - .task_ctx_nr = perf_invalid_context,
> .event_init = core_misc_pmu_event_init,
> .add = core_misc_pmu_event_add, /* must have */
> .del = core_misc_pmu_event_del, /* must have */
> @@ -902,9 +901,12 @@ static void __init core_misc_pmus_register(void)
> if (type->type == perf_intel_core_misc_thread) {
> type->pmu.pmu_disable = (void *) intel_core_misc_pmu_disable;
> type->pmu.pmu_enable = (void *) intel_core_misc_pmu_enable;
> + type->pmu.task_ctx_nr = perf_free_context;
> + err = perf_pmu_register(&type->pmu, type->name, PERF_TYPE_CORE_MISC_FREE);
> + } else {
> + type->pmu.task_ctx_nr = perf_invalid_context;
> + err = perf_pmu_register(&type->pmu, type->name, -1);
> }
> -
> - err = perf_pmu_register(&type->pmu, type->name, -1);
> if (WARN_ON(err))
> pr_info("Failed to register PMU %s error %d\n",
> type->pmu.name, err);
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index fea0ddf..3538f1c 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -773,6 +773,16 @@ static inline int is_hardware_event(struct perf_event *event)
> return event->pmu->task_ctx_nr == perf_hw_context;
> }
>
> +static inline int is_free_event(struct perf_event *event)
> +{
> + return event->pmu->task_ctx_nr == perf_free_context;
> +}
> +
> +static inline int has_context_event(struct perf_event *event)
> +{
> + return event->pmu->task_ctx_nr > perf_invalid_context;
> +}
> +
> extern struct static_key perf_swevent_enabled[PERF_COUNT_SW_MAX];
>
> extern void ___perf_sw_event(u32, u64, struct pt_regs *, u64);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ae21f15..717f492 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1335,6 +1335,7 @@ union rcu_special {
> struct rcu_node;
>
> enum perf_event_task_context {
> + perf_free_context = -2,
> perf_invalid_context = -1,
> perf_hw_context = 0,
> perf_sw_context,
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index d97f84c..232b674 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -32,6 +32,7 @@ enum perf_type_id {
> PERF_TYPE_HW_CACHE = 3,
> PERF_TYPE_RAW = 4,
> PERF_TYPE_BREAKPOINT = 5,
> + PERF_TYPE_CORE_MISC_FREE = 6,
>
> PERF_TYPE_MAX, /* non-ABI */
> };
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 9077867..995b436 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8019,6 +8019,15 @@ SYSCALL_DEFINE5(perf_event_open,
> }
>
> /*
> + * Special case per-cpu free counter events and allow them to be part of
> + * any hardware/software group for system-wide monitoring.
> + */
> + if (group_leader && !task &&
> + is_free_event(event) &&
> + has_context_event(group_leader))
> + pmu = group_leader->pmu;
> +
> + /*
> * Get the target context (task or percpu):
> */
> ctx = find_get_context(pmu, task, event);
> --
> 1.8.3.1
>
On Fri, Jul 17, 2015 at 01:21:06PM +0100, Mark Rutland wrote:
>
> As with my earlier comments, I don't think these can be grouped with
> events (not even from the same PMU given their free-running nature).
>
> They're CPU-affine, so you can associate them with work done on that
> CPU.
Just record the deltas from them while you're on.
On Fri, Jul 17, 2015 at 01:11:41PM +0100, Mark Rutland wrote:
> > diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> > index b9826a9..651a86d 100644
> > --- a/arch/x86/kernel/cpu/perf_event_intel.c
> > +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> > @@ -1586,6 +1586,8 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
> > if (!x86_pmu.late_ack)
> > apic_write(APIC_LVTPC, APIC_DM_NMI);
> > __intel_pmu_disable_all();
> > + if (cpuc->core_misc_active_mask)
> > + intel_core_misc_pmu_disable();
>
> Huh? Free running counters have nothing to do with the PMU interrupt;
> there's nothing they can do to trigger it. This feels very hacky.
>
> If this is necessary, surely it should live in __intel_pmu_disable_all?
>
> [...]
Yeah this is crazy. It should not live in the regular PMU at all, not be
Intel specific.
On Fri, Jul 17, 2015 at 03:46:29PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 17, 2015 at 01:11:41PM +0100, Mark Rutland wrote:
> > > diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> > > index b9826a9..651a86d 100644
> > > --- a/arch/x86/kernel/cpu/perf_event_intel.c
> > > +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> > > @@ -1586,6 +1586,8 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
> > > if (!x86_pmu.late_ack)
> > > apic_write(APIC_LVTPC, APIC_DM_NMI);
> > > __intel_pmu_disable_all();
> > > + if (cpuc->core_misc_active_mask)
> > > + intel_core_misc_pmu_disable();
> >
> > Huh? Free running counters have nothing to do with the PMU interrupt;
> > there's nothing they can do to trigger it. This feels very hacky.
> >
> > If this is necessary, surely it should live in __intel_pmu_disable_all?
> >
> > [...]
>
> Yeah this is crazy. It should not live in the regular PMU at all, not be
> Intel specific.
lkml.kernel.org/r/2c37309d20afadf88ad4a82cf0ce02b9152801e2.1430256154.git.luto@kernel.org
That does the right thing for free running MSRs.
Take it and expand.
>
> On Thu, Jul 16, 2015 at 09:33:45PM +0100, [email protected] wrote:
> > From: Kan Liang <[email protected]>
> >
> > Using is_hardware_event to replace !is_software_event to indicate a
> > hardware event.
>
> Why...?
First, the comments of is_software_event is not correct.
0 or !is_software_event is not for a hardware event.
is_hardware_event is for a hardware event.
Also, the following patch make mix core_misc event be part of hw/sw
event, !is_software_event could be either hw event or core_misc event.
We need an accurate definition here.
>
> For an uncore event e, is_hardware_event(e) != !is_software_event(e),
> so this will be a change of behaviour...
Uncore event cannot be part of hw/sw event group. So it doesn't change the behavior.
>
> >
> > Signed-off-by: Kan Liang <[email protected]>
> > ---
> > include/linux/perf_event.h | 7 ++++++-
> > kernel/events/core.c | 6 +++---
> > 2 files changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 2027809..fea0ddf 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -761,13 +761,18 @@ static inline bool is_sampling_event(struct
> > perf_event *event) }
> >
> > /*
> > - * Return 1 for a software event, 0 for a hardware event
> > + * Return 1 for a software event, 0 for other event
> > */
> > static inline int is_software_event(struct perf_event *event) {
> > return event->pmu->task_ctx_nr == perf_sw_context; }
> >
> > +static inline int is_hardware_event(struct perf_event *event) {
> > + return event->pmu->task_ctx_nr == perf_hw_context; }
> > +
> > extern struct static_key
> perf_swevent_enabled[PERF_COUNT_SW_MAX];
> >
> > extern void ___perf_sw_event(u32, u64, struct pt_regs *, u64); diff
> > --git a/kernel/events/core.c b/kernel/events/core.c index
> > d3dae34..9077867 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1347,7 +1347,7 @@ static void perf_group_attach(struct
> perf_event *event)
> > WARN_ON_ONCE(group_leader->ctx != event->ctx);
> >
> > if (group_leader->group_flags & PERF_GROUP_SOFTWARE &&
> > - !is_software_event(event))
> > + is_hardware_event(event))
> > group_leader->group_flags &= ~PERF_GROUP_SOFTWARE;
> >
> > list_add_tail(&event->group_entry, &group_leader->sibling_list);
> @@
> > -1553,7 +1553,7 @@ event_sched_out(struct perf_event *event,
> > event->pmu->del(event, 0);
> > event->oncpu = -1;
> >
> > - if (!is_software_event(event))
> > + if (is_hardware_event(event))
> > cpuctx->active_oncpu--;
> > if (!--ctx->nr_active)
> > perf_event_ctx_deactivate(ctx);
> > @@ -1881,7 +1881,7 @@ event_sched_in(struct perf_event *event,
> > goto out;
> > }
> >
> > - if (!is_software_event(event))
> > + if (is_hardware_event(event))
> > cpuctx->active_oncpu++;
> > if (!ctx->nr_active++)
> > perf_event_ctx_activate(ctx);
>
> ... whereby we won't accuont uncore events as active, and thereforef will
> never perform throttling.
>
> That doesn't sound right.
I think active_oncpu should only impact if the group is exclusive.
The changes will make pure perf_invalid_context event group never exclusive.
If that's a problem, I will change this part back.
Thanks,
Kan
>
> Mark.
> On Fri, Jul 17, 2015 at 03:46:29PM +0200, Peter Zijlstra wrote:
> > On Fri, Jul 17, 2015 at 01:11:41PM +0100, Mark Rutland wrote:
> > > > diff --git a/arch/x86/kernel/cpu/perf_event_intel.c
> > > > b/arch/x86/kernel/cpu/perf_event_intel.c
> > > > index b9826a9..651a86d 100644
> > > > --- a/arch/x86/kernel/cpu/perf_event_intel.c
> > > > +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> > > > @@ -1586,6 +1586,8 @@ static int intel_pmu_handle_irq(struct
> pt_regs *regs)
> > > > if (!x86_pmu.late_ack)
> > > > apic_write(APIC_LVTPC, APIC_DM_NMI);
> > > > __intel_pmu_disable_all();
> > > > + if (cpuc->core_misc_active_mask)
> > > > + intel_core_misc_pmu_disable();
> > >
> > > Huh? Free running counters have nothing to do with the PMU
> > > interrupt; there's nothing they can do to trigger it. This feels very hacky.
> > >
> > > If this is necessary, surely it should live in __intel_pmu_disable_all?
> > >
> > > [...]
> >
> > Yeah this is crazy. It should not live in the regular PMU at all, not
> > be Intel specific.
>
> lkml.kernel.org/r/2c37309d20afadf88ad4a82cf0ce02b9152801e2.143025615
> [email protected]
>
> That does the right thing for free running MSRs.
>
> Take it and expand.
The first patch did the similar thing as the link you shared with.
Here is the first patch.
https://lkml.org/lkml/2015/7/16/953
This patch is expend the per-core core_misc PMU based on the first patch.
I implemented this patch is because that one of the biggest concern
from upstream for mix PMU group is that it breaks group semantics.
When one PMU is stop, the other PMU is still running.
So I introduce the enable/disable function. Other PMUs can discard the counter
value for core_misc event when they are stop or in irq.
If you think it should not live in the regular PMU, I can just remove the codes.
We just keep core_misc event running and no harm in it.
Thanks,
Kan
On Fri, Jul 17, 2015 at 04:03:36PM +0100, Liang, Kan wrote:
> >
> > On Thu, Jul 16, 2015 at 09:33:45PM +0100, [email protected] wrote:
> > > From: Kan Liang <[email protected]>
> > >
> > > Using is_hardware_event to replace !is_software_event to indicate a
> > > hardware event.
> >
> > Why...?
>
> First, the comments of is_software_event is not correct.
> 0 or !is_software_event is not for a hardware event.
> is_hardware_event is for a hardware event.
Circular logic is fantastic.
> Also, the following patch make mix core_misc event be part of hw/sw
> event, !is_software_event could be either hw event or core_misc event.
!is_software_event is also true for an uncore event currently, and the
code relies on this fact. Blindly replacing !is_software_event with
is_hardware_event changes the behaviour of the code for uncore events.
> > For an uncore event e, is_hardware_event(e) != !is_software_event(e),
> > so this will be a change of behaviour...
>
> Uncore event cannot be part of hw/sw event group. So it doesn't change the behavior.
My complaint had _nothing_ to do with groups. It had to do with the
accounting for throttling, where it _does_ change the behaviour.
However, now that you mention the group logic...
> > > /*
> > > - * Return 1 for a software event, 0 for a hardware event
> > > + * Return 1 for a software event, 0 for other event
> > > */
> > > static inline int is_software_event(struct perf_event *event) {
> > > return event->pmu->task_ctx_nr == perf_sw_context; }
> > >
> > > +static inline int is_hardware_event(struct perf_event *event) {
> > > + return event->pmu->task_ctx_nr == perf_hw_context; }
> > > +
> > > extern struct static_key
> > perf_swevent_enabled[PERF_COUNT_SW_MAX];
> > >
> > > extern void ___perf_sw_event(u32, u64, struct pt_regs *, u64); diff
> > > --git a/kernel/events/core.c b/kernel/events/core.c index
> > > d3dae34..9077867 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -1347,7 +1347,7 @@ static void perf_group_attach(struct
> > perf_event *event)
> > > WARN_ON_ONCE(group_leader->ctx != event->ctx);
> > >
> > > if (group_leader->group_flags & PERF_GROUP_SOFTWARE &&
> > > - !is_software_event(event))
> > > + is_hardware_event(event))
> > > group_leader->group_flags &= ~PERF_GROUP_SOFTWARE;
> > >
...this changes the behaviour of attaching an uncore event to a software
group.
Before, we'd correctly clear the PERF_GROUP_SOFTWARE flag on the leader.
After this patch, we don't. That is a bug.
My original complaint was with the changes below.
> > > list_add_tail(&event->group_entry, &group_leader->sibling_list);
> > @@
> > > -1553,7 +1553,7 @@ event_sched_out(struct perf_event *event,
> > > event->pmu->del(event, 0);
> > > event->oncpu = -1;
> > >
> > > - if (!is_software_event(event))
> > > + if (is_hardware_event(event))
> > > cpuctx->active_oncpu--;
> > > if (!--ctx->nr_active)
> > > perf_event_ctx_deactivate(ctx);
Previously we'd call perf_event_ctx_deactivate() for an uncore PMU's
contexts, but now we never will.
> > > @@ -1881,7 +1881,7 @@ event_sched_in(struct perf_event *event,
> > > goto out;
> > > }
> > >
> > > - if (!is_software_event(event))
> > > + if (is_hardware_event(event))
> > > cpuctx->active_oncpu++;
> > > if (!ctx->nr_active++)
> > > perf_event_ctx_activate(ctx);
Similarly for perf_event_ctx_deactivate().
As I mention below, That means we will no longer perform throttling for
an uncore PMU's cpu context (see perf_event_task_tick()).
> > ... whereby we won't accuont uncore events as active, and thereforef will
> > never perform throttling.
> >
> > That doesn't sound right.
>
> I think active_oncpu should only impact if the group is exclusive.
> The changes will make pure perf_invalid_context event group never exclusive.
> If that's a problem, I will change this part back.
I'm not sure what you mean here -- I can't see what a group being
exclusive has to do with any of the points above.
What am I missing?
Thanks,
Mark.
On Fri, Jul 17, 2015 at 04:47:26PM +0100, Mark Rutland wrote:
> On Fri, Jul 17, 2015 at 04:03:36PM +0100, Liang, Kan wrote:
> > >
> > > On Thu, Jul 16, 2015 at 09:33:45PM +0100, [email protected] wrote:
> > > > From: Kan Liang <[email protected]>
> > > >
> > > > Using is_hardware_event to replace !is_software_event to indicate a
> > > > hardware event.
> > >
> > > Why...?
> >
> > First, the comments of is_software_event is not correct.
> > 0 or !is_software_event is not for a hardware event.
> > is_hardware_event is for a hardware event.
>
> Circular logic is fantastic.
Sorry for the snark here. I completely misread this.
I agree that the comment is wrong. However, changing !is_software_event
to is_hardware_event is not always correct.
For example, perf_group_attach tests for the addition of a non-software
event to a software group, so we can mark the group as not consisting
solely of software events. For that to be done correctly, we need to
check !is_software_event.
I was wrong about the throttling, having confused active_oncpu and
nr_active. Sorry for the noise on that. However, as you mention that
does prevent the use of exclusive events for uncore PMUs, and I don't
see why that should change.
Thanks,
Mark.
On Fri, Jul 17, 2015 at 8:35 AM, Liang, Kan <[email protected]> wrote:
>
>
>> On Fri, Jul 17, 2015 at 03:46:29PM +0200, Peter Zijlstra wrote:
>> > On Fri, Jul 17, 2015 at 01:11:41PM +0100, Mark Rutland wrote:
>> > > > diff --git a/arch/x86/kernel/cpu/perf_event_intel.c
>> > > > b/arch/x86/kernel/cpu/perf_event_intel.c
>> > > > index b9826a9..651a86d 100644
>> > > > --- a/arch/x86/kernel/cpu/perf_event_intel.c
>> > > > +++ b/arch/x86/kernel/cpu/perf_event_intel.c
>> > > > @@ -1586,6 +1586,8 @@ static int intel_pmu_handle_irq(struct
>> pt_regs *regs)
>> > > > if (!x86_pmu.late_ack)
>> > > > apic_write(APIC_LVTPC, APIC_DM_NMI);
>> > > > __intel_pmu_disable_all();
>> > > > + if (cpuc->core_misc_active_mask)
>> > > > + intel_core_misc_pmu_disable();
>> > >
>> > > Huh? Free running counters have nothing to do with the PMU
>> > > interrupt; there's nothing they can do to trigger it. This feels very hacky.
>> > >
>> > > If this is necessary, surely it should live in __intel_pmu_disable_all?
>> > >
>> > > [...]
>> >
>> > Yeah this is crazy. It should not live in the regular PMU at all, not
>> > be Intel specific.
>>
>> lkml.kernel.org/r/2c37309d20afadf88ad4a82cf0ce02b9152801e2.143025615
>> [email protected]
>>
>> That does the right thing for free running MSRs.
>>
>> Take it and expand.
>
> The first patch did the similar thing as the link you shared with.
> Here is the first patch.
> https://lkml.org/lkml/2015/7/16/953
>
> This patch is expend the per-core core_misc PMU based on the first patch.
> I implemented this patch is because that one of the biggest concern
> from upstream for mix PMU group is that it breaks group semantics.
> When one PMU is stop, the other PMU is still running.
> So I introduce the enable/disable function. Other PMUs can discard the counter
> value for core_misc event when they are stop or in irq.
>
> If you think it should not live in the regular PMU, I can just remove the codes.
> We just keep core_misc event running and no harm in it.
I know very little about perf pmu organization, but I think that AMD
supports APERF and MPERF, too, so it may make sense to have that thing
live outside a file with "intel" in the name.
Also, should the driver detect those using the cpuid bit?
--Andy
>
> On Fri, Jul 17, 2015 at 8:35 AM, Liang, Kan <[email protected]> wrote:
> >
> >
> >> On Fri, Jul 17, 2015 at 03:46:29PM +0200, Peter Zijlstra wrote:
> >> > On Fri, Jul 17, 2015 at 01:11:41PM +0100, Mark Rutland wrote:
> >> > > > diff --git a/arch/x86/kernel/cpu/perf_event_intel.c
> >> > > > b/arch/x86/kernel/cpu/perf_event_intel.c
> >> > > > index b9826a9..651a86d 100644
> >> > > > --- a/arch/x86/kernel/cpu/perf_event_intel.c
> >> > > > +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> >> > > > @@ -1586,6 +1586,8 @@ static int intel_pmu_handle_irq(struct
> >> pt_regs *regs)
> >> > > > if (!x86_pmu.late_ack)
> >> > > > apic_write(APIC_LVTPC, APIC_DM_NMI);
> >> > > > __intel_pmu_disable_all();
> >> > > > + if (cpuc->core_misc_active_mask)
> >> > > > + intel_core_misc_pmu_disable();
> >> > >
> >> > > Huh? Free running counters have nothing to do with the PMU
> >> > > interrupt; there's nothing they can do to trigger it. This feels very
> hacky.
> >> > >
> >> > > If this is necessary, surely it should live in __intel_pmu_disable_all?
> >> > >
> >> > > [...]
> >> >
> >> > Yeah this is crazy. It should not live in the regular PMU at all,
> >> > not be Intel specific.
> >>
> >>
> lkml.kernel.org/r/2c37309d20afadf88ad4a82cf0ce02b9152801e2.143025615
> >> [email protected]
> >>
> >> That does the right thing for free running MSRs.
> >>
> >> Take it and expand.
> >
> > The first patch did the similar thing as the link you shared with.
> > Here is the first patch.
> > https://lkml.org/lkml/2015/7/16/953
> >
> > This patch is expend the per-core core_misc PMU based on the first
> patch.
> > I implemented this patch is because that one of the biggest concern
> > from upstream for mix PMU group is that it breaks group semantics.
> > When one PMU is stop, the other PMU is still running.
> > So I introduce the enable/disable function. Other PMUs can discard the
> > counter value for core_misc event when they are stop or in irq.
> >
> > If you think it should not live in the regular PMU, I can just remove the
> codes.
> > We just keep core_misc event running and no harm in it.
>
> I know very little about perf pmu organization, but I think that AMD
> supports APERF and MPERF, too, so it may make sense to have that thing
> live outside a file with "intel" in the name.
>
> Also, should the driver detect those using the cpuid bit?
>
Hi Andy,
Yes, it detects the cpuid to determine which counters are available.
If we want to implement a common file for both Intel and AMD,
we can also check cupid.06h.ecx[bit 0] for a/mperf availability on Intel
platform.
Do you have a V2 patch already? I'm asking is because you once
mentioned it... :)
Hi Peter,
I think I misread your meaning after go through all your comments
for Andy's patch.
Sorry for that.
Yes, I can add the APERF and MPERF part into Andy's patch.
As you suggested, make that perf_sw_context.
So we don't need to special case for mix PMUs.
But I think we still need a patch for support CORE_C*_RESIDENCY
and PKG_C*_RESIDENCY, which are Intel specific?
I will send it separately.
Thanks,
Kan
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Fri, Jul 17, 2015 at 10:52 AM, Liang, Kan <[email protected]> wrote:
>
>
>>
>> On Fri, Jul 17, 2015 at 8:35 AM, Liang, Kan <[email protected]> wrote:
>> >
>> >
>> >> On Fri, Jul 17, 2015 at 03:46:29PM +0200, Peter Zijlstra wrote:
>> >> > On Fri, Jul 17, 2015 at 01:11:41PM +0100, Mark Rutland wrote:
>> >> > > > diff --git a/arch/x86/kernel/cpu/perf_event_intel.c
>> >> > > > b/arch/x86/kernel/cpu/perf_event_intel.c
>> >> > > > index b9826a9..651a86d 100644
>> >> > > > --- a/arch/x86/kernel/cpu/perf_event_intel.c
>> >> > > > +++ b/arch/x86/kernel/cpu/perf_event_intel.c
>> >> > > > @@ -1586,6 +1586,8 @@ static int intel_pmu_handle_irq(struct
>> >> pt_regs *regs)
>> >> > > > if (!x86_pmu.late_ack)
>> >> > > > apic_write(APIC_LVTPC, APIC_DM_NMI);
>> >> > > > __intel_pmu_disable_all();
>> >> > > > + if (cpuc->core_misc_active_mask)
>> >> > > > + intel_core_misc_pmu_disable();
>> >> > >
>> >> > > Huh? Free running counters have nothing to do with the PMU
>> >> > > interrupt; there's nothing they can do to trigger it. This feels very
>> hacky.
>> >> > >
>> >> > > If this is necessary, surely it should live in __intel_pmu_disable_all?
>> >> > >
>> >> > > [...]
>> >> >
>> >> > Yeah this is crazy. It should not live in the regular PMU at all,
>> >> > not be Intel specific.
>> >>
>> >>
>> lkml.kernel.org/r/2c37309d20afadf88ad4a82cf0ce02b9152801e2.143025615
>> >> [email protected]
>> >>
>> >> That does the right thing for free running MSRs.
>> >>
>> >> Take it and expand.
>> >
>> > The first patch did the similar thing as the link you shared with.
>> > Here is the first patch.
>> > https://lkml.org/lkml/2015/7/16/953
>> >
>> > This patch is expend the per-core core_misc PMU based on the first
>> patch.
>> > I implemented this patch is because that one of the biggest concern
>> > from upstream for mix PMU group is that it breaks group semantics.
>> > When one PMU is stop, the other PMU is still running.
>> > So I introduce the enable/disable function. Other PMUs can discard the
>> > counter value for core_misc event when they are stop or in irq.
>> >
>> > If you think it should not live in the regular PMU, I can just remove the
>> codes.
>> > We just keep core_misc event running and no harm in it.
>>
>> I know very little about perf pmu organization, but I think that AMD
>> supports APERF and MPERF, too, so it may make sense to have that thing
>> live outside a file with "intel" in the name.
>>
>> Also, should the driver detect those using the cpuid bit?
>>
>
> Hi Andy,
>
> Yes, it detects the cpuid to determine which counters are available.
> If we want to implement a common file for both Intel and AMD,
> we can also check cupid.06h.ecx[bit 0] for a/mperf availability on Intel
> platform.
>
> Do you have a V2 patch already? I'm asking is because you once
> mentioned it... :)
No, and I also don't have PPERF hardware, etc, so there's not really
much I can do. Feel free to do whatever you like with my v1.
--Andy
Hi,
On Fri, Jul 17, 2015 at 5:55 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jul 17, 2015 at 01:21:06PM +0100, Mark Rutland wrote:
>>
>> As with my earlier comments, I don't think these can be grouped with
>> events (not even from the same PMU given their free-running nature).
>>
>> They're CPU-affine, so you can associate them with work done on that
>> CPU.
>
> Just record the deltas from them while you're on.
Yes, free-running are already handled by the kernel, the RAPL counters
are a good example.
The uncore IMC counters (for SNB/IVB/HSW/BDW) client processors is
another example.
Just compute deltas, and make sure you do not miss a wrap-around of
the counter if it is
not wide enough to never wrap-around.
> >
> > Do you have a V2 patch already? I'm asking is because you once
> > mentioned it... :)
>
> No, and I also don't have PPERF hardware, etc, so there's not really much I
> can do. Feel free to do whatever you like with my v1.
>
OK. I will add my code based on V1, and send it back to you.
I don't have AMD machine. Could you please test on them?
Thanks,
Kan
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Fri, Jul 17, 2015 at 11:15 AM, Liang, Kan <[email protected]> wrote:
>
>> >
>> > Do you have a V2 patch already? I'm asking is because you once
>> > mentioned it... :)
>>
>> No, and I also don't have PPERF hardware, etc, so there's not really much I
>> can do. Feel free to do whatever you like with my v1.
>>
> OK. I will add my code based on V1, and send it back to you.
> I don't have AMD machine. Could you please test on them?
>
No, because I don't either. But I'm sure someone will volunteer.
--Andy
> Thanks,
> Kan
--
Andy Lutomirski
AMA Capital Management, LLC
> As with my earlier comments, I don't think these can be grouped with
> events (not even from the same PMU given their free-running nature).
Mark, we already went through this last time. There is nothing
stopping handling free running counters as part of other groups.
A perf event logically has a 64bit counter that accumulates counts from
a less wide hardware counter. A free running counter just has
to be sampled at the beginning and at the end of the measurement
period, and the difference between the two values added to the perf
counter. To handle CPU switches the counter is just sampled, and
accumulated into the software counter, before switching to another CPU.
Then you start the next measurement period with a sample from the
new CPU etc.
-Andi
> + if (sample->read.time_running > 0) {
> + freq.tsc_freq = (1000 * sample->tsc) / sample->read.time_running;
> + freq.avg_freq = (1000 * sample->aperf) / sample->read.time_running;
> + if (sample->aperf > 0)
> + freq.bzy_freq = freq.tsc_freq * sample->mperf / sample->aperf;
Sorry didn't notice that earlier. The formula is not correct.
aperf/mperf is not necessarily the frequency, it is essentially a load average
of the CPU. It should be reported as such. Also only the ratio is
architecturally defined.
The right way to compute frequency is cycles / ref-cycles
TSC can be used to accurately compute CPU utilization tsc / ref-cycles
It would be useful to report all three metrics.
-Andi
>
> > + if (sample->read.time_running > 0) {
> > + freq.tsc_freq = (1000 * sample->tsc) / sample-
> >read.time_running;
> > + freq.avg_freq = (1000 * sample->aperf) / sample-
> >read.time_running;
> > + if (sample->aperf > 0)
> > + freq.bzy_freq = freq.tsc_freq * sample->mperf /
> sample->aperf;
>
> Sorry didn't notice that earlier. The formula is not correct.
> aperf/mperf is not necessarily the frequency, it is essentially a load
> average of the CPU. It should be reported as such. Also only the ratio is
> architecturally defined.
>
The concept of tsc, avg and bzy are from turbostat.
Here is the definition from turbostat readme.
- AVG_MHz = APERF_delta/measurement_interval. This is the actual
number of elapsed cycles divided by the entire sample interval
- TSC_MHz = TSC_delta/measurement_interval.
On a system with an invariant TSC, this value will be constant
and will closely match the base frequency value
- Bzy_MHz = TSC_delta/APERF_delta/MPERF_delta/measurement_interval
Only the Bzy_MHz is wrong and has a typo error.
Other formula should be correct.
If it's confusion, I will change the name and make it consistent as turbostat.
> The right way to compute frequency is cycles / ref-cycles TSC can be used
> to accurately compute CPU utilization tsc / ref-cycles
I think I can add the support for frequency and CPU% calculation,
and show them in --stdio.
Thanks,
Kan
>
> It would be useful to report all three metrics.
>
> -Andi
On Fri, Jul 17, 2015 at 06:15:05PM +0000, Liang, Kan wrote:
>
> > >
> > > Do you have a V2 patch already? I'm asking is because you once
> > > mentioned it... :)
> >
> > No, and I also don't have PPERF hardware, etc, so there's not really much I
> > can do. Feel free to do whatever you like with my v1.
> >
> OK. I will add my code based on V1, and send it back to you.
> I don't have AMD machine. Could you please test on them?
I have an AMD machine (interlagos based).
> The concept of tsc, avg and bzy are from turbostat.
> Here is the definition from turbostat readme.
turbostat can do this because it is model specific, but perf user code
is not. It has to be architectural definitions only. And the architectural
definition of ASTATE MSTATE in the SDM is that it is only valid
as a ratio.
Please use the formulas I described.
-Andi
--
[email protected] -- Speaking for myself only
On Fri, Jul 17, 2015 at 09:17:47PM +0100, Andi Kleen wrote:
> > As with my earlier comments, I don't think these can be grouped with
> > events (not even from the same PMU given their free-running nature).
>
> Mark, we already went through this last time. There is nothing
> stopping handling free running counters as part of other groups.
Ok. It's inexact (but only marginally), and not the end of the world if
you know that's the case.
My concern last time was mainly with the grouping of uncore and CPU
events.
Thanks,
Mark.