LinuxLists.cc - [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf

2018-09-19 17:30:50

by Reinette Chatre

[permalink] [raw]

Subject: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf

Dear Maintainers,

This new series fixing the lack of coordination between the
pseudo-locking measurement code and perf addresses all feedback received
up to and including V4.

Changes since V4:
- Move the helper to obtain the performance counter index
(x86_perf_rdpmc_index()) to arch/x86/events/core.c with an extern
declaration in arch/x86/include/asm/perf_event.h. Please do note that
this change introduces a checkpatch.pl warning that was discussed
and found to be a false positive in:
http://lkml.kernel.org/r/[email protected]
The new warning is:
> CHECK: extern prototypes should be avoided in .h files
> #66: FILE: arch/x86/include/asm/perf_event.h:273:
> +extern int x86_perf_rdpmc_index(struct perf_event *event);

- Change the x86_perf_rdpmc_index() comment slightly:
"... could be checked for validity ..." -> "... should be checked for
validity ...". No code changes.
- Add comments to the measurement routines to document that on measurement
failure zeroes will be written to the trace buffer instead of leaving an
empty trace buffer. No code changes.

Changes since V3:
- The kbuild test robot reported a build issue that at first seems to be
related to "make ARCH=i386" but was actually a result of the kernel
configuration used not having CONFIG_TRACEPOINTS set.
The consequence was that include/linux/perf_event.h was not being included
via the tracing code (from include/trace/define_trace.h) and we thus
encounter the build failure where struct perf_event_attr is unknown.
Fix this by explicitly including perf_event.h

- Below is verbatim from V3 submission (except for diffstat below) -

Changes since V2:
- Move the helper to obtain the performance counter index to
include/linux/perf_event.h. The request was actually to move this helper
to arch/x86/include/asm/perf_event.h - but doing so would be more
involved since this header file does not know about struct perf_event
that is used by this helper. There was no response for further
clarification of the request to move this helper so I proceeded to move it
to include/linux/perf_event.h instead.
- Change name of helper to obtain the index to perf_rdpmc_index() - the
original request was to name it x86_perf_rdpmc_index() but this seems to
be tied to the suggested header location. With the header location of
include/linux/perf_event.h the name perf_rdpmc_index() seems to fit
better with the new destination. There was no response for further
clarification of the naming change request so I proceeded with the change.
- Replace all local register variables used in the measurement routines
with local variables using READ_ONCE().
- The removal of local register variables also enable us to replace the
direct __wrmsr() with wrmsr().
- Merge the L2 and L3 measurement routines following Peter's suggested
framework.
- Do not copy the text from SDM that refers to serializing instructions.
- Include another LFENCE call after loop reading pseudo-locked memory.

The above addresses all feedback received for V2. The one unanswered
question that remains following the review is why the memory reading
was done with asm: the reason I did so was to avoid any compiler
optimizations while constraining the code exactly to what needed to be
measured. By using the asm instruction I am able to use a single instruction
to read a cache line into a register. To me this seemed the most constrained
way to measure if a cache line is in the cache.

This is the second attempt at fixing the lack of coordination between the
pseudo-locking measurement code and perf. Thank you very much for your
feedback on the first version. The entire solution, including the cover
letter, has been reworked based on your feedback, while submitted as a V2,
none of the patches from V1 remained.

Changes since V1:
- Use in-kernel interface to perf.
- Do not write directly to PMU registers.
- Do not introduce another PMU owner. perf maintains role as performing
resource arbitration for PMU.
- User space is able to use perf and resctrl at the same time.
- event_base_rdpmc is accessed and used only within an interrupts
disabled section.
- Internals of events are never accessed directly, inline function used.
- Due to "pinned" usage the scheduling of event may have failed. Error
state is checked in recommended way and have a credible error
handling.
- use X86_CONFIG

This code is based on the x86/cache branch of tip.git

The success of Cache Pseudo-Locking, as measured by how many cache lines
from a physical memory region has been locked to cache, can be measured
via the use of hardware performance events. Specifically, the number of
cache hits and misses reading a memory region after it has been
pseudo-locked to cache. This measurement is triggered via the resctrl
debugfs interface.

The current solution accesses performance counters and their configuration
registers directly without coordination with other performance event users
(perf).
Two of the issues that exist with the current solution:
- By writing to the performance monitoring registers directly a new owner
for these resources is introduced. The perf infrastructure already exist
to perform resource arbitration and the in-kernel infrastructure should
be used to do so.
- The current lack of coordination with perf will have consequences any time
two users, for example perf and cache pseudo-locking, attempt to do any
kind of measurement at the same time.

In this series the measurement of Cache Pseudo-Lock regions is moved to use
the in-kernel interface to perf. During the rework of the measurement
function the L2 and L3 cache measurements are separated to avoid the
additional code needed to decide on which measurement causing unrelated
cache hits and misses.

Your feedback on this work will be greatly appreciated.

Reinette

Reinette Chatre (6):
perf/core: Add sanity check to deal with pinned event failure
perf/x86: Add helper to obtain performance counter index
x86/intel_rdt: Remove local register variables
x86/intel_rdt: Create required perf event attributes
x86/intel_rdt: Use perf infrastructure for measurements
x86/intel_rdt: Re-enable pseudo-lock measurements

Documentation/x86/intel_rdt_ui.txt | 22 +-
arch/x86/events/core.c | 21 ++
arch/x86/include/asm/perf_event.h | 1 +
arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 372 ++++++++++++--------
kernel/events/core.c | 6 +
5 files changed, 261 insertions(+), 161 deletions(-)

--
2.17.0

2018-09-19 17:30:33

by Reinette Chatre

[permalink] [raw]

Subject: [PATCH V5 5/6] x86/intel_rdt: Use perf infrastructure for measurements

The success of a cache pseudo-locked region is measured using
performance monitoring events that are programmed directly at the time
the user requests a measurement.

Modifying the performance event registers directly is not appropriate
since it circumvents the in-kernel perf infrastructure that exists to
manage these resources and provide resource arbitration to the
performance monitoring hardware.

The cache pseudo-locking measurements are modified to use the in-kernel
perf infrastructure. Performance events are created and validated with
the appropriate perf API. The performance counters are still read as
directly as possible to avoid the additional cache hits. This is
done safely by first ensuring with the perf API that the counters have
been programmed correctly and only accessing the counters in an
interrupt disabled section where they are not able to be moved.

As part of the transition to the in-kernel perf infrastructure the L2
and L3 measurements are split into two separate measurements that can
be triggered independently. This separation prevents additional cache
misses incurred during the extra testing code used to decide if a
L2 and/or L3 measurement should be made.

Signed-off-by: Reinette Chatre <[email protected]>
---
Documentation/x86/intel_rdt_ui.txt | 22 +-
arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 305 ++++++++++++--------
2 files changed, 204 insertions(+), 123 deletions(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index f662d3c530e5..52b10945ff75 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -520,18 +520,24 @@ the pseudo-locked region:
2) Cache hit and miss measurements using model specific precision counters if
available. Depending on the levels of cache on the system the pseudo_lock_l2
and pseudo_lock_l3 tracepoints are available.
- WARNING: triggering this measurement uses from two (for just L2
- measurements) to four (for L2 and L3 measurements) precision counters on
- the system, if any other measurements are in progress the counters and
- their corresponding event registers will be clobbered.

When a pseudo-locked region is created a new debugfs directory is created for
it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
write-only file, pseudo_lock_measure, is present in this directory. The
-measurement on the pseudo-locked region depends on the number, 1 or 2,
-written to this debugfs file. Since the measurements are recorded with the
-tracing infrastructure the relevant tracepoints need to be enabled before the
-measurement is triggered.
+measurement of the pseudo-locked region depends on the number written to this
+debugfs file:
+1 - writing "1" to the pseudo_lock_measure file will trigger the latency
+ measurement captured in the pseudo_lock_mem_latency tracepoint. See
+ example below.
+2 - writing "2" to the pseudo_lock_measure file will trigger the L2 cache
+ residency (cache hits and misses) measurement captured in the
+ pseudo_lock_l2 tracepoint. See example below.
+3 - writing "3" to the pseudo_lock_measure file will trigger the L3 cache
+ residency (cache hits and misses) measurement captured in the
+ pseudo_lock_l3 tracepoint.
+
+All measurements are recorded with the tracing infrastructure. This requires
+the relevant tracepoints to be enabled before the measurement is triggered.

Example of latency debugging interface:
In this example a pseudo-locked region named "newlock" was created. Here is
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 33d7968f152a..c35b975e3b1e 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -27,6 +27,7 @@
#include <asm/intel_rdt_sched.h>
#include <asm/perf_event.h>

+#include "../../events/perf_event.h" /* For X86_CONFIG() */
#include "intel_rdt.h"

#define CREATE_TRACE_POINTS
@@ -107,16 +108,6 @@ static u64 get_prefetch_disable_bits(void)
return 0;
}

-/*
- * Helper to write 64bit value to MSR without tracing. Used when
- * use of the cache should be restricted and use of registers used
- * for local variables avoided.
- */
-static inline void pseudo_wrmsrl_notrace(unsigned int msr, u64 val)
-{
- __wrmsr(msr, (u32)(val & 0xffffffffULL), (u32)(val >> 32));
-}
-
/**
* pseudo_lock_minor_get - Obtain available minor number
* @minor: Pointer to where new minor number will be stored
@@ -925,7 +916,7 @@ static int measure_cycles_lat_fn(void *_plr)
* The actual configuration of the event is set right before use in order
* to use the X86_CONFIG macro.
*/
-static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
+static struct perf_event_attr perf_miss_attr = {
.type = PERF_TYPE_RAW,
.size = sizeof(struct perf_event_attr),
.pinned = 1,
@@ -933,7 +924,7 @@ static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
.exclude_user = 1,
};

-static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
+static struct perf_event_attr perf_hit_attr = {
.type = PERF_TYPE_RAW,
.size = sizeof(struct perf_event_attr),
.pinned = 1,
@@ -941,139 +932,216 @@ static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
.exclude_user = 1,
};

-static int measure_cycles_perf_fn(void *_plr)
+struct residency_counts {
+ u64 miss_before, hits_before;
+ u64 miss_after, hits_after;
+};
+
+static int measure_residency_fn(struct perf_event_attr *miss_attr,
+ struct perf_event_attr *hit_attr,
+ struct pseudo_lock_region *plr,
+ struct residency_counts *counts)
{
- unsigned long long l3_hits = 0, l3_miss = 0;
- u64 l3_hit_bits = 0, l3_miss_bits = 0;
- struct pseudo_lock_region *plr = _plr;
- unsigned long long l2_hits, l2_miss;
- u64 l2_hit_bits, l2_miss_bits;
+ u64 hits_before = 0, hits_after = 0, miss_before = 0, miss_after = 0;
+ struct perf_event *miss_event, *hit_event;
+ int hit_pmcnum, miss_pmcnum;
unsigned int line_size;
unsigned int size;
unsigned long i;
void *mem_r;
+ u64 tmp;

- /*
- * Non-architectural event for the Goldmont Microarchitecture
- * from Intel x86 Architecture Software Developer Manual (SDM):
- * MEM_LOAD_UOPS_RETIRED D1H (event number)
- * Umask values:
- * L1_HIT 01H
- * L2_HIT 02H
- * L1_MISS 08H
- * L2_MISS 10H
- *
- * On Broadwell Microarchitecture the MEM_LOAD_UOPS_RETIRED event
- * has two "no fix" errata associated with it: BDM35 and BDM100. On
- * this platform we use the following events instead:
- * L2_RQSTS 24H (Documented in https://download.01.org/perfmon/BDW/)
- * REFERENCES FFH
- * MISS 3FH
- * LONGEST_LAT_CACHE 2EH (Documented in SDM)
- * REFERENCE 4FH
- * MISS 41H
- */
+ miss_event = perf_event_create_kernel_counter(miss_attr, plr->cpu,
+ NULL, NULL, NULL);
+ if (IS_ERR(miss_event))
+ goto out;

+ hit_event = perf_event_create_kernel_counter(hit_attr, plr->cpu,
+ NULL, NULL, NULL);
+ if (IS_ERR(hit_event))
+ goto out_miss;
+
+ local_irq_disable();
/*
- * Start by setting flags for IA32_PERFEVTSELx:
- * OS (Operating system mode) 0x2
- * INT (APIC interrupt enable) 0x10
- * EN (Enable counter) 0x40
- *
- * Then add the Umask value and event number to select performance
- * event.
+ * Check any possible error state of events used by performing
+ * one local read.
*/
-
- switch (boot_cpu_data.x86_model) {
- case INTEL_FAM6_ATOM_GOLDMONT:
- case INTEL_FAM6_ATOM_GEMINI_LAKE:
- l2_hit_bits = (0x52ULL << 16) | (0x2 << 8) | 0xd1;
- l2_miss_bits = (0x52ULL << 16) | (0x10 << 8) | 0xd1;
- break;
- case INTEL_FAM6_BROADWELL_X:
- /* On BDW the l2_hit_bits count references, not hits */
- l2_hit_bits = (0x52ULL << 16) | (0xff << 8) | 0x24;
- l2_miss_bits = (0x52ULL << 16) | (0x3f << 8) | 0x24;
- /* On BDW the l3_hit_bits count references, not hits */
- l3_hit_bits = (0x52ULL << 16) | (0x4f << 8) | 0x2e;
- l3_miss_bits = (0x52ULL << 16) | (0x41 << 8) | 0x2e;
- break;
- default:
- goto out;
+ if (perf_event_read_local(miss_event, &tmp, NULL, NULL)) {
+ local_irq_enable();
+ goto out_hit;
+ }
+ if (perf_event_read_local(hit_event, &tmp, NULL, NULL)) {
+ local_irq_enable();
+ goto out_hit;
}

- local_irq_disable();
/*
* Disable hardware prefetchers.
*/
wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
- /* Disable events and reset counters */
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 1, 0x0);
- if (l3_hit_bits > 0) {
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 2, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 3, 0x0);
- }
- /* Set and enable the L2 counters */
+
+ /* Initialize rest of local variables */
+ /*
+ * Performance event has been validated right before this with
+ * interrupts disabled - it is thus safe to read the counter index.
+ */
+ miss_pmcnum = x86_perf_rdpmc_index(miss_event);
+ hit_pmcnum = x86_perf_rdpmc_index(hit_event);
+ line_size = READ_ONCE(plr->line_size);
mem_r = READ_ONCE(plr->kmem);
size = READ_ONCE(plr->size);
- line_size = READ_ONCE(plr->line_size);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, l2_hit_bits);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, l2_miss_bits);
- if (l3_hit_bits > 0) {
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
- l3_hit_bits);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
- l3_miss_bits);
- }
+
+ /*
+ * Read counter variables twice - first to load the instructions
+ * used in L1 cache, second to capture accurate value that does not
+ * include cache misses incurred because of instruction loads.
+ */
+ rdpmcl(hit_pmcnum, hits_before);
+ rdpmcl(miss_pmcnum, miss_before);
+ /*
+ * From SDM: Performing back-to-back fast reads are not guaranteed
+ * to be monotonic.
+ * Use LFENCE to ensure all previous instructions are retired
+ * before proceeding.
+ */
+ rmb();
+ rdpmcl(hit_pmcnum, hits_before);
+ rdpmcl(miss_pmcnum, miss_before);
+ /*
+ * Use LFENCE to ensure all previous instructions are retired
+ * before proceeding.
+ */
+ rmb();
for (i = 0; i < size; i += line_size) {
+ /*
+ * Add a barrier to prevent speculative execution of this
+ * loop reading beyond the end of the buffer.
+ */
+ rmb();
asm volatile("mov (%0,%1,1), %%eax\n\t"
:
: "r" (mem_r), "r" (i)
: "%eax", "memory");
}
/*
- * Call wrmsr directly (no tracing) to not influence
- * the cache access counters as they are disabled.
+ * Use LFENCE to ensure all previous instructions are retired
+ * before proceeding.
*/
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0,
- l2_hit_bits & ~(0x40ULL << 16));
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1,
- l2_miss_bits & ~(0x40ULL << 16));
- if (l3_hit_bits > 0) {
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
- l3_hit_bits & ~(0x40ULL << 16));
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
- l3_miss_bits & ~(0x40ULL << 16));
- }
- l2_hits = native_read_pmc(0);
- l2_miss = native_read_pmc(1);
- if (l3_hit_bits > 0) {
- l3_hits = native_read_pmc(2);
- l3_miss = native_read_pmc(3);
- }
+ rmb();
+ rdpmcl(hit_pmcnum, hits_after);
+ rdpmcl(miss_pmcnum, miss_after);
+ /*
+ * Use LFENCE to ensure all previous instructions are retired
+ * before proceeding.
+ */
+ rmb();
+ /* Re-enable hardware prefetchers */
wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
local_irq_enable();
+out_hit:
+ perf_event_release_kernel(hit_event);
+out_miss:
+ perf_event_release_kernel(miss_event);
+out:
+ /*
+ * All counts will be zero on failure.
+ */
+ counts->miss_before = miss_before;
+ counts->hits_before = hits_before;
+ counts->miss_after = miss_after;
+ counts->hits_after = hits_after;
+ return 0;
+}
+
+static int measure_l2_residency(void *_plr)
+{
+ struct pseudo_lock_region *plr = _plr;
+ struct residency_counts counts = {0};
+
+ /*
+ * Non-architectural event for the Goldmont Microarchitecture
+ * from Intel x86 Architecture Software Developer Manual (SDM):
+ * MEM_LOAD_UOPS_RETIRED D1H (event number)
+ * Umask values:
+ * L2_HIT 02H
+ * L2_MISS 10H
+ */
+ switch (boot_cpu_data.x86_model) {
+ case INTEL_FAM6_ATOM_GOLDMONT:
+ case INTEL_FAM6_ATOM_GEMINI_LAKE:
+ perf_miss_attr.config = X86_CONFIG(.event = 0xd1,
+ .umask = 0x10);
+ perf_hit_attr.config = X86_CONFIG(.event = 0xd1,
+ .umask = 0x2);
+ break;
+ default:
+ goto out;
+ }
+
+ measure_residency_fn(&perf_miss_attr, &perf_hit_attr, plr, &counts);
+ /*
+ * If a failure prevented the measurements from succeeding
+ * tracepoints will still be written and all counts will be zero.
+ */
+ trace_pseudo_lock_l2(counts.hits_after - counts.hits_before,
+ counts.miss_after - counts.miss_before);
+out:
+ plr->thread_done = 1;
+ wake_up_interruptible(&plr->lock_thread_wq);
+ return 0;
+}
+
+static int measure_l3_residency(void *_plr)
+{
+ struct pseudo_lock_region *plr = _plr;
+ struct residency_counts counts = {0};
+
/*
- * On BDW we count references and misses, need to adjust. Sometimes
- * the "hits" counter is a bit more than the references, for
- * example, x references but x + 1 hits. To not report invalid
- * hit values in this case we treat that as misses eaqual to
- * references.
+ * On Broadwell Microarchitecture the MEM_LOAD_UOPS_RETIRED event
+ * has two "no fix" errata associated with it: BDM35 and BDM100. On
+ * this platform the following events are used instead:
+ * LONGEST_LAT_CACHE 2EH (Documented in SDM)
+ * REFERENCE 4FH
+ * MISS 41H
*/
- if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
- l2_hits -= (l2_miss > l2_hits ? l2_hits : l2_miss);
- trace_pseudo_lock_l2(l2_hits, l2_miss);
- if (l3_hit_bits > 0) {
- if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
- l3_hits -= (l3_miss > l3_hits ? l3_hits : l3_miss);
- trace_pseudo_lock_l3(l3_hits, l3_miss);
+
+ switch (boot_cpu_data.x86_model) {
+ case INTEL_FAM6_BROADWELL_X:
+ /* On BDW the hit event counts references, not hits */
+ perf_hit_attr.config = X86_CONFIG(.event = 0x2e,
+ .umask = 0x4f);
+ perf_miss_attr.config = X86_CONFIG(.event = 0x2e,
+ .umask = 0x41);
+ break;
+ default:
+ goto out;
+ }
+
+ measure_residency_fn(&perf_miss_attr, &perf_hit_attr, plr, &counts);
+ /*
+ * If a failure prevented the measurements from succeeding
+ * tracepoints will still be written and all counts will be zero.
+ */
+
+ counts.miss_after -= counts.miss_before;
+ if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X) {
+ /*
+ * On BDW references and misses are counted, need to adjust.
+ * Sometimes the "hits" counter is a bit more than the
+ * references, for example, x references but x + 1 hits.
+ * To not report invalid hit values in this case we treat
+ * that as misses equal to references.
+ */
+ /* First compute the number of cache references measured */
+ counts.hits_after -= counts.hits_before;
+ /* Next convert references to cache hits */
+ counts.hits_after -= counts.miss_after > counts.hits_after ?
+ counts.hits_after : counts.miss_after;
+ } else {
+ counts.hits_after -= counts.hits_before;
}

+ trace_pseudo_lock_l3(counts.hits_after, counts.miss_after);
out:
plr->thread_done = 1;
wake_up_interruptible(&plr->lock_thread_wq);
@@ -1112,13 +1180,20 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
goto out;
}

+ plr->cpu = cpu;
+
if (sel == 1)
thread = kthread_create_on_node(measure_cycles_lat_fn, plr,
cpu_to_node(cpu),
"pseudo_lock_measure/%u",
cpu);
else if (sel == 2)
- thread = kthread_create_on_node(measure_cycles_perf_fn, plr,
+ thread = kthread_create_on_node(measure_l2_residency, plr,
+ cpu_to_node(cpu),
+ "pseudo_lock_measure/%u",
+ cpu);
+ else if (sel == 3)
+ thread = kthread_create_on_node(measure_l3_residency, plr,
cpu_to_node(cpu),
"pseudo_lock_measure/%u",
cpu);
--
2.17.0

2018-09-19 17:30:41

by Reinette Chatre

[permalink] [raw]

Subject: [PATCH V5 2/6] perf/x86: Add helper to obtain performance counter index

perf_event_read_local() is the safest way to obtain measurements
associated with performance events. In some cases the overhead
introduced by perf_event_read_local() affects the measurements and the
use of rdpmcl() is needed. rdpmcl() requires the index
of the performance counter used so a helper is introduced to determine
the index used by a provided performance event.

The index used by a performance event may change when interrupts are
enabled. A check is added to ensure that the index is only accessed
with interrupts disabled. Even with this check the use of this counter
needs to be done with care to ensure it is queried and used within the
same disabled interrupts section.

This change introduces a new checkpatch warning:
CHECK: extern prototypes should be avoided in .h files
+extern int x86_perf_rdpmc_index(struct perf_event *event);

This warning was discussed and designated as a false positive in
http://lkml.kernel.org/r/[email protected]

Link: http://lkml.kernel.org/r/[email protected]
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Reinette Chatre <[email protected]>
---
arch/x86/events/core.c | 21 +++++++++++++++++++++
arch/x86/include/asm/perf_event.h | 1 +
2 files changed, 22 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index dfb2f7c0d019..3550d800b030 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1033,6 +1033,27 @@ static inline void x86_assign_hw_event(struct perf_event *event,
}
}

+/**
+ * x86_perf_rdpmc_index - Return PMC counter used for event
+ * @event: the perf_event to which the PMC counter was assigned
+ *
+ * The counter assigned to this performance event may change if interrupts
+ * are enabled. This counter should thus never be used while interrupts are
+ * enabled. Before this function is used to obtain the assigned counter the
+ * event should be checked for validity using, for example,
+ * perf_event_read_local(), within the same interrupt disabled section in
+ * which this counter is planned to be used.
+ *
+ * Return: The index of the performance monitoring counter assigned to
+ * @perf_event.
+ */
+int x86_perf_rdpmc_index(struct perf_event *event)
+{
+ lockdep_assert_irqs_disabled();
+
+ return event->hw.event_base_rdpmc;
+}
+
static inline int match_prev_assignment(struct hw_perf_event *hwc,
struct cpu_hw_events *cpuc,
int i)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 12f54082f4c8..b2cf84c35a6d 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -270,6 +270,7 @@ struct perf_guest_switch_msr {
extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
extern void perf_check_microcode(void);
+extern int x86_perf_rdpmc_index(struct perf_event *event);
#else
static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
{
--
2.17.0

2018-09-19 17:30:48

by Reinette Chatre

[permalink] [raw]

Subject: [PATCH V5 4/6] x86/intel_rdt: Create required perf event attributes

A perf event has many attributes that are maintained in a separate
structure that should be provided when a new perf_event is created.

In preparation for the transition to perf_events the required attribute
structures are created for all the events that may be used in the
measurements. Most attributes for all the events are identical. The
actual configuration, what specifies what needs to be measured, is what
will be different between the events used. This configuration needs to
be done with X86_CONFIG that cannot be used as part of the designated
initializers used here, this will be introduced later.

Although they do look identical at this time the attribute structures
needs to be maintained separately since a perf_event will maintain a
pointer to its unique attributes.

In support of patch testing the new structs are given the unused attribute
until their use in later patches.

Signed-off-by: Reinette Chatre <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 26 +++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 8ad83eb3fc89..33d7968f152a 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -17,6 +17,7 @@
#include <linux/debugfs.h>
#include <linux/kthread.h>
#include <linux/mman.h>
+#include <linux/perf_event.h>
#include <linux/pm_qos.h>
#include <linux/slab.h>
#include <linux/uaccess.h>
@@ -915,6 +916,31 @@ static int measure_cycles_lat_fn(void *_plr)
return 0;
}

+/*
+ * Create a perf_event_attr for the hit and miss perf events that will
+ * be used during the performance measurement. A perf_event maintains
+ * a pointer to its perf_event_attr so a unique attribute structure is
+ * created for each perf_event.
+ *
+ * The actual configuration of the event is set right before use in order
+ * to use the X86_CONFIG macro.
+ */
+static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
+ .type = PERF_TYPE_RAW,
+ .size = sizeof(struct perf_event_attr),
+ .pinned = 1,
+ .disabled = 0,
+ .exclude_user = 1,
+};
+
+static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
+ .type = PERF_TYPE_RAW,
+ .size = sizeof(struct perf_event_attr),
+ .pinned = 1,
+ .disabled = 0,
+ .exclude_user = 1,
+};
+
static int measure_cycles_perf_fn(void *_plr)
{
unsigned long long l3_hits = 0, l3_miss = 0;
--
2.17.0

2018-09-19 17:31:27

by Reinette Chatre

[permalink] [raw]

Subject: [PATCH V5 6/6] x86/intel_rdt: Re-enable pseudo-lock measurements

Commit 4a7a54a55e72 ("x86/intel_rdt: Disable PMU access") disabled
measurements of pseudo-locked regions because of incorrect usage
of the performance monitoring hardware.

Cache pseudo-locking measurements are now done correctly with the
in-kernel perf API and its use can be re-enabled at this time.

The adjustment to the in-kernel perf API also separated the L2 and L3
measurements that can be triggered separately from user space. The
re-enabling of the measurements is thus not a simple revert of the
original disable in order to accommodate the additional parameter
possible.

Signed-off-by: Reinette Chatre <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index c35b975e3b1e..edf30c94c226 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -1237,7 +1237,7 @@ static ssize_t pseudo_lock_measure_trigger(struct file *file,
buf[buf_size] = '\0';
ret = kstrtoint(buf, 10, &sel);
if (ret == 0) {
- if (sel != 1)
+ if (sel != 1 && sel != 2 && sel != 3)
return -EINVAL;
ret = debugfs_file_get(file->f_path.dentry);
if (ret)
--
2.17.0

2018-09-19 17:32:07

by Reinette Chatre

[permalink] [raw]

Subject: [PATCH V5 1/6] perf/core: Add sanity check to deal with pinned event failure

It is possible that a failure can occur during the scheduling of a
pinned event. The initial portion of perf_event_read_local() contains
the various error checks an event should pass before it can be
considered valid. Ensure that the potential scheduling failure
of a pinned event is checked for and have a credible error.

Link: http://lkml.kernel.org/r/[email protected]
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Reinette Chatre <[email protected]>
---
kernel/events/core.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2a62b96600ad..191c0c2e10de 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3940,6 +3940,12 @@ int perf_event_read_local(struct perf_event *event, u64 *value,
goto out;
}

+ /* If this is a pinned event it must be running on this CPU */
+ if (event->attr.pinned && event->oncpu != smp_processor_id()) {
+ ret = -EBUSY;
+ goto out;
+ }
+
/*
* If the event is currently on this CPU, its either a per-task event,
* or local to this CPU. Furthermore it means its ACTIVE (otherwise
--
2.17.0

2018-09-19 17:32:30

by Reinette Chatre

[permalink] [raw]

Subject: [PATCH V5 3/6] x86/intel_rdt: Remove local register variables

Local register variables were used in an effort to improve the
accuracy of the measurement of cache residency of a pseudo-locked
region. This was done to ensure that only the cache residency of
the memory is measured and not the cache residency of the variables
used to perform the measurement.

While local register variables do accomplish the goal they do require
significant care since different architectures have different registers
available. Local register variables also cannot be used with valuable
developer tools like KASAN.

Significant testing has shown that similar accuracy in measurement
results can be obtained by replacing local register variables with
regular local variables.

Make use of local variables in the critical code but do so with
READ_ONCE() to prevent the compiler from merging or refetching reads.
Ensure these variables are initialized before the measurement starts,
and ensure it is only the local variables that are accessed during
the measurement.

With the removal of the local register variables and using READ_ONCE()
there is no longer a motivation for using a direct wrmsr call (that
avoids the additional tracing code that may clobber the local register
variables).

Signed-off-by: Reinette Chatre <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 53 ++++-----------------
1 file changed, 9 insertions(+), 44 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 40f3903ae5d9..8ad83eb3fc89 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -886,31 +886,14 @@ static int measure_cycles_lat_fn(void *_plr)
struct pseudo_lock_region *plr = _plr;
unsigned long i;
u64 start, end;
-#ifdef CONFIG_KASAN
- /*
- * The registers used for local register variables are also used
- * when KASAN is active. When KASAN is active we use a regular
- * variable to ensure we always use a valid pointer to access memory.
- * The cost is that accessing this pointer, which could be in
- * cache, will be included in the measurement of memory read latency.
- */
void *mem_r;
-#else
-#ifdef CONFIG_X86_64
- register void *mem_r asm("rbx");
-#else
- register void *mem_r asm("ebx");
-#endif /* CONFIG_X86_64 */
-#endif /* CONFIG_KASAN */

local_irq_disable();
/*
- * The wrmsr call may be reordered with the assignment below it.
- * Call wrmsr as directly as possible to avoid tracing clobbering
- * local register variable used for memory pointer.
+ * Disable hardware prefetchers.
*/
- __wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
- mem_r = plr->kmem;
+ wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
+ mem_r = READ_ONCE(plr->kmem);
/*
* Dummy execute of the time measurement to load the needed
* instructions into the L1 instruction cache.
@@ -939,26 +922,10 @@ static int measure_cycles_perf_fn(void *_plr)
struct pseudo_lock_region *plr = _plr;
unsigned long long l2_hits, l2_miss;
u64 l2_hit_bits, l2_miss_bits;
- unsigned long i;
-#ifdef CONFIG_KASAN
- /*
- * The registers used for local register variables are also used
- * when KASAN is active. When KASAN is active we use regular variables
- * at the cost of including cache access latency to these variables
- * in the measurements.
- */
unsigned int line_size;
unsigned int size;
+ unsigned long i;
void *mem_r;
-#else
- register unsigned int line_size asm("esi");
- register unsigned int size asm("edi");
-#ifdef CONFIG_X86_64
- register void *mem_r asm("rbx");
-#else
- register void *mem_r asm("ebx");
-#endif /* CONFIG_X86_64 */
-#endif /* CONFIG_KASAN */

/*
* Non-architectural event for the Goldmont Microarchitecture
@@ -1011,11 +978,9 @@ static int measure_cycles_perf_fn(void *_plr)

local_irq_disable();
/*
- * Call wrmsr direcly to avoid the local register variables from
- * being overwritten due to reordering of their assignment with
- * the wrmsr calls.
+ * Disable hardware prefetchers.
*/
- __wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
+ wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
/* Disable events and reset counters */
pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, 0x0);
pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x0);
@@ -1028,6 +993,9 @@ static int measure_cycles_perf_fn(void *_plr)
pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 3, 0x0);
}
/* Set and enable the L2 counters */
+ mem_r = READ_ONCE(plr->kmem);
+ size = READ_ONCE(plr->size);
+ line_size = READ_ONCE(plr->line_size);
pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, l2_hit_bits);
pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, l2_miss_bits);
if (l3_hit_bits > 0) {
@@ -1036,9 +1004,6 @@ static int measure_cycles_perf_fn(void *_plr)
pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
l3_miss_bits);
}
- mem_r = plr->kmem;
- size = plr->size;
- line_size = plr->line_size;
for (i = 0; i < size; i += line_size) {
asm volatile("mov (%0,%1,1), %%eax\n\t"
:
--
2.17.0

2018-09-20 14:12:10

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V5 5/6] x86/intel_rdt: Use perf infrastructure for measurements

On Wed, Sep 19, 2018 at 10:29:10AM -0700, Reinette Chatre wrote:
> +static int measure_l3_residency(void *_plr)
> +{

> + counts.miss_after -= counts.miss_before;
> + if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X) {
> + /*
> + * On BDW references and misses are counted, need to adjust.
> + * Sometimes the "hits" counter is a bit more than the
> + * references, for example, x references but x + 1 hits.
> + * To not report invalid hit values in this case we treat
> + * that as misses equal to references.
> + */
> + /* First compute the number of cache references measured */
> + counts.hits_after -= counts.hits_before;

> + /* Next convert references to cache hits */
> + counts.hits_after -= counts.miss_after > counts.hits_after ?
> + counts.hits_after : counts.miss_after;

Maybe:
counts.hits_after -= min(counts.hits_after, counts.miss_after);

2018-09-20 14:12:33

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf

On Wed, Sep 19, 2018 at 10:29:05AM -0700, Reinette Chatre wrote:
> Reinette Chatre (6):
> perf/core: Add sanity check to deal with pinned event failure
> perf/x86: Add helper to obtain performance counter index
> x86/intel_rdt: Remove local register variables
> x86/intel_rdt: Create required perf event attributes
> x86/intel_rdt: Use perf infrastructure for measurements
> x86/intel_rdt: Re-enable pseudo-lock measurements
>
> Documentation/x86/intel_rdt_ui.txt | 22 +-
> arch/x86/events/core.c | 21 ++
> arch/x86/include/asm/perf_event.h | 1 +
> arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 372 ++++++++++++--------
> kernel/events/core.c | 6 +
> 5 files changed, 261 insertions(+), 161 deletions(-)

Yeah, these look good, thanks!

Acked-by: Peter Zijlstra (Intel) <[email protected]>

2018-09-20 19:03:46

by Reinette Chatre

[permalink] [raw]

Subject: [PATCH V6 5/6] x86/intel_rdt: Use perf infrastructure for measurements

The success of a cache pseudo-locked region is measured using
performance monitoring events that are programmed directly at the time
the user requests a measurement.

Modifying the performance event registers directly is not appropriate
since it circumvents the in-kernel perf infrastructure that exists to
manage these resources and provide resource arbitration to the
performance monitoring hardware.

The cache pseudo-locking measurements are modified to use the in-kernel
perf infrastructure. Performance events are created and validated with
the appropriate perf API. The performance counters are still read as
directly as possible to avoid the additional cache hits. This is
done safely by first ensuring with the perf API that the counters have
been programmed correctly and only accessing the counters in an
interrupt disabled section where they are not able to be moved.

As part of the transition to the in-kernel perf infrastructure the L2
and L3 measurements are split into two separate measurements that can
be triggered independently. This separation prevents additional cache
misses incurred during the extra testing code used to decide if a
L2 and/or L3 measurement should be made.

Signed-off-by: Reinette Chatre <[email protected]>
---
V6:
- Replace an expanded minimum check with min(). Thanks to Peter for
noticing this.

Documentation/x86/intel_rdt_ui.txt | 22 +-
arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 304 ++++++++++++--------
2 files changed, 203 insertions(+), 123 deletions(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index f662d3c530e5..52b10945ff75 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -520,18 +520,24 @@ the pseudo-locked region:
2) Cache hit and miss measurements using model specific precision counters if
available. Depending on the levels of cache on the system the pseudo_lock_l2
and pseudo_lock_l3 tracepoints are available.
- WARNING: triggering this measurement uses from two (for just L2
- measurements) to four (for L2 and L3 measurements) precision counters on
- the system, if any other measurements are in progress the counters and
- their corresponding event registers will be clobbered.

When a pseudo-locked region is created a new debugfs directory is created for
it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
write-only file, pseudo_lock_measure, is present in this directory. The
-measurement on the pseudo-locked region depends on the number, 1 or 2,
-written to this debugfs file. Since the measurements are recorded with the
-tracing infrastructure the relevant tracepoints need to be enabled before the
-measurement is triggered.
+measurement of the pseudo-locked region depends on the number written to this
+debugfs file:
+1 - writing "1" to the pseudo_lock_measure file will trigger the latency
+ measurement captured in the pseudo_lock_mem_latency tracepoint. See
+ example below.
+2 - writing "2" to the pseudo_lock_measure file will trigger the L2 cache
+ residency (cache hits and misses) measurement captured in the
+ pseudo_lock_l2 tracepoint. See example below.
+3 - writing "3" to the pseudo_lock_measure file will trigger the L3 cache
+ residency (cache hits and misses) measurement captured in the
+ pseudo_lock_l3 tracepoint.
+
+All measurements are recorded with the tracing infrastructure. This requires
+the relevant tracepoints to be enabled before the measurement is triggered.

Example of latency debugging interface:
In this example a pseudo-locked region named "newlock" was created. Here is
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 33d7968f152a..d68836139cf9 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -27,6 +27,7 @@
#include <asm/intel_rdt_sched.h>
#include <asm/perf_event.h>

+#include "../../events/perf_event.h" /* For X86_CONFIG() */
#include "intel_rdt.h"

#define CREATE_TRACE_POINTS
@@ -107,16 +108,6 @@ static u64 get_prefetch_disable_bits(void)
return 0;
}

-/*
- * Helper to write 64bit value to MSR without tracing. Used when
- * use of the cache should be restricted and use of registers used
- * for local variables avoided.
- */
-static inline void pseudo_wrmsrl_notrace(unsigned int msr, u64 val)
-{
- __wrmsr(msr, (u32)(val & 0xffffffffULL), (u32)(val >> 32));
-}
-
/**
* pseudo_lock_minor_get - Obtain available minor number
* @minor: Pointer to where new minor number will be stored
@@ -925,7 +916,7 @@ static int measure_cycles_lat_fn(void *_plr)
* The actual configuration of the event is set right before use in order
* to use the X86_CONFIG macro.
*/
-static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
+static struct perf_event_attr perf_miss_attr = {
.type = PERF_TYPE_RAW,
.size = sizeof(struct perf_event_attr),
.pinned = 1,
@@ -933,7 +924,7 @@ static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
.exclude_user = 1,
};

-static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
+static struct perf_event_attr perf_hit_attr = {
.type = PERF_TYPE_RAW,
.size = sizeof(struct perf_event_attr),
.pinned = 1,
@@ -941,139 +932,215 @@ static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
.exclude_user = 1,
};

-static int measure_cycles_perf_fn(void *_plr)
+struct residency_counts {
+ u64 miss_before, hits_before;
+ u64 miss_after, hits_after;
+};
+
+static int measure_residency_fn(struct perf_event_attr *miss_attr,
+ struct perf_event_attr *hit_attr,
+ struct pseudo_lock_region *plr,
+ struct residency_counts *counts)
{
- unsigned long long l3_hits = 0, l3_miss = 0;
- u64 l3_hit_bits = 0, l3_miss_bits = 0;
- struct pseudo_lock_region *plr = _plr;
- unsigned long long l2_hits, l2_miss;
- u64 l2_hit_bits, l2_miss_bits;
+ u64 hits_before = 0, hits_after = 0, miss_before = 0, miss_after = 0;
+ struct perf_event *miss_event, *hit_event;
+ int hit_pmcnum, miss_pmcnum;
unsigned int line_size;
unsigned int size;
unsigned long i;
void *mem_r;
+ u64 tmp;

- /*
- * Non-architectural event for the Goldmont Microarchitecture
- * from Intel x86 Architecture Software Developer Manual (SDM):
- * MEM_LOAD_UOPS_RETIRED D1H (event number)
- * Umask values:
- * L1_HIT 01H
- * L2_HIT 02H
- * L1_MISS 08H
- * L2_MISS 10H
- *
- * On Broadwell Microarchitecture the MEM_LOAD_UOPS_RETIRED event
- * has two "no fix" errata associated with it: BDM35 and BDM100. On
- * this platform we use the following events instead:
- * L2_RQSTS 24H (Documented in https://download.01.org/perfmon/BDW/)
- * REFERENCES FFH
- * MISS 3FH
- * LONGEST_LAT_CACHE 2EH (Documented in SDM)
- * REFERENCE 4FH
- * MISS 41H
- */
+ miss_event = perf_event_create_kernel_counter(miss_attr, plr->cpu,
+ NULL, NULL, NULL);
+ if (IS_ERR(miss_event))
+ goto out;

+ hit_event = perf_event_create_kernel_counter(hit_attr, plr->cpu,
+ NULL, NULL, NULL);
+ if (IS_ERR(hit_event))
+ goto out_miss;
+
+ local_irq_disable();
/*
- * Start by setting flags for IA32_PERFEVTSELx:
- * OS (Operating system mode) 0x2
- * INT (APIC interrupt enable) 0x10
- * EN (Enable counter) 0x40
- *
- * Then add the Umask value and event number to select performance
- * event.
+ * Check any possible error state of events used by performing
+ * one local read.
*/
-
- switch (boot_cpu_data.x86_model) {
- case INTEL_FAM6_ATOM_GOLDMONT:
- case INTEL_FAM6_ATOM_GEMINI_LAKE:
- l2_hit_bits = (0x52ULL << 16) | (0x2 << 8) | 0xd1;
- l2_miss_bits = (0x52ULL << 16) | (0x10 << 8) | 0xd1;
- break;
- case INTEL_FAM6_BROADWELL_X:
- /* On BDW the l2_hit_bits count references, not hits */
- l2_hit_bits = (0x52ULL << 16) | (0xff << 8) | 0x24;
- l2_miss_bits = (0x52ULL << 16) | (0x3f << 8) | 0x24;
- /* On BDW the l3_hit_bits count references, not hits */
- l3_hit_bits = (0x52ULL << 16) | (0x4f << 8) | 0x2e;
- l3_miss_bits = (0x52ULL << 16) | (0x41 << 8) | 0x2e;
- break;
- default:
- goto out;
+ if (perf_event_read_local(miss_event, &tmp, NULL, NULL)) {
+ local_irq_enable();
+ goto out_hit;
+ }
+ if (perf_event_read_local(hit_event, &tmp, NULL, NULL)) {
+ local_irq_enable();
+ goto out_hit;
}

- local_irq_disable();
/*
* Disable hardware prefetchers.
*/
wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
- /* Disable events and reset counters */
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 1, 0x0);
- if (l3_hit_bits > 0) {
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 2, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 3, 0x0);
- }
- /* Set and enable the L2 counters */
+
+ /* Initialize rest of local variables */
+ /*
+ * Performance event has been validated right before this with
+ * interrupts disabled - it is thus safe to read the counter index.
+ */
+ miss_pmcnum = x86_perf_rdpmc_index(miss_event);
+ hit_pmcnum = x86_perf_rdpmc_index(hit_event);
+ line_size = READ_ONCE(plr->line_size);
mem_r = READ_ONCE(plr->kmem);
size = READ_ONCE(plr->size);
- line_size = READ_ONCE(plr->line_size);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, l2_hit_bits);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, l2_miss_bits);
- if (l3_hit_bits > 0) {
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
- l3_hit_bits);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
- l3_miss_bits);
- }
+
+ /*
+ * Read counter variables twice - first to load the instructions
+ * used in L1 cache, second to capture accurate value that does not
+ * include cache misses incurred because of instruction loads.
+ */
+ rdpmcl(hit_pmcnum, hits_before);
+ rdpmcl(miss_pmcnum, miss_before);
+ /*
+ * From SDM: Performing back-to-back fast reads are not guaranteed
+ * to be monotonic.
+ * Use LFENCE to ensure all previous instructions are retired
+ * before proceeding.
+ */
+ rmb();
+ rdpmcl(hit_pmcnum, hits_before);
+ rdpmcl(miss_pmcnum, miss_before);
+ /*
+ * Use LFENCE to ensure all previous instructions are retired
+ * before proceeding.
+ */
+ rmb();
for (i = 0; i < size; i += line_size) {
+ /*
+ * Add a barrier to prevent speculative execution of this
+ * loop reading beyond the end of the buffer.
+ */
+ rmb();
asm volatile("mov (%0,%1,1), %%eax\n\t"
:
: "r" (mem_r), "r" (i)
: "%eax", "memory");
}
/*
- * Call wrmsr directly (no tracing) to not influence
- * the cache access counters as they are disabled.
+ * Use LFENCE to ensure all previous instructions are retired
+ * before proceeding.
*/
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0,
- l2_hit_bits & ~(0x40ULL << 16));
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1,
- l2_miss_bits & ~(0x40ULL << 16));
- if (l3_hit_bits > 0) {
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
- l3_hit_bits & ~(0x40ULL << 16));
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
- l3_miss_bits & ~(0x40ULL << 16));
- }
- l2_hits = native_read_pmc(0);
- l2_miss = native_read_pmc(1);
- if (l3_hit_bits > 0) {
- l3_hits = native_read_pmc(2);
- l3_miss = native_read_pmc(3);
- }
+ rmb();
+ rdpmcl(hit_pmcnum, hits_after);
+ rdpmcl(miss_pmcnum, miss_after);
+ /*
+ * Use LFENCE to ensure all previous instructions are retired
+ * before proceeding.
+ */
+ rmb();
+ /* Re-enable hardware prefetchers */
wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
local_irq_enable();
+out_hit:
+ perf_event_release_kernel(hit_event);
+out_miss:
+ perf_event_release_kernel(miss_event);
+out:
+ /*
+ * All counts will be zero on failure.
+ */
+ counts->miss_before = miss_before;
+ counts->hits_before = hits_before;
+ counts->miss_after = miss_after;
+ counts->hits_after = hits_after;
+ return 0;
+}
+
+static int measure_l2_residency(void *_plr)
+{
+ struct pseudo_lock_region *plr = _plr;
+ struct residency_counts counts = {0};
+
+ /*
+ * Non-architectural event for the Goldmont Microarchitecture
+ * from Intel x86 Architecture Software Developer Manual (SDM):
+ * MEM_LOAD_UOPS_RETIRED D1H (event number)
+ * Umask values:
+ * L2_HIT 02H
+ * L2_MISS 10H
+ */
+ switch (boot_cpu_data.x86_model) {
+ case INTEL_FAM6_ATOM_GOLDMONT:
+ case INTEL_FAM6_ATOM_GEMINI_LAKE:
+ perf_miss_attr.config = X86_CONFIG(.event = 0xd1,
+ .umask = 0x10);
+ perf_hit_attr.config = X86_CONFIG(.event = 0xd1,
+ .umask = 0x2);
+ break;
+ default:
+ goto out;
+ }
+
+ measure_residency_fn(&perf_miss_attr, &perf_hit_attr, plr, &counts);
+ /*
+ * If a failure prevented the measurements from succeeding
+ * tracepoints will still be written and all counts will be zero.
+ */
+ trace_pseudo_lock_l2(counts.hits_after - counts.hits_before,
+ counts.miss_after - counts.miss_before);
+out:
+ plr->thread_done = 1;
+ wake_up_interruptible(&plr->lock_thread_wq);
+ return 0;
+}
+
+static int measure_l3_residency(void *_plr)
+{
+ struct pseudo_lock_region *plr = _plr;
+ struct residency_counts counts = {0};
+
/*
- * On BDW we count references and misses, need to adjust. Sometimes
- * the "hits" counter is a bit more than the references, for
- * example, x references but x + 1 hits. To not report invalid
- * hit values in this case we treat that as misses eaqual to
- * references.
+ * On Broadwell Microarchitecture the MEM_LOAD_UOPS_RETIRED event
+ * has two "no fix" errata associated with it: BDM35 and BDM100. On
+ * this platform the following events are used instead:
+ * LONGEST_LAT_CACHE 2EH (Documented in SDM)
+ * REFERENCE 4FH
+ * MISS 41H
*/
- if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
- l2_hits -= (l2_miss > l2_hits ? l2_hits : l2_miss);
- trace_pseudo_lock_l2(l2_hits, l2_miss);
- if (l3_hit_bits > 0) {
- if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
- l3_hits -= (l3_miss > l3_hits ? l3_hits : l3_miss);
- trace_pseudo_lock_l3(l3_hits, l3_miss);
+
+ switch (boot_cpu_data.x86_model) {
+ case INTEL_FAM6_BROADWELL_X:
+ /* On BDW the hit event counts references, not hits */
+ perf_hit_attr.config = X86_CONFIG(.event = 0x2e,
+ .umask = 0x4f);
+ perf_miss_attr.config = X86_CONFIG(.event = 0x2e,
+ .umask = 0x41);
+ break;
+ default:
+ goto out;
+ }
+
+ measure_residency_fn(&perf_miss_attr, &perf_hit_attr, plr, &counts);
+ /*
+ * If a failure prevented the measurements from succeeding
+ * tracepoints will still be written and all counts will be zero.
+ */
+
+ counts.miss_after -= counts.miss_before;
+ if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X) {
+ /*
+ * On BDW references and misses are counted, need to adjust.
+ * Sometimes the "hits" counter is a bit more than the
+ * references, for example, x references but x + 1 hits.
+ * To not report invalid hit values in this case we treat
+ * that as misses equal to references.
+ */
+ /* First compute the number of cache references measured */
+ counts.hits_after -= counts.hits_before;
+ /* Next convert references to cache hits */
+ counts.hits_after -= min(counts.miss_after, counts.hits_after);
+ } else {
+ counts.hits_after -= counts.hits_before;
}

+ trace_pseudo_lock_l3(counts.hits_after, counts.miss_after);
out:
plr->thread_done = 1;
wake_up_interruptible(&plr->lock_thread_wq);
@@ -1112,13 +1179,20 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
goto out;
}

+ plr->cpu = cpu;
+
if (sel == 1)
thread = kthread_create_on_node(measure_cycles_lat_fn, plr,
cpu_to_node(cpu),
"pseudo_lock_measure/%u",
cpu);
else if (sel == 2)
- thread = kthread_create_on_node(measure_cycles_perf_fn, plr,
+ thread = kthread_create_on_node(measure_l2_residency, plr,
+ cpu_to_node(cpu),
+ "pseudo_lock_measure/%u",
+ cpu);
+ else if (sel == 3)
+ thread = kthread_create_on_node(measure_l3_residency, plr,
cpu_to_node(cpu),
"pseudo_lock_measure/%u",
cpu);
--
2.17.0

2018-09-21 16:52:29

by Reinette Chatre

[permalink] [raw]

Subject: Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf

Dear Maintainers,

On 9/20/2018 7:11 AM, Peter Zijlstra wrote:
> On Wed, Sep 19, 2018 at 10:29:05AM -0700, Reinette Chatre wrote:
>> Reinette Chatre (6):
>> perf/core: Add sanity check to deal with pinned event failure
>> perf/x86: Add helper to obtain performance counter index
>> x86/intel_rdt: Remove local register variables
>> x86/intel_rdt: Create required perf event attributes
>> x86/intel_rdt: Use perf infrastructure for measurements
>> x86/intel_rdt: Re-enable pseudo-lock measurements
>>
>> Documentation/x86/intel_rdt_ui.txt | 22 +-
>> arch/x86/events/core.c | 21 ++
>> arch/x86/include/asm/perf_event.h | 1 +
>> arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 372 ++++++++++++--------
>> kernel/events/core.c | 6 +
>> 5 files changed, 261 insertions(+), 161 deletions(-)
>
> Yeah, these look good, thanks!
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>

Could you please consider this series for inclusion into v4.19?

- This series is needed to complete the initial cache pseudo-locking
enabling that is first introduced in v4.19. Without this series users
are able to create pseudo-locked regions but unable to accurately
measure their success.
- This is not adding a new feature of pseudo-locked region measurement
but fixing the existing measurement code that was disabled after found
to be poorly integrated.
- This series consists out of 6 patches. Patches 2 to 6 are either new
code in support of this fix or fixes to code that does not exist in
kernels before v4.19.
- Patch 1 is a fix to existing code. It is small, was suggested by, and
does have support from maintainer. Even so, if it is felt that this is
too risky then patches 2 to 6 could still be merged since it does not
actually depend on patch 1. In retrospect I should have submitted it
separately.

Your consideration would be greatly appreciated.

Thank you

Reinette

2018-09-27 20:40:00

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf

On Fri, 21 Sep 2018, Reinette Chatre wrote:

> Dear Maintainers,

Sorry for replying late.

> On 9/20/2018 7:11 AM, Peter Zijlstra wrote:
> > On Wed, Sep 19, 2018 at 10:29:05AM -0700, Reinette Chatre wrote:
> >> Reinette Chatre (6):
> >> perf/core: Add sanity check to deal with pinned event failure
> >> perf/x86: Add helper to obtain performance counter index
> >> x86/intel_rdt: Remove local register variables
> >> x86/intel_rdt: Create required perf event attributes
> >> x86/intel_rdt: Use perf infrastructure for measurements
> >> x86/intel_rdt: Re-enable pseudo-lock measurements
> >>
> >> Documentation/x86/intel_rdt_ui.txt | 22 +-
> >> arch/x86/events/core.c | 21 ++
> >> arch/x86/include/asm/perf_event.h | 1 +
> >> arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 372 ++++++++++++--------
> >> kernel/events/core.c | 6 +
> >> 5 files changed, 261 insertions(+), 161 deletions(-)
> >
> > Yeah, these look good, thanks!
> >
> > Acked-by: Peter Zijlstra (Intel) <[email protected]>
> >
>
> Could you please consider this series for inclusion into v4.19?

So in principle I'm having no objections as this really is mostly a RDT
only issue.

Peter, any objections against the Perf part of it, aside the core patch
which is an obvious fix?

Thanks,

tglx

2018-09-28 07:00:48

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf

On Thu, Sep 27, 2018 at 10:39:01PM +0200, Thomas Gleixner wrote:
> On Fri, 21 Sep 2018, Reinette Chatre wrote:
>
> > Dear Maintainers,
>
> Sorry for replying late.
>
> > On 9/20/2018 7:11 AM, Peter Zijlstra wrote:
> > > On Wed, Sep 19, 2018 at 10:29:05AM -0700, Reinette Chatre wrote:
> > >> Reinette Chatre (6):
> > >> perf/core: Add sanity check to deal with pinned event failure
> > >> perf/x86: Add helper to obtain performance counter index
> > >> x86/intel_rdt: Remove local register variables
> > >> x86/intel_rdt: Create required perf event attributes
> > >> x86/intel_rdt: Use perf infrastructure for measurements
> > >> x86/intel_rdt: Re-enable pseudo-lock measurements
> > >>
> > >> Documentation/x86/intel_rdt_ui.txt | 22 +-
> > >> arch/x86/events/core.c | 21 ++
> > >> arch/x86/include/asm/perf_event.h | 1 +
> > >> arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 372 ++++++++++++--------
> > >> kernel/events/core.c | 6 +
> > >> 5 files changed, 261 insertions(+), 161 deletions(-)
> > >
> > > Yeah, these look good, thanks!
> > >
> > > Acked-by: Peter Zijlstra (Intel) <[email protected]>
> > >
> >
> > Could you please consider this series for inclusion into v4.19?
>
> So in principle I'm having no objections as this really is mostly a RDT
> only issue.
>
> Peter, any objections against the Perf part of it, aside the core patch
> which is an obvious fix?

Nope, look up a few lines to observe my Ack ;-)

2018-09-28 20:49:51

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:perf/urgent] perf/core: Add sanity check to deal with pinned event failure

Commit-ID: befb1b3c2703897c5b8ffb0044dc5d0e5f27c5d7
Gitweb: https://git.kernel.org/tip/befb1b3c2703897c5b8ffb0044dc5d0e5f27c5d7
Author: Reinette Chatre <[email protected]>
AuthorDate: Wed, 19 Sep 2018 10:29:06 -0700
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 28 Sep 2018 22:44:53 +0200

perf/core: Add sanity check to deal with pinned event failure

It is possible that a failure can occur during the scheduling of a
pinned event. The initial portion of perf_event_read_local() contains
the various error checks an event should pass before it can be
considered valid. Ensure that the potential scheduling failure
of a pinned event is checked for and have a credible error.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Reinette Chatre <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/6486385d1f30336e9973b24c8c65f5079543d3d3.1537377064.git.reinette.chatre@intel.com

---
kernel/events/core.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index c80549bf82c6..dcb093e7b377 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3935,6 +3935,12 @@ int perf_event_read_local(struct perf_event *event, u64 *value,
goto out;
}

+ /* If this is a pinned event it must be running on this CPU */
+ if (event->attr.pinned && event->oncpu != smp_processor_id()) {
+ ret = -EBUSY;
+ goto out;
+ }
+
/*
* If the event is currently on this CPU, its either a per-task event,
* or local to this CPU. Furthermore it means its ACTIVE (otherwise

2018-09-29 07:02:43

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:x86/cache] perf/x86: Add helper to obtain performance counter index

Commit-ID: 1182a49529edde899be4b4f0e1ab76e626976eb6
Gitweb: https://git.kernel.org/tip/1182a49529edde899be4b4f0e1ab76e626976eb6
Author: Reinette Chatre <[email protected]>
AuthorDate: Wed, 19 Sep 2018 10:29:07 -0700
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 28 Sep 2018 22:48:26 +0200

perf/x86: Add helper to obtain performance counter index

perf_event_read_local() is the safest way to obtain measurements
associated with performance events. In some cases the overhead
introduced by perf_event_read_local() affects the measurements and the
use of rdpmcl() is needed. rdpmcl() requires the index
of the performance counter used so a helper is introduced to determine
the index used by a provided performance event.

The index used by a performance event may change when interrupts are
enabled. A check is added to ensure that the index is only accessed
with interrupts disabled. Even with this check the use of this counter
needs to be done with care to ensure it is queried and used within the
same disabled interrupts section.

This change introduces a new checkpatch warning:
CHECK: extern prototypes should be avoided in .h files
+extern int x86_perf_rdpmc_index(struct perf_event *event);

This warning was discussed and designated as a false positive in
http://lkml.kernel.org/r/[email protected]

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Reinette Chatre <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/b277ffa78a51254f5414f7b1bc1923826874566e.1537377064.git.reinette.chatre@intel.com

---
arch/x86/events/core.c | 21 +++++++++++++++++++++
arch/x86/include/asm/perf_event.h | 1 +
2 files changed, 22 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index dfb2f7c0d019..3550d800b030 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1033,6 +1033,27 @@ static inline void x86_assign_hw_event(struct perf_event *event,
}
}

+/**
+ * x86_perf_rdpmc_index - Return PMC counter used for event
+ * @event: the perf_event to which the PMC counter was assigned
+ *
+ * The counter assigned to this performance event may change if interrupts
+ * are enabled. This counter should thus never be used while interrupts are
+ * enabled. Before this function is used to obtain the assigned counter the
+ * event should be checked for validity using, for example,
+ * perf_event_read_local(), within the same interrupt disabled section in
+ * which this counter is planned to be used.
+ *
+ * Return: The index of the performance monitoring counter assigned to
+ * @perf_event.
+ */
+int x86_perf_rdpmc_index(struct perf_event *event)
+{
+ lockdep_assert_irqs_disabled();
+
+ return event->hw.event_base_rdpmc;
+}
+
static inline int match_prev_assignment(struct hw_perf_event *hwc,
struct cpu_hw_events *cpuc,
int i)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 12f54082f4c8..b2cf84c35a6d 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -270,6 +270,7 @@ struct perf_guest_switch_msr {
extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
extern void perf_check_microcode(void);
+extern int x86_perf_rdpmc_index(struct perf_event *event);
#else
static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
{

2018-09-29 07:03:46

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:x86/cache] x86/intel_rdt: Remove local register variables

Commit-ID: b5e4274ef7f00a03ce8d728701409a6e2c99146b
Gitweb: https://git.kernel.org/tip/b5e4274ef7f00a03ce8d728701409a6e2c99146b
Author: Reinette Chatre <[email protected]>
AuthorDate: Wed, 19 Sep 2018 10:29:08 -0700
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 28 Sep 2018 22:48:26 +0200

x86/intel_rdt: Remove local register variables

Local register variables were used in an effort to improve the
accuracy of the measurement of cache residency of a pseudo-locked
region. This was done to ensure that only the cache residency of
the memory is measured and not the cache residency of the variables
used to perform the measurement.

While local register variables do accomplish the goal they do require
significant care since different architectures have different registers
available. Local register variables also cannot be used with valuable
developer tools like KASAN.

Significant testing has shown that similar accuracy in measurement
results can be obtained by replacing local register variables with
regular local variables.

Make use of local variables in the critical code but do so with
READ_ONCE() to prevent the compiler from merging or refetching reads.
Ensure these variables are initialized before the measurement starts,
and ensure it is only the local variables that are accessed during
the measurement.

With the removal of the local register variables and using READ_ONCE()
there is no longer a motivation for using a direct wrmsr call (that
avoids the additional tracing code that may clobber the local register
variables).

Signed-off-by: Reinette Chatre <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/f430f57347414e0691765d92b144758ab93d8407.1537377064.git.reinette.chatre@intel.com

---
arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 53 +++++------------------------
1 file changed, 9 insertions(+), 44 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 40f3903ae5d9..8ad83eb3fc89 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -886,31 +886,14 @@ static int measure_cycles_lat_fn(void *_plr)
struct pseudo_lock_region *plr = _plr;
unsigned long i;
u64 start, end;
-#ifdef CONFIG_KASAN
- /*
- * The registers used for local register variables are also used
- * when KASAN is active. When KASAN is active we use a regular
- * variable to ensure we always use a valid pointer to access memory.
- * The cost is that accessing this pointer, which could be in
- * cache, will be included in the measurement of memory read latency.
- */
void *mem_r;
-#else
-#ifdef CONFIG_X86_64
- register void *mem_r asm("rbx");
-#else
- register void *mem_r asm("ebx");
-#endif /* CONFIG_X86_64 */
-#endif /* CONFIG_KASAN */

local_irq_disable();
/*
- * The wrmsr call may be reordered with the assignment below it.
- * Call wrmsr as directly as possible to avoid tracing clobbering
- * local register variable used for memory pointer.
+ * Disable hardware prefetchers.
*/
- __wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
- mem_r = plr->kmem;
+ wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
+ mem_r = READ_ONCE(plr->kmem);
/*
* Dummy execute of the time measurement to load the needed
* instructions into the L1 instruction cache.
@@ -939,26 +922,10 @@ static int measure_cycles_perf_fn(void *_plr)
struct pseudo_lock_region *plr = _plr;
unsigned long long l2_hits, l2_miss;
u64 l2_hit_bits, l2_miss_bits;
- unsigned long i;
-#ifdef CONFIG_KASAN
- /*
- * The registers used for local register variables are also used
- * when KASAN is active. When KASAN is active we use regular variables
- * at the cost of including cache access latency to these variables
- * in the measurements.
- */
unsigned int line_size;
unsigned int size;
+ unsigned long i;
void *mem_r;
-#else
- register unsigned int line_size asm("esi");
- register unsigned int size asm("edi");
-#ifdef CONFIG_X86_64
- register void *mem_r asm("rbx");
-#else
- register void *mem_r asm("ebx");
-#endif /* CONFIG_X86_64 */
-#endif /* CONFIG_KASAN */

/*
* Non-architectural event for the Goldmont Microarchitecture
@@ -1011,11 +978,9 @@ static int measure_cycles_perf_fn(void *_plr)

local_irq_disable();
/*
- * Call wrmsr direcly to avoid the local register variables from
- * being overwritten due to reordering of their assignment with
- * the wrmsr calls.
+ * Disable hardware prefetchers.
*/
- __wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
+ wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
/* Disable events and reset counters */
pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, 0x0);
pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x0);
@@ -1028,6 +993,9 @@ static int measure_cycles_perf_fn(void *_plr)
pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 3, 0x0);
}
/* Set and enable the L2 counters */
+ mem_r = READ_ONCE(plr->kmem);
+ size = READ_ONCE(plr->size);
+ line_size = READ_ONCE(plr->line_size);
pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, l2_hit_bits);
pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, l2_miss_bits);
if (l3_hit_bits > 0) {
@@ -1036,9 +1004,6 @@ static int measure_cycles_perf_fn(void *_plr)
pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
l3_miss_bits);
}
- mem_r = plr->kmem;
- size = plr->size;
- line_size = plr->line_size;
for (i = 0; i < size; i += line_size) {
asm volatile("mov (%0,%1,1), %%eax\n\t"
:

2018-09-29 07:04:31

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:x86/cache] x86/intel_rdt: Create required perf event attributes

Commit-ID: 0a701c9dd5351cbaef1677a0c8d37950e158cd55
Gitweb: https://git.kernel.org/tip/0a701c9dd5351cbaef1677a0c8d37950e158cd55
Author: Reinette Chatre <[email protected]>
AuthorDate: Wed, 19 Sep 2018 10:29:09 -0700
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 28 Sep 2018 22:48:27 +0200

x86/intel_rdt: Create required perf event attributes

A perf event has many attributes that are maintained in a separate
structure that should be provided when a new perf_event is created.

In preparation for the transition to perf_events the required attribute
structures are created for all the events that may be used in the
measurements. Most attributes for all the events are identical. The
actual configuration, what specifies what needs to be measured, is what
will be different between the events used. This configuration needs to
be done with X86_CONFIG that cannot be used as part of the designated
initializers used here, this will be introduced later.

Although they do look identical at this time the attribute structures
needs to be maintained separately since a perf_event will maintain a
pointer to its unique attributes.

In support of patch testing the new structs are given the unused attribute
until their use in later patches.

Signed-off-by: Reinette Chatre <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/1822f6164e221a497648d108913d056ab675d5d0.1537377064.git.reinette.chatre@intel.com

---
arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 8ad83eb3fc89..33d7968f152a 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -17,6 +17,7 @@
#include <linux/debugfs.h>
#include <linux/kthread.h>
#include <linux/mman.h>
+#include <linux/perf_event.h>
#include <linux/pm_qos.h>
#include <linux/slab.h>
#include <linux/uaccess.h>
@@ -915,6 +916,31 @@ static int measure_cycles_lat_fn(void *_plr)
return 0;
}

+/*
+ * Create a perf_event_attr for the hit and miss perf events that will
+ * be used during the performance measurement. A perf_event maintains
+ * a pointer to its perf_event_attr so a unique attribute structure is
+ * created for each perf_event.
+ *
+ * The actual configuration of the event is set right before use in order
+ * to use the X86_CONFIG macro.
+ */
+static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
+ .type = PERF_TYPE_RAW,
+ .size = sizeof(struct perf_event_attr),
+ .pinned = 1,
+ .disabled = 0,
+ .exclude_user = 1,
+};
+
+static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
+ .type = PERF_TYPE_RAW,
+ .size = sizeof(struct perf_event_attr),
+ .pinned = 1,
+ .disabled = 0,
+ .exclude_user = 1,
+};
+
static int measure_cycles_perf_fn(void *_plr)
{
unsigned long long l3_hits = 0, l3_miss = 0;

2018-09-29 07:05:13

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:x86/cache] x86/intel_rdt: Use perf infrastructure for measurements

Commit-ID: dd45407c0b2445bc2aa0ecfea744d5af3a146577
Gitweb: https://git.kernel.org/tip/dd45407c0b2445bc2aa0ecfea744d5af3a146577
Author: Reinette Chatre <[email protected]>
AuthorDate: Thu, 20 Sep 2018 12:02:11 -0700
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 28 Sep 2018 22:48:27 +0200

x86/intel_rdt: Use perf infrastructure for measurements

The success of a cache pseudo-locked region is measured using
performance monitoring events that are programmed directly at the time
the user requests a measurement.

Modifying the performance event registers directly is not appropriate
since it circumvents the in-kernel perf infrastructure that exists to
manage these resources and provide resource arbitration to the
performance monitoring hardware.

The cache pseudo-locking measurements are modified to use the in-kernel
perf infrastructure. Performance events are created and validated with
the appropriate perf API. The performance counters are still read as
directly as possible to avoid the additional cache hits. This is
done safely by first ensuring with the perf API that the counters have
been programmed correctly and only accessing the counters in an
interrupt disabled section where they are not able to be moved.

As part of the transition to the in-kernel perf infrastructure the L2
and L3 measurements are split into two separate measurements that can
be triggered independently. This separation prevents additional cache
misses incurred during the extra testing code used to decide if a
L2 and/or L3 measurement should be made.

Signed-off-by: Reinette Chatre <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/fc24e728b446404f42c78573c506e98cd0599873.1537468643.git.reinette.chatre@intel.com

---
Documentation/x86/intel_rdt_ui.txt | 22 +-
arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 304 +++++++++++++++++-----------
2 files changed, 203 insertions(+), 123 deletions(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index f662d3c530e5..52b10945ff75 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -520,18 +520,24 @@ the pseudo-locked region:
2) Cache hit and miss measurements using model specific precision counters if
available. Depending on the levels of cache on the system the pseudo_lock_l2
and pseudo_lock_l3 tracepoints are available.
- WARNING: triggering this measurement uses from two (for just L2
- measurements) to four (for L2 and L3 measurements) precision counters on
- the system, if any other measurements are in progress the counters and
- their corresponding event registers will be clobbered.

When a pseudo-locked region is created a new debugfs directory is created for
it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
write-only file, pseudo_lock_measure, is present in this directory. The
-measurement on the pseudo-locked region depends on the number, 1 or 2,
-written to this debugfs file. Since the measurements are recorded with the
-tracing infrastructure the relevant tracepoints need to be enabled before the
-measurement is triggered.
+measurement of the pseudo-locked region depends on the number written to this
+debugfs file:
+1 - writing "1" to the pseudo_lock_measure file will trigger the latency
+ measurement captured in the pseudo_lock_mem_latency tracepoint. See
+ example below.
+2 - writing "2" to the pseudo_lock_measure file will trigger the L2 cache
+ residency (cache hits and misses) measurement captured in the
+ pseudo_lock_l2 tracepoint. See example below.
+3 - writing "3" to the pseudo_lock_measure file will trigger the L3 cache
+ residency (cache hits and misses) measurement captured in the
+ pseudo_lock_l3 tracepoint.
+
+All measurements are recorded with the tracing infrastructure. This requires
+the relevant tracepoints to be enabled before the measurement is triggered.

Example of latency debugging interface:
In this example a pseudo-locked region named "newlock" was created. Here is
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 33d7968f152a..d68836139cf9 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -27,6 +27,7 @@
#include <asm/intel_rdt_sched.h>
#include <asm/perf_event.h>

+#include "../../events/perf_event.h" /* For X86_CONFIG() */
#include "intel_rdt.h"

#define CREATE_TRACE_POINTS
@@ -107,16 +108,6 @@ static u64 get_prefetch_disable_bits(void)
return 0;
}

-/*
- * Helper to write 64bit value to MSR without tracing. Used when
- * use of the cache should be restricted and use of registers used
- * for local variables avoided.
- */
-static inline void pseudo_wrmsrl_notrace(unsigned int msr, u64 val)
-{
- __wrmsr(msr, (u32)(val & 0xffffffffULL), (u32)(val >> 32));
-}
-
/**
* pseudo_lock_minor_get - Obtain available minor number
* @minor: Pointer to where new minor number will be stored
@@ -925,7 +916,7 @@ static int measure_cycles_lat_fn(void *_plr)
* The actual configuration of the event is set right before use in order
* to use the X86_CONFIG macro.
*/
-static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
+static struct perf_event_attr perf_miss_attr = {
.type = PERF_TYPE_RAW,
.size = sizeof(struct perf_event_attr),
.pinned = 1,
@@ -933,7 +924,7 @@ static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
.exclude_user = 1,
};

-static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
+static struct perf_event_attr perf_hit_attr = {
.type = PERF_TYPE_RAW,
.size = sizeof(struct perf_event_attr),
.pinned = 1,
@@ -941,139 +932,215 @@ static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
.exclude_user = 1,
};

-static int measure_cycles_perf_fn(void *_plr)
+struct residency_counts {
+ u64 miss_before, hits_before;
+ u64 miss_after, hits_after;
+};
+
+static int measure_residency_fn(struct perf_event_attr *miss_attr,
+ struct perf_event_attr *hit_attr,
+ struct pseudo_lock_region *plr,
+ struct residency_counts *counts)
{
- unsigned long long l3_hits = 0, l3_miss = 0;
- u64 l3_hit_bits = 0, l3_miss_bits = 0;
- struct pseudo_lock_region *plr = _plr;
- unsigned long long l2_hits, l2_miss;
- u64 l2_hit_bits, l2_miss_bits;
+ u64 hits_before = 0, hits_after = 0, miss_before = 0, miss_after = 0;
+ struct perf_event *miss_event, *hit_event;
+ int hit_pmcnum, miss_pmcnum;
unsigned int line_size;
unsigned int size;
unsigned long i;
void *mem_r;
+ u64 tmp;

- /*
- * Non-architectural event for the Goldmont Microarchitecture
- * from Intel x86 Architecture Software Developer Manual (SDM):
- * MEM_LOAD_UOPS_RETIRED D1H (event number)
- * Umask values:
- * L1_HIT 01H
- * L2_HIT 02H
- * L1_MISS 08H
- * L2_MISS 10H
- *
- * On Broadwell Microarchitecture the MEM_LOAD_UOPS_RETIRED event
- * has two "no fix" errata associated with it: BDM35 and BDM100. On
- * this platform we use the following events instead:
- * L2_RQSTS 24H (Documented in https://download.01.org/perfmon/BDW/)
- * REFERENCES FFH
- * MISS 3FH
- * LONGEST_LAT_CACHE 2EH (Documented in SDM)
- * REFERENCE 4FH
- * MISS 41H
- */
+ miss_event = perf_event_create_kernel_counter(miss_attr, plr->cpu,
+ NULL, NULL, NULL);
+ if (IS_ERR(miss_event))
+ goto out;

+ hit_event = perf_event_create_kernel_counter(hit_attr, plr->cpu,
+ NULL, NULL, NULL);
+ if (IS_ERR(hit_event))
+ goto out_miss;
+
+ local_irq_disable();
/*
- * Start by setting flags for IA32_PERFEVTSELx:
- * OS (Operating system mode) 0x2
- * INT (APIC interrupt enable) 0x10
- * EN (Enable counter) 0x40
- *
- * Then add the Umask value and event number to select performance
- * event.
+ * Check any possible error state of events used by performing
+ * one local read.
*/
-
- switch (boot_cpu_data.x86_model) {
- case INTEL_FAM6_ATOM_GOLDMONT:
- case INTEL_FAM6_ATOM_GEMINI_LAKE:
- l2_hit_bits = (0x52ULL << 16) | (0x2 << 8) | 0xd1;
- l2_miss_bits = (0x52ULL << 16) | (0x10 << 8) | 0xd1;
- break;
- case INTEL_FAM6_BROADWELL_X:
- /* On BDW the l2_hit_bits count references, not hits */
- l2_hit_bits = (0x52ULL << 16) | (0xff << 8) | 0x24;
- l2_miss_bits = (0x52ULL << 16) | (0x3f << 8) | 0x24;
- /* On BDW the l3_hit_bits count references, not hits */
- l3_hit_bits = (0x52ULL << 16) | (0x4f << 8) | 0x2e;
- l3_miss_bits = (0x52ULL << 16) | (0x41 << 8) | 0x2e;
- break;
- default:
- goto out;
+ if (perf_event_read_local(miss_event, &tmp, NULL, NULL)) {
+ local_irq_enable();
+ goto out_hit;
+ }
+ if (perf_event_read_local(hit_event, &tmp, NULL, NULL)) {
+ local_irq_enable();
+ goto out_hit;
}

- local_irq_disable();
/*
* Disable hardware prefetchers.
*/
wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
- /* Disable events and reset counters */
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 1, 0x0);
- if (l3_hit_bits > 0) {
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 2, 0x0);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 3, 0x0);
- }
- /* Set and enable the L2 counters */
+
+ /* Initialize rest of local variables */
+ /*
+ * Performance event has been validated right before this with
+ * interrupts disabled - it is thus safe to read the counter index.
+ */
+ miss_pmcnum = x86_perf_rdpmc_index(miss_event);
+ hit_pmcnum = x86_perf_rdpmc_index(hit_event);
+ line_size = READ_ONCE(plr->line_size);
mem_r = READ_ONCE(plr->kmem);
size = READ_ONCE(plr->size);
- line_size = READ_ONCE(plr->line_size);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, l2_hit_bits);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, l2_miss_bits);
- if (l3_hit_bits > 0) {
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
- l3_hit_bits);
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
- l3_miss_bits);
- }
+
+ /*
+ * Read counter variables twice - first to load the instructions
+ * used in L1 cache, second to capture accurate value that does not
+ * include cache misses incurred because of instruction loads.
+ */
+ rdpmcl(hit_pmcnum, hits_before);
+ rdpmcl(miss_pmcnum, miss_before);
+ /*
+ * From SDM: Performing back-to-back fast reads are not guaranteed
+ * to be monotonic.
+ * Use LFENCE to ensure all previous instructions are retired
+ * before proceeding.
+ */
+ rmb();
+ rdpmcl(hit_pmcnum, hits_before);
+ rdpmcl(miss_pmcnum, miss_before);
+ /*
+ * Use LFENCE to ensure all previous instructions are retired
+ * before proceeding.
+ */
+ rmb();
for (i = 0; i < size; i += line_size) {
+ /*
+ * Add a barrier to prevent speculative execution of this
+ * loop reading beyond the end of the buffer.
+ */
+ rmb();
asm volatile("mov (%0,%1,1), %%eax\n\t"
:
: "r" (mem_r), "r" (i)
: "%eax", "memory");
}
/*
- * Call wrmsr directly (no tracing) to not influence
- * the cache access counters as they are disabled.
+ * Use LFENCE to ensure all previous instructions are retired
+ * before proceeding.
*/
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0,
- l2_hit_bits & ~(0x40ULL << 16));
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1,
- l2_miss_bits & ~(0x40ULL << 16));
- if (l3_hit_bits > 0) {
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
- l3_hit_bits & ~(0x40ULL << 16));
- pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
- l3_miss_bits & ~(0x40ULL << 16));
- }
- l2_hits = native_read_pmc(0);
- l2_miss = native_read_pmc(1);
- if (l3_hit_bits > 0) {
- l3_hits = native_read_pmc(2);
- l3_miss = native_read_pmc(3);
- }
+ rmb();
+ rdpmcl(hit_pmcnum, hits_after);
+ rdpmcl(miss_pmcnum, miss_after);
+ /*
+ * Use LFENCE to ensure all previous instructions are retired
+ * before proceeding.
+ */
+ rmb();
+ /* Re-enable hardware prefetchers */
wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
local_irq_enable();
+out_hit:
+ perf_event_release_kernel(hit_event);
+out_miss:
+ perf_event_release_kernel(miss_event);
+out:
+ /*
+ * All counts will be zero on failure.
+ */
+ counts->miss_before = miss_before;
+ counts->hits_before = hits_before;
+ counts->miss_after = miss_after;
+ counts->hits_after = hits_after;
+ return 0;
+}
+
+static int measure_l2_residency(void *_plr)
+{
+ struct pseudo_lock_region *plr = _plr;
+ struct residency_counts counts = {0};
+
+ /*
+ * Non-architectural event for the Goldmont Microarchitecture
+ * from Intel x86 Architecture Software Developer Manual (SDM):
+ * MEM_LOAD_UOPS_RETIRED D1H (event number)
+ * Umask values:
+ * L2_HIT 02H
+ * L2_MISS 10H
+ */
+ switch (boot_cpu_data.x86_model) {
+ case INTEL_FAM6_ATOM_GOLDMONT:
+ case INTEL_FAM6_ATOM_GEMINI_LAKE:
+ perf_miss_attr.config = X86_CONFIG(.event = 0xd1,
+ .umask = 0x10);
+ perf_hit_attr.config = X86_CONFIG(.event = 0xd1,
+ .umask = 0x2);
+ break;
+ default:
+ goto out;
+ }
+
+ measure_residency_fn(&perf_miss_attr, &perf_hit_attr, plr, &counts);
+ /*
+ * If a failure prevented the measurements from succeeding
+ * tracepoints will still be written and all counts will be zero.
+ */
+ trace_pseudo_lock_l2(counts.hits_after - counts.hits_before,
+ counts.miss_after - counts.miss_before);
+out:
+ plr->thread_done = 1;
+ wake_up_interruptible(&plr->lock_thread_wq);
+ return 0;
+}
+
+static int measure_l3_residency(void *_plr)
+{
+ struct pseudo_lock_region *plr = _plr;
+ struct residency_counts counts = {0};
+
/*
- * On BDW we count references and misses, need to adjust. Sometimes
- * the "hits" counter is a bit more than the references, for
- * example, x references but x + 1 hits. To not report invalid
- * hit values in this case we treat that as misses eaqual to
- * references.
+ * On Broadwell Microarchitecture the MEM_LOAD_UOPS_RETIRED event
+ * has two "no fix" errata associated with it: BDM35 and BDM100. On
+ * this platform the following events are used instead:
+ * LONGEST_LAT_CACHE 2EH (Documented in SDM)
+ * REFERENCE 4FH
+ * MISS 41H
*/
- if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
- l2_hits -= (l2_miss > l2_hits ? l2_hits : l2_miss);
- trace_pseudo_lock_l2(l2_hits, l2_miss);
- if (l3_hit_bits > 0) {
- if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
- l3_hits -= (l3_miss > l3_hits ? l3_hits : l3_miss);
- trace_pseudo_lock_l3(l3_hits, l3_miss);
+
+ switch (boot_cpu_data.x86_model) {
+ case INTEL_FAM6_BROADWELL_X:
+ /* On BDW the hit event counts references, not hits */
+ perf_hit_attr.config = X86_CONFIG(.event = 0x2e,
+ .umask = 0x4f);
+ perf_miss_attr.config = X86_CONFIG(.event = 0x2e,
+ .umask = 0x41);
+ break;
+ default:
+ goto out;
+ }
+
+ measure_residency_fn(&perf_miss_attr, &perf_hit_attr, plr, &counts);
+ /*
+ * If a failure prevented the measurements from succeeding
+ * tracepoints will still be written and all counts will be zero.
+ */
+
+ counts.miss_after -= counts.miss_before;
+ if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X) {
+ /*
+ * On BDW references and misses are counted, need to adjust.
+ * Sometimes the "hits" counter is a bit more than the
+ * references, for example, x references but x + 1 hits.
+ * To not report invalid hit values in this case we treat
+ * that as misses equal to references.
+ */
+ /* First compute the number of cache references measured */
+ counts.hits_after -= counts.hits_before;
+ /* Next convert references to cache hits */
+ counts.hits_after -= min(counts.miss_after, counts.hits_after);
+ } else {
+ counts.hits_after -= counts.hits_before;
}

+ trace_pseudo_lock_l3(counts.hits_after, counts.miss_after);
out:
plr->thread_done = 1;
wake_up_interruptible(&plr->lock_thread_wq);
@@ -1112,13 +1179,20 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
goto out;
}

+ plr->cpu = cpu;
+
if (sel == 1)
thread = kthread_create_on_node(measure_cycles_lat_fn, plr,
cpu_to_node(cpu),
"pseudo_lock_measure/%u",
cpu);
else if (sel == 2)
- thread = kthread_create_on_node(measure_cycles_perf_fn, plr,
+ thread = kthread_create_on_node(measure_l2_residency, plr,
+ cpu_to_node(cpu),
+ "pseudo_lock_measure/%u",
+ cpu);
+ else if (sel == 3)
+ thread = kthread_create_on_node(measure_l3_residency, plr,
cpu_to_node(cpu),
"pseudo_lock_measure/%u",
cpu);

2018-09-29 14:24:30

by Reinette Chatre

[permalink] [raw]

Subject: Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf

On 9/27/2018 11:58 PM, Peter Zijlstra wrote:
> On Thu, Sep 27, 2018 at 10:39:01PM +0200, Thomas Gleixner wrote:
>> On Fri, 21 Sep 2018, Reinette Chatre wrote:
>>
>>> Dear Maintainers,
>>
>> Sorry for replying late.
>>
>>> On 9/20/2018 7:11 AM, Peter Zijlstra wrote:
>>>> On Wed, Sep 19, 2018 at 10:29:05AM -0700, Reinette Chatre wrote:
>>>>> Reinette Chatre (6):
>>>>> perf/core: Add sanity check to deal with pinned event failure
>>>>> perf/x86: Add helper to obtain performance counter index
>>>>> x86/intel_rdt: Remove local register variables
>>>>> x86/intel_rdt: Create required perf event attributes
>>>>> x86/intel_rdt: Use perf infrastructure for measurements
>>>>> x86/intel_rdt: Re-enable pseudo-lock measurements
>>>>>
>>>>> Documentation/x86/intel_rdt_ui.txt | 22 +-
>>>>> arch/x86/events/core.c | 21 ++
>>>>> arch/x86/include/asm/perf_event.h | 1 +
>>>>> arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 372 ++++++++++++--------
>>>>> kernel/events/core.c | 6 +
>>>>> 5 files changed, 261 insertions(+), 161 deletions(-)
>>>>
>>>> Yeah, these look good, thanks!
>>>>
>>>> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>>>>
>>>
>>> Could you please consider this series for inclusion into v4.19?
>>
>> So in principle I'm having no objections as this really is mostly a RDT
>> only issue.
>>
>> Peter, any objections against the Perf part of it, aside the core patch
>> which is an obvious fix?
>
> Nope, look up a few lines to observe my Ack ;-)
>

I interpreted Thomas and Peter's responses to mean that there are no
objections for this to be included in v4.19 as a fix.

If I understand the tip branches correctly the core patch seems to be
headed to v4.19 while the rest (excluding the final patch
"x86/intel_rdt: Re-enable pseudo-lock measurements") are headed to v4.20.

Have you decided against including this into v4.19 or did I
misunderstand the responses and/or branches?

Thank you for helping me to sort this out

Reinette

2018-09-29 18:00:02

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf

Reinette,

On Sat, 29 Sep 2018, Reinette Chatre wrote:
> I interpreted Thomas and Peter's responses to mean that there are no
> objections for this to be included in v4.19 as a fix.
>
> If I understand the tip branches correctly the core patch seems to be
> headed to v4.19 while the rest (excluding the final patch
> "x86/intel_rdt: Re-enable pseudo-lock measurements") are headed to v4.20.
>
> Have you decided against including this into v4.19 or did I
> misunderstand the responses and/or branches?

I did not decide anything yet. It's not going into -rc6 as it's not yet
through next and the other standard testing.

I'm also looking at the other set of RDT fixes, which obviously want to go
as well. So not sure how to deal with all of that.

Thanks,

tglx

2018-10-03 19:41:43

by Reinette Chatre

[permalink] [raw]

Subject: Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf

Hi Thomas,

On 9/29/2018 10:56 AM, Thomas Gleixner wrote:
> Reinette,
>
> On Sat, 29 Sep 2018, Reinette Chatre wrote:
>> I interpreted Thomas and Peter's responses to mean that there are no
>> objections for this to be included in v4.19 as a fix.
>>
>> If I understand the tip branches correctly the core patch seems to be
>> headed to v4.19 while the rest (excluding the final patch
>> "x86/intel_rdt: Re-enable pseudo-lock measurements") are headed to v4.20.
>>
>> Have you decided against including this into v4.19 or did I
>> misunderstand the responses and/or branches?
>
> I did not decide anything yet. It's not going into -rc6 as it's not yet
> through next and the other standard testing.
>
> I'm also looking at the other set of RDT fixes, which obviously want to go
> as well. So not sure how to deal with all of that.
>

When you do deal with this series could you please also include the
final patch ("x86/intel_rdt: Re-enable pseudo-lock measurements")? I
noticed that patches 1/6 to 5/6 have been merged into x86/cache in
tip.git and then some other work on top of it. It is not clear to me why
6/6 was omitted and this fix does require it.

Thank you very much

Reinette

2018-10-03 19:45:53

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf

On Wed, 3 Oct 2018, Reinette Chatre wrote:
> On 9/29/2018 10:56 AM, Thomas Gleixner wrote:
> > Reinette,
> >
> > On Sat, 29 Sep 2018, Reinette Chatre wrote:
> >> I interpreted Thomas and Peter's responses to mean that there are no
> >> objections for this to be included in v4.19 as a fix.
> >>
> >> If I understand the tip branches correctly the core patch seems to be
> >> headed to v4.19 while the rest (excluding the final patch
> >> "x86/intel_rdt: Re-enable pseudo-lock measurements") are headed to v4.20.
> >>
> >> Have you decided against including this into v4.19 or did I
> >> misunderstand the responses and/or branches?
> >
> > I did not decide anything yet. It's not going into -rc6 as it's not yet
> > through next and the other standard testing.
> >
> > I'm also looking at the other set of RDT fixes, which obviously want to go
> > as well. So not sure how to deal with all of that.
> >
>
> When you do deal with this series could you please also include the
> final patch ("x86/intel_rdt: Re-enable pseudo-lock measurements")? I
> noticed that patches 1/6 to 5/6 have been merged into x86/cache in
> tip.git and then some other work on top of it. It is not clear to me why
> 6/6 was omitted and this fix does require it.

Hmm. That was not intentional. /me goes to rumage around.

Thanks,

tglx

2018-10-03 19:58:26

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:x86/cache] x86/intel_rdt: Re-enable pseudo-lock measurements

Commit-ID: 53ed74af05517cf5fe83d87c4bb7ab998adfd631
Gitweb: https://git.kernel.org/tip/53ed74af05517cf5fe83d87c4bb7ab998adfd631
Author: Reinette Chatre <[email protected]>
AuthorDate: Wed, 19 Sep 2018 10:29:11 -0700
Committer: Thomas Gleixner <[email protected]>
CommitDate: Wed, 3 Oct 2018 21:53:35 +0200

x86/intel_rdt: Re-enable pseudo-lock measurements

Commit 4a7a54a55e72 ("x86/intel_rdt: Disable PMU access") disabled
measurements of pseudo-locked regions because of incorrect usage
of the performance monitoring hardware.

Cache pseudo-locking measurements are now done correctly with the
in-kernel perf API and its use can be re-enabled at this time.

The adjustment to the in-kernel perf API also separated the L2 and L3
measurements that can be triggered separately from user space. The
re-enabling of the measurements is thus not a simple revert of the
original disable in order to accommodate the additional parameter
possible.

Signed-off-by: Reinette Chatre <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/bfb9fc31692e0c62d9ca39062e55eceb6a0635b5.1537377064.git.reinette.chatre@intel.com

---
arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index d68836139cf9..30e6c9f5a0ad 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -1236,7 +1236,7 @@ static ssize_t pseudo_lock_measure_trigger(struct file *file,
buf[buf_size] = '\0';
ret = kstrtoint(buf, 10, &sel);
if (ret == 0) {
- if (sel != 1)
+ if (sel != 1 && sel != 2 && sel != 3)
return -EINVAL;
ret = debugfs_file_get(file->f_path.dentry);
if (ret)