2024-01-03 03:09:44

by Mi, Dapeng

[permalink] [raw]
Subject: [kvm-unit-tests Patch v3 00/11] pmu test bugs fix and improvements

When running pmu test on Sapphire Rapids, we found sometimes pmu test
reports the following failures.

1. FAIL: Intel: all counters
2. FAIL: Intel: core cycles-0
3. FAIL: Intel: llc misses-4

Further investigation shows these failures are all false alarms rather
than real vPMU issues.

The failure 1 is caused by a bug in check_counters_many() which defines
a cnt[] array with length 10. On Sapphire Rapids KVM supports 8 GP
counters and 3 fixed counters, obviously the total counter number (11)
of Sapphire Rapids exceed current cnt[] length 10, it would cause a out
of memory access and lead to the "all counters" false alarm. Patch
02~03 would fix this issue.

The failure 2 is caused by pipeline and cache warm-up latency.
Currently "core cycles" is the first executed event. When the measured
loop() program is executed at the first time, cache hierarchy and pipeline
are needed to warm up. All these warm-up work consumes so much cycles
that it exceeds the predefined upper boundary and cause the failure.
Patch 04 fixes this issue.

The failure 3 is caused by 0 llc misses count. It's possible and
reasonable that there is no llc misses happened for such simple loop()
asm blob especially along with larger and larger LLC size on new
processors. Patch 09 would fix this issue by introducing clflush
instruction to force LLC miss.

Besides above bug fixes, this patch series also includes several
optimizations.

One important optimization (patch 07~08) is to move
GLOBAL_CTRL enabling/disabling into the loop asm blob, so the precise
count for instructions and branches events can be measured and the
verification can be done against the precise count instead of the rough
count range. This improves the verification accuracy.

Another important optimization (patch 10~11) is to leverage IBPB command
to force to trigger a branch miss, so the lower boundary of branch miss
event can be set to 1 instead of the ambiguous 0. This eliminates the
ambiguity brought from 0.

All these changes are tested on Intel Sapphire Rapids server platform
and the pmu test passes. Since I have no AMD platforms on my hand, these
changes are not verified on AMD platforms yet. If someone can help to
verify these changes on AMD platforms, it's welcome and appreciated.

Changes:
v2 -> v3:
fix "core cycles" failure,
introduce precise verification for instructions/branches,
leverage IBPB command to optimize branch misses verification,
drop v2 introduced slots event verification
v1 -> v2:
introduce clflush to optimize llc misses verification
introduce rdrand to optimize branch misses verification

History:
v2: https://lore.kernel.org/lkml/[email protected]/
v1: https://lore.kernel.org/lkml/[email protected]/

Dapeng Mi (10):
x86: pmu: Enlarge cnt[] length to 64 in check_counters_many()
x86: pmu: Add asserts to warn inconsistent fixed events and counters
x86: pmu: Switch instructions and core cycles events sequence
x86: pmu: Refine fixed_events[] names
x86: pmu: Remove blank line and redundant space
x86: pmu: Enable and disable PMCs in loop() asm blob
x86: pmu: Improve instruction and branches events verification
x86: pmu: Improve LLC misses event verification
x86: pmu: Add IBPB indirect jump asm blob
x86: pmu: Improve branch misses event verification

Xiong Zhang (1):
x86: pmu: Remove duplicate code in pmu_init()

lib/x86/pmu.c | 5 --
x86/pmu.c | 201 ++++++++++++++++++++++++++++++++++++++++++--------
2 files changed, 171 insertions(+), 35 deletions(-)

--
2.34.1



2024-01-03 03:10:07

by Mi, Dapeng

[permalink] [raw]
Subject: [kvm-unit-tests Patch v3 01/11] x86: pmu: Remove duplicate code in pmu_init()

From: Xiong Zhang <[email protected]>

There are totally same code in pmu_init() helper, remove the duplicate
code.

Signed-off-by: Xiong Zhang <[email protected]>
Signed-off-by: Dapeng Mi <[email protected]>
---
lib/x86/pmu.c | 5 -----
1 file changed, 5 deletions(-)

diff --git a/lib/x86/pmu.c b/lib/x86/pmu.c
index 0f2afd650bc9..d06e94553024 100644
--- a/lib/x86/pmu.c
+++ b/lib/x86/pmu.c
@@ -16,11 +16,6 @@ void pmu_init(void)
pmu.fixed_counter_width = (cpuid_10.d >> 5) & 0xff;
}

- if (pmu.version > 1) {
- pmu.nr_fixed_counters = cpuid_10.d & 0x1f;
- pmu.fixed_counter_width = (cpuid_10.d >> 5) & 0xff;
- }
-
pmu.nr_gp_counters = (cpuid_10.a >> 8) & 0xff;
pmu.gp_counter_width = (cpuid_10.a >> 16) & 0xff;
pmu.gp_counter_mask_length = (cpuid_10.a >> 24) & 0xff;
--
2.34.1


2024-01-03 03:10:43

by Mi, Dapeng

[permalink] [raw]
Subject: [kvm-unit-tests Patch v3 02/11] x86: pmu: Enlarge cnt[] length to 64 in check_counters_many()

Considering there are already 8 GP counters and 4 fixed counters on
latest Intel processors, like Sapphire Rapids. The original cnt[] array
length 10 is definitely not enough to cover all supported PMU counters on these
new processors even through currently KVM only supports 3 fixed counters
at most. This would cause out of bound memory access and may trigger
false alarm on PMU counter validation

It's probably more and more GP and fixed counters are introduced in the
future and then directly extends the cnt[] array length to 64 once and
for all.

Signed-off-by: Dapeng Mi <[email protected]>
---
x86/pmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index 0def28695c70..a13b8a8398c6 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -254,7 +254,7 @@ static void check_fixed_counters(void)

static void check_counters_many(void)
{
- pmu_counter_t cnt[10];
+ pmu_counter_t cnt[64];
int i, n;

for (i = 0, n = 0; n < pmu.nr_gp_counters; i++) {
--
2.34.1


2024-01-03 03:10:57

by Mi, Dapeng

[permalink] [raw]
Subject: [kvm-unit-tests Patch v3 03/11] x86: pmu: Add asserts to warn inconsistent fixed events and counters

Current PMU code deosn't check whether PMU fixed counter number is
larger than pre-defined fixed events. If so, it would cause memory
access out of range.

So add assert to warn this invalid case.

Signed-off-by: Dapeng Mi <[email protected]>
---
x86/pmu.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index a13b8a8398c6..a42fff8d8b36 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -111,8 +111,12 @@ static struct pmu_event* get_counter_event(pmu_counter_t *cnt)
for (i = 0; i < gp_events_size; i++)
if (gp_events[i].unit_sel == (cnt->config & 0xffff))
return &gp_events[i];
- } else
- return &fixed_events[cnt->ctr - MSR_CORE_PERF_FIXED_CTR0];
+ } else {
+ int idx = cnt->ctr - MSR_CORE_PERF_FIXED_CTR0;
+
+ assert(idx < ARRAY_SIZE(fixed_events));
+ return &fixed_events[idx];
+ }

return (void*)0;
}
@@ -245,6 +249,7 @@ static void check_fixed_counters(void)
};
int i;

+ assert(pmu.nr_fixed_counters <= ARRAY_SIZE(fixed_events));
for (i = 0; i < pmu.nr_fixed_counters; i++) {
cnt.ctr = fixed_events[i].unit_sel;
measure_one(&cnt);
@@ -266,6 +271,7 @@ static void check_counters_many(void)
gp_events[i % gp_events_size].unit_sel;
n++;
}
+ assert(pmu.nr_fixed_counters <= ARRAY_SIZE(fixed_events));
for (i = 0; i < pmu.nr_fixed_counters; i++) {
cnt[n].ctr = fixed_events[i].unit_sel;
cnt[n].config = EVNTSEL_OS | EVNTSEL_USR;
--
2.34.1


2024-01-03 03:11:19

by Mi, Dapeng

[permalink] [raw]
Subject: [kvm-unit-tests Patch v3 04/11] x86: pmu: Switch instructions and core cycles events sequence

When running pmu test on SPR, sometimes the following failure is
reported.

PMU version: 2
GP counters: 8
GP counter width: 48
Mask length: 8
Fixed counters: 3
Fixed counter width: 48
1000000 <= 55109398 <= 50000000
FAIL: Intel: core cycles-0
1000000 <= 18279571 <= 50000000
PASS: Intel: core cycles-1
1000000 <= 12238092 <= 50000000
PASS: Intel: core cycles-2
1000000 <= 7981727 <= 50000000
PASS: Intel: core cycles-3
1000000 <= 6984711 <= 50000000
PASS: Intel: core cycles-4
1000000 <= 6773673 <= 50000000
PASS: Intel: core cycles-5
1000000 <= 6697842 <= 50000000
PASS: Intel: core cycles-6
1000000 <= 6747947 <= 50000000
PASS: Intel: core cycles-7

The count of the "core cycles" on first counter would exceed the upper
boundary and leads to a failure, and then the "core cycles" count would
drop gradually and reach a stable state.

That looks reasonable. The "core cycles" event is defined as the 1st
event in xxx_gp_events[] array and it is always verified at first.
when the program loop() is executed at the first time it needs to warm
up the pipeline and cache, such as it has to wait for cache is filled.
All these warm-up work leads to a quite large core cycles count which
may exceeds the verification range.

The event "instructions" instead of "core cycles" is a good choice as
the warm-up event since it would always return a fixed count. Thus
switch instructions and core cycles events sequence in the
xxx_gp_events[] array.

Signed-off-by: Dapeng Mi <[email protected]>
---
x86/pmu.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index a42fff8d8b36..67ebfbe55b49 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -31,16 +31,16 @@ struct pmu_event {
int min;
int max;
} intel_gp_events[] = {
- {"core cycles", 0x003c, 1*N, 50*N},
{"instructions", 0x00c0, 10*N, 10.2*N},
+ {"core cycles", 0x003c, 1*N, 50*N},
{"ref cycles", 0x013c, 1*N, 30*N},
{"llc references", 0x4f2e, 1, 2*N},
{"llc misses", 0x412e, 1, 1*N},
{"branches", 0x00c4, 1*N, 1.1*N},
{"branch misses", 0x00c5, 0, 0.1*N},
}, amd_gp_events[] = {
- {"core cycles", 0x0076, 1*N, 50*N},
{"instructions", 0x00c0, 10*N, 10.2*N},
+ {"core cycles", 0x0076, 1*N, 50*N},
{"branches", 0x00c2, 1*N, 1.1*N},
{"branch misses", 0x00c3, 0, 0.1*N},
}, fixed_events[] = {
@@ -307,7 +307,7 @@ static void check_counter_overflow(void)
int i;
pmu_counter_t cnt = {
.ctr = MSR_GP_COUNTERx(0),
- .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel /* instructions */,
+ .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel /* instructions */,
};
overflow_preset = measure_for_overflow(&cnt);

@@ -365,11 +365,11 @@ static void check_gp_counter_cmask(void)
{
pmu_counter_t cnt = {
.ctr = MSR_GP_COUNTERx(0),
- .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel /* instructions */,
+ .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel /* instructions */,
};
cnt.config |= (0x2 << EVNTSEL_CMASK_SHIFT);
measure_one(&cnt);
- report(cnt.count < gp_events[1].min, "cmask");
+ report(cnt.count < gp_events[0].min, "cmask");
}

static void do_rdpmc_fast(void *ptr)
@@ -446,7 +446,7 @@ static void check_running_counter_wrmsr(void)
uint64_t count;
pmu_counter_t evt = {
.ctr = MSR_GP_COUNTERx(0),
- .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel,
+ .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel,
};

report_prefix_push("running counter wrmsr");
@@ -455,7 +455,7 @@ static void check_running_counter_wrmsr(void)
loop();
wrmsr(MSR_GP_COUNTERx(0), 0);
stop_event(&evt);
- report(evt.count < gp_events[1].min, "cntr");
+ report(evt.count < gp_events[0].min, "cntr");

/* clear status before overflow test */
if (this_cpu_has_perf_global_status())
@@ -493,7 +493,7 @@ static void check_emulated_instr(void)
pmu_counter_t instr_cnt = {
.ctr = MSR_GP_COUNTERx(1),
/* instructions */
- .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel,
+ .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel,
};
report_prefix_push("emulated instruction");

--
2.34.1


2024-01-03 03:11:43

by Mi, Dapeng

[permalink] [raw]
Subject: [kvm-unit-tests Patch v3 05/11] x86: pmu: Refine fixed_events[] names

In SDM the fixed counter is numbered from 0 but currently the
fixed_events names are numbered from 1. It would cause confusion for
users. So Change the fixed_events[] names to number from 0 as well and
keep identical with SDM.

Signed-off-by: Dapeng Mi <[email protected]>
---
x86/pmu.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index 67ebfbe55b49..a2c64a1ce95b 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -44,9 +44,9 @@ struct pmu_event {
{"branches", 0x00c2, 1*N, 1.1*N},
{"branch misses", 0x00c3, 0, 0.1*N},
}, fixed_events[] = {
- {"fixed 1", MSR_CORE_PERF_FIXED_CTR0, 10*N, 10.2*N},
- {"fixed 2", MSR_CORE_PERF_FIXED_CTR0 + 1, 1*N, 30*N},
- {"fixed 3", MSR_CORE_PERF_FIXED_CTR0 + 2, 0.1*N, 30*N}
+ {"fixed 0", MSR_CORE_PERF_FIXED_CTR0, 10*N, 10.2*N},
+ {"fixed 1", MSR_CORE_PERF_FIXED_CTR0 + 1, 1*N, 30*N},
+ {"fixed 2", MSR_CORE_PERF_FIXED_CTR0 + 2, 0.1*N, 30*N}
};

char *buf;
--
2.34.1


2024-01-03 03:12:07

by Mi, Dapeng

[permalink] [raw]
Subject: [kvm-unit-tests Patch v3 06/11] x86: pmu: Remove blank line and redundant space

code style changes.

Signed-off-by: Dapeng Mi <[email protected]>
---
x86/pmu.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index a2c64a1ce95b..46bed66c5c9f 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -207,8 +207,7 @@ static noinline void __measure(pmu_counter_t *evt, uint64_t count)
static bool verify_event(uint64_t count, struct pmu_event *e)
{
// printf("%d <= %ld <= %d\n", e->min, count, e->max);
- return count >= e->min && count <= e->max;
-
+ return count >= e->min && count <= e->max;
}

static bool verify_counter(pmu_counter_t *cnt)
--
2.34.1


2024-01-03 03:12:34

by Mi, Dapeng

[permalink] [raw]
Subject: [kvm-unit-tests Patch v3 07/11] x86: pmu: Enable and disable PMCs in loop() asm blob

Currently enabling PMCs, executing loop() and disabling PMCs are divided
3 separated functions. So there could be other instructions executed
between enabling PMCS and running loop() or running loop() and disabling
PMCs, e.g. if there are multiple counters enabled in measure_many()
function, the instructions which enabling the 2nd and more counters
would be counted in by the 1st counter.

So current implementation can only verify the correctness of count by an
rough range rather than a precise count even for instructions and
branches events. Strictly speaking, this verification is meaningless as
the test could still pass even though KVM vPMU has something wrong and
reports an incorrect instructions or branches count which is in the rough
range.

Thus, move the PMCs enabling and disabling into the loop() asm blob and
ensure only the loop asm instructions would be counted, then the
instructions or branches events can be verified with an precise count
instead of an rough range.

Signed-off-by: Dapeng Mi <[email protected]>
---
x86/pmu.c | 83 +++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 69 insertions(+), 14 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index 46bed66c5c9f..88b89ad889b9 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -18,6 +18,20 @@
#define EXPECTED_INSTR 17
#define EXPECTED_BRNCH 5

+// Instrustion number of LOOP_ASM code
+#define LOOP_INSTRNS 10
+#define LOOP_ASM \
+ "1: mov (%1), %2; add $64, %1;\n\t" \
+ "nop; nop; nop; nop; nop; nop; nop;\n\t" \
+ "loop 1b;\n\t"
+
+#define PRECISE_LOOP_ASM \
+ "wrmsr;\n\t" \
+ "mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t" \
+ LOOP_ASM \
+ "mov %%edi, %%ecx; xor %%eax, %%eax; xor %%edx, %%edx;\n\t" \
+ "wrmsr;\n\t"
+
typedef struct {
uint32_t ctr;
uint64_t config;
@@ -54,13 +68,43 @@ char *buf;
static struct pmu_event *gp_events;
static unsigned int gp_events_size;

-static inline void loop(void)
+
+static inline void __loop(void)
+{
+ unsigned long tmp, tmp2, tmp3;
+
+ asm volatile(LOOP_ASM
+ : "=c"(tmp), "=r"(tmp2), "=r"(tmp3)
+ : "0"(N), "1"(buf));
+}
+
+/*
+ * Enable and disable counters in a whole asm blob to ensure
+ * no other instructions are counted in the time slot between
+ * counters enabling and really LOOP_ASM code executing.
+ * Thus counters can verify instructions and branches events
+ * against precise counts instead of a rough valid count range.
+ */
+static inline void __precise_count_loop(u64 cntrs)
{
unsigned long tmp, tmp2, tmp3;
+ unsigned int global_ctl = pmu.msr_global_ctl;
+ u32 eax = cntrs & (BIT_ULL(32) - 1);
+ u32 edx = cntrs >> 32;

- asm volatile("1: mov (%1), %2; add $64, %1; nop; nop; nop; nop; nop; nop; nop; loop 1b"
- : "=c"(tmp), "=r"(tmp2), "=r"(tmp3): "0"(N), "1"(buf));
+ asm volatile(PRECISE_LOOP_ASM
+ : "=b"(tmp), "=r"(tmp2), "=r"(tmp3)
+ : "a"(eax), "d"(edx), "c"(global_ctl),
+ "0"(N), "1"(buf)
+ : "edi");
+}

+static inline void loop(u64 cntrs)
+{
+ if (!this_cpu_has_perf_global_ctrl())
+ __loop();
+ else
+ __precise_count_loop(cntrs);
}

volatile uint64_t irq_received;
@@ -159,18 +203,17 @@ static void __start_event(pmu_counter_t *evt, uint64_t count)
ctrl = (ctrl & ~(0xf << shift)) | (usrospmi << shift);
wrmsr(MSR_CORE_PERF_FIXED_CTR_CTRL, ctrl);
}
- global_enable(evt);
apic_write(APIC_LVTPC, PMI_VECTOR);
}

static void start_event(pmu_counter_t *evt)
{
__start_event(evt, 0);
+ global_enable(evt);
}

-static void stop_event(pmu_counter_t *evt)
+static void __stop_event(pmu_counter_t *evt)
{
- global_disable(evt);
if (is_gp(evt)) {
wrmsr(MSR_GP_EVENT_SELECTx(event_to_global_idx(evt)),
evt->config & ~EVNTSEL_EN);
@@ -182,14 +225,24 @@ static void stop_event(pmu_counter_t *evt)
evt->count = rdmsr(evt->ctr);
}

+static void stop_event(pmu_counter_t *evt)
+{
+ global_disable(evt);
+ __stop_event(evt);
+}
+
static noinline void measure_many(pmu_counter_t *evt, int count)
{
int i;
+ u64 cntrs = 0;
+
+ for (i = 0; i < count; i++) {
+ __start_event(&evt[i], 0);
+ cntrs |= BIT_ULL(event_to_global_idx(&evt[i]));
+ }
+ loop(cntrs);
for (i = 0; i < count; i++)
- start_event(&evt[i]);
- loop();
- for (i = 0; i < count; i++)
- stop_event(&evt[i]);
+ __stop_event(&evt[i]);
}

static void measure_one(pmu_counter_t *evt)
@@ -199,9 +252,11 @@ static void measure_one(pmu_counter_t *evt)

static noinline void __measure(pmu_counter_t *evt, uint64_t count)
{
+ u64 cntrs = BIT_ULL(event_to_global_idx(evt));
+
__start_event(evt, count);
- loop();
- stop_event(evt);
+ loop(cntrs);
+ __stop_event(evt);
}

static bool verify_event(uint64_t count, struct pmu_event *e)
@@ -451,7 +506,7 @@ static void check_running_counter_wrmsr(void)
report_prefix_push("running counter wrmsr");

start_event(&evt);
- loop();
+ __loop();
wrmsr(MSR_GP_COUNTERx(0), 0);
stop_event(&evt);
report(evt.count < gp_events[0].min, "cntr");
@@ -468,7 +523,7 @@ static void check_running_counter_wrmsr(void)

wrmsr(MSR_GP_COUNTERx(0), count);

- loop();
+ __loop();
stop_event(&evt);

if (this_cpu_has_perf_global_status()) {
--
2.34.1


2024-01-03 03:12:56

by Mi, Dapeng

[permalink] [raw]
Subject: [kvm-unit-tests Patch v3 08/11] x86: pmu: Improve instruction and branches events verification

If HW supports GLOBAL_CTRL MSR, enabling and disabling PMCs are moved in
__precise_count_loop(). Thus, instructions and branches events can be
verified against a precise count instead of a rough range.

Signed-off-by: Dapeng Mi <[email protected]>
---
x86/pmu.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/x86/pmu.c b/x86/pmu.c
index 88b89ad889b9..b764827c1c3d 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -25,6 +25,10 @@
"nop; nop; nop; nop; nop; nop; nop;\n\t" \
"loop 1b;\n\t"

+/*Enable GLOBAL_CTRL + disable GLOBAL_CTRL instructions */
+#define PRECISE_EXTRA_INSTRNS (2 + 4)
+#define PRECISE_LOOP_INSTRNS (N * LOOP_INSTRNS + PRECISE_EXTRA_INSTRNS)
+#define PRECISE_LOOP_BRANCHES (N)
#define PRECISE_LOOP_ASM \
"wrmsr;\n\t" \
"mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t" \
@@ -107,6 +111,24 @@ static inline void loop(u64 cntrs)
__precise_count_loop(cntrs);
}

+static void adjust_events_range(struct pmu_event *gp_events, int branch_idx)
+{
+ /*
+ * If HW supports GLOBAL_CTRL MSR, enabling and disabling PMCs are
+ * moved in __precise_count_loop(). Thus, instructions and branches
+ * events can be verified against a precise count instead of a rough
+ * range.
+ */
+ if (this_cpu_has_perf_global_ctrl()) {
+ /* instructions event */
+ gp_events[0].min = PRECISE_LOOP_INSTRNS;
+ gp_events[0].max = PRECISE_LOOP_INSTRNS;
+ /* branches event */
+ gp_events[branch_idx].min = PRECISE_LOOP_BRANCHES;
+ gp_events[branch_idx].max = PRECISE_LOOP_BRANCHES;
+ }
+}
+
volatile uint64_t irq_received;

static void cnt_overflow(isr_regs_t *regs)
@@ -771,6 +793,7 @@ static void check_invalid_rdpmc_gp(void)

int main(int ac, char **av)
{
+ int branch_idx;
setup_vm();
handle_irq(PMI_VECTOR, cnt_overflow);
buf = malloc(N*64);
@@ -784,13 +807,16 @@ int main(int ac, char **av)
}
gp_events = (struct pmu_event *)intel_gp_events;
gp_events_size = sizeof(intel_gp_events)/sizeof(intel_gp_events[0]);
+ branch_idx = 5;
report_prefix_push("Intel");
set_ref_cycle_expectations();
} else {
gp_events_size = sizeof(amd_gp_events)/sizeof(amd_gp_events[0]);
gp_events = (struct pmu_event *)amd_gp_events;
+ branch_idx = 2;
report_prefix_push("AMD");
}
+ adjust_events_range(gp_events, branch_idx);

printf("PMU version: %d\n", pmu.version);
printf("GP counters: %d\n", pmu.nr_gp_counters);
--
2.34.1


2024-01-03 03:13:36

by Mi, Dapeng

[permalink] [raw]
Subject: [kvm-unit-tests Patch v3 09/11] x86: pmu: Improve LLC misses event verification

When running pmu test on SPR, sometimes the following failure is
reported.

1 <= 0 <= 1000000
FAIL: Intel: llc misses-4

Currently The LLC misses occurring only depends on probability. It's
possible that there is no LLC misses happened in the whole loop(),
especially along with processors have larger and larger cache size just
like what we observed on SPR.

Thus, add clflush instruction into the loop() asm blob and ensure once
LLC miss is triggered at least.

Suggested-by: Jim Mattson <[email protected]>
Signed-off-by: Dapeng Mi <[email protected]>
---
x86/pmu.c | 43 ++++++++++++++++++++++++++++++-------------
1 file changed, 30 insertions(+), 13 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index b764827c1c3d..8fd3db0fbf81 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -20,19 +20,21 @@

// Instrustion number of LOOP_ASM code
#define LOOP_INSTRNS 10
-#define LOOP_ASM \
+#define LOOP_ASM(_clflush) \
+ _clflush "\n\t" \
+ "mfence;\n\t" \
"1: mov (%1), %2; add $64, %1;\n\t" \
"nop; nop; nop; nop; nop; nop; nop;\n\t" \
"loop 1b;\n\t"

-/*Enable GLOBAL_CTRL + disable GLOBAL_CTRL instructions */
-#define PRECISE_EXTRA_INSTRNS (2 + 4)
+/*Enable GLOBAL_CTRL + disable GLOBAL_CTRL + clflush/mfence instructions */
+#define PRECISE_EXTRA_INSTRNS (2 + 4 + 2)
#define PRECISE_LOOP_INSTRNS (N * LOOP_INSTRNS + PRECISE_EXTRA_INSTRNS)
#define PRECISE_LOOP_BRANCHES (N)
-#define PRECISE_LOOP_ASM \
+#define PRECISE_LOOP_ASM(_clflush) \
"wrmsr;\n\t" \
"mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t" \
- LOOP_ASM \
+ LOOP_ASM(_clflush) \
"mov %%edi, %%ecx; xor %%eax, %%eax; xor %%edx, %%edx;\n\t" \
"wrmsr;\n\t"

@@ -72,14 +74,30 @@ char *buf;
static struct pmu_event *gp_events;
static unsigned int gp_events_size;

+#define _loop_asm(_clflush) \
+do { \
+ asm volatile(LOOP_ASM(_clflush) \
+ : "=c"(tmp), "=r"(tmp2), "=r"(tmp3) \
+ : "0"(N), "1"(buf)); \
+} while (0)
+
+#define _precise_loop_asm(_clflush) \
+do { \
+ asm volatile(PRECISE_LOOP_ASM(_clflush) \
+ : "=b"(tmp), "=r"(tmp2), "=r"(tmp3) \
+ : "a"(eax), "d"(edx), "c"(global_ctl), \
+ "0"(N), "1"(buf) \
+ : "edi"); \
+} while (0)

static inline void __loop(void)
{
unsigned long tmp, tmp2, tmp3;

- asm volatile(LOOP_ASM
- : "=c"(tmp), "=r"(tmp2), "=r"(tmp3)
- : "0"(N), "1"(buf));
+ if (this_cpu_has(X86_FEATURE_CLFLUSH))
+ _loop_asm("clflush (%1)");
+ else
+ _loop_asm("nop");
}

/*
@@ -96,11 +114,10 @@ static inline void __precise_count_loop(u64 cntrs)
u32 eax = cntrs & (BIT_ULL(32) - 1);
u32 edx = cntrs >> 32;

- asm volatile(PRECISE_LOOP_ASM
- : "=b"(tmp), "=r"(tmp2), "=r"(tmp3)
- : "a"(eax), "d"(edx), "c"(global_ctl),
- "0"(N), "1"(buf)
- : "edi");
+ if (this_cpu_has(X86_FEATURE_CLFLUSH))
+ _precise_loop_asm("clflush (%1)");
+ else
+ _precise_loop_asm("nop");
}

static inline void loop(u64 cntrs)
--
2.34.1


2024-01-03 03:13:44

by Mi, Dapeng

[permalink] [raw]
Subject: [kvm-unit-tests Patch v3 10/11] x86: pmu: Add IBPB indirect jump asm blob

Currently the lower boundary of branch misses event is set to 0.
Strictly speaking 0 shouldn't be a valid count since it can't tell us if
branch misses event counter works correctly or even disabled. Whereas
it's also possible and reasonable that branch misses event count is 0
especailly for such simple loop() program with advanced branch
predictor.

To eliminate such ambiguity and make branch misses event verification
more acccurately, an extra IBPB indirect jump asm blob is appended and
IBPB command is leveraged to clear the branch target buffer and force to
cause a branch miss for the indirect jump.

Suggested-by: Jim Mattson <[email protected]>
Signed-off-by: Dapeng Mi <[email protected]>
---
x86/pmu.c | 56 +++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 42 insertions(+), 14 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index 8fd3db0fbf81..c8d4a0dcd362 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -27,14 +27,26 @@
"nop; nop; nop; nop; nop; nop; nop;\n\t" \
"loop 1b;\n\t"

-/*Enable GLOBAL_CTRL + disable GLOBAL_CTRL + clflush/mfence instructions */
-#define PRECISE_EXTRA_INSTRNS (2 + 4 + 2)
+#define IBPB_JMP_INSTRNS 7
+#define IBPB_JMP_BRANCHES 1
+#define IBPB_JMP_ASM(_wrmsr) \
+ "mov $1, %%eax; xor %%edx, %%edx;\n\t" \
+ "mov $73, %%ecx;\n\t" \
+ _wrmsr "\n\t" \
+ "lea 2f, %%rax;\n\t" \
+ "jmp *%%rax;\n\t" \
+ "nop;\n\t" \
+ "2: nop;\n\t"
+
+/* GLOBAL_CTRL enable + disable + clflush/mfence + IBPB_JMP */
+#define PRECISE_EXTRA_INSTRNS (2 + 4 + 2 + IBPB_JMP_INSTRNS)
#define PRECISE_LOOP_INSTRNS (N * LOOP_INSTRNS + PRECISE_EXTRA_INSTRNS)
-#define PRECISE_LOOP_BRANCHES (N)
-#define PRECISE_LOOP_ASM(_clflush) \
+#define PRECISE_LOOP_BRANCHES (N + IBPB_JMP_BRANCHES)
+#define PRECISE_LOOP_ASM(_clflush, _wrmsr) \
"wrmsr;\n\t" \
"mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t" \
LOOP_ASM(_clflush) \
+ IBPB_JMP_ASM(_wrmsr) \
"mov %%edi, %%ecx; xor %%eax, %%eax; xor %%edx, %%edx;\n\t" \
"wrmsr;\n\t"

@@ -74,30 +86,42 @@ char *buf;
static struct pmu_event *gp_events;
static unsigned int gp_events_size;

-#define _loop_asm(_clflush) \
+#define _loop_asm(_clflush, _wrmsr) \
do { \
asm volatile(LOOP_ASM(_clflush) \
+ IBPB_JMP_ASM(_wrmsr) \
: "=c"(tmp), "=r"(tmp2), "=r"(tmp3) \
- : "0"(N), "1"(buf)); \
+ : "0"(N), "1"(buf) \
+ : "eax", "edx"); \
} while (0)

-#define _precise_loop_asm(_clflush) \
+#define _precise_loop_asm(_clflush, _wrmsr) \
do { \
- asm volatile(PRECISE_LOOP_ASM(_clflush) \
+ asm volatile(PRECISE_LOOP_ASM(_clflush, _wrmsr) \
: "=b"(tmp), "=r"(tmp2), "=r"(tmp3) \
: "a"(eax), "d"(edx), "c"(global_ctl), \
"0"(N), "1"(buf) \
: "edi"); \
} while (0)

+static int has_ibpb(void)
+{
+ return this_cpu_has(X86_FEATURE_SPEC_CTRL) ||
+ this_cpu_has(X86_FEATURE_AMD_IBPB);
+}
+
static inline void __loop(void)
{
unsigned long tmp, tmp2, tmp3;

- if (this_cpu_has(X86_FEATURE_CLFLUSH))
- _loop_asm("clflush (%1)");
+ if (this_cpu_has(X86_FEATURE_CLFLUSH) && has_ibpb())
+ _loop_asm("clflush (%1)", "wrmsr");
+ else if (this_cpu_has(X86_FEATURE_CLFLUSH))
+ _loop_asm("clflush (%1)", "nop");
+ else if (has_ibpb())
+ _loop_asm("nop", "wrmsr");
else
- _loop_asm("nop");
+ _loop_asm("nop", "nop");
}

/*
@@ -114,10 +138,14 @@ static inline void __precise_count_loop(u64 cntrs)
u32 eax = cntrs & (BIT_ULL(32) - 1);
u32 edx = cntrs >> 32;

- if (this_cpu_has(X86_FEATURE_CLFLUSH))
- _precise_loop_asm("clflush (%1)");
+ if (this_cpu_has(X86_FEATURE_CLFLUSH) && has_ibpb())
+ _precise_loop_asm("clflush (%1)", "wrmsr");
+ else if (this_cpu_has(X86_FEATURE_CLFLUSH))
+ _precise_loop_asm("clflush (%1)", "nop");
+ else if (has_ibpb())
+ _precise_loop_asm("nop", "wrmsr");
else
- _precise_loop_asm("nop");
+ _precise_loop_asm("nop", "nop");
}

static inline void loop(u64 cntrs)
--
2.34.1


2024-01-03 03:14:26

by Mi, Dapeng

[permalink] [raw]
Subject: [kvm-unit-tests Patch v3 11/11] x86: pmu: Improve branch misses event verification

Since IBPB command is already leveraged to force one branch miss
triggering, the lower boundary of branch misses event can be set to 1
instead of 0 on IBPB supported processors. Thus the ambiguity from 0 can
be eliminated.

Signed-off-by: Dapeng Mi <[email protected]>
---
x86/pmu.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/x86/pmu.c b/x86/pmu.c
index c8d4a0dcd362..d5c3fcfaa84c 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -172,6 +172,16 @@ static void adjust_events_range(struct pmu_event *gp_events, int branch_idx)
gp_events[branch_idx].min = PRECISE_LOOP_BRANCHES;
gp_events[branch_idx].max = PRECISE_LOOP_BRANCHES;
}
+
+ /*
+ * If HW supports IBPB, one branch miss is forced to trigger by
+ * IBPB command. Thus overwrite the lower boundary of branch misses
+ * event to 1.
+ */
+ if (has_ibpb()) {
+ /* branch misses event */
+ gp_events[branch_idx + 1].min = 1;
+ }
}

volatile uint64_t irq_received;
--
2.34.1


2024-01-24 09:20:58

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 00/11] pmu test bugs fix and improvements

Kindly ping ...

On 1/3/2024 11:13 AM, Dapeng Mi wrote:
> When running pmu test on Sapphire Rapids, we found sometimes pmu test
> reports the following failures.
>
> 1. FAIL: Intel: all counters
> 2. FAIL: Intel: core cycles-0
> 3. FAIL: Intel: llc misses-4
>
> Further investigation shows these failures are all false alarms rather
> than real vPMU issues.
>
> The failure 1 is caused by a bug in check_counters_many() which defines
> a cnt[] array with length 10. On Sapphire Rapids KVM supports 8 GP
> counters and 3 fixed counters, obviously the total counter number (11)
> of Sapphire Rapids exceed current cnt[] length 10, it would cause a out
> of memory access and lead to the "all counters" false alarm. Patch
> 02~03 would fix this issue.
>
> The failure 2 is caused by pipeline and cache warm-up latency.
> Currently "core cycles" is the first executed event. When the measured
> loop() program is executed at the first time, cache hierarchy and pipeline
> are needed to warm up. All these warm-up work consumes so much cycles
> that it exceeds the predefined upper boundary and cause the failure.
> Patch 04 fixes this issue.
>
> The failure 3 is caused by 0 llc misses count. It's possible and
> reasonable that there is no llc misses happened for such simple loop()
> asm blob especially along with larger and larger LLC size on new
> processors. Patch 09 would fix this issue by introducing clflush
> instruction to force LLC miss.
>
> Besides above bug fixes, this patch series also includes several
> optimizations.
>
> One important optimization (patch 07~08) is to move
> GLOBAL_CTRL enabling/disabling into the loop asm blob, so the precise
> count for instructions and branches events can be measured and the
> verification can be done against the precise count instead of the rough
> count range. This improves the verification accuracy.
>
> Another important optimization (patch 10~11) is to leverage IBPB command
> to force to trigger a branch miss, so the lower boundary of branch miss
> event can be set to 1 instead of the ambiguous 0. This eliminates the
> ambiguity brought from 0.
>
> All these changes are tested on Intel Sapphire Rapids server platform
> and the pmu test passes. Since I have no AMD platforms on my hand, these
> changes are not verified on AMD platforms yet. If someone can help to
> verify these changes on AMD platforms, it's welcome and appreciated.
>
> Changes:
> v2 -> v3:
> fix "core cycles" failure,
> introduce precise verification for instructions/branches,
> leverage IBPB command to optimize branch misses verification,
> drop v2 introduced slots event verification
> v1 -> v2:
> introduce clflush to optimize llc misses verification
> introduce rdrand to optimize branch misses verification
>
> History:
> v2: https://lore.kernel.org/lkml/[email protected]/
> v1: https://lore.kernel.org/lkml/[email protected]/
>
> Dapeng Mi (10):
> x86: pmu: Enlarge cnt[] length to 64 in check_counters_many()
> x86: pmu: Add asserts to warn inconsistent fixed events and counters
> x86: pmu: Switch instructions and core cycles events sequence
> x86: pmu: Refine fixed_events[] names
> x86: pmu: Remove blank line and redundant space
> x86: pmu: Enable and disable PMCs in loop() asm blob
> x86: pmu: Improve instruction and branches events verification
> x86: pmu: Improve LLC misses event verification
> x86: pmu: Add IBPB indirect jump asm blob
> x86: pmu: Improve branch misses event verification
>
> Xiong Zhang (1):
> x86: pmu: Remove duplicate code in pmu_init()
>
> lib/x86/pmu.c | 5 --
> x86/pmu.c | 201 ++++++++++++++++++++++++++++++++++++++++++--------
> 2 files changed, 171 insertions(+), 35 deletions(-)
>

2024-03-25 21:42:09

by Jim Mattson

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 02/11] x86: pmu: Enlarge cnt[] length to 64 in check_counters_many()

On Tue, Jan 2, 2024 at 7:09 PM Dapeng Mi <[email protected]> wrote:
>
> Considering there are already 8 GP counters and 4 fixed counters on
> latest Intel processors, like Sapphire Rapids. The original cnt[] array
> length 10 is definitely not enough to cover all supported PMU counters on these
> new processors even through currently KVM only supports 3 fixed counters
> at most. This would cause out of bound memory access and may trigger
> false alarm on PMU counter validation
>
> It's probably more and more GP and fixed counters are introduced in the
> future and then directly extends the cnt[] array length to 64 once and
> for all.
>
> Signed-off-by: Dapeng Mi <[email protected]>
> ---
> x86/pmu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/x86/pmu.c b/x86/pmu.c
> index 0def28695c70..a13b8a8398c6 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -254,7 +254,7 @@ static void check_fixed_counters(void)
>
> static void check_counters_many(void)
> {
> - pmu_counter_t cnt[10];
> + pmu_counter_t cnt[64];

I think 48 should be sufficient, based on the layout of
IA32_PERF_GLOBAL_CTRL and IA32_PERF_GLOBAL_STATUS.

In any event, let's bump this up!

Reviewed-by: Jim Mattson <[email protected]>

2024-03-27 05:30:30

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 03/11] x86: pmu: Add asserts to warn inconsistent fixed events and counters

On Wed, Jan 03, 2024, Dapeng Mi wrote:
> Current PMU code deosn't check whether PMU fixed counter number is
> larger than pre-defined fixed events. If so, it would cause memory
> access out of range.
>
> So add assert to warn this invalid case.
>
> Signed-off-by: Dapeng Mi <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
> ---
> x86/pmu.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/x86/pmu.c b/x86/pmu.c
> index a13b8a8398c6..a42fff8d8b36 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -111,8 +111,12 @@ static struct pmu_event* get_counter_event(pmu_counter_t *cnt)
> for (i = 0; i < gp_events_size; i++)
> if (gp_events[i].unit_sel == (cnt->config & 0xffff))
> return &gp_events[i];
> - } else
> - return &fixed_events[cnt->ctr - MSR_CORE_PERF_FIXED_CTR0];
> + } else {
> + int idx = cnt->ctr - MSR_CORE_PERF_FIXED_CTR0;
maybe unsigned int is better?
> +
> + assert(idx < ARRAY_SIZE(fixed_events));
> + return &fixed_events[idx];
> + }
>
> return (void*)0;
> }
> @@ -245,6 +249,7 @@ static void check_fixed_counters(void)
> };
> int i;
>
> + assert(pmu.nr_fixed_counters <= ARRAY_SIZE(fixed_events));
> for (i = 0; i < pmu.nr_fixed_counters; i++) {
> cnt.ctr = fixed_events[i].unit_sel;
> measure_one(&cnt);
> @@ -266,6 +271,7 @@ static void check_counters_many(void)
> gp_events[i % gp_events_size].unit_sel;
> n++;
> }
> + assert(pmu.nr_fixed_counters <= ARRAY_SIZE(fixed_events));
> for (i = 0; i < pmu.nr_fixed_counters; i++) {
> cnt[n].ctr = fixed_events[i].unit_sel;
> cnt[n].config = EVNTSEL_OS | EVNTSEL_USR;
> --
> 2.34.1
>

2024-03-27 05:37:17

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 04/11] x86: pmu: Switch instructions and core cycles events sequence

On Wed, Jan 03, 2024, Dapeng Mi wrote:
> When running pmu test on SPR, sometimes the following failure is
> reported.
>
> PMU version: 2
> GP counters: 8
> GP counter width: 48
> Mask length: 8
> Fixed counters: 3
> Fixed counter width: 48
> 1000000 <= 55109398 <= 50000000
> FAIL: Intel: core cycles-0
> 1000000 <= 18279571 <= 50000000
> PASS: Intel: core cycles-1
> 1000000 <= 12238092 <= 50000000
> PASS: Intel: core cycles-2
> 1000000 <= 7981727 <= 50000000
> PASS: Intel: core cycles-3
> 1000000 <= 6984711 <= 50000000
> PASS: Intel: core cycles-4
> 1000000 <= 6773673 <= 50000000
> PASS: Intel: core cycles-5
> 1000000 <= 6697842 <= 50000000
> PASS: Intel: core cycles-6
> 1000000 <= 6747947 <= 50000000
> PASS: Intel: core cycles-7
>
> The count of the "core cycles" on first counter would exceed the upper
> boundary and leads to a failure, and then the "core cycles" count would
> drop gradually and reach a stable state.
>
> That looks reasonable. The "core cycles" event is defined as the 1st
> event in xxx_gp_events[] array and it is always verified at first.
> when the program loop() is executed at the first time it needs to warm
> up the pipeline and cache, such as it has to wait for cache is filled.
> All these warm-up work leads to a quite large core cycles count which
> may exceeds the verification range.
>
> The event "instructions" instead of "core cycles" is a good choice as
> the warm-up event since it would always return a fixed count. Thus
> switch instructions and core cycles events sequence in the
> xxx_gp_events[] array.

The observation is great. However, it is hard to agree that we fix the
problem by switching the order. Maybe directly tweaking the N from 50 to
a larger value makes more sense.

Thanks.
-Mingwei
>
> Signed-off-by: Dapeng Mi <[email protected]>
> ---
> x86/pmu.c | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/x86/pmu.c b/x86/pmu.c
> index a42fff8d8b36..67ebfbe55b49 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -31,16 +31,16 @@ struct pmu_event {
> int min;
> int max;
> } intel_gp_events[] = {
> - {"core cycles", 0x003c, 1*N, 50*N},
> {"instructions", 0x00c0, 10*N, 10.2*N},
> + {"core cycles", 0x003c, 1*N, 50*N},
> {"ref cycles", 0x013c, 1*N, 30*N},
> {"llc references", 0x4f2e, 1, 2*N},
> {"llc misses", 0x412e, 1, 1*N},
> {"branches", 0x00c4, 1*N, 1.1*N},
> {"branch misses", 0x00c5, 0, 0.1*N},
> }, amd_gp_events[] = {
> - {"core cycles", 0x0076, 1*N, 50*N},
> {"instructions", 0x00c0, 10*N, 10.2*N},
> + {"core cycles", 0x0076, 1*N, 50*N},
> {"branches", 0x00c2, 1*N, 1.1*N},
> {"branch misses", 0x00c3, 0, 0.1*N},
> }, fixed_events[] = {
> @@ -307,7 +307,7 @@ static void check_counter_overflow(void)
> int i;
> pmu_counter_t cnt = {
> .ctr = MSR_GP_COUNTERx(0),
> - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel /* instructions */,
> + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel /* instructions */,
> };
> overflow_preset = measure_for_overflow(&cnt);
>
> @@ -365,11 +365,11 @@ static void check_gp_counter_cmask(void)
> {
> pmu_counter_t cnt = {
> .ctr = MSR_GP_COUNTERx(0),
> - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel /* instructions */,
> + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel /* instructions */,
> };
> cnt.config |= (0x2 << EVNTSEL_CMASK_SHIFT);
> measure_one(&cnt);
> - report(cnt.count < gp_events[1].min, "cmask");
> + report(cnt.count < gp_events[0].min, "cmask");
> }
>
> static void do_rdpmc_fast(void *ptr)
> @@ -446,7 +446,7 @@ static void check_running_counter_wrmsr(void)
> uint64_t count;
> pmu_counter_t evt = {
> .ctr = MSR_GP_COUNTERx(0),
> - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel,
> + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel,
> };
>
> report_prefix_push("running counter wrmsr");
> @@ -455,7 +455,7 @@ static void check_running_counter_wrmsr(void)
> loop();
> wrmsr(MSR_GP_COUNTERx(0), 0);
> stop_event(&evt);
> - report(evt.count < gp_events[1].min, "cntr");
> + report(evt.count < gp_events[0].min, "cntr");
>
> /* clear status before overflow test */
> if (this_cpu_has_perf_global_status())
> @@ -493,7 +493,7 @@ static void check_emulated_instr(void)
> pmu_counter_t instr_cnt = {
> .ctr = MSR_GP_COUNTERx(1),
> /* instructions */
> - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel,
> + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel,
> };
> report_prefix_push("emulated instruction");
>
> --
> 2.34.1
>

2024-03-27 05:38:28

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 05/11] x86: pmu: Refine fixed_events[] names

On Wed, Jan 03, 2024, Dapeng Mi wrote:
> In SDM the fixed counter is numbered from 0 but currently the
> fixed_events names are numbered from 1. It would cause confusion for
> users. So Change the fixed_events[] names to number from 0 as well and
> keep identical with SDM.
>
> Signed-off-by: Dapeng Mi <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
> ---
> x86/pmu.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/x86/pmu.c b/x86/pmu.c
> index 67ebfbe55b49..a2c64a1ce95b 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -44,9 +44,9 @@ struct pmu_event {
> {"branches", 0x00c2, 1*N, 1.1*N},
> {"branch misses", 0x00c3, 0, 0.1*N},
> }, fixed_events[] = {
> - {"fixed 1", MSR_CORE_PERF_FIXED_CTR0, 10*N, 10.2*N},
> - {"fixed 2", MSR_CORE_PERF_FIXED_CTR0 + 1, 1*N, 30*N},
> - {"fixed 3", MSR_CORE_PERF_FIXED_CTR0 + 2, 0.1*N, 30*N}
> + {"fixed 0", MSR_CORE_PERF_FIXED_CTR0, 10*N, 10.2*N},
> + {"fixed 1", MSR_CORE_PERF_FIXED_CTR0 + 1, 1*N, 30*N},
> + {"fixed 2", MSR_CORE_PERF_FIXED_CTR0 + 2, 0.1*N, 30*N}
> };
>
> char *buf;
> --
> 2.34.1
>

2024-03-27 05:38:58

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 06/11] x86: pmu: Remove blank line and redundant space

On Wed, Jan 03, 2024, Dapeng Mi wrote:
> code style changes.
>
> Signed-off-by: Dapeng Mi <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
> ---
> x86/pmu.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/x86/pmu.c b/x86/pmu.c
> index a2c64a1ce95b..46bed66c5c9f 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -207,8 +207,7 @@ static noinline void __measure(pmu_counter_t *evt, uint64_t count)
> static bool verify_event(uint64_t count, struct pmu_event *e)
> {
> // printf("%d <= %ld <= %d\n", e->min, count, e->max);
> - return count >= e->min && count <= e->max;
> -
> + return count >= e->min && count <= e->max;
> }
>
> static bool verify_counter(pmu_counter_t *cnt)
> --
> 2.34.1
>

2024-03-27 06:08:29

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 07/11] x86: pmu: Enable and disable PMCs in loop() asm blob

On Wed, Jan 03, 2024, Dapeng Mi wrote:
> Currently enabling PMCs, executing loop() and disabling PMCs are divided
> 3 separated functions. So there could be other instructions executed
> between enabling PMCS and running loop() or running loop() and disabling
> PMCs, e.g. if there are multiple counters enabled in measure_many()
> function, the instructions which enabling the 2nd and more counters
> would be counted in by the 1st counter.
>
> So current implementation can only verify the correctness of count by an
> rough range rather than a precise count even for instructions and
> branches events. Strictly speaking, this verification is meaningless as
> the test could still pass even though KVM vPMU has something wrong and
> reports an incorrect instructions or branches count which is in the rough
> range.
>
> Thus, move the PMCs enabling and disabling into the loop() asm blob and
> ensure only the loop asm instructions would be counted, then the
> instructions or branches events can be verified with an precise count
> instead of an rough range.
>
> Signed-off-by: Dapeng Mi <[email protected]>

> ---
> x86/pmu.c | 83 +++++++++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 69 insertions(+), 14 deletions(-)
>
> diff --git a/x86/pmu.c b/x86/pmu.c
> index 46bed66c5c9f..88b89ad889b9 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -18,6 +18,20 @@
> #define EXPECTED_INSTR 17
> #define EXPECTED_BRNCH 5
>
> +// Instrustion number of LOOP_ASM code
> +#define LOOP_INSTRNS 10
> +#define LOOP_ASM \
> + "1: mov (%1), %2; add $64, %1;\n\t" \
> + "nop; nop; nop; nop; nop; nop; nop;\n\t" \
> + "loop 1b;\n\t"
> +
> +#define PRECISE_LOOP_ASM \
> + "wrmsr;\n\t" \
> + "mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t" \
> + LOOP_ASM \
> + "mov %%edi, %%ecx; xor %%eax, %%eax; xor %%edx, %%edx;\n\t" \
> + "wrmsr;\n\t"

Can we add "FEP" prefix into the above blob? This way, we can expand the
testing for emulated instructions.
> +
> typedef struct {
> uint32_t ctr;
> uint64_t config;
> @@ -54,13 +68,43 @@ char *buf;
> static struct pmu_event *gp_events;
> static unsigned int gp_events_size;
>
> -static inline void loop(void)
> +
> +static inline void __loop(void)
> +{
> + unsigned long tmp, tmp2, tmp3;
> +
> + asm volatile(LOOP_ASM
> + : "=c"(tmp), "=r"(tmp2), "=r"(tmp3)
> + : "0"(N), "1"(buf));
> +}
> +
> +/*
> + * Enable and disable counters in a whole asm blob to ensure
> + * no other instructions are counted in the time slot between
> + * counters enabling and really LOOP_ASM code executing.
> + * Thus counters can verify instructions and branches events
> + * against precise counts instead of a rough valid count range.
> + */
> +static inline void __precise_count_loop(u64 cntrs)
> {
> unsigned long tmp, tmp2, tmp3;
> + unsigned int global_ctl = pmu.msr_global_ctl;
> + u32 eax = cntrs & (BIT_ULL(32) - 1);
> + u32 edx = cntrs >> 32;
>
> - asm volatile("1: mov (%1), %2; add $64, %1; nop; nop; nop; nop; nop; nop; nop; loop 1b"
> - : "=c"(tmp), "=r"(tmp2), "=r"(tmp3): "0"(N), "1"(buf));
> + asm volatile(PRECISE_LOOP_ASM
> + : "=b"(tmp), "=r"(tmp2), "=r"(tmp3)
> + : "a"(eax), "d"(edx), "c"(global_ctl),
> + "0"(N), "1"(buf)
> + : "edi");
> +}
>
> +static inline void loop(u64 cntrs)
> +{
> + if (!this_cpu_has_perf_global_ctrl())
> + __loop();
> + else
> + __precise_count_loop(cntrs);
> }
>
> volatile uint64_t irq_received;
> @@ -159,18 +203,17 @@ static void __start_event(pmu_counter_t *evt, uint64_t count)
> ctrl = (ctrl & ~(0xf << shift)) | (usrospmi << shift);
> wrmsr(MSR_CORE_PERF_FIXED_CTR_CTRL, ctrl);
> }
> - global_enable(evt);
> apic_write(APIC_LVTPC, PMI_VECTOR);
> }
>
> static void start_event(pmu_counter_t *evt)
> {
> __start_event(evt, 0);
> + global_enable(evt);
> }
>
> -static void stop_event(pmu_counter_t *evt)
> +static void __stop_event(pmu_counter_t *evt)
> {
> - global_disable(evt);
> if (is_gp(evt)) {
> wrmsr(MSR_GP_EVENT_SELECTx(event_to_global_idx(evt)),
> evt->config & ~EVNTSEL_EN);
> @@ -182,14 +225,24 @@ static void stop_event(pmu_counter_t *evt)
> evt->count = rdmsr(evt->ctr);
> }
>
> +static void stop_event(pmu_counter_t *evt)
> +{
> + global_disable(evt);
> + __stop_event(evt);
> +}
> +
> static noinline void measure_many(pmu_counter_t *evt, int count)
> {
> int i;
> + u64 cntrs = 0;
> +
> + for (i = 0; i < count; i++) {
> + __start_event(&evt[i], 0);
> + cntrs |= BIT_ULL(event_to_global_idx(&evt[i]));
> + }
> + loop(cntrs);
> for (i = 0; i < count; i++)
> - start_event(&evt[i]);
> - loop();
> - for (i = 0; i < count; i++)
> - stop_event(&evt[i]);
> + __stop_event(&evt[i]);
> }
>
> static void measure_one(pmu_counter_t *evt)
> @@ -199,9 +252,11 @@ static void measure_one(pmu_counter_t *evt)
>
> static noinline void __measure(pmu_counter_t *evt, uint64_t count)
> {
> + u64 cntrs = BIT_ULL(event_to_global_idx(evt));
> +
> __start_event(evt, count);
> - loop();
> - stop_event(evt);
> + loop(cntrs);
> + __stop_event(evt);
> }
>
> static bool verify_event(uint64_t count, struct pmu_event *e)
> @@ -451,7 +506,7 @@ static void check_running_counter_wrmsr(void)
> report_prefix_push("running counter wrmsr");
>
> start_event(&evt);
> - loop();
> + __loop();
> wrmsr(MSR_GP_COUNTERx(0), 0);
> stop_event(&evt);
> report(evt.count < gp_events[0].min, "cntr");
> @@ -468,7 +523,7 @@ static void check_running_counter_wrmsr(void)
>
> wrmsr(MSR_GP_COUNTERx(0), count);
>
> - loop();
> + __loop();
> stop_event(&evt);
>
> if (this_cpu_has_perf_global_status()) {
> --
> 2.34.1
>

2024-03-27 06:15:24

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 08/11] x86: pmu: Improve instruction and branches events verification

On Wed, Jan 03, 2024, Dapeng Mi wrote:
> If HW supports GLOBAL_CTRL MSR, enabling and disabling PMCs are moved in
> __precise_count_loop(). Thus, instructions and branches events can be
> verified against a precise count instead of a rough range.
>
> Signed-off-by: Dapeng Mi <[email protected]>
> ---
> x86/pmu.c | 26 ++++++++++++++++++++++++++
> 1 file changed, 26 insertions(+)
>
> diff --git a/x86/pmu.c b/x86/pmu.c
> index 88b89ad889b9..b764827c1c3d 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -25,6 +25,10 @@
> "nop; nop; nop; nop; nop; nop; nop;\n\t" \
> "loop 1b;\n\t"
>
> +/*Enable GLOBAL_CTRL + disable GLOBAL_CTRL instructions */
> +#define PRECISE_EXTRA_INSTRNS (2 + 4)
> +#define PRECISE_LOOP_INSTRNS (N * LOOP_INSTRNS + PRECISE_EXTRA_INSTRNS)
> +#define PRECISE_LOOP_BRANCHES (N)
> #define PRECISE_LOOP_ASM \
> "wrmsr;\n\t" \
> "mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t" \
> @@ -107,6 +111,24 @@ static inline void loop(u64 cntrs)
> __precise_count_loop(cntrs);
> }
>
> +static void adjust_events_range(struct pmu_event *gp_events, int branch_idx)
> +{
> + /*
> + * If HW supports GLOBAL_CTRL MSR, enabling and disabling PMCs are
> + * moved in __precise_count_loop(). Thus, instructions and branches
> + * events can be verified against a precise count instead of a rough
> + * range.
> + */
> + if (this_cpu_has_perf_global_ctrl()) {
> + /* instructions event */
> + gp_events[0].min = PRECISE_LOOP_INSTRNS;
> + gp_events[0].max = PRECISE_LOOP_INSTRNS;
> + /* branches event */
> + gp_events[branch_idx].min = PRECISE_LOOP_BRANCHES;
> + gp_events[branch_idx].max = PRECISE_LOOP_BRANCHES;
> + }
> +}
> +
> volatile uint64_t irq_received;
>
> static void cnt_overflow(isr_regs_t *regs)
> @@ -771,6 +793,7 @@ static void check_invalid_rdpmc_gp(void)
>
> int main(int ac, char **av)
> {
> + int branch_idx;
> setup_vm();
> handle_irq(PMI_VECTOR, cnt_overflow);
> buf = malloc(N*64);
> @@ -784,13 +807,16 @@ int main(int ac, char **av)
> }
> gp_events = (struct pmu_event *)intel_gp_events;
> gp_events_size = sizeof(intel_gp_events)/sizeof(intel_gp_events[0]);
> + branch_idx = 5;

This (and the follow up one) hardcoded index is hacky and more
importantly, error prone especially when code get refactored later.
Please use a proper way via macro? Eg., checking
INTEL_ARCH_BRANCHES_RETIRED_INDEX in pmu_counters_test.c might be a good
one.
> report_prefix_push("Intel");
> set_ref_cycle_expectations();
> } else {
> gp_events_size = sizeof(amd_gp_events)/sizeof(amd_gp_events[0]);
> gp_events = (struct pmu_event *)amd_gp_events;
> + branch_idx = 2;
> report_prefix_push("AMD");
> }
> + adjust_events_range(gp_events, branch_idx);
>
> printf("PMU version: %d\n", pmu.version);
> printf("GP counters: %d\n", pmu.nr_gp_counters);
> --
> 2.34.1
>

2024-03-27 06:33:26

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 09/11] x86: pmu: Improve LLC misses event verification

On Wed, Jan 03, 2024, Dapeng Mi wrote:
> When running pmu test on SPR, sometimes the following failure is
> reported.
>
> 1 <= 0 <= 1000000
> FAIL: Intel: llc misses-4
>
> Currently The LLC misses occurring only depends on probability. It's
> possible that there is no LLC misses happened in the whole loop(),
> especially along with processors have larger and larger cache size just
> like what we observed on SPR.
>
> Thus, add clflush instruction into the loop() asm blob and ensure once
> LLC miss is triggered at least.
>
> Suggested-by: Jim Mattson <[email protected]>
> Signed-off-by: Dapeng Mi <[email protected]>

I wonder if we can skip all LLC tests if CPU does not have
clflush/clflushopt properties?
> ---
> x86/pmu.c | 43 ++++++++++++++++++++++++++++++-------------
> 1 file changed, 30 insertions(+), 13 deletions(-)
>
> diff --git a/x86/pmu.c b/x86/pmu.c
> index b764827c1c3d..8fd3db0fbf81 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -20,19 +20,21 @@
>
> // Instrustion number of LOOP_ASM code
> #define LOOP_INSTRNS 10
> -#define LOOP_ASM \
> +#define LOOP_ASM(_clflush) \
> + _clflush "\n\t" \
> + "mfence;\n\t" \
> "1: mov (%1), %2; add $64, %1;\n\t" \
> "nop; nop; nop; nop; nop; nop; nop;\n\t" \
> "loop 1b;\n\t"
>
> -/*Enable GLOBAL_CTRL + disable GLOBAL_CTRL instructions */
> -#define PRECISE_EXTRA_INSTRNS (2 + 4)
> +/*Enable GLOBAL_CTRL + disable GLOBAL_CTRL + clflush/mfence instructions */
> +#define PRECISE_EXTRA_INSTRNS (2 + 4 + 2)
> #define PRECISE_LOOP_INSTRNS (N * LOOP_INSTRNS + PRECISE_EXTRA_INSTRNS)
> #define PRECISE_LOOP_BRANCHES (N)
> -#define PRECISE_LOOP_ASM \
> +#define PRECISE_LOOP_ASM(_clflush) \
> "wrmsr;\n\t" \
> "mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t" \
> - LOOP_ASM \
> + LOOP_ASM(_clflush) \
> "mov %%edi, %%ecx; xor %%eax, %%eax; xor %%edx, %%edx;\n\t" \
> "wrmsr;\n\t"
>
> @@ -72,14 +74,30 @@ char *buf;
> static struct pmu_event *gp_events;
> static unsigned int gp_events_size;
>
> +#define _loop_asm(_clflush) \
> +do { \
> + asm volatile(LOOP_ASM(_clflush) \
> + : "=c"(tmp), "=r"(tmp2), "=r"(tmp3) \
> + : "0"(N), "1"(buf)); \
> +} while (0)
> +
> +#define _precise_loop_asm(_clflush) \
> +do { \
> + asm volatile(PRECISE_LOOP_ASM(_clflush) \
> + : "=b"(tmp), "=r"(tmp2), "=r"(tmp3) \
> + : "a"(eax), "d"(edx), "c"(global_ctl), \
> + "0"(N), "1"(buf) \
> + : "edi"); \
> +} while (0)
>
> static inline void __loop(void)
> {
> unsigned long tmp, tmp2, tmp3;
>
> - asm volatile(LOOP_ASM
> - : "=c"(tmp), "=r"(tmp2), "=r"(tmp3)
> - : "0"(N), "1"(buf));
> + if (this_cpu_has(X86_FEATURE_CLFLUSH))
> + _loop_asm("clflush (%1)");
> + else
> + _loop_asm("nop");
> }
>
> /*
> @@ -96,11 +114,10 @@ static inline void __precise_count_loop(u64 cntrs)
> u32 eax = cntrs & (BIT_ULL(32) - 1);
> u32 edx = cntrs >> 32;
>
> - asm volatile(PRECISE_LOOP_ASM
> - : "=b"(tmp), "=r"(tmp2), "=r"(tmp3)
> - : "a"(eax), "d"(edx), "c"(global_ctl),
> - "0"(N), "1"(buf)
> - : "edi");
> + if (this_cpu_has(X86_FEATURE_CLFLUSH))
> + _precise_loop_asm("clflush (%1)");
> + else
> + _precise_loop_asm("nop");
> }
>
> static inline void loop(u64 cntrs)
> --
> 2.34.1
>

2024-03-27 08:55:49

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 04/11] x86: pmu: Switch instructions and core cycles events sequence


On 3/27/2024 1:36 PM, Mingwei Zhang wrote:
> On Wed, Jan 03, 2024, Dapeng Mi wrote:
>> When running pmu test on SPR, sometimes the following failure is
>> reported.
>>
>> PMU version: 2
>> GP counters: 8
>> GP counter width: 48
>> Mask length: 8
>> Fixed counters: 3
>> Fixed counter width: 48
>> 1000000 <= 55109398 <= 50000000
>> FAIL: Intel: core cycles-0
>> 1000000 <= 18279571 <= 50000000
>> PASS: Intel: core cycles-1
>> 1000000 <= 12238092 <= 50000000
>> PASS: Intel: core cycles-2
>> 1000000 <= 7981727 <= 50000000
>> PASS: Intel: core cycles-3
>> 1000000 <= 6984711 <= 50000000
>> PASS: Intel: core cycles-4
>> 1000000 <= 6773673 <= 50000000
>> PASS: Intel: core cycles-5
>> 1000000 <= 6697842 <= 50000000
>> PASS: Intel: core cycles-6
>> 1000000 <= 6747947 <= 50000000
>> PASS: Intel: core cycles-7
>>
>> The count of the "core cycles" on first counter would exceed the upper
>> boundary and leads to a failure, and then the "core cycles" count would
>> drop gradually and reach a stable state.
>>
>> That looks reasonable. The "core cycles" event is defined as the 1st
>> event in xxx_gp_events[] array and it is always verified at first.
>> when the program loop() is executed at the first time it needs to warm
>> up the pipeline and cache, such as it has to wait for cache is filled.
>> All these warm-up work leads to a quite large core cycles count which
>> may exceeds the verification range.
>>
>> The event "instructions" instead of "core cycles" is a good choice as
>> the warm-up event since it would always return a fixed count. Thus
>> switch instructions and core cycles events sequence in the
>> xxx_gp_events[] array.
> The observation is great. However, it is hard to agree that we fix the
> problem by switching the order. Maybe directly tweaking the N from 50 to
> a larger value makes more sense.
>
> Thanks.
> -Mingwei

yeah, a larger upper boundary can fix the fault as well, but the
question is how large it would be enough. For different CPU model, the
needed cycles could be different for warming up. So we may have to set a
quite large upper boundary but a large boundary would decrease
credibility of this validation. Not sure which one is better. Any inputs
from other ones?


>> Signed-off-by: Dapeng Mi <[email protected]>
>> ---
>> x86/pmu.c | 16 ++++++++--------
>> 1 file changed, 8 insertions(+), 8 deletions(-)
>>
>> diff --git a/x86/pmu.c b/x86/pmu.c
>> index a42fff8d8b36..67ebfbe55b49 100644
>> --- a/x86/pmu.c
>> +++ b/x86/pmu.c
>> @@ -31,16 +31,16 @@ struct pmu_event {
>> int min;
>> int max;
>> } intel_gp_events[] = {
>> - {"core cycles", 0x003c, 1*N, 50*N},
>> {"instructions", 0x00c0, 10*N, 10.2*N},
>> + {"core cycles", 0x003c, 1*N, 50*N},
>> {"ref cycles", 0x013c, 1*N, 30*N},
>> {"llc references", 0x4f2e, 1, 2*N},
>> {"llc misses", 0x412e, 1, 1*N},
>> {"branches", 0x00c4, 1*N, 1.1*N},
>> {"branch misses", 0x00c5, 0, 0.1*N},
>> }, amd_gp_events[] = {
>> - {"core cycles", 0x0076, 1*N, 50*N},
>> {"instructions", 0x00c0, 10*N, 10.2*N},
>> + {"core cycles", 0x0076, 1*N, 50*N},
>> {"branches", 0x00c2, 1*N, 1.1*N},
>> {"branch misses", 0x00c3, 0, 0.1*N},
>> }, fixed_events[] = {
>> @@ -307,7 +307,7 @@ static void check_counter_overflow(void)
>> int i;
>> pmu_counter_t cnt = {
>> .ctr = MSR_GP_COUNTERx(0),
>> - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel /* instructions */,
>> + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel /* instructions */,
>> };
>> overflow_preset = measure_for_overflow(&cnt);
>>
>> @@ -365,11 +365,11 @@ static void check_gp_counter_cmask(void)
>> {
>> pmu_counter_t cnt = {
>> .ctr = MSR_GP_COUNTERx(0),
>> - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel /* instructions */,
>> + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel /* instructions */,
>> };
>> cnt.config |= (0x2 << EVNTSEL_CMASK_SHIFT);
>> measure_one(&cnt);
>> - report(cnt.count < gp_events[1].min, "cmask");
>> + report(cnt.count < gp_events[0].min, "cmask");
>> }
>>
>> static void do_rdpmc_fast(void *ptr)
>> @@ -446,7 +446,7 @@ static void check_running_counter_wrmsr(void)
>> uint64_t count;
>> pmu_counter_t evt = {
>> .ctr = MSR_GP_COUNTERx(0),
>> - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel,
>> + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel,
>> };
>>
>> report_prefix_push("running counter wrmsr");
>> @@ -455,7 +455,7 @@ static void check_running_counter_wrmsr(void)
>> loop();
>> wrmsr(MSR_GP_COUNTERx(0), 0);
>> stop_event(&evt);
>> - report(evt.count < gp_events[1].min, "cntr");
>> + report(evt.count < gp_events[0].min, "cntr");
>>
>> /* clear status before overflow test */
>> if (this_cpu_has_perf_global_status())
>> @@ -493,7 +493,7 @@ static void check_emulated_instr(void)
>> pmu_counter_t instr_cnt = {
>> .ctr = MSR_GP_COUNTERx(1),
>> /* instructions */
>> - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel,
>> + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel,
>> };
>> report_prefix_push("emulated instruction");
>>
>> --
>> 2.34.1
>>

2024-03-27 08:56:30

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 07/11] x86: pmu: Enable and disable PMCs in loop() asm blob


On 3/27/2024 2:07 PM, Mingwei Zhang wrote:
> On Wed, Jan 03, 2024, Dapeng Mi wrote:
>> Currently enabling PMCs, executing loop() and disabling PMCs are divided
>> 3 separated functions. So there could be other instructions executed
>> between enabling PMCS and running loop() or running loop() and disabling
>> PMCs, e.g. if there are multiple counters enabled in measure_many()
>> function, the instructions which enabling the 2nd and more counters
>> would be counted in by the 1st counter.
>>
>> So current implementation can only verify the correctness of count by an
>> rough range rather than a precise count even for instructions and
>> branches events. Strictly speaking, this verification is meaningless as
>> the test could still pass even though KVM vPMU has something wrong and
>> reports an incorrect instructions or branches count which is in the rough
>> range.
>>
>> Thus, move the PMCs enabling and disabling into the loop() asm blob and
>> ensure only the loop asm instructions would be counted, then the
>> instructions or branches events can be verified with an precise count
>> instead of an rough range.
>>
>> Signed-off-by: Dapeng Mi <[email protected]>
>> ---
>> x86/pmu.c | 83 +++++++++++++++++++++++++++++++++++++++++++++----------
>> 1 file changed, 69 insertions(+), 14 deletions(-)
>>
>> diff --git a/x86/pmu.c b/x86/pmu.c
>> index 46bed66c5c9f..88b89ad889b9 100644
>> --- a/x86/pmu.c
>> +++ b/x86/pmu.c
>> @@ -18,6 +18,20 @@
>> #define EXPECTED_INSTR 17
>> #define EXPECTED_BRNCH 5
>>
>> +// Instrustion number of LOOP_ASM code
>> +#define LOOP_INSTRNS 10
>> +#define LOOP_ASM \
>> + "1: mov (%1), %2; add $64, %1;\n\t" \
>> + "nop; nop; nop; nop; nop; nop; nop;\n\t" \
>> + "loop 1b;\n\t"
>> +
>> +#define PRECISE_LOOP_ASM \
>> + "wrmsr;\n\t" \
>> + "mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t" \
>> + LOOP_ASM \
>> + "mov %%edi, %%ecx; xor %%eax, %%eax; xor %%edx, %%edx;\n\t" \
>> + "wrmsr;\n\t"
> Can we add "FEP" prefix into the above blob? This way, we can expand the
> testing for emulated instructions.


Yeah, that sounds like a new feature request. I would add it in next
version.


>> +
>> typedef struct {
>> uint32_t ctr;
>> uint64_t config;
>> @@ -54,13 +68,43 @@ char *buf;
>> static struct pmu_event *gp_events;
>> static unsigned int gp_events_size;
>>
>> -static inline void loop(void)
>> +
>> +static inline void __loop(void)
>> +{
>> + unsigned long tmp, tmp2, tmp3;
>> +
>> + asm volatile(LOOP_ASM
>> + : "=c"(tmp), "=r"(tmp2), "=r"(tmp3)
>> + : "0"(N), "1"(buf));
>> +}
>> +
>> +/*
>> + * Enable and disable counters in a whole asm blob to ensure
>> + * no other instructions are counted in the time slot between
>> + * counters enabling and really LOOP_ASM code executing.
>> + * Thus counters can verify instructions and branches events
>> + * against precise counts instead of a rough valid count range.
>> + */
>> +static inline void __precise_count_loop(u64 cntrs)
>> {
>> unsigned long tmp, tmp2, tmp3;
>> + unsigned int global_ctl = pmu.msr_global_ctl;
>> + u32 eax = cntrs & (BIT_ULL(32) - 1);
>> + u32 edx = cntrs >> 32;
>>
>> - asm volatile("1: mov (%1), %2; add $64, %1; nop; nop; nop; nop; nop; nop; nop; loop 1b"
>> - : "=c"(tmp), "=r"(tmp2), "=r"(tmp3): "0"(N), "1"(buf));
>> + asm volatile(PRECISE_LOOP_ASM
>> + : "=b"(tmp), "=r"(tmp2), "=r"(tmp3)
>> + : "a"(eax), "d"(edx), "c"(global_ctl),
>> + "0"(N), "1"(buf)
>> + : "edi");
>> +}
>>
>> +static inline void loop(u64 cntrs)
>> +{
>> + if (!this_cpu_has_perf_global_ctrl())
>> + __loop();
>> + else
>> + __precise_count_loop(cntrs);
>> }
>>
>> volatile uint64_t irq_received;
>> @@ -159,18 +203,17 @@ static void __start_event(pmu_counter_t *evt, uint64_t count)
>> ctrl = (ctrl & ~(0xf << shift)) | (usrospmi << shift);
>> wrmsr(MSR_CORE_PERF_FIXED_CTR_CTRL, ctrl);
>> }
>> - global_enable(evt);
>> apic_write(APIC_LVTPC, PMI_VECTOR);
>> }
>>
>> static void start_event(pmu_counter_t *evt)
>> {
>> __start_event(evt, 0);
>> + global_enable(evt);
>> }
>>
>> -static void stop_event(pmu_counter_t *evt)
>> +static void __stop_event(pmu_counter_t *evt)
>> {
>> - global_disable(evt);
>> if (is_gp(evt)) {
>> wrmsr(MSR_GP_EVENT_SELECTx(event_to_global_idx(evt)),
>> evt->config & ~EVNTSEL_EN);
>> @@ -182,14 +225,24 @@ static void stop_event(pmu_counter_t *evt)
>> evt->count = rdmsr(evt->ctr);
>> }
>>
>> +static void stop_event(pmu_counter_t *evt)
>> +{
>> + global_disable(evt);
>> + __stop_event(evt);
>> +}
>> +
>> static noinline void measure_many(pmu_counter_t *evt, int count)
>> {
>> int i;
>> + u64 cntrs = 0;
>> +
>> + for (i = 0; i < count; i++) {
>> + __start_event(&evt[i], 0);
>> + cntrs |= BIT_ULL(event_to_global_idx(&evt[i]));
>> + }
>> + loop(cntrs);
>> for (i = 0; i < count; i++)
>> - start_event(&evt[i]);
>> - loop();
>> - for (i = 0; i < count; i++)
>> - stop_event(&evt[i]);
>> + __stop_event(&evt[i]);
>> }
>>
>> static void measure_one(pmu_counter_t *evt)
>> @@ -199,9 +252,11 @@ static void measure_one(pmu_counter_t *evt)
>>
>> static noinline void __measure(pmu_counter_t *evt, uint64_t count)
>> {
>> + u64 cntrs = BIT_ULL(event_to_global_idx(evt));
>> +
>> __start_event(evt, count);
>> - loop();
>> - stop_event(evt);
>> + loop(cntrs);
>> + __stop_event(evt);
>> }
>>
>> static bool verify_event(uint64_t count, struct pmu_event *e)
>> @@ -451,7 +506,7 @@ static void check_running_counter_wrmsr(void)
>> report_prefix_push("running counter wrmsr");
>>
>> start_event(&evt);
>> - loop();
>> + __loop();
>> wrmsr(MSR_GP_COUNTERx(0), 0);
>> stop_event(&evt);
>> report(evt.count < gp_events[0].min, "cntr");
>> @@ -468,7 +523,7 @@ static void check_running_counter_wrmsr(void)
>>
>> wrmsr(MSR_GP_COUNTERx(0), count);
>>
>> - loop();
>> + __loop();
>> stop_event(&evt);
>>
>> if (this_cpu_has_perf_global_status()) {
>> --
>> 2.34.1
>>

2024-03-27 09:08:36

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 02/11] x86: pmu: Enlarge cnt[] length to 64 in check_counters_many()


On 3/26/2024 5:41 AM, Jim Mattson wrote:
> On Tue, Jan 2, 2024 at 7:09 PM Dapeng Mi <[email protected]> wrote:
>> Considering there are already 8 GP counters and 4 fixed counters on
>> latest Intel processors, like Sapphire Rapids. The original cnt[] array
>> length 10 is definitely not enough to cover all supported PMU counters on these
>> new processors even through currently KVM only supports 3 fixed counters
>> at most. This would cause out of bound memory access and may trigger
>> false alarm on PMU counter validation
>>
>> It's probably more and more GP and fixed counters are introduced in the
>> future and then directly extends the cnt[] array length to 64 once and
>> for all.
>>
>> Signed-off-by: Dapeng Mi <[email protected]>
>> ---
>> x86/pmu.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/x86/pmu.c b/x86/pmu.c
>> index 0def28695c70..a13b8a8398c6 100644
>> --- a/x86/pmu.c
>> +++ b/x86/pmu.c
>> @@ -254,7 +254,7 @@ static void check_fixed_counters(void)
>>
>> static void check_counters_many(void)
>> {
>> - pmu_counter_t cnt[10];
>> + pmu_counter_t cnt[64];
> I think 48 should be sufficient, based on the layout of
> IA32_PERF_GLOBAL_CTRL and IA32_PERF_GLOBAL_STATUS.
>
> In any event, let's bump this up!
>
> Reviewed-by: Jim Mattson <[email protected]>

Yeah, would shrink to 48.

Thanks Jim. I'm glad you have time to review this patchset. Not sure if
you have time to review other patches?


2024-03-27 09:18:20

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 09/11] x86: pmu: Improve LLC misses event verification


On 3/27/2024 2:23 PM, Mingwei Zhang wrote:
> On Wed, Jan 03, 2024, Dapeng Mi wrote:
>> When running pmu test on SPR, sometimes the following failure is
>> reported.
>>
>> 1 <= 0 <= 1000000
>> FAIL: Intel: llc misses-4
>>
>> Currently The LLC misses occurring only depends on probability. It's
>> possible that there is no LLC misses happened in the whole loop(),
>> especially along with processors have larger and larger cache size just
>> like what we observed on SPR.
>>
>> Thus, add clflush instruction into the loop() asm blob and ensure once
>> LLC miss is triggered at least.
>>
>> Suggested-by: Jim Mattson <[email protected]>
>> Signed-off-by: Dapeng Mi <[email protected]>
> I wonder if we can skip all LLC tests if CPU does not have
> clflush/clflushopt properties?

Looks reasonable, the LLC miss event should be skipped at least if
clflush/clflushopt are not supported since the validation can't grantee
the correctness. But for LLC reference event, I think we can keep it
since the asm_loop() would always 1 LLC reference at least.

If no disapproval, I would change code to skip LLC miss event if
clflush/clflushopt are not supported in next version. Thanks.


>> ---
>> x86/pmu.c | 43 ++++++++++++++++++++++++++++++-------------
>> 1 file changed, 30 insertions(+), 13 deletions(-)
>>
>> diff --git a/x86/pmu.c b/x86/pmu.c
>> index b764827c1c3d..8fd3db0fbf81 100644
>> --- a/x86/pmu.c
>> +++ b/x86/pmu.c
>> @@ -20,19 +20,21 @@
>>
>> // Instrustion number of LOOP_ASM code
>> #define LOOP_INSTRNS 10
>> -#define LOOP_ASM \
>> +#define LOOP_ASM(_clflush) \
>> + _clflush "\n\t" \
>> + "mfence;\n\t" \
>> "1: mov (%1), %2; add $64, %1;\n\t" \
>> "nop; nop; nop; nop; nop; nop; nop;\n\t" \
>> "loop 1b;\n\t"
>>
>> -/*Enable GLOBAL_CTRL + disable GLOBAL_CTRL instructions */
>> -#define PRECISE_EXTRA_INSTRNS (2 + 4)
>> +/*Enable GLOBAL_CTRL + disable GLOBAL_CTRL + clflush/mfence instructions */
>> +#define PRECISE_EXTRA_INSTRNS (2 + 4 + 2)
>> #define PRECISE_LOOP_INSTRNS (N * LOOP_INSTRNS + PRECISE_EXTRA_INSTRNS)
>> #define PRECISE_LOOP_BRANCHES (N)
>> -#define PRECISE_LOOP_ASM \
>> +#define PRECISE_LOOP_ASM(_clflush) \
>> "wrmsr;\n\t" \
>> "mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t" \
>> - LOOP_ASM \
>> + LOOP_ASM(_clflush) \
>> "mov %%edi, %%ecx; xor %%eax, %%eax; xor %%edx, %%edx;\n\t" \
>> "wrmsr;\n\t"
>>
>> @@ -72,14 +74,30 @@ char *buf;
>> static struct pmu_event *gp_events;
>> static unsigned int gp_events_size;
>>
>> +#define _loop_asm(_clflush) \
>> +do { \
>> + asm volatile(LOOP_ASM(_clflush) \
>> + : "=c"(tmp), "=r"(tmp2), "=r"(tmp3) \
>> + : "0"(N), "1"(buf)); \
>> +} while (0)
>> +
>> +#define _precise_loop_asm(_clflush) \
>> +do { \
>> + asm volatile(PRECISE_LOOP_ASM(_clflush) \
>> + : "=b"(tmp), "=r"(tmp2), "=r"(tmp3) \
>> + : "a"(eax), "d"(edx), "c"(global_ctl), \
>> + "0"(N), "1"(buf) \
>> + : "edi"); \
>> +} while (0)
>>
>> static inline void __loop(void)
>> {
>> unsigned long tmp, tmp2, tmp3;
>>
>> - asm volatile(LOOP_ASM
>> - : "=c"(tmp), "=r"(tmp2), "=r"(tmp3)
>> - : "0"(N), "1"(buf));
>> + if (this_cpu_has(X86_FEATURE_CLFLUSH))
>> + _loop_asm("clflush (%1)");
>> + else
>> + _loop_asm("nop");
>> }
>>
>> /*
>> @@ -96,11 +114,10 @@ static inline void __precise_count_loop(u64 cntrs)
>> u32 eax = cntrs & (BIT_ULL(32) - 1);
>> u32 edx = cntrs >> 32;
>>
>> - asm volatile(PRECISE_LOOP_ASM
>> - : "=b"(tmp), "=r"(tmp2), "=r"(tmp3)
>> - : "a"(eax), "d"(edx), "c"(global_ctl),
>> - "0"(N), "1"(buf)
>> - : "edi");
>> + if (this_cpu_has(X86_FEATURE_CLFLUSH))
>> + _precise_loop_asm("clflush (%1)");
>> + else
>> + _precise_loop_asm("nop");
>> }
>>
>> static inline void loop(u64 cntrs)
>> --
>> 2.34.1
>>

2024-03-27 10:21:58

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 03/11] x86: pmu: Add asserts to warn inconsistent fixed events and counters


On 3/27/2024 1:30 PM, Mingwei Zhang wrote:
> On Wed, Jan 03, 2024, Dapeng Mi wrote:
>> Current PMU code deosn't check whether PMU fixed counter number is
>> larger than pre-defined fixed events. If so, it would cause memory
>> access out of range.
>>
>> So add assert to warn this invalid case.
>>
>> Signed-off-by: Dapeng Mi <[email protected]>
> Reviewed-by: Mingwei Zhang <[email protected]>
>> ---
>> x86/pmu.c | 10 ++++++++--
>> 1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/x86/pmu.c b/x86/pmu.c
>> index a13b8a8398c6..a42fff8d8b36 100644
>> --- a/x86/pmu.c
>> +++ b/x86/pmu.c
>> @@ -111,8 +111,12 @@ static struct pmu_event* get_counter_event(pmu_counter_t *cnt)
>> for (i = 0; i < gp_events_size; i++)
>> if (gp_events[i].unit_sel == (cnt->config & 0xffff))
>> return &gp_events[i];
>> - } else
>> - return &fixed_events[cnt->ctr - MSR_CORE_PERF_FIXED_CTR0];
>> + } else {
>> + int idx = cnt->ctr - MSR_CORE_PERF_FIXED_CTR0;
> maybe unsigned int is better?

Make sense. Thanks for review.

>> +
>> + assert(idx < ARRAY_SIZE(fixed_events));
>> + return &fixed_events[idx];
>> + }
>>
>> return (void*)0;
>> }
>> @@ -245,6 +249,7 @@ static void check_fixed_counters(void)
>> };
>> int i;
>>
>> + assert(pmu.nr_fixed_counters <= ARRAY_SIZE(fixed_events));
>> for (i = 0; i < pmu.nr_fixed_counters; i++) {
>> cnt.ctr = fixed_events[i].unit_sel;
>> measure_one(&cnt);
>> @@ -266,6 +271,7 @@ static void check_counters_many(void)
>> gp_events[i % gp_events_size].unit_sel;
>> n++;
>> }
>> + assert(pmu.nr_fixed_counters <= ARRAY_SIZE(fixed_events));
>> for (i = 0; i < pmu.nr_fixed_counters; i++) {
>> cnt[n].ctr = fixed_events[i].unit_sel;
>> cnt[n].config = EVNTSEL_OS | EVNTSEL_USR;
>> --
>> 2.34.1
>>

2024-03-27 14:40:25

by Jim Mattson

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 03/11] x86: pmu: Add asserts to warn inconsistent fixed events and counters

On Tue, Jan 2, 2024 at 7:09 PM Dapeng Mi <[email protected]> wrote:
>
> Current PMU code deosn't check whether PMU fixed counter number is
> larger than pre-defined fixed events. If so, it would cause memory
> access out of range.
>
> So add assert to warn this invalid case.
>
> Signed-off-by: Dapeng Mi <[email protected]>
> ---
> x86/pmu.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/x86/pmu.c b/x86/pmu.c
> index a13b8a8398c6..a42fff8d8b36 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -111,8 +111,12 @@ static struct pmu_event* get_counter_event(pmu_counter_t *cnt)
> for (i = 0; i < gp_events_size; i++)
> if (gp_events[i].unit_sel == (cnt->config & 0xffff))
> return &gp_events[i];
> - } else
> - return &fixed_events[cnt->ctr - MSR_CORE_PERF_FIXED_CTR0];
> + } else {
> + int idx = cnt->ctr - MSR_CORE_PERF_FIXED_CTR0;
> +
> + assert(idx < ARRAY_SIZE(fixed_events));
> + return &fixed_events[idx];
> + }
>
> return (void*)0;
> }
> @@ -245,6 +249,7 @@ static void check_fixed_counters(void)
> };
> int i;
>
> + assert(pmu.nr_fixed_counters <= ARRAY_SIZE(fixed_events));
> for (i = 0; i < pmu.nr_fixed_counters; i++) {
> cnt.ctr = fixed_events[i].unit_sel;
> measure_one(&cnt);
> @@ -266,6 +271,7 @@ static void check_counters_many(void)
> gp_events[i % gp_events_size].unit_sel;
> n++;
> }
> + assert(pmu.nr_fixed_counters <= ARRAY_SIZE(fixed_events));

Can we assert this just once, in main()?

> for (i = 0; i < pmu.nr_fixed_counters; i++) {
> cnt[n].ctr = fixed_events[i].unit_sel;
> cnt[n].config = EVNTSEL_OS | EVNTSEL_USR;
> --
> 2.34.1
>

2024-03-27 15:13:17

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 08/11] x86: pmu: Improve instruction and branches events verification


On 3/27/2024 2:14 PM, Mingwei Zhang wrote:
> On Wed, Jan 03, 2024, Dapeng Mi wrote:
>> If HW supports GLOBAL_CTRL MSR, enabling and disabling PMCs are moved in
>> __precise_count_loop(). Thus, instructions and branches events can be
>> verified against a precise count instead of a rough range.
>>
>> Signed-off-by: Dapeng Mi <[email protected]>
>> ---
>> x86/pmu.c | 26 ++++++++++++++++++++++++++
>> 1 file changed, 26 insertions(+)
>>
>> diff --git a/x86/pmu.c b/x86/pmu.c
>> index 88b89ad889b9..b764827c1c3d 100644
>> --- a/x86/pmu.c
>> +++ b/x86/pmu.c
>> @@ -25,6 +25,10 @@
>> "nop; nop; nop; nop; nop; nop; nop;\n\t" \
>> "loop 1b;\n\t"
>>
>> +/*Enable GLOBAL_CTRL + disable GLOBAL_CTRL instructions */
>> +#define PRECISE_EXTRA_INSTRNS (2 + 4)
>> +#define PRECISE_LOOP_INSTRNS (N * LOOP_INSTRNS + PRECISE_EXTRA_INSTRNS)
>> +#define PRECISE_LOOP_BRANCHES (N)
>> #define PRECISE_LOOP_ASM \
>> "wrmsr;\n\t" \
>> "mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t" \
>> @@ -107,6 +111,24 @@ static inline void loop(u64 cntrs)
>> __precise_count_loop(cntrs);
>> }
>>
>> +static void adjust_events_range(struct pmu_event *gp_events, int branch_idx)
>> +{
>> + /*
>> + * If HW supports GLOBAL_CTRL MSR, enabling and disabling PMCs are
>> + * moved in __precise_count_loop(). Thus, instructions and branches
>> + * events can be verified against a precise count instead of a rough
>> + * range.
>> + */
>> + if (this_cpu_has_perf_global_ctrl()) {
>> + /* instructions event */
>> + gp_events[0].min = PRECISE_LOOP_INSTRNS;
>> + gp_events[0].max = PRECISE_LOOP_INSTRNS;
>> + /* branches event */
>> + gp_events[branch_idx].min = PRECISE_LOOP_BRANCHES;
>> + gp_events[branch_idx].max = PRECISE_LOOP_BRANCHES;
>> + }
>> +}
>> +
>> volatile uint64_t irq_received;
>>
>> static void cnt_overflow(isr_regs_t *regs)
>> @@ -771,6 +793,7 @@ static void check_invalid_rdpmc_gp(void)
>>
>> int main(int ac, char **av)
>> {
>> + int branch_idx;
>> setup_vm();
>> handle_irq(PMI_VECTOR, cnt_overflow);
>> buf = malloc(N*64);
>> @@ -784,13 +807,16 @@ int main(int ac, char **av)
>> }
>> gp_events = (struct pmu_event *)intel_gp_events;
>> gp_events_size = sizeof(intel_gp_events)/sizeof(intel_gp_events[0]);
>> + branch_idx = 5;
> This (and the follow up one) hardcoded index is hacky and more
> importantly, error prone especially when code get refactored later.
> Please use a proper way via macro? Eg., checking
> INTEL_ARCH_BRANCHES_RETIRED_INDEX in pmu_counters_test.c might be a good
> one.

Yeah, I would define an enum to enumerate these indexes. Thanks.


>> report_prefix_push("Intel");
>> set_ref_cycle_expectations();
>> } else {
>> gp_events_size = sizeof(amd_gp_events)/sizeof(amd_gp_events[0]);
>> gp_events = (struct pmu_event *)amd_gp_events;
>> + branch_idx = 2;
>> report_prefix_push("AMD");
>> }
>> + adjust_events_range(gp_events, branch_idx);
>>
>> printf("PMU version: %d\n", pmu.version);
>> printf("GP counters: %d\n", pmu.nr_gp_counters);
>> --
>> 2.34.1
>>

2024-03-27 15:26:13

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 09/11] x86: pmu: Improve LLC misses event verification

On 1/3/2024 11:14 AM, Dapeng Mi wrote:
> When running pmu test on SPR, sometimes the following failure is
> reported.
>
> 1 <= 0 <= 1000000
> FAIL: Intel: llc misses-4
>
> Currently The LLC misses occurring only depends on probability. It's
> possible that there is no LLC misses happened in the whole loop(),
> especially along with processors have larger and larger cache size just
> like what we observed on SPR.
>
> Thus, add clflush instruction into the loop() asm blob and ensure once
> LLC miss is triggered at least.
>
> Suggested-by: Jim Mattson <[email protected]>
> Signed-off-by: Dapeng Mi <[email protected]>
> ---
> x86/pmu.c | 43 ++++++++++++++++++++++++++++++-------------
> 1 file changed, 30 insertions(+), 13 deletions(-)
>
> diff --git a/x86/pmu.c b/x86/pmu.c
> index b764827c1c3d..8fd3db0fbf81 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -20,19 +20,21 @@
>
> // Instrustion number of LOOP_ASM code
> #define LOOP_INSTRNS 10
> -#define LOOP_ASM \
> +#define LOOP_ASM(_clflush) \
> + _clflush "\n\t" \
> + "mfence;\n\t" \
> "1: mov (%1), %2; add $64, %1;\n\t" \
> "nop; nop; nop; nop; nop; nop; nop;\n\t" \
> "loop 1b;\n\t"
>
> -/*Enable GLOBAL_CTRL + disable GLOBAL_CTRL instructions */
> -#define PRECISE_EXTRA_INSTRNS (2 + 4)
> +/*Enable GLOBAL_CTRL + disable GLOBAL_CTRL + clflush/mfence instructions */
> +#define PRECISE_EXTRA_INSTRNS (2 + 4 + 2)
> #define PRECISE_LOOP_INSTRNS (N * LOOP_INSTRNS + PRECISE_EXTRA_INSTRNS)
> #define PRECISE_LOOP_BRANCHES (N)
> -#define PRECISE_LOOP_ASM \
> +#define PRECISE_LOOP_ASM(_clflush) \
> "wrmsr;\n\t" \
> "mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t" \
> - LOOP_ASM \
> + LOOP_ASM(_clflush) \
> "mov %%edi, %%ecx; xor %%eax, %%eax; xor %%edx, %%edx;\n\t" \
> "wrmsr;\n\t"
>
> @@ -72,14 +74,30 @@ char *buf;
> static struct pmu_event *gp_events;
> static unsigned int gp_events_size;
>
> +#define _loop_asm(_clflush) \
> +do { \
> + asm volatile(LOOP_ASM(_clflush) \
> + : "=c"(tmp), "=r"(tmp2), "=r"(tmp3) \
> + : "0"(N), "1"(buf)); \
> +} while (0)
> +
> +#define _precise_loop_asm(_clflush) \
> +do { \
> + asm volatile(PRECISE_LOOP_ASM(_clflush) \
> + : "=b"(tmp), "=r"(tmp2), "=r"(tmp3) \
> + : "a"(eax), "d"(edx), "c"(global_ctl), \
> + "0"(N), "1"(buf) \
> + : "edi"); \
> +} while (0)
>
> static inline void __loop(void)
> {
> unsigned long tmp, tmp2, tmp3;

Can you move these tmp variables into macro's do...while block since they're not
needed here?

>
> - asm volatile(LOOP_ASM
> - : "=c"(tmp), "=r"(tmp2), "=r"(tmp3)
> - : "0"(N), "1"(buf));
> + if (this_cpu_has(X86_FEATURE_CLFLUSH))
> + _loop_asm("clflush (%1)");
> + else
> + _loop_asm("nop");
> }
>
> /*
> @@ -96,11 +114,10 @@ static inline void __precise_count_loop(u64 cntrs)
> u32 eax = cntrs & (BIT_ULL(32) - 1);
> u32 edx = cntrs >> 32;

Ditto.

>
> - asm volatile(PRECISE_LOOP_ASM
> - : "=b"(tmp), "=r"(tmp2), "=r"(tmp3)
> - : "a"(eax), "d"(edx), "c"(global_ctl),
> - "0"(N), "1"(buf)
> - : "edi");
> + if (this_cpu_has(X86_FEATURE_CLFLUSH))
> + _precise_loop_asm("clflush (%1)");
> + else
> + _precise_loop_asm("nop");
> }
>
> static inline void loop(u64 cntrs)


2024-03-27 17:07:05

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 04/11] x86: pmu: Switch instructions and core cycles events sequence

On Wed, Mar 27, 2024, Mi, Dapeng wrote:
>
> On 3/27/2024 1:36 PM, Mingwei Zhang wrote:
> > On Wed, Jan 03, 2024, Dapeng Mi wrote:
> > > When running pmu test on SPR, sometimes the following failure is
> > > reported.
> > >
> > > PMU version: 2
> > > GP counters: 8
> > > GP counter width: 48
> > > Mask length: 8
> > > Fixed counters: 3
> > > Fixed counter width: 48
> > > 1000000 <= 55109398 <= 50000000
> > > FAIL: Intel: core cycles-0
> > > 1000000 <= 18279571 <= 50000000
> > > PASS: Intel: core cycles-1
> > > 1000000 <= 12238092 <= 50000000
> > > PASS: Intel: core cycles-2
> > > 1000000 <= 7981727 <= 50000000
> > > PASS: Intel: core cycles-3
> > > 1000000 <= 6984711 <= 50000000
> > > PASS: Intel: core cycles-4
> > > 1000000 <= 6773673 <= 50000000
> > > PASS: Intel: core cycles-5
> > > 1000000 <= 6697842 <= 50000000
> > > PASS: Intel: core cycles-6
> > > 1000000 <= 6747947 <= 50000000
> > > PASS: Intel: core cycles-7
> > >
> > > The count of the "core cycles" on first counter would exceed the upper
> > > boundary and leads to a failure, and then the "core cycles" count would
> > > drop gradually and reach a stable state.
> > >
> > > That looks reasonable. The "core cycles" event is defined as the 1st
> > > event in xxx_gp_events[] array and it is always verified at first.
> > > when the program loop() is executed at the first time it needs to warm
> > > up the pipeline and cache, such as it has to wait for cache is filled.
> > > All these warm-up work leads to a quite large core cycles count which
> > > may exceeds the verification range.
> > >
> > > The event "instructions" instead of "core cycles" is a good choice as
> > > the warm-up event since it would always return a fixed count. Thus
> > > switch instructions and core cycles events sequence in the
> > > xxx_gp_events[] array.
> > The observation is great. However, it is hard to agree that we fix the
> > problem by switching the order. Maybe directly tweaking the N from 50 to
> > a larger value makes more sense.
> >
> > Thanks.
> > -Mingwei
>
> yeah, a larger upper boundary can fix the fault as well, but the question is
> how large it would be enough. For different CPU model, the needed cycles
> could be different for warming up. So we may have to set a quite large upper
> boundary but a large boundary would decrease credibility of this validation.
> Not sure which one is better. Any inputs from other ones?
>

Just checked with an expert from our side, so "core cycles" (0x003c)
is affected the current CPU state/frequency, ie., its counting value
could vary largely. In that sense, "warming" up seems reasonable.
However, switching the order would be a terrible idea for maintanence
since people will forget it and the problem will come back.

From another perspective, "warming" up seems just a best effort. Nobody
knows how warm is really warm. Besides, some systems might turn off some
C-State and may set a cap on max turbo frequency. All of these will
directly affect the warm-up process and the counting result of 0x003c.

So, while adding a warm-up blob is reasonable, tweaking the heuristics
seems to be same for me. Regarding the value, I think I will rely on
your experiments and observation.

Thanks.
-Mingwei
>
> > > Signed-off-by: Dapeng Mi <[email protected]>
> > > ---
> > > x86/pmu.c | 16 ++++++++--------
> > > 1 file changed, 8 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/x86/pmu.c b/x86/pmu.c
> > > index a42fff8d8b36..67ebfbe55b49 100644
> > > --- a/x86/pmu.c
> > > +++ b/x86/pmu.c
> > > @@ -31,16 +31,16 @@ struct pmu_event {
> > > int min;
> > > int max;
> > > } intel_gp_events[] = {
> > > - {"core cycles", 0x003c, 1*N, 50*N},
> > > {"instructions", 0x00c0, 10*N, 10.2*N},
> > > + {"core cycles", 0x003c, 1*N, 50*N},
> > > {"ref cycles", 0x013c, 1*N, 30*N},
> > > {"llc references", 0x4f2e, 1, 2*N},
> > > {"llc misses", 0x412e, 1, 1*N},
> > > {"branches", 0x00c4, 1*N, 1.1*N},
> > > {"branch misses", 0x00c5, 0, 0.1*N},
> > > }, amd_gp_events[] = {
> > > - {"core cycles", 0x0076, 1*N, 50*N},
> > > {"instructions", 0x00c0, 10*N, 10.2*N},
> > > + {"core cycles", 0x0076, 1*N, 50*N},
> > > {"branches", 0x00c2, 1*N, 1.1*N},
> > > {"branch misses", 0x00c3, 0, 0.1*N},
> > > }, fixed_events[] = {
> > > @@ -307,7 +307,7 @@ static void check_counter_overflow(void)
> > > int i;
> > > pmu_counter_t cnt = {
> > > .ctr = MSR_GP_COUNTERx(0),
> > > - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel /* instructions */,
> > > + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel /* instructions */,
> > > };
> > > overflow_preset = measure_for_overflow(&cnt);
> > > @@ -365,11 +365,11 @@ static void check_gp_counter_cmask(void)
> > > {
> > > pmu_counter_t cnt = {
> > > .ctr = MSR_GP_COUNTERx(0),
> > > - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel /* instructions */,
> > > + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel /* instructions */,
> > > };
> > > cnt.config |= (0x2 << EVNTSEL_CMASK_SHIFT);
> > > measure_one(&cnt);
> > > - report(cnt.count < gp_events[1].min, "cmask");
> > > + report(cnt.count < gp_events[0].min, "cmask");
> > > }
> > > static void do_rdpmc_fast(void *ptr)
> > > @@ -446,7 +446,7 @@ static void check_running_counter_wrmsr(void)
> > > uint64_t count;
> > > pmu_counter_t evt = {
> > > .ctr = MSR_GP_COUNTERx(0),
> > > - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel,
> > > + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel,
> > > };
> > > report_prefix_push("running counter wrmsr");
> > > @@ -455,7 +455,7 @@ static void check_running_counter_wrmsr(void)
> > > loop();
> > > wrmsr(MSR_GP_COUNTERx(0), 0);
> > > stop_event(&evt);
> > > - report(evt.count < gp_events[1].min, "cntr");
> > > + report(evt.count < gp_events[0].min, "cntr");
> > > /* clear status before overflow test */
> > > if (this_cpu_has_perf_global_status())
> > > @@ -493,7 +493,7 @@ static void check_emulated_instr(void)
> > > pmu_counter_t instr_cnt = {
> > > .ctr = MSR_GP_COUNTERx(1),
> > > /* instructions */
> > > - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel,
> > > + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel,
> > > };
> > > report_prefix_push("emulated instruction");
> > > --
> > > 2.34.1
> > >

2024-03-28 01:22:02

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 01/11] x86: pmu: Remove duplicate code in pmu_init()


On 3/28/2024 9:19 AM, Yang, Weijiang wrote:
> On 1/3/2024 11:13 AM, Dapeng Mi wrote:
>> From: Xiong Zhang <[email protected]>
>>
>> There are totally same code in pmu_init() helper, remove the duplicate
>> code.
>>
>> Signed-off-by: Xiong Zhang <[email protected]>
>> Signed-off-by: Dapeng Mi <[email protected]>
>
> Jim has already added RB tag for this patch in v2, you may add it here.

Oh, missed it. Thanks for reminding.

>
>> ---
>>   lib/x86/pmu.c | 5 -----
>>   1 file changed, 5 deletions(-)
>>
>> diff --git a/lib/x86/pmu.c b/lib/x86/pmu.c
>> index 0f2afd650bc9..d06e94553024 100644
>> --- a/lib/x86/pmu.c
>> +++ b/lib/x86/pmu.c
>> @@ -16,11 +16,6 @@ void pmu_init(void)
>>               pmu.fixed_counter_width = (cpuid_10.d >> 5) & 0xff;
>>           }
>>   -        if (pmu.version > 1) {
>> -            pmu.nr_fixed_counters = cpuid_10.d & 0x1f;
>> -            pmu.fixed_counter_width = (cpuid_10.d >> 5) & 0xff;
>> -        }
>> -
>>           pmu.nr_gp_counters = (cpuid_10.a >> 8) & 0xff;
>>           pmu.gp_counter_width = (cpuid_10.a >> 16) & 0xff;
>>           pmu.gp_counter_mask_length = (cpuid_10.a >> 24) & 0xff;
>

2024-03-28 01:22:55

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 01/11] x86: pmu: Remove duplicate code in pmu_init()

On 1/3/2024 11:13 AM, Dapeng Mi wrote:
> From: Xiong Zhang <[email protected]>
>
> There are totally same code in pmu_init() helper, remove the duplicate
> code.
>
> Signed-off-by: Xiong Zhang <[email protected]>
> Signed-off-by: Dapeng Mi <[email protected]>

Jim has already added RB tag for this patch in v2, you may add it here.

> ---
> lib/x86/pmu.c | 5 -----
> 1 file changed, 5 deletions(-)
>
> diff --git a/lib/x86/pmu.c b/lib/x86/pmu.c
> index 0f2afd650bc9..d06e94553024 100644
> --- a/lib/x86/pmu.c
> +++ b/lib/x86/pmu.c
> @@ -16,11 +16,6 @@ void pmu_init(void)
> pmu.fixed_counter_width = (cpuid_10.d >> 5) & 0xff;
> }
>
> - if (pmu.version > 1) {
> - pmu.nr_fixed_counters = cpuid_10.d & 0x1f;
> - pmu.fixed_counter_width = (cpuid_10.d >> 5) & 0xff;
> - }
> -
> pmu.nr_gp_counters = (cpuid_10.a >> 8) & 0xff;
> pmu.gp_counter_width = (cpuid_10.a >> 16) & 0xff;
> pmu.gp_counter_mask_length = (cpuid_10.a >> 24) & 0xff;


2024-03-28 01:25:05

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 06/11] x86: pmu: Remove blank line and redundant space

On 1/3/2024 11:14 AM, Dapeng Mi wrote:
> code style changes.
>
> Signed-off-by: Dapeng Mi <[email protected]>
> ---
> x86/pmu.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/x86/pmu.c b/x86/pmu.c
> index a2c64a1ce95b..46bed66c5c9f 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -207,8 +207,7 @@ static noinline void __measure(pmu_counter_t *evt, uint64_t count)
> static bool verify_event(uint64_t count, struct pmu_event *e)
> {
> // printf("%d <= %ld <= %d\n", e->min, count, e->max);
> - return count >= e->min && count <= e->max;
> -
> + return count >= e->min && count <= e->max;

I don't think it's necessary to fix the nit in a separate patch, just squash it in some patch with
"Opportunistically ...."

> }
>
> static bool verify_counter(pmu_counter_t *cnt)


2024-03-28 09:29:54

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 03/11] x86: pmu: Add asserts to warn inconsistent fixed events and counters


On 3/27/2024 9:11 PM, Jim Mattson wrote:
> On Tue, Jan 2, 2024 at 7:09 PM Dapeng Mi <[email protected]> wrote:
>> Current PMU code deosn't check whether PMU fixed counter number is
>> larger than pre-defined fixed events. If so, it would cause memory
>> access out of range.
>>
>> So add assert to warn this invalid case.
>>
>> Signed-off-by: Dapeng Mi <[email protected]>
>> ---
>> x86/pmu.c | 10 ++++++++--
>> 1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/x86/pmu.c b/x86/pmu.c
>> index a13b8a8398c6..a42fff8d8b36 100644
>> --- a/x86/pmu.c
>> +++ b/x86/pmu.c
>> @@ -111,8 +111,12 @@ static struct pmu_event* get_counter_event(pmu_counter_t *cnt)
>> for (i = 0; i < gp_events_size; i++)
>> if (gp_events[i].unit_sel == (cnt->config & 0xffff))
>> return &gp_events[i];
>> - } else
>> - return &fixed_events[cnt->ctr - MSR_CORE_PERF_FIXED_CTR0];
>> + } else {
>> + int idx = cnt->ctr - MSR_CORE_PERF_FIXED_CTR0;
>> +
>> + assert(idx < ARRAY_SIZE(fixed_events));
>> + return &fixed_events[idx];
>> + }
>>
>> return (void*)0;
>> }
>> @@ -245,6 +249,7 @@ static void check_fixed_counters(void)
>> };
>> int i;
>>
>> + assert(pmu.nr_fixed_counters <= ARRAY_SIZE(fixed_events));
>> for (i = 0; i < pmu.nr_fixed_counters; i++) {
>> cnt.ctr = fixed_events[i].unit_sel;
>> measure_one(&cnt);
>> @@ -266,6 +271,7 @@ static void check_counters_many(void)
>> gp_events[i % gp_events_size].unit_sel;
>> n++;
>> }
>> + assert(pmu.nr_fixed_counters <= ARRAY_SIZE(fixed_events));
> Can we assert this just once, in main()?
sure. would do.
>
>> for (i = 0; i < pmu.nr_fixed_counters; i++) {
>> cnt[n].ctr = fixed_events[i].unit_sel;
>> cnt[n].config = EVNTSEL_OS | EVNTSEL_USR;
>> --
>> 2.34.1
>>

2024-03-28 10:12:36

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 06/11] x86: pmu: Remove blank line and redundant space


On 3/28/2024 9:23 AM, Yang, Weijiang wrote:
> On 1/3/2024 11:14 AM, Dapeng Mi wrote:
>> code style changes.
>>
>> Signed-off-by: Dapeng Mi <[email protected]>
>> ---
>>   x86/pmu.c | 3 +--
>>   1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/x86/pmu.c b/x86/pmu.c
>> index a2c64a1ce95b..46bed66c5c9f 100644
>> --- a/x86/pmu.c
>> +++ b/x86/pmu.c
>> @@ -207,8 +207,7 @@ static noinline void __measure(pmu_counter_t
>> *evt, uint64_t count)
>>   static bool verify_event(uint64_t count, struct pmu_event *e)
>>   {
>>       // printf("%d <= %ld <= %d\n", e->min, count, e->max);
>> -    return count >= e->min  && count <= e->max;
>> -
>> +    return count >= e->min && count <= e->max;
>
> I don't think it's necessary to fix the nit in a separate patch, just
> squash it in some patch with
> "Opportunistically ...."

Not sure this, I was always required to use a separate patch to refactor
the code style faults by reviewers. It looks a unwritten rule for Linux.


>
>>   }
>>     static bool verify_counter(pmu_counter_t *cnt)
>

2024-03-28 15:16:29

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 04/11] x86: pmu: Switch instructions and core cycles events sequence


On 3/28/2024 1:06 AM, Mingwei Zhang wrote:
> On Wed, Mar 27, 2024, Mi, Dapeng wrote:
>> On 3/27/2024 1:36 PM, Mingwei Zhang wrote:
>>> On Wed, Jan 03, 2024, Dapeng Mi wrote:
>>>> When running pmu test on SPR, sometimes the following failure is
>>>> reported.
>>>>
>>>> PMU version: 2
>>>> GP counters: 8
>>>> GP counter width: 48
>>>> Mask length: 8
>>>> Fixed counters: 3
>>>> Fixed counter width: 48
>>>> 1000000 <= 55109398 <= 50000000
>>>> FAIL: Intel: core cycles-0
>>>> 1000000 <= 18279571 <= 50000000
>>>> PASS: Intel: core cycles-1
>>>> 1000000 <= 12238092 <= 50000000
>>>> PASS: Intel: core cycles-2
>>>> 1000000 <= 7981727 <= 50000000
>>>> PASS: Intel: core cycles-3
>>>> 1000000 <= 6984711 <= 50000000
>>>> PASS: Intel: core cycles-4
>>>> 1000000 <= 6773673 <= 50000000
>>>> PASS: Intel: core cycles-5
>>>> 1000000 <= 6697842 <= 50000000
>>>> PASS: Intel: core cycles-6
>>>> 1000000 <= 6747947 <= 50000000
>>>> PASS: Intel: core cycles-7
>>>>
>>>> The count of the "core cycles" on first counter would exceed the upper
>>>> boundary and leads to a failure, and then the "core cycles" count would
>>>> drop gradually and reach a stable state.
>>>>
>>>> That looks reasonable. The "core cycles" event is defined as the 1st
>>>> event in xxx_gp_events[] array and it is always verified at first.
>>>> when the program loop() is executed at the first time it needs to warm
>>>> up the pipeline and cache, such as it has to wait for cache is filled.
>>>> All these warm-up work leads to a quite large core cycles count which
>>>> may exceeds the verification range.
>>>>
>>>> The event "instructions" instead of "core cycles" is a good choice as
>>>> the warm-up event since it would always return a fixed count. Thus
>>>> switch instructions and core cycles events sequence in the
>>>> xxx_gp_events[] array.
>>> The observation is great. However, it is hard to agree that we fix the
>>> problem by switching the order. Maybe directly tweaking the N from 50 to
>>> a larger value makes more sense.
>>>
>>> Thanks.
>>> -Mingwei
>> yeah, a larger upper boundary can fix the fault as well, but the question is
>> how large it would be enough. For different CPU model, the needed cycles
>> could be different for warming up. So we may have to set a quite large upper
>> boundary but a large boundary would decrease credibility of this validation.
>> Not sure which one is better. Any inputs from other ones?
>>
> Just checked with an expert from our side, so "core cycles" (0x003c)
> is affected the current CPU state/frequency, ie., its counting value
> could vary largely. In that sense, "warming" up seems reasonable.
> However, switching the order would be a terrible idea for maintanence
> since people will forget it and the problem will come back.
>
> From another perspective, "warming" up seems just a best effort. Nobody
> knows how warm is really warm. Besides, some systems might turn off some
> C-State and may set a cap on max turbo frequency. All of these will
> directly affect the warm-up process and the counting result of 0x003c.
>
> So, while adding a warm-up blob is reasonable, tweaking the heuristics
> seems to be same for me. Regarding the value, I think I will rely on
> your experiments and observation.

Per my understanding, most of extra cpu cycles should come from the warm
up for cache. If we don't want to change the validation order,  it may
be doable to add an extra warm-up phase before starting the validation.
Thus we don't need to enlarge the upper boundary. It looks not a
preferred way since it would decrease the credibility of the validation.

Let me try to add a warm-up phase first and check if it works as expect.


>
> Thanks.
> -Mingwei
>>>> Signed-off-by: Dapeng Mi <[email protected]>
>>>> ---
>>>> x86/pmu.c | 16 ++++++++--------
>>>> 1 file changed, 8 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/x86/pmu.c b/x86/pmu.c
>>>> index a42fff8d8b36..67ebfbe55b49 100644
>>>> --- a/x86/pmu.c
>>>> +++ b/x86/pmu.c
>>>> @@ -31,16 +31,16 @@ struct pmu_event {
>>>> int min;
>>>> int max;
>>>> } intel_gp_events[] = {
>>>> - {"core cycles", 0x003c, 1*N, 50*N},
>>>> {"instructions", 0x00c0, 10*N, 10.2*N},
>>>> + {"core cycles", 0x003c, 1*N, 50*N},
>>>> {"ref cycles", 0x013c, 1*N, 30*N},
>>>> {"llc references", 0x4f2e, 1, 2*N},
>>>> {"llc misses", 0x412e, 1, 1*N},
>>>> {"branches", 0x00c4, 1*N, 1.1*N},
>>>> {"branch misses", 0x00c5, 0, 0.1*N},
>>>> }, amd_gp_events[] = {
>>>> - {"core cycles", 0x0076, 1*N, 50*N},
>>>> {"instructions", 0x00c0, 10*N, 10.2*N},
>>>> + {"core cycles", 0x0076, 1*N, 50*N},
>>>> {"branches", 0x00c2, 1*N, 1.1*N},
>>>> {"branch misses", 0x00c3, 0, 0.1*N},
>>>> }, fixed_events[] = {
>>>> @@ -307,7 +307,7 @@ static void check_counter_overflow(void)
>>>> int i;
>>>> pmu_counter_t cnt = {
>>>> .ctr = MSR_GP_COUNTERx(0),
>>>> - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel /* instructions */,
>>>> + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel /* instructions */,
>>>> };
>>>> overflow_preset = measure_for_overflow(&cnt);
>>>> @@ -365,11 +365,11 @@ static void check_gp_counter_cmask(void)
>>>> {
>>>> pmu_counter_t cnt = {
>>>> .ctr = MSR_GP_COUNTERx(0),
>>>> - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel /* instructions */,
>>>> + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel /* instructions */,
>>>> };
>>>> cnt.config |= (0x2 << EVNTSEL_CMASK_SHIFT);
>>>> measure_one(&cnt);
>>>> - report(cnt.count < gp_events[1].min, "cmask");
>>>> + report(cnt.count < gp_events[0].min, "cmask");
>>>> }
>>>> static void do_rdpmc_fast(void *ptr)
>>>> @@ -446,7 +446,7 @@ static void check_running_counter_wrmsr(void)
>>>> uint64_t count;
>>>> pmu_counter_t evt = {
>>>> .ctr = MSR_GP_COUNTERx(0),
>>>> - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel,
>>>> + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel,
>>>> };
>>>> report_prefix_push("running counter wrmsr");
>>>> @@ -455,7 +455,7 @@ static void check_running_counter_wrmsr(void)
>>>> loop();
>>>> wrmsr(MSR_GP_COUNTERx(0), 0);
>>>> stop_event(&evt);
>>>> - report(evt.count < gp_events[1].min, "cntr");
>>>> + report(evt.count < gp_events[0].min, "cntr");
>>>> /* clear status before overflow test */
>>>> if (this_cpu_has_perf_global_status())
>>>> @@ -493,7 +493,7 @@ static void check_emulated_instr(void)
>>>> pmu_counter_t instr_cnt = {
>>>> .ctr = MSR_GP_COUNTERx(1),
>>>> /* instructions */
>>>> - .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel,
>>>> + .config = EVNTSEL_OS | EVNTSEL_USR | gp_events[0].unit_sel,
>>>> };
>>>> report_prefix_push("emulated instruction");
>>>> --
>>>> 2.34.1
>>>>

2024-04-08 23:22:01

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 07/11] x86: pmu: Enable and disable PMCs in loop() asm blob

On Wed, Mar 27, 2024, Mi, Dapeng wrote:
>
> On 3/27/2024 2:07 PM, Mingwei Zhang wrote:
> > On Wed, Jan 03, 2024, Dapeng Mi wrote:
> > > Currently enabling PMCs, executing loop() and disabling PMCs are divided
> > > 3 separated functions. So there could be other instructions executed
> > > between enabling PMCS and running loop() or running loop() and disabling
> > > PMCs, e.g. if there are multiple counters enabled in measure_many()
> > > function, the instructions which enabling the 2nd and more counters
> > > would be counted in by the 1st counter.
> > >
> > > So current implementation can only verify the correctness of count by an
> > > rough range rather than a precise count even for instructions and
> > > branches events. Strictly speaking, this verification is meaningless as
> > > the test could still pass even though KVM vPMU has something wrong and
> > > reports an incorrect instructions or branches count which is in the rough
> > > range.
> > >
> > > Thus, move the PMCs enabling and disabling into the loop() asm blob and
> > > ensure only the loop asm instructions would be counted, then the
> > > instructions or branches events can be verified with an precise count
> > > instead of an rough range.
> > >
> > > Signed-off-by: Dapeng Mi <[email protected]>
> > > ---
> > > x86/pmu.c | 83 +++++++++++++++++++++++++++++++++++++++++++++----------
> > > 1 file changed, 69 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/x86/pmu.c b/x86/pmu.c
> > > index 46bed66c5c9f..88b89ad889b9 100644
> > > --- a/x86/pmu.c
> > > +++ b/x86/pmu.c
> > > @@ -18,6 +18,20 @@
> > > #define EXPECTED_INSTR 17
> > > #define EXPECTED_BRNCH 5
> > > +// Instrustion number of LOOP_ASM code
> > > +#define LOOP_INSTRNS 10
> > > +#define LOOP_ASM \
> > > + "1: mov (%1), %2; add $64, %1;\n\t" \
> > > + "nop; nop; nop; nop; nop; nop; nop;\n\t" \
> > > + "loop 1b;\n\t"
> > > +
> > > +#define PRECISE_LOOP_ASM \
> > > + "wrmsr;\n\t" \
> > > + "mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t" \
> > > + LOOP_ASM \
> > > + "mov %%edi, %%ecx; xor %%eax, %%eax; xor %%edx, %%edx;\n\t" \
> > > + "wrmsr;\n\t"
> > Can we add "FEP" prefix into the above blob? This way, we can expand the
> > testing for emulated instructions.
Dapeng,

Sorry, did not clarify that this is not a hard request. I am not
pushing that this need to be done in your next version if it takes
time to do so. (FEP is of couse nice to have :), but this test already
supports it in somewhere else.).

Once your next version is ready, please send it out as soon as you can
and I am happy to give my reviews until it is merged.

Thanks.
-Mingwei
>
>
> Yeah, that sounds like a new feature request. I would add it in next
> version.
>
>
> > > +
> > > typedef struct {
> > > uint32_t ctr;
> > > uint64_t config;
> > > @@ -54,13 +68,43 @@ char *buf;
> > > static struct pmu_event *gp_events;
> > > static unsigned int gp_events_size;
> > > -static inline void loop(void)
> > > +
> > > +static inline void __loop(void)
> > > +{
> > > + unsigned long tmp, tmp2, tmp3;
> > > +
> > > + asm volatile(LOOP_ASM
> > > + : "=c"(tmp), "=r"(tmp2), "=r"(tmp3)
> > > + : "0"(N), "1"(buf));
> > > +}
> > > +
> > > +/*
> > > + * Enable and disable counters in a whole asm blob to ensure
> > > + * no other instructions are counted in the time slot between
> > > + * counters enabling and really LOOP_ASM code executing.
> > > + * Thus counters can verify instructions and branches events
> > > + * against precise counts instead of a rough valid count range.
> > > + */
> > > +static inline void __precise_count_loop(u64 cntrs)
> > > {
> > > unsigned long tmp, tmp2, tmp3;
> > > + unsigned int global_ctl = pmu.msr_global_ctl;
> > > + u32 eax = cntrs & (BIT_ULL(32) - 1);
> > > + u32 edx = cntrs >> 32;
> > > - asm volatile("1: mov (%1), %2; add $64, %1; nop; nop; nop; nop; nop; nop; nop; loop 1b"
> > > - : "=c"(tmp), "=r"(tmp2), "=r"(tmp3): "0"(N), "1"(buf));
> > > + asm volatile(PRECISE_LOOP_ASM
> > > + : "=b"(tmp), "=r"(tmp2), "=r"(tmp3)
> > > + : "a"(eax), "d"(edx), "c"(global_ctl),
> > > + "0"(N), "1"(buf)
> > > + : "edi");
> > > +}
> > > +static inline void loop(u64 cntrs)
> > > +{
> > > + if (!this_cpu_has_perf_global_ctrl())
> > > + __loop();
> > > + else
> > > + __precise_count_loop(cntrs);
> > > }
> > > volatile uint64_t irq_received;
> > > @@ -159,18 +203,17 @@ static void __start_event(pmu_counter_t *evt, uint64_t count)
> > > ctrl = (ctrl & ~(0xf << shift)) | (usrospmi << shift);
> > > wrmsr(MSR_CORE_PERF_FIXED_CTR_CTRL, ctrl);
> > > }
> > > - global_enable(evt);
> > > apic_write(APIC_LVTPC, PMI_VECTOR);
> > > }
> > > static void start_event(pmu_counter_t *evt)
> > > {
> > > __start_event(evt, 0);
> > > + global_enable(evt);
> > > }
> > > -static void stop_event(pmu_counter_t *evt)
> > > +static void __stop_event(pmu_counter_t *evt)
> > > {
> > > - global_disable(evt);
> > > if (is_gp(evt)) {
> > > wrmsr(MSR_GP_EVENT_SELECTx(event_to_global_idx(evt)),
> > > evt->config & ~EVNTSEL_EN);
> > > @@ -182,14 +225,24 @@ static void stop_event(pmu_counter_t *evt)
> > > evt->count = rdmsr(evt->ctr);
> > > }
> > > +static void stop_event(pmu_counter_t *evt)
> > > +{
> > > + global_disable(evt);
> > > + __stop_event(evt);
> > > +}
> > > +
> > > static noinline void measure_many(pmu_counter_t *evt, int count)
> > > {
> > > int i;
> > > + u64 cntrs = 0;
> > > +
> > > + for (i = 0; i < count; i++) {
> > > + __start_event(&evt[i], 0);
> > > + cntrs |= BIT_ULL(event_to_global_idx(&evt[i]));
> > > + }
> > > + loop(cntrs);
> > > for (i = 0; i < count; i++)
> > > - start_event(&evt[i]);
> > > - loop();
> > > - for (i = 0; i < count; i++)
> > > - stop_event(&evt[i]);
> > > + __stop_event(&evt[i]);
> > > }
> > > static void measure_one(pmu_counter_t *evt)
> > > @@ -199,9 +252,11 @@ static void measure_one(pmu_counter_t *evt)
> > > static noinline void __measure(pmu_counter_t *evt, uint64_t count)
> > > {
> > > + u64 cntrs = BIT_ULL(event_to_global_idx(evt));
> > > +
> > > __start_event(evt, count);
> > > - loop();
> > > - stop_event(evt);
> > > + loop(cntrs);
> > > + __stop_event(evt);
> > > }
> > > static bool verify_event(uint64_t count, struct pmu_event *e)
> > > @@ -451,7 +506,7 @@ static void check_running_counter_wrmsr(void)
> > > report_prefix_push("running counter wrmsr");
> > > start_event(&evt);
> > > - loop();
> > > + __loop();
> > > wrmsr(MSR_GP_COUNTERx(0), 0);
> > > stop_event(&evt);
> > > report(evt.count < gp_events[0].min, "cntr");
> > > @@ -468,7 +523,7 @@ static void check_running_counter_wrmsr(void)
> > > wrmsr(MSR_GP_COUNTERx(0), count);
> > > - loop();
> > > + __loop();
> > > stop_event(&evt);
> > > if (this_cpu_has_perf_global_status()) {
> > > --
> > > 2.34.1
> > >

2024-04-09 03:34:57

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [kvm-unit-tests Patch v3 07/11] x86: pmu: Enable and disable PMCs in loop() asm blob


On 4/9/2024 7:17 AM, Mingwei Zhang wrote:
> On Wed, Mar 27, 2024, Mi, Dapeng wrote:
>> On 3/27/2024 2:07 PM, Mingwei Zhang wrote:
>>> On Wed, Jan 03, 2024, Dapeng Mi wrote:
>>>> Currently enabling PMCs, executing loop() and disabling PMCs are divided
>>>> 3 separated functions. So there could be other instructions executed
>>>> between enabling PMCS and running loop() or running loop() and disabling
>>>> PMCs, e.g. if there are multiple counters enabled in measure_many()
>>>> function, the instructions which enabling the 2nd and more counters
>>>> would be counted in by the 1st counter.
>>>>
>>>> So current implementation can only verify the correctness of count by an
>>>> rough range rather than a precise count even for instructions and
>>>> branches events. Strictly speaking, this verification is meaningless as
>>>> the test could still pass even though KVM vPMU has something wrong and
>>>> reports an incorrect instructions or branches count which is in the rough
>>>> range.
>>>>
>>>> Thus, move the PMCs enabling and disabling into the loop() asm blob and
>>>> ensure only the loop asm instructions would be counted, then the
>>>> instructions or branches events can be verified with an precise count
>>>> instead of an rough range.
>>>>
>>>> Signed-off-by: Dapeng Mi <[email protected]>
>>>> ---
>>>> x86/pmu.c | 83 +++++++++++++++++++++++++++++++++++++++++++++----------
>>>> 1 file changed, 69 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/x86/pmu.c b/x86/pmu.c
>>>> index 46bed66c5c9f..88b89ad889b9 100644
>>>> --- a/x86/pmu.c
>>>> +++ b/x86/pmu.c
>>>> @@ -18,6 +18,20 @@
>>>> #define EXPECTED_INSTR 17
>>>> #define EXPECTED_BRNCH 5
>>>> +// Instrustion number of LOOP_ASM code
>>>> +#define LOOP_INSTRNS 10
>>>> +#define LOOP_ASM \
>>>> + "1: mov (%1), %2; add $64, %1;\n\t" \
>>>> + "nop; nop; nop; nop; nop; nop; nop;\n\t" \
>>>> + "loop 1b;\n\t"
>>>> +
>>>> +#define PRECISE_LOOP_ASM \
>>>> + "wrmsr;\n\t" \
>>>> + "mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t" \
>>>> + LOOP_ASM \
>>>> + "mov %%edi, %%ecx; xor %%eax, %%eax; xor %%edx, %%edx;\n\t" \
>>>> + "wrmsr;\n\t"
>>> Can we add "FEP" prefix into the above blob? This way, we can expand the
>>> testing for emulated instructions.
> Dapeng,
>
> Sorry, did not clarify that this is not a hard request. I am not
> pushing that this need to be done in your next version if it takes
> time to do so. (FEP is of couse nice to have :), but this test already
> supports it in somewhere else.).
>
> Once your next version is ready, please send it out as soon as you can
> and I am happy to give my reviews until it is merged.
>
> Thanks.
> -Mingwei

Yeah, I see there are some FEP related test cases in this test, I'm not
sure if it can already meet the requirement, I would look at it later.
Currently I'm busy on some high priority work, I suppose I have
bandwidth to refresh a new version in next week. Thanks.


>>
>> Yeah, that sounds like a new feature request. I would add it in next
>> version.
>>
>>
>>>> +
>>>> typedef struct {
>>>> uint32_t ctr;
>>>> uint64_t config;
>>>> @@ -54,13 +68,43 @@ char *buf;
>>>> static struct pmu_event *gp_events;
>>>> static unsigned int gp_events_size;
>>>> -static inline void loop(void)
>>>> +
>>>> +static inline void __loop(void)
>>>> +{
>>>> + unsigned long tmp, tmp2, tmp3;
>>>> +
>>>> + asm volatile(LOOP_ASM
>>>> + : "=c"(tmp), "=r"(tmp2), "=r"(tmp3)
>>>> + : "0"(N), "1"(buf));
>>>> +}
>>>> +
>>>> +/*
>>>> + * Enable and disable counters in a whole asm blob to ensure
>>>> + * no other instructions are counted in the time slot between
>>>> + * counters enabling and really LOOP_ASM code executing.
>>>> + * Thus counters can verify instructions and branches events
>>>> + * against precise counts instead of a rough valid count range.
>>>> + */
>>>> +static inline void __precise_count_loop(u64 cntrs)
>>>> {
>>>> unsigned long tmp, tmp2, tmp3;
>>>> + unsigned int global_ctl = pmu.msr_global_ctl;
>>>> + u32 eax = cntrs & (BIT_ULL(32) - 1);
>>>> + u32 edx = cntrs >> 32;
>>>> - asm volatile("1: mov (%1), %2; add $64, %1; nop; nop; nop; nop; nop; nop; nop; loop 1b"
>>>> - : "=c"(tmp), "=r"(tmp2), "=r"(tmp3): "0"(N), "1"(buf));
>>>> + asm volatile(PRECISE_LOOP_ASM
>>>> + : "=b"(tmp), "=r"(tmp2), "=r"(tmp3)
>>>> + : "a"(eax), "d"(edx), "c"(global_ctl),
>>>> + "0"(N), "1"(buf)
>>>> + : "edi");
>>>> +}
>>>> +static inline void loop(u64 cntrs)
>>>> +{
>>>> + if (!this_cpu_has_perf_global_ctrl())
>>>> + __loop();
>>>> + else
>>>> + __precise_count_loop(cntrs);
>>>> }
>>>> volatile uint64_t irq_received;
>>>> @@ -159,18 +203,17 @@ static void __start_event(pmu_counter_t *evt, uint64_t count)
>>>> ctrl = (ctrl & ~(0xf << shift)) | (usrospmi << shift);
>>>> wrmsr(MSR_CORE_PERF_FIXED_CTR_CTRL, ctrl);
>>>> }
>>>> - global_enable(evt);
>>>> apic_write(APIC_LVTPC, PMI_VECTOR);
>>>> }
>>>> static void start_event(pmu_counter_t *evt)
>>>> {
>>>> __start_event(evt, 0);
>>>> + global_enable(evt);
>>>> }
>>>> -static void stop_event(pmu_counter_t *evt)
>>>> +static void __stop_event(pmu_counter_t *evt)
>>>> {
>>>> - global_disable(evt);
>>>> if (is_gp(evt)) {
>>>> wrmsr(MSR_GP_EVENT_SELECTx(event_to_global_idx(evt)),
>>>> evt->config & ~EVNTSEL_EN);
>>>> @@ -182,14 +225,24 @@ static void stop_event(pmu_counter_t *evt)
>>>> evt->count = rdmsr(evt->ctr);
>>>> }
>>>> +static void stop_event(pmu_counter_t *evt)
>>>> +{
>>>> + global_disable(evt);
>>>> + __stop_event(evt);
>>>> +}
>>>> +
>>>> static noinline void measure_many(pmu_counter_t *evt, int count)
>>>> {
>>>> int i;
>>>> + u64 cntrs = 0;
>>>> +
>>>> + for (i = 0; i < count; i++) {
>>>> + __start_event(&evt[i], 0);
>>>> + cntrs |= BIT_ULL(event_to_global_idx(&evt[i]));
>>>> + }
>>>> + loop(cntrs);
>>>> for (i = 0; i < count; i++)
>>>> - start_event(&evt[i]);
>>>> - loop();
>>>> - for (i = 0; i < count; i++)
>>>> - stop_event(&evt[i]);
>>>> + __stop_event(&evt[i]);
>>>> }
>>>> static void measure_one(pmu_counter_t *evt)
>>>> @@ -199,9 +252,11 @@ static void measure_one(pmu_counter_t *evt)
>>>> static noinline void __measure(pmu_counter_t *evt, uint64_t count)
>>>> {
>>>> + u64 cntrs = BIT_ULL(event_to_global_idx(evt));
>>>> +
>>>> __start_event(evt, count);
>>>> - loop();
>>>> - stop_event(evt);
>>>> + loop(cntrs);
>>>> + __stop_event(evt);
>>>> }
>>>> static bool verify_event(uint64_t count, struct pmu_event *e)
>>>> @@ -451,7 +506,7 @@ static void check_running_counter_wrmsr(void)
>>>> report_prefix_push("running counter wrmsr");
>>>> start_event(&evt);
>>>> - loop();
>>>> + __loop();
>>>> wrmsr(MSR_GP_COUNTERx(0), 0);
>>>> stop_event(&evt);
>>>> report(evt.count < gp_events[0].min, "cntr");
>>>> @@ -468,7 +523,7 @@ static void check_running_counter_wrmsr(void)
>>>> wrmsr(MSR_GP_COUNTERx(0), count);
>>>> - loop();
>>>> + __loop();
>>>> stop_event(&evt);
>>>> if (this_cpu_has_perf_global_status()) {
>>>> --
>>>> 2.34.1
>>>>