2023-10-04 18:38:30

by Tony Luck

[permalink] [raw]
Subject: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

This is the core functionality to track CMCI storms at the
machine check bank granularity. Subsequent patches will add
the vendor specific hooks to supply input to the storm
detection and take actions on the start/end of a storm.

Maintain a bitmap history for each bank showing whether the bank
logged an corrected error or not each time it is polled.

In normal operation the interval between polls of this banks
determines how far to shift the history. The 64 bit width corresponds
to about one second.

When a storm is observed a CPU vendor specific action is taken to reduce
or stop CMCI from the bank that is the source of the storm. The bank
is added to the bitmap of banks for this CPU to poll. The polling rate
is increased to once per second. During a storm each bit in the history
indicates the status of the bank each time it is polled. Thus the history
covers just over a minute.

Declare a storm for that bank if the number of corrected interrupts
seen in that history is above some threshold (defined as 5 in this
series, could be tuned later if there is data to suggest a better
value).

A storm on a bank ends if enough consecutive polls of the bank show
no corrected errors (defined as 30, may also change). That calls the
CPU vendor specific function to revert to normal operational mode,
and changes the polling rate back to the default.

Signed-off-by: Tony Luck <[email protected]>

---
v8: Updated with review comments from Yazen.
Link: https://lore.kernel.org/r/[email protected]
v9: Fix lkp randconfig build issue when neither INTEL nor AMD configured
---
arch/x86/kernel/cpu/mce/internal.h | 40 +++++++++-
arch/x86/kernel/cpu/mce/core.c | 25 +++++--
arch/x86/kernel/cpu/mce/threshold.c | 112 ++++++++++++++++++++++++++++
3 files changed, 168 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 616732ec16f7..4aa957442c8d 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -54,7 +54,45 @@ static inline void intel_clear_lmce(void) { }
static inline bool intel_filter_mce(struct mce *m) { return false; }
#endif

-void mce_timer_kick(unsigned long interval);
+void mce_timer_kick(bool storm);
+
+#ifdef CONFIG_X86_MCE_THRESHOLD
+void cmci_storm_begin(unsigned int bank);
+void cmci_storm_end(unsigned int bank);
+void mce_track_storm(struct mce *mce);
+void mce_inherit_storm(unsigned int bank);
+bool mce_is_storm(void);
+void mce_set_storm(bool storm);
+#else
+static inline void cmci_storm_begin(unsigned int bank) {}
+static inline void cmci_storm_end(unsigned int bank) {}
+static inline void mce_track_storm(struct mce *mce) {}
+static inline void mce_inherit_storm(unsigned int bank) {}
+static inline bool mce_is_storm(void) { return false; }
+static inline void mce_set_storm(bool storm) {}
+#endif
+
+/*
+ * history: bitmask tracking whether errors were seen or not seen in
+ * the most recent polls of a bank.
+ * timestamp: last time (in jiffies) that the bank was polled
+ * storm: Is this bank in storm mode?
+ */
+struct storm_bank {
+ u64 history;
+ u64 timestamp;
+ bool storm;
+};
+
+/* How many errors within the history buffer mark the start of a storm. */
+#define STORM_BEGIN_THRESHOLD 5
+
+/*
+ * How many polls of machine check bank without an error before declaring
+ * the storm is over. Since it is tracked by the bitmaks in the history
+ * field of struct storm_bank the mask is 30 bits [0 ... 29].
+ */
+#define STORM_END_POLL_THRESHOLD 29

#ifdef CONFIG_ACPI_APEI
int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index f6e87443b37a..7c931f0c9251 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -680,6 +680,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
barrier();
m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));

+ mce_track_storm(&m);
+
/* If this entry is not valid, ignore it */
if (!(m.status & MCI_STATUS_VAL))
continue;
@@ -1652,22 +1654,29 @@ static void mce_timer_fn(struct timer_list *t)
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));

- __this_cpu_write(mce_next_interval, iv);
- __start_timer(t, iv);
+ if (mce_is_storm()) {
+ __start_timer(t, HZ);
+ } else {
+ __this_cpu_write(mce_next_interval, iv);
+ __start_timer(t, iv);
+ }
}

/*
- * Ensure that the timer is firing in @interval from now.
+ * When a storm starts on any bank on this CPU, switch to polling
+ * once per second. When the storm ends, revert to the default
+ * polling interval.
*/
-void mce_timer_kick(unsigned long interval)
+void mce_timer_kick(bool storm)
{
struct timer_list *t = this_cpu_ptr(&mce_timer);
- unsigned long iv = __this_cpu_read(mce_next_interval);

- __start_timer(t, interval);
+ mce_set_storm(storm);

- if (interval < iv)
- __this_cpu_write(mce_next_interval, interval);
+ if (storm)
+ __start_timer(t, HZ);
+ else
+ __this_cpu_write(mce_next_interval, check_interval * HZ);
}

/* Must not be called in IRQ context where del_timer_sync() can deadlock */
diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
index ef4e7bb5fd88..ecdf13f1bb7d 100644
--- a/arch/x86/kernel/cpu/mce/threshold.c
+++ b/arch/x86/kernel/cpu/mce/threshold.c
@@ -29,3 +29,115 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_threshold)
trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR);
apic_eoi();
}
+
+/*
+ * banks: per-cpu, per-bank details
+ * stormy_bank_count: count of MC banks in storm state
+ * poll_mode: CPU is in poll mode
+ */
+struct mca_storm_desc {
+ struct storm_bank banks[MAX_NR_BANKS];
+ u8 stormy_bank_count;
+ bool poll_mode;
+};
+
+static DEFINE_PER_CPU(struct mca_storm_desc, storm_desc);
+
+void mce_inherit_storm(unsigned int bank)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+ storm->banks[bank].history = ~0ull;
+ storm->banks[bank].timestamp = jiffies;
+}
+
+bool mce_is_storm(void)
+{
+ return __this_cpu_read(storm_desc.poll_mode);
+}
+
+void mce_set_storm(bool storm)
+{
+ __this_cpu_write(storm_desc.poll_mode, storm);
+}
+
+static void mce_handle_storm(unsigned int bank, bool on)
+{
+ switch (boot_cpu_data.x86_vendor) {
+ }
+}
+
+void cmci_storm_begin(unsigned int bank)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+ __set_bit(bank, this_cpu_ptr(mce_poll_banks));
+ storm->banks[bank].storm = true;
+
+ /*
+ * If this is the first bank on this CPU to enter storm mode
+ * start polling.
+ */
+ if (++storm->stormy_bank_count == 1)
+ mce_timer_kick(true);
+}
+
+void cmci_storm_end(unsigned int bank)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ storm->banks[bank].history = 0;
+ storm->banks[bank].storm = false;
+
+ /* If no banks left in storm mode, stop polling. */
+ if (!this_cpu_dec_return(storm_desc.stormy_bank_count))
+ mce_timer_kick(false);
+}
+
+void mce_track_storm(struct mce *mce)
+{
+ struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+ unsigned long now = jiffies, delta;
+ unsigned int shift = 1;
+ u64 history = 0;
+
+ /*
+ * When a bank is in storm mode it is polled once per second and
+ * the history mask will record about the last minute of poll results.
+ * If it is not in storm mode, then the bank is only checked when
+ * there is a CMCI interrupt. Check how long it has been since
+ * this bank was last checked, and adjust the amount of "shift"
+ * to apply to history.
+ */
+ if (!storm->banks[mce->bank].storm) {
+ delta = now - storm->banks[mce->bank].timestamp;
+ shift = (delta + HZ) / HZ;
+ }
+
+ /* If it has been a long time since the last poll, clear history. */
+ if (shift < 64)
+ history = storm->banks[mce->bank].history << shift;
+
+ storm->banks[mce->bank].timestamp = now;
+
+ /* History keeps track of corrected errors. VAL=1 && UC=0 */
+ if ((mce->status & MCI_STATUS_VAL) && mce_is_correctable(mce))
+ history |= 1;
+
+ storm->banks[mce->bank].history = history;
+
+ if (storm->banks[mce->bank].storm) {
+ if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD, 0))
+ return;
+ printk_deferred(KERN_NOTICE "CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), mce->bank);
+ mce_handle_storm(mce->bank, false);
+ cmci_storm_end(mce->bank);
+ } else {
+ if (hweight64(history) < STORM_BEGIN_THRESHOLD)
+ return;
+ printk_deferred(KERN_NOTICE "CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), mce->bank);
+ mce_handle_storm(mce->bank, true);
+ cmci_storm_begin(mce->bank);
+ }
+}
--
2.41.0


2023-10-11 09:13:57

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation



Hello,

kernel test robot noticed a -8.8% regression of stress-ng.clock.ops_per_sec on:


commit: 26bff7b04b829cccc6a97726d6398391a62e34ef ("[PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation")
url: https://github.com/intel-lab-lkp/linux/commits/Tony-Luck/x86-mce-Remove-old-CMCI-storm-mitigation-code/20231005-024047
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

testcase: stress-ng
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:

nr_threads: 10%
disk: 1HDD
testtime: 60s
fs: ext4
class: os
test: clock
cpufreq_governor: performance




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231011/[email protected]

=========================================================================================
class/compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
os/gcc-12/performance/1HDD/ext4/x86_64-rhel-8.3/10%/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp7/clock/stress-ng/60s

commit:
cbb2ef4fc1 ("x86/mce: Remove old CMCI storm mitigation code")
26bff7b04b ("x86/mce: Add per-bank CMCI storm mitigation")

cbb2ef4fc1f946b3 26bff7b04b829cccc6a97726d63
---------------- ---------------------------
%stddev %change %stddev
\ | \
118898 ? 7% -18.5% 96883 ? 8% meminfo.Mapped
2971 ? 7% +18.3% 3513 ? 9% perf-c2c.HITM.local
0.01 -20.0% 0.00 perf-sched.wait_time.avg.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep_timens.__x64_sys_clock_nanosleep
3865971 ? 4% -23.2% 2970833 ? 2% turbostat.C1
1.48 ? 6% -0.3 1.22 ? 4% turbostat.C1%
292422 -9.4% 264956 vmstat.system.cs
219398 -7.4% 203113 vmstat.system.in
4715828 -8.8% 4301560 stress-ng.clock.ops
78597 -8.8% 71692 stress-ng.clock.ops_per_sec
305.30 +1.7% 310.35 stress-ng.time.system_time
9425269 -8.8% 8596889 stress-ng.time.voluntary_context_switches
157229 ? 2% -5.1% 149224 proc-vmstat.nr_active_anon
904759 -1.7% 889679 proc-vmstat.nr_file_pages
123722 ? 2% -6.1% 116162 ? 2% proc-vmstat.nr_inactive_anon
29778 ? 7% -18.5% 24260 ? 8% proc-vmstat.nr_mapped
194663 -7.8% 179535 proc-vmstat.nr_shmem
157229 ? 2% -5.1% 149224 proc-vmstat.nr_zone_active_anon
123722 ? 2% -6.1% 116162 ? 2% proc-vmstat.nr_zone_inactive_anon
844968 -5.5% 798662 proc-vmstat.numa_hit
778755 -5.9% 732457 ? 2% proc-vmstat.numa_local
259547 -5.7% 244756 proc-vmstat.pgactivate
1060836 -7.1% 986009 ? 2% proc-vmstat.pgalloc_normal
755630 ? 2% -7.0% 702996 ? 4% proc-vmstat.pgfree
2.70 -17.4% 2.23 perf-stat.i.MPKI
1.428e+09 -5.8% 1.345e+09 perf-stat.i.branch-instructions
0.99 +0.1 1.09 ? 2% perf-stat.i.branch-miss-rate%
35.43 -7.4 28.07 perf-stat.i.cache-miss-rate%
19762268 -22.2% 15365818 perf-stat.i.cache-misses
55484296 -1.5% 54670651 perf-stat.i.cache-references
305445 -8.7% 278849 perf-stat.i.context-switches
2.88 +6.9% 3.08 perf-stat.i.cpi
400.01 ? 3% -6.5% 373.89 ? 3% perf-stat.i.cpu-migrations
1090 +27.2% 1387 perf-stat.i.cycles-between-cache-misses
1.989e+09 -6.6% 1.857e+09 perf-stat.i.dTLB-loads
1.153e+09 -7.6% 1.066e+09 perf-stat.i.dTLB-stores
7.703e+09 -6.3% 7.221e+09 perf-stat.i.instructions
0.38 -5.8% 0.36 perf-stat.i.ipc
1124 -6.3% 1053 perf-stat.i.metric.K/sec
71.46 -6.6% 66.71 perf-stat.i.metric.M/sec
3530238 ? 3% +48.1% 5229217 ? 3% perf-stat.i.node-load-misses
90.35 -5.3 85.09 perf-stat.i.node-store-miss-rate%
14871425 ? 2% -40.5% 8846683 perf-stat.i.node-store-misses
1393999 ? 2% +10.0% 1533309 perf-stat.i.node-stores
2.57 -17.0% 2.13 perf-stat.overall.MPKI
1.03 ? 2% +0.1 1.14 ? 3% perf-stat.overall.branch-miss-rate%
35.62 -7.5 28.11 perf-stat.overall.cache-miss-rate%
2.73 +6.4% 2.90 perf-stat.overall.cpi
1063 +28.2% 1363 perf-stat.overall.cycles-between-cache-misses
0.37 -6.0% 0.34 perf-stat.overall.ipc
91.42 -6.2 85.23 perf-stat.overall.node-store-miss-rate%
1.405e+09 -5.8% 1.323e+09 perf-stat.ps.branch-instructions
19444141 -22.2% 15118353 perf-stat.ps.cache-misses
54591369 -1.5% 53789356 perf-stat.ps.cache-references
300511 -8.7% 274335 perf-stat.ps.context-switches
393.53 ? 3% -6.5% 367.80 ? 3% perf-stat.ps.cpu-migrations
1.957e+09 -6.6% 1.827e+09 perf-stat.ps.dTLB-loads
1.135e+09 -7.6% 1.049e+09 perf-stat.ps.dTLB-stores
7.58e+09 -6.3% 7.105e+09 perf-stat.ps.instructions
3473672 ? 3% +48.1% 5144934 ? 3% perf-stat.ps.node-load-misses
14630992 ? 2% -40.5% 8703485 perf-stat.ps.node-store-misses
1371912 ? 2% +10.0% 1508626 perf-stat.ps.node-stores
4.721e+11 -5.7% 4.45e+11 perf-stat.total.instructions
13.27 -1.8 11.46 ? 2% perf-profile.calltrace.cycles-pp.ktime_get_update_offsets_now.clock_was_set.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime
7.00 ? 3% -1.3 5.74 ? 2% perf-profile.calltrace.cycles-pp.syscall
5.50 ? 2% -1.2 4.26 ? 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.syscall
5.25 ? 2% -1.2 4.02 ? 6% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
4.27 ? 2% -1.1 3.13 ? 9% perf-profile.calltrace.cycles-pp.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
14.93 -1.1 13.82 ? 2% perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
17.13 -1.1 16.06 ? 2% perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
14.04 -1.0 13.00 ? 2% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
14.22 -1.0 13.18 ? 2% perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
16.74 -1.0 15.77 ? 2% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
16.78 -1.0 15.80 ? 2% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
16.77 -1.0 15.80 ? 2% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
3.29 ? 7% -0.8 2.51 ? 13% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
2.50 ? 7% -0.6 1.94 ? 6% perf-profile.calltrace.cycles-pp.intel_idle_irq.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
0.68 ? 3% -0.4 0.26 ?100% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime
2.09 -0.4 1.67 ? 2% perf-profile.calltrace.cycles-pp.timekeeping_update.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64
4.87 -0.4 4.49 ? 4% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
4.70 -0.4 4.31 ? 4% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
1.99 ? 7% -0.4 1.62 ? 6% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.intel_idle_irq.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
1.78 ? 7% -0.3 1.44 ? 6% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.intel_idle_irq.cpuidle_enter_state.cpuidle_enter
1.40 ? 8% -0.3 1.08 ? 6% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.intel_idle_irq.cpuidle_enter_state
1.40 ? 8% -0.3 1.08 ? 6% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.intel_idle_irq
0.56 ? 4% -0.3 0.26 ?100% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__close
3.07 ? 3% -0.3 2.80 ? 5% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state
3.08 ? 3% -0.3 2.81 ? 5% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter
1.74 ? 6% -0.2 1.51 ? 6% perf-profile.calltrace.cycles-pp.timer_settime
2.12 ? 2% -0.2 1.90 ? 5% perf-profile.calltrace.cycles-pp.add_device_randomness.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.95 -0.2 0.73 ? 4% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64
1.58 ? 6% -0.2 1.36 ? 7% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.timer_settime
0.71 ? 7% -0.2 0.50 ? 45% perf-profile.calltrace.cycles-pp.posix_get_monotonic_raw.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
1.60 ? 6% -0.2 1.39 ? 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.timer_settime
1.55 ? 7% -0.2 1.33 ? 6% perf-profile.calltrace.cycles-pp.__x64_sys_timer_settime.do_syscall_64.entry_SYSCALL_64_after_hwframe.timer_settime
0.70 ? 7% -0.2 0.49 ? 46% perf-profile.calltrace.cycles-pp.ktime_get_raw_ts64.posix_get_monotonic_raw.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.34 ? 8% -0.2 1.14 ? 7% perf-profile.calltrace.cycles-pp.do_timer_settime.__x64_sys_timer_settime.do_syscall_64.entry_SYSCALL_64_after_hwframe.timer_settime
2.84 -0.2 2.64 ? 4% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.clock_nanosleep
2.79 ? 2% -0.2 2.61 ? 4% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
2.30 -0.2 2.12 perf-profile.calltrace.cycles-pp.open64
2.24 -0.2 2.07 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
2.25 -0.2 2.08 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.open64
2.23 -0.2 2.06 perf-profile.calltrace.cycles-pp.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
2.22 -0.2 2.06 perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
1.36 ? 4% -0.2 1.20 ? 18% perf-profile.calltrace.cycles-pp.__entry_text_start.syscall
1.79 ? 4% -0.2 1.63 ? 6% perf-profile.calltrace.cycles-pp.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt
2.70 ? 2% -0.2 2.54 ? 4% perf-profile.calltrace.cycles-pp.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
1.12 ? 3% -0.1 0.97 ? 6% perf-profile.calltrace.cycles-pp.blake2s_update.add_device_randomness.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64
0.60 ? 4% -0.1 0.46 ? 44% perf-profile.calltrace.cycles-pp.__close
2.02 ? 2% -0.1 1.88 perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.00 ? 2% -0.1 1.86 perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64
0.95 ? 3% -0.1 0.82 ? 6% perf-profile.calltrace.cycles-pp.blake2s_compress.blake2s_update.add_device_randomness.do_adjtimex.__do_sys_clock_adjtime
1.26 ? 2% -0.1 1.13 ? 3% perf-profile.calltrace.cycles-pp.do_open.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat
1.07 ? 3% -0.1 0.98 ? 5% perf-profile.calltrace.cycles-pp.hrtimer_nanosleep.common_nsleep_timens.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.08 ? 3% -0.1 0.99 ? 5% perf-profile.calltrace.cycles-pp.common_nsleep_timens.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
0.86 ? 2% -0.1 0.78 ? 6% perf-profile.calltrace.cycles-pp.hrtimer_start_range_ns.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
1.01 ? 3% -0.1 0.93 ? 6% perf-profile.calltrace.cycles-pp.do_nanosleep.hrtimer_nanosleep.common_nsleep_timens.__x64_sys_clock_nanosleep.do_syscall_64
0.96 -0.1 0.88 ? 2% perf-profile.calltrace.cycles-pp.read_tsc.ktime_get_update_offsets_now.clock_was_set.timekeeping_inject_offset.do_adjtimex
0.69 ? 4% -0.1 0.62 ? 3% perf-profile.calltrace.cycles-pp.chrdev_open.do_dentry_open.do_open.path_openat.do_filp_open
0.98 ? 4% -0.1 0.91 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state
0.86 ? 4% -0.1 0.80 ? 3% perf-profile.calltrace.cycles-pp.do_dentry_open.do_open.path_openat.do_filp_open.do_sys_openat2
0.96 ? 3% +0.1 1.09 ? 5% perf-profile.calltrace.cycles-pp.tick_nohz_idle_enter.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
0.87 ? 6% +0.1 1.00 ? 5% perf-profile.calltrace.cycles-pp.posix_timer_unhash_and_free.__x64_sys_timer_delete.do_syscall_64.entry_SYSCALL_64_after_hwframe.timer_delete
6.08 ? 2% +0.4 6.44 ? 2% perf-profile.calltrace.cycles-pp.igb_ptp_gettimex_82580.pc_clock_gettime.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe
6.27 ? 2% +0.4 6.66 ? 2% perf-profile.calltrace.cycles-pp.pc_clock_gettime.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_gettime
6.39 ? 2% +0.4 6.78 ? 2% perf-profile.calltrace.cycles-pp.clock_gettime
6.32 ? 2% +0.4 6.72 ? 2% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.clock_gettime
6.32 ? 2% +0.4 6.71 ? 2% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_gettime
6.30 ? 2% +0.4 6.70 ? 2% perf-profile.calltrace.cycles-pp.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_gettime
0.08 ?223% +0.5 0.55 ? 5% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.tick_nohz_idle_enter.do_idle.cpu_startup_entry.start_secondary
2.58 ? 7% +0.5 3.13 ? 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.igb_ptp_gettimex_82580.pc_clock_gettime.__x64_sys_clock_gettime.do_syscall_64
2.30 ? 7% +0.6 2.89 ? 2% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.igb_ptp_gettimex_82580.pc_clock_gettime.__x64_sys_clock_gettime
53.43 +2.9 56.32 perf-profile.calltrace.cycles-pp.clock_adjtime
53.05 +2.9 55.98 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.clock_adjtime
52.98 +2.9 55.92 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_adjtime
52.81 +2.9 55.75 perf-profile.calltrace.cycles-pp.__do_sys_clock_adjtime.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_adjtime
51.37 +3.1 54.49 perf-profile.calltrace.cycles-pp.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_adjtime
47.33 +3.4 50.75 perf-profile.calltrace.cycles-pp.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64.entry_SYSCALL_64_after_hwframe
43.22 +4.3 47.50 perf-profile.calltrace.cycles-pp.clock_was_set.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64
4.17 ? 13% +4.7 8.85 ? 4% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.clock_was_set.timekeeping_inject_offset.do_adjtimex
21.83 +5.8 27.67 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.clock_was_set.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime
14.07 -1.8 12.26 ? 2% perf-profile.children.cycles-pp.ktime_get_update_offsets_now
7.34 -1.4 5.98 ? 4% perf-profile.children.cycles-pp.syscall
15.24 -1.2 14.06 ? 2% perf-profile.children.cycles-pp.cpuidle_idle_call
14.48 -1.1 13.37 ? 2% perf-profile.children.cycles-pp.cpuidle_enter_state
14.51 -1.1 13.40 ? 2% perf-profile.children.cycles-pp.cpuidle_enter
17.11 -1.1 16.04 ? 2% perf-profile.children.cycles-pp.do_idle
17.13 -1.1 16.06 ? 2% perf-profile.children.cycles-pp.secondary_startup_64_no_verify
17.13 -1.1 16.06 ? 2% perf-profile.children.cycles-pp.cpu_startup_entry
16.78 -1.0 15.80 ? 2% perf-profile.children.cycles-pp.start_secondary
10.61 ? 2% -0.7 9.86 ? 3% perf-profile.children.cycles-pp.__x64_sys_clock_gettime
9.62 -0.7 8.90 ? 3% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
8.93 -0.7 8.26 ? 3% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
2.76 ? 6% -0.6 2.14 ? 6% perf-profile.children.cycles-pp.intel_idle_irq
6.40 -0.6 5.81 ? 3% perf-profile.children.cycles-pp.hrtimer_interrupt
6.41 -0.6 5.83 ? 3% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
4.11 ? 2% -0.5 3.60 ? 4% perf-profile.children.cycles-pp.__hrtimer_run_queues
1.09 ? 4% -0.5 0.62 ? 8% perf-profile.children.cycles-pp.ktime_get_real_ts64
0.88 ? 6% -0.4 0.45 ? 14% perf-profile.children.cycles-pp.posix_get_realtime_timespec
2.09 -0.4 1.67 ? 2% perf-profile.children.cycles-pp.timekeeping_update
2.12 ? 2% -0.2 1.90 ? 5% perf-profile.children.cycles-pp.add_device_randomness
1.78 ? 6% -0.2 1.56 ? 5% perf-profile.children.cycles-pp.timer_settime
1.55 ? 7% -0.2 1.34 ? 6% perf-profile.children.cycles-pp.__x64_sys_timer_settime
1.35 ? 8% -0.2 1.14 ? 7% perf-profile.children.cycles-pp.do_timer_settime
1.19 ? 3% -0.2 0.99 ? 5% perf-profile.children.cycles-pp.try_to_wake_up
1.19 ? 3% -0.2 0.99 ? 4% perf-profile.children.cycles-pp.hrtimer_wakeup
0.77 ? 4% -0.2 0.58 ? 8% perf-profile.children.cycles-pp.ktime_get_with_offset
0.52 ? 7% -0.2 0.33 ? 17% perf-profile.children.cycles-pp.posix_get_monotonic_timespec
0.51 ? 6% -0.2 0.32 ? 19% perf-profile.children.cycles-pp.ktime_get_ts64
2.25 -0.2 2.08 perf-profile.children.cycles-pp.do_sys_openat2
2.31 -0.2 2.14 perf-profile.children.cycles-pp.open64
2.26 -0.2 2.08 perf-profile.children.cycles-pp.__x64_sys_openat
2.10 ? 3% -0.2 1.92 ? 5% perf-profile.children.cycles-pp.tick_sched_timer
2.71 ? 2% -0.2 2.54 ? 4% perf-profile.children.cycles-pp.__x64_sys_clock_nanosleep
2.55 -0.2 2.39 ? 4% perf-profile.children.cycles-pp.hrtimer_nanosleep
2.45 ? 2% -0.2 2.30 ? 5% perf-profile.children.cycles-pp.do_nanosleep
1.12 ? 3% -0.1 0.98 ? 6% perf-profile.children.cycles-pp.blake2s_update
0.36 ? 6% -0.1 0.21 ? 17% perf-profile.children.cycles-pp.smp_call_function_many_cond
0.39 ? 3% -0.1 0.24 ? 4% perf-profile.children.cycles-pp.update_vsyscall
0.72 ? 7% -0.1 0.57 ? 13% perf-profile.children.cycles-pp.posix_get_monotonic_raw
0.53 -0.1 0.38 ? 8% perf-profile.children.cycles-pp.posix_get_tai_timespec
0.70 ? 7% -0.1 0.56 ? 14% perf-profile.children.cycles-pp.ktime_get_raw_ts64
2.03 ? 2% -0.1 1.89 perf-profile.children.cycles-pp.path_openat
2.04 ? 2% -0.1 1.91 perf-profile.children.cycles-pp.do_filp_open
0.95 ? 3% -0.1 0.82 ? 6% perf-profile.children.cycles-pp.blake2s_compress
1.50 ? 2% -0.1 1.37 ? 2% perf-profile.children.cycles-pp.read_tsc
1.26 ? 3% -0.1 1.14 ? 3% perf-profile.children.cycles-pp.do_open
1.38 ? 2% -0.1 1.26 ? 6% perf-profile.children.cycles-pp.hrtimer_start_range_ns
0.54 ? 4% -0.1 0.43 ? 3% perf-profile.children.cycles-pp.ttwu_do_activate
0.26 ? 8% -0.1 0.16 ? 21% perf-profile.children.cycles-pp.ktime_get_coarse_real_ts64
0.25 ? 6% -0.1 0.14 ? 21% perf-profile.children.cycles-pp.posix_get_realtime_coarse
1.47 ? 2% -0.1 1.37 ? 3% perf-profile.children.cycles-pp.__entry_text_start
1.09 ? 3% -0.1 0.99 ? 5% perf-profile.children.cycles-pp.common_nsleep_timens
0.56 ? 3% -0.1 0.46 ? 7% perf-profile.children.cycles-pp.update_fast_timekeeper
1.53 ? 3% -0.1 1.44 ? 3% perf-profile.children.cycles-pp.tick_sched_handle
1.14 ? 2% -0.1 1.05 ? 2% perf-profile.children.cycles-pp._copy_to_user
0.51 ? 2% -0.1 0.43 ? 4% perf-profile.children.cycles-pp.timekeeping_forward_now
0.16 ? 7% -0.1 0.08 ? 18% perf-profile.children.cycles-pp.__flush_smp_call_function_queue
0.32 ? 3% -0.1 0.25 ? 7% perf-profile.children.cycles-pp.enqueue_entity
0.69 ? 4% -0.1 0.62 ? 4% perf-profile.children.cycles-pp.chrdev_open
0.36 ? 3% -0.1 0.29 ? 7% perf-profile.children.cycles-pp.activate_task
0.15 ? 9% -0.1 0.08 ? 23% perf-profile.children.cycles-pp.asm_sysvec_call_function_single
0.34 ? 4% -0.1 0.28 ? 7% perf-profile.children.cycles-pp.enqueue_task_fair
0.30 ? 7% -0.1 0.24 ? 6% perf-profile.children.cycles-pp.capable
0.54 ? 4% -0.1 0.48 ? 3% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.40 ? 7% -0.1 0.34 ? 11% perf-profile.children.cycles-pp.timerfd_clock_was_set
0.62 ? 4% -0.1 0.56 ? 4% perf-profile.children.cycles-pp.__close
0.57 ? 4% -0.1 0.51 ? 6% perf-profile.children.cycles-pp.menu_select
0.54 ? 3% -0.1 0.48 ? 4% perf-profile.children.cycles-pp.__x64_sys_close
0.31 ? 9% -0.1 0.25 ? 5% perf-profile.children.cycles-pp.lockref_get_not_dead
0.28 ? 5% -0.1 0.22 ? 6% perf-profile.children.cycles-pp.apparmor_capable
0.14 ? 16% -0.1 0.08 ? 27% perf-profile.children.cycles-pp.ktime_mono_to_any
0.13 ? 8% -0.1 0.07 ? 20% perf-profile.children.cycles-pp.sysvec_call_function_single
0.32 ? 8% -0.1 0.26 ? 5% perf-profile.children.cycles-pp.__legitimize_path
0.30 ? 5% -0.1 0.24 ? 5% perf-profile.children.cycles-pp.security_capable
0.08 ? 17% -0.1 0.02 ? 99% perf-profile.children.cycles-pp.security_file_free
0.52 ? 5% -0.1 0.47 ? 4% perf-profile.children.cycles-pp.alloc_empty_file
0.11 ? 6% -0.1 0.06 ? 21% perf-profile.children.cycles-pp.__sysvec_call_function_single
0.45 ? 4% -0.1 0.40 ? 6% perf-profile.children.cycles-pp.__fput
0.40 ? 5% -0.1 0.35 ? 4% perf-profile.children.cycles-pp.init_file
0.31 ? 5% -0.1 0.26 ? 6% perf-profile.children.cycles-pp.notifier_call_chain
0.69 ? 3% -0.0 0.64 ? 2% perf-profile.children.cycles-pp.kmem_cache_alloc
0.18 ? 9% -0.0 0.14 ? 8% perf-profile.children.cycles-pp.kobject_get
0.18 ? 9% -0.0 0.13 ? 14% perf-profile.children.cycles-pp.check_preempt_curr
0.23 ? 6% -0.0 0.19 ? 8% perf-profile.children.cycles-pp._raw_spin_lock_irq
0.15 ? 11% -0.0 0.11 ? 14% perf-profile.children.cycles-pp.resched_curr
0.29 ? 5% -0.0 0.25 ? 5% perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
0.60 ? 3% -0.0 0.56 ? 2% perf-profile.children.cycles-pp.lapic_next_deadline
0.22 ? 5% -0.0 0.19 ? 8% perf-profile.children.cycles-pp.security_file_alloc
0.18 ? 6% -0.0 0.15 ? 12% perf-profile.children.cycles-pp.smpboot_thread_fn
0.11 ? 8% -0.0 0.08 ? 5% perf-profile.children.cycles-pp.run_ksoftirqd
0.26 ? 5% -0.0 0.23 ? 4% perf-profile.children.cycles-pp.syscall_return_via_sysret
0.14 ? 9% -0.0 0.11 ? 9% perf-profile.children.cycles-pp.getname_flags
0.24 ? 3% -0.0 0.22 ? 4% perf-profile.children.cycles-pp.exit_to_user_mode_prepare
0.10 ? 6% +0.0 0.12 ? 6% perf-profile.children.cycles-pp.posix_clock_poll
0.12 ? 6% +0.0 0.14 ? 4% perf-profile.children.cycles-pp.do_poll
0.27 ? 5% +0.0 0.29 ? 3% perf-profile.children.cycles-pp.down_read
0.15 ? 9% +0.0 0.18 ? 7% perf-profile.children.cycles-pp.up_read
0.12 ? 6% +0.0 0.15 ? 6% perf-profile.children.cycles-pp.link_path_walk
0.18 ? 5% +0.0 0.22 ? 7% perf-profile.children.cycles-pp.__switch_to
0.21 ? 4% +0.1 0.26 ? 13% perf-profile.children.cycles-pp.finish_task_switch
0.00 +0.1 0.08 ? 21% perf-profile.children.cycles-pp.step_into
0.87 ? 6% +0.1 1.00 ? 5% perf-profile.children.cycles-pp.posix_timer_unhash_and_free
6.08 ? 2% +0.4 6.45 ? 2% perf-profile.children.cycles-pp.igb_ptp_gettimex_82580
6.27 ? 2% +0.4 6.66 ? 2% perf-profile.children.cycles-pp.pc_clock_gettime
6.39 ? 2% +0.4 6.78 ? 2% perf-profile.children.cycles-pp.clock_gettime
77.76 +1.4 79.16 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
77.38 +1.4 78.81 perf-profile.children.cycles-pp.do_syscall_64
53.58 +2.9 56.46 perf-profile.children.cycles-pp.clock_adjtime
52.83 +2.9 55.78 perf-profile.children.cycles-pp.__do_sys_clock_adjtime
51.38 +3.1 54.50 perf-profile.children.cycles-pp.do_adjtimex
47.34 +3.4 50.76 perf-profile.children.cycles-pp.timekeeping_inject_offset
43.40 +4.3 47.66 perf-profile.children.cycles-pp.clock_was_set
9.11 ? 8% +5.2 14.27 ? 2% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
29.22 +6.0 35.19 perf-profile.children.cycles-pp._raw_spin_lock_irqsave
13.04 -1.7 11.30 ? 2% perf-profile.self.cycles-pp.ktime_get_update_offsets_now
1.03 ? 5% -0.5 0.57 ? 8% perf-profile.self.cycles-pp.ktime_get_real_ts64
4.24 ? 4% -0.3 3.92 ? 5% perf-profile.self.cycles-pp.ktime_get
0.81 ? 4% -0.2 0.58 ? 7% perf-profile.self.cycles-pp.intel_idle_irq
0.70 ? 5% -0.2 0.51 ? 9% perf-profile.self.cycles-pp.ktime_get_with_offset
0.47 ? 7% -0.2 0.29 ? 21% perf-profile.self.cycles-pp.ktime_get_ts64
0.39 ? 3% -0.1 0.24 ? 4% perf-profile.self.cycles-pp.update_vsyscall
0.66 ? 8% -0.1 0.52 ? 15% perf-profile.self.cycles-pp.ktime_get_raw_ts64
0.74 ? 3% -0.1 0.61 ? 4% perf-profile.self.cycles-pp.__do_sys_clock_adjtime
0.31 ? 7% -0.1 0.18 ? 16% perf-profile.self.cycles-pp.smp_call_function_many_cond
0.95 ? 3% -0.1 0.82 ? 5% perf-profile.self.cycles-pp.blake2s_compress
0.80 ? 2% -0.1 0.68 ? 3% perf-profile.self.cycles-pp.timekeeping_update
1.43 ? 2% -0.1 1.32 ? 2% perf-profile.self.cycles-pp.read_tsc
0.26 ? 7% -0.1 0.15 ? 22% perf-profile.self.cycles-pp.ktime_get_coarse_real_ts64
0.98 ? 4% -0.1 0.87 ? 5% perf-profile.self.cycles-pp.poll_idle
0.56 ? 3% -0.1 0.46 ? 7% perf-profile.self.cycles-pp.update_fast_timekeeper
0.32 ? 5% -0.1 0.23 ? 5% perf-profile.self.cycles-pp.timekeeping_inject_offset
1.12 ? 2% -0.1 1.03 ? 2% perf-profile.self.cycles-pp._copy_to_user
0.50 ? 2% -0.1 0.41 ? 5% perf-profile.self.cycles-pp.timekeeping_forward_now
0.51 ? 7% -0.1 0.45 ? 3% perf-profile.self.cycles-pp.hrtimer_interrupt
0.30 ? 8% -0.1 0.24 ? 3% perf-profile.self.cycles-pp.lockref_get_not_dead
0.10 ? 9% -0.1 0.04 ? 45% perf-profile.self.cycles-pp.enqueue_entity
0.13 ? 18% -0.1 0.08 ? 25% perf-profile.self.cycles-pp.ktime_mono_to_any
0.27 ? 6% -0.1 0.21 ? 5% perf-profile.self.cycles-pp.apparmor_capable
0.16 ? 8% -0.1 0.10 ? 21% perf-profile.self.cycles-pp.hrtimer_start_range_ns
0.18 ? 9% -0.0 0.14 ? 8% perf-profile.self.cycles-pp.kobject_get
0.15 ? 11% -0.0 0.11 ? 14% perf-profile.self.cycles-pp.resched_curr
0.59 ? 2% -0.0 0.56 ? 2% perf-profile.self.cycles-pp.lapic_next_deadline
0.41 ? 4% -0.0 0.38 ? 3% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.33 ? 3% -0.0 0.30 ? 5% perf-profile.self.cycles-pp.do_syscall_64
0.20 ? 7% -0.0 0.17 ? 5% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.26 ? 5% -0.0 0.23 ? 4% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.19 ? 3% -0.0 0.16 ? 8% perf-profile.self.cycles-pp.apparmor_file_alloc_security
0.10 ? 4% -0.0 0.08 ? 8% perf-profile.self.cycles-pp.update_curr
0.14 ? 11% +0.0 0.18 ? 7% perf-profile.self.cycles-pp.up_read
0.17 ? 5% +0.0 0.21 ? 6% perf-profile.self.cycles-pp.__switch_to
0.01 ?223% +0.1 0.08 ? 8% perf-profile.self.cycles-pp.dequeue_task_fair
20.56 +0.9 21.48 perf-profile.self.cycles-pp._raw_spin_lock_irqsave
9.08 ? 8% +5.2 14.24 ? 2% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



Attachments:
(No filename) (33.35 kB)
config-6.6.0-rc3-00002-g26bff7b04b82 (161.65 kB)
job-script (9.30 kB)
job.yaml (6.96 kB)
repro-script (550.00 B)
Download all attachments

2023-10-11 15:17:11

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

> kernel test robot noticed a -8.8% regression of stress-ng.clock.ops_per_sec on:
>
>
> commit: 26bff7b04b829cccc6a97726d6398391a62e34ef ("[PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation")
> url: https://github.com/intel-lab-lkp/linux/commits/Tony-Luck/x86-mce-Remove-old-CMCI-storm-mitigation-code/20231005-024047
> patch link: https://lore.kernel.org/all/[email protected]/
> patch subject: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation
>
> testcase: stress-ng
> test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
> parameters:
>
> nr_threads: 10%
> disk: 1HDD
> testtime: 60s
> fs: ext4
> class: os
> test: clock
> cpufreq_governor: performance
>
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-lkp/[email protected]

Is the test injecting massive numbers of corrected memory errors? The code in this patch
is only executed when handling CMCI interrupts, or polling machine check banks (at most
once per second).

I'm guessing this report is just because alignment of some hot path code changed.

-Tony

2023-10-11 15:52:14

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

On Wed, Oct 11, 2023 at 11:16:46PM +0800, Luck, Tony wrote:
> > kernel test robot noticed a -8.8% regression of stress-ng.clock.ops_per_sec on:
> >
> >
> > commit: 26bff7b04b829cccc6a97726d6398391a62e34ef ("[PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation")
> > url: https://github.com/intel-lab-lkp/linux/commits/Tony-Luck/x86-mce-Remove-old-CMCI-storm-mitigation-code/20231005-024047
> > patch link: https://lore.kernel.org/all/[email protected]/
> > patch subject: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation
> >
> > testcase: stress-ng
> > test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
> > parameters:
> >
> > nr_threads: 10%
> > disk: 1HDD
> > testtime: 60s
> > fs: ext4
> > class: os
> > test: clock
> > cpufreq_governor: performance
> >
> >
> >
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <[email protected]>
> > | Closes: https://lore.kernel.org/oe-lkp/[email protected]
>
> Is the test injecting massive numbers of corrected memory errors? The code in this patch
> is only executed when handling CMCI interrupts, or polling machine check banks (at most
> once per second).
>
> I'm guessing this report is just because alignment of some hot path code changed.

IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's
kernel config for quite a while, to force each funtion's start
address aligned on 64 bytes. Don't know if this has been changed
recently.

Also I noticed the patch introduce a new per-cpu variable 'storm_desc",
if the function address is 64B aligned, then per-cpu data alignment
may be related.

Thanks,
Feng

>
> -Tony

2023-10-11 17:24:20

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

> IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's
> kernel config for quite a while, to force each funtion's start
> address aligned on 64 bytes. Don't know if this has been changed
> recently.
>
> Also I noticed the patch introduce a new per-cpu variable 'storm_desc",
> if the function address is 64B aligned, then per-cpu data alignment
> may be related.

If adding (removing, changing) per-cpu variables can cause 8% performance
changes, then maybe there needs to be some way to insulate the builds
from these side effects (as was done with the 64-byte function alignment).
I've no ideas on how to achieve that :-(

"storm_desc" is a largish per-cpu variable at 1568 bytes. But that doesn't
quite get it into the top 20.

In the config used for my kernel build, it does appear just before a bunch
of interrupt related per-cpu variables. Those might be sensitive to alignment?

000000000001ff40 d smca_misc_banks_map
000000000001ff48 d bank_map
000000000001ff50 d threshold_banks
000000000001ff60 d storm_desc
0000000000020580 d tsc_adjust
00000000000205c0 d lapic_events
00000000000206c0 d vector_cleanup
0000000000020720 D __uv_cpu_info
0000000000020738 d ipi_mask

-Tony

2023-10-12 02:36:29

by Li, Philip

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

On Wed, Oct 11, 2023 at 11:42:26PM +0800, Feng Tang wrote:
> On Wed, Oct 11, 2023 at 11:16:46PM +0800, Luck, Tony wrote:
> > > kernel test robot noticed a -8.8% regression of stress-ng.clock.ops_per_sec on:
> > >
> > >
> > > commit: 26bff7b04b829cccc6a97726d6398391a62e34ef ("[PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation")
> > > url: https://github.com/intel-lab-lkp/linux/commits/Tony-Luck/x86-mce-Remove-old-CMCI-storm-mitigation-code/20231005-024047
> > > patch link: https://lore.kernel.org/all/[email protected]/
> > > patch subject: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation
> > >
> > > testcase: stress-ng
> > > test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
> > > parameters:
> > >
> > > nr_threads: 10%
> > > disk: 1HDD
> > > testtime: 60s
> > > fs: ext4
> > > class: os
> > > test: clock
> > > cpufreq_governor: performance
> > >
> > >
> > >
> > >
> > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > the same patch/commit), kindly add following tags
> > > | Reported-by: kernel test robot <[email protected]>
> > > | Closes: https://lore.kernel.org/oe-lkp/[email protected]
> >
> > Is the test injecting massive numbers of corrected memory errors? The code in this patch
> > is only executed when handling CMCI interrupts, or polling machine check banks (at most
> > once per second).
> >
> > I'm guessing this report is just because alignment of some hot path code changed.
>
> IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's
> kernel config for quite a while, to force each funtion's start
> address aligned on 64 bytes. Don't know if this has been changed
> recently.

yes, the kernel is always built with -falign-functions=64 to test
performance now.

>
> Also I noticed the patch introduce a new per-cpu variable 'storm_desc",
> if the function address is 64B aligned, then per-cpu data alignment
> may be related.
>
> Thanks,
> Feng
>
> >
> > -Tony
>

2023-10-12 05:47:22

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

On Thu, Oct 12, 2023 at 01:23:54AM +0800, Luck, Tony wrote:
> > IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's
> > kernel config for quite a while, to force each funtion's start
> > address aligned on 64 bytes. Don't know if this has been changed
> > recently.
> >
> > Also I noticed the patch introduce a new per-cpu variable 'storm_desc",
> > if the function address is 64B aligned, then per-cpu data alignment
> > may be related.
>
> If adding (removing, changing) per-cpu variables can cause 8% performance
> changes, then maybe there needs to be some way to insulate the builds
> from these side effects (as was done with the 64-byte function alignment).
> I've no ideas on how to achieve that :-(

Philip helped to clarify that 0Day kernel build does enable the 64 bytes
function address alignment.

> "storm_desc" is a largish per-cpu variable at 1568 bytes. But that doesn't
> quite get it into the top 20.
>
> In the config used for my kernel build, it does appear just before a bunch
> of interrupt related per-cpu variables. Those might be sensitive to alignment?
>
> 000000000001ff40 d smca_misc_banks_map
> 000000000001ff48 d bank_map
> 000000000001ff50 d threshold_banks
> 000000000001ff60 d storm_desc
> 0000000000020580 d tsc_adjust
> 00000000000205c0 d lapic_events
> 00000000000206c0 d vector_cleanup
> 0000000000020720 D __uv_cpu_info
> 0000000000020738 d ipi_mask

I did further check by putting 'storm_desc' into another section with
below patch, and the regression is restored, which indictes it is
related with these per-cpu data alignment change.

diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
index ecdf13f1bb7d5..7f32126f216bc 100644
--- a/arch/x86/kernel/cpu/mce/threshold.c
+++ b/arch/x86/kernel/cpu/mce/threshold.c
@@ -41,7 +41,7 @@ struct mca_storm_desc {
bool poll_mode;
};

-static DEFINE_PER_CPU(struct mca_storm_desc, storm_desc);
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct mca_storm_desc, storm_desc);

Thanks,
Feng

>
> -Tony

2023-10-12 06:06:42

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

On Thu, Oct 12, 2023 at 01:36:41PM +0800, Feng Tang wrote:
> On Thu, Oct 12, 2023 at 01:23:54AM +0800, Luck, Tony wrote:
> > > IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's
> > > kernel config for quite a while, to force each funtion's start
> > > address aligned on 64 bytes. Don't know if this has been changed
> > > recently.
> > >
> > > Also I noticed the patch introduce a new per-cpu variable 'storm_desc",
> > > if the function address is 64B aligned, then per-cpu data alignment
> > > may be related.
> >
> > If adding (removing, changing) per-cpu variables can cause 8% performance
> > changes, then maybe there needs to be some way to insulate the builds
> > from these side effects (as was done with the 64-byte function alignment).
> > I've no ideas on how to achieve that :-(

As for mitigation (how to reduce these kind of strange performance
changes), I have proposed a "selective isolation" in LPC 2021, in
page 15 of https://lpc.events/event/11/contributions/895/attachments/770/1603/Strange_kernel_performance_changes_lpc_2021.pdf

As kernel binary is compactly linked together, the alignment change
of text/data in earlier modules could affect other modules after them.
MCE module is much easier to hit these strange performance changes,
just because it sit in arch/ folder and get linked early in kernel
binary.

The idea of "selective isolation" is simple, by picking some modules
sparsely and enforce some alignment to function/data/per-cpu data,
so that they can act as fences/borders, separate kernel into multiple
capsules, and make the alignment changes only take effect inside
that specific capsule.

Any thoughts? thanks!

- Feng

> Philip helped to clarify that 0Day kernel build does enable the 64 bytes
> function address alignment.

[...]

2023-10-19 15:12:48

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

On Wed, Oct 04, 2023 at 11:36:22AM -0700, Tony Luck wrote:
> +/*
> + * history: bitmask tracking whether errors were seen or not seen in
> + * the most recent polls of a bank.

each bit in that bitmask represents an error seen.

> + * timestamp: last time (in jiffies) that the bank was polled
> + * storm: Is this bank in storm mode?
> + */
> +struct storm_bank {
> + u64 history;
> + u64 timestamp;
> + bool storm;

I guess "in_storm_mode" is even more descriptive:

storm->banks[bank].in_storm_mode = false;

etc.

> +};
> +
> +/* How many errors within the history buffer mark the start of a storm. */
> +#define STORM_BEGIN_THRESHOLD 5
> +
> +/*
> + * How many polls of machine check bank without an error before declaring
> + * the storm is over. Since it is tracked by the bitmaks in the history
> + * field of struct storm_bank the mask is 30 bits [0 ... 29].
> + */
> +#define STORM_END_POLL_THRESHOLD 29
>
> #ifdef CONFIG_ACPI_APEI
> int apei_write_mce(struct mce *m);
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index f6e87443b37a..7c931f0c9251 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -680,6 +680,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
> barrier();
> m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));
>
> + mce_track_storm(&m);

Lemme see if I understand the idea here:

the default polling interval is 5 mins. So the storm tracking is called
every 5 mins once to see how are banks "doing", so to speak and to get
them in or out of storm mode. So far so good...

> +
> /* If this entry is not valid, ignore it */
> if (!(m.status & MCI_STATUS_VAL))
> continue;

Btw, you're tracking storm even if the error is not valid - conditional
above here. Why?

> @@ -1652,22 +1654,29 @@ static void mce_timer_fn(struct timer_list *t)
> else
> iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));
>
> - __this_cpu_write(mce_next_interval, iv);
> - __start_timer(t, iv);
> + if (mce_is_storm()) {
> + __start_timer(t, HZ);
> + } else {
> + __this_cpu_write(mce_next_interval, iv);
> + __start_timer(t, iv);
> + }

... this is where it becomes, hm, interesting: the check interval will
be halved if an error has been seen during this round but then if we're
in storm mode, that check interval doesn't matter - you'll run the timer
each second.

Then you need to restructure this to check the storm condition and not
do anything to iv if storm.

Or, am I missing something?

> diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
> index ef4e7bb5fd88..ecdf13f1bb7d 100644
> --- a/arch/x86/kernel/cpu/mce/threshold.c
> +++ b/arch/x86/kernel/cpu/mce/threshold.c
> @@ -29,3 +29,115 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_threshold)
> trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR);
> apic_eoi();
> }
> +
> +/*
> + * banks: per-cpu, per-bank details
> + * stormy_bank_count: count of MC banks in storm state
> + * poll_mode: CPU is in poll mode
> + */
> +struct mca_storm_desc {
> + struct storm_bank banks[MAX_NR_BANKS];
> + u8 stormy_bank_count;
> + bool poll_mode;
> +};

Yeah, put the struct definition into internal.h pls.

> +static DEFINE_PER_CPU(struct mca_storm_desc, storm_desc);
> +
> +void mce_inherit_storm(unsigned int bank)
> +{
> + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
> +
> + storm->banks[bank].history = ~0ull;

So upon inheriting a bank, you set its history that it has seen errors
each time?

That's weird.

> + storm->banks[bank].timestamp = jiffies;
> +}
> +
> +bool mce_is_storm(void)

That's a weird name. mce_get_storm_mode() perhaps?

> +{
> + return __this_cpu_read(storm_desc.poll_mode);
> +}
> +
> +void mce_set_storm(bool storm)

mce_set_storm_mode()

> +{
> + __this_cpu_write(storm_desc.poll_mode, storm);
> +}
> +
> +static void mce_handle_storm(unsigned int bank, bool on)
> +{
> + switch (boot_cpu_data.x86_vendor) {
> + }
> +}

...

> +void mce_track_storm(struct mce *mce)
> +{
> + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
> + unsigned long now = jiffies, delta;
> + unsigned int shift = 1;
> + u64 history = 0;
> +
> + /*
> + * When a bank is in storm mode it is polled once per second and
> + * the history mask will record about the last minute of poll results.
> + * If it is not in storm mode, then the bank is only checked when
> + * there is a CMCI interrupt. Check how long it has been since
> + * this bank was last checked, and adjust the amount of "shift"
> + * to apply to history.
> + */
> + if (!storm->banks[mce->bank].storm) {

Yeah, if this were

if (!storm->banks[mce->bank].in_storm_mode)

it would've been perfectly clear what the condition tests.

> + delta = now - storm->banks[mce->bank].timestamp;
> + shift = (delta + HZ) / HZ;
> + }
> +
> + /* If it has been a long time since the last poll, clear history. */
> + if (shift < 64)

Use a properly named define instead of a naked number.

...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-10-23 18:14:42

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

On Thu, Oct 19, 2023 at 05:12:11PM +0200, Borislav Petkov wrote:
> On Wed, Oct 04, 2023 at 11:36:22AM -0700, Tony Luck wrote:
> > +/*
> > + * history: bitmask tracking whether errors were seen or not seen in
> > + * the most recent polls of a bank.
>
> each bit in that bitmask represents an error seen.

Yes. That reads better.

>
> > + * timestamp: last time (in jiffies) that the bank was polled
> > + * storm: Is this bank in storm mode?
> > + */
> > +struct storm_bank {
> > + u64 history;
> > + u64 timestamp;
> > + bool storm;
>
> I guess "in_storm_mode" is even more descriptive:
>
> storm->banks[bank].in_storm_mode = false;
>
> etc.

Will fix.

>
> > +};
> > +
> > +/* How many errors within the history buffer mark the start of a storm. */
> > +#define STORM_BEGIN_THRESHOLD 5
> > +
> > +/*
> > + * How many polls of machine check bank without an error before declaring
> > + * the storm is over. Since it is tracked by the bitmaks in the history
> > + * field of struct storm_bank the mask is 30 bits [0 ... 29].
> > + */
> > +#define STORM_END_POLL_THRESHOLD 29
> >
> > #ifdef CONFIG_ACPI_APEI
> > int apei_write_mce(struct mce *m);
> > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> > index f6e87443b37a..7c931f0c9251 100644
> > --- a/arch/x86/kernel/cpu/mce/core.c
> > +++ b/arch/x86/kernel/cpu/mce/core.c
> > @@ -680,6 +680,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
> > barrier();
> > m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));
> >
> > + mce_track_storm(&m);
>
> Lemme see if I understand the idea here:
>
> the default polling interval is 5 mins. So the storm tracking is called
> every 5 mins once to see how are banks "doing", so to speak and to get
> them in or out of storm mode. So far so good...
>
> > +
> > /* If this entry is not valid, ignore it */
> > if (!(m.status & MCI_STATUS_VAL))
> > continue;
>
> Btw, you're tracking storm even if the error is not valid - conditional
> above here. Why?

I want to track whether each bank is in storm mode, or not. But there's
no indication when a CMCI is delivered which bank was the source. Code
just has to scan all the banks, and might find more than one with an
error. While no bank is in polling mode, there isn't a set time interval
between scanning banks. A scan is just triggered when a CMCI happens.
So it's non-trivial to compute a rate. Would require saving a timestamp
for every logged error.

In a simple case there's just one bank responsible for a ton of CMCI.
No need for complexity here, the count of interrupts from that bank will
hit a threshold and a storm is declared.

But more complex scenarois are possible. Other banks may trigger small
numbers of CMCI. Not enough to call it a storm. Or multiple banks may
be screaming together.

By tracking both the hits and misses in each bank, I end up with a
bitmap history for the past 64 polls. If there are enough "1" bits in
that bitmap to meet the threshold, then declare a storm for that bank.

>
> > @@ -1652,22 +1654,29 @@ static void mce_timer_fn(struct timer_list *t)
> > else
> > iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));
> >
> > - __this_cpu_write(mce_next_interval, iv);
> > - __start_timer(t, iv);
> > + if (mce_is_storm()) {
> > + __start_timer(t, HZ);
> > + } else {
> > + __this_cpu_write(mce_next_interval, iv);
> > + __start_timer(t, iv);
> > + }
>
> ... this is where it becomes, hm, interesting: the check interval will
> be halved if an error has been seen during this round but then if we're
> in storm mode, that check interval doesn't matter - you'll run the timer
> each second.
>
> Then you need to restructure this to check the storm condition and not
> do anything to iv if storm.
>
> Or, am I missing something?

I need to stare at this again to refresh my memory of what's going on
here. This code may need pulling apart into a routine that is used for
systems with no CMCI (or have CMCI disabled). Then the whole "divide the
poll interval by two" when you see an error and double the interval
when you don't see an error makes sense.

For systems with CMCI ... I think just polling a one second interval
until the storm is over makes sense.

>
> > diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
> > index ef4e7bb5fd88..ecdf13f1bb7d 100644
> > --- a/arch/x86/kernel/cpu/mce/threshold.c
> > +++ b/arch/x86/kernel/cpu/mce/threshold.c
> > @@ -29,3 +29,115 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_threshold)
> > trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR);
> > apic_eoi();
> > }
> > +
> > +/*
> > + * banks: per-cpu, per-bank details
> > + * stormy_bank_count: count of MC banks in storm state
> > + * poll_mode: CPU is in poll mode
> > + */
> > +struct mca_storm_desc {
> > + struct storm_bank banks[MAX_NR_BANKS];
> > + u8 stormy_bank_count;
> > + bool poll_mode;
> > +};
>
> Yeah, put the struct definition into internal.h pls.

These are only used in threshold.c now. What's the point of them
being in internal.h. That's for defintiones shared by multiple
mcs/*.c files. Isn't it? But will move there if you still want this.

>
> > +static DEFINE_PER_CPU(struct mca_storm_desc, storm_desc);
> > +
> > +void mce_inherit_storm(unsigned int bank)
> > +{
> > + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
> > +
> > + storm->banks[bank].history = ~0ull;
>
> So upon inheriting a bank, you set its history that it has seen errors
> each time?
>
> That's weird.

Ideally the new CPU would inherit the precise state of the previous
owner of this bank. But there's no easy way to track that as the bank
is abanoned by the CPU going offline, and there is a free-for-all with
remaining CPUs racing to claim ownership. It is known that this bank
was in storm mode (because the threshold in the CTL2 bank register is
set to CMCI_STORM_THRESHOLD).

I went with "worst case" to make sure the new CPU didn't prematurely
declare an end to the storm.

I'll add a comment in mce_inherit_storm() to explain this.

>
> > + storm->banks[bank].timestamp = jiffies;
> > +}
> > +
> > +bool mce_is_storm(void)
>
> That's a weird name. mce_get_storm_mode() perhaps?

Sure.

>
> > +{
> > + return __this_cpu_read(storm_desc.poll_mode);
> > +}
> > +
> > +void mce_set_storm(bool storm)
>
> mce_set_storm_mode()

Also sure.

>
> > +{
> > + __this_cpu_write(storm_desc.poll_mode, storm);
> > +}
> > +
> > +static void mce_handle_storm(unsigned int bank, bool on)
> > +{
> > + switch (boot_cpu_data.x86_vendor) {
> > + }
> > +}
>
> ...
>
> > +void mce_track_storm(struct mce *mce)
> > +{
> > + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
> > + unsigned long now = jiffies, delta;
> > + unsigned int shift = 1;
> > + u64 history = 0;
> > +
> > + /*
> > + * When a bank is in storm mode it is polled once per second and
> > + * the history mask will record about the last minute of poll results.
> > + * If it is not in storm mode, then the bank is only checked when
> > + * there is a CMCI interrupt. Check how long it has been since
> > + * this bank was last checked, and adjust the amount of "shift"
> > + * to apply to history.
> > + */
> > + if (!storm->banks[mce->bank].storm) {
>
> Yeah, if this were
>
> if (!storm->banks[mce->bank].in_storm_mode)
>
> it would've been perfectly clear what the condition tests.

Yup.

>
> > + delta = now - storm->banks[mce->bank].timestamp;
> > + shift = (delta + HZ) / HZ;
> > + }
> > +
> > + /* If it has been a long time since the last poll, clear history. */
> > + if (shift < 64)
>
> Use a properly named define instead of a naked number.

Like this?

#define NUM_HISTORY_BITS (sizeof(u64) * BITS_PER_BYTE)

if (shift < NUM_HISTORY_BITS)
>
> ...
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.

-Tony

2023-11-21 12:00:04

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote:
> Before any storm happens, machine_check_poll() may only be called once
> a month, or less, when errors occur.

Err:

[ 317.825546] mce: mc_poll_banks_default: CPU2 irq ctxt level: 1
[ 317.825585] mce: mc_poll_banks_default: CPU0 irq ctxt level: 1
[ 317.825585] mce: mc_poll_banks_default: CPU1 irq ctxt level: 1
[ 317.825586] mce: mc_poll_banks_default: CPU3 irq ctxt level: 1
[ 317.825586] mce: mc_poll_banks_default: CPU4 irq ctxt level: 1
[ 317.825586] mce: mc_poll_banks_default: CPU5 irq ctxt level: 1
[ 629.121536] mce: mc_poll_banks_default: CPU1 irq ctxt level: 1
[ 629.121536] mce: mc_poll_banks_default: CPU4 irq ctxt level: 1
[ 629.121560] mce: mc_poll_banks_default: CPU2 irq ctxt level: 1
[ 629.121561] mce: mc_poll_banks_default: CPU0 irq ctxt level: 1
[ 629.121561] mce: mc_poll_banks_default: CPU5 irq ctxt level: 1
[ 629.121569] mce: mc_poll_banks_default: CPU3 irq ctxt level: 1
[ 940.417507] mce: mc_poll_banks_default: CPU2 irq ctxt level: 1
[ 940.417508] mce: mc_poll_banks_default: CPU3 irq ctxt level: 1
[ 940.417508] mce: mc_poll_banks_default: CPU1 irq ctxt level: 1
[ 940.417508] mce: mc_poll_banks_default: CPU4 irq ctxt level: 1
[ 940.417509] mce: mc_poll_banks_default: CPU5 irq ctxt level: 1
[ 940.417508] mce: mc_poll_banks_default: CPU0 irq ctxt level: 1
...

That's from my coffeelake test box.

The irq context level thing says we're in softirq context when the
polling happens.

> When a storm is detected for a bank, that bank (and any others in storm
> mode) will be checked once per second.

Ok.

> For a bank that doesn't support CMCI, then polling is the only way
> to find errors. You are right, these will feed into the history
> tracker, but while at 5-minute interval will not be able to trigger
> a storm.

Yes. But you need to call into the storm handling code somehow. So you
do that from the polling code.

And if the machine supports CMCI, you do the same - call the polling
code which then does the storm check.

> Since that 5-minute interval is halved every time error is
> found consecutively, it is possible at the 1-second poll interval to
> fill out enough bits to indicate a storm. I think I need to add some
> code to handle that case as it makes no sense to mess with the CMCI
> threshold in IA32_MCi_CTL2 for a bank that doesn't support CMCI.
> Probably will just skip tracking for any such banks.

Ok.

> Aren't interrupts disabled while running the code after the timer fires?

No, see above.

> Whichever of the timer and the CMCI happens first will run. Second to
> arrive will pend the interrupt and be handled when interrupts are
> enabled as the first completes.

So I still don't like the timer calling machine_check_poll() and
cmci_mc_poll_banks() doing the same without any proper synchronization
between the two.

Yes, when you get a CMCI interrupt, you poll and do the call the storm
code. Now what happens if the polling runs from softirq context and you
get a CMCI interrupt at exactly the same time. I.e., is
machine_check_poll() reentrant and audited properly?

I hope I'm making more sense.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-11-27 19:51:14

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote:
> On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote:
> > Whichever of the timer and the CMCI happens first will run. Second to
> > arrive will pend the interrupt and be handled when interrupts are
> > enabled as the first completes.
>
> So I still don't like the timer calling machine_check_poll() and
> cmci_mc_poll_banks() doing the same without any proper synchronization
> between the two.

But it isn't doing the same thing. The timer calls:

machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

and cmci_mc_poll_banks() calls:

machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

A bank is either in the bitmap of banks to poll from the timer,
or in one of the per-cpu bitmaps of banks "owned" by that CPU to
be checked when a CMCI occurs. But it can't be in both.

> Yes, when you get a CMCI interrupt, you poll and do the call the storm
> code. Now what happens if the polling runs from softirq context and you
> get a CMCI interrupt at exactly the same time. I.e., is
> machine_check_poll() reentrant and audited properly?

So nothing bad happens. If Linux was polling some set of banks from
the timer and is interrupted by CMCI, the interrupt will check some
disjoint set of banks. All the history tracking code is done per-bank,
so there is no overlap.

> I hope I'm making more sense.

Yes. Totally making sense. I was under the mistaken impression that
the mce timers used TIMER_IRQSAFE and the nested CMCI while processing
a timed poll couldn't happen. So I learned something here too.

I'll think of some comment to add to the history tracking code
to summarize this thread.

-Tony

2023-11-27 20:14:52

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

On Mon, Nov 27, 2023 at 11:50:26AM -0800, Tony Luck wrote:
> On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote:
> > On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote:
> But it isn't doing the same thing. The timer calls:
>
> machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
>
> and cmci_mc_poll_banks() calls:
>
> machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

Bah ... I've cut & pasted the same thing ... but I think there
are separate bit maps ... maybe I'm wrong. Will go back and
look again.

-Tony

2023-11-28 00:42:13

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

On Mon, Nov 27, 2023 at 12:14:28PM -0800, Tony Luck wrote:
> On Mon, Nov 27, 2023 at 11:50:26AM -0800, Tony Luck wrote:
> > On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote:
> > > On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote:
> > But it isn't doing the same thing. The timer calls:
> >
> > machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
> >
> > and cmci_mc_poll_banks() calls:
> >
> > machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

machine_check_poll(0, this_cpu_ptr(&mce_banks_owned));

>
> Bah ... I've cut & pasted the same thing ... but I think there
> are separate bit maps ... maybe I'm wrong. Will go back and
> look again.
>
> -Tony

2023-11-28 15:32:25

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

On 11/27/2023 7:42 PM, Tony Luck wrote:
> On Mon, Nov 27, 2023 at 12:14:28PM -0800, Tony Luck wrote:
>> On Mon, Nov 27, 2023 at 11:50:26AM -0800, Tony Luck wrote:
>>> On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote:
>>>> On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote:
>>> But it isn't doing the same thing. The timer calls:
>>>
>>> machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
>>>
>>> and cmci_mc_poll_banks() calls:
>>>
>>> machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
>
> machine_check_poll(0, this_cpu_ptr(&mce_banks_owned));
>
>>
>> Bah ... I've cut & pasted the same thing ... but I think there
>> are separate bit maps ... maybe I'm wrong. Will go back and
>> look again.
>>

This mutually exclusive behavior is not enforced on AMD systems. A bank
can be polled and signaled using hardware interrupts.

I've been thinking to change this in order to save polling cycles. Now
it seems there's another reason. :)

Thanks,
Yazen

2023-12-14 18:03:46

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

> and you're saying
>
> mce_poll_banks and mce_banks_owned
>
> are disjoint.
>
> That's what you mean, right?

Yes. A bank is one or the other bitmap, but not both.

> Because if so, yes, that makes sense. If the sets of MCA banks polled
> and handled in the thresholding interrupt are disjoint, we should be ok.

Yes.

-Tony

2023-12-15 00:09:09

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

On Mon, Nov 27, 2023 at 04:42:02PM -0800, Tony Luck wrote:
> On Mon, Nov 27, 2023 at 12:14:28PM -0800, Tony Luck wrote:
> > On Mon, Nov 27, 2023 at 11:50:26AM -0800, Tony Luck wrote:
> > > On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote:
> > > > On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote:
> > > But it isn't doing the same thing. The timer calls:
> > >
> > > machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
> > >
> > > and cmci_mc_poll_banks() calls:
> > >
> > > machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
>
> machine_check_poll(0, this_cpu_ptr(&mce_banks_owned));

Hmm, so I applied your v10 and this call with mce_banks_owned is done in
cmci_recheck() only. Which is on some init path.

The thresholding interrupt calls it too.

The timer ends up calling mc_poll_banks_default() which does

machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

I presume we don't do:

if (!cmci_supported(&banks)) {
mc_poll_banks = cmci_mc_poll_banks;
return;
}

usually on Intel. And even if we did, cmci_mc_poll_banks() calls

machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

too.

So regardless what machine you have, you do call the mc_poll_banks
pointer which in both cases does

machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

The *thresholding* interrupt does

machine_check_poll(0, this_cpu_ptr(&mce_banks_owned));

and you're saying

mce_poll_banks and mce_banks_owned

are disjoint.

That's what you mean, right?

Because if so, yes, that makes sense. If the sets of MCA banks polled
and handled in the thresholding interrupt are disjoint, we should be ok.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette