Tony reported intermittent lockups on poweroff. His analysis identified the
wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
on SME enabled machines a kexec() does not leave any stale data in the
caches when switching from encrypted to non-encrypted mode or vice versa.
That wbindv() is conditional on the SME feature bit which is read directly
from CPUID. But that readout does not check whether the CPUID leaf is
available or not. If it's not available the CPU will return the value of
the highest supported leaf instead. Depending on the content the "SME" bit
might be set or not.
That's incorrect but harmless. Making the CPUID readout conditional makes
the observed hangs go away, but it does not fix the underlying problem:
CPU0 CPU1
stop_other_cpus()
send_IPIs(REBOOT); stop_this_cpu()
while (num_online_cpus() > 1); set_online(false);
proceed... -> hang
wbinvd()
WBINVD is an expensive operation and if multiple CPUs issue it at the same
time the resulting delays are even larger.
But CPU0 already observed num_online_cpus() going down to 1 and proceeds
which causes the system to hang.
This issue exists independent of WBINVD, but the delays caused by WBINVD
make it more prominent.
Make this more robust by adding a cpumask which is initialized to the
online CPU mask before sending the IPIs and CPUs clear their bit in
stop_this_cpu() after the WBINVD completed. Check for that cpumask to
become empty in stop_other_cpus() instead of watching num_online_cpus().
The cpumask cannot plug all holes either, but it's better than a raw
counter and allows to restrict the NMI fallback IPI to be sent only to
the CPUs which have not reported within the timeout window.
Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Reported-by: Tony Battersby <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
---
V3: Use a cpumask to make the NMI case slightly safer - Ashok
---
arch/x86/include/asm/cpu.h | 2 +
arch/x86/kernel/process.c | 23 +++++++++++++-
arch/x86/kernel/smp.c | 71 +++++++++++++++++++++++++++++++--------------
3 files changed, 73 insertions(+), 23 deletions(-)
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -98,4 +98,6 @@ extern u64 x86_read_arch_cap_msr(void);
int intel_find_matching_signature(void *mc, unsigned int csig, int cpf);
int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
+extern struct cpumask cpus_stop_mask;
+
#endif /* _ASM_X86_CPU_H */
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -759,13 +759,23 @@ bool xen_set_default_idle(void)
}
#endif
+struct cpumask cpus_stop_mask;
+
void __noreturn stop_this_cpu(void *dummy)
{
+ unsigned int cpu = smp_processor_id();
+
local_irq_disable();
+
/*
- * Remove this CPU:
+ * Remove this CPU from the online mask and disable it
+ * unconditionally. This might be redundant in case that the reboot
+ * vector was handled late and stop_other_cpus() sent an NMI.
+ *
+ * According to SDM and APM NMIs can be accepted even after soft
+ * disabling the local APIC.
*/
- set_cpu_online(smp_processor_id(), false);
+ set_cpu_online(cpu, false);
disable_local_APIC();
mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
@@ -783,6 +793,15 @@ void __noreturn stop_this_cpu(void *dumm
*/
if (cpuid_eax(0x8000001f) & BIT(0))
native_wbinvd();
+
+ /*
+ * This brings a cache line back and dirties it, but
+ * native_stop_other_cpus() will overwrite cpus_stop_mask after it
+ * observed that all CPUs reported stop. This write will invalidate
+ * the related cache line on this CPU.
+ */
+ cpumask_clear_cpu(cpu, &cpus_stop_mask);
+
for (;;) {
/*
* Use native_halt() so that memory contents don't change
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -27,6 +27,7 @@
#include <asm/mmu_context.h>
#include <asm/proto.h>
#include <asm/apic.h>
+#include <asm/cpu.h>
#include <asm/idtentry.h>
#include <asm/nmi.h>
#include <asm/mce.h>
@@ -146,31 +147,43 @@ static int register_stop_handler(void)
static void native_stop_other_cpus(int wait)
{
- unsigned long flags;
- unsigned long timeout;
+ unsigned int cpu = smp_processor_id();
+ unsigned long flags, timeout;
if (reboot_force)
return;
- /*
- * Use an own vector here because smp_call_function
- * does lots of things not suitable in a panic situation.
- */
+ /* Only proceed if this is the first CPU to reach this code */
+ if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1)
+ return;
/*
- * We start by using the REBOOT_VECTOR irq.
- * The irq is treated as a sync point to allow critical
- * regions of code on other cpus to release their spin locks
- * and re-enable irqs. Jumping straight to an NMI might
- * accidentally cause deadlocks with further shutdown/panic
- * code. By syncing, we give the cpus up to one second to
- * finish their work before we force them off with the NMI.
+ * 1) Send an IPI on the reboot vector to all other CPUs.
+ *
+ * The other CPUs should react on it after leaving critical
+ * sections and re-enabling interrupts. They might still hold
+ * locks, but there is nothing which can be done about that.
+ *
+ * 2) Wait for all other CPUs to report that they reached the
+ * HLT loop in stop_this_cpu()
+ *
+ * 3) If #2 timed out send an NMI to the CPUs which did not
+ * yet report
+ *
+ * 4) Wait for all other CPUs to report that they reached the
+ * HLT loop in stop_this_cpu()
+ *
+ * #3 can obviously race against a CPU reaching the HLT loop late.
+ * That CPU will have reported already and the "have all CPUs
+ * reached HLT" condition will be true despite the fact that the
+ * other CPU is still handling the NMI. Again, there is no
+ * protection against that as "disabled" APICs still respond to
+ * NMIs.
*/
- if (num_online_cpus() > 1) {
- /* did someone beat us here? */
- if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
- return;
+ cpumask_copy(&cpus_stop_mask, cpu_online_mask);
+ cpumask_clear_cpu(cpu, &cpus_stop_mask);
+ if (!cpumask_empty(&cpus_stop_mask)) {
/* sync above data before sending IRQ */
wmb();
@@ -183,24 +196,34 @@ static void native_stop_other_cpus(int w
* CPUs reach shutdown state.
*/
timeout = USEC_PER_SEC;
- while (num_online_cpus() > 1 && timeout--)
+ while (!cpumask_empty(&cpus_stop_mask) && timeout--)
udelay(1);
}
/* if the REBOOT_VECTOR didn't work, try with the NMI */
- if (num_online_cpus() > 1) {
+ if (!cpumask_empty(&cpus_stop_mask)) {
/*
* If NMI IPI is enabled, try to register the stop handler
* and send the IPI. In any case try to wait for the other
* CPUs to stop.
*/
if (!smp_no_nmi_ipi && !register_stop_handler()) {
+ u32 dm;
+
/* Sync above data before sending IRQ */
wmb();
pr_emerg("Shutting down cpus with NMI\n");
- apic_send_IPI_allbutself(NMI_VECTOR);
+ dm = apic->dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
+ dm |= APIC_DM_NMI;
+
+ for_each_cpu(cpu, &cpus_stop_mask) {
+ u32 apicid = apic->cpu_present_to_apicid(cpu);
+
+ apic_icr_write(dm, apicid);
+ apic_wait_icr_idle();
+ }
}
/*
* Don't wait longer than 10 ms if the caller didn't
@@ -208,7 +231,7 @@ static void native_stop_other_cpus(int w
* one or more CPUs do not reach shutdown state.
*/
timeout = USEC_PER_MSEC * 10;
- while (num_online_cpus() > 1 && (wait || timeout--))
+ while (!cpumask_empty(&cpus_stop_mask) && (wait || timeout--))
udelay(1);
}
@@ -216,6 +239,12 @@ static void native_stop_other_cpus(int w
disable_local_APIC();
mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
local_irq_restore(flags);
+
+ /*
+ * Ensure that the cpus_stop_mask cache lines are invalidated on
+ * the other CPUs. See comment vs. SME in stop_this_cpu().
+ */
+ cpumask_clear(&cpus_stop_mask);
}
/*
Hi Thomas,
On Thu, Jun 15, 2023 at 10:33:50PM +0200, Thomas Gleixner wrote:
> Tony reported intermittent lockups on poweroff. His analysis identified the
> wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
> on SME enabled machines a kexec() does not leave any stale data in the
> caches when switching from encrypted to non-encrypted mode or vice versa.
>
> That wbindv() is conditional on the SME feature bit which is read directly
> from CPUID. But that readout does not check whether the CPUID leaf is
> available or not. If it's not available the CPU will return the value of
> the highest supported leaf instead. Depending on the content the "SME" bit
> might be set or not.
>
> That's incorrect but harmless. Making the CPUID readout conditional makes
> the observed hangs go away, but it does not fix the underlying problem:
>
> CPU0 CPU1
>
> stop_other_cpus()
> send_IPIs(REBOOT); stop_this_cpu()
> while (num_online_cpus() > 1); set_online(false);
> proceed... -> hang
> wbinvd()
>
> WBINVD is an expensive operation and if multiple CPUs issue it at the same
> time the resulting delays are even larger.
>
> But CPU0 already observed num_online_cpus() going down to 1 and proceeds
> which causes the system to hang.
>
> This issue exists independent of WBINVD, but the delays caused by WBINVD
> make it more prominent.
>
> Make this more robust by adding a cpumask which is initialized to the
> online CPU mask before sending the IPIs and CPUs clear their bit in
> stop_this_cpu() after the WBINVD completed. Check for that cpumask to
> become empty in stop_other_cpus() instead of watching num_online_cpus().
>
> The cpumask cannot plug all holes either, but it's better than a raw
> counter and allows to restrict the NMI fallback IPI to be sent only to
> the CPUs which have not reported within the timeout window.
>
> Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
> Reported-by: Tony Battersby <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Link: https://lore.kernel.org/all/[email protected]
> ---
> V3: Use a cpumask to make the NMI case slightly safer - Ashok
> ---
> arch/x86/include/asm/cpu.h | 2 +
> arch/x86/kernel/process.c | 23 +++++++++++++-
> arch/x86/kernel/smp.c | 71 +++++++++++++++++++++++++++++++--------------
> 3 files changed, 73 insertions(+), 23 deletions(-)
I tested them and seems to work fine on my system.
Maybe Tony can check in his setup would be great.
One thought on sending NMI below.
[snip]
>
> /* if the REBOOT_VECTOR didn't work, try with the NMI */
> - if (num_online_cpus() > 1) {
> + if (!cpumask_empty(&cpus_stop_mask)) {
> /*
> * If NMI IPI is enabled, try to register the stop handler
> * and send the IPI. In any case try to wait for the other
> * CPUs to stop.
> */
> if (!smp_no_nmi_ipi && !register_stop_handler()) {
> + u32 dm;
> +
> /* Sync above data before sending IRQ */
> wmb();
>
> pr_emerg("Shutting down cpus with NMI\n");
>
> - apic_send_IPI_allbutself(NMI_VECTOR);
> + dm = apic->dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
> + dm |= APIC_DM_NMI;
> +
> + for_each_cpu(cpu, &cpus_stop_mask) {
> + u32 apicid = apic->cpu_present_to_apicid(cpu);
> +
> + apic_icr_write(dm, apicid);
> + apic_wait_icr_idle();
can we simplify this by just apic->send_IPI(cpu, NMI_VECTOR); ??
On Thu, Jun 15 2023 at 18:58, Ashok Raj wrote:
> On Thu, Jun 15, 2023 at 10:33:50PM +0200, Thomas Gleixner wrote:
>> + dm = apic->dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
>> + dm |= APIC_DM_NMI;
>> +
>> + for_each_cpu(cpu, &cpus_stop_mask) {
>> + u32 apicid = apic->cpu_present_to_apicid(cpu);
>> +
>> + apic_icr_write(dm, apicid);
>> + apic_wait_icr_idle();
>
> can we simplify this by just apic->send_IPI(cpu, NMI_VECTOR); ??
That would not set APIC_DM_NMI in delivery mode and the IPI would be
sent with APIC_DM_FIXED.
Unfortunately we don't have apic->send_NMI() ...
Thanks,
tglx
On Fri, Jun 16, 2023 at 09:53:25AM +0200, Thomas Gleixner wrote:
> On Thu, Jun 15 2023 at 18:58, Ashok Raj wrote:
> > On Thu, Jun 15, 2023 at 10:33:50PM +0200, Thomas Gleixner wrote:
> >> + dm = apic->dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
> >> + dm |= APIC_DM_NMI;
> >> +
> >> + for_each_cpu(cpu, &cpus_stop_mask) {
> >> + u32 apicid = apic->cpu_present_to_apicid(cpu);
> >> +
> >> + apic_icr_write(dm, apicid);
> >> + apic_wait_icr_idle();
> >
> > can we simplify this by just apic->send_IPI(cpu, NMI_VECTOR); ??
>
> That would not set APIC_DM_NMI in delivery mode and the IPI would be
> sent with APIC_DM_FIXED.
That's correct.
Maybe if we use apic->send_IPI_mask(cpumask_of(cpu), NMI_VECTOR)
apic->send_IPI_mask(cpumask_of(cpu),NMI_VECTOR)
__x2apic_send_IPI_mask()
__x2apic_send_IPI_dest()
unsigned long cfg = __prepare_ICR(0, vector, dest);
native_x2apic_icr_write(cfg, apicid);
__prepare_ICR() seems to have the magic for APIC_DM_NMI?
I suppose this works for non-x2apic parts as well
Cheers,
Ashok
On 6/15/23 21:58, Ashok Raj wrote:
> Hi Thomas,
>
> On Thu, Jun 15, 2023 at 10:33:50PM +0200, Thomas Gleixner wrote:
>> Tony reported intermittent lockups on poweroff. His analysis identified the
>> wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
>> on SME enabled machines a kexec() does not leave any stale data in the
>> caches when switching from encrypted to non-encrypted mode or vice versa.
>>
>> That wbindv() is conditional on the SME feature bit which is read directly
>> from CPUID. But that readout does not check whether the CPUID leaf is
>> available or not. If it's not available the CPU will return the value of
>> the highest supported leaf instead. Depending on the content the "SME" bit
>> might be set or not.
>>
>> That's incorrect but harmless. Making the CPUID readout conditional makes
>> the observed hangs go away, but it does not fix the underlying problem:
>>
>> CPU0 CPU1
>>
>> stop_other_cpus()
>> send_IPIs(REBOOT); stop_this_cpu()
>> while (num_online_cpus() > 1); set_online(false);
>> proceed... -> hang
>> wbinvd()
>>
>> WBINVD is an expensive operation and if multiple CPUs issue it at the same
>> time the resulting delays are even larger.
>>
>> But CPU0 already observed num_online_cpus() going down to 1 and proceeds
>> which causes the system to hang.
>>
>> This issue exists independent of WBINVD, but the delays caused by WBINVD
>> make it more prominent.
>>
>> Make this more robust by adding a cpumask which is initialized to the
>> online CPU mask before sending the IPIs and CPUs clear their bit in
>> stop_this_cpu() after the WBINVD completed. Check for that cpumask to
>> become empty in stop_other_cpus() instead of watching num_online_cpus().
>>
>> The cpumask cannot plug all holes either, but it's better than a raw
>> counter and allows to restrict the NMI fallback IPI to be sent only to
>> the CPUs which have not reported within the timeout window.
>>
>> Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
>> Reported-by: Tony Battersby <[email protected]>
>> Signed-off-by: Thomas Gleixner <[email protected]>
>> Link: https://lore.kernel.org/all/[email protected]
>> ---
>> V3: Use a cpumask to make the NMI case slightly safer - Ashok
>> ---
>> arch/x86/include/asm/cpu.h | 2 +
>> arch/x86/kernel/process.c | 23 +++++++++++++-
>> arch/x86/kernel/smp.c | 71 +++++++++++++++++++++++++++++++--------------
>> 3 files changed, 73 insertions(+), 23 deletions(-)
> I tested them and seems to work fine on my system.
>
> Maybe Tony can check in his setup would be great.
>
plain 6.4-rc6: 50% failure rate
poweroff success: 2
poweroff fail: 2
6.4-rc6 with tglx v3 patch #1 only: 0% failure rate
poweroff success: 10
poweroff fail: 0
6.4-rc6 with all 7 tglx v3 patches: 0% failure rate
poweroff success: 10
poweroff fail: 0
Fixes my problem.
Tested-by: Tony Battersby <[email protected]>
On Fri, Jun 16 2023 at 07:13, Ashok Raj wrote:
> On Fri, Jun 16, 2023 at 09:53:25AM +0200, Thomas Gleixner wrote:
>> > can we simplify this by just apic->send_IPI(cpu, NMI_VECTOR); ??
>>
>> That would not set APIC_DM_NMI in delivery mode and the IPI would be
>> sent with APIC_DM_FIXED.
>
> That's correct.
>
> Maybe if we use apic->send_IPI_mask(cpumask_of(cpu), NMI_VECTOR)
>
> apic->send_IPI_mask(cpumask_of(cpu),NMI_VECTOR)
> __x2apic_send_IPI_mask()
> __x2apic_send_IPI_dest()
> unsigned long cfg = __prepare_ICR(0, vector, dest);
> native_x2apic_icr_write(cfg, apicid);
>
> __prepare_ICR() seems to have the magic for APIC_DM_NMI?
Bah. I clearly neither can read nor can remember any of the internals of
that code:
apic->send_IPI()
x2apic_send_IPI()
__x2apic_send_IPI_dest()
It also works for all other APIC incarnations.
Thanks,
tglx
---
From: Thomas Gleixner <[email protected]>
Subject: x86/smp: Make stop_other_cpus() more robust
Date: Wed, 26 Apr 2023 18:37:00 +0200
Tony reported intermittent lockups on poweroff. His analysis identified the
wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
on SME enabled machines a kexec() does not leave any stale data in the
caches when switching from encrypted to non-encrypted mode or vice versa.
That wbindv() is conditional on the SME feature bit which is read directly
from CPUID. But that readout does not check whether the CPUID leaf is
available or not. If it's not available the CPU will return the value of
the highest supported leaf instead. Depending on the content the "SME" bit
might be set or not.
That's incorrect but harmless. Making the CPUID readout conditional makes
the observed hangs go away, but it does not fix the underlying problem:
CPU0 CPU1
stop_other_cpus()
send_IPIs(REBOOT); stop_this_cpu()
while (num_online_cpus() > 1); set_online(false);
proceed... -> hang
wbinvd()
WBINVD is an expensive operation and if multiple CPUs issue it at the same
time the resulting delays are even larger.
But CPU0 already observed num_online_cpus() going down to 1 and proceeds
which causes the system to hang.
This issue exists independent of WBINVD, but the delays caused by WBINVD
make it more prominent.
Make this more robust by adding a cpumask which is initialized to the
online CPU mask before sending the IPIs and CPUs clear their bit in
stop_this_cpu() after the WBINVD completed. Check for that cpumask to
become empty in stop_other_cpus() instead of watching num_online_cpus().
The cpumask cannot plug all holes either, but it's better than a raw
counter and allows to restrict the NMI fallback IPI to be sent only the
CPUs which have not reported within the timeout window.
Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Reported-by: Tony Battersby <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
---
V3: Use a cpumask to make the NMI case slightly safer - Ashok
V4: Simplify the NMI IPI code - Ashok
---
arch/x86/include/asm/cpu.h | 2 +
arch/x86/kernel/process.c | 23 +++++++++++++++-
arch/x86/kernel/smp.c | 62 +++++++++++++++++++++++++++++----------------
3 files changed, 64 insertions(+), 23 deletions(-)
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -98,4 +98,6 @@ extern u64 x86_read_arch_cap_msr(void);
int intel_find_matching_signature(void *mc, unsigned int csig, int cpf);
int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
+extern struct cpumask cpus_stop_mask;
+
#endif /* _ASM_X86_CPU_H */
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -759,13 +759,23 @@ bool xen_set_default_idle(void)
}
#endif
+struct cpumask cpus_stop_mask;
+
void __noreturn stop_this_cpu(void *dummy)
{
+ unsigned int cpu = smp_processor_id();
+
local_irq_disable();
+
/*
- * Remove this CPU:
+ * Remove this CPU from the online mask and disable it
+ * unconditionally. This might be redundant in case that the reboot
+ * vector was handled late and stop_other_cpus() sent an NMI.
+ *
+ * According to SDM and APM NMIs can be accepted even after soft
+ * disabling the local APIC.
*/
- set_cpu_online(smp_processor_id(), false);
+ set_cpu_online(cpu, false);
disable_local_APIC();
mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
@@ -783,6 +793,15 @@ void __noreturn stop_this_cpu(void *dumm
*/
if (cpuid_eax(0x8000001f) & BIT(0))
native_wbinvd();
+
+ /*
+ * This brings a cache line back and dirties it, but
+ * native_stop_other_cpus() will overwrite cpus_stop_mask after it
+ * observed that all CPUs reported stop. This write will invalidate
+ * the related cache line on this CPU.
+ */
+ cpumask_clear_cpu(cpu, &cpus_stop_mask);
+
for (;;) {
/*
* Use native_halt() so that memory contents don't change
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -27,6 +27,7 @@
#include <asm/mmu_context.h>
#include <asm/proto.h>
#include <asm/apic.h>
+#include <asm/cpu.h>
#include <asm/idtentry.h>
#include <asm/nmi.h>
#include <asm/mce.h>
@@ -146,31 +147,43 @@ static int register_stop_handler(void)
static void native_stop_other_cpus(int wait)
{
- unsigned long flags;
- unsigned long timeout;
+ unsigned int cpu = smp_processor_id();
+ unsigned long flags, timeout;
if (reboot_force)
return;
- /*
- * Use an own vector here because smp_call_function
- * does lots of things not suitable in a panic situation.
- */
+ /* Only proceed if this is the first CPU to reach this code */
+ if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1)
+ return;
/*
- * We start by using the REBOOT_VECTOR irq.
- * The irq is treated as a sync point to allow critical
- * regions of code on other cpus to release their spin locks
- * and re-enable irqs. Jumping straight to an NMI might
- * accidentally cause deadlocks with further shutdown/panic
- * code. By syncing, we give the cpus up to one second to
- * finish their work before we force them off with the NMI.
+ * 1) Send an IPI on the reboot vector to all other CPUs.
+ *
+ * The other CPUs should react on it after leaving critical
+ * sections and re-enabling interrupts. They might still hold
+ * locks, but there is nothing which can be done about that.
+ *
+ * 2) Wait for all other CPUs to report that they reached the
+ * HLT loop in stop_this_cpu()
+ *
+ * 3) If #2 timed out send an NMI to the CPUs which did not
+ * yet report
+ *
+ * 4) Wait for all other CPUs to report that they reached the
+ * HLT loop in stop_this_cpu()
+ *
+ * #3 can obviously race against a CPU reaching the HLT loop late.
+ * That CPU will have reported already and the "have all CPUs
+ * reached HLT" condition will be true despite the fact that the
+ * other CPU is still handling the NMI. Again, there is no
+ * protection against that as "disabled" APICs still respond to
+ * NMIs.
*/
- if (num_online_cpus() > 1) {
- /* did someone beat us here? */
- if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
- return;
+ cpumask_copy(&cpus_stop_mask, cpu_online_mask);
+ cpumask_clear_cpu(cpu, &cpus_stop_mask);
+ if (!cpumask_empty(&cpus_stop_mask)) {
/* sync above data before sending IRQ */
wmb();
@@ -183,12 +196,12 @@ static void native_stop_other_cpus(int w
* CPUs reach shutdown state.
*/
timeout = USEC_PER_SEC;
- while (num_online_cpus() > 1 && timeout--)
+ while (!cpumask_empty(&cpus_stop_mask) && timeout--)
udelay(1);
}
/* if the REBOOT_VECTOR didn't work, try with the NMI */
- if (num_online_cpus() > 1) {
+ if (!cpumask_empty(&cpus_stop_mask)) {
/*
* If NMI IPI is enabled, try to register the stop handler
* and send the IPI. In any case try to wait for the other
@@ -200,7 +213,8 @@ static void native_stop_other_cpus(int w
pr_emerg("Shutting down cpus with NMI\n");
- apic_send_IPI_allbutself(NMI_VECTOR);
+ for_each_cpu(cpu, &cpus_stop_mask)
+ apic->send_IPI(cpu, NMI_VECTOR);
}
/*
* Don't wait longer than 10 ms if the caller didn't
@@ -208,7 +222,7 @@ static void native_stop_other_cpus(int w
* one or more CPUs do not reach shutdown state.
*/
timeout = USEC_PER_MSEC * 10;
- while (num_online_cpus() > 1 && (wait || timeout--))
+ while (!cpumask_empty(&cpus_stop_mask) && (wait || timeout--))
udelay(1);
}
@@ -216,6 +230,12 @@ static void native_stop_other_cpus(int w
disable_local_APIC();
mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
local_irq_restore(flags);
+
+ /*
+ * Ensure that the cpus_stop_mask cache lines are invalidated on
+ * the other CPUs. See comment vs. SME in stop_this_cpu().
+ */
+ cpumask_clear(&cpus_stop_mask);
}
/*
On Fri, Jun 16, 2023 at 08:01:56PM +0200, Thomas Gleixner wrote:
> On Fri, Jun 16 2023 at 07:13, Ashok Raj wrote:
> > On Fri, Jun 16, 2023 at 09:53:25AM +0200, Thomas Gleixner wrote:
> >> > can we simplify this by just apic->send_IPI(cpu, NMI_VECTOR); ??
> >>
> >> That would not set APIC_DM_NMI in delivery mode and the IPI would be
> >> sent with APIC_DM_FIXED.
> >
> > That's correct.
> >
> > Maybe if we use apic->send_IPI_mask(cpumask_of(cpu), NMI_VECTOR)
> >
> > apic->send_IPI_mask(cpumask_of(cpu),NMI_VECTOR)
> > __x2apic_send_IPI_mask()
> > __x2apic_send_IPI_dest()
> > unsigned long cfg = __prepare_ICR(0, vector, dest);
> > native_x2apic_icr_write(cfg, apicid);
> >
> > __prepare_ICR() seems to have the magic for APIC_DM_NMI?
>
> Bah. I clearly neither can read nor can remember any of the internals of
> that code:
>
> apic->send_IPI()
> x2apic_send_IPI()
> __x2apic_send_IPI_dest()
>
> It also works for all other APIC incarnations.
Works!
On Fri, Jun 16, 2023 at 01:57:59PM -0700, Ashok Raj wrote:
> On Fri, Jun 16, 2023 at 08:01:56PM +0200, Thomas Gleixner wrote:
> > On Fri, Jun 16 2023 at 07:13, Ashok Raj wrote:
> > > On Fri, Jun 16, 2023 at 09:53:25AM +0200, Thomas Gleixner wrote:
> > >> > can we simplify this by just apic->send_IPI(cpu, NMI_VECTOR); ??
> > >>
> > >> That would not set APIC_DM_NMI in delivery mode and the IPI would be
> > >> sent with APIC_DM_FIXED.
> > >
> > > That's correct.
> > >
> > > Maybe if we use apic->send_IPI_mask(cpumask_of(cpu), NMI_VECTOR)
> > >
> > > apic->send_IPI_mask(cpumask_of(cpu),NMI_VECTOR)
> > > __x2apic_send_IPI_mask()
> > > __x2apic_send_IPI_dest()
> > > unsigned long cfg = __prepare_ICR(0, vector, dest);
> > > native_x2apic_icr_write(cfg, apicid);
> > >
> > > __prepare_ICR() seems to have the magic for APIC_DM_NMI?
> >
> > Bah. I clearly neither can read nor can remember any of the internals of
> > that code:
> >
> > apic->send_IPI()
> > x2apic_send_IPI()
> > __x2apic_send_IPI_dest()
> >
> > It also works for all other APIC incarnations.
>
> Works!
>
Forgot to add
Reviewed-by: Ashok Raj <[email protected]>
On Fri, Jun 16, 2023 at 08:01:56PM +0200, Thomas Gleixner wrote:
> From: Thomas Gleixner <[email protected]>
> Subject: x86/smp: Make stop_other_cpus() more robust
> Date: Wed, 26 Apr 2023 18:37:00 +0200
>
> Tony reported intermittent lockups on poweroff. His analysis identified the
> wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
> on SME enabled machines a kexec() does not leave any stale data in the
> caches when switching from encrypted to non-encrypted mode or vice versa.
>
> That wbindv() is conditional on the SME feature bit which is read directly
wbinvd()
Otherwise:
Reviewed-by: Borislav Petkov (AMD) <[email protected]>
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette