2023-01-10 10:26:00

by Zeng Heng

[permalink] [raw]
Subject: [PATCH] x86/boot/compressed: Register NMI handler in EFI boot loader

If kdump is enabled, when using mce_inject to inject errors, EFI
boot loader would decompressed & load second kernel for saving
vmcore file.

For normal errors that is fine. However, in MCEs cases, the panic
cpu that firstly enters into mce_panic(), is running within nmi
interrupt context, and the processor blocks delivery of subsequent
NMIs until the next execution of the IRET instruction.

When the panic cpu takes long time in the panic processing route,
and causes the watchdog timeout, at this moment, the processor
already receives NMI interrupt in the background.

In the following processure, panic cpu would run into EFI loader
and raise page fault exception (like visiting `vidmem` variable
when attempts to call debug_putstr()), the cpu would execute IRET
instruction when exits from page fault handler.

But the loader never registers handler for NMI vector in IDT,
lack of vector handler would cause reboot, which interrupts
kdump processure and fails to save vmcore file.

Here is steps to reproduce the above issue (Have a certain probability):
1. # cat uncorrected
CPU 1 BANK 4
STATUS uncorrected 0xc0
MCGSTATUS EIPV MCIP
ADDR 0x1234
RIP 0xdeadbabe
RAISINGCPU 0
MCGCAP SER CMCI TES 0x6
2. # modprobe mce_inject
3. # mce-inject uncorrected

For increasing probability of issue reproduce, there is two ways of
modification to select:
1. modify the threshold value of watchdog;
2. add delays before panic() in mce_panic() and modify PANIC_TIMEOUT macro;

Fixes: ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")
Signed-off-by: Zeng Heng <[email protected]>
---
arch/x86/boot/compressed/ident_map_64.c | 5 +++++
arch/x86/boot/compressed/idt_64.c | 1 +
arch/x86/boot/compressed/idt_handlers_64.S | 1 +
arch/x86/boot/compressed/misc.h | 1 +
4 files changed, 8 insertions(+)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index d4a314cc50d6..6893127f673f 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -379,3 +379,8 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
*/
kernel_add_identity_map(address, end);
}
+
+void do_boot_nmi_fault(struct pt_regs *regs, unsigned long error_code)
+{
+ /* ignore */
+}
diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
index 6debb816e83d..b169c9728d52 100644
--- a/arch/x86/boot/compressed/idt_64.c
+++ b/arch/x86/boot/compressed/idt_64.c
@@ -60,6 +60,7 @@ void load_stage2_idt(void)
{
boot_idt_desc.address = (unsigned long)boot_idt;

+ set_idt_entry(X86_TRAP_NMI, boot_nmi_fault);
set_idt_entry(X86_TRAP_PF, boot_page_fault);

#ifdef CONFIG_AMD_MEM_ENCRYPT
diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S
index 22890e199f5b..2aef8e1b515b 100644
--- a/arch/x86/boot/compressed/idt_handlers_64.S
+++ b/arch/x86/boot/compressed/idt_handlers_64.S
@@ -69,6 +69,7 @@ SYM_FUNC_END(\name)
.text
.code64

+EXCEPTION_HANDLER boot_nmi_fault do_boot_nmi_fault error_code=0
EXCEPTION_HANDLER boot_page_fault do_boot_page_fault error_code=1

#ifdef CONFIG_AMD_MEM_ENCRYPT
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 62208ec04ca4..d89d3f8417f6 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -187,6 +187,7 @@ static inline void cleanup_exception_handling(void) { }
#endif

/* IDT Entry Points */
+void boot_nmi_fault(void);
void boot_page_fault(void);
void boot_stage1_vc(void);
void boot_stage2_vc(void);
--
2.25.1


2023-01-10 11:40:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/boot/compressed: Register NMI handler in EFI boot loader


* Zeng Heng <[email protected]> wrote:

> +void do_boot_nmi_fault(struct pt_regs *regs, unsigned long error_code)
> +{
> + /* ignore */
> +}
> diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
> index 6debb816e83d..b169c9728d52 100644
> --- a/arch/x86/boot/compressed/idt_64.c
> +++ b/arch/x86/boot/compressed/idt_64.c
> @@ -60,6 +60,7 @@ void load_stage2_idt(void)
> {
> boot_idt_desc.address = (unsigned long)boot_idt;
>
> + set_idt_entry(X86_TRAP_NMI, boot_nmi_fault);
> set_idt_entry(X86_TRAP_PF, boot_page_fault);

So it's a bit sad to install a dummy handler that does nothing, while
something clearly sent an NMI and expects an intelligent reaction - OTOH
the unexpected NMIs from from watchdog during a kdump clearly make things
worse by crashing the bootup.

Anyway, I cannot think of a better response here that the boot loading code
could do either, so I've applied your fix to tip:x86/boot.

Thanks,

Ingo

2023-01-10 12:04:46

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes


* Ingo Molnar <[email protected]> wrote:

>
> * Zeng Heng <[email protected]> wrote:
>
> > +void do_boot_nmi_fault(struct pt_regs *regs, unsigned long error_code)
> > +{
> > + /* ignore */
> > +}
> > diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
> > index 6debb816e83d..b169c9728d52 100644
> > --- a/arch/x86/boot/compressed/idt_64.c
> > +++ b/arch/x86/boot/compressed/idt_64.c
> > @@ -60,6 +60,7 @@ void load_stage2_idt(void)
> > {
> > boot_idt_desc.address = (unsigned long)boot_idt;
> >
> > + set_idt_entry(X86_TRAP_NMI, boot_nmi_fault);
> > set_idt_entry(X86_TRAP_PF, boot_page_fault);
>
> So it's a bit sad to install a dummy handler that does nothing, while
> something clearly sent an NMI and expects an intelligent reaction - OTOH
> the unexpected NMIs from from watchdog during a kdump clearly make things
> worse by crashing the bootup.
>
> Anyway, I cannot think of a better response here that the boot loading code
> could do either, so I've applied your fix to tip:x86/boot.

BTW., the changelog had very poor quality, and the patch added no comments
to explain the presence of the dummy NMI.

The -v2 version below should address most of those problems.

Thanks,

Ingo

=============>
From: Zeng Heng <[email protected]>
Date: Tue, 10 Jan 2023 18:27:45 +0800
Subject: [PATCH] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes

If kdump is enabled, when using mce_inject to inject errors, EFI
boot loader would decompress & load second kernel for saving the
vmcore file.

For normal errors that is fine. However, in the MCE case, the panic
CPU that firstly enters into mce_panic() is running within NMI
interrupt context, and the processor blocks delivery of subsequent
NMIs until the next execution of the IRET instruction.

When the panic CPU takes long time in the panic processing route,
and causes the watchdog timeout, at this moment, the processor
already receives NMI interrupt in the background.

In the reproducer sequence below, panic CPU would run into EFI loader
and raise page fault exception (like visiting `vidmem` variable
when attempting to call debug_putstr()), the CPU would execute IRET
instruction when it exits from the page fault handler.

But the loader never registers handler for NMI vector in IDT,
lack of vector handler would cause reboot, which interrupts
kdump procedure and fails to save the vmcore file.

Here is steps to reproduce the above issue (it's sporadic):

1. # cat uncorrected
CPU 1 BANK 4
STATUS uncorrected 0xc0
MCGSTATUS EIPV MCIP
ADDR 0x1234
RIP 0xdeadbabe
RAISINGCPU 0
MCGCAP SER CMCI TES 0x6
2. # modprobe mce_inject
3. # mce-inject uncorrected

For increasing the probability of reproduction of this issue, there are
two ways to increase the probability of the bug:

1. modify the threshold value of watchdog (increase NMI frequency);
2. and/or add delays before panic() in mce_panic() and modify PANIC_TIMEOUT macro;

Fixes: ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")
Signed-off-by: Zeng Heng <[email protected]>
[ Tidy up changelog, add comments. ]
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/boot/compressed/ident_map_64.c | 12 ++++++++++++
arch/x86/boot/compressed/idt_64.c | 1 +
arch/x86/boot/compressed/idt_handlers_64.S | 1 +
arch/x86/boot/compressed/misc.h | 1 +
4 files changed, 15 insertions(+)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index d4a314cc50d6..cbfdefcf9657 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -379,3 +379,15 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
*/
kernel_add_identity_map(address, end);
}
+
+void do_boot_nmi_fault(struct pt_regs *regs, unsigned long error_code)
+{
+ /*
+ * Default boot loader placeholder fault handler - there's no real
+ * kernel running yet, so there's not much we can do - but NMIs
+ * can arrive in a kdump scenario, for example by the NMI watchdog.
+ *
+ * Not having any handler would cause the CPU to silently reboot,
+ * so we do the second-worst thing here and ignore the NMI.
+ */
+}
diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
index 6debb816e83d..b169c9728d52 100644
--- a/arch/x86/boot/compressed/idt_64.c
+++ b/arch/x86/boot/compressed/idt_64.c
@@ -60,6 +60,7 @@ void load_stage2_idt(void)
{
boot_idt_desc.address = (unsigned long)boot_idt;

+ set_idt_entry(X86_TRAP_NMI, boot_nmi_fault);
set_idt_entry(X86_TRAP_PF, boot_page_fault);

#ifdef CONFIG_AMD_MEM_ENCRYPT
diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S
index 22890e199f5b..2aef8e1b515b 100644
--- a/arch/x86/boot/compressed/idt_handlers_64.S
+++ b/arch/x86/boot/compressed/idt_handlers_64.S
@@ -69,6 +69,7 @@ SYM_FUNC_END(\name)
.text
.code64

+EXCEPTION_HANDLER boot_nmi_fault do_boot_nmi_fault error_code=0
EXCEPTION_HANDLER boot_page_fault do_boot_page_fault error_code=1

#ifdef CONFIG_AMD_MEM_ENCRYPT
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 62208ec04ca4..d89d3f8417f6 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -187,6 +187,7 @@ static inline void cleanup_exception_handling(void) { }
#endif

/* IDT Entry Points */
+void boot_nmi_fault(void);
void boot_page_fault(void);
void boot_stage1_vc(void);
void boot_stage2_vc(void);

Subject: [tip: x86/boot] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: f919e902ac2371ec46b5582a42c50544dcb097f6
Gitweb: https://git.kernel.org/tip/f919e902ac2371ec46b5582a42c50544dcb097f6
Author: Zeng Heng <[email protected]>
AuthorDate: Tue, 10 Jan 2023 18:27:45 +08:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 10 Jan 2023 11:56:55 +01:00

x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes

If kdump is enabled, when using mce_inject to inject errors, EFI
boot loader would decompressed & load second kernel for saving
vmcore file.

For normal errors that is fine. However, in MCEs cases, the panic
CPU that firstly enters into mce_panic(), is running within nmi
interrupt context, and the processor blocks delivery of subsequent
NMIs until the next execution of the IRET instruction.

When the panic CPU takes long time in the panic processing route,
and causes the watchdog timeout, at this moment, the processor
already receives NMI interrupt in the background.

In the following processure, panic CPU would run into EFI loader
and raise page fault exception (like visiting `vidmem` variable
when attempts to call debug_putstr()), the CPU would execute IRET
instruction when exits from page fault handler.

But the loader never registers handler for NMI vector in IDT,
lack of vector handler would cause reboot, which interrupts
kdump processure and fails to save vmcore file.

Here is steps to reproduce the above issue (Have a certain probability):

1. # cat uncorrected
CPU 1 BANK 4
STATUS uncorrected 0xc0
MCGSTATUS EIPV MCIP
ADDR 0x1234
RIP 0xdeadbabe
RAISINGCPU 0
MCGCAP SER CMCI TES 0x6
2. # modprobe mce_inject
3. # mce-inject uncorrected

For increasing probability of reproduction of this issue, there is two ways of
modification to select:

1. modify the threshold value of watchdog;
2. add delays before panic() in mce_panic() and modify PANIC_TIMEOUT macro;

Fixes: ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")
Signed-off-by: Zeng Heng <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/boot/compressed/ident_map_64.c | 5 +++++
arch/x86/boot/compressed/idt_64.c | 1 +
arch/x86/boot/compressed/idt_handlers_64.S | 1 +
arch/x86/boot/compressed/misc.h | 1 +
4 files changed, 8 insertions(+)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index d4a314c..6893127 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -379,3 +379,8 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
*/
kernel_add_identity_map(address, end);
}
+
+void do_boot_nmi_fault(struct pt_regs *regs, unsigned long error_code)
+{
+ /* ignore */
+}
diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
index 6debb81..b169c97 100644
--- a/arch/x86/boot/compressed/idt_64.c
+++ b/arch/x86/boot/compressed/idt_64.c
@@ -60,6 +60,7 @@ void load_stage2_idt(void)
{
boot_idt_desc.address = (unsigned long)boot_idt;

+ set_idt_entry(X86_TRAP_NMI, boot_nmi_fault);
set_idt_entry(X86_TRAP_PF, boot_page_fault);

#ifdef CONFIG_AMD_MEM_ENCRYPT
diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S
index 22890e1..2aef8e1 100644
--- a/arch/x86/boot/compressed/idt_handlers_64.S
+++ b/arch/x86/boot/compressed/idt_handlers_64.S
@@ -69,6 +69,7 @@ SYM_FUNC_END(\name)
.text
.code64

+EXCEPTION_HANDLER boot_nmi_fault do_boot_nmi_fault error_code=0
EXCEPTION_HANDLER boot_page_fault do_boot_page_fault error_code=1

#ifdef CONFIG_AMD_MEM_ENCRYPT
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 62208ec..d89d3f8 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -187,6 +187,7 @@ static inline void cleanup_exception_handling(void) { }
#endif

/* IDT Entry Points */
+void boot_nmi_fault(void);
void boot_page_fault(void);
void boot_stage1_vc(void);
void boot_stage2_vc(void);

2023-01-10 12:34:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes

On Tue, Jan 10, 2023 at 01:11:29PM +0100, Borislav Petkov wrote:
> On Tue, Jan 10, 2023 at 01:01:06PM +0100, Ingo Molnar wrote:
> > From: Zeng Heng <[email protected]>
> > Date: Tue, 10 Jan 2023 18:27:45 +0800
> > Subject: [PATCH] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes
> >
> > If kdump is enabled, when using mce_inject to inject errors, EFI
>
> Why does "EFI" matter here? Any boot loader would do...
>
> > boot loader would decompress & load second kernel for saving the
>
> s/&/and/
>
> > vmcore file.
> >
> > For normal errors that is fine.
>
> Useless sentence.
>
> > However, in the MCE case, the panic
> > CPU that firstly enters into mce_panic() is running within NMI
> > interrupt context,
>
> "#MC context" it is non-maskable but that's not "NMI interrupt context"
>
> > and the processor blocks delivery of subsequent
> > NMIs until the next execution of the IRET instruction.
> >
> > When the panic CPU takes long time in the panic processing route,
>
> I'm still unclear on the order of events here. It sounds like
>
> 1. MCE injected
> 2. panic
> 3. kdump gets loaded
>
> If that is the case, then I presume the flow is:
>
> mce_panic -> panic -> __crash_kexec()
>
> Yes?
>
> If so, then we should make sure we have *exited* #MC context before calling
> panic() and not have to add hacks like this one of adding an empty NMI handler.
>
> But I'm only speculating as it is hard to make sense of all this text.

IOW, does this help?

---
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7832a69d170e..55437d8a4fad 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -287,6 +287,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
if (panic_timeout == 0)
panic_timeout = mca_cfg.panic_timeout;
panic(msg);
+ mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
} else
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);



--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-10 12:48:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes


* Borislav Petkov <[email protected]> wrote:

> > mce_panic -> panic -> __crash_kexec()
> >
> > Yes?
> >
> > If so, then we should make sure we have *exited* #MC context before calling
> > panic() and not have to add hacks like this one of adding an empty NMI handler.
> >
> > But I'm only speculating as it is hard to make sense of all this text.
>
> IOW, does this help?
>
> ---
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 7832a69d170e..55437d8a4fad 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -287,6 +287,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
> if (panic_timeout == 0)
> panic_timeout = mca_cfg.panic_timeout;
> panic(msg);
> + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);

So your suggestion was to exit MC context 'before' the panic() call - but
the patch calls it 'after' - was that intentional?

Thanks,

Ingo

2023-01-10 13:07:16

by Zeng Heng

[permalink] [raw]
Subject: Re: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes


On 2023/1/10 20:11, Borislav Petkov wrote:
> On Tue, Jan 10, 2023 at 01:01:06PM +0100, Ingo Molnar wrote:
>> From: Zeng Heng <[email protected]>
>> Date: Tue, 10 Jan 2023 18:27:45 +0800
>> Subject: [PATCH] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes
>>
>> If kdump is enabled, when using mce_inject to inject errors, EFI
> Why does "EFI" matter here? Any boot loader would do...
>
>> boot loader would decompress & load second kernel for saving the
> s/&/and/
>
>> vmcore file.
>>
>> For normal errors that is fine.
> Useless sentence.
>
>> However, in the MCE case, the panic
>> CPU that firstly enters into mce_panic() is running within NMI
>> interrupt context,
> "#MC context" it is non-maskable but that's not "NMI interrupt context"

mce is registered on NMI handler by inject_init().

And here is the context of mce-inject:

#0  relocate_kernel () at arch/x86/kernel/relocate_kernel_64.S:55
#1  0xffffffff81a57fc2 in machine_kexec (image=0xffff888101ef8400)
    at arch/x86/kernel/machine_kexec_64.c:391
#2  0xffffffff811a9573 in __crash_kexec (regs=regs@entry=0x0
<fixed_percpu_data>)
    at kernel/kexec_core.c:1057
#3  0xffffffff81a5b4e4 in panic (fmt=fmt@entry=0xffffffff823211c8 "Fatal
machine check")
    at kernel/panic.c:393
#4  0xffffffff81aa65f5 in mce_panic (
    msg=msg@entry=0xffffffff823211c8 "Fatal machine check",
    final=final@entry=0xffff88813ac9eec0, exp=<optimized out>)
    at arch/x86/kernel/cpu/mce/core.c:380
#5  0xffffffff8103863b in mce_reign () at
arch/x86/kernel/cpu/mce/core.c:1042
#6  0xffffffff81aa682f in mce_end (order=order@entry=1)
    at arch/x86/kernel/cpu/mce/core.c:1175
#7  0xffffffff81aa6d57 in do_machine_check (regs=0xfffffe00000beef8)
    at arch/x86/kernel/cpu/mce/core.c:1567
#8  0xffffffffc0495167 in raise_exception (m=m@entry=0xffff88813ad9ef40,
    pregs=<optimized out>) at arch/x86/kernel/cpu/mce/inject.c:152
#9  0xffffffffc0495e7f in mce_raise_notify (cmd=<optimized out>,
regs=<optimized out>)
    at arch/x86/kernel/cpu/mce/inject.c:168
#10 0xffffffff810204b8 in nmi_handle ()
#11 0xffffffff81aa5e62 in default_do_nmi
(regs=regs@entry=0xfffffe00000beef8)
    at arch/x86/kernel/nmi.c:335
#12 0xffffffff81aa608d in exc_nmi (regs=0xfffffe00000beef8) at
arch/x86/kernel/nmi.c:517
#13 0xffffffff81c014e8 in asm_exc_nmi () at arch/x86/entry/entry_64.S:1440


>
>> and the processor blocks delivery of subsequent
>> NMIs until the next execution of the IRET instruction.
>>
>> When the panic CPU takes long time in the panic processing route,
> I'm still unclear on the order of events here. It sounds like
>
> 1. MCE injected
> 2. panic
> 3. kdump gets loaded
>
> If that is the case, then I presume the flow is:
>
> mce_panic -> panic -> __crash_kexec()
>
> Yes?

Yes, exactly. The following procedure is like:

panic() -> relocate_kernel() -> identity_mapped() -> x86 purgatory image
-> EFI loader -> secondary kernel

>
> If so, then we should make sure we have *exited* #MC context before calling
> panic() and not have to add hacks like this one of adding an empty NMI handler.
>
> But I'm only speculating as it is hard to make sense of all this text.
>
> Thx.
>

2023-01-10 13:23:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes

On Tue, Jan 10, 2023 at 01:01:06PM +0100, Ingo Molnar wrote:
> From: Zeng Heng <[email protected]>
> Date: Tue, 10 Jan 2023 18:27:45 +0800
> Subject: [PATCH] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes
>
> If kdump is enabled, when using mce_inject to inject errors, EFI

Why does "EFI" matter here? Any boot loader would do...

> boot loader would decompress & load second kernel for saving the

s/&/and/

> vmcore file.
>
> For normal errors that is fine.

Useless sentence.

> However, in the MCE case, the panic
> CPU that firstly enters into mce_panic() is running within NMI
> interrupt context,

"#MC context" it is non-maskable but that's not "NMI interrupt context"

> and the processor blocks delivery of subsequent
> NMIs until the next execution of the IRET instruction.
>
> When the panic CPU takes long time in the panic processing route,

I'm still unclear on the order of events here. It sounds like

1. MCE injected
2. panic
3. kdump gets loaded

If that is the case, then I presume the flow is:

mce_panic -> panic -> __crash_kexec()

Yes?

If so, then we should make sure we have *exited* #MC context before calling
panic() and not have to add hacks like this one of adding an empty NMI handler.

But I'm only speculating as it is hard to make sense of all this text.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-10 13:23:57

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes

On Tue, Jan 10, 2023 at 08:32:07PM +0800, Zeng Heng wrote:
> mce is registered on NMI handler by inject_init().

That's a handler for the NMI raised by raise_mce(). That's for the injection
case, which is simulated. If you're fixing the injection case, then surely not
with a bogus boot NMI handler.

> Yes, exactly. The following procedure is like:
>
> panic() -> relocate_kernel() -> identity_mapped() -> x86 purgatory image ->
> EFI loader -> secondary kernel

I'm doubtful now as you're injecting errors so you're not really in #MC context
but in this contrived context which is actually an NMI one. So we need to think
about how to fix this case.

Certainly not with an empty NMI handler...

Regardless, we should do

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7832a69d170e..57fe376ed049 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -286,6 +286,8 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
if (!fake_panic) {
if (panic_timeout == 0)
panic_timeout = mca_cfg.panic_timeout;
+
+ mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
panic(msg);
} else
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);

so that we not run kexec in #MC context.

Hmmm.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-10 13:24:17

by Zeng Heng

[permalink] [raw]
Subject: Re: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes


On 2023/1/10 20:34, Ingo Molnar wrote:
> * Borislav Petkov <[email protected]> wrote:
>
>>> mce_panic -> panic -> __crash_kexec()
>>>
>>> Yes?
>>>
>>> If so, then we should make sure we have *exited* #MC context before calling
>>> panic() and not have to add hacks like this one of adding an empty NMI handler.
>>>
>>> But I'm only speculating as it is hard to make sense of all this text.
>> IOW, does this help?
>>
>> ---
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index 7832a69d170e..55437d8a4fad 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -287,6 +287,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
>> if (panic_timeout == 0)
>> panic_timeout = mca_cfg.panic_timeout;
>> panic(msg);
>> + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);

I'm willing to test any patch provided, but the panic() is never return
and the

mce_wrmsrl() would be never called. Am I wrong?

B.R.,

Zeng Heng

> So your suggestion was to exit MC context 'before' the panic() call - but
> the patch calls it 'after' - was that intentional?
>
> Thanks,
>
> Ingo

2023-01-10 13:39:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes

On Tue, Jan 10, 2023 at 01:34:35PM +0100, Ingo Molnar wrote:
> So your suggestion was to exit MC context 'before' the panic() call - but
> the patch calls it 'after' - was that intentional?

Bah, you're right - it should be:

mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
panic();

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-10 14:19:00

by Zeng Heng

[permalink] [raw]
Subject: Re: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes


On 2023/1/10 20:57, Borislav Petkov wrote:
> On Tue, Jan 10, 2023 at 08:32:07PM +0800, Zeng Heng wrote:
>> mce is registered on NMI handler by inject_init().
> That's a handler for the NMI raised by raise_mce(). That's for the injection
> case, which is simulated. If you're fixing the injection case, then surely not
> with a bogus boot NMI handler.

OK, mce-injection is the simulated one.

>
>> Yes, exactly. The following procedure is like:
>>
>> panic() -> relocate_kernel() -> identity_mapped() -> x86 purgatory image ->
>> EFI loader -> secondary kernel
> I'm doubtful now as you're injecting errors so you're not really in #MC context
> but in this contrived context which is actually an NMI one. So we need to think
> about how to fix this case.
>
> Certainly not with an empty NMI handler...
>
> Regardless, we should do
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 7832a69d170e..57fe376ed049 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -286,6 +286,8 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
> if (!fake_panic) {
> if (panic_timeout == 0)
> panic_timeout = mca_cfg.panic_timeout;
> +
> + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> panic(msg);
> } else
> pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
>
> so that we not run kexec in #MC context.
>
> Hmmm.

I don't have ready test case for real MCE to verify whether it has
exited #MC context before panic() or not.

In mce-inject case that based on NMI, it doesn't work as mentioned indeed.

B.R.,

Zeng Heng

2023-01-10 15:00:46

by Zeng Heng

[permalink] [raw]
Subject: Re: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes


On 2023/1/10 20:01, Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>> * Zeng Heng <[email protected]> wrote:
>>
>>> +void do_boot_nmi_fault(struct pt_regs *regs, unsigned long error_code)
>>> +{
>>> + /* ignore */
>>> +}
>>> diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
>>> index 6debb816e83d..b169c9728d52 100644
>>> --- a/arch/x86/boot/compressed/idt_64.c
>>> +++ b/arch/x86/boot/compressed/idt_64.c
>>> @@ -60,6 +60,7 @@ void load_stage2_idt(void)
>>> {
>>> boot_idt_desc.address = (unsigned long)boot_idt;
>>>
>>> + set_idt_entry(X86_TRAP_NMI, boot_nmi_fault);
>>> set_idt_entry(X86_TRAP_PF, boot_page_fault);
>> So it's a bit sad to install a dummy handler that does nothing, while
>> something clearly sent an NMI and expects an intelligent reaction - OTOH
>> the unexpected NMIs from from watchdog during a kdump clearly make things
>> worse by crashing the bootup.
>>
>> Anyway, I cannot think of a better response here that the boot loading code
>> could do either, so I've applied your fix to tip:x86/boot.
> BTW., the changelog had very poor quality, and the patch added no comments
> to explain the presence of the dummy NMI.
>
> The -v2 version below should address most of those problems.
>
> Thanks,
>
> Ingo

Thanks for your work.

B.R.,

Zeng Heng

> =============>
> From: Zeng Heng <[email protected]>
> Date: Tue, 10 Jan 2023 18:27:45 +0800
> Subject: [PATCH] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes
>
> If kdump is enabled, when using mce_inject to inject errors, EFI
> boot loader would decompress & load second kernel for saving the
> vmcore file.
>
> For normal errors that is fine. However, in the MCE case, the panic
> CPU that firstly enters into mce_panic() is running within NMI
> interrupt context, and the processor blocks delivery of subsequent
> NMIs until the next execution of the IRET instruction.
>
> When the panic CPU takes long time in the panic processing route,
> and causes the watchdog timeout, at this moment, the processor
> already receives NMI interrupt in the background.
>
> In the reproducer sequence below, panic CPU would run into EFI loader
> and raise page fault exception (like visiting `vidmem` variable
> when attempting to call debug_putstr()), the CPU would execute IRET
> instruction when it exits from the page fault handler.
>
> But the loader never registers handler for NMI vector in IDT,
> lack of vector handler would cause reboot, which interrupts
> kdump procedure and fails to save the vmcore file.
>
> Here is steps to reproduce the above issue (it's sporadic):
>
> 1. # cat uncorrected
> CPU 1 BANK 4
> STATUS uncorrected 0xc0
> MCGSTATUS EIPV MCIP
> ADDR 0x1234
> RIP 0xdeadbabe
> RAISINGCPU 0
> MCGCAP SER CMCI TES 0x6
> 2. # modprobe mce_inject
> 3. # mce-inject uncorrected
>
> For increasing the probability of reproduction of this issue, there are
> two ways to increase the probability of the bug:
>
> 1. modify the threshold value of watchdog (increase NMI frequency);
> 2. and/or add delays before panic() in mce_panic() and modify PANIC_TIMEOUT macro;
>
> Fixes: ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")
> Signed-off-by: Zeng Heng <[email protected]>
> [ Tidy up changelog, add comments. ]
> Signed-off-by: Ingo Molnar <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
> ---
> arch/x86/boot/compressed/ident_map_64.c | 12 ++++++++++++
> arch/x86/boot/compressed/idt_64.c | 1 +
> arch/x86/boot/compressed/idt_handlers_64.S | 1 +
> arch/x86/boot/compressed/misc.h | 1 +
> 4 files changed, 15 insertions(+)
>
> diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
> index d4a314cc50d6..cbfdefcf9657 100644
> --- a/arch/x86/boot/compressed/ident_map_64.c
> +++ b/arch/x86/boot/compressed/ident_map_64.c
> @@ -379,3 +379,15 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
> */
> kernel_add_identity_map(address, end);
> }
> +
> +void do_boot_nmi_fault(struct pt_regs *regs, unsigned long error_code)
> +{
> + /*
> + * Default boot loader placeholder fault handler - there's no real
> + * kernel running yet, so there's not much we can do - but NMIs
> + * can arrive in a kdump scenario, for example by the NMI watchdog.
> + *
> + * Not having any handler would cause the CPU to silently reboot,
> + * so we do the second-worst thing here and ignore the NMI.
> + */
> +}
> diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
> index 6debb816e83d..b169c9728d52 100644
> --- a/arch/x86/boot/compressed/idt_64.c
> +++ b/arch/x86/boot/compressed/idt_64.c
> @@ -60,6 +60,7 @@ void load_stage2_idt(void)
> {
> boot_idt_desc.address = (unsigned long)boot_idt;
>
> + set_idt_entry(X86_TRAP_NMI, boot_nmi_fault);
> set_idt_entry(X86_TRAP_PF, boot_page_fault);
>
> #ifdef CONFIG_AMD_MEM_ENCRYPT
> diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S
> index 22890e199f5b..2aef8e1b515b 100644
> --- a/arch/x86/boot/compressed/idt_handlers_64.S
> +++ b/arch/x86/boot/compressed/idt_handlers_64.S
> @@ -69,6 +69,7 @@ SYM_FUNC_END(\name)
> .text
> .code64
>
> +EXCEPTION_HANDLER boot_nmi_fault do_boot_nmi_fault error_code=0
> EXCEPTION_HANDLER boot_page_fault do_boot_page_fault error_code=1
>
> #ifdef CONFIG_AMD_MEM_ENCRYPT
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index 62208ec04ca4..d89d3f8417f6 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -187,6 +187,7 @@ static inline void cleanup_exception_handling(void) { }
> #endif
>
> /* IDT Entry Points */
> +void boot_nmi_fault(void);
> void boot_page_fault(void);
> void boot_stage1_vc(void);
> void boot_stage2_vc(void);

2023-01-10 15:01:29

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes

On Tue, Jan 10, 2023 at 08:32:07PM +0800, Zeng Heng wrote:
> And here is the context of mce-inject:
>
> #0  relocate_kernel () at arch/x86/kernel/relocate_kernel_64.S:55
> #1  0xffffffff81a57fc2 in machine_kexec (image=0xffff888101ef8400)
>     at arch/x86/kernel/machine_kexec_64.c:391

Before we continue with this any further: are you doing this "exercise" in
qemu/kvm and nothing of that is happening on real hardware?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-10 16:24:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes

On Tue, Jan 10, 2023 at 01:57:05PM +0100, Borislav Petkov wrote:
> I'm doubtful now as you're injecting errors so you're not really in #MC context
> but in this contrived context which is actually an NMI one. So we need to think
> about how to fix this case.

I did some more thinking:

*if* this really is a real issue - and not some silly qemu games - then
native_machine_crash_shutdown() does all the cleanup before the kdump kernel is
started.

Any NMI clearing, maybe using iret_to_self() etc, #MC resetting etc should
happen there and not anywhere else.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-11 04:23:47

by Zeng Heng

[permalink] [raw]
Subject: Re: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes


On 2023/1/10 22:53, Borislav Petkov wrote:
> On Tue, Jan 10, 2023 at 08:32:07PM +0800, Zeng Heng wrote:
>> And here is the context of mce-inject:
>>
>> #0  relocate_kernel () at arch/x86/kernel/relocate_kernel_64.S:55
>> #1  0xffffffff81a57fc2 in machine_kexec (image=0xffff888101ef8400)
>>     at arch/x86/kernel/machine_kexec_64.c:391
> Before we continue with this any further: are you doing this "exercise" in
> qemu/kvm and nothing of that is happening on real hardware?

Real hardware and QEMU both can reproduce the issue.

Here is description about NMI from Intel 64 and IA-32 Architectures
Software Developer's Manual in chapter 6.7.1:

While an NMI interrupt handler is executing, the processor blocks
delivery of subsequent NMIs until the next execution of the IRET
instruction. This blocking of NMIs prevents nested execution of the NMI
handler. It is recommended that the NMI interrupt handler be accessed
through an interrupt gate to disable maskable hardware interrupts (see
Section 6.8.1, “Masking Maskable Hardware Interrupts”).

2023-01-12 02:26:44

by Zeng Heng

[permalink] [raw]
Subject: Re: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes


On 2023/1/11 0:09, Borislav Petkov wrote:
> On Tue, Jan 10, 2023 at 01:57:05PM +0100, Borislav Petkov wrote:
>> I'm doubtful now as you're injecting errors so you're not really in #MC context
>> but in this contrived context which is actually an NMI one. So we need to think
>> about how to fix this case.
> I did some more thinking:
>
> *if* this really is a real issue - and not some silly qemu games - then
> native_machine_crash_shutdown() does all the cleanup before the kdump kernel is
> started.
>
> Any NMI clearing, maybe using iret_to_self() etc, #MC resetting etc should
> happen there and not anywhere else.

You mean native_machine_crash_shutdown() should cleanup the NMI
interrupt status

before enter kexec?


But how about the watchdog raise NMI interrupt after
native_machine_crash_shutdown()

or mce_wrmsrl(MSR_IA32_MCG_STATUS, 0) or any else cleanup function?


B.R.,

Zeng Heng