2023-01-27 03:56:48

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when
"perf" executes. "./perf record sleep 20" is an example.

Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not
defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi)
and actual reading of DB7 (which in turn causes #VC) every function is
inlined and no stack frame is created (?). Replacing __always_inline with
noinline in local_db_save() or native_get_debugreg() fixes the problem.

The crash does not happen with CONFIG_PARAVIRT_XXL as in this case
paravirt_get_debugreg() is used and there is an indirect call via
PVOP_CALL1(). It has not been noticed as the most configs have CONFIG_XEN
enabled which enables CONFIG_PARAVIRT_XXL.

This happens with the recent tip/master, here is my test kernel
and the config:
https://github.com/aik/linux/commits/debug_dr7

Found this while testing DebugSwap (which also fixes the crash as
it eliminates DB7 interception == #VC):
https://lore.kernel.org/all/[email protected]

Define local_db_save_exc_nmi() to demostrate that the problem better.

Why is this crash happening and how to fix that? I am still reading
the assembly but was hoping for a shortcut here :) Thanks,

aik-Standard-PC-i440FX-PIIX-1996 login: [ 15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2)
[ 15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
[ 15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73
[ 15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
[ 15.775323] RIP: 0010:error_entry+0x17/0x140
[ 15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9
[ 15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097
[ 15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed
[ 15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8
[ 15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000
[ 15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000
[ 15.775339] FS: 0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000
[ 15.775340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0
[ 15.775342] Call Trace:
[ 15.775352] <NMI>
[ 15.775355] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775357] ? exc_page_fault+0x11/0x120
[ 15.775360] ? asm_exc_page_fault+0x22/0x30
[ 15.775364] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775365] ? exc_page_fault+0x11/0x120
[ 15.775367] ? asm_exc_page_fault+0x22/0x30
[ 15.775368] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775369] ? exc_page_fault+0x11/0x120
[ 15.775371] ? asm_exc_page_fault+0x22/0x30
[ 15.775372] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775373] ? exc_page_fault+0x11/0x120
[ 15.775374] ? asm_exc_page_fault+0x22/0x30
[ 15.775375] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775376] ? exc_page_fault+0x11/0x120
[ 15.775378] ? asm_exc_page_fault+0x22/0x30
[ 15.775379] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775380] ? exc_page_fault+0x11/0x120
[ 15.775381] ? asm_exc_page_fault+0x22/0x30
[ 15.775382] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775383] ? exc_page_fault+0x11/0x120
[ 15.775384] ? asm_exc_page_fault+0x22/0x30
[ 15.775385] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775386] ? exc_page_fault+0x11/0x120
[ 15.775388] ? asm_exc_page_fault+0x22/0x30
[ 15.775389] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775390] ? exc_page_fault+0x11/0x120
[ 15.775391] ? asm_exc_page_fault+0x22/0x30
[ 15.775392] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775393] ? exc_page_fault+0x11/0x120
[ 15.775395] ? asm_exc_page_fault+0x22/0x30
[ 15.775396] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775397] ? exc_page_fault+0x11/0x120
[ 15.775398] ? asm_exc_page_fault+0x22/0x30
[ 15.775399] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775400] ? exc_page_fault+0x11/0x120
[ 15.775401] ? asm_exc_page_fault+0x22/0x30
[ 15.775403] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775404] ? exc_page_fault+0x11/0x120
[ 15.775405] ? asm_exc_page_fault+0x22/0x30
[ 15.775406] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775407] ? exc_page_fault+0x11/0x120
[ 15.775408] ? asm_exc_page_fault+0x22/0x30
[ 15.775409] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775410] ? exc_page_fault+0x11/0x120
[ 15.775412] ? asm_exc_page_fault+0x22/0x30
[ 15.775413] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775414] ? exc_page_fault+0x11/0x120
[ 15.775415] ? asm_exc_page_fault+0x22/0x30
[ 15.775416] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775420] ? exc_page_fault+0x11/0x120
[ 15.775421] ? asm_exc_page_fault+0x22/0x30
[ 15.775422] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775423] ? exc_page_fault+0x11/0x120
[ 15.775425] ? asm_exc_page_fault+0x22/0x30
[ 15.775426] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775427] ? exc_page_fault+0x11/0x120
[ 15.775431] ? asm_exc_page_fault+0x22/0x30
[ 15.775432] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775433] ? exc_page_fault+0x11/0x120
[ 15.775435] ? asm_exc_page_fault+0x22/0x30
[ 15.775436] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775437] ? exc_page_fault+0x11/0x120
[ 15.775438] ? asm_exc_page_fault+0x22/0x30
[ 15.775439] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775440] ? exc_page_fault+0x11/0x120
[ 15.775441] ? asm_exc_page_fault+0x22/0x30
[ 15.775442] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775443] ? exc_page_fault+0x11/0x120
[ 15.775445] ? asm_exc_page_fault+0x22/0x30
[ 15.775446] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775447] ? exc_page_fault+0x11/0x120
[ 15.775448] ? asm_exc_page_fault+0x22/0x30
[ 15.775449] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775450] ? exc_page_fault+0x11/0x120
[ 15.775454] ? asm_exc_page_fault+0x22/0x30
[ 15.775455] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775456] ? exc_page_fault+0x11/0x120
[ 15.775458] ? asm_exc_page_fault+0x22/0x30
[ 15.775459] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775460] ? exc_page_fault+0x11/0x120
[ 15.775461] ? asm_exc_page_fault+0x22/0x30
[ 15.775462] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775463] ? exc_page_fault+0x11/0x120
[ 15.775465] ? asm_exc_page_fault+0x22/0x30
[ 15.775466] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775467] ? exc_page_fault+0x11/0x120
[ 15.775468] ? asm_exc_page_fault+0x22/0x30
[ 15.775469] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775470] ? exc_page_fault+0x11/0x120
[ 15.775471] ? asm_exc_page_fault+0x22/0x30
[ 15.775472] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775473] ? exc_page_fault+0x11/0x120
[ 15.775475] ? asm_exc_page_fault+0x22/0x30
[ 15.775476] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775477] ? exc_page_fault+0x11/0x120
[ 15.775478] ? asm_exc_page_fault+0x22/0x30
[ 15.775482] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775483] ? exc_page_fault+0x11/0x120
[ 15.775485] ? asm_exc_page_fault+0x22/0x30
[ 15.775486] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775487] ? exc_page_fault+0x11/0x120
[ 15.775488] ? asm_exc_page_fault+0x22/0x30
[ 15.775490] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775491] ? exc_page_fault+0x11/0x120
[ 15.775492] ? asm_exc_page_fault+0x22/0x30
[ 15.775493] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775494] ? exc_page_fault+0x11/0x120
[ 15.775495] ? asm_exc_page_fault+0x22/0x30
[ 15.775496] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775497] ? exc_page_fault+0x11/0x120
[ 15.775499] ? asm_exc_page_fault+0x22/0x30
[ 15.775500] ? check_preemption_disabled+0x8/0xe0
[ 15.775502] ? __sev_es_ist_enter+0x13/0x100
[ 15.775503] ? exc_nmi+0x10e/0x150
[ 15.775505] ? end_repeat_nmi+0x16/0x67
[ 15.775506] ? asm_exc_double_fault+0x30/0x30
[ 15.775507] ? asm_exc_double_fault+0x30/0x30
[ 15.775508] ? asm_exc_double_fault+0x30/0x30
[ 15.775509] </NMI>
[ 15.775509] <#VC>
[ 15.775510] ? __show_regs.cold+0x18e/0x23d
[ 15.775511] </#VC>
[ 15.775511] <#DF>
[ 15.775512] ? __die_body.cold+0x1a/0x1f
[ 15.775513] ? die+0x26/0x40
[ 15.775517] ? handle_stack_overflow+0x44/0x60
[ 15.775518] ? exc_double_fault+0x14b/0x180
[ 15.775519] ? asm_exc_double_fault+0x1f/0x30
[ 15.775520] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775521] ? asm_exc_page_fault+0x9/0x30
[ 15.775522] ? error_entry+0x17/0x140
[ 15.775523] </#DF>
[ 15.775523] WARNING: stack recursion on stack type 6
[ 15.775524] Modules linked in: msr efivarfs
[ 15.837935] ---[ end trace 0000000000000000 ]---

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++
arch/x86/kernel/nmi.c | 2 +-
2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index b049d950612f..396e2ea114dc 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -125,6 +125,35 @@ static __always_inline void local_db_restore(unsigned long dr7)
set_debugreg(dr7, 7);
}

+/* noinline here inline __always_inline'd native_get_debugreg(int regno) */
+static noinline unsigned long native_get_debugreg7(void)
+{
+ unsigned long dr7;
+ asm("mov %%db7, %0" :"=r" (dr7));
+ return dr7;
+}
+
+static __always_inline unsigned long local_db_save_exc_nmi(void)
+{
+ unsigned long dr7;
+
+ if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active())
+ return 0;
+
+ dr7 = native_get_debugreg7();
+ dr7 &= ~0x400; /* architecturally set bit */
+ if (dr7)
+ set_debugreg(0, 7);
+ /*
+ * Ensure the compiler doesn't lower the above statements into
+ * the critical section; disabling breakpoints late would not
+ * be good.
+ */
+ barrier();
+
+ return dr7;
+}
+
#ifdef CONFIG_CPU_SUP_AMD
extern void set_dr_addr_mask(unsigned long mask, int dr);
#else
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index cec0bfa3bc04..400b5b6b74f6 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
*/
sev_es_ist_enter(regs);

- this_cpu_write(nmi_dr7, local_db_save());
+ this_cpu_write(nmi_dr7, local_db_save_exc_nmi());

irq_state = irqentry_nmi_enter(regs);

--
2.39.1



2023-01-27 09:08:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

On Fri, Jan 27, 2023 at 02:56:16PM +1100, Alexey Kardashevskiy wrote:
> AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when
> "perf" executes. "./perf record sleep 20" is an example.
>
> Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not
> defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi)
> and actual reading of DB7 (which in turn causes #VC) every function is
> inlined

Very much on purpose.

> and no stack frame is created (?).

Silly compilers ;-) I think you can force it to by using inline asm with
a rsp dependency or somesuch.

> Replacing __always_inline with
> noinline in local_db_save() or native_get_debugreg() fixes the problem.

It will create others.

> Why is this crash happening and how to fix that? I am still reading
> the assembly but was hoping for a shortcut here :) Thanks,

Welcome to the wonderful shit show that is x86 exceptions :/

I thought sev_es_*() is supposed to fix this. Joerg, any clue?

> aik-Standard-PC-i440FX-PIIX-1996 login: [ 15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2)
> [ 15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
> [ 15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73
> [ 15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
> [ 15.775323] RIP: 0010:error_entry+0x17/0x140
> [ 15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9
> [ 15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097
> [ 15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed
> [ 15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8
> [ 15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000
> [ 15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000
> [ 15.775339] FS: 0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000
> [ 15.775340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0
> [ 15.775342] Call Trace:
> [ 15.775352] <NMI>

<snip>

> [ 15.775495] ? asm_exc_page_fault+0x22/0x30
> [ 15.775496] ? restore_regs_and_return_to_kernel+0x22/0x22
> [ 15.775497] ? exc_page_fault+0x11/0x120
> [ 15.775499] ? asm_exc_page_fault+0x22/0x30
> [ 15.775500] ? check_preemption_disabled+0x8/0xe0
> [ 15.775502] ? __sev_es_ist_enter+0x13/0x100
> [ 15.775503] ? exc_nmi+0x10e/0x150
> [ 15.775505] ? end_repeat_nmi+0x16/0x67
> [ 15.775506] ? asm_exc_double_fault+0x30/0x30
> [ 15.775507] ? asm_exc_double_fault+0x30/0x30
> [ 15.775508] ? asm_exc_double_fault+0x30/0x30
> [ 15.775509] </NMI>
> [ 15.775509] <#VC>
> [ 15.775510] ? __show_regs.cold+0x18e/0x23d
> [ 15.775511] </#VC>
> [ 15.775511] <#DF>
> [ 15.775512] ? __die_body.cold+0x1a/0x1f
> [ 15.775513] ? die+0x26/0x40
> [ 15.775517] ? handle_stack_overflow+0x44/0x60
> [ 15.775518] ? exc_double_fault+0x14b/0x180
> [ 15.775519] ? asm_exc_double_fault+0x1f/0x30
> [ 15.775520] ? restore_regs_and_return_to_kernel+0x22/0x22
> [ 15.775521] ? asm_exc_page_fault+0x9/0x30
> [ 15.775522] ? error_entry+0x17/0x140
> [ 15.775523] </#DF>
> [ 15.775523] WARNING: stack recursion on stack type 6
> [ 15.775524] Modules linked in: msr efivarfs
> [ 15.837935] ---[ end trace 0000000000000000 ]---
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++
> arch/x86/kernel/nmi.c | 2 +-
> 2 files changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
> index b049d950612f..396e2ea114dc 100644
> --- a/arch/x86/include/asm/debugreg.h
> +++ b/arch/x86/include/asm/debugreg.h
> @@ -125,6 +125,35 @@ static __always_inline void local_db_restore(unsigned long dr7)
> set_debugreg(dr7, 7);
> }
>
> +/* noinline here inline __always_inline'd native_get_debugreg(int regno) */
> +static noinline unsigned long native_get_debugreg7(void)
> +{
> + unsigned long dr7;
> + asm("mov %%db7, %0" :"=r" (dr7));
> + return dr7;
> +}
> +
> +static __always_inline unsigned long local_db_save_exc_nmi(void)
> +{
> + unsigned long dr7;
> +
> + if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active())
> + return 0;
> +
> + dr7 = native_get_debugreg7();
> + dr7 &= ~0x400; /* architecturally set bit */
> + if (dr7)
> + set_debugreg(0, 7);
> + /*
> + * Ensure the compiler doesn't lower the above statements into
> + * the critical section; disabling breakpoints late would not
> + * be good.
> + */
> + barrier();
> +
> + return dr7;
> +}

This is broken, and building with DEBUG_ENTRY=y would've told you.

> +
> #ifdef CONFIG_CPU_SUP_AMD
> extern void set_dr_addr_mask(unsigned long mask, int dr);
> #else
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index cec0bfa3bc04..400b5b6b74f6 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
> */
> sev_es_ist_enter(regs);
>
> - this_cpu_write(nmi_dr7, local_db_save());
> + this_cpu_write(nmi_dr7, local_db_save_exc_nmi());
>
> irq_state = irqentry_nmi_enter(regs);

So what I don't get is why sev_es_ist_enter() doesn't cause us to make a
stack frame, that has an actual call in it (although admittedly
conditional).

2023-01-27 10:37:54

by Joerg Roedel

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

On Fri, Jan 27, 2023 at 10:08:14AM +0100, Peter Zijlstra wrote:
> Welcome to the wonderful shit show that is x86 exceptions :/
>
> I thought sev_es_*() is supposed to fix this. Joerg, any clue?

Hmm, no, not yet, the stack-trace doesn't make much sense to me. The
sev_es_* function calls in the NMI path are for re-enabling NMI and
adjusting the #VC IST stack to allow nested VCs.

Alexey, can you try to get a more stable backtrace? For example by
building the kernel with frame pointers?

Regards,

Joerg

2023-01-27 12:05:22

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)



On 27/1/23 21:37, Joerg Roedel wrote:
> On Fri, Jan 27, 2023 at 10:08:14AM +0100, Peter Zijlstra wrote:
>> Welcome to the wonderful shit show that is x86 exceptions :/
>>
>> I thought sev_es_*() is supposed to fix this. Joerg, any clue?
>
> Hmm, no, not yet, the stack-trace doesn't make much sense to me. The
> sev_es_* function calls in the NMI path are for re-enabling NMI and
> adjusting the #VC IST stack to allow nested VCs.
>
> Alexey, can you try to get a more stable backtrace? For example by
> building the kernel with frame pointers?

Do you mean these guys (below)?

aik@aik-Standard-PC-i440FX-PIIX-1996:~$ grep FRAME_POINTER
/boot/config-$(uname -r)
CONFIG_SCHED_OMIT_FRAME_POINTER=y

CONFIG_FRAME_POINTER=y

CONFIG_UNWINDER_FRAME_POINTER=y


Here is the complete output of that VM (200k so not in the email):

https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc

Note that the original long backtrace appears more than once, I just did
not copy all of that in the first email.

--
Alexey

2023-01-27 12:18:41

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)



On 27/1/23 20:08, Peter Zijlstra wrote:
> On Fri, Jan 27, 2023 at 02:56:16PM +1100, Alexey Kardashevskiy wrote:
>> AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when
>> "perf" executes. "./perf record sleep 20" is an example.
>>
>> Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not
>> defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi)
>> and actual reading of DB7 (which in turn causes #VC) every function is
>> inlined
>
> Very much on purpose.
>
>> and no stack frame is created (?).
>
> Silly compilers ;-) I think you can force it to by using inline asm with
> a rsp dependency or somesuch.
>
>> Replacing __always_inline with
>> noinline in local_db_save() or native_get_debugreg() fixes the problem.
>
> It will create others.
>
>> Why is this crash happening and how to fix that? I am still reading
>> the assembly but was hoping for a shortcut here :) Thanks,
>
> Welcome to the wonderful shit show that is x86 exceptions :/
>
> I thought sev_es_*() is supposed to fix this. Joerg, any clue?
>
>> aik-Standard-PC-i440FX-PIIX-1996 login: [ 15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2)
>> [ 15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
>> [ 15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73
>> [ 15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
>> [ 15.775323] RIP: 0010:error_entry+0x17/0x140
>> [ 15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9
>> [ 15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097
>> [ 15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed
>> [ 15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8
>> [ 15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000
>> [ 15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>> [ 15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000
>> [ 15.775339] FS: 0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000
>> [ 15.775340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0
>> [ 15.775342] Call Trace:
>> [ 15.775352] <NMI>
>
> <snip>
>
>> [ 15.775495] ? asm_exc_page_fault+0x22/0x30
>> [ 15.775496] ? restore_regs_and_return_to_kernel+0x22/0x22
>> [ 15.775497] ? exc_page_fault+0x11/0x120
>> [ 15.775499] ? asm_exc_page_fault+0x22/0x30
>> [ 15.775500] ? check_preemption_disabled+0x8/0xe0
>> [ 15.775502] ? __sev_es_ist_enter+0x13/0x100
>> [ 15.775503] ? exc_nmi+0x10e/0x150
>> [ 15.775505] ? end_repeat_nmi+0x16/0x67
>> [ 15.775506] ? asm_exc_double_fault+0x30/0x30
>> [ 15.775507] ? asm_exc_double_fault+0x30/0x30
>> [ 15.775508] ? asm_exc_double_fault+0x30/0x30
>> [ 15.775509] </NMI>
>> [ 15.775509] <#VC>
>> [ 15.775510] ? __show_regs.cold+0x18e/0x23d
>> [ 15.775511] </#VC>
>> [ 15.775511] <#DF>
>> [ 15.775512] ? __die_body.cold+0x1a/0x1f
>> [ 15.775513] ? die+0x26/0x40
>> [ 15.775517] ? handle_stack_overflow+0x44/0x60
>> [ 15.775518] ? exc_double_fault+0x14b/0x180
>> [ 15.775519] ? asm_exc_double_fault+0x1f/0x30
>> [ 15.775520] ? restore_regs_and_return_to_kernel+0x22/0x22
>> [ 15.775521] ? asm_exc_page_fault+0x9/0x30
>> [ 15.775522] ? error_entry+0x17/0x140
>> [ 15.775523] </#DF>
>> [ 15.775523] WARNING: stack recursion on stack type 6
>> [ 15.775524] Modules linked in: msr efivarfs
>> [ 15.837935] ---[ end trace 0000000000000000 ]---
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++
>> arch/x86/kernel/nmi.c | 2 +-
>> 2 files changed, 30 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
>> index b049d950612f..396e2ea114dc 100644
>> --- a/arch/x86/include/asm/debugreg.h
>> +++ b/arch/x86/include/asm/debugreg.h
>> @@ -125,6 +125,35 @@ static __always_inline void local_db_restore(unsigned long dr7)
>> set_debugreg(dr7, 7);
>> }
>>
>> +/* noinline here inline __always_inline'd native_get_debugreg(int regno) */
>> +static noinline unsigned long native_get_debugreg7(void)
>> +{
>> + unsigned long dr7;
>> + asm("mov %%db7, %0" :"=r" (dr7));
>> + return dr7;
>> +}
>> +
>> +static __always_inline unsigned long local_db_save_exc_nmi(void)
>> +{
>> + unsigned long dr7;
>> +
>> + if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active())
>> + return 0;
>> +
>> + dr7 = native_get_debugreg7();
>> + dr7 &= ~0x400; /* architecturally set bit */
>> + if (dr7)
>> + set_debugreg(0, 7);
>> + /*
>> + * Ensure the compiler doesn't lower the above statements into
>> + * the critical section; disabling breakpoints late would not
>> + * be good.
>> + */
>> + barrier();
>> +
>> + return dr7;
>> +}
>
> This is broken, and building with DEBUG_ENTRY=y would've told you.


Huh, good to know. Is this it telling me so?

vmlinux.o: warning: objtool: exc_nmi+0x73: call to
native_get_debugreg7() leaves .noinstr.text section



>> +
>> #ifdef CONFIG_CPU_SUP_AMD
>> extern void set_dr_addr_mask(unsigned long mask, int dr);
>> #else
>> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
>> index cec0bfa3bc04..400b5b6b74f6 100644
>> --- a/arch/x86/kernel/nmi.c
>> +++ b/arch/x86/kernel/nmi.c
>> @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
>> */
>> sev_es_ist_enter(regs);
>>
>> - this_cpu_write(nmi_dr7, local_db_save());
>> + this_cpu_write(nmi_dr7, local_db_save_exc_nmi());
>>
>> irq_state = irqentry_nmi_enter(regs);
>
> So what I don't get is why sev_es_ist_enter() doesn't cause us to make a
> stack frame, that has an actual call in it (although admittedly
> conditional).

Is not the frame gone when sev_es_ist_enter() (which does not get
inlined as per objdump's "ffffffff81bd4550 <__sev_es_ist_enter>:
") returned?



--
Alexey

2023-01-27 12:42:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

On Fri, Jan 27, 2023 at 11:13:38PM +1100, Alexey Kardashevskiy wrote:

> > This is broken, and building with DEBUG_ENTRY=y would've told you.
>
>
> Huh, good to know. Is this it telling me so?
>
> vmlinux.o: warning: objtool: exc_nmi+0x73: call to native_get_debugreg7()
> leaves .noinstr.text section
>

Yep. The ramification of all that is that by calling non-noinstr code
(double negative, iow, regular instrumented code) is that you can end up
in the tracers/*SAN/breakpoints etc.. code -- something we're very much
not ready for at this point.

> > > +
> > > #ifdef CONFIG_CPU_SUP_AMD
> > > extern void set_dr_addr_mask(unsigned long mask, int dr);
> > > #else
> > > diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> > > index cec0bfa3bc04..400b5b6b74f6 100644
> > > --- a/arch/x86/kernel/nmi.c
> > > +++ b/arch/x86/kernel/nmi.c
> > > @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
> > > */
> > > sev_es_ist_enter(regs);
> > > - this_cpu_write(nmi_dr7, local_db_save());
> > > + this_cpu_write(nmi_dr7, local_db_save_exc_nmi());
> > > irq_state = irqentry_nmi_enter(regs);
> >
> > So what I don't get is why sev_es_ist_enter() doesn't cause us to make a
> > stack frame, that has an actual call in it (although admittedly
> > conditional).
>
> Is not the frame gone when sev_es_ist_enter() (which does not get inlined as
> per objdump's "ffffffff81bd4550 <__sev_es_ist_enter>:
> ") returned?

Well, returning would consume the callframe, but the stack setup of the
caller should remain. Let me go stare at some asm.

2023-01-27 12:59:20

by Joerg Roedel

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote:
> Here is the complete output of that VM (200k so not in the email):
>
> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc

Thanks. So looking at the code in the traces:

Code starting with the faulting instruction
===========================================
0: 65 48 8b 04 25 c0 db mov %gs:0x2dbc0,%rax
7: 02 00
9: 48 8b 80 a8 08 00 00 mov 0x8a8(%rax),%rax
10: 0f 0d 48 70 prefetchw 0x70(%rax)
14: e8 .byte 0xe8
15: 82 .byte 0x82

I think the fault in the page-fault handler happens here:

DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
unsigned long address = read_cr2();
irqentry_state_t state;

prefetchw(&current->mm->mmap_lock); <--- Here

To be precise, it faults while dereferencing current. That means that
GS_BASE is likely broken, need to find out why...

This at least explains why it page-faults in a loop until the stack
overflows and the guard page is hit.

Regards,

Joerg


2023-01-27 17:25:24

by Joerg Roedel

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote:
> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc

Okay, I reproduced the problem here and the root cause turned out to be
that the compiler moved the DR7 read instruction before the 5-byte NOP
which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is
guaranteed to cause #VC exception stack recursion if the NMI was
triggered on the #VC stack, and that leads to all kinds of undefined
behavior.

Regards,

Joerg

2023-01-28 11:25:16

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)



On 28/1/23 04:25, Joerg Roedel wrote:
> On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote:
>> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc
>
> Okay, I reproduced the problem here and the root cause turned out to be
> that the compiler moved the DR7 read instruction before the 5-byte NOP
> which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is
> guaranteed to cause #VC exception stack recursion if the NMI was
> triggered on the #VC stack, and that leads to all kinds of undefined
> behavior.

Cool!

(out of curiosity) where do you see these NOPs? "objdump -D vmlinux"
does not show any, is this after lifepatching?

Meanwhile, this seems to be doing the right thing:


diff --git a/arch/x86/include/asm/debugreg.h
b/arch/x86/include/asm/debugreg.h
index b049d950612f..687b15297057 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -39,7 +39,7 @@ static __always_inline unsigned long
native_get_debugreg(int regno)
asm("mov %%db6, %0" :"=r" (val));
break;
case 7:
- asm("mov %%db7, %0" :"=r" (val));
+ asm volatile ("mov %%db7, %0" :"=r" (val));



--
Alexey

2023-01-28 13:52:30

by Joerg Roedel

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

On Sat, Jan 28, 2023 at 10:24:56PM +1100, Alexey Kardashevskiy wrote:
> (out of curiosity) where do you see these NOPs? "objdump -D vmlinux" does
> not show any, is this after lifepatching?

Here is the disassembly of exc_nmi of a kernel built from tip/master
with CONFIG_PARAVIRT=n:

<exc_nmi>:
41 54 push %r12
55 push %rbp
48 89 fd mov %rdi,%rbp
53 push %rbx
0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
65 8b 05 69 66 41 7e mov %gs:0x7e416669(%rip),%eax # 3254c <pcpu_hot+0xc>
48 98 cltq
48 0f a3 05 33 00 2b bt %rax,0x12b0033(%rip) # ffffffff82ecbf20 <__cpu_online_mask>
01
0f 83 c9 00 00 00 jae ffffffff81c1bfbc <exc_nmi+0xec>
65 8b 05 f6 41 40 7e mov %gs:0x7e4041f6(%rip),%eax # 200f0 <nmi_state>
85 c0 test %eax,%eax
0f 85 f8 00 00 00 jne ffffffff81c1bffa <exc_nmi+0x12a>
65 c7 05 e3 41 40 7e movl $0x1,%gs:0x7e4041e3(%rip) # 200f0 <nmi_state>
01 00 00 00
0f 20 d0 mov %cr2,%rax
65 48 89 05 d0 41 40 mov %rax,%gs:0x7e4041d0(%rip) # 200e8 <nmi_cr2>
7e
41 0f 21 fc mov %db7,%r12 <-- here is the DR7 read
0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) <-- here are the NOPS that become a
call to sev_es_ist_enter() in
SEV-ES guests

The DR7 read will cause a #VC exception, switching to the #VC IST stack.
If the NMI was raised while already on the #VC IST stack, this DR7 read
will overwrite the previous stack frame and cause stack recursion, with
all funny side effects.


> diff --git a/arch/x86/include/asm/debugreg.h
> b/arch/x86/include/asm/debugreg.h
> index b049d950612f..687b15297057 100644
> --- a/arch/x86/include/asm/debugreg.h
> +++ b/arch/x86/include/asm/debugreg.h
> @@ -39,7 +39,7 @@ static __always_inline unsigned long
> native_get_debugreg(int regno)
> asm("mov %%db6, %0" :"=r" (val));
> break;
> case 7:
> - asm("mov %%db7, %0" :"=r" (val));
> + asm volatile ("mov %%db7, %0" :"=r" (val));

Yeah, something like this will be the fix. I am still thinking about
the right place to put the volatile to make it explicit to the situation
we are encountering here (which is SEV-ES specific).

Best would be an explicit barrier in C code between sev_es_ist_enter()
and the DR7 read, but all barriers I tried to far only seem to affect
memory instructions and had no influence on the DR7 read (which is
obviously not considered as a memory read by the compiler).

The best place to put the barrier is in the sev_es_ist_enter() inline
function, right after the static_call to __sev_es_ist_enter().

Regards,

Joerg

2023-01-30 09:17:44

by Joerg Roedel

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

On Sat, Jan 28, 2023 at 02:52:23PM +0100, Joerg Roedel wrote:
> Yeah, something like this will be the fix. I am still thinking about
> the right place to put the volatile to make it explicit to the situation
> we are encountering here (which is SEV-ES specific).
>
> Best would be an explicit barrier in C code between sev_es_ist_enter()
> and the DR7 read, but all barriers I tried to far only seem to affect
> memory instructions and had no influence on the DR7 read (which is
> obviously not considered as a memory read by the compiler).
>
> The best place to put the barrier is in the sev_es_ist_enter() inline
> function, right after the static_call to __sev_es_ist_enter().

Okay, after some investigation I was not able to find a compiler barrier
which affects DR7 read ordering. This leaves us with the only solution
of directly forbidding DR7 register access re-ordering by adding a
volatile to the asm, like you did before.

I will send a fix later today.

Regards,

Joerg

2023-01-30 17:31:42

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

On January 28, 2023 3:24:56 AM PST, Alexey Kardashevskiy <[email protected]> wrote:
>
>
>On 28/1/23 04:25, Joerg Roedel wrote:
>> On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote:
>>> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc
>>
>> Okay, I reproduced the problem here and the root cause turned out to be
>> that the compiler moved the DR7 read instruction before the 5-byte NOP
>> which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is
>> guaranteed to cause #VC exception stack recursion if the NMI was
>> triggered on the #VC stack, and that leads to all kinds of undefined
>> behavior.
>
>Cool!
>
>(out of curiosity) where do you see these NOPs? "objdump -D vmlinux" does not show any, is this after lifepatching?
>
>Meanwhile, this seems to be doing the right thing:
>
>
>diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
>index b049d950612f..687b15297057 100644
>--- a/arch/x86/include/asm/debugreg.h
>+++ b/arch/x86/include/asm/debugreg.h
>@@ -39,7 +39,7 @@ static __always_inline unsigned long native_get_debugreg(int regno)
> asm("mov %%db6, %0" :"=r" (val));
> break;
> case 7:
>- asm("mov %%db7, %0" :"=r" (val));
>+ asm volatile ("mov %%db7, %0" :"=r" (val));
>
>
>

It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is not... %dr6 is the status register!

I believe they should all be volatile (the compiler semantics is that volatile operations are always executed exactly once, in strict program order with respect to any other volatile operations); the real question is if there should also be memory clobbers on %dr6 reads and any %dr write.

2023-01-30 18:05:00

by Borislav Petkov

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote:
> It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is
> not... %dr6 is the status register!

Yeah, as a precaution I think we should make all those volatile. Just in
case.

> I believe they should all be volatile (the compiler semantics is that
> volatile operations are always executed exactly once, in strict
> program order with respect to any other volatile operations); the real
> question is if there should also be memory clobbers on %dr6 reads and
> any %dr write.

Yes, I think so too. From gcc docs:


"6.47.2.1 Volatile
.................

...

Note that the compiler can move even 'volatile asm' instructions
relative to other code, including across jump instructions."

We already have __FORCE_ORDER for exactly things like that.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-31 09:02:32

by Joerg Roedel

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote:
> It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is
> not... %dr6 is the status register!

The reason is that on SEV-ES only accesses to DR7 will cause #VC
exceptions, DR0-DR6 are not intercepted.

Regards,

Joerg


2023-01-31 10:37:32

by tip-bot2 for Jacob Pan

[permalink] [raw]
Subject: [tip: x86/urgent] x86/debug: Fix stack recursion caused by wrongly ordered DR7 accesses

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 31859551393bc00f705cae2e1f9d31b80c62f365
Gitweb: https://git.kernel.org/tip/31859551393bc00f705cae2e1f9d31b80c62f365
Author: Joerg Roedel <[email protected]>
AuthorDate: Tue, 31 Jan 2023 09:57:18 +01:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Tue, 31 Jan 2023 11:26:15 +01:00

x86/debug: Fix stack recursion caused by wrongly ordered DR7 accesses

In kernels compiled with CONFIG_PARAVIRT=n, the compiler re-orders the
DR7 read in exc_nmi() to happen before the call to sev_es_ist_enter().

This is problematic when running as an SEV-ES guest because in this
environment the DR7 read might cause a #VC exception, and taking #VC
exceptions is not safe in exc_nmi() before sev_es_ist_enter() has run.

The result is stack recursion if the NMI was caused on the #VC IST
stack, because a subsequent #VC exception in the NMI handler will
overwrite the stack frame of the interrupted #VC handler.

As there are no compiler barriers affecting the ordering of DR7
reads/writes, make the accesses to this register volatile, forbidding
the compiler to re-order them.

[ bp: Massage text, make them volatile too, to make sure some
aggressive compiler optimization pass doesn't discard them. ]

Fixes: 315562c9af3d ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
Reported-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/debugreg.h | 27 ++++++++++++++++++++++++---
1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index b049d95..ff1a924 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -39,7 +39,20 @@ static __always_inline unsigned long native_get_debugreg(int regno)
asm("mov %%db6, %0" :"=r" (val));
break;
case 7:
- asm("mov %%db7, %0" :"=r" (val));
+ /*
+ * Apply __FORCE_ORDER to DR7 reads to forbid re-ordering them
+ * with other code.
+ *
+ * This is needed because a DR7 access can cause a #VC exception
+ * when running under SEV-ES. Taking a #VC exception is not a
+ * safe thing to do just anywhere in the entry code and
+ * re-ordering might place the access into an unsafe location.
+ *
+ * This happened in the NMI handler, where the DR7 read was
+ * re-ordered to happen before the call to sev_es_ist_enter(),
+ * causing stack recursion.
+ */
+ asm volatile("mov %%db7, %0" : "=r" (val) : __FORCE_ORDER);
break;
default:
BUG();
@@ -66,8 +79,16 @@ static __always_inline void native_set_debugreg(int regno, unsigned long value)
asm("mov %0, %%db6" ::"r" (value));
break;
case 7:
- asm("mov %0, %%db7" ::"r" (value));
- break;
+ /*
+ * Apply __FORCE_ORDER to DR7 writes to forbid re-ordering them
+ * with other code.
+ *
+ * While is didn't happen with a DR7 write (see the DR7 read
+ * comment above which explains where it happened), add the
+ * __FORCE_ORDER here too to avoid similar problems in the
+ * future.
+ */
+ asm volatile("mov %0, %%db7" ::"r" (value), __FORCE_ORDER); break;
default:
BUG();
}

2023-01-31 11:57:19

by tip-bot2 for Jacob Pan

[permalink] [raw]
Subject: [tip: x86/urgent] x86/debug: Fix stack recursion caused by wrongly ordered DR7 accesses

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 9d2c7203ffdb846399b82b0660563c89e918c751
Gitweb: https://git.kernel.org/tip/9d2c7203ffdb846399b82b0660563c89e918c751
Author: Joerg Roedel <[email protected]>
AuthorDate: Tue, 31 Jan 2023 09:57:18 +01:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Tue, 31 Jan 2023 12:51:19 +01:00

x86/debug: Fix stack recursion caused by wrongly ordered DR7 accesses

In kernels compiled with CONFIG_PARAVIRT=n, the compiler re-orders the
DR7 read in exc_nmi() to happen before the call to sev_es_ist_enter().

This is problematic when running as an SEV-ES guest because in this
environment the DR7 read might cause a #VC exception, and taking #VC
exceptions is not safe in exc_nmi() before sev_es_ist_enter() has run.

The result is stack recursion if the NMI was caused on the #VC IST
stack, because a subsequent #VC exception in the NMI handler will
overwrite the stack frame of the interrupted #VC handler.

As there are no compiler barriers affecting the ordering of DR7
reads/writes, make the accesses to this register volatile, forbidding
the compiler to re-order them.

[ bp: Massage text, make them volatile too, to make sure some
aggressive compiler optimization pass doesn't discard them. ]

Fixes: 315562c9af3d ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
Reported-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/debugreg.h | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index b049d95..ca97442 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -39,7 +39,20 @@ static __always_inline unsigned long native_get_debugreg(int regno)
asm("mov %%db6, %0" :"=r" (val));
break;
case 7:
- asm("mov %%db7, %0" :"=r" (val));
+ /*
+ * Apply __FORCE_ORDER to DR7 reads to forbid re-ordering them
+ * with other code.
+ *
+ * This is needed because a DR7 access can cause a #VC exception
+ * when running under SEV-ES. Taking a #VC exception is not a
+ * safe thing to do just anywhere in the entry code and
+ * re-ordering might place the access into an unsafe location.
+ *
+ * This happened in the NMI handler, where the DR7 read was
+ * re-ordered to happen before the call to sev_es_ist_enter(),
+ * causing stack recursion.
+ */
+ asm volatile("mov %%db7, %0" : "=r" (val) : __FORCE_ORDER);
break;
default:
BUG();
@@ -66,7 +79,16 @@ static __always_inline void native_set_debugreg(int regno, unsigned long value)
asm("mov %0, %%db6" ::"r" (value));
break;
case 7:
- asm("mov %0, %%db7" ::"r" (value));
+ /*
+ * Apply __FORCE_ORDER to DR7 writes to forbid re-ordering them
+ * with other code.
+ *
+ * While is didn't happen with a DR7 write (see the DR7 read
+ * comment above which explains where it happened), add the
+ * __FORCE_ORDER here too to avoid similar problems in the
+ * future.
+ */
+ asm volatile("mov %0, %%db7" ::"r" (value), __FORCE_ORDER);
break;
default:
BUG();

2023-01-31 15:53:48

by Sean Christopherson

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

On Tue, Jan 31, 2023, Joerg Roedel wrote:
> On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote:
> > It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is
> > not... %dr6 is the status register!
>
> The reason is that on SEV-ES only accesses to DR7 will cause #VC
> exceptions, DR0-DR6 are not intercepted.

I don't think that is technically true. A _well-behaved_ hypervisor will not
intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES
architecture enforces that behavior.

2023-01-31 16:00:47

by Joerg Roedel

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

On Tue, Jan 31, 2023 at 03:53:39PM +0000, Sean Christopherson wrote:
> I don't think that is technically true. A _well-behaved_ hypervisor will not
> intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES
> architecture enforces that behavior.

Not from the hardware architecture side, but the GHCB spec does not
list NAE events for DR0-DR6 accesses, so a guest is not required to
handle them in the VC handler.

Linux under SEV-ES will crash if the HV intercepts debug registers,
except DR7.

Regards,

Joerg

2023-01-31 16:47:33

by Sean Christopherson

[permalink] [raw]
Subject: Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

On Tue, Jan 31, 2023, Joerg Roedel wrote:
> On Tue, Jan 31, 2023 at 03:53:39PM +0000, Sean Christopherson wrote:
> > I don't think that is technically true. A _well-behaved_ hypervisor will not
> > intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES
> > architecture enforces that behavior.
>
> Not from the hardware architecture side, but the GHCB spec does not
> list NAE events for DR0-DR6 accesses, so a guest is not required to
> handle them in the VC handler.
>
> Linux under SEV-ES will crash if the HV intercepts debug registers,
> except DR7.

Right, I'm just objecting to the wording of "DR0-DR6 are not intercepted". E.g.
from a security perspective, the kernel shouldn't rely on DR0-DR6 to execute
cleanly.