LinuxLists.cc - [PATCHv2 04/29] x86/traps: Add #VE support for TDX guest

2022-01-24 19:24:03

Subject: [PATCHv2 04/29] x86/traps: Add #VE support for TDX guest

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:

* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to unmapped pages (EPT violation)

In the settings that Linux will run in, virtualization exceptions are
never generated on accesses to normal, TD-private memory that has been
accepted.

Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.

For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.

Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.

During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.

If a guest kernel action which would normally cause a #VE occurs in
the interrupt-disabled region before TDGETVEINFO, a #DF (fault
exception) is delivered to the guest which will result in an oops.

Add basic infrastructure to handle any #VE which occurs in the kernel
or userspace. Later patches will add handling for specific #VE
scenarios.

For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/idtentry.h | 4 ++
arch/x86/include/asm/tdx.h | 21 ++++++
arch/x86/kernel/idt.c | 3 +
arch/x86/kernel/tdx.c | 63 ++++++++++++++++++
arch/x86/kernel/traps.c | 110 ++++++++++++++++++++++++++++++++
5 files changed, 201 insertions(+)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..8ccc81d653b3 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -625,6 +625,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback);
DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap);
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception);
+#endif
+
/* Device interrupts common/spurious */
DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt);
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 5107a4d9ba8f..d17143290f0a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -4,6 +4,7 @@
#define _ASM_X86_TDX_H

#include <linux/init.h>
+#include <asm/ptrace.h>

#define TDX_CPUID_LEAF_ID 0x21
#define TDX_IDENT "IntelTDX "
@@ -40,6 +41,22 @@ struct tdx_hypercall_output {
u64 r15;
};

+/*
+ * Used by the #VE exception handler to gather the #VE exception
+ * info from the TDX module. This is a software only structure
+ * and not part of the TDX module/VMM ABI.
+ */
+struct ve_info {
+ u64 exit_reason;
+ u64 exit_qual;
+ /* Guest Linear (virtual) Address */
+ u64 gla;
+ /* Guest Physical (virtual) Address */
+ u64 gpa;
+ u32 instr_len;
+ u32 instr_info;
+};
+
#ifdef CONFIG_INTEL_TDX_GUEST

void __init tdx_early_init(void);
@@ -53,6 +70,10 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
u64 __tdx_hypercall(u64 type, u64 fn, u64 r12, u64 r13, u64 r14,
u64 r15, struct tdx_hypercall_output *out);

+bool tdx_get_ve_info(struct ve_info *ve);
+
+bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
+
#else

static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df0fa695bb09..1da074123c16 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = {
*/
INTG(X86_TRAP_PF, asm_exc_page_fault),
#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+ INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
+#endif
};

/*
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index d40b6df51e26..5a5b25f9c4d3 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,9 @@
#include <linux/cpufeature.h>
#include <asm/tdx.h>

+/* TDX module Call Leaf IDs */
+#define TDX_GET_VEINFO 3
+
static bool tdx_guest_detected __ro_after_init;

/*
@@ -32,6 +35,66 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
return out->r10;
}

+bool tdx_get_ve_info(struct ve_info *ve)
+{
+ struct tdx_module_output out;
+
+ /*
+ * NMIs and machine checks are suppressed. Before this point any
+ * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
+ * additional #VEs are permitted (but it is expected not to
+ * happen unless kernel panics).
+ */
+ if (__tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out))
+ return false;
+
+ ve->exit_reason = out.rcx;
+ ve->exit_qual = out.rdx;
+ ve->gla = out.r8;
+ ve->gpa = out.r9;
+ ve->instr_len = lower_32_bits(out.r10);
+ ve->instr_info = upper_32_bits(out.r10);
+
+ return true;
+}
+
+/*
+ * Handle the user initiated #VE.
+ *
+ * For example, executing the CPUID instruction from user space
+ * is a valid case and hence the resulting #VE has to be handled.
+ *
+ * For dis-allowed or invalid #VE just return failure.
+ */
+static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
+{
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return false;
+}
+
+/* Handle the kernel #VE */
+static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
+{
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+ return false;
+}
+
+bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
+{
+ bool ret;
+
+ if (user_mode(regs))
+ ret = tdx_virt_exception_user(regs, ve);
+ else
+ ret = tdx_virt_exception_kernel(regs, ve);
+
+ /* After successful #VE handling, move the IP */
+ if (ret)
+ regs->ip += ve->instr_len;
+
+ return ret;
+}
+
bool is_tdx_guest(void)
{
return tdx_guest_detected;
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c9d566dcf89a..428504535912 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/vdso.h>
+#include <asm/tdx.h>

#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
@@ -1212,6 +1213,115 @@ DEFINE_IDTENTRY(exc_device_not_available)
}
}

+#ifdef CONFIG_INTEL_TDX_GUEST
+
+#define VE_FAULT_STR "VE fault"
+
+static void ve_raise_fault(struct pt_regs *regs, long error_code)
+{
+ struct task_struct *tsk = current;
+
+ if (user_mode(regs)) {
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_VE;
+ show_signal(tsk, SIGSEGV, "", VE_FAULT_STR, regs, error_code);
+ force_sig(SIGSEGV);
+ return;
+ }
+
+ /*
+ * Attempt to recover from #VE exception failure without
+ * triggering OOPS (useful for MSR read/write failures)
+ */
+ if (fixup_exception(regs, X86_TRAP_VE, error_code, 0))
+ return;
+
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_VE;
+
+ /*
+ * To be potentially processing a kprobe fault and to trust the result
+ * from kprobe_running(), it should be non-preemptible.
+ */
+ if (!preemptible() && kprobe_running() &&
+ kprobe_fault_handler(regs, X86_TRAP_VE))
+ return;
+
+ /* Notify about #VE handling failure, useful for debugger hooks */
+ if (notify_die(DIE_GPF, VE_FAULT_STR, regs, error_code,
+ X86_TRAP_VE, SIGSEGV) == NOTIFY_STOP)
+ return;
+
+ /* Trigger OOPS and panic */
+ die_addr(VE_FAULT_STR, regs, error_code, 0);
+}
+
+/*
+ * Virtualization Exceptions (#VE) are delivered to TDX guests due to
+ * specific guest actions which may happen in either user space or the
+ * kernel:
+ *
+ * * Specific instructions (WBINVD, for example)
+ * * Specific MSR accesses
+ * * Specific CPUID leaf accesses
+ * * Access to unmapped pages (EPT violation)
+ *
+ * In the settings that Linux will run in, virtualization exceptions are
+ * never generated on accesses to normal, TD-private memory that has been
+ * accepted.
+ *
+ * Syscall entry code has a critical window where the kernel stack is not
+ * yet set up. Any exception in this window leads to hard to debug issues
+ * and can be exploited for privilege escalation. Exceptions in the NMI
+ * entry code also cause issues. Returning from the exception handler with
+ * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
+ *
+ * For these reasons, the kernel avoids #VEs during the syscall gap and
+ * the NMI entry code. Entry code paths do not access TD-shared memory,
+ * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
+ * that might generate #VE. VMM can remove memory from TD at any point,
+ * but access to unaccepted (or missing) private memory leads to VM
+ * termination, not to #VE.
+ *
+ * Similarly to page faults and breakpoints, #VEs are allowed in NMI
+ * handlers once the kernel is ready to deal with nested NMIs.
+ *
+ * During #VE delivery, all interrupts, including NMIs, are blocked until
+ * TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
+ * the VE info.
+ *
+ * If a guest kernel action which would normally cause a #VE occurs in
+ * the interrupt-disabled region before TDGETVEINFO, a #DF (fault
+ * exception) is delivered to the guest which will result in an oops.
+ */
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+ struct ve_info ve;
+ bool ret;
+
+ /*
+ * NMIs/Machine-checks/Interrupts will be in a disabled state
+ * till TDGETVEINFO TDCALL is executed. This ensures that VE
+ * info cannot be overwritten by a nested #VE.
+ */
+ ret = tdx_get_ve_info(&ve);
+
+ cond_local_irq_enable(regs);
+
+ if (ret)
+ ret = tdx_handle_virt_exception(regs, &ve);
+ /*
+ * If tdx_handle_virt_exception() could not process
+ * it successfully, treat it as #GP(0) and handle it.
+ */
+ if (!ret)
+ ve_raise_fault(regs, 0);
+
+ cond_local_irq_disable(regs);
+}
+
+#endif
+
#ifdef CONFIG_X86_32
DEFINE_IDTENTRY_SW(iret_error)
{
--
2.34.1

2022-02-02 10:08:04

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCHv2 04/29] x86/traps: Add #VE support for TDX guest

On Tue, Feb 01, 2022, Thomas Gleixner wrote:
> On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> > diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> > index df0fa695bb09..1da074123c16 100644
> > --- a/arch/x86/kernel/idt.c
> > +++ b/arch/x86/kernel/idt.c
> > @@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = {
> > */
> > INTG(X86_TRAP_PF, asm_exc_page_fault),
> > #endif
> > +#ifdef CONFIG_INTEL_TDX_GUEST
> > + INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
> > +#endif
> >
> > +bool tdx_get_ve_info(struct ve_info *ve)
> > +{
> > + struct tdx_module_output out;
> > +
> > + /*
> > + * NMIs and machine checks are suppressed. Before this point any
> > + * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
> > + * additional #VEs are permitted (but it is expected not to
> > + * happen unless kernel panics).
>
> I really do not understand that comment. #NMI and #MC are suppressed
> according to the above. How long are they suppressed and what's the
> mechanism? Are they unblocked on return from __tdx_module_call() ?

TDX_GET_VEINFO is a call into the TDX module to get the data from #VE info struct
pointed at by the VMCS. Doing TDX_GET_VEINFO also clears that "valid" flag in
the struct. It's basically a CMPXCHG on the #VE info struct, except that it routes
through the TDX module.

The TDX module treats virtual NMIs as blocked if the #VE valid flag is set, i.e.
refuses to inject NMI until the guest does TDX_GET_VEINFO to retrieve the info for
the last #VE.

I don't understand the blurb about #MC. Unless things have changed, the TDX module
doesn't support injecting #MC into the guest.

> What prevents a nested #VE? If it happens what makes it fatal? Is it
> converted to a #DF or detected by software?

A #VE that would occur is morphed to a #DF by the TDX module if the #VE info valid
flag is already set. But nested #VE should work, so long as the nested #VE happens
after TDX_GET_VEINFO.

> Also I do not understand that the last sentence tries to tell me. If the
> suppression of #NMI and #MC is lifted on return from tdcall then both
> can be delivered immediately afterwards, right?

Yep, NMI can be injected on the instruction following the TDCALL.

Something like this?

/*
* Retrieve the #VE info from the TDX module, which also clears the "#VE
* valid" flag. This must be done before anything else as any #VE that
* occurs while the valid flag is set, i.e. before the previous #VE info
* was consumed, is morphed to a #DF by the TDX module. Note, the TDX
* module also treats virtual NMIs as inhibited if the #VE valid flag is
* set, e.g. so that NMI=>#VE will not result in a #DF.
*/

> I assume the additional #VE is triggered by software or a bug in the
> kernel.

I'm curious if that will even hold true, there's sooo much stuff that can happen
from NMI context. I don't see much value in speculating what will/won't happen
after retrieving the #VE info.

2022-02-02 18:32:44

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCHv2 04/29] x86/traps: Add #VE support for TDX guest

On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index df0fa695bb09..1da074123c16 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = {
> */
> INTG(X86_TRAP_PF, asm_exc_page_fault),
> #endif
> +#ifdef CONFIG_INTEL_TDX_GUEST
> + INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
> +#endif
>
> +bool tdx_get_ve_info(struct ve_info *ve)
> +{
> + struct tdx_module_output out;
> +
> + /*
> + * NMIs and machine checks are suppressed. Before this point any
> + * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
> + * additional #VEs are permitted (but it is expected not to
> + * happen unless kernel panics).

I really do not understand that comment. #NMI and #MC are suppressed
according to the above. How long are they suppressed and what's the
mechanism? Are they unblocked on return from __tdx_module_call() ?

What prevents a nested #VE? If it happens what makes it fatal? Is it
converted to a #DF or detected by software?

Also I do not understand that the last sentence tries to tell me. If the
suppression of #NMI and #MC is lifted on return from tdcall then both
can be delivered immediately afterwards, right?

I assume the additional #VE is triggered by software or a bug in the
kernel.

Confused.

> + */
> + if (__tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out))
> + return false;
> +
> + ve->exit_reason = out.rcx;
> + ve->exit_qual = out.rdx;
> + ve->gla = out.r8;
> + ve->gpa = out.r9;
> + ve->instr_len = lower_32_bits(out.r10);
> + ve->instr_info = upper_32_bits(out.r10);
> +
> + return true;
> +}
> +
> +/*
> + * Handle the user initiated #VE.
> + *
> + * For example, executing the CPUID instruction from user space
> + * is a valid case and hence the resulting #VE has to be handled.
> + *
> + * For dis-allowed or invalid #VE just return failure.
> + */
> +static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
> +{
> + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> + return false;
> +}
> +
> +/* Handle the kernel #VE */
> +static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> +{
> + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> + return false;
> +}
> +
> +bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
> +{
> + bool ret;
> +
> + if (user_mode(regs))
> + ret = tdx_virt_exception_user(regs, ve);
> + else
> + ret = tdx_virt_exception_kernel(regs, ve);
> +
> + /* After successful #VE handling, move the IP */
> + if (ret)
> + regs->ip += ve->instr_len;
> +
> + return ret;
> +}
> +
> bool is_tdx_guest(void)
> {
> return tdx_guest_detected;
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index c9d566dcf89a..428504535912 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -61,6 +61,7 @@
> #include <asm/insn.h>
> #include <asm/insn-eval.h>
> #include <asm/vdso.h>
> +#include <asm/tdx.h>
>
> #ifdef CONFIG_X86_64
> #include <asm/x86_init.h>
> @@ -1212,6 +1213,115 @@ DEFINE_IDTENTRY(exc_device_not_available)
> }
> }
>
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +
> +#define VE_FAULT_STR "VE fault"
> +
> +static void ve_raise_fault(struct pt_regs *regs, long error_code)
> +{
> + struct task_struct *tsk = current;
> +
> + if (user_mode(regs)) {
> + tsk->thread.error_code = error_code;
> + tsk->thread.trap_nr = X86_TRAP_VE;
> + show_signal(tsk, SIGSEGV, "", VE_FAULT_STR, regs, error_code);
> + force_sig(SIGSEGV);
> + return;
> + }
> +
> + /*
> + * Attempt to recover from #VE exception failure without
> + * triggering OOPS (useful for MSR read/write failures)
> + */
> + if (fixup_exception(regs, X86_TRAP_VE, error_code, 0))
> + return;
> +
> + tsk->thread.error_code = error_code;
> + tsk->thread.trap_nr = X86_TRAP_VE;
> +
> + /*
> + * To be potentially processing a kprobe fault and to trust the result
> + * from kprobe_running(), it should be non-preemptible.
> + */
> + if (!preemptible() && kprobe_running() &&
> + kprobe_fault_handler(regs, X86_TRAP_VE))
> + return;
> +
> + /* Notify about #VE handling failure, useful for debugger hooks */
> + if (notify_die(DIE_GPF, VE_FAULT_STR, regs, error_code,
> + X86_TRAP_VE, SIGSEGV) == NOTIFY_STOP)
> + return;
> +
> + /* Trigger OOPS and panic */
> + die_addr(VE_FAULT_STR, regs, error_code, 0);

This is pretty much a copy of the #GP handling. So why not consolidating
this properly?

--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -559,6 +559,36 @@ static bool fixup_iopl_exception(struct
return true;
}

+static bool gp_try_fixup_and_notify(struct pt_regs *regs, int trapnr, long error_code,
+ const char *str)
+{
+ if (fixup_exception(regs, trapnr, error_code, 0))
+ return true;
+
+ current->thread.error_code = error_code;
+ current->thread.trap_nr = trapnr;
+
+ /*
+ * To be potentially processing a kprobe fault and to trust the result
+ * from kprobe_running(), we have to be non-preemptible.
+ */
+ if (!preemptible() && kprobe_running() &&
+ kprobe_fault_handler(regs, trapnr))
+ return true;
+
+ ret = notify_die(DIE_GPF, str, regs, error_code, trapnr, SIGSEGV);
+ return ret == NOTIFY_STOP;
+}
+
+static void gp_user_force_sig_segv(struct pt_regs *regs, int trapnr, long error_code,
+ const char *str)
+{
+ current->thread.error_code = error_code;
+ current->thread.trap_nr = trapnr;
+ show_signal(current, SIGSEGV, "", str, regs, error_code);
+ force_sig(SIGSEGV);
+}
+
DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
{
char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
@@ -587,34 +617,14 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_pr
if (fixup_iopl_exception(regs))
goto exit;

- tsk->thread.error_code = error_code;
- tsk->thread.trap_nr = X86_TRAP_GP;
-
if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0))
goto exit;

- show_signal(tsk, SIGSEGV, "", desc, regs, error_code);
- force_sig(SIGSEGV);
+ gp_user_force_sig_segv(regs, X86_TRAP_GP, error_code, desc);
goto exit;
}

- if (fixup_exception(regs, X86_TRAP_GP, error_code, 0))
- goto exit;
-
- tsk->thread.error_code = error_code;
- tsk->thread.trap_nr = X86_TRAP_GP;
-
- /*
- * To be potentially processing a kprobe fault and to trust the result
- * from kprobe_running(), we have to be non-preemptible.
- */
- if (!preemptible() &&
- kprobe_running() &&
- kprobe_fault_handler(regs, X86_TRAP_GP))
- goto exit;
-
- ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV);
- if (ret == NOTIFY_STOP)
+ if (gp_try_fixup_and_notify(regs, X86_TRAP_GP, error_code, desc))
goto exit;

if (error_code)

which makes this:

static void ve_raise_fault(struct pt_regs *regs, long error_code)
{
if (user_mode(regs)) {
gp_user_force_sig_segv(regs, X86_TRAP_VE, error_code, VE_FAULT_STR);
return;
}

if (gp_try_fixup_and_notify(regs, X86_TRAP_VE, error_code, VE_FAULT_STR)
return;

die_addr(VE_FAULT_STR, regs, error_code, 0);
}

Hmm?

> +/*
> + * Virtualization Exceptions (#VE) are delivered to TDX guests due to
> + * specific guest actions which may happen in either user space or the
> + * kernel:
> + *
> + * * Specific instructions (WBINVD, for example)
> + * * Specific MSR accesses
> + * * Specific CPUID leaf accesses
> + * * Access to unmapped pages (EPT violation)
> + *
> + * In the settings that Linux will run in, virtualization exceptions are
> + * never generated on accesses to normal, TD-private memory that has been
> + * accepted.
> + *
> + * Syscall entry code has a critical window where the kernel stack is not
> + * yet set up. Any exception in this window leads to hard to debug issues
> + * and can be exploited for privilege escalation. Exceptions in the NMI
> + * entry code also cause issues. Returning from the exception handler with
> + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
> + *
> + * For these reasons, the kernel avoids #VEs during the syscall gap and
> + * the NMI entry code. Entry code paths do not access TD-shared memory,
> + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> + * that might generate #VE.

How is that enforced or validated? What checks for a violation of that
assumption?

Thanks,

tglx

2022-02-13 15:47:25

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCHv2 04/29] x86/traps: Add #VE support for TDX guest

On Tue, Feb 01, 2022 at 10:02:41PM +0100, Thomas Gleixner wrote:
> > +/*
> > + * Virtualization Exceptions (#VE) are delivered to TDX guests due to
> > + * specific guest actions which may happen in either user space or the
> > + * kernel:
> > + *
> > + * * Specific instructions (WBINVD, for example)
> > + * * Specific MSR accesses
> > + * * Specific CPUID leaf accesses
> > + * * Access to unmapped pages (EPT violation)
> > + *
> > + * In the settings that Linux will run in, virtualization exceptions are
> > + * never generated on accesses to normal, TD-private memory that has been
> > + * accepted.
> > + *
> > + * Syscall entry code has a critical window where the kernel stack is not
> > + * yet set up. Any exception in this window leads to hard to debug issues
> > + * and can be exploited for privilege escalation. Exceptions in the NMI
> > + * entry code also cause issues. Returning from the exception handler with
> > + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
> > + *
> > + * For these reasons, the kernel avoids #VEs during the syscall gap and
> > + * the NMI entry code. Entry code paths do not access TD-shared memory,
> > + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> > + * that might generate #VE.
>
> How is that enforced or validated? What checks for a violation of that
> assumption?

Hm. I think we would have to rely on code audit for it.

Entry code has no #VE inducing things: no port I/O, CPUID, HLT,
MONITOR/MWAIT, WBINVD/INVD, HLT, VMCALL.

There's single MSR read for MSR_GS_BASE paranoid_entry(), but it doesn't
trigger #VE either.

Other possible source of #VE is shared memory. If somebody tricks kernel
to access shared memory from entry code we have a bigger problem to deal
with than #VE in syscall gap.

Or do you have something more strict than code audit in mind? I don't see
it.

--
Kirill A. Shutemov