LinuxLists.cc - [PATCHv6 07/30] x86/traps: Add #VE support for TDX guest

2022-03-17 03:20:40

Subject: [PATCHv6 07/30] x86/traps: Add #VE support for TDX guest

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:

* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to specific guest physical addresses

Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.

For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.

Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.

During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.

TDGETVEINFO retrieves the #VE info from the TDX module, which also
clears the "#VE valid" flag. This must be done before anything else as
any #VE that occurs while the valid flag is set escalates to #DF by TDX
module. It will result in an oops.

Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be
delivered until TDGETVEINFO is called.

For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Dave Hansen <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 31 +++++++++++++
arch/x86/include/asm/idtentry.h | 4 ++
arch/x86/include/asm/tdx.h | 21 +++++++++
arch/x86/kernel/idt.c | 3 ++
arch/x86/kernel/traps.c | 81 +++++++++++++++++++++++++++++++++
5 files changed, 140 insertions(+)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 069a93d2d331..c886a9e022f1 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -10,6 +10,7 @@

/* TDX module Call Leaf IDs */
#define TDX_GET_INFO 1
+#define TDX_GET_VEINFO 3

/*
* Wrapper for standard use of __tdx_hypercall with no output aside from
@@ -66,6 +67,36 @@ static void get_info(unsigned int *gpa_width)
*gpa_width = out.rcx & GENMASK(5, 0);
}

+void tdx_get_ve_info(struct ve_info *ve)
+{
+ struct tdx_module_output out;
+
+ /*
+ * Called during #VE handling to retrieve the #VE info from the
+ * TDX module.
+ *
+ * This should called done early in #VE handling. A "nested"
+ * #VE which occurs before this will raise a #DF and is not
+ * recoverable.
+ */
+ tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);
+
+ /* Interrupts and NMIs can be delivered again. */
+ ve->exit_reason = out.rcx;
+ ve->exit_qual = out.rdx;
+ ve->gla = out.r8;
+ ve->gpa = out.r9;
+ ve->instr_len = lower_32_bits(out.r10);
+ ve->instr_info = upper_32_bits(out.r10);
+}
+
+bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
+{
+ pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+
+ return false;
+}
+
void __init tdx_early_init(void)
{
unsigned int gpa_width;
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..8ccc81d653b3 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -625,6 +625,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback);
DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap);
#endif

+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception);
+#endif
+
/* Device interrupts common/spurious */
DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt);
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index ccdfe7de24a2..f4f5da1febbe 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -5,6 +5,7 @@

#include <linux/bits.h>
#include <linux/init.h>
+#include <asm/ptrace.h>

#define TDX_CPUID_LEAF_ID 0x21
#define TDX_IDENT "IntelTDX "
@@ -55,6 +56,22 @@ struct tdx_hypercall_args {
u64 r15;
};

+/*
+ * Used by the #VE exception handler to gather the #VE exception
+ * info from the TDX module. This is a software only structure
+ * and not part of the TDX module/VMM ABI.
+ */
+struct ve_info {
+ u64 exit_reason;
+ u64 exit_qual;
+ /* Guest Linear (virtual) Address */
+ u64 gla;
+ /* Guest Physical Address */
+ u64 gpa;
+ u32 instr_len;
+ u32 instr_info;
+};
+
#ifdef CONFIG_INTEL_TDX_GUEST

void __init tdx_early_init(void);
@@ -66,6 +83,10 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
/* Used to request services from the VMM */
u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);

+void tdx_get_ve_info(struct ve_info *ve);
+
+bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
+
#else

static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df0fa695bb09..1da074123c16 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = {
*/
INTG(X86_TRAP_PF, asm_exc_page_fault),
#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+ INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
+#endif
};

/*
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 59ed14d8c53f..9ab8570c48f5 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -62,6 +62,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/vdso.h>
+#include <asm/tdx.h>

#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
@@ -1275,6 +1276,86 @@ DEFINE_IDTENTRY(exc_device_not_available)
}
}

+#ifdef CONFIG_INTEL_TDX_GUEST
+
+#define VE_FAULT_STR "VE fault"
+
+static void ve_raise_fault(struct pt_regs *regs, long error_code)
+{
+ if (user_mode(regs)) {
+ gp_user_force_sig_segv(regs, X86_TRAP_VE, error_code, VE_FAULT_STR);
+ return;
+ }
+
+ if (gp_try_fixup_and_notify(regs, X86_TRAP_VE, error_code, VE_FAULT_STR))
+ return;
+
+ die_addr(VE_FAULT_STR, regs, error_code, 0);
+}
+
+/*
+ * Virtualization Exceptions (#VE) are delivered to TDX guests due to
+ * specific guest actions which may happen in either user space or the
+ * kernel:
+ *
+ * * Specific instructions (WBINVD, for example)
+ * * Specific MSR accesses
+ * * Specific CPUID leaf accesses
+ * * Access to specific guest physical addresses
+ *
+ * In the settings that Linux will run in, virtualization exceptions are
+ * never generated on accesses to normal, TD-private memory that has been
+ * accepted.
+ *
+ * Syscall entry code has a critical window where the kernel stack is not
+ * yet set up. Any exception in this window leads to hard to debug issues
+ * and can be exploited for privilege escalation. Exceptions in the NMI
+ * entry code also cause issues. Returning from the exception handler with
+ * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
+ *
+ * For these reasons, the kernel avoids #VEs during the syscall gap and
+ * the NMI entry code. Entry code paths do not access TD-shared memory,
+ * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
+ * that might generate #VE. VMM can remove memory from TD at any point,
+ * but access to unaccepted (or missing) private memory leads to VM
+ * termination, not to #VE.
+ *
+ * Similarly to page faults and breakpoints, #VEs are allowed in NMI
+ * handlers once the kernel is ready to deal with nested NMIs.
+ *
+ * During #VE delivery, all interrupts, including NMIs, are blocked until
+ * TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
+ * the VE info.
+ *
+ * If a guest kernel action which would normally cause a #VE occurs in
+ * the interrupt-disabled region before TDGETVEINFO, a #DF (fault
+ * exception) is delivered to the guest which will result in an oops.
+ */
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+ struct ve_info ve;
+
+ /*
+ * NMIs/Machine-checks/Interrupts will be in a disabled state
+ * till TDGETVEINFO TDCALL is executed. This ensures that VE
+ * info cannot be overwritten by a nested #VE.
+ */
+ tdx_get_ve_info(&ve);
+
+ cond_local_irq_enable(regs);
+
+ /*
+ * If tdx_handle_virt_exception() could not process
+ * it successfully, treat it as #GP(0) and handle it.
+ */
+ if (!tdx_handle_virt_exception(regs, &ve))
+ ve_raise_fault(regs, 0);
+
+ cond_local_irq_disable(regs);
+}
+
+#endif
+
#ifdef CONFIG_X86_32
DEFINE_IDTENTRY_SW(iret_error)
{
--
2.34.1

2022-03-17 06:05:04

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCHv6 07/30] x86/traps: Add #VE support for TDX guest

On Wed, Mar 16 2022 at 05:08, Kirill A. Shutemov wrote:
> +void tdx_get_ve_info(struct ve_info *ve)
> +{
> + struct tdx_module_output out;
> +
> + /*
> + * Called during #VE handling to retrieve the #VE info from the
> + * TDX module.
> + *
> + * This should called done early in #VE handling. A "nested"

... has to be called early ..

> + * #VE which occurs before this will raise a #DF and is not
> + * recoverable.
> + */
> + tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);
> +
> + /* Interrupts and NMIs can be delivered again. */

Please put a new line between this comment and the code below because
they are completely unrelated. Also interrupts cannot be delivered here
because this code runs with interrupts disabled ...

And I rather have this comment above the tdx_module_call() invocation:

* recoverable.
*
* <Useful comment about VE info and NMIs>

When I reviewed this last time, Sean provided a very concise comment for
this.

*/
tdx_module_call(...);

/* Transfer the output parameters */

> + ve->exit_reason = out.rcx;
....

Hmm?

> +/*
> + * Virtualization Exceptions (#VE) are delivered to TDX guests due to
> + * specific guest actions which may happen in either user space or the
> + * kernel:
> + *
> + * * Specific instructions (WBINVD, for example)
> + * * Specific MSR accesses
> + * * Specific CPUID leaf accesses
> + * * Access to specific guest physical addresses
> + *
> + * In the settings that Linux will run in, virtualization exceptions are
> + * never generated on accesses to normal, TD-private memory that has been
> + * accepted.
> + *
> + * Syscall entry code has a critical window where the kernel stack is not
> + * yet set up. Any exception in this window leads to hard to debug issues
> + * and can be exploited for privilege escalation. Exceptions in the NMI
> + * entry code also cause issues. Returning from the exception handler with
> + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
> + *
> + * For these reasons, the kernel avoids #VEs during the syscall gap and
> + * the NMI entry code. Entry code paths do not access TD-shared memory,
> + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> + * that might generate #VE.

I asked that before:

"How is that enforced or validated? What checks for a violation of that
assumption?"

This is still exactly the same comment which is based on testing which
did not yet explode in your face, right?

So what's the point of this blurb? Create expectations which are not
accountable?

The point is that any #VE in such a code path is fatal and you better
come up with some reasonable explanation why this is not the case in
those code pathes and how a potential violation of that assumption might
be detected especially in rarely used corner cases. If such a violation
is not detectable by audit, CI, static code analysis or whatever then
document the consequences instead of pretending that the problem does
not exist and the kernel is perfect today and forever.

Thanks,

tglx

2022-03-17 19:42:07

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCHv6 07/30] x86/traps: Add #VE support for TDX guest

On Thu, Mar 17, 2022 at 01:48:54AM +0100, Thomas Gleixner wrote:
> On Wed, Mar 16 2022 at 05:08, Kirill A. Shutemov wrote:
> Hmm?

Does the changed version below address your concerns?

void tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;

/*
* Called during #VE handling to retrieve the #VE info from the
* TDX module.
*
* This has to be called early in #VE handling. A "nested" #VE which
* occurs before this will raise a #DF and is not recoverable.
*
* The call retrieves the #VE info from the TDX module, which also
* clears the "#VE valid" flag. This must be done before anything else
* because any #VE that occurs while the valid flag is set will lead to
* #DF.
*
* Note, the TDX module treats virtual NMIs as inhibited if the #VE
* valid flag is set. It means that NMI=>#VE will not result in a #DF.
*/
tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);

/* Transfer the output parameters */
ve->exit_reason = out.rcx;
ve->exit_qual = out.rdx;
ve->gla = out.r8;
ve->gpa = out.r9;
ve->instr_len = lower_32_bits(out.r10);
ve->instr_info = upper_32_bits(out.r10);
}

> > +/*
> > + * Virtualization Exceptions (#VE) are delivered to TDX guests due to
> > + * specific guest actions which may happen in either user space or the
> > + * kernel:
> > + *
> > + * * Specific instructions (WBINVD, for example)
> > + * * Specific MSR accesses
> > + * * Specific CPUID leaf accesses
> > + * * Access to specific guest physical addresses
> > + *
> > + * In the settings that Linux will run in, virtualization exceptions are
> > + * never generated on accesses to normal, TD-private memory that has been
> > + * accepted.
> > + *
> > + * Syscall entry code has a critical window where the kernel stack is not
> > + * yet set up. Any exception in this window leads to hard to debug issues
> > + * and can be exploited for privilege escalation. Exceptions in the NMI
> > + * entry code also cause issues. Returning from the exception handler with
> > + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
> > + *
> > + * For these reasons, the kernel avoids #VEs during the syscall gap and
> > + * the NMI entry code. Entry code paths do not access TD-shared memory,
> > + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> > + * that might generate #VE.
>
> I asked that before:
>
> "How is that enforced or validated? What checks for a violation of that
> assumption?"
>
> This is still exactly the same comment which is based on testing which
> did not yet explode in your face, right?

[ Disclaimer: I have limited understanding of the entry code complexity
and may miss some crucial details. But I try my best. ]

Yes, it is the same comment, but it is based on code audit, not only on
testing.

I claim that kernel does not do anything that can possibly trigger #VE
where kernel cannot deal with it:

- on syscall entry code before kernel stack is set up (few instructions
in the beginning of entry_SYSCALL_64())

- in NMI entry code (asm_exc_nmi()) before NMI nesting is safe:
+ for NMI from user mode, before switched to thread stack
+ for NMI from kernel, up to end_repead_nmi

After that points #VE is safe.

> So what's the point of this blurb? Create expectations which are not
> accountable?

I don't have such intentions.

> The point is that any #VE in such a code path is fatal and you better
> come up with some reasonable explanation why this is not the case in
> those code pathes and how a potential violation of that assumption might
> be detected especially in rarely used corner cases. If such a violation
> is not detectable by audit, CI, static code analysis or whatever then
> document the consequences instead of pretending that the problem does
> not exist and the kernel is perfect today and forever.

It is detectable by audit. The critical windows very limited and located
in the highly scrutinized entry code. But, yes, I cannot guarantee that
this code will be perfect forever.

Consequences of #VE in these critical windows are mentioned in the
comment:

Any exception in this window leads to hard to debug issues and can
be exploited for privilege escalation.

I have hard time understanding what I has to change here. Do you want
details of audit to be documented? Make consequences of #VE at the wrong
point to be more prominent in the comment?

--
Kirill A. Shutemov

2022-03-17 20:10:16

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCHv6 07/30] x86/traps: Add #VE support for TDX guest

On Thu, Mar 17 2022 at 20:33, Kirill A. Shutemov wrote:
> On Thu, Mar 17, 2022 at 01:48:54AM +0100, Thomas Gleixner wrote:
>> On Wed, Mar 16 2022 at 05:08, Kirill A. Shutemov wrote:
>> Hmm?
>
> Does the changed version below address your concerns?
>
> void tdx_get_ve_info(struct ve_info *ve)
> {
> struct tdx_module_output out;
>
> /*
> * Called during #VE handling to retrieve the #VE info from the
> * TDX module.
> *
> * This has to be called early in #VE handling. A "nested" #VE which
> * occurs before this will raise a #DF and is not recoverable.
> *
> * The call retrieves the #VE info from the TDX module, which also
> * clears the "#VE valid" flag. This must be done before anything else
> * because any #VE that occurs while the valid flag is set will lead to
> * #DF.
> *
> * Note, the TDX module treats virtual NMIs as inhibited if the #VE
> * valid flag is set. It means that NMI=>#VE will not result in a #DF.
> */
> tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);
>
> /* Transfer the output parameters */
> ve->exit_reason = out.rcx;
> ve->exit_qual = out.rdx;
> ve->gla = out.r8;
> ve->gpa = out.r9;
> ve->instr_len = lower_32_bits(out.r10);
> ve->instr_info = upper_32_bits(out.r10);
> }

Nice.

>> The point is that any #VE in such a code path is fatal and you better
>> come up with some reasonable explanation why this is not the case in
>> those code pathes and how a potential violation of that assumption might
>> be detected especially in rarely used corner cases. If such a violation
>> is not detectable by audit, CI, static code analysis or whatever then
>> document the consequences instead of pretending that the problem does
>> not exist and the kernel is perfect today and forever.
>
> It is detectable by audit. The critical windows very limited and located
> in the highly scrutinized entry code. But, yes, I cannot guarantee that
> this code will be perfect forever.

Fair enough.

> Consequences of #VE in these critical windows are mentioned in the
> comment:
>
> Any exception in this window leads to hard to debug issues and can
> be exploited for privilege escalation.
>
> I have hard time understanding what I has to change here. Do you want
> details of audit to be documented? Make consequences of #VE at the wrong
> point to be more prominent in the comment?

So having something like this in the comment would be helpful:

*
* The entry code has been audited carefuly for following these
* expectations. Changes in the entry code have to be audited for
* correctness vs. this aspect. #VE in these places will cause
* [an instant kernel panic | whatever | fill the blanks ]
*

Thanks,

tglx

2022-03-17 20:55:53

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCHv6 07/30] x86/traps: Add #VE support for TDX guest

On Thu, Mar 17, 2022 at 08:33:54PM +0300, Kirill A. Shutemov wrote:

> [ Disclaimer: I have limited understanding of the entry code complexity
> and may miss some crucial details. But I try my best. ]
>
> Yes, it is the same comment, but it is based on code audit, not only on
> testing.
>
> I claim that kernel does not do anything that can possibly trigger #VE
> where kernel cannot deal with it:
>
> - on syscall entry code before kernel stack is set up (few instructions
> in the beginning of entry_SYSCALL_64())
>
> - in NMI entry code (asm_exc_nmi()) before NMI nesting is safe:
> + for NMI from user mode, before switched to thread stack
> + for NMI from kernel, up to end_repead_nmi
>
> After that points #VE is safe.

In what way is it guaranteed that #VE isn't raised in those places? What
does an auditor / future coder looking to changes things, need to
consider to keep this so.

From vague memories #VE can be raised on any memop, loading the stack
address in the syscall-gap is a memop. What makes that special? Can we
get a comment _there_ to explain how this is safe such that we can keep
it so?

Same for the NMI path I suppose.

2022-03-17 21:04:10

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCHv6 07/30] x86/traps: Add #VE support for TDX guest

On 3/17/22 13:21, Peter Zijlstra wrote:
> From vague memories #VE can be raised on any memop, loading the stack
> address in the syscall-gap is a memop. What makes that special? Can we
> get a comment _there_ to explain how this is safe such that we can keep
> it so?

#GP and #PF can be raised from any memop too. But, we know that if we
only touch normal, valid kernel mappings, we don't have to worry about them.

The memop rules to avoid #VE are basically the same, except they also
include "TD-shared" memory. But, "normal" kernel memory isn't ever
shared because it's under the control of the untrusted hypervisor. The
kernel trusts shared memory like it trusts userspace-mapped memory.

The kernel would be insane to, for instance, put its stack in
userspace-writable memory. The same goes for TD-shared memory.

In the end, I'm asserting that we don't really have to be any more
careful to avoid a #VE on a memory access than we are to avoid #GP and #PF.

The TDX rules are *much* nicer than SEV. They're also a lot nicer on
TDX _now_ than they used to be. There are a few stubborn people at
Intel who managed to add some drops of sanity to the architecture.

2022-03-18 14:28:53

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCHv6 07/30] x86/traps: Add #VE support for TDX guest

On Fri, Mar 18, 2022 at 11:55:11AM +0100, Peter Zijlstra wrote:
> On Thu, Mar 17, 2022 at 01:32:07PM -0700, Dave Hansen wrote:
>
> > The TDX rules are *much* nicer than SEV. They're also a lot nicer on
> > TDX _now_ than they used to be. There are a few stubborn people at
> > Intel who managed to add some drops of sanity to the architecture.
>
> Right; that is saner than it used to be. I have definite memories that
> pages could be taken back by the TDX thing and would need
> re-authentication. A pool of 'fixed' pages was talked about. I'm glad to
> hear all that is gone.

Right, VMM can still pull memory form the guest at any point, but
reference of such memory from the guest would lead not to #VE as before,
but to TD termination.

--
Kirill A. Shutemov

2022-03-18 15:25:19

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCHv6 07/30] x86/traps: Add #VE support for TDX guest

On Thu, Mar 17, 2022 at 01:32:07PM -0700, Dave Hansen wrote:

> The TDX rules are *much* nicer than SEV. They're also a lot nicer on
> TDX _now_ than they used to be. There are a few stubborn people at
> Intel who managed to add some drops of sanity to the architecture.

Right; that is saner than it used to be. I have definite memories that
pages could be taken back by the TDX thing and would need
re-authentication. A pool of 'fixed' pages was talked about. I'm glad to
hear all that is gone.

2022-03-18 17:52:11

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCHv6 07/30] x86/traps: Add #VE support for TDX guest

On Thu, Mar 17 2022 at 21:21, Peter Zijlstra wrote:
> On Thu, Mar 17, 2022 at 08:33:54PM +0300, Kirill A. Shutemov wrote:
>> - in NMI entry code (asm_exc_nmi()) before NMI nesting is safe:
>> + for NMI from user mode, before switched to thread stack
>> + for NMI from kernel, up to end_repead_nmi
>>
>> After that points #VE is safe.
>
> In what way is it guaranteed that #VE isn't raised in those places? What
> does an auditor / future coder looking to changes things, need to
> consider to keep this so.
>
> From vague memories #VE can be raised on any memop, loading the stack
> address in the syscall-gap is a memop. What makes that special? Can we
> get a comment _there_ to explain how this is safe such that we can keep
> it so?
>
> Same for the NMI path I suppose.

#VE is raised by HLT, CPUID, I/O-Port access, MSR read/write, EPT violations

So in the hairy places:

- HLT: No business
- I/O Ports: That would be outright stupid to use

- CPUID: Should never be used - Emphasis on should :)
- MSRs: Same as CPUID

- EPT: Well....

Thanks,

Thomas

2022-03-19 14:22:09

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCHv6 07/30] x86/traps: Add #VE support for TDX guest

On Fri, Mar 18, 2022 at 03:19:34PM +0100, Thomas Gleixner wrote:
> On Thu, Mar 17 2022 at 21:21, Peter Zijlstra wrote:
> > On Thu, Mar 17, 2022 at 08:33:54PM +0300, Kirill A. Shutemov wrote:
> >> - in NMI entry code (asm_exc_nmi()) before NMI nesting is safe:
> >> + for NMI from user mode, before switched to thread stack
> >> + for NMI from kernel, up to end_repead_nmi
> >>
> >> After that points #VE is safe.
> >
> > In what way is it guaranteed that #VE isn't raised in those places? What
> > does an auditor / future coder looking to changes things, need to
> > consider to keep this so.
> >
> > From vague memories #VE can be raised on any memop, loading the stack
> > address in the syscall-gap is a memop. What makes that special? Can we
> > get a comment _there_ to explain how this is safe such that we can keep
> > it so?
> >
> > Same for the NMI path I suppose.
>
> #VE is raised by HLT, CPUID, I/O-Port access, MSR read/write, EPT violations
>
> So in the hairy places:
>
> - HLT: No business
> - I/O Ports: That would be outright stupid to use
>
> - CPUID: Should never be used - Emphasis on should :)
> - MSRs: Same as CPUID
>
> - EPT: Well....

EPT violation may result in #VE only on shared memory. If entry code
touches shared memory we have a bigger problem than syscall gap.

--
Kirill A. Shutemov