LinuxLists.cc - [PATCH v2 0/5] TDX host: kexec() support

2024-03-19 01:49:10

[permalink] [raw]

Subject: [PATCH v2 0/5] TDX host: kexec() support

Currently kexec() support and TDX host are muturally exclusive in the
Kconfig. This series adds the TDX host kexec support so that they can
work together and can be enabled at the same time in the Kconfig.

v1 -> v2:
- Do unconditional WBINVD during kexec() -- Boris
- Change to cover crash kexec() -- Rick
- Add a new patch (last one) to add a mechanism to reset all TDX private
pages due to having to cover crash kexec().
- Other code improvements -- Dave
- Rebase to latest tip/master.

Hi Dave, Sean, Paolo,

The last patch provides a new mechanism to handle all other TDX private
pages when they become possible to exist, e.g., when KVM is ready to run
TDX guests. It's not mandatory at this stage because currently we only
have PAMT as private pages, but if we agree it's the right way to do then
it can be applied together with rest patches too.

KVM will be the first user of this, could you help to review?

Thanks in advance.

Hi Tom, Ashish,

This series touches AMD SME code too, and I don't have AMD machine to
test. I appreciate if you can help to review and/or test.

Kai Huang (5):
x86/kexec: do unconditional WBINVD in stop_this_cpu()
x86/kexec: do unconditional WBINVD in relocate_kernel()
x86/kexec: Reset TDX private memory on platforms with TDX erratum
x86/virt/tdx: Remove the !KEXEC_CORE dependency
x86/virt/tdx: Add TDX memory reset notifier to reset other private
pages

arch/x86/Kconfig | 1 -
arch/x86/include/asm/kexec.h | 3 +-
arch/x86/include/asm/tdx.h | 16 +++++
arch/x86/kernel/machine_kexec_64.c | 30 ++++++--
arch/x86/kernel/process.c | 17 ++---
arch/x86/kernel/relocate_kernel_64.S | 13 +---
arch/x86/virt/vmx/tdx/tdx.c | 100 +++++++++++++++++++++++++++
7 files changed, 150 insertions(+), 30 deletions(-)

base-commit: 7e19a79344df2ed5e106091c29338962261b0290
--
2.34.1

2024-03-19 01:49:36

[permalink] [raw]

Subject: [PATCH v2 1/5] x86/kexec: do unconditional WBINVD in stop_this_cpu()

TL;DR:

Change to do unconditional WBINVD in stop_this_cpu() to cover kexec
support for both AMD SME and Intel TDX, despite there _was_ some issue
preventing doing so but now has got fixed.

Long version:

Both SME and TDX can leave caches in incoherent state due to memory
encryption. During kexec, the caches must be flushed before jumping to
the second kernel to avoid silent memory corruption to the second kernel.

Currently, for SME the kernel only does WBINVD in stop_this_cpu() when
the kernel determines the hardware supports SME. To support TDX, one
option is to extend that specific check to cover both SME and TDX.

However, instead of sprinkling around vendor-specific checks, it's
better to just do unconditional WBINVD. Kexec() is a slow path, and it
is acceptable to have an additional WBINVD in order to have simple
and easy to maintain code.

Note:

Historically, there _was_ an issue preventing doing unconditional WBINVD
but that has been fixed.

When SME kexec() support was initially added in commit

bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")

WBINVD was done unconditionally. However since then some issues were
reported that different Intel systems would hang or reset due to that
commit.

To try to fix, a later commit

f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")

then changed to only do WBINVD when hardware supports SME.

While this commit made the reported issues go away, it didn't pinpoint
the root cause. Also, it didn't handle a corner case[*] correctly, which
resulted in the reveal of the root cause and the final fix by commit

1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")

See [1][2] for more information.

Further testing of doing unconditional WBINVD based on the above fix on
the problematic machines (that issues were originally reported)
confirmed the issues couldn't be reproduced.

See [3][4] for more information.

Therefore, it is safe to do unconditional WBINVD now.

[*] The commit didn't check whether the CPUID leaf is available or not.
Making unsupported CPUID leaf on Intel returns garbage resulting in
unintended WBINVD which caused some issue (followed by the analysis and
the reveal of the final root cause). The corner case was independently
fixed by commit

9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf")

[1]: https://lore.kernel.org/lkml/CALu+AoQKmeixJdkO07t7BtttN7v3RM4_aBKi642bQ3fTBbSAVg@mail.gmail.com/T/#m300f3f9790850b5daa20a71abcc200ae8d94a12a
[2]: https://lore.kernel.org/lkml/CALu+AoQKmeixJdkO07t7BtttN7v3RM4_aBKi642bQ3fTBbSAVg@mail.gmail.com/T/#ma7263a7765483db0dabdeef62a1110940e634846
[3]: https://lore.kernel.org/lkml/CALu+AoQKmeixJdkO07t7BtttN7v3RM4_aBKi642bQ3fTBbSAVg@mail.gmail.com/T/#mc043191f2ff860d649c8466775dc61ac1e0ae320
[4]: https://lore.kernel.org/lkml/CALu+AoQKmeixJdkO07t7BtttN7v3RM4_aBKi642bQ3fTBbSAVg@mail.gmail.com/T/#md23f1a8f6afcc59fa2b0ac1967f18e418e24347c

Signed-off-by: Kai Huang <[email protected]>
Suggested-by: Borislav Petkov <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Dave Young <[email protected]>
---
arch/x86/kernel/process.c | 17 ++++++-----------
1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b8441147eb5e..b375f069dd2a 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -813,19 +813,14 @@ void __noreturn stop_this_cpu(void *dummy)
mcheck_cpu_clear(c);

/*
- * Use wbinvd on processors that support SME. This provides support
- * for performing a successful kexec when going from SME inactive
- * to SME active (or vice-versa). The cache must be cleared so that
- * if there are entries with the same physical address, both with and
- * without the encryption bit, they don't race each other when flushed
- * and potentially end up with the wrong entry being committed to
- * memory.
+ * The kernel could leave caches in incoherent state on SME/TDX
+ * capable platforms. Flush cache to avoid silent memory
+ * corruption for these platforms.
*
- * Test the CPUID bit directly because the machine might've cleared
- * X86_FEATURE_SME due to cmdline options.
+ * stop_this_cpu() is not a fast path, just do unconditional
+ * WBINVD for simplicity.
*/
- if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
- native_wbinvd();
+ native_wbinvd();

/*
* This brings a cache line back and dirties it, but
--
2.34.1

2024-03-19 01:49:40

[permalink] [raw]

Subject: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

Both SME and TDX can leave caches in incoherent state due to memory
encryption. During kexec, the caches must be flushed before jumping to
the second kernel to avoid silent memory corruption to the second kernel.

During kexec, the WBINVD in stop_this_cpu() flushes caches for all
remote cpus when they are being stopped. For SME, the WBINVD in
relocate_kernel() flushes the cache for the last running cpu (which is
executing the kexec).

Similarly, for TDX after stopping all remote cpus with cache flushed, to
support kexec, the kernel needs to flush cache for the last running cpu.

Make the WBINVD in the relocate_kernel() unconditional to cover both SME
and TDX.

Signed-off-by: Kai Huang <[email protected]>
Suggested-by: Borislav Petkov <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Dave Young <[email protected]>
---
arch/x86/include/asm/kexec.h | 3 +--
arch/x86/kernel/machine_kexec_64.c | 3 +--
arch/x86/kernel/relocate_kernel_64.S | 13 +++----------
3 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 91ca9a9ee3a2..9754794242ad 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -127,8 +127,7 @@ unsigned long
relocate_kernel(unsigned long indirection_page,
unsigned long page_list,
unsigned long start_address,
- unsigned int preserve_context,
- unsigned int host_mem_enc_active);
+ unsigned int preserve_context);
#endif

#define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index bc0a5348b4a6..b9a632479b36 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -357,8 +357,7 @@ void machine_kexec(struct kimage *image)
image->start = relocate_kernel((unsigned long)image->head,
(unsigned long)page_list,
image->start,
- image->preserve_context,
- cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
+ image->preserve_context);

#ifdef CONFIG_KEXEC_JUMP
if (image->preserve_context)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 56cab1bb25f5..66b628686dbc 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -50,7 +50,6 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
* %rsi page_list
* %rdx start address
* %rcx preserve_context
- * %r8 host_mem_enc_active
*/

/* Save the CPU context, used for jumping back */
@@ -78,9 +77,6 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
pushq $0
popfq

- /* Save SME active flag */
- movq %r8, %r12
-
/*
* get physical address of control page now
* this is impossible after page table switch
@@ -160,14 +156,11 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
movq %r9, %cr3

/*
- * If SME is active, there could be old encrypted cache line
- * entries that will conflict with the now unencrypted memory
- * used by kexec. Flush the caches before copying the kernel.
+ * The kernel could leave caches in incoherent state on SME/TDX
+ * capable platforms. Just do unconditional cache flush to avoid
+ * silent memory corruption to the new kernel for these platforms.
*/
- testq %r12, %r12
- jz 1f
wbinvd
-1:

movq %rcx, %r11
call swap_pages
--
2.34.1

2024-03-19 01:49:53

[permalink] [raw]

Subject: [PATCH v2 3/5] x86/kexec: Reset TDX private memory on platforms with TDX erratum

TL;DR:

On the platforms with TDX "partial write machine check" erratum, during
kexec, convert TDX private memory back to normal before jumping to the
second kernel to avoid the second kernel seeing potential unexpected
machine check.

Long version:

The first few generations of TDX hardware have an erratum. A partial
write to a TDX private memory cacheline will silently "poison" the
line. Subsequent reads will consume the poison and generate a machine
check. According to the TDX hardware spec, neither of these things
should have happened.

== Background ==

Virtually all kernel memory accesses operations happen in full
cachelines. In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller. The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings. The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

== Problem ==

A fast warm reset doesn't reset TDX private memory. Kexec() can also
boot into the new kernel directly. Thus if the old kernel has left any
TDX private pages on the platform with this erratum, the new kernel
might get unexpected machine check.

Note that w/o this erratum any kernel read/write on TDX private memory
should never cause machine check, thus it's OK for the old kernel to
leave TDX private pages as is.

== Solution ==

In short, with this erratum, the kernel needs to explicitly convert all
TDX private pages back to normal to give the new kernel a clean slate
after kexec(). The BIOS is also expected to disable fast warm reset as
a workaround to this erratum, thus this implementation doesn't try to
reset TDX private memory for the reboot case in the kernel but depends
on the BIOS to enable the workaround.

Convert TDX private pages back to normal (using MOVDIR64B to clear these
pages) after all remote cpus have been stopped and cache flush has been
done on all cpus, when no more TDX activity can happen further.

Do it in machine_kexec() to cover both normal kexec, and crash kexec.

For now TDX private memory can only be PAMT pages. It would be ideal to
cover all types of TDX private memory here, but there are practical
problems to do so:

1) There's no existing infrastructure to track TDX private pages;
2) It's not feasible to query the TDX module about page type, because
VMX, which making SEAMCALL requires, has already been disabled;
3) Even if it is feasible to query the TDX module, the result may not be
accurate. E.g., the remote CPU could be stopped right before
MOVDIR64B.

One temporary solution is to blindly convert all memory pages, but it's
problematic to do so too, because not all pages are mapped as writable
in the direct mapping. It can be done by switching to the identical
mapping created for kexec(), or a new page table, but the complexity
looks overkill.

Therefore, rather than doing something dramatic, only reset PAMT pages
here.

Leave resetting other TDX private pages as a future work when they
become possible to exist.

Signed-off-by: Kai Huang <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
---

v1 -> v2:
- Remove using reboot notifier to stop TDX module as it doesn't
cover crash kexec. Change to use a variable with barrier instead.
(Rick)
- Introduce kexec_save_processor_start() to make code better, and
make the comment around calling site of tdx_reset_memory() more
concise. (Dave)
- Mention cache for all other cpus have been flushed around
native_wbinvd() in tdx_reset_memory(). (Dave)
- Remove the extended alternaties discussion from the comment, but leave
it in the changelog. Point out what does current code do and point out
risk. (Dave)

---
arch/x86/include/asm/tdx.h | 2 +
arch/x86/kernel/machine_kexec_64.c | 27 ++++++++--
arch/x86/virt/vmx/tdx/tdx.c | 79 ++++++++++++++++++++++++++++++
3 files changed, 104 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index eba178996d84..ed3ac9a8a079 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -116,11 +116,13 @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
int tdx_cpu_enable(void);
int tdx_enable(void);
const char *tdx_dump_mce_info(struct mce *m);
+void tdx_reset_memory(void);
#else
static inline void tdx_init(void) { }
static inline int tdx_cpu_enable(void) { return -ENODEV; }
static inline int tdx_enable(void) { return -ENODEV; }
static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
+static inline void tdx_reset_memory(void) { }
#endif /* CONFIG_INTEL_TDX_HOST */

#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index b9a632479b36..fdc2ffdc7d90 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -28,6 +28,7 @@
#include <asm/setup.h>
#include <asm/set_memory.h>
#include <asm/cpu.h>
+#include <asm/tdx.h>

#ifdef CONFIG_ACPI
/*
@@ -288,6 +289,14 @@ void machine_kexec_cleanup(struct kimage *image)
free_transition_pgtable(image);
}

+static void kexec_save_processor_start(struct kimage *image)
+{
+#ifdef CONFIG_KEXEC_JUMP
+ if (image->preserve_context)
+ save_processor_state();
+#endif
+}
+
/*
* Do not allocate memory (or fail in any way) in machine_kexec().
* We are past the point of no return, committed to rebooting now.
@@ -298,10 +307,20 @@ void machine_kexec(struct kimage *image)
void *control_page;
int save_ftrace_enabled;

-#ifdef CONFIG_KEXEC_JUMP
- if (image->preserve_context)
- save_processor_state();
-#endif
+ kexec_save_processor_start(image);
+
+ /*
+ * Convert TDX private memory back to normal (when needed) to
+ * avoid the second kernel potentially seeing unexpected machine
+ * check.
+ *
+ * However skip this when preserve_context is on. By reaching
+ * here, TDX (if ever got enabled by the kernel) has survived
+ * from the suspend when preserve_context is on, and it can
+ * continue to work after jumping back from the second kernel.
+ */
+ if (!image->preserve_context)
+ tdx_reset_memory();

save_ftrace_enabled = __ftrace_enabled_save();

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 4d6826a76f78..1443d486858a 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -53,6 +53,8 @@ static DEFINE_MUTEX(tdx_module_lock);
/* All TDX-usable memory regions. Protected by mem_hotplug_lock. */
static LIST_HEAD(tdx_memlist);

+static bool tdx_may_have_private_memory __read_mostly;
+
typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);

static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
@@ -1097,6 +1099,18 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
return 0;
}

+static void mark_may_have_private_memory(bool may)
+{
+ tdx_may_have_private_memory = may;
+
+ /*
+ * Ensure update to tdx_may_have_private_memory is visible to all
+ * cpus. This ensures when any remote cpu reads it as true, the
+ * 'tdx_tdmr_list' must be stable for reading PAMTs.
+ */
+ smp_wmb();
+}
+
static int init_tdx_module(void)
{
struct tdx_tdmr_sysinfo tdmr_sysinfo;
@@ -1142,6 +1156,12 @@ static int init_tdx_module(void)
if (ret)
goto err_reset_pamts;

+ /*
+ * Starting from this point the system is possible to have
+ * TDX private memory.
+ */
+ mark_may_have_private_memory(true);
+
/* Initialize TDMRs to complete the TDX module initialization */
ret = init_tdmrs(&tdx_tdmr_list);
if (ret)
@@ -1173,6 +1193,7 @@ static int init_tdx_module(void)
* as suggested by the TDX spec.
*/
tdmrs_reset_pamt_all(&tdx_tdmr_list);
+ mark_may_have_private_memory(false);
err_free_pamts:
tdmrs_free_pamt_all(&tdx_tdmr_list);
err_free_tdmrs:
@@ -1490,3 +1511,61 @@ void __init tdx_init(void)

check_tdx_erratum();
}
+
+void tdx_reset_memory(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM))
+ return;
+
+ /*
+ * Converting TDX private pages back to normal must be done
+ * when there's no TDX activity anymore on all remote cpus.
+ * Verify this is only called when all remote cpus have
+ * been stopped.
+ */
+ WARN_ON_ONCE(num_online_cpus() != 1);
+
+ /*
+ * Kernel read/write to TDX private memory doesn't cause
+ * machine check on hardware w/o this erratum.
+ */
+ if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+ return;
+
+ /*
+ * Nothing to convert if it's not possible to have any TDX
+ * private pages.
+ */
+ if (!tdx_may_have_private_memory)
+ return;
+
+ /*
+ * Ensure the 'tdx_tdmr_list' is stable for reading PAMTs
+ * when tdx_may_have_private_memory reads true, paired with
+ * the smp_wmb() in mark_may_have_private_memory().
+ */
+ smp_rmb();
+
+ /*
+ * All remote cpus have been stopped, and their caches have
+ * been flushed in stop_this_cpu(). Now flush cache for the
+ * last running cpu _before_ converting TDX private pages.
+ */
+ native_wbinvd();
+
+ /*
+ * It's ideal to cover all types of TDX private pages here, but
+ * currently there's no unified way to tell whether a given page
+ * is TDX private page or not.
+ *
+ * Just convert PAMT pages now, as currently TDX private pages
+ * can only be PAMT pages.
+ *
+ * TODO:
+ *
+ * This leaves all other types of TDX private pages undealt
+ * with. They must be handled in _some_ way when they become
+ * possible to exist.
+ */
+ tdmrs_reset_pamt_all(&tdx_tdmr_list);
+}
--
2.34.1

2024-03-19 01:50:04

[permalink] [raw]

Subject: [PATCH v2 4/5] x86/virt/tdx: Remove the !KEXEC_CORE dependency

Now TDX host can work with kexec(). Remove the !KEXEC_CORE dependency.

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/Kconfig | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e88f6f7b6b41..89b65e2e1115 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1970,7 +1970,6 @@ config INTEL_TDX_HOST
depends on X86_X2APIC
select ARCH_KEEP_MEMBLOCK
depends on CONTIG_ALLOC
- depends on !KEXEC_CORE
depends on X86_MCE
help
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
--
2.34.1

2024-03-19 01:50:34

[permalink] [raw]

Subject: [PATCH v2 5/5] x86/virt/tdx: Add TDX memory reset notifier to reset other private pages

TL;DR:

To cover both normal kexec and crash kexec, add a TDX specific memory
reset notifier to let "in-kernel TDX users" use their own way to convert
TDX private pages (that they manage respectively) in tdx_reset_memory().

Long version:

On the platforms with TDX "partial write machine check" erratum, during
kexec, the kernel needs to convert TDX private memory back to normal
before jumping to the second kernel to avoid the second kernel seeing
potential machine check.

For now tdx_reset_memory() only resets PAMT pages. KVM will be the
first in-kernel TDX user to support running TDX guests, and by then
other TDX private pages will start to exist. They need to be covered
too.

Currently the kernel doesn't have a unified way to tell whether a given
page is TDX private page or not. One choice is to add such unified way,
and there are couple of options to do it:

1) Use a bitmap, or Xarray, etc to track TDX private page for all PFNs;
2) Use a "software-only" bit in the direct-mapping PTE to mark a given
page is TDX private page;
3) Use a new flag in 'struct page' to mark TDX private page;
4) ... potential other ways.

Option 1) consumes additional memory. E.g., if using bitmap, the
overhead is "number of total RAM pages / 8" bytes.

Option 2) would cause splitting large-page mapping to 4K mapping in the
direct mapping when one page is allocated as TDX private page, and cause
additional TLB flush etc. It's not ideal for such use case.

Option 3) apparently contradicts to the effort to reduce the use of the
flags of 'struct page'.

None of above is ideal.

Therefore, instead of providing a unified way to tell whether a given
page is TDX private page or not, leave "resetting TDX private pages" to
the "in-kernel user" of TDX.

This is motivated by the fact that KVM is already maintaining an Xarray
to track "memory attributes (e.g., private or shared)" for each GFN for
each guest. Thus KVM can use its own way to find all TDX private pages
that it manages and convert them back to normal.

For the normal kexec the reboot notifier could be used, but it doesn't
cover the cash kexec.

Add a TDX specific memory reset notifier to achieve this. The in-kernel
TDX users will need to register their own notifiers to reset TDX private
pages. Call these notifiers in tdx_reset_memory() right before
resetting PAMT pages.

KVM will be the first user of this notifier. Export the "register" and
"unregister" APIs for KVM to use.

Signed-off-by: Kai Huang <[email protected]>
---
arch/x86/include/asm/tdx.h | 14 ++++++++++++
arch/x86/virt/vmx/tdx/tdx.c | 45 +++++++++++++++++++++++++++----------
2 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index ed3ac9a8a079..7c2c0a0b9754 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -117,12 +117,26 @@ int tdx_cpu_enable(void);
int tdx_enable(void);
const char *tdx_dump_mce_info(struct mce *m);
void tdx_reset_memory(void);
+
+struct notifier_block;
+
+int tdx_register_memory_reset_notifier(struct notifier_block *nb);
+void tdx_unregister_memory_reset_notifier(struct notifier_block *nb);
#else
static inline void tdx_init(void) { }
static inline int tdx_cpu_enable(void) { return -ENODEV; }
static inline int tdx_enable(void) { return -ENODEV; }
static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
static inline void tdx_reset_memory(void) { }
+
+struct notifier_block;
+
+static inline int tdx_register_memory_reset_notifier(struct notifier_block *nb)
+{
+ return -EOPNOTSUPP;
+}
+static inline void tdx_unregister_memory_reset_notifier(
+ struct notifier_block *nb) { }
#endif /* CONFIG_INTEL_TDX_HOST */

#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 1443d486858a..a3daea33c59f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -28,6 +28,7 @@
#include <linux/acpi.h>
#include <linux/suspend.h>
#include <linux/acpi.h>
+#include <linux/notifier.h>
#include <asm/page.h>
#include <asm/special_insns.h>
#include <asm/msr-index.h>
@@ -55,6 +56,8 @@ static LIST_HEAD(tdx_memlist);

static bool tdx_may_have_private_memory __read_mostly;

+static BLOCKING_NOTIFIER_HEAD(tdx_memory_reset_chain);
+
typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);

static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
@@ -1512,6 +1515,27 @@ void __init tdx_init(void)
check_tdx_erratum();
}

+int tdx_register_memory_reset_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_register(&tdx_memory_reset_chain, nb);
+}
+EXPORT_SYMBOL_GPL(tdx_register_memory_reset_notifier);
+
+void tdx_unregister_memory_reset_notifier(struct notifier_block *nb)
+{
+ blocking_notifier_chain_unregister(&tdx_memory_reset_chain, nb);
+}
+EXPORT_SYMBOL_GPL(tdx_unregister_memory_reset_notifier);
+
+static int notify_reset_memory(void)
+{
+ int ret;
+
+ ret = blocking_notifier_call_chain(&tdx_memory_reset_chain, 0, NULL);
+
+ return notifier_to_errno(ret);
+}
+
void tdx_reset_memory(void)
{
if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM))
@@ -1554,18 +1578,15 @@ void tdx_reset_memory(void)
native_wbinvd();

/*
- * It's ideal to cover all types of TDX private pages here, but
- * currently there's no unified way to tell whether a given page
- * is TDX private page or not.
- *
- * Just convert PAMT pages now, as currently TDX private pages
- * can only be PAMT pages.
- *
- * TODO:
- *
- * This leaves all other types of TDX private pages undealt
- * with. They must be handled in _some_ way when they become
- * possible to exist.
+ * Tell all in-kernel TDX users to reset TDX private pages
+ * that they manage.
+ */
+ if (notify_reset_memory())
+ pr_err("Failed to reset all TDX private pages.\n");
+
+ /*
+ * The only remaining TDX private pages are PAMT pages.
+ * Reset them.
*/
tdmrs_reset_pamt_all(&tdx_tdmr_list);
}
--
2.34.1

2024-03-19 12:30:05

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
> Both SME and TDX can leave caches in incoherent state due to memory
> encryption. During kexec, the caches must be flushed before jumping to
> the second kernel to avoid silent memory corruption to the second kernel.
>
> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
> remote cpus when they are being stopped. For SME, the WBINVD in
> relocate_kernel() flushes the cache for the last running cpu (which is
> executing the kexec).
>
> Similarly, for TDX after stopping all remote cpus with cache flushed, to
> support kexec, the kernel needs to flush cache for the last running cpu.
>
> Make the WBINVD in the relocate_kernel() unconditional to cover both SME
> and TDX.

Nope. It breaks TDX guest. WBINVD triggers #VE for TDX guests.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-03-19 14:38:46

by Tom Lendacky

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On 3/19/24 06:13, Kirill A. Shutemov wrote:
> On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
>> Both SME and TDX can leave caches in incoherent state due to memory
>> encryption. During kexec, the caches must be flushed before jumping to
>> the second kernel to avoid silent memory corruption to the second kernel.
>>
>> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
>> remote cpus when they are being stopped. For SME, the WBINVD in
>> relocate_kernel() flushes the cache for the last running cpu (which is
>> executing the kexec).
>>
>> Similarly, for TDX after stopping all remote cpus with cache flushed, to
>> support kexec, the kernel needs to flush cache for the last running cpu.
>>
>> Make the WBINVD in the relocate_kernel() unconditional to cover both SME
>> and TDX.
>
> Nope. It breaks TDX guest. WBINVD triggers #VE for TDX guests.

Ditto for SEV-ES/SEV-SNP, a #VC is generated and crashes the guest.

Thanks,
Tom

>

2024-03-19 15:42:08

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
> Both SME and TDX can leave caches in incoherent state due to memory
> encryption. During kexec, the caches must be flushed before jumping to
> the second kernel to avoid silent memory corruption to the second kernel.
>
> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
> remote cpus when they are being stopped. For SME, the WBINVD in
> relocate_kernel() flushes the cache for the last running cpu (which is
> executing the kexec).
>
> Similarly, for TDX after stopping all remote cpus with cache flushed, to
> support kexec, the kernel needs to flush cache for the last running cpu.
>
> Make the WBINVD in the relocate_kernel() unconditional to cover both SME
> and TDX.
>
> Signed-off-by: Kai Huang <[email protected]>
> Suggested-by: Borislav Petkov <[email protected]>

Well, I suggested what you have in patch 1 but not this.

You can't just slap tags willy nilly to patches.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-19 21:08:37

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On 20/03/2024 4:41 am, Borislav Petkov wrote:
> On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
>> Both SME and TDX can leave caches in incoherent state due to memory
>> encryption. During kexec, the caches must be flushed before jumping to
>> the second kernel to avoid silent memory corruption to the second kernel.
>>
>> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
>> remote cpus when they are being stopped. For SME, the WBINVD in
>> relocate_kernel() flushes the cache for the last running cpu (which is
>> executing the kexec).
>>
>> Similarly, for TDX after stopping all remote cpus with cache flushed, to
>> support kexec, the kernel needs to flush cache for the last running cpu.
>>
>> Make the WBINVD in the relocate_kernel() unconditional to cover both SME
>> and TDX.
>>
>> Signed-off-by: Kai Huang <[email protected]>
>> Suggested-by: Borislav Petkov <[email protected]>
>
> Well, I suggested what you have in patch 1 but not this.
>
> You can't just slap tags willy nilly to patches.
>

Apologize. Will remove.

2024-03-19 21:21:15

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On 20/03/2024 3:38 am, Tom Lendacky wrote:
> On 3/19/24 06:13, Kirill A. Shutemov wrote:
>> On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
>>> Both SME and TDX can leave caches in incoherent state due to memory
>>> encryption. During kexec, the caches must be flushed before jumping to
>>> the second kernel to avoid silent memory corruption to the second
>>> kernel.
>>>
>>> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
>>> remote cpus when they are being stopped. For SME, the WBINVD in
>>> relocate_kernel() flushes the cache for the last running cpu (which is
>>> executing the kexec).
>>>
>>> Similarly, for TDX after stopping all remote cpus with cache flushed, to
>>> support kexec, the kernel needs to flush cache for the last running cpu.
>>>
>>> Make the WBINVD in the relocate_kernel() unconditional to cover both SME
>>> and TDX.
>>
>> Nope. It breaks TDX guest. WBINVD triggers #VE for TDX guests.
>
> Ditto for SEV-ES/SEV-SNP, a #VC is generated and crashes the guest.
>

Oh I forgot these.

Hi Kirill,

Then I think patch 1 will also break TDX guest after your series to
enable multiple cpus for the second kernel after kexec()?

Hi Tom,

I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
Does patch 1 break them?

2024-03-20 00:20:05

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On Wed, Mar 20, 2024 at 10:20:50AM +1300, Huang, Kai wrote:
>
>
> On 20/03/2024 3:38 am, Tom Lendacky wrote:
> > On 3/19/24 06:13, Kirill A. Shutemov wrote:
> > > On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
> > > > Both SME and TDX can leave caches in incoherent state due to memory
> > > > encryption.? During kexec, the caches must be flushed before jumping to
> > > > the second kernel to avoid silent memory corruption to the
> > > > second kernel.
> > > >
> > > > During kexec, the WBINVD in stop_this_cpu() flushes caches for all
> > > > remote cpus when they are being stopped.? For SME, the WBINVD in
> > > > relocate_kernel() flushes the cache for the last running cpu (which is
> > > > executing the kexec).
> > > >
> > > > Similarly, for TDX after stopping all remote cpus with cache flushed, to
> > > > support kexec, the kernel needs to flush cache for the last running cpu.
> > > >
> > > > Make the WBINVD in the relocate_kernel() unconditional to cover both SME
> > > > and TDX.
> > >
> > > Nope. It breaks TDX guest. WBINVD triggers #VE for TDX guests.
> >
> > Ditto for SEV-ES/SEV-SNP, a #VC is generated and crashes the guest.
> >
>
> Oh I forgot these.
>
> Hi Kirill,
>
> Then I think patch 1 will also break TDX guest after your series to enable
> multiple cpus for the second kernel after kexec()?

Well, not exactly.

My patchset overrides stop_this_cpu() with own implementation for MADT
wakeup method that doesn't have WBINVD. So the patch doesn't break
anything, but if in the future TDX (or SEV) host would use MADT wake up
method instead of IPI we will get back to the problem with missing
WBINVD.

I don't know if we care. There's no reason for host to use MADT wake up
method.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-03-20 00:45:59

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On 20/03/2024 1:19 pm, Kirill A. Shutemov wrote:
> On Wed, Mar 20, 2024 at 10:20:50AM +1300, Huang, Kai wrote:
>>
>>
>> On 20/03/2024 3:38 am, Tom Lendacky wrote:
>>> On 3/19/24 06:13, Kirill A. Shutemov wrote:
>>>> On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
>>>>> Both SME and TDX can leave caches in incoherent state due to memory
>>>>> encryption. During kexec, the caches must be flushed before jumping to
>>>>> the second kernel to avoid silent memory corruption to the
>>>>> second kernel.
>>>>>
>>>>> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
>>>>> remote cpus when they are being stopped. For SME, the WBINVD in
>>>>> relocate_kernel() flushes the cache for the last running cpu (which is
>>>>> executing the kexec).
>>>>>
>>>>> Similarly, for TDX after stopping all remote cpus with cache flushed, to
>>>>> support kexec, the kernel needs to flush cache for the last running cpu.
>>>>>
>>>>> Make the WBINVD in the relocate_kernel() unconditional to cover both SME
>>>>> and TDX.
>>>>
>>>> Nope. It breaks TDX guest. WBINVD triggers #VE for TDX guests.
>>>
>>> Ditto for SEV-ES/SEV-SNP, a #VC is generated and crashes the guest.
>>>
>>
>> Oh I forgot these.
>>
>> Hi Kirill,
>>
>> Then I think patch 1 will also break TDX guest after your series to enable
>> multiple cpus for the second kernel after kexec()?
>
> Well, not exactly.
>
> My patchset overrides stop_this_cpu() with own implementation for MADT
> wakeup method that doesn't have WBINVD. So the patch doesn't break
> anything,

Well, your callback actually only gets called _after_ this WBINVD, so...

I guess I should have that checked by myself. :-)

but if in the future TDX (or SEV) host would use MADT wake up
> method instead of IPI we will get back to the problem with missing
> WBINVD.
>
> I don't know if we care. There's no reason for host to use MADT wake up
> method.
>

I don't think MADT wake up will be used at any native environment.

Anyway, regardless whether patch 1 will break TDX/SEV-ES/SEV-SNP guests,
I think to resolve this, we can simply adjust our mindset from ...

"do unconditional WBINVD"

to ...

"do unconditional WBINVD when it can be done safely"

For now, AFAICT, only TDX guests and SEV-ES/SEV-SNP guests are such guests.

And they all report the CC_ATTR_GUEST_MEM_ENCRYPT flag as true, so we
can change to only do WBINVD when the kernel sees that flag.

if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
native_wbinvd();

Alternatively, we can have a dedicated X86_FEATURE_NO_WBINVD and get it
set for TDX/SEV-ES/SEV-SNP guests (and any guests if this is true), and do:

if (!boot_cpu_has(X86_FEATURE_NO_WBINVD))
native_wbinvd();

It seems the first one is too generic (for any CoCo VMs), and the second
one is better.

Any comments?

Hi Boris/Dave,

Do you have any comments?

2024-03-20 12:51:42

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On Wed, Mar 20, 2024 at 01:45:32PM +1300, Huang, Kai wrote:
> Anyway, regardless whether patch 1 will break TDX/SEV-ES/SEV-SNP guests, I
> think to resolve this, we can simply adjust our mindset from ...
>
> "do unconditional WBINVD"
>
> to ...
>
> "do unconditional WBINVD when it can be done safely"
>
> For now, AFAICT, only TDX guests and SEV-ES/SEV-SNP guests are such guests.
>
> And they all report the CC_ATTR_GUEST_MEM_ENCRYPT flag as true, so we can
> change to only do WBINVD when the kernel sees that flag.
>
> if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> native_wbinvd();
>
> Alternatively, we can have a dedicated X86_FEATURE_NO_WBINVD and get it set
> for TDX/SEV-ES/SEV-SNP guests (and any guests if this is true), and do:
>
> if (!boot_cpu_has(X86_FEATURE_NO_WBINVD))
> native_wbinvd();
>
> It seems the first one is too generic (for any CoCo VMs), and the second one
> is better.
>
> Any comments?

I like cc_platform_has() variant better. There's no good reason to invent
X86_FEATURE if we don't cases outside of CC.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-03-20 13:50:13

by Tom Lendacky

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On 3/19/24 16:20, Huang, Kai wrote:
>
>
> On 20/03/2024 3:38 am, Tom Lendacky wrote:
>> On 3/19/24 06:13, Kirill A. Shutemov wrote:
>>> On Tue, Mar 19, 2024 at 01:48:45AM +0000, Kai Huang wrote:
>>>> Both SME and TDX can leave caches in incoherent state due to memory
>>>> encryption. During kexec, the caches must be flushed before jumping to
>>>> the second kernel to avoid silent memory corruption to the second kernel.
>>>>
>>>> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
>>>> remote cpus when they are being stopped. For SME, the WBINVD in
>>>> relocate_kernel() flushes the cache for the last running cpu (which is
>>>> executing the kexec).
>>>>
>>>> Similarly, for TDX after stopping all remote cpus with cache flushed, to
>>>> support kexec, the kernel needs to flush cache for the last running cpu.
>>>>
>>>> Make the WBINVD in the relocate_kernel() unconditional to cover both SME
>>>> and TDX.
>>>
>>> Nope. It breaks TDX guest. WBINVD triggers #VE for TDX guests.
>>
>> Ditto for SEV-ES/SEV-SNP, a #VC is generated and crashes the guest.
>>
>
> Oh I forgot these.
>
> Hi Kirill,
>
> Then I think patch 1 will also break TDX guest after your series to enable
> multiple cpus for the second kernel after kexec()?
>
> Hi Tom,
>
> I am not aware of kexec() support status for SEV-ES/SEV-SNP guests. Does
> patch 1 break them?

SNP guests can kexec with some patches that are currently in process
around shared to private memory conversions. ES guests can only kexec with
a single vCPU. There was a recent patch series to add support for multiple
vCPUs.

Patch #1 doesn't break either ES or SNP because we still have an IDT and
traditional kernel addressing in place, so the #VC can be handled.

Whereas patch #2 has switched to identity mapping and removed the IDT, so
a #VC causes a triple fault.

Thanks,
Tom

2024-03-20 20:48:55

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

>> Hi Tom,
>>
>> I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
>> Does patch 1 break them?
>
> SNP guests can kexec with some patches that are currently in process
> around shared to private memory conversions. ES guests can only kexec
> with a single vCPU. There was a recent patch series to add support for
> multiple vCPUs.
>
> Patch #1 doesn't break either ES or SNP because we still have an IDT and
> traditional kernel addressing in place, so the #VC can be handled.

How about plain SEV guest?

>
> Whereas patch #2 has switched to identity mapping and removed the IDT,
> so a #VC causes a triple fault.

That makes sense. Thanks.

Hi Kirill,

Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu()
can be handled although it causes #VE, while WBINVD in relocate_kernel()
will just triple fault the guest?

2024-03-20 21:08:02

by Tom Lendacky

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On 3/20/24 15:48, Huang, Kai wrote:
>
>>> Hi Tom,
>>>
>>> I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
>>> Does patch 1 break them?
>>
>> SNP guests can kexec with some patches that are currently in process
>> around shared to private memory conversions. ES guests can only kexec
>> with a single vCPU. There was a recent patch series to add support for
>> multiple vCPUs.
>>
>> Patch #1 doesn't break either ES or SNP because we still have an IDT and
>> traditional kernel addressing in place, so the #VC can be handled.
>
> How about plain SEV guest?

A plain SEV guest is fine, since WBINVD is intercepted and would just exit
to the hypervisor (#VC doesn't happen with plain SEV).

Thanks,
Tom

>
>>
>> Whereas patch #2 has switched to identity mapping and removed the IDT,
>> so a #VC causes a triple fault.
>
> That makes sense. Thanks.
>
> Hi Kirill,
>
> Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu()
> can be handled although it causes #VE, while WBINVD in relocate_kernel()
> will just triple fault the guest?

2024-03-20 21:58:29

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On 21/03/2024 10:06 am, Tom Lendacky wrote:
> On 3/20/24 15:48, Huang, Kai wrote:
>>
>>>> Hi Tom,
>>>>
>>>> I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
>>>> Does patch 1 break them?
>>>
>>> SNP guests can kexec with some patches that are currently in process
>>> around shared to private memory conversions. ES guests can only kexec
>>> with a single vCPU. There was a recent patch series to add support
>>> for multiple vCPUs.
>>>
>>> Patch #1 doesn't break either ES or SNP because we still have an IDT
>>> and traditional kernel addressing in place, so the #VC can be handled.
>>
>> How about plain SEV guest?
>
> A plain SEV guest is fine, since WBINVD is intercepted and would just
> exit to the hypervisor (#VC doesn't happen with plain SEV).
>
> Thanks,
> Tom
>

Thanks for the info.

2024-03-20 23:11:01

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On Thu, Mar 21, 2024 at 09:48:28AM +1300, Huang, Kai wrote:
>
> > > Hi Tom,
> > >
> > > I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
> > > Does patch 1 break them?
> >
> > SNP guests can kexec with some patches that are currently in process
> > around shared to private memory conversions. ES guests can only kexec
> > with a single vCPU. There was a recent patch series to add support for
> > multiple vCPUs.
> >
> > Patch #1 doesn't break either ES or SNP because we still have an IDT and
> > traditional kernel addressing in place, so the #VC can be handled.
>
> How about plain SEV guest?
>
> >
> > Whereas patch #2 has switched to identity mapping and removed the IDT,
> > so a #VC causes a triple fault.
>
> That makes sense. Thanks.
>
> Hi Kirill,
>
> Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() can
> be handled although it causes #VE, while WBINVD in relocate_kernel() will
> just triple fault the guest?

No. We never handle WBINVD #VE. Guest cannot handle WBINVD itself and the
only option is to ask host to do this. We cannot guarantee host will do
anything useful with the request. I guess it can be potential attack
vector if host strategically ignores WBINVD to induce bad guest behaviour.

And it is not good from host PoV either. If it does WBINVD on every guest
request we get guest->host DoS attack possibility.

Tom, I am curious, how do you deal with these problems?

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-03-21 21:02:30

by Tom Lendacky

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On 3/20/24 18:10, Kirill A. Shutemov wrote:
> On Thu, Mar 21, 2024 at 09:48:28AM +1300, Huang, Kai wrote:
>>
>>>> Hi Tom,
>>>>
>>>> I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
>>>> Does patch 1 break them?
>>>
>>> SNP guests can kexec with some patches that are currently in process
>>> around shared to private memory conversions. ES guests can only kexec
>>> with a single vCPU. There was a recent patch series to add support for
>>> multiple vCPUs.
>>>
>>> Patch #1 doesn't break either ES or SNP because we still have an IDT and
>>> traditional kernel addressing in place, so the #VC can be handled.
>>
>> How about plain SEV guest?
>>
>>>
>>> Whereas patch #2 has switched to identity mapping and removed the IDT,
>>> so a #VC causes a triple fault.
>>
>> That makes sense. Thanks.
>>
>> Hi Kirill,
>>
>> Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() can
>> be handled although it causes #VE, while WBINVD in relocate_kernel() will
>> just triple fault the guest?
>
> No. We never handle WBINVD #VE. Guest cannot handle WBINVD itself and the
> only option is to ask host to do this. We cannot guarantee host will do

Is the WBINVD performed or ignored in that case?

> anything useful with the request. I guess it can be potential attack
> vector if host strategically ignores WBINVD to induce bad guest behaviour.

With SNP, memory is coherent so there isn't a need for a WBINVD within a
guest and so issuing it should not be an issue whether the hypervisor
performs the operation or not. I don't know what can happen in the case
where, say, you have a non-coherent TDISP device attached or such, but
that would be very unusual/unlikely.

>
> And it is not good from host PoV either. If it does WBINVD on every guest
> request we get guest->host DoS attack possibility.

Yeah, that can happen today, regardless of the type of VM running.

>
> Tom, I am curious, how do you deal with these problems?

If the WBINVD is being intercepted, then it will generate a #VC and we use
the GHCB protocol to communicate that back to the hypervisor to handle.

Thanks,
Tom

>

2024-03-22 10:41:15

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On Thu, Mar 21, 2024 at 04:02:11PM -0500, Tom Lendacky wrote:
> On 3/20/24 18:10, Kirill A. Shutemov wrote:
> > On Thu, Mar 21, 2024 at 09:48:28AM +1300, Huang, Kai wrote:
> > >
> > > > > Hi Tom,
> > > > >
> > > > > I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
> > > > > Does patch 1 break them?
> > > >
> > > > SNP guests can kexec with some patches that are currently in process
> > > > around shared to private memory conversions. ES guests can only kexec
> > > > with a single vCPU. There was a recent patch series to add support for
> > > > multiple vCPUs.
> > > >
> > > > Patch #1 doesn't break either ES or SNP because we still have an IDT and
> > > > traditional kernel addressing in place, so the #VC can be handled.
> > >
> > > How about plain SEV guest?
> > >
> > > >
> > > > Whereas patch #2 has switched to identity mapping and removed the IDT,
> > > > so a #VC causes a triple fault.
> > >
> > > That makes sense. Thanks.
> > >
> > > Hi Kirill,
> > >
> > > Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() can
> > > be handled although it causes #VE, while WBINVD in relocate_kernel() will
> > > just triple fault the guest?
> >
> > No. We never handle WBINVD #VE. Guest cannot handle WBINVD itself and the
> > only option is to ask host to do this. We cannot guarantee host will do
>
> Is the WBINVD performed or ignored in that case?

We crash the guest if it tries to use WBINVD. There's no legitimate reason
for it.

> > anything useful with the request. I guess it can be potential attack
> > vector if host strategically ignores WBINVD to induce bad guest behaviour.
>
> With SNP, memory is coherent so there isn't a need for a WBINVD within a
> guest and so issuing it should not be an issue whether the hypervisor
> performs the operation or not. I don't know what can happen in the case
> where, say, you have a non-coherent TDISP device attached or such, but that
> would be very unusual/unlikely.

Looks like SNP is in the same position as TDX.

> > And it is not good from host PoV either. If it does WBINVD on every guest
> > request we get guest->host DoS attack possibility.
>
> Yeah, that can happen today, regardless of the type of VM running.
>
> >
> > Tom, I am curious, how do you deal with these problems?
>
> If the WBINVD is being intercepted, then it will generate a #VC and we use
> the GHCB protocol to communicate that back to the hypervisor to handle.

I would argue that forwarding it to hypervisor is worse than crashing. It
gives false sense of doing something. Hypervisor is outside TCB and
considered hostile.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-03-22 14:51:09

by Tom Lendacky

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On 3/22/24 05:40, Kirill A. Shutemov wrote:
> On Thu, Mar 21, 2024 at 04:02:11PM -0500, Tom Lendacky wrote:
>> On 3/20/24 18:10, Kirill A. Shutemov wrote:
>>> On Thu, Mar 21, 2024 at 09:48:28AM +1300, Huang, Kai wrote:
>>>>
>>>>>> Hi Tom,
>>>>>>
>>>>>> I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
>>>>>> Does patch 1 break them?
>>>>>
>>>>> SNP guests can kexec with some patches that are currently in process
>>>>> around shared to private memory conversions. ES guests can only kexec
>>>>> with a single vCPU. There was a recent patch series to add support for
>>>>> multiple vCPUs.
>>>>>
>>>>> Patch #1 doesn't break either ES or SNP because we still have an IDT and
>>>>> traditional kernel addressing in place, so the #VC can be handled.
>>>>
>>>> How about plain SEV guest?
>>>>
>>>>>
>>>>> Whereas patch #2 has switched to identity mapping and removed the IDT,
>>>>> so a #VC causes a triple fault.
>>>>
>>>> That makes sense. Thanks.
>>>>
>>>> Hi Kirill,
>>>>
>>>> Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() can
>>>> be handled although it causes #VE, while WBINVD in relocate_kernel() will
>>>> just triple fault the guest?
>>>
>>> No. We never handle WBINVD #VE. Guest cannot handle WBINVD itself and the
>>> only option is to ask host to do this. We cannot guarantee host will do
>>
>> Is the WBINVD performed or ignored in that case?
>
> We crash the guest if it tries to use WBINVD. There's no legitimate reason
> for it.
>
>>> anything useful with the request. I guess it can be potential attack
>>> vector if host strategically ignores WBINVD to induce bad guest behaviour.
>>
>> With SNP, memory is coherent so there isn't a need for a WBINVD within a
>> guest and so issuing it should not be an issue whether the hypervisor
>> performs the operation or not. I don't know what can happen in the case
>> where, say, you have a non-coherent TDISP device attached or such, but that
>> would be very unusual/unlikely.
>
> Looks like SNP is in the same position as TDX.
>
>>> And it is not good from host PoV either. If it does WBINVD on every guest
>>> request we get guest->host DoS attack possibility.
>>
>> Yeah, that can happen today, regardless of the type of VM running.
>>
>>>
>>> Tom, I am curious, how do you deal with these problems?
>>
>> If the WBINVD is being intercepted, then it will generate a #VC and we use
>> the GHCB protocol to communicate that back to the hypervisor to handle.
>
> I would argue that forwarding it to hypervisor is worse than crashing. It
> gives false sense of doing something. Hypervisor is outside TCB and
> considered hostile.

Since the memory is coherent, it really doesn't matter what the hypervisor
does in regards to WBINVD (ignore it or perform it). And the hypervisor
can do anything it wants on any exit, regardless of this intercept.

Thanks,
Tom

>

2024-03-25 15:36:01

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On Fri, 2024-03-22 at 09:50 -0500, Tom Lendacky wrote:
> On 3/22/24 05:40, Kirill A. Shutemov wrote:
> > On Thu, Mar 21, 2024 at 04:02:11PM -0500, Tom Lendacky wrote:
> > > On 3/20/24 18:10, Kirill A. Shutemov wrote:
> > > > On Thu, Mar 21, 2024 at 09:48:28AM +1300, Huang, Kai wrote:
> > > > >
> > > > > > > Hi Tom,
> > > > > > >
> > > > > > > I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
> > > > > > > Does patch 1 break them?
> > > > > >
> > > > > > SNP guests can kexec with some patches that are currently in process
> > > > > > around shared to private memory conversions. ES guests can only kexec
> > > > > > with a single vCPU. There was a recent patch series to add support for
> > > > > > multiple vCPUs.
> > > > > >
> > > > > > Patch #1 doesn't break either ES or SNP because we still have an IDT and
> > > > > > traditional kernel addressing in place, so the #VC can be handled.
> > > > >
> > > > > How about plain SEV guest?
> > > > >
> > > > > >
> > > > > > Whereas patch #2 has switched to identity mapping and removed the IDT,
> > > > > > so a #VC causes a triple fault.
> > > > >
> > > > > That makes sense. Thanks.
> > > > >
> > > > > Hi Kirill,
> > > > >
> > > > > Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() can
> > > > > be handled although it causes #VE, while WBINVD in relocate_kernel() will
> > > > > just triple fault the guest?
> > > >
> > > > No. We never handle WBINVD #VE. Guest cannot handle WBINVD itself and the
> > > > only option is to ask host to do this. We cannot guarantee host will do
> > >
> > > Is the WBINVD performed or ignored in that case?
> >
> > We crash the guest if it tries to use WBINVD. There's no legitimate reason
> > for it.
> >
> > > > anything useful with the request. I guess it can be potential attack
> > > > vector if host strategically ignores WBINVD to induce bad guest behaviour.
> > >
> > > With SNP, memory is coherent so there isn't a need for a WBINVD within a
> > > guest and so issuing it should not be an issue whether the hypervisor
> > > performs the operation or not. I don't know what can happen in the case
> > > where, say, you have a non-coherent TDISP device attached or such, but that
> > > would be very unusual/unlikely.
> >
> > Looks like SNP is in the same position as TDX.
> >
> > > > And it is not good from host PoV either. If it does WBINVD on every guest
> > > > request we get guest->host DoS attack possibility.
> > >
> > > Yeah, that can happen today, regardless of the type of VM running.
> > >
> > > >
> > > > Tom, I am curious, how do you deal with these problems?
> > >
> > > If the WBINVD is being intercepted, then it will generate a #VC and we use
> > > the GHCB protocol to communicate that back to the hypervisor to handle.
> >
> > I would argue that forwarding it to hypervisor is worse than crashing. It
> > gives false sense of doing something. Hypervisor is outside TCB and
> > considered hostile.
>
> Since the memory is coherent, it really doesn't matter what the hypervisor
> does in regards to WBINVD (ignore it or perform it). And the hypervisor
> can do anything it wants on any exit, regardless of this intercept.
>

I guess it makes sense to not handle #VE due to WBINVD in the sense that guest
shouldn't do WBINVD when memory is coherent from guest's view, although it is
harmless to make the WBINVD and let hypervisor handle it.

Anyway, the current TDX guest doesn't handle #VE due to WBINVD, so I think for
simplicity we just don't do WBINVD in stop_this_cpu() and relocate_kernel() for
both TDX and SNP/SEV-ES guests.

As mentioned in my earlier reply, we can achieve this by skipping WBINVD when
the CC_ATTR_GUEST_MEM_ENCRYPT is true:

if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
native_wbinvd();

(This skips WBINVD for plain SEV guest too, but this exactly is the current
behaviour of the upstream code, so I don't see any problem.)

Alternatively, we can have a dedicated CPU feature flag such as
X86_FEATURE_NO_WBINVD,

if (!boot_cpu_has(X86_FEATURE_NO_WBINVD))
native_wbinvd();

Or, we can just change to our mindset to "do unconditional WBINVD, but not in
virtualized environment":

if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
native_wbinvd();

Hi Boris/Dave/Tom/Kirill,

Do you have any comments?

2024-03-28 19:16:23

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

On Mon, Mar 25, 2024 at 01:04:47PM +0000, Huang, Kai wrote:
> On Fri, 2024-03-22 at 09:50 -0500, Tom Lendacky wrote:
> > On 3/22/24 05:40, Kirill A. Shutemov wrote:
> > > On Thu, Mar 21, 2024 at 04:02:11PM -0500, Tom Lendacky wrote:
> > > > On 3/20/24 18:10, Kirill A. Shutemov wrote:
> > > > > On Thu, Mar 21, 2024 at 09:48:28AM +1300, Huang, Kai wrote:
> > > > > >
> > > > > > > > Hi Tom,
> > > > > > > >
> > > > > > > > I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
> > > > > > > > Does patch 1 break them?
> > > > > > >
> > > > > > > SNP guests can kexec with some patches that are currently in process
> > > > > > > around shared to private memory conversions. ES guests can only kexec
> > > > > > > with a single vCPU. There was a recent patch series to add support for
> > > > > > > multiple vCPUs.
> > > > > > >
> > > > > > > Patch #1 doesn't break either ES or SNP because we still have an IDT and
> > > > > > > traditional kernel addressing in place, so the #VC can be handled.
> > > > > >
> > > > > > How about plain SEV guest?
> > > > > >
> > > > > > >
> > > > > > > Whereas patch #2 has switched to identity mapping and removed the IDT,
> > > > > > > so a #VC causes a triple fault.
> > > > > >
> > > > > > That makes sense. Thanks.
> > > > > >
> > > > > > Hi Kirill,
> > > > > >
> > > > > > Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() can
> > > > > > be handled although it causes #VE, while WBINVD in relocate_kernel() will
> > > > > > just triple fault the guest?
> > > > >
> > > > > No. We never handle WBINVD #VE. Guest cannot handle WBINVD itself and the
> > > > > only option is to ask host to do this. We cannot guarantee host will do
> > > >
> > > > Is the WBINVD performed or ignored in that case?
> > >
> > > We crash the guest if it tries to use WBINVD. There's no legitimate reason
> > > for it.
> > >
> > > > > anything useful with the request. I guess it can be potential attack
> > > > > vector if host strategically ignores WBINVD to induce bad guest behaviour.
> > > >
> > > > With SNP, memory is coherent so there isn't a need for a WBINVD within a
> > > > guest and so issuing it should not be an issue whether the hypervisor
> > > > performs the operation or not. I don't know what can happen in the case
> > > > where, say, you have a non-coherent TDISP device attached or such, but that
> > > > would be very unusual/unlikely.
> > >
> > > Looks like SNP is in the same position as TDX.
> > >
> > > > > And it is not good from host PoV either. If it does WBINVD on every guest
> > > > > request we get guest->host DoS attack possibility.
> > > >
> > > > Yeah, that can happen today, regardless of the type of VM running.
> > > >
> > > > >
> > > > > Tom, I am curious, how do you deal with these problems?
> > > >
> > > > If the WBINVD is being intercepted, then it will generate a #VC and we use
> > > > the GHCB protocol to communicate that back to the hypervisor to handle.
> > >
> > > I would argue that forwarding it to hypervisor is worse than crashing. It
> > > gives false sense of doing something. Hypervisor is outside TCB and
> > > considered hostile.
> >
> > Since the memory is coherent, it really doesn't matter what the hypervisor
> > does in regards to WBINVD (ignore it or perform it). And the hypervisor
> > can do anything it wants on any exit, regardless of this intercept.
> >
>
> I guess it makes sense to not handle #VE due to WBINVD in the sense that guest
> shouldn't do WBINVD when memory is coherent from guest's view, although it is
> harmless to make the WBINVD and let hypervisor handle it.

>
> Anyway, the current TDX guest doesn't handle #VE due to WBINVD, so I think for
> simplicity we just don't do WBINVD in stop_this_cpu() and relocate_kernel() for
> both TDX and SNP/SEV-ES guests.
>
> As mentioned in my earlier reply, we can achieve this by skipping WBINVD when
> the CC_ATTR_GUEST_MEM_ENCRYPT is true:
>
> if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> native_wbinvd(); ?
>
> (This skips WBINVD for plain SEV guest too, but this exactly is the current
> behaviour of the upstream code, so I don't see any problem.)
>
> Alternatively, we can have a dedicated CPU feature flag such as
> X86_FEATURE_NO_WBINVD,
>
> if (!boot_cpu_has(X86_FEATURE_NO_WBINVD))
> native_wbinvd();
>
> Or, we can just change to our mindset to "do unconditional WBINVD, but not in
> virtualized environment":
>
> if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
> native_wbinvd();

ACPI_FLUSH_CPU_CACHE() uses cpu_feature_enabled(X86_FEATURE_HYPERVISOR)
check.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-04-01 09:14:05

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

> > > > > >
>
> >
> > Anyway, the current TDX guest doesn't handle #VE due to WBINVD, so I think for
> > simplicity we just don't do WBINVD in stop_this_cpu() and relocate_kernel() for
> > both TDX and SNP/SEV-ES guests.
> >
> > As mentioned in my earlier reply, we can achieve this by skipping WBINVD when
> > the CC_ATTR_GUEST_MEM_ENCRYPT is true:
> >
> > if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> > native_wbinvd();
> >
> > (This skips WBINVD for plain SEV guest too, but this exactly is the current
> > behaviour of the upstream code, so I don't see any problem.)
> >
> > Alternatively, we can have a dedicated CPU feature flag such as
> > X86_FEATURE_NO_WBINVD,
> >
> > if (!boot_cpu_has(X86_FEATURE_NO_WBINVD))
> > native_wbinvd();
> >
> > Or, we can just change to our mindset to "do unconditional WBINVD, but not in
> > virtualized environment":
> >
> > if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
> > native_wbinvd();
>
> ACPI_FLUSH_CPU_CACHE() uses cpu_feature_enabled(X86_FEATURE_HYPERVISOR)
> check.
>
>

Thanks for pointing out this. Yeah I think skipping WBINVD in virtualized
environment makes sense. Will use this way.