The first few generations of TDX hardware have an erratum. A partial
write to a TDX private memory cacheline will silently "poison" the
line. Subsequent reads will consume the poison and generate a machine
check. According to the TDX hardware spec, neither of these things
should have happened.
== Background ==
Virtually all kernel memory accesses operations happen in full
cachelines. In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.
This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller. The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings. The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.
== Problem ==
A fast warm reset doesn't reset TDX private memory. Kexec() can also
boot into the new kernel directly. Thus if the old kernel has enabled
TDX on the platform with this erratum, the new kernel may get unexpected
machine check.
Note that w/o this erratum any kernel read/write on TDX private memory
should never cause machine check, thus it's OK for the old kernel to
leave TDX private pages as is.
== Solution ==
In short, with this erratum, the kernel needs to explicitly convert all
TDX private pages back to normal to give the new kernel a clean slate
after kexec(). The BIOS is also expected to disable fast warm reset as
a workaround to this erratum, thus this implementation doesn't try to
reset TDX private memory for the reboot case in the kernel but depend on
the BIOS to enable the workaround.
For now TDX private memory can only be PAMT pages. It would be ideal to
cover all types of TDX private memory here (TDX guest private pages and
Secure-EPT pages are yet to be implemented when TDX gets supported in
KVM), but there's no existing infrastructure to track TDX private pages.
It's not feasible to query the TDX module about page type either because
VMX has already been stopped when KVM receives the reboot notifier.
Another option is to blindly convert all memory pages. But this may
bring non-trivial latency to kexec() on large memory systems (especially
when the number of TDX private pages is small). Thus even with this
temporary solution, eventually it's better for the kernel to only reset
TDX private pages. Also, it's problematic to convert all memory pages
because not all pages are mapped as writable in the direct-mapping. The
kernel needs to switch to another page table which maps all pages as
writable (e.g., the identical-mapping table for kexec(), or a new page
table) to do so, but this looks overkill.
Therefore, rather than doing something dramatic, only reset PAMT pages
for now. Do it in machine_kexec() to avoid additional overhead to the
machine reboot/shutdown as the kernel depends on the BIOS to disable
fast warm reset as a workaround for the reboot case.
Signed-off-by: Kai Huang <[email protected]>
---
v11 -> v12:
- Changed comment/changelog to say kernel doesn't try to handle fast
warm reset but depends on BIOS to enable workaround (Kirill)
- Added a new tdx_may_has_private_mem to indicate system may have TDX
private memory and PAMTs/TDMRs are stable to access. (Dave).
- Use atomic_t for tdx_may_has_private_mem for build-in memory barrier
(Dave)
- Changed calling x86_platform.memory_shutdown() to calling
tdx_reset_memory() directly from machine_kexec() to avoid overhead to
normal reboot case.
v10 -> v11:
- New patch
---
arch/x86/include/asm/tdx.h | 2 +
arch/x86/kernel/machine_kexec_64.c | 9 ++++
arch/x86/virt/vmx/tdx/tdx.c | 79 ++++++++++++++++++++++++++++++
3 files changed, 90 insertions(+)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 91416fd600cd..e95c9fbf52e4 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -100,10 +100,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
bool platform_tdx_enabled(void);
int tdx_cpu_enable(void);
int tdx_enable(void);
+void tdx_reset_memory(void);
#else /* !CONFIG_INTEL_TDX_HOST */
static inline bool platform_tdx_enabled(void) { return false; }
static inline int tdx_cpu_enable(void) { return -ENODEV; }
static inline int tdx_enable(void) { return -ENODEV; }
+static inline void tdx_reset_memory(void) { }
#endif /* CONFIG_INTEL_TDX_HOST */
#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 1a3e2c05a8a5..232253bd7ccd 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -28,6 +28,7 @@
#include <asm/setup.h>
#include <asm/set_memory.h>
#include <asm/cpu.h>
+#include <asm/tdx.h>
#ifdef CONFIG_ACPI
/*
@@ -301,6 +302,14 @@ void machine_kexec(struct kimage *image)
void *control_page;
int save_ftrace_enabled;
+ /*
+ * On the platform with "partial write machine check" erratum,
+ * all TDX private pages need to be converted back to normal
+ * before booting to the new kernel, otherwise the new kernel
+ * may get unexpected machine check.
+ */
+ tdx_reset_memory();
+
#ifdef CONFIG_KEXEC_JUMP
if (image->preserve_context)
save_processor_state();
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 85b24b2e9417..1107f4227568 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -51,6 +51,8 @@ static LIST_HEAD(tdx_memlist);
static struct tdmr_info_list tdx_tdmr_list;
+static atomic_t tdx_may_has_private_mem;
+
/*
* Wrapper of __seamcall() to convert SEAMCALL leaf function error code
* to kernel error code. @seamcall_ret and @out contain the SEAMCALL
@@ -1113,6 +1115,17 @@ static int init_tdx_module(void)
*/
wbinvd_on_all_cpus();
+ /*
+ * Starting from this point the system may have TDX private
+ * memory. Make it globally visible so tdx_reset_memory() only
+ * reads TDMRs/PAMTs when they are stable.
+ *
+ * Note using atomic_inc_return() to provide the explicit memory
+ * ordering isn't mandatory here as the WBINVD above already
+ * does that. Compiler barrier isn't needed here either.
+ */
+ atomic_inc_return(&tdx_may_has_private_mem);
+
/* Config the key of global KeyID on all packages */
ret = config_global_keyid();
if (ret)
@@ -1154,6 +1167,15 @@ static int init_tdx_module(void)
* as suggested by the TDX spec.
*/
tdmrs_reset_pamt_all(&tdx_tdmr_list);
+ /*
+ * No more TDX private pages now, and PAMTs/TDMRs are
+ * going to be freed. Make this globally visible so
+ * tdx_reset_memory() can read stable TDMRs/PAMTs.
+ *
+ * Note atomic_dec_return(), which is an atomic RMW with
+ * return value, always enforces the memory barrier.
+ */
+ atomic_dec_return(&tdx_may_has_private_mem);
out_free_pamts:
tdmrs_free_pamt_all(&tdx_tdmr_list);
out_free_tdmrs:
@@ -1229,6 +1251,63 @@ int tdx_enable(void)
}
EXPORT_SYMBOL_GPL(tdx_enable);
+/*
+ * Convert TDX private pages back to normal on platforms with
+ * "partial write machine check" erratum.
+ *
+ * Called from machine_kexec() before booting to the new kernel.
+ */
+void tdx_reset_memory(void)
+{
+ if (!platform_tdx_enabled())
+ return;
+
+ /*
+ * Kernel read/write to TDX private memory doesn't
+ * cause machine check on hardware w/o this erratum.
+ */
+ if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+ return;
+
+ /* Called from kexec() when only rebooting cpu is alive */
+ WARN_ON_ONCE(num_online_cpus() != 1);
+
+ if (!atomic_read(&tdx_may_has_private_mem))
+ return;
+
+ /*
+ * Ideally it's better to cover all types of TDX private pages,
+ * but there's no existing infrastructure to tell whether a page
+ * is TDX private memory or not. Using SEAMCALL to query TDX
+ * module isn't feasible either because: 1) VMX has been turned
+ * off by reaching here so SEAMCALL cannot be made; 2) Even
+ * SEAMCALL can be made the result from TDX module may not be
+ * accurate (e.g., remote CPU can be stopped while the kernel
+ * is in the middle of reclaiming one TDX private page and doing
+ * MOVDIR64B).
+ *
+ * One solution could be just converting all memory pages, but
+ * this may bring non-trivial latency on large memory systems
+ * (especially when the number of TDX private pages is small).
+ * So even with this temporary solution, eventually the kernel
+ * should only convert TDX private pages.
+ *
+ * Also, not all pages are mapped as writable in direct mapping,
+ * thus it's problematic to do so. It can be done by switching
+ * to the identical mapping table for kexec() or a new page table
+ * which maps all pages as writable, but the complexity looks
+ * overkill.
+ *
+ * Thus instead of doing something dramatic to convert all pages,
+ * only convert PAMTs as for now TDX private pages can only be
+ * PAMT.
+ *
+ * All other cpus are already dead. TDMRs/PAMTs are stable when
+ * @tdx_may_has_private_mem reads true.
+ */
+ tdmrs_reset_pamt_all(&tdx_tdmr_list);
+}
+
static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
u32 *nr_tdx_keyids)
{
--
2.40.1
On 26.06.23 г. 17:12 ч., Kai Huang wrote:
<snip>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 85b24b2e9417..1107f4227568 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -51,6 +51,8 @@ static LIST_HEAD(tdx_memlist);
>
> static struct tdmr_info_list tdx_tdmr_list;
>
> +static atomic_t tdx_may_has_private_mem;
> +
> /*
> * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> @@ -1113,6 +1115,17 @@ static int init_tdx_module(void)
> */
> wbinvd_on_all_cpus();
>
> + /*
> + * Starting from this point the system may have TDX private
> + * memory. Make it globally visible so tdx_reset_memory() only
> + * reads TDMRs/PAMTs when they are stable.
> + *
> + * Note using atomic_inc_return() to provide the explicit memory
> + * ordering isn't mandatory here as the WBINVD above already
> + * does that. Compiler barrier isn't needed here either.
> + */
If it's not needed, then why use it? Simply do atomic_inc() and instead
rephrase the comment to state what are the ordering guarantees and how
they are achieved (i.e by using wbinvd above).
> + atomic_inc_return(&tdx_may_has_private_mem);
> +
> /* Config the key of global KeyID on all packages */
> ret = config_global_keyid();
> if (ret)
> @@ -1154,6 +1167,15 @@ static int init_tdx_module(void)
> * as suggested by the TDX spec.
> */
> tdmrs_reset_pamt_all(&tdx_tdmr_list);
> + /*
> + * No more TDX private pages now, and PAMTs/TDMRs are
> + * going to be freed. Make this globally visible so
> + * tdx_reset_memory() can read stable TDMRs/PAMTs.
> + *
> + * Note atomic_dec_return(), which is an atomic RMW with
> + * return value, always enforces the memory barrier.
> + */
> + atomic_dec_return(&tdx_may_has_private_mem);
Make a comment here which either refers to the comment at the increment
site.
> out_free_pamts:
> tdmrs_free_pamt_all(&tdx_tdmr_list);
> out_free_tdmrs:
> @@ -1229,6 +1251,63 @@ int tdx_enable(void)
> }
> EXPORT_SYMBOL_GPL(tdx_enable);
>
> +/*
> + * Convert TDX private pages back to normal on platforms with
> + * "partial write machine check" erratum.
> + *
> + * Called from machine_kexec() before booting to the new kernel.
> + */
> +void tdx_reset_memory(void)
> +{
> + if (!platform_tdx_enabled())
> + return;
> +
> + /*
> + * Kernel read/write to TDX private memory doesn't
> + * cause machine check on hardware w/o this erratum.
> + */
> + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> + return;
> +
> + /* Called from kexec() when only rebooting cpu is alive */
> + WARN_ON_ONCE(num_online_cpus() != 1);
> +
> + if (!atomic_read(&tdx_may_has_private_mem))
> + return;
I think a comment is warranted here explicitly calling our the ordering
requirement/guarantees. Actually this is a non-rmw operation so it
doesn't have any bearing on the ordering/implicit mb's achieved at the
"increment" site.
<snip>
On Tue, Jun 27, 2023 at 02:12:49AM +1200, Kai Huang wrote:
> @@ -1113,6 +1115,17 @@ static int init_tdx_module(void)
> */
> wbinvd_on_all_cpus();
>
> + /*
> + * Starting from this point the system may have TDX private
> + * memory. Make it globally visible so tdx_reset_memory() only
> + * reads TDMRs/PAMTs when they are stable.
> + *
> + * Note using atomic_inc_return() to provide the explicit memory
> + * ordering isn't mandatory here as the WBINVD above already
> + * does that. Compiler barrier isn't needed here either.
> + */
> + atomic_inc_return(&tdx_may_has_private_mem);
Why do we need atomics at all here? Writers seems serialized with
tdx_module_lock and reader accesses the variable when all CPUs, but one is
down and cannot race.
Hm?
--
Kiryl Shutsemau / Kirill A. Shutemov
On Wed, 2023-06-28 at 15:29 +0300, [email protected] wrote:
> On Tue, Jun 27, 2023 at 02:12:49AM +1200, Kai Huang wrote:
> > @@ -1113,6 +1115,17 @@ static int init_tdx_module(void)
> > */
> > wbinvd_on_all_cpus();
> >
> > + /*
> > + * Starting from this point the system may have TDX private
> > + * memory. Make it globally visible so tdx_reset_memory() only
> > + * reads TDMRs/PAMTs when they are stable.
> > + *
> > + * Note using atomic_inc_return() to provide the explicit memory
> > + * ordering isn't mandatory here as the WBINVD above already
> > + * does that. Compiler barrier isn't needed here either.
> > + */
> > + atomic_inc_return(&tdx_may_has_private_mem);
>
> Why do we need atomics at all here? Writers seems serialized with
> tdx_module_lock and reader accesses the variable when all CPUs, but one is
> down and cannot race.
>
In kexec() the reader reads this when all remote cpus are dead w/o holding
module lock. All remote cpus can be stopped at _ANY_ time, meaning they can be
stopped right in any place in middle of init_tdx_module(). IIUC the module lock
doesn't help here.
On 6/28/23 02:20, Nikolay Borisov wrote:
>>
>> + /*
>> + * Starting from this point the system may have TDX private
>> + * memory. Make it globally visible so tdx_reset_memory() only
>> + * reads TDMRs/PAMTs when they are stable.
>> + *
>> + * Note using atomic_inc_return() to provide the explicit memory
>> + * ordering isn't mandatory here as the WBINVD above already
>> + * does that. Compiler barrier isn't needed here either.
>> + */
>
> If it's not needed, then why use it? Simply do atomic_inc() and instead
> rephrase the comment to state what are the ordering guarantees and how
> they are achieved (i.e by using wbinvd above).
Even better, explain why the barrier needs to be there and *IGNORE* the
WBVIND.
If the WBINVD gets moved -- or if the gods ever bless us with a halfway
reasonable way to flush the caches that's not full serializing -- this
code is screwed.
There is _zero_ reason to try and "optimize" this junk by trying to get
rid of a memory barrier at the risk of screwing it over later.
I use "optimize" in quotes because that's a highly charitable way of
describing this activity.
On Wed, 2023-06-28 at 17:32 -0700, Dave Hansen wrote:
> On 6/28/23 02:20, Nikolay Borisov wrote:
> > >
> > > + /*
> > > + * Starting from this point the system may have TDX private
> > > + * memory. Make it globally visible so tdx_reset_memory() only
> > > + * reads TDMRs/PAMTs when they are stable.
> > > + *
> > > + * Note using atomic_inc_return() to provide the explicit memory
> > > + * ordering isn't mandatory here as the WBINVD above already
> > > + * does that. Compiler barrier isn't needed here either.
> > > + */
> >
> > If it's not needed, then why use it? Simply do atomic_inc() and instead
> > rephrase the comment to state what are the ordering guarantees and how
> > they are achieved (i.e by using wbinvd above).
>
> Even better, explain why the barrier needs to be there and *IGNORE* the
> WBVIND.
>
> If the WBINVD gets moved -- or if the gods ever bless us with a halfway
> reasonable way to flush the caches that's not full serializing -- this
> code is screwed.
>
> There is _zero_ reason to try and "optimize" this junk by trying to get
> rid of a memory barrier at the risk of screwing it over later.
>
> I use "optimize" in quotes because that's a highly charitable way of
> describing this activity.
>
Agreed. I'll try to explain this well and come back.
Thanks!
On Wed, 2023-06-28 at 12:20 +0300, Nikolay Borisov wrote:
> > + atomic_inc_return(&tdx_may_has_private_mem);
> > +
> > /* Config the key of global KeyID on all packages */
> > ret = config_global_keyid();
> > if (ret)
> > @@ -1154,6 +1167,15 @@ static int init_tdx_module(void)
> > * as suggested by the TDX spec.
> > */
> > tdmrs_reset_pamt_all(&tdx_tdmr_list);
> > + /*
> > + * No more TDX private pages now, and PAMTs/TDMRs are
> > + * going to be freed. Make this globally visible so
> > + * tdx_reset_memory() can read stable TDMRs/PAMTs.
> > + *
> > + * Note atomic_dec_return(), which is an atomic RMW with
> > + * return value, always enforces the memory barrier.
> > + */
> > + atomic_dec_return(&tdx_may_has_private_mem);
>
> Make a comment here which either refers to the comment at the increment
> site.
I guess I got your point. Will try to make better comments.
>
> > out_free_pamts:
> > tdmrs_free_pamt_all(&tdx_tdmr_list);
> > out_free_tdmrs:
> > @@ -1229,6 +1251,63 @@ int tdx_enable(void)
> > }
> > EXPORT_SYMBOL_GPL(tdx_enable);
> >
> > +/*
> > + * Convert TDX private pages back to normal on platforms with
> > + * "partial write machine check" erratum.
> > + *
> > + * Called from machine_kexec() before booting to the new kernel.
> > + */
> > +void tdx_reset_memory(void)
> > +{
> > + if (!platform_tdx_enabled())
> > + return;
> > +
> > + /*
> > + * Kernel read/write to TDX private memory doesn't
> > + * cause machine check on hardware w/o this erratum.
> > + */
> > + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> > + return;
> > +
> > + /* Called from kexec() when only rebooting cpu is alive */
> > + WARN_ON_ONCE(num_online_cpus() != 1);
> > +
> > + if (!atomic_read(&tdx_may_has_private_mem))
> > + return;
>
> I think a comment is warranted here explicitly calling our the ordering
> requirement/guarantees. Actually this is a non-rmw operation so it
> doesn't have any bearing on the ordering/implicit mb's achieved at the
> "increment" site.
We don't need explicit ordering/barrier here, if I am not missing something.
The atomic_{inc/dec}_return() already made sure the memory ordering -- which
guarantees when @tdx_may_has_private_mem reads true _here_, the TDMRs/PAMTs must
be stable.
Quoted from Documentation/atomic_t.txt:
"
- RMW operations that have a return value are fully ordered;
...
Fully ordered primitives are ordered against everything prior and everything
subsequent. Therefore a fully ordered primitive is like having an smp_mb()
before and an smp_mb() after the primitive.
"
Am I missing anything?
On Thu, 2023-06-29 at 03:19 +0000, Huang, Kai wrote:
> On Wed, 2023-06-28 at 12:20 +0300, Nikolay Borisov wrote:
> > > + atomic_inc_return(&tdx_may_has_private_mem);
> > > +
> > > /* Config the key of global KeyID on all packages */
> > > ret = config_global_keyid();
> > > if (ret)
> > > @@ -1154,6 +1167,15 @@ static int init_tdx_module(void)
> > > * as suggested by the TDX spec.
> > > */
> > > tdmrs_reset_pamt_all(&tdx_tdmr_list);
> > > + /*
> > > + * No more TDX private pages now, and PAMTs/TDMRs are
> > > + * going to be freed. Make this globally visible so
> > > + * tdx_reset_memory() can read stable TDMRs/PAMTs.
> > > + *
> > > + * Note atomic_dec_return(), which is an atomic RMW with
> > > + * return value, always enforces the memory barrier.
> > > + */
> > > + atomic_dec_return(&tdx_may_has_private_mem);
> >
> > Make a comment here which either refers to the comment at the increment
> > site.
>
> I guess I got your point. Will try to make better comments.
>
> >
> > > out_free_pamts:
> > > tdmrs_free_pamt_all(&tdx_tdmr_list);
> > > out_free_tdmrs:
> > > @@ -1229,6 +1251,63 @@ int tdx_enable(void)
> > > }
> > > EXPORT_SYMBOL_GPL(tdx_enable);
> > >
> > > +/*
> > > + * Convert TDX private pages back to normal on platforms with
> > > + * "partial write machine check" erratum.
> > > + *
> > > + * Called from machine_kexec() before booting to the new kernel.
> > > + */
> > > +void tdx_reset_memory(void)
> > > +{
> > > + if (!platform_tdx_enabled())
> > > + return;
> > > +
> > > + /*
> > > + * Kernel read/write to TDX private memory doesn't
> > > + * cause machine check on hardware w/o this erratum.
> > > + */
> > > + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> > > + return;
> > > +
> > > + /* Called from kexec() when only rebooting cpu is alive */
> > > + WARN_ON_ONCE(num_online_cpus() != 1);
> > > +
> > > + if (!atomic_read(&tdx_may_has_private_mem))
> > > + return;
> >
> > I think a comment is warranted here explicitly calling our the ordering
> > requirement/guarantees. Actually this is a non-rmw operation so it
> > doesn't have any bearing on the ordering/implicit mb's achieved at the
> > "increment" site.
>
> We don't need explicit ordering/barrier here, if I am not missing something.
> The atomic_{inc/dec}_return() already made sure the memory ordering -- which
> guarantees when @tdx_may_has_private_mem reads true _here_, the TDMRs/PAMTs must
> be stable.
>
> Quoted from Documentation/atomic_t.txt:
>
> "
> - RMW operations that have a return value are fully ordered;
>
> ...
>
> Fully ordered primitives are ordered against everything prior and everything
> subsequent. Therefore a fully ordered primitive is like having an smp_mb()
> before and an smp_mb() after the primitive.
> "
>
>
> Am I missing anything?
OK I guess I figured out by myself after more thinking. Although the
atomic_{inc|dec}_return() code path has guaranteed when @tdx_may_has_private_mem
is true, TDMRs/PAMTs are stable, but here in the reading path, the code below
tdmrs_reset_pamt_all(&tdx_tdmr_list);
may still be executed speculatively before the if () statement completes
if (!atomic_read(&tdx_may_has_private_mem))
return;
So we need CPU memory barrier instead of compiler barrier.
I'll change to use "RMW with return value" to provide the CPU barrier.
Thanks for the feedback!
On 29.06.23 г. 12:45 ч., Huang, Kai wrote:
> On Thu, 2023-06-29 at 05:38 +0000, Huang, Kai wrote:
>> On Thu, 2023-06-29 at 03:19 +0000, Huang, Kai wrote:
>>> On Wed, 2023-06-28 at 12:20 +0300, Nikolay Borisov wrote:
>>>>> + atomic_inc_return(&tdx_may_has_private_mem);
>>>>> +
>>>>> /* Config the key of global KeyID on all packages */
>>>>> ret = config_global_keyid();
>>>>> if (ret)
>>>>> @@ -1154,6 +1167,15 @@ static int init_tdx_module(void)
>>>>> * as suggested by the TDX spec.
>>>>> */
>>>>> tdmrs_reset_pamt_all(&tdx_tdmr_list);
>>>>> + /*
>>>>> + * No more TDX private pages now, and PAMTs/TDMRs are
>>>>> + * going to be freed. Make this globally visible so
>>>>> + * tdx_reset_memory() can read stable TDMRs/PAMTs.
>>>>> + *
>>>>> + * Note atomic_dec_return(), which is an atomic RMW with
>>>>> + * return value, always enforces the memory barrier.
>>>>> + */
>>>>> + atomic_dec_return(&tdx_may_has_private_mem);
>>>>
>>>> Make a comment here which either refers to the comment at the increment
>>>> site.
>>>
>>> I guess I got your point. Will try to make better comments.
>>>
>>>>
>>>>> out_free_pamts:
>>>>> tdmrs_free_pamt_all(&tdx_tdmr_list);
>>>>> out_free_tdmrs:
>>>>> @@ -1229,6 +1251,63 @@ int tdx_enable(void)
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(tdx_enable);
>>>>>
>>>>> +/*
>>>>> + * Convert TDX private pages back to normal on platforms with
>>>>> + * "partial write machine check" erratum.
>>>>> + *
>>>>> + * Called from machine_kexec() before booting to the new kernel.
>>>>> + */
>>>>> +void tdx_reset_memory(void)
>>>>> +{
>>>>> + if (!platform_tdx_enabled())
>>>>> + return;
>>>>> +
>>>>> + /*
>>>>> + * Kernel read/write to TDX private memory doesn't
>>>>> + * cause machine check on hardware w/o this erratum.
>>>>> + */
>>>>> + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
>>>>> + return;
>>>>> +
>>>>> + /* Called from kexec() when only rebooting cpu is alive */
>>>>> + WARN_ON_ONCE(num_online_cpus() != 1);
>>>>> +
>>>>> + if (!atomic_read(&tdx_may_has_private_mem))
>>>>> + return;
>>>>
>>>> I think a comment is warranted here explicitly calling our the ordering
>>>> requirement/guarantees. Actually this is a non-rmw operation so it
>>>> doesn't have any bearing on the ordering/implicit mb's achieved at the
>>>> "increment" site.
>>>
>>> We don't need explicit ordering/barrier here, if I am not missing something.
>>> The atomic_{inc/dec}_return() already made sure the memory ordering -- which
>>> guarantees when @tdx_may_has_private_mem reads true _here_, the TDMRs/PAMTs must
>>> be stable.
>>>
>>> Quoted from Documentation/atomic_t.txt:
>>>
>>> "
>>> - RMW operations that have a return value are fully ordered;
>>>
>>> ...
>>>
>>> Fully ordered primitives are ordered against everything prior and everything
>>> subsequent. Therefore a fully ordered primitive is like having an smp_mb()
>>> before and an smp_mb() after the primitive.
>>> "
>>>
>>>
>>> Am I missing anything?
>>
>> OK I guess I figured out by myself after more thinking. Although the
>> atomic_{inc|dec}_return() code path has guaranteed when @tdx_may_has_private_mem
>> is true, TDMRs/PAMTs are stable, but here in the reading path, the code below
>>
>> tdmrs_reset_pamt_all(&tdx_tdmr_list);
>>
>> may still be executed speculatively before the if () statement completes
>>
>> if (!atomic_read(&tdx_may_has_private_mem))
>> return;
>>
>> So we need CPU memory barrier instead of compiler barrier.
>>
>
> (Sorry for multiple replies)
>
> Hmm.. reading the SDM more carefully, the speculative execution shouldn't
> matter. It may cause instruction/data being fetched to the cache, etc, but the
> instruction shouldn't take effort unless the above branch predication truly
> turns out to be the right result.
>
> What matters is memory reads/writes order. On x86, per SDM on single processor
> (which is the case here) basically reads/writes are not reordered:
>
> "
> In a single-processor system for memory regions defined as write-back cacheable,
> the memory-ordering model respects the following principles ...:
> • Reads are not reordered with other reads.
> • Writes are not reordered with older reads.
> • Writes to memory are not reordered with other writes, with the following
> exceptions:
> (string operations/non-temporal moves)
> ...
> "
>
> So in practice there should be no problem. But I will just use the correct
> atomic_t operations to force a memory barrier here.
as-per memory-barriers.txt there needs to be proper comment explaining a
particular ordering scenario. Let's not forget that this code will have
to be maintained for years to come not necessarily by you and the poor
sod that comes after you should be provided all the help in terms of
context to understand the code :)
On Thu, 2023-06-29 at 05:38 +0000, Huang, Kai wrote:
> On Thu, 2023-06-29 at 03:19 +0000, Huang, Kai wrote:
> > On Wed, 2023-06-28 at 12:20 +0300, Nikolay Borisov wrote:
> > > > + atomic_inc_return(&tdx_may_has_private_mem);
> > > > +
> > > > /* Config the key of global KeyID on all packages */
> > > > ret = config_global_keyid();
> > > > if (ret)
> > > > @@ -1154,6 +1167,15 @@ static int init_tdx_module(void)
> > > > * as suggested by the TDX spec.
> > > > */
> > > > tdmrs_reset_pamt_all(&tdx_tdmr_list);
> > > > + /*
> > > > + * No more TDX private pages now, and PAMTs/TDMRs are
> > > > + * going to be freed. Make this globally visible so
> > > > + * tdx_reset_memory() can read stable TDMRs/PAMTs.
> > > > + *
> > > > + * Note atomic_dec_return(), which is an atomic RMW with
> > > > + * return value, always enforces the memory barrier.
> > > > + */
> > > > + atomic_dec_return(&tdx_may_has_private_mem);
> > >
> > > Make a comment here which either refers to the comment at the increment
> > > site.
> >
> > I guess I got your point. Will try to make better comments.
> >
> > >
> > > > out_free_pamts:
> > > > tdmrs_free_pamt_all(&tdx_tdmr_list);
> > > > out_free_tdmrs:
> > > > @@ -1229,6 +1251,63 @@ int tdx_enable(void)
> > > > }
> > > > EXPORT_SYMBOL_GPL(tdx_enable);
> > > >
> > > > +/*
> > > > + * Convert TDX private pages back to normal on platforms with
> > > > + * "partial write machine check" erratum.
> > > > + *
> > > > + * Called from machine_kexec() before booting to the new kernel.
> > > > + */
> > > > +void tdx_reset_memory(void)
> > > > +{
> > > > + if (!platform_tdx_enabled())
> > > > + return;
> > > > +
> > > > + /*
> > > > + * Kernel read/write to TDX private memory doesn't
> > > > + * cause machine check on hardware w/o this erratum.
> > > > + */
> > > > + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> > > > + return;
> > > > +
> > > > + /* Called from kexec() when only rebooting cpu is alive */
> > > > + WARN_ON_ONCE(num_online_cpus() != 1);
> > > > +
> > > > + if (!atomic_read(&tdx_may_has_private_mem))
> > > > + return;
> > >
> > > I think a comment is warranted here explicitly calling our the ordering
> > > requirement/guarantees. Actually this is a non-rmw operation so it
> > > doesn't have any bearing on the ordering/implicit mb's achieved at the
> > > "increment" site.
> >
> > We don't need explicit ordering/barrier here, if I am not missing something.
> > The atomic_{inc/dec}_return() already made sure the memory ordering -- which
> > guarantees when @tdx_may_has_private_mem reads true _here_, the TDMRs/PAMTs must
> > be stable.
> >
> > Quoted from Documentation/atomic_t.txt:
> >
> > "
> > - RMW operations that have a return value are fully ordered;
> >
> > ...
> >
> > Fully ordered primitives are ordered against everything prior and everything
> > subsequent. Therefore a fully ordered primitive is like having an smp_mb()
> > before and an smp_mb() after the primitive.
> > "
> >
> >
> > Am I missing anything?
>
> OK I guess I figured out by myself after more thinking. Although the
> atomic_{inc|dec}_return() code path has guaranteed when @tdx_may_has_private_mem
> is true, TDMRs/PAMTs are stable, but here in the reading path, the code below
>
> tdmrs_reset_pamt_all(&tdx_tdmr_list);
>
> may still be executed speculatively before the if () statement completes
>
> if (!atomic_read(&tdx_may_has_private_mem))
> return;
>
> So we need CPU memory barrier instead of compiler barrier.
>
(Sorry for multiple replies)
Hmm.. reading the SDM more carefully, the speculative execution shouldn't
matter. It may cause instruction/data being fetched to the cache, etc, but the
instruction shouldn't take effort unless the above branch predication truly
turns out to be the right result.
What matters is memory reads/writes order. On x86, per SDM on single processor
(which is the case here) basically reads/writes are not reordered:
"
In a single-processor system for memory regions defined as write-back cacheable,
the memory-ordering model respects the following principles ...:
• Reads are not reordered with other reads.
• Writes are not reordered with older reads.
• Writes to memory are not reordered with other writes, with the following
exceptions:
(string operations/non-temporal moves)
...
"
So in practice there should be no problem. But I will just use the correct
atomic_t operations to force a memory barrier here.
On Tue, Jun 27, 2023 at 02:12:49AM +1200, Kai Huang wrote:
> The first few generations of TDX hardware have an erratum. A partial
> write to a TDX private memory cacheline will silently "poison" the
> line. Subsequent reads will consume the poison and generate a machine
> check. According to the TDX hardware spec, neither of these things
> should have happened.
>
> == Background ==
>
> Virtually all kernel memory accesses operations happen in full
> cachelines. In practice, writing a "byte" of memory usually reads a 64
> byte cacheline of memory, modifies it, then writes the whole line back.
> Those operations do not trigger this problem.
>
> This problem is triggered by "partial" writes where a write transaction
> of less than cacheline lands at the memory controller. The CPU does
> these via non-temporal write instructions (like MOVNTI), or through
> UC/WC memory mappings. The issue can also be triggered away from the
> CPU by devices doing partial writes via DMA.
>
> == Problem ==
>
> A fast warm reset doesn't reset TDX private memory. Kexec() can also
> boot into the new kernel directly. Thus if the old kernel has enabled
> TDX on the platform with this erratum, the new kernel may get unexpected
> machine check.
>
> Note that w/o this erratum any kernel read/write on TDX private memory
> should never cause machine check, thus it's OK for the old kernel to
> leave TDX private pages as is.
>
> == Solution ==
>
> In short, with this erratum, the kernel needs to explicitly convert all
> TDX private pages back to normal to give the new kernel a clean slate
> after kexec(). The BIOS is also expected to disable fast warm reset as
> a workaround to this erratum, thus this implementation doesn't try to
> reset TDX private memory for the reboot case in the kernel but depend on
> the BIOS to enable the workaround.
>
> For now TDX private memory can only be PAMT pages. It would be ideal to
> cover all types of TDX private memory here (TDX guest private pages and
> Secure-EPT pages are yet to be implemented when TDX gets supported in
> KVM), but there's no existing infrastructure to track TDX private pages.
> It's not feasible to query the TDX module about page type either because
> VMX has already been stopped when KVM receives the reboot notifier.
>
> Another option is to blindly convert all memory pages. But this may
> bring non-trivial latency to kexec() on large memory systems (especially
> when the number of TDX private pages is small). Thus even with this
> temporary solution, eventually it's better for the kernel to only reset
> TDX private pages. Also, it's problematic to convert all memory pages
> because not all pages are mapped as writable in the direct-mapping. The
> kernel needs to switch to another page table which maps all pages as
> writable (e.g., the identical-mapping table for kexec(), or a new page
> table) to do so, but this looks overkill.
>
> Therefore, rather than doing something dramatic, only reset PAMT pages
> for now. Do it in machine_kexec() to avoid additional overhead to the
> machine reboot/shutdown as the kernel depends on the BIOS to disable
> fast warm reset as a workaround for the reboot case.
>
> Signed-off-by: Kai Huang <[email protected]>
> ---
>
> v11 -> v12:
> - Changed comment/changelog to say kernel doesn't try to handle fast
> warm reset but depends on BIOS to enable workaround (Kirill)
> - Added a new tdx_may_has_private_mem to indicate system may have TDX
> private memory and PAMTs/TDMRs are stable to access. (Dave).
> - Use atomic_t for tdx_may_has_private_mem for build-in memory barrier
> (Dave)
> - Changed calling x86_platform.memory_shutdown() to calling
> tdx_reset_memory() directly from machine_kexec() to avoid overhead to
> normal reboot case.
>
> v10 -> v11:
> - New patch
>
>
> ---
> arch/x86/include/asm/tdx.h | 2 +
> arch/x86/kernel/machine_kexec_64.c | 9 ++++
> arch/x86/virt/vmx/tdx/tdx.c | 79 ++++++++++++++++++++++++++++++
> 3 files changed, 90 insertions(+)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 91416fd600cd..e95c9fbf52e4 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -100,10 +100,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> bool platform_tdx_enabled(void);
> int tdx_cpu_enable(void);
> int tdx_enable(void);
> +void tdx_reset_memory(void);
> #else /* !CONFIG_INTEL_TDX_HOST */
> static inline bool platform_tdx_enabled(void) { return false; }
> static inline int tdx_cpu_enable(void) { return -ENODEV; }
> static inline int tdx_enable(void) { return -ENODEV; }
> +static inline void tdx_reset_memory(void) { }
> #endif /* CONFIG_INTEL_TDX_HOST */
>
> #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 1a3e2c05a8a5..232253bd7ccd 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -28,6 +28,7 @@
> #include <asm/setup.h>
> #include <asm/set_memory.h>
> #include <asm/cpu.h>
> +#include <asm/tdx.h>
>
> #ifdef CONFIG_ACPI
> /*
> @@ -301,6 +302,14 @@ void machine_kexec(struct kimage *image)
> void *control_page;
> int save_ftrace_enabled;
>
> + /*
> + * On the platform with "partial write machine check" erratum,
> + * all TDX private pages need to be converted back to normal
> + * before booting to the new kernel, otherwise the new kernel
> + * may get unexpected machine check.
> + */
> + tdx_reset_memory();
> +
> #ifdef CONFIG_KEXEC_JUMP
> if (image->preserve_context)
> save_processor_state();
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 85b24b2e9417..1107f4227568 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -51,6 +51,8 @@ static LIST_HEAD(tdx_memlist);
>
> static struct tdmr_info_list tdx_tdmr_list;
>
> +static atomic_t tdx_may_has_private_mem;
> +
> /*
> * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> @@ -1113,6 +1115,17 @@ static int init_tdx_module(void)
> */
> wbinvd_on_all_cpus();
>
> + /*
> + * Starting from this point the system may have TDX private
> + * memory. Make it globally visible so tdx_reset_memory() only
> + * reads TDMRs/PAMTs when they are stable.
> + *
> + * Note using atomic_inc_return() to provide the explicit memory
> + * ordering isn't mandatory here as the WBINVD above already
WBINVD is serial instruction to make sure all things happen before
it must be committed before finish the exection of this instruction,
but it should not impact the instructions after it.
(SDM Vol.3 9.3 Jun 2023)
I think the atomic operation used below is to make sure the
change to tdx_may_has_private_mem becomes visible immediately
to other LPs which read it, e.g running tdx_reset_memory().
atomic_inc() should be enough for this case because the
locked Instructions are total order.
(SDM Vol.3 9.2.3.8 June 2023).
So per my understanding the key here is the atomic
operation's guarantee on memory changes visibility, not the
guarantee from WBINVD, the comment should be changed if
this is the correct understanding.
> + * does that. Compiler barrier isn't needed here either.
> + */
> + atomic_inc_return(&tdx_may_has_private_mem);
> +
> /* Config the key of global KeyID on all packages */
> ret = config_global_keyid();
> if (ret)
> @@ -1154,6 +1167,15 @@ static int init_tdx_module(void)
> * as suggested by the TDX spec.
> */
> tdmrs_reset_pamt_all(&tdx_tdmr_list);
> + /*
> + * No more TDX private pages now, and PAMTs/TDMRs are
> + * going to be freed. Make this globally visible so
> + * tdx_reset_memory() can read stable TDMRs/PAMTs.
> + *
> + * Note atomic_dec_return(), which is an atomic RMW with
> + * return value, always enforces the memory barrier.
> + */
> + atomic_dec_return(&tdx_may_has_private_mem);
> out_free_pamts:
> tdmrs_free_pamt_all(&tdx_tdmr_list);
> out_free_tdmrs:
> @@ -1229,6 +1251,63 @@ int tdx_enable(void)
> }
> EXPORT_SYMBOL_GPL(tdx_enable);
>
> +/*
> + * Convert TDX private pages back to normal on platforms with
> + * "partial write machine check" erratum.
> + *
> + * Called from machine_kexec() before booting to the new kernel.
> + */
> +void tdx_reset_memory(void)
> +{
> + if (!platform_tdx_enabled())
> + return;
> +
> + /*
> + * Kernel read/write to TDX private memory doesn't
> + * cause machine check on hardware w/o this erratum.
> + */
> + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> + return;
> +
> + /* Called from kexec() when only rebooting cpu is alive */
> + WARN_ON_ONCE(num_online_cpus() != 1);
> +
> + if (!atomic_read(&tdx_may_has_private_mem))
> + return;
> +
> + /*
> + * Ideally it's better to cover all types of TDX private pages,
> + * but there's no existing infrastructure to tell whether a page
> + * is TDX private memory or not. Using SEAMCALL to query TDX
> + * module isn't feasible either because: 1) VMX has been turned
> + * off by reaching here so SEAMCALL cannot be made; 2) Even
> + * SEAMCALL can be made the result from TDX module may not be
> + * accurate (e.g., remote CPU can be stopped while the kernel
> + * is in the middle of reclaiming one TDX private page and doing
> + * MOVDIR64B).
> + *
> + * One solution could be just converting all memory pages, but
> + * this may bring non-trivial latency on large memory systems
> + * (especially when the number of TDX private pages is small).
> + * So even with this temporary solution, eventually the kernel
> + * should only convert TDX private pages.
> + *
> + * Also, not all pages are mapped as writable in direct mapping,
> + * thus it's problematic to do so. It can be done by switching
> + * to the identical mapping table for kexec() or a new page table
> + * which maps all pages as writable, but the complexity looks
> + * overkill.
> + *
> + * Thus instead of doing something dramatic to convert all pages,
> + * only convert PAMTs as for now TDX private pages can only be
> + * PAMT.
> + *
> + * All other cpus are already dead. TDMRs/PAMTs are stable when
> + * @tdx_may_has_private_mem reads true.
> + */
> + tdmrs_reset_pamt_all(&tdx_tdmr_list);
> +}
> +
> static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
> u32 *nr_tdx_keyids)
> {
> --
> 2.40.1
>