2023-06-04 14:59:25

by Kai Huang

[permalink] [raw]
Subject: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

TDX memory has integrity and confidentiality protections. Violations of
this integrity protection are supposed to only affect TDX operations and
are never supposed to affect the host kernel itself. In other words,
the host kernel should never, itself, see machine checks induced by the
TDX integrity hardware.

Alas, the first few generations of TDX hardware have an erratum. A
"partial" write to a TDX private memory cacheline will silently "poison"
the line. Subsequent reads will consume the poison and generate a
machine check. According to the TDX hardware spec, neither of these
things should have happened.

Virtually all kernel memory accesses operations happen in full
cachelines. In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller. The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings. The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

With this erratum, there are additional things need to be done around
machine check handler and kexec(), etc. Similar to other CPU bugs, use
a CPU bug bit to indicate this erratum, and detect this erratum during
early boot. Note this bug reflects the hardware thus it is detected
regardless of whether the kernel is built with TDX support or not.

Signed-off-by: Kai Huang <[email protected]>
---

v10 -> v11:
- New patch

---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/intel.c | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index cb8ca46213be..dc8701f8d88b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -483,5 +483,6 @@
#define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */
#define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
#define X86_BUG_SMT_RSB X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */
+#define X86_BUG_TDX_PW_MCE X86_BUG(30) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */

#endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 1c4639588ff9..251b333e53d2 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -1552,3 +1552,24 @@ u8 get_this_hybrid_cpu_type(void)

return cpuid_eax(0x0000001a) >> X86_HYBRID_CPU_TYPE_ID_SHIFT;
}
+
+/*
+ * These CPUs have an erratum. A partial write from non-TD
+ * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
+ * private memory poisons that memory, and a subsequent read of
+ * that memory triggers #MC.
+ */
+static const struct x86_cpu_id tdx_pw_mce_cpu_ids[] __initconst = {
+ X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, NULL),
+ X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, NULL),
+ { }
+};
+
+static int __init tdx_erratum_detect(void)
+{
+ if (x86_match_cpu(tdx_pw_mce_cpu_ids))
+ setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
+
+ return 0;
+}
+early_initcall(tdx_erratum_detect);
--
2.40.1



2023-06-06 12:50:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

On Mon, Jun 05, 2023 at 02:27:17AM +1200, Kai Huang wrote:
> TDX memory has integrity and confidentiality protections. Violations of
> this integrity protection are supposed to only affect TDX operations and
> are never supposed to affect the host kernel itself. In other words,
> the host kernel should never, itself, see machine checks induced by the
> TDX integrity hardware.
>
> Alas, the first few generations of TDX hardware have an erratum. A
> "partial" write to a TDX private memory cacheline will silently "poison"
> the line. Subsequent reads will consume the poison and generate a
> machine check. According to the TDX hardware spec, neither of these
> things should have happened.
>
> Virtually all kernel memory accesses operations happen in full
> cachelines. In practice, writing a "byte" of memory usually reads a 64
> byte cacheline of memory, modifies it, then writes the whole line back.
> Those operations do not trigger this problem.
>
> This problem is triggered by "partial" writes where a write transaction
> of less than cacheline lands at the memory controller. The CPU does
> these via non-temporal write instructions (like MOVNTI), or through
> UC/WC memory mappings. The issue can also be triggered away from the
> CPU by devices doing partial writes via DMA.
>
> With this erratum, there are additional things need to be done around
> machine check handler and kexec(), etc. Similar to other CPU bugs, use
> a CPU bug bit to indicate this erratum, and detect this erratum during
> early boot. Note this bug reflects the hardware thus it is detected
> regardless of whether the kernel is built with TDX support or not.
>
> Signed-off-by: Kai Huang <[email protected]>
> ---
>
> v10 -> v11:
> - New patch
>
> ---
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/kernel/cpu/intel.c | 21 +++++++++++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index cb8ca46213be..dc8701f8d88b 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -483,5 +483,6 @@
> #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */
> #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
> #define X86_BUG_SMT_RSB X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */
> +#define X86_BUG_TDX_PW_MCE X86_BUG(30) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */
>
> #endif /* _ASM_X86_CPUFEATURES_H */
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index 1c4639588ff9..251b333e53d2 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -1552,3 +1552,24 @@ u8 get_this_hybrid_cpu_type(void)
>
> return cpuid_eax(0x0000001a) >> X86_HYBRID_CPU_TYPE_ID_SHIFT;
> }
> +
> +/*
> + * These CPUs have an erratum. A partial write from non-TD
> + * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
> + * private memory poisons that memory, and a subsequent read of
> + * that memory triggers #MC.
> + */
> +static const struct x86_cpu_id tdx_pw_mce_cpu_ids[] __initconst = {
> + X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, NULL),
> + X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, NULL),
> + { }
> +};
> +
> +static int __init tdx_erratum_detect(void)
> +{
> + if (x86_match_cpu(tdx_pw_mce_cpu_ids))
> + setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
> +
> + return 0;
> +}
> +early_initcall(tdx_erratum_detect);

Initcall? Don't we already have a codepath to call it directly?
Maybe cpu_set_bug_bits()?

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-06-06 23:20:07

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

On Tue, 2023-06-06 at 15:38 +0300, [email protected] wrote:
> On Mon, Jun 05, 2023 at 02:27:17AM +1200, Kai Huang wrote:
> > TDX memory has integrity and confidentiality protections. Violations of
> > this integrity protection are supposed to only affect TDX operations and
> > are never supposed to affect the host kernel itself. In other words,
> > the host kernel should never, itself, see machine checks induced by the
> > TDX integrity hardware.
> >
> > Alas, the first few generations of TDX hardware have an erratum. A
> > "partial" write to a TDX private memory cacheline will silently "poison"
> > the line. Subsequent reads will consume the poison and generate a
> > machine check. According to the TDX hardware spec, neither of these
> > things should have happened.
> >
> > Virtually all kernel memory accesses operations happen in full
> > cachelines. In practice, writing a "byte" of memory usually reads a 64
> > byte cacheline of memory, modifies it, then writes the whole line back.
> > Those operations do not trigger this problem.
> >
> > This problem is triggered by "partial" writes where a write transaction
> > of less than cacheline lands at the memory controller. The CPU does
> > these via non-temporal write instructions (like MOVNTI), or through
> > UC/WC memory mappings. The issue can also be triggered away from the
> > CPU by devices doing partial writes via DMA.
> >
> > With this erratum, there are additional things need to be done around
> > machine check handler and kexec(), etc. Similar to other CPU bugs, use
> > a CPU bug bit to indicate this erratum, and detect this erratum during
> > early boot. Note this bug reflects the hardware thus it is detected
> > regardless of whether the kernel is built with TDX support or not.
> >
> > Signed-off-by: Kai Huang <[email protected]>
> > ---
> >
> > v10 -> v11:
> > - New patch
> >
> > ---
> > arch/x86/include/asm/cpufeatures.h | 1 +
> > arch/x86/kernel/cpu/intel.c | 21 +++++++++++++++++++++
> > 2 files changed, 22 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > index cb8ca46213be..dc8701f8d88b 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -483,5 +483,6 @@
> > #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */
> > #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
> > #define X86_BUG_SMT_RSB X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */
> > +#define X86_BUG_TDX_PW_MCE X86_BUG(30) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */
> >
> > #endif /* _ASM_X86_CPUFEATURES_H */
> > diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> > index 1c4639588ff9..251b333e53d2 100644
> > --- a/arch/x86/kernel/cpu/intel.c
> > +++ b/arch/x86/kernel/cpu/intel.c
> > @@ -1552,3 +1552,24 @@ u8 get_this_hybrid_cpu_type(void)
> >
> > return cpuid_eax(0x0000001a) >> X86_HYBRID_CPU_TYPE_ID_SHIFT;
> > }
> > +
> > +/*
> > + * These CPUs have an erratum. A partial write from non-TD
> > + * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
> > + * private memory poisons that memory, and a subsequent read of
> > + * that memory triggers #MC.
> > + */
> > +static const struct x86_cpu_id tdx_pw_mce_cpu_ids[] __initconst = {
> > + X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, NULL),
> > + X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, NULL),
> > + { }
> > +};
> > +
> > +static int __init tdx_erratum_detect(void)
> > +{
> > + if (x86_match_cpu(tdx_pw_mce_cpu_ids))
> > + setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
> > +
> > + return 0;
> > +}
> > +early_initcall(tdx_erratum_detect);
>
> Initcall? Don't we already have a codepath to call it directly?
> Maybe cpu_set_bug_bits()?
>
I didn't like doing in cpu_set_bug_bits() because it appears the bugs that
handled in that function seem to have some dependency. For instance, if a CPU
is in the whitelist of NO_SPECULATION, then this function simply returns and
assumes all other bugs are not present:

static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
{
u64 ia32_cap = x86_read_arch_cap_msr();

...

if (cpu_matches(cpu_vuln_whitelist, NO_SPECULATION))
return;

setup_force_cpu_bug(X86_BUG_SPECTRE_V1);

...
}

This TDX erratum is quite self contained thus I think using some initcall is the
cleanest way to do.

And there are other bug flags that are handled in other places but not in
cpu_set_bug_bits(), for instance,

static void init_intel(struct cpuinfo_x86 *c)
{
...

if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_MWAIT) &&
((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT)))
set_cpu_bug(c, X86_BUG_MONITOR);

...
}

So it seems there's no hard rule that all bugs need to be done in
cpu_set_bug_bits().


2023-06-07 14:33:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

On 6/4/23 07:27, Kai Huang wrote:
> TDX memory has integrity and confidentiality protections. Violations of
> this integrity protection are supposed to only affect TDX operations and
> are never supposed to affect the host kernel itself. In other words,
> the host kernel should never, itself, see machine checks induced by the
> TDX integrity hardware.

At the risk of patting myself on the back by acking a changelog that I
wrote 95% of:

Reviewed-by: Dave Hansen <[email protected]>


2023-06-07 15:18:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

On Tue, Jun 06, 2023 at 10:58:04PM +0000, Huang, Kai wrote:
> So it seems there's no hard rule that all bugs need to be done in
> cpu_set_bug_bits().

Yes, CPU identify is a mess, but initcall makes it worse. Initcall is lazy
way out that contributes to the mess. Maybe cpu_set_bug_bits() is the
wrong place, find the right one.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-06-07 22:56:10

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

On Wed, 2023-06-07 at 07:15 -0700, Hansen, Dave wrote:
> On 6/4/23 07:27, Kai Huang wrote:
> > TDX memory has integrity and confidentiality protections. Violations of
> > this integrity protection are supposed to only affect TDX operations and
> > are never supposed to affect the host kernel itself. In other words,
> > the host kernel should never, itself, see machine checks induced by the
> > TDX integrity hardware.
>
> At the risk of patting myself on the back by acking a changelog that I
> wrote 95% of:
>
> Reviewed-by: Dave Hansen <[email protected]>
>

Thanks!

2023-06-19 12:21:00

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

On Wed, 2023-06-07 at 22:43 +0000, Huang, Kai wrote:
> On Wed, 2023-06-07 at 07:15 -0700, Hansen, Dave wrote:
> > On 6/4/23 07:27, Kai Huang wrote:
> > > TDX memory has integrity and confidentiality protections. Violations of
> > > this integrity protection are supposed to only affect TDX operations and
> > > are never supposed to affect the host kernel itself. In other words,
> > > the host kernel should never, itself, see machine checks induced by the
> > > TDX integrity hardware.
> >
> > At the risk of patting myself on the back by acking a changelog that I
> > wrote 95% of:
> >
> > Reviewed-by: Dave Hansen <[email protected]>
> >
>
> Thanks!

Hi Dave,

Thanks for reviewing and providing the tag. However I found there's a bug if we
use early_initcall() to detect erratum here -- in the later kexec() patch, the
early_initcall(tdx_init) sets up the x86_platform.memory_shutdown() callback to
reset TDX private memory depending on presence of the erratum, but there's no
guarantee detecting erratum will be done before tdx_init() because they are both
early_initcall().

Kirill also said early_initcall() isn't the right place so I changed to do the
detection to earlier phase in bsp_init_intel(), because we just need to match
cpu once for BSP assuming CPU model is consistent across all cpus (which is the
assumption of x86_match_cpu() anyway).

Please let me know for any comments?

+/*
+ * These CPUs have an erratum. A partial write from non-TD
+ * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
+ * private memory poisons that memory, and a subsequent read of
+ * that memory triggers #MC.
+ */
+static const struct x86_cpu_id tdx_pw_mce_cpu_ids[] __initconst = {
+ X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, NULL),
+ X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, NULL),
+ { }
+};
+
static void bsp_init_intel(struct cpuinfo_x86 *c)
{
resctrl_cpu_detect(c);
+
+ if (x86_match_cpu(tdx_pw_mce_cpu_ids))
+ setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
}

2023-06-19 12:27:02

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

On 04.06.23 16:27, Kai Huang wrote:
> TDX memory has integrity and confidentiality protections. Violations of
> this integrity protection are supposed to only affect TDX operations and
> are never supposed to affect the host kernel itself. In other words,
> the host kernel should never, itself, see machine checks induced by the
> TDX integrity hardware.
>
> Alas, the first few generations of TDX hardware have an erratum. A
> "partial" write to a TDX private memory cacheline will silently "poison"
> the line. Subsequent reads will consume the poison and generate a
> machine check. According to the TDX hardware spec, neither of these
> things should have happened.
>
> Virtually all kernel memory accesses operations happen in full
> cachelines. In practice, writing a "byte" of memory usually reads a 64
> byte cacheline of memory, modifies it, then writes the whole line back.
> Those operations do not trigger this problem.

So, ordinary writes to TD private memory are not a problem? I thought
one motivation for the unmapped-guest-memory discussion was to prevent
host (userspace) writes to such memory because it would trigger a MC and
eventually crash the host.

I recall that this would happen easily (not just in some weird "partial"
case and that the spec would allow for it)

1) Does that, in general, not happen anymore (was the hardware fixed?)?

2) Will new hardware prevent/"fix" that completely (was the spec updated?)?


... or was my understanding wrong?

Thanks!

>
> This problem is triggered by "partial" writes where a write transaction
> of less than cacheline lands at the memory controller. The CPU does
> these via non-temporal write instructions (like MOVNTI), or through
> UC/WC memory mappings. The issue can also be triggered away from the
> CPU by devices doing partial writes via DMA.
>
> With this erratum, there are additional things need to be done around
> machine check handler and kexec(), etc. Similar to other CPU bugs, use
> a CPU bug bit to indicate this erratum, and detect this erratum during
> early boot. Note this bug reflects the hardware thus it is detected
> regardless of whether the kernel is built with TDX support or not.
>
> Signed-off-by: Kai Huang <[email protected]>
> ---
>
> v10 -> v11:
> - New patch
>
> ---
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/kernel/cpu/intel.c | 21 +++++++++++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index cb8ca46213be..dc8701f8d88b 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -483,5 +483,6 @@
> #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */
> #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
> #define X86_BUG_SMT_RSB X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */
> +#define X86_BUG_TDX_PW_MCE X86_BUG(30) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */
>
> #endif /* _ASM_X86_CPUFEATURES_H */
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index 1c4639588ff9..251b333e53d2 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -1552,3 +1552,24 @@ u8 get_this_hybrid_cpu_type(void)
>
> return cpuid_eax(0x0000001a) >> X86_HYBRID_CPU_TYPE_ID_SHIFT;
> }
> +
> +/*
> + * These CPUs have an erratum. A partial write from non-TD
> + * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
> + * private memory poisons that memory, and a subsequent read of
> + * that memory triggers #MC.
> + */
> +static const struct x86_cpu_id tdx_pw_mce_cpu_ids[] __initconst = {
> + X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, NULL),
> + X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, NULL),
> + { }
> +};
> +
> +static int __init tdx_erratum_detect(void)
> +{
> + if (x86_match_cpu(tdx_pw_mce_cpu_ids))
> + setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
> +
> + return 0;
> +}
> +early_initcall(tdx_erratum_detect);

--
Cheers,

David / dhildenb


2023-06-20 10:44:01

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

On Mon, 2023-06-19 at 14:21 +0200, David Hildenbrand wrote:
> On 04.06.23 16:27, Kai Huang wrote:
> > TDX memory has integrity and confidentiality protections. Violations of
> > this integrity protection are supposed to only affect TDX operations and
> > are never supposed to affect the host kernel itself. In other words,
> > the host kernel should never, itself, see machine checks induced by the
> > TDX integrity hardware.
> >
> > Alas, the first few generations of TDX hardware have an erratum. A
> > "partial" write to a TDX private memory cacheline will silently "poison"
> > the line. Subsequent reads will consume the poison and generate a
> > machine check. According to the TDX hardware spec, neither of these
> > things should have happened.
> >
> > Virtually all kernel memory accesses operations happen in full
> > cachelines. In practice, writing a "byte" of memory usually reads a 64
> > byte cacheline of memory, modifies it, then writes the whole line back.
> > Those operations do not trigger this problem.
>
> So, ordinary writes to TD private memory are not a problem? 
>

Not a problem for the kernel as such write won't poison the memory directly, so
if the kernel reads those memory there won't be #MC.

However if TDX guest reads those memory (which was previous written by kernel or
userspace), the memory is marked as poison when read and #MC is triggered.

> I thought
> one motivation for the unmapped-guest-memory discussion was to prevent
> host (userspace) writes to such memory because it would trigger a MC and
> eventually crash the host.

Yeah the #MC will be triggered inside the TDX guest. I think in most cases such
#MC won't cause host kernel crash but only the victim TDX guest is killed. But
there might be some cases we may not be able to handle #MC gracefully, e.g., in
some particular BIOS setting. One example is with LMCE disabled, any #MC would
be broadcast to all LPs causing all other TDX guests running on other LPs being
killed.

Also quoted from Chao, Peng, who has been working on the unmapped-guest-memory
since early time:

"
The problem is we may not always be able to handle #MC gracefully, in
some configurations (BIOS settings) the #MC can cause the whole system
reset, not just kill the TD. At least this is the original motivation
for Intel to start this series. I think the case is still true unless I
missed something. From KVM community, they have motivation to unmap the
private memory from userspace even the #MC is not fatal, just to prevent
possible unintended accesses from userspace (that's why they ask AMD to
use this series even their machine doesn't cause system reset when the
same happens).
"

>
> I recall that this would happen easily (not just in some weird "partial"
> case and that the spec would allow for it)

No as mentioned above, this partial write #MC is different from the one
triggered in TDX guest as mentioned above.

>
> 1) Does that, in general, not happen anymore (was the hardware fixed?)?
>
> 2) Will new hardware prevent/"fix" that completely (was the spec updated?)?

Yes this erratum will be fixed in later generations of TDX hardware. It only
appears on SPR and EMR (the first two generations of TDX hardware).



2023-06-20 15:52:03

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

On 6/19/23 05:21, David Hildenbrand wrote:
> So, ordinary writes to TD private memory are not a problem? I thought
> one motivation for the unmapped-guest-memory discussion was to prevent
> host (userspace) writes to such memory because it would trigger a MC and
> eventually crash the host.

Those are two different problems.

Problem #1 (this patch): The host encounters poison when going about its
normal business accessing normal memory. This happens when something in
the host accidentally clobbers some TDX memory and *then* reads it.
Only occurs with partial writes.

Problem #2 (addressed with unmapping): Host *userspace* intentionally
and maliciously clobbers some TDX memory and then the TDX module or a
TDX guest can't run because the memory integrity checks (checksum or TD
bit) fail. This can also take the system down because #MC's are nasty.

Host userspace unmapping doesn't prevent problem #1 because it's the
kernel who screwed up with the _kernel_ mapping.



2023-06-20 16:14:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

On 6/19/23 04:37, Huang, Kai wrote:
> Please let me know for any comments?

Can you please go look at where most of the X86_BUG_* bits are set? Can
you set yours near one of the existing sites instead of plopping a new
one into the code?


2023-06-20 16:25:40

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

On 20.06.23 17:39, Dave Hansen wrote:
> On 6/19/23 05:21, David Hildenbrand wrote:
>> So, ordinary writes to TD private memory are not a problem? I thought
>> one motivation for the unmapped-guest-memory discussion was to prevent
>> host (userspace) writes to such memory because it would trigger a MC and
>> eventually crash the host.
>
> Those are two different problems.
>
> Problem #1 (this patch): The host encounters poison when going about its
> normal business accessing normal memory. This happens when something in
> the host accidentally clobbers some TDX memory and *then* reads it.
> Only occurs with partial writes.
>
> Problem #2 (addressed with unmapping): Host *userspace* intentionally
> and maliciously clobbers some TDX memory and then the TDX module or a
> TDX guest can't run because the memory integrity checks (checksum or TD
> bit) fail. This can also take the system down because #MC's are nasty.
>
> Host userspace unmapping doesn't prevent problem #1 because it's the
> kernel who screwed up with the _kernel_ mapping.

Ahh, thanks for verifying. I was hoping that problem #2 would get fixed
in HW as well (and treated like a BUG).


Because problem #2 also sounds like something that directly violates the
first paragraph of this patch description "violations of
this integrity protection are supposed to only affect TDX operations and
are never supposed to affect the host kernel itself."

So I would expect the TDX guest to fail hard, but not other TDX guests
(or the host kernel).

--
Cheers,

David / dhildenb


2023-06-20 16:26:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

On 6/20/23 09:03, David Hildenbrand wrote:
> On 20.06.23 17:39, Dave Hansen wrote:
>> On 6/19/23 05:21, David Hildenbrand wrote:
>>> So, ordinary writes to TD private memory are not a problem? I thought
>>> one motivation for the unmapped-guest-memory discussion was to prevent
>>> host (userspace) writes to such memory because it would trigger a MC and
>>> eventually crash the host.
>>
>> Those are two different problems.
>>
>> Problem #1 (this patch): The host encounters poison when going about its
>> normal business accessing normal memory.  This happens when something in
>> the host accidentally clobbers some TDX memory and *then* reads it.
>> Only occurs with partial writes.
>>
>> Problem #2 (addressed with unmapping): Host *userspace* intentionally
>> and maliciously clobbers some TDX memory and then the TDX module or a
>> TDX guest can't run because the memory integrity checks (checksum or TD
>> bit) fail.  This can also take the system down because #MC's are nasty.
>>
>> Host userspace unmapping doesn't prevent problem #1 because it's the
>> kernel who screwed up with the _kernel_ mapping.
>
> Ahh, thanks for verifying. I was hoping that problem #2 would get fixed
> in HW as well (and treated like a BUG).

No, it's really working as designed.

#1 _can_ be fixed because the hardware can just choose to let the host
run merrily along corrupting TDX data and blissfully unaware of the
carnage until TDX stumbles on the mess. Blissful ignorance really is a
useful feature here. It means, for instance, that if the kernel screws
up, it can still blissfully kexec(), reboot , boot a new kernel, or dump
to the console without fear of #MC.

#2 is much harder because the TDX data is destroyed and yet the TDX side
still wants to run. The SEV folks chose page faults on write to stop
SEV from running and the TDX folks chose #MC on reads as the mechanism.

All of the nastiness on the TDX side is (IMNHO) really a consequence of
that decision to use machine checks.

(Aside: I'm not specifically crapping on the TDX CPU designers here. I
don't particularly like the SEV approach either. But this mess is a
result of the TDX design choices. There are other messes in other
patch series from SEV. )

> Because problem #2 also sounds like something that directly violates the
> first paragraph of this patch description "violations of
> this integrity protection are supposed to only affect TDX operations and
> are never supposed to affect the host kernel itself."
>
> So I would expect the TDX guest to fail hard, but not other TDX guests
> (or the host kernel).

This is more fallout from the #MC design choice.

Let's use page faults as an example since our SEV friends are using
them. *ANY* instruction that reads memory can page fault, have the
kernel fix up the fault, and continue merrily along its way.

#MC is fundamentally different. The exceptions can be declared to be
unrecoverable. The CPU says, "whoopsie, I managed to deliver this #MC,
but it would be too hard for me so I can't continue." These "too hard"
scenarios are shrinking over time, but they do exist. They're fatal.



2023-06-21 01:06:33

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

On Tue, 2023-06-20 at 08:44 -0700, Dave Hansen wrote:
> On 6/19/23 04:37, Huang, Kai wrote:
> > Please let me know for any comments?
>
> Can you please go look at where most of the X86_BUG_* bits are set? Can
> you set yours near one of the existing sites instead of plopping a new
> one into the code?
>

Sure I'll try again and sync with Kirill first. Thanks!