Subject: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

Current TDX spec does not have support to emulate the WBINVD
instruction. If any feature that uses WBINVD is enabled/used
in TDX guest, it will lead to un-handled #VE exception, which
will be handled as #GP fault.

ACPI drivers also uses WBINVD instruction for cache flushes in
reboot or shutdown code path. Since TDX guest has requirement
to support shutdown feature, skip WBINVD instruction usage
in ACPI drivers for TDX guest.

Since cache is always coherent in TDX guests, making wbinvd as
noop should not cause any issues in above mentioned code path.
The end-behavior is the same as KVM guest (treat as noops).

In future, once TDX guest specification adds support for WBINVD
hypercall, we can pass the handle to KVM to handle it.
   
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2-fix-v3:
* Fixed commit log as per review comments.
* Instead of fixing all usages of ACPI_FLUSH_CPU_CACHE(),
created TDX specific exception for it in its implementation.

Changes since RFC v2-fix-v2:
* Instead of handling WBINVD #VE exception as nop, we skip its
usage in currently enabled drivers.
* Adapted commit log for above change.

arch/x86/include/asm/acenv.h | 7 ++++++-
arch/x86/kernel/tdx.c | 1 +
include/linux/protected_guest.h | 2 ++
3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/acenv.h b/arch/x86/include/asm/acenv.h
index 9aff97f0de7f..36c87b69366b 100644
--- a/arch/x86/include/asm/acenv.h
+++ b/arch/x86/include/asm/acenv.h
@@ -10,10 +10,15 @@
#define _ASM_X86_ACENV_H

#include <asm/special_insns.h>
+#include <linux/protected_guest.h>

/* Asm macros */

-#define ACPI_FLUSH_CPU_CACHE() wbinvd()
+#define ACPI_FLUSH_CPU_CACHE() \
+do { \
+ if (!prot_guest_has(PR_GUEST_DISABLE_WBINVD)) \
+ wbinvd(); \
+} while (0)

int __acpi_acquire_global_lock(unsigned int *lock);
int __acpi_release_global_lock(unsigned int *lock);
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 06fcbca402cb..fd27cf651f0b 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -92,6 +92,7 @@ bool tdx_protected_guest_has(unsigned long flag)
case PR_GUEST_MEM_ENCRYPT_ACTIVE:
case PR_GUEST_UNROLL_STRING_IO:
case PR_GUEST_SHARED_MAPPING_INIT:
+ case PR_GUEST_DISABLE_WBINVD:
return true;
}

diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
index adfa62e2615e..0ec4dab86f67 100644
--- a/include/linux/protected_guest.h
+++ b/include/linux/protected_guest.h
@@ -18,6 +18,8 @@
#define PR_GUEST_HOST_MEM_ENCRYPT 0x103
/* Support for shared mapping initialization (after early init) */
#define PR_GUEST_SHARED_MAPPING_INIT 0x104
+/* Support to disable WBINVD */
+#define PR_GUEST_DISABLE_WBINVD 0x105

#if defined(CONFIG_INTEL_TDX_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)

--
2.25.1


2021-06-09 12:39:02

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

On Tue, Jun 8, 2021 at 6:10 PM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> Current TDX spec does not have support to emulate the WBINVD
> instruction. If any feature that uses WBINVD is enabled/used
> in TDX guest, it will lead to un-handled #VE exception, which
> will be handled as #GP fault.
>
> ACPI drivers also uses WBINVD instruction for cache flushes in
> reboot or shutdown code path. Since TDX guest has requirement
> to support shutdown feature, skip WBINVD instruction usage
> in ACPI drivers for TDX guest.

This sounds awkward...

> Since cache is always coherent in TDX guests, making wbinvd as

This is incorrect, ACPI cache flushing is not about I/O or CPU coherency...

> noop should not cause any issues in above mentioned code path.

..."should" is a famous last word...

> The end-behavior is the same as KVM guest (treat as noops).

..."KVM gets away with it" is not a justification that TDX can stand
on otherwise we would not be here fixing up ACPICA properly.

How about:

"TDX guests use standard ACPI mechanisms to signal sleep state entry
(including reboot) to the host. The ACPI specification mandates WBINVD
on any sleep state entry with the expectation that the platform is
only responsible for maintaining the state of memory over sleep
states, not preserving dirty data in any CPU caches. ACPI cache
flushing requirements pre-date the advent of virtualization. Given TDX
guest sleep state entry does not affect any host power rails it is not
required to flush caches. The host is responsible for maintaining
cache state over its own bare metal sleep state transitions that
power-off the cache. If the host fails to manage caches over its sleep
state transitions the guest..."

I don't know how to finish the last sentence. What does TDX do if it
is resumed after host suspend and the host somehow arranged for dirty
TDX lines to be lost. Will that be noticed by TDX integrity
mechanisms? I did not immediately find an answer to this with a brief
look at the specs.

>
> In future, once TDX guest specification adds support for WBINVD
> hypercall, we can pass the handle to KVM to handle it.

I expect if the specification wanted operating systems to plan for
this eventuality it would have made a note of it. I expect this
sentence can just be deleted.

2021-06-09 12:42:44

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

On 6/8/21 8:40 PM, Dan Williams wrote:
> On Tue, Jun 8, 2021 at 6:10 PM Kuppuswamy Sathyanarayanan
> <[email protected]> wrote:
>>
>> Current TDX spec does not have support to emulate the WBINVD
>> instruction. If any feature that uses WBINVD is enabled/used
>> in TDX guest, it will lead to un-handled #VE exception, which
>> will be handled as #GP fault.
>>
>> ACPI drivers also uses WBINVD instruction for cache flushes in
>> reboot or shutdown code path. Since TDX guest has requirement
>> to support shutdown feature, skip WBINVD instruction usage
>> in ACPI drivers for TDX guest.
>
> This sounds awkward...
>
>> Since cache is always coherent in TDX guests, making wbinvd as
>
> This is incorrect, ACPI cache flushing is not about I/O or CPU coherency...
>
>> noop should not cause any issues in above mentioned code path.
>
> ..."should" is a famous last word...
>
>> The end-behavior is the same as KVM guest (treat as noops).
>
> ..."KVM gets away with it" is not a justification that TDX can stand
> on otherwise we would not be here fixing up ACPICA properly.
>
> How about:
>
> "TDX guests use standard ACPI mechanisms to signal sleep state entry
> (including reboot) to the host. The ACPI specification mandates WBINVD
> on any sleep state entry with the expectation that the platform is
> only responsible for maintaining the state of memory over sleep
> states, not preserving dirty data in any CPU caches. ACPI cache
> flushing requirements pre-date the advent of virtualization. Given TDX
> guest sleep state entry does not affect any host power rails it is not
> required to flush caches. The host is responsible for maintaining
> cache state over its own bare metal sleep state transitions that
> power-off the cache. If the host fails to manage caches over its sleep
> state transitions the guest..."
>

I like this description, but shouldn't the logic be:

if (!CPUID has hypervisor bit set)
wbinvd();

As far as I know, most hypervisors will turn WBINVD into a noop and,
even if they don't, it seems to be that something must be really quite
wrong for a guest to need to WBINVD for ACPI purposes.

-Andy

2021-06-09 12:47:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest


here is no resume path.

> Host is free to go into S3 independent of any guest state.

Actually my understanding is that none of the systems which support TDX
support S3. S3 has been deprecated for a long time.


> A hostile
> host is free to do just enough cache management so that it can resume
> from S3 while arranging for TDX guest dirty data to be lost. Does a
> TDX guest go fatal if the cache loses power?

That would be a machine check, and yes it would be fatal.

-Andi

Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest



On 6/8/21 9:40 PM, Andi Kleen wrote:
>
>>> KVM only turns it into a noop if there is no VT-d, because with VT-d you
>>> might need it to turn mappings into uncached and vice versa.
>> Wow, I found the kvm_arch_register_noncoherent_dma() stuff.  That's horrifying.  What's it for?  e
>
> e.g. if you want to run a GPU it really needs some uncached memory. Same is true for other more
> complex devices.
>
> Now modern Linux of course will be preferring CLFLUSH instead for the conversion, but there are old
> versions that preferred WBINVD.
>
> I don't think it's a DoS, as long as you're not too picky about latencies on the host.
>
> -Andi
>

Currently we use prot_guest_has(PR_GUEST_DISABLE_WBINVD)) check for disabling the wbinvd()
usage (which can be selectively enabled for tested guests).

Is it alright to generalize it with boot_cpu_has(X86_FEATURE_HYPERVISOR) without
verify it?

>
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-06-09 15:35:43

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

On 6/8/21 6:10 PM, Kuppuswamy Sathyanarayanan wrote:
> Since cache is always coherent in TDX guests, making wbinvd as
> noop should not cause any issues in above mentioned code path.
> The end-behavior is the same as KVM guest (treat as noops).

I don't see anything in the specs to back up such a broad statement.

For Secure-EPT, I see in the TDX "EAS" that "Ignore PAT" is "Set to 1".
This, presumably along with the "TD VMCS Guest MSRs... IA32_PAT" being
set to 0x0007040600070406 (I didn't decode it, I'm just guessing),
ensures that guests using Secure-EPT have no architectural way of
creating non-coherent mappings using the guest x86 page tables.

That covers one of the memory types to which guests have access.

Guests can also access TD-shared memory. Those mappings are controlled
by the VMM and not mapped by Secure-EPT. This is the part that concerns
me and is not consistent with the statement above. Is it
architecturally impossible for a VMM to create an non-coherent mapping
and expose it to a guest? If it is impossible, please include citations
of the spec or the logic behind this so that a reader can understand,
just as I did above.

If it is possible to have non-coherent mappings in a guest, then please
remove the above statement.

2021-06-09 16:37:28

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

On 6/9/21 8:09 AM, Dan Williams wrote:
> On Tue, Jun 8, 2021 at 9:27 PM Andi Kleen <[email protected]> wrote:
>>
>>
>> here is no resume path.
>>
>>> Host is free to go into S3 independent of any guest state.
>>
>> Actually my understanding is that none of the systems which support TDX
>> support S3. S3 has been deprecated for a long time.
>
> Ok, I wanted to imply any power state that might power-off caches.
>
>>
>>
>>> A hostile
>>> host is free to do just enough cache management so that it can resume
>>> from S3 while arranging for TDX guest dirty data to be lost. Does a
>>> TDX guest go fatal if the cache loses power?
>>
>> That would be a machine check, and yes it would be fatal.
>
> Sounds good, so incorporating this and Andy's feedback:
>
> "TDX guests, like other typical guests, use standard ACPI mechanisms
> to signal sleep state entry (including reboot) to the host. The ACPI
> specification mandates WBINVD on any sleep state entry with the
> expectation that the platform is only responsible for maintaining the
> state of memory over sleep states, not preserving dirty data in any
> CPU caches. ACPI cache flushing requirements pre-date the advent of
> virtualization. Given guest sleep state entry does not affect any host
> power rails it is not required to flush caches. The host is
> responsible for maintaining cache state over its own bare metal sleep
> state transitions that power-off the cache. A TDX guest, unlike a
> typical guest, will machine check if the CPU cache is powered off."
>
> Andi, is that machine check behavior relative to power states
> mentioned in the docs?

I don't think there's anything about power states. There is a general
documented mechanism to integrity-check TD guest memory, but it is *not*
replay-resistant. So, if the guest dirties a cache line, and the cache
line is lost, it seems entirely plausible that the guest would get
silently corrupted.

I would argue that, if this happens, it's a host, TD module, or
architecture bug, and it's not the guest's fault.

--Andy

2021-06-09 17:06:09

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

On Tue, Jun 8, 2021 at 8:56 PM Kuppuswamy, Sathyanarayanan
<[email protected]> wrote:
>
>
>
> On 6/8/21 8:40 PM, Dan Williams wrote:
> > ..."KVM gets away with it" is not a justification that TDX can stand
> > on otherwise we would not be here fixing up ACPICA properly.
> >
> > How about:
> >
> > "TDX guests use standard ACPI mechanisms to signal sleep state entry
> > (including reboot) to the host. The ACPI specification mandates WBINVD
> > on any sleep state entry with the expectation that the platform is
> > only responsible for maintaining the state of memory over sleep
> > states, not preserving dirty data in any CPU caches. ACPI cache
> > flushing requirements pre-date the advent of virtualization. Given TDX
> > guest sleep state entry does not affect any host power rails it is not
> > required to flush caches. The host is responsible for maintaining
> > cache state over its own bare metal sleep state transitions that
> > power-off the cache. If the host fails to manage caches over its sleep
> > state transitions the guest..."
>
> >
> > I don't know how to finish the last sentence. What does TDX do if it
> > is resumed after host suspend and the host somehow arranged for dirty
> > TDX lines to be lost.
>
> TDX guest does not support S3. It will be disabled in ACPI tables. It
> is a TDX firmware spec requirement. Please check the following spec,
> sec 2.1
>
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf

I'm not asking about TDX guest entering S3...

>
> In TDX guest, we encounter cache flushes only in shutdown and reboot path.
> So there is no resume path.

Host is free to go into S3 independent of any guest state. A hostile
host is free to do just enough cache management so that it can resume
from S3 while arranging for TDX guest dirty data to be lost. Does a
TDX guest go fatal if the cache loses power?

2021-06-09 17:06:30

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

On Tue, Jun 8, 2021, at 9:25 PM, Andi Kleen wrote:
>
> > I like this description, but shouldn't the logic be:
> >
> > if (!CPUID has hypervisor bit set)
> > wbinvd();
> >
> > As far as I know, most hypervisors will turn WBINVD into a noop and,
> > even if they don't, it seems to be that something must be really quite
> > wrong for a guest to need to WBINVD for ACPI purposes.
>
> KVM only turns it into a noop if there is no VT-d, because with VT-d you
> might need it to turn mappings into uncached and vice versa.

Wow, I found the kvm_arch_register_noncoherent_dma() stuff. That's horrifying. What's it for? Are there actually guests that use devices exposed by VFIO that expect WBINVD to work? That's a giant DoS hole.

>
> But yes the change would make sense for reboot. BTW I suspect for the
> reboot path it isn't really needed anywhere modern, so it might actually
> be ok to completely disable it. But that's some risk, so doing it only
> for hypervisor is reasonable.

I agree.

2021-06-09 17:06:35

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest


>> KVM only turns it into a noop if there is no VT-d, because with VT-d you
>> might need it to turn mappings into uncached and vice versa.
> Wow, I found the kvm_arch_register_noncoherent_dma() stuff. That's horrifying. What's it for? e

e.g. if you want to run a GPU it really needs some uncached memory. Same
is true for other more complex devices.

Now modern Linux of course will be preferring CLFLUSH instead for the
conversion, but there are old versions that preferred WBINVD.

I don't think it's a DoS, as long as you're not too picky about
latencies on the host.

-Andi



Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest



On 6/8/21 8:40 PM, Dan Williams wrote:
> ..."KVM gets away with it" is not a justification that TDX can stand
> on otherwise we would not be here fixing up ACPICA properly.
>
> How about:
>
> "TDX guests use standard ACPI mechanisms to signal sleep state entry
> (including reboot) to the host. The ACPI specification mandates WBINVD
> on any sleep state entry with the expectation that the platform is
> only responsible for maintaining the state of memory over sleep
> states, not preserving dirty data in any CPU caches. ACPI cache
> flushing requirements pre-date the advent of virtualization. Given TDX
> guest sleep state entry does not affect any host power rails it is not
> required to flush caches. The host is responsible for maintaining
> cache state over its own bare metal sleep state transitions that
> power-off the cache. If the host fails to manage caches over its sleep
> state transitions the guest..."

>
> I don't know how to finish the last sentence. What does TDX do if it
> is resumed after host suspend and the host somehow arranged for dirty
> TDX lines to be lost.

TDX guest does not support S3. It will be disabled in ACPI tables. It
is a TDX firmware spec requirement. Please check the following spec,
sec 2.1

https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf

In TDX guest, we encounter cache flushes only in shutdown and reboot path.
So there is no resume path.


Will that be noticed by TDX integrity
> mechanisms? I did not immediately find an answer to this with a brief
> look at the specs.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-06-09 17:07:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest


> I like this description, but shouldn't the logic be:
>
> if (!CPUID has hypervisor bit set)
> wbinvd();
>
> As far as I know, most hypervisors will turn WBINVD into a noop and,
> even if they don't, it seems to be that something must be really quite
> wrong for a guest to need to WBINVD for ACPI purposes.

KVM only turns it into a noop if there is no VT-d, because with VT-d you
might need it to turn mappings into uncached and vice versa.

But yes the change would make sense for reboot. BTW I suspect for the
reboot path it isn't really needed anywhere modern, so it might actually
be ok to completely disable it. But that's some risk, so doing it only
for hypervisor is reasonable.

I can see it making sense for the S3 path, but nobody supports S3 for
guests.

-Andi


>
> -Andy

2021-06-09 17:07:37

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

On Tue, Jun 8, 2021 at 9:02 PM Andy Lutomirski <[email protected]> wrote:
>
> On 6/8/21 8:40 PM, Dan Williams wrote:
> > On Tue, Jun 8, 2021 at 6:10 PM Kuppuswamy Sathyanarayanan
> > <[email protected]> wrote:
> >>
> >> Current TDX spec does not have support to emulate the WBINVD
> >> instruction. If any feature that uses WBINVD is enabled/used
> >> in TDX guest, it will lead to un-handled #VE exception, which
> >> will be handled as #GP fault.
> >>
> >> ACPI drivers also uses WBINVD instruction for cache flushes in
> >> reboot or shutdown code path. Since TDX guest has requirement
> >> to support shutdown feature, skip WBINVD instruction usage
> >> in ACPI drivers for TDX guest.
> >
> > This sounds awkward...
> >
> >> Since cache is always coherent in TDX guests, making wbinvd as
> >
> > This is incorrect, ACPI cache flushing is not about I/O or CPU coherency...
> >
> >> noop should not cause any issues in above mentioned code path.
> >
> > ..."should" is a famous last word...
> >
> >> The end-behavior is the same as KVM guest (treat as noops).
> >
> > ..."KVM gets away with it" is not a justification that TDX can stand
> > on otherwise we would not be here fixing up ACPICA properly.
> >
> > How about:
> >
> > "TDX guests use standard ACPI mechanisms to signal sleep state entry
> > (including reboot) to the host. The ACPI specification mandates WBINVD
> > on any sleep state entry with the expectation that the platform is
> > only responsible for maintaining the state of memory over sleep
> > states, not preserving dirty data in any CPU caches. ACPI cache
> > flushing requirements pre-date the advent of virtualization. Given TDX
> > guest sleep state entry does not affect any host power rails it is not
> > required to flush caches. The host is responsible for maintaining
> > cache state over its own bare metal sleep state transitions that
> > power-off the cache. If the host fails to manage caches over its sleep
> > state transitions the guest..."
> >
>
> I like this description, but shouldn't the logic be:
>
> if (!CPUID has hypervisor bit set)
> wbinvd();
>
> As far as I know, most hypervisors will turn WBINVD into a noop and,
> even if they don't, it seems to be that something must be really quite
> wrong for a guest to need to WBINVD for ACPI purposes.

Agree, a well behaved guest should not pretend its callouts to the
virtual ACPI BIOS actually affect a host power rail.

2021-06-09 18:01:17

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

On Tue, Jun 8, 2021 at 9:27 PM Andi Kleen <[email protected]> wrote:
>
>
> here is no resume path.
>
> > Host is free to go into S3 independent of any guest state.
>
> Actually my understanding is that none of the systems which support TDX
> support S3. S3 has been deprecated for a long time.

Ok, I wanted to imply any power state that might power-off caches.

>
>
> > A hostile
> > host is free to do just enough cache management so that it can resume
> > from S3 while arranging for TDX guest dirty data to be lost. Does a
> > TDX guest go fatal if the cache loses power?
>
> That would be a machine check, and yes it would be fatal.

Sounds good, so incorporating this and Andy's feedback:

"TDX guests, like other typical guests, use standard ACPI mechanisms
to signal sleep state entry (including reboot) to the host. The ACPI
specification mandates WBINVD on any sleep state entry with the
expectation that the platform is only responsible for maintaining the
state of memory over sleep states, not preserving dirty data in any
CPU caches. ACPI cache flushing requirements pre-date the advent of
virtualization. Given guest sleep state entry does not affect any host
power rails it is not required to flush caches. The host is
responsible for maintaining cache state over its own bare metal sleep
state transitions that power-off the cache. A TDX guest, unlike a
typical guest, will machine check if the CPU cache is powered off."

Andi, is that machine check behavior relative to power states
mentioned in the docs?

2021-06-09 18:16:42

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest

On Wed, Jun 9, 2021 at 10:28 AM Kuppuswamy, Sathyanarayanan
<[email protected]> wrote:
>
>
>
> On 6/9/21 9:12 AM, Andy Lutomirski wrote:
> > On 6/9/21 8:09 AM, Dan Williams wrote:
> >> On Tue, Jun 8, 2021 at 9:27 PM Andi Kleen <[email protected]> wrote:
> >>>
> >>>
> >>> here is no resume path.
> >>>
> >>>> Host is free to go into S3 independent of any guest state.
> >>>
> >>> Actually my understanding is that none of the systems which support TDX
> >>> support S3. S3 has been deprecated for a long time.
> >>
> >> Ok, I wanted to imply any power state that might power-off caches.
> >>
> >>>
> >>>
> >>>> A hostile
> >>>> host is free to do just enough cache management so that it can resume
> >>>> from S3 while arranging for TDX guest dirty data to be lost. Does a
> >>>> TDX guest go fatal if the cache loses power?
> >>>
> >>> That would be a machine check, and yes it would be fatal.
> >>
> >> Sounds good, so incorporating this and Andy's feedback:
> >>
> >> "TDX guests, like other typical guests, use standard ACPI mechanisms
> >> to signal sleep state entry (including reboot) to the host. The ACPI
> >> specification mandates WBINVD on any sleep state entry with the
> >> expectation that the platform is only responsible for maintaining the
> >> state of memory over sleep states, not preserving dirty data in any
> >> CPU caches. ACPI cache flushing requirements pre-date the advent of
> >> virtualization. Given guest sleep state entry does not affect any host
> >> power rails it is not required to flush caches. The host is
> >> responsible for maintaining cache state over its own bare metal sleep
> >> state transitions that power-off the cache. A TDX guest, unlike a
> >> typical guest, will machine check if the CPU cache is powered off."
> >>
> >> Andi, is that machine check behavior relative to power states
> >> mentioned in the docs?
> >
> > I don't think there's anything about power states. There is a general
> > documented mechanism to integrity-check TD guest memory, but it is *not*
> > replay-resistant. So, if the guest dirties a cache line, and the cache
> > line is lost, it seems entirely plausible that the guest would get
> > silently corrupted.
> >
> > I would argue that, if this happens, it's a host, TD module, or
> > architecture bug, and it's not the guest's fault.
>
> If you want to apply this fix for all hypervisors (using boot_cpu_has
> (X86_FEATURE_HYPERVISOR) check), then we don't need any TDX specific
> reference in commit log right? It can be generalized for all VM guests.
>
> agree?

No, because there is a note needed about the integrity implications in
the TDX case that makes it distinct from typical hypervisor enabling.

Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest



On 6/9/21 10:31 AM, Dan Williams wrote:
>> If you want to apply this fix for all hypervisors (using boot_cpu_has
>> (X86_FEATURE_HYPERVISOR) check), then we don't need any TDX specific
>> reference in commit log right? It can be generalized for all VM guests.
>>
>> agree?
> No, because there is a note needed about the integrity implications in
> the TDX case that makes it distinct from typical hypervisor enabling.

Generalized the commit log (but left the TDX related info). Final version
will look like below.

x86: Skip WBINVD instruction for VM guest

VM guests that supports ACPI, use standard ACPI mechanisms to signal sleep
state entry (including reboot) to the host. The ACPI specification mandates
WBINVD on any sleep state entry with the expectation that the platform is
only responsible for maintaining the state of memory over sleep states, not
preserving dirty data in any CPU caches. ACPI cache flushing requirements
pre-date the advent of virtualization. Given guest sleep state entry does not
affect any host power rails it is not required to flush caches. The host is
responsible for maintaining cache state over its own bare metal sleep state
transitions that power-off the cache. A TDX guest, unlike a typical guest,
will machine check if the CPU cache is powered off.

--- a/arch/x86/include/asm/acenv.h
+++ b/arch/x86/include/asm/acenv.h
@@ -10,10 +10,15 @@
#define _ASM_X86_ACENV_H

#include <asm/special_insns.h>
+#include <asm/cpu.h>

/* Asm macros */

-#define ACPI_FLUSH_CPU_CACHE() wbinvd()
+#define ACPI_FLUSH_CPU_CACHE() \
+do { \
+ if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) \
+ wbinvd(); \
+} while (0)

int __acpi_acquire_global_lock(unsigned int *lock);
int __acpi_release_global_lock(unsigned int *lock);

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest



On 6/9/21 9:12 AM, Andy Lutomirski wrote:
> On 6/9/21 8:09 AM, Dan Williams wrote:
>> On Tue, Jun 8, 2021 at 9:27 PM Andi Kleen <[email protected]> wrote:
>>>
>>>
>>> here is no resume path.
>>>
>>>> Host is free to go into S3 independent of any guest state.
>>>
>>> Actually my understanding is that none of the systems which support TDX
>>> support S3. S3 has been deprecated for a long time.
>>
>> Ok, I wanted to imply any power state that might power-off caches.
>>
>>>
>>>
>>>> A hostile
>>>> host is free to do just enough cache management so that it can resume
>>>> from S3 while arranging for TDX guest dirty data to be lost. Does a
>>>> TDX guest go fatal if the cache loses power?
>>>
>>> That would be a machine check, and yes it would be fatal.
>>
>> Sounds good, so incorporating this and Andy's feedback:
>>
>> "TDX guests, like other typical guests, use standard ACPI mechanisms
>> to signal sleep state entry (including reboot) to the host. The ACPI
>> specification mandates WBINVD on any sleep state entry with the
>> expectation that the platform is only responsible for maintaining the
>> state of memory over sleep states, not preserving dirty data in any
>> CPU caches. ACPI cache flushing requirements pre-date the advent of
>> virtualization. Given guest sleep state entry does not affect any host
>> power rails it is not required to flush caches. The host is
>> responsible for maintaining cache state over its own bare metal sleep
>> state transitions that power-off the cache. A TDX guest, unlike a
>> typical guest, will machine check if the CPU cache is powered off."
>>
>> Andi, is that machine check behavior relative to power states
>> mentioned in the docs?
>
> I don't think there's anything about power states. There is a general
> documented mechanism to integrity-check TD guest memory, but it is *not*
> replay-resistant. So, if the guest dirties a cache line, and the cache
> line is lost, it seems entirely plausible that the guest would get
> silently corrupted.
>
> I would argue that, if this happens, it's a host, TD module, or
> architecture bug, and it's not the guest's fault.

If you want to apply this fix for all hypervisors (using boot_cpu_has
(X86_FEATURE_HYPERVISOR) check), then we don't need any TDX specific
reference in commit log right? It can be generalized for all VM guests.

agree?

>
> --Andy
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest

VM guests that supports ACPI, use standard ACPI mechanisms to
signal sleep state entry (including reboot) to the host. The
ACPI specification mandates WBINVD on any sleep state entry
with the expectation that the platform is only responsible for
maintaining the state of memory over sleep states, not
preserving dirty data in any CPU caches. ACPI cache flushing
requirements pre-date the advent of virtualization. Given guest
sleep state entry does not affect any host power rails it is not
required to flush caches. The host is responsible for maintaining
cache state over its own bare metal sleep state transitions that
power-off the cache. A TDX guest, unlike a typical guest, will
machine check if the CPU cache is powered off.
   
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since RFC v2-fix-v4:
* Fixed commit log as per Dan's comments.
* Used boot_cpu_has(X86_FEATURE_HYPERVISOR) instead of
prot_guest_has(PR_GUEST_DISABLE_WBINVD) check.

Changes since RFC v2-fix-v3:
* Fixed commit log as per review comments.
* Instead of fixing all usages of ACPI_FLUSH_CPU_CACHE(),
created TDX specific exception for it in its implementation.

Changes since RFC v2-fix-v2:
* Instead of handling WBINVD #VE exception as nop, we skip its
usage in currently enabled drivers.
* Adapted commit log for above change.

arch/x86/include/asm/acenv.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/acenv.h b/arch/x86/include/asm/acenv.h
index 9aff97f0de7f..d4162e94bee8 100644
--- a/arch/x86/include/asm/acenv.h
+++ b/arch/x86/include/asm/acenv.h
@@ -10,10 +10,15 @@
#define _ASM_X86_ACENV_H

#include <asm/special_insns.h>
+#include <asm/cpu.h>

/* Asm macros */

-#define ACPI_FLUSH_CPU_CACHE() wbinvd()
+#define ACPI_FLUSH_CPU_CACHE() \
+do { \
+ if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) \
+ wbinvd(); \
+} while (0)

int __acpi_acquire_global_lock(unsigned int *lock);
int __acpi_release_global_lock(unsigned int *lock);
--
2.25.1

2021-06-09 20:01:08

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest

[ add back linux-acpi and Rafael ]


On Wed, Jun 9, 2021 at 12:49 PM Kuppuswamy Sathyanarayanan
<[email protected]> wrote:
>
> VM guests that supports ACPI, use standard ACPI mechanisms to
> signal sleep state entry (including reboot) to the host. The
> ACPI specification mandates WBINVD on any sleep state entry
> with the expectation that the platform is only responsible for
> maintaining the state of memory over sleep states, not
> preserving dirty data in any CPU caches. ACPI cache flushing
> requirements pre-date the advent of virtualization. Given guest
> sleep state entry does not affect any host power rails it is not
> required to flush caches. The host is responsible for maintaining
> cache state over its own bare metal sleep state transitions that
> power-off the cache. A TDX guest, unlike a typical guest, will
> machine check if the CPU cache is powered off.

Looks like you are wrapping at column 62 than 72, double check that
for the final submission of this series. Other than that this looks
good to me.

Reviewed-by: Dan Williams <[email protected]>

2021-06-09 21:05:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest

This changelog lacks both clear problem statements and a clear solution
implemented within the patch.

Here's a proposed changelog. It clearly spells out the two problems
caused by WBINVD within a guest, and the proposed solution which fixes
those two problems.

Is this missing anything?

--

VM guests that support ACPI use standard ACPI mechanisms to signal sleep
state entry to the host. To ACPI, reboot is simply another sleep state.

ACPI specifies that the platform preserve memory contents over (some)
sleep states. It does not specify any requirements for data
preservation in CPU caches. The ACPI specification mandates the use of
WBINVD to flush the contents of the CPU caches to memory before entering
specific sleep states, thus ensuring data in caches can survive sleep
state transitions.e

Unlike when entering sleep states bare metal, no actions within a guest
can cause data in processor caches to be lost. That makes these WBINVD
invocations harmless but superfluous within a guest. (<--- problem #1)

In TDX guests, these WBINVD operations cause #VE exceptions. For debug,
it would be ideal for the #VE handler to be able to WARN() when an
unexpected WBINVD occurs. (<--- problem #2)

Avoid WBINVD for all ACPI cache-flushing operations which occur while
running under a hypervisor, which includes TDX guests. This both avoids
TDX warnings and optimizes away superfluous WBINVD invocations. (<----
solution)

2021-06-09 21:42:10

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest

On Wed, Jun 9, 2021 at 2:03 PM Dave Hansen <[email protected]> wrote:
>
> This changelog lacks both clear problem statements and a clear solution
> implemented within the patch.
>
> Here's a proposed changelog. It clearly spells out the two problems
> caused by WBINVD within a guest, and the proposed solution which fixes
> those two problems.

Looks good to me modulo the comment below...

>
> Is this missing anything?
>
> --
>
> VM guests that support ACPI use standard ACPI mechanisms to signal sleep
> state entry to the host. To ACPI, reboot is simply another sleep state.
>
> ACPI specifies that the platform preserve memory contents over (some)
> sleep states. It does not specify any requirements for data
> preservation in CPU caches. The ACPI specification mandates the use of
> WBINVD to flush the contents of the CPU caches to memory before entering
> specific sleep states, thus ensuring data in caches can survive sleep
> state transitions.e
>
> Unlike when entering sleep states bare metal, no actions within a guest
> can cause data in processor caches to be lost. That makes these WBINVD
> invocations harmless but superfluous within a guest. (<--- problem #1)
>
> In TDX guests, these WBINVD operations cause #VE exceptions. For debug,
> it would be ideal for the #VE handler to be able to WARN() when an
> unexpected WBINVD occurs. (<--- problem #2)

...but it doesn't WARN() it triggers unhandled #VE, unless I missed
another patch that precedes this that turns it into a WARN()? If a
code path expects WBINVD for correct operation and the guest can't
execute that sounds fatal, not a WARN to me.

> Avoid WBINVD for all ACPI cache-flushing operations which occur while
> running under a hypervisor, which includes TDX guests. This both avoids
> TDX warnings and optimizes away superfluous WBINVD invocations. (<----
> solution)
>

Subject: Re: [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest



On 6/9/21 2:38 PM, Dan Williams wrote:
>> In TDX guests, these WBINVD operations cause #VE exceptions. For debug,
>> it would be ideal for the #VE handler to be able to WARN() when an
>> unexpected WBINVD occurs. (<--- problem #2)
> ...but it doesn't WARN() it triggers unhandled #VE, unless I missed
> another patch that precedes this that turns it into a WARN()? If a
> code path expects WBINVD for correct operation and the guest can't
> execute that sounds fatal, not a WARN to me.

Yes. It is not WARN. It is a fatal unhandled exception.

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-06-09 23:58:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest

On 6/9/21 2:42 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 6/9/21 2:38 PM, Dan Williams wrote:
>>> In TDX guests, these WBINVD operations cause #VE exceptions.  For debug,
>>> it would be ideal for the #VE handler to be able to WARN() when an
>>> unexpected WBINVD occurs. (<--- problem #2)
>> ...but it doesn't WARN() it triggers unhandled #VE, unless I missed
>> another patch that precedes this that turns it into a WARN()? If a
>> code path expects WBINVD for correct operation and the guest can't
>> execute that sounds fatal, not a WARN to me.
>
> Yes. It is not WARN. It is a fatal unhandled exception.

That makes the problem statement a wee bit different, but it should
still be pretty easy to explain:

In TDX guests, these WBINVD operations cause #VE exceptions.
While some #VE exceptions can be handled, there is no recourse
for a TDX guest to handle a WBINVD and it will panic(). (<---
problem #2)