On 11/9/23 03:55, Kai Huang wrote:
...
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -31,6 +31,7 @@
> #include <asm/realmode.h>
> #include <asm/x86_init.h>
> #include <asm/efi.h>
> +#include <asm/tdx.h>
>
> /*
> * Power off function, if any
> @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
> local_irq_disable();
> stop_other_cpus();
> #endif
> + /*
> + * stop_other_cpus() has flushed all dirty cachelines of TDX
> + * private memory on remote cpus. Unlike SME, which does the
> + * cache flush on _this_ cpu in the relocate_kernel(), flush
> + * the cache for _this_ cpu here. This is because on the
> + * platforms with "partial write machine check" erratum the
> + * kernel needs to convert all TDX private pages back to normal
> + * before booting to the new kernel in kexec(), and the cache
> + * flush must be done before that. If the kernel took SME's way,
> + * it would have to muck with the relocate_kernel() assembly to
> + * do memory conversion.
> + */
> + if (platform_tdx_enabled())
> + native_wbinvd();
Why can't the TDX host code just set host_mem_enc_active=1?
Sure, you'll end up *using* the SME WBINVD support, but then you don't
have two different WBINVD call sites. You also don't have to mess with
a single line of assembly.
On Mon, 2023-11-27 at 10:13 -0800, Hansen, Dave wrote:
> On 11/9/23 03:55, Kai Huang wrote:
> ...
> > --- a/arch/x86/kernel/reboot.c
> > +++ b/arch/x86/kernel/reboot.c
> > @@ -31,6 +31,7 @@
> > #include <asm/realmode.h>
> > #include <asm/x86_init.h>
> > #include <asm/efi.h>
> > +#include <asm/tdx.h>
> >
> > /*
> > * Power off function, if any
> > @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
> > local_irq_disable();
> > stop_other_cpus();
> > #endif
> > + /*
> > + * stop_other_cpus() has flushed all dirty cachelines of TDX
> > + * private memory on remote cpus. Unlike SME, which does the
> > + * cache flush on _this_ cpu in the relocate_kernel(), flush
> > + * the cache for _this_ cpu here. This is because on the
> > + * platforms with "partial write machine check" erratum the
> > + * kernel needs to convert all TDX private pages back to normal
> > + * before booting to the new kernel in kexec(), and the cache
> > + * flush must be done before that. If the kernel took SME's way,
> > + * it would have to muck with the relocate_kernel() assembly to
> > + * do memory conversion.
> > + */
> > + if (platform_tdx_enabled())
> > + native_wbinvd();
>
> Why can't the TDX host code just set host_mem_enc_active=1?
>
> Sure, you'll end up *using* the SME WBINVD support, but then you don't
> have two different WBINVD call sites. You also don't have to mess with
> a single line of assembly.
I wanted to avoid changing the assembly.
Perhaps the comment isn't very clear. Flushing cache (on the CPU running kexec)
when the host_mem_enc_active=1 is handled in the relocate_kernel() assembly,
which happens at very late stage right before jumping to the new kernel.
However for TDX when the platform has erratum we need to convert TDX private
pages back to normal, which must be done after flushing cache. If we reuse
host_mem_enc_active=1, then we will need to change the assembly code to do that.
On Mon, 2023-11-27 at 19:33 +0000, Huang, Kai wrote:
> On Mon, 2023-11-27 at 10:13 -0800, Hansen, Dave wrote:
> > On 11/9/23 03:55, Kai Huang wrote:
> > ...
> > > --- a/arch/x86/kernel/reboot.c
> > > +++ b/arch/x86/kernel/reboot.c
> > > @@ -31,6 +31,7 @@
> > > #include <asm/realmode.h>
> > > #include <asm/x86_init.h>
> > > #include <asm/efi.h>
> > > +#include <asm/tdx.h>
> > >
> > > /*
> > > * Power off function, if any
> > > @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
> > > local_irq_disable();
> > > stop_other_cpus();
> > > #endif
> > > + /*
> > > + * stop_other_cpus() has flushed all dirty cachelines of TDX
> > > + * private memory on remote cpus. Unlike SME, which does the
> > > + * cache flush on _this_ cpu in the relocate_kernel(), flush
> > > + * the cache for _this_ cpu here. This is because on the
> > > + * platforms with "partial write machine check" erratum the
> > > + * kernel needs to convert all TDX private pages back to normal
> > > + * before booting to the new kernel in kexec(), and the cache
> > > + * flush must be done before that. If the kernel took SME's way,
> > > + * it would have to muck with the relocate_kernel() assembly to
> > > + * do memory conversion.
> > > + */
> > > + if (platform_tdx_enabled())
> > > + native_wbinvd();
> >
> > Why can't the TDX host code just set host_mem_enc_active=1?
> >
> > Sure, you'll end up *using* the SME WBINVD support, but then you don't
> > have two different WBINVD call sites. You also don't have to mess with
> > a single line of assembly.
>
> I wanted to avoid changing the assembly.
>
> Perhaps the comment isn't very clear. Flushing cache (on the CPU running kexec)
> when the host_mem_enc_active=1 is handled in the relocate_kernel() assembly,
> which happens at very late stage right before jumping to the new kernel.
> However for TDX when the platform has erratum we need to convert TDX private
> pages back to normal, which must be done after flushing cache. If we reuse
> host_mem_enc_active=1, then we will need to change the assembly code to do that.
>
Forgot to say doing TDX page conversion in the relocate_assembly() isn't easy
because the cache flushing when host_mem_enc_active=1 happens after kernel has
switched to the identity mapping table, so we will need to do hacks like fixing
up symbol address etc.
On 11/27/23 11:33, Huang, Kai wrote:
> On Mon, 2023-11-27 at 10:13 -0800, Hansen, Dave wrote:
>> On 11/9/23 03:55, Kai Huang wrote:
>> ...
>>> --- a/arch/x86/kernel/reboot.c
>>> +++ b/arch/x86/kernel/reboot.c
>>> @@ -31,6 +31,7 @@
>>> #include <asm/realmode.h>
>>> #include <asm/x86_init.h>
>>> #include <asm/efi.h>
>>> +#include <asm/tdx.h>
>>>
>>> /*
>>> * Power off function, if any
>>> @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
>>> local_irq_disable();
>>> stop_other_cpus();
>>> #endif
>>> + /*
>>> + * stop_other_cpus() has flushed all dirty cachelines of TDX
>>> + * private memory on remote cpus. Unlike SME, which does the
>>> + * cache flush on _this_ cpu in the relocate_kernel(), flush
>>> + * the cache for _this_ cpu here. This is because on the
>>> + * platforms with "partial write machine check" erratum the
>>> + * kernel needs to convert all TDX private pages back to normal
>>> + * before booting to the new kernel in kexec(), and the cache
>>> + * flush must be done before that. If the kernel took SME's way,
>>> + * it would have to muck with the relocate_kernel() assembly to
>>> + * do memory conversion.
>>> + */
>>> + if (platform_tdx_enabled())
>>> + native_wbinvd();
>>
>> Why can't the TDX host code just set host_mem_enc_active=1?
>>
>> Sure, you'll end up *using* the SME WBINVD support, but then you don't
>> have two different WBINVD call sites. You also don't have to mess with
>> a single line of assembly.
>
> I wanted to avoid changing the assembly.
>
> Perhaps the comment isn't very clear. Flushing cache (on the CPU running kexec)
> when the host_mem_enc_active=1 is handled in the relocate_kernel() assembly,
> which happens at very late stage right before jumping to the new kernel.
> However for TDX when the platform has erratum we need to convert TDX private
> pages back to normal, which must be done after flushing cache. If we reuse
> host_mem_enc_active=1, then we will need to change the assembly code to do that.
I honestly think you need to stop thinking about the partial write issue
at this point in the series. It's really causing a horrible amount of
unnecessary confusion.
Here's the golden rule:
The boot CPU needs to run WBINVD sometime after it stops writing
to private memory but before it starts treating the memory as
shared.
On SME kernels, that key point evidently in early boot when it's
enabling SME. I _think_ that point is also a valid place to do WBINVD
on no-TDX-erratum systems.
On TDX systems with the erratum, there's a *second* point before the
private=>shared conversion occurs. I think what you're trying to do
here is prematurely optimize the erratum-affected affected systems so
that they don't do two WBINVDs. Please stop trying to do that.
This WBINVD is only _needed_ for the erratum. It should be closer to
the actual erratum handing.
On Mon, 2023-11-27 at 12:05 -0800, Dave Hansen wrote:
> On 11/27/23 11:33, Huang, Kai wrote:
> > On Mon, 2023-11-27 at 10:13 -0800, Hansen, Dave wrote:
> > > On 11/9/23 03:55, Kai Huang wrote:
> > > ...
> > > > --- a/arch/x86/kernel/reboot.c
> > > > +++ b/arch/x86/kernel/reboot.c
> > > > @@ -31,6 +31,7 @@
> > > > #include <asm/realmode.h>
> > > > #include <asm/x86_init.h>
> > > > #include <asm/efi.h>
> > > > +#include <asm/tdx.h>
> > > >
> > > > /*
> > > > * Power off function, if any
> > > > @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
> > > > local_irq_disable();
> > > > stop_other_cpus();
> > > > #endif
> > > > + /*
> > > > + * stop_other_cpus() has flushed all dirty cachelines of TDX
> > > > + * private memory on remote cpus. Unlike SME, which does the
> > > > + * cache flush on _this_ cpu in the relocate_kernel(), flush
> > > > + * the cache for _this_ cpu here. This is because on the
> > > > + * platforms with "partial write machine check" erratum the
> > > > + * kernel needs to convert all TDX private pages back to normal
> > > > + * before booting to the new kernel in kexec(), and the cache
> > > > + * flush must be done before that. If the kernel took SME's way,
> > > > + * it would have to muck with the relocate_kernel() assembly to
> > > > + * do memory conversion.
> > > > + */
> > > > + if (platform_tdx_enabled())
> > > > + native_wbinvd();
> > >
> > > Why can't the TDX host code just set host_mem_enc_active=1?
> > >
> > > Sure, you'll end up *using* the SME WBINVD support, but then you don't
> > > have two different WBINVD call sites. You also don't have to mess with
> > > a single line of assembly.
> >
> > I wanted to avoid changing the assembly.
> >
> > Perhaps the comment isn't very clear. Flushing cache (on the CPU running kexec)
> > when the host_mem_enc_active=1 is handled in the relocate_kernel() assembly,
> > which happens at very late stage right before jumping to the new kernel.
> > However for TDX when the platform has erratum we need to convert TDX private
> > pages back to normal, which must be done after flushing cache. If we reuse
> > host_mem_enc_active=1, then we will need to change the assembly code to do that.
>
> I honestly think you need to stop thinking about the partial write issue
> at this point in the series. It's really causing a horrible amount of
> unnecessary confusion.
>
> Here's the golden rule:
>
> The boot CPU needs to run WBINVD sometime after it stops writing
> to private memory but before it starts treating the memory as
> shared.
>
> On SME kernels, that key point evidently in early boot when it's
> enabling SME. I _think_ that point is also a valid place to do WBINVD
> on no-TDX-erratum systems.
You mean we could advertise cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT) true for
TDX host? We could but IMHO it doesn't perfectly match.
SME kernel sets _PAGE_ENC on by default for all memory mappings IIUC, but TDX
host never actually sets any encryption bits in page tables managed by the
kernel.
So I think we can just do below, but not advertise CC_ATTR_HOST_MEM_ENCRYPT for
TDX host?
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -377,7 +377,8 @@ void machine_kexec(struct kimage *image)
(unsigned long)page_list,
image->start,
image->preserve_context,
-
cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
+ cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)
||
+ platform_tdx_enabled());
>
> On TDX systems with the erratum, there's a *second* point before the
> private=>shared conversion occurs. I think what you're trying to do
> here is prematurely optimize the erratum-affected affected systems so
> that they don't do two WBINVDs. Please stop trying to do that.
>
> This WBINVD is only _needed_ for the erratum. It should be closer to
> the actual erratum handing.
If we do WBINVD early here then the second one isn't needed. But 100% agreed
this handling/optimization should be done later closer to the erratum handling.
On 11/27/23 12:52, Huang, Kai wrote:
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -377,7 +377,8 @@ void machine_kexec(struct kimage *image)
> (unsigned long)page_list,
> image->start,
> image->preserve_context,
> -
> cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
> + cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)
> ||
> + platform_tdx_enabled());
Well, something more like the attached would be preferable, but you've
got the right idea logically.
On Mon, 2023-11-27 at 13:06 -0800, Hansen, Dave wrote:
> On 11/27/23 12:52, Huang, Kai wrote:
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -377,7 +377,8 @@ void machine_kexec(struct kimage *image)
> > (unsigned long)page_list,
> > image->start,
> > image->preserve_context,
> > -
> > cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
> > + cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)
> > > >
> > + platform_tdx_enabled());
>
> Well, something more like the attached would be preferable, but you've
> got the right idea logically.
Thanks!
On top of that, I think below code (also attached the diff) should do
advertising the CC_ATTR_HOST_MEM_INCOHERENT for TDX host?
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 2e2d559169a8..bec70b967504 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -12,6 +12,8 @@
#include <asm/coco.h>
#include <asm/processor.h>
+#include <asm/cpufeatures.h>
+#include <asm/tdx.h>
enum cc_vendor cc_vendor __ro_after_init = CC_VENDOR_NONE;
static u64 cc_mask __ro_after_init;
@@ -23,7 +25,9 @@ static bool noinstr intel_cc_platform_has(enum cc_attr attr)
case CC_ATTR_HOTPLUG_DISABLED:
case CC_ATTR_GUEST_MEM_ENCRYPT:
case CC_ATTR_MEM_ENCRYPT:
- return true;
+ return cpu_feature_enabled(X86_FEATURE_TDX_GUEST);
+ case CC_ATTR_HOST_MEM_INCOHERENT:
+ return platform_tdx_enabled();
default:
return false;
}
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 73cd2f7b7d87..1ae21348edc1 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1634,6 +1634,13 @@ static int __init tdx_init(void)
tdx_guest_keyid_start = tdx_keyid_start + 1;
tdx_nr_guest_keyids = nr_tdx_keyids - 1;
+ /*
+ * TDX doesn't guarantee cache coherency among different
+ * KeyIDs. Advertise the CC_ATTR_HOST_MEM_INCOHERENT
+ * attribute for TDX host.
+ */
+ cc_vendor = CC_VENDOR_INTEL;
+
return 0;
}
early_initcall(tdx_init);
I'll do some test with your code and the above code.