LinuxLists.cc - [ANNOUNCE] PUCK Notes - 2024.04.03

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On 4/6/2024 12:58 AM, Sean Christopherson wrote:
> - For guest MAXPHYADDR vs. GPAW, rely on KVM_GET_SUPPORTED_CPUID to enumerate
> the usable MAXPHYADDR[2], and simply refuse to enable TDX if the TDX Module
> isn't compatible. Specifically, if MAXPHYADDR=52, 5-level paging is enabled,
> but the TDX-Module only allows GPAW=0, i.e. only supports 4-level paging.

So userspace can get supported GPAW from usable MAXPHYADDR, i.e.,
CPUID(0X8000_0008).eaxx[23:16] of KVM_GET_SUPPORTED_CPUID:
- if usable MAXPHYADDR == 52, supported GPAW is 0 and 1.
- if usable MAXPHYADDR <= 48, supported GPAW is only 0.

There is another thing needs to be discussed. How does userspace
configure GPAW for TD guest?

Currently, KVM uses CPUID(0x8000_0008).EAX[7:0] in struct
kvm_tdx_init_vm::cpuid.entries[] of IOCTL(KVM_TDX_INIT_VM) to deduce the
GPAW:

int maxpa = 36;
entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x80000008, 0);
if (entry)
max_pa = entry->eax & 0xff;

...
if (!cpu_has_vmx_ept_5levels() && max_pa > 48)
return -EINVAL;
if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
td_params->eptp_controls |= VMX_EPTP_PWL_5;
td_params->exec_controls |= TDX_EXEC_CONTROL_MAX_GPAW;
} else {
td_params->eptp_controls |= VMX_EPTP_PWL_4;
}

The code implies that KVM allows the provided
CPUID(0x8000_0008).EAX[7:0] to be any value (when 5level ept is
supported). when it > 48, configure GPAW of TD to 1, otherwise to 0.

However, the virtual value of CPUID(0x8000_0008).EAX[7:0] inside TD is
always the native value of hardware (for current TDX).

So if we want to keep this behavior, we need to document it somewhere
that CPUID(0x8000_0008).EAX[7:0] in struct
kvm_tdx_init_vm::cpuid.entries[] of IOCTL(KVM_TDX_INIT_VM) is only for
configuring GPAW, not for userspace to configure virtual CPUID value for
TD VMs.

Another option is that, KVM doesn't allow userspace to configure
CPUID(0x8000_0008).EAX[7:0]. Instead, it provides a gpaw field in struct
kvm_tdx_init_vm for userspace to configure directly.

What do you prefer?

> [1]https://drive.google.com/corp/drive/folders/1hm_ITeuB6DjT7dNd-6Ezybio4tRRQOlC
> [2]https://lore.kernel.org/all/[email protected]

2024-04-08 16:21:43

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Sun, Apr 07, 2024, Xiaoyao Li wrote:
> On 4/6/2024 12:58 AM, Sean Christopherson wrote:
> > - For guest MAXPHYADDR vs. GPAW, rely on KVM_GET_SUPPORTED_CPUID to enumerate
> > the usable MAXPHYADDR[2], and simply refuse to enable TDX if the TDX Module
> > isn't compatible. Specifically, if MAXPHYADDR=52, 5-level paging is enabled,
> > but the TDX-Module only allows GPAW=0, i.e. only supports 4-level paging.
>
> So userspace can get supported GPAW from usable MAXPHYADDR, i.e.,
> CPUID(0X8000_0008).eaxx[23:16] of KVM_GET_SUPPORTED_CPUID:
> - if usable MAXPHYADDR == 52, supported GPAW is 0 and 1.
> - if usable MAXPHYADDR <= 48, supported GPAW is only 0.
>
> There is another thing needs to be discussed. How does userspace configure
> GPAW for TD guest?
>
> Currently, KVM uses CPUID(0x8000_0008).EAX[7:0] in struct
> kvm_tdx_init_vm::cpuid.entries[] of IOCTL(KVM_TDX_INIT_VM) to deduce the
> GPAW:
>
> int maxpa = 36;
> entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x80000008, 0);
> if (entry)
> max_pa = entry->eax & 0xff;
>
> ...
> if (!cpu_has_vmx_ept_5levels() && max_pa > 48)
> return -EINVAL;
> if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
> td_params->eptp_controls |= VMX_EPTP_PWL_5;
> td_params->exec_controls |= TDX_EXEC_CONTROL_MAX_GPAW;
> } else {
> td_params->eptp_controls |= VMX_EPTP_PWL_4;
> }
>
> The code implies that KVM allows the provided CPUID(0x8000_0008).EAX[7:0] to
> be any value (when 5level ept is supported). when it > 48, configure GPAW of
> TD to 1, otherwise to 0.
>
> However, the virtual value of CPUID(0x8000_0008).EAX[7:0] inside TD is
> always the native value of hardware (for current TDX).
>
> So if we want to keep this behavior, we need to document it somewhere that
> CPUID(0x8000_0008).EAX[7:0] in struct kvm_tdx_init_vm::cpuid.entries[] of
> IOCTL(KVM_TDX_INIT_VM) is only for configuring GPAW, not for userspace to
> configure virtual CPUID value for TD VMs.
>
> Another option is that, KVM doesn't allow userspace to configure
> CPUID(0x8000_0008).EAX[7:0]. Instead, it provides a gpaw field in struct
> kvm_tdx_init_vm for userspace to configure directly.
>
> What do you prefer?

Hmm, neither. I think the best approach is to build on Gerd's series to have KVM
select 4-level vs. 5-level based on the enumerated guest.MAXPHYADDR, not on
host.MAXPHYADDR.

With a moderate amount of refactoring, cache/compute guest_maxphyaddr as:

static void kvm_vcpu_refresh_maxphyaddr(struct kvm_vcpu *vcpu)
{
struct kvm_cpuid_entry2 *best;

best = kvm_find_cpuid_entry(vcpu, 0x80000000);
if (!best || best->eax < 0x80000008)
goto not_found;

best = kvm_find_cpuid_entry(vcpu, 0x80000008);
if (!best)
goto not_found;

vcpu->arch.maxphyaddr = best->eax & GENMASK(7, 0);

if (best->eax & GENMASK(15, 8))
vcpu->arch.guest_maxphyaddr = (best->eax & GENMASK(15, 8)) >> 8;
else
vcpu->arch.guest_maxphyaddr = vcpu->arch.maxphyaddr;

return;

not_found:
vcpu->arch.maxphyaddr = KVM_X86_DEFAULT_MAXPHYADDR;
vcpu->arch.guest_maxphyaddr = KVM_X86_DEFAULT_MAXPHYADDR;
}

and then use vcpu->arch.guest_maxphyaddr instead of vcpu->arch.maxphyaddr when
selecting the TDP level.

static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
{
/* tdp_root_level is architecture forced level, use it if nonzero */
if (tdp_root_level)
return tdp_root_level;

/*
* Use 5-level TDP if and only if it's useful/necessary. Definitely a
* more verbose comment here.
*/
if (max_tdp_level == 5 && vcpu->arch.guest_maxphyaddr <= 48)
return 4;

return max_tdp_level;
}

The only question is whether or not the behavior needs to be opt-in via a new
capability, e.g. in case there is some weird usage where userspace enumerates
guest.MAXPHYADDR < host.MAXPHYADDR but still wants/needs 5-level paging. I highly
doubt such a use case exists though.

I'll get Gerd's series applied, and will post a small series to implement the
above later this week.

2024-04-08 17:42:24

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Mon, 2024-04-08 at 09:20 -0700, Sean Christopherson wrote:
> > Another option is that, KVM doesn't allow userspace to configure
> > CPUID(0x8000_0008).EAX[7:0]. Instead, it provides a gpaw field in struct
> > kvm_tdx_init_vm for userspace to configure directly.
> >
> > What do you prefer?
>
> Hmm, neither. I think the best approach is to build on Gerd's series to have KVM
> select 4-level vs. 5-level based on the enumerated guest.MAXPHYADDR, not on
> host.MAXPHYADDR.

So then GPAW would be coded to basically best fit the supported guest.MAXPHYADDR within KVM. QEMU
could look at the supported guest.MAXPHYADDR and use matching logic to determine GPAW.

Or are you suggesting that KVM should look at the value of CPUID(0X8000_0008).eax[23:16] passed from
userspace?

I'm not following the code examples involving struct kvm_vcpu. Since TDX configures these at a VM
level, there isn't a vcpu.

The challenge for TDX with the KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID language of
enumeration/enabling of features, is that not all CPUID leafs are configurable for TDX. Today
KVM_SET_CPUID sort of serves two roles. Configuring CPUID values and actually enabling feature
virtualization logic. In the current TDX module CPUID 0x8000_0008 is not configurable. So the TDX
code in KVM would have to just peak at the value and discard it.

If we try to use a KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID model, then maybe we could clarify things
by exposing which leafs are configurable. Then userspace can know which CPUID leafs are information
for KVM, and which are actually controlling the result of the CPUID instruction in the TD.

2024-04-08 19:18:41

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Mon, Apr 08, 2024, Edgecombe, Rick P wrote:
> On Mon, 2024-04-08 at 09:20 -0700, Sean Christopherson wrote:
> > > Another option is that, KVM doesn't allow userspace to configure
> > > CPUID(0x8000_0008).EAX[7:0]. Instead, it provides a gpaw field in struct
> > > kvm_tdx_init_vm for userspace to configure directly.
> > >
> > > What do you prefer?
> >
> > Hmm, neither. I think the best approach is to build on Gerd's series to have KVM
> > select 4-level vs. 5-level based on the enumerated guest.MAXPHYADDR, not on
> > host.MAXPHYADDR.
>
> So then GPAW would be coded to basically best fit the supported guest.MAXPHYADDR within KVM. QEMU
> could look at the supported guest.MAXPHYADDR and use matching logic to determine GPAW.

Off topic, any chance I can bribe/convince you to wrap your email replies closer
to 80 chars, not 100? Yeah, checkpath no longer complains when code exceeds 80
chars, but my brain is so well trained for 80 that it actually slows me down a
bit when reading mails that are wrapped at 100 chars.

> Or are you suggesting that KVM should look at the value of CPUID(0X8000_0008).eax[23:16] passed from
> userspace?

This. Note, my pseudo-patch incorrectly looked at bits 15:8, that was just me
trying to go off memory.

> I'm not following the code examples involving struct kvm_vcpu. Since TDX
> configures these at a VM level, there isn't a vcpu.

Ah, I take it GPAW is a VM-scope knob? I forget where we ended up with the ordering
of TDX commands vs. creating vCPUs. Does KVM allow creating vCPU structures in
advance of the TDX INIT call? If so, the least awful solution might be to use
vCPU0's CPUID.

2024-04-08 21:56:36

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Mon, 2024-04-08 at 18:51 +0000, Sean Christopherson wrote:
> Off topic, any chance I can bribe/convince you to wrap your email replies
> closer
> to 80 chars, not 100? Yeah, checkpath no longer complains when code exceeds
> 80
> chars, but my brain is so well trained for 80 that it actually slows me down a
> bit when reading mails that are wrapped at 100 chars.

Heh, sure. I was trying 100 chars recently as an experiment to better quote code
in mails. I was also getting thrown a little.

>
> > Or are you suggesting that KVM should look at the value of
> > CPUID(0X8000_0008).eax[23:16] passed from
> > userspace?
>
> This. Note, my pseudo-patch incorrectly looked at bits 15:8, that was just me
> trying to go off memory.
>
> > I'm not following the code examples involving struct kvm_vcpu. Since TDX
> > configures these at a VM level, there isn't a vcpu.
>
> Ah, I take it GPAW is a VM-scope knob?

Yea.

> I forget where we ended up with the ordering
> of TDX commands vs. creating vCPUs. Does KVM allow creating vCPU structures
> in
> advance of the TDX INIT call? If so, the least awful solution might be to use
> vCPU0's CPUID.

Currently the values for the directly settable CPUID leafs come via a TDX
specific init VM userspace API. There was some discussion on forcing the values
provided there to be consistent with the CPUIDs set on the VCPUs later:
https://lore.kernel.org/lkml/[email protected]/

Which lead to:
https://lore.kernel.org/kvm/ac424b167210288cdf32ac940bcc6ec84f8a45b9.1708933498.git.isaku.yamahata@intel.com/
https://lore.kernel.org/kvm/d394938197044b40bbe6d9ce2402f72a66a99e80.1708933498.git.isaku.yamahata@intel.com/

So KVM has to reject KVM_SET_CPUID if it doesn't match the VM-wide configuration
anyway, however the VM-scoped CPUID state ends up getting configured. Then if we
leave the VM-scoped CPUID configuration with the VM-scoped operations it doesn't
force KVM_SET_CPUID to learn about rejecting TDX incompatible CPUID state (state
that is not directly configurable).

So should we look at making the TDX side follow a
KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID pattern for feature enablement? Or am I
misreading general guidance out of this specific suggestion around GPAW?

2024-04-08 22:36:32

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Mon, Apr 08, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-04-08 at 18:51 +0000, Sean Christopherson wrote:
> > > I'm not following the code examples involving struct kvm_vcpu. Since TDX
> > > configures these at a VM level, there isn't a vcpu.
> >
> > Ah, I take it GPAW is a VM-scope knob?
>
> Yea.
>
> > I forget where we ended up with the ordering of TDX commands vs. creating
> > vCPUs. Does KVM allow creating vCPU structures in advance of the TDX INIT
> > call? If so, the least awful solution might be to use vCPU0's CPUID.
>
> Currently the values for the directly settable CPUID leafs come via a TDX
> specific init VM userspace API.

Is guest.MAXPHYADDR one of those? If so, use that.

> So should we look at making the TDX side follow a
> KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID pattern for feature enablement? Or am I
> misreading general guidance out of this specific suggestion around GPAW?

No? Where I was going with that, is _if_ vCPUs can be created (in KVM) before
the GPAW is set (in the TDX module), then using vCPU0's guest.MAXPHYADDR to
compute the desired GPAW may be the least awful solution, all things considered.

2024-04-08 23:46:38

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Mon, 2024-04-08 at 15:36 -0700, Sean Christopherson wrote:
> > Currently the values for the directly settable CPUID leafs come via a TDX
> > specific init VM userspace API.
>
> Is guest.MAXPHYADDR one of those? If so, use that.

No it is not configurable. I'm looking into make it configurable, but it is not
likely to happen before we were hoping to get basic support upstream. An
alternative would be to have the KVM API peak at the value, and then discard it
(not pass the leaf value to the TDX module). Not ideal. Or have a dedicated GPAW
field and expose the concept to userspace like Xiaoyao was talking about.

>
> > So should we look at making the TDX side follow a
> > KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID pattern for feature enablement? Or am
> > I
> > misreading general guidance out of this specific suggestion around GPAW?
>
> No? Where I was going with that, is _if_ vCPUs can be created (in KVM) before
> the GPAW is set (in the TDX module), then using vCPU0's guest.MAXPHYADDR to
> compute the desired GPAW may be the least awful solution, all things
> considered.

Sorry, I was trying to uplevel the conversation to be about the general concept
of matching TD configuration to CPUID bits. Let me try to articulate the problem
a little better.

Today, KVM’s KVM_GET_SUPPORTED_CPUID is a way to specify which features are
virtualizable by KVM. Communicating this via CPUID leaf values works for the
most part, because CPUID is already designed to communicate which features are
supported. But TDX has a different language to communicate which features are
supported. That is special fields that are passed when creating a VM: XFAM
(matching XCR0 features) and ATTRIBUTES (TDX specific flags for MSR based
features like PKS, etc). So compared to KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID,
the TDX module instead accepts only a few CPUID bits to be set directly by the
VMM, and sets other CPUID leafs to match the configured features via XFAM and
ATTRIBUTES.

There are also some bits/features that have fixed values. Which leafs are fixed
and what the values are isn't something provided by any current TDX module API.
Instead they are only known via documentation, which is subject to change. The
queryable information is limited to communicating which bits are directly
configurable.

So the current interface won't allow us to perfectly match the
KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID. Even excluding the vm-scoped vs vcpu-
scoped differences. However, we could try to match the general design a little
better.

Here we were discussing making gpaw configurable via a dedicated named field,
but the suggestion is to instead include it in CPUID bits. The current API takes
ATTRIBUTES as a dedicated field too. But there actually are CPUID bits for some
of those features. Those CPUID bits are controlled instead via the associated
ATTRIBUTES. So we could expose such features via CPUID as well. Userspace would
for example, pass the PKS CPUID bit in, and KVM would see it and configure PKS
via the ATTRIBUTES bit.

So what I was looking to understand is, what is the enthusiasm for generally
continuing to use CPUID has the main method for specifying which features should
be enabled/virtualized, if we can't match the existing
KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID APIs. Is the hope just to make userspace's
code more unified between TDX and normal VMs?

2024-04-09 02:58:24

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On 4/9/2024 12:20 AM, Sean Christopherson wrote:
> On Sun, Apr 07, 2024, Xiaoyao Li wrote:
>> On 4/6/2024 12:58 AM, Sean Christopherson wrote:
>>> - For guest MAXPHYADDR vs. GPAW, rely on KVM_GET_SUPPORTED_CPUID to enumerate
>>> the usable MAXPHYADDR[2], and simply refuse to enable TDX if the TDX Module
>>> isn't compatible. Specifically, if MAXPHYADDR=52, 5-level paging is enabled,
>>> but the TDX-Module only allows GPAW=0, i.e. only supports 4-level paging.
>>
>> So userspace can get supported GPAW from usable MAXPHYADDR, i.e.,
>> CPUID(0X8000_0008).eaxx[23:16] of KVM_GET_SUPPORTED_CPUID:
>> - if usable MAXPHYADDR == 52, supported GPAW is 0 and 1.
>> - if usable MAXPHYADDR <= 48, supported GPAW is only 0.
>>
>> There is another thing needs to be discussed. How does userspace configure
>> GPAW for TD guest?
>>
>> Currently, KVM uses CPUID(0x8000_0008).EAX[7:0] in struct
>> kvm_tdx_init_vm::cpuid.entries[] of IOCTL(KVM_TDX_INIT_VM) to deduce the
>> GPAW:
>>
>> int maxpa = 36;
>> entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x80000008, 0);
>> if (entry)
>> max_pa = entry->eax & 0xff;
>>
>> ...
>> if (!cpu_has_vmx_ept_5levels() && max_pa > 48)
>> return -EINVAL;
>> if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
>> td_params->eptp_controls |= VMX_EPTP_PWL_5;
>> td_params->exec_controls |= TDX_EXEC_CONTROL_MAX_GPAW;
>> } else {
>> td_params->eptp_controls |= VMX_EPTP_PWL_4;
>> }
>>
>> The code implies that KVM allows the provided CPUID(0x8000_0008).EAX[7:0] to
>> be any value (when 5level ept is supported). when it > 48, configure GPAW of
>> TD to 1, otherwise to 0.
>>
>> However, the virtual value of CPUID(0x8000_0008).EAX[7:0] inside TD is
>> always the native value of hardware (for current TDX).
>>
>> So if we want to keep this behavior, we need to document it somewhere that
>> CPUID(0x8000_0008).EAX[7:0] in struct kvm_tdx_init_vm::cpuid.entries[] of
>> IOCTL(KVM_TDX_INIT_VM) is only for configuring GPAW, not for userspace to
>> configure virtual CPUID value for TD VMs.
>>
>> Another option is that, KVM doesn't allow userspace to configure
>> CPUID(0x8000_0008).EAX[7:0]. Instead, it provides a gpaw field in struct
>> kvm_tdx_init_vm for userspace to configure directly.
>>
>> What do you prefer?
>
> Hmm, neither. I think the best approach is to build on Gerd's series to have KVM
> select 4-level vs. 5-level based on the enumerated guest.MAXPHYADDR, not on
> host.MAXPHYADDR.

I see no difference between using guest.MAXPHYADDR (EAX[23:16]) and
using host.MAXPHYADDR (EAX[7:0]) to determine the GPAW (and EPT level)
for TD guest. The case for TDX diverges from what for non TDX VMs. The
value of them passed from userspace can only be used to configure GPAW
and EPT level for TD, but won't be reflected in CPUID inside TD.

So I take it as you prefer the former option than dedicated GPAW field.

> With a moderate amount of refactoring, cache/compute guest_maxphyaddr as:
>
> static void kvm_vcpu_refresh_maxphyaddr(struct kvm_vcpu *vcpu)
> {
> struct kvm_cpuid_entry2 *best;
>
> best = kvm_find_cpuid_entry(vcpu, 0x80000000);
> if (!best || best->eax < 0x80000008)
> goto not_found;
>
> best = kvm_find_cpuid_entry(vcpu, 0x80000008);
> if (!best)
> goto not_found;
>
> vcpu->arch.maxphyaddr = best->eax & GENMASK(7, 0);
>
> if (best->eax & GENMASK(15, 8))
> vcpu->arch.guest_maxphyaddr = (best->eax & GENMASK(15, 8)) >> 8;
> else
> vcpu->arch.guest_maxphyaddr = vcpu->arch.maxphyaddr;
>
> return;
>
> not_found:
> vcpu->arch.maxphyaddr = KVM_X86_DEFAULT_MAXPHYADDR;
> vcpu->arch.guest_maxphyaddr = KVM_X86_DEFAULT_MAXPHYADDR;
> }
>
> and then use vcpu->arch.guest_maxphyaddr instead of vcpu->arch.maxphyaddr when
> selecting the TDP level.
>
> static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
> {
> /* tdp_root_level is architecture forced level, use it if nonzero */
> if (tdp_root_level)
> return tdp_root_level;
>
> /*
> * Use 5-level TDP if and only if it's useful/necessary. Definitely a
> * more verbose comment here.
> */
> if (max_tdp_level == 5 && vcpu->arch.guest_maxphyaddr <= 48)
> return 4;
>
> return max_tdp_level;
> }
>
> The only question is whether or not the behavior needs to be opt-in via a new
> capability, e.g. in case there is some weird usage where userspace enumerates
> guest.MAXPHYADDR < host.MAXPHYADDR but still wants/needs 5-level paging. I highly
> doubt such a use case exists though.
>
> I'll get Gerd's series applied, and will post a small series to implement the
> above later this week.

2024-04-09 14:02:03

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Tue, Apr 09, 2024, Xiaoyao Li wrote:
> On 4/9/2024 12:20 AM, Sean Christopherson wrote:
> > On Sun, Apr 07, 2024, Xiaoyao Li wrote:
> > > On 4/6/2024 12:58 AM, Sean Christopherson wrote:
> > > > - For guest MAXPHYADDR vs. GPAW, rely on KVM_GET_SUPPORTED_CPUID to enumerate
> > > > the usable MAXPHYADDR[2], and simply refuse to enable TDX if the TDX Module
> > > > isn't compatible. Specifically, if MAXPHYADDR=52, 5-level paging is enabled,
> > > > but the TDX-Module only allows GPAW=0, i.e. only supports 4-level paging.
> > >
> > > So userspace can get supported GPAW from usable MAXPHYADDR, i.e.,
> > > CPUID(0X8000_0008).eaxx[23:16] of KVM_GET_SUPPORTED_CPUID:
> > > - if usable MAXPHYADDR == 52, supported GPAW is 0 and 1.
> > > - if usable MAXPHYADDR <= 48, supported GPAW is only 0.
> > >
> > > There is another thing needs to be discussed. How does userspace configure
> > > GPAW for TD guest?
> > >
> > > Currently, KVM uses CPUID(0x8000_0008).EAX[7:0] in struct
> > > kvm_tdx_init_vm::cpuid.entries[] of IOCTL(KVM_TDX_INIT_VM) to deduce the
> > > GPAW:
> > >
> > > int maxpa = 36;
> > > entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x80000008, 0);
> > > if (entry)
> > > max_pa = entry->eax & 0xff;
> > >
> > > ...
> > > if (!cpu_has_vmx_ept_5levels() && max_pa > 48)
> > > return -EINVAL;
> > > if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
> > > td_params->eptp_controls |= VMX_EPTP_PWL_5;
> > > td_params->exec_controls |= TDX_EXEC_CONTROL_MAX_GPAW;
> > > } else {
> > > td_params->eptp_controls |= VMX_EPTP_PWL_4;
> > > }
> > >
> > > The code implies that KVM allows the provided CPUID(0x8000_0008).EAX[7:0] to
> > > be any value (when 5level ept is supported). when it > 48, configure GPAW of
> > > TD to 1, otherwise to 0.
> > >
> > > However, the virtual value of CPUID(0x8000_0008).EAX[7:0] inside TD is
> > > always the native value of hardware (for current TDX).
> > >
> > > So if we want to keep this behavior, we need to document it somewhere that
> > > CPUID(0x8000_0008).EAX[7:0] in struct kvm_tdx_init_vm::cpuid.entries[] of
> > > IOCTL(KVM_TDX_INIT_VM) is only for configuring GPAW, not for userspace to
> > > configure virtual CPUID value for TD VMs.
> > >
> > > Another option is that, KVM doesn't allow userspace to configure
> > > CPUID(0x8000_0008).EAX[7:0]. Instead, it provides a gpaw field in struct
> > > kvm_tdx_init_vm for userspace to configure directly.
> > >
> > > What do you prefer?
> >
> > Hmm, neither. I think the best approach is to build on Gerd's series to have KVM
> > select 4-level vs. 5-level based on the enumerated guest.MAXPHYADDR, not on
> > host.MAXPHYADDR.
>
> I see no difference between using guest.MAXPHYADDR (EAX[23:16]) and using
> host.MAXPHYADDR (EAX[7:0]) to determine the GPAW (and EPT level) for TD
> guest. The case for TDX diverges from what for non TDX VMs. The value of
> them passed from userspace can only be used to configure GPAW and EPT level
> for TD, but won't be reflected in CPUID inside TD.

But the TDX module will emulate EAX[7:0] to match hardware, no? Whenever possible,
the CPUID entries passed to KVM should match the CPUID values that are observed
by the guest. E.g. if host.MAXPHYADDR=52, but the CPU only supports 4-level
paging, then KVM should get host.MAXPHYADDR=52, guest.MAXPHYADDR=48.

As I said in my response to Rick:

: > An alternative would be to have the KVM API peak at the value, and then
: > discard it (not pass the leaf value to the TDX module). Not ideal.
:
: Heh, I typed up this idea before reading ahead. This has my vote. Unless I'm
: misreading where things are headed, using guest.MAXPHYADDR to communicate what
: is essentially GPAW to the guest is about to become the de facto standard.
:
: At that point, KVM can basically treat the current TDX module behavior as an
: erratum, i.e. discarding guest.MAXPHYADDR becomes a workaround for a "CPU" bug,
: not some goofy KVM quirk.

2024-04-09 14:16:07

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On 4/9/2024 10:01 PM, Sean Christopherson wrote:
> On Tue, Apr 09, 2024, Xiaoyao Li wrote:
>> On 4/9/2024 12:20 AM, Sean Christopherson wrote:
>>> On Sun, Apr 07, 2024, Xiaoyao Li wrote:
>>>> On 4/6/2024 12:58 AM, Sean Christopherson wrote:
>>>>> - For guest MAXPHYADDR vs. GPAW, rely on KVM_GET_SUPPORTED_CPUID to enumerate
>>>>> the usable MAXPHYADDR[2], and simply refuse to enable TDX if the TDX Module
>>>>> isn't compatible. Specifically, if MAXPHYADDR=52, 5-level paging is enabled,
>>>>> but the TDX-Module only allows GPAW=0, i.e. only supports 4-level paging.
>>>>
>>>> So userspace can get supported GPAW from usable MAXPHYADDR, i.e.,
>>>> CPUID(0X8000_0008).eaxx[23:16] of KVM_GET_SUPPORTED_CPUID:
>>>> - if usable MAXPHYADDR == 52, supported GPAW is 0 and 1.
>>>> - if usable MAXPHYADDR <= 48, supported GPAW is only 0.
>>>>
>>>> There is another thing needs to be discussed. How does userspace configure
>>>> GPAW for TD guest?
>>>>
>>>> Currently, KVM uses CPUID(0x8000_0008).EAX[7:0] in struct
>>>> kvm_tdx_init_vm::cpuid.entries[] of IOCTL(KVM_TDX_INIT_VM) to deduce the
>>>> GPAW:
>>>>
>>>> int maxpa = 36;
>>>> entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 0x80000008, 0);
>>>> if (entry)
>>>> max_pa = entry->eax & 0xff;
>>>>
>>>> ...
>>>> if (!cpu_has_vmx_ept_5levels() && max_pa > 48)
>>>> return -EINVAL;
>>>> if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
>>>> td_params->eptp_controls |= VMX_EPTP_PWL_5;
>>>> td_params->exec_controls |= TDX_EXEC_CONTROL_MAX_GPAW;
>>>> } else {
>>>> td_params->eptp_controls |= VMX_EPTP_PWL_4;
>>>> }
>>>>
>>>> The code implies that KVM allows the provided CPUID(0x8000_0008).EAX[7:0] to
>>>> be any value (when 5level ept is supported). when it > 48, configure GPAW of
>>>> TD to 1, otherwise to 0.
>>>>
>>>> However, the virtual value of CPUID(0x8000_0008).EAX[7:0] inside TD is
>>>> always the native value of hardware (for current TDX).
>>>>
>>>> So if we want to keep this behavior, we need to document it somewhere that
>>>> CPUID(0x8000_0008).EAX[7:0] in struct kvm_tdx_init_vm::cpuid.entries[] of
>>>> IOCTL(KVM_TDX_INIT_VM) is only for configuring GPAW, not for userspace to
>>>> configure virtual CPUID value for TD VMs.
>>>>
>>>> Another option is that, KVM doesn't allow userspace to configure
>>>> CPUID(0x8000_0008).EAX[7:0]. Instead, it provides a gpaw field in struct
>>>> kvm_tdx_init_vm for userspace to configure directly.
>>>>
>>>> What do you prefer?
>>>
>>> Hmm, neither. I think the best approach is to build on Gerd's series to have KVM
>>> select 4-level vs. 5-level based on the enumerated guest.MAXPHYADDR, not on
>>> host.MAXPHYADDR.
>>
>> I see no difference between using guest.MAXPHYADDR (EAX[23:16]) and using
>> host.MAXPHYADDR (EAX[7:0]) to determine the GPAW (and EPT level) for TD
>> guest. The case for TDX diverges from what for non TDX VMs. The value of
>> them passed from userspace can only be used to configure GPAW and EPT level
>> for TD, but won't be reflected in CPUID inside TD.
>
> But the TDX module will emulate EAX[7:0] to match hardware, no? Whenever possible,
> the CPUID entries passed to KVM should match the CPUID values that are observed
> by the guest. E.g. if host.MAXPHYADDR=52, but the CPU only supports 4-level
> paging, then KVM should get host.MAXPHYADDR=52, guest.MAXPHYADDR=48.

side topic: Do we expect KVM to check the input of EAX[7:0] to match
with hardware value? or a zero value? or both are allowed?

> As I said in my response to Rick:
>
> : > An alternative would be to have the KVM API peak at the value, and then
> : > discard it (not pass the leaf value to the TDX module). Not ideal.
> :
> : Heh, I typed up this idea before reading ahead. This has my vote. Unless I'm
> : misreading where things are headed, using guest.MAXPHYADDR to communicate what
> : is essentially GPAW to the guest is about to become the de facto standard.
> :
> : At that point, KVM can basically treat the current TDX module behavior as an
> : erratum, i.e. discarding guest.MAXPHYADDR becomes a workaround for a "CPU" bug,
> : not some goofy KVM quirk.

yes, bit [23:16] fits better.

2024-04-09 14:49:21

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Mon, 2024-04-08 at 18:37 -0700, Sean Christopherson wrote:
> > > Is guest.MAXPHYADDR one of those? If so, use that.
> >
> > No it is not configurable. I'm looking into make it configurable, but it is
> > not
> > likely to happen before we were hoping to get basic support upstream.
>
> Yeah, love me some hardware defined software.
>
> > An alternative would be to have the KVM API peak at the value, and then
> > discard it (not pass the leaf value to the TDX module). Not ideal.
>
> Heh, I typed up this idea before reading ahead. This has my vote. Unless I'm
> misreading where things are headed, using guest.MAXPHYADDR to communicate what
> is essentially GPAW to the guest is about to become the de facto standard.
>
> At that point, KVM can basically treat the current TDX module behavior as an
> erratum, i.e. discarding guest.MAXPHYADDR becomes a workaround for a "CPU"
> bug,
> not some goofy KVM quirk.

Makes sense. I'd like to get to the point where we can say it's for sure coming.
Hopefully will happen soon.

> >
[snip]
> > >
>
> As I said in PUCK (and recorded in the notes), the fixed values should be
> provided
> in a data format that is easily consumed by C code, so that KVM can report
> that
> to userspace with

Right, I thought I heard this on the call, and to use the upper bits of that
leaf for GPAW. What has changed since then is a little more learning on the TDX
module behavior around CPUID bits.

The runtime API doesn't provide what the fixed values actually are, but per the
TDX module folks, which bits are fixed and what the values are could change
without an opt-in. This begged the questions for me of what exactly KVM should
expect of TDX module backwards compatibility and what SW is expected to actually
do with that JSON file. I'm still trying to track that down. Long term we need
the TDX module to expose an interface to provide more info about the CPUID
leafs, and those discussions are just starting.

If KVM needs to expose the values of the fixed leafs today (doesn't seem like a
bad idea, but I'm still not clear of the exact consumption), then most would
have to be exposed as "unknown", or something like that.

>
> > So the current interface won't allow us to perfectly match the
> > KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID. Even excluding the vm-scoped vs vcpu-
> > scoped differences. However, we could try to match the general design a
> > little better.
>
> No, don't try to match KVM_GET_SUPPORTED_CPUID, it's a terrible API that no
> one
> likes. The only reason we haven't replaced is because no one has come up with
> a
> universally better idea. For feature flags, communicating what KVM supports
> is
> straightforward, mostly. But for things like topology, communicating exactly
> what
> KVM "supports" is much more difficult.
>
> The TDX fixed bits are very different. It's the TDX module, and thus KVM,
> saying
> "here are the bits that you _must_ set to these exact values".

Right, we would need like a KVM_GET_SUPPORTED_CPUID_ON,
KVM_GET_SUPPORTED_CPUID_OFF and KVM_GET_SUPPORTED_CPUID_OPTIONAL. And still
inherit the KVM_GET_SUPPORTED_CPUID problems for the leafs that are not simple
bits.

>
> > Here we were discussing making gpaw configurable via a dedicated named
> > field,
> > but the suggestion is to instead include it in CPUID bits. The current API
> > takes
> > ATTRIBUTES as a dedicated field too. But there actually are CPUID bits for
> > some
> > of those features. Those CPUID bits are controlled instead via the
> > associated
> > ATTRIBUTES. So we could expose such features via CPUID as well. Userspace
> > would
> > for example, pass the PKS CPUID bit in, and KVM would see it and configure
> > PKS
> > via the ATTRIBUTES bit.
> >
> > So what I was looking to understand is, what is the enthusiasm for generally
> > continuing to use CPUID has the main method for specifying which features
> > should
> > be enabled/virtualized, if we can't match the existing
> > KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID APIs. Is the hope just to make
> > userspace's
> > code more unified between TDX and normal VMs?
>
> I need to look at the TDX code more to form an (updated) opinion. IIRC, my
> opinion
> from four years ago was to use ATTRIBUTES and then force CPUID to match.
> Whether
> or not that's still my preferred approach probably depends on how many, and
> what,
> things are shoved into attributes.

Thanks. Paolo seemed eager to get the uAPI settled for TDX. Based on that, it's
one of the top priorities for us right now.

Although having userspace upstream before kernel makes me nervous. It caused a
pile of problems for my last project (shadow stack).

2024-04-09 15:27:34

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Tue, Apr 09, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-04-08 at 18:37 -0700, Sean Christopherson wrote:
> > As I said in PUCK (and recorded in the notes), the fixed values should be
> > provided in a data format that is easily consumed by C code, so that KVM
> > can report that to userspace with
>
> Right, I thought I heard this on the call, and to use the upper bits of that
> leaf for GPAW. What has changed since then is a little more learning on the TDX
> module behavior around CPUID bits.
>
> The runtime API doesn't provide what the fixed values actually are, but per the
> TDX module folks, which bits are fixed and what the values are could change
> without an opt-in.

Change when? While the module is running? Between modules?

> This begged the questions for me of what exactly KVM should expect of TDX
> module backwards compatibility and what SW is expected to actually do with
> that JSON file. I'm still trying to track that down.

There is nothing to track down, we damn well state what KVM's requirements are,
and the TDX folks make it so.

I don't want JSON. I want a data payload that is easily consumable in C code,
which contains (a) the bits that are fixed and (b) their values. If a value can
change at runtime, it's not fixed.

The only question is, how do we document/define/structure KVM's uAPI so that _if_
the TDX module breaks backwards compatibility by mucking with fixed bits, then
it's Intel's problem, not KVM's problem.

2024-04-09 15:50:35

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Tue, 2024-04-09 at 08:23 -0700, Sean Christopherson wrote:
> > Right, I thought I heard this on the call, and to use the upper bits of that
> > leaf for GPAW. What has changed since then is a little more learning on the
> > TDX
> > module behavior around CPUID bits.
> >
> > The runtime API doesn't provide what the fixed values actually are, but per
> > the
> > TDX module folks, which bits are fixed and what the values are could change
> > without an opt-in.
>
> Change when? While the module is running? Between modules?

Between modules. They are fixed for a specific TDX module version. But the TDX
module could change.

Ah! Maybe there is confusion about where the JSON file is coming from. It is
*not* coming from the TDX module, it is coming from the Intel site that has the
documentation to download. It another form of documentation. Haha, if this is
the confusion, I see why you reacted that way to "JSON". That would be quite the
curious choice for a TDX module API.

So it is easy to convert it to a C struct and embed it in KVM. It's just not
that useful because it will not necessarily be valid for future TDX modules.

>
> > This begged the questions for me of what exactly KVM should expect of TDX
> > module backwards compatibility and what SW is expected to actually do with
> > that JSON file. I'm still trying to track that down.
>
> There is nothing to track down, we damn well state what KVM's requirements
> are,
> and the TDX folks make it so.

I'm on the same page.

>
> I don't want JSON. I want a data payload that is easily consumable in C code,
> which contains (a) the bits that are fixed and (b) their values. If a value
> can
> change at runtime, it's not fixed.

Right. The fixed values have to come in a reasonable format from the TDX module
at runtime, or require an opt-in for any CPUID bits to change in future TDX
modules.

>
> The only question is, how do we document/define/structure KVM's uAPI so that
> _if_
> the TDX module breaks backwards compatibility by mucking with fixed bits, then
> it's Intel's problem, not KVM's problem.

If we can't trust it at all, then we can always disable it if it behaves badly.
I think everyone would like to avoid this. Yes, if we have a better idea of how
it would change, we can try to avoid that scenario with API design.

2024-04-09 16:19:16

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On 4/10/2024 12:13 AM, Xiaoyao Li wrote:
> On 4/9/2024 11:49 PM, Edgecombe, Rick P wrote:
>>> I don't want JSON. I want a data payload that is easily consumable
>>> in C code,
>>> which contains (a) the bits that are fixed and (b) their values. If
>>> a value
>>> can
>>> change at runtime, it's not fixed.
>> Right. The fixed values have to come in a reasonable format from the
>> TDX module
>> at runtime, or require an opt-in for any CPUID bits to change in
>> future TDX
>> modules.
>
> I have a thought for current situation that TDX module doesn't report
> fixed CPUID bits via SEAMCALL interface but defines them in docs. VMM
> (KVM or userspace) can maintain a hardcoded array of fixed CPUID bits
> and their values according to TDX docs. And VMM needs to update the
> fixed array by striping out the bits that are reported in
> TDSYSINFO.CPUID_CONFIG[], which are configurable.
>
> If the newer TDX module changes some fixed bits to configurable bits,
> They will show up in TDSYSINFO.CPUID_CONFIG[]. So VMM can update fixed
> array correctly.
>
> In fact, this is how TDX QEMU series current implements.
>
> However, it requires TDX module to follow the rule that if any bit
> becomes not fixed, it needs to be reported in TDSYSINFO.CPUID_CONFIG[]
> as configurable.

If TDX module flips the bit between fixed0 and fixed1. It doesn't work
neither. :(

> It's just for the case there is no interface from TDX module to report
> the fixed CPUID bits in the end.
>

2024-04-09 16:26:18

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Tue, Apr 09, 2024, Rick P Edgecombe wrote:
> On Tue, 2024-04-09 at 08:23 -0700, Sean Christopherson wrote:
> > > Right, I thought I heard this on the call, and to use the upper bits of
> > > that leaf for GPAW. What has changed since then is a little more learning
> > > on the TDX module behavior around CPUID bits.
> > >
> > > The runtime API doesn't provide what the fixed values actually are, but
> > > per the TDX module folks, which bits are fixed and what the values are
> > > could change without an opt-in.
> >
> > Change when? While the module is running? Between modules?
>
> Between modules. They are fixed for a specific TDX module version. But the TDX
> module could change.
>
> Ah! Maybe there is confusion about where the JSON file is coming from. It is
> *not* coming from the TDX module, it is coming from the Intel site that has the
> documentation to download. It another form of documentation.

I know.

> Haha, if this is the confusion, I see why you reacted that way to "JSON".
> That would be quite the curious choice for a TDX module API.
>
> So it is easy to convert it to a C struct and embed it in KVM. It's just not
> that useful because it will not necessarily be valid for future TDX modules.

No, I don't want to embed anything in KVM, that's the exact same as hardcoding
crud into KVM, which is what I want to avoid. I want to be able to roll out a
new TDX module with any kernel changes, and I want userspace to be able to assert
that, for a given TDX module, the effective guest CPUID configuration aligns with
userspace's desired the vCPU model, i.e. that the value of fixed bits match up
with the guest CPUID that userspace wants to define.

Maybe that just means converting the JSON file into some binary format that the
kernel can already parse. But I want Intel to commit to providing that metadata
along with every TDX module.

2024-04-09 16:35:46

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On 4/9/2024 11:49 PM, Edgecombe, Rick P wrote:
>> I don't want JSON. I want a data payload that is easily consumable in C code,
>> which contains (a) the bits that are fixed and (b) their values. If a value
>> can
>> change at runtime, it's not fixed.
> Right. The fixed values have to come in a reasonable format from the TDX module
> at runtime, or require an opt-in for any CPUID bits to change in future TDX
> modules.

I have a thought for current situation that TDX module doesn't report
fixed CPUID bits via SEAMCALL interface but defines them in docs. VMM
(KVM or userspace) can maintain a hardcoded array of fixed CPUID bits
and their values according to TDX docs. And VMM needs to update the
fixed array by striping out the bits that are reported in
TDSYSINFO.CPUID_CONFIG[], which are configurable.

If the newer TDX module changes some fixed bits to configurable bits,
They will show up in TDSYSINFO.CPUID_CONFIG[]. So VMM can update fixed
array correctly.

In fact, this is how TDX QEMU series current implements.

However, it requires TDX module to follow the rule that if any bit
becomes not fixed, it needs to be reported in TDSYSINFO.CPUID_CONFIG[]
as configurable.

It's just for the case there is no interface from TDX module to report
the fixed CPUID bits in the end.

2024-04-10 01:05:38

by Huang, Kai

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On 10/04/2024 4:18 am, Xiaoyao Li wrote:
> On 4/10/2024 12:13 AM, Xiaoyao Li wrote:
>> On 4/9/2024 11:49 PM, Edgecombe, Rick P wrote:
>>>> I don't want JSON. I want a data payload that is easily consumable
>>>> in C code,
>>>> which contains (a) the bits that are fixed and (b) their values. If
>>>> a value
>>>> can
>>>> change at runtime, it's not fixed.
>>> Right. The fixed values have to come in a reasonable format from the
>>> TDX module
>>> at runtime, or require an opt-in for any CPUID bits to change in
>>> future TDX
>>> modules.
>>
>> I have a thought for current situation that TDX module doesn't report
>> fixed CPUID bits via SEAMCALL interface but defines them in docs. VMM
>> (KVM or userspace) can maintain a hardcoded array of fixed CPUID bits
>> and their values according to TDX docs. And VMM needs to update the
>> fixed array by striping out the bits that are reported in
>> TDSYSINFO.CPUID_CONFIG[], which are configurable.
>>
>> If the newer TDX module changes some fixed bits to configurable bits,
>> They will show up in TDSYSINFO.CPUID_CONFIG[]. So VMM can update fixed
>> array correctly.
>>
>> In fact, this is how TDX QEMU series current implements.
>>
>> However, it requires TDX module to follow the rule that if any bit
>> becomes not fixed, it needs to be reported in TDSYSINFO.CPUID_CONFIG[]
>> as configurable.
>
> If TDX module flips the bit between fixed0 and fixed1. It doesn't work
> neither. :(

Exactly. I think we need to ask TDX module to report such information
rather than depending on some JASON file.

2024-04-10 01:12:53

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Mon, Apr 08, 2024 at 06:51:40PM +0000,
Sean Christopherson <[email protected]> wrote:

> On Mon, Apr 08, 2024, Edgecombe, Rick P wrote:
> > On Mon, 2024-04-08 at 09:20 -0700, Sean Christopherson wrote:
> > > > Another option is that, KVM doesn't allow userspace to configure
> > > > CPUID(0x8000_0008).EAX[7:0]. Instead, it provides a gpaw field in struct
> > > > kvm_tdx_init_vm for userspace to configure directly.
> > > >
> > > > What do you prefer?
> > >
> > > Hmm, neither. I think the best approach is to build on Gerd's series to have KVM
> > > select 4-level vs. 5-level based on the enumerated guest.MAXPHYADDR, not on
> > > host.MAXPHYADDR.
> >
> > So then GPAW would be coded to basically best fit the supported guest.MAXPHYADDR within KVM. QEMU
> > could look at the supported guest.MAXPHYADDR and use matching logic to determine GPAW.
>
> Off topic, any chance I can bribe/convince you to wrap your email replies closer
> to 80 chars, not 100? Yeah, checkpath no longer complains when code exceeds 80
> chars, but my brain is so well trained for 80 that it actually slows me down a
> bit when reading mails that are wrapped at 100 chars.
>
> > Or are you suggesting that KVM should look at the value of CPUID(0X8000_0008).eax[23:16] passed from
> > userspace?
>
> This. Note, my pseudo-patch incorrectly looked at bits 15:8, that was just me
> trying to go off memory.
>
> > I'm not following the code examples involving struct kvm_vcpu. Since TDX
> > configures these at a VM level, there isn't a vcpu.
>
> Ah, I take it GPAW is a VM-scope knob? I forget where we ended up with the ordering
> of TDX commands vs. creating vCPUs. Does KVM allow creating vCPU structures in
> advance of the TDX INIT call? If so, the least awful solution might be to use
> vCPU0's CPUID.

The current order is, KVM vm creation (KVM_CREATE_VM),
KVM vcpu creation(KVM_CREATE_VCPU), TDX VM initialization (KVM_TDX_INIT_VM).
and TDX VCPU initialization(KVM_TDX_INIT_VCPU).
We can call KVM_SET_CPUID2 before KVM_TDX_INIT_VM. We can remove cpuid part
from struct kvm_tdx_init_vm by vcpu0 cpuid.
--
Isaku Yamahata <[email protected]>

2024-04-10 14:12:39

by Huang, Kai

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Tue, 2024-04-09 at 18:12 -0700, Isaku Yamahata wrote:
> On Mon, Apr 08, 2024 at 06:51:40PM +0000,
> Sean Christopherson <[email protected]> wrote:
>
> > On Mon, Apr 08, 2024, Edgecombe, Rick P wrote:
> > > On Mon, 2024-04-08 at 09:20 -0700, Sean Christopherson wrote:
> > > > > Another option is that, KVM doesn't allow userspace to configure
> > > > > CPUID(0x8000_0008).EAX[7:0]. Instead, it provides a gpaw field in struct
> > > > > kvm_tdx_init_vm for userspace to configure directly.
> > > > >
> > > > > What do you prefer?
> > > >
> > > > Hmm, neither. I think the best approach is to build on Gerd's series to have KVM
> > > > select 4-level vs. 5-level based on the enumerated guest.MAXPHYADDR, not on
> > > > host.MAXPHYADDR.
> > >
> > > So then GPAW would be coded to basically best fit the supported guest.MAXPHYADDR within KVM. QEMU
> > > could look at the supported guest.MAXPHYADDR and use matching logic to determine GPAW.
> >
> > Off topic, any chance I can bribe/convince you to wrap your email replies closer
> > to 80 chars, not 100? Yeah, checkpath no longer complains when code exceeds 80
> > chars, but my brain is so well trained for 80 that it actually slows me down a
> > bit when reading mails that are wrapped at 100 chars.
> >
> > > Or are you suggesting that KVM should look at the value of CPUID(0X8000_0008).eax[23:16] passed from
> > > userspace?
> >
> > This. Note, my pseudo-patch incorrectly looked at bits 15:8, that was just me
> > trying to go off memory.
> >
> > > I'm not following the code examples involving struct kvm_vcpu. Since TDX
> > > configures these at a VM level, there isn't a vcpu.
> >
> > Ah, I take it GPAW is a VM-scope knob? I forget where we ended up with the ordering
> > of TDX commands vs. creating vCPUs. Does KVM allow creating vCPU structures in
> > advance of the TDX INIT call? If so, the least awful solution might be to use
> > vCPU0's CPUID.
>
> The current order is, KVM vm creation (KVM_CREATE_VM),
> KVM vcpu creation(KVM_CREATE_VCPU), TDX VM initialization (KVM_TDX_INIT_VM).
> and TDX VCPU initialization(KVM_TDX_INIT_VCPU).
> We can call KVM_SET_CPUID2 before KVM_TDX_INIT_VM. We can remove cpuid part
> from struct kvm_tdx_init_vm by vcpu0 cpuid.

What's the reason to call KVM_TDX_INIT_VM after KVM_CREATE_VCPU?

I guess I have been away for this for too long time, but I had believed
KVM_TDX_INIT_VM is called before creating any vcpu, which turns out to be wrong.

I am not against to make KVM_TDX_INIT_VM must be called after creating (at least
one?) vcpu if there's good reason, but it seems if the purpose is just to pass
CPUID(0x8000_0008).EAX[23:16] to KVM so KVM can determine GPAW for TDX guest,
then we can also make KVM_TDX_INIT_VM to pass that.

KVM just need to manually handle CPUID(0x8000_0008) in KVM_TDX_INIT_VM, but
that's the thing KVM needs to do anyway even if we use vcpu0's CPUID.

Am I missing anything?

2024-04-11 01:04:06

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Wed, Apr 10, 2024 at 02:03:26PM +0000,
"Huang, Kai" <[email protected]> wrote:

> On Tue, 2024-04-09 at 18:12 -0700, Isaku Yamahata wrote:
> > On Mon, Apr 08, 2024 at 06:51:40PM +0000,
> > Sean Christopherson <[email protected]> wrote:
> >
> > > On Mon, Apr 08, 2024, Edgecombe, Rick P wrote:
> > > > On Mon, 2024-04-08 at 09:20 -0700, Sean Christopherson wrote:
> > > > > > Another option is that, KVM doesn't allow userspace to configure
> > > > > > CPUID(0x8000_0008).EAX[7:0]. Instead, it provides a gpaw field in struct
> > > > > > kvm_tdx_init_vm for userspace to configure directly.
> > > > > >
> > > > > > What do you prefer?
> > > > >
> > > > > Hmm, neither. I think the best approach is to build on Gerd's series to have KVM
> > > > > select 4-level vs. 5-level based on the enumerated guest.MAXPHYADDR, not on
> > > > > host.MAXPHYADDR.
> > > >
> > > > So then GPAW would be coded to basically best fit the supported guest.MAXPHYADDR within KVM. QEMU
> > > > could look at the supported guest.MAXPHYADDR and use matching logic to determine GPAW.
> > >
> > > Off topic, any chance I can bribe/convince you to wrap your email replies closer
> > > to 80 chars, not 100? Yeah, checkpath no longer complains when code exceeds 80
> > > chars, but my brain is so well trained for 80 that it actually slows me down a
> > > bit when reading mails that are wrapped at 100 chars.
> > >
> > > > Or are you suggesting that KVM should look at the value of CPUID(0X8000_0008).eax[23:16] passed from
> > > > userspace?
> > >
> > > This. Note, my pseudo-patch incorrectly looked at bits 15:8, that was just me
> > > trying to go off memory.
> > >
> > > > I'm not following the code examples involving struct kvm_vcpu. Since TDX
> > > > configures these at a VM level, there isn't a vcpu.
> > >
> > > Ah, I take it GPAW is a VM-scope knob? I forget where we ended up with the ordering
> > > of TDX commands vs. creating vCPUs. Does KVM allow creating vCPU structures in
> > > advance of the TDX INIT call? If so, the least awful solution might be to use
> > > vCPU0's CPUID.
> >
> > The current order is, KVM vm creation (KVM_CREATE_VM),
> > KVM vcpu creation(KVM_CREATE_VCPU), TDX VM initialization (KVM_TDX_INIT_VM).
> > and TDX VCPU initialization(KVM_TDX_INIT_VCPU).
> > We can call KVM_SET_CPUID2 before KVM_TDX_INIT_VM. We can remove cpuid part
> > from struct kvm_tdx_init_vm by vcpu0 cpuid.
>
> What's the reason to call KVM_TDX_INIT_VM after KVM_CREATE_VCPU?

The KVM_TDX_INIT_VM (it requires cpuids) doesn't requires any order between two,
KVM_TDX_INIT_VM and KVM_CREATE_VCPU. We can call KVM_TDX_INIT_VM before or
after KVM_CREATE_VCPU because there is no limitation between two.

The v5 TDX QEMU happens to call KVM_CREATE_VCPU and then KVM_TDX_INIT_VM
because it creates CPUIDs for KVM_TDX_INIT_VM from qemu vCPU structures after
KVM_GET_CPUID2. Which is after KVM_CREATE_VCPU.

> I guess I have been away for this for too long time, but I had believed
> KVM_TDX_INIT_VM is called before creating any vcpu, which turns out to be wrong.
>
> I am not against to make KVM_TDX_INIT_VM must be called after creating (at least
> one?) vcpu if there's good reason, but it seems if the purpose is just to pass
> CPUID(0x8000_0008).EAX[23:16] to KVM so KVM can determine GPAW for TDX guest,
> then we can also make KVM_TDX_INIT_VM to pass that.
>
> KVM just need to manually handle CPUID(0x8000_0008) in KVM_TDX_INIT_VM, but
> that's the thing KVM needs to do anyway even if we use vcpu0's CPUID.
>
> Am I missing anything?

Userspace VMM needs to create CPUID list somehow for KVM_TDX_INIT_VM or
KVM_SET_CPUID2 whichever is first. It's effortless to reuse the CPUID list
for KVM_SET_CPUID2.
--
Isaku Yamahata <[email protected]>

2024-04-11 01:14:04

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Tue, 2024-04-09 at 09:26 -0700, Sean Christopherson wrote:
> > Haha, if this is the confusion, I see why you reacted that way to "JSON".
> > That would be quite the curious choice for a TDX module API.
> >
> > So it is easy to convert it to a C struct and embed it in KVM. It's just not
> > that useful because it will not necessarily be valid for future TDX modules.
>
> No, I don't want to embed anything in KVM, that's the exact same as hardcoding
> crud into KVM, which is what I want to avoid. I want to be able to roll out a
> new TDX module with any kernel changes, and I want userspace to be able to
> assert
> that, for a given TDX module, the effective guest CPUID configuration aligns
> with
> userspace's desired the vCPU model, i.e. that the value of fixed bits match up
> with the guest CPUID that userspace wants to define.
>
> Maybe that just means converting the JSON file into some binary format that
> the
> kernel can already parse. But I want Intel to commit to providing that
> metadata
> along with every TDX module.

Oof. It turns out in one of the JSON files there is a description of a different
interface (TDX module runtime interface) that provides a way to read CPUID data
that is configured in a TD, including fixed bits. It works like:
1. VMM queries which CPUID bits are directly configurable.
2. VMM provides directly configurable CPUID bits, along with XFAM and
ATTRIBUTES, via TDH.MNG.INIT. (KVM_TDX_INIT_VM)
3. Then VMM can use this other interface via TDH.MNG.RD, to query the resulting
values of specific CPUID leafs.

This does not provide a way to query the fixed bits specifically, it tells you
what ended up getting configuring in a specific TD, which includes the fixed
bits and anything else. So we need to do KVM_TDX_INIT_VM before KVM_SET_CPUID in
order to have something to check against. But there was discussion of
KVM_SET_CPUID on CPU0 having the CPUID state to pass to KVM_TDX_INIT_VM. So that
would need to be sorted.

If we pass the directly configurable values with KVM_TDX_INIT_VM, like we do
today, then the data provided by this interface should allow us to check
consistency between KVM_SET_CPUID and the actual configured TD CPUID behavior.
But we still need to think through what guarantees we need from the TDX module
to prevent TDX module changes from breaking userspace. For example if something
goes from fixed 1 to fixed 0, and the KVM_SET_CPUID call starts getting rejected
where it didn't before. Some of the TDX docs have a statement on how to help
this situation:

"
A properly written VMM should be able to handle the fact that more CPUID bits
become configurable. The host VMM should always consult the list of directly
configurable CPUID leaves and sub-leaves, as enumerated by 25 TDH.SYS.RD/RDALL
or TDH.SYS.INFO. If a CPUID bit is enumerated as configurable, and the VMM was
not designed to configure that bit, the VMM should set the configuration for
that bit to 1. If the host VMM neglects to configure CPUID bits that are
configurable, their virtual value (as seen by guest TDs) will be 0.
"

How this could translate to something sensible for the KVM API we are discussing
is not immediately obvious to me, but I need to think more on it.

We also need to verify a bit more on this method of reading CPUID data. So this
is just sort of a status update to share the direction we are heading.

2024-04-09 01:37:45

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Mon, Apr 08, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-04-08 at 15:36 -0700, Sean Christopherson wrote:
> > > Currently the values for the directly settable CPUID leafs come via a TDX
> > > specific init VM userspace API.
> >
> > Is guest.MAXPHYADDR one of those? If so, use that.
>
> No it is not configurable. I'm looking into make it configurable, but it is not
> likely to happen before we were hoping to get basic support upstream.

Yeah, love me some hardware defined software.

> An alternative would be to have the KVM API peak at the value, and then
> discard it (not pass the leaf value to the TDX module). Not ideal.

Heh, I typed up this idea before reading ahead. This has my vote. Unless I'm
misreading where things are headed, using guest.MAXPHYADDR to communicate what
is essentially GPAW to the guest is about to become the de facto standard.

At that point, KVM can basically treat the current TDX module behavior as an
erratum, i.e. discarding guest.MAXPHYADDR becomes a workaround for a "CPU" bug,
not some goofy KVM quirk.

> Or have a dedicated GPAW field and expose the concept to userspace like
> Xiaoyao was talking about.

I'd prefer not to. As above, it's not KVM's fault that the TDX module can't move
fast enough to adapt.

> > > So should we look at making the TDX side follow a
> > > KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID pattern for feature enablement? Or am
> > > I
> > > misreading general guidance out of this specific suggestion around GPAW?
> >
> > No? Where I was going with that, is _if_ vCPUs can be created (in KVM) before
> > the GPAW is set (in the TDX module), then using vCPU0's guest.MAXPHYADDR tokkk
> > compute the desired GPAW may be the least awful solution, all things
> > considered.
>
> Sorry, I was trying to uplevel the conversation to be about the general concept
> of matching TD configuration to CPUID bits. Let me try to articulate the problem
> a little better.
>
> Today, KVM’s KVM_GET_SUPPORTED_CPUID is a way to specify which features are
> virtualizable by KVM. Communicating this via CPUID leaf values works for the
> most part, because CPUID is already designed to communicate which features are
> supported. But TDX has a different language to communicate which features are
> supported. That is special fields that are passed when creating a VM: XFAM
> (matching XCR0 features) and ATTRIBUTES (TDX specific flags for MSR based
> features like PKS, etc). So compared to KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID,
> the TDX module instead accepts only a few CPUID bits to be set directly by the
> VMM, and sets other CPUID leafs to match the configured features via XFAM and
> ATTRIBUTES.
>
> There are also some bits/features that have fixed values. Which leafs are fixed
> and what the values are isn't something provided by any current TDX module API.
> Instead they are only known via documentation, which is subject to change The
> queryable information is limited to communicating which bits are directly
> configurable.

As I said in PUCK (and recorded in the notes), the fixed values should be provided
in a data format that is easily consumed by C code, so that KVM can report that
to userspace with

> So the current interface won't allow us to perfectly match the
> KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID. Even excluding the vm-scoped vs vcpu-
> scoped differences. However, we could try to match the general design a
> little better.

No, don't try to match KVM_GET_SUPPORTED_CPUID, it's a terrible API that no one
likes. The only reason we haven't replaced is because no one has come up with a
universally better idea. For feature flags, communicating what KVM supports is
straightforward, mostly. But for things like topology, communicating exactly what
KVM "supports" is much more difficult.

The TDX fixed bits are very different. It's the TDX module, and thus KVM, saying
"here are the bits that you _must_ set to these exact values".

> Here we were discussing making gpaw configurable via a dedicated named field,
> but the suggestion is to instead include it in CPUID bits. The current API takes
> ATTRIBUTES as a dedicated field too. But there actually are CPUID bits for some
> of those features. Those CPUID bits are controlled instead via the associated
> ATTRIBUTES. So we could expose such features via CPUID as well. Userspace would
> for example, pass the PKS CPUID bit in, and KVM would see it and configure PKS
> via the ATTRIBUTES bit.
>
> So what I was looking to understand is, what is the enthusiasm for generally
> continuing to use CPUID has the main method for specifying which features should
> be enabled/virtualized, if we can't match the existing
> KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID APIs. Is the hope just to make userspace's
> code more unified between TDX and normal VMs?

I need to look at the TDX code more to form an (updated) opinion. IIRC, my opinion
from four years ago was to use ATTRIBUTES and then force CPUID to match. Whether
or not that's still my preferred approach probably depends on how many, and what,
things are shoved into attributes.

2024-04-11 03:47:04

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Wed, Apr 10, 2024 at 06:03:52PM -0700,
Isaku Yamahata <[email protected]> wrote:

> On Wed, Apr 10, 2024 at 02:03:26PM +0000,
> "Huang, Kai" <[email protected]> wrote:
>
> > On Tue, 2024-04-09 at 18:12 -0700, Isaku Yamahata wrote:
> > > On Mon, Apr 08, 2024 at 06:51:40PM +0000,
> > > Sean Christopherson <[email protected]> wrote:
> > >
> > > > On Mon, Apr 08, 2024, Edgecombe, Rick P wrote:
> > > > > On Mon, 2024-04-08 at 09:20 -0700, Sean Christopherson wrote:
> > > > > > > Another option is that, KVM doesn't allow userspace to configure
> > > > > > > CPUID(0x8000_0008).EAX[7:0]. Instead, it provides a gpaw field in struct
> > > > > > > kvm_tdx_init_vm for userspace to configure directly.
> > > > > > >
> > > > > > > What do you prefer?
> > > > > >
> > > > > > Hmm, neither. I think the best approach is to build on Gerd's series to have KVM
> > > > > > select 4-level vs. 5-level based on the enumerated guest.MAXPHYADDR, not on
> > > > > > host.MAXPHYADDR.
> > > > >
> > > > > So then GPAW would be coded to basically best fit the supported guest.MAXPHYADDR within KVM. QEMU
> > > > > could look at the supported guest.MAXPHYADDR and use matching logic to determine GPAW.
> > > >
> > > > Off topic, any chance I can bribe/convince you to wrap your email replies closer
> > > > to 80 chars, not 100? Yeah, checkpath no longer complains when code exceeds 80
> > > > chars, but my brain is so well trained for 80 that it actually slows me down a
> > > > bit when reading mails that are wrapped at 100 chars.
> > > >
> > > > > Or are you suggesting that KVM should look at the value of CPUID(0X8000_0008).eax[23:16] passed from
> > > > > userspace?
> > > >
> > > > This. Note, my pseudo-patch incorrectly looked at bits 15:8, that was just me
> > > > trying to go off memory.
> > > >
> > > > > I'm not following the code examples involving struct kvm_vcpu. Since TDX
> > > > > configures these at a VM level, there isn't a vcpu.
> > > >
> > > > Ah, I take it GPAW is a VM-scope knob? I forget where we ended up with the ordering
> > > > of TDX commands vs. creating vCPUs. Does KVM allow creating vCPU structures in
> > > > advance of the TDX INIT call? If so, the least awful solution might be to use
> > > > vCPU0's CPUID.
> > >
> > > The current order is, KVM vm creation (KVM_CREATE_VM),
> > > KVM vcpu creation(KVM_CREATE_VCPU), TDX VM initialization (KVM_TDX_INIT_VM).
> > > and TDX VCPU initialization(KVM_TDX_INIT_VCPU).
> > > We can call KVM_SET_CPUID2 before KVM_TDX_INIT_VM. We can remove cpuid part
> > > from struct kvm_tdx_init_vm by vcpu0 cpuid.
> >
> > What's the reason to call KVM_TDX_INIT_VM after KVM_CREATE_VCPU?
>
> The KVM_TDX_INIT_VM (it requires cpuids) doesn't requires any order between two,
> KVM_TDX_INIT_VM and KVM_CREATE_VCPU. We can call KVM_TDX_INIT_VM before or
> after KVM_CREATE_VCPU because there is no limitation between two.
>
> The v5 TDX QEMU happens to call KVM_CREATE_VCPU and then KVM_TDX_INIT_VM
> because it creates CPUIDs for KVM_TDX_INIT_VM from qemu vCPU structures after
> KVM_GET_CPUID2. Which is after KVM_CREATE_VCPU.

Sorry, let me correct it. QEMU creates QEMU's vCPU struct with its CPUIDs.
KVM_TDX_INIT_VM, KVM_CREATE_VCPU, and KVM_SET_CPUID2. QEMU passes CPUIDs as is
to KVM_SET_CPUID2.

The v19 KVM_TDX_INIT_VM checks if the KVM vCPU is not created yet. But it's can
be relaxed.
--
Isaku Yamahata <[email protected]>

2024-04-11 13:46:36

by Huang, Kai

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Wed, 2024-04-10 at 20:46 -0700, Isaku Yamahata wrote:
> On Wed, Apr 10, 2024 at 06:03:52PM -0700,
> Isaku Yamahata <[email protected]> wrote:
>
> > On Wed, Apr 10, 2024 at 02:03:26PM +0000,
> > "Huang, Kai" <[email protected]> wrote:
> >
> > > On Tue, 2024-04-09 at 18:12 -0700, Isaku Yamahata wrote:
> > > > On Mon, Apr 08, 2024 at 06:51:40PM +0000,
> > > > Sean Christopherson <[email protected]> wrote:
> > > >
> > > > > On Mon, Apr 08, 2024, Edgecombe, Rick P wrote:
> > > > > > On Mon, 2024-04-08 at 09:20 -0700, Sean Christopherson wrote:
> > > > > > > > Another option is that, KVM doesn't allow userspace to configure
> > > > > > > > CPUID(0x8000_0008).EAX[7:0]. Instead, it provides a gpaw field in struct
> > > > > > > > kvm_tdx_init_vm for userspace to configure directly.
> > > > > > > >
> > > > > > > > What do you prefer?
> > > > > > >
> > > > > > > Hmm, neither. I think the best approach is to build on Gerd's series to have KVM
> > > > > > > select 4-level vs. 5-level based on the enumerated guest.MAXPHYADDR, not on
> > > > > > > host.MAXPHYADDR.
> > > > > >
> > > > > > So then GPAW would be coded to basically best fit the supported guest.MAXPHYADDR within KVM. QEMU
> > > > > > could look at the supported guest.MAXPHYADDR and use matching logic to determine GPAW.
> > > > >
> > > > > Off topic, any chance I can bribe/convince you to wrap your email replies closer
> > > > > to 80 chars, not 100? Yeah, checkpath no longer complains when code exceeds 80
> > > > > chars, but my brain is so well trained for 80 that it actually slows me down a
> > > > > bit when reading mails that are wrapped at 100 chars.
> > > > >
> > > > > > Or are you suggesting that KVM should look at the value of CPUID(0X8000_0008).eax[23:16] passed from
> > > > > > userspace?
> > > > >
> > > > > This. Note, my pseudo-patch incorrectly looked at bits 15:8, that was just me
> > > > > trying to go off memory.
> > > > >
> > > > > > I'm not following the code examples involving struct kvm_vcpu. Since TDX
> > > > > > configures these at a VM level, there isn't a vcpu.
> > > > >
> > > > > Ah, I take it GPAW is a VM-scope knob? I forget where we ended up with the ordering
> > > > > of TDX commands vs. creating vCPUs. Does KVM allow creating vCPU structures in
> > > > > advance of the TDX INIT call? If so, the least awful solution might be to use
> > > > > vCPU0's CPUID.
> > > >
> > > > The current order is, KVM vm creation (KVM_CREATE_VM),
> > > > KVM vcpu creation(KVM_CREATE_VCPU), TDX VM initialization (KVM_TDX_INIT_VM).
> > > > and TDX VCPU initialization(KVM_TDX_INIT_VCPU).
> > > > We can call KVM_SET_CPUID2 before KVM_TDX_INIT_VM. We can remove cpuid part
> > > > from struct kvm_tdx_init_vm by vcpu0 cpuid.
> > >
> > > What's the reason to call KVM_TDX_INIT_VM after KVM_CREATE_VCPU?
> >
> > The KVM_TDX_INIT_VM (it requires cpuids) doesn't requires any order between two,
> > KVM_TDX_INIT_VM and KVM_CREATE_VCPU. We can call KVM_TDX_INIT_VM before or
> > after KVM_CREATE_VCPU because there is no limitation between two.
> >
> > The v5 TDX QEMU happens to call KVM_CREATE_VCPU and then KVM_TDX_INIT_VM
> > because it creates CPUIDs for KVM_TDX_INIT_VM from qemu vCPU structures after
> > KVM_GET_CPUID2. Which is after KVM_CREATE_VCPU.
>
> Sorry, let me correct it. QEMU creates QEMU's vCPU struct with its CPUIDs.
> KVM_TDX_INIT_VM, KVM_CREATE_VCPU, and KVM_SET_CPUID2. QEMU passes CPUIDs as is
> to KVM_SET_CPUID2.
>
> The v19 KVM_TDX_INIT_VM checks if the KVM vCPU is not created yet. But it's can
> be relaxed.

OK. So in current implementation KVM_TDX_INIT_VM must be done before
KVM_CREATE_VCPU.

Which seems more reasonable conceptually, IMHO.

Doing KVM_TDX_INIT_VM after KVM_CREATE_VCPU (well KVM_SET_CPUID2 actually) looks
more like a workaround, and it isn't sufficient actually, if we want
KVM_TDX_INIT_VM to use vCPU0's CPUID.

For that we need to explicitly say:

KVM_TDX_INIT_VM must be called after KVM_SET_CPUID2 is called for vCPU0.

Which is kinda odd because AFAICT userspace should be able to legally do
KVM_SET_CPUID2 in random order all vCPUs.

In other words, making KVM_TDX_INIT_VM use vCPU0's CPUID looks more like a
workaround, and it is a little bit odd.

When something looks odd, it's error-prone.

A less odd way is to make KVM_TDX_INIT_VM must be called after KVM_SET_CPUID2 is
called for _ALL_ vCPUs. It's _looks_ better, but in practice I am not sure KVM
is able to do such check.

Also, because CPUID is just one part of the KVM_TDX_INITT_VM, to me we also need
ask questions like whether it should be done after other vCPU IOCTL()s (e.g.,
KVM_SET_MSR etc) because there are more than 1 IOCTLs() to initialize vCPU from
userspace.

Again, although probably all of above issues are theoretical thing, to me looks
it's better to just make KVM_TDX_INIT_VM must be called before creating any
vCPU.

For CPUID(0x8000_0008), we can request KVM_TDX_INIT_VM to pass it, and KVM can
get GPAW/guest.MAXPHYADDR from it. KVM will need to convert this CPUID entry to
what TDH.MNG.INIT can accept, but this is what KVM needs to do anyway even it
uses vCPU0's CPUID.

Sean, any feedback?

2024-04-11 14:23:02

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Thu, Apr 11, 2024, Rick P Edgecombe wrote:
> On Tue, 2024-04-09 at 09:26 -0700, Sean Christopherson wrote:
> > > Haha, if this is the confusion, I see why you reacted that way to "JSON".
> > > That would be quite the curious choice for a TDX module API.
> > >
> > > So it is easy to convert it to a C struct and embed it in KVM. It's just not
> > > that useful because it will not necessarily be valid for future TDX modules.
> >
> > No, I don't want to embed anything in KVM, that's the exact same as hardcoding
> > crud into KVM, which is what I want to avoid. I want to be able to roll out a
> > new TDX module with any kernel changes, and I want userspace to be able to
> > assert
> > that, for a given TDX module, the effective guest CPUID configuration aligns
> > with
> > userspace's desired the vCPU model, i.e. that the value of fixed bits match up
> > with the guest CPUID that userspace wants to define.
> >
> > Maybe that just means converting the JSON file into some binary format that
> > the
> > kernel can already parse. But I want Intel to commit to providing that
> > metadata
> > along with every TDX module.
>
> Oof. It turns out in one of the JSON files there is a description of a different
> interface (TDX module runtime interface) that provides a way to read CPUID data
> that is configured in a TD, including fixed bits. It works like:
> 1. VMM queries which CPUID bits are directly configurable.
> 2. VMM provides directly configurable CPUID bits, along with XFAM and
> ATTRIBUTES, via TDH.MNG.INIT. (KVM_TDX_INIT_VM)
> 3. Then VMM can use this other interface via TDH.MNG.RD, to query the resulting
> values of specific CPUID leafs.
>
> This does not provide a way to query the fixed bits specifically, it tells you
> what ended up getting configuring in a specific TD, which includes the fixed
> bits and anything else. So we need to do KVM_TDX_INIT_VM before KVM_SET_CPUID in
> order to have something to check against. But there was discussion of
> KVM_SET_CPUID on CPU0 having the CPUID state to pass to KVM_TDX_INIT_VM. So that
> would need to be sorted.
>
> If we pass the directly configurable values with KVM_TDX_INIT_VM, like we do
> today, then the data provided by this interface should allow us to check
> consistency between KVM_SET_CPUID and the actual configured TD CPUID behavior.

I think it would be a good (optional?) sanity check, e.g. KVM_BUG_ON() if the
post-KVM_TDX_INIT_VM CPUID set doesn't match KVM's internal data. But that alone
provides a terrible experience for userspace.

- The VMM would still need to hardcode knowledge of fixed bits, without a way
to do a sanity check of its own.

- Lack of a sanity check means the VMM can't fail VM creation early.

- KVM_SET_CPUID2 doesn't have a way to inform userspace _which_ CPUID bits are
"bad".

- Neither userspace nor KVM can programming detect when bits are fixed vs.
flexible. E.g. it's not impossible that userspace would want to do X if a
feature is fixed, but Y if it's flexible.

2024-04-11 15:30:16

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Thu, Apr 11, 2024, Xiaoyao Li wrote:
> flexible (configurable) bits is known to VMM (KVM and userspace) because TDX
> module has interface to report them. So we can treat a bit as fixed if it is
> not reported in the flexible group. (of course the dynamic bits are special
> and excluded.)

Does that interface reported the fixed _values_?

2024-04-11 15:53:47

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On 4/11/2024 10:22 PM, Sean Christopherson wrote:
> On Thu, Apr 11, 2024, Rick P Edgecombe wrote:
>> On Tue, 2024-04-09 at 09:26 -0700, Sean Christopherson wrote:
>>>> Haha, if this is the confusion, I see why you reacted that way to "JSON".
>>>> That would be quite the curious choice for a TDX module API.
>>>>
>>>> So it is easy to convert it to a C struct and embed it in KVM. It's just not
>>>> that useful because it will not necessarily be valid for future TDX modules.
>>>
>>> No, I don't want to embed anything in KVM, that's the exact same as hardcoding
>>> crud into KVM, which is what I want to avoid. I want to be able to roll out a
>>> new TDX module with any kernel changes, and I want userspace to be able to
>>> assert
>>> that, for a given TDX module, the effective guest CPUID configuration aligns
>>> with
>>> userspace's desired the vCPU model, i.e. that the value of fixed bits match up
>>> with the guest CPUID that userspace wants to define.
>>>
>>> Maybe that just means converting the JSON file into some binary format that
>>> the
>>> kernel can already parse. But I want Intel to commit to providing that
>>> metadata
>>> along with every TDX module.
>>
>> Oof. It turns out in one of the JSON files there is a description of a different
>> interface (TDX module runtime interface) that provides a way to read CPUID data
>> that is configured in a TD, including fixed bits. It works like:
>> 1. VMM queries which CPUID bits are directly configurable.
>> 2. VMM provides directly configurable CPUID bits, along with XFAM and
>> ATTRIBUTES, via TDH.MNG.INIT. (KVM_TDX_INIT_VM)
>> 3. Then VMM can use this other interface via TDH.MNG.RD, to query the resulting
>> values of specific CPUID leafs.
>>
>> This does not provide a way to query the fixed bits specifically, it tells you
>> what ended up getting configuring in a specific TD, which includes the fixed
>> bits and anything else. So we need to do KVM_TDX_INIT_VM before KVM_SET_CPUID in
>> order to have something to check against. But there was discussion of
>> KVM_SET_CPUID on CPU0 having the CPUID state to pass to KVM_TDX_INIT_VM. So that
>> would need to be sorted.
>>
>> If we pass the directly configurable values with KVM_TDX_INIT_VM, like we do
>> today, then the data provided by this interface should allow us to check
>> consistency between KVM_SET_CPUID and the actual configured TD CPUID behavior.
>
> I think it would be a good (optional?) sanity check, e.g. KVM_BUG_ON() if the
> post-KVM_TDX_INIT_VM CPUID set doesn't match KVM's internal data. But that alone
> provides a terrible experience for userspace.
>
> - The VMM would still need to hardcode knowledge of fixed bits, without a way
> to do a sanity check of its own.

Maybe we can do it this way to avoid hardcode:

1. KVM can get the configurable CPUID bits from TDX module with
TDH.SYS.RD (they are the old info of TD_SYSINFO.CPUID_CONFIG[]), and
report them to userspace;

2. userspace configures the configurable CPUID bits and pass them to KVM
to init TD.

3. After TD is initialized via TDH.MNG.INIT, KVM can get a full CPUID
list of TD via TDH.MNG.RD. KVM provides interface to report the full
CPUID list to userspace.

4. Userspace can sanity check the full CPUID list.
- the configurable bits reported in #1 should be what they have been
configured;
- the dynamic bits and other special bits will be checked case by case;
- the rest bits should be fixed. If the value is not what user
wants, userspace prints error to user and stop.

Does it sounds reasonable?

> - Lack of a sanity check means the VMM can't fail VM creation early.
>
> - KVM_SET_CPUID2 doesn't have a way to inform userspace _which_ CPUID bits are
> "bad".
>
> - Neither userspace nor KVM can programming detect when bits are fixed vs.
> flexible. E.g. it's not impossible that userspace would want to do X if a
> feature is fixed, but Y if it's flexible.

flexible (configurable) bits is known to VMM (KVM and userspace) because
TDX module has interface to report them. So we can treat a bit as fixed
if it is not reported in the flexible group. (of course the dynamic bits
are special and excluded.)

2024-04-11 17:05:34

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On 4/11/2024 11:26 PM, Sean Christopherson wrote:
> On Thu, Apr 11, 2024, Xiaoyao Li wrote:
>> flexible (configurable) bits is known to VMM (KVM and userspace) because TDX
>> module has interface to report them. So we can treat a bit as fixed if it is
>> not reported in the flexible group. (of course the dynamic bits are special
>> and excluded.)
>
> Does that interface reported the fixed _values_?

No.

But as I said, we can get what the fixed _values_ are after TD is
initialized by TDH.MNG.INIT via another interface.

Yes. It is a bit late. But at least we have interface to get the fixed
value runtime instead of hardcoding them.

Meanwhile, we are working internally with TDX architecture team to
request new interface to report fixed bits and values as the
configurable bits that doesn't require the TD is initialized. But not
guarantee on it and not sure when it will be public.

2024-04-11 19:01:54

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Thu, 2024-04-11 at 08:26 -0700, Sean Christopherson wrote:
> On Thu, Apr 11, 2024, Xiaoyao Li wrote:
> > flexible (configurable) bits is known to VMM (KVM and userspace) because TDX
> > module has interface to report them. So we can treat a bit as fixed if it is
> > not reported in the flexible group. (of course the dynamic bits are special
> > and excluded.)
>
> Does that interface reported the fixed _values_?

So backing up a bit to the problem statement. I think there are two related
requirements. (please correct)

The first is a simple functional one. KVM is involved in virtualizing TD
behavior. If enables various logic based on its knowledge of a vCPU’s CPUID
state. Some of this logic is necessary for TDX. It can end up engaging logic for
at least:
X86_FEATURE_X2APIC
X86_FEATURE_XSAVE
X86_FEATURE_SMEP
X86_FEATURE_SMAP
X86_FEATURE_FSGBASE
X86_FEATURE_PKU
X86_FEATURE_LA57
TDX pieces need for some of these bits to get set in the vCPU or KVM’s part of
the necessary virtualization wont function.

The second issue is that userspace can’t know what CPUID values are configured
in the TD. In the existing API for normal guests, it knows because it tells the
guest what CPUID values to have. But for the TDX module that model is
complicated to fit into in its API where you tell it some things and it gives
you the resulting leaves. How to handle KVM_SET_CPUID kind of follows from this
issue.

One option is to demand the TDX module change to be able to efficiently wedge
into KVM’s exiting “tell” model. This looks like the metadata API to query the
fixed bits. Then userspace can know what bits it has to set, and call
KVM_SET_CPUID with them. I think it is still kind of awkward. "Tell me what you
want to hear?", "Ok here it is".

Another option would be to add TDX specific KVM APIs that work for the TDX
module's “ask” model, and meet the enumerated two goals. It could look something
like:
1. KVM_TDX_GET_CONFIG_CPUID provides a list of directly configurable bits by
KVM. This is based on static data on what KVM supports, with sanity check of
TD_SYSINFO.CPUID_CONFIG[]. Bits that KVM doesn’t know about, but are returned as
configurable by TD_SYSINFO.CPUID_CONFIG[] are not exposed as configurable. (they
will be set to 1 by KVM, per the recommendation)
2. KVM_TDX_INIT_VM is passed userspaces choice of configurable bits, along with
XFAM and ATTRIBUTES as dedicated fields. They go into TDH.MNG.INIT.
3. KVM_TDX_INIT_VCPU_CPUID takes a list of CPUID leafs. It pulls the CPUID bits
actually configured in the TD for these leafs. They go into the struct kvm_vcpu,
and are also passed up to userspace so everyone knows what actually got
configured.

KVM_SET_CPUID is not used for TDX.

Then we get TDX module folks to commit to never breaking KVM/userspace that
follows this logic. One thing still missing is how to handle unknown future
leafs with fixed bits. If a future leaf is defined and gets fixed 1, QEMU
wouldn't know to query it. We might need to ask for some TDX module guarantees
around that. It might already be the expectation though, based on the
description of how to handle unknown configurable bits.

Xiaoyao, would it work for userspace?

2024-04-12 08:52:06

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On 4/12/2024 2:52 AM, Edgecombe, Rick P wrote:
> On Thu, 2024-04-11 at 08:26 -0700, Sean Christopherson wrote:
>> On Thu, Apr 11, 2024, Xiaoyao Li wrote:
>>> flexible (configurable) bits is known to VMM (KVM and userspace) because TDX
>>> module has interface to report them. So we can treat a bit as fixed if it is
>>> not reported in the flexible group. (of course the dynamic bits are special
>>> and excluded.)
>>
>> Does that interface reported the fixed _values_?
>
> So backing up a bit to the problem statement. I think there are two related
> requirements. (please correct)
>
> The first is a simple functional one. KVM is involved in virtualizing TD
> behavior. If enables various logic based on its knowledge of a vCPU’s CPUID
> state. Some of this logic is necessary for TDX. It can end up engaging logic for
> at least:
> X86_FEATURE_X2APIC
> X86_FEATURE_XSAVE
> X86_FEATURE_SMEP
> X86_FEATURE_SMAP
> X86_FEATURE_FSGBASE
> X86_FEATURE_PKU
> X86_FEATURE_LA57
> TDX pieces need for some of these bits to get set in the vCPU or KVM’s part of
> the necessary virtualization wont function.
>
> The second issue is that userspace can’t know what CPUID values are configured
> in the TD. In the existing API for normal guests, it knows because it tells the
> guest what CPUID values to have. But for the TDX module that model is
> complicated to fit into in its API where you tell it some things and it gives
> you the resulting leaves. How to handle KVM_SET_CPUID kind of follows from this
> issue.
>
> One option is to demand the TDX module change to be able to efficiently wedge
> into KVM’s exiting “tell” model. This looks like the metadata API to query the
> fixed bits. Then userspace can know what bits it has to set, and call
> KVM_SET_CPUID with them. I think it is still kind of awkward. "Tell me what you
> want to hear?", "Ok here it is".
>
> Another option would be to add TDX specific KVM APIs that work for the TDX
> module's “ask” model, and meet the enumerated two goals. It could look something
> like:
> 1. KVM_TDX_GET_CONFIG_CPUID provides a list of directly configurable bits by
> KVM. This is based on static data on what KVM supports, with sanity check of
> TD_SYSINFO.CPUID_CONFIG[]. Bits that KVM doesn’t know about, but are returned as
> configurable by TD_SYSINFO.CPUID_CONFIG[] are not exposed as configurable. (they
> will be set to 1 by KVM, per the recommendation)

This is not how KVM works. KVM will never enable unknown features
blindly. If the feature is unknown to KVM, it cannot be enable for
guest. That's why every new feature needs enabling patch in KVM, even
the simplest case that needs one patch to enumerate the CPUID of new
instruction in KVM_GET_SUPPORTED_CPUID.

> 2. KVM_TDX_INIT_VM is passed userspaces choice of configurable bits, along with
> XFAM and ATTRIBUTES as dedicated fields. They go into TDH.MNG.INIT.
> 3. KVM_TDX_INIT_VCPU_CPUID takes a list of CPUID leafs. It pulls the CPUID bits
> actually configured in the TD for these leafs. They go into the struct kvm_vcpu,
> and are also passed up to userspace so everyone knows what actually got
> configured.
>
> KVM_SET_CPUID is not used for TDX.
>
> Then we get TDX module folks to commit to never breaking KVM/userspace that
> follows this logic. One thing still missing is how to handle unknown future
> leafs with fixed bits. If a future leaf is defined and gets fixed 1, QEMU
> wouldn't know to query it.

We can make KVM_TDX_INIT_VCPU_CPUID provide a large enough CPUID leafs
and KVM reports every leafs to userpsace. Instead of something that
userspace cares leafs X,Y,Z and KVM only reports back leafs X,Y,Z via
KVM_TDX_INIT_VCPU_CPUID.

> We might need to ask for some TDX module guarantees
> around that. It might already be the expectation though, based on the
> description of how to handle unknown configurable bits.
>
> Xiaoyao, would it work for userspace?

As I see, userspace has to be enlightened to run TDX. So whatever
interface KVM defines for TDX, userspace has to adjust itself to work
with it.

So my personal answer is yes. But only personal.

2024-04-12 17:40:05

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Fri, Apr 12, 2024 at 04:40:37PM +0800,
Xiaoyao Li <[email protected]> wrote:

> > The second issue is that userspace can’t know what CPUID values are configured
> > in the TD. In the existing API for normal guests, it knows because it tells the
> > guest what CPUID values to have. But for the TDX module that model is
> > complicated to fit into in its API where you tell it some things and it gives
> > you the resulting leaves. How to handle KVM_SET_CPUID kind of follows from this
> > issue.
> >
> > One option is to demand the TDX module change to be able to efficiently wedge
> > into KVM’s exiting “tell” model. This looks like the metadata API to query the
> > fixed bits. Then userspace can know what bits it has to set, and call
> > KVM_SET_CPUID with them. I think it is still kind of awkward. "Tell me what you
> > want to hear?", "Ok here it is".
> >
> > Another option would be to add TDX specific KVM APIs that work for the TDX
> > module's “ask” model, and meet the enumerated two goals. It could look something
> > like:
> > 1. KVM_TDX_GET_CONFIG_CPUID provides a list of directly configurable bits by
> > KVM. This is based on static data on what KVM supports, with sanity check of
> > TD_SYSINFO.CPUID_CONFIG[]. Bits that KVM doesn’t know about, but are returned as
> > configurable by TD_SYSINFO.CPUID_CONFIG[] are not exposed as configurable. (they
> > will be set to 1 by KVM, per the recommendation)
>
> This is not how KVM works. KVM will never enable unknown features blindly.
> If the feature is unknown to KVM, it cannot be enable for guest. That's why
> every new feature needs enabling patch in KVM, even the simplest case that
> needs one patch to enumerate the CPUID of new instruction in
> KVM_GET_SUPPORTED_CPUID.

We can use device attributes as discussed at
https://lore.kernel.org/kvm/CABgObfZzkNiP3q8p=KpvvFnh8m6qcHX4=tATaJc7cvVv2QWpJQ@mail.gmail.com/
https://lore.kernel.org/kvm/[email protected]/

Something like

#define KVM_X86_GRP_TDX 2
ioctl(fd, KVM_GET_DEVICE_ATTR, (KVM_X86_GRP_TDX, metadata_field_id))

> > 2. KVM_TDX_INIT_VM is passed userspaces choice of configurable bits, along with
> > XFAM and ATTRIBUTES as dedicated fields. They go into TDH.MNG.INIT.
> > 3. KVM_TDX_INIT_VCPU_CPUID takes a list of CPUID leafs. It pulls the CPUID bits
> > actually configured in the TD for these leafs. They go into the struct kvm_vcpu,
> > and are also passed up to userspace so everyone knows what actually got
> > configured.

Any reason to introduce KVM_TDX_INIT_VCPU_CPUID in addition to
KVM_TDX_INIT_VCPU? We can make single vCPU KVM TDX ioctl do all.

> > KVM_SET_CPUID is not used for TDX.

What cpuid does KVM_TDX_INIT_VCPU_CPUID accept? The one that TDX module
accepts with TDH.MNG.INIT()? Or any cpuids that KVM_SET_CPUID2 accepts?
I'm asking it because TDX module virtualizes only subset of CPUIDs.
TDG.VP.VMCALL<CPUID> would need info from KVM_SET_CPUID.

> > Then we get TDX module folks to commit to never breaking KVM/userspace that
> > follows this logic. One thing still missing is how to handle unknown future
> > leafs with fixed bits. If a future leaf is defined and gets fixed 1, QEMU
> > wouldn't know to query it.
>
> We can make KVM_TDX_INIT_VCPU_CPUID provide a large enough CPUID leafs and
> KVM reports every leafs to userpsace. Instead of something that userspace
> cares leafs X,Y,Z and KVM only reports back leafs X,Y,Z via
> KVM_TDX_INIT_VCPU_CPUID.

If new CPUID index is introduced, the userspace will get default values of
CPUIDs and don't touch unknown CPUIDs? Or KVM_TDX_GET_CONFIG_CPUID will mask
out CPUID unknown to KVM?
--
Isaku Yamahata <[email protected]>

2024-04-12 20:10:01

[permalink] [raw]

Subject: Re: [ANNOUNCE] PUCK Notes - 2024.04.03 - TDX Upstreaming Strategy

On Fri, 2024-04-12 at 10:39 -0700, Isaku Yamahata wrote:
> On Fri, Apr 12, 2024 at 04:40:37PM +0800,
> Xiaoyao Li <[email protected]> wrote:
>
> > > The second issue is that userspace can’t know what CPUID values are
> > > configured
> > > in the TD. In the existing API for normal guests, it knows because it
> > > tells the
> > > guest what CPUID values to have. But for the TDX module that model is
> > > complicated to fit into in its API where you tell it some things and it
> > > gives
> > > you the resulting leaves. How to handle KVM_SET_CPUID kind of follows from
> > > this
> > > issue.
> > >
> > > One option is to demand the TDX module change to be able to efficiently
> > > wedge
> > > into KVM’s exiting “tell” model. This looks like the metadata API to query
> > > the
> > > fixed bits. Then userspace can know what bits it has to set, and call
> > > KVM_SET_CPUID with them. I think it is still kind of awkward. "Tell me
> > > what you
> > > want to hear?", "Ok here it is".
> > >
> > > Another option would be to add TDX specific KVM APIs that work for the TDX
> > > module's “ask” model, and meet the enumerated two goals. It could look
> > > something
> > > like:
> > > 1. KVM_TDX_GET_CONFIG_CPUID provides a list of directly configurable bits
> > > by
> > > KVM. This is based on static data on what KVM supports, with sanity check
> > > of
> > > TD_SYSINFO.CPUID_CONFIG[]. Bits that KVM doesn’t know about, but are
> > > returned as
> > > configurable by TD_SYSINFO.CPUID_CONFIG[] are not exposed as configurable.
> > > (they
> > > will be set to 1 by KVM, per the recommendation)
> >
> > This is not how KVM works. KVM will never enable unknown features blindly.
> > If the feature is unknown to KVM, it cannot be enable for guest. That's why
> > every new feature needs enabling patch in KVM, even the simplest case that
> > needs one patch to enumerate the CPUID of new instruction in
> > KVM_GET_SUPPORTED_CPUID.

I *think* the part in the docs that says VMMs need to do this is concerned with
fixed 1 bits becoming configurable. So not newly defined bits, but just ones
that were previously fixed as 1 and now need to be configurable to 1 to result
in no change. We need to clarify this.

>
> We can use device attributes as discussed at
> https://lore.kernel.org/kvm/CABgObfZzkNiP3q8p=KpvvFnh8m6qcHX4=tATaJc7cvVv2QWpJQ@mail.gmail.com/
> https://lore.kernel.org/kvm/[email protected]/
>
> Something like
>
> #define KVM_X86_GRP_TDX 2
> ioctl(fd, KVM_GET_DEVICE_ATTR, (KVM_X86_GRP_TDX, metadata_field_id))

This would be instead of ATTRIBUTES and XFAM?

>
>
> > > 2. KVM_TDX_INIT_VM is passed userspaces choice of configurable bits, along
> > > with
> > > XFAM and ATTRIBUTES as dedicated fields. They go into TDH.MNG.INIT.
> > > 3. KVM_TDX_INIT_VCPU_CPUID takes a list of CPUID leafs. It pulls the CPUID
> > > bits
> > > actually configured in the TD for these leafs. They go into the struct
> > > kvm_vcpu,
> > > and are also passed up to userspace so everyone knows what actually got
> > > configured.
>
> Any reason to introduce KVM_TDX_INIT_VCPU_CPUID in addition to
> KVM_TDX_INIT_VCPU? We can make single vCPU KVM TDX ioctl do all.

No, that is a good idea to use KVM_TDX_INIT_VCPU.

>
>
> > > KVM_SET_CPUID is not used for TDX.
>
> What cpuid does KVM_TDX_INIT_VCPU_CPUID accept? The one that TDX module
> accepts with TDH.MNG.INIT()? Or any cpuids that KVM_SET_CPUID2 accepts?
> I'm asking it because TDX module virtualizes only subset of CPUIDs.
> TDG.VP.VMCALL<CPUID> would need info from KVM_SET_CPUID.

It should only take bits that are exposed by KVM as configurable for TDX which
are those that KVM knows about AND the TDX module supports. The other features
come in as ATTRIBUTES or XFAM bits (or KVM versions of the same concept).

>
>
> > > Then we get TDX module folks to commit to never breaking KVM/userspace
> > > that
> > > follows this logic. One thing still missing is how to handle unknown
> > > future
> > > leafs with fixed bits. If a future leaf is defined and gets fixed 1, QEMU
> > > wouldn't know to query it.
> >
> > We can make KVM_TDX_INIT_VCPU_CPUID provide a large enough CPUID leafs and
> > KVM reports every leafs to userpsace. Instead of something that userspace
> > cares leafs X,Y,Z and KVM only reports back leafs X,Y,Z via
> > KVM_TDX_INIT_VCPU_CPUID.
>
> If new CPUID index is introduced, the userspace will get default values of
> CPUIDs and don't touch unknown CPUIDs? Or KVM_TDX_GET_CONFIG_CPUID will mask
> out CPUID unknown to KVM?

The API to get at this data only allows for 256 possible leafs. But it also
allows for 256 bits of either sub leafs or specifying no subleaf. So total
possibly values is 65536. There is another field CPUID_VALID, which says "Non-
architectural" and doesn't seem to have anything on the subleafs either.

We could ask to make all future bits configurable and default 0. For the simple
instruction support type CPUID bits, I don't know if there would always be a
functional problem for KVM if they became the native value. But this requirement
would make things more consistent. Even if we have a new metadata API to get the
specifically fixed bits, it doesn't help with the problem of new bits that start
as host value or fixed 1. So I think TDX module guarantees are the only way to
deal with new bits like KVM does today.

We could also ask for something more flexible, like: The TDX module cannot set
new bits to fixed 1 or native value, unless it won't break any existing kernels
or userspace.

This way for some simple new instruction bits they can leave them native/fixed
and it could simplify the TDX module. But then the knowledge of what bits are
set becomes inconsistent.

Hmm...

2024-04-15 21:04:34