2020-08-07 08:48:15

by Chenyi Qiang

[permalink] [raw]
Subject: [RFC 6/7] KVM: X86: Expose PKS to guest and userspace

Existence of PKS is enumerated via CPUID.(EAX=7H,ECX=0):ECX[31]. It is
enabled by setting CR4.PKS when long mode is active. PKS is only
implemented when EPT is enabled and requires the support of VM_{ENTRY,
EXIT}_LOAD_IA32_PKRS currently.

Signed-off-by: Chenyi Qiang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/cpuid.c | 3 ++-
arch/x86/kvm/vmx/vmx.c | 15 ++++++++++++---
arch/x86/kvm/x86.c | 7 +++++--
4 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 736e56e023d5..36c7356693c8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -99,7 +99,8 @@
| X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
| X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
| X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
- | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP))
+ | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
+ | X86_CR4_PKS))

#define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 8a294f9747aa..897749250afd 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -325,7 +325,8 @@ void kvm_set_cpu_caps(void)
F(AVX512VBMI) | F(LA57) | F(PKU) | 0 /*OSPKE*/ | F(RDPID) |
F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
- F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/
+ F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
+ 0 /*PKS*/
);
/* Set LA57 based on hardware capability. */
if (cpuid_ecx(7) & F(LA57))
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d91d59fb46fa..5cdf4d3848fb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3216,7 +3216,7 @@ int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
}

/*
- * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
+ * SMEP/SMAP/PKU/PKS is disabled if CPU is in non-paging mode in
* hardware. To emulate this behavior, SMEP/SMAP/PKU needs
* to be manually disabled when guest switches to non-paging
* mode.
@@ -3224,10 +3224,11 @@ int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
* If !enable_unrestricted_guest, the CPU is always running
* with CR0.PG=1 and CR4 needs to be modified.
* If enable_unrestricted_guest, the CPU automatically
- * disables SMEP/SMAP/PKU when the guest sets CR0.PG=0.
+ * disables SMEP/SMAP/PKU/PKS when the guest sets CR0.PG=0.
*/
if (!is_paging(vcpu))
- hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
+ hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
+ X86_CR4_PKS);
}

vmcs_writel(CR4_READ_SHADOW, cr4);
@@ -7348,6 +7349,14 @@ static __init void vmx_set_cpu_caps(void)
if (vmx_pt_mode_is_host_guest())
kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT);

+ /*
+ * PKS is not yet implemented for shadow paging.
+ * If not support VM_{ENTRY, EXIT}_LOAD_IA32_PKRS,
+ * don't expose the PKS as well.
+ */
+ if (enable_ept && cpu_has_load_ia32_pkrs())
+ kvm_cpu_cap_check_and_set(X86_FEATURE_PKS);
+
if (vmx_umip_emulated())
kvm_cpu_cap_set(X86_FEATURE_UMIP);

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 88c593f83b28..8ad9622ee2b2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -949,6 +949,8 @@ EXPORT_SYMBOL_GPL(kvm_set_xcr);
__reserved_bits |= X86_CR4_LA57; \
if (!__cpu_has(__c, X86_FEATURE_UMIP)) \
__reserved_bits |= X86_CR4_UMIP; \
+ if (!__cpu_has(__c, X86_FEATURE_PKS)) \
+ __reserved_bits |= X86_CR4_PKS; \
__reserved_bits; \
})

@@ -967,7 +969,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
{
unsigned long old_cr4 = kvm_read_cr4(vcpu);
unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
- X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
+ X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
+ X86_CR4_PKS;

if (kvm_valid_cr4(vcpu, cr4))
return 1;
@@ -1202,7 +1205,7 @@ static const u32 msrs_to_save_all[] = {
MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
- MSR_IA32_UMWAIT_CONTROL,
+ MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,

MSR_ARCH_PERFMON_FIXED_CTR0, MSR_ARCH_PERFMON_FIXED_CTR1,
MSR_ARCH_PERFMON_FIXED_CTR0 + 2, MSR_ARCH_PERFMON_FIXED_CTR0 + 3,
--
2.17.1


2020-08-13 19:06:26

by Jim Mattson

[permalink] [raw]
Subject: Re: [RFC 6/7] KVM: X86: Expose PKS to guest and userspace

On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <[email protected]> wrote:
>
> Existence of PKS is enumerated via CPUID.(EAX=7H,ECX=0):ECX[31]. It is
> enabled by setting CR4.PKS when long mode is active. PKS is only
> implemented when EPT is enabled and requires the support of VM_{ENTRY,
> EXIT}_LOAD_IA32_PKRS currently.
>
> Signed-off-by: Chenyi Qiang <[email protected]>

> @@ -967,7 +969,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> {
> unsigned long old_cr4 = kvm_read_cr4(vcpu);
> unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
> - X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
> + X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
> + X86_CR4_PKS;

This list already seems overly long, but I don't think CR4.PKS belongs
here. In volume 3 of the SDM, section 4.4.1, it says:

- If PAE paging would be in use following an execution of MOV to CR0
or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any
of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP;
then the PDPTEs are loaded from the address in CR3.

CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
Since it has no effect on PAE paging, I would be surprised if it did
result in a PDPTE load.

> if (kvm_valid_cr4(vcpu, cr4))
> return 1;
> @@ -1202,7 +1205,7 @@ static const u32 msrs_to_save_all[] = {
> MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
> MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
> MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
> - MSR_IA32_UMWAIT_CONTROL,
> + MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,

Should MSR_IA32_PKRS be added to the switch statement in
kvm_init_msr_list()? Something like...

case MSR_IA32_PKRS:
if (!kvm_cpu_cap_has(X86_FEATURE_PKRS))
continue;
break;

2020-08-14 02:37:39

by Chenyi Qiang

[permalink] [raw]
Subject: Re: [RFC 6/7] KVM: X86: Expose PKS to guest and userspace



On 8/14/2020 3:04 AM, Jim Mattson wrote:
> On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <[email protected]> wrote:
>>
>> Existence of PKS is enumerated via CPUID.(EAX=7H,ECX=0):ECX[31]. It is
>> enabled by setting CR4.PKS when long mode is active. PKS is only
>> implemented when EPT is enabled and requires the support of VM_{ENTRY,
>> EXIT}_LOAD_IA32_PKRS currently.
>>
>> Signed-off-by: Chenyi Qiang <[email protected]>
>
>> @@ -967,7 +969,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>> {
>> unsigned long old_cr4 = kvm_read_cr4(vcpu);
>> unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
>> - X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
>> + X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
>> + X86_CR4_PKS;
>
> This list already seems overly long, but I don't think CR4.PKS belongs
> here. In volume 3 of the SDM, section 4.4.1, it says:
>
> - If PAE paging would be in use following an execution of MOV to CR0
> or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any
> of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP;
> then the PDPTEs are loaded from the address in CR3.
>
> CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
> Since it has no effect on PAE paging, I would be surprised if it did
> result in a PDPTE load.
>

Oh, My mistake.

>> if (kvm_valid_cr4(vcpu, cr4))
>> return 1;
>> @@ -1202,7 +1205,7 @@ static const u32 msrs_to_save_all[] = {
>> MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
>> MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
>> MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
>> - MSR_IA32_UMWAIT_CONTROL,
>> + MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,
>
> Should MSR_IA32_PKRS be added to the switch statement in
> kvm_init_msr_list()? Something like...
>
> case MSR_IA32_PKRS:
> if (!kvm_cpu_cap_has(X86_FEATURE_PKRS))
> continue;
> break;
>

Yes, this should be added.

2020-09-30 04:40:58

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC 6/7] KVM: X86: Expose PKS to guest and userspace

On Thu, Aug 13, 2020 at 12:04:54PM -0700, Jim Mattson wrote:
> On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <[email protected]> wrote:
> >
> > Existence of PKS is enumerated via CPUID.(EAX=7H,ECX=0):ECX[31]. It is
> > enabled by setting CR4.PKS when long mode is active. PKS is only
> > implemented when EPT is enabled and requires the support of VM_{ENTRY,
> > EXIT}_LOAD_IA32_PKRS currently.
> >
> > Signed-off-by: Chenyi Qiang <[email protected]>
>
> > @@ -967,7 +969,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> > {
> > unsigned long old_cr4 = kvm_read_cr4(vcpu);
> > unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
> > - X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
> > + X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
> > + X86_CR4_PKS;
>
> This list already seems overly long, but I don't think CR4.PKS belongs
> here. In volume 3 of the SDM, section 4.4.1, it says:
>
> - If PAE paging would be in use following an execution of MOV to CR0
> or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any
> of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP;
> then the PDPTEs are loaded from the address in CR3.
>
> CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
> Since it has no effect on PAE paging, I would be surprised if it did
> result in a PDPTE load.

It does belong in the mmu_role_bits though ;-)

2021-01-27 01:49:20

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC 6/7] KVM: X86: Expose PKS to guest and userspace

On 30/09/20 06:36, Sean Christopherson wrote:
>> CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
>> Since it has no effect on PAE paging, I would be surprised if it did
>> result in a PDPTE load.
> It does belong in the mmu_role_bits though;-)
>

Does it? We don't support PKU/PKS for shadow paging, and it's always
zero for EPT. We only support enough PKU/PKS for emulation.

Paolo

2021-01-27 19:51:15

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC 6/7] KVM: X86: Expose PKS to guest and userspace

On 26/01/21 20:56, Sean Christopherson wrote:
>>> It does belong in the mmu_role_bits though;-)
>>
>> Does it? We don't support PKU/PKS for shadow paging, and it's always zero
>> for EPT. We only support enough PKU/PKS for emulation.
>
> As proposed, yes. The PKU/PKS mask is tracked on a per-mmu basis, e.g.
> computed in update_pkr_bitmask() and consumed in permission_fault()
> during emulation. Omitting CR4.PKS from the extended role could let KVM
> reuse an MMU with the wrong pkr_mask.

Right, not for the hash table key but for reuse.

> IIUC, the logic is PKU|PKS, with pkr_mask generation being PKU vs. PKS agnostic.

Not in the patches as submitted, but it's what I suggested indeed (using
one bit of the PFEC to pick one of CR4.PKE and CR4.PKS).

> Another option would be to move the tracking out of the MMU, e.g. make pkr_mask
> per-vCPU and recalculate when CR4 changes. I think that would "just work", even
> when nested VMs are in play?

Yeah, pkr_mask is basically one of four constants (depending on CR4.PKE
and CR4.PKS) so recalculating when CR4 changes would work too. But I'm
okay with doing that later, too.

Paolo

2021-01-27 19:53:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC 6/7] KVM: X86: Expose PKS to guest and userspace

On Tue, Jan 26, 2021, Paolo Bonzini wrote:
> On 30/09/20 06:36, Sean Christopherson wrote:
> > > CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
> > > Since it has no effect on PAE paging, I would be surprised if it did
> > > result in a PDPTE load.
> > It does belong in the mmu_role_bits though;-)
> >
>
> Does it? We don't support PKU/PKS for shadow paging, and it's always zero
> for EPT. We only support enough PKU/PKS for emulation.

As proposed, yes. The PKU/PKS mask is tracked on a per-mmu basis, e.g. computed
in update_pkr_bitmask() and consumed in permission_fault() during emulation.
Omitting CR4.PKS from the extended role could let KVM reuse an MMU with the
wrong pkr_mask.

That being said, I don't think it needs a dedicated bit. IIUC, the logic is
PKU|PKS, with pkr_mask generation being PKU vs. PKS agnostic. The role could do
the same and smush the bits together, e.g.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3d6616f6f6ef..3bfca34f6ea2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -293,7 +293,7 @@ union kvm_mmu_extended_role {
unsigned int cr0_pg:1;
unsigned int cr4_pae:1;
unsigned int cr4_pse:1;
- unsigned int cr4_pke:1;
+ unsigned int cr4_pkr:1;
unsigned int cr4_smap:1;
unsigned int cr4_smep:1;
unsigned int maxphyaddr:6;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 79166288ed03..2774b85a36d5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4448,7 +4448,8 @@ static union kvm_mmu_extended_role kvm_calc_mmu_role_ext(struct kvm_vcpu *vcpu)
ext.cr4_smep = !!kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
ext.cr4_smap = !!kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
ext.cr4_pse = !!is_pse(vcpu);
- ext.cr4_pke = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE);
+ ext.cr4_pkr = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
+ !!kvm_read_cr4_bits(vcpu, X86_CR4_PKS);
ext.maxphyaddr = cpuid_maxphyaddr(vcpu);

ext.valid = 1;


Another option would be to move the tracking out of the MMU, e.g. make pkr_mask
per-vCPU and recalculate when CR4 changes. I think that would "just work", even
when nested VMs are in play?