2021-08-11 10:09:24

by Chenyi Qiang

[permalink] [raw]
Subject: [PATCH v5 0/7] KVM: PKS Virtualization support

This patch series is based on top of kernel patchset:
https://lore.kernel.org/lkml/[email protected]/

To help patches review, one missing info in SDM is that PKSR will be
cleared on Powerup/INIT/RESET, which should be listed in Table 9.1
"IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT"

---

Protection Keys for Supervisor Pages(PKS) is a feature that extends the
Protection Keys architecture to support thread-specific permission
restrictions on supervisor pages.

PKS works similar to an existing feature named PKU(protecting user pages).
They both perform an additional check after normal paging permission
checks are done. Access or Writes can be disabled via a MSR update
without TLB flushes when permissions changes. If violating this
addional check, #PF occurs and PFEC.PK bit will be set.

PKS introduces MSR IA32_PKRS to manage supervisor protection key
rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
advertises on a group of pages with the same key which is set in the
leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
supported by XSAVES architecture.

This patchset aims to add the virtualization of PKS in KVM. It
implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
exposure, nested supported etc. Currently, PKS is not yet supported for
shadow paging.

Detailed information about PKS can be found in the latest Intel 64 and
IA-32 Architectures Software Developer's Manual.


---

Changelogs:

v4->v5
- Make setting of MSR intercept/vmcs control bits not dependent on guest.CR4.PKS.
And set them if PKS is exposed to guest. (Suggested by Sean)
- Add pkrs to standard register caching mechanism to help update
vcpu->arch.pkrs on demand. Add related helper functions. (Suggested by Sean)
- Do the real pkrs update in VMCS field in vmx_vcpu_reset and
vmx_sync_vmcs_host_state(). (Sean)
- Add a new mmu_role cr4_pks instead of smushing PKU and PKS together.
(Sean & Paolo)
- v4: https://lore.kernel.org/lkml/[email protected]/

v3->v4
- Make the MSR intercept and load-controls setting depend on CR4.PKS value
- shadow the guest pkrs and make it usable in PKS emultion
- add the cr4_pke and cr4_pks check in pkr_mask update
- squash PATCH 2 and PATCH 5 to make the dependencies read more clear
- v3: https://lore.kernel.org/lkml/[email protected]/

v2->v3:
- No function changes since last submit
- rebase on the latest PKS kernel support:
https://lore.kernel.org/lkml/[email protected]/
- add MSR_IA32_PKRS to the vmx_possible_passthrough_msrs[]
- RFC v2: https://lore.kernel.org/lkml/[email protected]/

v1->v2:
- rebase on the latest PKS kernel support:
https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
- add a kvm-unit-tests for PKS
- add the check in kvm_init_msr_list for PKRS
- place the X86_CR4_PKS in mmu_role_bits in kvm_set_cr4
- add the support to expose VM_{ENTRY, EXIT}_LOAD_IA32_PKRS in nested
VMX MSR
- RFC v1: https://lore.kernel.org/lkml/[email protected]/

---

Chenyi Qiang (7):
KVM: VMX: Introduce PKS VMCS fields
KVM: VMX: Add proper cache tracking for PKRS
KVM: X86: Expose IA32_PKRS MSR
KVM: MMU: Rename the pkru to pkr
KVM: MMU: Add support for PKS emulation
KVM: VMX: Expose PKS to guest
KVM: VMX: Enable PKS for nested VM

arch/x86/include/asm/kvm_host.h | 17 ++++---
arch/x86/include/asm/vmx.h | 6 +++
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/kvm_cache_regs.h | 7 +++
arch/x86/kvm/mmu.h | 25 +++++----
arch/x86/kvm/mmu/mmu.c | 68 ++++++++++++++-----------
arch/x86/kvm/vmx/capabilities.h | 6 +++
arch/x86/kvm/vmx/nested.c | 41 ++++++++++++++-
arch/x86/kvm/vmx/vmcs.h | 1 +
arch/x86/kvm/vmx/vmcs12.c | 2 +
arch/x86/kvm/vmx/vmcs12.h | 4 ++
arch/x86/kvm/vmx/vmx.c | 89 ++++++++++++++++++++++++++++++---
arch/x86/kvm/vmx/vmx.h | 7 ++-
arch/x86/kvm/x86.c | 6 ++-
arch/x86/kvm/x86.h | 8 +++
arch/x86/mm/pkeys.c | 6 +++
include/linux/pkeys.h | 5 ++
17 files changed, 243 insertions(+), 57 deletions(-)

--
2.17.1


2021-08-11 10:09:47

by Chenyi Qiang

[permalink] [raw]
Subject: [PATCH v5 7/7] KVM: VMX: Enable PKS for nested VM

PKS MSR passes through guest directly. Configure the MSR to match the
L0/L1 settings so that nested VM runs PKS properly.

Signed-off-by: Chenyi Qiang <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 41 +++++++++++++++++++++++++++++++++++++--
arch/x86/kvm/vmx/vmcs12.c | 2 ++
arch/x86/kvm/vmx/vmcs12.h | 4 ++++
arch/x86/kvm/vmx/vmx.c | 1 +
arch/x86/kvm/vmx/vmx.h | 2 ++
5 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 1a52134b0c42..b551e6ec7db5 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -251,6 +251,10 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
dest->ds_sel = src->ds_sel;
dest->es_sel = src->es_sel;
#endif
+ if (unlikely(src->pkrs != dest->pkrs)) {
+ vmcs_write64(HOST_IA32_PKRS, src->pkrs);
+ dest->pkrs = src->pkrs;
+ }
}

static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
@@ -660,6 +664,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
MSR_IA32_PRED_CMD,
MSR_TYPE_W);

+ if (!msr_write_intercepted_l01(vcpu, MSR_IA32_PKRS))
+ nested_vmx_disable_intercept_for_msr(
+ msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PKRS,
+ MSR_TYPE_R | MSR_TYPE_W);
+
kvm_vcpu_unmap(vcpu, &to_vmx(vcpu)->nested.msr_bitmap_map, false);

return true;
@@ -2394,6 +2404,10 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
+
+ if (vmx->nested.nested_run_pending &&
+ (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS))
+ vmcs_write64(GUEST_IA32_PKRS, vmcs12->guest_ia32_pkrs);
}

if (nested_cpu_has_xsaves(vmcs12))
@@ -2482,6 +2496,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs);
+ if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+ (!vmx->nested.nested_run_pending ||
+ !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
+ vmcs_write64(GUEST_IA32_PKRS, vmx->nested.vmcs01_guest_pkrs);
+
vmx_set_rflags(vcpu, vmcs12->guest_rflags);

/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
@@ -2840,6 +2859,10 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
vmcs12->host_ia32_perf_global_ctrl)))
return -EINVAL;

+ if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PKRS) &&
+ CC(!kvm_pkrs_valid(vmcs12->host_ia32_pkrs)))
+ return -EINVAL;
+
#ifdef CONFIG_X86_64
ia32e = !!(vcpu->arch.efer & EFER_LMA);
#else
@@ -2990,6 +3013,10 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu,
if (nested_check_guest_non_reg_state(vmcs12))
return -EINVAL;

+ if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS) &&
+ CC(!kvm_pkrs_valid(vmcs12->guest_ia32_pkrs)))
+ return -EINVAL;
+
return 0;
}

@@ -3325,6 +3352,9 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
if (kvm_mpx_supported() &&
!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
+ if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+ !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS))
+ vmx->nested.vmcs01_guest_pkrs = vmcs_read64(GUEST_IA32_PKRS);

/*
* Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and*
@@ -3964,6 +3994,7 @@ static bool is_vmcs12_ext_field(unsigned long field)
case GUEST_IDTR_BASE:
case GUEST_PENDING_DBG_EXCEPTIONS:
case GUEST_BNDCFGS:
+ case GUEST_IA32_PKRS:
return true;
default:
break;
@@ -4015,6 +4046,8 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
if (kvm_mpx_supported())
vmcs12->guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
+ if (guest_cpuid_has(vcpu, X86_FEATURE_PKS))
+ vmcs12->guest_ia32_pkrs = vmcs_read64(GUEST_IA32_PKRS);

vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
}
@@ -4252,6 +4285,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
vmcs12->host_ia32_perf_global_ctrl));

+ if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PKRS)
+ vmcs_write64(GUEST_IA32_PKRS, vmcs12->host_ia32_pkrs);
+
/* Set L1 segment info according to Intel SDM
27.5.2 Loading Host Segment and Descriptor-Table Registers */
seg = (struct kvm_segment) {
@@ -6466,7 +6502,8 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
VM_EXIT_HOST_ADDR_SPACE_SIZE |
#endif
VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
- VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
+ VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
+ VM_EXIT_LOAD_IA32_PKRS;
msrs->exit_ctls_high |=
VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
@@ -6486,7 +6523,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
VM_ENTRY_IA32E_MODE |
#endif
VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
- VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
+ VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | VM_ENTRY_LOAD_IA32_PKRS;
msrs->entry_ctls_high |=
(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER);

diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index d9f5d7c56ae3..cf088dfa69c6 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -63,9 +63,11 @@ const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(GUEST_PDPTR2, guest_pdptr2),
FIELD64(GUEST_PDPTR3, guest_pdptr3),
FIELD64(GUEST_BNDCFGS, guest_bndcfgs),
+ FIELD64(GUEST_IA32_PKRS, guest_ia32_pkrs),
FIELD64(HOST_IA32_PAT, host_ia32_pat),
FIELD64(HOST_IA32_EFER, host_ia32_efer),
FIELD64(HOST_IA32_PERF_GLOBAL_CTRL, host_ia32_perf_global_ctrl),
+ FIELD64(HOST_IA32_PKRS, host_ia32_pkrs),
FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control),
FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control),
FIELD(EXCEPTION_BITMAP, exception_bitmap),
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 5e0e1b39f495..2cf201f19155 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -185,6 +185,8 @@ struct __packed vmcs12 {
u16 host_gs_selector;
u16 host_tr_selector;
u16 guest_pml_index;
+ u64 host_ia32_pkrs;
+ u64 guest_ia32_pkrs;
};

/*
@@ -359,6 +361,8 @@ static inline void vmx_check_vmcs12_offsets(void)
CHECK_OFFSET(host_gs_selector, 992);
CHECK_OFFSET(host_tr_selector, 994);
CHECK_OFFSET(guest_pml_index, 996);
+ CHECK_OFFSET(host_ia32_pkrs, 998);
+ CHECK_OFFSET(guest_ia32_pkrs, 1006);
}

extern const unsigned short vmcs_field_to_offset_table[];
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 71f2aefd6454..a910d9b423eb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7112,6 +7112,7 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
cr4_fixed1_update(X86_CR4_PKE, ecx, feature_bit(PKU));
cr4_fixed1_update(X86_CR4_UMIP, ecx, feature_bit(UMIP));
cr4_fixed1_update(X86_CR4_LA57, ecx, feature_bit(LA57));
+ cr4_fixed1_update(X86_CR4_PKS, ecx, feature_bit(PKS));

#undef cr4_fixed1_update
}
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 858d9a959c72..fff314dc8356 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -203,6 +203,8 @@ struct nested_vmx {
u64 vmcs01_debugctl;
u64 vmcs01_guest_bndcfgs;

+ u64 vmcs01_guest_pkrs;
+
/* to migrate it to L1 if L2 writes to L1's CR8 directly */
int l1_tpr_threshold;

--
2.17.1

2021-08-11 10:09:48

by Chenyi Qiang

[permalink] [raw]
Subject: [PATCH v5 5/7] KVM: MMU: Add support for PKS emulation

Up until now, pkr_mask had 0 bits for supervisor pages (the U/S bit in
page tables replaces the PFEC.RSVD in page fault error code).
For PKS support, fill in the bits using the same algorithm used for user
mode pages, but with CR4.PKE replaced by CR4.PKS. Because of this
change, CR4.PKS must also be included in the MMU role.

Signed-off-by: Chenyi Qiang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 10 +++---
arch/x86/kvm/mmu.h | 15 +++++----
arch/x86/kvm/mmu/mmu.c | 58 ++++++++++++++++++++-------------
3 files changed, 50 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3d55aca9167b..f31d19e851de 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -359,6 +359,7 @@ union kvm_mmu_extended_role {
unsigned int cr4_smap:1;
unsigned int cr4_smep:1;
unsigned int cr4_la57:1;
+ unsigned int cr4_pks:1;
};
};

@@ -439,10 +440,11 @@ struct kvm_mmu {
u8 permissions[16];

/*
- * The pkru_mask indicates if protection key checks are needed. It
- * consists of 16 domains indexed by page fault error code bits [4:1],
- * with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
- * Each domain has 2 bits which are ANDed with AD and WD from PKRU.
+ * The pkr_mask indicates if protection key checks are needed.
+ * It consists of 16 domains indexed by page fault error code
+ * bits[4:1] with PFEC.RSVD replaced by ACC_USER_MASK from the
+ * page tables. Each domain has 2 bits which are ANDed with AD
+ * and WD from PKRU/PKRS.
*/
u32 pkr_mask;

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 5e94f6a90e80..5586c0341d28 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -46,7 +46,7 @@

#define KVM_MMU_CR4_ROLE_BITS (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE | \
X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE | \
- X86_CR4_LA57)
+ X86_CR4_LA57 | X86_CR4_PKS)

#define KVM_MMU_CR0_ROLE_BITS (X86_CR0_PG | X86_CR0_WP)

@@ -202,14 +202,17 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
if (unlikely(mmu->pkr_mask)) {
u32 pkr_bits, offset;
+ u64 pkr;

/*
- * PKRU defines 32 bits, there are 16 domains and 2
- * attribute bits per domain in pkru. pte_pkey is the
- * index of the protection domain, so pte_pkey * 2 is
- * is the index of the first bit for the domain.
+ * PKRU and PKRS both define 32 bits. There are 16 domains
+ * and 2 attribute bits per domain in them. pte_key is the
+ * index of the protection domain, so pte_pkey * 2 is the
+ * index of the first bit for the domain. The choice of
+ * PKRU and PKRS is determined by the accessed pages.
*/
- pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
+ pkr = pte_access & PT_USER_MASK ? vcpu->arch.pkru : kvm_read_pkrs(vcpu);
+ pkr_bits = (pkr >> pte_pkey * 2) & 3;

/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
offset = (pfec & ~1) +
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 49fd2dc98cc6..ca83ad5f5716 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -205,6 +205,7 @@ BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smep, X86_CR4_SMEP);
BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smap, X86_CR4_SMAP);
BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pke, X86_CR4_PKE);
BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, la57, X86_CR4_LA57);
+BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pks, X86_CR4_PKS);
BUILD_MMU_ROLE_REGS_ACCESSOR(efer, nx, EFER_NX);
BUILD_MMU_ROLE_REGS_ACCESSOR(efer, lma, EFER_LMA);

@@ -227,6 +228,7 @@ BUILD_MMU_ROLE_ACCESSOR(ext, cr4, smep);
BUILD_MMU_ROLE_ACCESSOR(ext, cr4, smap);
BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pke);
BUILD_MMU_ROLE_ACCESSOR(ext, cr4, la57);
+BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pks);
BUILD_MMU_ROLE_ACCESSOR(base, efer, nx);

static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
@@ -4420,35 +4422,41 @@ static void update_permission_bitmask(struct kvm_mmu *mmu, bool ept)
}

/*
-* PKU is an additional mechanism by which the paging controls access to
-* user-mode addresses based on the value in the PKRU register. Protection
-* key violations are reported through a bit in the page fault error code.
+* Protection Keys (PKEY) is an additional mechanism by which
+* the paging controls access to user-mode/supervisor-mode address
+* based on the values in PKEY registers (PKRU/PKRS). Protection key
+* violations are reported through a bit in the page fault error code.
* Unlike other bits of the error code, the PK bit is not known at the
* call site of e.g. gva_to_gpa; it must be computed directly in
-* permission_fault based on two bits of PKRU, on some machine state (CR4,
-* CR0, EFER, CPL), and on other bits of the error code and the page tables.
+* permission_fault based on two bits of PKRU/PKRS, on some machine
+* state (CR4, CR0, EFER, CPL), and on other bits of the error code
+* and the page tables.
*
* In particular the following conditions come from the error code, the
* page tables and the machine state:
-* - PK is always zero unless CR4.PKE=1 and EFER.LMA=1
+* - PK is always zero unless CR4.PKE=1/CR4.PKS=1 and EFER.LMA=1
* - PK is always zero if RSVD=1 (reserved bit set) or F=1 (instruction fetch)
-* - PK is always zero if U=0 in the page tables
-* - PKRU.WD is ignored if CR0.WP=0 and the access is a supervisor access.
+* - PK is always zero if
+* - U=0 in the page tables and CR4.PKS=0
+* - U=1 in the page tables and CR4.PKU=0
+* - (PKRU/PKRS).WD is ignored if CR0.WP=0 and the access is a supervisor access.
*
-* The PKRU bitmask caches the result of these four conditions. The error
-* code (minus the P bit) and the page table's U bit form an index into the
-* PKRU bitmask. Two bits of the PKRU bitmask are then extracted and ANDed
-* with the two bits of the PKRU register corresponding to the protection key.
-* For the first three conditions above the bits will be 00, thus masking
-* away both AD and WD. For all reads or if the last condition holds, WD
-* only will be masked away.
+* The pkr_mask caches the result of these three conditions. The error
+* code (minus the P bit) and the page table's U bit form an index into
+* the pkr_mask. Two bits of the pkr_mask are then extracted and ANDed with
+* the two bits of the PKEY register corresponding to the protection key.
+* For the first three conditions above the bits will be 00, thus masking away
+* both AD and WD. For all reads or if the last condition holds, WD only will be
+* masked away.
*/
static void update_pkr_bitmask(struct kvm_mmu *mmu)
{
unsigned bit;
bool wp;
+ bool cr4_pke = is_cr4_pke(mmu);
+ bool cr4_pks = is_cr4_pks(mmu);

- if (!is_cr4_pke(mmu)) {
+ if (!cr4_pke && !cr4_pks) {
mmu->pkr_mask = 0;
return;
}
@@ -4468,19 +4476,22 @@ static void update_pkr_bitmask(struct kvm_mmu *mmu)
pte_user = pfec & PFERR_RSVD_MASK;

/*
- * Only need to check the access which is not an
- * instruction fetch and is to a user page.
+ * need to check the access which is not an
+ * instruction fetch and
+ * - if cr4_pke 1-setting when accessing a user page.
+ * - if cr4_pks 1-setting when accessing a supervisor page.
*/
- check_pkey = (!ff && pte_user);
+ check_pkey = !ff && (pte_user ? cr4_pke : cr4_pks);
+
/*
- * write access is controlled by PKRU if it is a
- * user access or CR0.WP = 1.
+ * write access is controlled by PKRU/PKRS if
+ * it is a user access or CR0.WP = 1.
*/
check_write = check_pkey && wf && (uf || wp);

- /* PKRU.AD stops both read and write access. */
+ /* PKRU/PKRS.AD stops both read and write access. */
pkey_bits = !!check_pkey;
- /* PKRU.WD stops write access. */
+ /* PKRU/PKRS.WD stops write access. */
pkey_bits |= (!!check_write) << 1;

mmu->pkr_mask |= (pkey_bits & 3) << pfec;
@@ -4531,6 +4542,7 @@ static union kvm_mmu_extended_role kvm_calc_mmu_role_ext(struct kvm_vcpu *vcpu,
/* PKEY and LA57 are active iff long mode is active. */
ext.cr4_pke = ____is_efer_lma(regs) && ____is_cr4_pke(regs);
ext.cr4_la57 = ____is_efer_lma(regs) && ____is_cr4_la57(regs);
+ ext.cr4_pks = ____is_efer_lma(regs) && ____is_cr4_pks(regs);
}

ext.valid = 1;
--
2.17.1

2021-08-11 10:10:09

by Chenyi Qiang

[permalink] [raw]
Subject: [PATCH v5 6/7] KVM: VMX: Expose PKS to guest

Existence of PKS is enumerated via CPUID.(EAX=7,ECX=0):ECX[31]. It is
enabled by setting CR4.PKS when long mode is active. PKS is only
implemented when EPT is enabled and requires the support of
VM_{ENTRY,EXIT}_LOAD_IA32_PKRS VMCS controls currently.

Signed-off-by: Chenyi Qiang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/vmx/capabilities.h | 6 ++++++
arch/x86/kvm/vmx/vmx.c | 15 ++++++++++++---
arch/x86/kvm/x86.h | 2 ++
5 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f31d19e851de..9abd9a4c2174 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -103,7 +103,8 @@
| X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
| X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
| X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
- | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP))
+ | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
+ | X86_CR4_PKS))

#define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 739be5da3bca..dbee0d639db3 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -458,7 +458,7 @@ void kvm_set_cpu_caps(void)
F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
- F(SGX_LC) | F(BUS_LOCK_DETECT)
+ F(SGX_LC) | F(BUS_LOCK_DETECT) | 0 /*PKS*/
);
/* Set LA57 based on hardware capability. */
if (cpuid_ecx(7) & F(LA57))
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 4705ad55abb5..3f6122fd8f65 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -104,6 +104,12 @@ static inline bool cpu_has_load_perf_global_ctrl(void)
(vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL);
}

+static inline bool cpu_has_load_ia32_pkrs(void)
+{
+ return (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PKRS) &&
+ (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PKRS);
+}
+
static inline bool cpu_has_vmx_mpx(void)
{
return (vmcs_config.vmexit_ctrl & VM_EXIT_CLEAR_BNDCFGS) &&
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0f3ca6a07a21..71f2aefd6454 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3218,7 +3218,7 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
}

/*
- * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
+ * SMEP/SMAP/PKU/PKS is disabled if CPU is in non-paging mode in
* hardware. To emulate this behavior, SMEP/SMAP/PKU needs
* to be manually disabled when guest switches to non-paging
* mode.
@@ -3226,10 +3226,11 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
* If !enable_unrestricted_guest, the CPU is always running
* with CR0.PG=1 and CR4 needs to be modified.
* If enable_unrestricted_guest, the CPU automatically
- * disables SMEP/SMAP/PKU when the guest sets CR0.PG=0.
+ * disables SMEP/SMAP/PKU/PKS when the guest sets CR0.PG=0.
*/
if (!is_paging(vcpu))
- hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
+ hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
+ X86_CR4_PKS);
}

vmcs_writel(CR4_READ_SHADOW, cr4);
@@ -7311,6 +7312,14 @@ static __init void vmx_set_cpu_caps(void)

if (cpu_has_vmx_waitpkg())
kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
+
+ /*
+ * PKS is not yet implemented for shadow paging.
+ * If not support VM_{ENTRY, EXIT}_LOAD_IA32_PKRS,
+ * don't expose the PKS as well.
+ */
+ if (enable_ept && cpu_has_load_ia32_pkrs())
+ kvm_cpu_cap_check_and_set(X86_FEATURE_PKS);
}

static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f8aaf89e6dc5..a7040c6ef524 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -481,6 +481,8 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
__reserved_bits |= X86_CR4_VMXE; \
if (!__cpu_has(__c, X86_FEATURE_PCID)) \
__reserved_bits |= X86_CR4_PCIDE; \
+ if (!__cpu_has(__c, X86_FEATURE_PKS)) \
+ __reserved_bits |= X86_CR4_PKS; \
__reserved_bits; \
})

--
2.17.1

2021-08-26 02:06:27

by Chenyi Qiang

[permalink] [raw]
Subject: Re: [PATCH v5 0/7] KVM: PKS Virtualization support

Ping for comments.

One thing to pay attention:
The previous statement was incorrect – PKRS should be preserved on INIT,
not cleared.


On 8/11/2021 6:11 PM, Chenyi Qiang wrote:
> This patch series is based on top of kernel patchset:
> https://lore.kernel.org/lkml/[email protected]/
>
> To help patches review, one missing info in SDM is that PKSR will be
> cleared on Powerup/INIT/RESET, which should be listed in Table 9.1
> "IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT"
>
> ---
>
> Protection Keys for Supervisor Pages(PKS) is a feature that extends the
> Protection Keys architecture to support thread-specific permission
> restrictions on supervisor pages.
>
> PKS works similar to an existing feature named PKU(protecting user pages).
> They both perform an additional check after normal paging permission
> checks are done. Access or Writes can be disabled via a MSR update
> without TLB flushes when permissions changes. If violating this
> addional check, #PF occurs and PFEC.PK bit will be set.
>
> PKS introduces MSR IA32_PKRS to manage supervisor protection key
> rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
> advertises on a group of pages with the same key which is set in the
> leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
> supported by XSAVES architecture.
>
> This patchset aims to add the virtualization of PKS in KVM. It
> implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
> exposure, nested supported etc. Currently, PKS is not yet supported for
> shadow paging.
>
> Detailed information about PKS can be found in the latest Intel 64 and
> IA-32 Architectures Software Developer's Manual.
>
>
> ---
>
> Changelogs:
>
> v4->v5
> - Make setting of MSR intercept/vmcs control bits not dependent on guest.CR4.PKS.
> And set them if PKS is exposed to guest. (Suggested by Sean)
> - Add pkrs to standard register caching mechanism to help update
> vcpu->arch.pkrs on demand. Add related helper functions. (Suggested by Sean)
> - Do the real pkrs update in VMCS field in vmx_vcpu_reset and
> vmx_sync_vmcs_host_state(). (Sean)
> - Add a new mmu_role cr4_pks instead of smushing PKU and PKS together.
> (Sean & Paolo)
> - v4: https://lore.kernel.org/lkml/[email protected]/
>
> v3->v4
> - Make the MSR intercept and load-controls setting depend on CR4.PKS value
> - shadow the guest pkrs and make it usable in PKS emultion
> - add the cr4_pke and cr4_pks check in pkr_mask update
> - squash PATCH 2 and PATCH 5 to make the dependencies read more clear
> - v3: https://lore.kernel.org/lkml/[email protected]/
>
> v2->v3:
> - No function changes since last submit
> - rebase on the latest PKS kernel support:
> https://lore.kernel.org/lkml/[email protected]/
> - add MSR_IA32_PKRS to the vmx_possible_passthrough_msrs[]
> - RFC v2: https://lore.kernel.org/lkml/[email protected]/
>
> v1->v2:
> - rebase on the latest PKS kernel support:
> https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
> - add a kvm-unit-tests for PKS
> - add the check in kvm_init_msr_list for PKRS
> - place the X86_CR4_PKS in mmu_role_bits in kvm_set_cr4
> - add the support to expose VM_{ENTRY, EXIT}_LOAD_IA32_PKRS in nested
> VMX MSR
> - RFC v1: https://lore.kernel.org/lkml/[email protected]/
>
> ---
>
> Chenyi Qiang (7):
> KVM: VMX: Introduce PKS VMCS fields
> KVM: VMX: Add proper cache tracking for PKRS
> KVM: X86: Expose IA32_PKRS MSR
> KVM: MMU: Rename the pkru to pkr
> KVM: MMU: Add support for PKS emulation
> KVM: VMX: Expose PKS to guest
> KVM: VMX: Enable PKS for nested VM
>
> arch/x86/include/asm/kvm_host.h | 17 ++++---
> arch/x86/include/asm/vmx.h | 6 +++
> arch/x86/kvm/cpuid.c | 2 +-
> arch/x86/kvm/kvm_cache_regs.h | 7 +++
> arch/x86/kvm/mmu.h | 25 +++++----
> arch/x86/kvm/mmu/mmu.c | 68 ++++++++++++++-----------
> arch/x86/kvm/vmx/capabilities.h | 6 +++
> arch/x86/kvm/vmx/nested.c | 41 ++++++++++++++-
> arch/x86/kvm/vmx/vmcs.h | 1 +
> arch/x86/kvm/vmx/vmcs12.c | 2 +
> arch/x86/kvm/vmx/vmcs12.h | 4 ++
> arch/x86/kvm/vmx/vmx.c | 89 ++++++++++++++++++++++++++++++---
> arch/x86/kvm/vmx/vmx.h | 7 ++-
> arch/x86/kvm/x86.c | 6 ++-
> arch/x86/kvm/x86.h | 8 +++
> arch/x86/mm/pkeys.c | 6 +++
> include/linux/pkeys.h | 5 ++
> 17 files changed, 243 insertions(+), 57 deletions(-)
>

2021-10-25 15:14:56

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v5 0/7] KVM: PKS Virtualization support

On 11/08/21 12:11, Chenyi Qiang wrote:
> This patch series is based on top of kernel patchset:
> https://lore.kernel.org/lkml/[email protected]/
>
> To help patches review, one missing info in SDM is that PKSR will be
> cleared on Powerup/INIT/RESET, which should be listed in Table 9.1
> "IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT"
>
> ---
>
> Protection Keys for Supervisor Pages(PKS) is a feature that extends the
> Protection Keys architecture to support thread-specific permission
> restrictions on supervisor pages.
>
> PKS works similar to an existing feature named PKU(protecting user pages).
> They both perform an additional check after normal paging permission
> checks are done. Access or Writes can be disabled via a MSR update
> without TLB flushes when permissions changes. If violating this
> addional check, #PF occurs and PFEC.PK bit will be set.
>
> PKS introduces MSR IA32_PKRS to manage supervisor protection key
> rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
> advertises on a group of pages with the same key which is set in the
> leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
> supported by XSAVES architecture.
>
> This patchset aims to add the virtualization of PKS in KVM. It
> implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
> exposure, nested supported etc. Currently, PKS is not yet supported for
> shadow paging.
>
> Detailed information about PKS can be found in the latest Intel 64 and
> IA-32 Architectures Software Developer's Manual.

Hi Chenyi,

pkrs_cache does not yet exist in Linux 5.15. What is the state of the
bare-metal support for PKS?

Thanks,

Paolo

>
> ---
>
> Changelogs:
>
> v4->v5
> - Make setting of MSR intercept/vmcs control bits not dependent on guest.CR4.PKS.
> And set them if PKS is exposed to guest. (Suggested by Sean)
> - Add pkrs to standard register caching mechanism to help update
> vcpu->arch.pkrs on demand. Add related helper functions. (Suggested by Sean)
> - Do the real pkrs update in VMCS field in vmx_vcpu_reset and
> vmx_sync_vmcs_host_state(). (Sean)
> - Add a new mmu_role cr4_pks instead of smushing PKU and PKS together.
> (Sean & Paolo)
> - v4: https://lore.kernel.org/lkml/[email protected]/
>
> v3->v4
> - Make the MSR intercept and load-controls setting depend on CR4.PKS value
> - shadow the guest pkrs and make it usable in PKS emultion
> - add the cr4_pke and cr4_pks check in pkr_mask update
> - squash PATCH 2 and PATCH 5 to make the dependencies read more clear
> - v3: https://lore.kernel.org/lkml/[email protected]/
>
> v2->v3:
> - No function changes since last submit
> - rebase on the latest PKS kernel support:
> https://lore.kernel.org/lkml/[email protected]/
> - add MSR_IA32_PKRS to the vmx_possible_passthrough_msrs[]
> - RFC v2: https://lore.kernel.org/lkml/[email protected]/
>
> v1->v2:
> - rebase on the latest PKS kernel support:
> https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
> - add a kvm-unit-tests for PKS
> - add the check in kvm_init_msr_list for PKRS
> - place the X86_CR4_PKS in mmu_role_bits in kvm_set_cr4
> - add the support to expose VM_{ENTRY, EXIT}_LOAD_IA32_PKRS in nested
> VMX MSR
> - RFC v1: https://lore.kernel.org/lkml/[email protected]/
>
> ---
>
> Chenyi Qiang (7):
> KVM: VMX: Introduce PKS VMCS fields
> KVM: VMX: Add proper cache tracking for PKRS
> KVM: X86: Expose IA32_PKRS MSR
> KVM: MMU: Rename the pkru to pkr
> KVM: MMU: Add support for PKS emulation
> KVM: VMX: Expose PKS to guest
> KVM: VMX: Enable PKS for nested VM
>
> arch/x86/include/asm/kvm_host.h | 17 ++++---
> arch/x86/include/asm/vmx.h | 6 +++
> arch/x86/kvm/cpuid.c | 2 +-
> arch/x86/kvm/kvm_cache_regs.h | 7 +++
> arch/x86/kvm/mmu.h | 25 +++++----
> arch/x86/kvm/mmu/mmu.c | 68 ++++++++++++++-----------
> arch/x86/kvm/vmx/capabilities.h | 6 +++
> arch/x86/kvm/vmx/nested.c | 41 ++++++++++++++-
> arch/x86/kvm/vmx/vmcs.h | 1 +
> arch/x86/kvm/vmx/vmcs12.c | 2 +
> arch/x86/kvm/vmx/vmcs12.h | 4 ++
> arch/x86/kvm/vmx/vmx.c | 89 ++++++++++++++++++++++++++++++---
> arch/x86/kvm/vmx/vmx.h | 7 ++-
> arch/x86/kvm/x86.c | 6 ++-
> arch/x86/kvm/x86.h | 8 +++
> arch/x86/mm/pkeys.c | 6 +++
> include/linux/pkeys.h | 5 ++
> 17 files changed, 243 insertions(+), 57 deletions(-)
>

2021-10-26 08:34:07

by Chenyi Qiang

[permalink] [raw]
Subject: Re: [PATCH v5 0/7] KVM: PKS Virtualization support



On 10/25/2021 11:12 PM, Paolo Bonzini wrote:
> On 11/08/21 12:11, Chenyi Qiang wrote:
>> This patch series is based on top of kernel patchset:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>>
>> To help patches review, one missing info in SDM is that PKSR will be
>> cleared on Powerup/INIT/RESET, which should be listed in Table 9.1
>> "IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT"
>>
>> ---
>>
>> Protection Keys for Supervisor Pages(PKS) is a feature that extends the
>> Protection Keys architecture to support thread-specific permission
>> restrictions on supervisor pages.
>>
>> PKS works similar to an existing feature named PKU(protecting user
>> pages).
>> They both perform an additional check after normal paging permission
>> checks are done. Access or Writes can be disabled via a MSR update
>> without TLB flushes when permissions changes. If violating this
>> addional check, #PF occurs and PFEC.PK bit will be set.
>>
>> PKS introduces MSR IA32_PKRS to manage supervisor protection key
>> rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
>> advertises on a group of pages with the same key which is set in the
>> leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
>> supported by XSAVES architecture.
>>
>> This patchset aims to add the virtualization of PKS in KVM. It
>> implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
>> exposure, nested supported etc. Currently, PKS is not yet supported for
>> shadow paging.
>>
>> Detailed information about PKS can be found in the latest Intel 64 and
>> IA-32 Architectures Software Developer's Manual.
>
> Hi Chenyi,
>
> pkrs_cache does not yet exist in Linux 5.15.  What is the state of the
> bare-metal support for PKS?
>
> Thanks,
>
> Paolo
>

Hi Paolo,

The latest version is still at
https://lore.kernel.org/lkml/[email protected]/

Ira is working on the next version but doesn't have concrete schedule.

Thanks
Chenyi

>>
>> ---
>>
>> Changelogs:
>>
>> v4->v5
>> - Make setting of MSR intercept/vmcs control bits not dependent on
>> guest.CR4.PKS.
>>    And set them if PKS is exposed to guest. (Suggested by Sean)
>> - Add pkrs to standard register caching mechanism to help update
>>    vcpu->arch.pkrs on demand. Add related helper functions. (Suggested
>> by Sean)
>> - Do the real pkrs update in VMCS field in vmx_vcpu_reset and
>>    vmx_sync_vmcs_host_state(). (Sean)
>> - Add a new mmu_role cr4_pks instead of smushing PKU and PKS together.
>>    (Sean & Paolo)
>> - v4:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>>
>> v3->v4
>> - Make the MSR intercept and load-controls setting depend on CR4.PKS
>> value
>> - shadow the guest pkrs and make it usable in PKS emultion
>> - add the cr4_pke and cr4_pks check in pkr_mask update
>> - squash PATCH 2 and PATCH 5 to make the dependencies read more clear
>> - v3:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>>
>> v2->v3:
>> - No function changes since last submit
>> - rebase on the latest PKS kernel support:
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> - add MSR_IA32_PKRS to the vmx_possible_passthrough_msrs[]
>> - RFC v2:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>>
>> v1->v2:
>> - rebase on the latest PKS kernel support:
>>    https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
>> - add a kvm-unit-tests for PKS
>> - add the check in kvm_init_msr_list for PKRS
>> - place the X86_CR4_PKS in mmu_role_bits in kvm_set_cr4
>> - add the support to expose VM_{ENTRY, EXIT}_LOAD_IA32_PKRS in nested
>>    VMX MSR
>> - RFC v1:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>>
>> ---
>>
>> Chenyi Qiang (7):
>>    KVM: VMX: Introduce PKS VMCS fields
>>    KVM: VMX: Add proper cache tracking for PKRS
>>    KVM: X86: Expose IA32_PKRS MSR
>>    KVM: MMU: Rename the pkru to pkr
>>    KVM: MMU: Add support for PKS emulation
>>    KVM: VMX: Expose PKS to guest
>>    KVM: VMX: Enable PKS for nested VM
>>
>>   arch/x86/include/asm/kvm_host.h | 17 ++++---
>>   arch/x86/include/asm/vmx.h      |  6 +++
>>   arch/x86/kvm/cpuid.c            |  2 +-
>>   arch/x86/kvm/kvm_cache_regs.h   |  7 +++
>>   arch/x86/kvm/mmu.h              | 25 +++++----
>>   arch/x86/kvm/mmu/mmu.c          | 68 ++++++++++++++-----------
>>   arch/x86/kvm/vmx/capabilities.h |  6 +++
>>   arch/x86/kvm/vmx/nested.c       | 41 ++++++++++++++-
>>   arch/x86/kvm/vmx/vmcs.h         |  1 +
>>   arch/x86/kvm/vmx/vmcs12.c       |  2 +
>>   arch/x86/kvm/vmx/vmcs12.h       |  4 ++
>>   arch/x86/kvm/vmx/vmx.c          | 89 ++++++++++++++++++++++++++++++---
>>   arch/x86/kvm/vmx/vmx.h          |  7 ++-
>>   arch/x86/kvm/x86.c              |  6 ++-
>>   arch/x86/kvm/x86.h              |  8 +++
>>   arch/x86/mm/pkeys.c             |  6 +++
>>   include/linux/pkeys.h           |  5 ++
>>   17 files changed, 243 insertions(+), 57 deletions(-)
>>
>

2021-11-09 01:03:01

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 5/7] KVM: MMU: Add support for PKS emulation

On Wed, Aug 11, 2021, Chenyi Qiang wrote:
> * In particular the following conditions come from the error code, the
> * page tables and the machine state:
> -* - PK is always zero unless CR4.PKE=1 and EFER.LMA=1
> +* - PK is always zero unless CR4.PKE=1/CR4.PKS=1 and EFER.LMA=1
> * - PK is always zero if RSVD=1 (reserved bit set) or F=1 (instruction fetch)
> -* - PK is always zero if U=0 in the page tables
> -* - PKRU.WD is ignored if CR0.WP=0 and the access is a supervisor access.
> +* - PK is always zero if
> +* - U=0 in the page tables and CR4.PKS=0
> +* - U=1 in the page tables and CR4.PKU=0

I think it makes sense to completely rewrite this "table" or drop it altogether.
The "always zero" wording is nonsensical when there are multiple conditions for
"always". And IMO the whole "PK is ... zero" thing is a bit awkward because it
leaves the uninitiated wondering what PK=0 even means ('1' == disabled is not the
most intuitive thing since most PTE bits are '1' = allowed). Ugh, and re-reading
with context, that's not even what "PK" means here, this is actually referring to
PFEC.PK, which is all kinds of confusing because PFEC.PK is merely a "symptom" of
a #PF to due a protection key violation, not the other way 'round.

IMO this entire comment could use a good overhaul. It never explicitly documents
the "access-disable" and "write-disable" behavior. More below.

> +* - (PKRU/PKRS).WD is ignored if CR0.WP=0 and the access is a supervisor access.

Hrm. The SDM contradicts itself.

Section 4.6.1 "Determination of Access Rights" says this for supervisor-mode accesses:

If CR0.WP = 0, data may be written to any supervisor-mode address with a protection
key for which write access is permitted.

but section 4.6.2 "Protection Keys" says:

If WDi = 1, write accesses are not permitted if CR0.WP = 1. (If CR0.WP = 0,
IA32_PKRS.WDi does not affect write accesses to supervisor-mode addresses with
protection key i.)

I believe 4.6.1 is subtly wrong and should be "data access", not "write access".

If CR0.WP = 0, data may be written to any supervisor-mode address with a protection
key for which data access is permitted.
^^^^

Can you follow-up with someone to get the SDM fixed? This stuff is subtle and
confusing enough as it is :-)

And on a very related topic, it would be helpful to clarify user-mode vs. supervisor-mode
and access vs. address.

How about this for a comment?

/*
* Protection Key Rights (PKR) is an additional mechanism by which data accesses
* with 4-level or 5-level paging (EFER.LMA=1) may be disabled based on the
* Protection Key Rights Userspace (PRKU) or Protection Key Rights Supervisor
* (PKRS) registers. The Protection Key (PK) used for an access is a 4-bit
* value specified in bits 62:59 of the leaf PTE used to translate the address.
*
* PKRU and PKRS are 32-bit registers, with 16 2-bit entries consisting of an
* access-disable (AD) and write-disable (WD) bit. The PK from the leaf PTE is
* used to index the approriate PKR (see below), e.g. PK=1 would consume bits
* 3:2 (bit 3 == write-disable, bit 2 == access-disable).
*
* The PK register (PKRU vs. PKRS) indexed by the PK depends on the type of
* _address_ (not access type!). For a user-mode address, PKRU is used; for a
* supervisor-mode address, PKRS is used. An address is supervisor-mode if the
* U/S flag (bit 2) is 0 in at least one of the paging-structure entries, i.e.
* an address is user-mode if the U/S flag is 0 in _all_ entries. Again, this
* is the address type, not the the access type, e.g. a supervisor-mode _access_
* will consume PKRU if the _address_ is a user-mode address.
*
* As alluded to above, PKR checks are only performed for data accesses; code
* fetches are not subject to PKR checks. Terminal page faults (!PRESENT or
* PFEC.RSVD=1) are also not subject to PKR checks.
*
* PKR write-disable checks for superivsor-mode _accesses_ are performed if and
* only if CR0.WP=1 (though access-disable checks still apply).
*
* In summary, PKR checks are based on (a) EFER.LMA, (b) CR4.PKE or CR4.PKS,
* (c) CR4.WP, (d) the PK in the leaf PTE, (e) two bits from the corresponding
* PKR{S,U} entry, (f) the access type (derived from the other PFEC bits), and
* (g) the address type (retrieved from the paging-structure entries).
*
* To avoid conditional branches in permission_fault(), the PKR bitmask caches
* the above inputs, except for (e) the PKR{S,U} entry. The FETCH, USER, and
* WRITE bits of the PFEC and the effective value of the paging-structures' U/S
* bit (slotted into the PFEC.RSVD position, bit 3) are used to index into the
* PKR bitmask (similar to the 4-bit Protection Key itself). The two bits of
* the PKR bitmask "entry" are then extracted and ANDed with the two bits of
* the PKR{S,U{} register corresponding to the address type and protection key.
*
* E.g. for all values where PFEC.FETCH=1, the corresponding pkr_bitmask bits
* will be 00b, thus masking away the AD and WD bits from the PKR{S,U} register
* to suppress PKR checks on code fetches.
*/

2021-11-09 03:26:17

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 5/7] KVM: MMU: Add support for PKS emulation

On Wed, Aug 11, 2021, Chenyi Qiang wrote:
> @@ -202,14 +202,17 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
> if (unlikely(mmu->pkr_mask)) {
> u32 pkr_bits, offset;
> + u64 pkr;

Heh, MSR_IA32_PKRS strikes again. This should be a u32.

>
> /*
> - * PKRU defines 32 bits, there are 16 domains and 2
> - * attribute bits per domain in pkru. pte_pkey is the
> - * index of the protection domain, so pte_pkey * 2 is
> - * is the index of the first bit for the domain.
> + * PKRU and PKRS both define 32 bits. There are 16 domains
> + * and 2 attribute bits per domain in them. pte_key is the
> + * index of the protection domain, so pte_pkey * 2 is the
> + * index of the first bit for the domain. The choice of
> + * PKRU and PKRS is determined by the accessed pages.

Please replace "accessed pages" with something along the lines of

The use of PKRU versus PKRS is selected by the address type, as determined by
the U/S bit in the paging-structure entries.

I.e. try to avoid "access" in favor of "address" to follow the SDM's wording.

2021-11-09 05:09:12

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 6/7] KVM: VMX: Expose PKS to guest

On Wed, Aug 11, 2021, Chenyi Qiang wrote:
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 739be5da3bca..dbee0d639db3 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -458,7 +458,7 @@ void kvm_set_cpu_caps(void)
> F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
> F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
> F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
> - F(SGX_LC) | F(BUS_LOCK_DETECT)
> + F(SGX_LC) | F(BUS_LOCK_DETECT) | 0 /*PKS*/

...

> );
> /* Set LA57 based on hardware capability. */
> if (cpuid_ecx(7) & F(LA57))

...

> @@ -7311,6 +7312,14 @@ static __init void vmx_set_cpu_caps(void)
>
> if (cpu_has_vmx_waitpkg())
> kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
> +
> + /*
> + * PKS is not yet implemented for shadow paging.
> + * If not support VM_{ENTRY, EXIT}_LOAD_IA32_PKRS,
> + * don't expose the PKS as well.
> + */
> + if (enable_ept && cpu_has_load_ia32_pkrs())
> + kvm_cpu_cap_check_and_set(X86_FEATURE_PKS);

I would rather handle the !TDP case in cpuid.c alongside the PKU. The decision
to not support Protection Keys with legacy shadow paging is an x86 decision, not
a VMX decision.

And VMX's extra restriction on the VMCS support should not bleed into common x86.

Can you also opportunistically update the comment (see below) to explain _why_
OSPKE needs to be enabled in order to advertise PKU?

Thanks!

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 2d70edb0f323..c4ed6881857c 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -439,18 +439,23 @@ void kvm_set_cpu_caps(void)
F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
- F(SGX_LC) | F(BUS_LOCK_DETECT)
+ F(SGX_LC) | F(BUS_LOCK_DETECT) | F(PKS)
);
/* Set LA57 based on hardware capability. */
if (cpuid_ecx(7) & F(LA57))
kvm_cpu_cap_set(X86_FEATURE_LA57);

/*
- * PKU not yet implemented for shadow paging and requires OSPKE
- * to be set on the host. Clear it if that is not the case
+ * Protection Keys are not supported for shadow paging. PKU further
+ * requires OSPKE to be set on the host in order to use {RD,WR}PKRU to
+ * save/restore the guests PKRU.
*/
- if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
+ if (!tdp_enabled) {
kvm_cpu_cap_clear(X86_FEATURE_PKU);
+ kvm_cpu_cap_clear(X86_FEATURE_PKS);
+ } else if (!boot_cpu_has(X86_FEATURE_OSPKE)) {
+ kvm_cpu_cap_clear(X86_FEATURE_PKU);
+ }

kvm_cpu_cap_mask(CPUID_7_EDX,
F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |


and then vmx.c only needs to handle clearing PKS when the VMCS controls aren't
available.

2021-11-09 14:55:08

by Chenyi Qiang

[permalink] [raw]
Subject: Re: [PATCH v5 5/7] KVM: MMU: Add support for PKS emulation



On 11/9/2021 3:46 AM, Sean Christopherson wrote:
> On Wed, Aug 11, 2021, Chenyi Qiang wrote:
>> * In particular the following conditions come from the error code, the
>> * page tables and the machine state:
>> -* - PK is always zero unless CR4.PKE=1 and EFER.LMA=1
>> +* - PK is always zero unless CR4.PKE=1/CR4.PKS=1 and EFER.LMA=1
>> * - PK is always zero if RSVD=1 (reserved bit set) or F=1 (instruction fetch)
>> -* - PK is always zero if U=0 in the page tables
>> -* - PKRU.WD is ignored if CR0.WP=0 and the access is a supervisor access.
>> +* - PK is always zero if
>> +* - U=0 in the page tables and CR4.PKS=0
>> +* - U=1 in the page tables and CR4.PKU=0
>
> I think it makes sense to completely rewrite this "table" or drop it altogether.
> The "always zero" wording is nonsensical when there are multiple conditions for
> "always". And IMO the whole "PK is ... zero" thing is a bit awkward because it
> leaves the uninitiated wondering what PK=0 even means ('1' == disabled is not the
> most intuitive thing since most PTE bits are '1' = allowed). Ugh, and re-reading
> with context, that's not even what "PK" means here, this is actually referring to
> PFEC.PK, which is all kinds of confusing because PFEC.PK is merely a "symptom" of
> a #PF to due a protection key violation, not the other way 'round.
>
> IMO this entire comment could use a good overhaul. It never explicitly documents
> the "access-disable" and "write-disable" behavior. More below.
>
>> +* - (PKRU/PKRS).WD is ignored if CR0.WP=0 and the access is a supervisor access.
>
> Hrm. The SDM contradicts itself.
>
> Section 4.6.1 "Determination of Access Rights" says this for supervisor-mode accesses:
>
> If CR0.WP = 0, data may be written to any supervisor-mode address with a protection
> key for which write access is permitted.
>
> but section 4.6.2 "Protection Keys" says:
>
> If WDi = 1, write accesses are not permitted if CR0.WP = 1. (If CR0.WP = 0,
> IA32_PKRS.WDi does not affect write accesses to supervisor-mode addresses with
> protection key i.)
>
> I believe 4.6.1 is subtly wrong and should be "data access", not "write access".
>
> If CR0.WP = 0, data may be written to any supervisor-mode address with a protection
> key for which data access is permitted.
> ^^^^
>
> Can you follow-up with someone to get the SDM fixed? This stuff is subtle and
> confusing enough as it is :-)
>

Nice catch. I'll mention it internally to fix it.

> And on a very related topic, it would be helpful to clarify user-mode vs. supervisor-mode
> and access vs. address.
>
> How about this for a comment?
>
> /*
> * Protection Key Rights (PKR) is an additional mechanism by which data accesses
> * with 4-level or 5-level paging (EFER.LMA=1) may be disabled based on the
> * Protection Key Rights Userspace (PRKU) or Protection Key Rights Supervisor
> * (PKRS) registers. The Protection Key (PK) used for an access is a 4-bit
> * value specified in bits 62:59 of the leaf PTE used to translate the address.
> *
> * PKRU and PKRS are 32-bit registers, with 16 2-bit entries consisting of an
> * access-disable (AD) and write-disable (WD) bit. The PK from the leaf PTE is
> * used to index the approriate PKR (see below), e.g. PK=1 would consume bits
> * 3:2 (bit 3 == write-disable, bit 2 == access-disable).
> *
> * The PK register (PKRU vs. PKRS) indexed by the PK depends on the type of
> * _address_ (not access type!). For a user-mode address, PKRU is used; for a
> * supervisor-mode address, PKRS is used. An address is supervisor-mode if the
> * U/S flag (bit 2) is 0 in at least one of the paging-structure entries, i.e.
> * an address is user-mode if the U/S flag is 0 in _all_ entries. Again, this
> * is the address type, not the the access type, e.g. a supervisor-mode _access_
> * will consume PKRU if the _address_ is a user-mode address.
> *
> * As alluded to above, PKR checks are only performed for data accesses; code
> * fetches are not subject to PKR checks. Terminal page faults (!PRESENT or
> * PFEC.RSVD=1) are also not subject to PKR checks.
> *
> * PKR write-disable checks for superivsor-mode _accesses_ are performed if and
> * only if CR0.WP=1 (though access-disable checks still apply).
> *
> * In summary, PKR checks are based on (a) EFER.LMA, (b) CR4.PKE or CR4.PKS,
> * (c) CR4.WP, (d) the PK in the leaf PTE, (e) two bits from the corresponding
> * PKR{S,U} entry, (f) the access type (derived from the other PFEC bits), and
> * (g) the address type (retrieved from the paging-structure entries).
> *
> * To avoid conditional branches in permission_fault(), the PKR bitmask caches
> * the above inputs, except for (e) the PKR{S,U} entry. The FETCH, USER, and
> * WRITE bits of the PFEC and the effective value of the paging-structures' U/S
> * bit (slotted into the PFEC.RSVD position, bit 3) are used to index into the
> * PKR bitmask (similar to the 4-bit Protection Key itself). The two bits of
> * the PKR bitmask "entry" are then extracted and ANDed with the two bits of
> * the PKR{S,U{} register corresponding to the address type and protection key.
> *
> * E.g. for all values where PFEC.FETCH=1, the corresponding pkr_bitmask bits
> * will be 00b, thus masking away the AD and WD bits from the PKR{S,U} register
> * to suppress PKR checks on code fetches.
> */

Very clear comment. I'll clean it up and change in next version.

>