2022-02-21 09:07:16

by Chenyi Qiang

[permalink] [raw]
Subject: [PATCH v6 0/7] KVM: PKS Virtualization support

This patch series is based on top of v8 PKS core support kernel patchset:
https://lore.kernel.org/lkml/[email protected]/

Note: If you read the SDM section 4.6.1 and has some confusion about the
statement of Data writes to supervisor-mode address:

If CR0.WP = 0, data may be written to any supervisor-mode address with
a protection key for which write access is permitted.

Which may seems a little conflict with 4.6.2:

if WDi = 1, write accesses are not permitted if CR0.WP = 1. (If CR0.WP
= 0, IA32_PKRS.WDi does not affect write accesses to supervisor-mode
address with protection key i.)

In fact, the statement in 4.6.1 doesn't say "a protection key with the
appropriate WDi bit set." The reader should instead refer to Section
4.6.2 to find the definition of what that means. We will follow up
this with someone internally to make it more clear in SDM.

---

Protection Keys for Supervisor Pages(PKS) is a feature that extends the
Protection Keys architecture to support thread-specific permission
restrictions on supervisor pages.

PKS works similar to an existing feature named PKU(protecting user pages).
They both perform an additional check after normal paging permission
checks are done. Access or Writes can be disabled via a MSR update
without TLB flushes when permissions changes. If violating this
addional check, #PF occurs and PFEC.PK bit will be set.

PKS introduces MSR IA32_PKRS to manage supervisor protection key
rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
advertises on a group of pages with the same key which is set in the
leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
supported by XSAVES architecture.

This patchset aims to add the virtualization of PKS in KVM. It
implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
exposure, nested supported etc. Currently, PKS is not yet supported for
shadow paging.

Detailed information about PKS can be found in the latest Intel 64 and
IA-32 Architectures Software Developer's Manual.

---

Changelogs:

v5->v6
- PKRS is preserved on INIT. Add the PKRS reset operation in kvm_vcpu_reset.
(Sean)
- Track the pkrs as u32. Add the code WARN on bits 64:32 being set in VMCS field.
(Sean)
- Adjust the MSR intercept and entry/exit control in VMCS according to
guest CPUID. This resolve the issue when userspace re-enable this feature.
(Sean)
- Split VMX restriction on PKS support(entry/exit load controls) out of
common x86. And put tdp restriction together with PKU in common x86.
(Sean)
- Thanks for Sean to revise the comments in mmu.c related to
update_pkr_bitmap, which make it more clear for pkr bitmask cache usage.
- v5: https://lore.kernel.org/lkml/[email protected]/

v4->v5
- Make setting of MSR intercept/vmcs control bits not dependent on guest.CR4.PKS.
And set them if PKS is exposed to guest. (Suggested by Sean)
- Add pkrs to standard register caching mechanism to help update
vcpu->arch.pkrs on demand. Add related helper functions. (Suggested by Sean)
- Do the real pkrs update in VMCS field in vmx_vcpu_reset and
vmx_sync_vmcs_host_state(). (Sean)
- Add a new mmu_role cr4_pks instead of smushing PKU and PKS together.
(Sean & Paolo)
- v4: https://lore.kernel.org/lkml/[email protected]/

v3->v4
- Make the MSR intercept and load-controls setting depend on CR4.PKS value
- shadow the guest pkrs and make it usable in PKS emultion
- add the cr4_pke and cr4_pks check in pkr_mask update
- squash PATCH 2 and PATCH 5 to make the dependencies read more clear
- v3: https://lore.kernel.org/lkml/[email protected]/

v2->v3:
- No function changes since last submit
- rebase on the latest PKS kernel support:
https://lore.kernel.org/lkml/[email protected]/
- add MSR_IA32_PKRS to the vmx_possible_passthrough_msrs[]
- RFC v2: https://lore.kernel.org/lkml/[email protected]/

v1->v2:
- rebase on the latest PKS kernel support:
https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
- add a kvm-unit-tests for PKS
- add the check in kvm_init_msr_list for PKRS
- place the X86_CR4_PKS in mmu_role_bits in kvm_set_cr4
- add the support to expose VM_{ENTRY, EXIT}_LOAD_IA32_PKRS in nested
VMX MSR
- RFC v1: https://lore.kernel.org/lkml/[email protected]/

---

Chenyi Qiang (7):
KVM: VMX: Introduce PKS VMCS fields
KVM: VMX: Add proper cache tracking for PKRS
KVM: X86: Expose IA32_PKRS MSR
KVM: MMU: Rename the pkru to pkr
KVM: MMU: Add support for PKS emulation
KVM: VMX: Expose PKS to guest
KVM: VMX: Enable PKS for nested VM

arch/x86/include/asm/kvm_host.h | 17 ++++--
arch/x86/include/asm/vmx.h | 6 ++
arch/x86/kvm/cpuid.c | 13 ++--
arch/x86/kvm/kvm_cache_regs.h | 7 +++
arch/x86/kvm/mmu.h | 27 +++++----
arch/x86/kvm/mmu/mmu.c | 101 ++++++++++++++++++++------------
arch/x86/kvm/vmx/capabilities.h | 6 ++
arch/x86/kvm/vmx/nested.c | 38 +++++++++++-
arch/x86/kvm/vmx/vmcs.h | 1 +
arch/x86/kvm/vmx/vmcs12.c | 2 +
arch/x86/kvm/vmx/vmcs12.h | 4 ++
arch/x86/kvm/vmx/vmx.c | 92 ++++++++++++++++++++++++++---
arch/x86/kvm/vmx/vmx.h | 7 ++-
arch/x86/kvm/x86.c | 10 +++-
arch/x86/kvm/x86.h | 8 +++
arch/x86/mm/pkeys.c | 6 ++
include/linux/pkeys.h | 6 ++
17 files changed, 280 insertions(+), 71 deletions(-)

--
2.17.1


2022-02-21 09:10:08

by Chenyi Qiang

[permalink] [raw]
Subject: [PATCH v6 1/7] KVM: VMX: Introduce PKS VMCS fields

PKS(Protection Keys for Supervisor Pages) is a feature that extends the
Protection Key architecture to support thread-specific permission
restrictions on supervisor pages.

A new PKS MSR(PKRS) is defined in kernel to support PKS, which holds a
set of permissions associated with each protection domain.

Two VMCS fields {HOST,GUEST}_IA32_PKRS are introduced in
{host,guest}-state area to store the respective values of PKRS.

Every VM exit saves PKRS into guest-state area.
If VM_EXIT_LOAD_IA32_PKRS = 1, VM exit loads PKRS from the host-state
area.
If VM_ENTRY_LOAD_IA32_PKRS = 1, VM entry loads PKRS from the guest-state
area.

Signed-off-by: Chenyi Qiang <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
---
arch/x86/include/asm/vmx.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0ffaa3156a4e..7962d506ba91 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -95,6 +95,7 @@
#define VM_EXIT_CLEAR_BNDCFGS 0x00800000
#define VM_EXIT_PT_CONCEAL_PIP 0x01000000
#define VM_EXIT_CLEAR_IA32_RTIT_CTL 0x02000000
+#define VM_EXIT_LOAD_IA32_PKRS 0x20000000

#define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR 0x00036dff

@@ -108,6 +109,7 @@
#define VM_ENTRY_LOAD_BNDCFGS 0x00010000
#define VM_ENTRY_PT_CONCEAL_PIP 0x00020000
#define VM_ENTRY_LOAD_IA32_RTIT_CTL 0x00040000
+#define VM_ENTRY_LOAD_IA32_PKRS 0x00400000

#define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR 0x000011ff

@@ -245,12 +247,16 @@ enum vmcs_field {
GUEST_BNDCFGS_HIGH = 0x00002813,
GUEST_IA32_RTIT_CTL = 0x00002814,
GUEST_IA32_RTIT_CTL_HIGH = 0x00002815,
+ GUEST_IA32_PKRS = 0x00002818,
+ GUEST_IA32_PKRS_HIGH = 0x00002819,
HOST_IA32_PAT = 0x00002c00,
HOST_IA32_PAT_HIGH = 0x00002c01,
HOST_IA32_EFER = 0x00002c02,
HOST_IA32_EFER_HIGH = 0x00002c03,
HOST_IA32_PERF_GLOBAL_CTRL = 0x00002c04,
HOST_IA32_PERF_GLOBAL_CTRL_HIGH = 0x00002c05,
+ HOST_IA32_PKRS = 0x00002c06,
+ HOST_IA32_PKRS_HIGH = 0x00002c07,
PIN_BASED_VM_EXEC_CONTROL = 0x00004000,
CPU_BASED_VM_EXEC_CONTROL = 0x00004002,
EXCEPTION_BITMAP = 0x00004004,
--
2.17.1

2022-02-21 09:13:27

by Chenyi Qiang

[permalink] [raw]
Subject: [PATCH v6 5/7] KVM: MMU: Add support for PKS emulation

Up until now, pkr_mask had 0 bits for supervisor pages (the U/S bit in
page tables replaces the PFEC.RSVD in page fault error code).
For PKS support, fill in the bits using the same algorithm used for user
mode pages, but with CR4.PKE replaced by CR4.PKS. Because of this
change, CR4.PKS must also be included in the MMU role.

Signed-off-by: Chenyi Qiang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 10 ++--
arch/x86/kvm/mmu.h | 17 +++---
arch/x86/kvm/mmu/mmu.c | 91 +++++++++++++++++++++------------
3 files changed, 76 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c37cd23b6764..5c53efe0012e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -370,6 +370,7 @@ union kvm_mmu_extended_role {
unsigned int cr4_smap:1;
unsigned int cr4_smep:1;
unsigned int cr4_la57:1;
+ unsigned int cr4_pks:1;
unsigned int efer_lma:1;
};
};
@@ -450,10 +451,11 @@ struct kvm_mmu {
u8 permissions[16];

/*
- * The pkru_mask indicates if protection key checks are needed. It
- * consists of 16 domains indexed by page fault error code bits [4:1],
- * with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
- * Each domain has 2 bits which are ANDed with AD and WD from PKRU.
+ * The pkr_mask indicates if protection key checks are needed.
+ * It consists of 16 domains indexed by page fault error code
+ * bits[4:1] with PFEC.RSVD replaced by ACC_USER_MASK from the
+ * page tables. Each domain has 2 bits which are ANDed with AD
+ * and WD from PKRU/PKRS.
*/
u32 pkr_mask;

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 9e216d205c8d..0dacd3bb8602 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -45,7 +45,8 @@
#define PT32E_ROOT_LEVEL 3

#define KVM_MMU_CR4_ROLE_BITS (X86_CR4_PSE | X86_CR4_PAE | X86_CR4_LA57 | \
- X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE)
+ X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE | \
+ X86_CR4_PKS)

#define KVM_MMU_CR0_ROLE_BITS (X86_CR0_PG | X86_CR0_WP)

@@ -277,14 +278,18 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
if (unlikely(mmu->pkr_mask)) {
u32 pkr_bits, offset;
+ u32 pkr;

/*
- * PKRU defines 32 bits, there are 16 domains and 2
- * attribute bits per domain in pkru. pte_pkey is the
- * index of the protection domain, so pte_pkey * 2 is
- * is the index of the first bit for the domain.
+ * PKRU and PKRS both define 32 bits. There are 16 domains
+ * and 2 attribute bits per domain in them. pte_key is the
+ * index of the protection domain, so pte_pkey * 2 is the
+ * index of the first bit for the domain. The use of PKRU
+ * versus PKRS is selected by the address type, as determined
+ * by the U/S bit in the paging-structure entries.
*/
- pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
+ pkr = pte_access & PT_USER_MASK ? vcpu->arch.pkru : kvm_read_pkrs(vcpu);
+ pkr_bits = (pkr >> pte_pkey * 2) & 3;

/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
offset = (pfec & ~1) +
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 83597161b5f9..ea19ccfa4da3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -220,6 +220,7 @@ BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smep, X86_CR4_SMEP);
BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smap, X86_CR4_SMAP);
BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pke, X86_CR4_PKE);
BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, la57, X86_CR4_LA57);
+BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pks, X86_CR4_PKS);
BUILD_MMU_ROLE_REGS_ACCESSOR(efer, nx, EFER_NX);
BUILD_MMU_ROLE_REGS_ACCESSOR(efer, lma, EFER_LMA);

@@ -242,6 +243,7 @@ BUILD_MMU_ROLE_ACCESSOR(ext, cr4, smep);
BUILD_MMU_ROLE_ACCESSOR(ext, cr4, smap);
BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pke);
BUILD_MMU_ROLE_ACCESSOR(ext, cr4, la57);
+BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pks);
BUILD_MMU_ROLE_ACCESSOR(base, efer, nx);

static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
@@ -4597,37 +4599,58 @@ static void update_permission_bitmask(struct kvm_mmu *mmu, bool ept)
}

/*
-* PKU is an additional mechanism by which the paging controls access to
-* user-mode addresses based on the value in the PKRU register. Protection
-* key violations are reported through a bit in the page fault error code.
-* Unlike other bits of the error code, the PK bit is not known at the
-* call site of e.g. gva_to_gpa; it must be computed directly in
-* permission_fault based on two bits of PKRU, on some machine state (CR4,
-* CR0, EFER, CPL), and on other bits of the error code and the page tables.
-*
-* In particular the following conditions come from the error code, the
-* page tables and the machine state:
-* - PK is always zero unless CR4.PKE=1 and EFER.LMA=1
-* - PK is always zero if RSVD=1 (reserved bit set) or F=1 (instruction fetch)
-* - PK is always zero if U=0 in the page tables
-* - PKRU.WD is ignored if CR0.WP=0 and the access is a supervisor access.
-*
-* The PKRU bitmask caches the result of these four conditions. The error
-* code (minus the P bit) and the page table's U bit form an index into the
-* PKRU bitmask. Two bits of the PKRU bitmask are then extracted and ANDed
-* with the two bits of the PKRU register corresponding to the protection key.
-* For the first three conditions above the bits will be 00, thus masking
-* away both AD and WD. For all reads or if the last condition holds, WD
-* only will be masked away.
-*/
+ * Protection Key Rights (PKR) is an additional mechanism by which data accesses
+ * with 4-level or 5-level paging (EFER.LMA=1) may be disabled based on the
+ * Protection Key Rights Userspace (PRKU) or Protection Key Rights Supervisor
+ * (PKRS) registers. The Protection Key (PK) used for an access is a 4-bit
+ * value specified in bits 62:59 of the leaf PTE used to translate the address.
+ *
+ * PKRU and PKRS are 32-bit registers, with 16 2-bit entries consisting of an
+ * access-disable (AD) and write-disable (WD) bit. The PK from the leaf PTE is
+ * used to index the appropriate PKR (see below), e.g. PK=1 would consume bits
+ * 3:2 (bit 3 == write-disable, bit 2 == access-disable).
+ *
+ * The PK register (PKRU vs. PKRS) indexed by the PK depends on the type of
+ * _address_ (not access type!). For a user-mode address, PKRU is used; for a
+ * supervisor-mode address, PKRS is used. An address is supervisor-mode if the
+ * U/S flag (bit 2) is 0 in at least one of the paging-structure entries, i.e.
+ * an address is user-mode if the U/S flag is 1 in _all_ entries. Again, this
+ * is the address type, not the access type, e.g. a supervisor-mode _access_
+ * will consume PKRU if the _address_ is a user-mode address.
+ *
+ * As alluded to above, PKR checks are only performed for data accesses; code
+ * fetches are not subject to PKR checks. Terminal page faults (!PRESENT or
+ * PFEC.RSVD=1) are also not subject to PKR checks.
+ *
+ * PKR write-disable checks for superivsor-mode _accesses_ are performed if and
+ * only if CR0.WP=1 (though access-disable checks still apply).
+ *
+ * In summary, PKR checks are based on (a) EFER.LMA, (b) CR4.PKE or CR4.PKS,
+ * (c) CR0.WP, (d) the PK in the leaf PTE, (e) two bits from the corresponding
+ * PKR{S,U} entry, (f) the access type (derived from the other PFEC bits), and
+ * (g) the address type (retrieved from the paging-structure entries).
+ *
+ * To avoid conditional branches in permission_fault(), the PKR bitmask caches
+ * the above inputs, except for (e) the PKR{S,U} entry. The FETCH, USER, and
+ * WRITE bits of the PFEC and the effective value of the paging-structures' U/S
+ * bit (slotted into the PFEC.RSVD position, bit 3) are used to index into the
+ * PKR bitmask (similar to the 4-bit Protection Key itself). The two bits of
+ * the PKR bitmask "entry" are then extracted and ANDed with the two bits of
+ * the PKR{S,U} register corresponding to the address type and protection key.
+ *
+ * E.g. for all values where PFEC.FETCH=1, the corresponding pkr_bitmask bits
+ * will be 00b, thus masking away the AD and WD bits from the PKR{S,U} register
+ * to suppress PKR checks on code fetches.
+ */
static void update_pkr_bitmask(struct kvm_mmu *mmu)
{
unsigned bit;
bool wp;
-
+ bool cr4_pke = is_cr4_pke(mmu);
+ bool cr4_pks = is_cr4_pks(mmu);
mmu->pkr_mask = 0;

- if (!is_cr4_pke(mmu))
+ if (!cr4_pke && !cr4_pks)
return;

wp = is_cr0_wp(mmu);
@@ -4645,19 +4668,22 @@ static void update_pkr_bitmask(struct kvm_mmu *mmu)
pte_user = pfec & PFERR_RSVD_MASK;

/*
- * Only need to check the access which is not an
- * instruction fetch and is to a user page.
+ * need to check the access which is not an
+ * instruction fetch and
+ * - if cr4_pke 1-setting when accessing a user page.
+ * - if cr4_pks 1-setting when accessing a supervisor page.
*/
- check_pkey = (!ff && pte_user);
+ check_pkey = !ff && (pte_user ? cr4_pke : cr4_pks);
+
/*
- * write access is controlled by PKRU if it is a
- * user access or CR0.WP = 1.
+ * write access is controlled by PKRU/PKRS if
+ * it is a user access or CR0.WP = 1.
*/
check_write = check_pkey && wf && (uf || wp);

- /* PKRU.AD stops both read and write access. */
+ /* PKRU/PKRS.AD stops both read and write access. */
pkey_bits = !!check_pkey;
- /* PKRU.WD stops write access. */
+ /* PKRU/PKRS.WD stops write access. */
pkey_bits |= (!!check_write) << 1;

mmu->pkr_mask |= (pkey_bits & 3) << pfec;
@@ -4708,6 +4734,7 @@ static union kvm_mmu_extended_role kvm_calc_mmu_role_ext(struct kvm_vcpu *vcpu,
/* PKEY and LA57 are active iff long mode is active. */
ext.cr4_pke = ____is_efer_lma(regs) && ____is_cr4_pke(regs);
ext.cr4_la57 = ____is_efer_lma(regs) && ____is_cr4_la57(regs);
+ ext.cr4_pks = ____is_efer_lma(regs) && ____is_cr4_pks(regs);
ext.efer_lma = ____is_efer_lma(regs);
}

--
2.17.1

2022-02-21 09:38:42

by Chenyi Qiang

[permalink] [raw]
Subject: [PATCH v6 2/7] KVM: VMX: Add proper cache tracking for PKRS

Add PKRS caching into the standard register caching mechanism in order
to take advantage of the availability checks provided by regs_avail.

This is because vcpu->arch.pkrs will be rarely acceesed by KVM, only in
the case of host userspace MSR reads and GVA->GPA translation in
following patches. It is unnecessary to keep it up-to-date at all times.

Signed-off-by: Chenyi Qiang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/kvm_cache_regs.h | 7 +++++++
arch/x86/kvm/vmx/vmx.c | 11 +++++++++++
arch/x86/kvm/vmx/vmx.h | 3 ++-
4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1384517d7709..75940aeb5f67 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -177,6 +177,7 @@ enum kvm_reg {
VCPU_EXREG_SEGMENTS,
VCPU_EXREG_EXIT_INFO_1,
VCPU_EXREG_EXIT_INFO_2,
+ VCPU_EXREG_PKRS,
};

enum {
@@ -632,6 +633,7 @@ struct kvm_vcpu_arch {
unsigned long cr8;
u32 host_pkru;
u32 pkru;
+ u32 pkrs;
u32 hflags;
u64 efer;
u64 apic_base;
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 3febc342360c..2b2540ca584f 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -177,6 +177,13 @@ static inline u64 kvm_read_edx_eax(struct kvm_vcpu *vcpu)
| ((u64)(kvm_rdx_read(vcpu) & -1u) << 32);
}

+static inline u32 kvm_read_pkrs(struct kvm_vcpu *vcpu)
+{
+ if (!kvm_register_is_available(vcpu, VCPU_EXREG_PKRS))
+ static_call(kvm_x86_cache_reg)(vcpu, VCPU_EXREG_PKRS);
+ return vcpu->arch.pkrs;
+}
+
static inline void enter_guest_mode(struct kvm_vcpu *vcpu)
{
vcpu->arch.hflags |= HF_GUEST_MASK;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4ac676066d60..0496afe786fa 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2269,6 +2269,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
{
unsigned long guest_owned_bits;
+ u64 ia32_pkrs;

kvm_register_mark_available(vcpu, reg);

@@ -2303,6 +2304,16 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
vcpu->arch.cr4 &= ~guest_owned_bits;
vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
break;
+ case VCPU_EXREG_PKRS:
+ /*
+ * The high 32 bits of PKRS are reserved and attempting to write
+ * non-zero value will cause #GP. KVM intentionally drops those
+ * bits.
+ */
+ ia32_pkrs = vmcs_read64(GUEST_IA32_PKRS);
+ WARN_ON_ONCE(ia32_pkrs >> 32);
+ vcpu->arch.pkrs = ia32_pkrs;
+ break;
default:
KVM_BUG_ON(1, vcpu->kvm);
break;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 7f2c82e7f38f..da5e95a6694c 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -500,7 +500,8 @@ BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL)
(1 << VCPU_EXREG_CR3) | \
(1 << VCPU_EXREG_CR4) | \
(1 << VCPU_EXREG_EXIT_INFO_1) | \
- (1 << VCPU_EXREG_EXIT_INFO_2))
+ (1 << VCPU_EXREG_EXIT_INFO_2) | \
+ (1 << VCPU_EXREG_PKRS))

static inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm)
{
--
2.17.1

2022-02-21 09:41:33

by Chenyi Qiang

[permalink] [raw]
Subject: [PATCH v6 6/7] KVM: VMX: Expose PKS to guest

Existence of PKS is enumerated via CPUID.(EAX=7,ECX=0):ECX[31]. It is
enabled by setting CR4.PKS when long mode is active. PKS is only
implemented when EPT is enabled and requires the support of
VM_{ENTRY,EXIT}_LOAD_IA32_PKRS VMCS controls currently.

Signed-off-by: Chenyi Qiang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/cpuid.c | 13 +++++++++----
arch/x86/kvm/vmx/capabilities.h | 6 ++++++
arch/x86/kvm/vmx/vmx.c | 14 +++++++++++---
arch/x86/kvm/x86.h | 2 ++
5 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5c53efe0012e..4de4b9c617e4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -114,7 +114,8 @@
| X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
| X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
| X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
- | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP))
+ | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
+ | X86_CR4_PKS))

#define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 3902c28fb6cb..2457dadad7ef 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -547,18 +547,23 @@ void kvm_set_cpu_caps(void)
F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
- F(SGX_LC) | F(BUS_LOCK_DETECT)
+ F(SGX_LC) | F(BUS_LOCK_DETECT) | F(PKS)
);
/* Set LA57 based on hardware capability. */
if (cpuid_ecx(7) & F(LA57))
kvm_cpu_cap_set(X86_FEATURE_LA57);

/*
- * PKU not yet implemented for shadow paging and requires OSPKE
- * to be set on the host. Clear it if that is not the case
+ * Protection Keys are not supported for shadow paging. PKU further
+ * requires OSPKE to be set on the host in order to use {RD,WR}PKRU to
+ * save/restore the guests PKRU.
*/
- if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
+ if (!tdp_enabled) {
kvm_cpu_cap_clear(X86_FEATURE_PKU);
+ kvm_cpu_cap_clear(X86_FEATURE_PKS);
+ } else if (!boot_cpu_has(X86_FEATURE_OSPKE)) {
+ kvm_cpu_cap_clear(X86_FEATURE_PKU);
+ }

kvm_cpu_cap_mask(CPUID_7_EDX,
F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 959b59d13b5a..ab1868a9c177 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -105,6 +105,12 @@ static inline bool cpu_has_load_perf_global_ctrl(void)
(vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL);
}

+static inline bool cpu_has_load_ia32_pkrs(void)
+{
+ return (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PKRS) &&
+ (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PKRS);
+}
+
static inline bool cpu_has_vmx_mpx(void)
{
return (vmcs_config.vmexit_ctrl & VM_EXIT_CLEAR_BNDCFGS) &&
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b3d5412b9481..6fc70ddeff5a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3270,7 +3270,7 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
}

/*
- * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
+ * SMEP/SMAP/PKU/PKS is disabled if CPU is in non-paging mode in
* hardware. To emulate this behavior, SMEP/SMAP/PKU needs
* to be manually disabled when guest switches to non-paging
* mode.
@@ -3278,10 +3278,11 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
* If !enable_unrestricted_guest, the CPU is always running
* with CR0.PG=1 and CR4 needs to be modified.
* If enable_unrestricted_guest, the CPU automatically
- * disables SMEP/SMAP/PKU when the guest sets CR0.PG=0.
+ * disables SMEP/SMAP/PKU/PKS when the guest sets CR0.PG=0.
*/
if (!is_paging(vcpu))
- hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
+ hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
+ X86_CR4_PKS);
}

vmcs_writel(CR4_READ_SHADOW, cr4);
@@ -7454,6 +7455,13 @@ static __init void vmx_set_cpu_caps(void)

if (cpu_has_vmx_waitpkg())
kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
+
+ /*
+ * If not support VM_{ENTRY, EXIT}_LOAD_IA32_PKRS,
+ * don't expose the PKS as well.
+ */
+ if (cpu_has_load_ia32_pkrs())
+ kvm_cpu_cap_check_and_set(X86_FEATURE_PKS);
}

static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 8b752cebbefc..8de5ea9f8e25 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -497,6 +497,8 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
__reserved_bits |= X86_CR4_VMXE; \
if (!__cpu_has(__c, X86_FEATURE_PCID)) \
__reserved_bits |= X86_CR4_PCIDE; \
+ if (!__cpu_has(__c, X86_FEATURE_PKS)) \
+ __reserved_bits |= X86_CR4_PKS; \
__reserved_bits; \
})

--
2.17.1

2022-02-21 09:41:38

by Chenyi Qiang

[permalink] [raw]
Subject: [PATCH v6 4/7] KVM: MMU: Rename the pkru to pkr

PKRU represents the PKU register utilized in the protection key rights
check for user pages. Protection Keys for Superviosr Pages (PKS) extends
the protection key architecture to cover supervisor pages.

Rename the *pkru* related variables and functions to *pkr* which stands
for both of the PKRU and PKRS. It makes sense because PKS and PKU each
have:
- a single control register (PKRU and PKRS)
- the same number of keys (16 in total)
- the same format in control registers (Access and Write disable bits)

PKS and PKU can also share the same bitmap pkr_mask cache conditions
where protection key checks are needed, because they can share almost the
same requirements for PK restrictions to cause a fault, except they
focus on different pages (supervisor and user pages).

Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Chenyi Qiang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/mmu.h | 12 ++++++------
arch/x86/kvm/mmu/mmu.c | 10 +++++-----
3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 75940aeb5f67..c37cd23b6764 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -455,7 +455,7 @@ struct kvm_mmu {
* with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
* Each domain has 2 bits which are ANDed with AD and WD from PKRU.
*/
- u32 pkru_mask;
+ u32 pkr_mask;

u64 *pae_root;
u64 *pml4_root;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e9fbb2c8bbe2..9e216d205c8d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -275,8 +275,8 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
u32 errcode = PFERR_PRESENT_MASK;

WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
- if (unlikely(mmu->pkru_mask)) {
- u32 pkru_bits, offset;
+ if (unlikely(mmu->pkr_mask)) {
+ u32 pkr_bits, offset;

/*
* PKRU defines 32 bits, there are 16 domains and 2
@@ -284,15 +284,15 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
* index of the protection domain, so pte_pkey * 2 is
* is the index of the first bit for the domain.
*/
- pkru_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
+ pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;

/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
offset = (pfec & ~1) +
((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));

- pkru_bits &= mmu->pkru_mask >> offset;
- errcode |= -pkru_bits & PFERR_PK_MASK;
- fault |= (pkru_bits != 0);
+ pkr_bits &= mmu->pkr_mask >> offset;
+ errcode |= -pkr_bits & PFERR_PK_MASK;
+ fault |= (pkr_bits != 0);
}

return -(u32)fault & errcode;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 593093b52395..83597161b5f9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4620,12 +4620,12 @@ static void update_permission_bitmask(struct kvm_mmu *mmu, bool ept)
* away both AD and WD. For all reads or if the last condition holds, WD
* only will be masked away.
*/
-static void update_pkru_bitmask(struct kvm_mmu *mmu)
+static void update_pkr_bitmask(struct kvm_mmu *mmu)
{
unsigned bit;
bool wp;

- mmu->pkru_mask = 0;
+ mmu->pkr_mask = 0;

if (!is_cr4_pke(mmu))
return;
@@ -4660,7 +4660,7 @@ static void update_pkru_bitmask(struct kvm_mmu *mmu)
/* PKRU.WD stops write access. */
pkey_bits |= (!!check_write) << 1;

- mmu->pkru_mask |= (pkey_bits & 3) << pfec;
+ mmu->pkr_mask |= (pkey_bits & 3) << pfec;
}
}

@@ -4672,7 +4672,7 @@ static void reset_guest_paging_metadata(struct kvm_vcpu *vcpu,

reset_rsvds_bits_mask(vcpu, mmu);
update_permission_bitmask(mmu, false);
- update_pkru_bitmask(mmu);
+ update_pkr_bitmask(mmu);
}

static void paging64_init_context(struct kvm_mmu *context)
@@ -4946,7 +4946,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
context->direct_map = false;

update_permission_bitmask(context, true);
- context->pkru_mask = 0;
+ context->pkr_mask = 0;
reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
reset_ept_shadow_zero_bits_mask(vcpu, context, execonly);
}
--
2.17.1

2022-02-21 09:48:27

by Chenyi Qiang

[permalink] [raw]
Subject: [PATCH v6 3/7] KVM: X86: Expose IA32_PKRS MSR

Protection Key for Superviosr Pages (PKS) uses IA32_PKRS MSR (PKRS) at
index 0x6E1 to allow software to manage superviosr key rights, i.e. it
can enforce additional permissions checks besides normal paging
protections via a MSR update without TLB flushes when permissions
change.

For performance consideration, PKRS intercept in KVM will be disabled
when PKS is supported in guest so that PKRS can be accessed without VM
exit.

PKS introduces dedicated control fields in VMCS to switch PKRS, which
only does the retore part. In addition, every VM exit saves PKRS into
the guest-state area in VMCS, while VM enter won't save the host value
due to the expectation that the host won't change the MSR often. Update
the host's value in VMCS manually if the MSR has been changed by the
kernel since the last time the VMCS was run.

Introduce a function get_current_pkrs() in arch/x86/mm/pkeys.c to export
the per-cpu variable pkrs_cache to avoid frequent rdmsr of PKRS.

Signed-off-by: Chenyi Qiang <[email protected]>
---
arch/x86/kvm/vmx/vmcs.h | 1 +
arch/x86/kvm/vmx/vmx.c | 66 +++++++++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/vmx.h | 2 +-
arch/x86/kvm/x86.c | 10 ++++++-
arch/x86/kvm/x86.h | 6 ++++
arch/x86/mm/pkeys.c | 6 ++++
include/linux/pkeys.h | 6 ++++
7 files changed, 90 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index e325c290a816..ee37741b2b9d 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -42,6 +42,7 @@ struct vmcs_host_state {
#ifdef CONFIG_X86_64
u16 ds_sel, es_sel;
#endif
+ u32 pkrs;
};

struct vmcs_controls_shadow {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0496afe786fa..b3d5412b9481 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -28,6 +28,7 @@
#include <linux/tboot.h>
#include <linux/trace_events.h>
#include <linux/entry-kvm.h>
+#include <linux/pkeys.h>

#include <asm/apic.h>
#include <asm/asm.h>
@@ -172,6 +173,7 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
MSR_CORE_C3_RESIDENCY,
MSR_CORE_C6_RESIDENCY,
MSR_CORE_C7_RESIDENCY,
+ MSR_IA32_PKRS,
};

/*
@@ -1121,6 +1123,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
#endif
unsigned long fs_base, gs_base;
u16 fs_sel, gs_sel;
+ u32 host_pkrs;
int i;

vmx->req_immediate_exit = false;
@@ -1156,6 +1159,20 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
*/
host_state->ldt_sel = kvm_read_ldt();

+ /*
+ * Update the host pkrs vmcs field before vcpu runs.
+ * The setting of VM_EXIT_LOAD_IA32_PKRS can ensure
+ * kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+ * guest_cpuid_has(vcpu, X86_FEATURE_PKS)
+ */
+ if (vm_exit_controls_get(vmx) & VM_EXIT_LOAD_IA32_PKRS) {
+ host_pkrs = get_current_pkrs();
+ if (unlikely(host_pkrs != host_state->pkrs)) {
+ vmcs_write64(HOST_IA32_PKRS, host_pkrs);
+ host_state->pkrs = host_pkrs;
+ }
+ }
+
#ifdef CONFIG_X86_64
savesegment(ds, host_state->ds_sel);
savesegment(es, host_state->es_sel);
@@ -1912,6 +1929,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_DEBUGCTLMSR:
msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
break;
+ case MSR_IA32_PKRS:
+ if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
+ (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
+ return 1;
+ msr_info->data = kvm_read_pkrs(vcpu);
+ break;
default:
find_uret_msr:
msr = vmx_find_uret_msr(vmx, msr_info->index);
@@ -2253,7 +2277,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
}
ret = kvm_set_msr_common(vcpu, msr_info);
break;
-
+ case MSR_IA32_PKRS:
+ if (!kvm_pkrs_valid(data))
+ return 1;
+ if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
+ (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
+ return 1;
+ vcpu->arch.pkrs = data;
+ kvm_register_mark_available(vcpu, VCPU_EXREG_PKRS);
+ vmcs_write64(GUEST_IA32_PKRS, data);
+ break;
default:
find_uret_msr:
msr = vmx_find_uret_msr(vmx, msr_index);
@@ -2544,7 +2578,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
VM_EXIT_LOAD_IA32_EFER |
VM_EXIT_CLEAR_BNDCFGS |
VM_EXIT_PT_CONCEAL_PIP |
- VM_EXIT_CLEAR_IA32_RTIT_CTL;
+ VM_EXIT_CLEAR_IA32_RTIT_CTL |
+ VM_EXIT_LOAD_IA32_PKRS;
if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
&_vmexit_control) < 0)
return -EIO;
@@ -2568,7 +2603,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
VM_ENTRY_LOAD_IA32_EFER |
VM_ENTRY_LOAD_BNDCFGS |
VM_ENTRY_PT_CONCEAL_PIP |
- VM_ENTRY_LOAD_IA32_RTIT_CTL;
+ VM_ENTRY_LOAD_IA32_RTIT_CTL |
+ VM_ENTRY_LOAD_IA32_PKRS;
if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS,
&_vmentry_control) < 0)
return -EIO;
@@ -4162,7 +4198,8 @@ static u32 vmx_vmentry_ctrl(void)
VM_ENTRY_LOAD_IA32_RTIT_CTL);
/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
return vmentry_ctrl &
- ~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | VM_ENTRY_LOAD_IA32_EFER);
+ ~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | VM_ENTRY_LOAD_IA32_EFER |
+ VM_ENTRY_LOAD_IA32_PKRS);
}

static u32 vmx_vmexit_ctrl(void)
@@ -4174,7 +4211,8 @@ static u32 vmx_vmexit_ctrl(void)
VM_EXIT_CLEAR_IA32_RTIT_CTL);
/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
return vmexit_ctrl &
- ~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
+ ~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER |
+ VM_EXIT_LOAD_IA32_PKRS);
}

static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
@@ -5887,6 +5925,8 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL));
if (vmentry_ctl & VM_ENTRY_LOAD_BNDCFGS)
pr_err("BndCfgS = 0x%016llx\n", vmcs_read64(GUEST_BNDCFGS));
+ if (vmentry_ctl & VM_ENTRY_LOAD_IA32_PKRS)
+ pr_err("PKRS = 0x%016llx\n", vmcs_read64(GUEST_IA32_PKRS));
pr_err("Interruptibility = %08x ActivityState = %08x\n",
vmcs_read32(GUEST_INTERRUPTIBILITY_INFO),
vmcs_read32(GUEST_ACTIVITY_STATE));
@@ -5928,6 +5968,8 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
vmcs_read64(HOST_IA32_PERF_GLOBAL_CTRL));
if (vmcs_read32(VM_EXIT_MSR_LOAD_COUNT) > 0)
vmx_dump_msrs("host autoload", &vmx->msr_autoload.host);
+ if (vmexit_ctl & VM_EXIT_LOAD_IA32_PKRS)
+ pr_err("PKRS = 0x%016llx\n", vmcs_read64(HOST_IA32_PKRS));

pr_err("*** Control State ***\n");
pr_err("PinBased=%08x CPUBased=%08x SecondaryExec=%08x\n",
@@ -7357,6 +7399,20 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)

/* Refresh #PF interception to account for MAXPHYADDR changes. */
vmx_update_exception_bitmap(vcpu);
+
+ if (kvm_cpu_cap_has(X86_FEATURE_PKS)) {
+ if (guest_cpuid_has(vcpu, X86_FEATURE_PKS)) {
+ vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);
+
+ vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
+ vm_exit_controls_setbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
+ } else {
+ vmx_enable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);
+
+ vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
+ vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
+ }
+ }
}

static __init void vmx_set_cpu_caps(void)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index da5e95a6694c..d704ba3a4af7 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -352,7 +352,7 @@ struct vcpu_vmx {
struct lbr_desc lbr_desc;

/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS 15
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS 16
struct {
DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9e43d756312f..8e61373d89d1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1356,7 +1356,7 @@ static const u32 msrs_to_save_all[] = {
MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
- MSR_IA32_UMWAIT_CONTROL,
+ MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,

MSR_ARCH_PERFMON_FIXED_CTR0, MSR_ARCH_PERFMON_FIXED_CTR1,
MSR_ARCH_PERFMON_FIXED_CTR0 + 2,
@@ -6506,6 +6506,10 @@ static void kvm_init_msr_list(void)
intel_pt_validate_hw_cap(PT_CAP_num_address_ranges) * 2)
continue;
break;
+ case MSR_IA32_PKRS:
+ if (!kvm_cpu_cap_has(X86_FEATURE_PKS))
+ continue;
+ break;
case MSR_ARCH_PERFMON_PERFCTR0 ... MSR_ARCH_PERFMON_PERFCTR0 + 17:
if (msrs_to_save_all[i] - MSR_ARCH_PERFMON_PERFCTR0 >=
min(INTEL_PMC_MAX_GENERIC, x86_pmu.num_counters_gp))
@@ -11233,6 +11237,10 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
kvm_rip_write(vcpu, 0xfff0);

+ /* PKRS is preserved on INIT */
+ if (!init_event && kvm_cpu_cap_has(X86_FEATURE_PKS))
+ __kvm_set_msr(vcpu, MSR_IA32_PKRS, 0, true);
+
vcpu->arch.cr3 = 0;
kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 635b75f9e145..8b752cebbefc 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -452,6 +452,12 @@ static inline void kvm_machine_check(void)
#endif
}

+static inline bool kvm_pkrs_valid(u64 data)
+{
+ /* bit[63,32] must be zero */
+ return !(data >> 32);
+}
+
void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
int kvm_spec_ctrl_test_value(u64 value);
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 50cbb65439a9..4396c4be18cb 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -490,4 +490,10 @@ void pks_update_exception(struct pt_regs *regs, int pkey, u32 protection)
}
EXPORT_SYMBOL_GPL(pks_update_exception);

+u32 get_current_pkrs(void)
+{
+ return this_cpu_read(pkrs_cache);
+}
+EXPORT_SYMBOL_GPL(get_current_pkrs);
+
#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index a642c875a04e..6915b43e2ffc 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -98,6 +98,8 @@ static inline void pks_mk_readwrite(int pkey)
typedef bool (*pks_key_callback)(struct pt_regs *regs, unsigned long address,
bool write);

+u32 get_current_pkrs(void);
+
#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */

static inline bool pks_available(void)
@@ -112,6 +114,10 @@ static inline void pks_update_exception(struct pt_regs *regs,
int pkey,
u32 protection)
{ }
+static inline u32 get_current_pkrs(void)
+{
+ return 0;
+}

#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */

--
2.17.1

2022-03-07 08:11:05

by Chenyi Qiang

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] KVM: PKS Virtualization support

Kindly ping for comments.

On 2/21/2022 4:08 PM, Chenyi Qiang wrote:
> This patch series is based on top of v8 PKS core support kernel patchset:
> https://lore.kernel.org/lkml/[email protected]/
>
> Note: If you read the SDM section 4.6.1 and has some confusion about the
> statement of Data writes to supervisor-mode address:
>
> If CR0.WP = 0, data may be written to any supervisor-mode address with
> a protection key for which write access is permitted.
>
> Which may seems a little conflict with 4.6.2:
>
> if WDi = 1, write accesses are not permitted if CR0.WP = 1. (If CR0.WP
> = 0, IA32_PKRS.WDi does not affect write accesses to supervisor-mode
> address with protection key i.)
>
> In fact, the statement in 4.6.1 doesn't say "a protection key with the
> appropriate WDi bit set." The reader should instead refer to Section
> 4.6.2 to find the definition of what that means. We will follow up
> this with someone internally to make it more clear in SDM.
>
> ---
>
> Protection Keys for Supervisor Pages(PKS) is a feature that extends the
> Protection Keys architecture to support thread-specific permission
> restrictions on supervisor pages.
>
> PKS works similar to an existing feature named PKU(protecting user pages).
> They both perform an additional check after normal paging permission
> checks are done. Access or Writes can be disabled via a MSR update
> without TLB flushes when permissions changes. If violating this
> addional check, #PF occurs and PFEC.PK bit will be set.
>
> PKS introduces MSR IA32_PKRS to manage supervisor protection key
> rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
> advertises on a group of pages with the same key which is set in the
> leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
> supported by XSAVES architecture.
>
> This patchset aims to add the virtualization of PKS in KVM. It
> implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
> exposure, nested supported etc. Currently, PKS is not yet supported for
> shadow paging.
>
> Detailed information about PKS can be found in the latest Intel 64 and
> IA-32 Architectures Software Developer's Manual.
>
> ---
>
> Changelogs:
>
> v5->v6
> - PKRS is preserved on INIT. Add the PKRS reset operation in kvm_vcpu_reset.
> (Sean)
> - Track the pkrs as u32. Add the code WARN on bits 64:32 being set in VMCS field.
> (Sean)
> - Adjust the MSR intercept and entry/exit control in VMCS according to
> guest CPUID. This resolve the issue when userspace re-enable this feature.
> (Sean)
> - Split VMX restriction on PKS support(entry/exit load controls) out of
> common x86. And put tdp restriction together with PKU in common x86.
> (Sean)
> - Thanks for Sean to revise the comments in mmu.c related to
> update_pkr_bitmap, which make it more clear for pkr bitmask cache usage.
> - v5: https://lore.kernel.org/lkml/[email protected]/
>
> v4->v5
> - Make setting of MSR intercept/vmcs control bits not dependent on guest.CR4.PKS.
> And set them if PKS is exposed to guest. (Suggested by Sean)
> - Add pkrs to standard register caching mechanism to help update
> vcpu->arch.pkrs on demand. Add related helper functions. (Suggested by Sean)
> - Do the real pkrs update in VMCS field in vmx_vcpu_reset and
> vmx_sync_vmcs_host_state(). (Sean)
> - Add a new mmu_role cr4_pks instead of smushing PKU and PKS together.
> (Sean & Paolo)
> - v4: https://lore.kernel.org/lkml/[email protected]/
>
> v3->v4
> - Make the MSR intercept and load-controls setting depend on CR4.PKS value
> - shadow the guest pkrs and make it usable in PKS emultion
> - add the cr4_pke and cr4_pks check in pkr_mask update
> - squash PATCH 2 and PATCH 5 to make the dependencies read more clear
> - v3: https://lore.kernel.org/lkml/[email protected]/
>
> v2->v3:
> - No function changes since last submit
> - rebase on the latest PKS kernel support:
> https://lore.kernel.org/lkml/[email protected]/
> - add MSR_IA32_PKRS to the vmx_possible_passthrough_msrs[]
> - RFC v2: https://lore.kernel.org/lkml/[email protected]/
>
> v1->v2:
> - rebase on the latest PKS kernel support:
> https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
> - add a kvm-unit-tests for PKS
> - add the check in kvm_init_msr_list for PKRS
> - place the X86_CR4_PKS in mmu_role_bits in kvm_set_cr4
> - add the support to expose VM_{ENTRY, EXIT}_LOAD_IA32_PKRS in nested
> VMX MSR
> - RFC v1: https://lore.kernel.org/lkml/[email protected]/
>
> ---
>
> Chenyi Qiang (7):
> KVM: VMX: Introduce PKS VMCS fields
> KVM: VMX: Add proper cache tracking for PKRS
> KVM: X86: Expose IA32_PKRS MSR
> KVM: MMU: Rename the pkru to pkr
> KVM: MMU: Add support for PKS emulation
> KVM: VMX: Expose PKS to guest
> KVM: VMX: Enable PKS for nested VM
>
> arch/x86/include/asm/kvm_host.h | 17 ++++--
> arch/x86/include/asm/vmx.h | 6 ++
> arch/x86/kvm/cpuid.c | 13 ++--
> arch/x86/kvm/kvm_cache_regs.h | 7 +++
> arch/x86/kvm/mmu.h | 27 +++++----
> arch/x86/kvm/mmu/mmu.c | 101 ++++++++++++++++++++------------
> arch/x86/kvm/vmx/capabilities.h | 6 ++
> arch/x86/kvm/vmx/nested.c | 38 +++++++++++-
> arch/x86/kvm/vmx/vmcs.h | 1 +
> arch/x86/kvm/vmx/vmcs12.c | 2 +
> arch/x86/kvm/vmx/vmcs12.h | 4 ++
> arch/x86/kvm/vmx/vmx.c | 92 ++++++++++++++++++++++++++---
> arch/x86/kvm/vmx/vmx.h | 7 ++-
> arch/x86/kvm/x86.c | 10 +++-
> arch/x86/kvm/x86.h | 8 +++
> arch/x86/mm/pkeys.c | 6 ++
> include/linux/pkeys.h | 6 ++
> 17 files changed, 280 insertions(+), 71 deletions(-)
>

2022-03-31 03:00:09

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v6 6/7] KVM: VMX: Expose PKS to guest

On Mon, Feb 21, 2022, Chenyi Qiang wrote:
> @@ -7454,6 +7455,13 @@ static __init void vmx_set_cpu_caps(void)
>
> if (cpu_has_vmx_waitpkg())
> kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
> +
> + /*
> + * If not support VM_{ENTRY, EXIT}_LOAD_IA32_PKRS,
> + * don't expose the PKS as well.
> + */

I wouldn't bother with the comment, this pattern is common and the behavior is
self-explanatory.

> + if (cpu_has_load_ia32_pkrs())
> + kvm_cpu_cap_check_and_set(X86_FEATURE_PKS);
> }

2022-03-31 03:00:46

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v6 1/7] KVM: VMX: Introduce PKS VMCS fields

On Mon, Feb 21, 2022, Chenyi Qiang wrote:
> PKS(Protection Keys for Supervisor Pages) is a feature that extends the
> Protection Key architecture to support thread-specific permission
> restrictions on supervisor pages.
>
> A new PKS MSR(PKRS) is defined in kernel to support PKS, which holds a
> set of permissions associated with each protection domain.
>
> Two VMCS fields {HOST,GUEST}_IA32_PKRS are introduced in
> {host,guest}-state area to store the respective values of PKRS.
>
> Every VM exit saves PKRS into guest-state area.
> If VM_EXIT_LOAD_IA32_PKRS = 1, VM exit loads PKRS from the host-state
> area.
> If VM_ENTRY_LOAD_IA32_PKRS = 1, VM entry loads PKRS from the guest-state
> area.
>
> Signed-off-by: Chenyi Qiang <[email protected]>
> Reviewed-by: Jim Mattson <[email protected]>
> ---

Reviewed-by: Sean Christopherson <[email protected]>

2022-03-31 04:17:41

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v6 5/7] KVM: MMU: Add support for PKS emulation

On Mon, Feb 21, 2022, Chenyi Qiang wrote:
> @@ -277,14 +278,18 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
> if (unlikely(mmu->pkr_mask)) {
> u32 pkr_bits, offset;
> + u32 pkr;
>
> /*
> - * PKRU defines 32 bits, there are 16 domains and 2
> - * attribute bits per domain in pkru. pte_pkey is the
> - * index of the protection domain, so pte_pkey * 2 is
> - * is the index of the first bit for the domain.
> + * PKRU and PKRS both define 32 bits. There are 16 domains
> + * and 2 attribute bits per domain in them. pte_key is the
> + * index of the protection domain, so pte_pkey * 2 is the
> + * index of the first bit for the domain. The use of PKRU
> + * versus PKRS is selected by the address type, as determined
> + * by the U/S bit in the paging-structure entries.
> */
> - pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
> + pkr = pte_access & PT_USER_MASK ? vcpu->arch.pkru : kvm_read_pkrs(vcpu);

Blindly reading PKRU/PKRS is wrong. I think this magic insanity will be functionally
correct due to update_pkr_bitmask() clearing the appropriate bits in pkr_mask based
on CR4.PK*, but the read should never happen. PKRU is benign, but I believe reading
PKRS will result in VMREAD to an invalid field if PKRU is supported and enabled, but
PKRS is not supported.

I belive the easiest solution is:

if (pte_access & PT_USER_MASK)
pkr = is_cr4_pke(mmu) ? vcpu->arch.pkru : 0;
else
pkr = is_cr4_pks(mmu) ? kvm_read_pkrs(vcpu) : 0;

The is_cr4_pk*() helpers are restricted to mmu.c, but this presents a good
opportunity to extra the PKR stuff to a separate, non-inline helper (as a prep
patch). E.g.


WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
if (unlikely(mmu->pkr_mask))
u32 pkr_bits = kvm_mmu_pkr_bits(vcpu, mmu, pte_access, pte_pkey);

errcode |= -pkr_bits & PFERR_PK_MASK;
fault |= (pkr_bits != 0);
}

return -(u32)fault & errcode;

permission_fault() is inline because it's heavily used for shadow paging, but
when using TDP, it's far less performance critical. PKR is TDP-only, so moving
it out-of-line should be totally ok (this is also why this patch is "unlikely").

2022-03-31 04:32:49

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v6 2/7] KVM: VMX: Add proper cache tracking for PKRS

On Mon, Feb 21, 2022, Chenyi Qiang wrote:
> Add PKRS caching into the standard register caching mechanism in order
> to take advantage of the availability checks provided by regs_avail.
>
> This is because vcpu->arch.pkrs will be rarely acceesed by KVM, only in
> the case of host userspace MSR reads and GVA->GPA translation in
> following patches. It is unnecessary to keep it up-to-date at all times.

It might be worth throwing in a blurb that the potential benefits of this caching
are tenous.

Barring userspace wierdness, the MSR read is not a hot path.

permission_fault() is slightly more common, but I would be surprised if caching
actually provides meaningful performance benefit. The PKRS checks are done only
once per virtual access, i.e. only on the final translation, so the cache will get
a hit if and only if there are multiple translations in a single round of emulation,
where a "round of emulation" ends upon entry to the guest. With unrestricted
guest, i.e. for all intents and purposes every VM using PKRS, there aren't _that_
many scenarios where KVM will (a) emulate in the first place and (b) emulate enough
accesses for the caching to be meaningful.

That said, this is basically "free", so I've no objection to adding it. But I do
think it's worth documenting that it's nice-to-have so that we don't hesitate to
rip it out in the future if there's a strong reason to drop the caching.

> Signed-off-by: Chenyi Qiang <[email protected]>
> ---

Reviewed-by: Sean Christopherson <[email protected]>

2022-03-31 05:57:31

by Chenyi Qiang

[permalink] [raw]
Subject: Re: [PATCH v6 2/7] KVM: VMX: Add proper cache tracking for PKRS



On 3/31/2022 4:42 AM, Sean Christopherson wrote:
> On Mon, Feb 21, 2022, Chenyi Qiang wrote:
>> Add PKRS caching into the standard register caching mechanism in order
>> to take advantage of the availability checks provided by regs_avail.
>>
>> This is because vcpu->arch.pkrs will be rarely acceesed by KVM, only in
>> the case of host userspace MSR reads and GVA->GPA translation in
>> following patches. It is unnecessary to keep it up-to-date at all times.
>
> It might be worth throwing in a blurb that the potential benefits of this caching
> are tenous.
>
> Barring userspace wierdness, the MSR read is not a hot path.
>
> permission_fault() is slightly more common, but I would be surprised if caching
> actually provides meaningful performance benefit. The PKRS checks are done only
> once per virtual access, i.e. only on the final translation, so the cache will get
> a hit if and only if there are multiple translations in a single round of emulation,
> where a "round of emulation" ends upon entry to the guest. With unrestricted
> guest, i.e. for all intents and purposes every VM using PKRS, there aren't _that_
> many scenarios where KVM will (a) emulate in the first place and (b) emulate enough
> accesses for the caching to be meaningful.
>
> That said, this is basically "free", so I've no objection to adding it. But I do
> think it's worth documenting that it's nice-to-have so that we don't hesitate to
> rip it out in the future if there's a strong reason to drop the caching.
>

OK, will add this note in commit message.

>> Signed-off-by: Chenyi Qiang <[email protected]>
>> ---
>
> Reviewed-by: Sean Christopherson <[email protected]>

2022-03-31 05:59:50

by Chenyi Qiang

[permalink] [raw]
Subject: Re: [PATCH v6 5/7] KVM: MMU: Add support for PKS emulation



On 3/31/2022 5:27 AM, Sean Christopherson wrote:
> On Mon, Feb 21, 2022, Chenyi Qiang wrote:
>> @@ -277,14 +278,18 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>> WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
>> if (unlikely(mmu->pkr_mask)) {
>> u32 pkr_bits, offset;
>> + u32 pkr;
>>
>> /*
>> - * PKRU defines 32 bits, there are 16 domains and 2
>> - * attribute bits per domain in pkru. pte_pkey is the
>> - * index of the protection domain, so pte_pkey * 2 is
>> - * is the index of the first bit for the domain.
>> + * PKRU and PKRS both define 32 bits. There are 16 domains
>> + * and 2 attribute bits per domain in them. pte_key is the
>> + * index of the protection domain, so pte_pkey * 2 is the
>> + * index of the first bit for the domain. The use of PKRU
>> + * versus PKRS is selected by the address type, as determined
>> + * by the U/S bit in the paging-structure entries.
>> */
>> - pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
>> + pkr = pte_access & PT_USER_MASK ? vcpu->arch.pkru : kvm_read_pkrs(vcpu);
>
> Blindly reading PKRU/PKRS is wrong. I think this magic insanity will be functionally
> correct due to update_pkr_bitmask() clearing the appropriate bits in pkr_mask based
> on CR4.PK*, but the read should never happen. PKRU is benign, but I believe reading
> PKRS will result in VMREAD to an invalid field if PKRU is supported and enabled, but
> PKRS is not supported.
>

Nice catch.

> I belive the easiest solution is:
>
> if (pte_access & PT_USER_MASK)
> pkr = is_cr4_pke(mmu) ? vcpu->arch.pkru : 0;
> else
> pkr = is_cr4_pks(mmu) ? kvm_read_pkrs(vcpu) : 0;
>
> The is_cr4_pk*() helpers are restricted to mmu.c, but this presents a good
> opportunity to extra the PKR stuff to a separate, non-inline helper (as a prep
> patch). E.g.
>
>
> WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
> if (unlikely(mmu->pkr_mask))
> u32 pkr_bits = kvm_mmu_pkr_bits(vcpu, mmu, pte_access, pte_pkey);
>
> errcode |= -pkr_bits & PFERR_PK_MASK;
> fault |= (pkr_bits != 0);
> }
>
> return -(u32)fault & errcode;
>
> permission_fault() is inline because it's heavily used for shadow paging, but
> when using TDP, it's far less performance critical. PKR is TDP-only, so moving
> it out-of-line should be totally ok (this is also why this patch is "unlikely").

Make sense, will do it.