2021-06-23 23:07:27

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 0/7] KVM: x86: guest MAXPHYADDR and C-bit fixes

A few fixes centered around enumerating guest MAXPHYADDR and handling the
C-bit in KVM.

DISCLAIMER: I have no idea if patch 04, "Truncate reported guest
MAXPHYADDR to C-bit if SEV is" is architecturally correct. The APM says
the following about the C-bit in the context of SEV, but I can't for the
life of me find anything in the APM that clarifies whether "effectively
reduced" is supposed to apply to _only_ SEV guests, or any guest on an
SEV enabled platform.

Note that because guest physical addresses are always translated through
the nested page tables, the size of the guest physical address space is
not impacted by any physical address space reduction indicated in
CPUID 8000_001F[EBX]. If the C-bit is a physical address bit however,
the guest physical address space is effectively reduced by 1 bit.

In practice, I have observed that Rome CPUs treat the C-bit as reserved for
non-SEV guests (another disclaimer on this below). Long story short, commit
ef4c9f4f6546 ("KVM: selftests: Fix 32-bit truncation of vm_get_max_gfn()")
exposed the issue by inadvertantly causing selftests to start using GPAs
with bit 47 set.

That said, regardless of whether or not the behavior is intended, it needs
to be addressed by KVM. I think the only difference is whether this is
KVM's _only_ behavior, or whether it's gated by an erratum flag.

The second disclaimer is that I haven't tested with memory encryption
disabled in hardware. I wrote the patch assuming/hoping that only CPUs
that report SEV=1 treat the C-bit as reserved, but I haven't actually
tested the SEV=0 case on e.g. CPUs with only SME (we might have these
platforms, but I've no idea how to access/find them), or CPUs with SME/SEV
disabled in BIOS (again, I've no idea how to do this with our BIOS).

Sean Christopherson (7):
KVM: x86: Use guest MAXPHYADDR from CPUID.0x8000_0008 iff TDP is
enabled
KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR
KVM: x86: Truncate reported guest MAXPHYADDR to C-bit if SEV is
supported
KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs
KVM: VMX: Refactor 32-bit PSE PT creation to avoid using MMU macro
KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.h
KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEs

arch/x86/kvm/cpuid.c | 38 +++++++++++++++++---
arch/x86/kvm/mmu.h | 11 ++----
arch/x86/kvm/mmu/mmu.c | 63 ++++++++-------------------------
arch/x86/kvm/mmu/mmu_audit.c | 6 ++--
arch/x86/kvm/mmu/mmu_internal.h | 14 ++++++++
arch/x86/kvm/mmu/paging_tmpl.h | 52 ++++++++++++++++++++++++++-
arch/x86/kvm/mmu/spte.c | 2 +-
arch/x86/kvm/mmu/spte.h | 34 +++++++-----------
arch/x86/kvm/mmu/tdp_iter.c | 6 ++--
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
arch/x86/kvm/svm/svm.c | 37 ++++++++++++++-----
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/x86.c | 3 ++
arch/x86/kvm/x86.h | 1 +
14 files changed, 170 insertions(+), 101 deletions(-)

--
2.32.0.288.g62a8d224e6-goog


2021-06-23 23:07:38

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 2/7] KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR

Use boot_cpu_data.x86_phys_bits instead of the raw CPUID information to
enumerate the MAXPHYADDR for KVM guests when TDP is disabled (the guest
version is only relevant to NPT/TDP).

When using shadow paging, any reductions to the host's MAXPHYADDR apply
to KVM and its guests as well, i.e. using the raw CPUID info will cause
KVM to misreport the number of PA bits available to the guest.

Unconditionally zero out the "Physical Address bit reduction" entry.
For !TDP, the adjustment is already done, and for TDP enumerating the
host's reduction is wrong as the reduction does not apply to GPAs.

Fixes: 9af9b94068fb ("x86/cpu/AMD: Handle SME reduction in physical address size")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/cpuid.c | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 4b2f8c6b41e8..28878671d648 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -941,11 +941,18 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
unsigned phys_as = entry->eax & 0xff;

/*
- * Use bare metal's MAXPHADDR if the CPU doesn't report guest
- * MAXPHYADDR separately, or if TDP (NPT) is disabled, as the
- * guest version "applies only to guests using nested paging".
+ * If TDP (NPT) is disabled use the adjusted host MAXPHYADDR as
+ * the guest operates in the same PA space as the host, i.e.
+ * reductions in MAXPHYADDR for memory encryption affect shadow
+ * paging, too.
+ *
+ * If TDP is enabled but an explicit guest MAXPHYADDR is not
+ * provided, use the raw bare metal MAXPHYADDR as reductions to
+ * the HPAs do not affect GPAs.
*/
- if (!g_phys_as || !tdp_enabled)
+ if (!tdp_enabled)
+ g_phys_as = boot_cpu_data.x86_phys_bits;
+ else if (!g_phys_as)
g_phys_as = phys_as;

entry->eax = g_phys_as | (virt_as << 8);
@@ -970,12 +977,18 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
case 0x8000001a:
case 0x8000001e:
break;
- /* Support memory encryption cpuid if host supports it */
case 0x8000001F:
- if (!kvm_cpu_cap_has(X86_FEATURE_SEV))
+ if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) {
entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
- else
+ } else {
cpuid_entry_override(entry, CPUID_8000_001F_EAX);
+
+ /*
+ * Enumerate '0' for "PA bits reduction", the adjusted
+ * MAXPHYADDR is enumerated directly (see 0x80000008).
+ */
+ entry->ebx &= ~GENMASK(11, 6);
+ }
break;
/*Add support for Centaur's CPUID instruction*/
case 0xC0000000:
--
2.32.0.288.g62a8d224e6-goog

2021-06-23 23:07:40

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 3/7] KVM: x86: Truncate reported guest MAXPHYADDR to C-bit if SEV is supported

Enumerate "C-bit - 1" as the MAXPHYADDR for KVM_GET_SUPPORTED_CPUID if
NPT is enabled, SEV is supported in hardware, and the C-bit overlaps
the legal guest physical address space reported by hardware. I.e.
advertise the C-bit and any PA bits above the C-bit as reserved.
Reuse svm_adjust_mmio_mask()'s C-bit retrieval logic, but
opportunistically switch to the "safe" RDMSR, e.g. the MSR may not be
emulated if KVM is running nested.

AMD's APM is a bit ambiguous as to the behavior of the C-bit for non-SEV
guests running on SEV-capable hardware. In "15.34.6 Page Table Support",
under "15.34 Secure Encrypted Virtualization", the APM does state that
the guest MAXPHYADDR is not reduced unless the C-bit is stolen from the
legal physical address space:

Note that because guest physical addresses are always translated
through the nested page tables, the size of the guest physical address
space is not impacted by any physical address space reduction indicated
in CPUID 8000_001F[EBX]. If the C-bit is a physical address bit however,
the guest physical address space is effectively reduced by 1 bit.

What it doesn't clarify is whether or not that behavior applies to
non-SEV guests, i.e. whether the C-bit is reserved or a legal GPA bit.

Regardless of what the intended behavior is (more on this later), KVM is
broken because it treats the C-bit as ignored for non-SEV. At first
blush, it would appear the KVM treats the C-bit as a legal GPA bit for
non-SEV guests, but the C-bit is stripped for host _and_ guest PTEs
tables, as PT64_BASE_ADDR_MASK is defined to incorporate the adjusted
physical_mask (CONFIG_DYNAMIC_PHYSICAL_MASK is forced for SME, and the
C-bit is explicitly cleared).

#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
#else

As a result, KVM reports the full guest MAXPHYADDR even though it
(silently) drops the C-bit.

If the APM's intent is that the C-bit is a GPA bit for non-SEV guests,
then the correct fix would be to split PT64_BASE_ADDR_MASK into host
and guest masks, with the guest mask being dynamic since SEV guests
would still need to drop the C-bit.

But, on AMD EPYC 7B12, a.k.a. Rome, the C-bit is reserved for non-SEV
guests. E.g. the demand_paging_test selftest hits an unexpected #PF when
running NPT due to using the highest possible GPAs. A dump of the #PF
and guest page tables (via #PF interception) shows the CPU generates a
reserved #PF, and clearing the C-bit in the test passes. The same dump
also captures KVM's clearing of the C-bit in its final GPA calulcation.

SVM: KVM: CPU #PF @ rip = 0x409ca4, cr2 = 0xc0000000, pfec = 0xb
KVM: guest PTE = 0x181023 @ GPA = 0x180000, level = 4
KVM: guest PTE = 0x186023 @ GPA = 0x181000, level = 3
KVM: guest PTE = 0x187023 @ GPA = 0x186000, level = 2
KVM: guest PTE = 0xffffbffff003 @ GPA = 0x187000, level = 1
SVM: KVM: GPA = 0x7fffbffff000

Fixes: d0ec49d4de90 ("kvm/x86/svm: Support Secure Memory Encryption within KVM")
Cc: [email protected]
Cc: Peter Gonda <[email protected]>
Cc: Brijesh Singh <[email protected]>
Cc: Tom Lendacky <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/cpuid.c | 11 +++++++++++
arch/x86/kvm/svm/svm.c | 37 +++++++++++++++++++++++++++++--------
arch/x86/kvm/x86.c | 3 +++
arch/x86/kvm/x86.h | 1 +
4 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 28878671d648..e05a9eb0dd03 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -955,6 +955,17 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
else if (!g_phys_as)
g_phys_as = phys_as;

+ /*
+ * The exception to the exception is if hardware supports SEV,
+ * in which case the C-bit is reserved for non-SEV guests and
+ * isn't a GPA bit for SEV guests.
+ *
+ * Note, KVM always reports '0' for the number of reduced PA
+ * bits (see 0x8000001F).
+ */
+ if (tdp_enabled && sev_c_bit)
+ g_phys_as = min(g_phys_as, (unsigned int)sev_c_bit);
+
entry->eax = g_phys_as | (virt_as << 8);
entry->edx = 0;
cpuid_entry_override(entry, CPUID_8000_0008_EBX);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 12c06ea28f5c..2549e80abf05 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -860,6 +860,26 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
}
}

+static __init u8 svm_get_c_bit(bool sev_only)
+{
+ unsigned int eax, ebx, ecx, edx;
+ u64 msr;
+
+ if (cpuid_eax(0x80000000) < 0x8000001f)
+ return 0;
+
+ if (rdmsrl_safe(MSR_K8_SYSCFG, &msr) ||
+ !(msr & MSR_K8_SYSCFG_MEM_ENCRYPT))
+ return 0;
+
+ cpuid(0x8000001f, &eax, &ebx, &ecx, &edx);
+
+ if (sev_only && !(eax & feature_bit(SEV)))
+ return 0;
+
+ return ebx & 0x3f;
+}
+
/*
* The default MMIO mask is a single bit (excluding the present bit),
* which could conflict with the memory encryption bit. Check for
@@ -869,18 +889,13 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
static __init void svm_adjust_mmio_mask(void)
{
unsigned int enc_bit, mask_bit;
- u64 msr, mask;
-
- /* If there is no memory encryption support, use existing mask */
- if (cpuid_eax(0x80000000) < 0x8000001f)
- return;
+ u64 mask;

/* If memory encryption is not enabled, use existing mask */
- rdmsrl(MSR_K8_SYSCFG, msr);
- if (!(msr & MSR_K8_SYSCFG_MEM_ENCRYPT))
+ enc_bit = svm_get_c_bit(false);
+ if (!enc_bit)
return;

- enc_bit = cpuid_ebx(0x8000001f) & 0x3f;
mask_bit = boot_cpu_data.x86_phys_bits;

/* Increment the mask bit if it is the same as the encryption bit */
@@ -1013,6 +1028,12 @@ static __init int svm_hardware_setup(void)
kvm_configure_mmu(npt_enabled, get_max_npt_level(), PG_LEVEL_1G);
pr_info("kvm: Nested Paging %sabled\n", npt_enabled ? "en" : "dis");

+ /*
+ * The SEV C-bit location is needed to correctly enumeration guest
+ * MAXPHYADDR even if SEV is not fully supported.
+ */
+ sev_c_bit = svm_get_c_bit(true);
+
/* Note, SEV setup consumes npt_enabled. */
sev_hardware_setup();

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 76dae88cf524..4479e67e5727 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -223,6 +223,9 @@ EXPORT_SYMBOL_GPL(host_xss);
u64 __read_mostly supported_xss;
EXPORT_SYMBOL_GPL(supported_xss);

+u8 __read_mostly sev_c_bit;
+EXPORT_SYMBOL_GPL(sev_c_bit);
+
struct kvm_stats_debugfs_item debugfs_entries[] = {
VCPU_STAT("pf_fixed", pf_fixed),
VCPU_STAT("pf_guest", pf_guest),
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 521f74e5bbf2..a80cc63039c3 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -338,6 +338,7 @@ extern u64 host_xcr0;
extern u64 supported_xcr0;
extern u64 host_xss;
extern u64 supported_xss;
+extern u8 sev_c_bit;

static inline bool kvm_mpx_supported(void)
{
--
2.32.0.288.g62a8d224e6-goog

2021-06-23 23:07:43

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 4/7] KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs

Ignore "dynamic" host adjustments to the physical address mask when
generating the masks for guest PTEs, i.e. the guest PA masks. The host
physical address space and guest physical address space are two different
beasts, e.g. even though SEV's C-bit is the same bit location for both
host and guest, disabling SME in the host (which clears shadow_me_mask)
does not affect the guest PTE->GPA "translation".

For non-SEV guests, not dropping bits is the correct behavior. Assuming
KVM and userspace correctly enumerate/configure guest MAXPHYADDR, bits
that are lost as collateral damage from memory encryption are treated as
reserved bits, i.e. KVM will never get to the point where it attempts to
generate a gfn using the affected bits. And if userspace wants to create
a bogus vCPU, then userspace gets to deal with the fallout of hardware
doing odd things with bad GPAs.

For SEV guests, not dropping the C-bit is technically wrong, but it's a
moot point because KVM can't read SEV guest's page tables in any case
since they're always encrypted. Not to mention that the current KVM code
is also broken since sme_me_mask does not have to be non-zero for SEV to
be supported by KVM. The proper fix would be to teach all of KVM to
correctly handle guest private memory, but that's a task for the future.

Fixes: d0ec49d4de90 ("kvm/x86/svm: Support Secure Memory Encryption within KVM")
Cc: [email protected]
Cc: Brijesh Singh <[email protected]>
Cc: Tom Lendacky <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/paging_tmpl.h | 17 +++++++++++++++--
arch/x86/kvm/mmu/spte.h | 6 ------
2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 823a5919f9fa..9df7e4b315a1 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -20,11 +20,24 @@
* so the code in this file is compiled twice, once per pte size.
*/

+/* Shadow paging constants/helpers that don't need to be #undef'd. */
+#ifndef __KVM_X86_PAGING_TMPL_COMMON_H
+#define __KVM_X86_PAGING_TMPL_COMMON_H
+
+#define GUEST_PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
+#define PT64_LVL_ADDR_MASK(level) \
+ (GUEST_PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
+ * PT64_LEVEL_BITS))) - 1))
+#define PT64_LVL_OFFSET_MASK(level) \
+ (GUEST_PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
+ * PT64_LEVEL_BITS))) - 1))
+#endif /* __KVM_X86_PAGING_TMPL_COMMON_H */
+
#if PTTYPE == 64
#define pt_element_t u64
#define guest_walker guest_walker64
#define FNAME(name) paging##64_##name
- #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+ #define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
@@ -57,7 +70,7 @@
#define pt_element_t u64
#define guest_walker guest_walkerEPT
#define FNAME(name) ept_##name
- #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+ #define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index bca0ba11cccf..6925dfc38981 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -38,12 +38,6 @@ static_assert(SPTE_TDP_AD_ENABLED_MASK == 0);
#else
#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
#endif
-#define PT64_LVL_ADDR_MASK(level) \
- (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
- * PT64_LEVEL_BITS))) - 1))
-#define PT64_LVL_OFFSET_MASK(level) \
- (PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
- * PT64_LEVEL_BITS))) - 1))

#define PT64_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
| shadow_x_mask | shadow_nx_mask | shadow_me_mask)
--
2.32.0.288.g62a8d224e6-goog

2021-06-23 23:08:26

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 5/7] KVM: VMX: Refactor 32-bit PSE PT creation to avoid using MMU macro

Compute the number of PTEs to be filled for the 32-bit PSE page tables
using the page size and the size of each entry. While using the MMU's
PT32_ENT_PER_PAGE macro is arguably better in isolation, removing VMX's
usage will allow a future namespacing cleanup to move the guest page
table macros into paging_tmpl.h, out of the reach of code that isn't
directly related to shadow paging.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ab6f682645d7..f6fa922ca6e3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3584,7 +3584,7 @@ static int init_rmode_identity_map(struct kvm *kvm)
}

/* Set up identity-mapping pagetable for EPT in real mode */
- for (i = 0; i < PT32_ENT_PER_PAGE; i++) {
+ for (i = 0; i < (PAGE_SIZE / sizeof(tmp)); i++) {
tmp = (i << 22) + (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
_PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
if (__copy_to_user(uaddr + i * sizeof(tmp), &tmp, sizeof(tmp))) {
--
2.32.0.288.g62a8d224e6-goog

2021-06-23 23:08:45

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 6/7] KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.h

Move a handful of one-off macros and helpers for 32-bit PSE paging into
paging_tmpl.h. Under no circumstance should anything but shadow paging
care about 32-bit PSE paging.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu.h | 5 -----
arch/x86/kvm/mmu/mmu.c | 13 -------------
arch/x86/kvm/mmu/paging_tmpl.h | 18 ++++++++++++++++++
3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index bc11402df83b..2b9d08b080cc 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -34,11 +34,6 @@
#define PT_DIR_PAT_SHIFT 12
#define PT_DIR_PAT_MASK (1ULL << PT_DIR_PAT_SHIFT)

-#define PT32_DIR_PSE36_SIZE 4
-#define PT32_DIR_PSE36_SHIFT 13
-#define PT32_DIR_PSE36_MASK \
- (((1ULL << PT32_DIR_PSE36_SIZE) - 1) << PT32_DIR_PSE36_SHIFT)
-
#define PT64_ROOT_5LEVEL 5
#define PT64_ROOT_4LEVEL 4
#define PT32_ROOT_LEVEL 2
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 84d48a33e38b..ef92717bff86 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -259,23 +259,10 @@ static gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,
return gpa;
}

-static int is_cpuid_PSE36(void)
-{
- return 1;
-}
-
static int is_nx(struct kvm_vcpu *vcpu)
{
return vcpu->arch.efer & EFER_NX;
}
-
-static gfn_t pse36_gfn_delta(u32 gpte)
-{
- int shift = 32 - PT32_DIR_PSE36_SHIFT - PAGE_SHIFT;
-
- return (gpte & PT32_DIR_PSE36_MASK) << shift;
-}
-
#ifdef CONFIG_X86_64
static void __set_spte(u64 *sptep, u64 spte)
{
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 9df7e4b315a1..a2dbea70ffda 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -31,6 +31,24 @@
#define PT64_LVL_OFFSET_MASK(level) \
(GUEST_PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
* PT64_LEVEL_BITS))) - 1))
+
+#define PT32_DIR_PSE36_SIZE 4
+#define PT32_DIR_PSE36_SHIFT 13
+#define PT32_DIR_PSE36_MASK \
+ (((1ULL << PT32_DIR_PSE36_SIZE) - 1) << PT32_DIR_PSE36_SHIFT)
+
+static inline int is_cpuid_PSE36(void)
+{
+ return 1;
+}
+
+static inline gfn_t pse36_gfn_delta(u32 gpte)
+{
+ int shift = 32 - PT32_DIR_PSE36_SHIFT - PAGE_SHIFT;
+
+ return (gpte & PT32_DIR_PSE36_MASK) << shift;
+}
+
#endif /* __KVM_X86_PAGING_TMPL_COMMON_H */

#if PTTYPE == 64
--
2.32.0.288.g62a8d224e6-goog

2021-06-23 23:10:47

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 7/7] KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEs

Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs
(PT64). SPTE and PT64 are _mostly_ the same, but the few differences are
quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and
guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is
very much specific to SPTEs.

Add helper macros to deduplicate the 32-bit vs. 64-bit code, and to avoid
additional duplication for SPTEs.

Opportunistically move most guest macros into paging_tmpl.h to clearly
associate them with shadow paging, and to make it more difficult to
unintentionally use a guest macro in the MMU. Sadly, PT32_LEVEL_BITS is
left behind in mmu.h because it's need for the quadrant calculation in
kvm_mmu_get_page(), which is hot enough that adding a per-context helper
is undesirable, and burying the computation in paging_tmpl.h with a
forward declaration isn't exactly an improvement.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu.h | 8 ++----
arch/x86/kvm/mmu/mmu.c | 50 ++++++++++-----------------------
arch/x86/kvm/mmu/mmu_audit.c | 6 ++--
arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++++
arch/x86/kvm/mmu/paging_tmpl.h | 35 +++++++++++++++++------
arch/x86/kvm/mmu/spte.c | 2 +-
arch/x86/kvm/mmu/spte.h | 28 ++++++++----------
arch/x86/kvm/mmu/tdp_iter.c | 6 ++--
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
9 files changed, 79 insertions(+), 72 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2b9d08b080cc..0199c8c2222d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -6,11 +6,6 @@
#include "kvm_cache_regs.h"
#include "cpuid.h"

-#define PT64_PT_BITS 9
-#define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS)
-#define PT32_PT_BITS 10
-#define PT32_ENT_PER_PAGE (1 << PT32_PT_BITS)
-
#define PT_WRITABLE_SHIFT 1
#define PT_USER_SHIFT 2

@@ -34,6 +29,9 @@
#define PT_DIR_PAT_SHIFT 12
#define PT_DIR_PAT_MASK (1ULL << PT_DIR_PAT_SHIFT)

+/* The number of bits for 32-bit PTEs is to compute the quandrant. :-( */
+#define PT32_LEVEL_BITS 10
+
#define PT64_ROOT_5LEVEL 5
#define PT64_ROOT_4LEVEL 4
#define PT32_ROOT_LEVEL 2
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ef92717bff86..cc93649f41cb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -113,26 +113,6 @@ module_param(dbg, bool, 0644);

#define PTE_PREFETCH_NUM 8

-#define PT32_LEVEL_BITS 10
-
-#define PT32_LEVEL_SHIFT(level) \
- (PAGE_SHIFT + (level - 1) * PT32_LEVEL_BITS)
-
-#define PT32_LVL_OFFSET_MASK(level) \
- (PT32_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
- * PT32_LEVEL_BITS))) - 1))
-
-#define PT32_INDEX(address, level)\
- (((address) >> PT32_LEVEL_SHIFT(level)) & ((1 << PT32_LEVEL_BITS) - 1))
-
-
-#define PT32_BASE_ADDR_MASK PAGE_MASK
-#define PT32_DIR_BASE_ADDR_MASK \
- (PAGE_MASK & ~((1ULL << (PAGE_SHIFT + PT32_LEVEL_BITS)) - 1))
-#define PT32_LVL_ADDR_MASK(level) \
- (PAGE_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
- * PT32_LEVEL_BITS))) - 1))
-
#include <trace/events/kvm.h>

/* make pte_list_desc fit well in cache line */
@@ -675,7 +655,7 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
if (!sp->role.direct)
return sp->gfns[index];

- return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
+ return sp->gfn + (index << ((sp->role.level - 1) * SPTE_LEVEL_BITS));
}

static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
@@ -1706,7 +1686,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
continue;
}

- child = to_shadow_page(ent & PT64_BASE_ADDR_MASK);
+ child = to_shadow_page(ent & SPTE_BASE_ADDR_MASK);

if (child->unsync_children) {
if (mmu_pages_add(pvec, child, i))
@@ -1989,8 +1969,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
role.gpte_is_8_bytes = true;
role.access = access;
if (!direct_mmu && vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL) {
- quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
- quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
+ quadrant = gaddr >> (PAGE_SHIFT + (SPTE_LEVEL_BITS * level));
+ quadrant &= (1 << ((PT32_LEVEL_BITS - SPTE_LEVEL_BITS) * level)) - 1;
role.quadrant = quadrant;
}

@@ -2082,7 +2062,7 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato

iterator->shadow_addr
= vcpu->arch.mmu->pae_root[(addr >> 30) & 3];
- iterator->shadow_addr &= PT64_BASE_ADDR_MASK;
+ iterator->shadow_addr &= SPTE_BASE_ADDR_MASK;
--iterator->level;
if (!iterator->shadow_addr)
iterator->level = 0;
@@ -2101,7 +2081,7 @@ static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
if (iterator->level < PG_LEVEL_4K)
return false;

- iterator->index = SHADOW_PT_INDEX(iterator->addr, iterator->level);
+ iterator->index = SPTE_INDEX(iterator->addr, iterator->level);
iterator->sptep = ((u64 *)__va(iterator->shadow_addr)) + iterator->index;
return true;
}
@@ -2114,7 +2094,7 @@ static void __shadow_walk_next(struct kvm_shadow_walk_iterator *iterator,
return;
}

- iterator->shadow_addr = spte & PT64_BASE_ADDR_MASK;
+ iterator->shadow_addr = spte & SPTE_BASE_ADDR_MASK;
--iterator->level;
}

@@ -2153,7 +2133,7 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
* so we should update the spte at this point to get
* a new sp with the correct access.
*/
- child = to_shadow_page(*sptep & PT64_BASE_ADDR_MASK);
+ child = to_shadow_page(*sptep & SPTE_BASE_ADDR_MASK);
if (child->role.access == direct_access)
return;

@@ -2176,7 +2156,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
if (is_large_pte(pte))
--kvm->stat.lpages;
} else {
- child = to_shadow_page(pte & PT64_BASE_ADDR_MASK);
+ child = to_shadow_page(pte & SPTE_BASE_ADDR_MASK);
drop_parent_pte(child, spte);

/*
@@ -2202,7 +2182,7 @@ static int kvm_mmu_page_unlink_children(struct kvm *kvm,
int zapped = 0;
unsigned i;

- for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
+ for (i = 0; i < SPTE_ENT_PER_PAGE; ++i)
zapped += mmu_page_zap_pte(kvm, sp, sp->spt + i, invalid_list);

return zapped;
@@ -2580,7 +2560,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
struct kvm_mmu_page *child;
u64 pte = *sptep;

- child = to_shadow_page(pte & PT64_BASE_ADDR_MASK);
+ child = to_shadow_page(pte & SPTE_BASE_ADDR_MASK);
drop_parent_pte(child, sptep);
flush = true;
} else if (pfn != spte_to_pfn(*sptep)) {
@@ -3134,7 +3114,7 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
if (!VALID_PAGE(*root_hpa))
return;

- sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
+ sp = to_shadow_page(*root_hpa & SPTE_BASE_ADDR_MASK);

if (is_tdp_mmu_page(sp))
kvm_tdp_mmu_put_root(kvm, sp, false);
@@ -3494,7 +3474,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
hpa_t root = vcpu->arch.mmu->pae_root[i];

if (IS_VALID_PAE_ROOT(root)) {
- root &= PT64_BASE_ADDR_MASK;
+ root &= SPTE_BASE_ADDR_MASK;
sp = to_shadow_page(root);
mmu_sync_children(vcpu, sp);
}
@@ -4927,11 +4907,11 @@ static bool need_remote_flush(u64 old, u64 new)
return false;
if (!is_shadow_present_pte(new))
return true;
- if ((old ^ new) & PT64_BASE_ADDR_MASK)
+ if ((old ^ new) & SPTE_BASE_ADDR_MASK)
return true;
old ^= shadow_nx_mask;
new ^= shadow_nx_mask;
- return (old & ~new & PT64_PERM_MASK) != 0;
+ return (old & ~new & SPTE_PERM_MASK) != 0;
}

static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
index cedc17b2f60e..4b5335188d01 100644
--- a/arch/x86/kvm/mmu/mmu_audit.c
+++ b/arch/x86/kvm/mmu/mmu_audit.c
@@ -45,7 +45,7 @@ static void __mmu_spte_walk(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
!is_last_spte(ent[i], level)) {
struct kvm_mmu_page *child;

- child = to_shadow_page(ent[i] & PT64_BASE_ADDR_MASK);
+ child = to_shadow_page(ent[i] & SPTE_BASE_ADDR_MASK);
__mmu_spte_walk(vcpu, child, fn, level - 1);
}
}
@@ -71,7 +71,7 @@ static void mmu_spte_walk(struct kvm_vcpu *vcpu, inspect_spte_fn fn)
hpa_t root = vcpu->arch.mmu->pae_root[i];

if (IS_VALID_PAE_ROOT(root)) {
- root &= PT64_BASE_ADDR_MASK;
+ root &= SPTE_BASE_ADDR_MASK;
sp = to_shadow_page(root);
__mmu_spte_walk(vcpu, sp, fn, 2);
}
@@ -117,7 +117,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
return;

hpa = pfn << PAGE_SHIFT;
- if ((*sptep & PT64_BASE_ADDR_MASK) != hpa)
+ if ((*sptep & SPTE_BASE_ADDR_MASK) != hpa)
audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx "
"ent %llxn", vcpu->arch.mmu->root_level, pfn,
hpa, *sptep);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 18be103df9d5..b9ef013a2202 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -20,6 +20,20 @@ extern bool dbg;
#define MMU_WARN_ON(x) do { } while (0)
#endif

+/* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
+#define __PT_LEVEL_SHIFT(level, bits_per_level) \
+ (PAGE_SHIFT + ((level) - 1) * (bits_per_level))
+#define __PT_INDEX(address, level, bits_per_level) \
+ (((address) >> __PT_LEVEL_SHIFT(level, bits_per_level)) & ((1 << (bits_per_level)) - 1))
+
+#define __PT_LVL_ADDR_MASK(base_addr_mask, level, bits_per_level) \
+ ((base_addr_mask) & ~((1ULL << (PAGE_SHIFT + (((level) - 1) * (bits_per_level)))) - 1))
+
+#define __PT_LVL_OFFSET_MASK(base_addr_mask, level, bits_per_level) \
+ ((base_addr_mask) & ((1ULL << (PAGE_SHIFT + (((level) - 1) * (bits_per_level)))) - 1))
+
+#define __PT_ENT_PER_PAGE(bits_per_level) (1 << (bits_per_level))
+
/*
* Unlike regular MMU roots, PAE "roots", a.k.a. PDPTEs/PDPTRs, have a PRESENT
* bit, and thus are guaranteed to be non-zero when valid. And, when a guest
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index a2dbea70ffda..caaeec848b12 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -24,13 +24,32 @@
#ifndef __KVM_X86_PAGING_TMPL_COMMON_H
#define __KVM_X86_PAGING_TMPL_COMMON_H

-#define GUEST_PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
+/* 64-bit guest PTEs, i.e. PAE paging and EPT. */
+#define PT64_LEVEL_BITS 9
+#define PT64_LEVEL_SHIFT __PT_LEVEL_SHIFT(level, PT64_LEVEL_BITS)
+#define PT64_INDEX(address, level) __PT_INDEX(address, level, PT64_LEVEL_BITS)
+#define PT64_ENT_PER_PAGE __PT_ENT_PER_PAGE(PT64_LEVEL_BITS)
+
+#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
#define PT64_LVL_ADDR_MASK(level) \
- (GUEST_PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
- * PT64_LEVEL_BITS))) - 1))
+ __PT_LVL_ADDR_MASK(PT64_BASE_ADDR_MASK, level, PT64_LEVEL_BITS)
+
#define PT64_LVL_OFFSET_MASK(level) \
- (GUEST_PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
- * PT64_LEVEL_BITS))) - 1))
+ __PT_LVL_OFFSET_MASK(PT64_BASE_ADDR_MASK, level, PT64_LEVEL_BITS)
+
+/* 32-bit guest PTEs, i.e. non-PAE IA32 paging. */
+#define PT32_LEVEL_SHIFT(level) __PT_LEVEL_SHIFT(level, PT32_LEVEL_BITS)
+#define PT32_INDEX(address, level) __PT_INDEX(address, level, PT32_LEVEL_BITS)
+#define PT32_ENT_PER_PAGE __PT_ENT_PER_PAGE(PT32_LEVEL_BITS)
+
+#define PT32_BASE_ADDR_MASK PAGE_MASK
+#define PT32_DIR_BASE_ADDR_MASK \
+ (PAGE_MASK & ~((1ULL << (PAGE_SHIFT + PT32_LEVEL_BITS)) - 1))
+#define PT32_LVL_ADDR_MASK(level) \
+ __PT_LVL_ADDR_MASK(PT32_BASE_ADDR_MASK, level, PT32_LEVEL_BITS)
+
+#define PT32_LVL_OFFSET_MASK(level) \
+ __PT_LVL_OFFSET_MASK(PT32_BASE_ADDR_MASK, level, PT32_LEVEL_BITS)

#define PT32_DIR_PSE36_SIZE 4
#define PT32_DIR_PSE36_SHIFT 13
@@ -55,7 +74,7 @@ static inline gfn_t pse36_gfn_delta(u32 gpte)
#define pt_element_t u64
#define guest_walker guest_walker64
#define FNAME(name) paging##64_##name
- #define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
+ #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
@@ -88,7 +107,7 @@ static inline gfn_t pse36_gfn_delta(u32 gpte)
#define pt_element_t u64
#define guest_walker guest_walkerEPT
#define FNAME(name) ept_##name
- #define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
+ #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
@@ -1072,7 +1091,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)

first_pte_gpa = FNAME(get_level1_sp_gpa)(sp);

- for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
+ for (i = 0; i < SPTE_ENT_PER_PAGE; i++) {
unsigned pte_access;
pt_element_t gpte;
gpa_t pte_gpa;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 66d43cec0c31..cc7feac12e26 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -200,7 +200,7 @@ u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn)
{
u64 new_spte;

- new_spte = old_spte & ~PT64_BASE_ADDR_MASK;
+ new_spte = old_spte & ~SPTE_BASE_ADDR_MASK;
new_spte |= (u64)new_pfn << PAGE_SHIFT;

new_spte &= ~PT_WRITABLE_MASK;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 6925dfc38981..719785eea2fe 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -34,12 +34,12 @@
static_assert(SPTE_TDP_AD_ENABLED_MASK == 0);

#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
-#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
+#define SPTE_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
#else
-#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
+#define SPTE_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
#endif

-#define PT64_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
+#define SPTE_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
| shadow_x_mask | shadow_nx_mask | shadow_me_mask)

#define ACC_EXEC_MASK 1
@@ -48,17 +48,13 @@ static_assert(SPTE_TDP_AD_ENABLED_MASK == 0);
#define ACC_ALL (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)

/* The mask for the R/X bits in EPT PTEs */
-#define PT64_EPT_READABLE_MASK 0x1ull
-#define PT64_EPT_EXECUTABLE_MASK 0x4ull
+#define SPTE_EPT_READABLE_MASK 0x1ull
+#define SPTE_EPT_EXECUTABLE_MASK 0x4ull

-#define PT64_LEVEL_BITS 9
-
-#define PT64_LEVEL_SHIFT(level) \
- (PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
-
-#define PT64_INDEX(address, level)\
- (((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
-#define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
+#define SPTE_LEVEL_BITS 9
+#define SPTE_LEVEL_SHIFT(level) __PT_LEVEL_SHIFT(level, SPTE_LEVEL_BITS)
+#define SPTE_INDEX(address, level) __PT_INDEX(address, level, SPTE_LEVEL_BITS)
+#define SPTE_ENT_PER_PAGE __PT_ENT_PER_PAGE(SPTE_LEVEL_BITS)

/* Bits 9 and 10 are ignored by all non-EPT PTEs. */
#define DEFAULT_SPTE_HOST_WRITEABLE BIT_ULL(9)
@@ -71,8 +67,8 @@ static_assert(SPTE_TDP_AD_ENABLED_MASK == 0);
* restored only when a write is attempted to the page. This mask obviously
* must not overlap the A/D type mask.
*/
-#define SHADOW_ACC_TRACK_SAVED_BITS_MASK (PT64_EPT_READABLE_MASK | \
- PT64_EPT_EXECUTABLE_MASK)
+#define SHADOW_ACC_TRACK_SAVED_BITS_MASK (SPTE_EPT_READABLE_MASK | \
+ SPTE_EPT_EXECUTABLE_MASK)
#define SHADOW_ACC_TRACK_SAVED_BITS_SHIFT 54
#define SHADOW_ACC_TRACK_SAVED_MASK (SHADOW_ACC_TRACK_SAVED_BITS_MASK << \
SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
@@ -269,7 +265,7 @@ static inline bool is_executable_pte(u64 spte)

static inline kvm_pfn_t spte_to_pfn(u64 pte)
{
- return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
+ return (pte & SPTE_BASE_ADDR_MASK) >> PAGE_SHIFT;
}

static inline bool is_accessed_spte(u64 spte)
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index b3ed302c1a35..5c002c43cb3c 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -11,7 +11,7 @@
static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
{
iter->sptep = iter->pt_path[iter->level - 1] +
- SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
+ SPTE_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
}

@@ -113,8 +113,8 @@ static bool try_step_side(struct tdp_iter *iter)
* Check if the iterator is already at the end of the current page
* table.
*/
- if (SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level) ==
- (PT64_ENT_PER_PAGE - 1))
+ if (SPTE_INDEX(iter->gfn << PAGE_SHIFT, iter->level) ==
+ (SPTE_ENT_PER_PAGE - 1))
return false;

iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index caac4ddb46df..3720d244e5a2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -335,7 +335,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,

tdp_mmu_unlink_page(kvm, sp, shared);

- for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
+ for (i = 0; i < SPTE_ENT_PER_PAGE; i++) {
sptep = rcu_dereference(pt) + i;
gfn = base_gfn + i * KVM_PAGES_PER_HPAGE(level);

--
2.32.0.288.g62a8d224e6-goog

2021-06-24 07:44:30

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: x86: guest MAXPHYADDR and C-bit fixes

On 24/06/21 01:05, Sean Christopherson wrote:
> A few fixes centered around enumerating guest MAXPHYADDR and handling the
> C-bit in KVM.
>
> DISCLAIMER: I have no idea if patch 04, "Truncate reported guest
> MAXPHYADDR to C-bit if SEV is" is architecturally correct. The APM says
> the following about the C-bit in the context of SEV, but I can't for the
> life of me find anything in the APM that clarifies whether "effectively
> reduced" is supposed to apply to _only_ SEV guests, or any guest on an
> SEV enabled platform.
>
> Note that because guest physical addresses are always translated through
> the nested page tables, the size of the guest physical address space is
> not impacted by any physical address space reduction indicated in
> CPUID 8000_001F[EBX]. If the C-bit is a physical address bit however,
> the guest physical address space is effectively reduced by 1 bit.
>
> In practice, I have observed that Rome CPUs treat the C-bit as reserved for
> non-SEV guests (another disclaimer on this below). Long story short, commit
> ef4c9f4f6546 ("KVM: selftests: Fix 32-bit truncation of vm_get_max_gfn()")
> exposed the issue by inadvertantly causing selftests to start using GPAs
> with bit 47 set.
>
> That said, regardless of whether or not the behavior is intended, it needs
> to be addressed by KVM. I think the only difference is whether this is
> KVM's _only_ behavior, or whether it's gated by an erratum flag.
>
> The second disclaimer is that I haven't tested with memory encryption
> disabled in hardware. I wrote the patch assuming/hoping that only CPUs
> that report SEV=1 treat the C-bit as reserved, but I haven't actually
> tested the SEV=0 case on e.g. CPUs with only SME (we might have these
> platforms, but I've no idea how to access/find them), or CPUs with SME/SEV
> disabled in BIOS (again, I've no idea how to do this with our BIOS).

I'm merging patches 1-3 right away, though not sending them to Linus
(they will be picked up by stable after the 5.14 merge window); for
patch 4, I'm creating a separate file for the "common" parts of
paging_tmpl.h, called paging.h.

Paolo

> Sean Christopherson (7):
> KVM: x86: Use guest MAXPHYADDR from CPUID.0x8000_0008 iff TDP is
> enabled
> KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR
> KVM: x86: Truncate reported guest MAXPHYADDR to C-bit if SEV is
> supported
> KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs
> KVM: VMX: Refactor 32-bit PSE PT creation to avoid using MMU macro
> KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.h
> KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEs
>
> arch/x86/kvm/cpuid.c | 38 +++++++++++++++++---
> arch/x86/kvm/mmu.h | 11 ++----
> arch/x86/kvm/mmu/mmu.c | 63 ++++++++-------------------------
> arch/x86/kvm/mmu/mmu_audit.c | 6 ++--
> arch/x86/kvm/mmu/mmu_internal.h | 14 ++++++++
> arch/x86/kvm/mmu/paging_tmpl.h | 52 ++++++++++++++++++++++++++-
> arch/x86/kvm/mmu/spte.c | 2 +-
> arch/x86/kvm/mmu/spte.h | 34 +++++++-----------
> arch/x86/kvm/mmu/tdp_iter.c | 6 ++--
> arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> arch/x86/kvm/svm/svm.c | 37 ++++++++++++++-----
> arch/x86/kvm/vmx/vmx.c | 2 +-
> arch/x86/kvm/x86.c | 3 ++
> arch/x86/kvm/x86.h | 1 +
> 14 files changed, 170 insertions(+), 101 deletions(-)
>

2021-06-24 16:32:05

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: x86: guest MAXPHYADDR and C-bit fixes

On 6/23/21 6:05 PM, Sean Christopherson wrote:
> A few fixes centered around enumerating guest MAXPHYADDR and handling the
> C-bit in KVM.
>
> DISCLAIMER: I have no idea if patch 04, "Truncate reported guest
> MAXPHYADDR to C-bit if SEV is" is architecturally correct. The APM says
> the following about the C-bit in the context of SEV, but I can't for the
> life of me find anything in the APM that clarifies whether "effectively
> reduced" is supposed to apply to _only_ SEV guests, or any guest on an
> SEV enabled platform.
>
> Note that because guest physical addresses are always translated through
> the nested page tables, the size of the guest physical address space is
> not impacted by any physical address space reduction indicated in
> CPUID 8000_001F[EBX]. If the C-bit is a physical address bit however,
> the guest physical address space is effectively reduced by 1 bit.
>
> In practice, I have observed that Rome CPUs treat the C-bit as reserved for
> non-SEV guests (another disclaimer on this below). Long story short, commit
> ef4c9f4f6546 ("KVM: selftests: Fix 32-bit truncation of vm_get_max_gfn()")
> exposed the issue by inadvertantly causing selftests to start using GPAs
> with bit 47 set.
>
> That said, regardless of whether or not the behavior is intended, it needs
> to be addressed by KVM. I think the only difference is whether this is
> KVM's _only_ behavior, or whether it's gated by an erratum flag.
>
> The second disclaimer is that I haven't tested with memory encryption
> disabled in hardware. I wrote the patch assuming/hoping that only CPUs
> that report SEV=1 treat the C-bit as reserved, but I haven't actually
> tested the SEV=0 case on e.g. CPUs with only SME (we might have these
> platforms, but I've no idea how to access/find them), or CPUs with SME/SEV
> disabled in BIOS (again, I've no idea how to do this with our BIOS).

Here's an explanation of the physical address reduction for bare-metal and
guest.

With MSR 0xC001_0010[SMEE] = 0:
No reduction in host or guest max physical address.

With MSR 0xC001_0010[SMEE] = 1:
- Reduction in the host is enumerated by CPUID 0x8000_001F_EBX[11:6],
regardless of whether SME is enabled in the host or not. So, for example
on EPYC generation 2 (Rome) you would see a reduction from 48 to 43.
- There is no reduction in physical address in a legacy guest (non-SEV
guest), so the guest can use a 48-bit physical address
- There is a reduction of only the encryption bit in an SEV guest, so
the guest can use up to a 47-bit physical address. This is why the
Qemu command line sev-guest option uses a value of 1 for the
"reduced-phys-bits" parameter.

Thanks,
Tom

>
> Sean Christopherson (7):
> KVM: x86: Use guest MAXPHYADDR from CPUID.0x8000_0008 iff TDP is
> enabled
> KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR
> KVM: x86: Truncate reported guest MAXPHYADDR to C-bit if SEV is
> supported
> KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs
> KVM: VMX: Refactor 32-bit PSE PT creation to avoid using MMU macro
> KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.h
> KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEs
>
> arch/x86/kvm/cpuid.c | 38 +++++++++++++++++---
> arch/x86/kvm/mmu.h | 11 ++----
> arch/x86/kvm/mmu/mmu.c | 63 ++++++++-------------------------
> arch/x86/kvm/mmu/mmu_audit.c | 6 ++--
> arch/x86/kvm/mmu/mmu_internal.h | 14 ++++++++
> arch/x86/kvm/mmu/paging_tmpl.h | 52 ++++++++++++++++++++++++++-
> arch/x86/kvm/mmu/spte.c | 2 +-
> arch/x86/kvm/mmu/spte.h | 34 +++++++-----------
> arch/x86/kvm/mmu/tdp_iter.c | 6 ++--
> arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> arch/x86/kvm/svm/svm.c | 37 ++++++++++++++-----
> arch/x86/kvm/vmx/vmx.c | 2 +-
> arch/x86/kvm/x86.c | 3 ++
> arch/x86/kvm/x86.h | 1 +
> 14 files changed, 170 insertions(+), 101 deletions(-)
>

2021-06-24 16:38:16

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: x86: guest MAXPHYADDR and C-bit fixes

>
> Here's an explanation of the physical address reduction for bare-metal and
> guest.
>
> With MSR 0xC001_0010[SMEE] = 0:
> No reduction in host or guest max physical address.
>
> With MSR 0xC001_0010[SMEE] = 1:
> - Reduction in the host is enumerated by CPUID 0x8000_001F_EBX[11:6],
> regardless of whether SME is enabled in the host or not. So, for example
> on EPYC generation 2 (Rome) you would see a reduction from 48 to 43.
> - There is no reduction in physical address in a legacy guest (non-SEV
> guest), so the guest can use a 48-bit physical address
> - There is a reduction of only the encryption bit in an SEV guest, so
> the guest can use up to a 47-bit physical address. This is why the
> Qemu command line sev-guest option uses a value of 1 for the
> "reduced-phys-bits" parameter.
>

The guest statements all assume that NPT is enabled.

Thanks,
Tom

2021-06-24 17:33:44

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: x86: guest MAXPHYADDR and C-bit fixes

On Thu, Jun 24, 2021, Tom Lendacky wrote:
> >
> > Here's an explanation of the physical address reduction for bare-metal and
> > guest.
> >
> > With MSR 0xC001_0010[SMEE] = 0:
> > No reduction in host or guest max physical address.
> >
> > With MSR 0xC001_0010[SMEE] = 1:
> > - Reduction in the host is enumerated by CPUID 0x8000_001F_EBX[11:6],
> > regardless of whether SME is enabled in the host or not. So, for example
> > on EPYC generation 2 (Rome) you would see a reduction from 48 to 43.
> > - There is no reduction in physical address in a legacy guest (non-SEV
> > guest), so the guest can use a 48-bit physical address

So the behavior I'm seeing is either a CPU bug or user error. Can you verify
the unexpected #PF behavior to make sure I'm not doing something stupid?

Thanks!

> > - There is a reduction of only the encryption bit in an SEV guest, so
> > the guest can use up to a 47-bit physical address. This is why the
> > Qemu command line sev-guest option uses a value of 1 for the
> > "reduced-phys-bits" parameter.
> >
>
> The guest statements all assume that NPT is enabled.
>
> Thanks,
> Tom

2021-06-24 17:41:14

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: x86: guest MAXPHYADDR and C-bit fixes



On 6/24/21 12:31 PM, Sean Christopherson wrote:
> On Thu, Jun 24, 2021, Tom Lendacky wrote:
>>>
>>> Here's an explanation of the physical address reduction for bare-metal and
>>> guest.
>>>
>>> With MSR 0xC001_0010[SMEE] = 0:
>>> No reduction in host or guest max physical address.
>>>
>>> With MSR 0xC001_0010[SMEE] = 1:
>>> - Reduction in the host is enumerated by CPUID 0x8000_001F_EBX[11:6],
>>> regardless of whether SME is enabled in the host or not. So, for example
>>> on EPYC generation 2 (Rome) you would see a reduction from 48 to 43.
>>> - There is no reduction in physical address in a legacy guest (non-SEV
>>> guest), so the guest can use a 48-bit physical address
>
> So the behavior I'm seeing is either a CPU bug or user error. Can you verify
> the unexpected #PF behavior to make sure I'm not doing something stupid?

Yeah, I saw that in patch #3. Let me see what I can find out. I could just
be wrong on that myself - it wouldn't be the first time.

Thanks,
Tom

>
> Thanks!
>
>>> - There is a reduction of only the encryption bit in an SEV guest, so
>>> the guest can use up to a 47-bit physical address. This is why the
>>> Qemu command line sev-guest option uses a value of 1 for the
>>> "reduced-phys-bits" parameter.
>>>
>>
>> The guest statements all assume that NPT is enabled.
>>
>> Thanks,
>> Tom

2021-06-24 22:09:16

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: x86: guest MAXPHYADDR and C-bit fixes

On 6/24/21 12:39 PM, Tom Lendacky wrote:
>
>
> On 6/24/21 12:31 PM, Sean Christopherson wrote:
>> On Thu, Jun 24, 2021, Tom Lendacky wrote:
>>>>
>>>> Here's an explanation of the physical address reduction for bare-metal and
>>>> guest.
>>>>
>>>> With MSR 0xC001_0010[SMEE] = 0:
>>>> No reduction in host or guest max physical address.
>>>>
>>>> With MSR 0xC001_0010[SMEE] = 1:
>>>> - Reduction in the host is enumerated by CPUID 0x8000_001F_EBX[11:6],
>>>> regardless of whether SME is enabled in the host or not. So, for example
>>>> on EPYC generation 2 (Rome) you would see a reduction from 48 to 43.
>>>> - There is no reduction in physical address in a legacy guest (non-SEV
>>>> guest), so the guest can use a 48-bit physical address
>>
>> So the behavior I'm seeing is either a CPU bug or user error. Can you verify
>> the unexpected #PF behavior to make sure I'm not doing something stupid?
>
> Yeah, I saw that in patch #3. Let me see what I can find out. I could just
> be wrong on that myself - it wouldn't be the first time.

From patch #3:
SVM: KVM: CPU #PF @ rip = 0x409ca4, cr2 = 0xc0000000, pfec = 0xb
KVM: guest PTE = 0x181023 @ GPA = 0x180000, level = 4
KVM: guest PTE = 0x186023 @ GPA = 0x181000, level = 3
KVM: guest PTE = 0x187023 @ GPA = 0x186000, level = 2
KVM: guest PTE = 0xffffbffff003 @ GPA = 0x187000, level = 1
SVM: KVM: GPA = 0x7fffbffff000

I think you may be hitting a special HT region that is at the top 12GB of
the 48-bit memory range and is reserved, even for GPAs. Can you somehow
get the test to use an address below 0xfffd_0000_0000? That would show
that bit 47 is valid for the legacy guest while staying out of the HT region.

Thanks,
Tom

>
> Thanks,
> Tom
>
>>
>> Thanks!
>>
>>>> - There is a reduction of only the encryption bit in an SEV guest, so
>>>> the guest can use up to a 47-bit physical address. This is why the
>>>> Qemu command line sev-guest option uses a value of 1 for the
>>>> "reduced-phys-bits" parameter.
>>>>
>>>
>>> The guest statements all assume that NPT is enabled.
>>>
>>> Thanks,
>>> Tom

2021-06-24 22:17:27

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: x86: guest MAXPHYADDR and C-bit fixes

On Thu, Jun 24, 2021, Tom Lendacky wrote:
> On 6/24/21 12:39 PM, Tom Lendacky wrote:
> >
> >
> > On 6/24/21 12:31 PM, Sean Christopherson wrote:
> >> On Thu, Jun 24, 2021, Tom Lendacky wrote:
> >>>>
> >>>> Here's an explanation of the physical address reduction for bare-metal and
> >>>> guest.
> >>>>
> >>>> With MSR 0xC001_0010[SMEE] = 0:
> >>>> No reduction in host or guest max physical address.
> >>>>
> >>>> With MSR 0xC001_0010[SMEE] = 1:
> >>>> - Reduction in the host is enumerated by CPUID 0x8000_001F_EBX[11:6],
> >>>> regardless of whether SME is enabled in the host or not. So, for example
> >>>> on EPYC generation 2 (Rome) you would see a reduction from 48 to 43.
> >>>> - There is no reduction in physical address in a legacy guest (non-SEV
> >>>> guest), so the guest can use a 48-bit physical address
> >>
> >> So the behavior I'm seeing is either a CPU bug or user error. Can you verify
> >> the unexpected #PF behavior to make sure I'm not doing something stupid?
> >
> > Yeah, I saw that in patch #3. Let me see what I can find out. I could just
> > be wrong on that myself - it wouldn't be the first time.
>
> From patch #3:
> SVM: KVM: CPU #PF @ rip = 0x409ca4, cr2 = 0xc0000000, pfec = 0xb
> KVM: guest PTE = 0x181023 @ GPA = 0x180000, level = 4
> KVM: guest PTE = 0x186023 @ GPA = 0x181000, level = 3
> KVM: guest PTE = 0x187023 @ GPA = 0x186000, level = 2
> KVM: guest PTE = 0xffffbffff003 @ GPA = 0x187000, level = 1
> SVM: KVM: GPA = 0x7fffbffff000
>
> I think you may be hitting a special HT region that is at the top 12GB of
> the 48-bit memory range and is reserved, even for GPAs. Can you somehow
> get the test to use an address below 0xfffd_0000_0000? That would show
> that bit 47 is valid for the legacy guest while staying out of the HT region.

I can make that happen.

I assume "HT" is HyperTransport?

2021-06-24 22:21:39

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: x86: guest MAXPHYADDR and C-bit fixes

On 6/24/21 5:15 PM, Sean Christopherson wrote:
> On Thu, Jun 24, 2021, Tom Lendacky wrote:
>> On 6/24/21 12:39 PM, Tom Lendacky wrote:
>>>
>>>
>>> On 6/24/21 12:31 PM, Sean Christopherson wrote:
>>>> On Thu, Jun 24, 2021, Tom Lendacky wrote:
>>>>>>
>>>>>> Here's an explanation of the physical address reduction for bare-metal and
>>>>>> guest.
>>>>>>
>>>>>> With MSR 0xC001_0010[SMEE] = 0:
>>>>>> No reduction in host or guest max physical address.
>>>>>>
>>>>>> With MSR 0xC001_0010[SMEE] = 1:
>>>>>> - Reduction in the host is enumerated by CPUID 0x8000_001F_EBX[11:6],
>>>>>> regardless of whether SME is enabled in the host or not. So, for example
>>>>>> on EPYC generation 2 (Rome) you would see a reduction from 48 to 43.
>>>>>> - There is no reduction in physical address in a legacy guest (non-SEV
>>>>>> guest), so the guest can use a 48-bit physical address
>>>>
>>>> So the behavior I'm seeing is either a CPU bug or user error. Can you verify
>>>> the unexpected #PF behavior to make sure I'm not doing something stupid?
>>>
>>> Yeah, I saw that in patch #3. Let me see what I can find out. I could just
>>> be wrong on that myself - it wouldn't be the first time.
>>
>> From patch #3:
>> SVM: KVM: CPU #PF @ rip = 0x409ca4, cr2 = 0xc0000000, pfec = 0xb
>> KVM: guest PTE = 0x181023 @ GPA = 0x180000, level = 4
>> KVM: guest PTE = 0x186023 @ GPA = 0x181000, level = 3
>> KVM: guest PTE = 0x187023 @ GPA = 0x186000, level = 2
>> KVM: guest PTE = 0xffffbffff003 @ GPA = 0x187000, level = 1
>> SVM: KVM: GPA = 0x7fffbffff000
>>
>> I think you may be hitting a special HT region that is at the top 12GB of
>> the 48-bit memory range and is reserved, even for GPAs. Can you somehow
>> get the test to use an address below 0xfffd_0000_0000? That would show
>> that bit 47 is valid for the legacy guest while staying out of the HT region.
>
> I can make that happen.
>
> I assume "HT" is HyperTransport?

Yep.

Thanks,
Tom

>

2021-06-24 23:51:43

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 0/7] KVM: x86: guest MAXPHYADDR and C-bit fixes

On Thu, Jun 24, 2021, Sean Christopherson wrote:
> On Thu, Jun 24, 2021, Tom Lendacky wrote:
> > On 6/24/21 12:39 PM, Tom Lendacky wrote:
> > >
> > >
> > > On 6/24/21 12:31 PM, Sean Christopherson wrote:
> > >> On Thu, Jun 24, 2021, Tom Lendacky wrote:
> > >>>>
> > >>>> Here's an explanation of the physical address reduction for bare-metal and
> > >>>> guest.
> > >>>>
> > >>>> With MSR 0xC001_0010[SMEE] = 0:
> > >>>> No reduction in host or guest max physical address.
> > >>>>
> > >>>> With MSR 0xC001_0010[SMEE] = 1:
> > >>>> - Reduction in the host is enumerated by CPUID 0x8000_001F_EBX[11:6],
> > >>>> regardless of whether SME is enabled in the host or not. So, for example
> > >>>> on EPYC generation 2 (Rome) you would see a reduction from 48 to 43.
> > >>>> - There is no reduction in physical address in a legacy guest (non-SEV
> > >>>> guest), so the guest can use a 48-bit physical address
> > >>
> > >> So the behavior I'm seeing is either a CPU bug or user error. Can you verify
> > >> the unexpected #PF behavior to make sure I'm not doing something stupid?
> > >
> > > Yeah, I saw that in patch #3. Let me see what I can find out. I could just
> > > be wrong on that myself - it wouldn't be the first time.
> >
> > From patch #3:
> > SVM: KVM: CPU #PF @ rip = 0x409ca4, cr2 = 0xc0000000, pfec = 0xb
> > KVM: guest PTE = 0x181023 @ GPA = 0x180000, level = 4
> > KVM: guest PTE = 0x186023 @ GPA = 0x181000, level = 3
> > KVM: guest PTE = 0x187023 @ GPA = 0x186000, level = 2
> > KVM: guest PTE = 0xffffbffff003 @ GPA = 0x187000, level = 1
> > SVM: KVM: GPA = 0x7fffbffff000
> >
> > I think you may be hitting a special HT region that is at the top 12GB of
> > the 48-bit memory range and is reserved, even for GPAs. Can you somehow
> > get the test to use an address below 0xfffd_0000_0000? That would show
> > that bit 47 is valid for the legacy guest while staying out of the HT region.
>
> I can make that happen.

Ah, hilarious. That indeed does the trick. 0xfffd00000000 = #PF,
0xfffcfffff000 = good.

I'll send a revert shortly. There's another C-bit bug that needs fixing, too :-/
The unconditional __sme_clr() in npf_interception() is wrong and breaks non-SEV
guests. Based on this from the APM

If the C-bit is an address bit, this bit is masked from the guest
physical address when it is translated through the nested page tables.
Consequently, the hypervisor does not need to be aware of which pages
the guest has chosen to mark private.

I assume it's not needed for SEV either? I'm about to find out shortly, but if
you happen to know for sure... :-)