This is a combination of prep work for TDX and SNP, and a clean up of the
page fault path to (hopefully) make it easier to follow the rules for
private memory, noslot faults, writes to read-only slots, etc.
Paolo, this is the series I mentioned in your TDX/SNP prep work series.
Stating the obvious, these
KVM: x86/mmu: Pass full 64-bit error code when handling page faults
KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler
are the drop-in replacements.
Isaku Yamahata (1):
KVM: x86/mmu: Pass full 64-bit error code when handling page faults
Sean Christopherson (15):
KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits
emulation
KVM: x86: Remove separate "bit" defines for page fault error code
masks
KVM: x86: Define more SEV+ page fault error bits/flags for #NPF
KVM: x86/mmu: Use synthetic page fault error code to indicate private
faults
KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are
non-zero
KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler
KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page
faults
KVM: x86/mmu: Move private vs. shared check above slot validity checks
KVM: x86/mmu: Don't force emulation of L2 accesses to non-APIC
internal slots
KVM: x86/mmu: Explicitly disallow private accesses to emulated MMIO
KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to
kvm_faultin_pfn()
KVM: x86/mmu: Handle no-slot faults at the beginning of
kvm_faultin_pfn()
KVM: x86/mmu: Set kvm_page_fault.hva to KVM_HVA_ERR_BAD for "no slot"
faults
KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values
KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create
noslot pfns
arch/x86/include/asm/kvm_host.h | 45 ++++-----
arch/x86/kvm/mmu.h | 4 +-
arch/x86/kvm/mmu/mmu.c | 159 +++++++++++++++++++-------------
arch/x86/kvm/mmu/mmu_internal.h | 24 ++++-
arch/x86/kvm/mmu/mmutrace.h | 2 +-
arch/x86/kvm/svm/svm.c | 9 ++
6 files changed, 151 insertions(+), 92 deletions(-)
base-commit: ec1e3d33557babed2c2c2c7da6e84293c2f56f58
--
2.44.0.278.ge034bb2e1d-goog
Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault
triggers emulation of any kind, as KVM doesn't currently support emulating
access to guest private memory. Practically speaking, private faults and
emulation are already mutually exclusive, but there are edge cases upon
edge cases where KVM can return RET_PF_EMULATE, and adding one last check
to harden against weird, unexpected combinations is inexpensive.
Suggested-by: Yan Zhao <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 --------
arch/x86/kvm/mmu/mmu_internal.h | 13 +++++++++++++
2 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e4cc7f764980..e2fd74e06ff8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4309,14 +4309,6 @@ static inline u8 kvm_max_level_for_order(int order)
return PG_LEVEL_4K;
}
-static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
- struct kvm_page_fault *fault)
-{
- kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
- PAGE_SIZE, fault->write, fault->exec,
- fault->is_private);
-}
-
static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault)
{
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 0669a8a668ca..0eea6c5a824d 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -279,6 +279,14 @@ enum {
RET_PF_SPURIOUS,
};
+static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
+{
+ kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
+ PAGE_SIZE, fault->write, fault->exec,
+ fault->is_private);
+}
+
static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
u32 err, bool prefetch, int *emulation_type)
{
@@ -320,6 +328,11 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
else
r = vcpu->arch.mmu->page_fault(vcpu, &fault);
+ if (r == RET_PF_EMULATE && fault.is_private) {
+ kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
+ return -EFAULT;
+ }
+
if (fault.write_fault_to_shadow_pgtable && emulation_type)
*emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
--
2.44.0.278.ge034bb2e1d-goog
Open code the bit number directly in the PFERR_* masks and drop the
intermediate PFERR_*_BIT defines, as having to bounce through two macros
just to see which flag corresponds to which bit is quite annoying, as is
having to define two macros just to add recognition of a new flag.
Use ilog2() to derive the bit in permission_fault(), the one function that
actually needs the bit number (it does clever shifting to manipulate flags
in order to avoid conditional branches).
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 32 ++++++++++----------------------
arch/x86/kvm/mmu.h | 4 ++--
2 files changed, 12 insertions(+), 24 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aaf5a25ea7ed..88cc523bafa8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -254,28 +254,16 @@ enum x86_intercept_stage;
KVM_GUESTDBG_INJECT_DB | \
KVM_GUESTDBG_BLOCKIRQ)
-
-#define PFERR_PRESENT_BIT 0
-#define PFERR_WRITE_BIT 1
-#define PFERR_USER_BIT 2
-#define PFERR_RSVD_BIT 3
-#define PFERR_FETCH_BIT 4
-#define PFERR_PK_BIT 5
-#define PFERR_SGX_BIT 15
-#define PFERR_GUEST_FINAL_BIT 32
-#define PFERR_GUEST_PAGE_BIT 33
-#define PFERR_IMPLICIT_ACCESS_BIT 48
-
-#define PFERR_PRESENT_MASK BIT(PFERR_PRESENT_BIT)
-#define PFERR_WRITE_MASK BIT(PFERR_WRITE_BIT)
-#define PFERR_USER_MASK BIT(PFERR_USER_BIT)
-#define PFERR_RSVD_MASK BIT(PFERR_RSVD_BIT)
-#define PFERR_FETCH_MASK BIT(PFERR_FETCH_BIT)
-#define PFERR_PK_MASK BIT(PFERR_PK_BIT)
-#define PFERR_SGX_MASK BIT(PFERR_SGX_BIT)
-#define PFERR_GUEST_FINAL_MASK BIT_ULL(PFERR_GUEST_FINAL_BIT)
-#define PFERR_GUEST_PAGE_MASK BIT_ULL(PFERR_GUEST_PAGE_BIT)
-#define PFERR_IMPLICIT_ACCESS BIT_ULL(PFERR_IMPLICIT_ACCESS_BIT)
+#define PFERR_PRESENT_MASK BIT(0)
+#define PFERR_WRITE_MASK BIT(1)
+#define PFERR_USER_MASK BIT(2)
+#define PFERR_RSVD_MASK BIT(3)
+#define PFERR_FETCH_MASK BIT(4)
+#define PFERR_PK_MASK BIT(5)
+#define PFERR_SGX_MASK BIT(15)
+#define PFERR_GUEST_FINAL_MASK BIT_ULL(32)
+#define PFERR_GUEST_PAGE_MASK BIT_ULL(33)
+#define PFERR_IMPLICIT_ACCESS BIT_ULL(48)
#define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
PFERR_WRITE_MASK | \
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 60f21bb4c27b..e8b620a85627 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -213,7 +213,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
*/
u64 implicit_access = access & PFERR_IMPLICIT_ACCESS;
bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC;
- int index = (pfec + (not_smap << PFERR_RSVD_BIT)) >> 1;
+ int index = (pfec + (not_smap << ilog2(PFERR_RSVD_MASK))) >> 1;
u32 errcode = PFERR_PRESENT_MASK;
bool fault;
@@ -235,7 +235,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
offset = (pfec & ~1) +
- ((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
+ ((pte_access & PT_USER_MASK) << (ilog2(PFERR_RSVD_MASK) - PT_USER_SHIFT));
pkru_bits &= mmu->pkru_mask >> offset;
errcode |= -pkru_bits & PFERR_PK_MASK;
--
2.44.0.278.ge034bb2e1d-goog
From: Isaku Yamahata <[email protected]>
Plumb the full 64-bit error code throughout the page fault handling code
so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK
will be used to determine whether or not a fault is private vs. shared.
Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change
the behavior of permission_fault() when invoked in the page fault path, as
KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault().
Continue passing '0' from the async #PF worker, as guest_memfd() and thus
private memory doesn't support async page faults.
Signed-off-by: Isaku Yamahata <[email protected]>
[mdr: drop references/changes on rebase, update commit message]
Signed-off-by: Michael Roth <[email protected]>
[sean: drop truncation in call to FNAME(walk_addr)(), rewrite changelog]
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 3 +--
arch/x86/kvm/mmu/mmu_internal.h | 4 ++--
arch/x86/kvm/mmu/mmutrace.h | 2 +-
3 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e2fd74e06ff8..408969ac1291 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5860,8 +5860,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
}
if (r == RET_PF_INVALID) {
- r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
- lower_32_bits(error_code), false,
+ r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
&emulation_type);
if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
return -EIO;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 0eea6c5a824d..1fab1f2359b5 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -190,7 +190,7 @@ static inline bool is_nx_huge_page_enabled(struct kvm *kvm)
struct kvm_page_fault {
/* arguments to kvm_mmu_do_page_fault. */
const gpa_t addr;
- const u32 error_code;
+ const u64 error_code;
const bool prefetch;
/* Derived from error_code. */
@@ -288,7 +288,7 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
}
static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- u32 err, bool prefetch, int *emulation_type)
+ u64 err, bool prefetch, int *emulation_type)
{
struct kvm_page_fault fault = {
.addr = cr2_or_gpa,
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..195d98bc8de8 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -260,7 +260,7 @@ TRACE_EVENT(
TP_STRUCT__entry(
__field(int, vcpu_id)
__field(gpa_t, cr2_or_gpa)
- __field(u32, error_code)
+ __field(u64, error_code)
__field(u64 *, sptep)
__field(u64, old_spte)
__field(u64, new_spte)
--
2.44.0.278.ge034bb2e1d-goog
Add and use a synthetic, KVM-defined page fault error code to indicate
whether a fault is to private vs. shared memory. TDX and SNP have
different mechanisms for reporting private vs. shared, and KVM's
software-protected VMs have no mechanism at all. Usurp an error code
flag to avoid having to plumb another parameter to kvm_mmu_page_fault()
and friends.
Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it
for TDX and software-protected VMs as appropriate, but that would require
*clearing* the flag for SEV and SEV-ES VMs, which support encrypted
memory at the hardware layer, but don't utilize private memory at the
KVM layer.
Opportunistically add a comment to call out that the logic for software-
protected VMs is (and was before this commit) broken for nested MMUs, i.e.
for nested TDP, as the GPA is an L2 GPA. Punt on trying to play nice with
nested MMUs as there is a _lot_ of functionality that simply doesn't work
for software-protected VMs, e.g. all of the paths where KVM accesses guest
memory need to be updated to be aware of private vs. shared memory.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 11 +++++++++++
arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++-------
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
3 files changed, 31 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1e69743ef0fb..4077c46c61ab 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -267,7 +267,18 @@ enum x86_intercept_stage;
#define PFERR_GUEST_ENC_MASK BIT_ULL(34)
#define PFERR_GUEST_SIZEM_MASK BIT_ULL(35)
#define PFERR_GUEST_VMPL_MASK BIT_ULL(36)
+
+/*
+ * IMPLICIT_ACCESS is a KVM-defined flag used to correctly perform SMAP checks
+ * when emulating instructions that triggers implicit access.
+ */
#define PFERR_IMPLICIT_ACCESS BIT_ULL(48)
+/*
+ * PRIVATE_ACCESS is a KVM-defined flag us to indicate that a fault occurred
+ * when the guest was accessing private memory.
+ */
+#define PFERR_PRIVATE_ACCESS BIT_ULL(49)
+#define PFERR_SYNTHETIC_MASK (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS)
#define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
PFERR_WRITE_MASK | \
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 408969ac1291..7807bdcd87e8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5839,19 +5839,31 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
bool direct = vcpu->arch.mmu->root_role.direct;
/*
- * IMPLICIT_ACCESS is a KVM-defined flag used to correctly perform SMAP
- * checks when emulating instructions that triggers implicit access.
* WARN if hardware generates a fault with an error code that collides
- * with the KVM-defined value. Clear the flag and continue on, i.e.
- * don't terminate the VM, as KVM can't possibly be relying on a flag
- * that KVM doesn't know about.
+ * with KVM-defined sythentic flags. Clear the flags and continue on,
+ * i.e. don't terminate the VM, as KVM can't possibly be relying on a
+ * flag that KVM doesn't know about.
*/
- if (WARN_ON_ONCE(error_code & PFERR_IMPLICIT_ACCESS))
- error_code &= ~PFERR_IMPLICIT_ACCESS;
+ if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
+ error_code &= ~PFERR_SYNTHETIC_MASK;
if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
return RET_PF_RETRY;
+ /*
+ * Except for reserved faults (emulated MMIO is shared-only), set the
+ * private flag for software-protected VMs based on the gfn's current
+ * attributes, which are the source of truth for such VMs. Note, this
+ * wrong for nested MMUs as the GPA is an L2 GPA, but KVM doesn't
+ * currently supported nested virtualization (among many other things)
+ * for software-protected VMs.
+ */
+ if (IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
+ !(error_code & PFERR_RSVD_MASK) &&
+ vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM &&
+ kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
+ error_code |= PFERR_PRIVATE_ACCESS;
+
r = RET_PF_INVALID;
if (unlikely(error_code & PFERR_RSVD_MASK)) {
r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1fab1f2359b5..d7c10d338f14 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -306,7 +306,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
.max_level = KVM_MAX_HUGEPAGE_LEVEL,
.req_level = PG_LEVEL_4K,
.goal_level = PG_LEVEL_4K,
- .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
+ .is_private = err & PFERR_PRIVATE_ACCESS,
};
int r;
--
2.44.0.278.ge034bb2e1d-goog
WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF,
as the error code for #PF is limited to 32 bits (and in practice, 16 bits
on Intel CPUS). This behavior is architectural, is part of KVM's ABI
(see kvm_vcpu_events.error_code), and is explicitly documented as being
preserved for intecerpted #PF in both the APM:
The error code saved in EXITINFO1 is the same as would be pushed onto
the stack by a non-intercepted #PF exception in protected mode.
and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a
32-bit field.
Simply drop the upper bits of hardware provides garbage, as spurious
information should do no harm (though in all likelihood hardware is buggy
and the kernel is doomed).
Handling all upper 32 bits in the #PF path will allow moving the sanity
check on synthetic checks from kvm_mmu_page_fault() to npf_interception(),
which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's
PFERR_GUEST_ENC_MASK without running afoul of the sanity check.
Note, this also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest
bit minimizes the probability of a collision with the "other" vendor,
without needing to plumb more bits through microcode.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7807bdcd87e8..5d892bd59c97 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4553,6 +4553,13 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
if (WARN_ON_ONCE(fault_address >> 32))
return -EFAULT;
#endif
+ /*
+ * Legacy #PF exception only have a 32-bit error code. Simply drop the
+ * upper bits as KVM doesn't use them for #PF (because they are never
+ * set), and to ensure there are no collisions with KVM-defined bits.
+ */
+ if (WARN_ON_ONCE(error_code >> 32))
+ error_code = lower_32_bits(error_code);
vcpu->arch.l1tf_flush_l1d = true;
if (!flags) {
--
2.44.0.278.ge034bb2e1d-goog
WARN and skip the emulated MMIO fastpath if a private, reserved page fault
is encountered, as private+reserved should be an impossible combination
(KVM should never create an MMIO SPTE for a private access).
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd342ebd0809..9206cfa58feb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5866,7 +5866,8 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
error_code |= PFERR_PRIVATE_ACCESS;
r = RET_PF_INVALID;
- if (unlikely(error_code & PFERR_RSVD_MASK)) {
+ if (unlikely((error_code & PFERR_RSVD_MASK) &&
+ !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
if (r == RET_PF_EMULATE)
goto emulate;
--
2.44.0.278.ge034bb2e1d-goog
Prioritize private vs. shared gfn attribute checks above slot validity
checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to
userspace if there is no memslot, but emulate accesses to the APIC access
page even if the attributes mismatch.
Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory")
Cc: Yu Zhang <[email protected]>
Cc: Chao Peng <[email protected]>
Cc: Fuad Tabba <[email protected]>
Cc: Michael Roth <[email protected]>
Cc: Isaku Yamahata <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9206cfa58feb..58c5ae8be66c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4365,11 +4365,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
return RET_PF_EMULATE;
}
- if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
- kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
- return -EFAULT;
- }
-
if (fault->is_private)
return kvm_faultin_pfn_private(vcpu, fault);
@@ -4410,6 +4405,16 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
smp_rmb();
+ /*
+ * Check for a private vs. shared mismatch *after* taking a snapshot of
+ * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
+ * invalidation notifier.
+ */
+ if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+ kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+ return -EFAULT;
+ }
+
/*
* Check for a relevant mmu_notifier invalidation event before getting
* the pfn from the primary MMU, and before acquiring mmu_lock.
--
2.44.0.278.ge034bb2e1d-goog
Explicitly detect and disallow private accesses to emulated MMIO in
kvm_handle_noslot_fault() instead of relying on kvm_faultin_pfn_private()
to perform the check. This will allow the page fault path to go straight
to kvm_handle_noslot_fault() without bouncing through __kvm_faultin_pfn().
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5c8caab64ba2..ebdb3fcce3dc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3314,6 +3314,11 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
{
gva_t gva = fault->is_tdp ? 0 : fault->addr;
+ if (fault->is_private) {
+ kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+ return -EFAULT;
+ }
+
vcpu_cache_mmio_info(vcpu, gva, fault->gfn,
access & shadow_mmio_access_mask);
--
2.44.0.278.ge034bb2e1d-goog
Move the checks related to the validity of an access to a memslot from the
inner __kvm_faultin_pfn() to its sole caller, kvm_faultin_pfn(). This
allows emulating accesses to the APIC access page, which don't need to
resolve a pfn, even if there is a relevant in-progress mmu_notifier
invalidation. Ditto for accesses to KVM internal memslots from L2, which
KVM also treats as emulated MMIO.
More importantly, this will allow for future cleanup by having the
"no memslot" case bail from kvm_faultin_pfn() very early on.
Go to rather extreme and gross lengths to make the change a glorified
nop, e.g. call into __kvm_faultin_pfn() even when there is no slot, as the
related code is very subtle. E.g. fault->slot can be nullified if it
points at the APIC access page, some flows in KVM x86 expect fault->pfn
to be KVM_PFN_NOSLOT, while others check only fault->slot, etc.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 105 +++++++++++++++++++++--------------------
1 file changed, 53 insertions(+), 52 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ebdb3fcce3dc..8aa957f0a717 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4340,9 +4340,59 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
- struct kvm_memory_slot *slot = fault->slot;
bool async;
+ if (fault->is_private)
+ return kvm_faultin_pfn_private(vcpu, fault);
+
+ async = false;
+ fault->pfn = __gfn_to_pfn_memslot(fault->slot, fault->gfn, false, false,
+ &async, fault->write,
+ &fault->map_writable, &fault->hva);
+ if (!async)
+ return RET_PF_CONTINUE; /* *pfn has correct page already */
+
+ if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
+ trace_kvm_try_async_get_page(fault->addr, fault->gfn);
+ if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
+ trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
+ kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+ return RET_PF_RETRY;
+ } else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn)) {
+ return RET_PF_RETRY;
+ }
+ }
+
+ /*
+ * Allow gup to bail on pending non-fatal signals when it's also allowed
+ * to wait for IO. Note, gup always bails if it is unable to quickly
+ * get a page and a fatal signal, i.e. SIGKILL, is pending.
+ */
+ fault->pfn = __gfn_to_pfn_memslot(fault->slot, fault->gfn, false, true,
+ NULL, fault->write,
+ &fault->map_writable, &fault->hva);
+ return RET_PF_CONTINUE;
+}
+
+static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+ unsigned int access)
+{
+ struct kvm_memory_slot *slot = fault->slot;
+ int ret;
+
+ fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
+ smp_rmb();
+
+ /*
+ * Check for a private vs. shared mismatch *after* taking a snapshot of
+ * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
+ * invalidation notifier.
+ */
+ if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+ kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+ return -EFAULT;
+ }
+
/*
* Retry the page fault if the gfn hit a memslot that is being deleted
* or moved. This ensures any existing SPTEs for the old memslot will
@@ -4367,7 +4417,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
fault->slot = NULL;
fault->pfn = KVM_PFN_NOSLOT;
fault->map_writable = false;
- return RET_PF_CONTINUE;
+ goto faultin_done;
}
/*
* If the APIC access page exists but is disabled, go directly
@@ -4379,56 +4429,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
return RET_PF_EMULATE;
}
- if (fault->is_private)
- return kvm_faultin_pfn_private(vcpu, fault);
-
- async = false;
- fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
- fault->write, &fault->map_writable,
- &fault->hva);
- if (!async)
- return RET_PF_CONTINUE; /* *pfn has correct page already */
-
- if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
- trace_kvm_try_async_get_page(fault->addr, fault->gfn);
- if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
- trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
- kvm_make_request(KVM_REQ_APF_HALT, vcpu);
- return RET_PF_RETRY;
- } else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn)) {
- return RET_PF_RETRY;
- }
- }
-
- /*
- * Allow gup to bail on pending non-fatal signals when it's also allowed
- * to wait for IO. Note, gup always bails if it is unable to quickly
- * get a page and a fatal signal, i.e. SIGKILL, is pending.
- */
- fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
- fault->write, &fault->map_writable,
- &fault->hva);
- return RET_PF_CONTINUE;
-}
-
-static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
- unsigned int access)
-{
- int ret;
-
- fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
- smp_rmb();
-
- /*
- * Check for a private vs. shared mismatch *after* taking a snapshot of
- * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
- * invalidation notifier.
- */
- if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
- kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
- return -EFAULT;
- }
-
/*
* Check for a relevant mmu_notifier invalidation event before getting
* the pfn from the primary MMU, and before acquiring mmu_lock.
@@ -4458,6 +4458,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
if (ret != RET_PF_CONTINUE)
return ret;
+faultin_done:
if (unlikely(is_error_pfn(fault->pfn)))
return kvm_handle_error_pfn(vcpu, fault);
--
2.44.0.278.ge034bb2e1d-goog
Handle the "no memslot" case at the beginning of kvm_faultin_pfn(), just
after the private versus shared check, so that there's no need to
repeatedly query whether or not a slot exists. This also makes it more
obvious that, except for private vs. shared attributes, the process of
faulting in a pfn simply doesn't apply to gfns without a slot.
Opportunistically stuff @fault's metadata in kvm_handle_noslot_fault() so
that it doesn't need to be duplicated in all paths that invoke
kvm_handle_noslot_fault(), and to minimize the probability of not stuffing
the right fields.
Leave the existing handle behind, but convert it to a WARN, to guard
against __kvm_faultin_pfn() unexpectedly nullifying fault->slot.
Cc: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 29 +++++++++++++++++------------
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
2 files changed, 18 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8aa957f0a717..4dee0999a66e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3322,6 +3322,10 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
vcpu_cache_mmio_info(vcpu, gva, fault->gfn,
access & shadow_mmio_access_mask);
+ fault->slot = NULL;
+ fault->pfn = KVM_PFN_NOSLOT;
+ fault->map_writable = false;
+
/*
* If MMIO caching is disabled, emulate immediately without
* touching the shadow page tables as attempting to install an
@@ -4393,15 +4397,18 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
return -EFAULT;
}
+ if (unlikely(!slot))
+ return kvm_handle_noslot_fault(vcpu, fault, access);
+
/*
* Retry the page fault if the gfn hit a memslot that is being deleted
* or moved. This ensures any existing SPTEs for the old memslot will
* be zapped before KVM inserts a new MMIO SPTE for the gfn.
*/
- if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
+ if (slot->flags & KVM_MEMSLOT_INVALID)
return RET_PF_RETRY;
- if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
+ if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
/*
* Don't map L1's APIC access page into L2, KVM doesn't support
* using APICv/AVIC to accelerate L2 accesses to L1's APIC,
@@ -4413,12 +4420,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
* uses different roots for L1 vs. L2, i.e. there is no danger
* of breaking APICv/AVIC for L1.
*/
- if (is_guest_mode(vcpu)) {
- fault->slot = NULL;
- fault->pfn = KVM_PFN_NOSLOT;
- fault->map_writable = false;
- goto faultin_done;
- }
+ if (is_guest_mode(vcpu))
+ return kvm_handle_noslot_fault(vcpu, fault, access);
+
/*
* If the APIC access page exists but is disabled, go directly
* to emulation without caching the MMIO access or creating a
@@ -4429,6 +4433,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
return RET_PF_EMULATE;
}
+ fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
+ smp_rmb();
+
/*
* Check for a relevant mmu_notifier invalidation event before getting
* the pfn from the primary MMU, and before acquiring mmu_lock.
@@ -4450,19 +4457,17 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
* *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
* to detect retry guarantees the worst case latency for the vCPU.
*/
- if (fault->slot &&
- mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
+ if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
return RET_PF_RETRY;
ret = __kvm_faultin_pfn(vcpu, fault);
if (ret != RET_PF_CONTINUE)
return ret;
-faultin_done:
if (unlikely(is_error_pfn(fault->pfn)))
return kvm_handle_error_pfn(vcpu, fault);
- if (unlikely(!fault->slot))
+ if (WARN_ON_ONCE(!fault->slot))
return kvm_handle_noslot_fault(vcpu, fault, access);
/*
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d7c10d338f14..74736d517e74 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -235,7 +235,7 @@ struct kvm_page_fault {
/* The memslot containing gfn. May be NULL. */
struct kvm_memory_slot *slot;
- /* Outputs of kvm_faultin_pfn. */
+ /* Outputs of kvm_faultin_pfn. */
unsigned long mmu_seq;
kvm_pfn_t pfn;
hva_t hva;
--
2.44.0.278.ge034bb2e1d-goog
Explicitly set fault->hva to KVM_HVA_ERR_BAD when handling a "no slot"
fault to ensure that KVM doesn't use a bogus virtual address, e.g. if
there *was* a slot but it's unusable (APIC access page), or if there
really was no slot, in which case fault->hva will be '0' (which is a
legal address for x86).
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4dee0999a66e..43f24a74571a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3325,6 +3325,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
fault->slot = NULL;
fault->pfn = KVM_PFN_NOSLOT;
fault->map_writable = false;
+ fault->hva = KVM_HVA_ERR_BAD;
/*
* If MMIO caching is disabled, emulate immediately without
--
2.44.0.278.ge034bb2e1d-goog
WARN if __kvm_faultin_pfn() generates a "no slot" pfn, and gracefully
handle the unexpected behavior instead of continuing on with dangerous
state, e.g. tdp_mmu_map_handle_target_level() _only_ checks fault->slot,
and so could install a bogus PFN into the guest.
The existing code is functionally ok, because kvm_faultin_pfn() pre-checks
all of the cases that result in KVM_PFN_NOSLOT, but it is unnecessarily
unsafe as it relies on __gfn_to_pfn_memslot() getting the _exact_ same
memslot, i.e. not a re-retrieved pointer with KVM_MEMSLOT_INVALID set.
And checking only fault->slot would fall apart if KVM ever added a flag or
condition that forced emulation, similar to how KVM handles writes to
read-only memslots.
Cc: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 43f24a74571a..cedacb1b89c5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4468,7 +4468,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
if (unlikely(is_error_pfn(fault->pfn)))
return kvm_handle_error_pfn(vcpu, fault);
- if (WARN_ON_ONCE(!fault->slot))
+ if (WARN_ON_ONCE(!fault->slot || is_noslot_pfn(fault->pfn)))
return kvm_handle_noslot_fault(vcpu, fault, access);
/*
--
2.44.0.278.ge034bb2e1d-goog
Explicitly set "pfn" and "hva" to error values in kvm_mmu_do_page_fault()
to harden KVM against using "uninitialized" values. In quotes because the
fields are actually zero-initialized, and zero is a legal value for both
page frame numbers and virtual addresses. E.g. failure to set "pfn" prior
to creating an SPTE could result in KVM pointing at physical address '0',
which is far less desirable than KVM generating a SPTE with reserved PA
bits set and thus effectively killing the VM.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu_internal.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 74736d517e74..67e32dec9424 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -307,6 +307,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
.req_level = PG_LEVEL_4K,
.goal_level = PG_LEVEL_4K,
.is_private = err & PFERR_PRIVATE_ACCESS,
+
+ .pfn = KVM_PFN_ERR_FAULT,
+ .hva = KVM_HVA_ERR_BAD,
};
int r;
--
2.44.0.278.ge034bb2e1d-goog
Define more #NPF error code flags that are relevant to SEV+ (mostly SNP)
guests, as specified by the APM:
* Bit 34 (ENC): Set to 1 if the guest’s effective C-bit was 1, 0 otherwise.
* Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between
PVALIDATE or RMPADJUST and the RMP, 0 otherwise.
* Bit 36 (VMPL): Set to 1 if the fault was caused by a VMPL permission
check failure, 0 otherwise.
* Bit 37 (SSS): Set to VMPL permission mask SSS (bit 4) value if VmplSSS is
enabled.
Note, the APM is *extremely* misleading, and strongly implies that the
above flags can _only_ be set for #NPF exits from SNP guests. That is a
lie, as bit 34 (C-bit=1, i.e. was encrypted) can be set when running _any_
flavor of SEV guest on SNP capable hardware.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 88cc523bafa8..1e69743ef0fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -261,8 +261,12 @@ enum x86_intercept_stage;
#define PFERR_FETCH_MASK BIT(4)
#define PFERR_PK_MASK BIT(5)
#define PFERR_SGX_MASK BIT(15)
+#define PFERR_GUEST_RMP_MASK BIT_ULL(31)
#define PFERR_GUEST_FINAL_MASK BIT_ULL(32)
#define PFERR_GUEST_PAGE_MASK BIT_ULL(33)
+#define PFERR_GUEST_ENC_MASK BIT_ULL(34)
+#define PFERR_GUEST_SIZEM_MASK BIT_ULL(35)
+#define PFERR_GUEST_VMPL_MASK BIT_ULL(36)
#define PFERR_IMPLICIT_ACCESS BIT_ULL(48)
#define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
--
2.44.0.278.ge034bb2e1d-goog
Move the sanity check that hardware never sets bits that collide with KVM-
define synthetic bits from kvm_mmu_page_fault() to npf_interception(),
i.e. make the sanity check #NPF specific. The legacy #PF path already
WARNs if _any_ of bits 63:32 are set, and the error code that comes from
VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's
EXIT_QUALIFICATION into error code flags).
Add a compile-time assert in the legacy #PF handler to make sure that KVM-
define flags are covered by its existing sanity check on the upper bits.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 12 +++---------
arch/x86/kvm/svm/svm.c | 9 +++++++++
2 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5d892bd59c97..bd342ebd0809 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4561,6 +4561,9 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
if (WARN_ON_ONCE(error_code >> 32))
error_code = lower_32_bits(error_code);
+ /* Ensure the above sanity check also covers KVM-defined flags. */
+ BUILD_BUG_ON(lower_32_bits(PFERR_SYNTHETIC_MASK));
+
vcpu->arch.l1tf_flush_l1d = true;
if (!flags) {
trace_kvm_page_fault(vcpu, fault_address, error_code);
@@ -5845,15 +5848,6 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
int r, emulation_type = EMULTYPE_PF;
bool direct = vcpu->arch.mmu->root_role.direct;
- /*
- * WARN if hardware generates a fault with an error code that collides
- * with KVM-defined sythentic flags. Clear the flags and continue on,
- * i.e. don't terminate the VM, as KVM can't possibly be relying on a
- * flag that KVM doesn't know about.
- */
- if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
- error_code &= ~PFERR_SYNTHETIC_MASK;
-
if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
return RET_PF_RETRY;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index e90b429c84f1..199c4dd8d214 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2055,6 +2055,15 @@ static int npf_interception(struct kvm_vcpu *vcpu)
u64 fault_address = svm->vmcb->control.exit_info_2;
u64 error_code = svm->vmcb->control.exit_info_1;
+ /*
+ * WARN if hardware generates a fault with an error code that collides
+ * with KVM-defined sythentic flags. Clear the flags and continue on,
+ * i.e. don't terminate the VM, as KVM can't possibly be relying on a
+ * flag that KVM doesn't know about.
+ */
+ if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
+ error_code &= ~PFERR_SYNTHETIC_MASK;
+
trace_kvm_page_fault(vcpu, fault_address, error_code);
return kvm_mmu_page_fault(vcpu, fault_address, error_code,
static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
--
2.44.0.278.ge034bb2e1d-goog
Allow mapping KVM's internal memslots used for EPT without unrestricted
guest into L2, i.e. allow mapping the hidden TSS and the identity mapped
page tables into L2. Unlike the APIC access page, there is no correctness
issue with letting L2 access the "hidden" memory. Allowing these memslots
to be mapped into L2 fixes a largely theoretical bug where KVM could
incorrectly emulate subsequent _L1_ accesses as MMIO, and also ensures
consistent KVM behavior for L2.
If KVM is using TDP, but L1 is using shadow paging for L2, then routing
through kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO,
and create an MMIO SPTE. Creating an MMIO SPTE is ok, but only because
kvm_mmu_page_role.guest_mode ensure KVM uses different roots for L1 vs.
L2. But vcpu->arch.mmio_gfn will remain valid, and could cause KVM to
incorrectly treat an L1 access to the hidden TSS or identity mapped page
tables as MMIO.
Furthermore, forcing L2 accesses to be treated as "no slot" faults doesn't
actually prevent exposing KVM's internal memslots to L2, it simply forces
KVM to emulate the access. In most cases, that will trigger MMIO,
amusingly due to filling vcpu->arch.mmio_gfn, but also because
vcpu_is_mmio_gpa() unconditionally treats APIC accesses as MMIO, i.e. APIC
accesses are ok. But the hidden TSS and identity mapped page tables could
go either way (MMIO or access the private memslot's backing memory).
Alternatively, the inconsistent emulator behavior could be addressed by
forcing MMIO emulation for L2 access to all internal memslots, not just to
the APIC. But that's arguably less correct than letting L2 access the
hidden TSS and identity mapped page tables, not to mention that it's
*extremely* unlikely anyone cares what KVM does in this case. From L1's
perspective there is R/W memory at those memslots, the memory just happens
to be initialized with non-zero data. Making the memory disappear when it
is accessed by L2 is far more magical and arbitrary than the memory
existing in the first place.
The APIC access page is special because KVM _must_ emulate the access to
do the right thing (emulate an APIC access instead of reading/writing the
APIC access page). And despite what commit 3a2936dedd20 ("kvm: mmu: Don't
expose private memslots to L2") said, it's not just necessary when L1 is
accelerating L2's virtual APIC, it's just as important (likely *more*
imporant for correctness when L1 is passing through its own APIC to L2.
Fixes: 3a2936dedd20 ("kvm: mmu: Don't expose private memslots to L2")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 58c5ae8be66c..5c8caab64ba2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4346,8 +4346,18 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
return RET_PF_RETRY;
- if (!kvm_is_visible_memslot(slot)) {
- /* Don't expose private memslots to L2. */
+ if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
+ /*
+ * Don't map L1's APIC access page into L2, KVM doesn't support
+ * using APICv/AVIC to accelerate L2 accesses to L1's APIC,
+ * i.e. the access needs to be emulated. Emulating access to
+ * L1's APIC is also correct if L1 is accelerating L2's own
+ * virtual APIC, but for some reason L1 also maps _L1's_ APIC
+ * into L2. Note, vcpu_is_mmio_gpa() always treats access to
+ * the APIC as MMIO. Allow an MMIO SPTE to be created, as KVM
+ * uses different roots for L1 vs. L2, i.e. there is no danger
+ * of breaking APICv/AVIC for L1.
+ */
if (is_guest_mode(vcpu)) {
fault->slot = NULL;
fault->pfn = KVM_PFN_NOSLOT;
@@ -4360,8 +4370,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* MMIO SPTE. That way the cache doesn't need to be purged
* when the AVIC is re-enabled.
*/
- if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
- !kvm_apicv_activated(vcpu->kvm))
+ if (!kvm_apicv_activated(vcpu->kvm))
return RET_PF_EMULATE;
}
--
2.44.0.278.ge034bb2e1d-goog
On 2/27/24 18:41, Sean Christopherson wrote:
> Define more #NPF error code flags that are relevant to SEV+ (mostly SNP)
> guests, as specified by the APM:
>
> * Bit 34 (ENC): Set to 1 if the guest’s effective C-bit was 1, 0 otherwise.
> * Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between
> PVALIDATE or RMPADJUST and the RMP, 0 otherwise.
> * Bit 36 (VMPL): Set to 1 if the fault was caused by a VMPL permission
> check failure, 0 otherwise.
> * Bit 37 (SSS): Set to VMPL permission mask SSS (bit 4) value if VmplSSS is
> enabled.
The above bits 34-37 do not match with the bits 31,34-36 in the patch.
Dongli Zhang
>
> Note, the APM is *extremely* misleading, and strongly implies that the
> above flags can _only_ be set for #NPF exits from SNP guests. That is a
> lie, as bit 34 (C-bit=1, i.e. was encrypted) can be set when running _any_
> flavor of SEV guest on SNP capable hardware.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 88cc523bafa8..1e69743ef0fb 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -261,8 +261,12 @@ enum x86_intercept_stage;
> #define PFERR_FETCH_MASK BIT(4)
> #define PFERR_PK_MASK BIT(5)
> #define PFERR_SGX_MASK BIT(15)
> +#define PFERR_GUEST_RMP_MASK BIT_ULL(31)
> #define PFERR_GUEST_FINAL_MASK BIT_ULL(32)
> #define PFERR_GUEST_PAGE_MASK BIT_ULL(33)
> +#define PFERR_GUEST_ENC_MASK BIT_ULL(34)
> +#define PFERR_GUEST_SIZEM_MASK BIT_ULL(35)
> +#define PFERR_GUEST_VMPL_MASK BIT_ULL(36)
> #define PFERR_IMPLICIT_ACCESS BIT_ULL(48)
>
> #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
On 2/27/24 18:41, Sean Christopherson wrote:
> From: Isaku Yamahata <[email protected]>
>
> Plumb the full 64-bit error code throughout the page fault handling code
> so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK
> will be used to determine whether or not a fault is private vs. shared.
>
> Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change
> the behavior of permission_fault() when invoked in the page fault path, as
> KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault().
May this lead to a WARN_ON_ONCE?
5843 int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
u64 error_code,
5844 void *insn, int insn_len)
5845 {
.. ...
5856 */
5857 if (WARN_ON_ONCE(error_code & PFERR_IMPLICIT_ACCESS))
5858 error_code &= ~PFERR_IMPLICIT_ACCESS;
>
> Continue passing '0' from the async #PF worker, as guest_memfd() and thus
:s/guest_memfd()/guest_memfd/ ?
Dongli Zhang
> private memory doesn't support async page faults.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> [mdr: drop references/changes on rebase, update commit message]
> Signed-off-by: Michael Roth <[email protected]>
> [sean: drop truncation in call to FNAME(walk_addr)(), rewrite changelog]
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 3 +--
> arch/x86/kvm/mmu/mmu_internal.h | 4 ++--
> arch/x86/kvm/mmu/mmutrace.h | 2 +-
> 3 files changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e2fd74e06ff8..408969ac1291 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5860,8 +5860,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> }
>
> if (r == RET_PF_INVALID) {
> - r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
> - lower_32_bits(error_code), false,
> + r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
> &emulation_type);
> if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
> return -EIO;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 0eea6c5a824d..1fab1f2359b5 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -190,7 +190,7 @@ static inline bool is_nx_huge_page_enabled(struct kvm *kvm)
> struct kvm_page_fault {
> /* arguments to kvm_mmu_do_page_fault. */
> const gpa_t addr;
> - const u32 error_code;
> + const u64 error_code;
> const bool prefetch;
>
> /* Derived from error_code. */
> @@ -288,7 +288,7 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> }
>
> static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> - u32 err, bool prefetch, int *emulation_type)
> + u64 err, bool prefetch, int *emulation_type)
> {
> struct kvm_page_fault fault = {
> .addr = cr2_or_gpa,
> diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> index ae86820cef69..195d98bc8de8 100644
> --- a/arch/x86/kvm/mmu/mmutrace.h
> +++ b/arch/x86/kvm/mmu/mmutrace.h
> @@ -260,7 +260,7 @@ TRACE_EVENT(
> TP_STRUCT__entry(
> __field(int, vcpu_id)
> __field(gpa_t, cr2_or_gpa)
> - __field(u32, error_code)
> + __field(u64, error_code)
> __field(u64 *, sptep)
> __field(u64, old_spte)
> __field(u64, new_spte)
On Tue, Feb 27, 2024, Dongli Zhang wrote:
>
>
> On 2/27/24 18:41, Sean Christopherson wrote:
> > Define more #NPF error code flags that are relevant to SEV+ (mostly SNP)
> > guests, as specified by the APM:
> >
> > * Bit 34 (ENC): Set to 1 if the guest’s effective C-bit was 1, 0 otherwise.
> > * Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between
> > PVALIDATE or RMPADJUST and the RMP, 0 otherwise.
> > * Bit 36 (VMPL): Set to 1 if the fault was caused by a VMPL permission
> > check failure, 0 otherwise.
> > * Bit 37 (SSS): Set to VMPL permission mask SSS (bit 4) value if VmplSSS is
> > enabled.
>
> The above bits 34-37 do not match with the bits 31,34-36 in the patch.
Doh, good catch. I copy+pasted this from the APM, but the RMP bit is defined
slightly earlier in the APM, and I missed SSS. I'll fixup the changelog to talk
about RMO, and I think I'll add SSS in v2; at the very least, having the #define
will make it clear which bits are used.
Thanks!
On Tue, Feb 27, 2024, Dongli Zhang wrote:
>
>
> On 2/27/24 18:41, Sean Christopherson wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Plumb the full 64-bit error code throughout the page fault handling code
> > so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK
> > will be used to determine whether or not a fault is private vs. shared.
> >
> > Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change
> > the behavior of permission_fault() when invoked in the page fault path, as
> > KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault().
>
> May this lead to a WARN_ON_ONCE?
>
> 5843 int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> u64 error_code,
> 5844 void *insn, int insn_len)
> 5845 {
> ... ...
> 5856 */
> 5857 if (WARN_ON_ONCE(error_code & PFERR_IMPLICIT_ACCESS))
> 5858 error_code &= ~PFERR_IMPLICIT_ACCESS;
Nope, it shouldn't. PFERR_IMPLICIT_ACCESS is a synthetic, KVM-defined flag, and
should never be in the error code passed to kvm_mmu_page_fault(). If the WARN
fires, it means hardware (specifically, AMD CPUs for #NPF) has started using the
bit for something, and that we need to update KVM to use a different bit.
> > Continue passing '0' from the async #PF worker, as guest_memfd() and thus
>
> :s/guest_memfd()/guest_memfd/ ?
I've been styling it as guest_memfd() to make it look like a syscall, e.g. like
memfd_create(), when I'm talking about a file that was created by userspace, as
opposed to GUEST_MEMFD when I'm talking about the ioctl() itself.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 408969ac1291..7807bdcd87e8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5839,19 +5839,31 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> bool direct = vcpu->arch.mmu->root_role.direct;
>
> /*
> - * IMPLICIT_ACCESS is a KVM-defined flag used to correctly perform SMAP
> - * checks when emulating instructions that triggers implicit access.
> * WARN if hardware generates a fault with an error code that collides
> - * with the KVM-defined value. Clear the flag and continue on, i.e.
> - * don't terminate the VM, as KVM can't possibly be relying on a flag
> - * that KVM doesn't know about.
> + * with KVM-defined sythentic flags. Clear the flags and continue on,
> + * i.e. don't terminate the VM, as KVM can't possibly be relying on a
> + * flag that KVM doesn't know about.
> */
> - if (WARN_ON_ONCE(error_code & PFERR_IMPLICIT_ACCESS))
> - error_code &= ~PFERR_IMPLICIT_ACCESS;
> + if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
> + error_code &= ~PFERR_SYNTHETIC_MASK;
>
Hmm.. I thought for TDX the caller -- handle_ept_violation() -- should
explicitly set the PFERR_PRIVATE_ACCESS so that here the fault handler can
figure out the fault is private.
Now it seems the caller should never pass PFERR_PRIVATE_ACCESS, then ...
> if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
> return RET_PF_RETRY;
>
> + /*
> + * Except for reserved faults (emulated MMIO is shared-only), set the
> + * private flag for software-protected VMs based on the gfn's current
> + * attributes, which are the source of truth for such VMs. Note, this
> + * wrong for nested MMUs as the GPA is an L2 GPA, but KVM doesn't
> + * currently supported nested virtualization (among many other things)
> + * for software-protected VMs.
> + */
> + if (IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
> + !(error_code & PFERR_RSVD_MASK) &&
> + vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM &&
> + kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
> + error_code |= PFERR_PRIVATE_ACCESS;
> +
>
... I am wondering how we figure out whether a fault is private for TDX?
On Wed, Feb 28, 2024 at 3:46 AM Sean Christopherson <[email protected]> wrote:
>
> Open code the bit number directly in the PFERR_* masks and drop the
> intermediate PFERR_*_BIT defines, as having to bounce through two macros
> just to see which flag corresponds to which bit is quite annoying, as is
> having to define two macros just to add recognition of a new flag.
>
> Use ilog2() to derive the bit in permission_fault(), the one function that
> actually needs the bit number (it does clever shifting to manipulate flags
> in order to avoid conditional branches).
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 32 ++++++++++----------------------
> arch/x86/kvm/mmu.h | 4 ++--
> 2 files changed, 12 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aaf5a25ea7ed..88cc523bafa8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -254,28 +254,16 @@ enum x86_intercept_stage;
> KVM_GUESTDBG_INJECT_DB | \
> KVM_GUESTDBG_BLOCKIRQ)
>
> -
> -#define PFERR_PRESENT_BIT 0
> -#define PFERR_WRITE_BIT 1
> -#define PFERR_USER_BIT 2
> -#define PFERR_RSVD_BIT 3
> -#define PFERR_FETCH_BIT 4
> -#define PFERR_PK_BIT 5
> -#define PFERR_SGX_BIT 15
> -#define PFERR_GUEST_FINAL_BIT 32
> -#define PFERR_GUEST_PAGE_BIT 33
> -#define PFERR_IMPLICIT_ACCESS_BIT 48
> -
> -#define PFERR_PRESENT_MASK BIT(PFERR_PRESENT_BIT)
> -#define PFERR_WRITE_MASK BIT(PFERR_WRITE_BIT)
> -#define PFERR_USER_MASK BIT(PFERR_USER_BIT)
> -#define PFERR_RSVD_MASK BIT(PFERR_RSVD_BIT)
> -#define PFERR_FETCH_MASK BIT(PFERR_FETCH_BIT)
> -#define PFERR_PK_MASK BIT(PFERR_PK_BIT)
> -#define PFERR_SGX_MASK BIT(PFERR_SGX_BIT)
> -#define PFERR_GUEST_FINAL_MASK BIT_ULL(PFERR_GUEST_FINAL_BIT)
> -#define PFERR_GUEST_PAGE_MASK BIT_ULL(PFERR_GUEST_PAGE_BIT)
> -#define PFERR_IMPLICIT_ACCESS BIT_ULL(PFERR_IMPLICIT_ACCESS_BIT)
> +#define PFERR_PRESENT_MASK BIT(0)
> +#define PFERR_WRITE_MASK BIT(1)
> +#define PFERR_USER_MASK BIT(2)
> +#define PFERR_RSVD_MASK BIT(3)
> +#define PFERR_FETCH_MASK BIT(4)
> +#define PFERR_PK_MASK BIT(5)
> +#define PFERR_SGX_MASK BIT(15)
> +#define PFERR_GUEST_FINAL_MASK BIT_ULL(32)
> +#define PFERR_GUEST_PAGE_MASK BIT_ULL(33)
> +#define PFERR_IMPLICIT_ACCESS BIT_ULL(48)
>
> #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
> PFERR_WRITE_MASK | \
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 60f21bb4c27b..e8b620a85627 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -213,7 +213,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> */
> u64 implicit_access = access & PFERR_IMPLICIT_ACCESS;
> bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC;
> - int index = (pfec + (not_smap << PFERR_RSVD_BIT)) >> 1;
> + int index = (pfec + (not_smap << ilog2(PFERR_RSVD_MASK))) >> 1;
Just use "(pfec + (not_smap ? PFERR_RSVD_MASK : 0)) >> 1".
Likewise below, "pte_access & PT_USER_MASK ? PFERR_RSVD_MASK : 0"/
No need to even check what the compiler produces, it will be either
exactly the same code or a bunch of cmov instructions.
Paolo
> u32 errcode = PFERR_PRESENT_MASK;
> bool fault;
>
> @@ -235,7 +235,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>
> /* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
> offset = (pfec & ~1) +
> - ((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
> + ((pte_access & PT_USER_MASK) << (ilog2(PFERR_RSVD_MASK) - PT_USER_SHIFT));
>
> pkru_bits &= mmu->pkru_mask >> offset;
> errcode |= -pkru_bits & PFERR_PK_MASK;
> --
> 2.44.0.278.ge034bb2e1d-goog
>
On 2/28/24 08:22, Sean Christopherson wrote:
> On Tue, Feb 27, 2024, Dongli Zhang wrote:
>>
>>
>> On 2/27/24 18:41, Sean Christopherson wrote:
>>> From: Isaku Yamahata <[email protected]>
>>>
>>> Plumb the full 64-bit error code throughout the page fault handling code
>>> so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK
>>> will be used to determine whether or not a fault is private vs. shared.
>>>
>>> Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change
>>> the behavior of permission_fault() when invoked in the page fault path, as
>>> KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault().
>>
>> May this lead to a WARN_ON_ONCE?
>>
>> 5843 int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>> u64 error_code,
>> 5844 void *insn, int insn_len)
>> 5845 {
>> ... ...
>> 5856 */
>> 5857 if (WARN_ON_ONCE(error_code & PFERR_IMPLICIT_ACCESS))
>> 5858 error_code &= ~PFERR_IMPLICIT_ACCESS;
>
> Nope, it shouldn't. PFERR_IMPLICIT_ACCESS is a synthetic, KVM-defined flag, and
> should never be in the error code passed to kvm_mmu_page_fault(). If the WARN
> fires, it means hardware (specifically, AMD CPUs for #NPF) has started using the
> bit for something, and that we need to update KVM to use a different bit.
Thank you very much for the explanation.
I see it is impossible to have PFERR_IMPLICIT_ACCESS set here, unless there is
AMD hardware issue or Intel page fault handler morphs the error_code erroneously.
I meant the above commit message confused me when I was reading it. E.g., how
about something like:
"Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change
the behavior of permission_fault() when invoked in the page fault path, as it
should never be in the error code because ...."
Thank you very much!
Dongli Zhang
>
>>> Continue passing '0' from the async #PF worker, as guest_memfd() and thus
>>
>> :s/guest_memfd()/guest_memfd/ ?
>
> I've been styling it as guest_memfd() to make it look like a syscall, e.g. like
> memfd_create(), when I'm talking about a file that was created by userspace, as
> opposed to GUEST_MEMFD when I'm talking about the ioctl() itself.
I remember I read somewhere suggesting not to change the headers in selftest.
Just double-check if there is requirement to edit
tools/testing/selftests/kvm/include/x86_64/processor.h.
Dongli Zhang
On 2/27/24 18:41, Sean Christopherson wrote:
> Open code the bit number directly in the PFERR_* masks and drop the
> intermediate PFERR_*_BIT defines, as having to bounce through two macros
> just to see which flag corresponds to which bit is quite annoying, as is
> having to define two macros just to add recognition of a new flag.
>
> Use ilog2() to derive the bit in permission_fault(), the one function that
> actually needs the bit number (it does clever shifting to manipulate flags
> in order to avoid conditional branches).
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 32 ++++++++++----------------------
> arch/x86/kvm/mmu.h | 4 ++--
> 2 files changed, 12 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aaf5a25ea7ed..88cc523bafa8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -254,28 +254,16 @@ enum x86_intercept_stage;
> KVM_GUESTDBG_INJECT_DB | \
> KVM_GUESTDBG_BLOCKIRQ)
>
> -
> -#define PFERR_PRESENT_BIT 0
> -#define PFERR_WRITE_BIT 1
> -#define PFERR_USER_BIT 2
> -#define PFERR_RSVD_BIT 3
> -#define PFERR_FETCH_BIT 4
> -#define PFERR_PK_BIT 5
> -#define PFERR_SGX_BIT 15
> -#define PFERR_GUEST_FINAL_BIT 32
> -#define PFERR_GUEST_PAGE_BIT 33
> -#define PFERR_IMPLICIT_ACCESS_BIT 48
> -
> -#define PFERR_PRESENT_MASK BIT(PFERR_PRESENT_BIT)
> -#define PFERR_WRITE_MASK BIT(PFERR_WRITE_BIT)
> -#define PFERR_USER_MASK BIT(PFERR_USER_BIT)
> -#define PFERR_RSVD_MASK BIT(PFERR_RSVD_BIT)
> -#define PFERR_FETCH_MASK BIT(PFERR_FETCH_BIT)
> -#define PFERR_PK_MASK BIT(PFERR_PK_BIT)
> -#define PFERR_SGX_MASK BIT(PFERR_SGX_BIT)
> -#define PFERR_GUEST_FINAL_MASK BIT_ULL(PFERR_GUEST_FINAL_BIT)
> -#define PFERR_GUEST_PAGE_MASK BIT_ULL(PFERR_GUEST_PAGE_BIT)
> -#define PFERR_IMPLICIT_ACCESS BIT_ULL(PFERR_IMPLICIT_ACCESS_BIT)
> +#define PFERR_PRESENT_MASK BIT(0)
> +#define PFERR_WRITE_MASK BIT(1)
> +#define PFERR_USER_MASK BIT(2)
> +#define PFERR_RSVD_MASK BIT(3)
> +#define PFERR_FETCH_MASK BIT(4)
> +#define PFERR_PK_MASK BIT(5)
> +#define PFERR_SGX_MASK BIT(15)
> +#define PFERR_GUEST_FINAL_MASK BIT_ULL(32)
> +#define PFERR_GUEST_PAGE_MASK BIT_ULL(33)
> +#define PFERR_IMPLICIT_ACCESS BIT_ULL(48)
>
> #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
> PFERR_WRITE_MASK | \
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 60f21bb4c27b..e8b620a85627 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -213,7 +213,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> */
> u64 implicit_access = access & PFERR_IMPLICIT_ACCESS;
> bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC;
> - int index = (pfec + (not_smap << PFERR_RSVD_BIT)) >> 1;
> + int index = (pfec + (not_smap << ilog2(PFERR_RSVD_MASK))) >> 1;
> u32 errcode = PFERR_PRESENT_MASK;
> bool fault;
>
> @@ -235,7 +235,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>
> /* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
> offset = (pfec & ~1) +
> - ((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
> + ((pte_access & PT_USER_MASK) << (ilog2(PFERR_RSVD_MASK) - PT_USER_SHIFT));
>
> pkru_bits &= mmu->pkru_mask >> offset;
> errcode |= -pkru_bits & PFERR_PK_MASK;
On Thu, Feb 29, 2024, Kai Huang wrote:
>
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 408969ac1291..7807bdcd87e8 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5839,19 +5839,31 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> > bool direct = vcpu->arch.mmu->root_role.direct;
> >
> > /*
> > - * IMPLICIT_ACCESS is a KVM-defined flag used to correctly perform SMAP
> > - * checks when emulating instructions that triggers implicit access.
> > * WARN if hardware generates a fault with an error code that collides
> > - * with the KVM-defined value. Clear the flag and continue on, i.e.
> > - * don't terminate the VM, as KVM can't possibly be relying on a flag
> > - * that KVM doesn't know about.
> > + * with KVM-defined sythentic flags. Clear the flags and continue on,
> > + * i.e. don't terminate the VM, as KVM can't possibly be relying on a
> > + * flag that KVM doesn't know about.
> > */
> > - if (WARN_ON_ONCE(error_code & PFERR_IMPLICIT_ACCESS))
> > - error_code &= ~PFERR_IMPLICIT_ACCESS;
> > + if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
> > + error_code &= ~PFERR_SYNTHETIC_MASK;
> >
>
> Hmm.. I thought for TDX the caller -- handle_ept_violation() -- should
> explicitly set the PFERR_PRIVATE_ACCESS so that here the fault handler can
> figure out the fault is private.
>
> Now it seems the caller should never pass PFERR_PRIVATE_ACCESS, then ...
>
> > if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
> > return RET_PF_RETRY;
> >
> > + /*
> > + * Except for reserved faults (emulated MMIO is shared-only), set the
> > + * private flag for software-protected VMs based on the gfn's current
> > + * attributes, which are the source of truth for such VMs. Note, this
> > + * wrong for nested MMUs as the GPA is an L2 GPA, but KVM doesn't
> > + * currently supported nested virtualization (among many other things)
> > + * for software-protected VMs.
> > + */
> > + if (IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
> > + !(error_code & PFERR_RSVD_MASK) &&
> > + vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM &&
> > + kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
> > + error_code |= PFERR_PRIVATE_ACCESS;
> > +
> >
>
> ... I am wondering how we figure out whether a fault is private for TDX?
Read the next few patches :-)
The sanity check gets moved to the legacy #PF handler (any error code with bits
63:32!=0 yells) and SVM's #NPF handler (error code with synthetic bits set yells),
leaving VMX free and clear to stuff PFERR_PRIVATE_ACCESS as appropriate.
On Thu, Feb 29, 2024, Dongli Zhang wrote:
> I remember I read somewhere suggesting not to change the headers in selftest.
The "suggestion" is to not update the headers that perf tooling copies verbatim
from the kernel, e.g. tools/include/uapi/linux/kvm.h. The duplicates in tools/
aren't used by KVM selftests, it's purely perf that needs identical copies from
the kernel tree, so I strongly prefer to leave it to the perf folks to deal with
synchronizing the headers as needed.
> Just double-check if there is requirement to edit
> tools/testing/selftests/kvm/include/x86_64/processor.h.
This header is a KVM selftests specific header that is independent from the kernel
headers. It does have _some_ copy+paste, mostly for architecturally defined
bits and bobs, but it's not a straight copy of any kernel header.
That said, yes, I think we should also clean up x86_64/processor.h. That can be
done in a one-off standalone patch though.
On Thu, Feb 29, 2024, Paolo Bonzini wrote:
> On Wed, Feb 28, 2024 at 3:46 AM Sean Christopherson <[email protected]> wrote:
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 60f21bb4c27b..e8b620a85627 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -213,7 +213,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > */
> > u64 implicit_access = access & PFERR_IMPLICIT_ACCESS;
> > bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC;
> > - int index = (pfec + (not_smap << PFERR_RSVD_BIT)) >> 1;
> > + int index = (pfec + (not_smap << ilog2(PFERR_RSVD_MASK))) >> 1;
>
> Just use "(pfec + (not_smap ? PFERR_RSVD_MASK : 0)) >> 1".
>
> Likewise below, "pte_access & PT_USER_MASK ? PFERR_RSVD_MASK : 0"/
>
> No need to even check what the compiler produces, it will be either
> exactly the same code or a bunch of cmov instructions.
I couldn't resist :-)
The second one generates identical code, but for this one:
int index = (pfec + (not_smap << PFERR_RSVD_BIT)) >> 1;
gcc generates almost bizarrely different code in the call from vcpu_mmio_gva_to_gpa().
clang is clever enough to realize "pfec" can only contain USER_MASK and/or WRITE_MASK,
and so does a ton of dead code elimination and other optimizations. But for some
reason, gcc doesn't appear to realize that, and generates a MOVSX when computing
"index", i.e. sign-extends the result of the ADD (at least, I think that's what it's
doing).
There's no actual bug today, and the vcpu_mmio_gva_to_gpa() path is super safe
since KVM fully controls the error code. But the call from FNAME(walk_addr_generic)
uses a _much_ more dynamic error code.
If an error code with unexpected bits set managed to get into permission_fault(),
I'm pretty sure we'd end up with out-of-bounds accesses. KVM sanity checks that
PK and RSVD aren't set,
WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
but KVM unnecessarily uses an ADD instead of OR, here
int index = (pfec + (not_smap << PFERR_RSVD_BIT)) >> 1;
and here
/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
offset = (pfec & ~1) +
((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
i.e. if the WARN fired, KVM would generate completely unexpected values due to
adding two RSVD bit flags.
And if _really_ unexpected flags make their way into permission_fault(), e.g. the
upcoming RMP flag (bit 31) or Intel's SGX flag (bit 15), then the use of index
fault = (mmu->permissions[index] >> pte_access) & 1;
could generate a read waaaya outside of the array. It can't/shouldn't happen in
practice since KVM shouldn't be trying to emulate RMP violations or faults in SGX
enclaves, but it's unnecessarily dangerous.
Long story short, I think we should get to the below (I'll post a separate series,
assuming I'm not missing something).
unsigned long rflags = static_call(kvm_x86_get_rflags)(vcpu);
unsigned int pfec = access & (PFERR_PRESENT_MASK |
PFERR_WRITE_MASK |
PFERR_USER_MASK |
PFERR_FETCH_MASK);
/*
* For explicit supervisor accesses, SMAP is disabled if EFLAGS.AC = 1.
* For implicit supervisor accesses, SMAP cannot be overridden.
*
* SMAP works on supervisor accesses only, and not_smap can
* be set or not set when user access with neither has any bearing
* on the result.
*
* We put the SMAP checking bit in place of the PFERR_RSVD_MASK bit;
* this bit will always be zero in pfec, but it will be one in index
* if SMAP checks are being disabled.
*/
u64 implicit_access = access & PFERR_IMPLICIT_ACCESS;
bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC;
int index = (pfec | (not_smap ? PFERR_RSVD_MASK : 0)) >> 1;
u32 errcode = PFERR_PRESENT_MASK;
bool fault;
kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
fault = (mmu->permissions[index] >> pte_access) & 1;
/*
* Sanity check that no bits are set in the legacy #PF error code
* (bits 31:0) other than the supported permission bits (see above).
*/
WARN_ON_ONCE(pfec != (unsigned int)access);
if (unlikely(mmu->pkru_mask)) {
u32 pkru_bits, offset;
/*
* PKRU defines 32 bits, there are 16 domains and 2
* attribute bits per domain in pkru. pte_pkey is the
* index of the protection domain, so pte_pkey * 2 is
* is the index of the first bit for the domain.
*/
pkru_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
offset = (pfec & ~1) | (pte_access & PT_USER_MASK ? PFERR_RSVD_MASK : 0);
pkru_bits &= mmu->pkru_mask >> offset;
errcode |= -pkru_bits & PFERR_PK_MASK;
fault |= (pkru_bits != 0);
}
return -(u32)fault & errcode;
On Thu, Feb 29, 2024 at 7:40 PM Sean Christopherson <[email protected]> wrote:
> Long story short, I think we should get to the below (I'll post a separate series,
> assuming I'm not missing something).
>
> unsigned long rflags = static_call(kvm_x86_get_rflags)(vcpu);
> unsigned int pfec = access & (PFERR_PRESENT_MASK |
> PFERR_WRITE_MASK |
> PFERR_USER_MASK |
> PFERR_FETCH_MASK);
>
> /*
> * For explicit supervisor accesses, SMAP is disabled if EFLAGS.AC = 1.
> * For implicit supervisor accesses, SMAP cannot be overridden.
> *
> * SMAP works on supervisor accesses only, and not_smap can
> * be set or not set when user access with neither has any bearing
> * on the result.
> *
> * We put the SMAP checking bit in place of the PFERR_RSVD_MASK bit;
> * this bit will always be zero in pfec, but it will be one in index
> * if SMAP checks are being disabled.
> */
> u64 implicit_access = access & PFERR_IMPLICIT_ACCESS;
> bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC;
> int index = (pfec | (not_smap ? PFERR_RSVD_MASK : 0)) >> 1;
> u32 errcode = PFERR_PRESENT_MASK;
> bool fault;
Sounds good. The whole series is
Reviewed-by: Paolo Bonzini <[email protected]>
apart from the small nits that were pointed out here and there.
Paolo
On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF,
I found "legacy #PF" is a little bit confusing but I couldn't figure out
a better name either :-)
> as the error code for #PF is limited to 32 bits (and in practice, 16 bits
> on Intel CPUS). This behavior is architectural, is part of KVM's ABI
> (see kvm_vcpu_events.error_code), and is explicitly documented as being
> preserved for intecerpted #PF in both the APM:
>
> The error code saved in EXITINFO1 is the same as would be pushed onto
> the stack by a non-intercepted #PF exception in protected mode.
>
> and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a
> 32-bit field.
>
> Simply drop the upper bits of hardware provides garbage, as spurious
"of" -> "if" ?
> information should do no harm (though in all likelihood hardware is buggy
> and the kernel is doomed).
>
> Handling all upper 32 bits in the #PF path will allow moving the sanity
> check on synthetic checks from kvm_mmu_page_fault() to npf_interception(),
> which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's
> PFERR_GUEST_ENC_MASK without running afoul of the sanity check.
>
> Note, this also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
"this" -> "this is" ?
> and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest
> bit minimizes the probability of a collision with the "other" vendor,
> without needing to plumb more bits through microcode.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7807bdcd87e8..5d892bd59c97 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4553,6 +4553,13 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> if (WARN_ON_ONCE(fault_address >> 32))
> return -EFAULT;
> #endif
> + /*
> + * Legacy #PF exception only have a 32-bit error code. Simply drop the
"have" -> "has" ?
> + * upper bits as KVM doesn't use them for #PF (because they are never
> + * set), and to ensure there are no collisions with KVM-defined bits.
> + */
> + if (WARN_ON_ONCE(error_code >> 32))
> + error_code = lower_32_bits(error_code);
>
> vcpu->arch.l1tf_flush_l1d = true;
> if (!flags) {
Reviewed-by: Kai Huang <[email protected]>
On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> Move the sanity check that hardware never sets bits that collide with KVM-
> define synthetic bits from kvm_mmu_page_fault() to npf_interception(),
> i.e. make the sanity check #NPF specific. The legacy #PF path already
> WARNs if _any_ of bits 63:32 are set, and the error code that comes from
> VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's
> EXIT_QUALIFICATION into error code flags).
>
> Add a compile-time assert in the legacy #PF handler to make sure that KVM-
> define flags are covered by its existing sanity check on the upper bits.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 12 +++---------
> arch/x86/kvm/svm/svm.c | 9 +++++++++
> 2 files changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5d892bd59c97..bd342ebd0809 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4561,6 +4561,9 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> if (WARN_ON_ONCE(error_code >> 32))
> error_code = lower_32_bits(error_code);
>
> + /* Ensure the above sanity check also covers KVM-defined flags. */
> + BUILD_BUG_ON(lower_32_bits(PFERR_SYNTHETIC_MASK));
> +
Could you explain why adding this BUILD_BUG_ON() here, but not ...
> vcpu->arch.l1tf_flush_l1d = true;
> if (!flags) {
> trace_kvm_page_fault(vcpu, fault_address, error_code);
> @@ -5845,15 +5848,6 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> int r, emulation_type = EMULTYPE_PF;
> bool direct = vcpu->arch.mmu->root_role.direct;
>
> - /*
> - * WARN if hardware generates a fault with an error code that collides
> - * with KVM-defined sythentic flags. Clear the flags and continue on,
> - * i.e. don't terminate the VM, as KVM can't possibly be relying on a
> - * flag that KVM doesn't know about.
> - */
> - if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
> - error_code &= ~PFERR_SYNTHETIC_MASK;
> -
> if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
> return RET_PF_RETRY;
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index e90b429c84f1..199c4dd8d214 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2055,6 +2055,15 @@ static int npf_interception(struct kvm_vcpu *vcpu)
> u64 fault_address = svm->vmcb->control.exit_info_2;
> u64 error_code = svm->vmcb->control.exit_info_1;
>
> + /*
> + * WARN if hardware generates a fault with an error code that collides
> + * with KVM-defined sythentic flags. Clear the flags and continue on,
> + * i.e. don't terminate the VM, as KVM can't possibly be relying on a
> + * flag that KVM doesn't know about.
> + */
> + if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
> + error_code &= ~PFERR_SYNTHETIC_MASK;
> +
> trace_kvm_page_fault(vcpu, fault_address, error_code);
> return kvm_mmu_page_fault(vcpu, fault_address, error_code,
> static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
.. in npf_interception() or some common place like in
kvm_mmu_page_fault()?
Otherwise,
Reviewed-by: Kai Huang <[email protected]>
On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> WARN and skip the emulated MMIO fastpath if a private, reserved page fault
> is encountered, as private+reserved should be an impossible combination
> (KVM should never create an MMIO SPTE for a private access).
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index bd342ebd0809..9206cfa58feb 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5866,7 +5866,8 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> error_code |= PFERR_PRIVATE_ACCESS;
>
> r = RET_PF_INVALID;
> - if (unlikely(error_code & PFERR_RSVD_MASK)) {
> + if (unlikely((error_code & PFERR_RSVD_MASK) &&
> + !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
> r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
> if (r == RET_PF_EMULATE)
> goto emulate;
It seems this will make KVM continue to call kvm_mmu_do_page_fault()
when such private+reserve error code actually happens (e.g., due to
bug), because @r is still RET_PF_INVALID in such case.
Is it better to just return error, e.g., -EINVAL, and give up?
On Fri, Mar 01, 2024, Kai Huang wrote:
>
>
> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > Move the sanity check that hardware never sets bits that collide with KVM-
> > define synthetic bits from kvm_mmu_page_fault() to npf_interception(),
> > i.e. make the sanity check #NPF specific. The legacy #PF path already
> > WARNs if _any_ of bits 63:32 are set, and the error code that comes from
> > VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's
> > EXIT_QUALIFICATION into error code flags).
> >
> > Add a compile-time assert in the legacy #PF handler to make sure that KVM-
> > define flags are covered by its existing sanity check on the upper bits.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 12 +++---------
> > arch/x86/kvm/svm/svm.c | 9 +++++++++
> > 2 files changed, 12 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 5d892bd59c97..bd342ebd0809 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4561,6 +4561,9 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> > if (WARN_ON_ONCE(error_code >> 32))
> > error_code = lower_32_bits(error_code);
> > + /* Ensure the above sanity check also covers KVM-defined flags. */
> > + BUILD_BUG_ON(lower_32_bits(PFERR_SYNTHETIC_MASK));
> > +
>
> Could you explain why adding this BUILD_BUG_ON() here, but not ...
>
> > vcpu->arch.l1tf_flush_l1d = true;
> > if (!flags) {
> > trace_kvm_page_fault(vcpu, fault_address, error_code);
> > @@ -5845,15 +5848,6 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> > int r, emulation_type = EMULTYPE_PF;
> > bool direct = vcpu->arch.mmu->root_role.direct;
> > - /*
> > - * WARN if hardware generates a fault with an error code that collides
> > - * with KVM-defined sythentic flags. Clear the flags and continue on,
> > - * i.e. don't terminate the VM, as KVM can't possibly be relying on a
> > - * flag that KVM doesn't know about.
> > - */
> > - if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
> > - error_code &= ~PFERR_SYNTHETIC_MASK;
> > -
> > if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
> > return RET_PF_RETRY;
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index e90b429c84f1..199c4dd8d214 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -2055,6 +2055,15 @@ static int npf_interception(struct kvm_vcpu *vcpu)
> > u64 fault_address = svm->vmcb->control.exit_info_2;
> > u64 error_code = svm->vmcb->control.exit_info_1;
> > + /*
> > + * WARN if hardware generates a fault with an error code that collides
> > + * with KVM-defined sythentic flags. Clear the flags and continue on,
> > + * i.e. don't terminate the VM, as KVM can't possibly be relying on a
> > + * flag that KVM doesn't know about.
> > + */
> > + if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
> > + error_code &= ~PFERR_SYNTHETIC_MASK;
> > +
> > trace_kvm_page_fault(vcpu, fault_address, error_code);
> > return kvm_mmu_page_fault(vcpu, fault_address, error_code,
> > static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
>
> ... in npf_interception() or
The intent of the BUILD_BUG_ON() is to ensure that kvm_handle_page_fault()'s
sanity check that bits 63:32 also serves as a sanity check that hardware doesn't
generate an error code that collides with any of KVM's synthetic flags.
E.g. if we were to add a KVM-defined flag in the lower 32 bits, then the #NPF
path would Just Work, because it already sanity checks all synthetic bits. But
the #PF path would need new code, thus the BUILD_BUG_ON() to scream that new code
is needed.
> some common place like in kvm_mmu_page_fault()?
Because again, the logic being enforced is very specific to intercepted #PFs.
On Fri, Mar 01, 2024, Kai Huang wrote:
>
>
> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > WARN and skip the emulated MMIO fastpath if a private, reserved page fault
> > is encountered, as private+reserved should be an impossible combination
> > (KVM should never create an MMIO SPTE for a private access).
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index bd342ebd0809..9206cfa58feb 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5866,7 +5866,8 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> > error_code |= PFERR_PRIVATE_ACCESS;
> > r = RET_PF_INVALID;
> > - if (unlikely(error_code & PFERR_RSVD_MASK)) {
> > + if (unlikely((error_code & PFERR_RSVD_MASK) &&
> > + !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
> > r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
> > if (r == RET_PF_EMULATE)
> > goto emulate;
>
> It seems this will make KVM continue to call kvm_mmu_do_page_fault() when
> such private+reserve error code actually happens (e.g., due to bug), because
> @r is still RET_PF_INVALID in such case.
Yep.
> Is it better to just return error, e.g., -EINVAL, and give up?
As long as there is no obvious/immediate danger to the host, no obvious way for
the "bad" behavior to cause data corruption for the guest, and continuing on has
a plausible chance of working, then KVM should generally try to continue on and
not terminate the VM.
E.g. in this case, KVM will just skip various fast paths because of the RSVD flag,
and treat the fault like a PRIVATE fault. Hmm, but page_fault_handle_page_track()
would skip write tracking, which could theoretically cause data corruption, so I
guess arguably it would be safer to bail?
Anyone else have an opinion? This type of bug should never escape development,
so I'm a-ok effectively killing the VM. Unless someone has a good argument for
continuing on, I'll go with Kai's suggestion and squash this:
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cedacb1b89c5..d796a162b2da 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5892,8 +5892,10 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
error_code |= PFERR_PRIVATE_ACCESS;
r = RET_PF_INVALID;
- if (unlikely((error_code & PFERR_RSVD_MASK) &&
- !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
+ if (unlikely(error_code & PFERR_RSVD_MASK)) {
+ if (WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))
+ return -EFAULT;
+
r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
if (r == RET_PF_EMULATE)
goto emulate;
On Fri, Mar 01, 2024, Kai Huang wrote:
>
>
> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF,
>
> I found "legacy #PF" is a little bit confusing but I couldn't figure out a
> better name either :-)
>
> > as the error code for #PF is limited to 32 bits (and in practice, 16 bits
> > on Intel CPUS). This behavior is architectural, is part of KVM's ABI
> > (see kvm_vcpu_events.error_code), and is explicitly documented as being
> > preserved for intecerpted #PF in both the APM:
> >
> > The error code saved in EXITINFO1 is the same as would be pushed onto
> > the stack by a non-intercepted #PF exception in protected mode.
> >
> > and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a
> > 32-bit field.
> >
> > Simply drop the upper bits of hardware provides garbage, as spurious
>
> "of" -> "if" ?
>
> > information should do no harm (though in all likelihood hardware is buggy
> > and the kernel is doomed).
> >
> > Handling all upper 32 bits in the #PF path will allow moving the sanity
> > check on synthetic checks from kvm_mmu_page_fault() to npf_interception(),
> > which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's
> > PFERR_GUEST_ENC_MASK without running afoul of the sanity check.
> >
> > Note, this also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
>
> "this" -> "this is" ?
>
> > and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest
> > bit minimizes the probability of a collision with the "other" vendor,
> > without needing to plumb more bits through microcode.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 7807bdcd87e8..5d892bd59c97 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4553,6 +4553,13 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> > if (WARN_ON_ONCE(fault_address >> 32))
> > return -EFAULT;
> > #endif
> > + /*
> > + * Legacy #PF exception only have a 32-bit error code. Simply drop the
>
> "have" -> "has" ?
This one I'll fix by making "exception" plural.
Thanks much for the reviews!
>
> > + * upper bits as KVM doesn't use them for #PF (because they are never
> > + * set), and to ensure there are no collisions with KVM-defined bits.
> > + */
> > + if (WARN_ON_ONCE(error_code >> 32))
> > + error_code = lower_32_bits(error_code);
> > vcpu->arch.l1tf_flush_l1d = true;
> > if (!flags) {
> Reviewed-by: Kai Huang <[email protected]>
On 1/03/2024 11:52 am, Sean Christopherson wrote:
> On Fri, Mar 01, 2024, Kai Huang wrote:
>>
>>
>> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
>>> Move the sanity check that hardware never sets bits that collide with KVM-
>>> define synthetic bits from kvm_mmu_page_fault() to npf_interception(),
>>> i.e. make the sanity check #NPF specific. The legacy #PF path already
>>> WARNs if _any_ of bits 63:32 are set, and the error code that comes from
>>> VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's
>>> EXIT_QUALIFICATION into error code flags).
>>>
>>> Add a compile-time assert in the legacy #PF handler to make sure that KVM-
>>> define flags are covered by its existing sanity check on the upper bits.
>>>
>>> Signed-off-by: Sean Christopherson <[email protected]>
>>> ---
>>> arch/x86/kvm/mmu/mmu.c | 12 +++---------
>>> arch/x86/kvm/svm/svm.c | 9 +++++++++
>>> 2 files changed, 12 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index 5d892bd59c97..bd342ebd0809 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -4561,6 +4561,9 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>>> if (WARN_ON_ONCE(error_code >> 32))
>>> error_code = lower_32_bits(error_code);
>>> + /* Ensure the above sanity check also covers KVM-defined flags. */
>>> + BUILD_BUG_ON(lower_32_bits(PFERR_SYNTHETIC_MASK));
>>> +
>>
>> Could you explain why adding this BUILD_BUG_ON() here, but not ...
>>
>>> vcpu->arch.l1tf_flush_l1d = true;
>>> if (!flags) {
>>> trace_kvm_page_fault(vcpu, fault_address, error_code);
>>> @@ -5845,15 +5848,6 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>>> int r, emulation_type = EMULTYPE_PF;
>>> bool direct = vcpu->arch.mmu->root_role.direct;
>>> - /*
>>> - * WARN if hardware generates a fault with an error code that collides
>>> - * with KVM-defined sythentic flags. Clear the flags and continue on,
>>> - * i.e. don't terminate the VM, as KVM can't possibly be relying on a
>>> - * flag that KVM doesn't know about.
>>> - */
>>> - if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
>>> - error_code &= ~PFERR_SYNTHETIC_MASK;
>>> -
>>> if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
>>> return RET_PF_RETRY;
>>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>>> index e90b429c84f1..199c4dd8d214 100644
>>> --- a/arch/x86/kvm/svm/svm.c
>>> +++ b/arch/x86/kvm/svm/svm.c
>>> @@ -2055,6 +2055,15 @@ static int npf_interception(struct kvm_vcpu *vcpu)
>>> u64 fault_address = svm->vmcb->control.exit_info_2;
>>> u64 error_code = svm->vmcb->control.exit_info_1;
>>> + /*
>>> + * WARN if hardware generates a fault with an error code that collides
>>> + * with KVM-defined sythentic flags. Clear the flags and continue on,
>>> + * i.e. don't terminate the VM, as KVM can't possibly be relying on a
>>> + * flag that KVM doesn't know about.
>>> + */
>>> + if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
>>> + error_code &= ~PFERR_SYNTHETIC_MASK;
>>> +
>>> trace_kvm_page_fault(vcpu, fault_address, error_code);
>>> return kvm_mmu_page_fault(vcpu, fault_address, error_code,
>>> static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
>>
>> ... in npf_interception() or
>
> The intent of the BUILD_BUG_ON() is to ensure that kvm_handle_page_fault()'s
> sanity check that bits 63:32 also serves as a sanity check that hardware doesn't
> generate an error code that collides with any of KVM's synthetic flags.
>
> E.g. if we were to add a KVM-defined flag in the lower 32 bits, then the #NPF
> path would Just Work, because it already sanity checks all synthetic bits. But
> the #PF path would need new code, thus the BUILD_BUG_ON() to scream that new code
> is needed.
Ah, right. Thanks for explaining :-)
Reviewed-by: Kai Huang <[email protected]>
On 1/03/2024 12:06 pm, Sean Christopherson wrote:
> On Fri, Mar 01, 2024, Kai Huang wrote:
>>
>>
>> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
>>> WARN and skip the emulated MMIO fastpath if a private, reserved page fault
>>> is encountered, as private+reserved should be an impossible combination
>>> (KVM should never create an MMIO SPTE for a private access).
>>>
>>> Signed-off-by: Sean Christopherson <[email protected]>
>>> ---
>>> arch/x86/kvm/mmu/mmu.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index bd342ebd0809..9206cfa58feb 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -5866,7 +5866,8 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>>> error_code |= PFERR_PRIVATE_ACCESS;
>>> r = RET_PF_INVALID;
>>> - if (unlikely(error_code & PFERR_RSVD_MASK)) {
>>> + if (unlikely((error_code & PFERR_RSVD_MASK) &&
>>> + !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
>>> r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
>>> if (r == RET_PF_EMULATE)
>>> goto emulate;
>>
>> It seems this will make KVM continue to call kvm_mmu_do_page_fault() when
>> such private+reserve error code actually happens (e.g., due to bug), because
>> @r is still RET_PF_INVALID in such case.
>
> Yep.
>
>> Is it better to just return error, e.g., -EINVAL, and give up?
>
> As long as there is no obvious/immediate danger to the host, no obvious way for
> the "bad" behavior to cause data corruption for the guest, and continuing on has
> a plausible chance of working, then KVM should generally try to continue on and
> not terminate the VM.
Agreed. But I think sometimes it is hard to tell whether there's any
dangerous things waiting to happen, because that means we have to sanity
check a lot of code, and when new patches arrive we need to keep that in
mind too, which could be a nightmare in terms of maintenance.
>
> E.g. in this case, KVM will just skip various fast paths because of the RSVD flag,
> and treat the fault like a PRIVATE fault. Hmm, but page_fault_handle_page_track()
> would skip write tracking, which could theoretically cause data corruption, so I
> guess arguably it would be safer to bail?
>
> Anyone else have an opinion? This type of bug should never escape development,
> so I'm a-ok effectively killing the VM. Unless someone has a good argument for
> continuing on, I'll go with Kai's suggestion and squash this:
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index cedacb1b89c5..d796a162b2da 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5892,8 +5892,10 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> error_code |= PFERR_PRIVATE_ACCESS;
>
> r = RET_PF_INVALID;
> - if (unlikely((error_code & PFERR_RSVD_MASK) &&
> - !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
> + if (unlikely(error_code & PFERR_RSVD_MASK)) {
> + if (WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))
> + return -EFAULT;
-EFAULT is part of guest_memfd() memory fault ABI. I didn't think over
this thoroughly but do you want to return -EFAULT here?
On 2/28/2024 10:41 AM, Sean Christopherson wrote:
> Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault
> triggers emulation of any kind, as KVM doesn't currently support emulating
> access to guest private memory. Practically speaking, private faults and
> emulation are already mutually exclusive, but there are edge cases upon
> edge cases where KVM can return RET_PF_EMULATE, and adding one last check
> to harden against weird, unexpected combinations is inexpensive.
>
> Suggested-by: Yan Zhao <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 8 --------
> arch/x86/kvm/mmu/mmu_internal.h | 13 +++++++++++++
> 2 files changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e4cc7f764980..e2fd74e06ff8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4309,14 +4309,6 @@ static inline u8 kvm_max_level_for_order(int order)
> return PG_LEVEL_4K;
> }
>
> -static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> - struct kvm_page_fault *fault)
> -{
> - kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> - PAGE_SIZE, fault->write, fault->exec,
> - fault->is_private);
> -}
> -
> static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault)
> {
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 0669a8a668ca..0eea6c5a824d 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -279,6 +279,14 @@ enum {
> RET_PF_SPURIOUS,
> };
>
> +static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> + PAGE_SIZE, fault->write, fault->exec,
> + fault->is_private);
> +}
> +
> static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> u32 err, bool prefetch, int *emulation_type)
> {
> @@ -320,6 +328,11 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> else
> r = vcpu->arch.mmu->page_fault(vcpu, &fault);
Beg for some comment to explain the paraniod.
> + if (r == RET_PF_EMULATE && fault.is_private) {
> + kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
> + return -EFAULT;
> + }
> +
> if (fault.write_fault_to_shadow_pgtable && emulation_type)
> *emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
>
On Fri, Mar 01, 2024, Kai Huang wrote:
> On 1/03/2024 12:06 pm, Sean Christopherson wrote:
> > E.g. in this case, KVM will just skip various fast paths because of the RSVD flag,
> > and treat the fault like a PRIVATE fault. Hmm, but page_fault_handle_page_track()
> > would skip write tracking, which could theoretically cause data corruption, so I
> > guess arguably it would be safer to bail?
> >
> > Anyone else have an opinion? This type of bug should never escape development,
> > so I'm a-ok effectively killing the VM. Unless someone has a good argument for
> > continuing on, I'll go with Kai's suggestion and squash this:
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index cedacb1b89c5..d796a162b2da 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5892,8 +5892,10 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> > error_code |= PFERR_PRIVATE_ACCESS;
> > r = RET_PF_INVALID;
> > - if (unlikely((error_code & PFERR_RSVD_MASK) &&
> > - !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
> > + if (unlikely(error_code & PFERR_RSVD_MASK)) {
> > + if (WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))
> > + return -EFAULT;
>
> -EFAULT is part of guest_memfd() memory fault ABI. I didn't think over this
> thoroughly but do you want to return -EFAULT here?
Yes, I/we do. There are many existing paths that can return -EFAULT from KVM_RUN
without setting run->exit_reason to KVM_EXIT_MEMORY_FAULT. Userspace is responsible
for checking run->exit_reason on -EFAULT (and -EHWPOISON), i.e. must be prepared
to handle a "bare" -EFAULT, where for all intents and purposes "handle" means
"terminate the guest".
That's actually one of the reasons why KVM_EXIT_MEMORY_FAULT exists, it'd require
an absurd amount of work and churn in KVM to *safely* return useful information
on *all* -EFAULTs. FWIW, I had hopes and dreams of actually doing exactly this,
but have long since abandoned those dreams.
In other words, KVM_EXIT_MEMORY_FAULT essentially communicates to userspace that
(a) userspace can likely fix whatever badness triggered the -EFAULT, and (b) that
KVM is in a state where fixing the underlying problem and resuming the guest is
safe, e.g. won't corrupt the guest (because KVM is in a half-baked state).
On 2/28/2024 10:41 AM, Sean Christopherson wrote:
> From: Isaku Yamahata <[email protected]>
>
> Plumb the full 64-bit error code throughout the page fault handling code
> so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK
> will be used to determine whether or not a fault is private vs. shared.
>
> Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change
> the behavior of permission_fault() when invoked in the page fault path, as
> KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault().
>
> Continue passing '0' from the async #PF worker, as guest_memfd() and thus
> private memory doesn't support async page faults.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> [mdr: drop references/changes on rebase, update commit message]
> Signed-off-by: Michael Roth <[email protected]>
> [sean: drop truncation in call to FNAME(walk_addr)(), rewrite changelog]
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 3 +--
> arch/x86/kvm/mmu/mmu_internal.h | 4 ++--
> arch/x86/kvm/mmu/mmutrace.h | 2 +-
> 3 files changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e2fd74e06ff8..408969ac1291 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5860,8 +5860,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> }
>
> if (r == RET_PF_INVALID) {
> - r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
> - lower_32_bits(error_code), false,
> + r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
> &emulation_type);
> if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
> return -EIO;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 0eea6c5a824d..1fab1f2359b5 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -190,7 +190,7 @@ static inline bool is_nx_huge_page_enabled(struct kvm *kvm)
> struct kvm_page_fault {
> /* arguments to kvm_mmu_do_page_fault. */
> const gpa_t addr;
> - const u32 error_code;
> + const u64 error_code;
> const bool prefetch;
>
> /* Derived from error_code. */
> @@ -288,7 +288,7 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> }
>
> static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> - u32 err, bool prefetch, int *emulation_type)
> + u64 err, bool prefetch, int *emulation_type)
> {
> struct kvm_page_fault fault = {
> .addr = cr2_or_gpa,
> diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> index ae86820cef69..195d98bc8de8 100644
> --- a/arch/x86/kvm/mmu/mmutrace.h
> +++ b/arch/x86/kvm/mmu/mmutrace.h
> @@ -260,7 +260,7 @@ TRACE_EVENT(
> TP_STRUCT__entry(
> __field(int, vcpu_id)
> __field(gpa_t, cr2_or_gpa)
> - __field(u32, error_code)
> + __field(u64, error_code)
> __field(u64 *, sptep)
> __field(u64, old_spte)
> __field(u64, new_spte)
On 5/03/2024 4:51 am, Sean Christopherson wrote:
> On Fri, Mar 01, 2024, Kai Huang wrote:
>> On 1/03/2024 12:06 pm, Sean Christopherson wrote:
>>> E.g. in this case, KVM will just skip various fast paths because of the RSVD flag,
>>> and treat the fault like a PRIVATE fault. Hmm, but page_fault_handle_page_track()
>>> would skip write tracking, which could theoretically cause data corruption, so I
>>> guess arguably it would be safer to bail?
>>>
>>> Anyone else have an opinion? This type of bug should never escape development,
>>> so I'm a-ok effectively killing the VM. Unless someone has a good argument for
>>> continuing on, I'll go with Kai's suggestion and squash this:
>>>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index cedacb1b89c5..d796a162b2da 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -5892,8 +5892,10 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>>> error_code |= PFERR_PRIVATE_ACCESS;
>>> r = RET_PF_INVALID;
>>> - if (unlikely((error_code & PFERR_RSVD_MASK) &&
>>> - !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
>>> + if (unlikely(error_code & PFERR_RSVD_MASK)) {
>>> + if (WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))
>>> + return -EFAULT;
>>
>> -EFAULT is part of guest_memfd() memory fault ABI. I didn't think over this
>> thoroughly but do you want to return -EFAULT here?
>
> Yes, I/we do. There are many existing paths that can return -EFAULT from KVM_RUN
> without setting run->exit_reason to KVM_EXIT_MEMORY_FAULT. Userspace is responsible
> for checking run->exit_reason on -EFAULT (and -EHWPOISON), i.e. must be prepared
> to handle a "bare" -EFAULT, where for all intents and purposes "handle" means
> "terminate the guest".
Right.
>
> That's actually one of the reasons why KVM_EXIT_MEMORY_FAULT exists, it'd require
> an absurd amount of work and churn in KVM to *safely* return useful information
> on *all* -EFAULTs. FWIW, I had hopes and dreams of actually doing exactly this,
> but have long since abandoned those dreams.
I am not sure whether we need to do that. Perhaps it made you feel so
after we changed to use -EFAULT to carry KVM_EXIT_MEMORY_FAULT. :-)
>
> In other words, KVM_EXIT_MEMORY_FAULT essentially communicates to userspace that
> (a) userspace can likely fix whatever badness triggered the -EFAULT, and (b) that
> KVM is in a state where fixing the underlying problem and resuming the guest is
> safe, e.g. won't corrupt the guest (because KVM is in a half-baked state).
>
Sure. One small issue might be that, in a later code check, we actually
return KVM_EXIT_MEMORY_FAULT when private fault hits RET_PF_EMULATION --
see your patch:
[PATCH 01/16] KVM: x86/mmu: Exit to userspace with -EFAULT if private
fault hits emulation
So here if we just return -EFAULT w/o reporting KVM_EXIT_MEMORY_FAULT
when private+reserved is hit, it seems there's a little bit
inconsistency here.
But you may have concern of corrupting guest here as you mentioned above.
On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> Prioritize private vs. shared gfn attribute checks above slot validity
> checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to
> userspace if there is no memslot, but emulate accesses to the APIC access
> page even if the attributes mismatch.
IMHO, it would be helpful to explicitly say that, in the later case
(emulate APIC access page) we still want to report MEMORY_FAULT error
first (so that userspace can have chance to fixup, IIUC) instead of
emulating directly, which will unlikely work.
Reviewed-by: Kai Huang <[email protected]>
>
> Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory")
> Cc: Yu Zhang <[email protected]>
> Cc: Chao Peng <[email protected]>
> Cc: Fuad Tabba <[email protected]>
> Cc: Michael Roth <[email protected]>
> Cc: Isaku Yamahata <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 15 ++++++++++-----
> 1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9206cfa58feb..58c5ae8be66c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4365,11 +4365,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> return RET_PF_EMULATE;
> }
>
> - if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> - kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> - return -EFAULT;
> - }
> -
> if (fault->is_private)
> return kvm_faultin_pfn_private(vcpu, fault);
>
> @@ -4410,6 +4405,16 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> smp_rmb();
>
> + /*
> + * Check for a private vs. shared mismatch *after* taking a snapshot of
> + * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
> + * invalidation notifier.
> + */
> + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> + return -EFAULT;
> + }
> +
> /*
> * Check for a relevant mmu_notifier invalidation event before getting
> * the pfn from the primary MMU, and before acquiring mmu_lock.
On Wed, Mar 06, 2024, Kai Huang wrote:
> On 5/03/2024 4:51 am, Sean Christopherson wrote:
> > In other words, KVM_EXIT_MEMORY_FAULT essentially communicates to userspace that
> > (a) userspace can likely fix whatever badness triggered the -EFAULT, and (b) that
> > KVM is in a state where fixing the underlying problem and resuming the guest is
> > safe, e.g. won't corrupt the guest (because KVM is in a half-baked state).
> >
>
> Sure. One small issue might be that, in a later code check, we actually
> return KVM_EXIT_MEMORY_FAULT when private fault hits RET_PF_EMULATION -- see
> your patch:
>
> [PATCH 01/16] KVM: x86/mmu: Exit to userspace with -EFAULT if private fault
> hits emulation
>
> So here if we just return -EFAULT w/o reporting KVM_EXIT_MEMORY_FAULT when
> private+reserved is hit, it seems there's a little bit inconsistency here.
It's intentionally inconsistent. -EFAULT without KVM_EXIT_MEMORY_FAULT is
essentially KVM saying "something bad happened, and it can't be fixed", whereas
exiting with KVM_EXIT_MEMORY_FAULT says "there's an issue, but you may be able
to resolve it".
The ABI is a bit messy, e.g. in some ways it would be cleaner if KVM returned '0'.
But doing that in a backwards compatible way would have required a rather ugly
opt-in, and it would also make it more tedious to extend KVM_EXIT_MEMORY_FAULT,
e.g. pairing it with -EHWPOISON didn't require any new flags.
On Wed, Mar 06, 2024, Kai Huang wrote:
>
>
> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > Prioritize private vs. shared gfn attribute checks above slot validity
> > checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to
> > userspace if there is no memslot, but emulate accesses to the APIC access
> > page even if the attributes mismatch.
>
> IMHO, it would be helpful to explicitly say that, in the later case (emulate
> APIC access page) we still want to report MEMORY_FAULT error first (so that
> userspace can have chance to fixup, IIUC) instead of emulating directly,
> which will unlikely work.
Hmm, it's not so much that emulating directly won't work, it's that KVM would be
violating its ABI. Emulating APIC accesses after userspace converted the APIC
gfn to private would still work (I think), but KVM's ABI is that emulated MMIO
is shared-only.
FWIW, I doubt there's a legitmate use case for converting the APIC gfn to private,
this is purely to ensure KVM has simple, consistent rules for how private vs.
shared access work.
On 6/03/2024 1:38 pm, Sean Christopherson wrote:
> On Wed, Mar 06, 2024, Kai Huang wrote:
>>
>>
>> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
>>> Prioritize private vs. shared gfn attribute checks above slot validity
>>> checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to
>>> userspace if there is no memslot, but emulate accesses to the APIC access
>>> page even if the attributes mismatch.
>>
>> IMHO, it would be helpful to explicitly say that, in the later case (emulate
>> APIC access page) we still want to report MEMORY_FAULT error first (so that
>> userspace can have chance to fixup, IIUC) instead of emulating directly,
>> which will unlikely work.
>
> Hmm, it's not so much that emulating directly won't work, it's that KVM would be
> violating its ABI. Emulating APIC accesses after userspace converted the APIC
> gfn to private would still work (I think), but KVM's ABI is that emulated MMIO
> is shared-only.
But for (at least) TDX guest I recall we _CAN_ allow guest's MMIO to be
mapped as private, right? The guest is supposed to get a #VE anyway?
Perhaps I am missing something -- I apologize if this has already been
discussed.
>
> FWIW, I doubt there's a legitmate use case for converting the APIC gfn to private,
> this is purely to ensure KVM has simple, consistent rules for how private vs.
> shared access work.
Again I _think_ for TDX APIC gfn can be private? IIUC virtualizing APIC
is done by the TDX module, which injects #VE to guest when emulation is
required.
On Wed, Mar 06, 2024, Kai Huang wrote:
>
>
> On 6/03/2024 1:38 pm, Sean Christopherson wrote:
> > On Wed, Mar 06, 2024, Kai Huang wrote:
> > >
> > >
> > > On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > > > Prioritize private vs. shared gfn attribute checks above slot validity
> > > > checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to
> > > > userspace if there is no memslot, but emulate accesses to the APIC access
> > > > page even if the attributes mismatch.
> > >
> > > IMHO, it would be helpful to explicitly say that, in the later case (emulate
> > > APIC access page) we still want to report MEMORY_FAULT error first (so that
> > > userspace can have chance to fixup, IIUC) instead of emulating directly,
> > > which will unlikely work.
> >
> > Hmm, it's not so much that emulating directly won't work, it's that KVM would be
> > violating its ABI. Emulating APIC accesses after userspace converted the APIC
> > gfn to private would still work (I think), but KVM's ABI is that emulated MMIO
> > is shared-only.
>
> But for (at least) TDX guest I recall we _CAN_ allow guest's MMIO to be
> mapped as private, right? The guest is supposed to get a #VE anyway?
Not really. KVM can't _map_ emulated MMIO as private memory, because S-EPT
entries can only point at convertible memory. KVM _could_ emulate in response
to a !PRESENT EPT violation, but KVM is not going to do that.
https://lore.kernel.org/all/[email protected]
> Perhaps I am missing something -- I apologize if this has already been
> discussed.
>
> >
> > FWIW, I doubt there's a legitmate use case for converting the APIC gfn to private,
> > this is purely to ensure KVM has simple, consistent rules for how private vs.
> > shared access work.
>
> Again I _think_ for TDX APIC gfn can be private? IIUC virtualizing APIC is
> done by the TDX module, which injects #VE to guest when emulation is
> required.
It's a moot point for TDX, as x2APIC is mandatory.
On Tue, Feb 27, 2024 at 06:41:36PM -0800, Sean Christopherson wrote:
> Add and use a synthetic, KVM-defined page fault error code to indicate
> whether a fault is to private vs. shared memory. TDX and SNP have
> different mechanisms for reporting private vs. shared, and KVM's
> software-protected VMs have no mechanism at all. Usurp an error code
> flag to avoid having to plumb another parameter to kvm_mmu_page_fault()
> and friends.
>
> Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it
> for TDX and software-protected VMs as appropriate, but that would require
> *clearing* the flag for SEV and SEV-ES VMs, which support encrypted
> memory at the hardware layer, but don't utilize private memory at the
> KVM layer.
I see this alternative in other patchset. And I still don't understand why
synthetic way is better after reading patch #5-7. I assume for SEV(-ES) if
there is spurious PFERR_GUEST_ENC flag set in error code when private memory
is not used in KVM, then it is a HW issue or SW bug that needs to be caught
and warned, and by the way cleared.
Thanks,
Yilun
>
> Opportunistically add a comment to call out that the logic for software-
> protected VMs is (and was before this commit) broken for nested MMUs, i.e.
> for nested TDP, as the GPA is an L2 GPA. Punt on trying to play nice with
> nested MMUs as there is a _lot_ of functionality that simply doesn't work
> for software-protected VMs, e.g. all of the paths where KVM accesses guest
> memory need to be updated to be aware of private vs. shared memory.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 11 +++++++++++
> arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++-------
> arch/x86/kvm/mmu/mmu_internal.h | 2 +-
> 3 files changed, 31 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1e69743ef0fb..4077c46c61ab 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -267,7 +267,18 @@ enum x86_intercept_stage;
> #define PFERR_GUEST_ENC_MASK BIT_ULL(34)
> #define PFERR_GUEST_SIZEM_MASK BIT_ULL(35)
> #define PFERR_GUEST_VMPL_MASK BIT_ULL(36)
> +
> +/*
> + * IMPLICIT_ACCESS is a KVM-defined flag used to correctly perform SMAP checks
> + * when emulating instructions that triggers implicit access.
> + */
> #define PFERR_IMPLICIT_ACCESS BIT_ULL(48)
> +/*
> + * PRIVATE_ACCESS is a KVM-defined flag us to indicate that a fault occurred
> + * when the guest was accessing private memory.
> + */
> +#define PFERR_PRIVATE_ACCESS BIT_ULL(49)
> +#define PFERR_SYNTHETIC_MASK (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS)
>
> #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
> PFERR_WRITE_MASK | \
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 408969ac1291..7807bdcd87e8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5839,19 +5839,31 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> bool direct = vcpu->arch.mmu->root_role.direct;
>
> /*
> - * IMPLICIT_ACCESS is a KVM-defined flag used to correctly perform SMAP
> - * checks when emulating instructions that triggers implicit access.
> * WARN if hardware generates a fault with an error code that collides
> - * with the KVM-defined value. Clear the flag and continue on, i.e.
> - * don't terminate the VM, as KVM can't possibly be relying on a flag
> - * that KVM doesn't know about.
> + * with KVM-defined sythentic flags. Clear the flags and continue on,
> + * i.e. don't terminate the VM, as KVM can't possibly be relying on a
> + * flag that KVM doesn't know about.
> */
> - if (WARN_ON_ONCE(error_code & PFERR_IMPLICIT_ACCESS))
> - error_code &= ~PFERR_IMPLICIT_ACCESS;
> + if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
> + error_code &= ~PFERR_SYNTHETIC_MASK;
>
> if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
> return RET_PF_RETRY;
>
> + /*
> + * Except for reserved faults (emulated MMIO is shared-only), set the
> + * private flag for software-protected VMs based on the gfn's current
> + * attributes, which are the source of truth for such VMs. Note, this
> + * wrong for nested MMUs as the GPA is an L2 GPA, but KVM doesn't
> + * currently supported nested virtualization (among many other things)
> + * for software-protected VMs.
> + */
> + if (IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
> + !(error_code & PFERR_RSVD_MASK) &&
> + vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM &&
> + kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
> + error_code |= PFERR_PRIVATE_ACCESS;
> +
> r = RET_PF_INVALID;
> if (unlikely(error_code & PFERR_RSVD_MASK)) {
> r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1fab1f2359b5..d7c10d338f14 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -306,7 +306,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> .max_level = KVM_MAX_HUGEPAGE_LEVEL,
> .req_level = PG_LEVEL_4K,
> .goal_level = PG_LEVEL_4K,
> - .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
> + .is_private = err & PFERR_PRIVATE_ACCESS,
> };
> int r;
>
> --
> 2.44.0.278.ge034bb2e1d-goog
>
>
On Wed, Mar 06, 2024, Xu Yilun wrote:
> On Tue, Feb 27, 2024 at 06:41:36PM -0800, Sean Christopherson wrote:
> > Add and use a synthetic, KVM-defined page fault error code to indicate
> > whether a fault is to private vs. shared memory. TDX and SNP have
> > different mechanisms for reporting private vs. shared, and KVM's
> > software-protected VMs have no mechanism at all. Usurp an error code
> > flag to avoid having to plumb another parameter to kvm_mmu_page_fault()
> > and friends.
> >
> > Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it
> > for TDX and software-protected VMs as appropriate, but that would require
> > *clearing* the flag for SEV and SEV-ES VMs, which support encrypted
> > memory at the hardware layer, but don't utilize private memory at the
> > KVM layer.
>
> I see this alternative in other patchset. And I still don't understand why
> synthetic way is better after reading patch #5-7. I assume for SEV(-ES) if
> there is spurious PFERR_GUEST_ENC flag set in error code when private memory
> is not used in KVM, then it is a HW issue or SW bug that needs to be caught
> and warned, and by the way cleared.
The conundrum is that SEV(-ES) support _encrypted_ memory, but cannot support
what KVM calls "private" memory. In quotes because SEV(-ES) memory encryption
does provide confidentially/privacy, but in KVM they don't support memslots that
can be switched between private and shared, e.g. will return false for
kvm_arch_has_private_mem().
And KVM _can't_ sanely use private/shared memslots for SEV(-ES), because it's
impossible to intercept implicit conversions by the guest, i.e. KVM can't prevent
the guest from encrypting a page that KVM thinks is private, and vice versa.
But from hardware's perspective, while the APM is a bit misleading and I don't
love the behavior, I can't really argue that it's a hardware bug if the CPU sets
PFERR_GUEST_ENC on a fault where the guest access had C-bit=1, because the access
_was_ indeed to encrypted memory.
Which is a long-winded way of saying the unwanted PFERR_GUEST_ENC faults aren't
really spurious, nor are they hardware or software bugs, just another unfortunate
side effect of the fact that the hypervisor can't intercept shared<->encrypted
conversions for SEV(-ES) guests. And that means that KVM can't WARN, because
literally every SNP-capable CPU would yell when running SEV(-ES) guests.
I also don't like the idea of usurping PFERR_GUEST_ENC to have it mean something
different in KVM as compared to how hardware defines/uses the flag.
Lastly, the approach in Paolo's series to propagate PFERR_GUEST_ENC to .is_private
iff the VM has private memory is brittle, in that it would be too easy for KVM
code that has access to the error code to do the wrong thing and interpret
PFERR_GUEST_ENC has meaning "private".
On 6/03/2024 3:02 pm, Sean Christopherson wrote:
> On Wed, Mar 06, 2024, Kai Huang wrote:
>>
>>
>> On 6/03/2024 1:38 pm, Sean Christopherson wrote:
>>> On Wed, Mar 06, 2024, Kai Huang wrote:
>>>>
>>>>
>>>> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
>>>>> Prioritize private vs. shared gfn attribute checks above slot validity
>>>>> checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to
>>>>> userspace if there is no memslot, but emulate accesses to the APIC access
>>>>> page even if the attributes mismatch.
>>>>
>>>> IMHO, it would be helpful to explicitly say that, in the later case (emulate
>>>> APIC access page) we still want to report MEMORY_FAULT error first (so that
>>>> userspace can have chance to fixup, IIUC) instead of emulating directly,
>>>> which will unlikely work.
>>>
>>> Hmm, it's not so much that emulating directly won't work, it's that KVM would be
>>> violating its ABI. Emulating APIC accesses after userspace converted the APIC
>>> gfn to private would still work (I think), but KVM's ABI is that emulated MMIO
>>> is shared-only.
>>
>> But for (at least) TDX guest I recall we _CAN_ allow guest's MMIO to be
>> mapped as private, right? The guest is supposed to get a #VE anyway?
>
> Not really. KVM can't _map_ emulated MMIO as private memory, because S-EPT
> entries can only point at convertible memory.
Right. I was talking about the MMIO mapping in the guest, which can be
private I suppose.
KVM _could_ emulate in response
> to a !PRESENT EPT violation, but KVM is not going to do that.
>
> https://lore.kernel.org/all/[email protected]
>
Right agreed KVM shouldn't "emulate" !PRESENT fault.
I am talking about this -- for TDX guest, if I recall correctly, for
guest's MMIO gfn KVM still needs to get the EPT violation for the
_first_ access, in which KVM can configure the EPT entry to make sure
"suppress #VE" bit is cleared so the later accesses can trigger #VE
directly.
I suppose this is still the way we want to implement?
I am afraid for this case, we will still see the MMIO GFN as private,
while by default I believe the "guest memory attributes" for that MMIO
GFN should be shared? AFAICT, it seems the "guest memory attributes"
code doesn't check whether a GFN is MMIO or truly RAM.
So I agree making KVM only "emulate" shared MMIO makes sense, and we
need this patch to cover the APIC access page case.
Anyway:
Reviewed-by: Kai Huang <[email protected]>
>> Perhaps I am missing something -- I apologize if this has already been
>> discussed.
>>
>>>
>>> FWIW, I doubt there's a legitmate use case for converting the APIC gfn to private,
>>> this is purely to ensure KVM has simple, consistent rules for how private vs.
>>> shared access work.
>>
>> Again I _think_ for TDX APIC gfn can be private? IIUC virtualizing APIC is
>> done by the TDX module, which injects #VE to guest when emulation is
>> required.
>
> It's a moot point for TDX, as x2APIC is mandatory.
Right. My bad to mention.
On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> Explicitly detect and disallow private accesses to emulated MMIO in
> kvm_handle_noslot_fault() instead of relying on kvm_faultin_pfn_private()
> to perform the check. This will allow the page fault path to go straight
> to kvm_handle_noslot_fault() without bouncing through __kvm_faultin_pfn().
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5c8caab64ba2..ebdb3fcce3dc 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3314,6 +3314,11 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
> {
> gva_t gva = fault->is_tdp ? 0 : fault->addr;
>
> + if (fault->is_private) {
> + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> + return -EFAULT;
> + }
> +
As mentioned in another reply in this series, unless I am mistaken, for
TDX guest the _first_ MMIO access would still cause EPT violation with
MMIO GFN being private.
Returning to userspace cannot really help here because the MMIO mapping
is inside the guest.
I am hoping I am missing something here?
On Thu, Mar 07, 2024, Kai Huang wrote:
>
>
> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > Explicitly detect and disallow private accesses to emulated MMIO in
> > kvm_handle_noslot_fault() instead of relying on kvm_faultin_pfn_private()
> > to perform the check. This will allow the page fault path to go straight
> > to kvm_handle_noslot_fault() without bouncing through __kvm_faultin_pfn().
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 5c8caab64ba2..ebdb3fcce3dc 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3314,6 +3314,11 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
> > {
> > gva_t gva = fault->is_tdp ? 0 : fault->addr;
> > + if (fault->is_private) {
> > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > + return -EFAULT;
> > + }
> > +
>
> As mentioned in another reply in this series, unless I am mistaken, for TDX
> guest the _first_ MMIO access would still cause EPT violation with MMIO GFN
> being private.
>
> Returning to userspace cannot really help here because the MMIO mapping is
> inside the guest.
That's a guest bug. The guest *knows* it's a TDX VM, it *has* to know. Accessing
emulated MMIO and thus taking a #VE before enabling paging is nonsensical. Either
enable paging and setup MMIO regions as shared, or go straight to TDCALL.
>
> I am hoping I am missing something here?
On 7/03/2024 11:43 am, Sean Christopherson wrote:
> On Thu, Mar 07, 2024, Kai Huang wrote:
>>
>>
>> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
>>> Explicitly detect and disallow private accesses to emulated MMIO in
>>> kvm_handle_noslot_fault() instead of relying on kvm_faultin_pfn_private()
>>> to perform the check. This will allow the page fault path to go straight
>>> to kvm_handle_noslot_fault() without bouncing through __kvm_faultin_pfn().
>>>
>>> Signed-off-by: Sean Christopherson <[email protected]>
>>> ---
>>> arch/x86/kvm/mmu/mmu.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index 5c8caab64ba2..ebdb3fcce3dc 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -3314,6 +3314,11 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
>>> {
>>> gva_t gva = fault->is_tdp ? 0 : fault->addr;
>>> + if (fault->is_private) {
>>> + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
>>> + return -EFAULT;
>>> + }
>>> +
>>
>> As mentioned in another reply in this series, unless I am mistaken, for TDX
>> guest the _first_ MMIO access would still cause EPT violation with MMIO GFN
>> being private.
>>
>> Returning to userspace cannot really help here because the MMIO mapping is
>> inside the guest.
>
> That's a guest bug. The guest *knows* it's a TDX VM, it *has* to know. Accessing
> emulated MMIO and thus taking a #VE before enabling paging is nonsensical. Either
> enable paging and setup MMIO regions as shared, or go straight to TDCALL.
+Kirill,
I kinda forgot the detail, but what I am afraid is there might be bunch
of existing TDX guests (since TDX guest code is upstream-ed) using
unmodified drivers, which doesn't map MMIO regions as shared I suppose.
Kirill,
Could you clarify whether TDX guest code maps MMIO regions as shared
since beginning?
On Thu, Mar 07, 2024, Kai Huang wrote:
>
>
> On 7/03/2024 11:43 am, Sean Christopherson wrote:
> > On Thu, Mar 07, 2024, Kai Huang wrote:
> > >
> > >
> > > On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > > > Explicitly detect and disallow private accesses to emulated MMIO in
> > > > kvm_handle_noslot_fault() instead of relying on kvm_faultin_pfn_private()
> > > > to perform the check. This will allow the page fault path to go straight
> > > > to kvm_handle_noslot_fault() without bouncing through __kvm_faultin_pfn().
> > > >
> > > > Signed-off-by: Sean Christopherson <[email protected]>
> > > > ---
> > > > arch/x86/kvm/mmu/mmu.c | 5 +++++
> > > > 1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 5c8caab64ba2..ebdb3fcce3dc 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -3314,6 +3314,11 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
> > > > {
> > > > gva_t gva = fault->is_tdp ? 0 : fault->addr;
> > > > + if (fault->is_private) {
> > > > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > > > + return -EFAULT;
> > > > + }
> > > > +
> > >
> > > As mentioned in another reply in this series, unless I am mistaken, for TDX
> > > guest the _first_ MMIO access would still cause EPT violation with MMIO GFN
> > > being private.
> > >
> > > Returning to userspace cannot really help here because the MMIO mapping is
> > > inside the guest.
> >
> > That's a guest bug. The guest *knows* it's a TDX VM, it *has* to know. Accessing
> > emulated MMIO and thus taking a #VE before enabling paging is nonsensical. Either
> > enable paging and setup MMIO regions as shared, or go straight to TDCALL.
>
> +Kirill,
>
> I kinda forgot the detail, but what I am afraid is there might be bunch of
> existing TDX guests (since TDX guest code is upstream-ed) using unmodified
> drivers, which doesn't map MMIO regions as shared I suppose.
>
> Kirill,
>
> Could you clarify whether TDX guest code maps MMIO regions as shared since
> beginning?
Y'all get the same answer we gave the SNP folks: KVM does not yet support TDX,
so as far is KVM is concerned, there is no existing functionality to support.
s/firmware/Linux if this is a Linux kernel problem.
On Thu, Feb 08, 2024, Paolo Bonzini wrote:
> On Thu, Feb 8, 2024 at 6:27 PM Sean Christopherson <[email protected]> wrote:
> > No. KVM does not yet support SNP, so as far as KVM's ABI goes, there are no
> > existing guests. Yes, I realize that I am burying my head in the sand to some
> > extent, but it is simply not sustainable for KVM to keep trying to pick up the
> > pieces of poorly defined hardware specs and broken guest firmware.
>
> 101% agreed. There are cases in which we have to and should bend
> together backwards for guests (e.g. older Linux kernels), but not for
> code that---according to current practices---is chosen by the host
> admin.
>
> (I am of the opinion that "bring your own firmware" is the only sane
> way to handle attestation/measurement, but that's not how things are
> done currently).
On 7/03/2024 12:01 pm, Sean Christopherson wrote:
> On Thu, Mar 07, 2024, Kai Huang wrote:
>>
>>
>> On 7/03/2024 11:43 am, Sean Christopherson wrote:
>>> On Thu, Mar 07, 2024, Kai Huang wrote:
>>>>
>>>>
>>>> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
>>>>> Explicitly detect and disallow private accesses to emulated MMIO in
>>>>> kvm_handle_noslot_fault() instead of relying on kvm_faultin_pfn_private()
>>>>> to perform the check. This will allow the page fault path to go straight
>>>>> to kvm_handle_noslot_fault() without bouncing through __kvm_faultin_pfn().
>>>>>
>>>>> Signed-off-by: Sean Christopherson <[email protected]>
>>>>> ---
>>>>> arch/x86/kvm/mmu/mmu.c | 5 +++++
>>>>> 1 file changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>>>> index 5c8caab64ba2..ebdb3fcce3dc 100644
>>>>> --- a/arch/x86/kvm/mmu/mmu.c
>>>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>>>> @@ -3314,6 +3314,11 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
>>>>> {
>>>>> gva_t gva = fault->is_tdp ? 0 : fault->addr;
>>>>> + if (fault->is_private) {
>>>>> + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
>>>>> + return -EFAULT;
>>>>> + }
>>>>> +
>>>>
>>>> As mentioned in another reply in this series, unless I am mistaken, for TDX
>>>> guest the _first_ MMIO access would still cause EPT violation with MMIO GFN
>>>> being private.
>>>>
>>>> Returning to userspace cannot really help here because the MMIO mapping is
>>>> inside the guest.
>>>
>>> That's a guest bug. The guest *knows* it's a TDX VM, it *has* to know. Accessing
>>> emulated MMIO and thus taking a #VE before enabling paging is nonsensical. Either
>>> enable paging and setup MMIO regions as shared, or go straight to TDCALL.
>>
>> +Kirill,
>>
>> I kinda forgot the detail, but what I am afraid is there might be bunch of
>> existing TDX guests (since TDX guest code is upstream-ed) using unmodified
>> drivers, which doesn't map MMIO regions as shared I suppose.
>>
>> Kirill,
>>
>> Could you clarify whether TDX guest code maps MMIO regions as shared since
>> beginning?
>
> Y'all get the same answer we gave the SNP folks: KVM does not yet support TDX,
> so as far is KVM is concerned, there is no existing functionality to support.
>
> s/firmware/Linux if this is a Linux kernel problem.
>
> On Thu, Feb 08, 2024, Paolo Bonzini wrote:
> > On Thu, Feb 8, 2024 at 6:27 PM Sean Christopherson <[email protected]> wrote:
> > > No. KVM does not yet support SNP, so as far as KVM's ABI goes, there are no
> > > existing guests. Yes, I realize that I am burying my head in the sand to some
> > > extent, but it is simply not sustainable for KVM to keep trying to pick up the
> > > pieces of poorly defined hardware specs and broken guest firmware.
> >
> > 101% agreed. There are cases in which we have to and should bend
> > together backwards for guests (e.g. older Linux kernels), but not for
> > code that---according to current practices---is chosen by the host
> > admin.
> >
> > (I am of the opinion that "bring your own firmware" is the only sane
> > way to handle attestation/measurement, but that's not how things are
> > done currently).
Fair enough, and good to know. :-)
(Still better to hear from Kirill, though.)
On Thu, Mar 07, 2024, Kai Huang wrote:
>
>
> On 6/03/2024 3:02 pm, Sean Christopherson wrote:
> > On Wed, Mar 06, 2024, Kai Huang wrote:
> > >
> > >
> > > On 6/03/2024 1:38 pm, Sean Christopherson wrote:
> > > > On Wed, Mar 06, 2024, Kai Huang wrote:
> > > > >
> > > > >
> > > > > On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > > > > > Prioritize private vs. shared gfn attribute checks above slot validity
> > > > > > checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to
> > > > > > userspace if there is no memslot, but emulate accesses to the APIC access
> > > > > > page even if the attributes mismatch.
> > > > >
> > > > > IMHO, it would be helpful to explicitly say that, in the later case (emulate
> > > > > APIC access page) we still want to report MEMORY_FAULT error first (so that
> > > > > userspace can have chance to fixup, IIUC) instead of emulating directly,
> > > > > which will unlikely work.
> > > >
> > > > Hmm, it's not so much that emulating directly won't work, it's that KVM would be
> > > > violating its ABI. Emulating APIC accesses after userspace converted the APIC
> > > > gfn to private would still work (I think), but KVM's ABI is that emulated MMIO
> > > > is shared-only.
> > >
> > > But for (at least) TDX guest I recall we _CAN_ allow guest's MMIO to be
> > > mapped as private, right? The guest is supposed to get a #VE anyway?
> >
> > Not really. KVM can't _map_ emulated MMIO as private memory, because S-EPT
> > entries can only point at convertible memory.
>
> Right. I was talking about the MMIO mapping in the guest, which can be
> private I suppose.
>
> > KVM _could_ emulate in response to a !PRESENT EPT violation, but KVM is not
> > going to do that.
> >
> > https://lore.kernel.org/all/[email protected]
> >
>
> Right agreed KVM shouldn't "emulate" !PRESENT fault.
One clarification. KVM *does* emulate !PRESENT faults. And that's not optional,
as caching MMIO info in SPTEs is purely an optimization and isn't possible on all
CPUs, e.g. AMD CPUs with MAXPHYADDR=52 don't have reserved bits to set.
My point above was specifically about emulating *private* !PRESENT faults as MMIO.
> I am talking about this -- for TDX guest, if I recall correctly, for guest's
> MMIO gfn KVM still needs to get the EPT violation for the _first_ access, in
> which KVM can configure the EPT entry to make sure "suppress #VE" bit is
> cleared so the later accesses can trigger #VE directly.
That's totally fine, so long as the access is shared, not private.
> I suppose this is still the way we want to implement?
>
> I am afraid for this case, we will still see the MMIO GFN as private, while
> by default I believe the "guest memory attributes" for that MMIO GFN should
> be shared?
No, the guest should know it's (emulated) MMIO and access the gfn as shared. That
might generate a !PRESENT fault, but that's again a-ok.
> AFAICT, it seems the "guest memory attributes" code doesn't check whether a
> GFN is MMIO or truly RAM.
That's userspace's responsibility, not KVM's responsibility. And if userspace
doesn't proactively make emulated MMIO regions shared, then the memory_fault exit
that results from this patch will give userspace the hook it needs to convert the
gfn to shared on-demand.
On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> Allow mapping KVM's internal memslots used for EPT without unrestricted
> guest into L2, i.e. allow mapping the hidden TSS and the identity mapped
> page tables into L2. Unlike the APIC access page, there is no correctness
> issue with letting L2 access the "hidden" memory. Allowing these memslots
> to be mapped into L2 fixes a largely theoretical bug where KVM could
> incorrectly emulate subsequent _L1_ accesses as MMIO, and also ensures
> consistent KVM behavior for L2.
>
> If KVM is using TDP, but L1 is using shadow paging for L2, then routing
> through kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO,
> and create an MMIO SPTE. Creating an MMIO SPTE is ok, but only because
> kvm_mmu_page_role.guest_mode ensure KVM uses different roots for L1 vs.
> L2. But vcpu->arch.mmio_gfn will remain valid, and could cause KVM to
> incorrectly treat an L1 access to the hidden TSS or identity mapped page
> tables as MMIO.
>
> Furthermore, forcing L2 accesses to be treated as "no slot" faults doesn't
> actually prevent exposing KVM's internal memslots to L2, it simply forces
> KVM to emulate the access. In most cases, that will trigger MMIO,
> amusingly due to filling vcpu->arch.mmio_gfn, but also because
> vcpu_is_mmio_gpa() unconditionally treats APIC accesses as MMIO, i.e. APIC
> accesses are ok. But the hidden TSS and identity mapped page tables could
> go either way (MMIO or access the private memslot's backing memory).
>
> Alternatively, the inconsistent emulator behavior could be addressed by
> forcing MMIO emulation for L2 access to all internal memslots, not just to
> the APIC. But that's arguably less correct than letting L2 access the
> hidden TSS and identity mapped page tables, not to mention that it's
> *extremely* unlikely anyone cares what KVM does in this case. From L1's
> perspective there is R/W memory at those memslots, the memory just happens
> to be initialized with non-zero data. Making the memory disappear when it
> is accessed by L2 is far more magical and arbitrary than the memory
> existing in the first place.
>
> The APIC access page is special because KVM _must_ emulate the access to
> do the right thing (emulate an APIC access instead of reading/writing the
> APIC access page). And despite what commit 3a2936dedd20 ("kvm: mmu: Don't
> expose private memslots to L2") said, it's not just necessary when L1 is
> accelerating L2's virtual APIC, it's just as important (likely *more*
> imporant for correctness when L1 is passing through its own APIC to L
Reviewed-by: Kai Huang <[email protected]>
Ditto for accesses to KVM internal memslots from L2, which
> KVM also treats as emulated MMIO.
Nit:
This is not accurate anymore due to your previous patch ("KVM: x86/mmu:
Don't force emulation of L2 accesses to non-APIC internal slots").
>
> More importantly, this will allow for future cleanup by having the
> "no memslot" case bail from kvm_faultin_pfn() very early on.
>
> Go to rather extreme and gross lengths to make the change a glorified
> nop, e.g. call into __kvm_faultin_pfn() even when there is no slot, as the
> related code is very subtle. E.g. fault->slot can be nullified if it
> points at the APIC access page, some flows in KVM x86 expect fault->pfn
> to be KVM_PFN_NOSLOT, while others check only fault->slot, etc.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Feel free to add:
Reviewed-by: Kai Huang <[email protected]>
On 7/03/2024 12:49 pm, Sean Christopherson wrote:
> On Thu, Mar 07, 2024, Kai Huang wrote:
>>
>>
>> On 6/03/2024 3:02 pm, Sean Christopherson wrote:
>>> On Wed, Mar 06, 2024, Kai Huang wrote:
>>>>
>>>>
>>>> On 6/03/2024 1:38 pm, Sean Christopherson wrote:
>>>>> On Wed, Mar 06, 2024, Kai Huang wrote:
>>>>>>
>>>>>>
>>>>>> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
>>>>>>> Prioritize private vs. shared gfn attribute checks above slot validity
>>>>>>> checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to
>>>>>>> userspace if there is no memslot, but emulate accesses to the APIC access
>>>>>>> page even if the attributes mismatch.
>>>>>>
>>>>>> IMHO, it would be helpful to explicitly say that, in the later case (emulate
>>>>>> APIC access page) we still want to report MEMORY_FAULT error first (so that
>>>>>> userspace can have chance to fixup, IIUC) instead of emulating directly,
>>>>>> which will unlikely work.
>>>>>
>>>>> Hmm, it's not so much that emulating directly won't work, it's that KVM would be
>>>>> violating its ABI. Emulating APIC accesses after userspace converted the APIC
>>>>> gfn to private would still work (I think), but KVM's ABI is that emulated MMIO
>>>>> is shared-only.
>>>>
>>>> But for (at least) TDX guest I recall we _CAN_ allow guest's MMIO to be
>>>> mapped as private, right? The guest is supposed to get a #VE anyway?
>>>
>>> Not really. KVM can't _map_ emulated MMIO as private memory, because S-EPT
>>> entries can only point at convertible memory.
>>
>> Right. I was talking about the MMIO mapping in the guest, which can be
>> private I suppose.
>>
>>> KVM _could_ emulate in response to a !PRESENT EPT violation, but KVM is not
>>> going to do that.
>>>
>>> https://lore.kernel.org/all/[email protected]
>>>
>>
>> Right agreed KVM shouldn't "emulate" !PRESENT fault.
>
> One clarification. KVM *does* emulate !PRESENT faults. And that's not optional,
> as caching MMIO info in SPTEs is purely an optimization and isn't possible on all
> CPUs, e.g. AMD CPUs with MAXPHYADDR=52 don't have reserved bits to set.
Sorry I forgot to add "private".
>
> My point above was specifically about emulating *private* !PRESENT faults as MMIO.
>
>> I am talking about this -- for TDX guest, if I recall correctly, for guest's
>> MMIO gfn KVM still needs to get the EPT violation for the _first_ access, in
>> which KVM can configure the EPT entry to make sure "suppress #VE" bit is
>> cleared so the later accesses can trigger #VE directly.
>
> That's totally fine, so long as the access is shared, not private.
OK as you already replied in the later patch.
>
>> I suppose this is still the way we want to implement?
>>
>> I am afraid for this case, we will still see the MMIO GFN as private, while
>> by default I believe the "guest memory attributes" for that MMIO GFN should
>> be shared?
>
> No, the guest should know it's (emulated) MMIO and access the gfn as shared. That
> might generate a !PRESENT fault, but that's again a-ok.
Ditto.
>
>> AFAICT, it seems the "guest memory attributes" code doesn't check whether a
>> GFN is MMIO or truly RAM.
>
> That's userspace's responsibility, not KVM's responsibility. And if userspace
> doesn't proactively make emulated MMIO regions shared, then the memory_fault exit
> that results from this patch will give userspace the hook it needs to convert the
> gfn to shared on-demand.
I mean whether it's better to just make kvm_mem_is_private() check
whether the given GFN has slot, and always return "shared" if it doesn't.
In kvm_vm_set_mem_attributes() we also ignore NULL-slot GFNs.
(APIC access page is a special case that needs to handle.)
Allowing userspace to maintain whether MMIO GFN is shared or private
doesn't make a lot sense because that doesn't help a lot due to the MMIO
mapping is actually controlled by the guest kernel itself.
The (buggy) guest may still generate private !PRESNET faults, and KVM
can still return to userspace with MEMORY_FAULT, but userspace can just
kill the VM if the faulting address is MMIO.
But this will complicate the code so I guess may not worth to do..
Thanks for your time. :-)
On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> WARN if __kvm_faultin_pfn() generates a "no slot" pfn, and gracefully
> handle the unexpected behavior instead of continuing on with dangerous
> state, e.g. tdp_mmu_map_handle_target_level() _only_ checks fault->slot,
> and so could install a bogus PFN into the guest.
>
> The existing code is functionally ok, because kvm_faultin_pfn() pre-checks
> all of the cases that result in KVM_PFN_NOSLOT, but it is unnecessarily
> unsafe as it relies on __gfn_to_pfn_memslot() getting the _exact_ same
> memslot, i.e. not a re-retrieved pointer with KVM_MEMSLOT_INVALID set.
> And checking only fault->slot would fall apart if KVM ever added a flag or
> condition that forced emulation, similar to how KVM handles writes to
> read-only memslots.
>
> Cc: David Matlack <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 43f24a74571a..cedacb1b89c5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4468,7 +4468,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> if (unlikely(is_error_pfn(fault->pfn)))
> return kvm_handle_error_pfn(vcpu, fault);
>
> - if (WARN_ON_ONCE(!fault->slot))
> + if (WARN_ON_ONCE(!fault->slot || is_noslot_pfn(fault->pfn)))
> return kvm_handle_noslot_fault(vcpu, fault, access);
>
> /*
Reviewed-by: Kai Huang <[email protected]>
On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> Explicitly set "pfn" and "hva" to error values in kvm_mmu_do_page_fault()
> to harden KVM against using "uninitialized" values. In quotes because the
> fields are actually zero-initialized, and zero is a legal value for both
> page frame numbers and virtual addresses. E.g. failure to set "pfn" prior
> to creating an SPTE could result in KVM pointing at physical address '0',
> which is far less desirable than KVM generating a SPTE with reserved PA
> bits set and thus effectively killing the VM.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu_internal.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 74736d517e74..67e32dec9424 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -307,6 +307,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> .req_level = PG_LEVEL_4K,
> .goal_level = PG_LEVEL_4K,
> .is_private = err & PFERR_PRIVATE_ACCESS,
> +
> + .pfn = KVM_PFN_ERR_FAULT,
> + .hva = KVM_HVA_ERR_BAD,
> };
> int r;
>
Reviewed-by: Kai Huang <[email protected]>
On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> Handle the "no memslot" case at the beginning of kvm_faultin_pfn(), just
> after the private versus shared check, so that there's no need to
> repeatedly query whether or not a slot exists. This also makes it more
> obvious that, except for private vs. shared attributes, the process of
> faulting in a pfn simply doesn't apply to gfns without a slot.
>
> Opportunistically stuff @fault's metadata in kvm_handle_noslot_fault() so
> that it doesn't need to be duplicated in all paths that invoke
> kvm_handle_noslot_fault(), and to minimize the probability of not stuffing
> the right fields.
>
> Leave the existing handle behind, but convert it to a WARN, to guard
> against __kvm_faultin_pfn() unexpectedly nullifying fault->slot.
>
> Cc: David Matlack <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
One nit ...
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -235,7 +235,7 @@ struct kvm_page_fault {
> /* The memslot containing gfn. May be NULL. */
> struct kvm_memory_slot *slot;
>
> - /* Outputs of kvm_faultin_pfn. */
> + /* Outputs of kvm_faultin_pfn. */
> unsigned long mmu_seq;
> kvm_pfn_t pfn;
> hva_t hva;
.. how about get rid of this non-related fix?
Yeah it's annoying but do in a separate patch?
On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> Explicitly set fault->hva to KVM_HVA_ERR_BAD when handling a "no slot"
> fault to ensure that KVM doesn't use a bogus virtual address, e.g. if
> there *was* a slot but it's unusable (APIC access page), or if there
> really was no slot, in which case fault->hva will be '0' (which is a
> legal address for x86).
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4dee0999a66e..43f24a74571a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3325,6 +3325,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
> fault->slot = NULL;
> fault->pfn = KVM_PFN_NOSLOT;
> fault->map_writable = false;
> + fault->hva = KVM_HVA_ERR_BAD;
>
> /*
> * If MMIO caching is disabled, emulate immediately without
Not sure why this cannot be merged to the previous one?
Anyway,
Reviewed-by: Kai Huang <[email protected]>
On Thu, Mar 07, 2024, Kai Huang wrote:
>
>
> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -235,7 +235,7 @@ struct kvm_page_fault {
> > /* The memslot containing gfn. May be NULL. */
> > struct kvm_memory_slot *slot;
> > - /* Outputs of kvm_faultin_pfn. */
> > + /* Outputs of kvm_faultin_pfn. */
> > unsigned long mmu_seq;
> > kvm_pfn_t pfn;
> > hva_t hva;
>
> ... how about get rid of this non-related fix?
Eww, yeah, I'll drop it. I originally had code that made the outputs valid if
and only if @slot was non-null, and had updated the comment accordingly, but
ended up going a different route and cleaned up the comment a little too well :-)
On Thu, Mar 07, 2024, Kai Huang wrote:
>
>
> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > Explicitly set fault->hva to KVM_HVA_ERR_BAD when handling a "no slot"
> > fault to ensure that KVM doesn't use a bogus virtual address, e.g. if
> > there *was* a slot but it's unusable (APIC access page), or if there
> > really was no slot, in which case fault->hva will be '0' (which is a
> > legal address for x86).
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 4dee0999a66e..43f24a74571a 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3325,6 +3325,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
> > fault->slot = NULL;
> > fault->pfn = KVM_PFN_NOSLOT;
> > fault->map_writable = false;
> > + fault->hva = KVM_HVA_ERR_BAD;
> > /*
> > * If MMIO caching is disabled, emulate immediately without
>
> Not sure why this cannot be merged to the previous one?
Purely because (before the previous patch) kvm_faultin_pfn() only paved over pfn,
slot, and map_writable. I highly doubt clobbering hva will break anything, but
just in case...
> Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault
> triggers emulation of any kind, as KVM doesn't currently support emulating
> access to guest private memory. Practically speaking, private faults and
> emulation are already mutually exclusive, but there are edge cases upon
> edge cases where KVM can return RET_PF_EMULATE, and adding one last check
edge cases upon edge cases?
Just curious about the details of the edge cases scenarios?
> to harden against weird, unexpected combinations is inexpensive.
>
> Suggested-by: Yan Zhao <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 8 --------
> arch/x86/kvm/mmu/mmu_internal.h | 13 +++++++++++++
> 2 files changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e4cc7f764980..e2fd74e06ff8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4309,14 +4309,6 @@ static inline u8 kvm_max_level_for_order(int order)
> return PG_LEVEL_4K;
> }
>
> -static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> - struct kvm_page_fault *fault)
> -{
> - kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> - PAGE_SIZE, fault->write, fault->exec,
> - fault->is_private);
> -}
> -
> static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault)
> {
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 0669a8a668ca..0eea6c5a824d 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -279,6 +279,14 @@ enum {
> RET_PF_SPURIOUS,
> };
>
> +static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> + PAGE_SIZE, fault->write, fault->exec,
> + fault->is_private);
> +}
> +
> static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> u32 err, bool prefetch, int *emulation_type)
> {
> @@ -320,6 +328,11 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> else
> r = vcpu->arch.mmu->page_fault(vcpu, &fault);
>
> + if (r == RET_PF_EMULATE && fault.is_private) {
> + kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
> + return -EFAULT;
> + }
> +
> if (fault.write_fault_to_shadow_pgtable && emulation_type)
> *emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
>
On Wed, Mar 06, 2024 at 06:45:30AM -0800, Sean Christopherson wrote:
> On Wed, Mar 06, 2024, Xu Yilun wrote:
> > On Tue, Feb 27, 2024 at 06:41:36PM -0800, Sean Christopherson wrote:
> > > Add and use a synthetic, KVM-defined page fault error code to indicate
> > > whether a fault is to private vs. shared memory. TDX and SNP have
> > > different mechanisms for reporting private vs. shared, and KVM's
> > > software-protected VMs have no mechanism at all. Usurp an error code
> > > flag to avoid having to plumb another parameter to kvm_mmu_page_fault()
> > > and friends.
> > >
> > > Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it
> > > for TDX and software-protected VMs as appropriate, but that would require
> > > *clearing* the flag for SEV and SEV-ES VMs, which support encrypted
> > > memory at the hardware layer, but don't utilize private memory at the
> > > KVM layer.
> >
> > I see this alternative in other patchset. And I still don't understand why
> > synthetic way is better after reading patch #5-7. I assume for SEV(-ES) if
> > there is spurious PFERR_GUEST_ENC flag set in error code when private memory
> > is not used in KVM, then it is a HW issue or SW bug that needs to be caught
> > and warned, and by the way cleared.
>
> The conundrum is that SEV(-ES) support _encrypted_ memory, but cannot support
> what KVM calls "private" memory. In quotes because SEV(-ES) memory encryption
> does provide confidentially/privacy, but in KVM they don't support memslots that
I see. I searched the basic knowledge of SEV(-ES/SNP) and finally understand
the difference of encrypted vs. private. For SEV(-ES) only encrypted. For SEV-SNP
both encrypted & private(ownership) supported, but seems now we are trying
to make encrypted & private equal, there is no "encrypted but shared" or
"plain but private" memory from KVM's perspective.
> can be switched between private and shared, e.g. will return false for
> kvm_arch_has_private_mem().
>
> And KVM _can't_ sanely use private/shared memslots for SEV(-ES), because it's
> impossible to intercept implicit conversions by the guest, i.e. KVM can't prevent
> the guest from encrypting a page that KVM thinks is private, and vice versa.
Is it because there is no #NPF for RMP violation?
>
> But from hardware's perspective, while the APM is a bit misleading and I don't
> love the behavior, I can't really argue that it's a hardware bug if the CPU sets
> PFERR_GUEST_ENC on a fault where the guest access had C-bit=1, because the access
> _was_ indeed to encrypted memory.
>
> Which is a long-winded way of saying the unwanted PFERR_GUEST_ENC faults aren't
> really spurious, nor are they hardware or software bugs, just another unfortunate
> side effect of the fact that the hypervisor can't intercept shared<->encrypted
> conversions for SEV(-ES) guests. And that means that KVM can't WARN, because
> literally every SNP-capable CPU would yell when running SEV(-ES) guests.
>
> I also don't like the idea of usurping PFERR_GUEST_ENC to have it mean something
> different in KVM as compared to how hardware defines/uses the flag.
Thanks for your clue. I agree PFERR_GUEST_ENC just for encrypted and a
synthetic flag for private.
Yilun
>
> Lastly, the approach in Paolo's series to propagate PFERR_GUEST_ENC to .is_private
> iff the VM has private memory is brittle, in that it would be too easy for KVM
> code that has access to the error code to do the wrong thing and interpret
> PFERR_GUEST_ENC has meaning "private".
On Thu, Mar 07, 2024, Xu Yilun wrote:
> On Wed, Mar 06, 2024 at 06:45:30AM -0800, Sean Christopherson wrote:
> > can be switched between private and shared, e.g. will return false for
> > kvm_arch_has_private_mem().
> >
> > And KVM _can't_ sanely use private/shared memslots for SEV(-ES), because it's
> > impossible to intercept implicit conversions by the guest, i.e. KVM can't prevent
> > the guest from encrypting a page that KVM thinks is private, and vice versa.
>
> Is it because there is no #NPF for RMP violation?
Yep, there is no RMP, thus no way for the host to express its view of shared vs.
private to hardware. As a result, KVM can't block conversions, and the given
state of a page is completely unkown at any given time. E.g. when memory is
reclaimed from an SEV(-ES) guest, KVM has to assume that the page is encrypted
and thus needs to be flushed (see sev_guest_memory_reclaimed()).
On Thu, Mar 07, 2024 at 11:49:11AM +1300, Huang, Kai wrote:
>
>
> On 7/03/2024 11:43 am, Sean Christopherson wrote:
> > On Thu, Mar 07, 2024, Kai Huang wrote:
> > >
> > >
> > > On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > > > Explicitly detect and disallow private accesses to emulated MMIO in
> > > > kvm_handle_noslot_fault() instead of relying on kvm_faultin_pfn_private()
> > > > to perform the check. This will allow the page fault path to go straight
> > > > to kvm_handle_noslot_fault() without bouncing through __kvm_faultin_pfn().
> > > >
> > > > Signed-off-by: Sean Christopherson <[email protected]>
> > > > ---
> > > > arch/x86/kvm/mmu/mmu.c | 5 +++++
> > > > 1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 5c8caab64ba2..ebdb3fcce3dc 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -3314,6 +3314,11 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
> > > > {
> > > > gva_t gva = fault->is_tdp ? 0 : fault->addr;
> > > > + if (fault->is_private) {
> > > > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > > > + return -EFAULT;
> > > > + }
> > > > +
> > >
> > > As mentioned in another reply in this series, unless I am mistaken, for TDX
> > > guest the _first_ MMIO access would still cause EPT violation with MMIO GFN
> > > being private.
> > >
> > > Returning to userspace cannot really help here because the MMIO mapping is
> > > inside the guest.
> >
> > That's a guest bug. The guest *knows* it's a TDX VM, it *has* to know. Accessing
> > emulated MMIO and thus taking a #VE before enabling paging is nonsensical. Either
> > enable paging and setup MMIO regions as shared, or go straight to TDCALL.
>
> +Kirill,
>
> I kinda forgot the detail, but what I am afraid is there might be bunch of
> existing TDX guests (since TDX guest code is upstream-ed) using unmodified
> drivers, which doesn't map MMIO regions as shared I suppose.
Unmodified drivers gets their MMIO regions mapped with ioremap() that sets
shared bit, unless asked explicitly to make it private (encrypted).
--
Kiryl Shutsemau / Kirill A. Shutemov
On 8/03/2024 6:10 am, Kirill A. Shutemov wrote:
> On Thu, Mar 07, 2024 at 11:49:11AM +1300, Huang, Kai wrote:
>>
>>
>> On 7/03/2024 11:43 am, Sean Christopherson wrote:
>>> On Thu, Mar 07, 2024, Kai Huang wrote:
>>>>
>>>>
>>>> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
>>>>> Explicitly detect and disallow private accesses to emulated MMIO in
>>>>> kvm_handle_noslot_fault() instead of relying on kvm_faultin_pfn_private()
>>>>> to perform the check. This will allow the page fault path to go straight
>>>>> to kvm_handle_noslot_fault() without bouncing through __kvm_faultin_pfn().
>>>>>
>>>>> Signed-off-by: Sean Christopherson <[email protected]>
>>>>> ---
>>>>> arch/x86/kvm/mmu/mmu.c | 5 +++++
>>>>> 1 file changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>>>> index 5c8caab64ba2..ebdb3fcce3dc 100644
>>>>> --- a/arch/x86/kvm/mmu/mmu.c
>>>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>>>> @@ -3314,6 +3314,11 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
>>>>> {
>>>>> gva_t gva = fault->is_tdp ? 0 : fault->addr;
>>>>> + if (fault->is_private) {
>>>>> + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
>>>>> + return -EFAULT;
>>>>> + }
>>>>> +
>>>>
>>>> As mentioned in another reply in this series, unless I am mistaken, for TDX
>>>> guest the _first_ MMIO access would still cause EPT violation with MMIO GFN
>>>> being private.
>>>>
>>>> Returning to userspace cannot really help here because the MMIO mapping is
>>>> inside the guest.
>>>
>>> That's a guest bug. The guest *knows* it's a TDX VM, it *has* to know. Accessing
>>> emulated MMIO and thus taking a #VE before enabling paging is nonsensical. Either
>>> enable paging and setup MMIO regions as shared, or go straight to TDCALL.
>>
>> +Kirill,
>>
>> I kinda forgot the detail, but what I am afraid is there might be bunch of
>> existing TDX guests (since TDX guest code is upstream-ed) using unmodified
>> drivers, which doesn't map MMIO regions as shared I suppose.
>
> Unmodified drivers gets their MMIO regions mapped with ioremap() that sets
> shared bit, unless asked explicitly to make it private (encrypted).
>
Thanks for clarification. Obviously I had bad memory.
On Tue, Feb 27, 2024 at 06:41:40PM -0800, Sean Christopherson wrote:
> Prioritize private vs. shared gfn attribute checks above slot validity
> checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to
> userspace if there is no memslot, but emulate accesses to the APIC access
> page even if the attributes mismatch.
>
> Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory")
> Cc: Yu Zhang <[email protected]>
> Cc: Chao Peng <[email protected]>
> Cc: Fuad Tabba <[email protected]>
> Cc: Michael Roth <[email protected]>
> Cc: Isaku Yamahata <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 15 ++++++++++-----
> 1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9206cfa58feb..58c5ae8be66c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4365,11 +4365,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> return RET_PF_EMULATE;
> }
>
> - if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> - kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> - return -EFAULT;
> - }
> -
> if (fault->is_private)
> return kvm_faultin_pfn_private(vcpu, fault);
>
> @@ -4410,6 +4405,16 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> smp_rmb();
>
> + /*
> + * Check for a private vs. shared mismatch *after* taking a snapshot of
> + * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
> + * invalidation notifier.
I didn't see how mmu_invalidate_seq influences gfn attribute judgement.
And there is no synchronization between the below check and
kvm_vm_set_mem_attributes(), the gfn attribute could still be changing
after the snapshot. So why this comment?
Thanks,
Yilun
> + */
> + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> + return -EFAULT;
> + }
> +
> /*
> * Check for a relevant mmu_notifier invalidation event before getting
> * the pfn from the primary MMU, and before acquiring mmu_lock.
> --
> 2.44.0.278.ge034bb2e1d-goog
>
>
On Tue, Feb 27, 2024 at 06:41:32PM -0800, Sean Christopherson wrote:
> Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault
> triggers emulation of any kind, as KVM doesn't currently support emulating
> access to guest private memory. Practically speaking, private faults and
> emulation are already mutually exclusive, but there are edge cases upon
> edge cases where KVM can return RET_PF_EMULATE, and adding one last check
> to harden against weird, unexpected combinations is inexpensive.
>
> Suggested-by: Yan Zhao <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 8 --------
> arch/x86/kvm/mmu/mmu_internal.h | 13 +++++++++++++
> 2 files changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e4cc7f764980..e2fd74e06ff8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4309,14 +4309,6 @@ static inline u8 kvm_max_level_for_order(int order)
> return PG_LEVEL_4K;
> }
>
> -static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> - struct kvm_page_fault *fault)
> -{
> - kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> - PAGE_SIZE, fault->write, fault->exec,
> - fault->is_private);
> -}
> -
> static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault)
> {
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 0669a8a668ca..0eea6c5a824d 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -279,6 +279,14 @@ enum {
> RET_PF_SPURIOUS,
> };
>
> +static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> + PAGE_SIZE, fault->write, fault->exec,
> + fault->is_private);
> +}
> +
> static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> u32 err, bool prefetch, int *emulation_type)
> {
> @@ -320,6 +328,11 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> else
> r = vcpu->arch.mmu->page_fault(vcpu, &fault);
>
> + if (r == RET_PF_EMULATE && fault.is_private) {
Should we just check VM type + RET_PF_EMULATE, and abort?
If r is RET_PF_EMULATE, and fault is caused by accesing a shared address,
the emulation code could still meet error if guest page table pages are in
private memory, right?
> + kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
> + return -EFAULT;
> + }
> +
> if (fault.write_fault_to_shadow_pgtable && emulation_type)
> *emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
>
> --
> 2.44.0.278.ge034bb2e1d-goog
>
On Fri, Mar 08, 2024, Xu Yilun wrote:
> On Tue, Feb 27, 2024 at 06:41:40PM -0800, Sean Christopherson wrote:
> > Prioritize private vs. shared gfn attribute checks above slot validity
> > checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to
> > userspace if there is no memslot, but emulate accesses to the APIC access
> > page even if the attributes mismatch.
> >
> > Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory")
> > Cc: Yu Zhang <[email protected]>
> > Cc: Chao Peng <[email protected]>
> > Cc: Fuad Tabba <[email protected]>
> > Cc: Michael Roth <[email protected]>
> > Cc: Isaku Yamahata <[email protected]>
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 15 ++++++++++-----
> > 1 file changed, 10 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 9206cfa58feb..58c5ae8be66c 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4365,11 +4365,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > return RET_PF_EMULATE;
> > }
> >
> > - if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> > - kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > - return -EFAULT;
> > - }
> > -
> > if (fault->is_private)
> > return kvm_faultin_pfn_private(vcpu, fault);
> >
> > @@ -4410,6 +4405,16 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> > smp_rmb();
> >
> > + /*
> > + * Check for a private vs. shared mismatch *after* taking a snapshot of
> > + * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
> > + * invalidation notifier.
>
> I didn't see how mmu_invalidate_seq influences gfn attribute judgement.
> And there is no synchronization between the below check and
> kvm_vm_set_mem_attributes(), the gfn attribute could still be changing
> after the snapshot.
There is synchronization. If kvm_vm_set_mem_attributes() changes the attributes,
and thus bumps mmu_invalidate_seq, after kvm_faultin_pfn() takes its snapshot,
then is_page_fault_stale() will detect that an invalidation related to the gfn
occured and resume the guest *without* installing a mapping in KVM's page tables.
I.e. KVM may read the old, stale gfn attributes, but it will never actually
expose the stale attirubtes to the guest.
On Fri, Mar 08, 2024 at 03:28:08PM -0800, Sean Christopherson wrote:
> On Fri, Mar 08, 2024, Xu Yilun wrote:
> > On Tue, Feb 27, 2024 at 06:41:40PM -0800, Sean Christopherson wrote:
> > > Prioritize private vs. shared gfn attribute checks above slot validity
> > > checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to
> > > userspace if there is no memslot, but emulate accesses to the APIC access
> > > page even if the attributes mismatch.
> > >
> > > Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory")
> > > Cc: Yu Zhang <[email protected]>
> > > Cc: Chao Peng <[email protected]>
> > > Cc: Fuad Tabba <[email protected]>
> > > Cc: Michael Roth <[email protected]>
> > > Cc: Isaku Yamahata <[email protected]>
> > > Signed-off-by: Sean Christopherson <[email protected]>
> > > ---
> > > arch/x86/kvm/mmu/mmu.c | 15 ++++++++++-----
> > > 1 file changed, 10 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 9206cfa58feb..58c5ae8be66c 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4365,11 +4365,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > > return RET_PF_EMULATE;
> > > }
> > >
> > > - if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> > > - kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > > - return -EFAULT;
> > > - }
> > > -
> > > if (fault->is_private)
> > > return kvm_faultin_pfn_private(vcpu, fault);
> > >
> > > @@ -4410,6 +4405,16 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> > > smp_rmb();
> > >
> > > + /*
> > > + * Check for a private vs. shared mismatch *after* taking a snapshot of
> > > + * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
> > > + * invalidation notifier.
> >
> > I didn't see how mmu_invalidate_seq influences gfn attribute judgement.
> > And there is no synchronization between the below check and
> > kvm_vm_set_mem_attributes(), the gfn attribute could still be changing
> > after the snapshot.
>
> There is synchronization. If kvm_vm_set_mem_attributes() changes the attributes,
> and thus bumps mmu_invalidate_seq, after kvm_faultin_pfn() takes its snapshot,
> then is_page_fault_stale() will detect that an invalidation related to the gfn
> occured and resume the guest *without* installing a mapping in KVM's page tables.
>
> I.e. KVM may read the old, stale gfn attributes, but it will never actually
> expose the stale attirubtes to the guest.
That makes sense! I was just thinking of the racing for below few lines,
if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
return -EFAULT;
}
But the guarding is actually for the whole kvm_faultin_pfn(). It is the
the same mechanism between getting old gfn attributes and getting old pfn.
I wonder if we could instead add some general comments at
fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
about the snapshot and is_page_fault_stale() thing.
Thanks,
Yilun
>
On Mon, Mar 11, 2024, Xu Yilun wrote:
> On Fri, Mar 08, 2024 at 03:28:08PM -0800, Sean Christopherson wrote:
> > On Fri, Mar 08, 2024, Xu Yilun wrote:
> > > On Tue, Feb 27, 2024 at 06:41:40PM -0800, Sean Christopherson wrote:
> > > > @@ -4410,6 +4405,16 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > > > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> > > > smp_rmb();
> > > >
> > > > + /*
> > > > + * Check for a private vs. shared mismatch *after* taking a snapshot of
> > > > + * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
> > > > + * invalidation notifier.
> > >
> > > I didn't see how mmu_invalidate_seq influences gfn attribute judgement.
> > > And there is no synchronization between the below check and
> > > kvm_vm_set_mem_attributes(), the gfn attribute could still be changing
> > > after the snapshot.
> >
> > There is synchronization. If kvm_vm_set_mem_attributes() changes the attributes,
> > and thus bumps mmu_invalidate_seq, after kvm_faultin_pfn() takes its snapshot,
> > then is_page_fault_stale() will detect that an invalidation related to the gfn
> > occured and resume the guest *without* installing a mapping in KVM's page tables.
> >
> > I.e. KVM may read the old, stale gfn attributes, but it will never actually
> > expose the stale attirubtes to the guest.
>
> That makes sense! I was just thinking of the racing for below few lines,
>
> if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> return -EFAULT;
> }
>
> But the guarding is actually for the whole kvm_faultin_pfn(). It is the
> the same mechanism between getting old gfn attributes and getting old pfn.
>
> I wonder if we could instead add some general comments at
>
> fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
>
> about the snapshot and is_page_fault_stale() thing.
Yeah, I'll add a comment. The only reason not to add a comment is that, ideally,
the comment/documentation would live in common KVM code, not x86. But this code
already has a few big comments about the mmu_notifier retry logic, one more
definitely won't hurt.
On 3/7/2024 8:52 PM, Gupta, Pankaj wrote:
>> Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private
>> fault
>> triggers emulation of any kind, as KVM doesn't currently support
>> emulating
>> access to guest private memory. Practically speaking, private faults
>> and
>> emulation are already mutually exclusive, but there are edge cases upon
>> edge cases where KVM can return RET_PF_EMULATE, and adding one last
>> check
>
> edge cases upon edge cases?
>
> Just curious about the details of the edge cases scenarios?
Same question out of curiosity.
>
>> to harden against weird, unexpected combinations is inexpensive.
>>
>> Suggested-by: Yan Zhao <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> ---
>> arch/x86/kvm/mmu/mmu.c | 8 --------
>> arch/x86/kvm/mmu/mmu_internal.h | 13 +++++++++++++
>> 2 files changed, 13 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index e4cc7f764980..e2fd74e06ff8 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -4309,14 +4309,6 @@ static inline u8 kvm_max_level_for_order(int
>> order)
>> return PG_LEVEL_4K;
>> }
>> -static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>> - struct kvm_page_fault *fault)
>> -{
>> - kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
>> - PAGE_SIZE, fault->write, fault->exec,
>> - fault->is_private);
>> -}
>> -
>> static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
>> struct kvm_page_fault *fault)
>> {
>> diff --git a/arch/x86/kvm/mmu/mmu_internal.h
>> b/arch/x86/kvm/mmu/mmu_internal.h
>> index 0669a8a668ca..0eea6c5a824d 100644
>> --- a/arch/x86/kvm/mmu/mmu_internal.h
>> +++ b/arch/x86/kvm/mmu/mmu_internal.h
>> @@ -279,6 +279,14 @@ enum {
>> RET_PF_SPURIOUS,
>> };
>> +static inline void kvm_mmu_prepare_memory_fault_exit(struct
>> kvm_vcpu *vcpu,
>> + struct kvm_page_fault *fault)
>> +{
>> + kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
>> + PAGE_SIZE, fault->write, fault->exec,
>> + fault->is_private);
>> +}
>> +
>> static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu,
>> gpa_t cr2_or_gpa,
>> u32 err, bool prefetch, int *emulation_type)
>> {
>> @@ -320,6 +328,11 @@ static inline int kvm_mmu_do_page_fault(struct
>> kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>> else
>> r = vcpu->arch.mmu->page_fault(vcpu, &fault);
>> + if (r == RET_PF_EMULATE && fault.is_private) {
>> + kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
>> + return -EFAULT;
>> + }
>> +
>> if (fault.write_fault_to_shadow_pgtable && emulation_type)
>> *emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
>
>
On 2/28/2024 10:41 AM, Sean Christopherson wrote:
> Add and use a synthetic, KVM-defined page fault error code to indicate
> whether a fault is to private vs. shared memory. TDX and SNP have
> different mechanisms for reporting private vs. shared, and KVM's
> software-protected VMs have no mechanism at all. Usurp an error code
> flag to avoid having to plumb another parameter to kvm_mmu_page_fault()
> and friends.
>
> Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it
> for TDX and software-protected VMs as appropriate, but that would require
> *clearing* the flag for SEV and SEV-ES VMs, which support encrypted
> memory at the hardware layer, but don't utilize private memory at the
> KVM layer.
>
> Opportunistically add a comment to call out that the logic for software-
> protected VMs is (and was before this commit) broken for nested MMUs, i.e.
> for nested TDP, as the GPA is an L2 GPA. Punt on trying to play nice with
> nested MMUs as there is a _lot_ of functionality that simply doesn't work
> for software-protected VMs, e.g. all of the paths where KVM accesses guest
> memory need to be updated to be aware of private vs. shared memory.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 11 +++++++++++
> arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++-------
> arch/x86/kvm/mmu/mmu_internal.h | 2 +-
> 3 files changed, 31 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1e69743ef0fb..4077c46c61ab 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -267,7 +267,18 @@ enum x86_intercept_stage;
> #define PFERR_GUEST_ENC_MASK BIT_ULL(34)
> #define PFERR_GUEST_SIZEM_MASK BIT_ULL(35)
> #define PFERR_GUEST_VMPL_MASK BIT_ULL(36)
> +
> +/*
> + * IMPLICIT_ACCESS is a KVM-defined flag used to correctly perform SMAP checks
> + * when emulating instructions that triggers implicit access.
> + */
> #define PFERR_IMPLICIT_ACCESS BIT_ULL(48)
> +/*
> + * PRIVATE_ACCESS is a KVM-defined flag us to indicate that a fault occurred
Here "us" is a typo? should be used?
> + * when the guest was accessing private memory.
> + */
> +#define PFERR_PRIVATE_ACCESS BIT_ULL(49)
> +#define PFERR_SYNTHETIC_MASK (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS)
>
> #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
> PFERR_WRITE_MASK | \
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 408969ac1291..7807bdcd87e8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5839,19 +5839,31 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> bool direct = vcpu->arch.mmu->root_role.direct;
>
> /*
> - * IMPLICIT_ACCESS is a KVM-defined flag used to correctly perform SMAP
> - * checks when emulating instructions that triggers implicit access.
> * WARN if hardware generates a fault with an error code that collides
> - * with the KVM-defined value. Clear the flag and continue on, i.e.
> - * don't terminate the VM, as KVM can't possibly be relying on a flag
> - * that KVM doesn't know about.
> + * with KVM-defined sythentic flags. Clear the flags and continue on,
:s/sythentic/synthetic/
Others,
Reviewed-by: Binbin Wu <[email protected]>
> + * i.e. don't terminate the VM, as KVM can't possibly be relying on a
> + * flag that KVM doesn't know about.
> */
> - if (WARN_ON_ONCE(error_code & PFERR_IMPLICIT_ACCESS))
> - error_code &= ~PFERR_IMPLICIT_ACCESS;
> + if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
> + error_code &= ~PFERR_SYNTHETIC_MASK;
>
> if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
> return RET_PF_RETRY;
>
> + /*
> + * Except for reserved faults (emulated MMIO is shared-only), set the
> + * private flag for software-protected VMs based on the gfn's current
> + * attributes, which are the source of truth for such VMs. Note, this
> + * wrong for nested MMUs as the GPA is an L2 GPA, but KVM doesn't
> + * currently supported nested virtualization (among many other things)
> + * for software-protected VMs.
> + */
> + if (IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
> + !(error_code & PFERR_RSVD_MASK) &&
> + vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM &&
> + kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
> + error_code |= PFERR_PRIVATE_ACCESS;
> +
> r = RET_PF_INVALID;
> if (unlikely(error_code & PFERR_RSVD_MASK)) {
> r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1fab1f2359b5..d7c10d338f14 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -306,7 +306,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> .max_level = KVM_MAX_HUGEPAGE_LEVEL,
> .req_level = PG_LEVEL_4K,
> .goal_level = PG_LEVEL_4K,
> - .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
> + .is_private = err & PFERR_PRIVATE_ACCESS,
> };
> int r;
>
On 2/28/2024 10:41 AM, Sean Christopherson wrote:
> Move the sanity check that hardware never sets bits that collide with KVM-
> define synthetic bits from kvm_mmu_page_fault() to npf_interception(),
> i.e. make the sanity check #NPF specific. The legacy #PF path already
> WARNs if _any_ of bits 63:32 are set, and the error code that comes from
> VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's
"Violatation" -> "Violation"
> EXIT_QUALIFICATION into error code flags).
>
> Add a compile-time assert in the legacy #PF handler to make sure that KVM-
> define flags are covered by its existing sanity check on the upper bits.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 12 +++---------
> arch/x86/kvm/svm/svm.c | 9 +++++++++
> 2 files changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5d892bd59c97..bd342ebd0809 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4561,6 +4561,9 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> if (WARN_ON_ONCE(error_code >> 32))
> error_code = lower_32_bits(error_code);
>
> + /* Ensure the above sanity check also covers KVM-defined flags. */
> + BUILD_BUG_ON(lower_32_bits(PFERR_SYNTHETIC_MASK));
> +
> vcpu->arch.l1tf_flush_l1d = true;
> if (!flags) {
> trace_kvm_page_fault(vcpu, fault_address, error_code);
> @@ -5845,15 +5848,6 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> int r, emulation_type = EMULTYPE_PF;
> bool direct = vcpu->arch.mmu->root_role.direct;
>
> - /*
> - * WARN if hardware generates a fault with an error code that collides
> - * with KVM-defined sythentic flags. Clear the flags and continue on,
> - * i.e. don't terminate the VM, as KVM can't possibly be relying on a
> - * flag that KVM doesn't know about.
> - */
> - if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
> - error_code &= ~PFERR_SYNTHETIC_MASK;
> -
> if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
> return RET_PF_RETRY;
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index e90b429c84f1..199c4dd8d214 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2055,6 +2055,15 @@ static int npf_interception(struct kvm_vcpu *vcpu)
> u64 fault_address = svm->vmcb->control.exit_info_2;
> u64 error_code = svm->vmcb->control.exit_info_1;
>
> + /*
> + * WARN if hardware generates a fault with an error code that collides
> + * with KVM-defined sythentic flags. Clear the flags and continue on,
"sythentic" -> "synthetic"
Two typos.
Others,
Reviewed-by: Binbin Wu <[email protected]>
> + * i.e. don't terminate the VM, as KVM can't possibly be relying on a
> + * flag that KVM doesn't know about.
> + */
> + if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
> + error_code &= ~PFERR_SYNTHETIC_MASK;
> +
> trace_kvm_page_fault(vcpu, fault_address, error_code);
> return kvm_mmu_page_fault(vcpu, fault_address, error_code,
> static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
On 3/1/2024 7:07 AM, Sean Christopherson wrote:
> On Fri, Mar 01, 2024, Kai Huang wrote:
>>
>> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
>>> WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF,
>> I found "legacy #PF" is a little bit confusing but I couldn't figure out a
>> better name either :-)
Me too.
>>
>>> as the error code for #PF is limited to 32 bits (and in practice, 16 bits
>>> on Intel CPUS). This behavior is architectural, is part of KVM's ABI
>>> (see kvm_vcpu_events.error_code), and is explicitly documented as being
>>> preserved for intecerpted #PF in both the APM:
"intecerpted" -> "intercepted"
>>>
>>> The error code saved in EXITINFO1 is the same as would be pushed onto
>>> the stack by a non-intercepted #PF exception in protected mode.
>>>
>>> and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a
>>> 32-bit field.
>>>
>>> Simply drop the upper bits of hardware provides garbage, as spurious
>> "of" -> "if" ?
>>
>>> information should do no harm (though in all likelihood hardware is buggy
>>> and the kernel is doomed).
>>>
>>> Handling all upper 32 bits in the #PF path will allow moving the sanity
>>> check on synthetic checks from kvm_mmu_page_fault() to npf_interception(),
>>> which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's
>>> PFERR_GUEST_ENC_MASK without running afoul of the sanity check.
>>>
>>> Note, this also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
>> "this" -> "this is" ?
>>
>>> and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest
>>> bit minimizes the probability of a collision with the "other" vendor,
>>> without needing to plumb more bits through microcode.
>>>
>>> Signed-off-by: Sean Christopherson <[email protected]>
>>> ---
>>> arch/x86/kvm/mmu/mmu.c | 7 +++++++
>>> 1 file changed, 7 insertions(+)
>>>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index 7807bdcd87e8..5d892bd59c97 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -4553,6 +4553,13 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>>> if (WARN_ON_ONCE(fault_address >> 32))
>>> return -EFAULT;
>>> #endif
>>> + /*
>>> + * Legacy #PF exception only have a 32-bit error code. Simply drop the
>> "have" -> "has" ?
> This one I'll fix by making "exception" plural.
>
> Thanks much for the reviews!
>
>>> + * upper bits as KVM doesn't use them for #PF (because they are never
>>> + * set), and to ensure there are no collisions with KVM-defined bits.
>>> + */
>>> + if (WARN_ON_ONCE(error_code >> 32))
>>> + error_code = lower_32_bits(error_code);
>>> vcpu->arch.l1tf_flush_l1d = true;
>>> if (!flags) {
>> Reviewed-by: Kai Huang <[email protected]>
On Fri, Mar 08, 2024, Yan Zhao wrote:
> On Tue, Feb 27, 2024 at 06:41:32PM -0800, Sean Christopherson wrote:
> > @@ -320,6 +328,11 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > else
> > r = vcpu->arch.mmu->page_fault(vcpu, &fault);
> >
> > + if (r == RET_PF_EMULATE && fault.is_private) {
> Should we just check VM type + RET_PF_EMULATE, and abort?
No, the goal here is purely to ensure that emulation is never triggered for
private memory. Guarding against attempting emulation for a VM type that doesn't
support emulation at all is something different.
And more concretely, as of this commit, all VM types that support private memory
(i.e. SW_PROTECTED_VM) support emulation, just not for private memory.
> If r is RET_PF_EMULATE, and fault is caused by accesing a shared address,
> the emulation code could still meet error if guest page table pages are in
> private memory, right?
Yes, which is why I squeezed in a documentation update for v6.8 to make it super
clear that SW_PROTECTED_VM is a development vehicle, i.e. that trying to use it
to run a real VM is all but guaranteed to cause explosions.
On Tue, Mar 12, 2024, Binbin Wu wrote:
>
> On 3/7/2024 8:52 PM, Gupta, Pankaj wrote:
> > > Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault
> > > triggers emulation of any kind, as KVM doesn't currently support
> > > emulating access to guest private memory. Practically speaking, private
> > > faults and emulation are already mutually exclusive, but there are edge
> > > cases upon edge cases where KVM can return RET_PF_EMULATE, and adding one
> > > last check
> >
> > edge cases upon edge cases?
> >
> > Just curious about the details of the edge cases scenarios?
>
> Same question out of curiosity.
Accesses that hit the APIC-access page and gfns that are write-tracked, are the
two most likely candidates. Even plain old emulated MMIO falls into this bucket,
e.g. if KVM botched things and generated a RSVD fault on a private mapping. I'll
reword that line to
faults and emulation are already mutually exclusive, but there are many
flows that can result in KVM returning RET_PF_EMULATE, and adding one last
check to harden against weird, unexpected combinations and/or KVM bugs is
inexpensive.
to make it sound less dramatic/hand-wavy.
On Wed, Feb 28, 2024 at 3:41 AM Sean Christopherson <[email protected]> wrote:
>
> This is a combination of prep work for TDX and SNP, and a clean up of the
> page fault path to (hopefully) make it easier to follow the rules for
> private memory, noslot faults, writes to read-only slots, etc.
>
> Paolo, this is the series I mentioned in your TDX/SNP prep work series.
> Stating the obvious, these
>
> KVM: x86/mmu: Pass full 64-bit error code when handling page faults
> KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler
>
> are the drop-in replacements.
Applied to kvm-coco-queue, thanks, and these to kvm/queue as well:
KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits emulation
KVM: x86: Remove separate "bit" defines for page fault error code masks
KVM: x86: Define more SEV+ page fault error bits/flags for #NPF
KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler
KVM: x86/mmu: Pass full 64-bit error code when handling page faults
KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero
I have made a little hack for kvm-coco-queue, preserving for now the
usage of PFERR_GUEST_ENC_MASK in case people were relying on the
branch, to limit the rebase pain.
The remaining parts are split into a "[TO SQUASH] KVM: x86/mmu: Use
synthetic page fault error code to indicate private faults" commit at
the end of the branch.
Paolo
> Isaku Yamahata (1):
> KVM: x86/mmu: Pass full 64-bit error code when handling page faults
>
> Sean Christopherson (15):
> KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits
> emulation
> KVM: x86: Remove separate "bit" defines for page fault error code
> masks
> KVM: x86: Define more SEV+ page fault error bits/flags for #NPF
> KVM: x86/mmu: Use synthetic page fault error code to indicate private
> faults
> KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are
> non-zero
> KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler
> KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page
> faults
> KVM: x86/mmu: Move private vs. shared check above slot validity checks
> KVM: x86/mmu: Don't force emulation of L2 accesses to non-APIC
> internal slots
> KVM: x86/mmu: Explicitly disallow private accesses to emulated MMIO
> KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to
> kvm_faultin_pfn()
> KVM: x86/mmu: Handle no-slot faults at the beginning of
> kvm_faultin_pfn()
> KVM: x86/mmu: Set kvm_page_fault.hva to KVM_HVA_ERR_BAD for "no slot"
> faults
> KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values
> KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create
> noslot pfns
>
> arch/x86/include/asm/kvm_host.h | 45 ++++-----
> arch/x86/kvm/mmu.h | 4 +-
> arch/x86/kvm/mmu/mmu.c | 159 +++++++++++++++++++-------------
> arch/x86/kvm/mmu/mmu_internal.h | 24 ++++-
> arch/x86/kvm/mmu/mmutrace.h | 2 +-
> arch/x86/kvm/svm/svm.c | 9 ++
> 6 files changed, 151 insertions(+), 92 deletions(-)
>
>
> base-commit: ec1e3d33557babed2c2c2c7da6e84293c2f56f58
> --
> 2.44.0.278.ge034bb2e1d-goog
>
On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> On Wed, Feb 28, 2024 at 3:41 AM Sean Christopherson <[email protected]> wrote:
> I have made a little hack for kvm-coco-queue, preserving for now the
> usage of PFERR_GUEST_ENC_MASK in case people were relying on the
> branch, to limit the rebase pain.
>
> The remaining parts are split into a "[TO SQUASH] KVM: x86/mmu: Use
> synthetic page fault error code to indicate private faults" commit at
> the end of the branch.
Ahh, I should have read this before reviewing the other patches. Thanks!
On 4/17/2024 8:48 PM, Paolo Bonzini wrote:
> On Wed, Feb 28, 2024 at 3:41 AM Sean Christopherson <[email protected]> wrote:
>>
>> This is a combination of prep work for TDX and SNP, and a clean up of the
>> page fault path to (hopefully) make it easier to follow the rules for
>> private memory, noslot faults, writes to read-only slots, etc.
>>
>> Paolo, this is the series I mentioned in your TDX/SNP prep work series.
>> Stating the obvious, these
>>
>> KVM: x86/mmu: Pass full 64-bit error code when handling page faults
>> KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler
>>
>> are the drop-in replacements.
>
> Applied to kvm-coco-queue, thanks, and these to kvm/queue as well:
>
> KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits emulation
> KVM: x86: Remove separate "bit" defines for page fault error code masks
> KVM: x86: Define more SEV+ page fault error bits/flags for #NPF
> KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler
> KVM: x86/mmu: Pass full 64-bit error code when handling page faults
> KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero
Paolo,
It seems you forgot to incorporate the review comment into the patch
before you queued them to kvm/queue.
e.g., the comment from Dongli to
KVM: x86: Define more SEV+ page fault error bits/flags for #NPF
https://lore.kernel.org/all/[email protected]/