LinuxLists.cc - [PATCH v4 0/4] KVM: x86/mmu: Pre-check for mmu

2024-02-09 22:29:16

Subject: [PATCH v4 0/4] KVM: x86/mmu: Pre-check for mmu_notifier retry

Retry page faults without acquiring mmu_lock, and potentially even without
resolving a pfn, if the gfn is covered by an active invalidation. This
avoids resource and lock contention, which can be especially beneficial
for preemptible kernels as KVM can get stuck bouncing mmu_lock between a
vCPU and the invalidation task the vCPU is waiting on to finish.

v4:
- Pre-check for retry before resolving the pfn, too. [Yan]
- Add a patch to fix a private/shared vs. memslot validity check
priority inversion bug.
- Refactor kvm_faultin_pfn() to clean up the handling of noslot faults.

v3:
- https://lkml.kernel.org/r/20240203003518.387220-1-seanjc%40google.com
- Release the pfn, i.e. put the struct page reference if one was held,
as the caller doesn't expect to get a reference on "failure". [Yuan]
- Fix a typo in the comment.

v2:
- Introduce a dedicated helper and collapse to a single patch (because
adding an unused helper would be quite silly).
- Add a comment to explain the "unsafe" check in kvm_faultin_pfn(). [Kai]
- Add Kai's Ack.

v1: https://lore.kernel.org/all/[email protected]

Sean Christopherson (4):
KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is
changing
KVM: x86/mmu: Move private vs. shared check above slot validity checks
KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to
kvm_faultin_pfn()
KVM: x86/mmu: Handle no-slot faults at the beginning of
kvm_faultin_pfn()

arch/x86/kvm/mmu/mmu.c | 134 ++++++++++++++++++++++----------
arch/x86/kvm/mmu/mmu_internal.h | 5 +-
include/linux/kvm_host.h | 26 +++++++
3 files changed, 122 insertions(+), 43 deletions(-)

base-commit: 60eedcfceda9db46f1b333e5e1aa9359793f04fb
--
2.43.0.687.g38aa6559b0-goog

2024-02-09 22:30:01

by Sean Christopherson

[permalink] [raw]

Subject: [PATCH v4 2/4] KVM: x86/mmu: Move private vs. shared check above slot validity checks

Prioritize private vs. shared gfn attribute checks above slot validity
checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to
userspace if there is no memslot, but emulate accesses to the APIC access
page even if the attributes mismatch.

Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory")
Cc: Yu Zhang <[email protected]>
Cc: Chao Peng <[email protected]>
Cc: Fuad Tabba <[email protected]>
Cc: Michael Roth <[email protected]>
Cc: Isaku Yamahata <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 166cef0c3ff4..50bfaa53f3f2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4360,11 +4360,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
return RET_PF_EMULATE;
}

- if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
- kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
- return -EFAULT;
- }
-
if (fault->is_private)
return kvm_faultin_pfn_private(vcpu, fault);

@@ -4403,6 +4398,11 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
struct kvm_memory_slot *slot = fault->slot;
int ret;

+ if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+ kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+ return -EFAULT;
+ }
+
fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
smp_rmb();

--
2.43.0.687.g38aa6559b0-goog

2024-02-09 22:30:15

by Sean Christopherson

[permalink] [raw]

Subject: [PATCH v4 1/4] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

Retry page faults without acquiring mmu_lock, and without even faulting
the page into the primary MMU, if the resolved gfn is covered by an active
invalidation. Contending for mmu_lock is especially problematic on
preemptible kernels as the mmu_notifier invalidation task will yield
mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and
ultimately increase the latency of resolving the page fault. And in the
worst case scenario, yielding will be accompanied by a remote TLB flush,
e.g. if the invalidation covers a large range of memory and vCPUs are
accessing addresses that were already zapped.

Faulting the page into the primary MMU is similarly problematic, as doing
so may acquire locks that need to be taken for the invalidation to
complete (the primary MMU has finer grained locks than KVM's MMU), and/or
may cause unnecessary churn (getting/putting pages, marking them accessed,
etc).

Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
iterators to perform more work before yielding, but that wouldn't solve
the lock contention and would negatively affect scenarios where a vCPU is
trying to fault in an address that is NOT covered by the in-progress
invalidation.

Add a dedicated lockess version of the range-based retry check to avoid
false positives on the sanity check on start+end WARN, and so that it's
super obvious that checking for a racing invalidation without holding
mmu_lock is unsafe (though obviously useful).

Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking
invalidation in a loop won't put KVM into an infinite loop, e.g. due to
caching the in-progress flag and never seeing it go to '0'.

Force a load of mmu_invalidate_seq as well, even though it isn't strictly
necessary to avoid an infinite loop, as doing so improves the probability
that KVM will detect an invalidation that already completed before
acquiring mmu_lock and bailing anyways.

Do the pre-check even for non-preemptible kernels, as waiting to detect
the invalidation until mmu_lock is held guarantees the vCPU will observe
the worst case latency in terms of handling the fault, and can generate
even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock,
detect retry, drop mmu_lock, re-enter the guest, retake the fault, and
eventually re-acquire mmu_lock. This behavior is also why there are no
new starvation issues due to losing the fairness guarantees provided by
rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting
on mmu_lock doesn't guarantee forward progress in the face of _another_
mmu_notifier invalidation event.

Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
may generate a load into a register instead of doing a direct comparison
(MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
is a few bytes of code and maaaaybe a cycle or three.

Reported-by: Yan Zhao <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]
Reported-by: Friedrich Weber <[email protected]>
Cc: Kai Huang <[email protected]>
Cc: Yan Zhao <[email protected]>
Cc: Yuan Yao <[email protected]>
Cc: Xu Yilun <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 45 +++++++++++++++++++++++++++++++++++++++-
include/linux/kvm_host.h | 26 +++++++++++++++++++++++
2 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3c193b096b45..166cef0c3ff4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4400,11 +4400,37 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
unsigned int access)
{
+ struct kvm_memory_slot *slot = fault->slot;
int ret;

fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
smp_rmb();

+ /*
+ * Check for a relevant mmu_notifier invalidation event before getting
+ * the pfn from the primary MMU, and before acquiring mmu_lock.
+ *
+ * For mmu_lock, if there is an in-progress invalidation and the kernel
+ * allows preemption, the invalidation task may drop mmu_lock and yield
+ * in response to mmu_lock being contended, which is *very* counter-
+ * productive as this vCPU can't actually make forward progress until
+ * the invalidation completes.
+ *
+ * Retrying now can also avoid unnessary lock contention in the primary
+ * MMU, as the primary MMU doesn't necessarily hold a single lock for
+ * the duration of the invalidation, i.e. faulting in a conflicting pfn
+ * can cause the invalidation to take longer by holding locks that are
+ * needed to complete the invalidation.
+ *
+ * Do the pre-check even for non-preemtible kernels, i.e. even if KVM
+ * will never yield mmu_lock in response to contention, as this vCPU is
+ * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
+ * to detect retry guarantees the worst case latency for the vCPU.
+ */
+ if (!slot &&
+ mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
+ return RET_PF_RETRY;
+
ret = __kvm_faultin_pfn(vcpu, fault);
if (ret != RET_PF_CONTINUE)
return ret;
@@ -4412,9 +4438,21 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
if (unlikely(is_error_pfn(fault->pfn)))
return kvm_handle_error_pfn(vcpu, fault);

- if (unlikely(!fault->slot))
+ if (unlikely(!slot))
return kvm_handle_noslot_fault(vcpu, fault, access);

+ /*
+ * Check again for a relevant mmu_notifier invalidation event purely to
+ * avoid contending mmu_lock. Most invalidations will be detected by
+ * the previous check, but checking is extremely cheap relative to the
+ * overall cost of failing to detect the invalidation until after
+ * mmu_lock is acquired.
+ */
+ if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) {
+ kvm_release_pfn_clean(fault->pfn);
+ return RET_PF_RETRY;
+ }
+
return RET_PF_CONTINUE;
}

@@ -4442,6 +4480,11 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
return true;

+ /*
+ * Check for a relevant mmu_notifier invalidation event one last time
+ * now that mmu_lock is held, as the "unsafe" checks performed without
+ * holding mmu_lock can get false negatives.
+ */
return fault->slot &&
mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn);
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7e7fd25b09b3..179df96b20f8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2031,6 +2031,32 @@ static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
return 1;
return 0;
}
+
+/*
+ * This lockless version of the range-based retry check *must* be paired with a
+ * call to the locked version after acquiring mmu_lock, i.e. this is safe to
+ * use only as a pre-check to avoid contending mmu_lock. This version *will*
+ * get false negatives and false positives.
+ */
+static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
+ unsigned long mmu_seq,
+ gfn_t gfn)
+{
+ /*
+ * Use READ_ONCE() to ensure the in-progress flag and sequence counter
+ * are always read from memory, e.g. so that checking for retry in a
+ * loop won't result in an infinite retry loop. Don't force loads for
+ * start+end, as the key to avoiding infinite retry loops is observing
+ * the 1=>0 transition of in-progress, i.e. getting false negatives
+ * due to stale start+end values is acceptable.
+ */
+ if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
+ gfn >= kvm->mmu_invalidate_range_start &&
+ gfn < kvm->mmu_invalidate_range_end)
+ return true;
+
+ return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
+}
#endif

#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
--
2.43.0.687.g38aa6559b0-goog

2024-02-09 22:30:31

by Sean Christopherson

[permalink] [raw]

Subject: [PATCH v4 3/4] KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn()

Move the checks related to the validity of an access to a memslot from the
inner __kvm_faultin_pfn() to its sole caller, kvm_faultin_pfn(). This
allows emulating accesses to the APIC access page, which don't need to
resolve a pfn, even if there is a relevant in-progress mmu_notifier
invalidation. Ditto for accesses to KVM internal memslots from L2, which
KVM also treats as emulated MMIO.

More importantly, this will allow for future cleanup by having the
"no memslot" case bail from kvm_faultin_pfn() very early on.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 62 ++++++++++++++++++++++--------------------
1 file changed, 33 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 50bfaa53f3f2..505fc7eef533 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4333,33 +4333,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
struct kvm_memory_slot *slot = fault->slot;
bool async;

- /*
- * Retry the page fault if the gfn hit a memslot that is being deleted
- * or moved. This ensures any existing SPTEs for the old memslot will
- * be zapped before KVM inserts a new MMIO SPTE for the gfn.
- */
- if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
- return RET_PF_RETRY;
-
- if (!kvm_is_visible_memslot(slot)) {
- /* Don't expose private memslots to L2. */
- if (is_guest_mode(vcpu)) {
- fault->slot = NULL;
- fault->pfn = KVM_PFN_NOSLOT;
- fault->map_writable = false;
- return RET_PF_CONTINUE;
- }
- /*
- * If the APIC access page exists but is disabled, go directly
- * to emulation without caching the MMIO access or creating a
- * MMIO SPTE. That way the cache doesn't need to be purged
- * when the AVIC is re-enabled.
- */
- if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
- !kvm_apicv_activated(vcpu->kvm))
- return RET_PF_EMULATE;
- }
-
if (fault->is_private)
return kvm_faultin_pfn_private(vcpu, fault);

@@ -4406,6 +4379,37 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
smp_rmb();

+ if (!slot)
+ goto faultin_pfn;
+
+ /*
+ * Retry the page fault if the gfn hit a memslot that is being deleted
+ * or moved. This ensures any existing SPTEs for the old memslot will
+ * be zapped before KVM inserts a new MMIO SPTE for the gfn.
+ */
+ if (slot->flags & KVM_MEMSLOT_INVALID)
+ return RET_PF_RETRY;
+
+ if (!kvm_is_visible_memslot(slot)) {
+ /* Don't expose KVM's internal memslots to L2. */
+ if (is_guest_mode(vcpu)) {
+ fault->slot = NULL;
+ fault->pfn = KVM_PFN_NOSLOT;
+ fault->map_writable = false;
+ return RET_PF_CONTINUE;
+ }
+
+ /*
+ * If the APIC access page exists but is disabled, go directly
+ * to emulation without caching the MMIO access or creating a
+ * MMIO SPTE. That way the cache doesn't need to be purged
+ * when the AVIC is re-enabled.
+ */
+ if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
+ !kvm_apicv_activated(vcpu->kvm))
+ return RET_PF_EMULATE;
+ }
+
/*
* Check for a relevant mmu_notifier invalidation event before getting
* the pfn from the primary MMU, and before acquiring mmu_lock.
@@ -4427,10 +4431,10 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
* *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
* to detect retry guarantees the worst case latency for the vCPU.
*/
- if (!slot &&
- mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
+ if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
return RET_PF_RETRY;

+faultin_pfn:
ret = __kvm_faultin_pfn(vcpu, fault);
if (ret != RET_PF_CONTINUE)
return ret;
--
2.43.0.687.g38aa6559b0-goog

2024-02-09 22:31:02

by Sean Christopherson

[permalink] [raw]

Subject: [PATCH v4 4/4] KVM: x86/mmu: Handle no-slot faults at the beginning of kvm_faultin_pfn()

Handle the "no memslot" case at the beginning of kvm_faultin_pfn(), just
after the private versus shared check, so that there's no need to
repeatedly query whether or not a slot exists. This also makes it more
obvious that, except for private vs. shared attributes, the process of
faulting in a pfn simply doesn't apply to gfns without a slot.

Cc: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 33 ++++++++++++++++++---------------
arch/x86/kvm/mmu/mmu_internal.h | 5 ++++-
2 files changed, 22 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 505fc7eef533..7a2874756b3f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3278,6 +3278,14 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return ret;
}

+static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
+{
+ kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
+ PAGE_SIZE, fault->write, fault->exec,
+ fault->is_private);
+}
+
static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
{
unsigned long hva = gfn_to_hva_memslot(slot, gfn);
@@ -3314,9 +3322,16 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
{
gva_t gva = fault->is_tdp ? 0 : fault->addr;

+ if (fault->is_private) {
+ kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+ return -EFAULT;
+ }
+
vcpu_cache_mmio_info(vcpu, gva, fault->gfn,
access & shadow_mmio_access_mask);

+ fault->pfn = KVM_PFN_NOSLOT;
+
/*
* If MMIO caching is disabled, emulate immediately without
* touching the shadow page tables as attempting to install an
@@ -4296,14 +4311,6 @@ static inline u8 kvm_max_level_for_order(int order)
return PG_LEVEL_4K;
}

-static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
- struct kvm_page_fault *fault)
-{
- kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
- PAGE_SIZE, fault->write, fault->exec,
- fault->is_private);
-}
-
static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault)
{
@@ -4376,12 +4383,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
return -EFAULT;
}

+ if (unlikely(!slot))
+ return kvm_handle_noslot_fault(vcpu, fault, access);
+
fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
smp_rmb();

- if (!slot)
- goto faultin_pfn;
-
/*
* Retry the page fault if the gfn hit a memslot that is being deleted
* or moved. This ensures any existing SPTEs for the old memslot will
@@ -4434,7 +4441,6 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
return RET_PF_RETRY;

-faultin_pfn:
ret = __kvm_faultin_pfn(vcpu, fault);
if (ret != RET_PF_CONTINUE)
return ret;
@@ -4442,9 +4448,6 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
if (unlikely(is_error_pfn(fault->pfn)))
return kvm_handle_error_pfn(vcpu, fault);

- if (unlikely(!slot))
- return kvm_handle_noslot_fault(vcpu, fault, access);
-
/*
* Check again for a relevant mmu_notifier invalidation event purely to
* avoid contending mmu_lock. Most invalidations will be detected by
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 0669a8a668ca..bd7d07e6c697 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -235,7 +235,10 @@ struct kvm_page_fault {
/* The memslot containing gfn. May be NULL. */
struct kvm_memory_slot *slot;

- /* Outputs of kvm_faultin_pfn. */
+ /*
+ * Outputs of kvm_faultin_pfn, guaranteed to be valid if and only if
+ * slot is non-NULL.
+ */
unsigned long mmu_seq;
kvm_pfn_t pfn;
hva_t hva;
--
2.43.0.687.g38aa6559b0-goog

2024-02-14 13:25:10

by Friedrich Weber

[permalink] [raw]

Subject: Re: [PATCH v4 0/4] KVM: x86/mmu: Pre-check for mmu_notifier retry

On 09/02/2024 23:28, Sean Christopherson wrote:
> Retry page faults without acquiring mmu_lock, and potentially even without
> resolving a pfn, if the gfn is covered by an active invalidation. This
> avoids resource and lock contention, which can be especially beneficial
> for preemptible kernels as KVM can get stuck bouncing mmu_lock between a
> vCPU and the invalidation task the vCPU is waiting on to finish.
>
> v4:
> - Pre-check for retry before resolving the pfn, too. [Yan]
> - Add a patch to fix a private/shared vs. memslot validity check
> priority inversion bug.
> - Refactor kvm_faultin_pfn() to clean up the handling of noslot faults.

Can confirm that v4 also fixes the temporary guest hangs [1] I'm seeing
in combination with KSM and NUMA balancing:
* On 60eedcfc, the reproducer [1] triggers temporary hangs
* With the four patches applied on top of 60eedcfc, the reproducer does
not trigger hangs

Thanks a lot for looking into this!

[1]
https://lore.kernel.org/kvm/[email protected]/

2024-02-17 14:49:59

by Xu Yilun

[permalink] [raw]

Subject: Re: [PATCH v4 1/4] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

> static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> unsigned int access)
> {
> + struct kvm_memory_slot *slot = fault->slot;
> int ret;
>
> fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> smp_rmb();
>
> + /*
> + * Check for a relevant mmu_notifier invalidation event before getting
> + * the pfn from the primary MMU, and before acquiring mmu_lock.
> + *
> + * For mmu_lock, if there is an in-progress invalidation and the kernel
> + * allows preemption, the invalidation task may drop mmu_lock and yield
> + * in response to mmu_lock being contended, which is *very* counter-
> + * productive as this vCPU can't actually make forward progress until
> + * the invalidation completes.
> + *
> + * Retrying now can also avoid unnessary lock contention in the primary
> + * MMU, as the primary MMU doesn't necessarily hold a single lock for
> + * the duration of the invalidation, i.e. faulting in a conflicting pfn
> + * can cause the invalidation to take longer by holding locks that are
> + * needed to complete the invalidation.
> + *
> + * Do the pre-check even for non-preemtible kernels, i.e. even if KVM
> + * will never yield mmu_lock in response to contention, as this vCPU is
> + * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
> + * to detect retry guarantees the worst case latency for the vCPU.
> + */
> + if (!slot &&

typo? if (slot &&

Thanks,
Yilun

> + mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
> + return RET_PF_RETRY;
> +
> ret = __kvm_faultin_pfn(vcpu, fault);
> if (ret != RET_PF_CONTINUE)
> return ret;

2024-02-19 04:14:19

by Yan Zhao

[permalink] [raw]

Subject: Re: [PATCH v4 3/4] KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn()

On Fri, Feb 09, 2024 at 02:28:57PM -0800, Sean Christopherson wrote:
> Move the checks related to the validity of an access to a memslot from the
> inner __kvm_faultin_pfn() to its sole caller, kvm_faultin_pfn(). This
> allows emulating accesses to the APIC access page, which don't need to
> resolve a pfn, even if there is a relevant in-progress mmu_notifier
> invalidation. Ditto for accesses to KVM internal memslots from L2, which
> KVM also treats as emulated MMIO.
>
> More importantly, this will allow for future cleanup by having the
> "no memslot" case bail from kvm_faultin_pfn() very early on.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 62 ++++++++++++++++++++++--------------------
> 1 file changed, 33 insertions(+), 29 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 50bfaa53f3f2..505fc7eef533 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4333,33 +4333,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> struct kvm_memory_slot *slot = fault->slot;
> bool async;
>
> - /*
> - * Retry the page fault if the gfn hit a memslot that is being deleted
> - * or moved. This ensures any existing SPTEs for the old memslot will
> - * be zapped before KVM inserts a new MMIO SPTE for the gfn.
> - */
> - if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
> - return RET_PF_RETRY;
> -
> - if (!kvm_is_visible_memslot(slot)) {
> - /* Don't expose private memslots to L2. */
> - if (is_guest_mode(vcpu)) {
> - fault->slot = NULL;
> - fault->pfn = KVM_PFN_NOSLOT;
> - fault->map_writable = false;
> - return RET_PF_CONTINUE;
> - }
> - /*
> - * If the APIC access page exists but is disabled, go directly
> - * to emulation without caching the MMIO access or creating a
> - * MMIO SPTE. That way the cache doesn't need to be purged
> - * when the AVIC is re-enabled.
> - */
> - if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
> - !kvm_apicv_activated(vcpu->kvm))
> - return RET_PF_EMULATE;
> - }
> -
> if (fault->is_private)
> return kvm_faultin_pfn_private(vcpu, fault);
>
> @@ -4406,6 +4379,37 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> smp_rmb();
>
> + if (!slot)
> + goto faultin_pfn;
> +
> + /*
> + * Retry the page fault if the gfn hit a memslot that is being deleted
> + * or moved. This ensures any existing SPTEs for the old memslot will
> + * be zapped before KVM inserts a new MMIO SPTE for the gfn.
> + */
> + if (slot->flags & KVM_MEMSLOT_INVALID)
> + return RET_PF_RETRY;
> +
> + if (!kvm_is_visible_memslot(slot)) {
> + /* Don't expose KVM's internal memslots to L2. */
> + if (is_guest_mode(vcpu)) {
> + fault->slot = NULL;
> + fault->pfn = KVM_PFN_NOSLOT;
> + fault->map_writable = false;
> + return RET_PF_CONTINUE;
Call kvm_handle_noslot_fault() to replace returning RET_PF_CONTINUE?

> + }
> +
> + /*
> + * If the APIC access page exists but is disabled, go directly
> + * to emulation without caching the MMIO access or creating a
> + * MMIO SPTE. That way the cache doesn't need to be purged
> + * when the AVIC is re-enabled.
> + */
> + if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
> + !kvm_apicv_activated(vcpu->kvm))
> + return RET_PF_EMULATE;
> + }
> +
> /*
> * Check for a relevant mmu_notifier invalidation event before getting
> * the pfn from the primary MMU, and before acquiring mmu_lock.
> @@ -4427,10 +4431,10 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
> * to detect retry guarantees the worst case latency for the vCPU.
> */
> - if (!slot &&
> - mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
> + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
> return RET_PF_RETRY;
>
> +faultin_pfn:
> ret = __kvm_faultin_pfn(vcpu, fault);
> if (ret != RET_PF_CONTINUE)
> return ret;
> --
> 2.43.0.687.g38aa6559b0-goog
>

2024-02-19 09:12:24

by Huang, Kai

[permalink] [raw]

Subject: Re: [PATCH v4 3/4] KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn()

2024-02-19 16:18:43

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v4 1/4] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

On Sat, Feb 17, 2024, Xu Yilun wrote:
> > static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > unsigned int access)
> > {
> > + struct kvm_memory_slot *slot = fault->slot;
> > int ret;
> >
> > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> > smp_rmb();
> >
> > + /*
> > + * Check for a relevant mmu_notifier invalidation event before getting
> > + * the pfn from the primary MMU, and before acquiring mmu_lock.
> > + *
> > + * For mmu_lock, if there is an in-progress invalidation and the kernel
> > + * allows preemption, the invalidation task may drop mmu_lock and yield
> > + * in response to mmu_lock being contended, which is *very* counter-
> > + * productive as this vCPU can't actually make forward progress until
> > + * the invalidation completes.
> > + *
> > + * Retrying now can also avoid unnessary lock contention in the primary
> > + * MMU, as the primary MMU doesn't necessarily hold a single lock for
> > + * the duration of the invalidation, i.e. faulting in a conflicting pfn
> > + * can cause the invalidation to take longer by holding locks that are
> > + * needed to complete the invalidation.
> > + *
> > + * Do the pre-check even for non-preemtible kernels, i.e. even if KVM
> > + * will never yield mmu_lock in response to contention, as this vCPU is
> > + * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
> > + * to detect retry guarantees the worst case latency for the vCPU.
> > + */
> > + if (!slot &&
>
> typo? if (slot &&

Ugh, and bad testing on my end.

2024-02-19 19:50:02

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v4 3/4] KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn()

+Jim

On Mon, Feb 19, 2024, Yan Zhao wrote:
> On Fri, Feb 09, 2024 at 02:28:57PM -0800, Sean Christopherson wrote:
> > @@ -4406,6 +4379,37 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> > smp_rmb();
> >
> > + if (!slot)
> > + goto faultin_pfn;
> > +
> > + /*
> > + * Retry the page fault if the gfn hit a memslot that is being deleted
> > + * or moved. This ensures any existing SPTEs for the old memslot will
> > + * be zapped before KVM inserts a new MMIO SPTE for the gfn.
> > + */
> > + if (slot->flags & KVM_MEMSLOT_INVALID)
> > + return RET_PF_RETRY;
> > +
> > + if (!kvm_is_visible_memslot(slot)) {
> > + /* Don't expose KVM's internal memslots to L2. */
> > + if (is_guest_mode(vcpu)) {
> > + fault->slot = NULL;
> > + fault->pfn = KVM_PFN_NOSLOT;
> > + fault->map_writable = false;
> > + return RET_PF_CONTINUE;
> Call kvm_handle_noslot_fault() to replace returning RET_PF_CONTINUE?

Oof. Yes. But there is a pre-existing bug here too, though it's very theoretical
and unlikely to ever cause problems.

If KVM is using TDP, but L1 is using shadow paging for L2, then routing through
kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO, and create an
MMIO SPTE. Creating an MMIO SPTE is ok, but only because kvm_mmu_page_role.guest_mode
ensure KVM uses different roots for L1 vs. L2. But mmio_gfn will remain valid,
and could (quite theoretically) cause KVM to incorrectly treat an L1 access to
the private TSS or identity mapped page tables as MMIO.

Furthermore, this check doesn't actually prevent exposing KVM's internal memslots
to L2, it simply forces KVM to emulate the access. In most cases, that will trigger
MMIO, amusingly due to filling arch.mmio_gfn. And vcpu_is_mmio_gpa() always
treats APIC accesses as MMIO, so those are fine. But the private TSS and identity
mapped page tables could go either way (MMIO or access the private memslot's backing
memory).

We could "fix" the issue by forcing MMIO emulation for L2 access to all internal
memslots, not just to the APIC. But I think that's actually less correct than
letting L2 access the private TSS and indentity mapped page tables (not to mention
that I can't imagine anyone cares what KVM does in this case). From L1's perspective,
there is R/W memory at those memslot, the memory just happens to be initialized
with non-zero data, and I don't see a good argument for hiding that memory from L2.
Making the memory disappear is far more magical than the memory existing in the
first place.

The APIC access page is special because KVM _must_ emulate the access to do the
right thing. And despite what commit 3a2936dedd20 ("kvm: mmu: Don't expose private
memslots to L2") said, it's not just when L1 is accelerating L2's virtual APIC,
it's just as important (likely *more* imporant* for correctness when L1 is passing
through its own APIC to L2.

Unless I'm missing someting, I think it makes sense to throw in the below before
moving code around.

Ouch, and looking at this further, patch 1 introduced a bug (technically) by caching
fault->slot; in this case KVM will unnecessarily check mmu_notifiers. That's
obviously a very benign bug, as a false positive just means an unnecessary retry,
but yikes.

--
Subject: [PATCH] KVM: x86/mmu: Don't force emulation of L2 accesses to
non-APIC internal slots

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 488f522f09c6..4ce824cec5b9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4341,8 +4341,18 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
return RET_PF_RETRY;

- if (!kvm_is_visible_memslot(slot)) {
- /* Don't expose private memslots to L2. */
+ if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
+ /*
+ * Don't map L1's APIC access page into L2, KVM doesn't support
+ * using APICv/AVIC to accelerate L2 accesses to L1's APIC,
+ * i.e. the access needs to be emulated. Emulating access to
+ * L1's APIC is also correct if L1 is accelerating L2's own
+ * virtual APIC, but for some reason L1 also maps _L1's_ APIC
+ * into L2. Note, vcpu_is_mmio_gpa() always treats access to
+ * the APIC as MMIO. Allow an MMIO SPTE to be created, as KVM
+ * uses different roots for L1 vs. L2, i.e. there is no danger
+ * of breaking APICv/AVIC for L1.
+ */
if (is_guest_mode(vcpu)) {
fault->slot = NULL;
fault->pfn = KVM_PFN_NOSLOT;
@@ -4355,8 +4365,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* MMIO SPTE. That way the cache doesn't need to be purged
* when the AVIC is re-enabled.
*/
- if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
- !kvm_apicv_activated(vcpu->kvm))
+ if (!kvm_apicv_activated(vcpu->kvm))
return RET_PF_EMULATE;
}

base-commit: ec98c2c1a07fb341ba2230eab9a31065d12d9de6
--

2024-02-20 07:30:00

by Yan Zhao

[permalink] [raw]

Subject: Re: [PATCH v4 3/4] KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn()

On Mon, Feb 19, 2024 at 11:44:54AM -0800, Sean Christopherson wrote:
> +Jim
>
> On Mon, Feb 19, 2024, Yan Zhao wrote:
> > On Fri, Feb 09, 2024 at 02:28:57PM -0800, Sean Christopherson wrote:
> > > @@ -4406,6 +4379,37 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> > > smp_rmb();
> > >
> > > + if (!slot)
> > > + goto faultin_pfn;
> > > +
> > > + /*
> > > + * Retry the page fault if the gfn hit a memslot that is being deleted
> > > + * or moved. This ensures any existing SPTEs for the old memslot will
> > > + * be zapped before KVM inserts a new MMIO SPTE for the gfn.
> > > + */
> > > + if (slot->flags & KVM_MEMSLOT_INVALID)
> > > + return RET_PF_RETRY;
> > > +
> > > + if (!kvm_is_visible_memslot(slot)) {
> > > + /* Don't expose KVM's internal memslots to L2. */
> > > + if (is_guest_mode(vcpu)) {
> > > + fault->slot = NULL;
> > > + fault->pfn = KVM_PFN_NOSLOT;
> > > + fault->map_writable = false;
> > > + return RET_PF_CONTINUE;
> > Call kvm_handle_noslot_fault() to replace returning RET_PF_CONTINUE?
>
> Oof. Yes. But there is a pre-existing bug here too, though it's very theoretical
> and unlikely to ever cause problems.
>
> If KVM is using TDP, but L1 is using shadow paging for L2, then routing through
> kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO, and create an
> MMIO SPTE. Creating an MMIO SPTE is ok, but only because kvm_mmu_page_role.guest_mode
> ensure KVM uses different roots for L1 vs. L2. But mmio_gfn will remain valid,
> and could (quite theoretically) cause KVM to incorrectly treat an L1 access to
> the private TSS or identity mapped page tables as MMIO.
Why would KVM treat L1 access to the private TSS and identity mapped page
tables as MMIO even though mmio_gfn is valid?
It looks that (for Intel platform) EPT for L1 will only install normal SPTEs
(non-MMIO SPTEs) for the two private slots, so there would not have EPT
misconfiguration and would not go to emulation path incorrectly.
Am I missing something?

> Furthermore, this check doesn't actually prevent exposing KVM's internal memslots
> to L2, it simply forces KVM to emulate the access. In most cases, that will trigger
> MMIO, amusingly due to filling arch.mmio_gfn. And vcpu_is_mmio_gpa() always
> treats APIC accesses as MMIO, so those are fine. But the private TSS and identity
> mapped page tables could go either way (MMIO or access the private memslot's backing
> memory).
Yes, this is also my question when reading that part of code.
mmio_gen mismatching may lead to accessing the backing memory directly.

> We could "fix" the issue by forcing MMIO emulation for L2 access to all internal
> memslots, not just to the APIC. But I think that's actually less correct than
> letting L2 access the private TSS and indentity mapped page tables (not to mention
> that I can't imagine anyone cares what KVM does in this case). From L1's perspective,
> there is R/W memory at those memslot, the memory just happens to be initialized
> with non-zero data, and I don't see a good argument for hiding that memory from L2.
> Making the memory disappear is far more magical than the memory existing in the
> first place.
>
> The APIC access page is special because KVM _must_ emulate the access to do the
> right thing. And despite what commit 3a2936dedd20 ("kvm: mmu: Don't expose private
> memslots to L2") said, it's not just when L1 is accelerating L2's virtual APIC,
> it's just as important (likely *more* imporant* for correctness when L1 is passing
> through its own APIC to L2.
>
> Unless I'm missing someting, I think it makes sense to throw in the below before
> moving code around.
>
> Ouch, and looking at this further, patch 1 introduced a bug (technically) by caching
> fault->slot; in this case KVM will unnecessarily check mmu_notifiers. That's
> obviously a very benign bug, as a false positive just means an unnecessary retry,
> but yikes.
>
Patch 3 & 4 removed the bug immediately :)

> --
> Subject: [PATCH] KVM: x86/mmu: Don't force emulation of L2 accesses to
> non-APIC internal slots
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
> 1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 488f522f09c6..4ce824cec5b9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4341,8 +4341,18 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
> return RET_PF_RETRY;
>
> - if (!kvm_is_visible_memslot(slot)) {
> - /* Don't expose private memslots to L2. */
> + if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
> + /*
> + * Don't map L1's APIC access page into L2, KVM doesn't support
> + * using APICv/AVIC to accelerate L2 accesses to L1's APIC,
> + * i.e. the access needs to be emulated. Emulating access to
> + * L1's APIC is also correct if L1 is accelerating L2's own
> + * virtual APIC, but for some reason L1 also maps _L1's_ APIC
> + * into L2. Note, vcpu_is_mmio_gpa() always treats access to
> + * the APIC as MMIO. Allow an MMIO SPTE to be created, as KVM
> + * uses different roots for L1 vs. L2, i.e. there is no danger
> + * of breaking APICv/AVIC for L1.
> + */
> if (is_guest_mode(vcpu)) {
> fault->slot = NULL;
> fault->pfn = KVM_PFN_NOSLOT;
Checking fault->is_private before calling kvm_handle_noslot_fault()?
And do we need a centralized check of fault->is_private in kvm_mmu_do_page_fault()
before returning RET_PF_EMULATE?

> @@ -4355,8 +4365,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> * MMIO SPTE. That way the cache doesn't need to be purged
> * when the AVIC is re-enabled.
> */
> - if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
> - !kvm_apicv_activated(vcpu->kvm))
> + if (!kvm_apicv_activated(vcpu->kvm))
> return RET_PF_EMULATE;
Otherwise, here also needs a checking of fault->is_private?
Maybe also for where RET_PF_EMULATE is returned when page_fault_handle_page_track()
is true (though I know it's always false for TDX).

> }
>
>
> base-commit: ec98c2c1a07fb341ba2230eab9a31065d12d9de6
> --

2024-02-21 02:17:04

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v4 3/4] KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn()

On Tue, Feb 20, 2024, Yan Zhao wrote:
> On Mon, Feb 19, 2024 at 11:44:54AM -0800, Sean Christopherson wrote:
> > If KVM is using TDP, but L1 is using shadow paging for L2, then routing through
> > kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO, and create an
> > MMIO SPTE. Creating an MMIO SPTE is ok, but only because kvm_mmu_page_role.guest_mode
> > ensure KVM uses different roots for L1 vs. L2. But mmio_gfn will remain valid,
> > and could (quite theoretically) cause KVM to incorrectly treat an L1 access to
> > the private TSS or identity mapped page tables as MMIO.
> Why would KVM treat L1 access to the private TSS and identity mapped page
> tables as MMIO even though mmio_gfn is valid?

Because KVM doesn't need to take an EPT Violation or Misconfig to trigger emulation,
those just happen to be (by far) the most common ways KVM gets into the emulator
on modern CPUs.

> It looks that (for Intel platform) EPT for L1 will only install normal SPTEs
> (non-MMIO SPTEs) for the two private slots, so there would not have EPT
> misconfiguration and would not go to emulation path incorrectly.
> Am I missing something?

..

> > --
> > Subject: [PATCH] KVM: x86/mmu: Don't force emulation of L2 accesses to
> > non-APIC internal slots
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
> > 1 file changed, 13 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 488f522f09c6..4ce824cec5b9 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4341,8 +4341,18 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
> > return RET_PF_RETRY;
> >
> > - if (!kvm_is_visible_memslot(slot)) {
> > - /* Don't expose private memslots to L2. */
> > + if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
> > + /*
> > + * Don't map L1's APIC access page into L2, KVM doesn't support
> > + * using APICv/AVIC to accelerate L2 accesses to L1's APIC,
> > + * i.e. the access needs to be emulated. Emulating access to
> > + * L1's APIC is also correct if L1 is accelerating L2's own
> > + * virtual APIC, but for some reason L1 also maps _L1's_ APIC
> > + * into L2. Note, vcpu_is_mmio_gpa() always treats access to
> > + * the APIC as MMIO. Allow an MMIO SPTE to be created, as KVM
> > + * uses different roots for L1 vs. L2, i.e. there is no danger
> > + * of breaking APICv/AVIC for L1.
> > + */
> > if (is_guest_mode(vcpu)) {
> > fault->slot = NULL;
> > fault->pfn = KVM_PFN_NOSLOT;
> Checking fault->is_private before calling kvm_handle_noslot_fault()?

Ya, the actual series will perform that check, this slots in halfway through.

> And do we need a centralized check of fault->is_private in kvm_mmu_do_page_fault()
> before returning RET_PF_EMULATE?

Oof, yes.

> > @@ -4355,8 +4365,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > * MMIO SPTE. That way the cache doesn't need to be purged
> > * when the AVIC is re-enabled.
> > */
> > - if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
> > - !kvm_apicv_activated(vcpu->kvm))
> > + if (!kvm_apicv_activated(vcpu->kvm))
> > return RET_PF_EMULATE;
> Otherwise, here also needs a checking of fault->is_private?
> Maybe also for where RET_PF_EMULATE is returned when page_fault_handle_page_track()
> is true (though I know it's always false for TDX).

Ya, and practically speaking it should always be false for functional setups
(software-protected VMs don't yet play nice with shadow paging or any form of
emulation), but it's easy enough to guard against RET_PF_EMULATE in
kvm_mmu_do_page_fault().

I'm going to post _just_ patch 1 as v5 so that it can land in 6.8 (assuming I
don't screw it up again).

I'll post a separate series to tackle the refactor and is_private cleanups and
fixes as that has ballooned to 17 patches :-/