Fix three fatal TDP MMU bugs introduced in 6.2, harden related code,
and clean up kvm_tdp_mmu_map() to eliminate the need for gotos.
Sean Christopherson (5):
KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is
frozen
KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached
KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is
disallowed
KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level
KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its caller
arch/x86/kvm/mmu/mmu.c | 9 +++++++-
arch/x86/kvm/mmu/mmu_internal.h | 1 -
arch/x86/kvm/mmu/tdp_mmu.c | 39 +++++++++++++++------------------
3 files changed, 26 insertions(+), 23 deletions(-)
base-commit: 51229fd7872f82af07498aef5c79ad51baf81ea0
--
2.39.0.rc1.256.g54fd8350bd-goog
Hoist the is_removed_spte() check above the "level == goal_level" check
when walking SPTEs during a TDP MMU page fault to avoid attempting to map
a leaf entry if said entry is frozen by a different task/vCPU.
------------[ cut here ]------------
WARNING: CPU: 3 PID: 939 at arch/x86/kvm/mmu/tdp_mmu.c:653 kvm_tdp_mmu_map+0x269/0x4b0
Modules linked in: kvm_intel
CPU: 3 PID: 939 Comm: nx_huge_pages_t Not tainted 6.1.0-rc4+ #67
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_tdp_mmu_map+0x269/0x4b0
RSP: 0018:ffffc9000068fba8 EFLAGS: 00010246
RAX: 00000000000005a0 RBX: ffffc9000068fcc0 RCX: 0000000000000005
RDX: ffff88810741f000 RSI: ffff888107f04600 RDI: ffffc900006a3000
RBP: 060000010b000bf3 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 000ffffffffff000 R12: 0000000000000005
R13: ffff888113670000 R14: ffff888107464958 R15: 0000000000000000
FS: 00007f01c942c740(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000117013006 CR4: 0000000000172ea0
Call Trace:
<TASK>
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
---[ end trace 0000000000000000 ]---
Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
Cc: Robert Hoo <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 764f7c87286f..b740f38fedcc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1162,9 +1162,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
if (fault->nx_huge_page_workaround_enabled)
disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
- if (iter.level == fault->goal_level)
- break;
-
/*
* If SPTE has been frozen by another thread, just give up and
* retry, avoiding unnecessary page table allocation and free.
@@ -1172,6 +1169,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
if (is_removed_spte(iter.old_spte))
goto retry;
+ if (iter.level == fault->goal_level)
+ break;
+
/* Step down into the lower level page table if it exists. */
if (is_shadow_present_pte(iter.old_spte) &&
!is_large_pte(iter.old_spte))
--
2.39.0.rc1.256.g54fd8350bd-goog
Map the leaf SPTE when handling a TDP MMU page fault if and only if the
target level is reached. A recent commit reworked the retry logic and
incorrectly assumed that walking SPTEs would never "fail", as the loop
either bails (retries) or installs parent SPs. However, the iterator
itself will bail early if it detects a frozen (REMOVED) SPTE when
stepping down. The TDP iterator also rereads the current SPTE before
stepping down specifically to avoid walking into a part of the tree that
is being removed, which means it's possible to terminate the loop without
the guts of the loop observing the frozen SPTE, e.g. if a different task
zaps a parent SPTE between the initial read and try_step_down()'s refresh.
Mapping a leaf SPTE at the wrong level results in all kinds of badness as
page table walkers interpret the SPTE as a page table, not a leaf, and
walk into the weeds.
------------[ cut here ]------------
WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510
Modules linked in: kvm_intel
CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_tdp_mmu_map+0x481/0x510
RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027
RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48
R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0
R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90
FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
Call Trace:
<TASK>
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
---[ end trace 0000000000000000 ]---
Invalid SPTE change: cannot replace a present leaf
SPTE with another present leaf SPTE mapping a
different PFN!
as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
Call Trace:
<TASK>
kvm_tdp_mmu_map+0x3b0/0x510
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
Modules linked in: kvm_intel
---[ end trace 0000000000000000 ]---
Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
Cc: Robert Hoo <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b740f38fedcc..e2e197d41780 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1170,7 +1170,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
goto retry;
if (iter.level == fault->goal_level)
- break;
+ goto map_target_level;
/* Step down into the lower level page table if it exists. */
if (is_shadow_present_pte(iter.old_spte) &&
@@ -1192,8 +1192,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
r = tdp_mmu_link_sp(kvm, &iter, sp, true);
/*
- * Also force the guest to retry the access if the upper level SPTEs
- * aren't in place.
+ * Force the guest to retry if installing an upper level SPTE
+ * failed, e.g. because a different task modified the SPTE.
*/
if (r) {
tdp_mmu_free_sp(sp);
@@ -1208,6 +1208,14 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
}
}
+ /*
+ * The walk aborted before reaching the target level, e.g. because the
+ * iterator detected an upper level SPTE was frozen during traversal.
+ */
+ WARN_ON_ONCE(iter.level == fault->goal_level);
+ goto retry;
+
+map_target_level:
ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
retry:
--
2.39.0.rc1.256.g54fd8350bd-goog
Re-check sp->nx_huge_page_disallowed under the tdp_mmu_pages_lock spinlock
when adding a new shadow page in the TDP MMU. To ensure the NX reclaim
kthread can't see a not-yet-linked shadow page, the page fault path links
the new page table prior to adding the page to possible_nx_huge_pages.
If the page is zapped by different task, e.g. because dirty logging is
disabled, between linking the page and adding it to the list, KVM can end
up triggering use-after-free by adding the zapped SP to the aforementioned
list, as the zapped SP's memory is scheduled for removal via RCU callback.
The bug is detected by the sanity checks guarded by CONFIG_DEBUG_LIST=y,
i.e. the below splat is just one possible signature.
------------[ cut here ]------------
list_add corruption. prev->next should be next (ffffc9000071fa70), but was ffff88811125ee38. (prev=ffff88811125ee38).
WARNING: CPU: 1 PID: 953 at lib/list_debug.c:30 __list_add_valid+0x79/0xa0
Modules linked in: kvm_intel
CPU: 1 PID: 953 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #71
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:__list_add_valid+0x79/0xa0
RSP: 0018:ffffc900006efb68 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff888116cae8a0 RCX: 0000000000000027
RDX: 0000000000000027 RSI: 0000000100001872 RDI: ffff888277c5b4c8
RBP: ffffc90000717000 R08: ffff888277c5b4c0 R09: ffffc900006efa08
R10: 0000000000199998 R11: 0000000000199a20 R12: ffff888116cae930
R13: ffff88811125ee38 R14: ffffc9000071fa70 R15: ffff88810b794f90
FS: 00007fc0415d2740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000115201006 CR4: 0000000000172ea0
Call Trace:
<TASK>
track_possible_nx_huge_page+0x53/0x80
kvm_tdp_mmu_map+0x242/0x2c0
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
---[ end trace 0000000000000000 ]---
Fixes: 61f94478547b ("KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE")
Reported-by: Greg Thelen <[email protected]>
Analyzed-by: David Matlack <[email protected]>
Cc: David Matlack <[email protected]>
Cc: Ben Gardon <[email protected]>
Cc: Mingwei Zhang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e2e197d41780..fd4ae99790d7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1203,7 +1203,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
if (fault->huge_page_disallowed &&
fault->req_level >= iter.level) {
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
- track_possible_nx_huge_page(kvm, sp);
+ if (sp->nx_huge_page_disallowed)
+ track_possible_nx_huge_page(kvm, sp);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
}
}
--
2.39.0.rc1.256.g54fd8350bd-goog
Don't install a leaf TDP MMU SPTE if the parent page's level doesn't
match the target level of the fault, and instead have the vCPU retry the
faulting instruction after warning. Continuing on is completely
unnecessary as the absolute worst case scenario of retrying is DoSing
the vCPU, whereas continuing on all but guarantees bigger explosions, e.g.
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
Call Trace:
<TASK>
kvm_tdp_mmu_map+0x3b0/0x510
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
Modules linked in: kvm_intel
---[ end trace 0000000000000000 ]---
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index fd4ae99790d7..cc1fb9a65620 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1063,7 +1063,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
int ret = RET_PF_FIXED;
bool wrprot = false;
- WARN_ON(sp->role.level != fault->goal_level);
+ if (WARN_ON_ONCE(sp->role.level != fault->goal_level))
+ return RET_PF_RETRY;
+
if (unlikely(!fault->slot))
new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
else
--
2.39.0.rc1.256.g54fd8350bd-goog
Move the hugepage adjust, tracepoint, and RCU (un)lock logic out of
kvm_tdp_mmu_map() and into its sole caller, kvm_tdp_mmu_page_fault(), to
eliminate the gotos used to bounce through rcu_read_unlock() when bailing
from the walk.
Opportunistically mark kvm_mmu_hugepage_adjust() as static as
kvm_tdp_mmu_map() was the only external user.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 9 ++++++++-
arch/x86/kvm/mmu/mmu_internal.h | 1 -
arch/x86/kvm/mmu/tdp_mmu.c | 22 ++++------------------
3 files changed, 12 insertions(+), 20 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 254bc46234e0..99c40617d325 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3085,7 +3085,8 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
return min(host_level, max_level);
}
-void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
{
struct kvm_memory_slot *slot = fault->slot;
kvm_pfn_t mask;
@@ -4405,7 +4406,13 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
if (is_page_fault_stale(vcpu, fault))
goto out_unlock;
+ kvm_mmu_hugepage_adjust(vcpu, fault);
+
+ trace_kvm_mmu_spte_requested(fault);
+
+ rcu_read_lock();
r = kvm_tdp_mmu_map(vcpu, fault);
+ rcu_read_unlock();
out_unlock:
read_unlock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index ac00bfbf32f6..66c294d67641 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -317,7 +317,6 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int kvm_mmu_max_mapping_level(struct kvm *kvm,
const struct kvm_memory_slot *slot, gfn_t gfn,
int max_level);
-void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cc1fb9a65620..78f47eb74544 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1150,13 +1150,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
struct kvm *kvm = vcpu->kvm;
struct tdp_iter iter;
struct kvm_mmu_page *sp;
- int ret = RET_PF_RETRY;
-
- kvm_mmu_hugepage_adjust(vcpu, fault);
-
- trace_kvm_mmu_spte_requested(fault);
-
- rcu_read_lock();
tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
int r;
@@ -1169,10 +1162,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
* retry, avoiding unnecessary page table allocation and free.
*/
if (is_removed_spte(iter.old_spte))
- goto retry;
+ return RET_PF_RETRY;
if (iter.level == fault->goal_level)
- goto map_target_level;
+ return tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
/* Step down into the lower level page table if it exists. */
if (is_shadow_present_pte(iter.old_spte) &&
@@ -1199,7 +1192,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
*/
if (r) {
tdp_mmu_free_sp(sp);
- goto retry;
+ return RET_PF_RETRY;
}
if (fault->huge_page_disallowed &&
@@ -1216,14 +1209,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
* iterator detected an upper level SPTE was frozen during traversal.
*/
WARN_ON_ONCE(iter.level == fault->goal_level);
- goto retry;
-
-map_target_level:
- ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
-
-retry:
- rcu_read_unlock();
- return ret;
+ return RET_PF_RETRY;
}
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
--
2.39.0.rc1.256.g54fd8350bd-goog
On Mon, Dec 12, 2022 at 7:30 PM Sean Christopherson <[email protected]> wrote:
>
> Don't install a leaf TDP MMU SPTE if the parent page's level doesn't
> match the target level of the fault, and instead have the vCPU retry the
> faulting instruction after warning. Continuing on is completely
> unnecessary as the absolute worst case scenario of retrying is DoSing
> the vCPU, whereas continuing on all but guarantees bigger explosions, e.g.
Would it make sense to kill the VM instead via KVM_BUG()?
On Tue, Dec 13, 2022, David Matlack wrote:
> On Mon, Dec 12, 2022 at 7:30 PM Sean Christopherson <[email protected]> wrote:
> >
> > Don't install a leaf TDP MMU SPTE if the parent page's level doesn't
> > match the target level of the fault, and instead have the vCPU retry the
> > faulting instruction after warning. Continuing on is completely
> > unnecessary as the absolute worst case scenario of retrying is DoSing
> > the vCPU, whereas continuing on all but guarantees bigger explosions, e.g.
>
> Would it make sense to kill the VM instead via KVM_BUG()?
No, because if bug that hits this escapes to a release, odds are quite high that
retrying will succeed. E.g. the fix earlier in this series is for a rare corner
case that I was able to hit consistently only by hacking KVM to effectively
synchronize the page fault and zap. Other than an extra page fault, no harm has
been done to the guest, e.g. there's no need to kill the VM to protect it from
data corruption.
On Tue, 2022-12-13 at 03:30 +0000, Sean Christopherson wrote:
> Hoist the is_removed_spte() check above the "level == goal_level"
> check
> when walking SPTEs during a TDP MMU page fault to avoid attempting to
> map
> a leaf entry if said entry is frozen by a different task/vCPU.
>
> ------------[ cut here ]------------
> WARNING: CPU: 3 PID: 939 at arch/x86/kvm/mmu/tdp_mmu.c:653
> kvm_tdp_mmu_map+0x269/0x4b0
> Modules linked in: kvm_intel
> CPU: 3 PID: 939 Comm: nx_huge_pages_t Not tainted 6.1.0-rc4+ #67
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0
> 02/06/2015
> RIP: 0010:kvm_tdp_mmu_map+0x269/0x4b0
> RSP: 0018:ffffc9000068fba8 EFLAGS: 00010246
> RAX: 00000000000005a0 RBX: ffffc9000068fcc0 RCX: 0000000000000005
> RDX: ffff88810741f000 RSI: ffff888107f04600 RDI: ffffc900006a3000
> RBP: 060000010b000bf3 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 000ffffffffff000 R12: 0000000000000005
> R13: ffff888113670000 R14: ffff888107464958 R15: 0000000000000000
> FS: 00007f01c942c740(0000) GS:ffff888277cc0000(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 0000000117013006 CR4: 0000000000172ea0
> Call Trace:
> <TASK>
> kvm_tdp_page_fault+0x10c/0x130
> kvm_mmu_page_fault+0x103/0x680
> vmx_handle_exit+0x132/0x5a0 [kvm_intel]
> vcpu_enter_guest+0x60c/0x16f0
> kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
> kvm_vcpu_ioctl+0x271/0x660
> __x64_sys_ioctl+0x80/0xb0
> do_syscall_64+0x2b/0x50
> entry_SYSCALL_64_after_hwframe+0x46/0xb0
> </TASK>
> ---[ end trace 0000000000000000 ]---
>
> Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow
> when guest has to retry")
> Cc: Robert Hoo <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 764f7c87286f..b740f38fedcc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1162,9 +1162,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault)
> if (fault->nx_huge_page_workaround_enabled)
> disallowed_hugepage_adjust(fault,
> iter.old_spte, iter.level);
>
> - if (iter.level == fault->goal_level)
> - break;
> -
> /*
> * If SPTE has been frozen by another thread, just give
> up and
> * retry, avoiding unnecessary page table allocation
> and free.
> @@ -1172,6 +1169,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault)
> if (is_removed_spte(iter.old_spte))
> goto retry;
>
> + if (iter.level == fault->goal_level)
> + break;
> +
> /* Step down into the lower level page table if it
> exists. */
> if (is_shadow_present_pte(iter.old_spte) &&
> !is_large_pte(iter.old_spte))
Reviewed-by: Robert Hoo <[email protected]>
On Tue, 2022-12-13 at 03:30 +0000, Sean Christopherson wrote:
> Re-check sp->nx_huge_page_disallowed under the tdp_mmu_pages_lock
> spinlock
> when adding a new shadow page in the TDP MMU. To ensure the NX
> reclaim
> kthread can't see a not-yet-linked shadow page, the page fault path
> links
> the new page table prior to adding the page to
> possible_nx_huge_pages.
>
> If the page is zapped by different task, e.g. because dirty logging
> is
> disabled, between linking the page and adding it to the list, KVM can
> end
> up triggering use-after-free by adding the zapped SP to the
> aforementioned
> list, as the zapped SP's memory is scheduled for removal via RCU
> callback.
> The bug is detected by the sanity checks guarded by
> CONFIG_DEBUG_LIST=y,
> i.e. the below splat is just one possible signature.
>
> ------------[ cut here ]------------
> list_add corruption. prev->next should be next (ffffc9000071fa70),
> but was ffff88811125ee38. (prev=ffff88811125ee38).
> WARNING: CPU: 1 PID: 953 at lib/list_debug.c:30
> __list_add_valid+0x79/0xa0
> Modules linked in: kvm_intel
> CPU: 1 PID: 953 Comm: nx_huge_pages_t Tainted:
> G W 6.1.0-rc4+ #71
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0
> 02/06/2015
> RIP: 0010:__list_add_valid+0x79/0xa0
> RSP: 0018:ffffc900006efb68 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: ffff888116cae8a0 RCX: 0000000000000027
> RDX: 0000000000000027 RSI: 0000000100001872 RDI: ffff888277c5b4c8
> RBP: ffffc90000717000 R08: ffff888277c5b4c0 R09: ffffc900006efa08
> R10: 0000000000199998 R11: 0000000000199a20 R12: ffff888116cae930
> R13: ffff88811125ee38 R14: ffffc9000071fa70 R15: ffff88810b794f90
> FS: 00007fc0415d2740(0000) GS:ffff888277c40000(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 0000000115201006 CR4: 0000000000172ea0
> Call Trace:
> <TASK>
> track_possible_nx_huge_page+0x53/0x80
> kvm_tdp_mmu_map+0x242/0x2c0
> kvm_tdp_page_fault+0x10c/0x130
> kvm_mmu_page_fault+0x103/0x680
> vmx_handle_exit+0x132/0x5a0 [kvm_intel]
> vcpu_enter_guest+0x60c/0x16f0
> kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
> kvm_vcpu_ioctl+0x271/0x660
> __x64_sys_ioctl+0x80/0xb0
> do_syscall_64+0x2b/0x50
> entry_SYSCALL_64_after_hwframe+0x46/0xb0
> </TASK>
> ---[ end trace 0000000000000000 ]---
>
> Fixes: 61f94478547b ("KVM: x86/mmu: Set disallowed_nx_huge_page in
> TDP MMU before setting SPTE")
> Reported-by: Greg Thelen <[email protected]>
> Analyzed-by: David Matlack <[email protected]>
> Cc: David Matlack <[email protected]>
> Cc: Ben Gardon <[email protected]>
> Cc: Mingwei Zhang <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index e2e197d41780..fd4ae99790d7 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1203,7 +1203,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault)
> if (fault->huge_page_disallowed &&
> fault->req_level >= iter.level) {
> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> - track_possible_nx_huge_page(kvm, sp);
> + if (sp->nx_huge_page_disallowed)
> + track_possible_nx_huge_page(kvm, sp);
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> }
> }
Is this possible?
The aforementioned situation happened, i.e. before above hunk
track_possible_nx_huge_page(), the sp is zapped by some other task,
tdp_mmu_unlink_sp() --> untrack_possible_nx_huge_page(kvm, sp):
--kvm->stat.nx_lpage_splits;
But looks like the stat for this sp hasn't been increased yet.
On Tue, 2022-12-13 at 03:30 +0000, Sean Christopherson wrote:
> Fix three fatal TDP MMU bugs introduced in 6.2,
introduced in 6.1? or earlier?
> harden related code,
> and clean up kvm_tdp_mmu_map() to eliminate the need for gotos.
>
> Sean Christopherson (5):
> KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is
> frozen
> KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached
> KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is
> disallowed
> KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level
> KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its
> caller
>
> arch/x86/kvm/mmu/mmu.c | 9 +++++++-
> arch/x86/kvm/mmu/mmu_internal.h | 1 -
> arch/x86/kvm/mmu/tdp_mmu.c | 39 +++++++++++++++--------------
> ----
> 3 files changed, 26 insertions(+), 23 deletions(-)
>
>
> base-commit: 51229fd7872f82af07498aef5c79ad51baf81ea0
I cannot find this base commit in my tree, where I just pulled to
latest queue yesterday. But find this series can be applied to this
latest queue as well.
commit 9d75a3251adfbcf444681474511b58042a364863 (origin/queue, queue)
Author: Sean Christopherson <[email protected]>
Date: Wed Dec 7 00:09:59 2022 +0000
KVM: x86: Add proper ReST tables for userspace MSR exits/flags
On Wed, Dec 14, 2022, Robert Hoo wrote:
> On Tue, 2022-12-13 at 03:30 +0000, Sean Christopherson wrote:
> > Fix three fatal TDP MMU bugs introduced in 6.2,
>
> introduced in 6.1? or earlier?
6.2, or more precisely, code sitting in kvm/next that will hopefully become part
of 6.2-rc1.
> > harden related code,
> > and clean up kvm_tdp_mmu_map() to eliminate the need for gotos.
> >
> > Sean Christopherson (5):
> > KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is
> > frozen
> > KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached
> > KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is
> > disallowed
> > KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level
> > KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its
> > caller
> >
> > arch/x86/kvm/mmu/mmu.c | 9 +++++++-
> > arch/x86/kvm/mmu/mmu_internal.h | 1 -
> > arch/x86/kvm/mmu/tdp_mmu.c | 39 +++++++++++++++--------------
> > ----
> > 3 files changed, 26 insertions(+), 23 deletions(-)
> >
> >
> > base-commit: 51229fd7872f82af07498aef5c79ad51baf81ea0
>
> I cannot find this base commit in my tree, where I just pulled to
> latest queue yesterday. But find this series can be applied to this
> latest queue as well.
Ya, I have an extra commit in my local repo sitting on top of kvm/queue so that
my standard builds don't fail.
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index cc3e8c7d0850..2c7f2a26421e 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -898,6 +898,7 @@ bool kvm_hv_assist_page_enabled(struct kvm_vcpu *vcpu)
return false;
return vcpu->arch.pv_eoi.msr_val & KVM_MSR_ENABLED;
}
+EXPORT_SYMBOL_GPL(kvm_hv_assist_page_enabled);
int kvm_hv_get_assist_page(struct kvm_vcpu *vcpu)
{
On Wed, Dec 14, 2022, Robert Hoo wrote:
> On Tue, 2022-12-13 at 03:30 +0000, Sean Christopherson wrote:
> > ---
> > arch/x86/kvm/mmu/tdp_mmu.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index e2e197d41780..fd4ae99790d7 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1203,7 +1203,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu,
> > struct kvm_page_fault *fault)
> > if (fault->huge_page_disallowed &&
> > fault->req_level >= iter.level) {
> > spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > - track_possible_nx_huge_page(kvm, sp);
> > + if (sp->nx_huge_page_disallowed)
> > + track_possible_nx_huge_page(kvm, sp);
> > spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > }
> > }
>
> Is this possible?
> The aforementioned situation happened, i.e. before above hunk
> track_possible_nx_huge_page(), the sp is zapped by some other task,
> tdp_mmu_unlink_sp() --> untrack_possible_nx_huge_page(kvm, sp):
It's possible for untrack_possible_nx_huge_page() to be called before the above
snippet, but the stat won't be decremented in that case since the page won't be on
the list of possible NX huge pages, i.e. list_empty() will be true.
void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
if (list_empty(&sp->possible_nx_huge_page_link))
return;
--kvm->stat.nx_lpage_splits;
And by not calling track_possible_nx_huge_page() (this bug fix), nx_lpage_splits
won't be incorrectly incremented.
>
> --kvm->stat.nx_lpage_splits;
>
> But looks like the stat for this sp hasn't been increased yet.
>
On Thu, 2022-12-15 at 00:11 +0000, Sean Christopherson wrote:
> On Wed, Dec 14, 2022, Robert Hoo wrote:
> > On Tue, 2022-12-13 at 03:30 +0000, Sean Christopherson wrote:
> > > ---
> > > arch/x86/kvm/mmu/tdp_mmu.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c
> > > b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index e2e197d41780..fd4ae99790d7 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -1203,7 +1203,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu,
> > > struct kvm_page_fault *fault)
> > > if (fault->huge_page_disallowed &&
> > > fault->req_level >= iter.level) {
> > > spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > > - track_possible_nx_huge_page(kvm, sp);
> > > + if (sp->nx_huge_page_disallowed)
> > > + track_possible_nx_huge_page(kvm, sp);
> > > spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > }
> > > }
> >
> > Is this possible?
> > The aforementioned situation happened, i.e. before above hunk
> > track_possible_nx_huge_page(), the sp is zapped by some other task,
> > tdp_mmu_unlink_sp() --> untrack_possible_nx_huge_page(kvm, sp):
>
> It's possible for untrack_possible_nx_huge_page() to be called before
> the above
> snippet, but the stat won't be decremented in that case since the
> page won't be on
> the list of possible NX huge pages, i.e. list_empty() will be true.
Right, I was fooled by the name of list_empty(), it's actually
list_node_empty(). Thanks for explaining.
>
> void untrack_possible_nx_huge_page(struct kvm *kvm, struct
> kvm_mmu_page *sp)
> {
> if (list_empty(&sp->possible_nx_huge_page_link))
> return;
>
> --kvm->stat.nx_lpage_splits;
>
> And by not calling track_possible_nx_huge_page() (this bug fix),
> nx_lpage_splits
> won't be incorrectly incremented.
>
> >
> > --kvm->stat.nx_lpage_splits;
> >
> > But looks like the stat for this sp hasn't been increased yet.
> >
On Tue, Dec 13, 2022 at 03:30:30AM +0000, Sean Christopherson wrote:
> Move the hugepage adjust, tracepoint, and RCU (un)lock logic out of
> kvm_tdp_mmu_map() and into its sole caller, kvm_tdp_mmu_page_fault(), to
> eliminate the gotos used to bounce through rcu_read_unlock() when bailing
> from the walk.
>
> Opportunistically mark kvm_mmu_hugepage_adjust() as static as
> kvm_tdp_mmu_map() was the only external user.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 9 ++++++++-
> arch/x86/kvm/mmu/mmu_internal.h | 1 -
> arch/x86/kvm/mmu/tdp_mmu.c | 22 ++++------------------
> 3 files changed, 12 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 254bc46234e0..99c40617d325 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3085,7 +3085,8 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> return min(host_level, max_level);
> }
>
> -void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> +static void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> {
> struct kvm_memory_slot *slot = fault->slot;
> kvm_pfn_t mask;
> @@ -4405,7 +4406,13 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> if (is_page_fault_stale(vcpu, fault))
> goto out_unlock;
>
> + kvm_mmu_hugepage_adjust(vcpu, fault);
Can you also move the call to kvm_mmu_hugepage_adjust() from
direct_map() to direct_page_fault()? I do think it's worth the
maintenence burden to keep those functions consistent.
> +
> + trace_kvm_mmu_spte_requested(fault);
> +
> + rcu_read_lock();
> r = kvm_tdp_mmu_map(vcpu, fault);
> + rcu_read_unlock();
I would prefer to keep these in tdp_mmu.c, to reduce the amount of TDP
MMU details that bleed into mmu.c (RCU) and for consistency with other
TDP MMU APIs that don't require the caller to acquire RCU. This will
also be helpful for the Common MMU, as the tracepoint and RCU will be
common.
e.g.
static int __kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
...
}
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
int r;
trace_kvm_mmu_spte_requested(fault);
rcu_read_lock();
r = __kvm_tdp_mmu_map(vcpu, fault);
rcu_read_unlock();
return r;
}
On Tue, Dec 13, 2022 at 06:15:56PM +0000, Sean Christopherson wrote:
> On Tue, Dec 13, 2022, David Matlack wrote:
> > On Mon, Dec 12, 2022 at 7:30 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > Don't install a leaf TDP MMU SPTE if the parent page's level doesn't
> > > match the target level of the fault, and instead have the vCPU retry the
> > > faulting instruction after warning. Continuing on is completely
> > > unnecessary as the absolute worst case scenario of retrying is DoSing
> > > the vCPU, whereas continuing on all but guarantees bigger explosions, e.g.
> >
> > Would it make sense to kill the VM instead via KVM_BUG()?
>
> No, because if bug that hits this escapes to a release, odds are quite high that
> retrying will succeed. E.g. the fix earlier in this series is for a rare corner
> case that I was able to hit consistently only by hacking KVM to effectively
> synchronize the page fault and zap. Other than an extra page fault, no harm has
> been done to the guest, e.g. there's no need to kill the VM to protect it from
> data corruption.
Good points, agreed!
On Tue, Dec 20, 2022, David Matlack wrote:
> On Tue, Dec 13, 2022 at 03:30:30AM +0000, Sean Christopherson wrote:
> > Move the hugepage adjust, tracepoint, and RCU (un)lock logic out of
> > kvm_tdp_mmu_map() and into its sole caller, kvm_tdp_mmu_page_fault(), to
> > eliminate the gotos used to bounce through rcu_read_unlock() when bailing
> > from the walk.
> >
> > Opportunistically mark kvm_mmu_hugepage_adjust() as static as
> > kvm_tdp_mmu_map() was the only external user.
> >
> > No functional change intended.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 9 ++++++++-
> > arch/x86/kvm/mmu/mmu_internal.h | 1 -
> > arch/x86/kvm/mmu/tdp_mmu.c | 22 ++++------------------
> > 3 files changed, 12 insertions(+), 20 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 254bc46234e0..99c40617d325 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3085,7 +3085,8 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > return min(host_level, max_level);
> > }
> >
> > -void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > +static void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu,
> > + struct kvm_page_fault *fault)
> > {
> > struct kvm_memory_slot *slot = fault->slot;
> > kvm_pfn_t mask;
> > @@ -4405,7 +4406,13 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> > if (is_page_fault_stale(vcpu, fault))
> > goto out_unlock;
> >
> > + kvm_mmu_hugepage_adjust(vcpu, fault);
>
> Can you also move the call to kvm_mmu_hugepage_adjust() from
> direct_map() to direct_page_fault()? I do think it's worth the
> maintenence burden to keep those functions consistent.
Sure.
> > + trace_kvm_mmu_spte_requested(fault);
> > +
> > + rcu_read_lock();
> > r = kvm_tdp_mmu_map(vcpu, fault);
> > + rcu_read_unlock();
>
> I would prefer to keep these in tdp_mmu.c, to reduce the amount of TDP
> MMU details that bleed into mmu.c (RCU) and for consistency with other
> TDP MMU APIs that don't require the caller to acquire RCU. This will
> also be helpful for the Common MMU, as the tracepoint and RCU will be
> common.
>
> e.g.
>
> static int __kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> ...
> }
>
> int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> int r;
>
> trace_kvm_mmu_spte_requested(fault);
>
> rcu_read_lock();
> r = __kvm_tdp_mmu_map(vcpu, fault);
> rcu_read_unlock();
>
> return r;
> }
I did that originally, but it felt really silly to have the trivial wrapper, especially
because mmu.c already has TDP MMU details, e.g. kvm_tdp_mmu_page_fault() takes mmu_lock
for read and other flows acquire rcu_read_lock() to protected the TDP MMU.
What about the below (split into multiple patches) instead? kvm_tdp_mmu_page_fault()
really should live in tdp_mmu.c, the only reason it's in mmu.c is to get at various
helpers, e.g. fast_page_fault() and kvm_faultin_pfn().
Or is that doomed to fail because the TDP MMU will want to add code before
kvm_faultin_pfn() (I can't remember what motivated splitting out kvm_tdp_mmu_page_fault()
in the first place).
---
arch/x86/kvm/mmu/mmu.c | 132 ++++++++------------------------
arch/x86/kvm/mmu/mmu_internal.h | 50 ++++++++++++
arch/x86/kvm/mmu/spte.h | 7 --
arch/x86/kvm/mmu/tdp_mmu.c | 41 ++++++----
arch/x86/kvm/mmu/tdp_mmu.h | 8 +-
5 files changed, 108 insertions(+), 130 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 254bc46234e0..8203b1dd2753 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1927,16 +1927,6 @@ static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm,
return true;
}
-static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
- if (sp->role.invalid)
- return true;
-
- /* TDP MMU pages do not use the MMU generation. */
- return !is_tdp_mmu_page(sp) &&
- unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
-}
-
struct mmu_page_path {
struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL];
unsigned int idx[PT64_ROOT_MAX_LEVEL];
@@ -3148,9 +3138,6 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
int ret;
gfn_t base_gfn = fault->gfn;
- kvm_mmu_hugepage_adjust(vcpu, fault);
-
- trace_kvm_mmu_spte_requested(fault);
for_each_shadow_entry(vcpu, fault->addr, it) {
/*
* We cannot overwrite existing page tables with an NX
@@ -4270,54 +4257,10 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
return RET_PF_CONTINUE;
}
-/*
- * Returns true if the page fault is stale and needs to be retried, i.e. if the
- * root was invalidated by a memslot update or a relevant mmu_notifier fired.
- */
-static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
- struct kvm_page_fault *fault)
+static int __direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
- struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
+ int r = RET_PF_RETRY;
- /* Special roots, e.g. pae_root, are not backed by shadow pages. */
- if (sp && is_obsolete_sp(vcpu->kvm, sp))
- return true;
-
- /*
- * Roots without an associated shadow page are considered invalid if
- * there is a pending request to free obsolete roots. The request is
- * only a hint that the current root _may_ be obsolete and needs to be
- * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
- * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
- * to reload even if no vCPU is actively using the root.
- */
- if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
- return true;
-
- return fault->slot &&
- mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
-}
-
-static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
-{
- int r;
-
- if (page_fault_handle_page_track(vcpu, fault))
- return RET_PF_EMULATE;
-
- r = fast_page_fault(vcpu, fault);
- if (r != RET_PF_INVALID)
- return r;
-
- r = mmu_topup_memory_caches(vcpu, false);
- if (r)
- return r;
-
- r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
- if (r != RET_PF_CONTINUE)
- return r;
-
- r = RET_PF_RETRY;
write_lock(&vcpu->kvm->mmu_lock);
if (is_page_fault_stale(vcpu, fault))
@@ -4327,6 +4270,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (r)
goto out_unlock;
+ kvm_mmu_hugepage_adjust(vcpu, fault);
+
+ trace_kvm_mmu_spte_requested(fault);
+
r = direct_map(vcpu, fault);
out_unlock:
@@ -4335,6 +4282,32 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
return r;
}
+static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+ int r;
+
+ if (page_fault_handle_page_track(vcpu, fault))
+ return RET_PF_EMULATE;
+
+ r = fast_page_fault(vcpu, fault);
+ if (r != RET_PF_INVALID)
+ return r;
+
+ r = mmu_topup_memory_caches(vcpu, false);
+ if (r)
+ return r;
+
+ r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
+ if (r != RET_PF_CONTINUE)
+ return r;
+
+#ifdef CONFIG_X86_64
+ if (tdp_mmu_enabled)
+ return kvm_tdp_mmu_page_fault(vcpu, fault);
+#endif
+ return __direct_page_fault(vcpu, fault);
+}
+
static int nonpaging_page_fault(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault)
{
@@ -4378,42 +4351,6 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
}
EXPORT_SYMBOL_GPL(kvm_handle_page_fault);
-#ifdef CONFIG_X86_64
-static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
- struct kvm_page_fault *fault)
-{
- int r;
-
- if (page_fault_handle_page_track(vcpu, fault))
- return RET_PF_EMULATE;
-
- r = fast_page_fault(vcpu, fault);
- if (r != RET_PF_INVALID)
- return r;
-
- r = mmu_topup_memory_caches(vcpu, false);
- if (r)
- return r;
-
- r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
- if (r != RET_PF_CONTINUE)
- return r;
-
- r = RET_PF_RETRY;
- read_lock(&vcpu->kvm->mmu_lock);
-
- if (is_page_fault_stale(vcpu, fault))
- goto out_unlock;
-
- r = kvm_tdp_mmu_map(vcpu, fault);
-
-out_unlock:
- read_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
- return r;
-}
-#endif
-
int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
/*
@@ -4438,11 +4375,6 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
}
}
-#ifdef CONFIG_X86_64
- if (tdp_mmu_enabled)
- return kvm_tdp_mmu_page_fault(vcpu, fault);
-#endif
-
return direct_page_fault(vcpu, fault);
}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index ac00bfbf32f6..2c7c2b49f719 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -133,6 +133,28 @@ struct kvm_mmu_page {
extern struct kmem_cache *mmu_page_header_cache;
+static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
+{
+ struct page *page = pfn_to_page((shadow_page) >> PAGE_SHIFT);
+
+ return (struct kvm_mmu_page *)page_private(page);
+}
+
+static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp)
+{
+ return IS_ENABLED(CONFIG_X86_64) && sp->tdp_mmu_page;
+}
+
+static inline bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ if (sp->role.invalid)
+ return true;
+
+ /* TDP MMU pages do not use the MMU generation. */
+ return !is_tdp_mmu_page(sp) &&
+ unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
+}
+
static inline int kvm_mmu_role_as_id(union kvm_mmu_page_role role)
{
return role.smm ? 1 : 0;
@@ -314,6 +336,34 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
return r;
}
+/*
+ * Returns true if the page fault is stale and needs to be retried, i.e. if the
+ * root was invalidated by a memslot update or a relevant mmu_notifier fired.
+ */
+static inline bool is_page_fault_stale(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
+{
+ struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
+
+ /* Special roots, e.g. pae_root, are not backed by shadow pages. */
+ if (sp && is_obsolete_sp(vcpu->kvm, sp))
+ return true;
+
+ /*
+ * Roots without an associated shadow page are considered invalid if
+ * there is a pending request to free obsolete roots. The request is
+ * only a hint that the current root _may_ be obsolete and needs to be
+ * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
+ * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
+ * to reload even if no vCPU is actively using the root.
+ */
+ if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
+ return true;
+
+ return fault->slot &&
+ mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
+}
+
int kvm_mmu_max_mapping_level(struct kvm *kvm,
const struct kvm_memory_slot *slot, gfn_t gfn,
int max_level);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1f03701b943a..23e8f8c152b5 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -219,13 +219,6 @@ static inline int spte_index(u64 *sptep)
*/
extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
-static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
-{
- struct page *page = pfn_to_page((shadow_page) >> PAGE_SHIFT);
-
- return (struct kvm_mmu_page *)page_private(page);
-}
-
static inline struct kvm_mmu_page *spte_to_child_sp(u64 spte)
{
return to_shadow_page(spte & SPTE_BASE_ADDR_MASK);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cc1fb9a65620..4bb58c0f465b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1144,19 +1144,12 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
* Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
* page tables and SPTEs to translate the faulting guest physical address.
*/
-int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
struct kvm *kvm = vcpu->kvm;
struct tdp_iter iter;
struct kvm_mmu_page *sp;
- int ret = RET_PF_RETRY;
-
- kvm_mmu_hugepage_adjust(vcpu, fault);
-
- trace_kvm_mmu_spte_requested(fault);
-
- rcu_read_lock();
tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
int r;
@@ -1169,10 +1162,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
* retry, avoiding unnecessary page table allocation and free.
*/
if (is_removed_spte(iter.old_spte))
- goto retry;
+ return RET_PF_RETRY;
if (iter.level == fault->goal_level)
- goto map_target_level;
+ return tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
/* Step down into the lower level page table if it exists. */
if (is_shadow_present_pte(iter.old_spte) &&
@@ -1199,7 +1192,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
*/
if (r) {
tdp_mmu_free_sp(sp);
- goto retry;
+ return RET_PF_RETRY;
}
if (fault->huge_page_disallowed &&
@@ -1216,14 +1209,30 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
* iterator detected an upper level SPTE was frozen during traversal.
*/
WARN_ON_ONCE(iter.level == fault->goal_level);
- goto retry;
+ return RET_PF_RETRY;
+}
-map_target_level:
- ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
+int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+ int r = RET_PF_RETRY;
-retry:
+ read_lock(&vcpu->kvm->mmu_lock);
+
+ if (is_page_fault_stale(vcpu, fault))
+ goto out_unlock;
+
+ kvm_mmu_hugepage_adjust(vcpu, fault);
+
+ trace_kvm_mmu_spte_requested(fault);
+
+ rcu_read_lock();
+ r = kvm_tdp_mmu_map(vcpu, fault);
rcu_read_unlock();
- return ret;
+
+out_unlock:
+ read_unlock(&vcpu->kvm->mmu_lock);
+ kvm_release_pfn_clean(fault->pfn);
+ return r;
}
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 0a63b1afabd3..697dca948d0a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -27,7 +27,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
-int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool flush);
@@ -70,10 +70,4 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
u64 *spte);
-#ifdef CONFIG_X86_64
-static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; }
-#else
-static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
-#endif
-
#endif /* __KVM_X86_MMU_TDP_MMU_H */
base-commit: ffc4e70d9855921b740aee9bcbc8503cc753aee2
--
On Wed, Dec 21, 2022 at 06:32:05PM +0000, Sean Christopherson wrote:
> On Tue, Dec 20, 2022, David Matlack wrote:
> > On Tue, Dec 13, 2022 at 03:30:30AM +0000, Sean Christopherson wrote:
> > > Move the hugepage adjust, tracepoint, and RCU (un)lock logic out of
> > > kvm_tdp_mmu_map() and into its sole caller, kvm_tdp_mmu_page_fault(), to
> > > eliminate the gotos used to bounce through rcu_read_unlock() when bailing
> > > from the walk.
> > >
> > > Opportunistically mark kvm_mmu_hugepage_adjust() as static as
> > > kvm_tdp_mmu_map() was the only external user.
> > >
> > > No functional change intended.
> > >
> > > Signed-off-by: Sean Christopherson <[email protected]>
> > > ---
> > > arch/x86/kvm/mmu/mmu.c | 9 ++++++++-
> > > arch/x86/kvm/mmu/mmu_internal.h | 1 -
> > > arch/x86/kvm/mmu/tdp_mmu.c | 22 ++++------------------
> > > 3 files changed, 12 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 254bc46234e0..99c40617d325 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -3085,7 +3085,8 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > return min(host_level, max_level);
> > > }
> > >
> > > -void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > +static void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu,
> > > + struct kvm_page_fault *fault)
> > > {
> > > struct kvm_memory_slot *slot = fault->slot;
> > > kvm_pfn_t mask;
> > > @@ -4405,7 +4406,13 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> > > if (is_page_fault_stale(vcpu, fault))
> > > goto out_unlock;
> > >
> > > + kvm_mmu_hugepage_adjust(vcpu, fault);
> >
> > Can you also move the call to kvm_mmu_hugepage_adjust() from
> > direct_map() to direct_page_fault()? I do think it's worth the
> > maintenence burden to keep those functions consistent.
>
> Sure.
>
> > > + trace_kvm_mmu_spte_requested(fault);
> > > +
> > > + rcu_read_lock();
> > > r = kvm_tdp_mmu_map(vcpu, fault);
> > > + rcu_read_unlock();
> >
> > I would prefer to keep these in tdp_mmu.c, to reduce the amount of TDP
> > MMU details that bleed into mmu.c (RCU) and for consistency with other
> > TDP MMU APIs that don't require the caller to acquire RCU. This will
> > also be helpful for the Common MMU, as the tracepoint and RCU will be
> > common.
> >
> > e.g.
> >
> > static int __kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > {
> > ...
> > }
> >
> > int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > {
> > int r;
> >
> > trace_kvm_mmu_spte_requested(fault);
> >
> > rcu_read_lock();
> > r = __kvm_tdp_mmu_map(vcpu, fault);
> > rcu_read_unlock();
> >
> > return r;
> > }
>
> I did that originally, but it felt really silly to have the trivial wrapper, especially
> because mmu.c already has TDP MMU details, e.g. kvm_tdp_mmu_page_fault() takes mmu_lock
> for read and other flows acquire rcu_read_lock() to protected the TDP MMU.
A trivial wrapper is useful in this case. While mmu.c does already have
some TDP MMU RCU details, I'd like to decrease that not increase it.
>
> What about the below (split into multiple patches) instead? kvm_tdp_mmu_page_fault()
> really should live in tdp_mmu.c, the only reason it's in mmu.c is to get at various
> helpers, e.g. fast_page_fault() and kvm_faultin_pfn().
Maybe, I'm not sure. The page fault handling routines have more to do
with mmu.c than tdp_mmu.c. i.e. It's more about integrating with the
rest of KVM/x86 (page fault tracking, MMU notifiers, etc.). We only go
into tdp_mmu.c to manipulate page tables.
>
> Or is that doomed to fail because the TDP MMU will want to add code before
> kvm_faultin_pfn() (I can't remember what motivated splitting out kvm_tdp_mmu_page_fault()
> in the first place).
To improve readability (less conditionals: if (tdp_mmu_enabled)) and
prepare for more divergence.
Your proposal (below) to split out the "lower half" of the page fault
handling routine works now because that's where all the divergence is.
But with the common MMU there's also going to be divergence in the fast
page fault handler. So I prefer to just keep the routines separate to
avoid thrashing down the road.
>
> ---
> arch/x86/kvm/mmu/mmu.c | 132 ++++++++------------------------
> arch/x86/kvm/mmu/mmu_internal.h | 50 ++++++++++++
> arch/x86/kvm/mmu/spte.h | 7 --
> arch/x86/kvm/mmu/tdp_mmu.c | 41 ++++++----
> arch/x86/kvm/mmu/tdp_mmu.h | 8 +-
> 5 files changed, 108 insertions(+), 130 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 254bc46234e0..8203b1dd2753 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1927,16 +1927,6 @@ static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm,
> return true;
> }
>
> -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> -{
> - if (sp->role.invalid)
> - return true;
> -
> - /* TDP MMU pages do not use the MMU generation. */
> - return !is_tdp_mmu_page(sp) &&
> - unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> -}
> -
> struct mmu_page_path {
> struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL];
> unsigned int idx[PT64_ROOT_MAX_LEVEL];
> @@ -3148,9 +3138,6 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> int ret;
> gfn_t base_gfn = fault->gfn;
>
> - kvm_mmu_hugepage_adjust(vcpu, fault);
> -
> - trace_kvm_mmu_spte_requested(fault);
> for_each_shadow_entry(vcpu, fault->addr, it) {
> /*
> * We cannot overwrite existing page tables with an NX
> @@ -4270,54 +4257,10 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> return RET_PF_CONTINUE;
> }
>
> -/*
> - * Returns true if the page fault is stale and needs to be retried, i.e. if the
> - * root was invalidated by a memslot update or a relevant mmu_notifier fired.
> - */
> -static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> - struct kvm_page_fault *fault)
> +static int __direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> - struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
> + int r = RET_PF_RETRY;
>
> - /* Special roots, e.g. pae_root, are not backed by shadow pages. */
> - if (sp && is_obsolete_sp(vcpu->kvm, sp))
> - return true;
> -
> - /*
> - * Roots without an associated shadow page are considered invalid if
> - * there is a pending request to free obsolete roots. The request is
> - * only a hint that the current root _may_ be obsolete and needs to be
> - * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
> - * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
> - * to reload even if no vCPU is actively using the root.
> - */
> - if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
> - return true;
> -
> - return fault->slot &&
> - mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
> -}
> -
> -static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> -{
> - int r;
> -
> - if (page_fault_handle_page_track(vcpu, fault))
> - return RET_PF_EMULATE;
> -
> - r = fast_page_fault(vcpu, fault);
> - if (r != RET_PF_INVALID)
> - return r;
> -
> - r = mmu_topup_memory_caches(vcpu, false);
> - if (r)
> - return r;
> -
> - r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
> - if (r != RET_PF_CONTINUE)
> - return r;
> -
> - r = RET_PF_RETRY;
> write_lock(&vcpu->kvm->mmu_lock);
>
> if (is_page_fault_stale(vcpu, fault))
> @@ -4327,6 +4270,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> if (r)
> goto out_unlock;
>
> + kvm_mmu_hugepage_adjust(vcpu, fault);
> +
> + trace_kvm_mmu_spte_requested(fault);
> +
> r = direct_map(vcpu, fault);
>
> out_unlock:
> @@ -4335,6 +4282,32 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> return r;
> }
>
> +static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> +{
> + int r;
> +
> + if (page_fault_handle_page_track(vcpu, fault))
> + return RET_PF_EMULATE;
> +
> + r = fast_page_fault(vcpu, fault);
> + if (r != RET_PF_INVALID)
> + return r;
> +
> + r = mmu_topup_memory_caches(vcpu, false);
> + if (r)
> + return r;
> +
> + r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
> + if (r != RET_PF_CONTINUE)
> + return r;
> +
> +#ifdef CONFIG_X86_64
> + if (tdp_mmu_enabled)
> + return kvm_tdp_mmu_page_fault(vcpu, fault);
> +#endif
> + return __direct_page_fault(vcpu, fault);
> +}
> +
> static int nonpaging_page_fault(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault)
> {
> @@ -4378,42 +4351,6 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> }
> EXPORT_SYMBOL_GPL(kvm_handle_page_fault);
>
> -#ifdef CONFIG_X86_64
> -static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> - struct kvm_page_fault *fault)
> -{
> - int r;
> -
> - if (page_fault_handle_page_track(vcpu, fault))
> - return RET_PF_EMULATE;
> -
> - r = fast_page_fault(vcpu, fault);
> - if (r != RET_PF_INVALID)
> - return r;
> -
> - r = mmu_topup_memory_caches(vcpu, false);
> - if (r)
> - return r;
> -
> - r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
> - if (r != RET_PF_CONTINUE)
> - return r;
> -
> - r = RET_PF_RETRY;
> - read_lock(&vcpu->kvm->mmu_lock);
> -
> - if (is_page_fault_stale(vcpu, fault))
> - goto out_unlock;
> -
> - r = kvm_tdp_mmu_map(vcpu, fault);
> -
> -out_unlock:
> - read_unlock(&vcpu->kvm->mmu_lock);
> - kvm_release_pfn_clean(fault->pfn);
> - return r;
> -}
> -#endif
> -
> int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> /*
> @@ -4438,11 +4375,6 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> }
> }
>
> -#ifdef CONFIG_X86_64
> - if (tdp_mmu_enabled)
> - return kvm_tdp_mmu_page_fault(vcpu, fault);
> -#endif
> -
> return direct_page_fault(vcpu, fault);
> }
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index ac00bfbf32f6..2c7c2b49f719 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -133,6 +133,28 @@ struct kvm_mmu_page {
>
> extern struct kmem_cache *mmu_page_header_cache;
>
> +static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
> +{
> + struct page *page = pfn_to_page((shadow_page) >> PAGE_SHIFT);
> +
> + return (struct kvm_mmu_page *)page_private(page);
> +}
> +
> +static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp)
> +{
> + return IS_ENABLED(CONFIG_X86_64) && sp->tdp_mmu_page;
> +}
> +
> +static inline bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> + if (sp->role.invalid)
> + return true;
> +
> + /* TDP MMU pages do not use the MMU generation. */
> + return !is_tdp_mmu_page(sp) &&
> + unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> +}
> +
> static inline int kvm_mmu_role_as_id(union kvm_mmu_page_role role)
> {
> return role.smm ? 1 : 0;
> @@ -314,6 +336,34 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> return r;
> }
>
> +/*
> + * Returns true if the page fault is stale and needs to be retried, i.e. if the
> + * root was invalidated by a memslot update or a relevant mmu_notifier fired.
> + */
> +static inline bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
> +
> + /* Special roots, e.g. pae_root, are not backed by shadow pages. */
> + if (sp && is_obsolete_sp(vcpu->kvm, sp))
> + return true;
> +
> + /*
> + * Roots without an associated shadow page are considered invalid if
> + * there is a pending request to free obsolete roots. The request is
> + * only a hint that the current root _may_ be obsolete and needs to be
> + * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
> + * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
> + * to reload even if no vCPU is actively using the root.
> + */
> + if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
> + return true;
> +
> + return fault->slot &&
> + mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
> +}
is_page_fault_stale() is overkill for the TDP MMU and is
KVM/x86-specific. If we do go with your way of splitting things, I'd
prefer to have the TDP MMU not call is_page_fault_stale() so that more
code can be shared across architectures when the TDP MMU is common.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 99c40617d325..68db6805072a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3085,8 +3085,7 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
return min(host_level, max_level);
}
-static void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu,
- struct kvm_page_fault *fault)
+void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_memory_slot *slot = fault->slot;
kvm_pfn_t mask;
@@ -4295,8 +4294,28 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
return true;
- return fault->slot &&
- mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
+ return mmu_invalidate_retry_fault(vcpu, fault);
+}
+
+static int __direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+ int r = RET_PF_RETRY;
+
+ write_lock(&vcpu->kvm->mmu_lock);
+
+ if (is_page_fault_stale(vcpu, fault))
+ goto out_unlock;
+
+ r = make_mmu_pages_available(vcpu);
+ if (r)
+ goto out_unlock;
+
+ r = direct_map(vcpu, fault);
+
+out_unlock:
+ write_unlock(&vcpu->kvm->mmu_lock);
+ kvm_release_pfn_clean(fault->pfn);
+ return r;
}
static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -4318,22 +4337,13 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (r != RET_PF_CONTINUE)
return r;
- r = RET_PF_RETRY;
- write_lock(&vcpu->kvm->mmu_lock);
-
- if (is_page_fault_stale(vcpu, fault))
- goto out_unlock;
-
- r = make_mmu_pages_available(vcpu);
- if (r)
- goto out_unlock;
+#ifdef CONFIG_X86_64
+ if (tdp_mmu_enabled)
+ return kvm_tdp_mmu_page_fault(vcpu, fault);
+#endif
- r = direct_map(vcpu, fault);
+ return __direct_page_fault(vcpu, fault);
-out_unlock:
- write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
- return r;
}
static int nonpaging_page_fault(struct kvm_vcpu *vcpu,
@@ -4379,48 +4389,6 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
}
EXPORT_SYMBOL_GPL(kvm_handle_page_fault);
-#ifdef CONFIG_X86_64
-static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
- struct kvm_page_fault *fault)
-{
- int r;
-
- if (page_fault_handle_page_track(vcpu, fault))
- return RET_PF_EMULATE;
-
- r = fast_page_fault(vcpu, fault);
- if (r != RET_PF_INVALID)
- return r;
-
- r = mmu_topup_memory_caches(vcpu, false);
- if (r)
- return r;
-
- r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
- if (r != RET_PF_CONTINUE)
- return r;
-
- r = RET_PF_RETRY;
- read_lock(&vcpu->kvm->mmu_lock);
-
- if (is_page_fault_stale(vcpu, fault))
- goto out_unlock;
-
- kvm_mmu_hugepage_adjust(vcpu, fault);
-
- trace_kvm_mmu_spte_requested(fault);
-
- rcu_read_lock();
- r = kvm_tdp_mmu_map(vcpu, fault);
- rcu_read_unlock();
-
-out_unlock:
- read_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
- return r;
-}
-#endif
-
int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
/*
@@ -4445,11 +4413,6 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
}
}
-#ifdef CONFIG_X86_64
- if (tdp_mmu_enabled)
- return kvm_tdp_mmu_page_fault(vcpu, fault);
-#endif
-
return direct_page_fault(vcpu, fault);
}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 66c294d67641..776b0ad4e58a 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -323,5 +323,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+
+static inline bool mmu_invalidate_retry_fault(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
+{
+ return fault->slot &&
+ mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
+}
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 78f47eb74544..d4736cb91c9f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1144,7 +1144,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
* Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
* page tables and SPTEs to translate the faulting guest physical address.
*/
-int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
struct kvm *kvm = vcpu->kvm;
@@ -1212,6 +1212,33 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return RET_PF_RETRY;
}
+int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+ struct kvm_mmu_page *root = to_shadow_page(vcpu->arch.mmu->root.hpa);
+ int r = RET_PF_RETRY;
+
+ read_lock(&vcpu->kvm->mmu_lock);
+
+ if (root->role.invalid)
+ goto out;
+
+ if (mmu_invalidate_retry_fault(vcpu, fault))
+ goto out;
+
+ kvm_mmu_hugepage_adjust(vcpu, fault);
+
+ trace_kvm_mmu_spte_requested(fault);
+
+ rcu_read_lock();
+ r = kvm_tdp_mmu_map(vcpu, fault);
+ rcu_read_unlock();
+
+out:
+ read_unlock(&vcpu->kvm->mmu_lock);
+ kvm_release_pfn_clean(fault->pfn);
+ return r;
+}
+
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool flush)
{
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 0a63b1afabd3..849e5886e73b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -27,7 +27,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
-int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool flush);
On 12/29/22 20:51, David Matlack wrote:
> Your proposal (below) to split out the "lower half" of the page fault
> handling routine works now because that's where all the divergence is.
> But with the common MMU there's also going to be divergence in the fast
> page fault handler. So I prefer to just keep the routines separate to
> avoid thrashing down the road.
Can you put the changes at the beginning of the common MMU series?
Large parts of the whole common MMU refactoring can be merged piece by
piece, so they can be taken as soon as they're ready.
Paolo
On Thu, Dec 29, 2022 at 1:06 PM Paolo Bonzini <[email protected]> wrote:
>
> On 12/29/22 20:51, David Matlack wrote:
> > Your proposal (below) to split out the "lower half" of the page fault
> > handling routine works now because that's where all the divergence is.
> > But with the common MMU there's also going to be divergence in the fast
> > page fault handler. So I prefer to just keep the routines separate to
> > avoid thrashing down the road.
>
> Can you put the changes at the beginning of the common MMU series?
Can do. By "the changes" I assume you mean the yet-to-be-written
changes to split out a fast_page_fault() handler for the TDP MMU? Or
do you mean Sean's changes (this series)?
> Large parts of the whole common MMU refactoring can be merged piece by
> piece, so they can be taken as soon as they're ready.
Ack.