2023-02-02 18:28:17

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 00/21] KVM: x86/MMU: Formalize the Shadow MMU

This series makes the Shadow MMU a distinct part of the KVM x86 MMU,
implemented in separate files, with a defined interface to common code.

When the TDP (Two Dimensional Paging) MMU was added to x86 KVM, it came in
a separate file with a (reasonably) clear interface. This lead to many
points in the KVM MMU like this:

if (tdp_mmu_on())
kvm_tdp_mmu_do_stuff()

if (memslots_have_rmaps())
/* Do whatever was being done before */

The implementations of various functions which preceded the TDP MMU have
remained scattered around mmu.c with no clear identity or interface. Over the
last couple years, the KVM x86 community has settled on calling the KVM MMU
implementation which preceded the TDP MMU the "Shadow MMU", as it grew
from shadow paging, which supported virtualization on hardware pre-TDP.
(Note that the Shadow MMU can also build TDP page tables, and doesn't
only do shadow paging, so the meaning is a bit overloaded.)

Splitting it out into separate files will give a clear interface and make it
easier to distinguish common x86 MMU code from the code specific to the two
MMU implementations.

Patches 1-3 are cleanups from Sean

Patches 4-6 prepare for the refactor by adding files and exporting
functions.

Patch 7 the big move, transferring 3.5K lines from mmu.c to
shadow_mmu.c
(It may be best if whoever ends up preparing the pull request with
this patch just dumps my version and re-does the move so that no code is
lost.)

Patches 8 and 9 move the includes for paging_tmpl.h to shadow_mmu.c

Patches 10-17 clean up the interface between the Shadow MMU and
common MMU code.

The last few patches are in response to feedback on the RFC and move
additional code to the Shadow MMU.

Patch 7 is an enormous change, and doing it all at once in a single
commit all but guarantees merge conflicts and makes it hard to review. I
don't have a good answer to this problem as there's no easy way to move
3.5K lines between files. I tried moving the code bit-by-bit but the
intermediate steps added complexity and ultimately the 50+ patches it
created didn't seem any easier to review.
Doing the big move all at once at least makes it easier to get past when
doing Git archeology, and doing it at the beginning of the series allows the
rest of the commits to still show up in Git blame.

I've tested this series on an Intel Skylake host with kvm-unit-tests and
selftests.

RFC -> v1:
- RFC: https://lore.kernel.org/all/[email protected]/
- Moved some more Shadow MMU content to shadow_mmu.c. David Matlack
pointed out some code I'd missed in the first pass. Added commits
to the end of the series to achieve this.
- Dropped is_cpuid_PSE36 and moved all the BUILD_MMU_ROLE*() macros
to mmu_internal, also as suggested by David Matlack.
- Added copyright comments to the tops of shadow_mmu.c and .h
- Tacked some cleanups from Sean onto the beginning of the series.

Ben Gardon (18):
KVM: x86/MMU: Add shadow_mmu.(c|h)
KVM: x86/MMU: Expose functions for the Shadow MMU
KVM: x86/mmu: Get rid of is_cpuid_PSE36()
KVM: x86/MMU: Move the Shadow MMU implementation to shadow_mmu.c
KVM: x86/MMU: Expose functions for paging_tmpl.h
KVM: x86/MMU: Move paging_tmpl.h includes to shadow_mmu.c
KVM: x86/MMU: Clean up Shadow MMU exports
KVM: x86/MMU: Cleanup shrinker interface with Shadow MMU
KVM: x86/MMU: Clean up naming of exported Shadow MMU functions
KVM: x86/MMU: Fix naming on prepare / commit zap page functions
KVM: x86/MMU: Factor Shadow MMU wrprot / clear dirty ops out of mmu.c
KVM: x86/MMU: Remove unneeded exports from shadow_mmu.c
KVM: x86/MMU: Wrap uses of kvm_handle_gfn_range in mmu.c
KVM: x86/MMU: Add kvm_shadow_mmu_ to the last few functions in
shadow_mmu.h
KVM: x86/mmu: Move split cache topup functions to shadow_mmu.c
KVM: x86/mmu: Move Shadow MMU part of kvm_mmu_zap_all() to
shadow_mmu.h
KVM: x86/mmu: Move Shadow MMU init/teardown to shadow_mmu.c
KVM: x86/mmu: Split out Shadow MMU lockless walk begin/end

Sean Christopherson (3):
KVM: x86/mmu: Rename slot rmap walkers to add clarity and clean up
code
KVM: x86/mmu: Replace comment with an actual lockdep assertion on
mmu_lock
KVM: x86/mmu: Clean up mmu.c functions that put return type on
separate line

arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/debugfs.c | 1 +
arch/x86/kvm/mmu/mmu.c | 4834 ++++---------------------------
arch/x86/kvm/mmu/mmu_internal.h | 87 +-
arch/x86/kvm/mmu/paging_tmpl.h | 15 +-
arch/x86/kvm/mmu/shadow_mmu.c | 3692 +++++++++++++++++++++++
arch/x86/kvm/mmu/shadow_mmu.h | 132 +
7 files changed, 4498 insertions(+), 4265 deletions(-)
create mode 100644 arch/x86/kvm/mmu/shadow_mmu.c
create mode 100644 arch/x86/kvm/mmu/shadow_mmu.h


base-commit: 7cb79f433e75b05d1635aefaa851cfcd1cb7dc4f
--
2.39.1.519.gcb327c4b5f-goog



2023-02-02 18:28:20

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 01/21] KVM: x86/mmu: Rename slot rmap walkers to add clarity and clean up code

From: Sean Christopherson <[email protected]>

Replace "slot_handle_level" with "walk_slot_rmaps" to better capture what
the helpers are doing, and to slightly shorten the function names so that
each function's return type and attributes can be placed on the same line
as the function declaration.

No functional change intended.

Link: https://lore.kernel.org/mm-commits/CAHk-=wjS-Jg7sGMwUPpDsjv392nDOOs0CtUtVkp=S6Q7JzFJRw@mail.gmail.com
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 66 +++++++++++++++++++++---------------------
1 file changed, 33 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index aeb240b339f54..09a0a2cc76bae 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5801,23 +5801,24 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
EXPORT_SYMBOL_GPL(kvm_configure_mmu);

/* The return value indicates if tlb flush on all vcpus is needed. */
-typedef bool (*slot_level_handler) (struct kvm *kvm,
+typedef bool (*slot_rmaps_handler) (struct kvm *kvm,
struct kvm_rmap_head *rmap_head,
const struct kvm_memory_slot *slot);

/* The caller should hold mmu-lock before calling this function. */
-static __always_inline bool
-slot_handle_level_range(struct kvm *kvm, const struct kvm_memory_slot *memslot,
- slot_level_handler fn, int start_level, int end_level,
- gfn_t start_gfn, gfn_t end_gfn, bool flush_on_yield,
- bool flush)
+static __always_inline bool __walk_slot_rmaps(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ slot_rmaps_handler fn,
+ int start_level, int end_level,
+ gfn_t start_gfn, gfn_t end_gfn,
+ bool flush_on_yield, bool flush)
{
struct slot_rmap_walk_iterator iterator;

- for_each_slot_rmap_range(memslot, start_level, end_level, start_gfn,
+ for_each_slot_rmap_range(slot, start_level, end_level, start_gfn,
end_gfn, &iterator) {
if (iterator.rmap)
- flush |= fn(kvm, iterator.rmap, memslot);
+ flush |= fn(kvm, iterator.rmap, slot);

if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
if (flush && flush_on_yield) {
@@ -5833,23 +5834,23 @@ slot_handle_level_range(struct kvm *kvm, const struct kvm_memory_slot *memslot,
return flush;
}

-static __always_inline bool
-slot_handle_level(struct kvm *kvm, const struct kvm_memory_slot *memslot,
- slot_level_handler fn, int start_level, int end_level,
- bool flush_on_yield)
+static __always_inline bool walk_slot_rmaps(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ slot_rmaps_handler fn,
+ int start_level, int end_level,
+ bool flush_on_yield)
{
- return slot_handle_level_range(kvm, memslot, fn, start_level,
- end_level, memslot->base_gfn,
- memslot->base_gfn + memslot->npages - 1,
- flush_on_yield, false);
+ return __walk_slot_rmaps(kvm, slot, fn, start_level, end_level,
+ slot->base_gfn, slot->base_gfn + slot->npages - 1,
+ flush_on_yield, false);
}

-static __always_inline bool
-slot_handle_level_4k(struct kvm *kvm, const struct kvm_memory_slot *memslot,
- slot_level_handler fn, bool flush_on_yield)
+static __always_inline bool walk_slot_rmaps_4k(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ slot_rmaps_handler fn,
+ bool flush_on_yield)
{
- return slot_handle_level(kvm, memslot, fn, PG_LEVEL_4K,
- PG_LEVEL_4K, flush_on_yield);
+ return walk_slot_rmaps(kvm, slot, fn, PG_LEVEL_4K, PG_LEVEL_4K, flush_on_yield);
}

static void free_mmu_pages(struct kvm_mmu *mmu)
@@ -6144,9 +6145,9 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_e
if (WARN_ON_ONCE(start >= end))
continue;

- flush = slot_handle_level_range(kvm, memslot, __kvm_zap_rmap,
- PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
- start, end - 1, true, flush);
+ flush = __walk_slot_rmaps(kvm, memslot, __kvm_zap_rmap,
+ PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
+ start, end - 1, true, flush);
}
}

@@ -6199,8 +6200,8 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
{
if (kvm_memslots_have_rmaps(kvm)) {
write_lock(&kvm->mmu_lock);
- slot_handle_level(kvm, memslot, slot_rmap_write_protect,
- start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
+ walk_slot_rmaps(kvm, memslot, slot_rmap_write_protect,
+ start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
write_unlock(&kvm->mmu_lock);
}

@@ -6435,10 +6436,9 @@ static void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm,
* all the way to the target level. There's no need to split pages
* already at the target level.
*/
- for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
- slot_handle_level_range(kvm, slot, shadow_mmu_try_split_huge_pages,
- level, level, start, end - 1, true, false);
- }
+ for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--)
+ __walk_slot_rmaps(kvm, slot, shadow_mmu_try_split_huge_pages,
+ level, level, start, end - 1, true, false);
}

/* Must be called with the mmu_lock held in write-mode. */
@@ -6537,8 +6537,8 @@ static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
* Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
* pages that are already mapped at the maximum hugepage level.
*/
- if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
- PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1, true))
+ if (walk_slot_rmaps(kvm, slot, kvm_mmu_zap_collapsible_spte,
+ PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1, true))
kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
}

@@ -6582,7 +6582,7 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
* Clear dirty bits only on 4k SPTEs since the legacy MMU only
* support dirty logging at a 4k granularity.
*/
- slot_handle_level_4k(kvm, memslot, __rmap_clear_dirty, false);
+ walk_slot_rmaps_4k(kvm, memslot, __rmap_clear_dirty, false);
write_unlock(&kvm->mmu_lock);
}

--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:28:24

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 02/21] KVM: x86/mmu: Replace comment with an actual lockdep assertion on mmu_lock

From: Sean Christopherson <[email protected]>

Assert that mmu_lock is held for write in __walk_slot_rmaps() instead of
hoping the function comment will magically prevent introducing bugs.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 09a0a2cc76bae..2ea8e58f83256 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5805,7 +5805,6 @@ typedef bool (*slot_rmaps_handler) (struct kvm *kvm,
struct kvm_rmap_head *rmap_head,
const struct kvm_memory_slot *slot);

-/* The caller should hold mmu-lock before calling this function. */
static __always_inline bool __walk_slot_rmaps(struct kvm *kvm,
const struct kvm_memory_slot *slot,
slot_rmaps_handler fn,
@@ -5815,6 +5814,8 @@ static __always_inline bool __walk_slot_rmaps(struct kvm *kvm,
{
struct slot_rmap_walk_iterator iterator;

+ lockdep_assert_held_write(&kvm->mmu_lock);
+
for_each_slot_rmap_range(slot, start_level, end_level, start_gfn,
end_gfn, &iterator) {
if (iterator.rmap)
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:28:29

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 03/21] KVM: x86/mmu: Clean up mmu.c functions that put return type on separate line

From: Sean Christopherson <[email protected]>

Adjust a variety of functions in mmu.c to put the function return type on
the same line as the function declaration. As stated in the Linus
specification:

But the "on their own line" is complete garbage to begin with. That
will NEVER be a kernel rule. We should never have a rule that assumes
things are so long that they need to be on multiple lines.

We don't put function return types on their own lines either, even if
some other projects have that rule (just to get function names at the
beginning of lines or some other odd reason).

Leave the functions generated by BUILD_MMU_ROLE_REGS_ACCESSOR() as-is,
that code is basically illegible no matter how it's formatted.

No functional change intended.

Link: https://lore.kernel.org/mm-commits/CAHk-=wjS-Jg7sGMwUPpDsjv392nDOOs0CtUtVkp=S6Q7JzFJRw@mail.gmail.com
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 59 ++++++++++++++++++++----------------------
1 file changed, 28 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2ea8e58f83256..3674bde2203b2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -876,9 +876,9 @@ static void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
untrack_possible_nx_huge_page(kvm, sp);
}

-static struct kvm_memory_slot *
-gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
- bool no_dirty_log)
+static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu,
+ gfn_t gfn,
+ bool no_dirty_log)
{
struct kvm_memory_slot *slot;

@@ -938,10 +938,9 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
return count;
}

-static void
-pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
- struct pte_list_desc *desc, int i,
- struct pte_list_desc *prev_desc)
+static void pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
+ struct pte_list_desc *desc, int i,
+ struct pte_list_desc *prev_desc)
{
int j = desc->spte_count - 1;

@@ -1493,8 +1492,8 @@ struct slot_rmap_walk_iterator {
struct kvm_rmap_head *end_rmap;
};

-static void
-rmap_walk_init_level(struct slot_rmap_walk_iterator *iterator, int level)
+static void rmap_walk_init_level(struct slot_rmap_walk_iterator *iterator,
+ int level)
{
iterator->level = level;
iterator->gfn = iterator->start_gfn;
@@ -1502,10 +1501,10 @@ rmap_walk_init_level(struct slot_rmap_walk_iterator *iterator, int level)
iterator->end_rmap = gfn_to_rmap(iterator->end_gfn, level, iterator->slot);
}

-static void
-slot_rmap_walk_init(struct slot_rmap_walk_iterator *iterator,
- const struct kvm_memory_slot *slot, int start_level,
- int end_level, gfn_t start_gfn, gfn_t end_gfn)
+static void slot_rmap_walk_init(struct slot_rmap_walk_iterator *iterator,
+ const struct kvm_memory_slot *slot,
+ int start_level, int end_level,
+ gfn_t start_gfn, gfn_t end_gfn)
{
iterator->slot = slot;
iterator->start_level = start_level;
@@ -3295,9 +3294,9 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
* Returns true if the SPTE was fixed successfully. Otherwise,
* someone else modified the SPTE from its original value.
*/
-static bool
-fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
- u64 *sptep, u64 old_spte, u64 new_spte)
+static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault,
+ u64 *sptep, u64 old_spte, u64 new_spte)
{
/*
* Theoretically we could also set dirty bit (and flush TLB) here in
@@ -4626,10 +4625,9 @@ static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
#include "paging_tmpl.h"
#undef PTTYPE

-static void
-__reset_rsvds_bits_mask(struct rsvd_bits_validate *rsvd_check,
- u64 pa_bits_rsvd, int level, bool nx, bool gbpages,
- bool pse, bool amd)
+static void __reset_rsvds_bits_mask(struct rsvd_bits_validate *rsvd_check,
+ u64 pa_bits_rsvd, int level, bool nx,
+ bool gbpages, bool pse, bool amd)
{
u64 gbpages_bit_rsvd = 0;
u64 nonleaf_bit8_rsvd = 0;
@@ -4742,9 +4740,9 @@ static void reset_guest_rsvds_bits_mask(struct kvm_vcpu *vcpu,
guest_cpuid_is_amd_or_hygon(vcpu));
}

-static void
-__reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check,
- u64 pa_bits_rsvd, bool execonly, int huge_page_level)
+static void __reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check,
+ u64 pa_bits_rsvd, bool execonly,
+ int huge_page_level)
{
u64 high_bits_rsvd = pa_bits_rsvd & rsvd_bits(0, 51);
u64 large_1g_rsvd = 0, large_2m_rsvd = 0;
@@ -4844,8 +4842,7 @@ static inline bool boot_cpu_is_amd(void)
* the direct page table on host, use as much mmu features as
* possible, however, kvm currently does not do execution-protection.
*/
-static void
-reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
+static void reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
{
struct rsvd_bits_validate *shadow_zero_check;
int i;
@@ -5060,8 +5057,8 @@ static void paging32_init_context(struct kvm_mmu *context)
context->invlpg = paging32_invlpg;
}

-static union kvm_cpu_role
-kvm_calc_cpu_role(struct kvm_vcpu *vcpu, const struct kvm_mmu_role_regs *regs)
+static union kvm_cpu_role kvm_calc_cpu_role(struct kvm_vcpu *vcpu,
+ const struct kvm_mmu_role_regs *regs)
{
union kvm_cpu_role role = {0};

@@ -6653,8 +6650,8 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
}
}

-static unsigned long
-mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+static unsigned long mmu_shrink_scan(struct shrinker *shrink,
+ struct shrink_control *sc)
{
struct kvm *kvm;
int nr_to_scan = sc->nr_to_scan;
@@ -6712,8 +6709,8 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
return freed;
}

-static unsigned long
-mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+static unsigned long mmu_shrink_count(struct shrinker *shrink,
+ struct shrink_control *sc)
{
return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
}
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:28:42

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 04/21] KVM: x86/MMU: Add shadow_mmu.(c|h)

As a first step to splitting the Shadow MMU out of KVM MMU common code,
add separate files for it with some of the boilerplate and includes the
Shadow MMU will need.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/mmu/mmu.c | 1 +
arch/x86/kvm/mmu/shadow_mmu.c | 23 +++++++++++++++++++++++
arch/x86/kvm/mmu/shadow_mmu.h | 21 +++++++++++++++++++++
4 files changed, 46 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kvm/mmu/shadow_mmu.c
create mode 100644 arch/x86/kvm/mmu/shadow_mmu.h

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 80e3fe184d17e..d6e94660b006e 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -12,7 +12,7 @@ include $(srctree)/virt/kvm/Makefile.kvm
kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \
- mmu/spte.o
+ mmu/spte.o mmu/shadow_mmu.o

ifdef CONFIG_HYPERV
kvm-y += kvm_onhyperv.o
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3674bde2203b2..752c38d625a32 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -21,6 +21,7 @@
#include "mmu.h"
#include "mmu_internal.h"
#include "tdp_mmu.h"
+#include "shadow_mmu.h"
#include "x86.h"
#include "kvm_cache_regs.h"
#include "smm.h"
diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
new file mode 100644
index 0000000000000..eee5a6796d9b0
--- /dev/null
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM Shadow MMU
+ *
+ * Extracted from mmu.c
+ *
+ * Copyright (C) 2006 Qumranet, Inc.
+ * Copyright 2010 Red Hat, Inc. and/or its affiliates.
+ * Copyright (C) 2023, Google, Inc.
+ *
+ * Original authors:
+ * Yaniv Kamay <[email protected]>
+ * Avi Kivity <[email protected]>
+ */
+#include "mmu.h"
+#include "mmu_internal.h"
+#include "mmutrace.h"
+#include "shadow_mmu.h"
+#include "spte.h"
+
+#include <asm/vmx.h>
+#include <asm/cmpxchg.h>
+#include <trace/events/kvm.h>
diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h
new file mode 100644
index 0000000000000..2bfba6ad20688
--- /dev/null
+++ b/arch/x86/kvm/mmu/shadow_mmu.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * KVM Shadow MMU
+ *
+ * Extracted from mmu.c
+ *
+ * Copyright (C) 2006 Qumranet, Inc.
+ * Copyright 2010 Red Hat, Inc. and/or its affiliates.
+ * Copyright (C) 2023, Google, Inc.
+ *
+ * Original authors:
+ * Yaniv Kamay <[email protected]>
+ * Avi Kivity <[email protected]>
+ */
+
+#ifndef __KVM_X86_MMU_SHADOW_MMU_H
+#define __KVM_X86_MMU_SHADOW_MMU_H
+
+#include <linux/kvm_host.h>
+
+#endif /* __KVM_X86_MMU_SHADOW_MMU_H */
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:28:47

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 06/21] KVM: x86/mmu: Get rid of is_cpuid_PSE36()

is_cpuid_PSE36() always returns 1 and is never overridden, so just get
rid of the function. This saves having to export it in a future commit
in order to move the include of paging_tmpl.h out of mmu.c.

No functional change intended.

Suggested-by: David Matlack <[email protected]>

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 13 ++-----------
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
2 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 12d38a8772a80..35cb59737c0a3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -304,11 +304,6 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 spte)
return likely(kvm_gen == spte_gen);
}

-static int is_cpuid_PSE36(void)
-{
- return 1;
-}
-
#ifdef CONFIG_X86_64
static void __set_spte(u64 *sptep, u64 spte)
{
@@ -4661,12 +4656,8 @@ static void __reset_rsvds_bits_mask(struct rsvd_bits_validate *rsvd_check,
break;
}

- if (is_cpuid_PSE36())
- /* 36bits PSE 4MB page */
- rsvd_check->rsvd_bits_mask[1][1] = rsvd_bits(17, 21);
- else
- /* 32 bits PSE 4MB page */
- rsvd_check->rsvd_bits_mask[1][1] = rsvd_bits(13, 21);
+ /* 36bits PSE 4MB page */
+ rsvd_check->rsvd_bits_mask[1][1] = rsvd_bits(17, 21);
break;
case PT32E_ROOT_LEVEL:
rsvd_check->rsvd_bits_mask[0][2] = rsvd_bits(63, 63) |
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index e5662dbd519c4..730b413eebfde 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -426,7 +426,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
gfn += (addr & PT_LVL_OFFSET_MASK(walker->level)) >> PAGE_SHIFT;

#if PTTYPE == 32
- if (walker->level > PG_LEVEL_4K && is_cpuid_PSE36())
+ if (walker->level > PG_LEVEL_4K)
gfn += pse36_gfn_delta(pte);
#endif

--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:28:54

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 05/21] KVM: x86/MMU: Expose functions for the Shadow MMU

Expose various common MMU functions which the Shadow MMU will need via
mmu_internal.h. This slightly reduces the work needed to move the
shadow MMU code out of mmu.c, which will already be a massive change.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 41 ++++++++++++++-------------------
arch/x86/kvm/mmu/mmu_internal.h | 24 +++++++++++++++++++
2 files changed, 41 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 752c38d625a32..12d38a8772a80 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -164,9 +164,9 @@ struct kvm_shadow_walk_iterator {
({ spte = mmu_spte_get_lockless(_walker.sptep); 1; }); \
__shadow_walk_next(&(_walker), spte))

-static struct kmem_cache *pte_list_desc_cache;
+struct kmem_cache *pte_list_desc_cache;
struct kmem_cache *mmu_page_header_cache;
-static struct percpu_counter kvm_total_used_mmu_pages;
+struct percpu_counter kvm_total_used_mmu_pages;

static void mmu_spte_set(u64 *sptep, u64 spte);

@@ -242,11 +242,6 @@ static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
return regs;
}

-static inline bool kvm_available_flush_tlb_with_range(void)
-{
- return kvm_x86_ops.tlb_remote_flush_with_range;
-}
-
static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
struct kvm_tlb_range *range)
{
@@ -270,8 +265,8 @@ void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
kvm_flush_remote_tlbs_with_range(kvm, &range);
}

-static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
- unsigned int access)
+void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
+ unsigned int access)
{
u64 spte = make_mmio_spte(vcpu, gfn, access);

@@ -623,7 +618,7 @@ static inline bool is_tdp_mmu_active(struct kvm_vcpu *vcpu)
return tdp_mmu_enabled && vcpu->arch.mmu->root_role.direct;
}

-static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
+void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
{
if (is_tdp_mmu_active(vcpu)) {
kvm_tdp_mmu_walk_lockless_begin();
@@ -642,7 +637,7 @@ static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
}
}

-static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
+void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
{
if (is_tdp_mmu_active(vcpu)) {
kvm_tdp_mmu_walk_lockless_end();
@@ -835,8 +830,8 @@ void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
&kvm->arch.possible_nx_huge_pages);
}

-static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- bool nx_huge_page_possible)
+void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ bool nx_huge_page_possible)
{
sp->nx_huge_page_disallowed = true;

@@ -870,16 +865,15 @@ void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
list_del_init(&sp->possible_nx_huge_page_link);
}

-static void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
sp->nx_huge_page_disallowed = false;

untrack_possible_nx_huge_page(kvm, sp);
}

-static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu,
- gfn_t gfn,
- bool no_dirty_log)
+struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu,
+ gfn_t gfn, bool no_dirty_log)
{
struct kvm_memory_slot *slot;

@@ -1415,7 +1409,7 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
return write_protected;
}

-static bool kvm_vcpu_write_protect_gfn(struct kvm_vcpu *vcpu, u64 gfn)
+bool kvm_vcpu_write_protect_gfn(struct kvm_vcpu *vcpu, u64 gfn)
{
struct kvm_memory_slot *slot;

@@ -1914,9 +1908,8 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
return ret;
}

-static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm,
- struct list_head *invalid_list,
- bool remote_flush)
+bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, struct list_head *invalid_list,
+ bool remote_flush)
{
if (!remote_flush && list_empty(invalid_list))
return false;
@@ -1928,7 +1921,7 @@ static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm,
return true;
}

-static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
{
if (sp->role.invalid)
return true;
@@ -6216,7 +6209,7 @@ static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
}

-static bool need_topup_split_caches_or_resched(struct kvm *kvm)
+bool need_topup_split_caches_or_resched(struct kvm *kvm)
{
if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
return true;
@@ -6231,7 +6224,7 @@ static bool need_topup_split_caches_or_resched(struct kvm *kvm)
need_topup(&kvm->arch.split_shadow_page_cache, 1);
}

-static int topup_split_caches(struct kvm *kvm)
+int topup_split_caches(struct kvm *kvm)
{
/*
* Allocating rmap list entries when splitting huge pages for nested
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index ac00bfbf32f67..95f0adfb3b0b4 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -131,7 +131,9 @@ struct kvm_mmu_page {
#endif
};

+extern struct kmem_cache *pte_list_desc_cache;
extern struct kmem_cache *mmu_page_header_cache;
+extern struct percpu_counter kvm_total_used_mmu_pages;

static inline int kvm_mmu_role_as_id(union kvm_mmu_page_role role)
{
@@ -323,6 +325,28 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);

void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ bool nx_huge_page_possible);
void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);

+static inline bool kvm_available_flush_tlb_with_range(void)
+{
+ return kvm_x86_ops.tlb_remote_flush_with_range;
+}
+
+void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
+ unsigned int access);
+struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu,
+ gfn_t gfn, bool no_dirty_log);
+bool kvm_vcpu_write_protect_gfn(struct kvm_vcpu *vcpu, u64 gfn);
+bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, struct list_head *invalid_list,
+ bool remote_flush);
+bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
+
+void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu);
+void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu);
+
+bool need_topup_split_caches_or_resched(struct kvm *kvm);
+int topup_split_caches(struct kvm *kvm);
#endif /* __KVM_X86_MMU_INTERNAL_H */
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:29:17

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 08/21] KVM: x86/MMU: Expose functions for paging_tmpl.h

In preparation for moving paging_tmpl.h to shadow_mmu.c, expose various
functions it needs through mmu_internal.h. This includes moving all the
BUILD_MMU_ROLE_*() macros. Not all of those macros are strictly needed
by paging_tmpl.h, but it is cleaner to keep them together.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 68 +++++----------------------------
arch/x86/kvm/mmu/mmu_internal.h | 59 ++++++++++++++++++++++++++++
2 files changed, 68 insertions(+), 59 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2162dfda9601f..da290bfca0137 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -123,57 +123,9 @@ struct kmem_cache *pte_list_desc_cache;
struct kmem_cache *mmu_page_header_cache;
struct percpu_counter kvm_total_used_mmu_pages;

-struct kvm_mmu_role_regs {
- const unsigned long cr0;
- const unsigned long cr4;
- const u64 efer;
-};
-
#define CREATE_TRACE_POINTS
#include "mmutrace.h"

-/*
- * Yes, lot's of underscores. They're a hint that you probably shouldn't be
- * reading from the role_regs. Once the root_role is constructed, it becomes
- * the single source of truth for the MMU's state.
- */
-#define BUILD_MMU_ROLE_REGS_ACCESSOR(reg, name, flag) \
-static inline bool __maybe_unused \
-____is_##reg##_##name(const struct kvm_mmu_role_regs *regs) \
-{ \
- return !!(regs->reg & flag); \
-}
-BUILD_MMU_ROLE_REGS_ACCESSOR(cr0, pg, X86_CR0_PG);
-BUILD_MMU_ROLE_REGS_ACCESSOR(cr0, wp, X86_CR0_WP);
-BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pse, X86_CR4_PSE);
-BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pae, X86_CR4_PAE);
-BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smep, X86_CR4_SMEP);
-BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smap, X86_CR4_SMAP);
-BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pke, X86_CR4_PKE);
-BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, la57, X86_CR4_LA57);
-BUILD_MMU_ROLE_REGS_ACCESSOR(efer, nx, EFER_NX);
-BUILD_MMU_ROLE_REGS_ACCESSOR(efer, lma, EFER_LMA);
-
-/*
- * The MMU itself (with a valid role) is the single source of truth for the
- * MMU. Do not use the regs used to build the MMU/role, nor the vCPU. The
- * regs don't account for dependencies, e.g. clearing CR4 bits if CR0.PG=1,
- * and the vCPU may be incorrect/irrelevant.
- */
-#define BUILD_MMU_ROLE_ACCESSOR(base_or_ext, reg, name) \
-static inline bool __maybe_unused is_##reg##_##name(struct kvm_mmu *mmu) \
-{ \
- return !!(mmu->cpu_role. base_or_ext . reg##_##name); \
-}
-BUILD_MMU_ROLE_ACCESSOR(base, cr0, wp);
-BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pse);
-BUILD_MMU_ROLE_ACCESSOR(ext, cr4, smep);
-BUILD_MMU_ROLE_ACCESSOR(ext, cr4, smap);
-BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pke);
-BUILD_MMU_ROLE_ACCESSOR(ext, cr4, la57);
-BUILD_MMU_ROLE_ACCESSOR(base, efer, nx);
-BUILD_MMU_ROLE_ACCESSOR(ext, efer, lma);
-
static inline bool is_cr0_pg(struct kvm_mmu *mmu)
{
return mmu->cpu_role.base.level > 0;
@@ -218,7 +170,7 @@ void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
kvm_flush_remote_tlbs_with_range(kvm, &range);
}

-static gfn_t get_mmio_spte_gfn(u64 spte)
+gfn_t get_mmio_spte_gfn(u64 spte)
{
u64 gpa = spte & shadow_nonpresent_or_rsvd_lower_gfn_mask;

@@ -287,7 +239,7 @@ void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
}
}

-static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
+int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
{
int r;

@@ -828,9 +780,8 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
return -EFAULT;
}

-static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
- struct kvm_page_fault *fault,
- unsigned int access)
+int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+ unsigned int access)
{
gva_t gva = fault->is_tdp ? 0 : fault->addr;

@@ -1284,8 +1235,8 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
return RET_PF_RETRY;
}

-static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
- struct kvm_page_fault *fault)
+bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
{
if (unlikely(fault->rsvd))
return false;
@@ -1408,8 +1359,8 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
return RET_PF_CONTINUE;
}

-static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
- unsigned int access)
+int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+ unsigned int access)
{
int ret;

@@ -1433,8 +1384,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
* Returns true if the page fault is stale and needs to be retried, i.e. if the
* root was invalidated by a memslot update or a relevant mmu_notifier fired.
*/
-static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
- struct kvm_page_fault *fault)
+bool is_page_fault_stale(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 9c1399762496b..349d4a300ad34 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -347,6 +347,65 @@ bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu);
void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu);

+int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect);
bool need_topup_split_caches_or_resched(struct kvm *kvm);
int topup_split_caches(struct kvm *kvm);
+
+bool is_page_fault_stale(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault);
+int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+ unsigned int access);
+int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+ unsigned int access);
+
+gfn_t get_mmio_spte_gfn(u64 spte);
+
+struct kvm_mmu_role_regs {
+ const unsigned long cr0;
+ const unsigned long cr4;
+ const u64 efer;
+};
+
+/*
+ * Yes, lot's of underscores. They're a hint that you probably shouldn't be
+ * reading from the role_regs. Once the root_role is constructed, it becomes
+ * the single source of truth for the MMU's state.
+ */
+#define BUILD_MMU_ROLE_REGS_ACCESSOR(reg, name, flag) \
+static inline bool __maybe_unused \
+____is_##reg##_##name(const struct kvm_mmu_role_regs *regs) \
+{ \
+ return !!(regs->reg & flag); \
+}
+BUILD_MMU_ROLE_REGS_ACCESSOR(cr0, pg, X86_CR0_PG);
+BUILD_MMU_ROLE_REGS_ACCESSOR(cr0, wp, X86_CR0_WP);
+BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pse, X86_CR4_PSE);
+BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pae, X86_CR4_PAE);
+BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smep, X86_CR4_SMEP);
+BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smap, X86_CR4_SMAP);
+BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pke, X86_CR4_PKE);
+BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, la57, X86_CR4_LA57);
+BUILD_MMU_ROLE_REGS_ACCESSOR(efer, nx, EFER_NX);
+BUILD_MMU_ROLE_REGS_ACCESSOR(efer, lma, EFER_LMA);
+
+/*
+ * The MMU itself (with a valid role) is the single source of truth for the
+ * MMU. Do not use the regs used to build the MMU/role, nor the vCPU. The
+ * regs don't account for dependencies, e.g. clearing CR4 bits if CR0.PG=1,
+ * and the vCPU may be incorrect/irrelevant.
+ */
+#define BUILD_MMU_ROLE_ACCESSOR(base_or_ext, reg, name) \
+static inline bool __maybe_unused is_##reg##_##name(struct kvm_mmu *mmu) \
+{ \
+ return !!(mmu->cpu_role. base_or_ext . reg##_##name); \
+}
+BUILD_MMU_ROLE_ACCESSOR(base, cr0, wp);
+BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pse);
+BUILD_MMU_ROLE_ACCESSOR(ext, cr4, smep);
+BUILD_MMU_ROLE_ACCESSOR(ext, cr4, smap);
+BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pke);
+BUILD_MMU_ROLE_ACCESSOR(ext, cr4, la57);
+BUILD_MMU_ROLE_ACCESSOR(base, efer, nx);
+BUILD_MMU_ROLE_ACCESSOR(ext, efer, lma);
#endif /* __KVM_X86_MMU_INTERNAL_H */
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:29:26

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 07/21] KVM: x86/MMU: Move the Shadow MMU implementation to shadow_mmu.c

Cut and paste the implementation of the Shadow MMU to shadow_mmu.(c|h).
This is a monsterously large commit, moving ~3500 lines. With such a
large move, there's no way to make it easy. Do the move in one massive
step to simplify dealing with merge conflicts and to make the git
history a little easier to dig through. Several cleanup commits follow
this one rather than preceed it so that their git history will remain
easy to see.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/debugfs.c | 1 +
arch/x86/kvm/mmu/mmu.c | 4510 ++++---------------------------
arch/x86/kvm/mmu/mmu_internal.h | 4 +-
arch/x86/kvm/mmu/shadow_mmu.c | 3418 +++++++++++++++++++++++
arch/x86/kvm/mmu/shadow_mmu.h | 145 +
5 files changed, 4083 insertions(+), 3995 deletions(-)

diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index ee8c4c3496edd..4825d7a56f39f 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -11,6 +11,7 @@
#include "lapic.h"
#include "mmu.h"
#include "mmu/mmu_internal.h"
+#include "mmu/shadow_mmu.h"

static int vcpu_get_timer_advance_ns(void *data, u64 *val)
{
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 35cb59737c0a3..2162dfda9601f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -117,59 +117,12 @@ bool dbg = 0;
module_param(dbg, bool, 0644);
#endif

-#define PTE_PREFETCH_NUM 8
-
#include <trace/events/kvm.h>

-/* make pte_list_desc fit well in cache lines */
-#define PTE_LIST_EXT 14
-
-/*
- * Slight optimization of cacheline layout, by putting `more' and `spte_count'
- * at the start; then accessing it will only use one single cacheline for
- * either full (entries==PTE_LIST_EXT) case or entries<=6.
- */
-struct pte_list_desc {
- struct pte_list_desc *more;
- /*
- * Stores number of entries stored in the pte_list_desc. No need to be
- * u64 but just for easier alignment. When PTE_LIST_EXT, means full.
- */
- u64 spte_count;
- u64 *sptes[PTE_LIST_EXT];
-};
-
-struct kvm_shadow_walk_iterator {
- u64 addr;
- hpa_t shadow_addr;
- u64 *sptep;
- int level;
- unsigned index;
-};
-
-#define for_each_shadow_entry_using_root(_vcpu, _root, _addr, _walker) \
- for (shadow_walk_init_using_root(&(_walker), (_vcpu), \
- (_root), (_addr)); \
- shadow_walk_okay(&(_walker)); \
- shadow_walk_next(&(_walker)))
-
-#define for_each_shadow_entry(_vcpu, _addr, _walker) \
- for (shadow_walk_init(&(_walker), _vcpu, _addr); \
- shadow_walk_okay(&(_walker)); \
- shadow_walk_next(&(_walker)))
-
-#define for_each_shadow_entry_lockless(_vcpu, _addr, _walker, spte) \
- for (shadow_walk_init(&(_walker), _vcpu, _addr); \
- shadow_walk_okay(&(_walker)) && \
- ({ spte = mmu_spte_get_lockless(_walker.sptep); 1; }); \
- __shadow_walk_next(&(_walker), spte))
-
struct kmem_cache *pte_list_desc_cache;
struct kmem_cache *mmu_page_header_cache;
struct percpu_counter kvm_total_used_mmu_pages;

-static void mmu_spte_set(u64 *sptep, u64 spte);
-
struct kvm_mmu_role_regs {
const unsigned long cr0;
const unsigned long cr4;
@@ -265,15 +218,6 @@ void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
kvm_flush_remote_tlbs_with_range(kvm, &range);
}

-void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
- unsigned int access)
-{
- u64 spte = make_mmio_spte(vcpu, gfn, access);
-
- trace_mark_mmio_spte(sptep, gfn, spte);
- mmu_spte_set(sptep, spte);
-}
-
static gfn_t get_mmio_spte_gfn(u64 spte)
{
u64 gpa = spte & shadow_nonpresent_or_rsvd_lower_gfn_mask;
@@ -304,310 +248,6 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 spte)
return likely(kvm_gen == spte_gen);
}

-#ifdef CONFIG_X86_64
-static void __set_spte(u64 *sptep, u64 spte)
-{
- WRITE_ONCE(*sptep, spte);
-}
-
-static void __update_clear_spte_fast(u64 *sptep, u64 spte)
-{
- WRITE_ONCE(*sptep, spte);
-}
-
-static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
-{
- return xchg(sptep, spte);
-}
-
-static u64 __get_spte_lockless(u64 *sptep)
-{
- return READ_ONCE(*sptep);
-}
-#else
-union split_spte {
- struct {
- u32 spte_low;
- u32 spte_high;
- };
- u64 spte;
-};
-
-static void count_spte_clear(u64 *sptep, u64 spte)
-{
- struct kvm_mmu_page *sp = sptep_to_sp(sptep);
-
- if (is_shadow_present_pte(spte))
- return;
-
- /* Ensure the spte is completely set before we increase the count */
- smp_wmb();
- sp->clear_spte_count++;
-}
-
-static void __set_spte(u64 *sptep, u64 spte)
-{
- union split_spte *ssptep, sspte;
-
- ssptep = (union split_spte *)sptep;
- sspte = (union split_spte)spte;
-
- ssptep->spte_high = sspte.spte_high;
-
- /*
- * If we map the spte from nonpresent to present, We should store
- * the high bits firstly, then set present bit, so cpu can not
- * fetch this spte while we are setting the spte.
- */
- smp_wmb();
-
- WRITE_ONCE(ssptep->spte_low, sspte.spte_low);
-}
-
-static void __update_clear_spte_fast(u64 *sptep, u64 spte)
-{
- union split_spte *ssptep, sspte;
-
- ssptep = (union split_spte *)sptep;
- sspte = (union split_spte)spte;
-
- WRITE_ONCE(ssptep->spte_low, sspte.spte_low);
-
- /*
- * If we map the spte from present to nonpresent, we should clear
- * present bit firstly to avoid vcpu fetch the old high bits.
- */
- smp_wmb();
-
- ssptep->spte_high = sspte.spte_high;
- count_spte_clear(sptep, spte);
-}
-
-static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
-{
- union split_spte *ssptep, sspte, orig;
-
- ssptep = (union split_spte *)sptep;
- sspte = (union split_spte)spte;
-
- /* xchg acts as a barrier before the setting of the high bits */
- orig.spte_low = xchg(&ssptep->spte_low, sspte.spte_low);
- orig.spte_high = ssptep->spte_high;
- ssptep->spte_high = sspte.spte_high;
- count_spte_clear(sptep, spte);
-
- return orig.spte;
-}
-
-/*
- * The idea using the light way get the spte on x86_32 guest is from
- * gup_get_pte (mm/gup.c).
- *
- * An spte tlb flush may be pending, because kvm_set_pte_rmap
- * coalesces them and we are running out of the MMU lock. Therefore
- * we need to protect against in-progress updates of the spte.
- *
- * Reading the spte while an update is in progress may get the old value
- * for the high part of the spte. The race is fine for a present->non-present
- * change (because the high part of the spte is ignored for non-present spte),
- * but for a present->present change we must reread the spte.
- *
- * All such changes are done in two steps (present->non-present and
- * non-present->present), hence it is enough to count the number of
- * present->non-present updates: if it changed while reading the spte,
- * we might have hit the race. This is done using clear_spte_count.
- */
-static u64 __get_spte_lockless(u64 *sptep)
-{
- struct kvm_mmu_page *sp = sptep_to_sp(sptep);
- union split_spte spte, *orig = (union split_spte *)sptep;
- int count;
-
-retry:
- count = sp->clear_spte_count;
- smp_rmb();
-
- spte.spte_low = orig->spte_low;
- smp_rmb();
-
- spte.spte_high = orig->spte_high;
- smp_rmb();
-
- if (unlikely(spte.spte_low != orig->spte_low ||
- count != sp->clear_spte_count))
- goto retry;
-
- return spte.spte;
-}
-#endif
-
-/* Rules for using mmu_spte_set:
- * Set the sptep from nonpresent to present.
- * Note: the sptep being assigned *must* be either not present
- * or in a state where the hardware will not attempt to update
- * the spte.
- */
-static void mmu_spte_set(u64 *sptep, u64 new_spte)
-{
- WARN_ON(is_shadow_present_pte(*sptep));
- __set_spte(sptep, new_spte);
-}
-
-/*
- * Update the SPTE (excluding the PFN), but do not track changes in its
- * accessed/dirty status.
- */
-static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
-{
- u64 old_spte = *sptep;
-
- WARN_ON(!is_shadow_present_pte(new_spte));
- check_spte_writable_invariants(new_spte);
-
- if (!is_shadow_present_pte(old_spte)) {
- mmu_spte_set(sptep, new_spte);
- return old_spte;
- }
-
- if (!spte_has_volatile_bits(old_spte))
- __update_clear_spte_fast(sptep, new_spte);
- else
- old_spte = __update_clear_spte_slow(sptep, new_spte);
-
- WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));
-
- return old_spte;
-}
-
-/* Rules for using mmu_spte_update:
- * Update the state bits, it means the mapped pfn is not changed.
- *
- * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote
- * TLBs must be flushed. Otherwise rmap_write_protect will find a read-only
- * spte, even though the writable spte might be cached on a CPU's TLB.
- *
- * Returns true if the TLB needs to be flushed
- */
-static bool mmu_spte_update(u64 *sptep, u64 new_spte)
-{
- bool flush = false;
- u64 old_spte = mmu_spte_update_no_track(sptep, new_spte);
-
- if (!is_shadow_present_pte(old_spte))
- return false;
-
- /*
- * For the spte updated out of mmu-lock is safe, since
- * we always atomically update it, see the comments in
- * spte_has_volatile_bits().
- */
- if (is_mmu_writable_spte(old_spte) &&
- !is_writable_pte(new_spte))
- flush = true;
-
- /*
- * Flush TLB when accessed/dirty states are changed in the page tables,
- * to guarantee consistency between TLB and page tables.
- */
-
- if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
- flush = true;
- kvm_set_pfn_accessed(spte_to_pfn(old_spte));
- }
-
- if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
- flush = true;
- kvm_set_pfn_dirty(spte_to_pfn(old_spte));
- }
-
- return flush;
-}
-
-/*
- * Rules for using mmu_spte_clear_track_bits:
- * It sets the sptep from present to nonpresent, and track the
- * state bits, it is used to clear the last level sptep.
- * Returns the old PTE.
- */
-static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
-{
- kvm_pfn_t pfn;
- u64 old_spte = *sptep;
- int level = sptep_to_sp(sptep)->role.level;
- struct page *page;
-
- if (!is_shadow_present_pte(old_spte) ||
- !spte_has_volatile_bits(old_spte))
- __update_clear_spte_fast(sptep, 0ull);
- else
- old_spte = __update_clear_spte_slow(sptep, 0ull);
-
- if (!is_shadow_present_pte(old_spte))
- return old_spte;
-
- kvm_update_page_stats(kvm, level, -1);
-
- pfn = spte_to_pfn(old_spte);
-
- /*
- * KVM doesn't hold a reference to any pages mapped into the guest, and
- * instead uses the mmu_notifier to ensure that KVM unmaps any pages
- * before they are reclaimed. Sanity check that, if the pfn is backed
- * by a refcounted page, the refcount is elevated.
- */
- page = kvm_pfn_to_refcounted_page(pfn);
- WARN_ON(page && !page_count(page));
-
- if (is_accessed_spte(old_spte))
- kvm_set_pfn_accessed(pfn);
-
- if (is_dirty_spte(old_spte))
- kvm_set_pfn_dirty(pfn);
-
- return old_spte;
-}
-
-/*
- * Rules for using mmu_spte_clear_no_track:
- * Directly clear spte without caring the state bits of sptep,
- * it is used to set the upper level spte.
- */
-static void mmu_spte_clear_no_track(u64 *sptep)
-{
- __update_clear_spte_fast(sptep, 0ull);
-}
-
-static u64 mmu_spte_get_lockless(u64 *sptep)
-{
- return __get_spte_lockless(sptep);
-}
-
-/* Returns the Accessed status of the PTE and resets it at the same time. */
-static bool mmu_spte_age(u64 *sptep)
-{
- u64 spte = mmu_spte_get_lockless(sptep);
-
- if (!is_accessed_spte(spte))
- return false;
-
- if (spte_ad_enabled(spte)) {
- clear_bit((ffs(shadow_accessed_mask) - 1),
- (unsigned long *)sptep);
- } else {
- /*
- * Capture the dirty status of the page, so that it doesn't get
- * lost when the SPTE is marked for access tracking.
- */
- if (is_writable_pte(spte))
- kvm_set_pfn_dirty(spte_to_pfn(spte));
-
- spte = mark_spte_for_access_track(spte);
- mmu_spte_update_no_track(sptep, spte);
- }
-
- return true;
-}
-
static inline bool is_tdp_mmu_active(struct kvm_vcpu *vcpu)
{
return tdp_mmu_enabled && vcpu->arch.mmu->root_role.direct;
@@ -678,77 +318,6 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
}

-static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
-{
- kmem_cache_free(pte_list_desc_cache, pte_list_desc);
-}
-
-static bool sp_has_gptes(struct kvm_mmu_page *sp);
-
-static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
-{
- if (sp->role.passthrough)
- return sp->gfn;
-
- if (!sp->role.direct)
- return sp->shadowed_translation[index] >> PAGE_SHIFT;
-
- return sp->gfn + (index << ((sp->role.level - 1) * SPTE_LEVEL_BITS));
-}
-
-/*
- * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note
- * that the SPTE itself may have a more constrained access permissions that
- * what the guest enforces. For example, a guest may create an executable
- * huge PTE but KVM may disallow execution to mitigate iTLB multihit.
- */
-static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
-{
- if (sp_has_gptes(sp))
- return sp->shadowed_translation[index] & ACC_ALL;
-
- /*
- * For direct MMUs (e.g. TDP or non-paging guests) or passthrough SPs,
- * KVM is not shadowing any guest page tables, so the "guest access
- * permissions" are just ACC_ALL.
- *
- * For direct SPs in indirect MMUs (shadow paging), i.e. when KVM
- * is shadowing a guest huge page with small pages, the guest access
- * permissions being shadowed are the access permissions of the huge
- * page.
- *
- * In both cases, sp->role.access contains the correct access bits.
- */
- return sp->role.access;
-}
-
-static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index,
- gfn_t gfn, unsigned int access)
-{
- if (sp_has_gptes(sp)) {
- sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access;
- return;
- }
-
- WARN_ONCE(access != kvm_mmu_page_get_access(sp, index),
- "access mismatch under %s page %llx (expected %u, got %u)\n",
- sp->role.passthrough ? "passthrough" : "direct",
- sp->gfn, kvm_mmu_page_get_access(sp, index), access);
-
- WARN_ONCE(gfn != kvm_mmu_page_get_gfn(sp, index),
- "gfn mismatch under %s page %llx (expected %llx, got %llx)\n",
- sp->role.passthrough ? "passthrough" : "direct",
- sp->gfn, kvm_mmu_page_get_gfn(sp, index), gfn);
-}
-
-static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index,
- unsigned int access)
-{
- gfn_t gfn = kvm_mmu_page_get_gfn(sp, index);
-
- kvm_mmu_page_set_translation(sp, index, gfn, access);
-}
-
/*
* Return the pointer to the large page information for a given gfn,
* handling slots that are not large page aligned.
@@ -785,28 +354,6 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
update_gfn_disallow_lpage_count(slot, gfn, -1);
}

-static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
- struct kvm_memslots *slots;
- struct kvm_memory_slot *slot;
- gfn_t gfn;
-
- kvm->arch.indirect_shadow_pages++;
- gfn = sp->gfn;
- slots = kvm_memslots_for_spte_role(kvm, sp->role);
- slot = __gfn_to_memslot(slots, gfn);
-
- /* the non-leaf shadow pages are keeping readonly. */
- if (sp->role.level > PG_LEVEL_4K)
- return kvm_slot_page_track_add_page(kvm, slot, gfn,
- KVM_PAGE_TRACK_WRITE);
-
- kvm_mmu_gfn_disallow_lpage(slot, gfn);
-
- if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
- kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
-}
-
void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
/*
@@ -834,23 +381,6 @@ void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
track_possible_nx_huge_page(kvm, sp);
}

-static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
- struct kvm_memslots *slots;
- struct kvm_memory_slot *slot;
- gfn_t gfn;
-
- kvm->arch.indirect_shadow_pages--;
- gfn = sp->gfn;
- slots = kvm_memslots_for_spte_role(kvm, sp->role);
- slot = __gfn_to_memslot(slots, gfn);
- if (sp->role.level > PG_LEVEL_4K)
- return kvm_slot_page_track_remove_page(kvm, slot, gfn,
- KVM_PAGE_TRACK_WRITE);
-
- kvm_mmu_gfn_allow_lpage(slot, gfn);
-}
-
void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
if (list_empty(&sp->possible_nx_huge_page_link))
@@ -881,436 +411,51 @@ struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu,
return slot;
}

-/*
- * About rmap_head encoding:
+/**
+ * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
+ * @kvm: kvm instance
+ * @slot: slot to protect
+ * @gfn_offset: start of the BITS_PER_LONG pages we care about
+ * @mask: indicates which pages we should protect
*
- * If the bit zero of rmap_head->val is clear, then it points to the only spte
- * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct
- * pte_list_desc containing more mappings.
- */
-
-/*
- * Returns the number of pointers in the rmap chain, not counting the new one.
+ * Used when we do not need to care about huge page mappings.
*/
-static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
- struct kvm_rmap_head *rmap_head)
-{
- struct pte_list_desc *desc;
- int count = 0;
-
- if (!rmap_head->val) {
- rmap_printk("%p %llx 0->1\n", spte, *spte);
- rmap_head->val = (unsigned long)spte;
- } else if (!(rmap_head->val & 1)) {
- rmap_printk("%p %llx 1->many\n", spte, *spte);
- desc = kvm_mmu_memory_cache_alloc(cache);
- desc->sptes[0] = (u64 *)rmap_head->val;
- desc->sptes[1] = spte;
- desc->spte_count = 2;
- rmap_head->val = (unsigned long)desc | 1;
- ++count;
- } else {
- rmap_printk("%p %llx many->many\n", spte, *spte);
- desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
- while (desc->spte_count == PTE_LIST_EXT) {
- count += PTE_LIST_EXT;
- if (!desc->more) {
- desc->more = kvm_mmu_memory_cache_alloc(cache);
- desc = desc->more;
- desc->spte_count = 0;
- break;
- }
- desc = desc->more;
- }
- count += desc->spte_count;
- desc->sptes[desc->spte_count++] = spte;
- }
- return count;
-}
-
-static void pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
- struct pte_list_desc *desc, int i,
- struct pte_list_desc *prev_desc)
+static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn_offset, unsigned long mask)
{
- int j = desc->spte_count - 1;
+ struct kvm_rmap_head *rmap_head;
+
+ if (tdp_mmu_enabled)
+ kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
+ slot->base_gfn + gfn_offset, mask, true);

- desc->sptes[i] = desc->sptes[j];
- desc->sptes[j] = NULL;
- desc->spte_count--;
- if (desc->spte_count)
+ if (!kvm_memslots_have_rmaps(kvm))
return;
- if (!prev_desc && !desc->more)
- rmap_head->val = 0;
- else
- if (prev_desc)
- prev_desc->more = desc->more;
- else
- rmap_head->val = (unsigned long)desc->more | 1;
- mmu_free_pte_list_desc(desc);
-}

-static void pte_list_remove(u64 *spte, struct kvm_rmap_head *rmap_head)
-{
- struct pte_list_desc *desc;
- struct pte_list_desc *prev_desc;
- int i;
+ while (mask) {
+ rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
+ PG_LEVEL_4K, slot);
+ rmap_write_protect(rmap_head, false);

- if (!rmap_head->val) {
- pr_err("%s: %p 0->BUG\n", __func__, spte);
- BUG();
- } else if (!(rmap_head->val & 1)) {
- rmap_printk("%p 1->0\n", spte);
- if ((u64 *)rmap_head->val != spte) {
- pr_err("%s: %p 1->BUG\n", __func__, spte);
- BUG();
- }
- rmap_head->val = 0;
- } else {
- rmap_printk("%p many->many\n", spte);
- desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
- prev_desc = NULL;
- while (desc) {
- for (i = 0; i < desc->spte_count; ++i) {
- if (desc->sptes[i] == spte) {
- pte_list_desc_remove_entry(rmap_head,
- desc, i, prev_desc);
- return;
- }
- }
- prev_desc = desc;
- desc = desc->more;
- }
- pr_err("%s: %p many->many\n", __func__, spte);
- BUG();
+ /* clear the first set bit */
+ mask &= mask - 1;
}
}

-static void kvm_zap_one_rmap_spte(struct kvm *kvm,
- struct kvm_rmap_head *rmap_head, u64 *sptep)
-{
- mmu_spte_clear_track_bits(kvm, sptep);
- pte_list_remove(sptep, rmap_head);
-}
-
-/* Return true if at least one SPTE was zapped, false otherwise */
-static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
- struct kvm_rmap_head *rmap_head)
-{
- struct pte_list_desc *desc, *next;
- int i;
-
- if (!rmap_head->val)
- return false;
-
- if (!(rmap_head->val & 1)) {
- mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val);
- goto out;
- }
-
- desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
-
- for (; desc; desc = next) {
- for (i = 0; i < desc->spte_count; i++)
- mmu_spte_clear_track_bits(kvm, desc->sptes[i]);
- next = desc->more;
- mmu_free_pte_list_desc(desc);
- }
-out:
- /* rmap_head is meaningless now, remember to reset it */
- rmap_head->val = 0;
- return true;
-}
-
-unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
-{
- struct pte_list_desc *desc;
- unsigned int count = 0;
-
- if (!rmap_head->val)
- return 0;
- else if (!(rmap_head->val & 1))
- return 1;
-
- desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
-
- while (desc) {
- count += desc->spte_count;
- desc = desc->more;
- }
-
- return count;
-}
-
-static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
- const struct kvm_memory_slot *slot)
-{
- unsigned long idx;
-
- idx = gfn_to_index(gfn, slot->base_gfn, level);
- return &slot->arch.rmap[level - PG_LEVEL_4K][idx];
-}
-
-static bool rmap_can_add(struct kvm_vcpu *vcpu)
-{
- struct kvm_mmu_memory_cache *mc;
-
- mc = &vcpu->arch.mmu_pte_list_desc_cache;
- return kvm_mmu_memory_cache_nr_free_objects(mc);
-}
-
-static void rmap_remove(struct kvm *kvm, u64 *spte)
-{
- struct kvm_memslots *slots;
- struct kvm_memory_slot *slot;
- struct kvm_mmu_page *sp;
- gfn_t gfn;
- struct kvm_rmap_head *rmap_head;
-
- sp = sptep_to_sp(spte);
- gfn = kvm_mmu_page_get_gfn(sp, spte_index(spte));
-
- /*
- * Unlike rmap_add, rmap_remove does not run in the context of a vCPU
- * so we have to determine which memslots to use based on context
- * information in sp->role.
- */
- slots = kvm_memslots_for_spte_role(kvm, sp->role);
-
- slot = __gfn_to_memslot(slots, gfn);
- rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
-
- pte_list_remove(spte, rmap_head);
-}
-
-/*
- * Used by the following functions to iterate through the sptes linked by a
- * rmap. All fields are private and not assumed to be used outside.
- */
-struct rmap_iterator {
- /* private fields */
- struct pte_list_desc *desc; /* holds the sptep if not NULL */
- int pos; /* index of the sptep */
-};
-
-/*
- * Iteration must be started by this function. This should also be used after
- * removing/dropping sptes from the rmap link because in such cases the
- * information in the iterator may not be valid.
- *
- * Returns sptep if found, NULL otherwise.
- */
-static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
- struct rmap_iterator *iter)
-{
- u64 *sptep;
-
- if (!rmap_head->val)
- return NULL;
-
- if (!(rmap_head->val & 1)) {
- iter->desc = NULL;
- sptep = (u64 *)rmap_head->val;
- goto out;
- }
-
- iter->desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
- iter->pos = 0;
- sptep = iter->desc->sptes[iter->pos];
-out:
- BUG_ON(!is_shadow_present_pte(*sptep));
- return sptep;
-}
-
-/*
- * Must be used with a valid iterator: e.g. after rmap_get_first().
- *
- * Returns sptep if found, NULL otherwise.
- */
-static u64 *rmap_get_next(struct rmap_iterator *iter)
-{
- u64 *sptep;
-
- if (iter->desc) {
- if (iter->pos < PTE_LIST_EXT - 1) {
- ++iter->pos;
- sptep = iter->desc->sptes[iter->pos];
- if (sptep)
- goto out;
- }
-
- iter->desc = iter->desc->more;
-
- if (iter->desc) {
- iter->pos = 0;
- /* desc->sptes[0] cannot be NULL */
- sptep = iter->desc->sptes[iter->pos];
- goto out;
- }
- }
-
- return NULL;
-out:
- BUG_ON(!is_shadow_present_pte(*sptep));
- return sptep;
-}
-
-#define for_each_rmap_spte(_rmap_head_, _iter_, _spte_) \
- for (_spte_ = rmap_get_first(_rmap_head_, _iter_); \
- _spte_; _spte_ = rmap_get_next(_iter_))
-
-static void drop_spte(struct kvm *kvm, u64 *sptep)
-{
- u64 old_spte = mmu_spte_clear_track_bits(kvm, sptep);
-
- if (is_shadow_present_pte(old_spte))
- rmap_remove(kvm, sptep);
-}
-
-static void drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
-{
- struct kvm_mmu_page *sp;
-
- sp = sptep_to_sp(sptep);
- WARN_ON(sp->role.level == PG_LEVEL_4K);
-
- drop_spte(kvm, sptep);
-
- if (flush)
- kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
- KVM_PAGES_PER_HPAGE(sp->role.level));
-}
-
-/*
- * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte write-protection is caused by protecting shadow page table.
- *
- * Note: write protection is difference between dirty logging and spte
- * protection:
- * - for dirty logging, the spte can be set to writable at anytime if
- * its dirty bitmap is properly set.
- * - for spte protection, the spte can be writable only after unsync-ing
- * shadow page.
- *
- * Return true if tlb need be flushed.
- */
-static bool spte_write_protect(u64 *sptep, bool pt_protect)
-{
- u64 spte = *sptep;
-
- if (!is_writable_pte(spte) &&
- !(pt_protect && is_mmu_writable_spte(spte)))
- return false;
-
- rmap_printk("spte %p %llx\n", sptep, *sptep);
-
- if (pt_protect)
- spte &= ~shadow_mmu_writable_mask;
- spte = spte & ~PT_WRITABLE_MASK;
-
- return mmu_spte_update(sptep, spte);
-}
-
-static bool rmap_write_protect(struct kvm_rmap_head *rmap_head,
- bool pt_protect)
-{
- u64 *sptep;
- struct rmap_iterator iter;
- bool flush = false;
-
- for_each_rmap_spte(rmap_head, &iter, sptep)
- flush |= spte_write_protect(sptep, pt_protect);
-
- return flush;
-}
-
-static bool spte_clear_dirty(u64 *sptep)
-{
- u64 spte = *sptep;
-
- rmap_printk("spte %p %llx\n", sptep, *sptep);
-
- MMU_WARN_ON(!spte_ad_enabled(spte));
- spte &= ~shadow_dirty_mask;
- return mmu_spte_update(sptep, spte);
-}
-
-static bool spte_wrprot_for_clear_dirty(u64 *sptep)
-{
- bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
- (unsigned long *)sptep);
- if (was_writable && !spte_ad_enabled(*sptep))
- kvm_set_pfn_dirty(spte_to_pfn(*sptep));
-
- return was_writable;
-}
-
-/*
- * Gets the GFN ready for another round of dirty logging by clearing the
- * - D bit on ad-enabled SPTEs, and
- * - W bit on ad-disabled SPTEs.
- * Returns true iff any D or W bits were cleared.
- */
-static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- const struct kvm_memory_slot *slot)
-{
- u64 *sptep;
- struct rmap_iterator iter;
- bool flush = false;
-
- for_each_rmap_spte(rmap_head, &iter, sptep)
- if (spte_ad_need_write_protect(*sptep))
- flush |= spte_wrprot_for_clear_dirty(sptep);
- else
- flush |= spte_clear_dirty(sptep);
-
- return flush;
-}
-
-/**
- * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
- * @kvm: kvm instance
- * @slot: slot to protect
- * @gfn_offset: start of the BITS_PER_LONG pages we care about
- * @mask: indicates which pages we should protect
- *
- * Used when we do not need to care about huge page mappings.
- */
-static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
- struct kvm_memory_slot *slot,
- gfn_t gfn_offset, unsigned long mask)
-{
- struct kvm_rmap_head *rmap_head;
-
- if (tdp_mmu_enabled)
- kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
- slot->base_gfn + gfn_offset, mask, true);
-
- if (!kvm_memslots_have_rmaps(kvm))
- return;
-
- while (mask) {
- rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
- PG_LEVEL_4K, slot);
- rmap_write_protect(rmap_head, false);
-
- /* clear the first set bit */
- mask &= mask - 1;
- }
-}
-
-/**
- * kvm_mmu_clear_dirty_pt_masked - clear MMU D-bit for PT level pages, or write
- * protect the page if the D-bit isn't supported.
- * @kvm: kvm instance
- * @slot: slot to clear D-bit
- * @gfn_offset: start of the BITS_PER_LONG pages we care about
- * @mask: indicates which pages we should clear D-bit
- *
- * Used for PML to re-log the dirty GPAs after userspace querying dirty_bitmap.
- */
-static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
- struct kvm_memory_slot *slot,
- gfn_t gfn_offset, unsigned long mask)
+/**
+ * kvm_mmu_clear_dirty_pt_masked - clear MMU D-bit for PT level pages, or write
+ * protect the page if the D-bit isn't supported.
+ * @kvm: kvm instance
+ * @slot: slot to clear D-bit
+ * @gfn_offset: start of the BITS_PER_LONG pages we care about
+ * @mask: indicates which pages we should clear D-bit
+ *
+ * Used for PML to re-log the dirty GPAs after userspace querying dirty_bitmap.
+ */
+static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn_offset, unsigned long mask)
{
struct kvm_rmap_head *rmap_head;

@@ -1412,147 +557,6 @@ bool kvm_vcpu_write_protect_gfn(struct kvm_vcpu *vcpu, u64 gfn)
return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn, PG_LEVEL_4K);
}

-static bool __kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- const struct kvm_memory_slot *slot)
-{
- return kvm_zap_all_rmap_sptes(kvm, rmap_head);
-}
-
-static bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn, int level,
- pte_t unused)
-{
- return __kvm_zap_rmap(kvm, rmap_head, slot);
-}
-
-static bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn, int level,
- pte_t pte)
-{
- u64 *sptep;
- struct rmap_iterator iter;
- bool need_flush = false;
- u64 new_spte;
- kvm_pfn_t new_pfn;
-
- WARN_ON(pte_huge(pte));
- new_pfn = pte_pfn(pte);
-
-restart:
- for_each_rmap_spte(rmap_head, &iter, sptep) {
- rmap_printk("spte %p %llx gfn %llx (%d)\n",
- sptep, *sptep, gfn, level);
-
- need_flush = true;
-
- if (pte_write(pte)) {
- kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
- goto restart;
- } else {
- new_spte = kvm_mmu_changed_pte_notifier_make_spte(
- *sptep, new_pfn);
-
- mmu_spte_clear_track_bits(kvm, sptep);
- mmu_spte_set(sptep, new_spte);
- }
- }
-
- if (need_flush && kvm_available_flush_tlb_with_range()) {
- kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
- return false;
- }
-
- return need_flush;
-}
-
-struct slot_rmap_walk_iterator {
- /* input fields. */
- const struct kvm_memory_slot *slot;
- gfn_t start_gfn;
- gfn_t end_gfn;
- int start_level;
- int end_level;
-
- /* output fields. */
- gfn_t gfn;
- struct kvm_rmap_head *rmap;
- int level;
-
- /* private field. */
- struct kvm_rmap_head *end_rmap;
-};
-
-static void rmap_walk_init_level(struct slot_rmap_walk_iterator *iterator,
- int level)
-{
- iterator->level = level;
- iterator->gfn = iterator->start_gfn;
- iterator->rmap = gfn_to_rmap(iterator->gfn, level, iterator->slot);
- iterator->end_rmap = gfn_to_rmap(iterator->end_gfn, level, iterator->slot);
-}
-
-static void slot_rmap_walk_init(struct slot_rmap_walk_iterator *iterator,
- const struct kvm_memory_slot *slot,
- int start_level, int end_level,
- gfn_t start_gfn, gfn_t end_gfn)
-{
- iterator->slot = slot;
- iterator->start_level = start_level;
- iterator->end_level = end_level;
- iterator->start_gfn = start_gfn;
- iterator->end_gfn = end_gfn;
-
- rmap_walk_init_level(iterator, iterator->start_level);
-}
-
-static bool slot_rmap_walk_okay(struct slot_rmap_walk_iterator *iterator)
-{
- return !!iterator->rmap;
-}
-
-static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator)
-{
- while (++iterator->rmap <= iterator->end_rmap) {
- iterator->gfn += (1UL << KVM_HPAGE_GFN_SHIFT(iterator->level));
-
- if (iterator->rmap->val)
- return;
- }
-
- if (++iterator->level > iterator->end_level) {
- iterator->rmap = NULL;
- return;
- }
-
- rmap_walk_init_level(iterator, iterator->level);
-}
-
-#define for_each_slot_rmap_range(_slot_, _start_level_, _end_level_, \
- _start_gfn, _end_gfn, _iter_) \
- for (slot_rmap_walk_init(_iter_, _slot_, _start_level_, \
- _end_level_, _start_gfn, _end_gfn); \
- slot_rmap_walk_okay(_iter_); \
- slot_rmap_walk_next(_iter_))
-
-typedef bool (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn,
- int level, pte_t pte);
-
-static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm,
- struct kvm_gfn_range *range,
- rmap_handler_t handler)
-{
- struct slot_rmap_walk_iterator iterator;
- bool ret = false;
-
- for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
- range->start, range->end - 1, &iterator)
- ret |= handler(kvm, iterator.rmap, range->slot, iterator.gfn,
- iterator.level, range->pte);
-
- return ret;
-}
-
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool flush = false;
@@ -1579,68 +583,6 @@ bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
return flush;
}

-static bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn, int level,
- pte_t unused)
-{
- u64 *sptep;
- struct rmap_iterator iter;
- int young = 0;
-
- for_each_rmap_spte(rmap_head, &iter, sptep)
- young |= mmu_spte_age(sptep);
-
- return young;
-}
-
-static bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn,
- int level, pte_t unused)
-{
- u64 *sptep;
- struct rmap_iterator iter;
-
- for_each_rmap_spte(rmap_head, &iter, sptep)
- if (is_accessed_spte(*sptep))
- return true;
- return false;
-}
-
-#define RMAP_RECYCLE_THRESHOLD 1000
-
-static void __rmap_add(struct kvm *kvm,
- struct kvm_mmu_memory_cache *cache,
- const struct kvm_memory_slot *slot,
- u64 *spte, gfn_t gfn, unsigned int access)
-{
- struct kvm_mmu_page *sp;
- struct kvm_rmap_head *rmap_head;
- int rmap_count;
-
- sp = sptep_to_sp(spte);
- kvm_mmu_page_set_translation(sp, spte_index(spte), gfn, access);
- kvm_update_page_stats(kvm, sp->role.level, 1);
-
- rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
- rmap_count = pte_list_add(cache, spte, rmap_head);
-
- if (rmap_count > kvm->stat.max_mmu_rmap_size)
- kvm->stat.max_mmu_rmap_size = rmap_count;
- if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
- kvm_zap_all_rmap_sptes(kvm, rmap_head);
- kvm_flush_remote_tlbs_with_address(
- kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
- }
-}
-
-static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
- u64 *spte, gfn_t gfn, unsigned int access)
-{
- struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_pte_list_desc_cache;
-
- __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
-}
-
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;
@@ -1667,2315 +609,571 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
return young;
}

-#ifdef MMU_DEBUG
-static int is_empty_shadow_page(u64 *spt)
+bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, struct list_head *invalid_list,
+ bool remote_flush)
{
- u64 *pos;
- u64 *end;
+ if (!remote_flush && list_empty(invalid_list))
+ return false;

- for (pos = spt, end = pos + SPTE_ENT_PER_PAGE; pos != end; pos++)
- if (is_shadow_present_pte(*pos)) {
- printk(KERN_ERR "%s: %p %llx\n", __func__,
- pos, *pos);
- return 0;
- }
- return 1;
+ if (!list_empty(invalid_list))
+ kvm_mmu_commit_zap_page(kvm, invalid_list);
+ else
+ kvm_flush_remote_tlbs(kvm);
+ return true;
}
-#endif

-/*
- * This value is the sum of all of the kvm instances's
- * kvm->arch.n_used_mmu_pages values. We need a global,
- * aggregate version in order to make the slab shrinker
- * faster
- */
-static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
+bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
{
- kvm->arch.n_used_mmu_pages += nr;
- percpu_counter_add(&kvm_total_used_mmu_pages, nr);
-}
+ if (sp->role.invalid)
+ return true;

-static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
- kvm_mod_used_mmu_pages(kvm, +1);
- kvm_account_pgtable_pages((void *)sp->spt, +1);
+ /* TDP MMU pages do not use the MMU generation. */
+ return !is_tdp_mmu_page(sp) &&
+ unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
}

-static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+/*
+ * Lookup the mapping level for @gfn in the current mm.
+ *
+ * WARNING! Use of host_pfn_mapping_level() requires the caller and the end
+ * consumer to be tied into KVM's handlers for MMU notifier events!
+ *
+ * There are several ways to safely use this helper:
+ *
+ * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, before
+ * consuming it. In this case, mmu_lock doesn't need to be held during the
+ * lookup, but it does need to be held while checking the MMU notifier.
+ *
+ * - Hold mmu_lock AND ensure there is no in-progress MMU notifier invalidation
+ * event for the hva. This can be done by explicit checking the MMU notifier
+ * or by ensuring that KVM already has a valid mapping that covers the hva.
+ *
+ * - Do not use the result to install new mappings, e.g. use the host mapping
+ * level only to decide whether or not to zap an entry. In this case, it's
+ * not required to hold mmu_lock (though it's highly likely the caller will
+ * want to hold mmu_lock anyways, e.g. to modify SPTEs).
+ *
+ * Note! The lookup can still race with modifications to host page tables, but
+ * the above "rules" ensure KVM will not _consume_ the result of the walk if a
+ * race with the primary MMU occurs.
+ */
+static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
+ const struct kvm_memory_slot *slot)
{
- kvm_mod_used_mmu_pages(kvm, -1);
- kvm_account_pgtable_pages((void *)sp->spt, -1);
-}
+ int level = PG_LEVEL_4K;
+ unsigned long hva;
+ unsigned long flags;
+ pgd_t pgd;
+ p4d_t p4d;
+ pud_t pud;
+ pmd_t pmd;

-static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
-{
- MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
- hlist_del(&sp->hash_link);
- list_del(&sp->link);
- free_page((unsigned long)sp->spt);
- if (!sp->role.direct)
- free_page((unsigned long)sp->shadowed_translation);
- kmem_cache_free(mmu_page_header_cache, sp);
-}
+ /*
+ * Note, using the already-retrieved memslot and __gfn_to_hva_memslot()
+ * is not solely for performance, it's also necessary to avoid the
+ * "writable" check in __gfn_to_hva_many(), which will always fail on
+ * read-only memslots due to gfn_to_hva() assuming writes. Earlier
+ * page fault steps have already verified the guest isn't writing a
+ * read-only memslot.
+ */
+ hva = __gfn_to_hva_memslot(slot, gfn);

-static unsigned kvm_page_table_hashfn(gfn_t gfn)
-{
- return hash_64(gfn, KVM_MMU_HASH_SHIFT);
-}
+ /*
+ * Disable IRQs to prevent concurrent tear down of host page tables,
+ * e.g. if the primary MMU promotes a P*D to a huge page and then frees
+ * the original page table.
+ */
+ local_irq_save(flags);

-static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
- struct kvm_mmu_page *sp, u64 *parent_pte)
-{
- if (!parent_pte)
- return;
+ /*
+ * Read each entry once. As above, a non-leaf entry can be promoted to
+ * a huge page _during_ this walk. Re-reading the entry could send the
+ * walk into the weeks, e.g. p*d_large() returns false (sees the old
+ * value) and then p*d_offset() walks into the target huge page instead
+ * of the old page table (sees the new value).
+ */
+ pgd = READ_ONCE(*pgd_offset(kvm->mm, hva));
+ if (pgd_none(pgd))
+ goto out;

- pte_list_add(cache, parent_pte, &sp->parent_ptes);
-}
+ p4d = READ_ONCE(*p4d_offset(&pgd, hva));
+ if (p4d_none(p4d) || !p4d_present(p4d))
+ goto out;

-static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
- u64 *parent_pte)
-{
- pte_list_remove(parent_pte, &sp->parent_ptes);
-}
+ pud = READ_ONCE(*pud_offset(&p4d, hva));
+ if (pud_none(pud) || !pud_present(pud))
+ goto out;

-static void drop_parent_pte(struct kvm_mmu_page *sp,
- u64 *parent_pte)
-{
- mmu_page_remove_parent_pte(sp, parent_pte);
- mmu_spte_clear_no_track(parent_pte);
+ if (pud_large(pud)) {
+ level = PG_LEVEL_1G;
+ goto out;
+ }
+
+ pmd = READ_ONCE(*pmd_offset(&pud, hva));
+ if (pmd_none(pmd) || !pmd_present(pmd))
+ goto out;
+
+ if (pmd_large(pmd))
+ level = PG_LEVEL_2M;
+
+out:
+ local_irq_restore(flags);
+ return level;
}

-static void mark_unsync(u64 *spte);
-static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
+int kvm_mmu_max_mapping_level(struct kvm *kvm,
+ const struct kvm_memory_slot *slot, gfn_t gfn,
+ int max_level)
{
- u64 *sptep;
- struct rmap_iterator iter;
+ struct kvm_lpage_info *linfo;
+ int host_level;

- for_each_rmap_spte(&sp->parent_ptes, &iter, sptep) {
- mark_unsync(sptep);
+ max_level = min(max_level, max_huge_page_level);
+ for ( ; max_level > PG_LEVEL_4K; max_level--) {
+ linfo = lpage_info_slot(gfn, slot, max_level);
+ if (!linfo->disallow_lpage)
+ break;
}
-}

-static void mark_unsync(u64 *spte)
-{
- struct kvm_mmu_page *sp;
+ if (max_level == PG_LEVEL_4K)
+ return PG_LEVEL_4K;

- sp = sptep_to_sp(spte);
- if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap))
- return;
- if (sp->unsync_children++)
- return;
- kvm_mmu_mark_parents_unsync(sp);
+ host_level = host_pfn_mapping_level(kvm, gfn, slot);
+ return min(host_level, max_level);
}

-static int nonpaging_sync_page(struct kvm_vcpu *vcpu,
- struct kvm_mmu_page *sp)
+void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
- return -1;
-}
+ struct kvm_memory_slot *slot = fault->slot;
+ kvm_pfn_t mask;

-#define KVM_PAGE_ARRAY_NR 16
+ fault->huge_page_disallowed = fault->exec && fault->nx_huge_page_workaround_enabled;

-struct kvm_mmu_pages {
- struct mmu_page_and_offset {
- struct kvm_mmu_page *sp;
- unsigned int idx;
- } page[KVM_PAGE_ARRAY_NR];
- unsigned int nr;
-};
+ if (unlikely(fault->max_level == PG_LEVEL_4K))
+ return;

-static int mmu_pages_add(struct kvm_mmu_pages *pvec, struct kvm_mmu_page *sp,
- int idx)
-{
- int i;
+ if (is_error_noslot_pfn(fault->pfn))
+ return;

- if (sp->unsync)
- for (i=0; i < pvec->nr; i++)
- if (pvec->page[i].sp == sp)
- return 0;
+ if (kvm_slot_dirty_track_enabled(slot))
+ return;

- pvec->page[pvec->nr].sp = sp;
- pvec->page[pvec->nr].idx = idx;
- pvec->nr++;
- return (pvec->nr == KVM_PAGE_ARRAY_NR);
-}
+ /*
+ * Enforce the iTLB multihit workaround after capturing the requested
+ * level, which will be used to do precise, accurate accounting.
+ */
+ fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
+ fault->gfn, fault->max_level);
+ if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
+ return;

-static inline void clear_unsync_child_bit(struct kvm_mmu_page *sp, int idx)
-{
- --sp->unsync_children;
- WARN_ON((int)sp->unsync_children < 0);
- __clear_bit(idx, sp->unsync_child_bitmap);
+ /*
+ * mmu_invalidate_retry() was successful and mmu_lock is held, so
+ * the pmd can't be split from under us.
+ */
+ fault->goal_level = fault->req_level;
+ mask = KVM_PAGES_PER_HPAGE(fault->goal_level) - 1;
+ VM_BUG_ON((fault->gfn & mask) != (fault->pfn & mask));
+ fault->pfn &= ~mask;
}

-static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
- struct kvm_mmu_pages *pvec)
+void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level)
{
- int i, ret, nr_unsync_leaf = 0;
-
- for_each_set_bit(i, sp->unsync_child_bitmap, 512) {
- struct kvm_mmu_page *child;
- u64 ent = sp->spt[i];
-
- if (!is_shadow_present_pte(ent) || is_large_pte(ent)) {
- clear_unsync_child_bit(sp, i);
- continue;
- }
-
- child = spte_to_child_sp(ent);
-
- if (child->unsync_children) {
- if (mmu_pages_add(pvec, child, i))
- return -ENOSPC;
-
- ret = __mmu_unsync_walk(child, pvec);
- if (!ret) {
- clear_unsync_child_bit(sp, i);
- continue;
- } else if (ret > 0) {
- nr_unsync_leaf += ret;
- } else
- return ret;
- } else if (child->unsync) {
- nr_unsync_leaf++;
- if (mmu_pages_add(pvec, child, i))
- return -ENOSPC;
- } else
- clear_unsync_child_bit(sp, i);
+ if (cur_level > PG_LEVEL_4K &&
+ cur_level == fault->goal_level &&
+ is_shadow_present_pte(spte) &&
+ !is_large_pte(spte) &&
+ spte_to_child_sp(spte)->nx_huge_page_disallowed) {
+ /*
+ * A small SPTE exists for this pfn, but FNAME(fetch),
+ * direct_map(), or kvm_tdp_mmu_map() would like to create a
+ * large PTE instead: just force them to go down another level,
+ * patching back for them into pfn the next 9 bits of the
+ * address.
+ */
+ u64 page_mask = KVM_PAGES_PER_HPAGE(cur_level) -
+ KVM_PAGES_PER_HPAGE(cur_level - 1);
+ fault->pfn |= fault->gfn & page_mask;
+ fault->goal_level--;
}
-
- return nr_unsync_leaf;
}

-#define INVALID_INDEX (-1)
-
-static int mmu_unsync_walk(struct kvm_mmu_page *sp,
- struct kvm_mmu_pages *pvec)
+static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
{
- pvec->nr = 0;
- if (!sp->unsync_children)
- return 0;
+ unsigned long hva = gfn_to_hva_memslot(slot, gfn);

- mmu_pages_add(pvec, sp, INVALID_INDEX);
- return __mmu_unsync_walk(sp, pvec);
+ send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current);
}

-static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
- WARN_ON(!sp->unsync);
- trace_kvm_mmu_sync_page(sp);
- sp->unsync = 0;
- --kvm->stat.mmu_unsync;
-}
-
-static bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- struct list_head *invalid_list);
-static void kvm_mmu_commit_zap_page(struct kvm *kvm,
- struct list_head *invalid_list);
+ if (is_sigpending_pfn(fault->pfn)) {
+ kvm_handle_signal_exit(vcpu);
+ return -EINTR;
+ }

-static bool sp_has_gptes(struct kvm_mmu_page *sp)
-{
- if (sp->role.direct)
- return false;
+ /*
+ * Do not cache the mmio info caused by writing the readonly gfn
+ * into the spte otherwise read access on readonly gfn also can
+ * caused mmio page fault and treat it as mmio access.
+ */
+ if (fault->pfn == KVM_PFN_ERR_RO_FAULT)
+ return RET_PF_EMULATE;

- if (sp->role.passthrough)
- return false;
+ if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
+ kvm_send_hwpoison_signal(fault->slot, fault->gfn);
+ return RET_PF_RETRY;
+ }

- return true;
+ return -EFAULT;
}

-#define for_each_valid_sp(_kvm, _sp, _list) \
- hlist_for_each_entry(_sp, _list, hash_link) \
- if (is_obsolete_sp((_kvm), (_sp))) { \
- } else
+static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault,
+ unsigned int access)
+{
+ gva_t gva = fault->is_tdp ? 0 : fault->addr;

-#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \
- for_each_valid_sp(_kvm, _sp, \
- &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)]) \
- if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else
+ vcpu_cache_mmio_info(vcpu, gva, fault->gfn,
+ access & shadow_mmio_access_mask);

-static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
- struct list_head *invalid_list)
-{
- int ret = vcpu->arch.mmu->sync_page(vcpu, sp);
+ /*
+ * If MMIO caching is disabled, emulate immediately without
+ * touching the shadow page tables as attempting to install an
+ * MMIO SPTE will just be an expensive nop.
+ */
+ if (unlikely(!enable_mmio_caching))
+ return RET_PF_EMULATE;

- if (ret < 0)
- kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
- return ret;
+ /*
+ * Do not create an MMIO SPTE for a gfn greater than host.MAXPHYADDR,
+ * any guest that generates such gfns is running nested and is being
+ * tricked by L0 userspace (you can observe gfn > L1.MAXPHYADDR if and
+ * only if L1's MAXPHYADDR is inaccurate with respect to the
+ * hardware's).
+ */
+ if (unlikely(fault->gfn > kvm_mmu_max_gfn()))
+ return RET_PF_EMULATE;
+
+ return RET_PF_CONTINUE;
}

-bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, struct list_head *invalid_list,
- bool remote_flush)
+static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
{
- if (!remote_flush && list_empty(invalid_list))
+ /*
+ * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only
+ * reach the common page fault handler if the SPTE has an invalid MMIO
+ * generation number. Refreshing the MMIO generation needs to go down
+ * the slow path. Note, EPT Misconfigs do NOT set the PRESENT flag!
+ */
+ if (fault->rsvd)
return false;

- if (!list_empty(invalid_list))
- kvm_mmu_commit_zap_page(kvm, invalid_list);
- else
- kvm_flush_remote_tlbs(kvm);
- return true;
-}
-
-bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
- if (sp->role.invalid)
- return true;
-
- /* TDP MMU pages do not use the MMU generation. */
- return !is_tdp_mmu_page(sp) &&
- unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
-}
-
-struct mmu_page_path {
- struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL];
- unsigned int idx[PT64_ROOT_MAX_LEVEL];
-};
-
-#define for_each_sp(pvec, sp, parents, i) \
- for (i = mmu_pages_first(&pvec, &parents); \
- i < pvec.nr && ({ sp = pvec.page[i].sp; 1;}); \
- i = mmu_pages_next(&pvec, &parents, i))
-
-static int mmu_pages_next(struct kvm_mmu_pages *pvec,
- struct mmu_page_path *parents,
- int i)
-{
- int n;
-
- for (n = i+1; n < pvec->nr; n++) {
- struct kvm_mmu_page *sp = pvec->page[n].sp;
- unsigned idx = pvec->page[n].idx;
- int level = sp->role.level;
-
- parents->idx[level-1] = idx;
- if (level == PG_LEVEL_4K)
- break;
-
- parents->parent[level-2] = sp;
- }
+ /*
+ * #PF can be fast if:
+ *
+ * 1. The shadow page table entry is not present and A/D bits are
+ * disabled _by KVM_, which could mean that the fault is potentially
+ * caused by access tracking (if enabled). If A/D bits are enabled
+ * by KVM, but disabled by L1 for L2, KVM is forced to disable A/D
+ * bits for L2 and employ access tracking, but the fast page fault
+ * mechanism only supports direct MMUs.
+ * 2. The shadow page table entry is present, the access is a write,
+ * and no reserved bits are set (MMIO SPTEs cannot be "fixed"), i.e.
+ * the fault was caused by a write-protection violation. If the
+ * SPTE is MMU-writable (determined later), the fault can be fixed
+ * by setting the Writable bit, which can be done out of mmu_lock.
+ */
+ if (!fault->present)
+ return !kvm_ad_enabled();

- return n;
+ /*
+ * Note, instruction fetches and writes are mutually exclusive, ignore
+ * the "exec" flag.
+ */
+ return fault->write;
}

-static int mmu_pages_first(struct kvm_mmu_pages *pvec,
- struct mmu_page_path *parents)
+/*
+ * Returns true if the SPTE was fixed successfully. Otherwise,
+ * someone else modified the SPTE from its original value.
+ */
+static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault,
+ u64 *sptep, u64 old_spte, u64 new_spte)
{
- struct kvm_mmu_page *sp;
- int level;
-
- if (pvec->nr == 0)
- return 0;
-
- WARN_ON(pvec->page[0].idx != INVALID_INDEX);
-
- sp = pvec->page[0].sp;
- level = sp->role.level;
- WARN_ON(level == PG_LEVEL_4K);
+ /*
+ * Theoretically we could also set dirty bit (and flush TLB) here in
+ * order to eliminate unnecessary PML logging. See comments in
+ * set_spte. But fast_page_fault is very unlikely to happen with PML
+ * enabled, so we do not do this. This might result in the same GPA
+ * to be logged in PML buffer again when the write really happens, and
+ * eventually to be called by mark_page_dirty twice. But it's also no
+ * harm. This also avoids the TLB flush needed after setting dirty bit
+ * so non-PML cases won't be impacted.
+ *
+ * Compare with set_spte where instead shadow_dirty_mask is set.
+ */
+ if (!try_cmpxchg64(sptep, &old_spte, new_spte))
+ return false;

- parents->parent[level-2] = sp;
+ if (is_writable_pte(new_spte) && !is_writable_pte(old_spte))
+ mark_page_dirty_in_slot(vcpu->kvm, fault->slot, fault->gfn);

- /* Also set up a sentinel. Further entries in pvec are all
- * children of sp, so this element is never overwritten.
- */
- parents->parent[level-1] = NULL;
- return mmu_pages_next(pvec, parents, 0);
+ return true;
}

-static void mmu_pages_clear_parents(struct mmu_page_path *parents)
+static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
{
- struct kvm_mmu_page *sp;
- unsigned int level = 0;
+ if (fault->exec)
+ return is_executable_pte(spte);

- do {
- unsigned int idx = parents->idx[level];
- sp = parents->parent[level];
- if (!sp)
- return;
+ if (fault->write)
+ return is_writable_pte(spte);

- WARN_ON(idx == INVALID_INDEX);
- clear_unsync_child_bit(sp, idx);
- level++;
- } while (!sp->unsync_children);
+ /* Fault was on Read access */
+ return spte & PT_PRESENT_MASK;
}

-static int mmu_sync_children(struct kvm_vcpu *vcpu,
- struct kvm_mmu_page *parent, bool can_yield)
+/*
+ * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS.
+ */
+static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
- int i;
struct kvm_mmu_page *sp;
- struct mmu_page_path parents;
- struct kvm_mmu_pages pages;
- LIST_HEAD(invalid_list);
- bool flush = false;
-
- while (mmu_unsync_walk(parent, &pages)) {
- bool protected = false;
-
- for_each_sp(pages, sp, parents, i)
- protected |= kvm_vcpu_write_protect_gfn(vcpu, sp->gfn);
-
- if (protected) {
- kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, true);
- flush = false;
- }
+ int ret = RET_PF_INVALID;
+ u64 spte = 0ull;
+ u64 *sptep = NULL;
+ uint retry_count = 0;

- for_each_sp(pages, sp, parents, i) {
- kvm_unlink_unsync_page(vcpu->kvm, sp);
- flush |= kvm_sync_page(vcpu, sp, &invalid_list) > 0;
- mmu_pages_clear_parents(&parents);
- }
- if (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock)) {
- kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, flush);
- if (!can_yield) {
- kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
- return -EINTR;
- }
+ if (!page_fault_can_be_fast(fault))
+ return ret;

- cond_resched_rwlock_write(&vcpu->kvm->mmu_lock);
- flush = false;
- }
- }
+ walk_shadow_page_lockless_begin(vcpu);

- kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, flush);
- return 0;
-}
+ do {
+ u64 new_spte;

-static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
-{
- atomic_set(&sp->write_flooding_count, 0);
-}
+ if (tdp_mmu_enabled)
+ sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, fault->addr, &spte);
+ else
+ sptep = fast_pf_get_last_sptep(vcpu, fault->addr, &spte);

-static void clear_sp_write_flooding_count(u64 *spte)
-{
- __clear_sp_write_flooding_count(sptep_to_sp(spte));
-}
+ if (!is_shadow_present_pte(spte))
+ break;

-/*
- * The vCPU is required when finding indirect shadow pages; the shadow
- * page may already exist and syncing it needs the vCPU pointer in
- * order to read guest page tables. Direct shadow pages are never
- * unsync, thus @vcpu can be NULL if @role.direct is true.
- */
-static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
- struct kvm_vcpu *vcpu,
- gfn_t gfn,
- struct hlist_head *sp_list,
- union kvm_mmu_page_role role)
-{
- struct kvm_mmu_page *sp;
- int ret;
- int collisions = 0;
- LIST_HEAD(invalid_list);
+ sp = sptep_to_sp(sptep);
+ if (!is_last_spte(spte, sp->role.level))
+ break;

- for_each_valid_sp(kvm, sp, sp_list) {
- if (sp->gfn != gfn) {
- collisions++;
- continue;
+ /*
+ * Check whether the memory access that caused the fault would
+ * still cause it if it were to be performed right now. If not,
+ * then this is a spurious fault caused by TLB lazily flushed,
+ * or some other CPU has already fixed the PTE after the
+ * current CPU took the fault.
+ *
+ * Need not check the access of upper level table entries since
+ * they are always ACC_ALL.
+ */
+ if (is_access_allowed(fault, spte)) {
+ ret = RET_PF_SPURIOUS;
+ break;
}

- if (sp->role.word != role.word) {
- /*
- * If the guest is creating an upper-level page, zap
- * unsync pages for the same gfn. While it's possible
- * the guest is using recursive page tables, in all
- * likelihood the guest has stopped using the unsync
- * page and is installing a completely unrelated page.
- * Unsync pages must not be left as is, because the new
- * upper-level page will be write-protected.
- */
- if (role.level > PG_LEVEL_4K && sp->unsync)
- kvm_mmu_prepare_zap_page(kvm, sp,
- &invalid_list);
- continue;
- }
+ new_spte = spte;

- /* unsync and write-flooding only apply to indirect SPs. */
- if (sp->role.direct)
- goto out;
+ /*
+ * KVM only supports fixing page faults outside of MMU lock for
+ * direct MMUs, nested MMUs are always indirect, and KVM always
+ * uses A/D bits for non-nested MMUs. Thus, if A/D bits are
+ * enabled, the SPTE can't be an access-tracked SPTE.
+ */
+ if (unlikely(!kvm_ad_enabled()) && is_access_track_spte(spte))
+ new_spte = restore_acc_track_spte(new_spte);

- if (sp->unsync) {
- if (KVM_BUG_ON(!vcpu, kvm))
- break;
+ /*
+ * To keep things simple, only SPTEs that are MMU-writable can
+ * be made fully writable outside of mmu_lock, e.g. only SPTEs
+ * that were write-protected for dirty-logging or access
+ * tracking are handled here. Don't bother checking if the
+ * SPTE is writable to prioritize running with A/D bits enabled.
+ * The is_access_allowed() check above handles the common case
+ * of the fault being spurious, and the SPTE is known to be
+ * shadow-present, i.e. except for access tracking restoration
+ * making the new SPTE writable, the check is wasteful.
+ */
+ if (fault->write && is_mmu_writable_spte(spte)) {
+ new_spte |= PT_WRITABLE_MASK;

/*
- * The page is good, but is stale. kvm_sync_page does
- * get the latest guest state, but (unlike mmu_unsync_children)
- * it doesn't write-protect the page or mark it synchronized!
- * This way the validity of the mapping is ensured, but the
- * overhead of write protection is not incurred until the
- * guest invalidates the TLB mapping. This allows multiple
- * SPs for a single gfn to be unsync.
+ * Do not fix write-permission on the large spte when
+ * dirty logging is enabled. Since we only dirty the
+ * first page into the dirty-bitmap in
+ * fast_pf_fix_direct_spte(), other pages are missed
+ * if its slot has dirty logging enabled.
*
- * If the sync fails, the page is zapped. If so, break
- * in order to rebuild it.
+ * Instead, we let the slow page fault path create a
+ * normal spte to fix the access.
*/
- ret = kvm_sync_page(vcpu, sp, &invalid_list);
- if (ret < 0)
+ if (sp->role.level > PG_LEVEL_4K &&
+ kvm_slot_dirty_track_enabled(fault->slot))
break;
-
- WARN_ON(!list_empty(&invalid_list));
- if (ret > 0)
- kvm_flush_remote_tlbs(kvm);
}

- __clear_sp_write_flooding_count(sp);
-
- goto out;
- }
-
- sp = NULL;
- ++kvm->stat.mmu_cache_miss;
-
-out:
- kvm_mmu_commit_zap_page(kvm, &invalid_list);
-
- if (collisions > kvm->stat.max_mmu_page_hash_collisions)
- kvm->stat.max_mmu_page_hash_collisions = collisions;
- return sp;
-}
-
-/* Caches used when allocating a new shadow page. */
-struct shadow_page_caches {
- struct kvm_mmu_memory_cache *page_header_cache;
- struct kvm_mmu_memory_cache *shadow_page_cache;
- struct kvm_mmu_memory_cache *shadowed_info_cache;
-};
-
-static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
- struct shadow_page_caches *caches,
- gfn_t gfn,
- struct hlist_head *sp_list,
- union kvm_mmu_page_role role)
-{
- struct kvm_mmu_page *sp;
-
- sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
- sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
- if (!role.direct)
- sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
-
- set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
- INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);
-
- /*
- * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
- * depends on valid pages being added to the head of the list. See
- * comments in kvm_zap_obsolete_pages().
- */
- sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
- list_add(&sp->link, &kvm->arch.active_mmu_pages);
- kvm_account_mmu_page(kvm, sp);
-
- sp->gfn = gfn;
- sp->role = role;
- hlist_add_head(&sp->hash_link, sp_list);
- if (sp_has_gptes(sp))
- account_shadowed(kvm, sp);
-
- return sp;
-}
-
-/* Note, @vcpu may be NULL if @role.direct is true; see kvm_mmu_find_shadow_page. */
-static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
- struct kvm_vcpu *vcpu,
- struct shadow_page_caches *caches,
- gfn_t gfn,
- union kvm_mmu_page_role role)
-{
- struct hlist_head *sp_list;
- struct kvm_mmu_page *sp;
- bool created = false;
-
- sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
-
- sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
- if (!sp) {
- created = true;
- sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role);
- }
-
- trace_kvm_mmu_get_page(sp, created);
- return sp;
-}
-
-static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
- gfn_t gfn,
- union kvm_mmu_page_role role)
-{
- struct shadow_page_caches caches = {
- .page_header_cache = &vcpu->arch.mmu_page_header_cache,
- .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
- .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
- };
-
- return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
-}
-
-static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
- unsigned int access)
-{
- struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
- union kvm_mmu_page_role role;
-
- role = parent_sp->role;
- role.level--;
- role.access = access;
- role.direct = direct;
- role.passthrough = 0;
-
- /*
- * If the guest has 4-byte PTEs then that means it's using 32-bit,
- * 2-level, non-PAE paging. KVM shadows such guests with PAE paging
- * (i.e. 8-byte PTEs). The difference in PTE size means that KVM must
- * shadow each guest page table with multiple shadow page tables, which
- * requires extra bookkeeping in the role.
- *
- * Specifically, to shadow the guest's page directory (which covers a
- * 4GiB address space), KVM uses 4 PAE page directories, each mapping
- * 1GiB of the address space. @role.quadrant encodes which quarter of
- * the address space each maps.
- *
- * To shadow the guest's page tables (which each map a 4MiB region), KVM
- * uses 2 PAE page tables, each mapping a 2MiB region. For these,
- * @role.quadrant encodes which half of the region they map.
- *
- * Concretely, a 4-byte PDE consumes bits 31:22, while an 8-byte PDE
- * consumes bits 29:21. To consume bits 31:30, KVM's uses 4 shadow
- * PDPTEs; those 4 PAE page directories are pre-allocated and their
- * quadrant is assigned in mmu_alloc_root(). A 4-byte PTE consumes
- * bits 21:12, while an 8-byte PTE consumes bits 20:12. To consume
- * bit 21 in the PTE (the child here), KVM propagates that bit to the
- * quadrant, i.e. sets quadrant to '0' or '1'. The parent 8-byte PDE
- * covers bit 21 (see above), thus the quadrant is calculated from the
- * _least_ significant bit of the PDE index.
- */
- if (role.has_4_byte_gpte) {
- WARN_ON_ONCE(role.level != PG_LEVEL_4K);
- role.quadrant = spte_index(sptep) & 1;
- }
-
- return role;
-}
-
-static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
- u64 *sptep, gfn_t gfn,
- bool direct, unsigned int access)
-{
- union kvm_mmu_page_role role;
-
- if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
- return ERR_PTR(-EEXIST);
-
- role = kvm_mmu_child_role(sptep, direct, access);
- return kvm_mmu_get_shadow_page(vcpu, gfn, role);
-}
-
-static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
- struct kvm_vcpu *vcpu, hpa_t root,
- u64 addr)
-{
- iterator->addr = addr;
- iterator->shadow_addr = root;
- iterator->level = vcpu->arch.mmu->root_role.level;
-
- if (iterator->level >= PT64_ROOT_4LEVEL &&
- vcpu->arch.mmu->cpu_role.base.level < PT64_ROOT_4LEVEL &&
- !vcpu->arch.mmu->root_role.direct)
- iterator->level = PT32E_ROOT_LEVEL;
+ /* Verify that the fault can be handled in the fast path */
+ if (new_spte == spte ||
+ !is_access_allowed(fault, new_spte))
+ break;

- if (iterator->level == PT32E_ROOT_LEVEL) {
/*
- * prev_root is currently only used for 64-bit hosts. So only
- * the active root_hpa is valid here.
+ * Currently, fast page fault only works for direct mapping
+ * since the gfn is not stable for indirect shadow page. See
+ * Documentation/virt/kvm/locking.rst to get more detail.
*/
- BUG_ON(root != vcpu->arch.mmu->root.hpa);
-
- iterator->shadow_addr
- = vcpu->arch.mmu->pae_root[(addr >> 30) & 3];
- iterator->shadow_addr &= SPTE_BASE_ADDR_MASK;
- --iterator->level;
- if (!iterator->shadow_addr)
- iterator->level = 0;
- }
-}
+ if (fast_pf_fix_direct_spte(vcpu, fault, sptep, spte, new_spte)) {
+ ret = RET_PF_FIXED;
+ break;
+ }

-static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
- struct kvm_vcpu *vcpu, u64 addr)
-{
- shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->root.hpa,
- addr);
-}
-
-static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
-{
- if (iterator->level < PG_LEVEL_4K)
- return false;
-
- iterator->index = SPTE_INDEX(iterator->addr, iterator->level);
- iterator->sptep = ((u64 *)__va(iterator->shadow_addr)) + iterator->index;
- return true;
-}
-
-static void __shadow_walk_next(struct kvm_shadow_walk_iterator *iterator,
- u64 spte)
-{
- if (!is_shadow_present_pte(spte) || is_last_spte(spte, iterator->level)) {
- iterator->level = 0;
- return;
- }
-
- iterator->shadow_addr = spte & SPTE_BASE_ADDR_MASK;
- --iterator->level;
-}
-
-static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
-{
- __shadow_walk_next(iterator, *iterator->sptep);
-}
-
-static void __link_shadow_page(struct kvm *kvm,
- struct kvm_mmu_memory_cache *cache, u64 *sptep,
- struct kvm_mmu_page *sp, bool flush)
-{
- u64 spte;
-
- BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
-
- /*
- * If an SPTE is present already, it must be a leaf and therefore
- * a large one. Drop it, and flush the TLB if needed, before
- * installing sp.
- */
- if (is_shadow_present_pte(*sptep))
- drop_large_spte(kvm, sptep, flush);
-
- spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
-
- mmu_spte_set(sptep, spte);
-
- mmu_page_add_parent_pte(cache, sp, sptep);
-
- /*
- * The non-direct sub-pagetable must be updated before linking. For
- * L1 sp, the pagetable is updated via kvm_sync_page() in
- * kvm_mmu_find_shadow_page() without write-protecting the gfn,
- * so sp->unsync can be true or false. For higher level non-direct
- * sp, the pagetable is updated/synced via mmu_sync_children() in
- * FNAME(fetch)(), so sp->unsync_children can only be false.
- * WARN_ON_ONCE() if anything happens unexpectedly.
- */
- if (WARN_ON_ONCE(sp->unsync_children) || sp->unsync)
- mark_unsync(sptep);
-}
-
-static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
- struct kvm_mmu_page *sp)
-{
- __link_shadow_page(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, sptep, sp, true);
-}
-
-static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
- unsigned direct_access)
-{
- if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) {
- struct kvm_mmu_page *child;
-
- /*
- * For the direct sp, if the guest pte's dirty bit
- * changed form clean to dirty, it will corrupt the
- * sp's access: allow writable in the read-only sp,
- * so we should update the spte at this point to get
- * a new sp with the correct access.
- */
- child = spte_to_child_sp(*sptep);
- if (child->role.access == direct_access)
- return;
-
- drop_parent_pte(child, sptep);
- kvm_flush_remote_tlbs_with_address(vcpu->kvm, child->gfn, 1);
- }
-}
-
-/* Returns the number of zapped non-leaf child shadow pages. */
-static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
- u64 *spte, struct list_head *invalid_list)
-{
- u64 pte;
- struct kvm_mmu_page *child;
-
- pte = *spte;
- if (is_shadow_present_pte(pte)) {
- if (is_last_spte(pte, sp->role.level)) {
- drop_spte(kvm, spte);
- } else {
- child = spte_to_child_sp(pte);
- drop_parent_pte(child, spte);
-
- /*
- * Recursively zap nested TDP SPs, parentless SPs are
- * unlikely to be used again in the near future. This
- * avoids retaining a large number of stale nested SPs.
- */
- if (tdp_enabled && invalid_list &&
- child->role.guest_mode && !child->parent_ptes.val)
- return kvm_mmu_prepare_zap_page(kvm, child,
- invalid_list);
- }
- } else if (is_mmio_spte(pte)) {
- mmu_spte_clear_no_track(spte);
- }
- return 0;
-}
-
-static int kvm_mmu_page_unlink_children(struct kvm *kvm,
- struct kvm_mmu_page *sp,
- struct list_head *invalid_list)
-{
- int zapped = 0;
- unsigned i;
-
- for (i = 0; i < SPTE_ENT_PER_PAGE; ++i)
- zapped += mmu_page_zap_pte(kvm, sp, sp->spt + i, invalid_list);
-
- return zapped;
-}
-
-static void kvm_mmu_unlink_parents(struct kvm_mmu_page *sp)
-{
- u64 *sptep;
- struct rmap_iterator iter;
-
- while ((sptep = rmap_get_first(&sp->parent_ptes, &iter)))
- drop_parent_pte(sp, sptep);
-}
-
-static int mmu_zap_unsync_children(struct kvm *kvm,
- struct kvm_mmu_page *parent,
- struct list_head *invalid_list)
-{
- int i, zapped = 0;
- struct mmu_page_path parents;
- struct kvm_mmu_pages pages;
-
- if (parent->role.level == PG_LEVEL_4K)
- return 0;
-
- while (mmu_unsync_walk(parent, &pages)) {
- struct kvm_mmu_page *sp;
-
- for_each_sp(pages, sp, parents, i) {
- kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
- mmu_pages_clear_parents(&parents);
- zapped++;
- }
- }
-
- return zapped;
-}
-
-static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
- struct kvm_mmu_page *sp,
- struct list_head *invalid_list,
- int *nr_zapped)
-{
- bool list_unstable, zapped_root = false;
-
- lockdep_assert_held_write(&kvm->mmu_lock);
- trace_kvm_mmu_prepare_zap_page(sp);
- ++kvm->stat.mmu_shadow_zapped;
- *nr_zapped = mmu_zap_unsync_children(kvm, sp, invalid_list);
- *nr_zapped += kvm_mmu_page_unlink_children(kvm, sp, invalid_list);
- kvm_mmu_unlink_parents(sp);
-
- /* Zapping children means active_mmu_pages has become unstable. */
- list_unstable = *nr_zapped;
-
- if (!sp->role.invalid && sp_has_gptes(sp))
- unaccount_shadowed(kvm, sp);
-
- if (sp->unsync)
- kvm_unlink_unsync_page(kvm, sp);
- if (!sp->root_count) {
- /* Count self */
- (*nr_zapped)++;
-
- /*
- * Already invalid pages (previously active roots) are not on
- * the active page list. See list_del() in the "else" case of
- * !sp->root_count.
- */
- if (sp->role.invalid)
- list_add(&sp->link, invalid_list);
- else
- list_move(&sp->link, invalid_list);
- kvm_unaccount_mmu_page(kvm, sp);
- } else {
- /*
- * Remove the active root from the active page list, the root
- * will be explicitly freed when the root_count hits zero.
- */
- list_del(&sp->link);
-
- /*
- * Obsolete pages cannot be used on any vCPUs, see the comment
- * in kvm_mmu_zap_all_fast(). Note, is_obsolete_sp() also
- * treats invalid shadow pages as being obsolete.
- */
- zapped_root = !is_obsolete_sp(kvm, sp);
- }
-
- if (sp->nx_huge_page_disallowed)
- unaccount_nx_huge_page(kvm, sp);
-
- sp->role.invalid = 1;
-
- /*
- * Make the request to free obsolete roots after marking the root
- * invalid, otherwise other vCPUs may not see it as invalid.
- */
- if (zapped_root)
- kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
- return list_unstable;
-}
-
-static bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- struct list_head *invalid_list)
-{
- int nr_zapped;
-
- __kvm_mmu_prepare_zap_page(kvm, sp, invalid_list, &nr_zapped);
- return nr_zapped;
-}
-
-static void kvm_mmu_commit_zap_page(struct kvm *kvm,
- struct list_head *invalid_list)
-{
- struct kvm_mmu_page *sp, *nsp;
-
- if (list_empty(invalid_list))
- return;
-
- /*
- * We need to make sure everyone sees our modifications to
- * the page tables and see changes to vcpu->mode here. The barrier
- * in the kvm_flush_remote_tlbs() achieves this. This pairs
- * with vcpu_enter_guest and walk_shadow_page_lockless_begin/end.
- *
- * In addition, kvm_flush_remote_tlbs waits for all vcpus to exit
- * guest mode and/or lockless shadow page table walks.
- */
- kvm_flush_remote_tlbs(kvm);
-
- list_for_each_entry_safe(sp, nsp, invalid_list, link) {
- WARN_ON(!sp->role.invalid || sp->root_count);
- kvm_mmu_free_shadow_page(sp);
- }
-}
-
-static unsigned long kvm_mmu_zap_oldest_mmu_pages(struct kvm *kvm,
- unsigned long nr_to_zap)
-{
- unsigned long total_zapped = 0;
- struct kvm_mmu_page *sp, *tmp;
- LIST_HEAD(invalid_list);
- bool unstable;
- int nr_zapped;
-
- if (list_empty(&kvm->arch.active_mmu_pages))
- return 0;
-
-restart:
- list_for_each_entry_safe_reverse(sp, tmp, &kvm->arch.active_mmu_pages, link) {
- /*
- * Don't zap active root pages, the page itself can't be freed
- * and zapping it will just force vCPUs to realloc and reload.
- */
- if (sp->root_count)
- continue;
-
- unstable = __kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list,
- &nr_zapped);
- total_zapped += nr_zapped;
- if (total_zapped >= nr_to_zap)
- break;
-
- if (unstable)
- goto restart;
- }
-
- kvm_mmu_commit_zap_page(kvm, &invalid_list);
-
- kvm->stat.mmu_recycled += total_zapped;
- return total_zapped;
-}
-
-static inline unsigned long kvm_mmu_available_pages(struct kvm *kvm)
-{
- if (kvm->arch.n_max_mmu_pages > kvm->arch.n_used_mmu_pages)
- return kvm->arch.n_max_mmu_pages -
- kvm->arch.n_used_mmu_pages;
-
- return 0;
-}
-
-static int make_mmu_pages_available(struct kvm_vcpu *vcpu)
-{
- unsigned long avail = kvm_mmu_available_pages(vcpu->kvm);
-
- if (likely(avail >= KVM_MIN_FREE_MMU_PAGES))
- return 0;
-
- kvm_mmu_zap_oldest_mmu_pages(vcpu->kvm, KVM_REFILL_PAGES - avail);
-
- /*
- * Note, this check is intentionally soft, it only guarantees that one
- * page is available, while the caller may end up allocating as many as
- * four pages, e.g. for PAE roots or for 5-level paging. Temporarily
- * exceeding the (arbitrary by default) limit will not harm the host,
- * being too aggressive may unnecessarily kill the guest, and getting an
- * exact count is far more trouble than it's worth, especially in the
- * page fault paths.
- */
- if (!kvm_mmu_available_pages(vcpu->kvm))
- return -ENOSPC;
- return 0;
-}
-
-/*
- * Changing the number of mmu pages allocated to the vm
- * Note: if goal_nr_mmu_pages is too small, you will get dead lock
- */
-void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long goal_nr_mmu_pages)
-{
- write_lock(&kvm->mmu_lock);
-
- if (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages) {
- kvm_mmu_zap_oldest_mmu_pages(kvm, kvm->arch.n_used_mmu_pages -
- goal_nr_mmu_pages);
-
- goal_nr_mmu_pages = kvm->arch.n_used_mmu_pages;
- }
-
- kvm->arch.n_max_mmu_pages = goal_nr_mmu_pages;
-
- write_unlock(&kvm->mmu_lock);
-}
-
-int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
-{
- struct kvm_mmu_page *sp;
- LIST_HEAD(invalid_list);
- int r;
-
- pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
- r = 0;
- write_lock(&kvm->mmu_lock);
- for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) {
- pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
- sp->role.word);
- r = 1;
- kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
- }
- kvm_mmu_commit_zap_page(kvm, &invalid_list);
- write_unlock(&kvm->mmu_lock);
-
- return r;
-}
-
-static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
-{
- gpa_t gpa;
- int r;
-
- if (vcpu->arch.mmu->root_role.direct)
- return 0;
-
- gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
-
- r = kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT);
-
- return r;
-}
-
-static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
- trace_kvm_mmu_unsync_page(sp);
- ++kvm->stat.mmu_unsync;
- sp->unsync = 1;
-
- kvm_mmu_mark_parents_unsync(sp);
-}
-
-/*
- * Attempt to unsync any shadow pages that can be reached by the specified gfn,
- * KVM is creating a writable mapping for said gfn. Returns 0 if all pages
- * were marked unsync (or if there is no shadow page), -EPERM if the SPTE must
- * be write-protected.
- */
-int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
- gfn_t gfn, bool can_unsync, bool prefetch)
-{
- struct kvm_mmu_page *sp;
- bool locked = false;
-
- /*
- * Force write-protection if the page is being tracked. Note, the page
- * track machinery is used to write-protect upper-level shadow pages,
- * i.e. this guards the role.level == 4K assertion below!
- */
- if (kvm_slot_page_track_is_active(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
- return -EPERM;
-
- /*
- * The page is not write-tracked, mark existing shadow pages unsync
- * unless KVM is synchronizing an unsync SP (can_unsync = false). In
- * that case, KVM must complete emulation of the guest TLB flush before
- * allowing shadow pages to become unsync (writable by the guest).
- */
- for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) {
- if (!can_unsync)
- return -EPERM;
-
- if (sp->unsync)
- continue;
-
- if (prefetch)
- return -EEXIST;
-
- /*
- * TDP MMU page faults require an additional spinlock as they
- * run with mmu_lock held for read, not write, and the unsync
- * logic is not thread safe. Take the spinklock regardless of
- * the MMU type to avoid extra conditionals/parameters, there's
- * no meaningful penalty if mmu_lock is held for write.
- */
- if (!locked) {
- locked = true;
- spin_lock(&kvm->arch.mmu_unsync_pages_lock);
-
- /*
- * Recheck after taking the spinlock, a different vCPU
- * may have since marked the page unsync. A false
- * positive on the unprotected check above is not
- * possible as clearing sp->unsync _must_ hold mmu_lock
- * for write, i.e. unsync cannot transition from 0->1
- * while this CPU holds mmu_lock for read (or write).
- */
- if (READ_ONCE(sp->unsync))
- continue;
- }
-
- WARN_ON(sp->role.level != PG_LEVEL_4K);
- kvm_unsync_page(kvm, sp);
- }
- if (locked)
- spin_unlock(&kvm->arch.mmu_unsync_pages_lock);
-
- /*
- * We need to ensure that the marking of unsync pages is visible
- * before the SPTE is updated to allow writes because
- * kvm_mmu_sync_roots() checks the unsync flags without holding
- * the MMU lock and so can race with this. If the SPTE was updated
- * before the page had been marked as unsync-ed, something like the
- * following could happen:
- *
- * CPU 1 CPU 2
- * ---------------------------------------------------------------------
- * 1.2 Host updates SPTE
- * to be writable
- * 2.1 Guest writes a GPTE for GVA X.
- * (GPTE being in the guest page table shadowed
- * by the SP from CPU 1.)
- * This reads SPTE during the page table walk.
- * Since SPTE.W is read as 1, there is no
- * fault.
- *
- * 2.2 Guest issues TLB flush.
- * That causes a VM Exit.
- *
- * 2.3 Walking of unsync pages sees sp->unsync is
- * false and skips the page.
- *
- * 2.4 Guest accesses GVA X.
- * Since the mapping in the SP was not updated,
- * so the old mapping for GVA X incorrectly
- * gets used.
- * 1.1 Host marks SP
- * as unsync
- * (sp->unsync = true)
- *
- * The write barrier below ensures that 1.1 happens before 1.2 and thus
- * the situation in 2.4 does not arise. It pairs with the read barrier
- * in is_unsync_root(), placed between 2.1's load of SPTE.W and 2.3.
- */
- smp_wmb();
-
- return 0;
-}
-
-static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
- u64 *sptep, unsigned int pte_access, gfn_t gfn,
- kvm_pfn_t pfn, struct kvm_page_fault *fault)
-{
- struct kvm_mmu_page *sp = sptep_to_sp(sptep);
- int level = sp->role.level;
- int was_rmapped = 0;
- int ret = RET_PF_FIXED;
- bool flush = false;
- bool wrprot;
- u64 spte;
-
- /* Prefetching always gets a writable pfn. */
- bool host_writable = !fault || fault->map_writable;
- bool prefetch = !fault || fault->prefetch;
- bool write_fault = fault && fault->write;
-
- pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
- *sptep, write_fault, gfn);
-
- if (unlikely(is_noslot_pfn(pfn))) {
- vcpu->stat.pf_mmio_spte_created++;
- mark_mmio_spte(vcpu, sptep, gfn, pte_access);
- return RET_PF_EMULATE;
- }
-
- if (is_shadow_present_pte(*sptep)) {
- /*
- * If we overwrite a PTE page pointer with a 2MB PMD, unlink
- * the parent of the now unreachable PTE.
- */
- if (level > PG_LEVEL_4K && !is_large_pte(*sptep)) {
- struct kvm_mmu_page *child;
- u64 pte = *sptep;
-
- child = spte_to_child_sp(pte);
- drop_parent_pte(child, sptep);
- flush = true;
- } else if (pfn != spte_to_pfn(*sptep)) {
- pgprintk("hfn old %llx new %llx\n",
- spte_to_pfn(*sptep), pfn);
- drop_spte(vcpu->kvm, sptep);
- flush = true;
- } else
- was_rmapped = 1;
- }
-
- wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
- true, host_writable, &spte);
-
- if (*sptep == spte) {
- ret = RET_PF_SPURIOUS;
- } else {
- flush |= mmu_spte_update(sptep, spte);
- trace_kvm_mmu_set_spte(level, gfn, sptep);
- }
-
- if (wrprot) {
- if (write_fault)
- ret = RET_PF_EMULATE;
- }
-
- if (flush)
- kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn,
- KVM_PAGES_PER_HPAGE(level));
-
- pgprintk("%s: setting spte %llx\n", __func__, *sptep);
-
- if (!was_rmapped) {
- WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
- rmap_add(vcpu, slot, sptep, gfn, pte_access);
- } else {
- /* Already rmapped but the pte_access bits may have changed. */
- kvm_mmu_page_set_access(sp, spte_index(sptep), pte_access);
- }
-
- return ret;
-}
-
-static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
- struct kvm_mmu_page *sp,
- u64 *start, u64 *end)
-{
- struct page *pages[PTE_PREFETCH_NUM];
- struct kvm_memory_slot *slot;
- unsigned int access = sp->role.access;
- int i, ret;
- gfn_t gfn;
-
- gfn = kvm_mmu_page_get_gfn(sp, spte_index(start));
- slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, access & ACC_WRITE_MASK);
- if (!slot)
- return -1;
-
- ret = gfn_to_page_many_atomic(slot, gfn, pages, end - start);
- if (ret <= 0)
- return -1;
-
- for (i = 0; i < ret; i++, gfn++, start++) {
- mmu_set_spte(vcpu, slot, start, access, gfn,
- page_to_pfn(pages[i]), NULL);
- put_page(pages[i]);
- }
-
- return 0;
-}
-
-static void __direct_pte_prefetch(struct kvm_vcpu *vcpu,
- struct kvm_mmu_page *sp, u64 *sptep)
-{
- u64 *spte, *start = NULL;
- int i;
-
- WARN_ON(!sp->role.direct);
-
- i = spte_index(sptep) & ~(PTE_PREFETCH_NUM - 1);
- spte = sp->spt + i;
-
- for (i = 0; i < PTE_PREFETCH_NUM; i++, spte++) {
- if (is_shadow_present_pte(*spte) || spte == sptep) {
- if (!start)
- continue;
- if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0)
- return;
- start = NULL;
- } else if (!start)
- start = spte;
- }
- if (start)
- direct_pte_prefetch_many(vcpu, sp, start, spte);
-}
-
-static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
-{
- struct kvm_mmu_page *sp;
-
- sp = sptep_to_sp(sptep);
-
- /*
- * Without accessed bits, there's no way to distinguish between
- * actually accessed translations and prefetched, so disable pte
- * prefetch if accessed bits aren't available.
- */
- if (sp_ad_disabled(sp))
- return;
-
- if (sp->role.level > PG_LEVEL_4K)
- return;
-
- /*
- * If addresses are being invalidated, skip prefetching to avoid
- * accidentally prefetching those addresses.
- */
- if (unlikely(vcpu->kvm->mmu_invalidate_in_progress))
- return;
-
- __direct_pte_prefetch(vcpu, sp, sptep);
-}
-
-/*
- * Lookup the mapping level for @gfn in the current mm.
- *
- * WARNING! Use of host_pfn_mapping_level() requires the caller and the end
- * consumer to be tied into KVM's handlers for MMU notifier events!
- *
- * There are several ways to safely use this helper:
- *
- * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, before
- * consuming it. In this case, mmu_lock doesn't need to be held during the
- * lookup, but it does need to be held while checking the MMU notifier.
- *
- * - Hold mmu_lock AND ensure there is no in-progress MMU notifier invalidation
- * event for the hva. This can be done by explicit checking the MMU notifier
- * or by ensuring that KVM already has a valid mapping that covers the hva.
- *
- * - Do not use the result to install new mappings, e.g. use the host mapping
- * level only to decide whether or not to zap an entry. In this case, it's
- * not required to hold mmu_lock (though it's highly likely the caller will
- * want to hold mmu_lock anyways, e.g. to modify SPTEs).
- *
- * Note! The lookup can still race with modifications to host page tables, but
- * the above "rules" ensure KVM will not _consume_ the result of the walk if a
- * race with the primary MMU occurs.
- */
-static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
- const struct kvm_memory_slot *slot)
-{
- int level = PG_LEVEL_4K;
- unsigned long hva;
- unsigned long flags;
- pgd_t pgd;
- p4d_t p4d;
- pud_t pud;
- pmd_t pmd;
-
- /*
- * Note, using the already-retrieved memslot and __gfn_to_hva_memslot()
- * is not solely for performance, it's also necessary to avoid the
- * "writable" check in __gfn_to_hva_many(), which will always fail on
- * read-only memslots due to gfn_to_hva() assuming writes. Earlier
- * page fault steps have already verified the guest isn't writing a
- * read-only memslot.
- */
- hva = __gfn_to_hva_memslot(slot, gfn);
-
- /*
- * Disable IRQs to prevent concurrent tear down of host page tables,
- * e.g. if the primary MMU promotes a P*D to a huge page and then frees
- * the original page table.
- */
- local_irq_save(flags);
-
- /*
- * Read each entry once. As above, a non-leaf entry can be promoted to
- * a huge page _during_ this walk. Re-reading the entry could send the
- * walk into the weeks, e.g. p*d_large() returns false (sees the old
- * value) and then p*d_offset() walks into the target huge page instead
- * of the old page table (sees the new value).
- */
- pgd = READ_ONCE(*pgd_offset(kvm->mm, hva));
- if (pgd_none(pgd))
- goto out;
-
- p4d = READ_ONCE(*p4d_offset(&pgd, hva));
- if (p4d_none(p4d) || !p4d_present(p4d))
- goto out;
-
- pud = READ_ONCE(*pud_offset(&p4d, hva));
- if (pud_none(pud) || !pud_present(pud))
- goto out;
-
- if (pud_large(pud)) {
- level = PG_LEVEL_1G;
- goto out;
- }
-
- pmd = READ_ONCE(*pmd_offset(&pud, hva));
- if (pmd_none(pmd) || !pmd_present(pmd))
- goto out;
-
- if (pmd_large(pmd))
- level = PG_LEVEL_2M;
-
-out:
- local_irq_restore(flags);
- return level;
-}
-
-int kvm_mmu_max_mapping_level(struct kvm *kvm,
- const struct kvm_memory_slot *slot, gfn_t gfn,
- int max_level)
-{
- struct kvm_lpage_info *linfo;
- int host_level;
-
- max_level = min(max_level, max_huge_page_level);
- for ( ; max_level > PG_LEVEL_4K; max_level--) {
- linfo = lpage_info_slot(gfn, slot, max_level);
- if (!linfo->disallow_lpage)
- break;
- }
-
- if (max_level == PG_LEVEL_4K)
- return PG_LEVEL_4K;
-
- host_level = host_pfn_mapping_level(kvm, gfn, slot);
- return min(host_level, max_level);
-}
-
-void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
-{
- struct kvm_memory_slot *slot = fault->slot;
- kvm_pfn_t mask;
-
- fault->huge_page_disallowed = fault->exec && fault->nx_huge_page_workaround_enabled;
-
- if (unlikely(fault->max_level == PG_LEVEL_4K))
- return;
-
- if (is_error_noslot_pfn(fault->pfn))
- return;
-
- if (kvm_slot_dirty_track_enabled(slot))
- return;
-
- /*
- * Enforce the iTLB multihit workaround after capturing the requested
- * level, which will be used to do precise, accurate accounting.
- */
- fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
- fault->gfn, fault->max_level);
- if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
- return;
-
- /*
- * mmu_invalidate_retry() was successful and mmu_lock is held, so
- * the pmd can't be split from under us.
- */
- fault->goal_level = fault->req_level;
- mask = KVM_PAGES_PER_HPAGE(fault->goal_level) - 1;
- VM_BUG_ON((fault->gfn & mask) != (fault->pfn & mask));
- fault->pfn &= ~mask;
-}
-
-void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level)
-{
- if (cur_level > PG_LEVEL_4K &&
- cur_level == fault->goal_level &&
- is_shadow_present_pte(spte) &&
- !is_large_pte(spte) &&
- spte_to_child_sp(spte)->nx_huge_page_disallowed) {
- /*
- * A small SPTE exists for this pfn, but FNAME(fetch),
- * direct_map(), or kvm_tdp_mmu_map() would like to create a
- * large PTE instead: just force them to go down another level,
- * patching back for them into pfn the next 9 bits of the
- * address.
- */
- u64 page_mask = KVM_PAGES_PER_HPAGE(cur_level) -
- KVM_PAGES_PER_HPAGE(cur_level - 1);
- fault->pfn |= fault->gfn & page_mask;
- fault->goal_level--;
- }
-}
-
-static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
-{
- struct kvm_shadow_walk_iterator it;
- struct kvm_mmu_page *sp;
- int ret;
- gfn_t base_gfn = fault->gfn;
-
- kvm_mmu_hugepage_adjust(vcpu, fault);
-
- trace_kvm_mmu_spte_requested(fault);
- for_each_shadow_entry(vcpu, fault->addr, it) {
- /*
- * We cannot overwrite existing page tables with an NX
- * large page, as the leaf could be executable.
- */
- if (fault->nx_huge_page_workaround_enabled)
- disallowed_hugepage_adjust(fault, *it.sptep, it.level);
-
- base_gfn = fault->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
- if (it.level == fault->goal_level)
- break;
-
- sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
- if (sp == ERR_PTR(-EEXIST))
- continue;
-
- link_shadow_page(vcpu, it.sptep, sp);
- if (fault->huge_page_disallowed)
- account_nx_huge_page(vcpu->kvm, sp,
- fault->req_level >= it.level);
- }
-
- if (WARN_ON_ONCE(it.level != fault->goal_level))
- return -EFAULT;
-
- ret = mmu_set_spte(vcpu, fault->slot, it.sptep, ACC_ALL,
- base_gfn, fault->pfn, fault);
- if (ret == RET_PF_SPURIOUS)
- return ret;
-
- direct_pte_prefetch(vcpu, it.sptep);
- return ret;
-}
-
-static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
-{
- unsigned long hva = gfn_to_hva_memslot(slot, gfn);
-
- send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current);
-}
-
-static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
-{
- if (is_sigpending_pfn(fault->pfn)) {
- kvm_handle_signal_exit(vcpu);
- return -EINTR;
- }
-
- /*
- * Do not cache the mmio info caused by writing the readonly gfn
- * into the spte otherwise read access on readonly gfn also can
- * caused mmio page fault and treat it as mmio access.
- */
- if (fault->pfn == KVM_PFN_ERR_RO_FAULT)
- return RET_PF_EMULATE;
-
- if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
- kvm_send_hwpoison_signal(fault->slot, fault->gfn);
- return RET_PF_RETRY;
- }
-
- return -EFAULT;
-}
-
-static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
- struct kvm_page_fault *fault,
- unsigned int access)
-{
- gva_t gva = fault->is_tdp ? 0 : fault->addr;
-
- vcpu_cache_mmio_info(vcpu, gva, fault->gfn,
- access & shadow_mmio_access_mask);
-
- /*
- * If MMIO caching is disabled, emulate immediately without
- * touching the shadow page tables as attempting to install an
- * MMIO SPTE will just be an expensive nop.
- */
- if (unlikely(!enable_mmio_caching))
- return RET_PF_EMULATE;
-
- /*
- * Do not create an MMIO SPTE for a gfn greater than host.MAXPHYADDR,
- * any guest that generates such gfns is running nested and is being
- * tricked by L0 userspace (you can observe gfn > L1.MAXPHYADDR if and
- * only if L1's MAXPHYADDR is inaccurate with respect to the
- * hardware's).
- */
- if (unlikely(fault->gfn > kvm_mmu_max_gfn()))
- return RET_PF_EMULATE;
-
- return RET_PF_CONTINUE;
-}
-
-static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
-{
- /*
- * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only
- * reach the common page fault handler if the SPTE has an invalid MMIO
- * generation number. Refreshing the MMIO generation needs to go down
- * the slow path. Note, EPT Misconfigs do NOT set the PRESENT flag!
- */
- if (fault->rsvd)
- return false;
-
- /*
- * #PF can be fast if:
- *
- * 1. The shadow page table entry is not present and A/D bits are
- * disabled _by KVM_, which could mean that the fault is potentially
- * caused by access tracking (if enabled). If A/D bits are enabled
- * by KVM, but disabled by L1 for L2, KVM is forced to disable A/D
- * bits for L2 and employ access tracking, but the fast page fault
- * mechanism only supports direct MMUs.
- * 2. The shadow page table entry is present, the access is a write,
- * and no reserved bits are set (MMIO SPTEs cannot be "fixed"), i.e.
- * the fault was caused by a write-protection violation. If the
- * SPTE is MMU-writable (determined later), the fault can be fixed
- * by setting the Writable bit, which can be done out of mmu_lock.
- */
- if (!fault->present)
- return !kvm_ad_enabled();
-
- /*
- * Note, instruction fetches and writes are mutually exclusive, ignore
- * the "exec" flag.
- */
- return fault->write;
-}
-
-/*
- * Returns true if the SPTE was fixed successfully. Otherwise,
- * someone else modified the SPTE from its original value.
- */
-static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu,
- struct kvm_page_fault *fault,
- u64 *sptep, u64 old_spte, u64 new_spte)
-{
- /*
- * Theoretically we could also set dirty bit (and flush TLB) here in
- * order to eliminate unnecessary PML logging. See comments in
- * set_spte. But fast_page_fault is very unlikely to happen with PML
- * enabled, so we do not do this. This might result in the same GPA
- * to be logged in PML buffer again when the write really happens, and
- * eventually to be called by mark_page_dirty twice. But it's also no
- * harm. This also avoids the TLB flush needed after setting dirty bit
- * so non-PML cases won't be impacted.
- *
- * Compare with set_spte where instead shadow_dirty_mask is set.
- */
- if (!try_cmpxchg64(sptep, &old_spte, new_spte))
- return false;
-
- if (is_writable_pte(new_spte) && !is_writable_pte(old_spte))
- mark_page_dirty_in_slot(vcpu->kvm, fault->slot, fault->gfn);
-
- return true;
-}
-
-static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
-{
- if (fault->exec)
- return is_executable_pte(spte);
-
- if (fault->write)
- return is_writable_pte(spte);
-
- /* Fault was on Read access */
- return spte & PT_PRESENT_MASK;
-}
-
-/*
- * Returns the last level spte pointer of the shadow page walk for the given
- * gpa, and sets *spte to the spte value. This spte may be non-preset. If no
- * walk could be performed, returns NULL and *spte does not contain valid data.
- *
- * Contract:
- * - Must be called between walk_shadow_page_lockless_{begin,end}.
- * - The returned sptep must not be used after walk_shadow_page_lockless_end.
- */
-static u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
-{
- struct kvm_shadow_walk_iterator iterator;
- u64 old_spte;
- u64 *sptep = NULL;
-
- for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) {
- sptep = iterator.sptep;
- *spte = old_spte;
- }
-
- return sptep;
-}
-
-/*
- * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS.
- */
-static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
-{
- struct kvm_mmu_page *sp;
- int ret = RET_PF_INVALID;
- u64 spte = 0ull;
- u64 *sptep = NULL;
- uint retry_count = 0;
-
- if (!page_fault_can_be_fast(fault))
- return ret;
-
- walk_shadow_page_lockless_begin(vcpu);
-
- do {
- u64 new_spte;
-
- if (tdp_mmu_enabled)
- sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, fault->addr, &spte);
- else
- sptep = fast_pf_get_last_sptep(vcpu, fault->addr, &spte);
-
- if (!is_shadow_present_pte(spte))
- break;
-
- sp = sptep_to_sp(sptep);
- if (!is_last_spte(spte, sp->role.level))
- break;
-
- /*
- * Check whether the memory access that caused the fault would
- * still cause it if it were to be performed right now. If not,
- * then this is a spurious fault caused by TLB lazily flushed,
- * or some other CPU has already fixed the PTE after the
- * current CPU took the fault.
- *
- * Need not check the access of upper level table entries since
- * they are always ACC_ALL.
- */
- if (is_access_allowed(fault, spte)) {
- ret = RET_PF_SPURIOUS;
- break;
- }
-
- new_spte = spte;
-
- /*
- * KVM only supports fixing page faults outside of MMU lock for
- * direct MMUs, nested MMUs are always indirect, and KVM always
- * uses A/D bits for non-nested MMUs. Thus, if A/D bits are
- * enabled, the SPTE can't be an access-tracked SPTE.
- */
- if (unlikely(!kvm_ad_enabled()) && is_access_track_spte(spte))
- new_spte = restore_acc_track_spte(new_spte);
-
- /*
- * To keep things simple, only SPTEs that are MMU-writable can
- * be made fully writable outside of mmu_lock, e.g. only SPTEs
- * that were write-protected for dirty-logging or access
- * tracking are handled here. Don't bother checking if the
- * SPTE is writable to prioritize running with A/D bits enabled.
- * The is_access_allowed() check above handles the common case
- * of the fault being spurious, and the SPTE is known to be
- * shadow-present, i.e. except for access tracking restoration
- * making the new SPTE writable, the check is wasteful.
- */
- if (fault->write && is_mmu_writable_spte(spte)) {
- new_spte |= PT_WRITABLE_MASK;
-
- /*
- * Do not fix write-permission on the large spte when
- * dirty logging is enabled. Since we only dirty the
- * first page into the dirty-bitmap in
- * fast_pf_fix_direct_spte(), other pages are missed
- * if its slot has dirty logging enabled.
- *
- * Instead, we let the slow page fault path create a
- * normal spte to fix the access.
- */
- if (sp->role.level > PG_LEVEL_4K &&
- kvm_slot_dirty_track_enabled(fault->slot))
- break;
- }
-
- /* Verify that the fault can be handled in the fast path */
- if (new_spte == spte ||
- !is_access_allowed(fault, new_spte))
- break;
-
- /*
- * Currently, fast page fault only works for direct mapping
- * since the gfn is not stable for indirect shadow page. See
- * Documentation/virt/kvm/locking.rst to get more detail.
- */
- if (fast_pf_fix_direct_spte(vcpu, fault, sptep, spte, new_spte)) {
- ret = RET_PF_FIXED;
- break;
- }
-
- if (++retry_count > 4) {
- pr_warn_once("Fast #PF retrying more than 4 times.\n");
- break;
- }
-
- } while (true);
-
- trace_fast_page_fault(vcpu, fault, sptep, spte, ret);
- walk_shadow_page_lockless_end(vcpu);
-
- if (ret != RET_PF_INVALID)
- vcpu->stat.pf_fast++;
-
- return ret;
-}
-
-static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
- struct list_head *invalid_list)
-{
- struct kvm_mmu_page *sp;
-
- if (!VALID_PAGE(*root_hpa))
- return;
-
- /*
- * The "root" may be a special root, e.g. a PAE entry, treat it as a
- * SPTE to ensure any non-PA bits are dropped.
- */
- sp = spte_to_child_sp(*root_hpa);
- if (WARN_ON(!sp))
- return;
-
- if (is_tdp_mmu_page(sp))
- kvm_tdp_mmu_put_root(kvm, sp, false);
- else if (!--sp->root_count && sp->role.invalid)
- kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
-
- *root_hpa = INVALID_PAGE;
-}
-
-/* roots_to_free must be some combination of the KVM_MMU_ROOT_* flags */
-void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
- ulong roots_to_free)
-{
- int i;
- LIST_HEAD(invalid_list);
- bool free_active_root;
-
- BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);
-
- /* Before acquiring the MMU lock, see if we need to do any real work. */
- free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT)
- && VALID_PAGE(mmu->root.hpa);
-
- if (!free_active_root) {
- for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
- if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
- VALID_PAGE(mmu->prev_roots[i].hpa))
- break;
-
- if (i == KVM_MMU_NUM_PREV_ROOTS)
- return;
- }
-
- write_lock(&kvm->mmu_lock);
-
- for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
- if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i))
- mmu_free_root_page(kvm, &mmu->prev_roots[i].hpa,
- &invalid_list);
-
- if (free_active_root) {
- if (to_shadow_page(mmu->root.hpa)) {
- mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
- } else if (mmu->pae_root) {
- for (i = 0; i < 4; ++i) {
- if (!IS_VALID_PAE_ROOT(mmu->pae_root[i]))
- continue;
-
- mmu_free_root_page(kvm, &mmu->pae_root[i],
- &invalid_list);
- mmu->pae_root[i] = INVALID_PAE_ROOT;
- }
- }
- mmu->root.hpa = INVALID_PAGE;
- mmu->root.pgd = 0;
- }
-
- kvm_mmu_commit_zap_page(kvm, &invalid_list);
- write_unlock(&kvm->mmu_lock);
-}
-EXPORT_SYMBOL_GPL(kvm_mmu_free_roots);
-
-void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu)
-{
- unsigned long roots_to_free = 0;
- hpa_t root_hpa;
- int i;
-
- /*
- * This should not be called while L2 is active, L2 can't invalidate
- * _only_ its own roots, e.g. INVVPID unconditionally exits.
- */
- WARN_ON_ONCE(mmu->root_role.guest_mode);
-
- for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
- root_hpa = mmu->prev_roots[i].hpa;
- if (!VALID_PAGE(root_hpa))
- continue;
-
- if (!to_shadow_page(root_hpa) ||
- to_shadow_page(root_hpa)->role.guest_mode)
- roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
- }
-
- kvm_mmu_free_roots(kvm, mmu, roots_to_free);
-}
-EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots);
-
-
-static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
-{
- int ret = 0;
-
- if (!kvm_vcpu_is_visible_gfn(vcpu, root_gfn)) {
- kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
- ret = 1;
- }
-
- return ret;
-}
-
-static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
- u8 level)
-{
- union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
- struct kvm_mmu_page *sp;
-
- role.level = level;
- role.quadrant = quadrant;
-
- WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
- WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
-
- sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
- ++sp->root_count;
-
- return __pa(sp->spt);
-}
-
-static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
-{
- struct kvm_mmu *mmu = vcpu->arch.mmu;
- u8 shadow_root_level = mmu->root_role.level;
- hpa_t root;
- unsigned i;
- int r;
-
- write_lock(&vcpu->kvm->mmu_lock);
- r = make_mmu_pages_available(vcpu);
- if (r < 0)
- goto out_unlock;
-
- if (tdp_mmu_enabled) {
- root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
- mmu->root.hpa = root;
- } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
- root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
- mmu->root.hpa = root;
- } else if (shadow_root_level == PT32E_ROOT_LEVEL) {
- if (WARN_ON_ONCE(!mmu->pae_root)) {
- r = -EIO;
- goto out_unlock;
- }
-
- for (i = 0; i < 4; ++i) {
- WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
-
- root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0,
- PT32_ROOT_LEVEL);
- mmu->pae_root[i] = root | PT_PRESENT_MASK |
- shadow_me_value;
- }
- mmu->root.hpa = __pa(mmu->pae_root);
- } else {
- WARN_ONCE(1, "Bad TDP root level = %d\n", shadow_root_level);
- r = -EIO;
- goto out_unlock;
- }
-
- /* root.pgd is ignored for direct MMUs. */
- mmu->root.pgd = 0;
-out_unlock:
- write_unlock(&vcpu->kvm->mmu_lock);
- return r;
-}
-
-static int mmu_first_shadow_root_alloc(struct kvm *kvm)
-{
- struct kvm_memslots *slots;
- struct kvm_memory_slot *slot;
- int r = 0, i, bkt;
-
- /*
- * Check if this is the first shadow root being allocated before
- * taking the lock.
- */
- if (kvm_shadow_root_allocated(kvm))
- return 0;
-
- mutex_lock(&kvm->slots_arch_lock);
-
- /* Recheck, under the lock, whether this is the first shadow root. */
- if (kvm_shadow_root_allocated(kvm))
- goto out_unlock;
-
- /*
- * Check if anything actually needs to be allocated, e.g. all metadata
- * will be allocated upfront if TDP is disabled.
- */
- if (kvm_memslots_have_rmaps(kvm) &&
- kvm_page_track_write_tracking_enabled(kvm))
- goto out_success;
-
- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
- slots = __kvm_memslots(kvm, i);
- kvm_for_each_memslot(slot, bkt, slots) {
- /*
- * Both of these functions are no-ops if the target is
- * already allocated, so unconditionally calling both
- * is safe. Intentionally do NOT free allocations on
- * failure to avoid having to track which allocations
- * were made now versus when the memslot was created.
- * The metadata is guaranteed to be freed when the slot
- * is freed, and will be kept/used if userspace retries
- * KVM_RUN instead of killing the VM.
- */
- r = memslot_rmap_alloc(slot, slot->npages);
- if (r)
- goto out_unlock;
- r = kvm_page_track_write_tracking_alloc(slot);
- if (r)
- goto out_unlock;
- }
- }
-
- /*
- * Ensure that shadow_root_allocated becomes true strictly after
- * all the related pointers are set.
- */
-out_success:
- smp_store_release(&kvm->arch.shadow_root_allocated, true);
-
-out_unlock:
- mutex_unlock(&kvm->slots_arch_lock);
- return r;
-}
-
-static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
-{
- struct kvm_mmu *mmu = vcpu->arch.mmu;
- u64 pdptrs[4], pm_mask;
- gfn_t root_gfn, root_pgd;
- int quadrant, i, r;
- hpa_t root;
-
- root_pgd = mmu->get_guest_pgd(vcpu);
- root_gfn = root_pgd >> PAGE_SHIFT;
-
- if (mmu_check_root(vcpu, root_gfn))
- return 1;
-
- /*
- * On SVM, reading PDPTRs might access guest memory, which might fault
- * and thus might sleep. Grab the PDPTRs before acquiring mmu_lock.
- */
- if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) {
- for (i = 0; i < 4; ++i) {
- pdptrs[i] = mmu->get_pdptr(vcpu, i);
- if (!(pdptrs[i] & PT_PRESENT_MASK))
- continue;
-
- if (mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT))
- return 1;
- }
- }
-
- r = mmu_first_shadow_root_alloc(vcpu->kvm);
- if (r)
- return r;
-
- write_lock(&vcpu->kvm->mmu_lock);
- r = make_mmu_pages_available(vcpu);
- if (r < 0)
- goto out_unlock;
-
- /*
- * Do we shadow a long mode page table? If so we need to
- * write-protect the guests page table root.
- */
- if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
- root = mmu_alloc_root(vcpu, root_gfn, 0,
- mmu->root_role.level);
- mmu->root.hpa = root;
- goto set_root_pgd;
- }
-
- if (WARN_ON_ONCE(!mmu->pae_root)) {
- r = -EIO;
- goto out_unlock;
- }
-
- /*
- * We shadow a 32 bit page table. This may be a legacy 2-level
- * or a PAE 3-level page table. In either case we need to be aware that
- * the shadow page table may be a PAE or a long mode page table.
- */
- pm_mask = PT_PRESENT_MASK | shadow_me_value;
- if (mmu->root_role.level >= PT64_ROOT_4LEVEL) {
- pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
-
- if (WARN_ON_ONCE(!mmu->pml4_root)) {
- r = -EIO;
- goto out_unlock;
- }
- mmu->pml4_root[0] = __pa(mmu->pae_root) | pm_mask;
-
- if (mmu->root_role.level == PT64_ROOT_5LEVEL) {
- if (WARN_ON_ONCE(!mmu->pml5_root)) {
- r = -EIO;
- goto out_unlock;
- }
- mmu->pml5_root[0] = __pa(mmu->pml4_root) | pm_mask;
- }
- }
-
- for (i = 0; i < 4; ++i) {
- WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
-
- if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) {
- if (!(pdptrs[i] & PT_PRESENT_MASK)) {
- mmu->pae_root[i] = INVALID_PAE_ROOT;
- continue;
- }
- root_gfn = pdptrs[i] >> PAGE_SHIFT;
- }
-
- /*
- * If shadowing 32-bit non-PAE page tables, each PAE page
- * directory maps one quarter of the guest's non-PAE page
- * directory. Othwerise each PAE page direct shadows one guest
- * PAE page directory so that quadrant should be 0.
- */
- quadrant = (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) ? i : 0;
-
- root = mmu_alloc_root(vcpu, root_gfn, quadrant, PT32_ROOT_LEVEL);
- mmu->pae_root[i] = root | pm_mask;
- }
-
- if (mmu->root_role.level == PT64_ROOT_5LEVEL)
- mmu->root.hpa = __pa(mmu->pml5_root);
- else if (mmu->root_role.level == PT64_ROOT_4LEVEL)
- mmu->root.hpa = __pa(mmu->pml4_root);
- else
- mmu->root.hpa = __pa(mmu->pae_root);
-
-set_root_pgd:
- mmu->root.pgd = root_pgd;
-out_unlock:
- write_unlock(&vcpu->kvm->mmu_lock);
-
- return r;
-}
-
-static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
-{
- struct kvm_mmu *mmu = vcpu->arch.mmu;
- bool need_pml5 = mmu->root_role.level > PT64_ROOT_4LEVEL;
- u64 *pml5_root = NULL;
- u64 *pml4_root = NULL;
- u64 *pae_root;
-
- /*
- * When shadowing 32-bit or PAE NPT with 64-bit NPT, the PML4 and PDP
- * tables are allocated and initialized at root creation as there is no
- * equivalent level in the guest's NPT to shadow. Allocate the tables
- * on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare.
- */
- if (mmu->root_role.direct ||
- mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL ||
- mmu->root_role.level < PT64_ROOT_4LEVEL)
- return 0;
-
- /*
- * NPT, the only paging mode that uses this horror, uses a fixed number
- * of levels for the shadow page tables, e.g. all MMUs are 4-level or
- * all MMus are 5-level. Thus, this can safely require that pml5_root
- * is allocated if the other roots are valid and pml5 is needed, as any
- * prior MMU would also have required pml5.
- */
- if (mmu->pae_root && mmu->pml4_root && (!need_pml5 || mmu->pml5_root))
- return 0;
-
- /*
- * The special roots should always be allocated in concert. Yell and
- * bail if KVM ends up in a state where only one of the roots is valid.
- */
- if (WARN_ON_ONCE(!tdp_enabled || mmu->pae_root || mmu->pml4_root ||
- (need_pml5 && mmu->pml5_root)))
- return -EIO;
-
- /*
- * Unlike 32-bit NPT, the PDP table doesn't need to be in low mem, and
- * doesn't need to be decrypted.
- */
- pae_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
- if (!pae_root)
- return -ENOMEM;
+ if (++retry_count > 4) {
+ pr_warn_once("Fast #PF retrying more than 4 times.\n");
+ break;
+ }

-#ifdef CONFIG_X86_64
- pml4_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
- if (!pml4_root)
- goto err_pml4;
-
- if (need_pml5) {
- pml5_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
- if (!pml5_root)
- goto err_pml5;
- }
-#endif
+ } while (true);

- mmu->pae_root = pae_root;
- mmu->pml4_root = pml4_root;
- mmu->pml5_root = pml5_root;
+ trace_fast_page_fault(vcpu, fault, sptep, spte, ret);
+ walk_shadow_page_lockless_end(vcpu);

- return 0;
+ if (ret != RET_PF_INVALID)
+ vcpu->stat.pf_fast++;

-#ifdef CONFIG_X86_64
-err_pml5:
- free_page((unsigned long)pml4_root);
-err_pml4:
- free_page((unsigned long)pae_root);
- return -ENOMEM;
-#endif
+ return ret;
}

-static bool is_unsync_root(hpa_t root)
+static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
+ struct list_head *invalid_list)
{
struct kvm_mmu_page *sp;

- if (!VALID_PAGE(root))
- return false;
-
- /*
- * The read barrier orders the CPU's read of SPTE.W during the page table
- * walk before the reads of sp->unsync/sp->unsync_children here.
- *
- * Even if another CPU was marking the SP as unsync-ed simultaneously,
- * any guest page table changes are not guaranteed to be visible anyway
- * until this VCPU issues a TLB flush strictly after those changes are
- * made. We only need to ensure that the other CPU sets these flags
- * before any actual changes to the page tables are made. The comments
- * in mmu_try_to_unsync_pages() describe what could go wrong if this
- * requirement isn't satisfied.
- */
- smp_rmb();
- sp = to_shadow_page(root);
+ if (!VALID_PAGE(*root_hpa))
+ return;

/*
- * PAE roots (somewhat arbitrarily) aren't backed by shadow pages, the
- * PDPTEs for a given PAE root need to be synchronized individually.
+ * The "root" may be a special root, e.g. a PAE entry, treat it as a
+ * SPTE to ensure any non-PA bits are dropped.
*/
- if (WARN_ON_ONCE(!sp))
- return false;
+ sp = spte_to_child_sp(*root_hpa);
+ if (WARN_ON(!sp))
+ return;

- if (sp->unsync || sp->unsync_children)
- return true;
+ if (is_tdp_mmu_page(sp))
+ kvm_tdp_mmu_put_root(kvm, sp, false);
+ else if (!--sp->root_count && sp->role.invalid)
+ kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);

- return false;
+ *root_hpa = INVALID_PAGE;
}

-void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
+/* roots_to_free must be some combination of the KVM_MMU_ROOT_* flags */
+void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
+ ulong roots_to_free)
{
int i;
- struct kvm_mmu_page *sp;
-
- if (vcpu->arch.mmu->root_role.direct)
- return;
+ LIST_HEAD(invalid_list);
+ bool free_active_root;

- if (!VALID_PAGE(vcpu->arch.mmu->root.hpa))
- return;
+ BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);

- vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
+ /* Before acquiring the MMU lock, see if we need to do any real work. */
+ free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT)
+ && VALID_PAGE(mmu->root.hpa);

- if (vcpu->arch.mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
- hpa_t root = vcpu->arch.mmu->root.hpa;
- sp = to_shadow_page(root);
+ if (!free_active_root) {
+ for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
+ if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
+ VALID_PAGE(mmu->prev_roots[i].hpa))
+ break;

- if (!is_unsync_root(root))
+ if (i == KVM_MMU_NUM_PREV_ROOTS)
return;
-
- write_lock(&vcpu->kvm->mmu_lock);
- mmu_sync_children(vcpu, sp, true);
- write_unlock(&vcpu->kvm->mmu_lock);
- return;
}

- write_lock(&vcpu->kvm->mmu_lock);
+ write_lock(&kvm->mmu_lock);
+
+ for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
+ if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i))
+ mmu_free_root_page(kvm, &mmu->prev_roots[i].hpa,
+ &invalid_list);

- for (i = 0; i < 4; ++i) {
- hpa_t root = vcpu->arch.mmu->pae_root[i];
+ if (free_active_root) {
+ if (to_shadow_page(mmu->root.hpa)) {
+ mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
+ } else if (mmu->pae_root) {
+ for (i = 0; i < 4; ++i) {
+ if (!IS_VALID_PAE_ROOT(mmu->pae_root[i]))
+ continue;

- if (IS_VALID_PAE_ROOT(root)) {
- sp = spte_to_child_sp(root);
- mmu_sync_children(vcpu, sp, true);
+ mmu_free_root_page(kvm, &mmu->pae_root[i],
+ &invalid_list);
+ mmu->pae_root[i] = INVALID_PAE_ROOT;
+ }
}
+ mmu->root.hpa = INVALID_PAGE;
+ mmu->root.pgd = 0;
}

- write_unlock(&vcpu->kvm->mmu_lock);
+ kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ write_unlock(&kvm->mmu_lock);
}
+EXPORT_SYMBOL_GPL(kvm_mmu_free_roots);

-void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu)
+static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
{
- unsigned long roots_to_free = 0;
- int i;
+ struct kvm_mmu *mmu = vcpu->arch.mmu;
+ u8 shadow_root_level = mmu->root_role.level;
+ hpa_t root;
+ unsigned i;
+ int r;

- for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
- if (is_unsync_root(vcpu->arch.mmu->prev_roots[i].hpa))
- roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
+ write_lock(&vcpu->kvm->mmu_lock);
+ r = make_mmu_pages_available(vcpu);
+ if (r < 0)
+ goto out_unlock;
+
+ if (tdp_mmu_enabled) {
+ root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
+ mmu->root.hpa = root;
+ } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
+ root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
+ mmu->root.hpa = root;
+ } else if (shadow_root_level == PT32E_ROOT_LEVEL) {
+ if (WARN_ON_ONCE(!mmu->pae_root)) {
+ r = -EIO;
+ goto out_unlock;
+ }
+
+ for (i = 0; i < 4; ++i) {
+ WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));

- /* sync prev_roots by simply freeing them */
- kvm_mmu_free_roots(vcpu->kvm, vcpu->arch.mmu, roots_to_free);
+ root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0,
+ PT32_ROOT_LEVEL);
+ mmu->pae_root[i] = root | PT_PRESENT_MASK |
+ shadow_me_value;
+ }
+ mmu->root.hpa = __pa(mmu->pae_root);
+ } else {
+ WARN_ONCE(1, "Bad TDP root level = %d\n", shadow_root_level);
+ r = -EIO;
+ goto out_unlock;
+ }
+
+ /* root.pgd is ignored for direct MMUs. */
+ mmu->root.pgd = 0;
+out_unlock:
+ write_unlock(&vcpu->kvm->mmu_lock);
+ return r;
}

static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
@@ -4002,31 +1200,6 @@ static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct)
return vcpu_match_mmio_gva(vcpu, addr);
}

-/*
- * Return the level of the lowest level SPTE added to sptes.
- * That SPTE may be non-present.
- *
- * Must be called between walk_shadow_page_lockless_{begin,end}.
- */
-static int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level)
-{
- struct kvm_shadow_walk_iterator iterator;
- int leaf = -1;
- u64 spte;
-
- for (shadow_walk_init(&iterator, vcpu, addr),
- *root_level = iterator.level;
- shadow_walk_okay(&iterator);
- __shadow_walk_next(&iterator, spte)) {
- leaf = iterator.level;
- spte = mmu_spte_get_lockless(iterator.sptep);
-
- sptes[leaf] = spte;
- }
-
- return leaf;
-}
-
/* return true if reserved bit(s) are detected on a valid, non-MMIO SPTE. */
static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
{
@@ -4130,17 +1303,6 @@ static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
return false;
}

-static void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
-{
- struct kvm_shadow_walk_iterator iterator;
- u64 spte;
-
- walk_shadow_page_lockless_begin(vcpu);
- for_each_shadow_entry_lockless(vcpu, addr, iterator, spte)
- clear_sp_write_flooding_count(iterator.sptep);
- walk_shadow_page_lockless_end(vcpu);
-}
-
static u32 alloc_apf_token(struct kvm_vcpu *vcpu)
{
/* make sure the token value is not 0 */
@@ -5356,264 +2518,65 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
vcpu->arch.nested_mmu.root_role.word = 0;
vcpu->arch.root_mmu.cpu_role.ext.valid = 0;
vcpu->arch.guest_mmu.cpu_role.ext.valid = 0;
- vcpu->arch.nested_mmu.cpu_role.ext.valid = 0;
- kvm_mmu_reset_context(vcpu);
-
- /*
- * Changing guest CPUID after KVM_RUN is forbidden, see the comment in
- * kvm_arch_vcpu_ioctl().
- */
- KVM_BUG_ON(vcpu->arch.last_vmentry_cpu != -1, vcpu->kvm);
-}
-
-void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
-{
- kvm_mmu_unload(vcpu);
- kvm_init_mmu(vcpu);
-}
-EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);
-
-int kvm_mmu_load(struct kvm_vcpu *vcpu)
-{
- int r;
-
- r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct);
- if (r)
- goto out;
- r = mmu_alloc_special_roots(vcpu);
- if (r)
- goto out;
- if (vcpu->arch.mmu->root_role.direct)
- r = mmu_alloc_direct_roots(vcpu);
- else
- r = mmu_alloc_shadow_roots(vcpu);
- if (r)
- goto out;
-
- kvm_mmu_sync_roots(vcpu);
-
- kvm_mmu_load_pgd(vcpu);
-
- /*
- * Flush any TLB entries for the new root, the provenance of the root
- * is unknown. Even if KVM ensures there are no stale TLB entries
- * for a freed root, in theory another hypervisor could have left
- * stale entries. Flushing on alloc also allows KVM to skip the TLB
- * flush when freeing a root (see kvm_tdp_mmu_put_root()).
- */
- static_call(kvm_x86_flush_tlb_current)(vcpu);
-out:
- return r;
-}
-
-void kvm_mmu_unload(struct kvm_vcpu *vcpu)
-{
- struct kvm *kvm = vcpu->kvm;
-
- kvm_mmu_free_roots(kvm, &vcpu->arch.root_mmu, KVM_MMU_ROOTS_ALL);
- WARN_ON(VALID_PAGE(vcpu->arch.root_mmu.root.hpa));
- kvm_mmu_free_roots(kvm, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL);
- WARN_ON(VALID_PAGE(vcpu->arch.guest_mmu.root.hpa));
- vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
-}
-
-static bool is_obsolete_root(struct kvm *kvm, hpa_t root_hpa)
-{
- struct kvm_mmu_page *sp;
-
- if (!VALID_PAGE(root_hpa))
- return false;
-
- /*
- * When freeing obsolete roots, treat roots as obsolete if they don't
- * have an associated shadow page. This does mean KVM will get false
- * positives and free roots that don't strictly need to be freed, but
- * such false positives are relatively rare:
- *
- * (a) only PAE paging and nested NPT has roots without shadow pages
- * (b) remote reloads due to a memslot update obsoletes _all_ roots
- * (c) KVM doesn't track previous roots for PAE paging, and the guest
- * is unlikely to zap an in-use PGD.
- */
- sp = to_shadow_page(root_hpa);
- return !sp || is_obsolete_sp(kvm, sp);
-}
-
-static void __kvm_mmu_free_obsolete_roots(struct kvm *kvm, struct kvm_mmu *mmu)
-{
- unsigned long roots_to_free = 0;
- int i;
-
- if (is_obsolete_root(kvm, mmu->root.hpa))
- roots_to_free |= KVM_MMU_ROOT_CURRENT;
-
- for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
- if (is_obsolete_root(kvm, mmu->prev_roots[i].hpa))
- roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
- }
-
- if (roots_to_free)
- kvm_mmu_free_roots(kvm, mmu, roots_to_free);
-}
-
-void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu)
-{
- __kvm_mmu_free_obsolete_roots(vcpu->kvm, &vcpu->arch.root_mmu);
- __kvm_mmu_free_obsolete_roots(vcpu->kvm, &vcpu->arch.guest_mmu);
-}
-
-static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
- int *bytes)
-{
- u64 gentry = 0;
- int r;
-
- /*
- * Assume that the pte write on a page table of the same type
- * as the current vcpu paging mode since we update the sptes only
- * when they have the same mode.
- */
- if (is_pae(vcpu) && *bytes == 4) {
- /* Handle a 32-bit guest writing two halves of a 64-bit gpte */
- *gpa &= ~(gpa_t)7;
- *bytes = 8;
- }
-
- if (*bytes == 4 || *bytes == 8) {
- r = kvm_vcpu_read_guest_atomic(vcpu, *gpa, &gentry, *bytes);
- if (r)
- gentry = 0;
- }
-
- return gentry;
-}
-
-/*
- * If we're seeing too many writes to a page, it may no longer be a page table,
- * or we may be forking, in which case it is better to unmap the page.
- */
-static bool detect_write_flooding(struct kvm_mmu_page *sp)
-{
- /*
- * Skip write-flooding detected for the sp whose level is 1, because
- * it can become unsync, then the guest page is not write-protected.
- */
- if (sp->role.level == PG_LEVEL_4K)
- return false;
-
- atomic_inc(&sp->write_flooding_count);
- return atomic_read(&sp->write_flooding_count) >= 3;
-}
-
-/*
- * Misaligned accesses are too much trouble to fix up; also, they usually
- * indicate a page is not used as a page table.
- */
-static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
- int bytes)
-{
- unsigned offset, pte_size, misaligned;
-
- pgprintk("misaligned: gpa %llx bytes %d role %x\n",
- gpa, bytes, sp->role.word);
-
- offset = offset_in_page(gpa);
- pte_size = sp->role.has_4_byte_gpte ? 4 : 8;
+ vcpu->arch.nested_mmu.cpu_role.ext.valid = 0;
+ kvm_mmu_reset_context(vcpu);

/*
- * Sometimes, the OS only writes the last one bytes to update status
- * bits, for example, in linux, andb instruction is used in clear_bit().
+ * Changing guest CPUID after KVM_RUN is forbidden, see the comment in
+ * kvm_arch_vcpu_ioctl().
*/
- if (!(offset & (pte_size - 1)) && bytes == 1)
- return false;
-
- misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1);
- misaligned |= bytes < 4;
-
- return misaligned;
+ KVM_BUG_ON(vcpu->arch.last_vmentry_cpu != -1, vcpu->kvm);
}

-static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
+void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
{
- unsigned page_offset, quadrant;
- u64 *spte;
- int level;
-
- page_offset = offset_in_page(gpa);
- level = sp->role.level;
- *nspte = 1;
- if (sp->role.has_4_byte_gpte) {
- page_offset <<= 1; /* 32->64 */
- /*
- * A 32-bit pde maps 4MB while the shadow pdes map
- * only 2MB. So we need to double the offset again
- * and zap two pdes instead of one.
- */
- if (level == PT32_ROOT_LEVEL) {
- page_offset &= ~7; /* kill rounding error */
- page_offset <<= 1;
- *nspte = 2;
- }
- quadrant = page_offset >> PAGE_SHIFT;
- page_offset &= ~PAGE_MASK;
- if (quadrant != sp->role.quadrant)
- return NULL;
- }
-
- spte = &sp->spt[page_offset / sizeof(*spte)];
- return spte;
+ kvm_mmu_unload(vcpu);
+ kvm_init_mmu(vcpu);
}
+EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);

-static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
- const u8 *new, int bytes,
- struct kvm_page_track_notifier_node *node)
+int kvm_mmu_load(struct kvm_vcpu *vcpu)
{
- gfn_t gfn = gpa >> PAGE_SHIFT;
- struct kvm_mmu_page *sp;
- LIST_HEAD(invalid_list);
- u64 entry, gentry, *spte;
- int npte;
- bool flush = false;
-
- /*
- * If we don't have indirect shadow pages, it means no page is
- * write-protected, so we can exit simply.
- */
- if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
- return;
-
- pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
+ int r;

- write_lock(&vcpu->kvm->mmu_lock);
+ r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct);
+ if (r)
+ goto out;
+ r = mmu_alloc_special_roots(vcpu);
+ if (r)
+ goto out;
+ if (vcpu->arch.mmu->root_role.direct)
+ r = mmu_alloc_direct_roots(vcpu);
+ else
+ r = mmu_alloc_shadow_roots(vcpu);
+ if (r)
+ goto out;

- gentry = mmu_pte_write_fetch_gpte(vcpu, &gpa, &bytes);
+ kvm_mmu_sync_roots(vcpu);

- ++vcpu->kvm->stat.mmu_pte_write;
+ kvm_mmu_load_pgd(vcpu);

- for_each_gfn_valid_sp_with_gptes(vcpu->kvm, sp, gfn) {
- if (detect_write_misaligned(sp, gpa, bytes) ||
- detect_write_flooding(sp)) {
- kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
- ++vcpu->kvm->stat.mmu_flooded;
- continue;
- }
+ /*
+ * Flush any TLB entries for the new root, the provenance of the root
+ * is unknown. Even if KVM ensures there are no stale TLB entries
+ * for a freed root, in theory another hypervisor could have left
+ * stale entries. Flushing on alloc also allows KVM to skip the TLB
+ * flush when freeing a root (see kvm_tdp_mmu_put_root()).
+ */
+ static_call(kvm_x86_flush_tlb_current)(vcpu);
+out:
+ return r;
+}

- spte = get_written_sptes(sp, gpa, &npte);
- if (!spte)
- continue;
+void kvm_mmu_unload(struct kvm_vcpu *vcpu)
+{
+ struct kvm *kvm = vcpu->kvm;

- while (npte--) {
- entry = *spte;
- mmu_page_zap_pte(vcpu->kvm, sp, spte, NULL);
- if (gentry && sp->role.level != PG_LEVEL_4K)
- ++vcpu->kvm->stat.mmu_pde_zapped;
- if (is_shadow_present_pte(entry))
- flush = true;
- ++spte;
- }
- }
- kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, flush);
- write_unlock(&vcpu->kvm->mmu_lock);
+ kvm_mmu_free_roots(kvm, &vcpu->arch.root_mmu, KVM_MMU_ROOTS_ALL);
+ WARN_ON(VALID_PAGE(vcpu->arch.root_mmu.root.hpa));
+ kvm_mmu_free_roots(kvm, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL);
+ WARN_ON(VALID_PAGE(vcpu->arch.guest_mmu.root.hpa));
+ vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
}

int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
@@ -5782,60 +2745,6 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
}
EXPORT_SYMBOL_GPL(kvm_configure_mmu);

-/* The return value indicates if tlb flush on all vcpus is needed. */
-typedef bool (*slot_rmaps_handler) (struct kvm *kvm,
- struct kvm_rmap_head *rmap_head,
- const struct kvm_memory_slot *slot);
-
-static __always_inline bool __walk_slot_rmaps(struct kvm *kvm,
- const struct kvm_memory_slot *slot,
- slot_rmaps_handler fn,
- int start_level, int end_level,
- gfn_t start_gfn, gfn_t end_gfn,
- bool flush_on_yield, bool flush)
-{
- struct slot_rmap_walk_iterator iterator;
-
- lockdep_assert_held_write(&kvm->mmu_lock);
-
- for_each_slot_rmap_range(slot, start_level, end_level, start_gfn,
- end_gfn, &iterator) {
- if (iterator.rmap)
- flush |= fn(kvm, iterator.rmap, slot);
-
- if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
- if (flush && flush_on_yield) {
- kvm_flush_remote_tlbs_with_address(kvm,
- start_gfn,
- iterator.gfn - start_gfn + 1);
- flush = false;
- }
- cond_resched_rwlock_write(&kvm->mmu_lock);
- }
- }
-
- return flush;
-}
-
-static __always_inline bool walk_slot_rmaps(struct kvm *kvm,
- const struct kvm_memory_slot *slot,
- slot_rmaps_handler fn,
- int start_level, int end_level,
- bool flush_on_yield)
-{
- return __walk_slot_rmaps(kvm, slot, fn, start_level, end_level,
- slot->base_gfn, slot->base_gfn + slot->npages - 1,
- flush_on_yield, false);
-}
-
-static __always_inline bool walk_slot_rmaps_4k(struct kvm *kvm,
- const struct kvm_memory_slot *slot,
- slot_rmaps_handler fn,
- bool flush_on_yield)
-{
- return walk_slot_rmaps(kvm, slot, fn, PG_LEVEL_4K, PG_LEVEL_4K, flush_on_yield);
-}
-
static void free_mmu_pages(struct kvm_mmu *mmu)
{
if (!tdp_enabled && mmu->pae_root)
@@ -5927,63 +2836,6 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
return ret;
}

-#define BATCH_ZAP_PAGES 10
-static void kvm_zap_obsolete_pages(struct kvm *kvm)
-{
- struct kvm_mmu_page *sp, *node;
- int nr_zapped, batch = 0;
- bool unstable;
-
-restart:
- list_for_each_entry_safe_reverse(sp, node,
- &kvm->arch.active_mmu_pages, link) {
- /*
- * No obsolete valid page exists before a newly created page
- * since active_mmu_pages is a FIFO list.
- */
- if (!is_obsolete_sp(kvm, sp))
- break;
-
- /*
- * Invalid pages should never land back on the list of active
- * pages. Skip the bogus page, otherwise we'll get stuck in an
- * infinite loop if the page gets put back on the list (again).
- */
- if (WARN_ON(sp->role.invalid))
- continue;
-
- /*
- * No need to flush the TLB since we're only zapping shadow
- * pages with an obsolete generation number and all vCPUS have
- * loaded a new root, i.e. the shadow pages being zapped cannot
- * be in active use by the guest.
- */
- if (batch >= BATCH_ZAP_PAGES &&
- cond_resched_rwlock_write(&kvm->mmu_lock)) {
- batch = 0;
- goto restart;
- }
-
- unstable = __kvm_mmu_prepare_zap_page(kvm, sp,
- &kvm->arch.zapped_obsolete_pages, &nr_zapped);
- batch += nr_zapped;
-
- if (unstable)
- goto restart;
- }
-
- /*
- * Kick all vCPUs (via remote TLB flush) before freeing the page tables
- * to ensure KVM is not in the middle of a lockless shadow page table
- * walk, which may reference the pages. The remote TLB flush itself is
- * not required and is simply a convenient way to kick vCPUs as needed.
- * KVM performs a local TLB flush when allocating a new root (see
- * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are
- * running with an obsolete MMU.
- */
- kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
-}
-
/*
* Fast invalidate all shadow pages and use lock-break technique
* to zap obsolete pages.
@@ -6044,11 +2896,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
kvm_tdp_mmu_zap_invalidated_roots(kvm);
}

-static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
-{
- return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
-}
-
static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot,
struct kvm_page_track_notifier_node *node)
@@ -6106,37 +2953,6 @@ void kvm_mmu_uninit_vm(struct kvm *kvm)
mmu_free_vm_memory_caches(kvm);
}

-static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
-{
- const struct kvm_memory_slot *memslot;
- struct kvm_memslots *slots;
- struct kvm_memslot_iter iter;
- bool flush = false;
- gfn_t start, end;
- int i;
-
- if (!kvm_memslots_have_rmaps(kvm))
- return flush;
-
- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
- slots = __kvm_memslots(kvm, i);
-
- kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, gfn_end) {
- memslot = iter.slot;
- start = max(gfn_start, memslot->base_gfn);
- end = min(gfn_end, memslot->base_gfn + memslot->npages);
- if (WARN_ON_ONCE(start >= end))
- continue;
-
- flush = __walk_slot_rmaps(kvm, memslot, __kvm_zap_rmap,
- PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
- start, end - 1, true, flush);
- }
- }
-
- return flush;
-}
-
/*
* Invalidate (zap) SPTEs that cover GFNs from gfn_start and up to gfn_end
* (not including it)
@@ -6170,13 +2986,6 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
write_unlock(&kvm->mmu_lock);
}

-static bool slot_rmap_write_protect(struct kvm *kvm,
- struct kvm_rmap_head *rmap_head,
- const struct kvm_memory_slot *slot)
-{
- return rmap_write_protect(rmap_head, false);
-}
-
void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
const struct kvm_memory_slot *memslot,
int start_level)
@@ -6248,182 +3057,6 @@ int topup_split_caches(struct kvm *kvm)
return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
}

-static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
-{
- struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
- struct shadow_page_caches caches = {};
- union kvm_mmu_page_role role;
- unsigned int access;
- gfn_t gfn;
-
- gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
- access = kvm_mmu_page_get_access(huge_sp, spte_index(huge_sptep));
-
- /*
- * Note, huge page splitting always uses direct shadow pages, regardless
- * of whether the huge page itself is mapped by a direct or indirect
- * shadow page, since the huge page region itself is being directly
- * mapped with smaller pages.
- */
- role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
-
- /* Direct SPs do not require a shadowed_info_cache. */
- caches.page_header_cache = &kvm->arch.split_page_header_cache;
- caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
-
- /* Safe to pass NULL for vCPU since requesting a direct SP. */
- return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
-}
-
-static void shadow_mmu_split_huge_page(struct kvm *kvm,
- const struct kvm_memory_slot *slot,
- u64 *huge_sptep)
-
-{
- struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
- u64 huge_spte = READ_ONCE(*huge_sptep);
- struct kvm_mmu_page *sp;
- bool flush = false;
- u64 *sptep, spte;
- gfn_t gfn;
- int index;
-
- sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep);
-
- for (index = 0; index < SPTE_ENT_PER_PAGE; index++) {
- sptep = &sp->spt[index];
- gfn = kvm_mmu_page_get_gfn(sp, index);
-
- /*
- * The SP may already have populated SPTEs, e.g. if this huge
- * page is aliased by multiple sptes with the same access
- * permissions. These entries are guaranteed to map the same
- * gfn-to-pfn translation since the SP is direct, so no need to
- * modify them.
- *
- * However, if a given SPTE points to a lower level page table,
- * that lower level page table may only be partially populated.
- * Installing such SPTEs would effectively unmap a potion of the
- * huge page. Unmapping guest memory always requires a TLB flush
- * since a subsequent operation on the unmapped regions would
- * fail to detect the need to flush.
- */
- if (is_shadow_present_pte(*sptep)) {
- flush |= !is_last_spte(*sptep, sp->role.level);
- continue;
- }
-
- spte = make_huge_page_split_spte(kvm, huge_spte, sp->role, index);
- mmu_spte_set(sptep, spte);
- __rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
- }
-
- __link_shadow_page(kvm, cache, huge_sptep, sp, flush);
-}
-
-static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
- const struct kvm_memory_slot *slot,
- u64 *huge_sptep)
-{
- struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
- int level, r = 0;
- gfn_t gfn;
- u64 spte;
-
- /* Grab information for the tracepoint before dropping the MMU lock. */
- gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
- level = huge_sp->role.level;
- spte = *huge_sptep;
-
- if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
- r = -ENOSPC;
- goto out;
- }
-
- if (need_topup_split_caches_or_resched(kvm)) {
- write_unlock(&kvm->mmu_lock);
- cond_resched();
- /*
- * If the topup succeeds, return -EAGAIN to indicate that the
- * rmap iterator should be restarted because the MMU lock was
- * dropped.
- */
- r = topup_split_caches(kvm) ?: -EAGAIN;
- write_lock(&kvm->mmu_lock);
- goto out;
- }
-
- shadow_mmu_split_huge_page(kvm, slot, huge_sptep);
-
-out:
- trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
- return r;
-}
-
-static bool shadow_mmu_try_split_huge_pages(struct kvm *kvm,
- struct kvm_rmap_head *rmap_head,
- const struct kvm_memory_slot *slot)
-{
- struct rmap_iterator iter;
- struct kvm_mmu_page *sp;
- u64 *huge_sptep;
- int r;
-
-restart:
- for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
- sp = sptep_to_sp(huge_sptep);
-
- /* TDP MMU is enabled, so rmap only contains nested MMU SPs. */
- if (WARN_ON_ONCE(!sp->role.guest_mode))
- continue;
-
- /* The rmaps should never contain non-leaf SPTEs. */
- if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
- continue;
-
- /* SPs with level >PG_LEVEL_4K should never by unsync. */
- if (WARN_ON_ONCE(sp->unsync))
- continue;
-
- /* Don't bother splitting huge pages on invalid SPs. */
- if (sp->role.invalid)
- continue;
-
- r = shadow_mmu_try_split_huge_page(kvm, slot, huge_sptep);
-
- /*
- * The split succeeded or needs to be retried because the MMU
- * lock was dropped. Either way, restart the iterator to get it
- * back into a consistent state.
- */
- if (!r || r == -EAGAIN)
- goto restart;
-
- /* The split failed and shouldn't be retried (e.g. -ENOMEM). */
- break;
- }
-
- return false;
-}
-
-static void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm,
- const struct kvm_memory_slot *slot,
- gfn_t start, gfn_t end,
- int target_level)
-{
- int level;
-
- /*
- * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
- * down to the target level. This ensures pages are recursively split
- * all the way to the target level. There's no need to split pages
- * already at the target level.
- */
- for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--)
- __walk_slot_rmaps(kvm, slot, shadow_mmu_try_split_huge_pages,
- level, level, start, end - 1, true, false);
-}
-
/* Must be called with the mmu_lock held in write-mode. */
void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
const struct kvm_memory_slot *memslot,
@@ -6475,56 +3108,6 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
*/
}

-static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
- struct kvm_rmap_head *rmap_head,
- const struct kvm_memory_slot *slot)
-{
- u64 *sptep;
- struct rmap_iterator iter;
- int need_tlb_flush = 0;
- struct kvm_mmu_page *sp;
-
-restart:
- for_each_rmap_spte(rmap_head, &iter, sptep) {
- sp = sptep_to_sp(sptep);
-
- /*
- * We cannot do huge page mapping for indirect shadow pages,
- * which are found on the last rmap (level = 1) when not using
- * tdp; such shadow pages are synced with the page table in
- * the guest, and the guest page table is using 4K page size
- * mapping if the indirect sp has level = 1.
- */
- if (sp->role.direct &&
- sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
- PG_LEVEL_NUM)) {
- kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
-
- if (kvm_available_flush_tlb_with_range())
- kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
- KVM_PAGES_PER_HPAGE(sp->role.level));
- else
- need_tlb_flush = 1;
-
- goto restart;
- }
- }
-
- return need_tlb_flush;
-}
-
-static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
- const struct kvm_memory_slot *slot)
-{
- /*
- * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
- * pages that are already mapped at the maximum hugepage level.
- */
- if (walk_slot_rmaps(kvm, slot, kvm_mmu_zap_collapsible_spte,
- PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1, true))
- kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
-}
-
void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
const struct kvm_memory_slot *slot)
{
@@ -6635,65 +3218,6 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
}
}

-static unsigned long mmu_shrink_scan(struct shrinker *shrink,
- struct shrink_control *sc)
-{
- struct kvm *kvm;
- int nr_to_scan = sc->nr_to_scan;
- unsigned long freed = 0;
-
- mutex_lock(&kvm_lock);
-
- list_for_each_entry(kvm, &vm_list, vm_list) {
- int idx;
- LIST_HEAD(invalid_list);
-
- /*
- * Never scan more than sc->nr_to_scan VM instances.
- * Will not hit this condition practically since we do not try
- * to shrink more than one VM and it is very unlikely to see
- * !n_used_mmu_pages so many times.
- */
- if (!nr_to_scan--)
- break;
- /*
- * n_used_mmu_pages is accessed without holding kvm->mmu_lock
- * here. We may skip a VM instance errorneosly, but we do not
- * want to shrink a VM that only started to populate its MMU
- * anyway.
- */
- if (!kvm->arch.n_used_mmu_pages &&
- !kvm_has_zapped_obsolete_pages(kvm))
- continue;
-
- idx = srcu_read_lock(&kvm->srcu);
- write_lock(&kvm->mmu_lock);
-
- if (kvm_has_zapped_obsolete_pages(kvm)) {
- kvm_mmu_commit_zap_page(kvm,
- &kvm->arch.zapped_obsolete_pages);
- goto unlock;
- }
-
- freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
-
-unlock:
- write_unlock(&kvm->mmu_lock);
- srcu_read_unlock(&kvm->srcu, idx);
-
- /*
- * unfair on small ones
- * per-vm shrinkers cry out
- * sadness comes quickly
- */
- list_move_tail(&kvm->vm_list, &vm_list);
- break;
- }
-
- mutex_unlock(&kvm_lock);
- return freed;
-}
-
static unsigned long mmu_shrink_count(struct shrinker *shrink,
struct shrink_control *sc)
{
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 95f0adfb3b0b4..9c1399762496b 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -44,6 +44,8 @@ extern bool dbg;
#define INVALID_PAE_ROOT 0
#define IS_VALID_PAE_ROOT(x) (!!(x))

+#define PTE_PREFETCH_NUM 8
+
typedef u64 __rcu *tdp_ptep_t;

struct kvm_mmu_page {
@@ -168,8 +170,6 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
int min_level);
void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
u64 start_gfn, u64 pages);
-unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);
-
extern int nx_huge_pages;
static inline bool is_nx_huge_page_enabled(struct kvm *kvm)
{
diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
index eee5a6796d9b0..f3e2ed5b675eb 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.c
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -21,3 +21,3421 @@
#include <asm/vmx.h>
#include <asm/cmpxchg.h>
#include <trace/events/kvm.h>
+
+#define for_each_shadow_entry(_vcpu, _addr, _walker) \
+ for (shadow_walk_init(&(_walker), _vcpu, _addr); \
+ shadow_walk_okay(&(_walker)); \
+ shadow_walk_next(&(_walker)))
+
+#define for_each_shadow_entry_lockless(_vcpu, _addr, _walker, spte) \
+ for (shadow_walk_init(&(_walker), _vcpu, _addr); \
+ shadow_walk_okay(&(_walker)) && \
+ ({ spte = mmu_spte_get_lockless(_walker.sptep); 1; }); \
+ __shadow_walk_next(&(_walker), spte))
+
+static void mmu_spte_set(u64 *sptep, u64 spte);
+
+void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
+ unsigned int access)
+{
+ u64 spte = make_mmio_spte(vcpu, gfn, access);
+
+ trace_mark_mmio_spte(sptep, gfn, spte);
+ mmu_spte_set(sptep, spte);
+}
+
+#ifdef CONFIG_X86_64
+static void __set_spte(u64 *sptep, u64 spte)
+{
+ WRITE_ONCE(*sptep, spte);
+}
+
+static void __update_clear_spte_fast(u64 *sptep, u64 spte)
+{
+ WRITE_ONCE(*sptep, spte);
+}
+
+static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
+{
+ return xchg(sptep, spte);
+}
+
+static u64 __get_spte_lockless(u64 *sptep)
+{
+ return READ_ONCE(*sptep);
+}
+#else
+union split_spte {
+ struct {
+ u32 spte_low;
+ u32 spte_high;
+ };
+ u64 spte;
+};
+
+static void count_spte_clear(u64 *sptep, u64 spte)
+{
+ struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+
+ if (is_shadow_present_pte(spte))
+ return;
+
+ /* Ensure the spte is completely set before we increase the count */
+ smp_wmb();
+ sp->clear_spte_count++;
+}
+
+static void __set_spte(u64 *sptep, u64 spte)
+{
+ union split_spte *ssptep, sspte;
+
+ ssptep = (union split_spte *)sptep;
+ sspte = (union split_spte)spte;
+
+ ssptep->spte_high = sspte.spte_high;
+
+ /*
+ * If we map the spte from nonpresent to present, We should store
+ * the high bits firstly, then set present bit, so cpu can not
+ * fetch this spte while we are setting the spte.
+ */
+ smp_wmb();
+
+ WRITE_ONCE(ssptep->spte_low, sspte.spte_low);
+}
+
+static void __update_clear_spte_fast(u64 *sptep, u64 spte)
+{
+ union split_spte *ssptep, sspte;
+
+ ssptep = (union split_spte *)sptep;
+ sspte = (union split_spte)spte;
+
+ WRITE_ONCE(ssptep->spte_low, sspte.spte_low);
+
+ /*
+ * If we map the spte from present to nonpresent, we should clear
+ * present bit firstly to avoid vcpu fetch the old high bits.
+ */
+ smp_wmb();
+
+ ssptep->spte_high = sspte.spte_high;
+ count_spte_clear(sptep, spte);
+}
+
+static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
+{
+ union split_spte *ssptep, sspte, orig;
+
+ ssptep = (union split_spte *)sptep;
+ sspte = (union split_spte)spte;
+
+ /* xchg acts as a barrier before the setting of the high bits */
+ orig.spte_low = xchg(&ssptep->spte_low, sspte.spte_low);
+ orig.spte_high = ssptep->spte_high;
+ ssptep->spte_high = sspte.spte_high;
+ count_spte_clear(sptep, spte);
+
+ return orig.spte;
+}
+
+/*
+ * The idea using the light way get the spte on x86_32 guest is from
+ * gup_get_pte (mm/gup.c).
+ *
+ * An spte tlb flush may be pending, because kvm_set_pte_rmap
+ * coalesces them and we are running out of the MMU lock. Therefore
+ * we need to protect against in-progress updates of the spte.
+ *
+ * Reading the spte while an update is in progress may get the old value
+ * for the high part of the spte. The race is fine for a present->non-present
+ * change (because the high part of the spte is ignored for non-present spte),
+ * but for a present->present change we must reread the spte.
+ *
+ * All such changes are done in two steps (present->non-present and
+ * non-present->present), hence it is enough to count the number of
+ * present->non-present updates: if it changed while reading the spte,
+ * we might have hit the race. This is done using clear_spte_count.
+ */
+static u64 __get_spte_lockless(u64 *sptep)
+{
+ struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+ union split_spte spte, *orig = (union split_spte *)sptep;
+ int count;
+
+retry:
+ count = sp->clear_spte_count;
+ smp_rmb();
+
+ spte.spte_low = orig->spte_low;
+ smp_rmb();
+
+ spte.spte_high = orig->spte_high;
+ smp_rmb();
+
+ if (unlikely(spte.spte_low != orig->spte_low ||
+ count != sp->clear_spte_count))
+ goto retry;
+
+ return spte.spte;
+}
+#endif
+
+/* Rules for using mmu_spte_set:
+ * Set the sptep from nonpresent to present.
+ * Note: the sptep being assigned *must* be either not present
+ * or in a state where the hardware will not attempt to update
+ * the spte.
+ */
+static void mmu_spte_set(u64 *sptep, u64 new_spte)
+{
+ WARN_ON(is_shadow_present_pte(*sptep));
+ __set_spte(sptep, new_spte);
+}
+
+/*
+ * Update the SPTE (excluding the PFN), but do not track changes in its
+ * accessed/dirty status.
+ */
+static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
+{
+ u64 old_spte = *sptep;
+
+ WARN_ON(!is_shadow_present_pte(new_spte));
+ check_spte_writable_invariants(new_spte);
+
+ if (!is_shadow_present_pte(old_spte)) {
+ mmu_spte_set(sptep, new_spte);
+ return old_spte;
+ }
+
+ if (!spte_has_volatile_bits(old_spte))
+ __update_clear_spte_fast(sptep, new_spte);
+ else
+ old_spte = __update_clear_spte_slow(sptep, new_spte);
+
+ WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));
+
+ return old_spte;
+}
+
+/* Rules for using mmu_spte_update:
+ * Update the state bits, it means the mapped pfn is not changed.
+ *
+ * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote
+ * TLBs must be flushed. Otherwise rmap_write_protect will find a read-only
+ * spte, even though the writable spte might be cached on a CPU's TLB.
+ *
+ * Returns true if the TLB needs to be flushed
+ */
+bool mmu_spte_update(u64 *sptep, u64 new_spte)
+{
+ bool flush = false;
+ u64 old_spte = mmu_spte_update_no_track(sptep, new_spte);
+
+ if (!is_shadow_present_pte(old_spte))
+ return false;
+
+ /*
+ * For the spte updated out of mmu-lock is safe, since
+ * we always atomically update it, see the comments in
+ * spte_has_volatile_bits().
+ */
+ if (is_mmu_writable_spte(old_spte) &&
+ !is_writable_pte(new_spte))
+ flush = true;
+
+ /*
+ * Flush TLB when accessed/dirty states are changed in the page tables,
+ * to guarantee consistency between TLB and page tables.
+ */
+
+ if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
+ flush = true;
+ kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+ }
+
+ if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
+ flush = true;
+ kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+ }
+
+ return flush;
+}
+
+/*
+ * Rules for using mmu_spte_clear_track_bits:
+ * It sets the sptep from present to nonpresent, and track the
+ * state bits, it is used to clear the last level sptep.
+ * Returns the old PTE.
+ */
+static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
+{
+ kvm_pfn_t pfn;
+ u64 old_spte = *sptep;
+ int level = sptep_to_sp(sptep)->role.level;
+ struct page *page;
+
+ if (!is_shadow_present_pte(old_spte) ||
+ !spte_has_volatile_bits(old_spte))
+ __update_clear_spte_fast(sptep, 0ull);
+ else
+ old_spte = __update_clear_spte_slow(sptep, 0ull);
+
+ if (!is_shadow_present_pte(old_spte))
+ return old_spte;
+
+ kvm_update_page_stats(kvm, level, -1);
+
+ pfn = spte_to_pfn(old_spte);
+
+ /*
+ * KVM doesn't hold a reference to any pages mapped into the guest, and
+ * instead uses the mmu_notifier to ensure that KVM unmaps any pages
+ * before they are reclaimed. Sanity check that, if the pfn is backed
+ * by a refcounted page, the refcount is elevated.
+ */
+ page = kvm_pfn_to_refcounted_page(pfn);
+ WARN_ON(page && !page_count(page));
+
+ if (is_accessed_spte(old_spte))
+ kvm_set_pfn_accessed(pfn);
+
+ if (is_dirty_spte(old_spte))
+ kvm_set_pfn_dirty(pfn);
+
+ return old_spte;
+}
+
+/*
+ * Rules for using mmu_spte_clear_no_track:
+ * Directly clear spte without caring the state bits of sptep,
+ * it is used to set the upper level spte.
+ */
+void mmu_spte_clear_no_track(u64 *sptep)
+{
+ __update_clear_spte_fast(sptep, 0ull);
+}
+
+static u64 mmu_spte_get_lockless(u64 *sptep)
+{
+ return __get_spte_lockless(sptep);
+}
+
+/* Returns the Accessed status of the PTE and resets it at the same time. */
+static bool mmu_spte_age(u64 *sptep)
+{
+ u64 spte = mmu_spte_get_lockless(sptep);
+
+ if (!is_accessed_spte(spte))
+ return false;
+
+ if (spte_ad_enabled(spte)) {
+ clear_bit((ffs(shadow_accessed_mask) - 1),
+ (unsigned long *)sptep);
+ } else {
+ /*
+ * Capture the dirty status of the page, so that it doesn't get
+ * lost when the SPTE is marked for access tracking.
+ */
+ if (is_writable_pte(spte))
+ kvm_set_pfn_dirty(spte_to_pfn(spte));
+
+ spte = mark_spte_for_access_track(spte);
+ mmu_spte_update_no_track(sptep, spte);
+ }
+
+ return true;
+}
+
+static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
+{
+ kmem_cache_free(pte_list_desc_cache, pte_list_desc);
+}
+
+static bool sp_has_gptes(struct kvm_mmu_page *sp);
+
+gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
+{
+ if (sp->role.passthrough)
+ return sp->gfn;
+
+ if (!sp->role.direct)
+ return sp->shadowed_translation[index] >> PAGE_SHIFT;
+
+ return sp->gfn + (index << ((sp->role.level - 1) * SPTE_LEVEL_BITS));
+}
+
+/*
+ * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note
+ * that the SPTE itself may have a more constrained access permissions that
+ * what the guest enforces. For example, a guest may create an executable
+ * huge PTE but KVM may disallow execution to mitigate iTLB multihit.
+ */
+static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
+{
+ if (sp_has_gptes(sp))
+ return sp->shadowed_translation[index] & ACC_ALL;
+
+ /*
+ * For direct MMUs (e.g. TDP or non-paging guests) or passthrough SPs,
+ * KVM is not shadowing any guest page tables, so the "guest access
+ * permissions" are just ACC_ALL.
+ *
+ * For direct SPs in indirect MMUs (shadow paging), i.e. when KVM
+ * is shadowing a guest huge page with small pages, the guest access
+ * permissions being shadowed are the access permissions of the huge
+ * page.
+ *
+ * In both cases, sp->role.access contains the correct access bits.
+ */
+ return sp->role.access;
+}
+
+static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index,
+ gfn_t gfn, unsigned int access)
+{
+ if (sp_has_gptes(sp)) {
+ sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access;
+ return;
+ }
+
+ WARN_ONCE(access != kvm_mmu_page_get_access(sp, index),
+ "access mismatch under %s page %llx (expected %u, got %u)\n",
+ sp->role.passthrough ? "passthrough" : "direct",
+ sp->gfn, kvm_mmu_page_get_access(sp, index), access);
+
+ WARN_ONCE(gfn != kvm_mmu_page_get_gfn(sp, index),
+ "gfn mismatch under %s page %llx (expected %llx, got %llx)\n",
+ sp->role.passthrough ? "passthrough" : "direct",
+ sp->gfn, kvm_mmu_page_get_gfn(sp, index), gfn);
+}
+
+void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index,
+ unsigned int access)
+{
+ gfn_t gfn = kvm_mmu_page_get_gfn(sp, index);
+
+ kvm_mmu_page_set_translation(sp, index, gfn, access);
+}
+
+static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ struct kvm_memslots *slots;
+ struct kvm_memory_slot *slot;
+ gfn_t gfn;
+
+ kvm->arch.indirect_shadow_pages++;
+ gfn = sp->gfn;
+ slots = kvm_memslots_for_spte_role(kvm, sp->role);
+ slot = __gfn_to_memslot(slots, gfn);
+
+ /* the non-leaf shadow pages are keeping readonly. */
+ if (sp->role.level > PG_LEVEL_4K)
+ return kvm_slot_page_track_add_page(kvm, slot, gfn,
+ KVM_PAGE_TRACK_WRITE);
+
+ kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+ if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
+ kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
+}
+
+static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ struct kvm_memslots *slots;
+ struct kvm_memory_slot *slot;
+ gfn_t gfn;
+
+ kvm->arch.indirect_shadow_pages--;
+ gfn = sp->gfn;
+ slots = kvm_memslots_for_spte_role(kvm, sp->role);
+ slot = __gfn_to_memslot(slots, gfn);
+ if (sp->role.level > PG_LEVEL_4K)
+ return kvm_slot_page_track_remove_page(kvm, slot, gfn,
+ KVM_PAGE_TRACK_WRITE);
+
+ kvm_mmu_gfn_allow_lpage(slot, gfn);
+}
+
+/*
+ * About rmap_head encoding:
+ *
+ * If the bit zero of rmap_head->val is clear, then it points to the only spte
+ * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct
+ * pte_list_desc containing more mappings.
+ */
+
+/*
+ * Returns the number of pointers in the rmap chain, not counting the new one.
+ */
+static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
+ struct kvm_rmap_head *rmap_head)
+{
+ struct pte_list_desc *desc;
+ int count = 0;
+
+ if (!rmap_head->val) {
+ rmap_printk("%p %llx 0->1\n", spte, *spte);
+ rmap_head->val = (unsigned long)spte;
+ } else if (!(rmap_head->val & 1)) {
+ rmap_printk("%p %llx 1->many\n", spte, *spte);
+ desc = kvm_mmu_memory_cache_alloc(cache);
+ desc->sptes[0] = (u64 *)rmap_head->val;
+ desc->sptes[1] = spte;
+ desc->spte_count = 2;
+ rmap_head->val = (unsigned long)desc | 1;
+ ++count;
+ } else {
+ rmap_printk("%p %llx many->many\n", spte, *spte);
+ desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+ while (desc->spte_count == PTE_LIST_EXT) {
+ count += PTE_LIST_EXT;
+ if (!desc->more) {
+ desc->more = kvm_mmu_memory_cache_alloc(cache);
+ desc = desc->more;
+ desc->spte_count = 0;
+ break;
+ }
+ desc = desc->more;
+ }
+ count += desc->spte_count;
+ desc->sptes[desc->spte_count++] = spte;
+ }
+ return count;
+}
+
+static void pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
+ struct pte_list_desc *desc, int i,
+ struct pte_list_desc *prev_desc)
+{
+ int j = desc->spte_count - 1;
+
+ desc->sptes[i] = desc->sptes[j];
+ desc->sptes[j] = NULL;
+ desc->spte_count--;
+ if (desc->spte_count)
+ return;
+ if (!prev_desc && !desc->more)
+ rmap_head->val = 0;
+ else
+ if (prev_desc)
+ prev_desc->more = desc->more;
+ else
+ rmap_head->val = (unsigned long)desc->more | 1;
+ mmu_free_pte_list_desc(desc);
+}
+
+static void pte_list_remove(u64 *spte, struct kvm_rmap_head *rmap_head)
+{
+ struct pte_list_desc *desc;
+ struct pte_list_desc *prev_desc;
+ int i;
+
+ if (!rmap_head->val) {
+ pr_err("%s: %p 0->BUG\n", __func__, spte);
+ BUG();
+ } else if (!(rmap_head->val & 1)) {
+ rmap_printk("%p 1->0\n", spte);
+ if ((u64 *)rmap_head->val != spte) {
+ pr_err("%s: %p 1->BUG\n", __func__, spte);
+ BUG();
+ }
+ rmap_head->val = 0;
+ } else {
+ rmap_printk("%p many->many\n", spte);
+ desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+ prev_desc = NULL;
+ while (desc) {
+ for (i = 0; i < desc->spte_count; ++i) {
+ if (desc->sptes[i] == spte) {
+ pte_list_desc_remove_entry(rmap_head,
+ desc, i, prev_desc);
+ return;
+ }
+ }
+ prev_desc = desc;
+ desc = desc->more;
+ }
+ pr_err("%s: %p many->many\n", __func__, spte);
+ BUG();
+ }
+}
+
+static void kvm_zap_one_rmap_spte(struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head, u64 *sptep)
+{
+ mmu_spte_clear_track_bits(kvm, sptep);
+ pte_list_remove(sptep, rmap_head);
+}
+
+/* Return true if at least one SPTE was zapped, false otherwise */
+static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head)
+{
+ struct pte_list_desc *desc, *next;
+ int i;
+
+ if (!rmap_head->val)
+ return false;
+
+ if (!(rmap_head->val & 1)) {
+ mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val);
+ goto out;
+ }
+
+ desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+
+ for (; desc; desc = next) {
+ for (i = 0; i < desc->spte_count; i++)
+ mmu_spte_clear_track_bits(kvm, desc->sptes[i]);
+ next = desc->more;
+ mmu_free_pte_list_desc(desc);
+ }
+out:
+ /* rmap_head is meaningless now, remember to reset it */
+ rmap_head->val = 0;
+ return true;
+}
+
+unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
+{
+ struct pte_list_desc *desc;
+ unsigned int count = 0;
+
+ if (!rmap_head->val)
+ return 0;
+ else if (!(rmap_head->val & 1))
+ return 1;
+
+ desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+
+ while (desc) {
+ count += desc->spte_count;
+ desc = desc->more;
+ }
+
+ return count;
+}
+
+struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
+ const struct kvm_memory_slot *slot)
+{
+ unsigned long idx;
+
+ idx = gfn_to_index(gfn, slot->base_gfn, level);
+ return &slot->arch.rmap[level - PG_LEVEL_4K][idx];
+}
+
+bool rmap_can_add(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu_memory_cache *mc;
+
+ mc = &vcpu->arch.mmu_pte_list_desc_cache;
+ return kvm_mmu_memory_cache_nr_free_objects(mc);
+}
+
+static void rmap_remove(struct kvm *kvm, u64 *spte)
+{
+ struct kvm_memslots *slots;
+ struct kvm_memory_slot *slot;
+ struct kvm_mmu_page *sp;
+ gfn_t gfn;
+ struct kvm_rmap_head *rmap_head;
+
+ sp = sptep_to_sp(spte);
+ gfn = kvm_mmu_page_get_gfn(sp, spte_index(spte));
+
+ /*
+ * Unlike rmap_add, rmap_remove does not run in the context of a vCPU
+ * so we have to determine which memslots to use based on context
+ * information in sp->role.
+ */
+ slots = kvm_memslots_for_spte_role(kvm, sp->role);
+
+ slot = __gfn_to_memslot(slots, gfn);
+ rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
+
+ pte_list_remove(spte, rmap_head);
+}
+
+/*
+ * Used by the following functions to iterate through the sptes linked by a
+ * rmap. All fields are private and not assumed to be used outside.
+ */
+struct rmap_iterator {
+ /* private fields */
+ struct pte_list_desc *desc; /* holds the sptep if not NULL */
+ int pos; /* index of the sptep */
+};
+
+/*
+ * Iteration must be started by this function. This should also be used after
+ * removing/dropping sptes from the rmap link because in such cases the
+ * information in the iterator may not be valid.
+ *
+ * Returns sptep if found, NULL otherwise.
+ */
+static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
+ struct rmap_iterator *iter)
+{
+ u64 *sptep;
+
+ if (!rmap_head->val)
+ return NULL;
+
+ if (!(rmap_head->val & 1)) {
+ iter->desc = NULL;
+ sptep = (u64 *)rmap_head->val;
+ goto out;
+ }
+
+ iter->desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+ iter->pos = 0;
+ sptep = iter->desc->sptes[iter->pos];
+out:
+ BUG_ON(!is_shadow_present_pte(*sptep));
+ return sptep;
+}
+
+/*
+ * Must be used with a valid iterator: e.g. after rmap_get_first().
+ *
+ * Returns sptep if found, NULL otherwise.
+ */
+static u64 *rmap_get_next(struct rmap_iterator *iter)
+{
+ u64 *sptep;
+
+ if (iter->desc) {
+ if (iter->pos < PTE_LIST_EXT - 1) {
+ ++iter->pos;
+ sptep = iter->desc->sptes[iter->pos];
+ if (sptep)
+ goto out;
+ }
+
+ iter->desc = iter->desc->more;
+
+ if (iter->desc) {
+ iter->pos = 0;
+ /* desc->sptes[0] cannot be NULL */
+ sptep = iter->desc->sptes[iter->pos];
+ goto out;
+ }
+ }
+
+ return NULL;
+out:
+ BUG_ON(!is_shadow_present_pte(*sptep));
+ return sptep;
+}
+
+#define for_each_rmap_spte(_rmap_head_, _iter_, _spte_) \
+ for (_spte_ = rmap_get_first(_rmap_head_, _iter_); \
+ _spte_; _spte_ = rmap_get_next(_iter_))
+
+void drop_spte(struct kvm *kvm, u64 *sptep)
+{
+ u64 old_spte = mmu_spte_clear_track_bits(kvm, sptep);
+
+ if (is_shadow_present_pte(old_spte))
+ rmap_remove(kvm, sptep);
+}
+
+static void drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
+{
+ struct kvm_mmu_page *sp;
+
+ sp = sptep_to_sp(sptep);
+ WARN_ON(sp->role.level == PG_LEVEL_4K);
+
+ drop_spte(kvm, sptep);
+
+ if (flush)
+ kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
+ KVM_PAGES_PER_HPAGE(sp->role.level));
+}
+
+/*
+ * Write-protect on the specified @sptep, @pt_protect indicates whether
+ * spte write-protection is caused by protecting shadow page table.
+ *
+ * Note: write protection is difference between dirty logging and spte
+ * protection:
+ * - for dirty logging, the spte can be set to writable at anytime if
+ * its dirty bitmap is properly set.
+ * - for spte protection, the spte can be writable only after unsync-ing
+ * shadow page.
+ *
+ * Return true if tlb need be flushed.
+ */
+static bool spte_write_protect(u64 *sptep, bool pt_protect)
+{
+ u64 spte = *sptep;
+
+ if (!is_writable_pte(spte) &&
+ !(pt_protect && is_mmu_writable_spte(spte)))
+ return false;
+
+ rmap_printk("spte %p %llx\n", sptep, *sptep);
+
+ if (pt_protect)
+ spte &= ~shadow_mmu_writable_mask;
+ spte = spte & ~PT_WRITABLE_MASK;
+
+ return mmu_spte_update(sptep, spte);
+}
+
+bool rmap_write_protect(struct kvm_rmap_head *rmap_head, bool pt_protect)
+{
+ u64 *sptep;
+ struct rmap_iterator iter;
+ bool flush = false;
+
+ for_each_rmap_spte(rmap_head, &iter, sptep)
+ flush |= spte_write_protect(sptep, pt_protect);
+
+ return flush;
+}
+
+static bool spte_clear_dirty(u64 *sptep)
+{
+ u64 spte = *sptep;
+
+ rmap_printk("spte %p %llx\n", sptep, *sptep);
+
+ MMU_WARN_ON(!spte_ad_enabled(spte));
+ spte &= ~shadow_dirty_mask;
+ return mmu_spte_update(sptep, spte);
+}
+
+static bool spte_wrprot_for_clear_dirty(u64 *sptep)
+{
+ bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
+ (unsigned long *)sptep);
+ if (was_writable && !spte_ad_enabled(*sptep))
+ kvm_set_pfn_dirty(spte_to_pfn(*sptep));
+
+ return was_writable;
+}
+
+/*
+ * Gets the GFN ready for another round of dirty logging by clearing the
+ * - D bit on ad-enabled SPTEs, and
+ * - W bit on ad-disabled SPTEs.
+ * Returns true iff any D or W bits were cleared.
+ */
+bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ const struct kvm_memory_slot *slot)
+{
+ u64 *sptep;
+ struct rmap_iterator iter;
+ bool flush = false;
+
+ for_each_rmap_spte(rmap_head, &iter, sptep)
+ if (spte_ad_need_write_protect(*sptep))
+ flush |= spte_wrprot_for_clear_dirty(sptep);
+ else
+ flush |= spte_clear_dirty(sptep);
+
+ return flush;
+}
+
+static bool __kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ const struct kvm_memory_slot *slot)
+{
+ return kvm_zap_all_rmap_sptes(kvm, rmap_head);
+}
+
+bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn, int level,
+ pte_t unused)
+{
+ return __kvm_zap_rmap(kvm, rmap_head, slot);
+}
+
+bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn, int level,
+ pte_t pte)
+{
+ u64 *sptep;
+ struct rmap_iterator iter;
+ bool need_flush = false;
+ u64 new_spte;
+ kvm_pfn_t new_pfn;
+
+ WARN_ON(pte_huge(pte));
+ new_pfn = pte_pfn(pte);
+
+restart:
+ for_each_rmap_spte(rmap_head, &iter, sptep) {
+ rmap_printk("spte %p %llx gfn %llx (%d)\n",
+ sptep, *sptep, gfn, level);
+
+ need_flush = true;
+
+ if (pte_write(pte)) {
+ kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
+ goto restart;
+ } else {
+ new_spte = kvm_mmu_changed_pte_notifier_make_spte(
+ *sptep, new_pfn);
+
+ mmu_spte_clear_track_bits(kvm, sptep);
+ mmu_spte_set(sptep, new_spte);
+ }
+ }
+
+ if (need_flush && kvm_available_flush_tlb_with_range()) {
+ kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
+ return false;
+ }
+
+ return need_flush;
+}
+
+struct slot_rmap_walk_iterator {
+ /* input fields. */
+ const struct kvm_memory_slot *slot;
+ gfn_t start_gfn;
+ gfn_t end_gfn;
+ int start_level;
+ int end_level;
+
+ /* output fields. */
+ gfn_t gfn;
+ struct kvm_rmap_head *rmap;
+ int level;
+
+ /* private field. */
+ struct kvm_rmap_head *end_rmap;
+};
+
+static void rmap_walk_init_level(struct slot_rmap_walk_iterator *iterator,
+ int level)
+{
+ iterator->level = level;
+ iterator->gfn = iterator->start_gfn;
+ iterator->rmap = gfn_to_rmap(iterator->gfn, level, iterator->slot);
+ iterator->end_rmap = gfn_to_rmap(iterator->end_gfn, level, iterator->slot);
+}
+
+static void slot_rmap_walk_init(struct slot_rmap_walk_iterator *iterator,
+ const struct kvm_memory_slot *slot,
+ int start_level, int end_level,
+ gfn_t start_gfn, gfn_t end_gfn)
+{
+ iterator->slot = slot;
+ iterator->start_level = start_level;
+ iterator->end_level = end_level;
+ iterator->start_gfn = start_gfn;
+ iterator->end_gfn = end_gfn;
+
+ rmap_walk_init_level(iterator, iterator->start_level);
+}
+
+static bool slot_rmap_walk_okay(struct slot_rmap_walk_iterator *iterator)
+{
+ return !!iterator->rmap;
+}
+
+static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator)
+{
+ while (++iterator->rmap <= iterator->end_rmap) {
+ iterator->gfn += (1UL << KVM_HPAGE_GFN_SHIFT(iterator->level));
+
+ if (iterator->rmap->val)
+ return;
+ }
+
+ if (++iterator->level > iterator->end_level) {
+ iterator->rmap = NULL;
+ return;
+ }
+
+ rmap_walk_init_level(iterator, iterator->level);
+}
+
+#define for_each_slot_rmap_range(_slot_, _start_level_, _end_level_, \
+ _start_gfn, _end_gfn, _iter_) \
+ for (slot_rmap_walk_init(_iter_, _slot_, _start_level_, \
+ _end_level_, _start_gfn, _end_gfn); \
+ slot_rmap_walk_okay(_iter_); \
+ slot_rmap_walk_next(_iter_))
+
+__always_inline bool kvm_handle_gfn_range(struct kvm *kvm,
+ struct kvm_gfn_range *range,
+ rmap_handler_t handler)
+{
+ struct slot_rmap_walk_iterator iterator;
+ bool ret = false;
+
+ for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
+ range->start, range->end - 1, &iterator)
+ ret |= handler(kvm, iterator.rmap, range->slot, iterator.gfn,
+ iterator.level, range->pte);
+
+ return ret;
+}
+
+bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn, int level,
+ pte_t unused)
+{
+ u64 *sptep;
+ struct rmap_iterator iter;
+ int young = 0;
+
+ for_each_rmap_spte(rmap_head, &iter, sptep)
+ young |= mmu_spte_age(sptep);
+
+ return young;
+}
+
+bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn,
+ int level, pte_t unused)
+{
+ u64 *sptep;
+ struct rmap_iterator iter;
+
+ for_each_rmap_spte(rmap_head, &iter, sptep)
+ if (is_accessed_spte(*sptep))
+ return true;
+ return false;
+}
+
+#define RMAP_RECYCLE_THRESHOLD 1000
+
+static void __rmap_add(struct kvm *kvm,
+ struct kvm_mmu_memory_cache *cache,
+ const struct kvm_memory_slot *slot,
+ u64 *spte, gfn_t gfn, unsigned int access)
+{
+ struct kvm_mmu_page *sp;
+ struct kvm_rmap_head *rmap_head;
+ int rmap_count;
+
+ sp = sptep_to_sp(spte);
+ kvm_mmu_page_set_translation(sp, spte_index(spte), gfn, access);
+ kvm_update_page_stats(kvm, sp->role.level, 1);
+
+ rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
+ rmap_count = pte_list_add(cache, spte, rmap_head);
+
+ if (rmap_count > kvm->stat.max_mmu_rmap_size)
+ kvm->stat.max_mmu_rmap_size = rmap_count;
+ if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
+ kvm_zap_all_rmap_sptes(kvm, rmap_head);
+ kvm_flush_remote_tlbs_with_address(
+ kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
+ }
+}
+
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
+ u64 *spte, gfn_t gfn, unsigned int access)
+{
+ struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_pte_list_desc_cache;
+
+ __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
+}
+
+#ifdef MMU_DEBUG
+static int is_empty_shadow_page(u64 *spt)
+{
+ u64 *pos;
+ u64 *end;
+
+ for (pos = spt, end = pos + SPTE_ENT_PER_PAGE; pos != end; pos++)
+ if (is_shadow_present_pte(*pos)) {
+ printk(KERN_ERR "%s: %p %llx\n", __func__,
+ pos, *pos);
+ return 0;
+ }
+ return 1;
+}
+#endif
+
+/*
+ * This value is the sum of all of the kvm instances's
+ * kvm->arch.n_used_mmu_pages values. We need a global,
+ * aggregate version in order to make the slab shrinker
+ * faster
+ */
+static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
+{
+ kvm->arch.n_used_mmu_pages += nr;
+ percpu_counter_add(&kvm_total_used_mmu_pages, nr);
+}
+
+static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ kvm_mod_used_mmu_pages(kvm, +1);
+ kvm_account_pgtable_pages((void *)sp->spt, +1);
+}
+
+static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ kvm_mod_used_mmu_pages(kvm, -1);
+ kvm_account_pgtable_pages((void *)sp->spt, -1);
+}
+
+static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
+{
+ MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
+ hlist_del(&sp->hash_link);
+ list_del(&sp->link);
+ free_page((unsigned long)sp->spt);
+ if (!sp->role.direct)
+ free_page((unsigned long)sp->shadowed_translation);
+ kmem_cache_free(mmu_page_header_cache, sp);
+}
+
+static unsigned kvm_page_table_hashfn(gfn_t gfn)
+{
+ return hash_64(gfn, KVM_MMU_HASH_SHIFT);
+}
+
+static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
+ struct kvm_mmu_page *sp, u64 *parent_pte)
+{
+ if (!parent_pte)
+ return;
+
+ pte_list_add(cache, parent_pte, &sp->parent_ptes);
+}
+
+static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
+ u64 *parent_pte)
+{
+ pte_list_remove(parent_pte, &sp->parent_ptes);
+}
+
+void drop_parent_pte(struct kvm_mmu_page *sp, u64 *parent_pte)
+{
+ mmu_page_remove_parent_pte(sp, parent_pte);
+ mmu_spte_clear_no_track(parent_pte);
+}
+
+static void mark_unsync(u64 *spte);
+static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
+{
+ u64 *sptep;
+ struct rmap_iterator iter;
+
+ for_each_rmap_spte(&sp->parent_ptes, &iter, sptep) {
+ mark_unsync(sptep);
+ }
+}
+
+static void mark_unsync(u64 *spte)
+{
+ struct kvm_mmu_page *sp;
+
+ sp = sptep_to_sp(spte);
+ if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap))
+ return;
+ if (sp->unsync_children++)
+ return;
+ kvm_mmu_mark_parents_unsync(sp);
+}
+
+int nonpaging_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+ return -1;
+}
+
+#define KVM_PAGE_ARRAY_NR 16
+
+struct kvm_mmu_pages {
+ struct mmu_page_and_offset {
+ struct kvm_mmu_page *sp;
+ unsigned int idx;
+ } page[KVM_PAGE_ARRAY_NR];
+ unsigned int nr;
+};
+
+static int mmu_pages_add(struct kvm_mmu_pages *pvec, struct kvm_mmu_page *sp,
+ int idx)
+{
+ int i;
+
+ if (sp->unsync)
+ for (i=0; i < pvec->nr; i++)
+ if (pvec->page[i].sp == sp)
+ return 0;
+
+ pvec->page[pvec->nr].sp = sp;
+ pvec->page[pvec->nr].idx = idx;
+ pvec->nr++;
+ return (pvec->nr == KVM_PAGE_ARRAY_NR);
+}
+
+static inline void clear_unsync_child_bit(struct kvm_mmu_page *sp, int idx)
+{
+ --sp->unsync_children;
+ WARN_ON((int)sp->unsync_children < 0);
+ __clear_bit(idx, sp->unsync_child_bitmap);
+}
+
+static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
+ struct kvm_mmu_pages *pvec)
+{
+ int i, ret, nr_unsync_leaf = 0;
+
+ for_each_set_bit(i, sp->unsync_child_bitmap, 512) {
+ struct kvm_mmu_page *child;
+ u64 ent = sp->spt[i];
+
+ if (!is_shadow_present_pte(ent) || is_large_pte(ent)) {
+ clear_unsync_child_bit(sp, i);
+ continue;
+ }
+
+ child = spte_to_child_sp(ent);
+
+ if (child->unsync_children) {
+ if (mmu_pages_add(pvec, child, i))
+ return -ENOSPC;
+
+ ret = __mmu_unsync_walk(child, pvec);
+ if (!ret) {
+ clear_unsync_child_bit(sp, i);
+ continue;
+ } else if (ret > 0) {
+ nr_unsync_leaf += ret;
+ } else
+ return ret;
+ } else if (child->unsync) {
+ nr_unsync_leaf++;
+ if (mmu_pages_add(pvec, child, i))
+ return -ENOSPC;
+ } else
+ clear_unsync_child_bit(sp, i);
+ }
+
+ return nr_unsync_leaf;
+}
+
+#define INVALID_INDEX (-1)
+
+static int mmu_unsync_walk(struct kvm_mmu_page *sp,
+ struct kvm_mmu_pages *pvec)
+{
+ pvec->nr = 0;
+ if (!sp->unsync_children)
+ return 0;
+
+ mmu_pages_add(pvec, sp, INVALID_INDEX);
+ return __mmu_unsync_walk(sp, pvec);
+}
+
+static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ WARN_ON(!sp->unsync);
+ trace_kvm_mmu_sync_page(sp);
+ sp->unsync = 0;
+ --kvm->stat.mmu_unsync;
+}
+
+static bool sp_has_gptes(struct kvm_mmu_page *sp)
+{
+ if (sp->role.direct)
+ return false;
+
+ if (sp->role.passthrough)
+ return false;
+
+ return true;
+}
+
+#define for_each_valid_sp(_kvm, _sp, _list) \
+ hlist_for_each_entry(_sp, _list, hash_link) \
+ if (is_obsolete_sp((_kvm), (_sp))) { \
+ } else
+
+#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \
+ for_each_valid_sp(_kvm, _sp, \
+ &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)]) \
+ if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else
+
+static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+ struct list_head *invalid_list)
+{
+ int ret = vcpu->arch.mmu->sync_page(vcpu, sp);
+
+ if (ret < 0)
+ kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
+ return ret;
+}
+
+struct mmu_page_path {
+ struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL];
+ unsigned int idx[PT64_ROOT_MAX_LEVEL];
+};
+
+#define for_each_sp(pvec, sp, parents, i) \
+ for (i = mmu_pages_first(&pvec, &parents); \
+ i < pvec.nr && ({ sp = pvec.page[i].sp; 1;}); \
+ i = mmu_pages_next(&pvec, &parents, i))
+
+static int mmu_pages_next(struct kvm_mmu_pages *pvec,
+ struct mmu_page_path *parents,
+ int i)
+{
+ int n;
+
+ for (n = i+1; n < pvec->nr; n++) {
+ struct kvm_mmu_page *sp = pvec->page[n].sp;
+ unsigned idx = pvec->page[n].idx;
+ int level = sp->role.level;
+
+ parents->idx[level-1] = idx;
+ if (level == PG_LEVEL_4K)
+ break;
+
+ parents->parent[level-2] = sp;
+ }
+
+ return n;
+}
+
+static int mmu_pages_first(struct kvm_mmu_pages *pvec,
+ struct mmu_page_path *parents)
+{
+ struct kvm_mmu_page *sp;
+ int level;
+
+ if (pvec->nr == 0)
+ return 0;
+
+ WARN_ON(pvec->page[0].idx != INVALID_INDEX);
+
+ sp = pvec->page[0].sp;
+ level = sp->role.level;
+ WARN_ON(level == PG_LEVEL_4K);
+
+ parents->parent[level-2] = sp;
+
+ /* Also set up a sentinel. Further entries in pvec are all
+ * children of sp, so this element is never overwritten.
+ */
+ parents->parent[level-1] = NULL;
+ return mmu_pages_next(pvec, parents, 0);
+}
+
+static void mmu_pages_clear_parents(struct mmu_page_path *parents)
+{
+ struct kvm_mmu_page *sp;
+ unsigned int level = 0;
+
+ do {
+ unsigned int idx = parents->idx[level];
+ sp = parents->parent[level];
+ if (!sp)
+ return;
+
+ WARN_ON(idx == INVALID_INDEX);
+ clear_unsync_child_bit(sp, idx);
+ level++;
+ } while (!sp->unsync_children);
+}
+
+int mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *parent,
+ bool can_yield)
+{
+ int i;
+ struct kvm_mmu_page *sp;
+ struct mmu_page_path parents;
+ struct kvm_mmu_pages pages;
+ LIST_HEAD(invalid_list);
+ bool flush = false;
+
+ while (mmu_unsync_walk(parent, &pages)) {
+ bool protected = false;
+
+ for_each_sp(pages, sp, parents, i)
+ protected |= kvm_vcpu_write_protect_gfn(vcpu, sp->gfn);
+
+ if (protected) {
+ kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, true);
+ flush = false;
+ }
+
+ for_each_sp(pages, sp, parents, i) {
+ kvm_unlink_unsync_page(vcpu->kvm, sp);
+ flush |= kvm_sync_page(vcpu, sp, &invalid_list) > 0;
+ mmu_pages_clear_parents(&parents);
+ }
+ if (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock)) {
+ kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, flush);
+ if (!can_yield) {
+ kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
+ return -EINTR;
+ }
+
+ cond_resched_rwlock_write(&vcpu->kvm->mmu_lock);
+ flush = false;
+ }
+ }
+
+ kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, flush);
+ return 0;
+}
+
+void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
+{
+ atomic_set(&sp->write_flooding_count, 0);
+}
+
+void clear_sp_write_flooding_count(u64 *spte)
+{
+ __clear_sp_write_flooding_count(sptep_to_sp(spte));
+}
+
+/*
+ * The vCPU is required when finding indirect shadow pages; the shadow
+ * page may already exist and syncing it needs the vCPU pointer in
+ * order to read guest page tables. Direct shadow pages are never
+ * unsync, thus @vcpu can be NULL if @role.direct is true.
+ */
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
+ struct kvm_vcpu *vcpu,
+ gfn_t gfn,
+ struct hlist_head *sp_list,
+ union kvm_mmu_page_role role)
+{
+ struct kvm_mmu_page *sp;
+ int ret;
+ int collisions = 0;
+ LIST_HEAD(invalid_list);
+
+ for_each_valid_sp(kvm, sp, sp_list) {
+ if (sp->gfn != gfn) {
+ collisions++;
+ continue;
+ }
+
+ if (sp->role.word != role.word) {
+ /*
+ * If the guest is creating an upper-level page, zap
+ * unsync pages for the same gfn. While it's possible
+ * the guest is using recursive page tables, in all
+ * likelihood the guest has stopped using the unsync
+ * page and is installing a completely unrelated page.
+ * Unsync pages must not be left as is, because the new
+ * upper-level page will be write-protected.
+ */
+ if (role.level > PG_LEVEL_4K && sp->unsync)
+ kvm_mmu_prepare_zap_page(kvm, sp,
+ &invalid_list);
+ continue;
+ }
+
+ /* unsync and write-flooding only apply to indirect SPs. */
+ if (sp->role.direct)
+ goto out;
+
+ if (sp->unsync) {
+ if (KVM_BUG_ON(!vcpu, kvm))
+ break;
+
+ /*
+ * The page is good, but is stale. kvm_sync_page does
+ * get the latest guest state, but (unlike mmu_unsync_children)
+ * it doesn't write-protect the page or mark it synchronized!
+ * This way the validity of the mapping is ensured, but the
+ * overhead of write protection is not incurred until the
+ * guest invalidates the TLB mapping. This allows multiple
+ * SPs for a single gfn to be unsync.
+ *
+ * If the sync fails, the page is zapped. If so, break
+ * in order to rebuild it.
+ */
+ ret = kvm_sync_page(vcpu, sp, &invalid_list);
+ if (ret < 0)
+ break;
+
+ WARN_ON(!list_empty(&invalid_list));
+ if (ret > 0)
+ kvm_flush_remote_tlbs(kvm);
+ }
+
+ __clear_sp_write_flooding_count(sp);
+
+ goto out;
+ }
+
+ sp = NULL;
+ ++kvm->stat.mmu_cache_miss;
+
+out:
+ kvm_mmu_commit_zap_page(kvm, &invalid_list);
+
+ if (collisions > kvm->stat.max_mmu_page_hash_collisions)
+ kvm->stat.max_mmu_page_hash_collisions = collisions;
+ return sp;
+}
+
+/* Caches used when allocating a new shadow page. */
+struct shadow_page_caches {
+ struct kvm_mmu_memory_cache *page_header_cache;
+ struct kvm_mmu_memory_cache *shadow_page_cache;
+ struct kvm_mmu_memory_cache *shadowed_info_cache;
+};
+
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
+ struct shadow_page_caches *caches,
+ gfn_t gfn,
+ struct hlist_head *sp_list,
+ union kvm_mmu_page_role role)
+{
+ struct kvm_mmu_page *sp;
+
+ sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
+ sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
+ if (!role.direct)
+ sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
+
+ set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
+ INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);
+
+ /*
+ * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
+ * depends on valid pages being added to the head of the list. See
+ * comments in kvm_zap_obsolete_pages().
+ */
+ sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
+ list_add(&sp->link, &kvm->arch.active_mmu_pages);
+ kvm_account_mmu_page(kvm, sp);
+
+ sp->gfn = gfn;
+ sp->role = role;
+ hlist_add_head(&sp->hash_link, sp_list);
+ if (sp_has_gptes(sp))
+ account_shadowed(kvm, sp);
+
+ return sp;
+}
+
+/* Note, @vcpu may be NULL if @role.direct is true; see kvm_mmu_find_shadow_page. */
+static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
+ struct kvm_vcpu *vcpu,
+ struct shadow_page_caches *caches,
+ gfn_t gfn,
+ union kvm_mmu_page_role role)
+{
+ struct hlist_head *sp_list;
+ struct kvm_mmu_page *sp;
+ bool created = false;
+
+ sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+
+ sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
+ if (!sp) {
+ created = true;
+ sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role);
+ }
+
+ trace_kvm_mmu_get_page(sp, created);
+ return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+ gfn_t gfn,
+ union kvm_mmu_page_role role)
+{
+ struct shadow_page_caches caches = {
+ .page_header_cache = &vcpu->arch.mmu_page_header_cache,
+ .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
+ .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
+ };
+
+ return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
+}
+
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
+ unsigned int access)
+{
+ struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
+ union kvm_mmu_page_role role;
+
+ role = parent_sp->role;
+ role.level--;
+ role.access = access;
+ role.direct = direct;
+ role.passthrough = 0;
+
+ /*
+ * If the guest has 4-byte PTEs then that means it's using 32-bit,
+ * 2-level, non-PAE paging. KVM shadows such guests with PAE paging
+ * (i.e. 8-byte PTEs). The difference in PTE size means that KVM must
+ * shadow each guest page table with multiple shadow page tables, which
+ * requires extra bookkeeping in the role.
+ *
+ * Specifically, to shadow the guest's page directory (which covers a
+ * 4GiB address space), KVM uses 4 PAE page directories, each mapping
+ * 1GiB of the address space. @role.quadrant encodes which quarter of
+ * the address space each maps.
+ *
+ * To shadow the guest's page tables (which each map a 4MiB region), KVM
+ * uses 2 PAE page tables, each mapping a 2MiB region. For these,
+ * @role.quadrant encodes which half of the region they map.
+ *
+ * Concretely, a 4-byte PDE consumes bits 31:22, while an 8-byte PDE
+ * consumes bits 29:21. To consume bits 31:30, KVM's uses 4 shadow
+ * PDPTEs; those 4 PAE page directories are pre-allocated and their
+ * quadrant is assigned in mmu_alloc_root(). A 4-byte PTE consumes
+ * bits 21:12, while an 8-byte PTE consumes bits 20:12. To consume
+ * bit 21 in the PTE (the child here), KVM propagates that bit to the
+ * quadrant, i.e. sets quadrant to '0' or '1'. The parent 8-byte PDE
+ * covers bit 21 (see above), thus the quadrant is calculated from the
+ * _least_ significant bit of the PDE index.
+ */
+ if (role.has_4_byte_gpte) {
+ WARN_ON_ONCE(role.level != PG_LEVEL_4K);
+ role.quadrant = spte_index(sptep) & 1;
+ }
+
+ return role;
+}
+
+struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, u64 *sptep,
+ gfn_t gfn, bool direct,
+ unsigned int access)
+{
+ union kvm_mmu_page_role role;
+
+ if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
+ return ERR_PTR(-EEXIST);
+
+ role = kvm_mmu_child_role(sptep, direct, access);
+ return kvm_mmu_get_shadow_page(vcpu, gfn, role);
+}
+
+void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
+ struct kvm_vcpu *vcpu, hpa_t root, u64 addr)
+{
+ iterator->addr = addr;
+ iterator->shadow_addr = root;
+ iterator->level = vcpu->arch.mmu->root_role.level;
+
+ if (iterator->level >= PT64_ROOT_4LEVEL &&
+ vcpu->arch.mmu->cpu_role.base.level < PT64_ROOT_4LEVEL &&
+ !vcpu->arch.mmu->root_role.direct)
+ iterator->level = PT32E_ROOT_LEVEL;
+
+ if (iterator->level == PT32E_ROOT_LEVEL) {
+ /*
+ * prev_root is currently only used for 64-bit hosts. So only
+ * the active root_hpa is valid here.
+ */
+ BUG_ON(root != vcpu->arch.mmu->root.hpa);
+
+ iterator->shadow_addr
+ = vcpu->arch.mmu->pae_root[(addr >> 30) & 3];
+ iterator->shadow_addr &= SPTE_BASE_ADDR_MASK;
+ --iterator->level;
+ if (!iterator->shadow_addr)
+ iterator->level = 0;
+ }
+}
+
+void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
+ struct kvm_vcpu *vcpu, u64 addr)
+{
+ shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->root.hpa,
+ addr);
+}
+
+bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
+{
+ if (iterator->level < PG_LEVEL_4K)
+ return false;
+
+ iterator->index = SPTE_INDEX(iterator->addr, iterator->level);
+ iterator->sptep = ((u64 *)__va(iterator->shadow_addr)) + iterator->index;
+ return true;
+}
+
+static void __shadow_walk_next(struct kvm_shadow_walk_iterator *iterator,
+ u64 spte)
+{
+ if (!is_shadow_present_pte(spte) || is_last_spte(spte, iterator->level)) {
+ iterator->level = 0;
+ return;
+ }
+
+ iterator->shadow_addr = spte & SPTE_BASE_ADDR_MASK;
+ --iterator->level;
+}
+
+void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
+{
+ __shadow_walk_next(iterator, *iterator->sptep);
+}
+
+static void __link_shadow_page(struct kvm *kvm,
+ struct kvm_mmu_memory_cache *cache, u64 *sptep,
+ struct kvm_mmu_page *sp, bool flush)
+{
+ u64 spte;
+
+ BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
+
+ /*
+ * If an SPTE is present already, it must be a leaf and therefore
+ * a large one. Drop it, and flush the TLB if needed, before
+ * installing sp.
+ */
+ if (is_shadow_present_pte(*sptep))
+ drop_large_spte(kvm, sptep, flush);
+
+ spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
+
+ mmu_spte_set(sptep, spte);
+
+ mmu_page_add_parent_pte(cache, sp, sptep);
+
+ /*
+ * The non-direct sub-pagetable must be updated before linking. For
+ * L1 sp, the pagetable is updated via kvm_sync_page() in
+ * kvm_mmu_find_shadow_page() without write-protecting the gfn,
+ * so sp->unsync can be true or false. For higher level non-direct
+ * sp, the pagetable is updated/synced via mmu_sync_children() in
+ * FNAME(fetch)(), so sp->unsync_children can only be false.
+ * WARN_ON_ONCE() if anything happens unexpectedly.
+ */
+ if (WARN_ON_ONCE(sp->unsync_children) || sp->unsync)
+ mark_unsync(sptep);
+}
+
+void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp)
+{
+ __link_shadow_page(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, sptep, sp, true);
+}
+
+void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
+ unsigned direct_access)
+{
+ if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) {
+ struct kvm_mmu_page *child;
+
+ /*
+ * For the direct sp, if the guest pte's dirty bit
+ * changed form clean to dirty, it will corrupt the
+ * sp's access: allow writable in the read-only sp,
+ * so we should update the spte at this point to get
+ * a new sp with the correct access.
+ */
+ child = spte_to_child_sp(*sptep);
+ if (child->role.access == direct_access)
+ return;
+
+ drop_parent_pte(child, sptep);
+ kvm_flush_remote_tlbs_with_address(vcpu->kvm, child->gfn, 1);
+ }
+}
+
+/* Returns the number of zapped non-leaf child shadow pages. */
+int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, u64 *spte,
+ struct list_head *invalid_list)
+{
+ u64 pte;
+ struct kvm_mmu_page *child;
+
+ pte = *spte;
+ if (is_shadow_present_pte(pte)) {
+ if (is_last_spte(pte, sp->role.level)) {
+ drop_spte(kvm, spte);
+ } else {
+ child = spte_to_child_sp(pte);
+ drop_parent_pte(child, spte);
+
+ /*
+ * Recursively zap nested TDP SPs, parentless SPs are
+ * unlikely to be used again in the near future. This
+ * avoids retaining a large number of stale nested SPs.
+ */
+ if (tdp_enabled && invalid_list &&
+ child->role.guest_mode && !child->parent_ptes.val)
+ return kvm_mmu_prepare_zap_page(kvm, child,
+ invalid_list);
+ }
+ } else if (is_mmio_spte(pte)) {
+ mmu_spte_clear_no_track(spte);
+ }
+ return 0;
+}
+
+static int kvm_mmu_page_unlink_children(struct kvm *kvm,
+ struct kvm_mmu_page *sp,
+ struct list_head *invalid_list)
+{
+ int zapped = 0;
+ unsigned i;
+
+ for (i = 0; i < SPTE_ENT_PER_PAGE; ++i)
+ zapped += mmu_page_zap_pte(kvm, sp, sp->spt + i, invalid_list);
+
+ return zapped;
+}
+
+static void kvm_mmu_unlink_parents(struct kvm_mmu_page *sp)
+{
+ u64 *sptep;
+ struct rmap_iterator iter;
+
+ while ((sptep = rmap_get_first(&sp->parent_ptes, &iter)))
+ drop_parent_pte(sp, sptep);
+}
+
+static int mmu_zap_unsync_children(struct kvm *kvm,
+ struct kvm_mmu_page *parent,
+ struct list_head *invalid_list)
+{
+ int i, zapped = 0;
+ struct mmu_page_path parents;
+ struct kvm_mmu_pages pages;
+
+ if (parent->role.level == PG_LEVEL_4K)
+ return 0;
+
+ while (mmu_unsync_walk(parent, &pages)) {
+ struct kvm_mmu_page *sp;
+
+ for_each_sp(pages, sp, parents, i) {
+ kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+ mmu_pages_clear_parents(&parents);
+ zapped++;
+ }
+ }
+
+ return zapped;
+}
+
+bool __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ struct list_head *invalid_list,
+ int *nr_zapped)
+{
+ bool list_unstable, zapped_root = false;
+
+ lockdep_assert_held_write(&kvm->mmu_lock);
+ trace_kvm_mmu_prepare_zap_page(sp);
+ ++kvm->stat.mmu_shadow_zapped;
+ *nr_zapped = mmu_zap_unsync_children(kvm, sp, invalid_list);
+ *nr_zapped += kvm_mmu_page_unlink_children(kvm, sp, invalid_list);
+ kvm_mmu_unlink_parents(sp);
+
+ /* Zapping children means active_mmu_pages has become unstable. */
+ list_unstable = *nr_zapped;
+
+ if (!sp->role.invalid && sp_has_gptes(sp))
+ unaccount_shadowed(kvm, sp);
+
+ if (sp->unsync)
+ kvm_unlink_unsync_page(kvm, sp);
+ if (!sp->root_count) {
+ /* Count self */
+ (*nr_zapped)++;
+
+ /*
+ * Already invalid pages (previously active roots) are not on
+ * the active page list. See list_del() in the "else" case of
+ * !sp->root_count.
+ */
+ if (sp->role.invalid)
+ list_add(&sp->link, invalid_list);
+ else
+ list_move(&sp->link, invalid_list);
+ kvm_unaccount_mmu_page(kvm, sp);
+ } else {
+ /*
+ * Remove the active root from the active page list, the root
+ * will be explicitly freed when the root_count hits zero.
+ */
+ list_del(&sp->link);
+
+ /*
+ * Obsolete pages cannot be used on any vCPUs, see the comment
+ * in kvm_mmu_zap_all_fast(). Note, is_obsolete_sp() also
+ * treats invalid shadow pages as being obsolete.
+ */
+ zapped_root = !is_obsolete_sp(kvm, sp);
+ }
+
+ if (sp->nx_huge_page_disallowed)
+ unaccount_nx_huge_page(kvm, sp);
+
+ sp->role.invalid = 1;
+
+ /*
+ * Make the request to free obsolete roots after marking the root
+ * invalid, otherwise other vCPUs may not see it as invalid.
+ */
+ if (zapped_root)
+ kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
+ return list_unstable;
+}
+
+bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ struct list_head *invalid_list)
+{
+ int nr_zapped;
+
+ __kvm_mmu_prepare_zap_page(kvm, sp, invalid_list, &nr_zapped);
+ return nr_zapped;
+}
+
+void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list)
+{
+ struct kvm_mmu_page *sp, *nsp;
+
+ if (list_empty(invalid_list))
+ return;
+
+ /*
+ * We need to make sure everyone sees our modifications to
+ * the page tables and see changes to vcpu->mode here. The barrier
+ * in the kvm_flush_remote_tlbs() achieves this. This pairs
+ * with vcpu_enter_guest and walk_shadow_page_lockless_begin/end.
+ *
+ * In addition, kvm_flush_remote_tlbs waits for all vcpus to exit
+ * guest mode and/or lockless shadow page table walks.
+ */
+ kvm_flush_remote_tlbs(kvm);
+
+ list_for_each_entry_safe(sp, nsp, invalid_list, link) {
+ WARN_ON(!sp->role.invalid || sp->root_count);
+ kvm_mmu_free_shadow_page(sp);
+ }
+}
+
+static unsigned long kvm_mmu_zap_oldest_mmu_pages(struct kvm *kvm,
+ unsigned long nr_to_zap)
+{
+ unsigned long total_zapped = 0;
+ struct kvm_mmu_page *sp, *tmp;
+ LIST_HEAD(invalid_list);
+ bool unstable;
+ int nr_zapped;
+
+ if (list_empty(&kvm->arch.active_mmu_pages))
+ return 0;
+
+restart:
+ list_for_each_entry_safe_reverse(sp, tmp, &kvm->arch.active_mmu_pages, link) {
+ /*
+ * Don't zap active root pages, the page itself can't be freed
+ * and zapping it will just force vCPUs to realloc and reload.
+ */
+ if (sp->root_count)
+ continue;
+
+ unstable = __kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list,
+ &nr_zapped);
+ total_zapped += nr_zapped;
+ if (total_zapped >= nr_to_zap)
+ break;
+
+ if (unstable)
+ goto restart;
+ }
+
+ kvm_mmu_commit_zap_page(kvm, &invalid_list);
+
+ kvm->stat.mmu_recycled += total_zapped;
+ return total_zapped;
+}
+
+static inline unsigned long kvm_mmu_available_pages(struct kvm *kvm)
+{
+ if (kvm->arch.n_max_mmu_pages > kvm->arch.n_used_mmu_pages)
+ return kvm->arch.n_max_mmu_pages -
+ kvm->arch.n_used_mmu_pages;
+
+ return 0;
+}
+
+int make_mmu_pages_available(struct kvm_vcpu *vcpu)
+{
+ unsigned long avail = kvm_mmu_available_pages(vcpu->kvm);
+
+ if (likely(avail >= KVM_MIN_FREE_MMU_PAGES))
+ return 0;
+
+ kvm_mmu_zap_oldest_mmu_pages(vcpu->kvm, KVM_REFILL_PAGES - avail);
+
+ /*
+ * Note, this check is intentionally soft, it only guarantees that one
+ * page is available, while the caller may end up allocating as many as
+ * four pages, e.g. for PAE roots or for 5-level paging. Temporarily
+ * exceeding the (arbitrary by default) limit will not harm the host,
+ * being too aggressive may unnecessarily kill the guest, and getting an
+ * exact count is far more trouble than it's worth, especially in the
+ * page fault paths.
+ */
+ if (!kvm_mmu_available_pages(vcpu->kvm))
+ return -ENOSPC;
+ return 0;
+}
+
+/*
+ * Changing the number of mmu pages allocated to the vm
+ * Note: if goal_nr_mmu_pages is too small, you will get dead lock
+ */
+void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long goal_nr_mmu_pages)
+{
+ write_lock(&kvm->mmu_lock);
+
+ if (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages) {
+ kvm_mmu_zap_oldest_mmu_pages(kvm, kvm->arch.n_used_mmu_pages -
+ goal_nr_mmu_pages);
+
+ goal_nr_mmu_pages = kvm->arch.n_used_mmu_pages;
+ }
+
+ kvm->arch.n_max_mmu_pages = goal_nr_mmu_pages;
+
+ write_unlock(&kvm->mmu_lock);
+}
+
+int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
+{
+ struct kvm_mmu_page *sp;
+ LIST_HEAD(invalid_list);
+ int r;
+
+ pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
+ r = 0;
+ write_lock(&kvm->mmu_lock);
+ for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) {
+ pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
+ sp->role.word);
+ r = 1;
+ kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+ }
+ kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ write_unlock(&kvm->mmu_lock);
+
+ return r;
+}
+
+int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
+{
+ gpa_t gpa;
+ int r;
+
+ if (vcpu->arch.mmu->root_role.direct)
+ return 0;
+
+ gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
+
+ r = kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT);
+
+ return r;
+}
+
+static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ trace_kvm_mmu_unsync_page(sp);
+ ++kvm->stat.mmu_unsync;
+ sp->unsync = 1;
+
+ kvm_mmu_mark_parents_unsync(sp);
+}
+
+/*
+ * Attempt to unsync any shadow pages that can be reached by the specified gfn,
+ * KVM is creating a writable mapping for said gfn. Returns 0 if all pages
+ * were marked unsync (or if there is no shadow page), -EPERM if the SPTE must
+ * be write-protected.
+ */
+int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
+ gfn_t gfn, bool can_unsync, bool prefetch)
+{
+ struct kvm_mmu_page *sp;
+ bool locked = false;
+
+ /*
+ * Force write-protection if the page is being tracked. Note, the page
+ * track machinery is used to write-protect upper-level shadow pages,
+ * i.e. this guards the role.level == 4K assertion below!
+ */
+ if (kvm_slot_page_track_is_active(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
+ return -EPERM;
+
+ /*
+ * The page is not write-tracked, mark existing shadow pages unsync
+ * unless KVM is synchronizing an unsync SP (can_unsync = false). In
+ * that case, KVM must complete emulation of the guest TLB flush before
+ * allowing shadow pages to become unsync (writable by the guest).
+ */
+ for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) {
+ if (!can_unsync)
+ return -EPERM;
+
+ if (sp->unsync)
+ continue;
+
+ if (prefetch)
+ return -EEXIST;
+
+ /*
+ * TDP MMU page faults require an additional spinlock as they
+ * run with mmu_lock held for read, not write, and the unsync
+ * logic is not thread safe. Take the spinklock regardless of
+ * the MMU type to avoid extra conditionals/parameters, there's
+ * no meaningful penalty if mmu_lock is held for write.
+ */
+ if (!locked) {
+ locked = true;
+ spin_lock(&kvm->arch.mmu_unsync_pages_lock);
+
+ /*
+ * Recheck after taking the spinlock, a different vCPU
+ * may have since marked the page unsync. A false
+ * positive on the unprotected check above is not
+ * possible as clearing sp->unsync _must_ hold mmu_lock
+ * for write, i.e. unsync cannot transition from 0->1
+ * while this CPU holds mmu_lock for read (or write).
+ */
+ if (READ_ONCE(sp->unsync))
+ continue;
+ }
+
+ WARN_ON(sp->role.level != PG_LEVEL_4K);
+ kvm_unsync_page(kvm, sp);
+ }
+ if (locked)
+ spin_unlock(&kvm->arch.mmu_unsync_pages_lock);
+
+ /*
+ * We need to ensure that the marking of unsync pages is visible
+ * before the SPTE is updated to allow writes because
+ * kvm_mmu_sync_roots() checks the unsync flags without holding
+ * the MMU lock and so can race with this. If the SPTE was updated
+ * before the page had been marked as unsync-ed, something like the
+ * following could happen:
+ *
+ * CPU 1 CPU 2
+ * ---------------------------------------------------------------------
+ * 1.2 Host updates SPTE
+ * to be writable
+ * 2.1 Guest writes a GPTE for GVA X.
+ * (GPTE being in the guest page table shadowed
+ * by the SP from CPU 1.)
+ * This reads SPTE during the page table walk.
+ * Since SPTE.W is read as 1, there is no
+ * fault.
+ *
+ * 2.2 Guest issues TLB flush.
+ * That causes a VM Exit.
+ *
+ * 2.3 Walking of unsync pages sees sp->unsync is
+ * false and skips the page.
+ *
+ * 2.4 Guest accesses GVA X.
+ * Since the mapping in the SP was not updated,
+ * so the old mapping for GVA X incorrectly
+ * gets used.
+ * 1.1 Host marks SP
+ * as unsync
+ * (sp->unsync = true)
+ *
+ * The write barrier below ensures that 1.1 happens before 1.2 and thus
+ * the situation in 2.4 does not arise. It pairs with the read barrier
+ * in is_unsync_root(), placed between 2.1's load of SPTE.W and 2.3.
+ */
+ smp_wmb();
+
+ return 0;
+}
+
+int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+ u64 *sptep, unsigned int pte_access, gfn_t gfn,
+ kvm_pfn_t pfn, struct kvm_page_fault *fault)
+{
+ struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+ int level = sp->role.level;
+ int was_rmapped = 0;
+ int ret = RET_PF_FIXED;
+ bool flush = false;
+ bool wrprot;
+ u64 spte;
+
+ /* Prefetching always gets a writable pfn. */
+ bool host_writable = !fault || fault->map_writable;
+ bool prefetch = !fault || fault->prefetch;
+ bool write_fault = fault && fault->write;
+
+ pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
+ *sptep, write_fault, gfn);
+
+ if (unlikely(is_noslot_pfn(pfn))) {
+ vcpu->stat.pf_mmio_spte_created++;
+ mark_mmio_spte(vcpu, sptep, gfn, pte_access);
+ return RET_PF_EMULATE;
+ }
+
+ if (is_shadow_present_pte(*sptep)) {
+ /*
+ * If we overwrite a PTE page pointer with a 2MB PMD, unlink
+ * the parent of the now unreachable PTE.
+ */
+ if (level > PG_LEVEL_4K && !is_large_pte(*sptep)) {
+ struct kvm_mmu_page *child;
+ u64 pte = *sptep;
+
+ child = spte_to_child_sp(pte);
+ drop_parent_pte(child, sptep);
+ flush = true;
+ } else if (pfn != spte_to_pfn(*sptep)) {
+ pgprintk("hfn old %llx new %llx\n",
+ spte_to_pfn(*sptep), pfn);
+ drop_spte(vcpu->kvm, sptep);
+ flush = true;
+ } else
+ was_rmapped = 1;
+ }
+
+ wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
+ true, host_writable, &spte);
+
+ if (*sptep == spte) {
+ ret = RET_PF_SPURIOUS;
+ } else {
+ flush |= mmu_spte_update(sptep, spte);
+ trace_kvm_mmu_set_spte(level, gfn, sptep);
+ }
+
+ if (wrprot) {
+ if (write_fault)
+ ret = RET_PF_EMULATE;
+ }
+
+ if (flush)
+ kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn,
+ KVM_PAGES_PER_HPAGE(level));
+
+ pgprintk("%s: setting spte %llx\n", __func__, *sptep);
+
+ if (!was_rmapped) {
+ WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
+ rmap_add(vcpu, slot, sptep, gfn, pte_access);
+ } else {
+ /* Already rmapped but the pte_access bits may have changed. */
+ kvm_mmu_page_set_access(sp, spte_index(sptep), pte_access);
+ }
+
+ return ret;
+}
+
+static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
+ struct kvm_mmu_page *sp,
+ u64 *start, u64 *end)
+{
+ struct page *pages[PTE_PREFETCH_NUM];
+ struct kvm_memory_slot *slot;
+ unsigned int access = sp->role.access;
+ int i, ret;
+ gfn_t gfn;
+
+ gfn = kvm_mmu_page_get_gfn(sp, spte_index(start));
+ slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, access & ACC_WRITE_MASK);
+ if (!slot)
+ return -1;
+
+ ret = gfn_to_page_many_atomic(slot, gfn, pages, end - start);
+ if (ret <= 0)
+ return -1;
+
+ for (i = 0; i < ret; i++, gfn++, start++) {
+ mmu_set_spte(vcpu, slot, start, access, gfn,
+ page_to_pfn(pages[i]), NULL);
+ put_page(pages[i]);
+ }
+
+ return 0;
+}
+
+void __direct_pte_prefetch(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+ u64 *sptep)
+{
+ u64 *spte, *start = NULL;
+ int i;
+
+ WARN_ON(!sp->role.direct);
+
+ i = spte_index(sptep) & ~(PTE_PREFETCH_NUM - 1);
+ spte = sp->spt + i;
+
+ for (i = 0; i < PTE_PREFETCH_NUM; i++, spte++) {
+ if (is_shadow_present_pte(*spte) || spte == sptep) {
+ if (!start)
+ continue;
+ if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0)
+ return;
+ start = NULL;
+ } else if (!start)
+ start = spte;
+ }
+ if (start)
+ direct_pte_prefetch_many(vcpu, sp, start, spte);
+}
+
+static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
+{
+ struct kvm_mmu_page *sp;
+
+ sp = sptep_to_sp(sptep);
+
+ /*
+ * Without accessed bits, there's no way to distinguish between
+ * actually accessed translations and prefetched, so disable pte
+ * prefetch if accessed bits aren't available.
+ */
+ if (sp_ad_disabled(sp))
+ return;
+
+ if (sp->role.level > PG_LEVEL_4K)
+ return;
+
+ /*
+ * If addresses are being invalidated, skip prefetching to avoid
+ * accidentally prefetching those addresses.
+ */
+ if (unlikely(vcpu->kvm->mmu_invalidate_in_progress))
+ return;
+
+ __direct_pte_prefetch(vcpu, sp, sptep);
+}
+
+int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+ struct kvm_shadow_walk_iterator it;
+ struct kvm_mmu_page *sp;
+ int ret;
+ gfn_t base_gfn = fault->gfn;
+
+ kvm_mmu_hugepage_adjust(vcpu, fault);
+
+ trace_kvm_mmu_spte_requested(fault);
+ for_each_shadow_entry(vcpu, fault->addr, it) {
+ /*
+ * We cannot overwrite existing page tables with an NX
+ * large page, as the leaf could be executable.
+ */
+ if (fault->nx_huge_page_workaround_enabled)
+ disallowed_hugepage_adjust(fault, *it.sptep, it.level);
+
+ base_gfn = fault->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
+ if (it.level == fault->goal_level)
+ break;
+
+ sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
+ if (sp == ERR_PTR(-EEXIST))
+ continue;
+
+ link_shadow_page(vcpu, it.sptep, sp);
+ if (fault->huge_page_disallowed)
+ account_nx_huge_page(vcpu->kvm, sp,
+ fault->req_level >= it.level);
+ }
+
+ if (WARN_ON_ONCE(it.level != fault->goal_level))
+ return -EFAULT;
+
+ ret = mmu_set_spte(vcpu, fault->slot, it.sptep, ACC_ALL,
+ base_gfn, fault->pfn, fault);
+ if (ret == RET_PF_SPURIOUS)
+ return ret;
+
+ direct_pte_prefetch(vcpu, it.sptep);
+ return ret;
+}
+
+/*
+ * Returns the last level spte pointer of the shadow page walk for the given
+ * gpa, and sets *spte to the spte value. This spte may be non-preset. If no
+ * walk could be performed, returns NULL and *spte does not contain valid data.
+ *
+ * Contract:
+ * - Must be called between walk_shadow_page_lockless_{begin,end}.
+ * - The returned sptep must not be used after walk_shadow_page_lockless_end.
+ */
+u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
+{
+ struct kvm_shadow_walk_iterator iterator;
+ u64 old_spte;
+ u64 *sptep = NULL;
+
+ for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) {
+ sptep = iterator.sptep;
+ *spte = old_spte;
+ }
+
+ return sptep;
+}
+
+void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu)
+{
+ unsigned long roots_to_free = 0;
+ hpa_t root_hpa;
+ int i;
+
+ /*
+ * This should not be called while L2 is active, L2 can't invalidate
+ * _only_ its own roots, e.g. INVVPID unconditionally exits.
+ */
+ WARN_ON_ONCE(mmu->root_role.guest_mode);
+
+ for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
+ root_hpa = mmu->prev_roots[i].hpa;
+ if (!VALID_PAGE(root_hpa))
+ continue;
+
+ if (!to_shadow_page(root_hpa) ||
+ to_shadow_page(root_hpa)->role.guest_mode)
+ roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
+ }
+
+ kvm_mmu_free_roots(kvm, mmu, roots_to_free);
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots);
+
+
+static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
+{
+ int ret = 0;
+
+ if (!kvm_vcpu_is_visible_gfn(vcpu, root_gfn)) {
+ kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+ ret = 1;
+ }
+
+ return ret;
+}
+
+hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, u8 level)
+{
+ union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
+ struct kvm_mmu_page *sp;
+
+ role.level = level;
+ role.quadrant = quadrant;
+
+ WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
+ WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
+
+ sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
+ ++sp->root_count;
+
+ return __pa(sp->spt);
+}
+
+static int mmu_first_shadow_root_alloc(struct kvm *kvm)
+{
+ struct kvm_memslots *slots;
+ struct kvm_memory_slot *slot;
+ int r = 0, i, bkt;
+
+ /*
+ * Check if this is the first shadow root being allocated before
+ * taking the lock.
+ */
+ if (kvm_shadow_root_allocated(kvm))
+ return 0;
+
+ mutex_lock(&kvm->slots_arch_lock);
+
+ /* Recheck, under the lock, whether this is the first shadow root. */
+ if (kvm_shadow_root_allocated(kvm))
+ goto out_unlock;
+
+ /*
+ * Check if anything actually needs to be allocated, e.g. all metadata
+ * will be allocated upfront if TDP is disabled.
+ */
+ if (kvm_memslots_have_rmaps(kvm) &&
+ kvm_page_track_write_tracking_enabled(kvm))
+ goto out_success;
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ slots = __kvm_memslots(kvm, i);
+ kvm_for_each_memslot(slot, bkt, slots) {
+ /*
+ * Both of these functions are no-ops if the target is
+ * already allocated, so unconditionally calling both
+ * is safe. Intentionally do NOT free allocations on
+ * failure to avoid having to track which allocations
+ * were made now versus when the memslot was created.
+ * The metadata is guaranteed to be freed when the slot
+ * is freed, and will be kept/used if userspace retries
+ * KVM_RUN instead of killing the VM.
+ */
+ r = memslot_rmap_alloc(slot, slot->npages);
+ if (r)
+ goto out_unlock;
+ r = kvm_page_track_write_tracking_alloc(slot);
+ if (r)
+ goto out_unlock;
+ }
+ }
+
+ /*
+ * Ensure that shadow_root_allocated becomes true strictly after
+ * all the related pointers are set.
+ */
+out_success:
+ smp_store_release(&kvm->arch.shadow_root_allocated, true);
+
+out_unlock:
+ mutex_unlock(&kvm->slots_arch_lock);
+ return r;
+}
+
+int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu *mmu = vcpu->arch.mmu;
+ u64 pdptrs[4], pm_mask;
+ gfn_t root_gfn, root_pgd;
+ int quadrant, i, r;
+ hpa_t root;
+
+ root_pgd = mmu->get_guest_pgd(vcpu);
+ root_gfn = root_pgd >> PAGE_SHIFT;
+
+ if (mmu_check_root(vcpu, root_gfn))
+ return 1;
+
+ /*
+ * On SVM, reading PDPTRs might access guest memory, which might fault
+ * and thus might sleep. Grab the PDPTRs before acquiring mmu_lock.
+ */
+ if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) {
+ for (i = 0; i < 4; ++i) {
+ pdptrs[i] = mmu->get_pdptr(vcpu, i);
+ if (!(pdptrs[i] & PT_PRESENT_MASK))
+ continue;
+
+ if (mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT))
+ return 1;
+ }
+ }
+
+ r = mmu_first_shadow_root_alloc(vcpu->kvm);
+ if (r)
+ return r;
+
+ write_lock(&vcpu->kvm->mmu_lock);
+ r = make_mmu_pages_available(vcpu);
+ if (r < 0)
+ goto out_unlock;
+
+ /*
+ * Do we shadow a long mode page table? If so we need to
+ * write-protect the guests page table root.
+ */
+ if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
+ root = mmu_alloc_root(vcpu, root_gfn, 0,
+ mmu->root_role.level);
+ mmu->root.hpa = root;
+ goto set_root_pgd;
+ }
+
+ if (WARN_ON_ONCE(!mmu->pae_root)) {
+ r = -EIO;
+ goto out_unlock;
+ }
+
+ /*
+ * We shadow a 32 bit page table. This may be a legacy 2-level
+ * or a PAE 3-level page table. In either case we need to be aware that
+ * the shadow page table may be a PAE or a long mode page table.
+ */
+ pm_mask = PT_PRESENT_MASK | shadow_me_value;
+ if (mmu->root_role.level >= PT64_ROOT_4LEVEL) {
+ pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
+
+ if (WARN_ON_ONCE(!mmu->pml4_root)) {
+ r = -EIO;
+ goto out_unlock;
+ }
+ mmu->pml4_root[0] = __pa(mmu->pae_root) | pm_mask;
+
+ if (mmu->root_role.level == PT64_ROOT_5LEVEL) {
+ if (WARN_ON_ONCE(!mmu->pml5_root)) {
+ r = -EIO;
+ goto out_unlock;
+ }
+ mmu->pml5_root[0] = __pa(mmu->pml4_root) | pm_mask;
+ }
+ }
+
+ for (i = 0; i < 4; ++i) {
+ WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
+
+ if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) {
+ if (!(pdptrs[i] & PT_PRESENT_MASK)) {
+ mmu->pae_root[i] = INVALID_PAE_ROOT;
+ continue;
+ }
+ root_gfn = pdptrs[i] >> PAGE_SHIFT;
+ }
+
+ /*
+ * If shadowing 32-bit non-PAE page tables, each PAE page
+ * directory maps one quarter of the guest's non-PAE page
+ * directory. Othwerise each PAE page direct shadows one guest
+ * PAE page directory so that quadrant should be 0.
+ */
+ quadrant = (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) ? i : 0;
+
+ root = mmu_alloc_root(vcpu, root_gfn, quadrant, PT32_ROOT_LEVEL);
+ mmu->pae_root[i] = root | pm_mask;
+ }
+
+ if (mmu->root_role.level == PT64_ROOT_5LEVEL)
+ mmu->root.hpa = __pa(mmu->pml5_root);
+ else if (mmu->root_role.level == PT64_ROOT_4LEVEL)
+ mmu->root.hpa = __pa(mmu->pml4_root);
+ else
+ mmu->root.hpa = __pa(mmu->pae_root);
+
+set_root_pgd:
+ mmu->root.pgd = root_pgd;
+out_unlock:
+ write_unlock(&vcpu->kvm->mmu_lock);
+
+ return r;
+}
+
+int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu *mmu = vcpu->arch.mmu;
+ bool need_pml5 = mmu->root_role.level > PT64_ROOT_4LEVEL;
+ u64 *pml5_root = NULL;
+ u64 *pml4_root = NULL;
+ u64 *pae_root;
+
+ /*
+ * When shadowing 32-bit or PAE NPT with 64-bit NPT, the PML4 and PDP
+ * tables are allocated and initialized at root creation as there is no
+ * equivalent level in the guest's NPT to shadow. Allocate the tables
+ * on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare.
+ */
+ if (mmu->root_role.direct ||
+ mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL ||
+ mmu->root_role.level < PT64_ROOT_4LEVEL)
+ return 0;
+
+ /*
+ * NPT, the only paging mode that uses this horror, uses a fixed number
+ * of levels for the shadow page tables, e.g. all MMUs are 4-level or
+ * all MMus are 5-level. Thus, this can safely require that pml5_root
+ * is allocated if the other roots are valid and pml5 is needed, as any
+ * prior MMU would also have required pml5.
+ */
+ if (mmu->pae_root && mmu->pml4_root && (!need_pml5 || mmu->pml5_root))
+ return 0;
+
+ /*
+ * The special roots should always be allocated in concert. Yell and
+ * bail if KVM ends up in a state where only one of the roots is valid.
+ */
+ if (WARN_ON_ONCE(!tdp_enabled || mmu->pae_root || mmu->pml4_root ||
+ (need_pml5 && mmu->pml5_root)))
+ return -EIO;
+
+ /*
+ * Unlike 32-bit NPT, the PDP table doesn't need to be in low mem, and
+ * doesn't need to be decrypted.
+ */
+ pae_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+ if (!pae_root)
+ return -ENOMEM;
+
+#ifdef CONFIG_X86_64
+ pml4_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+ if (!pml4_root)
+ goto err_pml4;
+
+ if (need_pml5) {
+ pml5_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+ if (!pml5_root)
+ goto err_pml5;
+ }
+#endif
+
+ mmu->pae_root = pae_root;
+ mmu->pml4_root = pml4_root;
+ mmu->pml5_root = pml5_root;
+
+ return 0;
+
+#ifdef CONFIG_X86_64
+err_pml5:
+ free_page((unsigned long)pml4_root);
+err_pml4:
+ free_page((unsigned long)pae_root);
+ return -ENOMEM;
+#endif
+}
+
+static bool is_unsync_root(hpa_t root)
+{
+ struct kvm_mmu_page *sp;
+
+ if (!VALID_PAGE(root))
+ return false;
+
+ /*
+ * The read barrier orders the CPU's read of SPTE.W during the page table
+ * walk before the reads of sp->unsync/sp->unsync_children here.
+ *
+ * Even if another CPU was marking the SP as unsync-ed simultaneously,
+ * any guest page table changes are not guaranteed to be visible anyway
+ * until this VCPU issues a TLB flush strictly after those changes are
+ * made. We only need to ensure that the other CPU sets these flags
+ * before any actual changes to the page tables are made. The comments
+ * in mmu_try_to_unsync_pages() describe what could go wrong if this
+ * requirement isn't satisfied.
+ */
+ smp_rmb();
+ sp = to_shadow_page(root);
+
+ /*
+ * PAE roots (somewhat arbitrarily) aren't backed by shadow pages, the
+ * PDPTEs for a given PAE root need to be synchronized individually.
+ */
+ if (WARN_ON_ONCE(!sp))
+ return false;
+
+ if (sp->unsync || sp->unsync_children)
+ return true;
+
+ return false;
+}
+
+void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
+{
+ int i;
+ struct kvm_mmu_page *sp;
+
+ if (vcpu->arch.mmu->root_role.direct)
+ return;
+
+ if (!VALID_PAGE(vcpu->arch.mmu->root.hpa))
+ return;
+
+ vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
+
+ if (vcpu->arch.mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
+ hpa_t root = vcpu->arch.mmu->root.hpa;
+ sp = to_shadow_page(root);
+
+ if (!is_unsync_root(root))
+ return;
+
+ write_lock(&vcpu->kvm->mmu_lock);
+ mmu_sync_children(vcpu, sp, true);
+ write_unlock(&vcpu->kvm->mmu_lock);
+ return;
+ }
+
+ write_lock(&vcpu->kvm->mmu_lock);
+
+ for (i = 0; i < 4; ++i) {
+ hpa_t root = vcpu->arch.mmu->pae_root[i];
+
+ if (IS_VALID_PAE_ROOT(root)) {
+ sp = spte_to_child_sp(root);
+ mmu_sync_children(vcpu, sp, true);
+ }
+ }
+
+ write_unlock(&vcpu->kvm->mmu_lock);
+}
+
+void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu)
+{
+ unsigned long roots_to_free = 0;
+ int i;
+
+ for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
+ if (is_unsync_root(vcpu->arch.mmu->prev_roots[i].hpa))
+ roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
+
+ /* sync prev_roots by simply freeing them */
+ kvm_mmu_free_roots(vcpu->kvm, vcpu->arch.mmu, roots_to_free);
+}
+
+/*
+ * Return the level of the lowest level SPTE added to sptes.
+ * That SPTE may be non-present.
+ *
+ * Must be called between walk_shadow_page_lockless_{begin,end}.
+ */
+int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level)
+{
+ struct kvm_shadow_walk_iterator iterator;
+ int leaf = -1;
+ u64 spte;
+
+ for (shadow_walk_init(&iterator, vcpu, addr),
+ *root_level = iterator.level;
+ shadow_walk_okay(&iterator);
+ __shadow_walk_next(&iterator, spte)) {
+ leaf = iterator.level;
+ spte = mmu_spte_get_lockless(iterator.sptep);
+
+ sptes[leaf] = spte;
+ }
+
+ return leaf;
+}
+
+void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
+{
+ struct kvm_shadow_walk_iterator iterator;
+ u64 spte;
+
+ walk_shadow_page_lockless_begin(vcpu);
+ for_each_shadow_entry_lockless(vcpu, addr, iterator, spte)
+ clear_sp_write_flooding_count(iterator.sptep);
+ walk_shadow_page_lockless_end(vcpu);
+}
+
+static bool is_obsolete_root(struct kvm *kvm, hpa_t root_hpa)
+{
+ struct kvm_mmu_page *sp;
+
+ if (!VALID_PAGE(root_hpa))
+ return false;
+
+ /*
+ * When freeing obsolete roots, treat roots as obsolete if they don't
+ * have an associated shadow page. This does mean KVM will get false
+ * positives and free roots that don't strictly need to be freed, but
+ * such false positives are relatively rare:
+ *
+ * (a) only PAE paging and nested NPT has roots without shadow pages
+ * (b) remote reloads due to a memslot update obsoletes _all_ roots
+ * (c) KVM doesn't track previous roots for PAE paging, and the guest
+ * is unlikely to zap an in-use PGD.
+ */
+ sp = to_shadow_page(root_hpa);
+ return !sp || is_obsolete_sp(kvm, sp);
+}
+
+static void __kvm_mmu_free_obsolete_roots(struct kvm *kvm, struct kvm_mmu *mmu)
+{
+ unsigned long roots_to_free = 0;
+ int i;
+
+ if (is_obsolete_root(kvm, mmu->root.hpa))
+ roots_to_free |= KVM_MMU_ROOT_CURRENT;
+
+ for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
+ if (is_obsolete_root(kvm, mmu->prev_roots[i].hpa))
+ roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
+ }
+
+ if (roots_to_free)
+ kvm_mmu_free_roots(kvm, mmu, roots_to_free);
+}
+
+void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu)
+{
+ __kvm_mmu_free_obsolete_roots(vcpu->kvm, &vcpu->arch.root_mmu);
+ __kvm_mmu_free_obsolete_roots(vcpu->kvm, &vcpu->arch.guest_mmu);
+}
+
+static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
+ int *bytes)
+{
+ u64 gentry = 0;
+ int r;
+
+ /*
+ * Assume that the pte write on a page table of the same type
+ * as the current vcpu paging mode since we update the sptes only
+ * when they have the same mode.
+ */
+ if (is_pae(vcpu) && *bytes == 4) {
+ /* Handle a 32-bit guest writing two halves of a 64-bit gpte */
+ *gpa &= ~(gpa_t)7;
+ *bytes = 8;
+ }
+
+ if (*bytes == 4 || *bytes == 8) {
+ r = kvm_vcpu_read_guest_atomic(vcpu, *gpa, &gentry, *bytes);
+ if (r)
+ gentry = 0;
+ }
+
+ return gentry;
+}
+
+/*
+ * If we're seeing too many writes to a page, it may no longer be a page table,
+ * or we may be forking, in which case it is better to unmap the page.
+ */
+static bool detect_write_flooding(struct kvm_mmu_page *sp)
+{
+ /*
+ * Skip write-flooding detected for the sp whose level is 1, because
+ * it can become unsync, then the guest page is not write-protected.
+ */
+ if (sp->role.level == PG_LEVEL_4K)
+ return false;
+
+ atomic_inc(&sp->write_flooding_count);
+ return atomic_read(&sp->write_flooding_count) >= 3;
+}
+
+/*
+ * Misaligned accesses are too much trouble to fix up; also, they usually
+ * indicate a page is not used as a page table.
+ */
+static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
+ int bytes)
+{
+ unsigned offset, pte_size, misaligned;
+
+ pgprintk("misaligned: gpa %llx bytes %d role %x\n",
+ gpa, bytes, sp->role.word);
+
+ offset = offset_in_page(gpa);
+ pte_size = sp->role.has_4_byte_gpte ? 4 : 8;
+
+ /*
+ * Sometimes, the OS only writes the last one bytes to update status
+ * bits, for example, in linux, andb instruction is used in clear_bit().
+ */
+ if (!(offset & (pte_size - 1)) && bytes == 1)
+ return false;
+
+ misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1);
+ misaligned |= bytes < 4;
+
+ return misaligned;
+}
+
+static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
+{
+ unsigned page_offset, quadrant;
+ u64 *spte;
+ int level;
+
+ page_offset = offset_in_page(gpa);
+ level = sp->role.level;
+ *nspte = 1;
+ if (sp->role.has_4_byte_gpte) {
+ page_offset <<= 1; /* 32->64 */
+ /*
+ * A 32-bit pde maps 4MB while the shadow pdes map
+ * only 2MB. So we need to double the offset again
+ * and zap two pdes instead of one.
+ */
+ if (level == PT32_ROOT_LEVEL) {
+ page_offset &= ~7; /* kill rounding error */
+ page_offset <<= 1;
+ *nspte = 2;
+ }
+ quadrant = page_offset >> PAGE_SHIFT;
+ page_offset &= ~PAGE_MASK;
+ if (quadrant != sp->role.quadrant)
+ return NULL;
+ }
+
+ spte = &sp->spt[page_offset / sizeof(*spte)];
+ return spte;
+}
+
+void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+ int bytes, struct kvm_page_track_notifier_node *node)
+{
+ gfn_t gfn = gpa >> PAGE_SHIFT;
+ struct kvm_mmu_page *sp;
+ LIST_HEAD(invalid_list);
+ u64 entry, gentry, *spte;
+ int npte;
+ bool flush = false;
+
+ /*
+ * If we don't have indirect shadow pages, it means no page is
+ * write-protected, so we can exit simply.
+ */
+ if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
+ return;
+
+ pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
+
+ write_lock(&vcpu->kvm->mmu_lock);
+
+ gentry = mmu_pte_write_fetch_gpte(vcpu, &gpa, &bytes);
+
+ ++vcpu->kvm->stat.mmu_pte_write;
+
+ for_each_gfn_valid_sp_with_gptes(vcpu->kvm, sp, gfn) {
+ if (detect_write_misaligned(sp, gpa, bytes) ||
+ detect_write_flooding(sp)) {
+ kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
+ ++vcpu->kvm->stat.mmu_flooded;
+ continue;
+ }
+
+ spte = get_written_sptes(sp, gpa, &npte);
+ if (!spte)
+ continue;
+
+ while (npte--) {
+ entry = *spte;
+ mmu_page_zap_pte(vcpu->kvm, sp, spte, NULL);
+ if (gentry && sp->role.level != PG_LEVEL_4K)
+ ++vcpu->kvm->stat.mmu_pde_zapped;
+ if (is_shadow_present_pte(entry))
+ flush = true;
+ ++spte;
+ }
+ }
+ kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, flush);
+ write_unlock(&vcpu->kvm->mmu_lock);
+}
+
+static __always_inline bool __walk_slot_rmaps(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ slot_rmaps_handler fn,
+ int start_level, int end_level,
+ gfn_t start_gfn, gfn_t end_gfn,
+ bool flush_on_yield, bool flush)
+{
+ struct slot_rmap_walk_iterator iterator;
+
+ lockdep_assert_held_write(&kvm->mmu_lock);
+
+ for_each_slot_rmap_range(slot, start_level, end_level, start_gfn,
+ end_gfn, &iterator) {
+ if (iterator.rmap)
+ flush |= fn(kvm, iterator.rmap, slot);
+
+ if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
+ if (flush && flush_on_yield) {
+ kvm_flush_remote_tlbs_with_address(kvm,
+ start_gfn,
+ iterator.gfn - start_gfn + 1);
+ flush = false;
+ }
+ cond_resched_rwlock_write(&kvm->mmu_lock);
+ }
+ }
+
+ return flush;
+}
+
+__always_inline bool walk_slot_rmaps(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ slot_rmaps_handler fn, int start_level,
+ int end_level, bool flush_on_yield)
+{
+ return __walk_slot_rmaps(kvm, slot, fn, start_level, end_level,
+ slot->base_gfn, slot->base_gfn + slot->npages - 1,
+ flush_on_yield, false);
+}
+
+__always_inline bool walk_slot_rmaps_4k(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ slot_rmaps_handler fn,
+ bool flush_on_yield)
+{
+ return walk_slot_rmaps(kvm, slot, fn, PG_LEVEL_4K,
+ PG_LEVEL_4K, flush_on_yield);
+}
+
+#define BATCH_ZAP_PAGES 10
+void kvm_zap_obsolete_pages(struct kvm *kvm)
+{
+ struct kvm_mmu_page *sp, *node;
+ int nr_zapped, batch = 0;
+ bool unstable;
+
+restart:
+ list_for_each_entry_safe_reverse(sp, node,
+ &kvm->arch.active_mmu_pages, link) {
+ /*
+ * No obsolete valid page exists before a newly created page
+ * since active_mmu_pages is a FIFO list.
+ */
+ if (!is_obsolete_sp(kvm, sp))
+ break;
+
+ /*
+ * Invalid pages should never land back on the list of active
+ * pages. Skip the bogus page, otherwise we'll get stuck in an
+ * infinite loop if the page gets put back on the list (again).
+ */
+ if (WARN_ON(sp->role.invalid))
+ continue;
+
+ /*
+ * No need to flush the TLB since we're only zapping shadow
+ * pages with an obsolete generation number and all vCPUS have
+ * loaded a new root, i.e. the shadow pages being zapped cannot
+ * be in active use by the guest.
+ */
+ if (batch >= BATCH_ZAP_PAGES &&
+ cond_resched_rwlock_write(&kvm->mmu_lock)) {
+ batch = 0;
+ goto restart;
+ }
+
+ unstable = __kvm_mmu_prepare_zap_page(kvm, sp,
+ &kvm->arch.zapped_obsolete_pages, &nr_zapped);
+ batch += nr_zapped;
+
+ if (unstable)
+ goto restart;
+ }
+
+ /*
+ * Kick all vCPUs (via remote TLB flush) before freeing the page tables
+ * to ensure KVM is not in the middle of a lockless shadow page table
+ * walk, which may reference the pages. The remote TLB flush itself is
+ * not required and is simply a convenient way to kick vCPUs as needed.
+ * KVM performs a local TLB flush when allocating a new root (see
+ * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are
+ * running with an obsolete MMU.
+ */
+ kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
+}
+
+static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
+{
+ return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
+}
+
+bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
+{
+ const struct kvm_memory_slot *memslot;
+ struct kvm_memslots *slots;
+ struct kvm_memslot_iter iter;
+ bool flush = false;
+ gfn_t start, end;
+ int i;
+
+ if (!kvm_memslots_have_rmaps(kvm))
+ return flush;
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ slots = __kvm_memslots(kvm, i);
+
+ kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, gfn_end) {
+ memslot = iter.slot;
+ start = max(gfn_start, memslot->base_gfn);
+ end = min(gfn_end, memslot->base_gfn + memslot->npages);
+ if (WARN_ON_ONCE(start >= end))
+ continue;
+
+ flush = __walk_slot_rmaps(kvm, memslot, __kvm_zap_rmap,
+ PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
+ start, end - 1, true, flush);
+ }
+ }
+
+ return flush;
+}
+
+bool slot_rmap_write_protect(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ const struct kvm_memory_slot *slot)
+{
+ return rmap_write_protect(rmap_head, false);
+}
+
+static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
+{
+ struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+ struct shadow_page_caches caches = {};
+ union kvm_mmu_page_role role;
+ unsigned int access;
+ gfn_t gfn;
+
+ gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
+ access = kvm_mmu_page_get_access(huge_sp, spte_index(huge_sptep));
+
+ /*
+ * Note, huge page splitting always uses direct shadow pages, regardless
+ * of whether the huge page itself is mapped by a direct or indirect
+ * shadow page, since the huge page region itself is being directly
+ * mapped with smaller pages.
+ */
+ role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
+
+ /* Direct SPs do not require a shadowed_info_cache. */
+ caches.page_header_cache = &kvm->arch.split_page_header_cache;
+ caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
+
+ /* Safe to pass NULL for vCPU since requesting a direct SP. */
+ return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
+}
+
+static void shadow_mmu_split_huge_page(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ u64 *huge_sptep)
+
+{
+ struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
+ u64 huge_spte = READ_ONCE(*huge_sptep);
+ struct kvm_mmu_page *sp;
+ bool flush = false;
+ u64 *sptep, spte;
+ gfn_t gfn;
+ int index;
+
+ sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep);
+
+ for (index = 0; index < SPTE_ENT_PER_PAGE; index++) {
+ sptep = &sp->spt[index];
+ gfn = kvm_mmu_page_get_gfn(sp, index);
+
+ /*
+ * The SP may already have populated SPTEs, e.g. if this huge
+ * page is aliased by multiple sptes with the same access
+ * permissions. These entries are guaranteed to map the same
+ * gfn-to-pfn translation since the SP is direct, so no need to
+ * modify them.
+ *
+ * However, if a given SPTE points to a lower level page table,
+ * that lower level page table may only be partially populated.
+ * Installing such SPTEs would effectively unmap a potion of the
+ * huge page. Unmapping guest memory always requires a TLB flush
+ * since a subsequent operation on the unmapped regions would
+ * fail to detect the need to flush.
+ */
+ if (is_shadow_present_pte(*sptep)) {
+ flush |= !is_last_spte(*sptep, sp->role.level);
+ continue;
+ }
+
+ spte = make_huge_page_split_spte(kvm, huge_spte, sp->role, index);
+ mmu_spte_set(sptep, spte);
+ __rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
+ }
+
+ __link_shadow_page(kvm, cache, huge_sptep, sp, flush);
+}
+
+static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ u64 *huge_sptep)
+{
+ struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+ int level, r = 0;
+ gfn_t gfn;
+ u64 spte;
+
+ /* Grab information for the tracepoint before dropping the MMU lock. */
+ gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
+ level = huge_sp->role.level;
+ spte = *huge_sptep;
+
+ if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
+ r = -ENOSPC;
+ goto out;
+ }
+
+ if (need_topup_split_caches_or_resched(kvm)) {
+ write_unlock(&kvm->mmu_lock);
+ cond_resched();
+ /*
+ * If the topup succeeds, return -EAGAIN to indicate that the
+ * rmap iterator should be restarted because the MMU lock was
+ * dropped.
+ */
+ r = topup_split_caches(kvm) ?: -EAGAIN;
+ write_lock(&kvm->mmu_lock);
+ goto out;
+ }
+
+ shadow_mmu_split_huge_page(kvm, slot, huge_sptep);
+
+out:
+ trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
+ return r;
+}
+
+static bool shadow_mmu_try_split_huge_pages(struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head,
+ const struct kvm_memory_slot *slot)
+{
+ struct rmap_iterator iter;
+ struct kvm_mmu_page *sp;
+ u64 *huge_sptep;
+ int r;
+
+restart:
+ for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
+ sp = sptep_to_sp(huge_sptep);
+
+ /* TDP MMU is enabled, so rmap only contains nested MMU SPs. */
+ if (WARN_ON_ONCE(!sp->role.guest_mode))
+ continue;
+
+ /* The rmaps should never contain non-leaf SPTEs. */
+ if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
+ continue;
+
+ /* SPs with level >PG_LEVEL_4K should never by unsync. */
+ if (WARN_ON_ONCE(sp->unsync))
+ continue;
+
+ /* Don't bother splitting huge pages on invalid SPs. */
+ if (sp->role.invalid)
+ continue;
+
+ r = shadow_mmu_try_split_huge_page(kvm, slot, huge_sptep);
+
+ /*
+ * The split succeeded or needs to be retried because the MMU
+ * lock was dropped. Either way, restart the iterator to get it
+ * back into a consistent state.
+ */
+ if (!r || r == -EAGAIN)
+ goto restart;
+
+ /* The split failed and shouldn't be retried (e.g. -ENOMEM). */
+ break;
+ }
+
+ return false;
+}
+
+void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ gfn_t start, gfn_t end,
+ int target_level)
+{
+ int level;
+
+ /*
+ * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
+ * down to the target level. This ensures pages are recursively split
+ * all the way to the target level. There's no need to split pages
+ * already at the target level.
+ */
+ for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
+ __walk_slot_rmaps(kvm, slot, shadow_mmu_try_split_huge_pages,
+ level, level, start, end - 1, true, false);
+ }
+}
+
+static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head,
+ const struct kvm_memory_slot *slot)
+{
+ u64 *sptep;
+ struct rmap_iterator iter;
+ int need_tlb_flush = 0;
+ struct kvm_mmu_page *sp;
+
+restart:
+ for_each_rmap_spte(rmap_head, &iter, sptep) {
+ sp = sptep_to_sp(sptep);
+
+ /*
+ * We cannot do huge page mapping for indirect shadow pages,
+ * which are found on the last rmap (level = 1) when not using
+ * tdp; such shadow pages are synced with the page table in
+ * the guest, and the guest page table is using 4K page size
+ * mapping if the indirect sp has level = 1.
+ */
+ if (sp->role.direct &&
+ sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
+ PG_LEVEL_NUM)) {
+ kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
+
+ if (kvm_available_flush_tlb_with_range())
+ kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
+ KVM_PAGES_PER_HPAGE(sp->role.level));
+ else
+ need_tlb_flush = 1;
+
+ goto restart;
+ }
+ }
+
+ return need_tlb_flush;
+}
+
+void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
+ const struct kvm_memory_slot *slot)
+{
+ /*
+ * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
+ * pages that are already mapped at the maximum hugepage level.
+ */
+ if (walk_slot_rmaps(kvm, slot, kvm_mmu_zap_collapsible_spte,
+ PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1, true))
+ kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+}
+
+unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+ struct kvm *kvm;
+ int nr_to_scan = sc->nr_to_scan;
+ unsigned long freed = 0;
+
+ mutex_lock(&kvm_lock);
+
+ list_for_each_entry(kvm, &vm_list, vm_list) {
+ int idx;
+ LIST_HEAD(invalid_list);
+
+ /*
+ * Never scan more than sc->nr_to_scan VM instances.
+ * Will not hit this condition practically since we do not try
+ * to shrink more than one VM and it is very unlikely to see
+ * !n_used_mmu_pages so many times.
+ */
+ if (!nr_to_scan--)
+ break;
+ /*
+ * n_used_mmu_pages is accessed without holding kvm->mmu_lock
+ * here. We may skip a VM instance errorneosly, but we do not
+ * want to shrink a VM that only started to populate its MMU
+ * anyway.
+ */
+ if (!kvm->arch.n_used_mmu_pages &&
+ !kvm_has_zapped_obsolete_pages(kvm))
+ continue;
+
+ idx = srcu_read_lock(&kvm->srcu);
+ write_lock(&kvm->mmu_lock);
+
+ if (kvm_has_zapped_obsolete_pages(kvm)) {
+ kvm_mmu_commit_zap_page(kvm,
+ &kvm->arch.zapped_obsolete_pages);
+ goto unlock;
+ }
+
+ freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
+
+unlock:
+ write_unlock(&kvm->mmu_lock);
+ srcu_read_unlock(&kvm->srcu, idx);
+
+ /*
+ * unfair on small ones
+ * per-vm shrinkers cry out
+ * sadness comes quickly
+ */
+ list_move_tail(&kvm->vm_list, &vm_list);
+ break;
+ }
+
+ mutex_unlock(&kvm_lock);
+ return freed;
+}
diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h
index 2bfba6ad20688..4534eadc9a17c 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.h
+++ b/arch/x86/kvm/mmu/shadow_mmu.h
@@ -18,4 +18,149 @@

#include <linux/kvm_host.h>

+/* make pte_list_desc fit well in cache lines */
+#define PTE_LIST_EXT 14
+
+/*
+ * Slight optimization of cacheline layout, by putting `more' and `spte_count'
+ * at the start; then accessing it will only use one single cacheline for
+ * either full (entries==PTE_LIST_EXT) case or entries<=6.
+ */
+struct pte_list_desc {
+ struct pte_list_desc *more;
+ /*
+ * Stores number of entries stored in the pte_list_desc. No need to be
+ * u64 but just for easier alignment. When PTE_LIST_EXT, means full.
+ */
+ u64 spte_count;
+ u64 *sptes[PTE_LIST_EXT];
+};
+
+unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);
+
+struct kvm_shadow_walk_iterator {
+ u64 addr;
+ hpa_t shadow_addr;
+ u64 *sptep;
+ int level;
+ unsigned index;
+};
+
+#define for_each_shadow_entry_using_root(_vcpu, _root, _addr, _walker) \
+ for (shadow_walk_init_using_root(&(_walker), (_vcpu), \
+ (_root), (_addr)); \
+ shadow_walk_okay(&(_walker)); \
+ shadow_walk_next(&(_walker)))
+
+bool mmu_spte_update(u64 *sptep, u64 new_spte);
+void mmu_spte_clear_no_track(u64 *sptep);
+gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index);
+void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index,
+ unsigned int access);
+
+struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
+ const struct kvm_memory_slot *slot);
+bool rmap_can_add(struct kvm_vcpu *vcpu);
+void drop_spte(struct kvm *kvm, u64 *sptep);
+bool rmap_write_protect(struct kvm_rmap_head *rmap_head, bool pt_protect);
+bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ const struct kvm_memory_slot *slot);
+bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn, int level,
+ pte_t unused);
+bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn, int level,
+ pte_t pte);
+
+typedef bool (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn,
+ int level, pte_t pte);
+bool kvm_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
+ rmap_handler_t handler);
+
+bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn, int level,
+ pte_t unused);
+bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn,
+ int level, pte_t unused);
+
+void drop_parent_pte(struct kvm_mmu_page *sp, u64 *parent_pte);
+int nonpaging_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp);
+int mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *parent,
+ bool can_yield);
+void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp);
+void clear_sp_write_flooding_count(u64 *spte);
+
+struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, u64 *sptep,
+ gfn_t gfn, bool direct,
+ unsigned int access);
+
+void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
+ struct kvm_vcpu *vcpu, hpa_t root, u64 addr);
+void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
+ struct kvm_vcpu *vcpu, u64 addr);
+bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator);
+void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator);
+
+void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp);
+
+void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
+ unsigned direct_access);
+
+int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, u64 *spte,
+ struct list_head *invalid_list);
+bool __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ struct list_head *invalid_list,
+ int *nr_zapped);
+bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ struct list_head *invalid_list);
+void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list);
+
+int make_mmu_pages_available(struct kvm_vcpu *vcpu);
+
+int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
+
+int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+ u64 *sptep, unsigned int pte_access, gfn_t gfn,
+ kvm_pfn_t pfn, struct kvm_page_fault *fault);
+void __direct_pte_prefetch(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+ u64 *sptep);
+int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte);
+
+hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, u8 level);
+int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu);
+int mmu_alloc_special_roots(struct kvm_vcpu *vcpu);
+
+int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level);
+
+void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr);
+void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+ int bytes, struct kvm_page_track_notifier_node *node);
+
+/* The return value indicates if tlb flush on all vcpus is needed. */
+typedef bool (*slot_rmaps_handler) (struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head,
+ const struct kvm_memory_slot *slot);
+bool walk_slot_rmaps(struct kvm *kvm, const struct kvm_memory_slot *slot,
+ slot_rmaps_handler fn, int start_level, int end_level,
+ bool flush_on_yield);
+bool walk_slot_rmaps_4k(struct kvm *kvm, const struct kvm_memory_slot *slot,
+ slot_rmaps_handler fn, bool flush_on_yield);
+
+void kvm_zap_obsolete_pages(struct kvm *kvm);
+bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
+
+bool slot_rmap_write_protect(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ const struct kvm_memory_slot *slot);
+
+void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ gfn_t start, gfn_t end,
+ int target_level);
+void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
+ const struct kvm_memory_slot *slot);
+
+unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc);
#endif /* __KVM_X86_MMU_SHADOW_MMU_H */
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:29:48

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 09/21] KVM: x86/MMU: Move paging_tmpl.h includes to shadow_mmu.c

Move the integration point for paging_tmpl.h to shadow_mmu.c since
paging_tmpl.h is ostensibly part of the Shadow MMU. This requires
modifying some of the definitions to be non-static and then exporting
the pre-processed function names through shadow_mmu.h since they are
needed for mmu context callbacks in mmu.c. This will facilitate cleanups
in following commits because many of the functions being exposed by
shadow_mmu.h are only needed by paging_tmpl.h. Those functions will no
longer need to be exported.

sync_mmio_spte() is only used by paging_tmpl.h, so move it along with
the includes.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 29 -----------------------------
arch/x86/kvm/mmu/paging_tmpl.h | 11 +++++------
arch/x86/kvm/mmu/shadow_mmu.c | 31 +++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/shadow_mmu.h | 25 ++++++++++++++++++++++++-
4 files changed, 60 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index da290bfca0137..cef481a17a519 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1697,35 +1697,6 @@ static unsigned long get_cr3(struct kvm_vcpu *vcpu)
return kvm_read_cr3(vcpu);
}

-static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
- unsigned int access)
-{
- if (unlikely(is_mmio_spte(*sptep))) {
- if (gfn != get_mmio_spte_gfn(*sptep)) {
- mmu_spte_clear_no_track(sptep);
- return true;
- }
-
- mark_mmio_spte(vcpu, sptep, gfn, access);
- return true;
- }
-
- return false;
-}
-
-#define PTTYPE_EPT 18 /* arbitrary */
-#define PTTYPE PTTYPE_EPT
-#include "paging_tmpl.h"
-#undef PTTYPE
-
-#define PTTYPE 64
-#include "paging_tmpl.h"
-#undef PTTYPE
-
-#define PTTYPE 32
-#include "paging_tmpl.h"
-#undef PTTYPE
-
static void __reset_rsvds_bits_mask(struct rsvd_bits_validate *rsvd_check,
u64 pa_bits_rsvd, int level, bool nx,
bool gbpages, bool pse, bool amd)
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 730b413eebfde..1251357794538 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -787,7 +787,7 @@ FNAME(is_self_change_mapping)(struct kvm_vcpu *vcpu,
* Returns: 1 if we need to emulate the instruction, 0 otherwise, or
* a negative value on error.
*/
-static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct guest_walker walker;
int r;
@@ -889,7 +889,7 @@ static gpa_t FNAME(get_level1_sp_gpa)(struct kvm_mmu_page *sp)
return gfn_to_gpa(sp->gfn) + offset * sizeof(pt_element_t);
}

-static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
+void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
{
struct kvm_shadow_walk_iterator iterator;
struct kvm_mmu_page *sp;
@@ -949,9 +949,8 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
}

/* Note, @addr is a GPA when gva_to_gpa() translates an L2 GPA to an L1 GPA. */
-static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
- gpa_t addr, u64 access,
- struct x86_exception *exception)
+gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, gpa_t addr,
+ u64 access, struct x86_exception *exception)
{
struct guest_walker walker;
gpa_t gpa = INVALID_GPA;
@@ -984,7 +983,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
* 0: the sp is synced and no tlb flushing is required
* > 0: the sp is synced and tlb flushing is required
*/
-static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
{
union kvm_mmu_page_role root_role = vcpu->arch.mmu->root_role;
int i;
diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
index f3e2ed5b675eb..c7cfdc6f51b53 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.c
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -12,6 +12,8 @@
* Yaniv Kamay <[email protected]>
* Avi Kivity <[email protected]>
*/
+
+#include "ioapic.h"
#include "mmu.h"
#include "mmu_internal.h"
#include "mmutrace.h"
@@ -2809,6 +2811,35 @@ void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
walk_shadow_page_lockless_end(vcpu);
}

+static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
+ unsigned int access)
+{
+ if (unlikely(is_mmio_spte(*sptep))) {
+ if (gfn != get_mmio_spte_gfn(*sptep)) {
+ mmu_spte_clear_no_track(sptep);
+ return true;
+ }
+
+ mark_mmio_spte(vcpu, sptep, gfn, access);
+ return true;
+ }
+
+ return false;
+}
+
+#define PTTYPE_EPT 18 /* arbitrary */
+#define PTTYPE PTTYPE_EPT
+#include "paging_tmpl.h"
+#undef PTTYPE
+
+#define PTTYPE 64
+#include "paging_tmpl.h"
+#undef PTTYPE
+
+#define PTTYPE 32
+#include "paging_tmpl.h"
+#undef PTTYPE
+
static bool is_obsolete_root(struct kvm *kvm, hpa_t root_hpa)
{
struct kvm_mmu_page *sp;
diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h
index 4534eadc9a17c..7faf8b06e68f1 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.h
+++ b/arch/x86/kvm/mmu/shadow_mmu.h
@@ -86,7 +86,6 @@ bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
int level, pte_t unused);

void drop_parent_pte(struct kvm_mmu_page *sp, u64 *parent_pte);
-int nonpaging_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp);
int mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *parent,
bool can_yield);
void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp);
@@ -163,4 +162,28 @@ void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
const struct kvm_memory_slot *slot);

unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc);
+
+/* Exports from paging_tmpl.h */
+gpa_t paging32_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+ gpa_t vaddr, u64 access,
+ struct x86_exception *exception);
+gpa_t paging64_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+ gpa_t vaddr, u64 access,
+ struct x86_exception *exception);
+gpa_t ept_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, gpa_t vaddr,
+ u64 access, struct x86_exception *exception);
+
+int paging32_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+int paging64_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+int ept_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+
+int paging32_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp);
+int paging64_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp);
+int ept_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp);
+/* Defined in shadow_mmu.c. */
+int nonpaging_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp);
+
+void paging32_invlpg(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root);
+void paging64_invlpg(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root);
+void ept_invlpg(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root);
#endif /* __KVM_X86_MMU_SHADOW_MMU_H */
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:29:56

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 12/21] KVM: x86/MMU: Clean up naming of exported Shadow MMU functions

Change the naming scheme on several functions exported from the shadow
MMU to match the naming scheme used by the TDP MMU: kvm_shadow_mmu_.
More cleanups will follow to convert the remaining functions to a similar
naming scheme, but for now, start with the trivial renames.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 19 ++++++++++---------
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/mmu/shadow_mmu.c | 19 ++++++++++---------
arch/x86/kvm/mmu/shadow_mmu.h | 17 +++++++++--------
4 files changed, 30 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3ea54b08239aa..9308ab8102f9b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1089,7 +1089,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
int r;

write_lock(&vcpu->kvm->mmu_lock);
- r = make_mmu_pages_available(vcpu);
+ r = kvm_shadow_mmu_make_pages_available(vcpu);
if (r < 0)
goto out_unlock;

@@ -1164,7 +1164,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
if (is_tdp_mmu_active(vcpu))
leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root);
else
- leaf = get_walk(vcpu, addr, sptes, &root);
+ leaf = kvm_shadow_mmu_get_walk(vcpu, addr, sptes, &root);

walk_shadow_page_lockless_end(vcpu);

@@ -1432,11 +1432,11 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (is_page_fault_stale(vcpu, fault))
goto out_unlock;

- r = make_mmu_pages_available(vcpu);
+ r = kvm_shadow_mmu_make_pages_available(vcpu);
if (r)
goto out_unlock;

- r = direct_map(vcpu, fault);
+ r = kvm_shadow_mmu_direct_map(vcpu, fault);

out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
@@ -1471,7 +1471,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
trace_kvm_page_fault(vcpu, fault_address, error_code);

if (kvm_event_needs_reinjection(vcpu))
- kvm_mmu_unprotect_page_virt(vcpu, fault_address);
+ kvm_shadow_mmu_unprotect_page_virt(vcpu, fault_address);
r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn,
insn_len);
} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
@@ -2786,7 +2786,8 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
* In order to ensure all vCPUs drop their soon-to-be invalid roots,
* invalidating TDP MMU roots must be done while holding mmu_lock for
* write and in the same critical section as making the reload request,
- * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and yield.
+ * e.g. before kvm_shadow_mmu_zap_obsolete_pages() could drop mmu_lock
+ * and yield.
*/
if (tdp_mmu_enabled)
kvm_tdp_mmu_invalidate_all_roots(kvm);
@@ -2801,7 +2802,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
*/
kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);

- kvm_zap_obsolete_pages(kvm);
+ kvm_shadow_mmu_zap_obsolete_pages(kvm);

write_unlock(&kvm->mmu_lock);

@@ -2890,7 +2891,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)

kvm_mmu_invalidate_begin(kvm, 0, -1ul);

- flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
+ flush = kvm_shadow_mmu_zap_gfn_range(kvm, gfn_start, gfn_end);

if (tdp_mmu_enabled) {
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
@@ -3034,7 +3035,7 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
{
if (kvm_memslots_have_rmaps(kvm)) {
write_lock(&kvm->mmu_lock);
- kvm_rmap_zap_collapsible_sptes(kvm, slot);
+ kvm_shadow_mmu_zap_collapsible_sptes(kvm, slot);
write_unlock(&kvm->mmu_lock);
}

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 1251357794538..14a8c8217c4cf 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -866,7 +866,7 @@ int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
if (is_page_fault_stale(vcpu, fault))
goto out_unlock;

- r = make_mmu_pages_available(vcpu);
+ r = kvm_shadow_mmu_make_pages_available(vcpu);
if (r)
goto out_unlock;
r = FNAME(fetch)(vcpu, fault, &walker);
diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
index 76c50aca3c487..36b335d75aee2 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.c
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -1977,7 +1977,7 @@ static inline unsigned long kvm_mmu_available_pages(struct kvm *kvm)
return 0;
}

-int make_mmu_pages_available(struct kvm_vcpu *vcpu)
+int kvm_shadow_mmu_make_pages_available(struct kvm_vcpu *vcpu)
{
unsigned long avail = kvm_mmu_available_pages(vcpu->kvm);

@@ -2041,7 +2041,7 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
return r;
}

-int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
+int kvm_shadow_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
{
gpa_t gpa;
int r;
@@ -2331,7 +2331,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
__direct_pte_prefetch(vcpu, sp, sptep);
}

-int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+int kvm_shadow_mmu_direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_shadow_walk_iterator it;
struct kvm_mmu_page *sp;
@@ -2549,7 +2549,7 @@ int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
return r;

write_lock(&vcpu->kvm->mmu_lock);
- r = make_mmu_pages_available(vcpu);
+ r = kvm_shadow_mmu_make_pages_available(vcpu);
if (r < 0)
goto out_unlock;

@@ -2797,7 +2797,8 @@ void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu)
*
* Must be called between walk_shadow_page_lockless_{begin,end}.
*/
-int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level)
+int kvm_shadow_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+ int *root_level)
{
struct kvm_shadow_walk_iterator iterator;
int leaf = -1;
@@ -3104,7 +3105,7 @@ __always_inline bool walk_slot_rmaps_4k(struct kvm *kvm,
}

#define BATCH_ZAP_PAGES 10
-void kvm_zap_obsolete_pages(struct kvm *kvm)
+void kvm_shadow_mmu_zap_obsolete_pages(struct kvm *kvm)
{
struct kvm_mmu_page *sp, *node;
int nr_zapped, batch = 0;
@@ -3165,7 +3166,7 @@ bool kvm_shadow_mmu_has_zapped_obsolete_pages(struct kvm *kvm)
return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
}

-bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
+bool kvm_shadow_mmu_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
{
const struct kvm_memory_slot *memslot;
struct kvm_memslots *slots;
@@ -3417,8 +3418,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
return need_tlb_flush;
}

-void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
- const struct kvm_memory_slot *slot)
+void kvm_shadow_mmu_zap_collapsible_sptes(struct kvm *kvm,
+ const struct kvm_memory_slot *slot)
{
/*
* Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h
index 9e27d03fbe368..cc28895d2a24f 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.h
+++ b/arch/x86/kvm/mmu/shadow_mmu.h
@@ -73,18 +73,19 @@ bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
struct list_head *invalid_list);
void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list);

-int make_mmu_pages_available(struct kvm_vcpu *vcpu);
+int kvm_shadow_mmu_make_pages_available(struct kvm_vcpu *vcpu);

-int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
+int kvm_shadow_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);

-int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+int kvm_shadow_mmu_direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte);

hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, u8 level);
int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu);
int mmu_alloc_special_roots(struct kvm_vcpu *vcpu);

-int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level);
+int kvm_shadow_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+ int *root_level);

void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
int bytes, struct kvm_page_track_notifier_node *node);
@@ -99,8 +100,8 @@ bool walk_slot_rmaps(struct kvm *kvm, const struct kvm_memory_slot *slot,
bool walk_slot_rmaps_4k(struct kvm *kvm, const struct kvm_memory_slot *slot,
slot_rmaps_handler fn, bool flush_on_yield);

-void kvm_zap_obsolete_pages(struct kvm *kvm);
-bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
+void kvm_shadow_mmu_zap_obsolete_pages(struct kvm *kvm);
+bool kvm_shadow_mmu_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);

bool slot_rmap_write_protect(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
const struct kvm_memory_slot *slot);
@@ -109,8 +110,8 @@ void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm,
const struct kvm_memory_slot *slot,
gfn_t start, gfn_t end,
int target_level);
-void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
- const struct kvm_memory_slot *slot);
+void kvm_shadow_mmu_zap_collapsible_sptes(struct kvm *kvm,
+ const struct kvm_memory_slot *slot);

bool kvm_shadow_mmu_has_zapped_obsolete_pages(struct kvm *kvm);
unsigned long kvm_shadow_mmu_shrink_scan(struct kvm *kvm, int pages_to_free);
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:30:00

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 11/21] KVM: x86/MMU: Cleanup shrinker interface with Shadow MMU

The MMU shrinker currently only operates on the Shadow MMU, but having
the entire implemenatation in shadow_mmu.c is awkward since much of the
function isn't Shadow MMU specific. There has also been talk of changing
the target of the shrinker to the MMU caches rather than already allocated
page tables. As a result, it makes sense to move some of the implementation
back to mmu.c.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 43 ++++++++++++++++++++++++
arch/x86/kvm/mmu/shadow_mmu.c | 62 ++++++++---------------------------
arch/x86/kvm/mmu/shadow_mmu.h | 3 +-
3 files changed, 58 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cef481a17a519..3ea54b08239aa 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3145,6 +3145,49 @@ static unsigned long mmu_shrink_count(struct shrinker *shrink,
return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
}

+unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+ struct kvm *kvm;
+ int nr_to_scan = sc->nr_to_scan;
+ unsigned long freed = 0;
+
+ mutex_lock(&kvm_lock);
+
+ list_for_each_entry(kvm, &vm_list, vm_list) {
+ /*
+ * Never scan more than sc->nr_to_scan VM instances.
+ * Will not hit this condition practically since we do not try
+ * to shrink more than one VM and it is very unlikely to see
+ * !n_used_mmu_pages so many times.
+ */
+ if (!nr_to_scan--)
+ break;
+
+ /*
+ * n_used_mmu_pages is accessed without holding kvm->mmu_lock
+ * here. We may skip a VM instance errorneosly, but we do not
+ * want to shrink a VM that only started to populate its MMU
+ * anyway.
+ */
+ if (!kvm->arch.n_used_mmu_pages &&
+ !kvm_shadow_mmu_has_zapped_obsolete_pages(kvm))
+ continue;
+
+ freed = kvm_shadow_mmu_shrink_scan(kvm, sc->nr_to_scan);
+
+ /*
+ * unfair on small ones
+ * per-vm shrinkers cry out
+ * sadness comes quickly
+ */
+ list_move_tail(&kvm->vm_list, &vm_list);
+ break;
+ }
+
+ mutex_unlock(&kvm_lock);
+ return freed;
+}
+
static struct shrinker mmu_shrinker = {
.count_objects = mmu_shrink_count,
.scan_objects = mmu_shrink_scan,
diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
index 1be680bce15a6..76c50aca3c487 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.c
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -3160,7 +3160,7 @@ void kvm_zap_obsolete_pages(struct kvm *kvm)
kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
}

-static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
+bool kvm_shadow_mmu_has_zapped_obsolete_pages(struct kvm *kvm)
{
return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
}
@@ -3429,60 +3429,24 @@ void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
}

-unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+unsigned long kvm_shadow_mmu_shrink_scan(struct kvm *kvm, int pages_to_free)
{
- struct kvm *kvm;
- int nr_to_scan = sc->nr_to_scan;
unsigned long freed = 0;
+ int idx;

- mutex_lock(&kvm_lock);
-
- list_for_each_entry(kvm, &vm_list, vm_list) {
- int idx;
- LIST_HEAD(invalid_list);
-
- /*
- * Never scan more than sc->nr_to_scan VM instances.
- * Will not hit this condition practically since we do not try
- * to shrink more than one VM and it is very unlikely to see
- * !n_used_mmu_pages so many times.
- */
- if (!nr_to_scan--)
- break;
- /*
- * n_used_mmu_pages is accessed without holding kvm->mmu_lock
- * here. We may skip a VM instance errorneosly, but we do not
- * want to shrink a VM that only started to populate its MMU
- * anyway.
- */
- if (!kvm->arch.n_used_mmu_pages &&
- !kvm_has_zapped_obsolete_pages(kvm))
- continue;
-
- idx = srcu_read_lock(&kvm->srcu);
- write_lock(&kvm->mmu_lock);
-
- if (kvm_has_zapped_obsolete_pages(kvm)) {
- kvm_mmu_commit_zap_page(kvm,
- &kvm->arch.zapped_obsolete_pages);
- goto unlock;
- }
+ idx = srcu_read_lock(&kvm->srcu);
+ write_lock(&kvm->mmu_lock);

- freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
+ if (kvm_shadow_mmu_has_zapped_obsolete_pages(kvm)) {
+ kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
+ goto out;
+ }

-unlock:
- write_unlock(&kvm->mmu_lock);
- srcu_read_unlock(&kvm->srcu, idx);
+ freed = kvm_mmu_zap_oldest_mmu_pages(kvm, pages_to_free);

- /*
- * unfair on small ones
- * per-vm shrinkers cry out
- * sadness comes quickly
- */
- list_move_tail(&kvm->vm_list, &vm_list);
- break;
- }
+out:
+ write_unlock(&kvm->mmu_lock);
+ srcu_read_unlock(&kvm->srcu, idx);

- mutex_unlock(&kvm_lock);
return freed;
}
diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h
index 9f16c4782bfbf..9e27d03fbe368 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.h
+++ b/arch/x86/kvm/mmu/shadow_mmu.h
@@ -112,7 +112,8 @@ void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm,
void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
const struct kvm_memory_slot *slot);

-unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc);
+bool kvm_shadow_mmu_has_zapped_obsolete_pages(struct kvm *kvm);
+unsigned long kvm_shadow_mmu_shrink_scan(struct kvm *kvm, int pages_to_free);

/* Exports from paging_tmpl.h */
gpa_t paging32_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:30:03

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 17/21] KVM: x86/MMU: Add kvm_shadow_mmu_ to the last few functions in shadow_mmu.h

Fix up the names of the last few Shadow MMU functions in shadow_mmu.h.
This gives a clean and obvious interface between the shared x86 MMU
code and the Shadow MMU. There are still a few functions exported from
paging_tmpl.h that are left as-is, but changing those will need to be
done separately, if at all.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 19 ++++++++-------
arch/x86/kvm/mmu/shadow_mmu.c | 44 +++++++++++++++++++----------------
arch/x86/kvm/mmu/shadow_mmu.h | 16 +++++++------
3 files changed, 43 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 156ab2e4cd811..f5b9db00eff99 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -884,7 +884,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
if (tdp_mmu_enabled)
sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, fault->addr, &spte);
else
- sptep = fast_pf_get_last_sptep(vcpu, fault->addr, &spte);
+ sptep = kvm_shadow_mmu_fast_pf_get_last_sptep(vcpu, fault->addr, &spte);

if (!is_shadow_present_pte(spte))
break;
@@ -1073,7 +1073,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
mmu->root.hpa = root;
} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
- root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
+ root = kvm_shadow_mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
mmu->root.hpa = root;
} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
if (WARN_ON_ONCE(!mmu->pae_root)) {
@@ -1084,8 +1084,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
for (i = 0; i < 4; ++i) {
WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));

- root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0,
- PT32_ROOT_LEVEL);
+ root = kvm_shadow_mmu_alloc_root(vcpu,
+ i << (30 - PAGE_SHIFT), 0, PT32_ROOT_LEVEL);
mmu->pae_root[i] = root | PT_PRESENT_MASK |
shadow_me_value;
}
@@ -1663,7 +1663,7 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
* count. Otherwise, clear the write flooding count.
*/
if (!new_role.direct)
- __clear_sp_write_flooding_count(
+ kvm_shadow_mmu_clear_sp_write_flooding_count(
to_shadow_page(vcpu->arch.mmu->root.hpa));
}
EXPORT_SYMBOL_GPL(kvm_mmu_new_pgd);
@@ -2439,13 +2439,13 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct);
if (r)
goto out;
- r = mmu_alloc_special_roots(vcpu);
+ r = kvm_shadow_mmu_alloc_special_roots(vcpu);
if (r)
goto out;
if (vcpu->arch.mmu->root_role.direct)
r = mmu_alloc_direct_roots(vcpu);
else
- r = mmu_alloc_shadow_roots(vcpu);
+ r = kvm_shadow_mmu_alloc_shadow_roots(vcpu);
if (r)
goto out;

@@ -2674,7 +2674,8 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
* generally doesn't use PAE paging and can skip allocating the PDP
* table. The main exception, handled here, is SVM's 32-bit NPT. The
* other exception is for shadowing L1's 32-bit or PAE NPT on 64-bit
- * KVM; that horror is handled on-demand by mmu_alloc_special_roots().
+ * KVM; that horror is handled on-demand by
+ * kvm_shadow_mmu_alloc_special_roots().
*/
if (tdp_enabled && kvm_mmu_get_tdp_level(vcpu) > PT32E_ROOT_LEVEL)
return 0;
@@ -2817,7 +2818,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
return r;
}

- node->track_write = kvm_mmu_pte_write;
+ node->track_write = kvm_shadow_mmu_pte_write;
node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
kvm_page_track_register_notifier(kvm, node);

diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
index dfff65db97c3b..eb4424fedd73a 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.c
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -1404,14 +1404,14 @@ static int mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *parent,
return 0;
}

-void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
+void kvm_shadow_mmu_clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
{
atomic_set(&sp->write_flooding_count, 0);
}

static void clear_sp_write_flooding_count(u64 *spte)
{
- __clear_sp_write_flooding_count(sptep_to_sp(spte));
+ kvm_shadow_mmu_clear_sp_write_flooding_count(sptep_to_sp(spte));
}

/*
@@ -1482,7 +1482,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
kvm_flush_remote_tlbs(kvm);
}

- __clear_sp_write_flooding_count(sp);
+ kvm_shadow_mmu_clear_sp_write_flooding_count(sp);

goto out;
}
@@ -1607,12 +1607,13 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
* Concretely, a 4-byte PDE consumes bits 31:22, while an 8-byte PDE
* consumes bits 29:21. To consume bits 31:30, KVM's uses 4 shadow
* PDPTEs; those 4 PAE page directories are pre-allocated and their
- * quadrant is assigned in mmu_alloc_root(). A 4-byte PTE consumes
- * bits 21:12, while an 8-byte PTE consumes bits 20:12. To consume
- * bit 21 in the PTE (the child here), KVM propagates that bit to the
- * quadrant, i.e. sets quadrant to '0' or '1'. The parent 8-byte PDE
- * covers bit 21 (see above), thus the quadrant is calculated from the
- * _least_ significant bit of the PDE index.
+ * quadrant is assigned in kvm_shadow_mmu_alloc_root().
+ * A 4-byte PTE consumes bits 21:12, while an 8-byte PTE consumes
+ * bits 20:12. To consume bit 21 in the PTE (the child here), KVM
+ * propagates that bit to the quadrant, i.e. sets quadrant to
+ * '0' or '1'. The parent 8-byte PDE covers bit 21 (see above), thus
+ * the quadrant is calculated from the _least_ significant bit of the
+ * PDE index.
*/
if (role.has_4_byte_gpte) {
WARN_ON_ONCE(role.level != PG_LEVEL_4K);
@@ -2389,7 +2390,8 @@ int kvm_shadow_mmu_direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *faul
* - Must be called between walk_shadow_page_lockless_{begin,end}.
* - The returned sptep must not be used after walk_shadow_page_lockless_end.
*/
-u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
+u64 *kvm_shadow_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa,
+ u64 *spte)
{
struct kvm_shadow_walk_iterator iterator;
u64 old_spte;
@@ -2442,7 +2444,8 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
return ret;
}

-hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, u8 level)
+hpa_t kvm_shadow_mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
+ u8 level)
{
union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
struct kvm_mmu_page *sp;
@@ -2459,7 +2462,7 @@ hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, u8 level)
return __pa(sp->spt);
}

-static int mmu_first_shadow_root_alloc(struct kvm *kvm)
+static int kvm_shadow_mmu_first_shadow_root_alloc(struct kvm *kvm)
{
struct kvm_memslots *slots;
struct kvm_memory_slot *slot;
@@ -2520,7 +2523,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
return r;
}

-int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
+int kvm_shadow_mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
u64 pdptrs[4], pm_mask;
@@ -2549,7 +2552,7 @@ int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
}
}

- r = mmu_first_shadow_root_alloc(vcpu->kvm);
+ r = kvm_shadow_mmu_first_shadow_root_alloc(vcpu->kvm);
if (r)
return r;

@@ -2563,8 +2566,8 @@ int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
* write-protect the guests page table root.
*/
if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
- root = mmu_alloc_root(vcpu, root_gfn, 0,
- mmu->root_role.level);
+ root = kvm_shadow_mmu_alloc_root(vcpu, root_gfn, 0,
+ mmu->root_role.level);
mmu->root.hpa = root;
goto set_root_pgd;
}
@@ -2617,7 +2620,8 @@ int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
*/
quadrant = (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) ? i : 0;

- root = mmu_alloc_root(vcpu, root_gfn, quadrant, PT32_ROOT_LEVEL);
+ root = kvm_shadow_mmu_alloc_root(vcpu, root_gfn, quadrant,
+ PT32_ROOT_LEVEL);
mmu->pae_root[i] = root | pm_mask;
}

@@ -2636,7 +2640,7 @@ int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
return r;
}

-int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
+int kvm_shadow_mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
bool need_pml5 = mmu->root_role.level > PT64_ROOT_4LEVEL;
@@ -3009,8 +3013,8 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
return spte;
}

-void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
- int bytes, struct kvm_page_track_notifier_node *node)
+void kvm_shadow_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+ int bytes, struct kvm_page_track_notifier_node *node)
{
gfn_t gfn = gpa >> PAGE_SHIFT;
struct kvm_mmu_page *sp;
diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h
index e4fbc842f524e..4d39017873aa6 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.h
+++ b/arch/x86/kvm/mmu/shadow_mmu.h
@@ -39,7 +39,7 @@ struct pte_list_desc {
/* Only exported for debugfs.c. */
unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);

-void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp);
+void kvm_shadow_mmu_clear_sp_write_flooding_count(struct kvm_mmu_page *sp);

bool __kvm_shadow_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
struct list_head *invalid_list,
@@ -54,17 +54,19 @@ int kvm_shadow_mmu_make_pages_available(struct kvm_vcpu *vcpu);
int kvm_shadow_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);

int kvm_shadow_mmu_direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
-u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte);
+u64 *kvm_shadow_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa,
+ u64 *spte);

-hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, u8 level);
-int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu);
-int mmu_alloc_special_roots(struct kvm_vcpu *vcpu);
+hpa_t kvm_shadow_mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
+ u8 level);
+int kvm_shadow_mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu);
+int kvm_shadow_mmu_alloc_special_roots(struct kvm_vcpu *vcpu);

int kvm_shadow_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
int *root_level);

-void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
- int bytes, struct kvm_page_track_notifier_node *node);
+void kvm_shadow_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+ int bytes, struct kvm_page_track_notifier_node *node);

void kvm_shadow_mmu_zap_obsolete_pages(struct kvm *kvm);
bool kvm_shadow_mmu_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:30:07

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 10/21] KVM: x86/MMU: Clean up Shadow MMU exports

Now that paging_tmpl.h is included from shadow_mmu.c, there's no need to
export many of the functions currrently in shadow_mmu.h, so remove those
exports and mark the functions static. This cleans up the interface
of the Shadow MMU, and will allow the implementation to keep the details
of rmap_heads internal.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/shadow_mmu.c | 78 +++++++++++++++++++++--------------
arch/x86/kvm/mmu/shadow_mmu.h | 51 +----------------------
2 files changed, 48 insertions(+), 81 deletions(-)

diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
index c7cfdc6f51b53..1be680bce15a6 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.c
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -24,6 +24,20 @@
#include <asm/cmpxchg.h>
#include <trace/events/kvm.h>

+struct kvm_shadow_walk_iterator {
+ u64 addr;
+ hpa_t shadow_addr;
+ u64 *sptep;
+ int level;
+ unsigned index;
+};
+
+#define for_each_shadow_entry_using_root(_vcpu, _root, _addr, _walker) \
+ for (shadow_walk_init_using_root(&(_walker), (_vcpu), \
+ (_root), (_addr)); \
+ shadow_walk_okay(&(_walker)); \
+ shadow_walk_next(&(_walker)))
+
#define for_each_shadow_entry(_vcpu, _addr, _walker) \
for (shadow_walk_init(&(_walker), _vcpu, _addr); \
shadow_walk_okay(&(_walker)); \
@@ -230,7 +244,7 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
*
* Returns true if the TLB needs to be flushed
*/
-bool mmu_spte_update(u64 *sptep, u64 new_spte)
+static bool mmu_spte_update(u64 *sptep, u64 new_spte)
{
bool flush = false;
u64 old_spte = mmu_spte_update_no_track(sptep, new_spte);
@@ -314,7 +328,7 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
* Directly clear spte without caring the state bits of sptep,
* it is used to set the upper level spte.
*/
-void mmu_spte_clear_no_track(u64 *sptep)
+static void mmu_spte_clear_no_track(u64 *sptep)
{
__update_clear_spte_fast(sptep, 0ull);
}
@@ -357,7 +371,7 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)

static bool sp_has_gptes(struct kvm_mmu_page *sp);

-gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
+static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
{
if (sp->role.passthrough)
return sp->gfn;
@@ -413,8 +427,8 @@ static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index,
sp->gfn, kvm_mmu_page_get_gfn(sp, index), gfn);
}

-void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index,
- unsigned int access)
+static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index,
+ unsigned int access)
{
gfn_t gfn = kvm_mmu_page_get_gfn(sp, index);

@@ -629,7 +643,7 @@ struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
return &slot->arch.rmap[level - PG_LEVEL_4K][idx];
}

-bool rmap_can_add(struct kvm_vcpu *vcpu)
+static bool rmap_can_add(struct kvm_vcpu *vcpu)
{
struct kvm_mmu_memory_cache *mc;

@@ -737,7 +751,7 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
for (_spte_ = rmap_get_first(_rmap_head_, _iter_); \
_spte_; _spte_ = rmap_get_next(_iter_))

-void drop_spte(struct kvm *kvm, u64 *sptep)
+static void drop_spte(struct kvm *kvm, u64 *sptep)
{
u64 old_spte = mmu_spte_clear_track_bits(kvm, sptep);

@@ -1114,7 +1128,7 @@ static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
pte_list_remove(parent_pte, &sp->parent_ptes);
}

-void drop_parent_pte(struct kvm_mmu_page *sp, u64 *parent_pte)
+static void drop_parent_pte(struct kvm_mmu_page *sp, u64 *parent_pte)
{
mmu_page_remove_parent_pte(sp, parent_pte);
mmu_spte_clear_no_track(parent_pte);
@@ -1344,8 +1358,8 @@ static void mmu_pages_clear_parents(struct mmu_page_path *parents)
} while (!sp->unsync_children);
}

-int mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *parent,
- bool can_yield)
+static int mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *parent,
+ bool can_yield)
{
int i;
struct kvm_mmu_page *sp;
@@ -1391,7 +1405,7 @@ void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
atomic_set(&sp->write_flooding_count, 0);
}

-void clear_sp_write_flooding_count(u64 *spte)
+static void clear_sp_write_flooding_count(u64 *spte)
{
__clear_sp_write_flooding_count(sptep_to_sp(spte));
}
@@ -1604,9 +1618,9 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
return role;
}

-struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, u64 *sptep,
- gfn_t gfn, bool direct,
- unsigned int access)
+static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
+ u64 *sptep, gfn_t gfn,
+ bool direct, unsigned int access)
{
union kvm_mmu_page_role role;

@@ -1617,8 +1631,9 @@ struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, u64 *sptep,
return kvm_mmu_get_shadow_page(vcpu, gfn, role);
}

-void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
- struct kvm_vcpu *vcpu, hpa_t root, u64 addr)
+static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
+ struct kvm_vcpu *vcpu, hpa_t root,
+ u64 addr)
{
iterator->addr = addr;
iterator->shadow_addr = root;
@@ -1645,14 +1660,14 @@ void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
}
}

-void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
- struct kvm_vcpu *vcpu, u64 addr)
+static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
+ struct kvm_vcpu *vcpu, u64 addr)
{
shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->root.hpa,
addr);
}

-bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
+static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
{
if (iterator->level < PG_LEVEL_4K)
return false;
@@ -1674,7 +1689,7 @@ static void __shadow_walk_next(struct kvm_shadow_walk_iterator *iterator,
--iterator->level;
}

-void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
+static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
{
__shadow_walk_next(iterator, *iterator->sptep);
}
@@ -1714,13 +1729,14 @@ static void __link_shadow_page(struct kvm *kvm,
mark_unsync(sptep);
}

-void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp)
+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
+ struct kvm_mmu_page *sp)
{
__link_shadow_page(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, sptep, sp, true);
}

-void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
- unsigned direct_access)
+static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
+ unsigned direct_access)
{
if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) {
struct kvm_mmu_page *child;
@@ -1742,8 +1758,8 @@ void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
}

/* Returns the number of zapped non-leaf child shadow pages. */
-int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, u64 *spte,
- struct list_head *invalid_list)
+static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, u64 *spte,
+ struct list_head *invalid_list)
{
u64 pte;
struct kvm_mmu_page *child;
@@ -2156,9 +2172,9 @@ int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
return 0;
}

-int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
- u64 *sptep, unsigned int pte_access, gfn_t gfn,
- kvm_pfn_t pfn, struct kvm_page_fault *fault)
+static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+ u64 *sptep, unsigned int pte_access, gfn_t gfn,
+ kvm_pfn_t pfn, struct kvm_page_fault *fault)
{
struct kvm_mmu_page *sp = sptep_to_sp(sptep);
int level = sp->role.level;
@@ -2263,8 +2279,8 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
return 0;
}

-void __direct_pte_prefetch(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
- u64 *sptep)
+static void __direct_pte_prefetch(struct kvm_vcpu *vcpu,
+ struct kvm_mmu_page *sp, u64 *sptep)
{
u64 *spte, *start = NULL;
int i;
@@ -2800,7 +2816,7 @@ int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level)
return leaf;
}

-void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
+static void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
{
struct kvm_shadow_walk_iterator iterator;
u64 spte;
diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h
index 7faf8b06e68f1..9f16c4782bfbf 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.h
+++ b/arch/x86/kvm/mmu/shadow_mmu.h
@@ -36,32 +36,11 @@ struct pte_list_desc {
u64 *sptes[PTE_LIST_EXT];
};

+/* Only exported for debugfs.c. */
unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);

-struct kvm_shadow_walk_iterator {
- u64 addr;
- hpa_t shadow_addr;
- u64 *sptep;
- int level;
- unsigned index;
-};
-
-#define for_each_shadow_entry_using_root(_vcpu, _root, _addr, _walker) \
- for (shadow_walk_init_using_root(&(_walker), (_vcpu), \
- (_root), (_addr)); \
- shadow_walk_okay(&(_walker)); \
- shadow_walk_next(&(_walker)))
-
-bool mmu_spte_update(u64 *sptep, u64 new_spte);
-void mmu_spte_clear_no_track(u64 *sptep);
-gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index);
-void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index,
- unsigned int access);
-
struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
const struct kvm_memory_slot *slot);
-bool rmap_can_add(struct kvm_vcpu *vcpu);
-void drop_spte(struct kvm *kvm, u64 *sptep);
bool rmap_write_protect(struct kvm_rmap_head *rmap_head, bool pt_protect);
bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
const struct kvm_memory_slot *slot);
@@ -85,30 +64,8 @@ bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
struct kvm_memory_slot *slot, gfn_t gfn,
int level, pte_t unused);

-void drop_parent_pte(struct kvm_mmu_page *sp, u64 *parent_pte);
-int mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *parent,
- bool can_yield);
void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp);
-void clear_sp_write_flooding_count(u64 *spte);
-
-struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, u64 *sptep,
- gfn_t gfn, bool direct,
- unsigned int access);
-
-void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
- struct kvm_vcpu *vcpu, hpa_t root, u64 addr);
-void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
- struct kvm_vcpu *vcpu, u64 addr);
-bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator);
-void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator);
-
-void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp);
-
-void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
- unsigned direct_access);

-int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, u64 *spte,
- struct list_head *invalid_list);
bool __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
struct list_head *invalid_list,
int *nr_zapped);
@@ -120,11 +77,6 @@ int make_mmu_pages_available(struct kvm_vcpu *vcpu);

int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);

-int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
- u64 *sptep, unsigned int pte_access, gfn_t gfn,
- kvm_pfn_t pfn, struct kvm_page_fault *fault);
-void __direct_pte_prefetch(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
- u64 *sptep);
int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte);

@@ -134,7 +86,6 @@ int mmu_alloc_special_roots(struct kvm_vcpu *vcpu);

int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level);

-void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr);
void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
int bytes, struct kvm_page_track_notifier_node *node);

--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:30:18

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 13/21] KVM: x86/MMU: Fix naming on prepare / commit zap page functions

Since the various prepare / commit zap page functions are part of the
Shadow MMU and used all over both shadow_mmu.c and mmu.c, add _shadow_
to the function names to match the rest of the Shadow MMU interface.
Since there are so many uses of these functions, this rename gets its
own commit.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 21 +++++++--------
arch/x86/kvm/mmu/shadow_mmu.c | 48 ++++++++++++++++++-----------------
arch/x86/kvm/mmu/shadow_mmu.h | 13 +++++-----
3 files changed, 43 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9308ab8102f9b..9b217e04cab0e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -230,8 +230,9 @@ void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
kvm_tdp_mmu_walk_lockless_end();
} else {
/*
- * Make sure the write to vcpu->mode is not reordered in front of
- * reads to sptes. If it does, kvm_mmu_commit_zap_page() can see us
+ * Make sure the write to vcpu->mode is not reordered in front
+ * of reads to sptes. If it does,
+ * kvm_shadow_mmu_commit_zap_page() can see us
* OUTSIDE_GUEST_MODE and proceed to free the shadow page table.
*/
smp_store_release(&vcpu->mode, OUTSIDE_GUEST_MODE);
@@ -568,7 +569,7 @@ bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, struct list_head *invalid_list
return false;

if (!list_empty(invalid_list))
- kvm_mmu_commit_zap_page(kvm, invalid_list);
+ kvm_shadow_mmu_commit_zap_page(kvm, invalid_list);
else
kvm_flush_remote_tlbs(kvm);
return true;
@@ -1022,7 +1023,7 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
if (is_tdp_mmu_page(sp))
kvm_tdp_mmu_put_root(kvm, sp, false);
else if (!--sp->root_count && sp->role.invalid)
- kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+ kvm_shadow_mmu_prepare_zap_page(kvm, sp, invalid_list);

*root_hpa = INVALID_PAGE;
}
@@ -1075,7 +1076,7 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
mmu->root.pgd = 0;
}

- kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ kvm_shadow_mmu_commit_zap_page(kvm, &invalid_list);
write_unlock(&kvm->mmu_lock);
}
EXPORT_SYMBOL_GPL(kvm_mmu_free_roots);
@@ -1397,8 +1398,8 @@ bool is_page_fault_stale(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
* there is a pending request to free obsolete roots. The request is
* only a hint that the current root _may_ be obsolete and needs to be
* reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
- * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
- * to reload even if no vCPU is actively using the root.
+ * previous root, then __kvm_shadow_mmu_prepare_zap_page() signals all
+ * vCPUs to reload even if no vCPU is actively using the root.
*/
if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
return true;
@@ -3101,13 +3102,13 @@ void kvm_mmu_zap_all(struct kvm *kvm)
list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
if (WARN_ON(sp->role.invalid))
continue;
- if (__kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign))
+ if (__kvm_shadow_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign))
goto restart;
if (cond_resched_rwlock_write(&kvm->mmu_lock))
goto restart;
}

- kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ kvm_shadow_mmu_commit_zap_page(kvm, &invalid_list);

if (tdp_mmu_enabled)
kvm_tdp_mmu_zap_all(kvm);
@@ -3457,7 +3458,7 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
else if (is_tdp_mmu_page(sp))
flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
else
- kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+ kvm_shadow_mmu_prepare_zap_page(kvm, sp, &invalid_list);
WARN_ON_ONCE(sp->nx_huge_page_disallowed);

if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
index 36b335d75aee2..32a24530cf19a 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.c
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -1282,7 +1282,7 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
int ret = vcpu->arch.mmu->sync_page(vcpu, sp);

if (ret < 0)
- kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
+ kvm_shadow_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
return ret;
}

@@ -1444,8 +1444,8 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
* upper-level page will be write-protected.
*/
if (role.level > PG_LEVEL_4K && sp->unsync)
- kvm_mmu_prepare_zap_page(kvm, sp,
- &invalid_list);
+ kvm_shadow_mmu_prepare_zap_page(kvm, sp,
+ &invalid_list);
continue;
}

@@ -1487,7 +1487,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
++kvm->stat.mmu_cache_miss;

out:
- kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ kvm_shadow_mmu_commit_zap_page(kvm, &invalid_list);

if (collisions > kvm->stat.max_mmu_page_hash_collisions)
kvm->stat.max_mmu_page_hash_collisions = collisions;
@@ -1779,8 +1779,8 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, u64 *spte,
*/
if (tdp_enabled && invalid_list &&
child->role.guest_mode && !child->parent_ptes.val)
- return kvm_mmu_prepare_zap_page(kvm, child,
- invalid_list);
+ return kvm_shadow_mmu_prepare_zap_page(kvm,
+ child, invalid_list);
}
} else if (is_mmio_spte(pte)) {
mmu_spte_clear_no_track(spte);
@@ -1825,7 +1825,7 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
struct kvm_mmu_page *sp;

for_each_sp(pages, sp, parents, i) {
- kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+ kvm_shadow_mmu_prepare_zap_page(kvm, sp, invalid_list);
mmu_pages_clear_parents(&parents);
zapped++;
}
@@ -1834,9 +1834,9 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
return zapped;
}

-bool __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- struct list_head *invalid_list,
- int *nr_zapped)
+bool __kvm_shadow_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ struct list_head *invalid_list,
+ int *nr_zapped)
{
bool list_unstable, zapped_root = false;

@@ -1898,16 +1898,17 @@ bool __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
return list_unstable;
}

-bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- struct list_head *invalid_list)
+bool kvm_shadow_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ struct list_head *invalid_list)
{
int nr_zapped;

- __kvm_mmu_prepare_zap_page(kvm, sp, invalid_list, &nr_zapped);
+ __kvm_shadow_mmu_prepare_zap_page(kvm, sp, invalid_list, &nr_zapped);
return nr_zapped;
}

-void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list)
+void kvm_shadow_mmu_commit_zap_page(struct kvm *kvm,
+ struct list_head *invalid_list)
{
struct kvm_mmu_page *sp, *nsp;

@@ -1952,8 +1953,8 @@ static unsigned long kvm_mmu_zap_oldest_mmu_pages(struct kvm *kvm,
if (sp->root_count)
continue;

- unstable = __kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list,
- &nr_zapped);
+ unstable = __kvm_shadow_mmu_prepare_zap_page(kvm, sp,
+ &invalid_list, &nr_zapped);
total_zapped += nr_zapped;
if (total_zapped >= nr_to_zap)
break;
@@ -1962,7 +1963,7 @@ static unsigned long kvm_mmu_zap_oldest_mmu_pages(struct kvm *kvm,
goto restart;
}

- kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ kvm_shadow_mmu_commit_zap_page(kvm, &invalid_list);

kvm->stat.mmu_recycled += total_zapped;
return total_zapped;
@@ -2033,9 +2034,9 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
sp->role.word);
r = 1;
- kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+ kvm_shadow_mmu_prepare_zap_page(kvm, sp, &invalid_list);
}
- kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ kvm_shadow_mmu_commit_zap_page(kvm, &invalid_list);
write_unlock(&kvm->mmu_lock);

return r;
@@ -3032,7 +3033,8 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
for_each_gfn_valid_sp_with_gptes(vcpu->kvm, sp, gfn) {
if (detect_write_misaligned(sp, gpa, bytes) ||
detect_write_flooding(sp)) {
- kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
+ kvm_shadow_mmu_prepare_zap_page(vcpu->kvm, sp,
+ &invalid_list);
++vcpu->kvm->stat.mmu_flooded;
continue;
}
@@ -3141,7 +3143,7 @@ void kvm_shadow_mmu_zap_obsolete_pages(struct kvm *kvm)
goto restart;
}

- unstable = __kvm_mmu_prepare_zap_page(kvm, sp,
+ unstable = __kvm_shadow_mmu_prepare_zap_page(kvm, sp,
&kvm->arch.zapped_obsolete_pages, &nr_zapped);
batch += nr_zapped;

@@ -3158,7 +3160,7 @@ void kvm_shadow_mmu_zap_obsolete_pages(struct kvm *kvm)
* kvm_mmu_load()), and the reload in the caller ensure no vCPUs are
* running with an obsolete MMU.
*/
- kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
+ kvm_shadow_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
}

bool kvm_shadow_mmu_has_zapped_obsolete_pages(struct kvm *kvm)
@@ -3439,7 +3441,7 @@ unsigned long kvm_shadow_mmu_shrink_scan(struct kvm *kvm, int pages_to_free)
write_lock(&kvm->mmu_lock);

if (kvm_shadow_mmu_has_zapped_obsolete_pages(kvm)) {
- kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
+ kvm_shadow_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
goto out;
}

diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h
index cc28895d2a24f..82eed9bb9bc9a 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.h
+++ b/arch/x86/kvm/mmu/shadow_mmu.h
@@ -66,12 +66,13 @@ bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,

void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp);

-bool __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- struct list_head *invalid_list,
- int *nr_zapped);
-bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- struct list_head *invalid_list);
-void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list);
+bool __kvm_shadow_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ struct list_head *invalid_list,
+ int *nr_zapped);
+bool kvm_shadow_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ struct list_head *invalid_list);
+void kvm_shadow_mmu_commit_zap_page(struct kvm *kvm,
+ struct list_head *invalid_list);

int kvm_shadow_mmu_make_pages_available(struct kvm_vcpu *vcpu);

--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:30:26

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 14/21] KVM: x86/MMU: Factor Shadow MMU wrprot / clear dirty ops out of mmu.c

There are several functions in mmu.c which bifrucate to the Shadow
and/or TDP MMU implementations. In most of these, the Shadow MMU
implementation is open-coded. Wrap these instances in a nice function
which just needs kvm and slot arguments or similar. This matches the TDP
MMU interface and will allow for some nice cleanups in a following
commit.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 52 ++++++----------------------
arch/x86/kvm/mmu/shadow_mmu.c | 64 +++++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/shadow_mmu.h | 15 ++++++++
3 files changed, 90 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9b217e04cab0e..44a00396284d5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -377,23 +377,13 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
struct kvm_memory_slot *slot,
gfn_t gfn_offset, unsigned long mask)
{
- struct kvm_rmap_head *rmap_head;
-
if (tdp_mmu_enabled)
kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
slot->base_gfn + gfn_offset, mask, true);

- if (!kvm_memslots_have_rmaps(kvm))
- return;
-
- while (mask) {
- rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
- PG_LEVEL_4K, slot);
- rmap_write_protect(rmap_head, false);
-
- /* clear the first set bit */
- mask &= mask - 1;
- }
+ if (kvm_memslots_have_rmaps(kvm))
+ kvm_shadow_mmu_write_protect_pt_masked(kvm, slot, gfn_offset,
+ mask);
}

/**
@@ -410,23 +400,13 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
struct kvm_memory_slot *slot,
gfn_t gfn_offset, unsigned long mask)
{
- struct kvm_rmap_head *rmap_head;
-
if (tdp_mmu_enabled)
kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
slot->base_gfn + gfn_offset, mask, false);

- if (!kvm_memslots_have_rmaps(kvm))
- return;
-
- while (mask) {
- rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
- PG_LEVEL_4K, slot);
- __rmap_clear_dirty(kvm, rmap_head, slot);
-
- /* clear the first set bit */
- mask &= mask - 1;
- }
+ if (kvm_memslots_have_rmaps(kvm))
+ kvm_shadow_mmu_clear_dirty_pt_masked(kvm, slot, gfn_offset,
+ mask);
}

/**
@@ -484,16 +464,11 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
struct kvm_memory_slot *slot, u64 gfn,
int min_level)
{
- struct kvm_rmap_head *rmap_head;
- int i;
bool write_protected = false;

- if (kvm_memslots_have_rmaps(kvm)) {
- for (i = min_level; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
- rmap_head = gfn_to_rmap(gfn, i, slot);
- write_protected |= rmap_write_protect(rmap_head, true);
- }
- }
+ if (kvm_memslots_have_rmaps(kvm))
+ write_protected |=
+ kvm_shadow_mmu_write_protect_gfn(kvm, slot, gfn, min_level);

if (tdp_mmu_enabled)
write_protected |=
@@ -2915,8 +2890,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
{
if (kvm_memslots_have_rmaps(kvm)) {
write_lock(&kvm->mmu_lock);
- walk_slot_rmaps(kvm, memslot, slot_rmap_write_protect,
- start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
+ kvm_shadow_mmu_wrprot_slot(kvm, memslot, start_level);
write_unlock(&kvm->mmu_lock);
}

@@ -3067,11 +3041,7 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
{
if (kvm_memslots_have_rmaps(kvm)) {
write_lock(&kvm->mmu_lock);
- /*
- * Clear dirty bits only on 4k SPTEs since the legacy MMU only
- * support dirty logging at a 4k granularity.
- */
- walk_slot_rmaps_4k(kvm, memslot, __rmap_clear_dirty, false);
+ kvm_shadow_mmu_clear_dirty_slot(kvm, memslot);
write_unlock(&kvm->mmu_lock);
}

diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
index 32a24530cf19a..b93a6174717d3 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.c
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -3453,3 +3453,67 @@ unsigned long kvm_shadow_mmu_shrink_scan(struct kvm *kvm, int pages_to_free)

return freed;
}
+
+void kvm_shadow_mmu_write_protect_pt_masked(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn_offset, unsigned long mask)
+{
+ struct kvm_rmap_head *rmap_head;
+
+ while (mask) {
+ rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
+ PG_LEVEL_4K, slot);
+ rmap_write_protect(rmap_head, false);
+
+ /* clear the first set bit */
+ mask &= mask - 1;
+ }
+}
+
+void kvm_shadow_mmu_clear_dirty_pt_masked(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn_offset, unsigned long mask)
+{
+ struct kvm_rmap_head *rmap_head;
+
+ while (mask) {
+ rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
+ PG_LEVEL_4K, slot);
+ __rmap_clear_dirty(kvm, rmap_head, slot);
+
+ /* clear the first set bit */
+ mask &= mask - 1;
+ }
+}
+
+bool kvm_shadow_mmu_write_protect_gfn(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ u64 gfn, int min_level)
+{
+ struct kvm_rmap_head *rmap_head;
+ int i;
+ bool write_protected = false;
+
+ if (kvm_memslots_have_rmaps(kvm)) {
+ for (i = min_level; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
+ rmap_head = gfn_to_rmap(gfn, i, slot);
+ write_protected |= rmap_write_protect(rmap_head, true);
+ }
+ }
+
+ return write_protected;
+}
+
+void kvm_shadow_mmu_clear_dirty_slot(struct kvm *kvm,
+ const struct kvm_memory_slot *memslot)
+{
+ walk_slot_rmaps_4k(kvm, memslot, __rmap_clear_dirty, false);
+}
+
+void kvm_shadow_mmu_wrprot_slot(struct kvm *kvm,
+ const struct kvm_memory_slot *memslot,
+ int start_level)
+{
+ walk_slot_rmaps(kvm, memslot, slot_rmap_write_protect,
+ start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
+}
diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h
index 82eed9bb9bc9a..58f48293b4773 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.h
+++ b/arch/x86/kvm/mmu/shadow_mmu.h
@@ -117,6 +117,21 @@ void kvm_shadow_mmu_zap_collapsible_sptes(struct kvm *kvm,
bool kvm_shadow_mmu_has_zapped_obsolete_pages(struct kvm *kvm);
unsigned long kvm_shadow_mmu_shrink_scan(struct kvm *kvm, int pages_to_free);

+void kvm_shadow_mmu_write_protect_pt_masked(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn_offset, unsigned long mask);
+void kvm_shadow_mmu_clear_dirty_pt_masked(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn_offset, unsigned long mask);
+bool kvm_shadow_mmu_write_protect_gfn(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ u64 gfn, int min_level);
+void kvm_shadow_mmu_clear_dirty_slot(struct kvm *kvm,
+ const struct kvm_memory_slot *memslot);
+void kvm_shadow_mmu_wrprot_slot(struct kvm *kvm,
+ const struct kvm_memory_slot *memslot,
+ int start_level);
+
/* Exports from paging_tmpl.h */
gpa_t paging32_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
gpa_t vaddr, u64 access,
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:30:37

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 15/21] KVM: x86/MMU: Remove unneeded exports from shadow_mmu.c

Now that the various dirty logging / wrprot function implementations are
in shadow_mmu.c, do another round of cleanups to remove functions which
no longer need to be exposed and can be marked static.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/shadow_mmu.c | 32 +++++++++++++++++++-------------
arch/x86/kvm/mmu/shadow_mmu.h | 18 ------------------
2 files changed, 19 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
index b93a6174717d3..dc5c4b9899cc6 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.c
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -634,8 +634,8 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
return count;
}

-struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
- const struct kvm_memory_slot *slot)
+static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
+ const struct kvm_memory_slot *slot)
{
unsigned long idx;

@@ -803,7 +803,7 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
return mmu_spte_update(sptep, spte);
}

-bool rmap_write_protect(struct kvm_rmap_head *rmap_head, bool pt_protect)
+static bool rmap_write_protect(struct kvm_rmap_head *rmap_head, bool pt_protect)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -842,8 +842,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
* - W bit on ad-disabled SPTEs.
* Returns true iff any D or W bits were cleared.
*/
-bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- const struct kvm_memory_slot *slot)
+static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ const struct kvm_memory_slot *slot)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -3057,6 +3057,11 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
write_unlock(&vcpu->kvm->mmu_lock);
}

+/* The return value indicates if tlb flush on all vcpus is needed. */
+typedef bool (*slot_rmaps_handler) (struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head,
+ const struct kvm_memory_slot *slot);
+
static __always_inline bool __walk_slot_rmaps(struct kvm *kvm,
const struct kvm_memory_slot *slot,
slot_rmaps_handler fn,
@@ -3087,20 +3092,21 @@ static __always_inline bool __walk_slot_rmaps(struct kvm *kvm,
return flush;
}

-__always_inline bool walk_slot_rmaps(struct kvm *kvm,
- const struct kvm_memory_slot *slot,
- slot_rmaps_handler fn, int start_level,
- int end_level, bool flush_on_yield)
+static __always_inline bool walk_slot_rmaps(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ slot_rmaps_handler fn,
+ int start_level, int end_level,
+ bool flush_on_yield)
{
return __walk_slot_rmaps(kvm, slot, fn, start_level, end_level,
slot->base_gfn, slot->base_gfn + slot->npages - 1,
flush_on_yield, false);
}

-__always_inline bool walk_slot_rmaps_4k(struct kvm *kvm,
- const struct kvm_memory_slot *slot,
- slot_rmaps_handler fn,
- bool flush_on_yield)
+static __always_inline bool walk_slot_rmaps_4k(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ slot_rmaps_handler fn,
+ bool flush_on_yield)
{
return walk_slot_rmaps(kvm, slot, fn, PG_LEVEL_4K,
PG_LEVEL_4K, flush_on_yield);
diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h
index 58f48293b4773..36fe8013931d2 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.h
+++ b/arch/x86/kvm/mmu/shadow_mmu.h
@@ -39,11 +39,6 @@ struct pte_list_desc {
/* Only exported for debugfs.c. */
unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);

-struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
- const struct kvm_memory_slot *slot);
-bool rmap_write_protect(struct kvm_rmap_head *rmap_head, bool pt_protect);
-bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- const struct kvm_memory_slot *slot);
bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
struct kvm_memory_slot *slot, gfn_t gfn, int level,
pte_t unused);
@@ -91,22 +86,9 @@ int kvm_shadow_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
int bytes, struct kvm_page_track_notifier_node *node);

-/* The return value indicates if tlb flush on all vcpus is needed. */
-typedef bool (*slot_rmaps_handler) (struct kvm *kvm,
- struct kvm_rmap_head *rmap_head,
- const struct kvm_memory_slot *slot);
-bool walk_slot_rmaps(struct kvm *kvm, const struct kvm_memory_slot *slot,
- slot_rmaps_handler fn, int start_level, int end_level,
- bool flush_on_yield);
-bool walk_slot_rmaps_4k(struct kvm *kvm, const struct kvm_memory_slot *slot,
- slot_rmaps_handler fn, bool flush_on_yield);
-
void kvm_shadow_mmu_zap_obsolete_pages(struct kvm *kvm);
bool kvm_shadow_mmu_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);

-bool slot_rmap_write_protect(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- const struct kvm_memory_slot *slot);
-
void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm,
const struct kvm_memory_slot *slot,
gfn_t start, gfn_t end,
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:30:41

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 16/21] KVM: x86/MMU: Wrap uses of kvm_handle_gfn_range in mmu.c

handle_gfn_range + callback is not a bad interface, but it requires
exporting the whole callback scheme to mmu.c. Simplify the interface
with some basic wrapper functions, making the callback scheme internal
to shadow_mmu.c.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 +++---
arch/x86/kvm/mmu/shadow_mmu.c | 54 +++++++++++++++++++++++++----------
arch/x86/kvm/mmu/shadow_mmu.h | 25 ++++------------
3 files changed, 48 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 44a00396284d5..156ab2e4cd811 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -490,7 +490,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
bool flush = false;

if (kvm_memslots_have_rmaps(kvm))
- flush = kvm_handle_gfn_range(kvm, range, kvm_zap_rmap);
+ flush = kvm_shadow_mmu_unmap_gfn_range(kvm, range);

if (tdp_mmu_enabled)
flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush);
@@ -503,7 +503,7 @@ bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
bool flush = false;

if (kvm_memslots_have_rmaps(kvm))
- flush = kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmap);
+ flush = kvm_shadow_mmu_set_spte_gfn(kvm, range);

if (tdp_mmu_enabled)
flush |= kvm_tdp_mmu_set_spte_gfn(kvm, range);
@@ -516,7 +516,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
bool young = false;

if (kvm_memslots_have_rmaps(kvm))
- young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
+ young = kvm_shadow_mmu_age_gfn_range(kvm, range);

if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
@@ -529,7 +529,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
bool young = false;

if (kvm_memslots_have_rmaps(kvm))
- young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
+ young = kvm_shadow_mmu_test_age_gfn(kvm, range);

if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
index dc5c4b9899cc6..dfff65db97c3b 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.c
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -864,16 +864,16 @@ static bool __kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
return kvm_zap_all_rmap_sptes(kvm, rmap_head);
}

-bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn, int level,
- pte_t unused)
+static bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn, int level,
+ pte_t unused)
{
return __kvm_zap_rmap(kvm, rmap_head, slot);
}

-bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn, int level,
- pte_t pte)
+static bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn, int level,
+ pte_t pte)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -980,9 +980,13 @@ static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator)
slot_rmap_walk_okay(_iter_); \
slot_rmap_walk_next(_iter_))

-__always_inline bool kvm_handle_gfn_range(struct kvm *kvm,
- struct kvm_gfn_range *range,
- rmap_handler_t handler)
+typedef bool (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn,
+ int level, pte_t pte);
+
+static __always_inline bool
+kvm_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
+ rmap_handler_t handler)
{
struct slot_rmap_walk_iterator iterator;
bool ret = false;
@@ -995,9 +999,9 @@ __always_inline bool kvm_handle_gfn_range(struct kvm *kvm,
return ret;
}

-bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn, int level,
- pte_t unused)
+static bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn, int level,
+ pte_t unused)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -1009,9 +1013,9 @@ bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
return young;
}

-bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn,
- int level, pte_t unused)
+static bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn,
+ int level, pte_t unused)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -3523,3 +3527,23 @@ void kvm_shadow_mmu_wrprot_slot(struct kvm *kvm,
walk_slot_rmaps(kvm, memslot, slot_rmap_write_protect,
start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
}
+
+bool kvm_shadow_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ return kvm_handle_gfn_range(kvm, range, kvm_zap_rmap);
+}
+
+bool kvm_shadow_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ return kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmap);
+}
+
+bool kvm_shadow_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ return kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
+}
+
+bool kvm_shadow_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ return kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
+}
diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h
index 36fe8013931d2..e4fbc842f524e 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.h
+++ b/arch/x86/kvm/mmu/shadow_mmu.h
@@ -39,26 +39,6 @@ struct pte_list_desc {
/* Only exported for debugfs.c. */
unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);

-bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn, int level,
- pte_t unused);
-bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn, int level,
- pte_t pte);
-
-typedef bool (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn,
- int level, pte_t pte);
-bool kvm_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
- rmap_handler_t handler);
-
-bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn, int level,
- pte_t unused);
-bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn,
- int level, pte_t unused);
-
void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp);

bool __kvm_shadow_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
@@ -114,6 +94,11 @@ void kvm_shadow_mmu_wrprot_slot(struct kvm *kvm,
const struct kvm_memory_slot *memslot,
int start_level);

+bool kvm_shadow_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_shadow_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_shadow_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_shadow_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+
/* Exports from paging_tmpl.h */
gpa_t paging32_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
gpa_t vaddr, u64 access,
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:30:43

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 18/21] KVM: x86/mmu: Move split cache topup functions to shadow_mmu.c

The split cache topup functions are only used by the Shadow MMU and were
left behind in mmu.c when splitting the Shadow MMU out to a separate
file. Move them over as well.

No functional change intended.

Suggested-by: David Matlack <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 53 ---------------------------------
arch/x86/kvm/mmu/mmu_internal.h | 2 --
arch/x86/kvm/mmu/shadow_mmu.c | 53 +++++++++++++++++++++++++++++++++
3 files changed, 53 insertions(+), 55 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f5b9db00eff99..8514e998e2127 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2902,59 +2902,6 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
}
}

-static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
-{
- return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
-}
-
-bool need_topup_split_caches_or_resched(struct kvm *kvm)
-{
- if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
- return true;
-
- /*
- * In the worst case, SPLIT_DESC_CACHE_MIN_NR_OBJECTS descriptors are needed
- * to split a single huge page. Calculating how many are actually needed
- * is possible but not worth the complexity.
- */
- return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_MIN_NR_OBJECTS) ||
- need_topup(&kvm->arch.split_page_header_cache, 1) ||
- need_topup(&kvm->arch.split_shadow_page_cache, 1);
-}
-
-int topup_split_caches(struct kvm *kvm)
-{
- /*
- * Allocating rmap list entries when splitting huge pages for nested
- * MMUs is uncommon as KVM needs to use a list if and only if there is
- * more than one rmap entry for a gfn, i.e. requires an L1 gfn to be
- * aliased by multiple L2 gfns and/or from multiple nested roots with
- * different roles. Aliasing gfns when using TDP is atypical for VMMs;
- * a few gfns are often aliased during boot, e.g. when remapping BIOS,
- * but aliasing rarely occurs post-boot or for many gfns. If there is
- * only one rmap entry, rmap->val points directly at that one entry and
- * doesn't need to allocate a list. Buffer the cache by the default
- * capacity so that KVM doesn't have to drop mmu_lock to topup if KVM
- * encounters an aliased gfn or two.
- */
- const int capacity = SPLIT_DESC_CACHE_MIN_NR_OBJECTS +
- KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
- int r;
-
- lockdep_assert_held(&kvm->slots_lock);
-
- r = __kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache, capacity,
- SPLIT_DESC_CACHE_MIN_NR_OBJECTS);
- if (r)
- return r;
-
- r = kvm_mmu_topup_memory_cache(&kvm->arch.split_page_header_cache, 1);
- if (r)
- return r;
-
- return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
-}
-
/* Must be called with the mmu_lock held in write-mode. */
void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
const struct kvm_memory_slot *memslot,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 349d4a300ad34..2273c6263faf0 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -348,8 +348,6 @@ void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu);
void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu);

int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect);
-bool need_topup_split_caches_or_resched(struct kvm *kvm);
-int topup_split_caches(struct kvm *kvm);

bool is_page_fault_stale(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
index eb4424fedd73a..bb23692d34a73 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.c
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -3219,6 +3219,59 @@ bool slot_rmap_write_protect(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
return rmap_write_protect(rmap_head, false);
}

+static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
+{
+ return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
+}
+
+static bool need_topup_split_caches_or_resched(struct kvm *kvm)
+{
+ if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+ return true;
+
+ /*
+ * In the worst case, SPLIT_DESC_CACHE_MIN_NR_OBJECTS descriptors are needed
+ * to split a single huge page. Calculating how many are actually needed
+ * is possible but not worth the complexity.
+ */
+ return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_MIN_NR_OBJECTS) ||
+ need_topup(&kvm->arch.split_page_header_cache, 1) ||
+ need_topup(&kvm->arch.split_shadow_page_cache, 1);
+}
+
+static int topup_split_caches(struct kvm *kvm)
+{
+ /*
+ * Allocating rmap list entries when splitting huge pages for nested
+ * MMUs is uncommon as KVM needs to use a list if and only if there is
+ * more than one rmap entry for a gfn, i.e. requires an L1 gfn to be
+ * aliased by multiple L2 gfns and/or from multiple nested roots with
+ * different roles. Aliasing gfns when using TDP is atypical for VMMs;
+ * a few gfns are often aliased during boot, e.g. when remapping BIOS,
+ * but aliasing rarely occurs post-boot or for many gfns. If there is
+ * only one rmap entry, rmap->val points directly at that one entry and
+ * doesn't need to allocate a list. Buffer the cache by the default
+ * capacity so that KVM doesn't have to drop mmu_lock to topup if KVM
+ * encounters an aliased gfn or two.
+ */
+ const int capacity = SPLIT_DESC_CACHE_MIN_NR_OBJECTS +
+ KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+ int r;
+
+ lockdep_assert_held(&kvm->slots_lock);
+
+ r = __kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache, capacity,
+ SPLIT_DESC_CACHE_MIN_NR_OBJECTS);
+ if (r)
+ return r;
+
+ r = kvm_mmu_topup_memory_cache(&kvm->arch.split_page_header_cache, 1);
+ if (r)
+ return r;
+
+ return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
+}
+
static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
{
struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:30:47

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 19/21] KVM: x86/mmu: Move Shadow MMU part of kvm_mmu_zap_all() to shadow_mmu.h

Move the Shadow MMU part of kvm_mmu_zap_all() into a helper function in
shadow_mmu.h. Also check kvm_memslots_have_rmaps so the Shadow MMU
operation can be skipped entierly if it's not needed. This could present
an opportuinity to move the TDP MMU portion of the function under the
MMU lock in read mode, but since zapping all paging structures should be
a very rare and thus not a perfromance sensitive operation, it's not
necessary.

Suggested-by: David Matlack <[email protected]>

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 17 ++---------------
arch/x86/kvm/mmu/shadow_mmu.c | 19 +++++++++++++++++++
arch/x86/kvm/mmu/shadow_mmu.h | 2 ++
3 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8514e998e2127..63b928bded9d1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3011,22 +3011,9 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,

void kvm_mmu_zap_all(struct kvm *kvm)
{
- struct kvm_mmu_page *sp, *node;
- LIST_HEAD(invalid_list);
- int ign;
-
write_lock(&kvm->mmu_lock);
-restart:
- list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
- if (WARN_ON(sp->role.invalid))
- continue;
- if (__kvm_shadow_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign))
- goto restart;
- if (cond_resched_rwlock_write(&kvm->mmu_lock))
- goto restart;
- }
-
- kvm_shadow_mmu_commit_zap_page(kvm, &invalid_list);
+ if (kvm_memslots_have_rmaps(kvm))
+ kvm_shadow_mmu_zap_all(kvm);

if (tdp_mmu_enabled)
kvm_tdp_mmu_zap_all(kvm);
diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
index bb23692d34a73..c6d3da795992e 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.c
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -3604,3 +3604,22 @@ bool kvm_shadow_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
return kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
}
+
+void kvm_shadow_mmu_zap_all(struct kvm *kvm)
+{
+ struct kvm_mmu_page *sp, *node;
+ LIST_HEAD(invalid_list);
+ int ign;
+
+restart:
+ list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
+ if (WARN_ON(sp->role.invalid))
+ continue;
+ if (__kvm_shadow_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign))
+ goto restart;
+ if (cond_resched_rwlock_write(&kvm->mmu_lock))
+ goto restart;
+ }
+
+ kvm_shadow_mmu_commit_zap_page(kvm, &invalid_list);
+}
diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h
index 4d39017873aa6..ab01636373bda 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.h
+++ b/arch/x86/kvm/mmu/shadow_mmu.h
@@ -101,6 +101,8 @@ bool kvm_shadow_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_shadow_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_shadow_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);

+void kvm_shadow_mmu_zap_all(struct kvm *kvm);
+
/* Exports from paging_tmpl.h */
gpa_t paging32_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
gpa_t vaddr, u64 access,
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:30:49

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 20/21] KVM: x86/mmu: Move Shadow MMU init/teardown to shadow_mmu.c

Move the meat of kvm_mmu_init_vm() and kvm_mmu_uninit_vm() pertaining to
the Shadow MMU to shadow_mmu.c.

Suggested-by: David Matlack <[email protected]>

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 41 +++---------------------------
arch/x86/kvm/mmu/mmu_internal.h | 2 ++
arch/x86/kvm/mmu/shadow_mmu.c | 44 +++++++++++++++++++++++++++++++--
arch/x86/kvm/mmu/shadow_mmu.h | 6 ++---
4 files changed, 51 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 63b928bded9d1..10aff23dea75d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2743,7 +2743,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
* not use any resource of the being-deleted slot or all slots
* after calling the function.
*/
-static void kvm_mmu_zap_all_fast(struct kvm *kvm)
+void kvm_mmu_zap_all_fast(struct kvm *kvm)
{
lockdep_assert_held(&kvm->slots_lock);

@@ -2795,22 +2795,13 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
kvm_tdp_mmu_zap_invalidated_roots(kvm);
}

-static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
- struct kvm_memory_slot *slot,
- struct kvm_page_track_notifier_node *node)
-{
- kvm_mmu_zap_all_fast(kvm);
-}
-
int kvm_mmu_init_vm(struct kvm *kvm)
{
- struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
int r;

- INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
- INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
- spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
+
+ kvm_mmu_init_shadow_mmu(kvm);

if (tdp_mmu_enabled) {
r = kvm_mmu_init_tdp_mmu(kvm);
@@ -2818,38 +2809,14 @@ int kvm_mmu_init_vm(struct kvm *kvm)
return r;
}

- node->track_write = kvm_shadow_mmu_pte_write;
- node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
- kvm_page_track_register_notifier(kvm, node);
-
- kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
- kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
-
- kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
-
- kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
- kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
-
return 0;
}

-static void mmu_free_vm_memory_caches(struct kvm *kvm)
-{
- kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
- kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
- kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
-}
-
void kvm_mmu_uninit_vm(struct kvm *kvm)
{
- struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
-
- kvm_page_track_unregister_notifier(kvm, node);
-
+ kvm_mmu_uninit_shadow_mmu(kvm);
if (tdp_mmu_enabled)
kvm_mmu_uninit_tdp_mmu(kvm);
-
- mmu_free_vm_memory_caches(kvm);
}

/*
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 2273c6263faf0..c49d302b037ec 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -406,4 +406,6 @@ BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pke);
BUILD_MMU_ROLE_ACCESSOR(ext, cr4, la57);
BUILD_MMU_ROLE_ACCESSOR(base, efer, nx);
BUILD_MMU_ROLE_ACCESSOR(ext, efer, lma);
+
+void kvm_mmu_zap_all_fast(struct kvm *kvm);
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
index c6d3da795992e..6449ac4de4883 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.c
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -3013,8 +3013,9 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
return spte;
}

-void kvm_shadow_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
- int bytes, struct kvm_page_track_notifier_node *node)
+static void kvm_shadow_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
+ const u8 *new, int bytes,
+ struct kvm_page_track_notifier_node *node)
{
gfn_t gfn = gpa >> PAGE_SHIFT;
struct kvm_mmu_page *sp;
@@ -3623,3 +3624,42 @@ void kvm_shadow_mmu_zap_all(struct kvm *kvm)

kvm_shadow_mmu_commit_zap_page(kvm, &invalid_list);
}
+
+static void kvm_shadow_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ struct kvm_page_track_notifier_node *node)
+{
+ kvm_mmu_zap_all_fast(kvm);
+}
+
+void kvm_mmu_init_shadow_mmu(struct kvm *kvm)
+{
+ struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
+
+ INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
+ spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
+
+ node->track_write = kvm_shadow_mmu_pte_write;
+ node->track_flush_slot = kvm_shadow_mmu_invalidate_zap_pages_in_memslot;
+ kvm_page_track_register_notifier(kvm, node);
+
+ kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
+ kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
+
+ kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
+
+ kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
+ kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
+}
+
+void kvm_mmu_uninit_shadow_mmu(struct kvm *kvm)
+{
+ struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
+
+ kvm_page_track_unregister_notifier(kvm, node);
+
+ kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
+ kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
+ kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
+}
diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h
index ab01636373bda..f2e54355ebb19 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.h
+++ b/arch/x86/kvm/mmu/shadow_mmu.h
@@ -65,9 +65,6 @@ int kvm_shadow_mmu_alloc_special_roots(struct kvm_vcpu *vcpu);
int kvm_shadow_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
int *root_level);

-void kvm_shadow_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
- int bytes, struct kvm_page_track_notifier_node *node);
-
void kvm_shadow_mmu_zap_obsolete_pages(struct kvm *kvm);
bool kvm_shadow_mmu_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);

@@ -103,6 +100,9 @@ bool kvm_shadow_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);

void kvm_shadow_mmu_zap_all(struct kvm *kvm);

+void kvm_mmu_init_shadow_mmu(struct kvm *kvm);
+void kvm_mmu_uninit_shadow_mmu(struct kvm *kvm);
+
/* Exports from paging_tmpl.h */
gpa_t paging32_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
gpa_t vaddr, u64 access,
--
2.39.1.519.gcb327c4b5f-goog


2023-02-02 18:31:02

by Ben Gardon

[permalink] [raw]
Subject: [PATCH 21/21] KVM: x86/mmu: Split out Shadow MMU lockless walk begin/end

Split out the meat of kvm_shadow_mmu_walk_lockless_begin/end() to
functions in shadow_mmu.c since there's no need for it in the common MMU
code.

Suggested-by: David Matlack <[email protected]>

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 31 ++++++-------------------------
arch/x86/kvm/mmu/shadow_mmu.c | 27 +++++++++++++++++++++++++++
arch/x86/kvm/mmu/shadow_mmu.h | 3 +++
3 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 10aff23dea75d..cfccc4c7a1427 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -207,37 +207,18 @@ static inline bool is_tdp_mmu_active(struct kvm_vcpu *vcpu)

void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
{
- if (is_tdp_mmu_active(vcpu)) {
+ if (is_tdp_mmu_active(vcpu))
kvm_tdp_mmu_walk_lockless_begin();
- } else {
- /*
- * Prevent page table teardown by making any free-er wait during
- * kvm_flush_remote_tlbs() IPI to all active vcpus.
- */
- local_irq_disable();
-
- /*
- * Make sure a following spte read is not reordered ahead of the write
- * to vcpu->mode.
- */
- smp_store_mb(vcpu->mode, READING_SHADOW_PAGE_TABLES);
- }
+ else
+ kvm_shadow_mmu_walk_lockless_begin(vcpu);
}

void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
{
- if (is_tdp_mmu_active(vcpu)) {
+ if (is_tdp_mmu_active(vcpu))
kvm_tdp_mmu_walk_lockless_end();
- } else {
- /*
- * Make sure the write to vcpu->mode is not reordered in front
- * of reads to sptes. If it does,
- * kvm_shadow_mmu_commit_zap_page() can see us
- * OUTSIDE_GUEST_MODE and proceed to free the shadow page table.
- */
- smp_store_release(&vcpu->mode, OUTSIDE_GUEST_MODE);
- local_irq_enable();
- }
+ else
+ kvm_shadow_mmu_walk_lockless_end(vcpu);
}

int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c
index 6449ac4de4883..c5d0accd6e057 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.c
+++ b/arch/x86/kvm/mmu/shadow_mmu.c
@@ -3663,3 +3663,30 @@ void kvm_mmu_uninit_shadow_mmu(struct kvm *kvm)
kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
}
+
+void kvm_shadow_mmu_walk_lockless_begin(struct kvm_vcpu *vcpu)
+{
+ /*
+ * Prevent page table teardown by making any free-er wait during
+ * kvm_flush_remote_tlbs() IPI to all active vcpus.
+ */
+ local_irq_disable();
+
+ /*
+ * Make sure a following spte read is not reordered ahead of the write
+ * to vcpu->mode.
+ */
+ smp_store_mb(vcpu->mode, READING_SHADOW_PAGE_TABLES);
+}
+
+void kvm_shadow_mmu_walk_lockless_end(struct kvm_vcpu *vcpu)
+{
+ /*
+ * Make sure the write to vcpu->mode is not reordered in front
+ * of reads to sptes. If it does,
+ * kvm_shadow_mmu_commit_zap_page() can see us
+ * OUTSIDE_GUEST_MODE and proceed to free the shadow page table.
+ */
+ smp_store_release(&vcpu->mode, OUTSIDE_GUEST_MODE);
+ local_irq_enable();
+}
diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h
index f2e54355ebb19..12835872bda34 100644
--- a/arch/x86/kvm/mmu/shadow_mmu.h
+++ b/arch/x86/kvm/mmu/shadow_mmu.h
@@ -103,6 +103,9 @@ void kvm_shadow_mmu_zap_all(struct kvm *kvm);
void kvm_mmu_init_shadow_mmu(struct kvm *kvm);
void kvm_mmu_uninit_shadow_mmu(struct kvm *kvm);

+void kvm_shadow_mmu_walk_lockless_begin(struct kvm_vcpu *vcpu);
+void kvm_shadow_mmu_walk_lockless_end(struct kvm_vcpu *vcpu);
+
/* Exports from paging_tmpl.h */
gpa_t paging32_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
gpa_t vaddr, u64 access,
--
2.39.1.519.gcb327c4b5f-goog


2023-02-04 13:54:19

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 11/21] KVM: x86/MMU: Cleanup shrinker interface with Shadow MMU

Hi Ben,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on 7cb79f433e75b05d1635aefaa851cfcd1cb7dc4f]

url: https://github.com/intel-lab-lkp/linux/commits/Ben-Gardon/KVM-x86-mmu-Rename-slot-rmap-walkers-to-add-clarity-and-clean-up-code/20230203-023259
base: 7cb79f433e75b05d1635aefaa851cfcd1cb7dc4f
patch link: https://lore.kernel.org/r/20230202182809.1929122-12-bgardon%40google.com
patch subject: [PATCH 11/21] KVM: x86/MMU: Cleanup shrinker interface with Shadow MMU
config: x86_64-randconfig-a002 (https://download.01.org/0day-ci/archive/20230204/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/c1170de906dfe1ee64da0066e7c28d35e716ed18
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Ben-Gardon/KVM-x86-mmu-Rename-slot-rmap-walkers-to-add-clarity-and-clean-up-code/20230203-023259
git checkout c1170de906dfe1ee64da0066e7c28d35e716ed18
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 olddefconfig
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash arch/x86/kvm/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

>> arch/x86/kvm/mmu/mmu.c:3148:15: warning: no previous prototype for 'mmu_shrink_scan' [-Wmissing-prototypes]
3148 | unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
| ^~~~~~~~~~~~~~~


vim +/mmu_shrink_scan +3148 arch/x86/kvm/mmu/mmu.c

3147
> 3148 unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
3149 {
3150 struct kvm *kvm;
3151 int nr_to_scan = sc->nr_to_scan;
3152 unsigned long freed = 0;
3153
3154 mutex_lock(&kvm_lock);
3155
3156 list_for_each_entry(kvm, &vm_list, vm_list) {
3157 /*
3158 * Never scan more than sc->nr_to_scan VM instances.
3159 * Will not hit this condition practically since we do not try
3160 * to shrink more than one VM and it is very unlikely to see
3161 * !n_used_mmu_pages so many times.
3162 */
3163 if (!nr_to_scan--)
3164 break;
3165
3166 /*
3167 * n_used_mmu_pages is accessed without holding kvm->mmu_lock
3168 * here. We may skip a VM instance errorneosly, but we do not
3169 * want to shrink a VM that only started to populate its MMU
3170 * anyway.
3171 */
3172 if (!kvm->arch.n_used_mmu_pages &&
3173 !kvm_shadow_mmu_has_zapped_obsolete_pages(kvm))
3174 continue;
3175
3176 freed = kvm_shadow_mmu_shrink_scan(kvm, sc->nr_to_scan);
3177
3178 /*
3179 * unfair on small ones
3180 * per-vm shrinkers cry out
3181 * sadness comes quickly
3182 */
3183 list_move_tail(&kvm->vm_list, &vm_list);
3184 break;
3185 }
3186
3187 mutex_unlock(&kvm_lock);
3188 return freed;
3189 }
3190

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-02-04 14:44:55

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 15/21] KVM: x86/MMU: Remove unneeded exports from shadow_mmu.c

Hi Ben,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on 7cb79f433e75b05d1635aefaa851cfcd1cb7dc4f]

url: https://github.com/intel-lab-lkp/linux/commits/Ben-Gardon/KVM-x86-mmu-Rename-slot-rmap-walkers-to-add-clarity-and-clean-up-code/20230203-023259
base: 7cb79f433e75b05d1635aefaa851cfcd1cb7dc4f
patch link: https://lore.kernel.org/r/20230202182809.1929122-16-bgardon%40google.com
patch subject: [PATCH 15/21] KVM: x86/MMU: Remove unneeded exports from shadow_mmu.c
config: x86_64-randconfig-a002 (https://download.01.org/0day-ci/archive/20230204/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/133035de32b4ef8bb8e3868c78dbdb3b807f04f4
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Ben-Gardon/KVM-x86-mmu-Rename-slot-rmap-walkers-to-add-clarity-and-clean-up-code/20230203-023259
git checkout 133035de32b4ef8bb8e3868c78dbdb3b807f04f4
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 olddefconfig
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash arch/x86/kvm/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

>> arch/x86/kvm/mmu/shadow_mmu.c:3208:6: warning: no previous prototype for 'slot_rmap_write_protect' [-Wmissing-prototypes]
3208 | bool slot_rmap_write_protect(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
| ^~~~~~~~~~~~~~~~~~~~~~~


vim +/slot_rmap_write_protect +3208 arch/x86/kvm/mmu/shadow_mmu.c

306e405e1b7fe70 Ben Gardon 2023-02-02 3207
306e405e1b7fe70 Ben Gardon 2023-02-02 @3208 bool slot_rmap_write_protect(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
306e405e1b7fe70 Ben Gardon 2023-02-02 3209 const struct kvm_memory_slot *slot)
306e405e1b7fe70 Ben Gardon 2023-02-02 3210 {
306e405e1b7fe70 Ben Gardon 2023-02-02 3211 return rmap_write_protect(rmap_head, false);
306e405e1b7fe70 Ben Gardon 2023-02-02 3212 }
306e405e1b7fe70 Ben Gardon 2023-02-02 3213

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-02-04 19:11:05

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 11/21] KVM: x86/MMU: Cleanup shrinker interface with Shadow MMU

Hi Ben,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on 7cb79f433e75b05d1635aefaa851cfcd1cb7dc4f]

url: https://github.com/intel-lab-lkp/linux/commits/Ben-Gardon/KVM-x86-mmu-Rename-slot-rmap-walkers-to-add-clarity-and-clean-up-code/20230203-023259
base: 7cb79f433e75b05d1635aefaa851cfcd1cb7dc4f
patch link: https://lore.kernel.org/r/20230202182809.1929122-12-bgardon%40google.com
patch subject: [PATCH 11/21] KVM: x86/MMU: Cleanup shrinker interface with Shadow MMU
config: x86_64-randconfig-a014 (https://download.01.org/0day-ci/archive/20230205/[email protected]/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/c1170de906dfe1ee64da0066e7c28d35e716ed18
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Ben-Gardon/KVM-x86-mmu-Rename-slot-rmap-walkers-to-add-clarity-and-clean-up-code/20230203-023259
git checkout c1170de906dfe1ee64da0066e7c28d35e716ed18
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash arch/x86/kvm/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

>> arch/x86/kvm/mmu/mmu.c:3148:15: warning: no previous prototype for function 'mmu_shrink_scan' [-Wmissing-prototypes]
unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
^
arch/x86/kvm/mmu/mmu.c:3148:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
^
static
1 warning generated.


vim +/mmu_shrink_scan +3148 arch/x86/kvm/mmu/mmu.c

3147
> 3148 unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
3149 {
3150 struct kvm *kvm;
3151 int nr_to_scan = sc->nr_to_scan;
3152 unsigned long freed = 0;
3153
3154 mutex_lock(&kvm_lock);
3155
3156 list_for_each_entry(kvm, &vm_list, vm_list) {
3157 /*
3158 * Never scan more than sc->nr_to_scan VM instances.
3159 * Will not hit this condition practically since we do not try
3160 * to shrink more than one VM and it is very unlikely to see
3161 * !n_used_mmu_pages so many times.
3162 */
3163 if (!nr_to_scan--)
3164 break;
3165
3166 /*
3167 * n_used_mmu_pages is accessed without holding kvm->mmu_lock
3168 * here. We may skip a VM instance errorneosly, but we do not
3169 * want to shrink a VM that only started to populate its MMU
3170 * anyway.
3171 */
3172 if (!kvm->arch.n_used_mmu_pages &&
3173 !kvm_shadow_mmu_has_zapped_obsolete_pages(kvm))
3174 continue;
3175
3176 freed = kvm_shadow_mmu_shrink_scan(kvm, sc->nr_to_scan);
3177
3178 /*
3179 * unfair on small ones
3180 * per-vm shrinkers cry out
3181 * sadness comes quickly
3182 */
3183 list_move_tail(&kvm->vm_list, &vm_list);
3184 break;
3185 }
3186
3187 mutex_unlock(&kvm_lock);
3188 return freed;
3189 }
3190

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-03-20 18:49:30

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 09/21] KVM: x86/MMU: Move paging_tmpl.h includes to shadow_mmu.c

First off, I apologize for not giving this feedback in the RFC. I didn't think
too hard about the impliciations of moving paging_tmpl.h until I actually looked
at the code.

On Thu, Feb 02, 2023, Ben Gardon wrote:
> Move the integration point for paging_tmpl.h to shadow_mmu.c since
> paging_tmpl.h is ostensibly part of the Shadow MMU.

Ostensibly indeed. While a simple majority of paging_tmpl.h is indeed unique to
the shadow MMU, all of the guest walker code needs to exist independent of the
shadow MMU. And that code is signficant both in terms of lines of code, and
more importantly in terms of understanding its role in KVM at large.

This is essentially the same mess that eventually led the cpu_role vs. root_role
cleanup, and I think we should figure out a way to give paging_tmpl.h similar
treatment. E.g. split paging_tmpl.h itself in some way.

Unfortunately, this is a sticking point for me. If the code movement were minor
and/or cleaner in nature (definitely not your fault, simply the reality of the
code base), I might feel differently. But as it stands, there is a lot of churn
to get to an endpoint that has significant flaws.

So while I love the idea of separating the MMU implementations from the common
MMU logic, because the guest walker stuff is a lynchpin of sorts, e.g. splitting
out the guest walker logic could go hand-in-hand with reworking guest_mmu, I don't
want to take this series as is.

Sadly, as much as I'm itching to dive in and do a bit of exploration, I am woefully
short on bandwidth right now, so all I can do is say no. Sorry :-(

2023-03-20 19:00:48

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 06/21] KVM: x86/mmu: Get rid of is_cpuid_PSE36()

On Thu, Feb 02, 2023, Ben Gardon wrote:
> is_cpuid_PSE36() always returns 1 and is never overridden, so just get
> rid of the function. This saves having to export it in a future commit
> in order to move the include of paging_tmpl.h out of mmu.c.

Probably won't matter as I suspect this series is going to end up a burner way
in the back, but FWIW I'd prefer to preserve is_cpuid_PSE36() in some capacity.
I 100% agree the helper is silly, but the mere existice of the flag is so esoteric
these days that I like having obvious/obnoxious code to call it out.

2023-03-20 19:18:12

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 00/21] KVM: x86/MMU: Formalize the Shadow MMU

On Thu, Feb 02, 2023, Ben Gardon wrote:
> Patches 4-6 prepare for the refactor by adding files and exporting
> functions.

For future reference, please do not conflate "export" with "make globally visible"
(here and in many of the changelogs). The distinction matters, especially for
modules, as an exported symbol is quite different than a globally visible symbol.

We (sadly) lose sight of this in KVM far too often due kvm.ko exporting an asburd
number of symbols for kvm-{amd,intel}.ko, and as a result we've ended up with
non-KVM code using helpers that realy should be KVM-only. This is something I
hope to remedy in the near-ish future, and so I want us to start getting the
terminology right.

2023-03-21 18:43:28

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH 09/21] KVM: x86/MMU: Move paging_tmpl.h includes to shadow_mmu.c

On Mon, Mar 20, 2023 at 11:41 AM Sean Christopherson <[email protected]> wrote:
>
> First off, I apologize for not giving this feedback in the RFC. I didn't think
> too hard about the impliciations of moving paging_tmpl.h until I actually looked
> at the code.
>
> On Thu, Feb 02, 2023, Ben Gardon wrote:
> > Move the integration point for paging_tmpl.h to shadow_mmu.c since
> > paging_tmpl.h is ostensibly part of the Shadow MMU.
>
> Ostensibly indeed. While a simple majority of paging_tmpl.h is indeed unique to
> the shadow MMU, all of the guest walker code needs to exist independent of the
> shadow MMU. And that code is signficant both in terms of lines of code, and
> more importantly in terms of understanding its role in KVM at large.
>
> This is essentially the same mess that eventually led the cpu_role vs. root_role
> cleanup, and I think we should figure out a way to give paging_tmpl.h similar
> treatment. E.g. split paging_tmpl.h itself in some way.
>
> Unfortunately, this is a sticking point for me. If the code movement were minor
> and/or cleaner in nature (definitely not your fault, simply the reality of the
> code base), I might feel differently. But as it stands, there is a lot of churn
> to get to an endpoint that has significant flaws.
>
> So while I love the idea of separating the MMU implementations from the common
> MMU logic, because the guest walker stuff is a lynchpin of sorts, e.g. splitting
> out the guest walker logic could go hand-in-hand with reworking guest_mmu, I don't
> want to take this series as is.
>
> Sadly, as much as I'm itching to dive in and do a bit of exploration, I am woefully
> short on bandwidth right now, so all I can do is say no. Sorry :-(

Fair enough, thanks for taking a look. I'm not going to have bandwidth
in the foreseeable future to work on this any more either,
unfortunately. I'd love it is someone picked up this series and did
the paging_tmpl.h split, but that's going to be a lot of work, so in
the meantime, I don't mind just letting this die.

2023-03-23 23:05:10

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 00/21] KVM: x86/MMU: Formalize the Shadow MMU

On Thu, 02 Feb 2023 18:27:48 +0000, Ben Gardon wrote:
> This series makes the Shadow MMU a distinct part of the KVM x86 MMU,
> implemented in separate files, with a defined interface to common code.
>
> When the TDP (Two Dimensional Paging) MMU was added to x86 KVM, it came in
> a separate file with a (reasonably) clear interface. This lead to many
> points in the KVM MMU like this:
>
> [...]

Applied the first three to kvm-x86 mmu, which just makes me look like a jerk
since they're all my patches. :-(

[01/21] KVM: x86/mmu: Rename slot rmap walkers to add clarity and clean up code
https://github.com/kvm-x86/linux/commit/727ae3770132
[02/21] KVM: x86/mmu: Replace comment with an actual lockdep assertion on mmu_lock
https://github.com/kvm-x86/linux/commit/eddd9e8302de
[03/21] KVM: x86/mmu: Clean up mmu.c functions that put return type on separate line
https://github.com/kvm-x86/linux/commit/f3d90f901d18

--
https://github.com/kvm-x86/linux/tree/next
https://github.com/kvm-x86/linux/tree/fixes