2022-06-22 19:28:35

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 00/23] KVM: Extend Eager Page Splitting to the shadow MMU

For the description of the "why" of this patch, I'll just direct you to
David's excellent cover letter from v6, which can be found at
https://lore.kernel.org/r/[email protected].

This version mostly does the following:

- apply the feedback from Sean and other reviewers, which is mostly
aesthetic

- replace the refactoring of drop_large_spte()/__drop_large_spte()
with my own version. The insight there is that drop_large_spte()
is always followed by {,__}link_shadow_page(), so the call is
moved there

- split the TLB flush optimization into a separate patch, mostly
to perform the previous refactoring independent of the optional
TLB flush

- rename a few functions from *nested_mmu* to *shadow_mmu*

David Matlack (21):
KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
KVM: x86/mmu: Use a bool for direct
KVM: x86/mmu: Stop passing "direct" to mmu_alloc_root()
KVM: x86/mmu: Derive shadow MMU page role from parent
KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes
KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
KVM: x86/mmu: Consolidate shadow page allocation and initialization
KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
KVM: x86/mmu: Pass memory caches to allocate SPs separately
KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()
KVM: x86/mmu: Pass kvm pointer separately from vcpu to
kvm_mmu_find_shadow_page()
KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()
KVM: x86/mmu: Pass const memslot to rmap_add()
KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
KVM: x86/mmu: Update page stats in __rmap_add()
KVM: x86/mmu: Cache the access bits of shadowed translations
KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible
levels
KVM: Allow for different capacities in kvm_mmu_memory_cache structs
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs

Paolo Bonzini (2):
KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()
KVM: x86/mmu: Avoid unnecessary flush on eager page split

.../admin-guide/kernel-parameters.txt | 3 +-
arch/arm64/kvm/mmu.c | 2 +-
arch/riscv/kvm/mmu.c | 5 +-
arch/x86/include/asm/kvm_host.h | 24 +-
arch/x86/kvm/mmu/mmu.c | 719 ++++++++++++++----
arch/x86/kvm/mmu/mmu_internal.h | 17 +-
arch/x86/kvm/mmu/paging_tmpl.h | 43 +-
arch/x86/kvm/mmu/spte.c | 15 +-
arch/x86/kvm/mmu/spte.h | 4 +-
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
include/linux/kvm_host.h | 1 +
include/linux/kvm_types.h | 6 +-
virt/kvm/kvm_main.c | 33 +-
13 files changed, 666 insertions(+), 208 deletions(-)

--
2.31.1


2022-06-22 19:28:36

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 01/23] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs

From: David Matlack <[email protected]>

Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
fully direct MMUs") skipped the unsync checks and write flood clearing
for full direct MMUs. We can extend this further to skip the checks for
all direct shadow pages. Direct shadow pages in indirect MMUs (i.e.
shadow paging) are used when shadowing a guest huge page with smaller
pages. Such direct shadow pages, like their counterparts in fully direct
MMUs, are never marked unsynced or have a non-zero write-flooding count.

Checking sp->role.direct also generates better code than checking
direct_map because, due to register pressure, direct_map has to get
shoved onto the stack and then pulled back off.

No functional change intended.

Reviewed-by: Lai Jiangshan <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 27b2a5603496..c0afb4f1c8ae 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2000,7 +2000,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
int direct,
unsigned int access)
{
- bool direct_mmu = vcpu->arch.mmu->root_role.direct;
union kvm_mmu_page_role role;
struct hlist_head *sp_list;
unsigned quadrant;
@@ -2060,7 +2059,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
continue;
}

- if (direct_mmu)
+ /* unsync and write-flooding only apply to indirect SPs. */
+ if (sp->role.direct)
goto trace_get_page;

if (sp->unsync) {
--
2.31.1


2022-06-22 19:29:50

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 15/23] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu

From: David Matlack <[email protected]>

Allow adding new entries to the rmap and linking shadow pages without a
struct kvm_vcpu pointer by moving the implementation of rmap_add() and
link_shadow_page() into inner helper functions.

No functional change intended.

Reviewed-by: Ben Gardon <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 47 ++++++++++++++++++++++++++----------------
1 file changed, 29 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 45a4e85c0b2c..a8cdbe2958d9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -673,11 +673,6 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
}

-static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
-{
- return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
-}
-
static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
{
kmem_cache_free(pte_list_desc_cache, pte_list_desc);
@@ -832,7 +827,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
/*
* Returns the number of pointers in the rmap chain, not counting the new one.
*/
-static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
+static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
struct kvm_rmap_head *rmap_head)
{
struct pte_list_desc *desc;
@@ -843,7 +838,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
rmap_head->val = (unsigned long)spte;
} else if (!(rmap_head->val & 1)) {
rmap_printk("%p %llx 1->many\n", spte, *spte);
- desc = mmu_alloc_pte_list_desc(vcpu);
+ desc = kvm_mmu_memory_cache_alloc(cache);
desc->sptes[0] = (u64 *)rmap_head->val;
desc->sptes[1] = spte;
desc->spte_count = 2;
@@ -855,7 +850,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
while (desc->spte_count == PTE_LIST_EXT) {
count += PTE_LIST_EXT;
if (!desc->more) {
- desc->more = mmu_alloc_pte_list_desc(vcpu);
+ desc->more = kvm_mmu_memory_cache_alloc(cache);
desc = desc->more;
desc->spte_count = 0;
break;
@@ -1556,8 +1551,10 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,

#define RMAP_RECYCLE_THRESHOLD 1000

-static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
- u64 *spte, gfn_t gfn)
+static void __rmap_add(struct kvm *kvm,
+ struct kvm_mmu_memory_cache *cache,
+ const struct kvm_memory_slot *slot,
+ u64 *spte, gfn_t gfn)
{
struct kvm_mmu_page *sp;
struct kvm_rmap_head *rmap_head;
@@ -1566,15 +1563,23 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
sp = sptep_to_sp(spte);
kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
- rmap_count = pte_list_add(vcpu, spte, rmap_head);
+ rmap_count = pte_list_add(cache, spte, rmap_head);

if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
- kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
+ kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
kvm_flush_remote_tlbs_with_address(
- vcpu->kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
+ kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
}
}

+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
+ u64 *spte, gfn_t gfn)
+{
+ struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_pte_list_desc_cache;
+
+ __rmap_add(vcpu->kvm, cache, slot, spte, gfn);
+}
+
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;
@@ -1645,13 +1650,13 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
return hash_64(gfn, KVM_MMU_HASH_SHIFT);
}

-static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
+static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
struct kvm_mmu_page *sp, u64 *parent_pte)
{
if (!parent_pte)
return;

- pte_list_add(vcpu, parent_pte, &sp->parent_ptes);
+ pte_list_add(cache, parent_pte, &sp->parent_ptes);
}

static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
@@ -2247,8 +2252,8 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
__shadow_walk_next(iterator, *iterator->sptep);
}

-static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
- struct kvm_mmu_page *sp)
+static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
+ struct kvm_mmu_page *sp)
{
u64 spte;

@@ -2258,12 +2263,18 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,

mmu_spte_set(sptep, spte);

- mmu_page_add_parent_pte(vcpu, sp, sptep);
+ mmu_page_add_parent_pte(cache, sp, sptep);

if (sp->unsync_children || sp->unsync)
mark_unsync(sptep);
}

+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
+ struct kvm_mmu_page *sp)
+{
+ __link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
+}
+
static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
unsigned direct_access)
{
--
2.31.1


2022-06-22 19:29:56

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 09/23] KVM: x86/mmu: Move guest PT write-protection to account_shadowed()

From: David Matlack <[email protected]>

Move the code that write-protects newly-shadowed guest page tables into
account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
more logical place for this code to live. But most importantly, this
reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
kvm_vcpu pointer, which will be necessary when creating new shadow pages
during VM ioctls for eager page splitting.

Note, it is safe to drop the role.level == PG_LEVEL_4K check since
account_shadowed() returns early if role.level > PG_LEVEL_4K.

No functional change intended.

Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd45364bf465..2602c3642f23 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -766,6 +766,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
KVM_PAGE_TRACK_WRITE);

kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+ if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
+ kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
}

void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -2072,11 +2075,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
sp->gfn = gfn;
sp->role = role;
hlist_add_head(&sp->hash_link, sp_list);
- if (sp_has_gptes(sp)) {
+ if (sp_has_gptes(sp))
account_shadowed(vcpu->kvm, sp);
- if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
- kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
- }

return sp;
}
--
2.31.1


2022-06-22 19:30:07

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 11/23] KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()

From: David Matlack <[email protected]>

The vcpu pointer in kvm_mmu_alloc_shadow_page() is only used to get the
kvm pointer. So drop the vcpu pointer and just pass in the kvm pointer.

No functional change intended.

Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fab417e7bf6c..c5a88e8d1b53 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2056,7 +2056,7 @@ struct shadow_page_caches {
struct kvm_mmu_memory_cache *gfn_array_cache;
};

-static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
struct shadow_page_caches *caches,
gfn_t gfn,
struct hlist_head *sp_list,
@@ -2076,15 +2076,15 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
* depends on valid pages being added to the head of the list. See
* comments in kvm_zap_obsolete_pages().
*/
- sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
- list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
- kvm_mod_used_mmu_pages(vcpu->kvm, +1);
+ sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
+ list_add(&sp->link, &kvm->arch.active_mmu_pages);
+ kvm_mod_used_mmu_pages(kvm, +1);

sp->gfn = gfn;
sp->role = role;
hlist_add_head(&sp->hash_link, sp_list);
if (sp_has_gptes(sp))
- account_shadowed(vcpu->kvm, sp);
+ account_shadowed(kvm, sp);

return sp;
}
@@ -2103,7 +2103,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
if (!sp) {
created = true;
- sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, role);
+ sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, role);
}

trace_kvm_mmu_get_page(sp, created);
--
2.31.1


2022-06-22 19:30:31

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 10/23] KVM: x86/mmu: Pass memory caches to allocate SPs separately

From: David Matlack <[email protected]>

Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
will allocate the various pieces of memory for shadow pages as a
parameter, rather than deriving them from the vcpu pointer. This will be
useful in a future commit where shadow pages are allocated during VM
ioctls for eager page splitting, and thus will use a different set of
caches.

Preemptively pull the caches out all the way to
kvm_mmu_get_shadow_page() since eager page splitting will not be calling
kvm_mmu_alloc_shadow_page() directly.

No functional change intended.

Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 36 +++++++++++++++++++++++++++++-------
1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2602c3642f23..fab417e7bf6c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2049,17 +2049,25 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
return sp;
}

+/* Caches used when allocating a new shadow page. */
+struct shadow_page_caches {
+ struct kvm_mmu_memory_cache *page_header_cache;
+ struct kvm_mmu_memory_cache *shadow_page_cache;
+ struct kvm_mmu_memory_cache *gfn_array_cache;
+};
+
static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+ struct shadow_page_caches *caches,
gfn_t gfn,
struct hlist_head *sp_list,
union kvm_mmu_page_role role)
{
struct kvm_mmu_page *sp;

- sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
- sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+ sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
+ sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
if (!role.direct)
- sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+ sp->gfns = kvm_mmu_memory_cache_alloc(caches->gfn_array_cache);

set_page_private(virt_to_page(sp->spt), (unsigned long)sp);

@@ -2081,9 +2089,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
return sp;
}

-static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
- gfn_t gfn,
- union kvm_mmu_page_role role)
+static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+ struct shadow_page_caches *caches,
+ gfn_t gfn,
+ union kvm_mmu_page_role role)
{
struct hlist_head *sp_list;
struct kvm_mmu_page *sp;
@@ -2094,13 +2103,26 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
if (!sp) {
created = true;
- sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, sp_list, role);
+ sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, role);
}

trace_kvm_mmu_get_page(sp, created);
return sp;
}

+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+ gfn_t gfn,
+ union kvm_mmu_page_role role)
+{
+ struct shadow_page_caches caches = {
+ .page_header_cache = &vcpu->arch.mmu_page_header_cache,
+ .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
+ .gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
+ };
+
+ return __kvm_mmu_get_shadow_page(vcpu, &caches, gfn, role);
+}
+
static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, unsigned int access)
{
struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
--
2.31.1


2022-06-22 19:30:57

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 20/23] KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()

Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE. This is done by drop_large_spte().

However, the act that requires dropping the large SPTE is the
installation of the sp that is returned by kvm_mmu_get_child_sp(),
which happens in __link_shadow_page(). Move the call there
instead of having it in each and every caller.

To ensure that the shadow page is not linked twice if it was
present, do _not_ opportunistically make kvm_mmu_get_child_sp()
idempotent: instead, return an error value if the shadow page
already existed. This is a bit more verbose, but clearer than
NULL.

Now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 43 +++++++++++++++++-----------------
arch/x86/kvm/mmu/paging_tmpl.h | 31 +++++++++++-------------
2 files changed, 35 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 36bc49f08d60..bf1ae5ebf41b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1135,26 +1135,16 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
rmap_remove(kvm, sptep);
}

-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void drop_large_spte(struct kvm *kvm, u64 *sptep)
{
- if (is_large_pte(*sptep)) {
- WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
- drop_spte(kvm, sptep);
- return true;
- }
-
- return false;
-}
+ struct kvm_mmu_page *sp;

-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
-{
- if (__drop_large_spte(vcpu->kvm, sptep)) {
- struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+ sp = sptep_to_sp(sptep);
+ WARN_ON(sp->role.level == PG_LEVEL_4K);

- kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+ drop_spte(kvm, sptep);
+ kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
KVM_PAGES_PER_HPAGE(sp->role.level));
- }
}

/*
@@ -2221,6 +2211,9 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
{
union kvm_mmu_page_role role;

+ if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
+ return ERR_PTR(-EEXIST);
+
role = kvm_mmu_child_role(sptep, direct, access);
return kvm_mmu_get_shadow_page(vcpu, gfn, role);
}
@@ -2288,13 +2281,21 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
__shadow_walk_next(iterator, *iterator->sptep);
}

-static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
+static void __link_shadow_page(struct kvm *kvm,
+ struct kvm_mmu_memory_cache *cache, u64 *sptep,
struct kvm_mmu_page *sp)
{
u64 spte;

BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);

+ /*
+ * If an SPTE is present already, it must be a leaf and therefore
+ * a large one. Drop it and flush the TLB before installing sp.
+ */
+ if (is_shadow_present_pte(*sptep))
+ drop_large_spte(kvm, sptep);
+
spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));

mmu_spte_set(sptep, spte);
@@ -2308,7 +2309,7 @@ static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
struct kvm_mmu_page *sp)
{
- __link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
+ __link_shadow_page(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
}

static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
@@ -3080,11 +3081,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
if (it.level == fault->goal_level)
break;

- drop_large_spte(vcpu, it.sptep);
- if (is_shadow_present_pte(*it.sptep))
- continue;
-
sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
+ if (sp == ERR_PTR(-EEXIST))
+ continue;

link_shadow_page(vcpu, it.sptep, sp);
if (fault->is_tdp && fault->huge_page_disallowed &&
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 24f292f3f93f..2448fa8d8438 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -648,15 +648,13 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
gfn_t table_gfn;

clear_sp_write_flooding_count(it.sptep);
- drop_large_spte(vcpu, it.sptep);

- sp = NULL;
- if (!is_shadow_present_pte(*it.sptep)) {
- table_gfn = gw->table_gfn[it.level - 2];
- access = gw->pt_access[it.level - 2];
- sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
- false, access);
+ table_gfn = gw->table_gfn[it.level - 2];
+ access = gw->pt_access[it.level - 2];
+ sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+ false, access);

+ if (sp != ERR_PTR(-EEXIST)) {
/*
* We must synchronize the pagetable before linking it
* because the guest doesn't need to flush tlb when
@@ -685,7 +683,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
goto out_gpte_changed;

- if (sp)
+ if (sp != ERR_PTR(-EEXIST))
link_shadow_page(vcpu, it.sptep, sp);
}

@@ -709,16 +707,15 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,

validate_direct_spte(vcpu, it.sptep, direct_access);

- drop_large_spte(vcpu, it.sptep);
+ sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+ true, direct_access);
+ if (sp == ERR_PTR(-EEXIST))
+ continue;

- if (!is_shadow_present_pte(*it.sptep)) {
- sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
- true, direct_access);
- link_shadow_page(vcpu, it.sptep, sp);
- if (fault->huge_page_disallowed &&
- fault->req_level >= it.level)
- account_huge_nx_page(vcpu->kvm, sp);
- }
+ link_shadow_page(vcpu, it.sptep, sp);
+ if (fault->huge_page_disallowed &&
+ fault->req_level >= it.level)
+ account_huge_nx_page(vcpu->kvm, sp);
}

if (WARN_ON_ONCE(it.level != fault->goal_level))
--
2.31.1


2022-06-22 19:31:11

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 18/23] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU

From: David Matlack <[email protected]>

Currently make_huge_page_split_spte() assumes execute permissions can be
granted to any 4K SPTE when splitting huge pages. This is true for the
TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
shadowing a non-executable huge page.

To fix this, pass in the role of the child shadow page where the huge
page will be split and derive the execution permission from that. This
is correct because huge pages are always split with direct shadow page
and thus the shadow page role contains the correct access permissions.

No functional change intended.

Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 15 +++++++--------
arch/x86/kvm/mmu/spte.h | 4 ++--
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index db294c1beea2..fb1f17504138 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -246,11 +246,10 @@ static u64 make_spte_executable(u64 spte)
* This is used during huge page splitting to build the SPTEs that make up the
* new page table.
*/
-u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, int huge_level,
+u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, union kvm_mmu_page_role role,
int index)
{
u64 child_spte;
- int child_level;

if (WARN_ON_ONCE(!is_shadow_present_pte(huge_spte)))
return 0;
@@ -259,23 +258,23 @@ u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, int huge_level,
return 0;

child_spte = huge_spte;
- child_level = huge_level - 1;

/*
* The child_spte already has the base address of the huge page being
* split. So we just have to OR in the offset to the page at the next
* lower level for the given index.
*/
- child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
+ child_spte |= (index * KVM_PAGES_PER_HPAGE(role.level)) << PAGE_SHIFT;

- if (child_level == PG_LEVEL_4K) {
+ if (role.level == PG_LEVEL_4K) {
child_spte &= ~PT_PAGE_SIZE_MASK;

/*
- * When splitting to a 4K page, mark the page executable as the
- * NX hugepage mitigation no longer applies.
+ * When splitting to a 4K page where execution is allowed, mark
+ * the page executable as the NX hugepage mitigation no longer
+ * applies.
*/
- if (is_nx_huge_page_enabled(kvm))
+ if ((role.access & ACC_EXEC_MASK) && is_nx_huge_page_enabled(kvm))
child_spte = make_spte_executable(child_spte);
}

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 256f90587e8d..b5c855f5514f 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -421,8 +421,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
u64 old_spte, bool prefetch, bool can_unsync,
bool host_writable, u64 *new_spte);
-u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, int huge_level,
- int index);
+u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
+ union kvm_mmu_page_role role, int index);
u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 522e2532343b..f3a430d64975 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1478,7 +1478,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
* not been linked in yet and thus is not reachable from any other CPU.
*/
for (i = 0; i < SPTE_ENT_PER_PAGE; i++)
- sp->spt[i] = make_huge_page_split_spte(kvm, huge_spte, level, i);
+ sp->spt[i] = make_huge_page_split_spte(kvm, huge_spte, sp->role, i);

/*
* Replace the huge spte with a pointer to the populated lower level
--
2.31.1


2022-06-22 19:31:53

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 21/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs

From: David Matlack <[email protected]>

Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
declaration time rather than being fixed for all declarations. This will
be used in a follow-up commit to declare an cache in x86 with a capacity
of 512+ objects without having to increase the capacity of all caches in
KVM.

This change requires each cache now specify its capacity at runtime,
since the cache struct itself no longer has a fixed capacity known at
compile time. To protect against someone accidentally defining a
kvm_mmu_memory_cache struct directly (without the extra storage), this
commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().

In order to support different capacities, this commit changes the
objects pointer array to be dynamically allocated the first time the
cache is topped-up.

While here, opportunistically clean up the stack-allocated
kvm_mmu_memory_cache structs in riscv and arm64 to use designated
initializers.

No functional change intended.

Reviewed-by: Marc Zyngier <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/arm64/kvm/mmu.c | 2 +-
arch/riscv/kvm/mmu.c | 5 +----
include/linux/kvm_host.h | 1 +
include/linux/kvm_types.h | 6 +++++-
virt/kvm/kvm_main.c | 33 ++++++++++++++++++++++++++++++---
5 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f5651a05b6a8..87f1cd0df36e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -786,7 +786,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
{
phys_addr_t addr;
int ret = 0;
- struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
+ struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
KVM_PGTABLE_PROT_R |
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 1c00695ebee7..081f8d2b9cf3 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -350,10 +350,7 @@ static int gstage_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
int ret = 0;
unsigned long pfn;
phys_addr_t addr, end;
- struct kvm_mmu_memory_cache pcache;
-
- memset(&pcache, 0, sizeof(pcache));
- pcache.gfp_zero = __GFP_ZERO;
+ struct kvm_mmu_memory_cache pcache = { .gfp_zero = __GFP_ZERO };

end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
pfn = __phys_to_pfn(hpa);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a2bbdf3ab086..3554e48406e4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1356,6 +1356,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);

#ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index f328a01db4fe..4d933518060f 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -85,12 +85,16 @@ struct gfn_to_pfn_cache {
* MMU flows is problematic, as is triggering reclaim, I/O, etc... while
* holding MMU locks. Note, these caches act more like prefetch buffers than
* classical caches, i.e. objects are not returned to the cache on being freed.
+ *
+ * The @capacity field and @objects array are lazily initialized when the cache
+ * is topped up (__kvm_mmu_topup_memory_cache()).
*/
struct kvm_mmu_memory_cache {
int nobjs;
gfp_t gfp_zero;
struct kmem_cache *kmem_cache;
- void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
+ int capacity;
+ void **objects;
};
#endif

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5b8ae83e09d7..45188d11812c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -396,14 +396,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
return (void *)__get_free_page(gfp_flags);
}

-int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
{
+ gfp_t gfp = GFP_KERNEL_ACCOUNT;
void *obj;

if (mc->nobjs >= min)
return 0;
- while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
- obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
+
+ if (unlikely(!mc->objects)) {
+ if (WARN_ON_ONCE(!capacity))
+ return -EIO;
+
+ mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
+ if (!mc->objects)
+ return -ENOMEM;
+
+ mc->capacity = capacity;
+ }
+
+ /* It is illegal to request a different capacity across topups. */
+ if (WARN_ON_ONCE(mc->capacity != capacity))
+ return -EIO;
+
+ while (mc->nobjs < mc->capacity) {
+ obj = mmu_memory_cache_alloc_obj(mc, gfp);
if (!obj)
return mc->nobjs >= min ? 0 : -ENOMEM;
mc->objects[mc->nobjs++] = obj;
@@ -411,6 +428,11 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
return 0;
}

+int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+{
+ return __kvm_mmu_topup_memory_cache(mc, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE, min);
+}
+
int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
{
return mc->nobjs;
@@ -424,6 +446,11 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
else
free_page((unsigned long)mc->objects[--mc->nobjs]);
}
+
+ kvfree(mc->objects);
+
+ mc->objects = NULL;
+ mc->capacity = 0;
}

void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
--
2.31.1


2022-06-22 19:32:18

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 19/23] KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels

From: David Matlack <[email protected]>

Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This
is fine for now since KVM never creates intermediate huge pages during
dirty logging. In other words, KVM always replaces 1GiB pages directly
with 4KiB pages, so there is no reason to look for collapsible 2MiB
pages.

However, this will stop being true once the shadow MMU participates in
eager page splitting. During eager page splitting, each 1GiB is first
split into 2MiB pages and then those are split into 4KiB pages. The
intermediate 2MiB pages may be left behind if an error condition causes
eager page splitting to bail early.

No functional change intended.

Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 13a059ad5dc7..36bc49f08d60 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6154,18 +6154,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
return need_tlb_flush;
}

+static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
+ const struct kvm_memory_slot *slot)
+{
+ /*
+ * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
+ * pages that are already mapped at the maximum possible level.
+ */
+ if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
+ PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
+ true))
+ kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+}
+
void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
const struct kvm_memory_slot *slot)
{
if (kvm_memslots_have_rmaps(kvm)) {
write_lock(&kvm->mmu_lock);
- /*
- * Zap only 4k SPTEs since the legacy MMU only supports dirty
- * logging at a 4k granularity and never creates collapsible
- * 2m SPTEs during dirty logging.
- */
- if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
- kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+ kvm_rmap_zap_collapsible_sptes(kvm, slot);
write_unlock(&kvm->mmu_lock);
}

--
2.31.1


2022-06-22 19:39:27

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 22/23] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs

From: David Matlack <[email protected]>

Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.

Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.

Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
allowed to be allocated. So, as a policy, Eager Page Splitting
refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
pages available.

(2) Splitting a huge page may end up re-using an existing lower level
shadow page tables. This is unlike the TDP MMU which always allocates
new shadow page tables when splitting.

(3) When installing the lower level SPTEs, they must be added to the
rmap which may require allocating additional pte_list_desc structs.

Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush. As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.

This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits. However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).

Suggested-by: Peter Feiner <[email protected]>
[ This commit is based off of the original implementation of Eager Page
Splitting from Peter in Google's kernel from 2016. ]
Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 3 +-
arch/x86/include/asm/kvm_host.h | 22 ++
arch/x86/kvm/mmu/mmu.c | 261 +++++++++++++++++-
3 files changed, 277 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 97c16aa2f53f..329f0f274e2b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2418,8 +2418,7 @@
the KVM_CLEAR_DIRTY ioctl, and only for the pages being
cleared.

- Eager page splitting currently only supports splitting
- huge pages mapped by the TDP MMU.
+ Eager page splitting is only supported when kvm.tdp_mmu=Y.

Default is Y (on).

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 64efe8c90c31..665667d61caf 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1338,6 +1338,28 @@ struct kvm_arch {
u32 max_vcpu_ids;

bool disable_nx_huge_pages;
+
+ /*
+ * Memory caches used to allocate shadow pages when performing eager
+ * page splitting. No need for a shadowed_info_cache since eager page
+ * splitting only allocates direct shadow pages.
+ *
+ * Protected by kvm->slots_lock.
+ */
+ struct kvm_mmu_memory_cache split_shadow_page_cache;
+ struct kvm_mmu_memory_cache split_page_header_cache;
+
+ /*
+ * Memory cache used to allocate pte_list_desc structs while splitting
+ * huge pages. In the worst case, to split one huge page, 512
+ * pte_list_desc structs are needed to add each lower level leaf sptep
+ * to the rmap plus 1 to extend the parent_ptes rmap of the lower level
+ * page table.
+ *
+ * Protected by kvm->slots_lock.
+ */
+#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
+ struct kvm_mmu_memory_cache split_desc_cache;
};

struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bf1ae5ebf41b..22681931921f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5942,9 +5942,25 @@ int kvm_mmu_init_vm(struct kvm *kvm)
node->track_write = kvm_mmu_pte_write;
node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
kvm_page_track_register_notifier(kvm, node);
+
+ kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
+ kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
+
+ kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
+
+ kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
+ kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
+
return 0;
}

+static void mmu_free_vm_memory_caches(struct kvm *kvm)
+{
+ kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
+ kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
+ kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
+}
+
void kvm_mmu_uninit_vm(struct kvm *kvm)
{
struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
@@ -5952,6 +5968,8 @@ void kvm_mmu_uninit_vm(struct kvm *kvm)
kvm_page_track_unregister_notifier(kvm, node);

kvm_mmu_uninit_tdp_mmu(kvm);
+
+ mmu_free_vm_memory_caches(kvm);
}

static bool __kvm_zap_rmaps(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
@@ -6073,15 +6091,237 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
}

+static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
+{
+ return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
+}
+
+static bool need_topup_split_caches_or_resched(struct kvm *kvm)
+{
+ if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+ return true;
+
+ /*
+ * In the worst case, SPLIT_DESC_CACHE_MIN_NR_OBJECTS descriptors are needed
+ * to split a single huge page. Calculating how many are actually needed
+ * is possible but not worth the complexity.
+ */
+ return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_MIN_NR_OBJECTS) ||
+ need_topup(&kvm->arch.split_page_header_cache, 1) ||
+ need_topup(&kvm->arch.split_shadow_page_cache, 1);
+}
+
+static int topup_split_caches(struct kvm *kvm)
+{
+ int r;
+
+ lockdep_assert_held(&kvm->slots_lock);
+
+ /*
+ * It's common to need all SPLIT_DESC_CACHE_MIN_NR_OBJECTS (513) objects
+ * when splitting a page, but setting capacity == min would cause
+ * KVM to drop mmu_lock even if just one object was consumed from the
+ * cache. So make capacity larger than min and handle two huge pages
+ * without having to drop the lock.
+ */
+ r = __kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache,
+ 2 * SPLIT_DESC_CACHE_MIN_NR_OBJECTS,
+ SPLIT_DESC_CACHE_MIN_NR_OBJECTS);
+ if (r)
+ return r;
+
+ r = kvm_mmu_topup_memory_cache(&kvm->arch.split_page_header_cache, 1);
+ if (r)
+ return r;
+
+ return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
+}
+
+static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
+{
+ struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+ struct shadow_page_caches caches = {};
+ union kvm_mmu_page_role role;
+ unsigned int access;
+ gfn_t gfn;
+
+ gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
+ access = kvm_mmu_page_get_access(huge_sp, huge_sptep - huge_sp->spt);
+
+ /*
+ * Note, huge page splitting always uses direct shadow pages, regardless
+ * of whether the huge page itself is mapped by a direct or indirect
+ * shadow page, since the huge page region itself is being directly
+ * mapped with smaller pages.
+ */
+ role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
+
+ /* Direct SPs do not require a shadowed_info_cache. */
+ caches.page_header_cache = &kvm->arch.split_page_header_cache;
+ caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
+
+ /* Safe to pass NULL for vCPU since requesting a direct SP. */
+ return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
+}
+
+static void shadow_mmu_split_huge_page(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ u64 *huge_sptep)
+
+{
+ struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
+ u64 huge_spte = READ_ONCE(*huge_sptep);
+ struct kvm_mmu_page *sp;
+ u64 *sptep, spte;
+ gfn_t gfn;
+ int index;
+
+ sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep);
+
+ for (index = 0; index < SPTE_ENT_PER_PAGE; index++) {
+ sptep = &sp->spt[index];
+ gfn = kvm_mmu_page_get_gfn(sp, index);
+
+ /*
+ * The SP may already have populated SPTEs, e.g. if this huge
+ * page is aliased by multiple sptes with the same access
+ * permissions. These entries are guaranteed to map the same
+ * gfn-to-pfn translation since the SP is direct, so no need to
+ * modify them.
+ *
+ * If a given SPTE points to a lower level page table, installing
+ * such SPTEs would effectively unmap a potion of the huge page.
+ * This is not an issue because __link_shadow_page() flushes the TLB
+ * when the passed sp replaces a large SPTE.
+ */
+ if (is_shadow_present_pte(*sptep))
+ continue;
+
+ spte = make_huge_page_split_spte(kvm, huge_spte, sp->role, index);
+ mmu_spte_set(sptep, spte);
+ __rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
+ }
+
+ __link_shadow_page(kvm, cache, huge_sptep, sp);
+}
+
+static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ u64 *huge_sptep)
+{
+ struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+ int level, r = 0;
+ gfn_t gfn;
+ u64 spte;
+
+ /* Grab information for the tracepoint before dropping the MMU lock. */
+ gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
+ level = huge_sp->role.level;
+ spte = *huge_sptep;
+
+ if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
+ r = -ENOSPC;
+ goto out;
+ }
+
+ if (need_topup_split_caches_or_resched(kvm)) {
+ write_unlock(&kvm->mmu_lock);
+ cond_resched();
+ /*
+ * If the topup succeeds, return -EAGAIN to indicate that the
+ * rmap iterator should be restarted because the MMU lock was
+ * dropped.
+ */
+ r = topup_split_caches(kvm) ?: -EAGAIN;
+ write_lock(&kvm->mmu_lock);
+ goto out;
+ }
+
+ shadow_mmu_split_huge_page(kvm, slot, huge_sptep);
+
+out:
+ trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
+ return r;
+}
+
+static bool shadow_mmu_try_split_huge_pages(struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head,
+ const struct kvm_memory_slot *slot)
+{
+ struct rmap_iterator iter;
+ struct kvm_mmu_page *sp;
+ u64 *huge_sptep;
+ int r;
+
+restart:
+ for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
+ sp = sptep_to_sp(huge_sptep);
+
+ /* TDP MMU is enabled, so rmap only contains nested MMU SPs. */
+ if (WARN_ON_ONCE(!sp->role.guest_mode))
+ continue;
+
+ /* The rmaps should never contain non-leaf SPTEs. */
+ if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
+ continue;
+
+ /* SPs with level >PG_LEVEL_4K should never by unsync. */
+ if (WARN_ON_ONCE(sp->unsync))
+ continue;
+
+ /* Don't bother splitting huge pages on invalid SPs. */
+ if (sp->role.invalid)
+ continue;
+
+ r = shadow_mmu_try_split_huge_page(kvm, slot, huge_sptep);
+
+ /*
+ * The split succeeded or needs to be retried because the MMU
+ * lock was dropped. Either way, restart the iterator to get it
+ * back into a consistent state.
+ */
+ if (!r || r == -EAGAIN)
+ goto restart;
+
+ /* The split failed and shouldn't be retried (e.g. -ENOMEM). */
+ break;
+ }
+
+ return false;
+}
+
+static void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ gfn_t start, gfn_t end,
+ int target_level)
+{
+ int level;
+
+ /*
+ * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
+ * down to the target level. This ensures pages are recursively split
+ * all the way to the target level. There's no need to split pages
+ * already at the target level.
+ */
+ for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
+ slot_handle_level_range(kvm, slot, shadow_mmu_try_split_huge_pages,
+ level, level, start, end - 1, true, false);
+ }
+}
+
/* Must be called with the mmu_lock held in write-mode. */
void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
const struct kvm_memory_slot *memslot,
u64 start, u64 end,
int target_level)
{
- if (is_tdp_mmu_enabled(kvm))
- kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
- target_level, false);
+ if (!is_tdp_mmu_enabled(kvm))
+ return;
+
+ if (kvm_memslots_have_rmaps(kvm))
+ kvm_shadow_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
+
+ kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, false);

/*
* A TLB flush is unnecessary at this point for the same resons as in
@@ -6096,12 +6336,19 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
u64 start = memslot->base_gfn;
u64 end = start + memslot->npages;

- if (is_tdp_mmu_enabled(kvm)) {
- read_lock(&kvm->mmu_lock);
- kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
- read_unlock(&kvm->mmu_lock);
+ if (!is_tdp_mmu_enabled(kvm))
+ return;
+
+ if (kvm_memslots_have_rmaps(kvm)) {
+ write_lock(&kvm->mmu_lock);
+ kvm_shadow_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
+ write_unlock(&kvm->mmu_lock);
}

+ read_lock(&kvm->mmu_lock);
+ kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
+ read_unlock(&kvm->mmu_lock);
+
/*
* No TLB flush is necessary here. KVM will flush TLBs after
* write-protecting and/or clearing dirty on the newly split SPTEs to
--
2.31.1


2022-06-22 19:39:45

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 05/23] KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes

From: David Matlack <[email protected]>

The quadrant is only used when gptes are 4 bytes, but
mmu_alloc_{direct,shadow}_roots() pass in a non-zero quadrant for PAE
page directories regardless. Make this less confusing by only passing in
a non-zero quadrant when it is actually necessary.

Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fd1b479bf7fc..f4e7978a6c6a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3389,9 +3389,10 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
struct kvm_mmu_page *sp;

role.level = level;
+ role.quadrant = quadrant;

- if (role.has_4_byte_gpte)
- role.quadrant = quadrant;
+ WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
+ WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);

sp = kvm_mmu_get_page(vcpu, gfn, role);
++sp->root_count;
@@ -3427,7 +3428,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
for (i = 0; i < 4; ++i) {
WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));

- root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
+ root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0,
PT32_ROOT_LEVEL);
mmu->pae_root[i] = root | PT_PRESENT_MASK |
shadow_me_value;
@@ -3512,9 +3513,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
struct kvm_mmu *mmu = vcpu->arch.mmu;
u64 pdptrs[4], pm_mask;
gfn_t root_gfn, root_pgd;
+ int quadrant, i, r;
hpa_t root;
- unsigned i;
- int r;

root_pgd = mmu->get_guest_pgd(vcpu);
root_gfn = root_pgd >> PAGE_SHIFT;
@@ -3597,7 +3597,15 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
root_gfn = pdptrs[i] >> PAGE_SHIFT;
}

- root = mmu_alloc_root(vcpu, root_gfn, i, PT32_ROOT_LEVEL);
+ /*
+ * If shadowing 32-bit non-PAE page tables, each PAE page
+ * directory maps one quarter of the guest's non-PAE page
+ * directory. Othwerise each PAE page direct shadows one guest
+ * PAE page directory so that quadrant should be 0.
+ */
+ quadrant = (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) ? i : 0;
+
+ root = mmu_alloc_root(vcpu, root_gfn, quadrant, PT32_ROOT_LEVEL);
mmu->pae_root[i] = root | pm_mask;
}

--
2.31.1


2022-06-22 19:39:52

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 07/23] KVM: x86/mmu: Consolidate shadow page allocation and initialization

From: David Matlack <[email protected]>

Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under
the latter so that all shadow page allocation and initialization happens
in one place.

No functional change intended.

Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 39 +++++++++++++++++----------------------
1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a59fe860da29..8b84cdd8c6cd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1664,27 +1664,6 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
mmu_spte_clear_no_track(parent_pte);
}

-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
-{
- struct kvm_mmu_page *sp;
-
- sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
- sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
- if (!direct)
- sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
- set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
- /*
- * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
- * depends on valid pages being added to the head of the list. See
- * comments in kvm_zap_obsolete_pages().
- */
- sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
- list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
- kvm_mod_used_mmu_pages(vcpu->kvm, +1);
- return sp;
-}
-
static void mark_unsync(u64 *spte);
static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
{
@@ -2072,7 +2051,23 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
struct hlist_head *sp_list,
union kvm_mmu_page_role role)
{
- struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
+ struct kvm_mmu_page *sp;
+
+ sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
+ sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+ if (!role.direct)
+ sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+
+ set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
+ /*
+ * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
+ * depends on valid pages being added to the head of the list. See
+ * comments in kvm_zap_obsolete_pages().
+ */
+ sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
+ list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
+ kvm_mod_used_mmu_pages(vcpu->kvm, +1);

sp->gfn = gfn;
sp->role = role;
--
2.31.1


2022-06-22 19:40:43

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 02/23] KVM: x86/mmu: Use a bool for direct

From: David Matlack <[email protected]>

The parameter "direct" can either be true or false, and all of the
callers pass in a bool variable or true/false literal, so just use the
type bool.

No functional change intended.

Reviewed-by: Lai Jiangshan <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c0afb4f1c8ae..844b58ddb3bb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1664,7 +1664,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
mmu_spte_clear_no_track(parent_pte);
}

-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
{
struct kvm_mmu_page *sp;

@@ -1997,7 +1997,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
gfn_t gfn,
gva_t gaddr,
unsigned level,
- int direct,
+ bool direct,
unsigned int access)
{
union kvm_mmu_page_role role;
--
2.31.1


2022-06-22 19:47:45

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 12/23] KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page()

From: David Matlack <[email protected]>

Get the kvm pointer from the caller, rather than deriving it from
vcpu->kvm, and plumb the kvm pointer all the way from
kvm_mmu_get_shadow_page(). With this change in place, the vcpu pointer
is only needed to sync indirect shadow pages. In other words,
__kvm_mmu_get_shadow_page() can now be used to get *direct* shadow pages
without a vcpu pointer. This enables eager page splitting, which needs
to allocate direct shadow pages during VM ioctls.

No functional change intended.

Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c5a88e8d1b53..88b3f3c2c8b1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1975,7 +1975,8 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sptep_to_sp(spte));
}

-static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
+ struct kvm_vcpu *vcpu,
gfn_t gfn,
struct hlist_head *sp_list,
union kvm_mmu_page_role role)
@@ -1985,7 +1986,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
int collisions = 0;
LIST_HEAD(invalid_list);

- for_each_valid_sp(vcpu->kvm, sp, sp_list) {
+ for_each_valid_sp(kvm, sp, sp_list) {
if (sp->gfn != gfn) {
collisions++;
continue;
@@ -2002,7 +2003,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
* upper-level page will be write-protected.
*/
if (role.level > PG_LEVEL_4K && sp->unsync)
- kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
+ kvm_mmu_prepare_zap_page(kvm, sp,
&invalid_list);
continue;
}
@@ -2030,7 +2031,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,

WARN_ON(!list_empty(&invalid_list));
if (ret > 0)
- kvm_flush_remote_tlbs(vcpu->kvm);
+ kvm_flush_remote_tlbs(kvm);
}

__clear_sp_write_flooding_count(sp);
@@ -2039,13 +2040,13 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
}

sp = NULL;
- ++vcpu->kvm->stat.mmu_cache_miss;
+ ++kvm->stat.mmu_cache_miss;

out:
- kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+ kvm_mmu_commit_zap_page(kvm, &invalid_list);

- if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
- vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+ if (collisions > kvm->stat.max_mmu_page_hash_collisions)
+ kvm->stat.max_mmu_page_hash_collisions = collisions;
return sp;
}

@@ -2089,7 +2090,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
return sp;
}

-static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
+ struct kvm_vcpu *vcpu,
struct shadow_page_caches *caches,
gfn_t gfn,
union kvm_mmu_page_role role)
@@ -2098,12 +2100,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp;
bool created = false;

- sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+ sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];

- sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
+ sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
if (!sp) {
created = true;
- sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, role);
+ sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role);
}

trace_kvm_mmu_get_page(sp, created);
@@ -2120,7 +2122,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
.gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
};

- return __kvm_mmu_get_shadow_page(vcpu, &caches, gfn, role);
+ return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
}

static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, unsigned int access)
--
2.31.1


2022-06-22 19:47:45

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 14/23] KVM: x86/mmu: Pass const memslot to rmap_add()

From: David Matlack <[email protected]>

Constify rmap_add()'s @slot parameter; it is simply passed on to
gfn_to_rmap(), which takes a const memslot.

No functional change intended.

Reviewed-by: Ben Gardon <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a7748c5a2385..45a4e85c0b2c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1556,7 +1556,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,

#define RMAP_RECYCLE_THRESHOLD 1000

-static void rmap_add(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
u64 *spte, gfn_t gfn)
{
struct kvm_mmu_page *sp;
--
2.31.1


2022-06-22 19:47:45

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 03/23] KVM: x86/mmu: Stop passing "direct" to mmu_alloc_root()

From: David Matlack <[email protected]>

The "direct" argument is vcpu->arch.mmu->root_role.direct,
because unlike non-root page tables, it's impossible to have
a direct root in an indirect MMU. So just use that.

Suggested-by: Lai Jiangshan <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 844b58ddb3bb..2e30398fe59f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3369,8 +3369,9 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
}

static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
- u8 level, bool direct)
+ u8 level)
{
+ bool direct = vcpu->arch.mmu->root_role.direct;
struct kvm_mmu_page *sp;

sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
@@ -3396,7 +3397,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
mmu->root.hpa = root;
} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
- root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
+ root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
mmu->root.hpa = root;
} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
if (WARN_ON_ONCE(!mmu->pae_root)) {
@@ -3408,7 +3409,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));

root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
- i << 30, PT32_ROOT_LEVEL, true);
+ i << 30, PT32_ROOT_LEVEL);
mmu->pae_root[i] = root | PT_PRESENT_MASK |
shadow_me_value;
}
@@ -3532,7 +3533,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
*/
if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
root = mmu_alloc_root(vcpu, root_gfn, 0,
- mmu->root_role.level, false);
+ mmu->root_role.level);
mmu->root.hpa = root;
goto set_root_pgd;
}
@@ -3578,7 +3579,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
}

root = mmu_alloc_root(vcpu, root_gfn, i << 30,
- PT32_ROOT_LEVEL, false);
+ PT32_ROOT_LEVEL);
mmu->pae_root[i] = root | pm_mask;
}

--
2.31.1


2022-06-22 19:47:57

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 06/23] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions

From: David Matlack <[email protected]>

Decompose kvm_mmu_get_page() into separate helper functions to increase
readability and prepare for allocating shadow pages without a vcpu
pointer.

Specifically, pull the guts of kvm_mmu_get_page() into 2 helper
functions:

kvm_mmu_find_shadow_page() -
Walks the page hash checking for any existing mmu pages that match the
given gfn and role.

kvm_mmu_alloc_shadow_page()
Allocates and initializes an entirely new kvm_mmu_page. This currently
requries a vcpu pointer for allocation and looking up the memslot but
that will be removed in a future commit.

No functional change intended.

Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 52 +++++++++++++++++++++++++++++++-----------
1 file changed, 39 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f4e7978a6c6a..a59fe860da29 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1993,16 +1993,16 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sptep_to_sp(spte));
}

-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
- union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+ gfn_t gfn,
+ struct hlist_head *sp_list,
+ union kvm_mmu_page_role role)
{
- struct hlist_head *sp_list;
struct kvm_mmu_page *sp;
int ret;
int collisions = 0;
LIST_HEAD(invalid_list);

- sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
for_each_valid_sp(vcpu->kvm, sp, sp_list) {
if (sp->gfn != gfn) {
collisions++;
@@ -2027,7 +2027,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,

/* unsync and write-flooding only apply to indirect SPs. */
if (sp->role.direct)
- goto trace_get_page;
+ goto out;

if (sp->unsync) {
/*
@@ -2053,14 +2053,26 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,

__clear_sp_write_flooding_count(sp);

-trace_get_page:
- trace_kvm_mmu_get_page(sp, false);
goto out;
}

+ sp = NULL;
++vcpu->kvm->stat.mmu_cache_miss;

- sp = kvm_mmu_alloc_page(vcpu, role.direct);
+out:
+ kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+
+ if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
+ vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+ return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+ gfn_t gfn,
+ struct hlist_head *sp_list,
+ union kvm_mmu_page_role role)
+{
+ struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);

sp->gfn = gfn;
sp->role = role;
@@ -2070,12 +2082,26 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
}
- trace_kvm_mmu_get_page(sp, true);
-out:
- kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);

- if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
- vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+ return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+ union kvm_mmu_page_role role)
+{
+ struct hlist_head *sp_list;
+ struct kvm_mmu_page *sp;
+ bool created = false;
+
+ sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+
+ sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
+ if (!sp) {
+ created = true;
+ sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, sp_list, role);
+ }
+
+ trace_kvm_mmu_get_page(sp, created);
return sp;
}

--
2.31.1


2022-06-22 19:48:42

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 17/23] KVM: x86/mmu: Cache the access bits of shadowed translations

From: David Matlack <[email protected]>

Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.

In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.

Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 85 +++++++++++++++++++++++----------
arch/x86/kvm/mmu/mmu_internal.h | 17 ++++++-
arch/x86/kvm/mmu/paging_tmpl.h | 9 +++-
4 files changed, 84 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7e4c31b57a75..64efe8c90c31 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -713,7 +713,7 @@ struct kvm_vcpu_arch {

struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
struct kvm_mmu_memory_cache mmu_shadow_page_cache;
- struct kvm_mmu_memory_cache mmu_gfn_array_cache;
+ struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;

/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7cca28d89a85..13a059ad5dc7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -656,7 +656,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
if (r)
return r;
if (maybe_indirect) {
- r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
+ r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
PT64_ROOT_MAX_LEVEL);
if (r)
return r;
@@ -669,7 +669,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
{
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
- kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
+ kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
}

@@ -678,34 +678,68 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
kmem_cache_free(pte_list_desc_cache, pte_list_desc);
}

+static bool sp_has_gptes(struct kvm_mmu_page *sp);
+
static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
{
if (sp->role.passthrough)
return sp->gfn;

if (!sp->role.direct)
- return sp->gfns[index];
+ return sp->shadowed_translation[index] >> PAGE_SHIFT;

return sp->gfn + (index << ((sp->role.level - 1) * SPTE_LEVEL_BITS));
}

-static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
+/*
+ * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note
+ * that the SPTE itself may have a more constrained access permissions that
+ * what the guest enforces. For example, a guest may create an executable
+ * huge PTE but KVM may disallow execution to mitigate iTLB multihit.
+ */
+static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
{
- if (sp->role.passthrough) {
- WARN_ON_ONCE(gfn != sp->gfn);
- return;
- }
+ if (sp_has_gptes(sp))
+ return sp->shadowed_translation[index] & ACC_ALL;

- if (!sp->role.direct) {
- sp->gfns[index] = gfn;
+ /*
+ * For direct MMUs (e.g. TDP or non-paging guests) or passthrough SPs,
+ * KVM is not shadowing any guest page tables, so the "guest access
+ * permissions" are just ACC_ALL.
+ *
+ * For direct SPs in indirect MMUs (shadow paging), i.e. when KVM
+ * is shadowing a guest huge page with small pages, the guest access
+ * permissions being shadowed are the access permissions of the huge
+ * page.
+ *
+ * In both cases, sp->role.access contains the correct access bits.
+ */
+ return sp->role.access;
+}
+
+static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index, gfn_t gfn, u32 access)
+{
+ if (sp_has_gptes(sp)) {
+ sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access;
return;
}

- if (WARN_ON(gfn != kvm_mmu_page_get_gfn(sp, index)))
- pr_err_ratelimited("gfn mismatch under direct page %llx "
- "(expected %llx, got %llx)\n",
- sp->gfn,
- kvm_mmu_page_get_gfn(sp, index), gfn);
+ WARN_ONCE(access != kvm_mmu_page_get_access(sp, index),
+ "access mismatch under %s page %llx (expected %u, got %u)\n",
+ sp->role.passthrough ? "passthrough" : "direct",
+ sp->gfn, kvm_mmu_page_get_access(sp, index), access);
+
+ WARN_ONCE(gfn != kvm_mmu_page_get_gfn(sp, index),
+ "gfn mismatch under %s page %llx (expected %llx, got %llx)\n",
+ sp->role.passthrough ? "passthrough" : "direct",
+ sp->gfn, kvm_mmu_page_get_gfn(sp, index), gfn);
+}
+
+static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
+{
+ gfn_t gfn = kvm_mmu_page_get_gfn(sp, index);
+
+ kvm_mmu_page_set_translation(sp, index, gfn, access);
}

/*
@@ -1554,14 +1588,14 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
static void __rmap_add(struct kvm *kvm,
struct kvm_mmu_memory_cache *cache,
const struct kvm_memory_slot *slot,
- u64 *spte, gfn_t gfn)
+ u64 *spte, gfn_t gfn, u32 access)
{
struct kvm_mmu_page *sp;
struct kvm_rmap_head *rmap_head;
int rmap_count;

sp = sptep_to_sp(spte);
- kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+ kvm_mmu_page_set_translation(sp, spte - sp->spt, gfn, access);
kvm_update_page_stats(kvm, sp->role.level, 1);

rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
@@ -1575,11 +1609,11 @@ static void __rmap_add(struct kvm *kvm,
}

static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
- u64 *spte, gfn_t gfn)
+ u64 *spte, gfn_t gfn, u32 access)
{
struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_pte_list_desc_cache;

- __rmap_add(vcpu->kvm, cache, slot, spte, gfn);
+ __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
}

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -1643,7 +1677,7 @@ static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
list_del(&sp->link);
free_page((unsigned long)sp->spt);
if (!sp->role.direct)
- free_page((unsigned long)sp->gfns);
+ free_page((unsigned long)sp->shadowed_translation);
kmem_cache_free(mmu_page_header_cache, sp);
}

@@ -2070,7 +2104,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
struct shadow_page_caches {
struct kvm_mmu_memory_cache *page_header_cache;
struct kvm_mmu_memory_cache *shadow_page_cache;
- struct kvm_mmu_memory_cache *gfn_array_cache;
+ struct kvm_mmu_memory_cache *shadowed_info_cache;
};

static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
@@ -2084,7 +2118,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
if (!role.direct)
- sp->gfns = kvm_mmu_memory_cache_alloc(caches->gfn_array_cache);
+ sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);

set_page_private(virt_to_page(sp->spt), (unsigned long)sp);

@@ -2136,7 +2170,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
struct shadow_page_caches caches = {
.page_header_cache = &vcpu->arch.mmu_page_header_cache,
.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
- .gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
+ .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
};

return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
@@ -2785,7 +2819,10 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,

if (!was_rmapped) {
WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
- rmap_add(vcpu, slot, sptep, gfn);
+ rmap_add(vcpu, slot, sptep, gfn, pte_access);
+ } else {
+ /* Already rmapped but the pte_access bits may have changed. */
+ kvm_mmu_page_set_access(sp, sptep - sp->spt, pte_access);
}

return ret;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index bb9d12ac0db3..ae2d660e2dab 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -67,8 +67,21 @@ struct kvm_mmu_page {
gfn_t gfn;

u64 *spt;
- /* hold the gfn of each spte inside spt */
- gfn_t *gfns;
+
+ /*
+ * Stores the result of the guest translation being shadowed by each
+ * SPTE. KVM shadows two types of guest translations: nGPA -> GPA
+ * (shadow EPT/NPT) and GVA -> GPA (traditional shadow paging). In both
+ * cases the result of the translation is a GPA and a set of access
+ * constraints.
+ *
+ * The GFN is stored in the upper bits (PAGE_SHIFT) and the shadowed
+ * access permissions are stored in the lower bits. Note, for
+ * convenience and uniformity across guests, the access permissions are
+ * stored in KVM format (e.g. ACC_EXEC_MASK) not the raw guest format.
+ */
+ u64 *shadowed_translation;
+
/* Currently serving as active root */
union {
int root_count;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 6ecdd7a41a82..24f292f3f93f 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -985,7 +985,8 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
}

/*
- * Using the cached information from sp->gfns is safe because:
+ * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()) is
+ * safe because:
* - The spte has a reference to the struct page, so the pfn for a given gfn
* can't change unless all sptes pointing to it are nuked first.
*
@@ -1067,12 +1068,16 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
* "present" bit, as all other paging modes will create a
* read-only SPTE if pte_access is zero.
*/
- if ((!pte_access && !shadow_present_mask) || gfn != sp->gfns[i]) {
+ if ((!pte_access && !shadow_present_mask) ||
+ gfn != kvm_mmu_page_get_gfn(sp, i)) {
drop_spte(vcpu->kvm, &sp->spt[i]);
flush = true;
continue;
}

+ /* Update the shadowed access bits in case they changed. */
+ kvm_mmu_page_set_access(sp, i, pte_access);
+
sptep = &sp->spt[i];
spte = *sptep;
host_writable = spte & shadow_host_writable_mask;
--
2.31.1


2022-06-22 19:49:07

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 23/23] KVM: x86/mmu: Avoid unnecessary flush on eager page split

The TLB flush before installing the newly-populated lower level
page table is unnecessary if the lower-level page table maps
the huge page identically. KVM knows it is if it did not reuse
an existing shadow page table, tell drop_large_spte() to skip
the flush in that case.

Extracted from a patch by David Matlack.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 32 ++++++++++++++++++++------------
1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 22681931921f..79c6a821ea0d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1135,7 +1135,7 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
rmap_remove(kvm, sptep);
}

-static void drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
{
struct kvm_mmu_page *sp;

@@ -1143,7 +1143,9 @@ static void drop_large_spte(struct kvm *kvm, u64 *sptep)
WARN_ON(sp->role.level == PG_LEVEL_4K);

drop_spte(kvm, sptep);
- kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
+
+ if (flush)
+ kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
KVM_PAGES_PER_HPAGE(sp->role.level));
}

@@ -2283,7 +2285,7 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)

static void __link_shadow_page(struct kvm *kvm,
struct kvm_mmu_memory_cache *cache, u64 *sptep,
- struct kvm_mmu_page *sp)
+ struct kvm_mmu_page *sp, bool flush)
{
u64 spte;

@@ -2291,10 +2293,11 @@ static void __link_shadow_page(struct kvm *kvm,

/*
* If an SPTE is present already, it must be a leaf and therefore
- * a large one. Drop it and flush the TLB before installing sp.
+ * a large one. Drop it, and flush the TLB if needed, before
+ * installing sp.
*/
if (is_shadow_present_pte(*sptep))
- drop_large_spte(kvm, sptep);
+ drop_large_spte(kvm, sptep, flush);

spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));

@@ -2309,7 +2312,7 @@ static void __link_shadow_page(struct kvm *kvm,
static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
struct kvm_mmu_page *sp)
{
- __link_shadow_page(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
+ __link_shadow_page(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, sptep, sp, true);
}

static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
@@ -6172,6 +6175,7 @@ static void shadow_mmu_split_huge_page(struct kvm *kvm,
struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
u64 huge_spte = READ_ONCE(*huge_sptep);
struct kvm_mmu_page *sp;
+ bool flush = false;
u64 *sptep, spte;
gfn_t gfn;
int index;
@@ -6189,20 +6193,24 @@ static void shadow_mmu_split_huge_page(struct kvm *kvm,
* gfn-to-pfn translation since the SP is direct, so no need to
* modify them.
*
- * If a given SPTE points to a lower level page table, installing
- * such SPTEs would effectively unmap a potion of the huge page.
- * This is not an issue because __link_shadow_page() flushes the TLB
- * when the passed sp replaces a large SPTE.
+ * However, if a given SPTE points to a lower level page table,
+ * that lower level page table may only be partially populated.
+ * Installing such SPTEs would effectively unmap a potion of the
+ * huge page. Unmapping guest memory always requires a TLB flush
+ * since a subsequent operation on the unmapped regions would
+ * fail to detect the need to flush.
*/
- if (is_shadow_present_pte(*sptep))
+ if (is_shadow_present_pte(*sptep)) {
+ flush |= !is_last_spte(*sptep, sp->role.level);
continue;
+ }

spte = make_huge_page_split_spte(kvm, huge_spte, sp->role, index);
mmu_spte_set(sptep, spte);
__rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
}

- __link_shadow_page(kvm, cache, huge_sptep, sp);
+ __link_shadow_page(kvm, cache, huge_sptep, sp, flush);
}

static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
--
2.31.1

2022-06-22 19:50:37

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 16/23] KVM: x86/mmu: Update page stats in __rmap_add()

From: David Matlack <[email protected]>

Update the page stats in __rmap_add() rather than at the call site. This
will avoid having to manually update page stats when splitting huge
pages in a subsequent commit.

No functional change intended.

Reviewed-by: Ben Gardon <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a8cdbe2958d9..7cca28d89a85 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1562,6 +1562,8 @@ static void __rmap_add(struct kvm *kvm,

sp = sptep_to_sp(spte);
kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+ kvm_update_page_stats(kvm, sp->role.level, 1);
+
rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
rmap_count = pte_list_add(cache, spte, rmap_head);

@@ -2783,7 +2785,6 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,

if (!was_rmapped) {
WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
- kvm_update_page_stats(vcpu->kvm, level, 1);
rmap_add(vcpu, slot, sptep, gfn);
}

--
2.31.1


2022-06-22 19:58:38

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 13/23] KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()

From: David Matlack <[email protected]>

Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only
caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync
indirect shadow pages, so it's safe to pass in NULL when looking up
direct shadow pages.

This will be used for doing eager page splitting, which allocates direct
shadow pages from the context of a VM ioctl without access to a vCPU
pointer.

Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 88b3f3c2c8b1..a7748c5a2385 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1975,6 +1975,12 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sptep_to_sp(spte));
}

+/*
+ * The vCPU is required when finding indirect shadow pages; the shadow
+ * page may already exist and syncing it needs the vCPU pointer in
+ * order to read guest page tables. Direct shadow pages are never
+ * unsync, thus @vcpu can be NULL if @role.direct is true.
+ */
static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
struct kvm_vcpu *vcpu,
gfn_t gfn,
@@ -2013,6 +2019,9 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
goto out;

if (sp->unsync) {
+ if (KVM_BUG_ON(!vcpu, kvm))
+ break;
+
/*
* The page is good, but is stale. kvm_sync_page does
* get the latest guest state, but (unlike mmu_unsync_children)
@@ -2090,6 +2099,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
return sp;
}

+/* Note, @vcpu may be NULL if @role.direct is true; see kvm_mmu_find_shadow_page. */
static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
struct kvm_vcpu *vcpu,
struct shadow_page_caches *caches,
--
2.31.1


2022-06-22 20:01:30

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 04/23] KVM: x86/mmu: Derive shadow MMU page role from parent

From: David Matlack <[email protected]>

Instead of computing the shadow page role from scratch for every new
page, derive most of the information from the parent shadow page. This
eliminates the dependency on the vCPU root role to allocate shadow page
tables, and reduces the number of parameters to kvm_mmu_get_page().

Preemptively split out the role calculation to a separate function for
use in a following commit.

Note that when calculating the MMU root role, we can take
@role.passthrough, @role.direct, and @role.access directly from
@vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still
must be overridden for PAE page directories, when shadowing 32-bit
guest page tables with PAE page tables.

No functional change intended.

Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 114 +++++++++++++++++++--------------
arch/x86/kvm/mmu/paging_tmpl.h | 9 +--
2 files changed, 71 insertions(+), 52 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2e30398fe59f..fd1b479bf7fc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1993,49 +1993,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sptep_to_sp(spte));
}

-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
- gfn_t gfn,
- gva_t gaddr,
- unsigned level,
- bool direct,
- unsigned int access)
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+ union kvm_mmu_page_role role)
{
- union kvm_mmu_page_role role;
struct hlist_head *sp_list;
- unsigned quadrant;
struct kvm_mmu_page *sp;
int ret;
int collisions = 0;
LIST_HEAD(invalid_list);

- role = vcpu->arch.mmu->root_role;
- role.level = level;
- role.direct = direct;
- role.access = access;
- if (role.has_4_byte_gpte) {
- /*
- * If the guest has 4-byte PTEs then that means it's using 32-bit,
- * 2-level, non-PAE paging. KVM shadows such guests with PAE paging
- * (i.e. 8-byte PTEs). The difference in PTE size means that KVM must
- * shadow each guest page table with multiple shadow page tables, which
- * requires extra bookkeeping in the role.
- *
- * Specifically, to shadow the guest's page directory (which covers a
- * 4GiB address space), KVM uses 4 PAE page directories, each mapping
- * 1GiB of the address space. @role.quadrant encodes which quarter of
- * the address space each maps.
- *
- * To shadow the guest's page tables (which each map a 4MiB region), KVM
- * uses 2 PAE page tables, each mapping a 2MiB region. For these,
- * @role.quadrant encodes which half of the region they map.
- */
- quadrant = gaddr >> (PAGE_SHIFT + (SPTE_LEVEL_BITS * level));
- quadrant &= (1 << level) - 1;
- role.quadrant = quadrant;
- }
- if (level <= vcpu->arch.mmu->cpu_role.base.level)
- role.passthrough = 0;
-
sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
for_each_valid_sp(vcpu->kvm, sp, sp_list) {
if (sp->gfn != gfn) {
@@ -2053,7 +2019,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
* Unsync pages must not be left as is, because the new
* upper-level page will be write-protected.
*/
- if (level > PG_LEVEL_4K && sp->unsync)
+ if (role.level > PG_LEVEL_4K && sp->unsync)
kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
&invalid_list);
continue;
@@ -2094,14 +2060,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,

++vcpu->kvm->stat.mmu_cache_miss;

- sp = kvm_mmu_alloc_page(vcpu, direct);
+ sp = kvm_mmu_alloc_page(vcpu, role.direct);

sp->gfn = gfn;
sp->role = role;
hlist_add_head(&sp->hash_link, sp_list);
if (sp_has_gptes(sp)) {
account_shadowed(vcpu->kvm, sp);
- if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
+ if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
}
trace_kvm_mmu_get_page(sp, true);
@@ -2113,6 +2079,55 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
return sp;
}

+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, unsigned int access)
+{
+ struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
+ union kvm_mmu_page_role role;
+
+ role = parent_sp->role;
+ role.level--;
+ role.access = access;
+ role.direct = direct;
+ role.passthrough = 0;
+
+ /*
+ * If the guest has 4-byte PTEs then that means it's using 32-bit,
+ * 2-level, non-PAE paging. KVM shadows such guests with PAE paging
+ * (i.e. 8-byte PTEs). The difference in PTE size means that KVM must
+ * shadow each guest page table with multiple shadow page tables, which
+ * requires extra bookkeeping in the role.
+ *
+ * Specifically, to shadow the guest's page directory (which covers a
+ * 4GiB address space), KVM uses 4 PAE page directories, each mapping
+ * 1GiB of the address space. @role.quadrant encodes which quarter of
+ * the address space each maps.
+ *
+ * To shadow the guest's page tables (which each map a 4MiB region), KVM
+ * uses 2 PAE page tables, each mapping a 2MiB region. For these,
+ * @role.quadrant encodes which half of the region they map.
+ *
+ * Note, the 4 PAE page directories are pre-allocated and the quadrant
+ * assigned in mmu_alloc_root(). So only page tables need to be handled
+ * here.
+ */
+ if (role.has_4_byte_gpte) {
+ WARN_ON_ONCE(role.level != PG_LEVEL_4K);
+ role.quadrant = (sptep - parent_sp->spt) % 2;
+ }
+
+ return role;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
+ u64 *sptep, gfn_t gfn,
+ bool direct, unsigned int access)
+{
+ union kvm_mmu_page_role role;
+
+ role = kvm_mmu_child_role(sptep, direct, access);
+ return kvm_mmu_get_page(vcpu, gfn, role);
+}
+
static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
struct kvm_vcpu *vcpu, hpa_t root,
u64 addr)
@@ -2964,8 +2979,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
if (is_shadow_present_pte(*it.sptep))
continue;

- sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
- it.level - 1, true, ACC_ALL);
+ sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);

link_shadow_page(vcpu, it.sptep, sp);
if (fault->is_tdp && fault->huge_page_disallowed &&
@@ -3368,13 +3382,18 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
return ret;
}

-static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
+static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
u8 level)
{
- bool direct = vcpu->arch.mmu->root_role.direct;
+ union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
struct kvm_mmu_page *sp;

- sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
+ role.level = level;
+
+ if (role.has_4_byte_gpte)
+ role.quadrant = quadrant;
+
+ sp = kvm_mmu_get_page(vcpu, gfn, role);
++sp->root_count;

return __pa(sp->spt);
@@ -3408,8 +3427,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
for (i = 0; i < 4; ++i) {
WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));

- root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
- i << 30, PT32_ROOT_LEVEL);
+ root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
+ PT32_ROOT_LEVEL);
mmu->pae_root[i] = root | PT_PRESENT_MASK |
shadow_me_value;
}
@@ -3578,8 +3597,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
root_gfn = pdptrs[i] >> PAGE_SHIFT;
}

- root = mmu_alloc_root(vcpu, root_gfn, i << 30,
- PT32_ROOT_LEVEL);
+ root = mmu_alloc_root(vcpu, root_gfn, i, PT32_ROOT_LEVEL);
mmu->pae_root[i] = root | pm_mask;
}

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index e4655056e651..6ecdd7a41a82 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -654,8 +654,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
if (!is_shadow_present_pte(*it.sptep)) {
table_gfn = gw->table_gfn[it.level - 2];
access = gw->pt_access[it.level - 2];
- sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
- it.level-1, false, access);
+ sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+ false, access);
+
/*
* We must synchronize the pagetable before linking it
* because the guest doesn't need to flush tlb when
@@ -711,8 +712,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
drop_large_spte(vcpu, it.sptep);

if (!is_shadow_present_pte(*it.sptep)) {
- sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
- it.level - 1, true, direct_access);
+ sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+ true, direct_access);
link_shadow_page(vcpu, it.sptep, sp);
if (fault->huge_page_disallowed &&
fault->req_level >= it.level)
--
2.31.1


2022-06-22 20:05:46

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH v7 08/23] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages

From: David Matlack <[email protected]>

Rename 2 functions:

kvm_mmu_get_page() -> kvm_mmu_get_shadow_page()
kvm_mmu_free_page() -> kvm_mmu_free_shadow_page()

This change makes it clear that these functions deal with shadow pages
rather than struct pages. It also aligns these functions with the naming
scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page().

Prefer "shadow_page" over the shorter "sp" since these are core
functions and the line lengths aren't terrible.

No functional change intended.

Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8b84cdd8c6cd..bd45364bf465 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1626,7 +1626,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
percpu_counter_add(&kvm_total_used_mmu_pages, nr);
}

-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
{
MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
hlist_del(&sp->hash_link);
@@ -2081,8 +2081,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
return sp;
}

-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
- union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+ gfn_t gfn,
+ union kvm_mmu_page_role role)
{
struct hlist_head *sp_list;
struct kvm_mmu_page *sp;
@@ -2146,7 +2147,7 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
union kvm_mmu_page_role role;

role = kvm_mmu_child_role(sptep, direct, access);
- return kvm_mmu_get_page(vcpu, gfn, role);
+ return kvm_mmu_get_shadow_page(vcpu, gfn, role);
}

static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
@@ -2422,7 +2423,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,

list_for_each_entry_safe(sp, nsp, invalid_list, link) {
WARN_ON(!sp->role.invalid || sp->root_count);
- kvm_mmu_free_page(sp);
+ kvm_mmu_free_shadow_page(sp);
}
}

@@ -3415,7 +3416,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);

- sp = kvm_mmu_get_page(vcpu, gfn, role);
+ sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
++sp->root_count;

return __pa(sp->spt);
--
2.31.1


2022-06-23 16:46:18

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v7 22/23] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs

On Wed, Jun 22, 2022 at 12:27 PM Paolo Bonzini <[email protected]> wrote:
>
> From: David Matlack <[email protected]>
>
> Add support for Eager Page Splitting pages that are mapped by nested
> MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
> pages, and then splitting all 2MiB pages to 4KiB pages.
>
> Note, Eager Page Splitting is limited to nested MMUs as a policy rather
> than due to any technical reason (the sp->role.guest_mode check could
> just be deleted and Eager Page Splitting would work correctly for all
> shadow MMU pages). There is really no reason to support Eager Page
> Splitting for tdp_mmu=N, since such support will eventually be phased
> out, and there is no current use case supporting Eager Page Splitting on
> hosts where TDP is either disabled or unavailable in hardware.
> Furthermore, future improvements to nested MMU scalability may diverge
> the code from the legacy shadow paging implementation. These
> improvements will be simpler to make if Eager Page Splitting does not
> have to worry about legacy shadow paging.
>
> Splitting huge pages mapped by nested MMUs requires dealing with some
> extra complexity beyond that of the TDP MMU:
>
> (1) The shadow MMU has a limit on the number of shadow pages that are
> allowed to be allocated. So, as a policy, Eager Page Splitting
> refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
> pages available.
>
> (2) Splitting a huge page may end up re-using an existing lower level
> shadow page tables. This is unlike the TDP MMU which always allocates
> new shadow page tables when splitting.
>
> (3) When installing the lower level SPTEs, they must be added to the
> rmap which may require allocating additional pte_list_desc structs.
>
> Case (2) is especially interesting since it may require a TLB flush,
> unlike the TDP MMU which can fully split huge pages without any TLB
> flushes. Specifically, an existing lower level page table may point to
> even lower level page tables that are not fully populated, effectively
> unmapping a portion of the huge page, which requires a flush. As of
> this commit, a flush is always done always after dropping the huge page
> and before installing the lower level page table.
>
> This TLB flush could instead be delayed until the MMU lock is about to be
> dropped, which would batch flushes for multiple splits. However these
> flushes should be rare in practice (a huge page must be aliased in
> multiple SPTEs and have been split for NX Huge Pages in only some of
> them). Flushing immediately is simpler to plumb and also reduces the
> chances of tripping over a CPU bug (e.g. see iTLB multihit).
>
> Suggested-by: Peter Feiner <[email protected]>
> [ This commit is based off of the original implementation of Eager Page
> Splitting from Peter in Google's kernel from 2016. ]
> Signed-off-by: David Matlack <[email protected]>
> Message-Id: <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> .../admin-guide/kernel-parameters.txt | 3 +-
> arch/x86/include/asm/kvm_host.h | 22 ++
> arch/x86/kvm/mmu/mmu.c | 261 +++++++++++++++++-
> 3 files changed, 277 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 97c16aa2f53f..329f0f274e2b 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2418,8 +2418,7 @@
> the KVM_CLEAR_DIRTY ioctl, and only for the pages being
> cleared.
>
> - Eager page splitting currently only supports splitting
> - huge pages mapped by the TDP MMU.
> + Eager page splitting is only supported when kvm.tdp_mmu=Y.
>
> Default is Y (on).
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 64efe8c90c31..665667d61caf 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1338,6 +1338,28 @@ struct kvm_arch {
> u32 max_vcpu_ids;
>
> bool disable_nx_huge_pages;
> +
> + /*
> + * Memory caches used to allocate shadow pages when performing eager
> + * page splitting. No need for a shadowed_info_cache since eager page
> + * splitting only allocates direct shadow pages.
> + *
> + * Protected by kvm->slots_lock.
> + */
> + struct kvm_mmu_memory_cache split_shadow_page_cache;
> + struct kvm_mmu_memory_cache split_page_header_cache;
> +
> + /*
> + * Memory cache used to allocate pte_list_desc structs while splitting
> + * huge pages. In the worst case, to split one huge page, 512
> + * pte_list_desc structs are needed to add each lower level leaf sptep
> + * to the rmap plus 1 to extend the parent_ptes rmap of the lower level
> + * page table.
> + *
> + * Protected by kvm->slots_lock.
> + */
> +#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> + struct kvm_mmu_memory_cache split_desc_cache;
> };
>
> struct kvm_vm_stat {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index bf1ae5ebf41b..22681931921f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5942,9 +5942,25 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> node->track_write = kvm_mmu_pte_write;
> node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> kvm_page_track_register_notifier(kvm, node);
> +
> + kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> + kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
> +
> + kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +
> + kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
> + kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> +
> return 0;
> }
>
> +static void mmu_free_vm_memory_caches(struct kvm *kvm)
> +{
> + kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
> + kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> + kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
> +}
> +
> void kvm_mmu_uninit_vm(struct kvm *kvm)
> {
> struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
> @@ -5952,6 +5968,8 @@ void kvm_mmu_uninit_vm(struct kvm *kvm)
> kvm_page_track_unregister_notifier(kvm, node);
>
> kvm_mmu_uninit_tdp_mmu(kvm);
> +
> + mmu_free_vm_memory_caches(kvm);
> }
>
> static bool __kvm_zap_rmaps(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> @@ -6073,15 +6091,237 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> }
>
> +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
> +{
> + return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
> +}
> +
> +static bool need_topup_split_caches_or_resched(struct kvm *kvm)
> +{
> + if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> + return true;
> +
> + /*
> + * In the worst case, SPLIT_DESC_CACHE_MIN_NR_OBJECTS descriptors are needed
> + * to split a single huge page. Calculating how many are actually needed
> + * is possible but not worth the complexity.
> + */
> + return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_MIN_NR_OBJECTS) ||
> + need_topup(&kvm->arch.split_page_header_cache, 1) ||
> + need_topup(&kvm->arch.split_shadow_page_cache, 1);
> +}
> +
> +static int topup_split_caches(struct kvm *kvm)
> +{
> + int r;
> +
> + lockdep_assert_held(&kvm->slots_lock);
> +
> + /*
> + * It's common to need all SPLIT_DESC_CACHE_MIN_NR_OBJECTS (513) objects
> + * when splitting a page, but setting capacity == min would cause
> + * KVM to drop mmu_lock even if just one object was consumed from the
> + * cache. So make capacity larger than min and handle two huge pages
> + * without having to drop the lock.

I was going to do some testing this week to confirm, but IIUC KVM will
only allocate from split_desc_cache if the L1 hypervisor has aliased a
huge page in multiple {E,N}PT12 page table entries. i.e. L1 is mapping
a huge page into an L2 multiple times, or mapped into multiple L2s.
This should be common in traditional, process-level, shadow paging,
but I think will be quite rare for nested shadow paging.

I don't have any objection to using 2x for capacity but I would
recommend dropping the "It's common part ...," part from the comment.


> + */
> + r = __kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache,
> + 2 * SPLIT_DESC_CACHE_MIN_NR_OBJECTS,
> + SPLIT_DESC_CACHE_MIN_NR_OBJECTS);
> + if (r)
> + return r;
> +
> + r = kvm_mmu_topup_memory_cache(&kvm->arch.split_page_header_cache, 1);
> + if (r)
> + return r;
> +
> + return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
> +}
> +
> +static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
> +{
> + struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> + struct shadow_page_caches caches = {};
> + union kvm_mmu_page_role role;
> + unsigned int access;
> + gfn_t gfn;
> +
> + gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
> + access = kvm_mmu_page_get_access(huge_sp, huge_sptep - huge_sp->spt);
> +
> + /*
> + * Note, huge page splitting always uses direct shadow pages, regardless
> + * of whether the huge page itself is mapped by a direct or indirect
> + * shadow page, since the huge page region itself is being directly
> + * mapped with smaller pages.
> + */
> + role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
> +
> + /* Direct SPs do not require a shadowed_info_cache. */
> + caches.page_header_cache = &kvm->arch.split_page_header_cache;
> + caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> +
> + /* Safe to pass NULL for vCPU since requesting a direct SP. */
> + return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> +}
> +
> +static void shadow_mmu_split_huge_page(struct kvm *kvm,
> + const struct kvm_memory_slot *slot,
> + u64 *huge_sptep)
> +
> +{
> + struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
> + u64 huge_spte = READ_ONCE(*huge_sptep);
> + struct kvm_mmu_page *sp;
> + u64 *sptep, spte;
> + gfn_t gfn;
> + int index;
> +
> + sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep);
> +
> + for (index = 0; index < SPTE_ENT_PER_PAGE; index++) {
> + sptep = &sp->spt[index];
> + gfn = kvm_mmu_page_get_gfn(sp, index);
> +
> + /*
> + * The SP may already have populated SPTEs, e.g. if this huge
> + * page is aliased by multiple sptes with the same access
> + * permissions. These entries are guaranteed to map the same
> + * gfn-to-pfn translation since the SP is direct, so no need to
> + * modify them.
> + *
> + * If a given SPTE points to a lower level page table, installing
> + * such SPTEs would effectively unmap a potion of the huge page.
> + * This is not an issue because __link_shadow_page() flushes the TLB
> + * when the passed sp replaces a large SPTE.
> + */
> + if (is_shadow_present_pte(*sptep))
> + continue;
> +
> + spte = make_huge_page_split_spte(kvm, huge_spte, sp->role, index);
> + mmu_spte_set(sptep, spte);
> + __rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
> + }
> +
> + __link_shadow_page(kvm, cache, huge_sptep, sp);
> +}
> +
> +static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
> + const struct kvm_memory_slot *slot,
> + u64 *huge_sptep)
> +{
> + struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> + int level, r = 0;
> + gfn_t gfn;
> + u64 spte;
> +
> + /* Grab information for the tracepoint before dropping the MMU lock. */
> + gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
> + level = huge_sp->role.level;
> + spte = *huge_sptep;
> +
> + if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
> + r = -ENOSPC;
> + goto out;
> + }
> +
> + if (need_topup_split_caches_or_resched(kvm)) {
> + write_unlock(&kvm->mmu_lock);
> + cond_resched();
> + /*
> + * If the topup succeeds, return -EAGAIN to indicate that the
> + * rmap iterator should be restarted because the MMU lock was
> + * dropped.
> + */
> + r = topup_split_caches(kvm) ?: -EAGAIN;
> + write_lock(&kvm->mmu_lock);
> + goto out;
> + }
> +
> + shadow_mmu_split_huge_page(kvm, slot, huge_sptep);
> +
> +out:
> + trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
> + return r;
> +}
> +
> +static bool shadow_mmu_try_split_huge_pages(struct kvm *kvm,
> + struct kvm_rmap_head *rmap_head,
> + const struct kvm_memory_slot *slot)
> +{
> + struct rmap_iterator iter;
> + struct kvm_mmu_page *sp;
> + u64 *huge_sptep;
> + int r;
> +
> +restart:
> + for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
> + sp = sptep_to_sp(huge_sptep);
> +
> + /* TDP MMU is enabled, so rmap only contains nested MMU SPs. */
> + if (WARN_ON_ONCE(!sp->role.guest_mode))
> + continue;
> +
> + /* The rmaps should never contain non-leaf SPTEs. */
> + if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
> + continue;
> +
> + /* SPs with level >PG_LEVEL_4K should never by unsync. */
> + if (WARN_ON_ONCE(sp->unsync))
> + continue;
> +
> + /* Don't bother splitting huge pages on invalid SPs. */
> + if (sp->role.invalid)
> + continue;
> +
> + r = shadow_mmu_try_split_huge_page(kvm, slot, huge_sptep);
> +
> + /*
> + * The split succeeded or needs to be retried because the MMU
> + * lock was dropped. Either way, restart the iterator to get it
> + * back into a consistent state.
> + */
> + if (!r || r == -EAGAIN)
> + goto restart;
> +
> + /* The split failed and shouldn't be retried (e.g. -ENOMEM). */
> + break;
> + }
> +
> + return false;
> +}
> +
> +static void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm,
> + const struct kvm_memory_slot *slot,
> + gfn_t start, gfn_t end,
> + int target_level)
> +{
> + int level;
> +
> + /*
> + * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
> + * down to the target level. This ensures pages are recursively split
> + * all the way to the target level. There's no need to split pages
> + * already at the target level.
> + */
> + for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
> + slot_handle_level_range(kvm, slot, shadow_mmu_try_split_huge_pages,
> + level, level, start, end - 1, true, false);
> + }
> +}
> +
> /* Must be called with the mmu_lock held in write-mode. */
> void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
> const struct kvm_memory_slot *memslot,
> u64 start, u64 end,
> int target_level)
> {
> - if (is_tdp_mmu_enabled(kvm))
> - kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
> - target_level, false);
> + if (!is_tdp_mmu_enabled(kvm))
> + return;
> +
> + if (kvm_memslots_have_rmaps(kvm))
> + kvm_shadow_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
> +
> + kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, false);
>
> /*
> * A TLB flush is unnecessary at this point for the same resons as in
> @@ -6096,12 +6336,19 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
> u64 start = memslot->base_gfn;
> u64 end = start + memslot->npages;
>
> - if (is_tdp_mmu_enabled(kvm)) {
> - read_lock(&kvm->mmu_lock);
> - kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
> - read_unlock(&kvm->mmu_lock);
> + if (!is_tdp_mmu_enabled(kvm))
> + return;
> +
> + if (kvm_memslots_have_rmaps(kvm)) {
> + write_lock(&kvm->mmu_lock);
> + kvm_shadow_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
> + write_unlock(&kvm->mmu_lock);
> }
>
> + read_lock(&kvm->mmu_lock);
> + kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
> + read_unlock(&kvm->mmu_lock);
> +
> /*
> * No TLB flush is necessary here. KVM will flush TLBs after
> * write-protecting and/or clearing dirty on the newly split SPTEs to
> --
> 2.31.1
>
>

2022-06-23 20:11:44

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v7 22/23] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs

On Thu, Jun 23, 2022, David Matlack wrote:
> On Wed, Jun 22, 2022 at 12:27 PM Paolo Bonzini <[email protected]> wrote:

Please trim replies.

> > +static int topup_split_caches(struct kvm *kvm)
> > +{
> > + int r;
> > +
> > + lockdep_assert_held(&kvm->slots_lock);
> > +
> > + /*
> > + * It's common to need all SPLIT_DESC_CACHE_MIN_NR_OBJECTS (513) objects
> > + * when splitting a page, but setting capacity == min would cause
> > + * KVM to drop mmu_lock even if just one object was consumed from the
> > + * cache. So make capacity larger than min and handle two huge pages
> > + * without having to drop the lock.
>
> I was going to do some testing this week to confirm, but IIUC KVM will
> only allocate from split_desc_cache if the L1 hypervisor has aliased a
> huge page in multiple {E,N}PT12 page table entries. i.e. L1 is mapping
> a huge page into an L2 multiple times, or mapped into multiple L2s.
> This should be common in traditional, process-level, shadow paging,
> but I think will be quite rare for nested shadow paging.

Ooooh, right, I forgot that that pte_list_add() needs to allocate if and only if
there are multiple rmap entries, otherwise rmap->val points that the one and only
rmap directly.

Doubling the capacity is all but guaranteed to be pointless overhead. What about
buffering with the default capacity? That way KVM doesn't have to topup if it
happens to encounter an aliased gfn. It's arbitrary, but so is the default capacity
size.

E.g. as fixup

---
arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 22b87007efff..90d6195edcf3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6125,19 +6125,23 @@ static bool need_topup_split_caches_or_resched(struct kvm *kvm)

static int topup_split_caches(struct kvm *kvm)
{
- int r;
-
- lockdep_assert_held(&kvm->slots_lock);
-
/*
- * It's common to need all SPLIT_DESC_CACHE_MIN_NR_OBJECTS (513) objects
- * when splitting a page, but setting capacity == min would cause
- * KVM to drop mmu_lock even if just one object was consumed from the
- * cache. So make capacity larger than min and handle two huge pages
- * without having to drop the lock.
+ * Allocating rmap list entries when splitting huge pages for nested
+ * MMUs is rare as KVM needs to allocate if and only if there is more
+ * than one rmap entry for the gfn, i.e. requires an L1 gfn to be
+ * aliased by multiple L2 gfns, which is very atypical for VMMs. If
+ * there is only one rmap entry, rmap->val points directly at that one
+ * entry and doesn't need to allocate a list. Buffer the cache by the
+ * default capacity so that KVM doesn't have to topup the cache if it
+ * encounters an aliased gfn or two.
*/
- r = __kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache,
- 2 * SPLIT_DESC_CACHE_MIN_NR_OBJECTS,
+ const int capacity = SPLIT_DESC_CACHE_MIN_NR_OBJECTS +
+ KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+ int r;
+
+ lockdep_assert_held(&kvm->slots_lock);
+
+ r = __kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache, capacity,
SPLIT_DESC_CACHE_MIN_NR_OBJECTS);
if (r)
return r;

base-commit: 436b1c29f36ed3d4385058ba6f0d6266dbd2a882
--

2022-06-23 22:41:27

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v7 22/23] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs

On Thu, Jun 23, 2022 at 07:48:02PM +0000, Sean Christopherson wrote:
> On Thu, Jun 23, 2022, David Matlack wrote:
> > On Wed, Jun 22, 2022 at 12:27 PM Paolo Bonzini <[email protected]> wrote:
>
> Please trim replies.
>
> > > +static int topup_split_caches(struct kvm *kvm)
> > > +{
> > > + int r;
> > > +
> > > + lockdep_assert_held(&kvm->slots_lock);
> > > +
> > > + /*
> > > + * It's common to need all SPLIT_DESC_CACHE_MIN_NR_OBJECTS (513) objects
> > > + * when splitting a page, but setting capacity == min would cause
> > > + * KVM to drop mmu_lock even if just one object was consumed from the
> > > + * cache. So make capacity larger than min and handle two huge pages
> > > + * without having to drop the lock.
> >
> > I was going to do some testing this week to confirm, but IIUC KVM will
> > only allocate from split_desc_cache if the L1 hypervisor has aliased a
> > huge page in multiple {E,N}PT12 page table entries. i.e. L1 is mapping
> > a huge page into an L2 multiple times, or mapped into multiple L2s.
> > This should be common in traditional, process-level, shadow paging,
> > but I think will be quite rare for nested shadow paging.
>
> Ooooh, right, I forgot that that pte_list_add() needs to allocate if and only if
> there are multiple rmap entries, otherwise rmap->val points that the one and only
> rmap directly.
>
> Doubling the capacity is all but guaranteed to be pointless overhead. What about
> buffering with the default capacity? That way KVM doesn't have to topup if it
> happens to encounter an aliased gfn. It's arbitrary, but so is the default capacity
> size.
>
> E.g. as fixup

LGTM

>
> ---
> arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++-----------
> 1 file changed, 15 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 22b87007efff..90d6195edcf3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6125,19 +6125,23 @@ static bool need_topup_split_caches_or_resched(struct kvm *kvm)
>
> static int topup_split_caches(struct kvm *kvm)
> {
> - int r;
> -
> - lockdep_assert_held(&kvm->slots_lock);
> -
> /*
> - * It's common to need all SPLIT_DESC_CACHE_MIN_NR_OBJECTS (513) objects
> - * when splitting a page, but setting capacity == min would cause
> - * KVM to drop mmu_lock even if just one object was consumed from the
> - * cache. So make capacity larger than min and handle two huge pages
> - * without having to drop the lock.
> + * Allocating rmap list entries when splitting huge pages for nested
> + * MMUs is rare as KVM needs to allocate if and only if there is more
> + * than one rmap entry for the gfn, i.e. requires an L1 gfn to be
> + * aliased by multiple L2 gfns, which is very atypical for VMMs. If
> + * there is only one rmap entry, rmap->val points directly at that one
> + * entry and doesn't need to allocate a list. Buffer the cache by the
> + * default capacity so that KVM doesn't have to topup the cache if it
> + * encounters an aliased gfn or two.
> */
> - r = __kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache,
> - 2 * SPLIT_DESC_CACHE_MIN_NR_OBJECTS,
> + const int capacity = SPLIT_DESC_CACHE_MIN_NR_OBJECTS +
> + KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> + int r;
> +
> + lockdep_assert_held(&kvm->slots_lock);
> +
> + r = __kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache, capacity,
> SPLIT_DESC_CACHE_MIN_NR_OBJECTS);
> if (r)
> return r;
>
> base-commit: 436b1c29f36ed3d4385058ba6f0d6266dbd2a882
> --
>

2022-06-24 00:12:28

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v7 19/23] KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels

On Wed, Jun 22, 2022, Paolo Bonzini wrote:
> From: David Matlack <[email protected]>
>
> Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This
> is fine for now since KVM never creates intermediate huge pages during
> dirty logging. In other words, KVM always replaces 1GiB pages directly
> with 4KiB pages, so there is no reason to look for collapsible 2MiB
> pages.
>
> However, this will stop being true once the shadow MMU participates in
> eager page splitting. During eager page splitting, each 1GiB is first
> split into 2MiB pages and then those are split into 4KiB pages. The
> intermediate 2MiB pages may be left behind if an error condition causes
> eager page splitting to bail early.
>
> No functional change intended.
>
> Reviewed-by: Peter Xu <[email protected]>
> Signed-off-by: David Matlack <[email protected]>
> Message-Id: <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
> 1 file changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 13a059ad5dc7..36bc49f08d60 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6154,18 +6154,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> return need_tlb_flush;
> }
>
> +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> + const struct kvm_memory_slot *slot)
> +{
> + /*
> + * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> + * pages that are already mapped at the maximum possible level.
> + */
> + if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> + PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> + true))

Can you fix this up to put "true" on the previous line?

And if you do that, maybe also tweak the comment to reference "hugepage level"
instead of "possible level"?

---
arch/x86/kvm/mmu/mmu.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8825716060e4..34b0e85b26a4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6450,12 +6450,11 @@ static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
const struct kvm_memory_slot *slot)
{
/*
- * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
- * pages that are already mapped at the maximum possible level.
+ * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1, there's no need to zap pages
+ * that are already mapped at the maximum hugepage level.
*/
if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
- PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
- true))
+ PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1, true))
kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
}


base-commit: fd43332c2900db8ca852676f37f0ab423d0c369a
--

2022-06-24 00:34:09

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v7 20/23] KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()

On Wed, Jun 22, 2022, Paolo Bonzini wrote:
> Before allocating a child shadow page table, all callers check
> whether the parent already points to a huge page and, if so, they
> drop that SPTE. This is done by drop_large_spte().

Thanks for the (), much appreciated!

> However, the act that requires dropping the large SPTE is the
> installation of the sp that is returned by kvm_mmu_get_child_sp(),
> which happens in __link_shadow_page(). Move the call there
> instead of having it in each and every caller.
>
> To ensure that the shadow page is not linked twice if it was
> present, do _not_ opportunistically make kvm_mmu_get_child_sp()
> idempotent: instead, return an error value if the shadow page
> already existed. This is a bit more verbose, but clearer than
> NULL.

Agreed, and I think we can take advantage of that verbosity to do a tiny bit more
cleanup by moving the unsync logic into a wrapper that returns -EAGAIN. Working
on a mini-series...

> Now that the drop_large_spte() name is not taken anymore,
> remove the two underscores in front of __drop_large_spte().
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---

Reviewed-by: Sean Christopherson <[email protected]>

2022-06-24 00:43:39

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v7 00/23] KVM: Extend Eager Page Splitting to the shadow MMU

On Wed, Jun 22, 2022 at 03:26:47PM -0400, Paolo Bonzini wrote:
> For the description of the "why" of this patch, I'll just direct you to
> David's excellent cover letter from v6, which can be found at
> https://lore.kernel.org/r/[email protected].
>
> This version mostly does the following:
>
> - apply the feedback from Sean and other reviewers, which is mostly
> aesthetic
>
> - replace the refactoring of drop_large_spte()/__drop_large_spte()
> with my own version. The insight there is that drop_large_spte()
> is always followed by {,__}link_shadow_page(), so the call is
> moved there
>
> - split the TLB flush optimization into a separate patch, mostly
> to perform the previous refactoring independent of the optional
> TLB flush
>
> - rename a few functions from *nested_mmu* to *shadow_mmu*
>

Thanks for the v7 Paolo!

2022-06-29 12:54:51

by Anup Patel

[permalink] [raw]
Subject: Re: [PATCH v7 21/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs

On Thu, Jun 23, 2022 at 12:57 AM Paolo Bonzini <[email protected]> wrote:
>
> From: David Matlack <[email protected]>
>
> Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> declaration time rather than being fixed for all declarations. This will
> be used in a follow-up commit to declare an cache in x86 with a capacity
> of 512+ objects without having to increase the capacity of all caches in
> KVM.
>
> This change requires each cache now specify its capacity at runtime,
> since the cache struct itself no longer has a fixed capacity known at
> compile time. To protect against someone accidentally defining a
> kvm_mmu_memory_cache struct directly (without the extra storage), this
> commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
>
> In order to support different capacities, this commit changes the
> objects pointer array to be dynamically allocated the first time the
> cache is topped-up.
>
> While here, opportunistically clean up the stack-allocated
> kvm_mmu_memory_cache structs in riscv and arm64 to use designated
> initializers.
>
> No functional change intended.
>
> Reviewed-by: Marc Zyngier <[email protected]>
> Signed-off-by: David Matlack <[email protected]>
> Message-Id: <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>

For KVM RISC-V
Reviewed-by: Anup Patel <[email protected]>
Tested-by: Anup Patel <[email protected]>

Regards,
Anup

> ---
> arch/arm64/kvm/mmu.c | 2 +-
> arch/riscv/kvm/mmu.c | 5 +----
> include/linux/kvm_host.h | 1 +
> include/linux/kvm_types.h | 6 +++++-
> virt/kvm/kvm_main.c | 33 ++++++++++++++++++++++++++++++---
> 5 files changed, 38 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index f5651a05b6a8..87f1cd0df36e 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -786,7 +786,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> {
> phys_addr_t addr;
> int ret = 0;
> - struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> + struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> KVM_PGTABLE_PROT_R |
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index 1c00695ebee7..081f8d2b9cf3 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -350,10 +350,7 @@ static int gstage_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
> int ret = 0;
> unsigned long pfn;
> phys_addr_t addr, end;
> - struct kvm_mmu_memory_cache pcache;
> -
> - memset(&pcache, 0, sizeof(pcache));
> - pcache.gfp_zero = __GFP_ZERO;
> + struct kvm_mmu_memory_cache pcache = { .gfp_zero = __GFP_ZERO };
>
> end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
> pfn = __phys_to_pfn(hpa);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a2bbdf3ab086..3554e48406e4 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1356,6 +1356,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
>
> #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
> int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> +int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
> int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
> void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index f328a01db4fe..4d933518060f 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -85,12 +85,16 @@ struct gfn_to_pfn_cache {
> * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
> * holding MMU locks. Note, these caches act more like prefetch buffers than
> * classical caches, i.e. objects are not returned to the cache on being freed.
> + *
> + * The @capacity field and @objects array are lazily initialized when the cache
> + * is topped up (__kvm_mmu_topup_memory_cache()).
> */
> struct kvm_mmu_memory_cache {
> int nobjs;
> gfp_t gfp_zero;
> struct kmem_cache *kmem_cache;
> - void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> + int capacity;
> + void **objects;
> };
> #endif
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5b8ae83e09d7..45188d11812c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -396,14 +396,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> return (void *)__get_free_page(gfp_flags);
> }
>
> -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> +int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> {
> + gfp_t gfp = GFP_KERNEL_ACCOUNT;
> void *obj;
>
> if (mc->nobjs >= min)
> return 0;
> - while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> - obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> +
> + if (unlikely(!mc->objects)) {
> + if (WARN_ON_ONCE(!capacity))
> + return -EIO;
> +
> + mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> + if (!mc->objects)
> + return -ENOMEM;
> +
> + mc->capacity = capacity;
> + }
> +
> + /* It is illegal to request a different capacity across topups. */
> + if (WARN_ON_ONCE(mc->capacity != capacity))
> + return -EIO;
> +
> + while (mc->nobjs < mc->capacity) {
> + obj = mmu_memory_cache_alloc_obj(mc, gfp);
> if (!obj)
> return mc->nobjs >= min ? 0 : -ENOMEM;
> mc->objects[mc->nobjs++] = obj;
> @@ -411,6 +428,11 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> return 0;
> }
>
> +int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> +{
> + return __kvm_mmu_topup_memory_cache(mc, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE, min);
> +}
> +
> int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
> {
> return mc->nobjs;
> @@ -424,6 +446,11 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> else
> free_page((unsigned long)mc->objects[--mc->nobjs]);
> }
> +
> + kvfree(mc->objects);
> +
> + mc->objects = NULL;
> + mc->capacity = 0;
> }
>
> void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
> --
> 2.31.1
>
>