2020-10-14 18:28:27

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 00/20] Introduce the TDP MMU

Over the years, the needs for KVM's x86 MMU have grown from running small
guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
we previously depended on shadow paging to run all guests, we now have
two dimensional paging (TDP). This patch set introduces a new
implementation of much of the KVM MMU, optimized for running guests with
TDP. We have re-implemented many of the MMU functions to take advantage of
the relative simplicity of TDP and eliminate the need for an rmap.
Building on this simplified implementation, a future patch set will change
the synchronization model for this "TDP MMU" to enable more parallelism
than the monolithic MMU lock. A TDP MMU is currently in use at Google
and has given us the performance necessary to live migrate our 416 vCPU,
12TiB m2-ultramem-416 VMs.

This work was motivated by the need to handle page faults in parallel for
very large VMs. When VMs have hundreds of vCPUs and terabytes of memory,
KVM's MMU lock suffers extreme contention, resulting in soft-lockups and
long latency on guest page faults. This contention can be easily seen
running the KVM selftests demand_paging_test with a couple hundred vCPUs.
Over a 1 second profile of the demand_paging_test, with 416 vCPUs and 4G
per vCPU, 98% of the time was spent waiting for the MMU lock. At Google,
the TDP MMU reduced the test duration by 89% and the execution was
dominated by get_user_pages and the user fault FD ioctl instead of the
MMU lock.

This series is the first of two. In this series we add a basic
implementation of the TDP MMU. In the next series we will improve the
performance of the TDP MMU and allow it to execute MMU operations
in parallel.

The overall purpose of the KVM MMU is to program paging structures
(CR3/EPT/NPT) to encode the mapping of guest addresses to host physical
addresses (HPA), and to provide utilities for other KVM features, for
example dirty logging. The definition of the L1 guest physical address
(GPA) to HPA mapping comes in two parts: KVM's memslots map GPA to HVA,
and the kernel MM/x86 host page tables map HVA -> HPA. Without TDP, the
MMU must program the x86 page tables to encode the full translation of
guest virtual addresses (GVA) to HPA. This requires "shadowing" the
guest's page tables to create a composite x86 paging structure. This
solution is complicated, requires separate paging structures for each
guest CR3, and requires emulating guest page table changes. The TDP case
is much simpler. In this case, KVM lets the guest control CR3 and programs
the EPT/NPT paging structures with the GPA -> HPA mapping. The guest has
no way to change this mapping and only one version of the paging structure
is needed per L1 paging mode. In this case the paging mode is some
combination of the number of levels in the paging structure, the address
space (normal execution or system management mode, on x86), and other
attributes. Most VMs only ever use 1 paging mode and so only ever need one
TDP structure.

This series implements a "TDP MMU" through alternative implementations of
MMU functions for running L1 guests with TDP. The TDP MMU falls back to
the existing shadow paging implementation when TDP is not available, and
interoperates with the existing shadow paging implementation for nesting.
The use of the TDP MMU can be controlled by a module parameter which is
snapshot on VM creation and follows the life of the VM. This snapshot
is used in many functions to decide whether or not to use TDP MMU handlers
for a given operation.

This series can also be viewed in Gerrit here:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
(Thanks to Dmitry Vyukov <[email protected]> for setting up the
Gerrit instance)

Changes v1 -> v2:
Big thanks to Paolo and Sean for your thorough reviews!
- Moved some function definitions to mmu_internal.h instead of just
declaring them there.
- Dropped the commit to add an as_id field to memslots in favor of
Peter Xu's which is part of the dirty ring patch set. I've included a
copy of that patch from v13 of the patch set in this series.
- Fixed comment style on SPDX license headers
- Added a min_level to the tdp_iter and removed the tdp_iter_no_step_down
function
- Removed the tdp_mmu module parameter and defaulted the global to false
- Unified more NX reclaim code
- Added helper functions for setting SPTEs in the TDP MMU
- Renamed tdp_iter macros to for clarity
- Renamed kvm_tdp_mmu_page_fault to kvm_tdp_mmu_map and gave it the same
signature as __direct_map
- Converted some WARN_ONs to BUG_ONs or removed them
- Changed dlog to dirty_log to match convention
- Switched make_spte to return a return code and use a return parameter
for the new SPTE
- Refactored TDP MMU root allocation
- Other misc cleanups and bug fixes

Ben Gardon (19):
kvm: x86/mmu: Separate making SPTEs from set_spte
kvm: x86/mmu: Introduce tdp_iter
kvm: x86/mmu: Init / Uninit the TDP MMU
kvm: x86/mmu: Allocate and free TDP MMU roots
kvm: x86/mmu: Add functions to handle changed TDP SPTEs
kvm: x86/mmu: Support zapping SPTEs in the TDP MMU
kvm: x86/mmu: Separate making non-leaf sptes from link_shadow_page
kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator
arg
kvm: x86/mmu: Add TDP MMU PF handler
kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU
kvm: x86/mmu: Add access tracking for tdp_mmu
kvm: x86/mmu: Support changed pte notifier in tdp MMU
kvm: x86/mmu: Support dirty logging for the TDP MMU
kvm: x86/mmu: Support disabling dirty logging for the tdp MMU
kvm: x86/mmu: Support write protection for nesting in tdp MMU
kvm: x86/mmu: Support MMIO in the TDP MMU
kvm: x86/mmu: Don't clear write flooding count for direct roots
kvm: x86/mmu: NX largepage recovery for TDP MMU

Peter Xu (1):
KVM: Cache as_id in kvm_memory_slot

arch/x86/include/asm/kvm_host.h | 14 +
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/mmu/mmu.c | 487 +++++++------
arch/x86/kvm/mmu/mmu_internal.h | 242 +++++++
arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
arch/x86/kvm/mmu/tdp_iter.c | 181 +++++
arch/x86/kvm/mmu/tdp_iter.h | 60 ++
arch/x86/kvm/mmu/tdp_mmu.c | 1154 +++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 48 ++
include/linux/kvm_host.h | 2 +
virt/kvm/kvm_main.c | 12 +-
11 files changed, 1944 insertions(+), 262 deletions(-)
create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c
create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h

--
2.28.0.1011.ga647a8990f-goog


2020-10-14 18:28:42

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 03/20] kvm: x86/mmu: Init / Uninit the TDP MMU

The TDP MMU offers an alternative mode of operation to the x86 shadow
paging based MMU, optimized for running an L1 guest with TDP. The TDP MMU
will require new fields that need to be initialized and torn down. Add
hooks into the existing KVM MMU initialization process to do that
initialization / cleanup. Currently the initialization and cleanup
fucntions do not do very much, however more operations will be added in
future patches.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 9 ++++++++
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/mmu/mmu.c | 5 +++++
arch/x86/kvm/mmu/tdp_mmu.c | 38 +++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 10 +++++++++
5 files changed, 63 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c
create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d0f77235da923..6b6dbc20ce23a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -980,6 +980,15 @@ struct kvm_arch {

struct kvm_pmu_event_filter *pmu_event_filter;
struct task_struct *nx_lpage_recovery_thread;
+
+ /*
+ * Whether the TDP MMU is enabled for this VM. This contains a
+ * snapshot of the TDP MMU module parameter from when the VM was
+ * created and remains unchanged for the life of the VM. If this is
+ * true, TDP MMU handler functions will run for various MMU
+ * operations.
+ */
+ bool tdp_mmu_enabled;
};

struct kvm_vm_stat {
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 4525c1151bf99..fd6b1b0cc27c0 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -16,7 +16,7 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \
- mmu/tdp_iter.o
+ mmu/tdp_iter.o mmu/tdp_mmu.o

kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d82784ed5679..f53d29e09367c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -19,6 +19,7 @@
#include "ioapic.h"
#include "mmu.h"
#include "mmu_internal.h"
+#include "tdp_mmu.h"
#include "x86.h"
#include "kvm_cache_regs.h"
#include "kvm_emulate.h"
@@ -5833,6 +5834,8 @@ void kvm_mmu_init_vm(struct kvm *kvm)
{
struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;

+ kvm_mmu_init_tdp_mmu(kvm);
+
node->track_write = kvm_mmu_pte_write;
node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
kvm_page_track_register_notifier(kvm, node);
@@ -5843,6 +5846,8 @@ void kvm_mmu_uninit_vm(struct kvm *kvm)
struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;

kvm_page_track_unregister_notifier(kvm, node);
+
+ kvm_mmu_uninit_tdp_mmu(kvm);
}

void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
new file mode 100644
index 0000000000000..b3809835e90b1
--- /dev/null
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "tdp_mmu.h"
+
+static bool __read_mostly tdp_mmu_enabled = false;
+
+static bool is_tdp_mmu_enabled(void)
+{
+#ifdef CONFIG_X86_64
+ if (!READ_ONCE(tdp_mmu_enabled))
+ return false;
+
+ if (WARN_ONCE(!tdp_enabled,
+ "Creating a VM with TDP MMU enabled requires TDP."))
+ return false;
+
+ return true;
+
+#else
+ return false;
+#endif /* CONFIG_X86_64 */
+}
+
+/* Initializes the TDP MMU for the VM, if enabled. */
+void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
+{
+ if (!is_tdp_mmu_enabled())
+ return;
+
+ /* This should not be changed for the lifetime of the VM. */
+ kvm->arch.tdp_mmu_enabled = true;
+}
+
+void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
+{
+ if (!kvm->arch.tdp_mmu_enabled)
+ return;
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
new file mode 100644
index 0000000000000..cd4a562a70e9a
--- /dev/null
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -0,0 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#ifndef __KVM_X86_MMU_TDP_MMU_H
+#define __KVM_X86_MMU_TDP_MMU_H
+
+#include <linux/kvm_host.h>
+
+void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
+void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
+#endif /* __KVM_X86_MMU_TDP_MMU_H */
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:28:43

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 02/20] kvm: x86/mmu: Introduce tdp_iter

The TDP iterator implements a pre-order traversal of a TDP paging
structure. This iterator will be used in future patches to create
an efficient implementation of the KVM MMU for the TDP case.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/mmu/mmu.c | 66 ------------
arch/x86/kvm/mmu/mmu_internal.h | 66 ++++++++++++
arch/x86/kvm/mmu/tdp_iter.c | 176 ++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/tdp_iter.h | 56 ++++++++++
5 files changed, 300 insertions(+), 67 deletions(-)
create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
create mode 100644 arch/x86/kvm/mmu/tdp_iter.h

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 7f86a14aed0e9..4525c1151bf99 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -15,7 +15,8 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o

kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
- hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o
+ hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \
+ mmu/tdp_iter.o

kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6c9db349600c8..6d82784ed5679 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -121,28 +121,6 @@ module_param(dbg, bool, 0644);

#define PTE_PREFETCH_NUM 8

-#define PT_FIRST_AVAIL_BITS_SHIFT 10
-#define PT64_SECOND_AVAIL_BITS_SHIFT 54
-
-/*
- * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
- * Access Tracking SPTEs.
- */
-#define SPTE_SPECIAL_MASK (3ULL << 52)
-#define SPTE_AD_ENABLED_MASK (0ULL << 52)
-#define SPTE_AD_DISABLED_MASK (1ULL << 52)
-#define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
-#define SPTE_MMIO_MASK (3ULL << 52)
-
-#define PT64_LEVEL_BITS 9
-
-#define PT64_LEVEL_SHIFT(level) \
- (PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
-
-#define PT64_INDEX(address, level)\
- (((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
-
-
#define PT32_LEVEL_BITS 10

#define PT32_LEVEL_SHIFT(level) \
@@ -155,19 +133,6 @@ module_param(dbg, bool, 0644);
#define PT32_INDEX(address, level)\
(((address) >> PT32_LEVEL_SHIFT(level)) & ((1 << PT32_LEVEL_BITS) - 1))

-
-#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
-#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
-#else
-#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
-#endif
-#define PT64_LVL_ADDR_MASK(level) \
- (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
- * PT64_LEVEL_BITS))) - 1))
-#define PT64_LVL_OFFSET_MASK(level) \
- (PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
- * PT64_LEVEL_BITS))) - 1))
-
#define PT32_BASE_ADDR_MASK PAGE_MASK
#define PT32_DIR_BASE_ADDR_MASK \
(PAGE_MASK & ~((1ULL << (PAGE_SHIFT + PT32_LEVEL_BITS)) - 1))
@@ -192,8 +157,6 @@ module_param(dbg, bool, 0644);
#define SPTE_HOST_WRITEABLE (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
#define SPTE_MMU_WRITEABLE (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))

-#define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
-
/* make pte_list_desc fit well in cache line */
#define PTE_LIST_EXT 3

@@ -349,11 +312,6 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 access_mask)
}
EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);

-static bool is_mmio_spte(u64 spte)
-{
- return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
-}
-
static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
{
return sp->role.ad_disabled;
@@ -626,35 +584,11 @@ static int is_nx(struct kvm_vcpu *vcpu)
return vcpu->arch.efer & EFER_NX;
}

-static int is_shadow_present_pte(u64 pte)
-{
- return (pte != 0) && !is_mmio_spte(pte);
-}
-
-static int is_large_pte(u64 pte)
-{
- return pte & PT_PAGE_SIZE_MASK;
-}
-
-static int is_last_spte(u64 pte, int level)
-{
- if (level == PG_LEVEL_4K)
- return 1;
- if (is_large_pte(pte))
- return 1;
- return 0;
-}
-
static bool is_executable_pte(u64 spte)
{
return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
}

-static kvm_pfn_t spte_to_pfn(u64 pte)
-{
- return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
-}
-
static gfn_t pse36_gfn_delta(u32 gpte)
{
int shift = 32 - PT32_DIR_PSE36_SHIFT - PAGE_SHIFT;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 3acf3b8eb469d..74ccbf001a42e 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -2,6 +2,8 @@
#ifndef __KVM_X86_MMU_INTERNAL_H
#define __KVM_X86_MMU_INTERNAL_H

+#include "mmu.h"
+
#include <linux/types.h>

#include <asm/kvm_host.h>
@@ -60,4 +62,68 @@ void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
struct kvm_memory_slot *slot, u64 gfn);

+#define PT64_LEVEL_BITS 9
+
+#define PT64_LEVEL_SHIFT(level) \
+ (PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
+
+#define PT64_INDEX(address, level)\
+ (((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
+#define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
+
+#define PT_FIRST_AVAIL_BITS_SHIFT 10
+#define PT64_SECOND_AVAIL_BITS_SHIFT 54
+
+/*
+ * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
+ * Access Tracking SPTEs.
+ */
+#define SPTE_SPECIAL_MASK (3ULL << 52)
+#define SPTE_AD_ENABLED_MASK (0ULL << 52)
+#define SPTE_AD_DISABLED_MASK (1ULL << 52)
+#define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
+#define SPTE_MMIO_MASK (3ULL << 52)
+
+#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
+#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
+#else
+#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
+#endif
+#define PT64_LVL_ADDR_MASK(level) \
+ (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
+ * PT64_LEVEL_BITS))) - 1))
+#define PT64_LVL_OFFSET_MASK(level) \
+ (PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
+ * PT64_LEVEL_BITS))) - 1))
+
+/* Functions for interpreting SPTEs */
+static inline bool is_mmio_spte(u64 spte)
+{
+ return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
+}
+
+static inline int is_shadow_present_pte(u64 pte)
+{
+ return (pte != 0) && !is_mmio_spte(pte);
+}
+
+static inline int is_large_pte(u64 pte)
+{
+ return pte & PT_PAGE_SIZE_MASK;
+}
+
+static inline int is_last_spte(u64 pte, int level)
+{
+ if (level == PG_LEVEL_4K)
+ return 1;
+ if (is_large_pte(pte))
+ return 1;
+ return 0;
+}
+
+static inline kvm_pfn_t spte_to_pfn(u64 pte)
+{
+ return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
+}
+
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
new file mode 100644
index 0000000000000..b07e9f0c5d4aa
--- /dev/null
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -0,0 +1,176 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "mmu_internal.h"
+#include "tdp_iter.h"
+
+/*
+ * Recalculates the pointer to the SPTE for the current GFN and level and
+ * reread the SPTE.
+ */
+static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
+{
+ iter->sptep = iter->pt_path[iter->level - 1] +
+ SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
+ iter->old_spte = READ_ONCE(*iter->sptep);
+}
+
+static gfn_t round_gfn_for_level(gfn_t gfn, int level)
+{
+ return gfn - (gfn % KVM_PAGES_PER_HPAGE(level));
+}
+
+/*
+ * Sets a TDP iterator to walk a pre-order traversal of the paging structure
+ * rooted at root_pt, starting with the walk to translate goal_gfn.
+ */
+void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
+ int min_level, gfn_t goal_gfn)
+{
+ WARN_ON(root_level < 1);
+ WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
+
+ iter->goal_gfn = goal_gfn;
+ iter->root_level = root_level;
+ iter->min_level = min_level;
+ iter->level = root_level;
+ iter->pt_path[iter->level - 1] = root_pt;
+
+ iter->gfn = round_gfn_for_level(iter->goal_gfn, iter->level);
+ tdp_iter_refresh_sptep(iter);
+
+ iter->valid = true;
+}
+
+/*
+ * Given an SPTE and its level, returns a pointer containing the host virtual
+ * address of the child page table referenced by the SPTE. Returns null if
+ * there is no such entry.
+ */
+u64 *spte_to_child_pt(u64 spte, int level)
+{
+ /*
+ * There's no child entry if this entry isn't present or is a
+ * last-level entry.
+ */
+ if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
+ return NULL;
+
+ return __va(spte_to_pfn(spte) << PAGE_SHIFT);
+}
+
+/*
+ * Steps down one level in the paging structure towards the goal GFN. Returns
+ * true if the iterator was able to step down a level, false otherwise.
+ */
+static bool try_step_down(struct tdp_iter *iter)
+{
+ u64 *child_pt;
+
+ if (iter->level == iter->min_level)
+ return false;
+
+ /*
+ * Reread the SPTE before stepping down to avoid traversing into page
+ * tables that are no longer linked from this entry.
+ */
+ iter->old_spte = READ_ONCE(*iter->sptep);
+
+ child_pt = spte_to_child_pt(iter->old_spte, iter->level);
+ if (!child_pt)
+ return false;
+
+ iter->level--;
+ iter->pt_path[iter->level - 1] = child_pt;
+ iter->gfn = round_gfn_for_level(iter->goal_gfn, iter->level);
+ tdp_iter_refresh_sptep(iter);
+
+ return true;
+}
+
+/*
+ * Steps to the next entry in the current page table, at the current page table
+ * level. The next entry could point to a page backing guest memory or another
+ * page table, or it could be non-present. Returns true if the iterator was
+ * able to step to the next entry in the page table, false if the iterator was
+ * already at the end of the current page table.
+ */
+static bool try_step_side(struct tdp_iter *iter)
+{
+ /*
+ * Check if the iterator is already at the end of the current page
+ * table.
+ */
+ if (!((iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) %
+ KVM_PAGES_PER_HPAGE(iter->level + 1)))
+ return false;
+
+ iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
+ iter->goal_gfn = iter->gfn;
+ iter->sptep++;
+ iter->old_spte = READ_ONCE(*iter->sptep);
+
+ return true;
+}
+
+/*
+ * Tries to traverse back up a level in the paging structure so that the walk
+ * can continue from the next entry in the parent page table. Returns true on a
+ * successful step up, false if already in the root page.
+ */
+static bool try_step_up(struct tdp_iter *iter)
+{
+ if (iter->level == iter->root_level)
+ return false;
+
+ iter->level++;
+ iter->gfn = round_gfn_for_level(iter->gfn, iter->level);
+ tdp_iter_refresh_sptep(iter);
+
+ return true;
+}
+
+/*
+ * Step to the next SPTE in a pre-order traversal of the paging structure.
+ * To get to the next SPTE, the iterator either steps down towards the goal
+ * GFN, if at a present, non-last-level SPTE, or over to a SPTE mapping a
+ * highter GFN.
+ *
+ * The basic algorithm is as follows:
+ * 1. If the current SPTE is a non-last-level SPTE, step down into the page
+ * table it points to.
+ * 2. If the iterator cannot step down, it will try to step to the next SPTE
+ * in the current page of the paging structure.
+ * 3. If the iterator cannot step to the next entry in the current page, it will
+ * try to step up to the parent paging structure page. In this case, that
+ * SPTE will have already been visited, and so the iterator must also step
+ * to the side again.
+ */
+void tdp_iter_next(struct tdp_iter *iter)
+{
+ if (try_step_down(iter))
+ return;
+
+ do {
+ if (try_step_side(iter))
+ return;
+ } while (try_step_up(iter));
+ iter->valid = false;
+}
+
+/*
+ * Restart the walk over the paging structure from the root, starting from the
+ * highest gfn the iterator had previously reached. Assumes that the entire
+ * paging structure, except the root page, may have been completely torn down
+ * and rebuilt.
+ */
+void tdp_iter_refresh_walk(struct tdp_iter *iter)
+{
+ gfn_t goal_gfn = iter->goal_gfn;
+
+ if (iter->gfn > goal_gfn)
+ goal_gfn = iter->gfn;
+
+ tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
+ iter->root_level, iter->min_level, goal_gfn);
+}
+
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
new file mode 100644
index 0000000000000..d629a53e1b73f
--- /dev/null
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#ifndef __KVM_X86_MMU_TDP_ITER_H
+#define __KVM_X86_MMU_TDP_ITER_H
+
+#include <linux/kvm_host.h>
+
+#include "mmu.h"
+
+/*
+ * A TDP iterator performs a pre-order walk over a TDP paging structure.
+ */
+struct tdp_iter {
+ /*
+ * The iterator will traverse the paging structure towards the mapping
+ * for this GFN.
+ */
+ gfn_t goal_gfn;
+ /* Pointers to the page tables traversed to reach the current SPTE */
+ u64 *pt_path[PT64_ROOT_MAX_LEVEL];
+ /* A pointer to the current SPTE */
+ u64 *sptep;
+ /* The lowest GFN mapped by the current SPTE */
+ gfn_t gfn;
+ /* The level of the root page given to the iterator */
+ int root_level;
+ /* The lowest level the iterator should traverse to */
+ int min_level;
+ /* The iterator's current level within the paging structure */
+ int level;
+ /* A snapshot of the value at sptep */
+ u64 old_spte;
+ /*
+ * Whether the iterator has a valid state. This will be false if the
+ * iterator walks off the end of the paging structure.
+ */
+ bool valid;
+};
+
+/*
+ * Iterates over every SPTE mapping the GFN range [start, end) in a
+ * preorder traversal.
+ */
+#define for_each_tdp_pte(iter, root, root_level, start, end) \
+ for (tdp_iter_start(&iter, root, root_level, PG_LEVEL_4K, start); \
+ iter.valid && iter.gfn < end; \
+ tdp_iter_next(&iter))
+
+u64 *spte_to_child_pt(u64 pte, int level);
+
+void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
+ int min_level, gfn_t goal_gfn);
+void tdp_iter_next(struct tdp_iter *iter);
+void tdp_iter_refresh_walk(struct tdp_iter *iter);
+
+#endif /* __KVM_X86_MMU_TDP_ITER_H */
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:28:52

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 01/20] kvm: x86/mmu: Separate making SPTEs from set_spte

Separate the functions for generating leaf page table entries from the
function that inserts them into the paging structure. This refactoring
will facilitate changes to the MMU sychronization model to use atomic
compare / exchanges (which are not guaranteed to succeed) instead of a
monolithic MMU lock.

No functional change expected.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This commit introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
Reviewed-by: Peter Shier <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 49 ++++++++++++++++++++++++++++--------------
1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 32e0e5c0524e5..6c9db349600c8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2987,20 +2987,15 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
#define SET_SPTE_NEED_REMOTE_TLB_FLUSH BIT(1)
#define SET_SPTE_SPURIOUS BIT(2)

-static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
- unsigned int pte_access, int level,
- gfn_t gfn, kvm_pfn_t pfn, bool speculative,
- bool can_unsync, bool host_writable)
+static int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
+ gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool speculative,
+ bool can_unsync, bool host_writable, bool ad_disabled,
+ u64 *new_spte)
{
u64 spte = 0;
int ret = 0;
- struct kvm_mmu_page *sp;

- if (set_mmio_spte(vcpu, sptep, gfn, pfn, pte_access))
- return 0;
-
- sp = sptep_to_sp(sptep);
- if (sp_ad_disabled(sp))
+ if (ad_disabled)
spte |= SPTE_AD_DISABLED_MASK;
else if (kvm_vcpu_ad_need_write_protect(vcpu))
spte |= SPTE_AD_WRPROT_ONLY_MASK;
@@ -3053,8 +3048,8 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
* is responsibility of mmu_get_page / kvm_sync_page.
* Same reasoning can be applied to dirty page accounting.
*/
- if (!can_unsync && is_writable_pte(*sptep))
- goto set_pte;
+ if (!can_unsync && is_writable_pte(old_spte))
+ goto out;

if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
pgprintk("%s: found shadow page for %llx, marking ro\n",
@@ -3065,15 +3060,37 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
}
}

- if (pte_access & ACC_WRITE_MASK) {
- kvm_vcpu_mark_page_dirty(vcpu, gfn);
+ if (pte_access & ACC_WRITE_MASK)
spte |= spte_shadow_dirty_mask(spte);
- }

if (speculative)
spte = mark_spte_for_access_track(spte);

-set_pte:
+out:
+ *new_spte = spte;
+ return ret;
+}
+
+static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
+ unsigned int pte_access, int level,
+ gfn_t gfn, kvm_pfn_t pfn, bool speculative,
+ bool can_unsync, bool host_writable)
+{
+ u64 spte;
+ struct kvm_mmu_page *sp;
+ int ret;
+
+ if (set_mmio_spte(vcpu, sptep, gfn, pfn, pte_access))
+ return 0;
+
+ sp = sptep_to_sp(sptep);
+
+ ret = make_spte(vcpu, pte_access, level, gfn, pfn, *sptep, speculative,
+ can_unsync, host_writable, sp_ad_disabled(sp), &spte);
+
+ if (spte & PT_WRITABLE_MASK)
+ kvm_vcpu_mark_page_dirty(vcpu, gfn);
+
if (*sptep == spte)
ret |= SET_SPTE_SPURIOUS;
else if (mmu_spte_update(sptep, spte))
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:28:59

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 08/20] kvm: x86/mmu: Separate making non-leaf sptes from link_shadow_page

The TDP MMU page fault handler will need to be able to create non-leaf
SPTEs to build up the paging structures. Rather than re-implementing the
function, factor the SPTE creation out of link_shadow_page.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 337ab6823e312..05024b8ae5a4d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2468,21 +2468,30 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
__shadow_walk_next(iterator, *iterator->sptep);
}

-static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
- struct kvm_mmu_page *sp)
+static u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
{
u64 spte;

- BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
-
- spte = __pa(sp->spt) | shadow_present_mask | PT_WRITABLE_MASK |
+ spte = __pa(child_pt) | shadow_present_mask | PT_WRITABLE_MASK |
shadow_user_mask | shadow_x_mask | shadow_me_mask;

- if (sp_ad_disabled(sp))
+ if (ad_disabled)
spte |= SPTE_AD_DISABLED_MASK;
else
spte |= shadow_accessed_mask;

+ return spte;
+}
+
+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
+ struct kvm_mmu_page *sp)
+{
+ u64 spte;
+
+ BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
+
+ spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
+
mmu_spte_set(sptep, spte);

mmu_page_add_parent_pte(vcpu, sp, sptep);
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:29:10

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 11/20] kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU

Attach struct kvm_mmu_pages to every page in the TDP MMU to track
metadata, facilitate NX reclaim, and enable inproved parallelism of MMU
operations in future patches.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ++++
arch/x86/kvm/mmu/tdp_mmu.c | 13 ++++++++++---
2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e0ec1dd271a32..2568dcd134156 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -989,7 +989,11 @@ struct kvm_arch {
* operations.
*/
bool tdp_mmu_enabled;
+
+ /* List of struct tdp_mmu_pages being used as roots */
struct list_head tdp_mmu_roots;
+ /* List of struct tdp_mmu_pages not being used as roots */
+ struct list_head tdp_mmu_pages;
};

struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f92c12c4ce31a..78d41a1949651 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -34,6 +34,7 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
kvm->arch.tdp_mmu_enabled = true;

INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
+ INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
}

void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
@@ -188,6 +189,7 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
bool is_leaf = is_present && is_last_spte(new_spte, level);
bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
u64 *pt;
+ struct kvm_mmu_page *sp;
u64 old_child_spte;
int i;

@@ -253,6 +255,9 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
*/
if (was_present && !was_leaf && (pfn_changed || !is_present)) {
pt = spte_to_child_pt(old_spte, level);
+ sp = sptep_to_sp(pt);
+
+ list_del(&sp->link);

for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
old_child_spte = *(pt + i);
@@ -266,6 +271,7 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
KVM_PAGES_PER_HPAGE(level));

free_page((unsigned long)pt);
+ kmem_cache_free(mmu_page_header_cache, sp);
}
}

@@ -434,8 +440,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
bool huge_page_disallowed = exec && nx_huge_page_workaround_enabled;
struct kvm_mmu *mmu = vcpu->arch.mmu;
struct tdp_iter iter;
- struct kvm_mmu_memory_cache *pf_pt_cache =
- &vcpu->arch.mmu_shadow_page_cache;
+ struct kvm_mmu_page *sp;
u64 *child_pt;
u64 new_spte;
int ret;
@@ -479,7 +484,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
}

if (!is_shadow_present_pte(iter.old_spte)) {
- child_pt = kvm_mmu_memory_cache_alloc(pf_pt_cache);
+ sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
+ list_add(&sp->link, &vcpu->kvm->arch.tdp_mmu_pages);
+ child_pt = sp->spt;
clear_page(child_pt);
new_spte = make_nonleaf_spte(child_pt,
!shadow_accessed_mask);
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:29:21

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 13/20] kvm: x86/mmu: Add access tracking for tdp_mmu

In order to interoperate correctly with the rest of KVM and other Linux
subsystems, the TDP MMU must correctly handle various MMU notifiers. The
main Linux MM uses the access tracking MMU notifiers for swap and other
features. Add hooks to handle the test/flush HVA (range) family of
MMU notifiers.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 34 +++++-----
arch/x86/kvm/mmu/mmu_internal.h | 17 +++++
arch/x86/kvm/mmu/tdp_mmu.c | 113 ++++++++++++++++++++++++++++++--
arch/x86/kvm/mmu/tdp_mmu.h | 4 ++
4 files changed, 145 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 00534133f99fc..e6ab79d8f215f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -175,8 +175,6 @@ static struct kmem_cache *pte_list_desc_cache;
struct kmem_cache *mmu_page_header_cache;
static struct percpu_counter kvm_total_used_mmu_pages;

-static u64 __read_mostly shadow_nx_mask;
-static u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
static u64 __read_mostly shadow_user_mask;
static u64 __read_mostly shadow_mmio_value;
static u64 __read_mostly shadow_mmio_access_mask;
@@ -221,7 +219,6 @@ static u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
static u8 __read_mostly shadow_phys_bits;

static void mmu_spte_set(u64 *sptep, u64 spte);
-static bool is_executable_pte(u64 spte);
static union kvm_mmu_page_role
kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu);

@@ -516,11 +513,6 @@ static int is_nx(struct kvm_vcpu *vcpu)
return vcpu->arch.efer & EFER_NX;
}

-static bool is_executable_pte(u64 spte)
-{
- return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
-}
-
static gfn_t pse36_gfn_delta(u32 gpte)
{
int shift = 32 - PT32_DIR_PSE36_SHIFT - PAGE_SHIFT;
@@ -695,14 +687,6 @@ static bool spte_has_volatile_bits(u64 spte)
return false;
}

-static bool is_accessed_spte(u64 spte)
-{
- u64 accessed_mask = spte_shadow_accessed_mask(spte);
-
- return accessed_mask ? spte & accessed_mask
- : !is_access_track_spte(spte);
-}
-
/* Rules for using mmu_spte_set:
* Set the sptep from nonpresent to present.
* Note: the sptep being assigned *must* be either not present
@@ -838,7 +822,7 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
return __get_spte_lockless(sptep);
}

-static u64 mark_spte_for_access_track(u64 spte)
+u64 mark_spte_for_access_track(u64 spte)
{
if (spte_ad_enabled(spte))
return spte & ~shadow_accessed_mask;
@@ -1842,12 +1826,24 @@ static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)

int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
{
- return kvm_handle_hva_range(kvm, start, end, 0, kvm_age_rmapp);
+ int young = false;
+
+ young = kvm_handle_hva_range(kvm, start, end, 0, kvm_age_rmapp);
+ if (kvm->arch.tdp_mmu_enabled)
+ young |= kvm_tdp_mmu_age_hva_range(kvm, start, end);
+
+ return young;
}

int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
{
- return kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
+ int young = false;
+
+ young = kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
+ if (kvm->arch.tdp_mmu_enabled)
+ young |= kvm_tdp_mmu_test_age_hva(kvm, hva);
+
+ return young;
}

#ifdef MMU_DEBUG
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index f7fe5616eff98..d886fe750be38 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -122,6 +122,8 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,

static u64 __read_mostly shadow_dirty_mask;
static u64 __read_mostly shadow_accessed_mask;
+static u64 __read_mostly shadow_nx_mask;
+static u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */

/*
* SPTEs used by MMUs without A/D bits are marked with SPTE_AD_DISABLED_MASK;
@@ -205,6 +207,19 @@ static inline bool is_access_track_spte(u64 spte)
return !spte_ad_enabled(spte) && (spte & shadow_acc_track_mask) == 0;
}

+static inline bool is_accessed_spte(u64 spte)
+{
+ u64 accessed_mask = spte_shadow_accessed_mask(spte);
+
+ return accessed_mask ? spte & accessed_mask
+ : !is_access_track_spte(spte);
+}
+
+static inline bool is_executable_pte(u64 spte)
+{
+ return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
+}
+
void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, u64 start_gfn,
u64 pages);

@@ -247,4 +262,6 @@ bool is_nx_huge_page_enabled(void);

void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);

+u64 mark_spte_for_access_track(u64 spte);
+
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9ec6c26ed6619..575970d8805a4 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -168,6 +168,18 @@ static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
return sp->role.smm ? 1 : 0;
}

+static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
+{
+ bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+
+ if (!is_shadow_present_pte(old_spte) || !is_last_spte(old_spte, level))
+ return;
+
+ if (is_accessed_spte(old_spte) &&
+ (!is_accessed_spte(new_spte) || pfn_changed))
+ kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+}
+
/**
* handle_changed_spte - handle bookkeeping associated with an SPTE change
* @kvm: kvm instance
@@ -279,10 +291,11 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
u64 old_spte, u64 new_spte, int level)
{
__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
+ handle_changed_spte_acc_track(old_spte, new_spte, level);
}

-static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
- u64 new_spte)
+static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
+ u64 new_spte, bool record_acc_track)
{
u64 *root_pt = tdp_iter_root_pt(iter);
struct kvm_mmu_page *root = sptep_to_sp(root_pt);
@@ -290,13 +303,36 @@ static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,

*iter->sptep = new_spte;

- handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
- iter->level);
+ __handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
+ iter->level);
+ if (record_acc_track)
+ handle_changed_spte_acc_track(iter->old_spte, new_spte,
+ iter->level);
+}
+
+static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
+ u64 new_spte)
+{
+ __tdp_mmu_set_spte(kvm, iter, new_spte, true);
+}
+
+static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm,
+ struct tdp_iter *iter,
+ u64 new_spte)
+{
+ __tdp_mmu_set_spte(kvm, iter, new_spte, false);
}

#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)

+#define tdp_root_for_each_leaf_pte(_iter, _root, _start, _end) \
+ tdp_root_for_each_pte(_iter, _root, _start, _end) \
+ if (!is_shadow_present_pte(_iter.old_spte) || \
+ !is_last_spte(_iter.old_spte, _iter.level)) \
+ continue; \
+ else
+
#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end) \
for_each_tdp_pte(_iter, __va(_mmu->root_hpa), \
_mmu->shadow_root_level, _start, _end)
@@ -572,3 +608,72 @@ int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
return kvm_tdp_mmu_handle_hva_range(kvm, start, end, 0,
zap_gfn_range_hva_wrapper);
}
+
+/*
+ * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
+ * if any of the GFNs in the range have been accessed.
+ */
+static int age_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,
+ struct kvm_mmu_page *root, gfn_t start, gfn_t end,
+ unsigned long unused)
+{
+ struct tdp_iter iter;
+ int young = 0;
+ u64 new_spte = 0;
+
+ tdp_root_for_each_leaf_pte(iter, root, start, end) {
+ /*
+ * If we have a non-accessed entry we don't need to change the
+ * pte.
+ */
+ if (!is_accessed_spte(iter.old_spte))
+ continue;
+
+ new_spte = iter.old_spte;
+
+ if (spte_ad_enabled(new_spte)) {
+ clear_bit((ffs(shadow_accessed_mask) - 1),
+ (unsigned long *)&new_spte);
+ } else {
+ /*
+ * Capture the dirty status of the page, so that it doesn't get
+ * lost when the SPTE is marked for access tracking.
+ */
+ if (is_writable_pte(new_spte))
+ kvm_set_pfn_dirty(spte_to_pfn(new_spte));
+
+ new_spte = mark_spte_for_access_track(new_spte);
+ }
+
+ tdp_mmu_set_spte_no_acc_track(kvm, &iter, new_spte);
+ young = 1;
+ }
+
+ return young;
+}
+
+int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
+ unsigned long end)
+{
+ return kvm_tdp_mmu_handle_hva_range(kvm, start, end, 0,
+ age_gfn_range);
+}
+
+static int test_age_gfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ struct kvm_mmu_page *root, gfn_t gfn, gfn_t unused,
+ unsigned long unused2)
+{
+ struct tdp_iter iter;
+
+ tdp_root_for_each_leaf_pte(iter, root, gfn, gfn + 1)
+ if (is_accessed_spte(iter.old_spte))
+ return 1;
+
+ return 0;
+}
+
+int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva)
+{
+ return kvm_tdp_mmu_handle_hva_range(kvm, hva, hva + 1, 0,
+ test_age_gfn);
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 026ceb6284102..bdb86f61e75eb 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -21,4 +21,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,

int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
unsigned long end);
+
+int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
+ unsigned long end);
+int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva);
#endif /* __KVM_X86_MMU_TDP_MMU_H */
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:29:41

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 07/20] kvm: x86/mmu: Support zapping SPTEs in the TDP MMU

Add functions to zap SPTEs to the TDP MMU. These are needed to tear down
TDP MMU roots properly and implement other MMU functions which require
tearing down mappings. Future patches will add functions to populate the
page tables, but as for this patch there will not be any work for these
functions to do.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 15 +++++
arch/x86/kvm/mmu/tdp_iter.c | 5 ++
arch/x86/kvm/mmu/tdp_iter.h | 1 +
arch/x86/kvm/mmu/tdp_mmu.c | 109 ++++++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 2 +
5 files changed, 132 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8bf20723c6177..337ab6823e312 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5787,6 +5787,10 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
kvm_reload_remote_mmus(kvm);

kvm_zap_obsolete_pages(kvm);
+
+ if (kvm->arch.tdp_mmu_enabled)
+ kvm_tdp_mmu_zap_all(kvm);
+
spin_unlock(&kvm->mmu_lock);
}

@@ -5827,6 +5831,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
struct kvm_memslots *slots;
struct kvm_memory_slot *memslot;
int i;
+ bool flush;

spin_lock(&kvm->mmu_lock);
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
@@ -5846,6 +5851,12 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
}
}

+ if (kvm->arch.tdp_mmu_enabled) {
+ flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end);
+ if (flush)
+ kvm_flush_remote_tlbs(kvm);
+ }
+
spin_unlock(&kvm->mmu_lock);
}

@@ -6012,6 +6023,10 @@ void kvm_mmu_zap_all(struct kvm *kvm)
}

kvm_mmu_commit_zap_page(kvm, &invalid_list);
+
+ if (kvm->arch.tdp_mmu_enabled)
+ kvm_tdp_mmu_zap_all(kvm);
+
spin_unlock(&kvm->mmu_lock);
}

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index b07e9f0c5d4aa..701eb753b701e 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -174,3 +174,8 @@ void tdp_iter_refresh_walk(struct tdp_iter *iter)
iter->root_level, iter->min_level, goal_gfn);
}

+u64 *tdp_iter_root_pt(struct tdp_iter *iter)
+{
+ return iter->pt_path[iter->root_level - 1];
+}
+
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index d629a53e1b73f..884ed2c70bfed 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -52,5 +52,6 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
int min_level, gfn_t goal_gfn);
void tdp_iter_next(struct tdp_iter *iter);
void tdp_iter_refresh_walk(struct tdp_iter *iter);
+u64 *tdp_iter_root_pt(struct tdp_iter *iter);

#endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f2bd3a6928ce9..9b5cd4a832f1a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -56,8 +56,13 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
return sp->tdp_mmu_page && sp->root_count;
}

+static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end);
+
void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
{
+ gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
+
lockdep_assert_held(&kvm->mmu_lock);

WARN_ON(root->root_count);
@@ -65,6 +70,8 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)

list_del(&root->link);

+ zap_gfn_range(kvm, root, 0, max_gfn);
+
free_page((unsigned long)root->spt);
kmem_cache_free(mmu_page_header_cache, root);
}
@@ -155,6 +162,11 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
u64 old_spte, u64 new_spte, int level);

+static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
+{
+ return sp->role.smm ? 1 : 0;
+}
+
/**
* handle_changed_spte - handle bookkeeping associated with an SPTE change
* @kvm: kvm instance
@@ -262,3 +274,100 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
{
__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
}
+
+static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
+ u64 new_spte)
+{
+ u64 *root_pt = tdp_iter_root_pt(iter);
+ struct kvm_mmu_page *root = sptep_to_sp(root_pt);
+ int as_id = kvm_mmu_page_as_id(root);
+
+ *iter->sptep = new_spte;
+
+ handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
+ iter->level);
+}
+
+#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
+ for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)
+
+static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
+{
+ if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+ kvm_flush_remote_tlbs(kvm);
+ cond_resched_lock(&kvm->mmu_lock);
+ tdp_iter_refresh_walk(iter);
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Tears down the mappings for the range of gfns, [start, end), and frees the
+ * non-root pages mapping GFNs strictly within that range. Returns true if
+ * SPTEs have been cleared and a TLB flush is needed before releasing the
+ * MMU lock.
+ */
+static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end)
+{
+ struct tdp_iter iter;
+ bool flush_needed = false;
+
+ tdp_root_for_each_pte(iter, root, start, end) {
+ if (!is_shadow_present_pte(iter.old_spte))
+ continue;
+
+ /*
+ * If this is a non-last-level SPTE that covers a larger range
+ * than should be zapped, continue, and zap the mappings at a
+ * lower level.
+ */
+ if ((iter.gfn < start ||
+ iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
+ !is_last_spte(iter.old_spte, iter.level))
+ continue;
+
+ tdp_mmu_set_spte(kvm, &iter, 0);
+
+ flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
+ }
+ return flush_needed;
+}
+
+/*
+ * Tears down the mappings for the range of gfns, [start, end), and frees the
+ * non-root pages mapping GFNs strictly within that range. Returns true if
+ * SPTEs have been cleared and a TLB flush is needed before releasing the
+ * MMU lock.
+ */
+bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ struct kvm_mmu_page *root;
+ bool flush = false;
+
+ for_each_tdp_mmu_root(kvm, root) {
+ /*
+ * Take a reference on the root so that it cannot be freed if
+ * this thread releases the MMU lock and yields in this loop.
+ */
+ get_tdp_mmu_root(kvm, root);
+
+ flush |= zap_gfn_range(kvm, root, start, end);
+
+ put_tdp_mmu_root(kvm, root);
+ }
+
+ return flush;
+}
+
+void kvm_tdp_mmu_zap_all(struct kvm *kvm)
+{
+ gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
+ bool flush;
+
+ flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn);
+ if (flush)
+ kvm_flush_remote_tlbs(kvm);
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index ac0ef91294420..6de2d007fc03c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -12,4 +12,6 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);

+bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_tdp_mmu_zap_all(struct kvm *kvm);
#endif /* __KVM_X86_MMU_TDP_MMU_H */
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:30:11

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 16/20] kvm: x86/mmu: Support disabling dirty logging for the tdp MMU

Dirty logging ultimately breaks down MMU mappings to 4k granularity.
When dirty logging is no longer needed, these granaular mappings
represent a useless performance penalty. When dirty logging is disabled,
search the paging structure for mappings that could be re-constituted
into a large page mapping. Zap those mappings so that they can be
faulted in again at a higher mapping level.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 3 ++
arch/x86/kvm/mmu/tdp_mmu.c | 59 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 2 ++
3 files changed, 64 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b2ce57761d2f1..8fcf5e955c475 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5918,6 +5918,9 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
spin_lock(&kvm->mmu_lock);
slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
kvm_mmu_zap_collapsible_spte, true);
+
+ if (kvm->arch.tdp_mmu_enabled)
+ kvm_tdp_mmu_zap_collapsible_sptes(kvm, memslot);
spin_unlock(&kvm->mmu_lock);
}

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 099c7d68aeb1d..94624cc1df84c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1019,3 +1019,62 @@ bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
return spte_set;
}

+/*
+ * Clear non-leaf entries (and free associated page tables) which could
+ * be replaced by large mappings, for GFNs within the slot.
+ */
+static void zap_collapsible_spte_range(struct kvm *kvm,
+ struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end)
+{
+ struct tdp_iter iter;
+ kvm_pfn_t pfn;
+ bool spte_set = false;
+
+ tdp_root_for_each_pte(iter, root, start, end) {
+ if (!is_shadow_present_pte(iter.old_spte) ||
+ is_last_spte(iter.old_spte, iter.level))
+ continue;
+
+ pfn = spte_to_pfn(iter.old_spte);
+ if (kvm_is_reserved_pfn(pfn) ||
+ !PageTransCompoundMap(pfn_to_page(pfn)))
+ continue;
+
+ tdp_mmu_set_spte(kvm, &iter, 0);
+ spte_set = true;
+
+ spte_set = !tdp_mmu_iter_cond_resched(kvm, &iter);
+ }
+
+ if (spte_set)
+ kvm_flush_remote_tlbs(kvm);
+}
+
+/*
+ * Clear non-leaf entries (and free associated page tables) which could
+ * be replaced by large mappings, for GFNs within the slot.
+ */
+void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
+ const struct kvm_memory_slot *slot)
+{
+ struct kvm_mmu_page *root;
+ int root_as_id;
+
+ for_each_tdp_mmu_root(kvm, root) {
+ root_as_id = kvm_mmu_page_as_id(root);
+ if (root_as_id != slot->as_id)
+ continue;
+
+ /*
+ * Take a reference on the root so that it cannot be freed if
+ * this thread releases the MMU lock and yields in this loop.
+ */
+ get_tdp_mmu_root(kvm, root);
+
+ zap_collapsible_spte_range(kvm, root, slot->base_gfn,
+ slot->base_gfn + slot->npages);
+
+ put_tdp_mmu_root(kvm, root);
+ }
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index add8bb97c56dd..dc4cdc5cc29f5 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -38,4 +38,6 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
gfn_t gfn, unsigned long mask,
bool wrprot);
bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot);
+void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
+ const struct kvm_memory_slot *slot);
#endif /* __KVM_X86_MMU_TDP_MMU_H */
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:30:34

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 15/20] kvm: x86/mmu: Support dirty logging for the TDP MMU

Dirty logging is a key feature of the KVM MMU and must be supported by
the TDP MMU. Add support for both the write protection and PML dirty
logging modes.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 20 ++-
arch/x86/kvm/mmu/mmu_internal.h | 6 +
arch/x86/kvm/mmu/tdp_iter.h | 7 +-
arch/x86/kvm/mmu/tdp_mmu.c | 292 +++++++++++++++++++++++++++++++-
arch/x86/kvm/mmu/tdp_mmu.h | 10 ++
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 6 +-
7 files changed, 327 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ef9ea3f45241b..b2ce57761d2f1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -277,12 +277,6 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
return vcpu->arch.mmu == &vcpu->arch.guest_mmu;
}

-static inline bool spte_ad_need_write_protect(u64 spte)
-{
- MMU_WARN_ON(is_mmio_spte(spte));
- return (spte & SPTE_SPECIAL_MASK) != SPTE_AD_ENABLED_MASK;
-}
-
bool is_nx_huge_page_enabled(void)
{
return READ_ONCE(nx_huge_pages);
@@ -1483,6 +1477,9 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
{
struct kvm_rmap_head *rmap_head;

+ if (kvm->arch.tdp_mmu_enabled)
+ kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
+ slot->base_gfn + gfn_offset, mask, true);
while (mask) {
rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
PG_LEVEL_4K, slot);
@@ -1509,6 +1506,9 @@ void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
{
struct kvm_rmap_head *rmap_head;

+ if (kvm->arch.tdp_mmu_enabled)
+ kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
+ slot->base_gfn + gfn_offset, mask, false);
while (mask) {
rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
PG_LEVEL_4K, slot);
@@ -5853,6 +5853,8 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
spin_lock(&kvm->mmu_lock);
flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
+ if (kvm->arch.tdp_mmu_enabled)
+ flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, PG_LEVEL_4K);
spin_unlock(&kvm->mmu_lock);

/*
@@ -5941,6 +5943,8 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,

spin_lock(&kvm->mmu_lock);
flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false);
+ if (kvm->arch.tdp_mmu_enabled)
+ flush |= kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
spin_unlock(&kvm->mmu_lock);

/*
@@ -5962,6 +5966,8 @@ void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
spin_lock(&kvm->mmu_lock);
flush = slot_handle_large_level(kvm, memslot, slot_rmap_write_protect,
false);
+ if (kvm->arch.tdp_mmu_enabled)
+ flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, PG_LEVEL_2M);
spin_unlock(&kvm->mmu_lock);

if (flush)
@@ -5976,6 +5982,8 @@ void kvm_mmu_slot_set_dirty(struct kvm *kvm,

spin_lock(&kvm->mmu_lock);
flush = slot_handle_all_level(kvm, memslot, __rmap_set_dirty, false);
+ if (kvm->arch.tdp_mmu_enabled)
+ flush |= kvm_tdp_mmu_slot_set_dirty(kvm, memslot);
spin_unlock(&kvm->mmu_lock);

if (flush)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 49c3a04d2b894..a7230532bb845 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -232,6 +232,12 @@ static inline bool is_executable_pte(u64 spte)
return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
}

+static inline bool spte_ad_need_write_protect(u64 spte)
+{
+ MMU_WARN_ON(is_mmio_spte(spte));
+ return (spte & SPTE_SPECIAL_MASK) != SPTE_AD_ENABLED_MASK;
+}
+
void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, u64 start_gfn,
u64 pages);

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 884ed2c70bfed..47170d0dc98e5 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -41,11 +41,14 @@ struct tdp_iter {
* Iterates over every SPTE mapping the GFN range [start, end) in a
* preorder traversal.
*/
-#define for_each_tdp_pte(iter, root, root_level, start, end) \
- for (tdp_iter_start(&iter, root, root_level, PG_LEVEL_4K, start); \
+#define for_each_tdp_pte_min_level(iter, root, root_level, min_level, start, end) \
+ for (tdp_iter_start(&iter, root, root_level, min_level, start); \
iter.valid && iter.gfn < end; \
tdp_iter_next(&iter))

+#define for_each_tdp_pte(iter, root, root_level, start, end) \
+ for_each_tdp_pte_min_level(iter, root, root_level, PG_LEVEL_4K, start, end)
+
u64 *spte_to_child_pt(u64 pte, int level);

void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 90abd55c89375..099c7d68aeb1d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -180,6 +180,24 @@ static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
kvm_set_pfn_accessed(spte_to_pfn(old_spte));
}

+static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
+ u64 old_spte, u64 new_spte, int level)
+{
+ bool pfn_changed;
+ struct kvm_memory_slot *slot;
+
+ if (level > PG_LEVEL_4K)
+ return;
+
+ pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+
+ if ((!is_writable_pte(old_spte) || pfn_changed) &&
+ is_writable_pte(new_spte)) {
+ slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id), gfn);
+ mark_page_dirty_in_slot(slot, gfn);
+ }
+}
+
/**
* handle_changed_spte - handle bookkeeping associated with an SPTE change
* @kvm: kvm instance
@@ -292,10 +310,13 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
{
__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
handle_changed_spte_acc_track(old_spte, new_spte, level);
+ handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
+ new_spte, level);
}

static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
- u64 new_spte, bool record_acc_track)
+ u64 new_spte, bool record_acc_track,
+ bool record_dirty_log)
{
u64 *root_pt = tdp_iter_root_pt(iter);
struct kvm_mmu_page *root = sptep_to_sp(root_pt);
@@ -308,19 +329,30 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
if (record_acc_track)
handle_changed_spte_acc_track(iter->old_spte, new_spte,
iter->level);
+ if (record_dirty_log)
+ handle_changed_spte_dirty_log(kvm, as_id, iter->gfn,
+ iter->old_spte, new_spte,
+ iter->level);
}

static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
u64 new_spte)
{
- __tdp_mmu_set_spte(kvm, iter, new_spte, true);
+ __tdp_mmu_set_spte(kvm, iter, new_spte, true, true);
}

static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm,
struct tdp_iter *iter,
u64 new_spte)
{
- __tdp_mmu_set_spte(kvm, iter, new_spte, false);
+ __tdp_mmu_set_spte(kvm, iter, new_spte, false, true);
+}
+
+static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
+ struct tdp_iter *iter,
+ u64 new_spte)
+{
+ __tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
}

#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
@@ -644,6 +676,7 @@ static int age_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,

new_spte = mark_spte_for_access_track(new_spte);
}
+ new_spte &= ~shadow_dirty_mask;

tdp_mmu_set_spte_no_acc_track(kvm, &iter, new_spte);
young = 1;
@@ -733,3 +766,256 @@ int kvm_tdp_mmu_set_spte_hva(struct kvm *kvm, unsigned long address,
set_tdp_spte);
}

+/*
+ * Remove write access from all the SPTEs mapping GFNs [start, end). If
+ * skip_4k is set, SPTEs that map 4k pages, will not be write-protected.
+ * Returns true if an SPTE has been changed and the TLBs need to be flushed.
+ */
+static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end, int min_level)
+{
+ struct tdp_iter iter;
+ u64 new_spte;
+ bool spte_set = false;
+
+ BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
+
+ for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
+ min_level, start, end) {
+ if (!is_shadow_present_pte(iter.old_spte) ||
+ !is_last_spte(iter.old_spte, iter.level))
+ continue;
+
+ new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
+
+ tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
+ spte_set = true;
+
+ tdp_mmu_iter_cond_resched(kvm, &iter);
+ }
+ return spte_set;
+}
+
+/*
+ * Remove write access from all the SPTEs mapping GFNs in the memslot. Will
+ * only affect leaf SPTEs down to min_level.
+ * Returns true if an SPTE has been changed and the TLBs need to be flushed.
+ */
+bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
+ int min_level)
+{
+ struct kvm_mmu_page *root;
+ int root_as_id;
+ bool spte_set = false;
+
+ for_each_tdp_mmu_root(kvm, root) {
+ root_as_id = kvm_mmu_page_as_id(root);
+ if (root_as_id != slot->as_id)
+ continue;
+
+ /*
+ * Take a reference on the root so that it cannot be freed if
+ * this thread releases the MMU lock and yields in this loop.
+ */
+ get_tdp_mmu_root(kvm, root);
+
+ spte_set = wrprot_gfn_range(kvm, root, slot->base_gfn,
+ slot->base_gfn + slot->npages, min_level) ||
+ spte_set;
+
+ put_tdp_mmu_root(kvm, root);
+ }
+
+ return spte_set;
+}
+
+/*
+ * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
+ * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
+ * If AD bits are not enabled, this will require clearing the writable bit on
+ * each SPTE. Returns true if an SPTE has been changed and the TLBs need to
+ * be flushed.
+ */
+static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end)
+{
+ struct tdp_iter iter;
+ u64 new_spte;
+ bool spte_set = false;
+
+ tdp_root_for_each_leaf_pte(iter, root, start, end) {
+ if (spte_ad_need_write_protect(iter.old_spte)) {
+ if (is_writable_pte(iter.old_spte))
+ new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
+ else
+ continue;
+ } else {
+ if (iter.old_spte & shadow_dirty_mask)
+ new_spte = iter.old_spte & ~shadow_dirty_mask;
+ else
+ continue;
+ }
+
+ tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
+ spte_set = true;
+
+ tdp_mmu_iter_cond_resched(kvm, &iter);
+ }
+ return spte_set;
+}
+
+/*
+ * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
+ * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
+ * If AD bits are not enabled, this will require clearing the writable bit on
+ * each SPTE. Returns true if an SPTE has been changed and the TLBs need to
+ * be flushed.
+ */
+bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ struct kvm_mmu_page *root;
+ int root_as_id;
+ bool spte_set = false;
+
+ for_each_tdp_mmu_root(kvm, root) {
+ root_as_id = kvm_mmu_page_as_id(root);
+ if (root_as_id != slot->as_id)
+ continue;
+
+ /*
+ * Take a reference on the root so that it cannot be freed if
+ * this thread releases the MMU lock and yields in this loop.
+ */
+ get_tdp_mmu_root(kvm, root);
+
+ spte_set = clear_dirty_gfn_range(kvm, root, slot->base_gfn,
+ slot->base_gfn + slot->npages) || spte_set;
+
+ put_tdp_mmu_root(kvm, root);
+ }
+
+ return spte_set;
+}
+
+/*
+ * Clears the dirty status of all the 4k SPTEs mapping GFNs for which a bit is
+ * set in mask, starting at gfn. The given memslot is expected to contain all
+ * the GFNs represented by set bits in the mask. If AD bits are enabled,
+ * clearing the dirty status will involve clearing the dirty bit on each SPTE
+ * or, if AD bits are not enabled, clearing the writable bit on each SPTE.
+ */
+static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t gfn, unsigned long mask, bool wrprot)
+{
+ struct tdp_iter iter;
+ u64 new_spte;
+
+ tdp_root_for_each_leaf_pte(iter, root, gfn + __ffs(mask),
+ gfn + BITS_PER_LONG) {
+ if (!mask)
+ break;
+
+ if (iter.level > PG_LEVEL_4K ||
+ !(mask & (1UL << (iter.gfn - gfn))))
+ continue;
+
+ if (wrprot || spte_ad_need_write_protect(iter.old_spte)) {
+ if (is_writable_pte(iter.old_spte))
+ new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
+ else
+ continue;
+ } else {
+ if (iter.old_spte & shadow_dirty_mask)
+ new_spte = iter.old_spte & ~shadow_dirty_mask;
+ else
+ continue;
+ }
+
+ tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
+
+ mask &= ~(1UL << (iter.gfn - gfn));
+ }
+}
+
+/*
+ * Clears the dirty status of all the 4k SPTEs mapping GFNs for which a bit is
+ * set in mask, starting at gfn. The given memslot is expected to contain all
+ * the GFNs represented by set bits in the mask. If AD bits are enabled,
+ * clearing the dirty status will involve clearing the dirty bit on each SPTE
+ * or, if AD bits are not enabled, clearing the writable bit on each SPTE.
+ */
+void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn, unsigned long mask,
+ bool wrprot)
+{
+ struct kvm_mmu_page *root;
+ int root_as_id;
+
+ lockdep_assert_held(&kvm->mmu_lock);
+ for_each_tdp_mmu_root(kvm, root) {
+ root_as_id = kvm_mmu_page_as_id(root);
+ if (root_as_id != slot->as_id)
+ continue;
+
+ clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
+ }
+}
+
+/*
+ * Set the dirty status of all the SPTEs mapping GFNs in the memslot. This is
+ * only used for PML, and so will involve setting the dirty bit on each SPTE.
+ * Returns true if an SPTE has been changed and the TLBs need to be flushed.
+ */
+static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end)
+{
+ struct tdp_iter iter;
+ u64 new_spte;
+ bool spte_set = false;
+
+ tdp_root_for_each_pte(iter, root, start, end) {
+ if (!is_shadow_present_pte(iter.old_spte))
+ continue;
+
+ new_spte = iter.old_spte | shadow_dirty_mask;
+
+ tdp_mmu_set_spte(kvm, &iter, new_spte);
+ spte_set = true;
+
+ tdp_mmu_iter_cond_resched(kvm, &iter);
+ }
+
+ return spte_set;
+}
+
+/*
+ * Set the dirty status of all the SPTEs mapping GFNs in the memslot. This is
+ * only used for PML, and so will involve setting the dirty bit on each SPTE.
+ * Returns true if an SPTE has been changed and the TLBs need to be flushed.
+ */
+bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ struct kvm_mmu_page *root;
+ int root_as_id;
+ bool spte_set = false;
+
+ for_each_tdp_mmu_root(kvm, root) {
+ root_as_id = kvm_mmu_page_as_id(root);
+ if (root_as_id != slot->as_id)
+ continue;
+
+ /*
+ * Take a reference on the root so that it cannot be freed if
+ * this thread releases the MMU lock and yields in this loop.
+ */
+ get_tdp_mmu_root(kvm, root);
+
+ spte_set = set_dirty_gfn_range(kvm, root, slot->base_gfn,
+ slot->base_gfn + slot->npages) || spte_set;
+
+ put_tdp_mmu_root(kvm, root);
+ }
+ return spte_set;
+}
+
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 6569792f40d4f..add8bb97c56dd 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -28,4 +28,14 @@ int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva);

int kvm_tdp_mmu_set_spte_hva(struct kvm *kvm, unsigned long address,
pte_t *host_ptep);
+
+bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
+ int min_level);
+bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm,
+ struct kvm_memory_slot *slot);
+void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn, unsigned long mask,
+ bool wrprot);
+bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot);
#endif /* __KVM_X86_MMU_TDP_MMU_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c6f45687ba89c..7f2e2a09ebbd9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -798,6 +798,7 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
bool kvm_vcpu_is_visible_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
void mark_page_dirty(struct kvm *kvm, gfn_t gfn);

struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2e85392131252..2541a17ff1c45 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -143,8 +143,6 @@ static void hardware_disable_all(void);

static void kvm_io_bus_destroy(struct kvm_io_bus *bus);

-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
-
__visible bool kvm_rebooting;
EXPORT_SYMBOL_GPL(kvm_rebooting);

@@ -2645,8 +2643,7 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
}
EXPORT_SYMBOL_GPL(kvm_clear_guest);

-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
- gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn)
{
if (memslot && memslot->dirty_bitmap) {
unsigned long rel_gfn = gfn - memslot->base_gfn;
@@ -2654,6 +2651,7 @@ static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
set_bit_le(rel_gfn, memslot->dirty_bitmap);
}
}
+EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);

void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
{
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:30:43

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 06/20] KVM: Cache as_id in kvm_memory_slot

From: Peter Xu <[email protected]>

Cache the address space ID just like the slot ID. It will be used in
order to fill in the dirty ring entries.

Suggested-by: Paolo Bonzini <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 6 ++++++
2 files changed, 7 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 05e3c2fb3ef78..c6f45687ba89c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -346,6 +346,7 @@ struct kvm_memory_slot {
unsigned long userspace_addr;
u32 flags;
short id;
+ u16 as_id;
};

static inline unsigned long kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memslot)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 68edd25dcb11f..2e85392131252 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1247,6 +1247,11 @@ static int kvm_delete_memslot(struct kvm *kvm,

memset(&new, 0, sizeof(new));
new.id = old->id;
+ /*
+ * This is only for debugging purpose; it should never be referenced
+ * for a removed memslot.
+ */
+ new.as_id = as_id;

r = kvm_set_memslot(kvm, mem, old, &new, as_id, KVM_MR_DELETE);
if (r)
@@ -1313,6 +1318,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (!mem->memory_size)
return kvm_delete_memslot(kvm, mem, &old, as_id);

+ new.as_id = as_id;
new.id = id;
new.base_gfn = mem->guest_phys_addr >> PAGE_SHIFT;
new.npages = mem->memory_size >> PAGE_SHIFT;
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:30:49

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 12/20] kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU

In order to interoperate correctly with the rest of KVM and other Linux
subsystems, the TDP MMU must correctly handle various MMU notifiers. Add
hooks to handle the invalidate range family of MMU notifiers.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 9 ++++-
arch/x86/kvm/mmu/tdp_mmu.c | 80 +++++++++++++++++++++++++++++++++++---
arch/x86/kvm/mmu/tdp_mmu.h | 2 +
3 files changed, 85 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 421a12a247b67..00534133f99fc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1781,7 +1781,14 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
unsigned flags)
{
- return kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
+ int r;
+
+ r = kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
+
+ if (kvm->arch.tdp_mmu_enabled)
+ r |= kvm_tdp_mmu_zap_hva_range(kvm, start, end);
+
+ return r;
}

int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 78d41a1949651..9ec6c26ed6619 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -58,7 +58,7 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
}

static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end);
+ gfn_t start, gfn_t end, bool can_yield);

void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
{
@@ -71,7 +71,7 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)

list_del(&root->link);

- zap_gfn_range(kvm, root, 0, max_gfn);
+ zap_gfn_range(kvm, root, 0, max_gfn, false);

free_page((unsigned long)root->spt);
kmem_cache_free(mmu_page_header_cache, root);
@@ -318,9 +318,14 @@ static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
* non-root pages mapping GFNs strictly within that range. Returns true if
* SPTEs have been cleared and a TLB flush is needed before releasing the
* MMU lock.
+ * If can_yield is true, will release the MMU lock and reschedule if the
+ * scheduler needs the CPU or there is contention on the MMU lock. If this
+ * function cannot yield, it will not release the MMU lock or reschedule and
+ * the caller must ensure it does not supply too large a GFN range, or the
+ * operation can cause a soft lockup.
*/
static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end)
+ gfn_t start, gfn_t end, bool can_yield)
{
struct tdp_iter iter;
bool flush_needed = false;
@@ -341,7 +346,10 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,

tdp_mmu_set_spte(kvm, &iter, 0);

- flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
+ if (can_yield)
+ flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
+ else
+ flush_needed = true;
}
return flush_needed;
}
@@ -364,7 +372,7 @@ bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
*/
get_tdp_mmu_root(kvm, root);

- flush |= zap_gfn_range(kvm, root, start, end);
+ flush |= zap_gfn_range(kvm, root, start, end, true);

put_tdp_mmu_root(kvm, root);
}
@@ -502,3 +510,65 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,

return ret;
}
+
+static int kvm_tdp_mmu_handle_hva_range(struct kvm *kvm, unsigned long start,
+ unsigned long end, unsigned long data,
+ int (*handler)(struct kvm *kvm, struct kvm_memory_slot *slot,
+ struct kvm_mmu_page *root, gfn_t start,
+ gfn_t end, unsigned long data))
+{
+ struct kvm_memslots *slots;
+ struct kvm_memory_slot *memslot;
+ struct kvm_mmu_page *root;
+ int ret = 0;
+ int as_id;
+
+ for_each_tdp_mmu_root(kvm, root) {
+ /*
+ * Take a reference on the root so that it cannot be freed if
+ * this thread releases the MMU lock and yields in this loop.
+ */
+ get_tdp_mmu_root(kvm, root);
+
+ as_id = kvm_mmu_page_as_id(root);
+ slots = __kvm_memslots(kvm, as_id);
+ kvm_for_each_memslot(memslot, slots) {
+ unsigned long hva_start, hva_end;
+ gfn_t gfn_start, gfn_end;
+
+ hva_start = max(start, memslot->userspace_addr);
+ hva_end = min(end, memslot->userspace_addr +
+ (memslot->npages << PAGE_SHIFT));
+ if (hva_start >= hva_end)
+ continue;
+ /*
+ * {gfn(page) | page intersects with [hva_start, hva_end)} =
+ * {gfn_start, gfn_start+1, ..., gfn_end-1}.
+ */
+ gfn_start = hva_to_gfn_memslot(hva_start, memslot);
+ gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
+
+ ret |= handler(kvm, memslot, root, gfn_start,
+ gfn_end, data);
+ }
+
+ put_tdp_mmu_root(kvm, root);
+ }
+
+ return ret;
+}
+
+static int zap_gfn_range_hva_wrapper(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ struct kvm_mmu_page *root, gfn_t start,
+ gfn_t end, unsigned long unused)
+{
+ return zap_gfn_range(kvm, root, start, end, false);
+}
+
+int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
+ unsigned long end)
+{
+ return kvm_tdp_mmu_handle_hva_range(kvm, start, end, 0,
+ zap_gfn_range_hva_wrapper);
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 4d111a4dd332f..026ceb6284102 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -19,4 +19,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
int map_writable, int max_level, kvm_pfn_t pfn,
bool prefault, bool is_tdp);

+int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
+ unsigned long end);
#endif /* __KVM_X86_MMU_TDP_MMU_H */
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:30:53

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 09/20] kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg

In order to avoid creating executable hugepages in the TDP MMU PF
handler, remove the dependency between disallowed_hugepage_adjust and
the shadow_walk_iterator. This will open the function up to being used
by the TDP MMU PF handler in a future patch.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 13 +++++++------
arch/x86/kvm/mmu/paging_tmpl.h | 3 ++-
2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 05024b8ae5a4d..288b97e96202e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3243,13 +3243,12 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
return level;
}

-static void disallowed_hugepage_adjust(struct kvm_shadow_walk_iterator it,
- gfn_t gfn, kvm_pfn_t *pfnp, int *levelp)
+static void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int cur_level,
+ kvm_pfn_t *pfnp, int *levelp)
{
int level = *levelp;
- u64 spte = *it.sptep;

- if (it.level == level && level > PG_LEVEL_4K &&
+ if (cur_level == level && level > PG_LEVEL_4K &&
is_shadow_present_pte(spte) &&
!is_large_pte(spte)) {
/*
@@ -3259,7 +3258,8 @@ static void disallowed_hugepage_adjust(struct kvm_shadow_walk_iterator it,
* patching back for them into pfn the next 9 bits of
* the address.
*/
- u64 page_mask = KVM_PAGES_PER_HPAGE(level) - KVM_PAGES_PER_HPAGE(level - 1);
+ u64 page_mask = KVM_PAGES_PER_HPAGE(level) -
+ KVM_PAGES_PER_HPAGE(level - 1);
*pfnp |= gfn & page_mask;
(*levelp)--;
}
@@ -3292,7 +3292,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
* large page, as the leaf could be executable.
*/
if (nx_huge_page_workaround_enabled)
- disallowed_hugepage_adjust(it, gfn, &pfn, &level);
+ disallowed_hugepage_adjust(*it.sptep, gfn, it.level,
+ &pfn, &level);

base_gfn = gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
if (it.level == level)
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 9a1a15f19beb6..50e268eb8e1a9 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -695,7 +695,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
* large page, as the leaf could be executable.
*/
if (nx_huge_page_workaround_enabled)
- disallowed_hugepage_adjust(it, gw->gfn, &pfn, &level);
+ disallowed_hugepage_adjust(*it.sptep, gw->gfn, it.level,
+ &pfn, &level);

base_gfn = gw->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
if (it.level == level)
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:31:43

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 14/20] kvm: x86/mmu: Support changed pte notifier in tdp MMU

In order to interoperate correctly with the rest of KVM and other Linux
subsystems, the TDP MMU must correctly handle various MMU notifiers. Add
a hook and handle the change_pte MMU notifier.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 21 ++++++-------
arch/x86/kvm/mmu/mmu_internal.h | 29 +++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 56 +++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 3 ++
4 files changed, 98 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e6ab79d8f215f..ef9ea3f45241b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -135,9 +135,6 @@ enum {

#include <trace/events/kvm.h>

-#define SPTE_HOST_WRITEABLE (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
-#define SPTE_MMU_WRITEABLE (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
-
/* make pte_list_desc fit well in cache line */
#define PTE_LIST_EXT 3

@@ -1615,13 +1612,8 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
pte_list_remove(rmap_head, sptep);
goto restart;
} else {
- new_spte = *sptep & ~PT64_BASE_ADDR_MASK;
- new_spte |= (u64)new_pfn << PAGE_SHIFT;
-
- new_spte &= ~PT_WRITABLE_MASK;
- new_spte &= ~SPTE_HOST_WRITEABLE;
-
- new_spte = mark_spte_for_access_track(new_spte);
+ new_spte = kvm_mmu_changed_pte_notifier_make_spte(
+ *sptep, new_pfn);

mmu_spte_clear_track_bits(sptep);
mmu_spte_set(sptep, new_spte);
@@ -1777,7 +1769,14 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,

int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
{
- return kvm_handle_hva(kvm, hva, (unsigned long)&pte, kvm_set_pte_rmapp);
+ int r;
+
+ r = kvm_handle_hva(kvm, hva, (unsigned long)&pte, kvm_set_pte_rmapp);
+
+ if (kvm->arch.tdp_mmu_enabled)
+ r |= kvm_tdp_mmu_set_spte_hva(kvm, hva, &pte);
+
+ return r;
}

static int kvm_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d886fe750be38..49c3a04d2b894 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -115,6 +115,12 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
(PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
* PT64_LEVEL_BITS))) - 1))

+#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
+#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
+#else
+#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
+#endif
+
#define ACC_EXEC_MASK 1
#define ACC_WRITE_MASK PT_WRITABLE_MASK
#define ACC_USER_MASK PT_USER_MASK
@@ -132,6 +138,12 @@ static u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
*/
static u64 __read_mostly shadow_acc_track_mask;

+#define PT_FIRST_AVAIL_BITS_SHIFT 10
+#define PT64_SECOND_AVAIL_BITS_SHIFT 54
+
+#define SPTE_HOST_WRITEABLE (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
+#define SPTE_MMU_WRITEABLE (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
+
/* Functions for interpreting SPTEs */
static inline bool is_mmio_spte(u64 spte)
{
@@ -264,4 +276,21 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);

u64 mark_spte_for_access_track(u64 spte);

+static inline u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte,
+ kvm_pfn_t new_pfn)
+{
+ u64 new_spte;
+
+ new_spte = old_spte & ~PT64_BASE_ADDR_MASK;
+ new_spte |= (u64)new_pfn << PAGE_SHIFT;
+
+ new_spte &= ~PT_WRITABLE_MASK;
+ new_spte &= ~SPTE_HOST_WRITEABLE;
+
+ new_spte = mark_spte_for_access_track(new_spte);
+
+ return new_spte;
+}
+
+
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 575970d8805a4..90abd55c89375 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -677,3 +677,59 @@ int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva)
return kvm_tdp_mmu_handle_hva_range(kvm, hva, hva + 1, 0,
test_age_gfn);
}
+
+/*
+ * Handle the changed_pte MMU notifier for the TDP MMU.
+ * data is a pointer to the new pte_t mapping the HVA specified by the MMU
+ * notifier.
+ * Returns non-zero if a flush is needed before releasing the MMU lock.
+ */
+static int set_tdp_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
+ struct kvm_mmu_page *root, gfn_t gfn, gfn_t unused,
+ unsigned long data)
+{
+ struct tdp_iter iter;
+ pte_t *ptep = (pte_t *)data;
+ kvm_pfn_t new_pfn;
+ u64 new_spte;
+ int need_flush = 0;
+
+ WARN_ON(pte_huge(*ptep));
+
+ new_pfn = pte_pfn(*ptep);
+
+ tdp_root_for_each_pte(iter, root, gfn, gfn + 1) {
+ if (iter.level != PG_LEVEL_4K)
+ continue;
+
+ if (!is_shadow_present_pte(iter.old_spte))
+ break;
+
+ tdp_mmu_set_spte(kvm, &iter, 0);
+
+ kvm_flush_remote_tlbs_with_address(kvm, iter.gfn, 1);
+
+ if (!pte_write(*ptep)) {
+ new_spte = kvm_mmu_changed_pte_notifier_make_spte(
+ iter.old_spte, new_pfn);
+
+ tdp_mmu_set_spte(kvm, &iter, new_spte);
+ }
+
+ need_flush = 1;
+ }
+
+ if (need_flush)
+ kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
+
+ return 0;
+}
+
+int kvm_tdp_mmu_set_spte_hva(struct kvm *kvm, unsigned long address,
+ pte_t *host_ptep)
+{
+ return kvm_tdp_mmu_handle_hva_range(kvm, address, address + 1,
+ (unsigned long)host_ptep,
+ set_tdp_spte);
+}
+
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index bdb86f61e75eb..6569792f40d4f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -25,4 +25,7 @@ int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
unsigned long end);
int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva);
+
+int kvm_tdp_mmu_set_spte_hva(struct kvm *kvm, unsigned long address,
+ pte_t *host_ptep);
#endif /* __KVM_X86_MMU_TDP_MMU_H */
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:31:50

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 20/20] kvm: x86/mmu: NX largepage recovery for TDP MMU

When KVM maps a largepage backed region at a lower level in order to
make it executable (i.e. NX large page shattering), it reduces the TLB
performance of that region. In order to avoid making this degradation
permanent, KVM must periodically reclaim shattered NX largepages by
zapping them and allowing them to be rebuilt in the page fault handler.

With this patch, the TDP MMU does not respect KVM's rate limiting on
reclaim. It traverses the entire TDP structure every time. This will be
addressed in a future patch.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 13 +++++++++----
arch/x86/kvm/mmu/mmu_internal.h | 3 +++
arch/x86/kvm/mmu/tdp_mmu.c | 6 ++++++
3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3935c10278736..5c8a35e4c872b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1030,7 +1030,7 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
kvm_mmu_gfn_disallow_lpage(slot, gfn);
}

-static void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
if (sp->lpage_disallowed)
return;
@@ -1058,7 +1058,7 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
kvm_mmu_gfn_allow_lpage(slot, gfn);
}

-static void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
--kvm->stat.nx_lpage_splits;
sp->lpage_disallowed = false;
@@ -6362,8 +6362,13 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
struct kvm_mmu_page,
lpage_disallowed_link);
WARN_ON_ONCE(!sp->lpage_disallowed);
- kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
- WARN_ON_ONCE(sp->lpage_disallowed);
+ if (sp->tdp_mmu_page)
+ kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
+ sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level));
+ else {
+ kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+ WARN_ON_ONCE(sp->lpage_disallowed);
+ }

if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
kvm_mmu_commit_zap_page(kvm, &invalid_list);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index a7230532bb845..88899a2666d86 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -299,4 +299,7 @@ static inline u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte,
}


+void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b1515b89606e1..2949759c6aa84 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -289,6 +289,9 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,

list_del(&sp->link);

+ if (sp->lpage_disallowed)
+ unaccount_huge_nx_page(kvm, sp);
+
for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
old_child_spte = *(pt + i);
*(pt + i) = 0;
@@ -567,6 +570,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
new_spte = make_nonleaf_spte(child_pt,
!shadow_accessed_mask);

+ if (huge_page_disallowed && req_level >= iter.level)
+ account_huge_nx_page(vcpu->kvm, sp);
+
tdp_mmu_set_spte(vcpu->kvm, &iter, new_spte);
}
}
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:32:12

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 19/20] kvm: x86/mmu: Don't clear write flooding count for direct roots

Direct roots don't have a write flooding count because the guest can't
affect that paging structure. Thus there's no need to clear the write
flooding count on a fast CR3 switch for direct roots.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2e8bf8d19c35a..3935c10278736 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4266,7 +4266,13 @@ static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
*/
vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);

- __clear_sp_write_flooding_count(to_shadow_page(vcpu->arch.mmu->root_hpa));
+ /*
+ * If this is a direct root page, it doesn't have a write flooding
+ * count. Otherwise, clear the write flooding count.
+ */
+ if (!new_role.direct)
+ __clear_sp_write_flooding_count(
+ to_shadow_page(vcpu->arch.mmu->root_hpa));
}

void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd, bool skip_tlb_flush,
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:32:24

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 05/20] kvm: x86/mmu: Add functions to handle changed TDP SPTEs

The existing bookkeeping done by KVM when a PTE is changed is spread
around several functions. This makes it difficult to remember all the
stats, bitmaps, and other subsystems that need to be updated whenever a
PTE is modified. When a non-leaf PTE is marked non-present or becomes a
leaf PTE, page table memory must also be freed. To simplify the MMU and
facilitate the use of atomic operations on SPTEs in future patches, create
functions to handle some of the bookkeeping required as a result of
a change.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 39 +----------
arch/x86/kvm/mmu/mmu_internal.h | 38 +++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 112 ++++++++++++++++++++++++++++++++
3 files changed, 152 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a3340ed59ad1d..8bf20723c6177 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -105,21 +105,6 @@ enum {
AUDIT_POST_SYNC
};

-#undef MMU_DEBUG
-
-#ifdef MMU_DEBUG
-static bool dbg = 0;
-module_param(dbg, bool, 0644);
-
-#define pgprintk(x...) do { if (dbg) printk(x); } while (0)
-#define rmap_printk(x...) do { if (dbg) printk(x); } while (0)
-#define MMU_WARN_ON(x) WARN_ON(x)
-#else
-#define pgprintk(x...) do { } while (0)
-#define rmap_printk(x...) do { } while (0)
-#define MMU_WARN_ON(x) do { } while (0)
-#endif
-
#define PTE_PREFETCH_NUM 8

#define PT32_LEVEL_BITS 10
@@ -211,7 +196,6 @@ static u64 __read_mostly shadow_nx_mask;
static u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
static u64 __read_mostly shadow_user_mask;
static u64 __read_mostly shadow_accessed_mask;
-static u64 __read_mostly shadow_dirty_mask;
static u64 __read_mostly shadow_mmio_value;
static u64 __read_mostly shadow_mmio_access_mask;
static u64 __read_mostly shadow_present_mask;
@@ -287,8 +271,8 @@ static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
kvm_flush_remote_tlbs(kvm);
}

-static void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
- u64 start_gfn, u64 pages)
+void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, u64 start_gfn,
+ u64 pages)
{
struct kvm_tlb_range range;

@@ -324,12 +308,6 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
return vcpu->arch.mmu == &vcpu->arch.guest_mmu;
}

-static inline bool spte_ad_enabled(u64 spte)
-{
- MMU_WARN_ON(is_mmio_spte(spte));
- return (spte & SPTE_SPECIAL_MASK) != SPTE_AD_DISABLED_MASK;
-}
-
static inline bool spte_ad_need_write_protect(u64 spte)
{
MMU_WARN_ON(is_mmio_spte(spte));
@@ -347,12 +325,6 @@ static inline u64 spte_shadow_accessed_mask(u64 spte)
return spte_ad_enabled(spte) ? shadow_accessed_mask : 0;
}

-static inline u64 spte_shadow_dirty_mask(u64 spte)
-{
- MMU_WARN_ON(is_mmio_spte(spte));
- return spte_ad_enabled(spte) ? shadow_dirty_mask : 0;
-}
-
static inline bool is_access_track_spte(u64 spte)
{
return !spte_ad_enabled(spte) && (spte & shadow_acc_track_mask) == 0;
@@ -767,13 +739,6 @@ static bool is_accessed_spte(u64 spte)
: !is_access_track_spte(spte);
}

-static bool is_dirty_spte(u64 spte)
-{
- u64 dirty_mask = spte_shadow_dirty_mask(spte);
-
- return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
-}
-
/* Rules for using mmu_spte_set:
* Set the sptep from nonpresent to present.
* Note: the sptep being assigned *must* be either not present
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 6cedf578c9a8d..c053a157e4d55 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -8,6 +8,21 @@

#include <asm/kvm_host.h>

+#undef MMU_DEBUG
+
+#ifdef MMU_DEBUG
+static bool dbg = 0;
+module_param(dbg, bool, 0644);
+
+#define pgprintk(x...) do { if (dbg) printk(x); } while (0)
+#define rmap_printk(x...) do { if (dbg) printk(x); } while (0)
+#define MMU_WARN_ON(x) WARN_ON(x)
+#else
+#define pgprintk(x...) do { } while (0)
+#define rmap_printk(x...) do { } while (0)
+#define MMU_WARN_ON(x) do { } while (0)
+#endif
+
struct kvm_mmu_page {
struct list_head link;
struct hlist_node hash_link;
@@ -105,6 +120,8 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
#define ACC_USER_MASK PT_USER_MASK
#define ACC_ALL (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)

+static u64 __read_mostly shadow_dirty_mask;
+
/* Functions for interpreting SPTEs */
static inline bool is_mmio_spte(u64 spte)
{
@@ -150,4 +167,25 @@ static inline bool kvm_mmu_put_root(struct kvm_mmu_page *sp)
}


+static inline bool spte_ad_enabled(u64 spte)
+{
+ MMU_WARN_ON(is_mmio_spte(spte));
+ return (spte & SPTE_SPECIAL_MASK) != SPTE_AD_DISABLED_MASK;
+}
+
+static inline u64 spte_shadow_dirty_mask(u64 spte)
+{
+ MMU_WARN_ON(is_mmio_spte(spte));
+ return spte_ad_enabled(spte) ? shadow_dirty_mask : 0;
+}
+
+static inline bool is_dirty_spte(u64 spte)
+{
+ u64 dirty_mask = spte_shadow_dirty_mask(spte);
+
+ return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
+}
+
+void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, u64 start_gfn,
+ u64 pages);
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 09a84a6e157b6..f2bd3a6928ce9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -2,6 +2,7 @@

#include "mmu.h"
#include "mmu_internal.h"
+#include "tdp_iter.h"
#include "tdp_mmu.h"

static bool __read_mostly tdp_mmu_enabled = false;
@@ -150,3 +151,114 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)

return __pa(root->spt);
}
+
+static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
+ u64 old_spte, u64 new_spte, int level);
+
+/**
+ * handle_changed_spte - handle bookkeeping associated with an SPTE change
+ * @kvm: kvm instance
+ * @as_id: the address space of the paging structure the SPTE was a part of
+ * @gfn: the base GFN that was mapped by the SPTE
+ * @old_spte: The value of the SPTE before the change
+ * @new_spte: The value of the SPTE after the change
+ * @level: the level of the PT the SPTE is part of in the paging structure
+ *
+ * Handle bookkeeping that might result from the modification of a SPTE.
+ * This function must be called for all TDP SPTE modifications.
+ */
+static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
+ u64 old_spte, u64 new_spte, int level)
+{
+ bool was_present = is_shadow_present_pte(old_spte);
+ bool is_present = is_shadow_present_pte(new_spte);
+ bool was_leaf = was_present && is_last_spte(old_spte, level);
+ bool is_leaf = is_present && is_last_spte(new_spte, level);
+ bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+ u64 *pt;
+ u64 old_child_spte;
+ int i;
+
+ WARN_ON(level > PT64_ROOT_MAX_LEVEL);
+ WARN_ON(level < PG_LEVEL_4K);
+ WARN_ON(gfn % KVM_PAGES_PER_HPAGE(level));
+
+ /*
+ * If this warning were to trigger it would indicate that there was a
+ * missing MMU notifier or a race with some notifier handler.
+ * A present, leaf SPTE should never be directly replaced with another
+ * present leaf SPTE pointing to a differnt PFN. A notifier handler
+ * should be zapping the SPTE before the main MM's page table is
+ * changed, or the SPTE should be zeroed, and the TLBs flushed by the
+ * thread before replacement.
+ */
+ if (was_leaf && is_leaf && pfn_changed) {
+ pr_err("Invalid SPTE change: cannot replace a present leaf\n"
+ "SPTE with another present leaf SPTE mapping a\n"
+ "different PFN!\n"
+ "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
+ as_id, gfn, old_spte, new_spte, level);
+
+ /*
+ * Crash the host to prevent error propagation and guest data
+ * courruption.
+ */
+ BUG();
+ }
+
+ if (old_spte == new_spte)
+ return;
+
+ /*
+ * The only times a SPTE should be changed from a non-present to
+ * non-present state is when an MMIO entry is installed/modified/
+ * removed. In that case, there is nothing to do here.
+ */
+ if (!was_present && !is_present) {
+ /*
+ * If this change does not involve a MMIO SPTE, it is
+ * unexpected. Log the change, though it should not impact the
+ * guest since both the former and current SPTEs are nonpresent.
+ */
+ if (WARN_ON(!is_mmio_spte(old_spte) && !is_mmio_spte(new_spte)))
+ pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
+ "should not be replaced with another,\n"
+ "different nonpresent SPTE, unless one or both\n"
+ "are MMIO SPTEs.\n"
+ "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
+ as_id, gfn, old_spte, new_spte, level);
+ return;
+ }
+
+
+ if (was_leaf && is_dirty_spte(old_spte) &&
+ (!is_dirty_spte(new_spte) || pfn_changed))
+ kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+
+ /*
+ * Recursively handle child PTs if the change removed a subtree from
+ * the paging structure.
+ */
+ if (was_present && !was_leaf && (pfn_changed || !is_present)) {
+ pt = spte_to_child_pt(old_spte, level);
+
+ for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
+ old_child_spte = *(pt + i);
+ *(pt + i) = 0;
+ handle_changed_spte(kvm, as_id,
+ gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
+ old_child_spte, 0, level - 1);
+ }
+
+ kvm_flush_remote_tlbs_with_address(kvm, gfn,
+ KVM_PAGES_PER_HPAGE(level));
+
+ free_page((unsigned long)pt);
+ }
+}
+
+static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
+ u64 old_spte, u64 new_spte, int level)
+{
+ __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
+}
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 18:33:05

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 04/20] kvm: x86/mmu: Allocate and free TDP MMU roots

The TDP MMU must be able to allocate paging structure root pages and track
the usage of those pages. Implement a similar, but separate system for root
page allocation to that of the x86 shadow paging implementation. When
future patches add synchronization model changes to allow for parallel
page faults, these pages will need to be handled differently from the
x86 shadow paging based MMU's root pages.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu/mmu.c | 29 +++++---
arch/x86/kvm/mmu/mmu_internal.h | 24 +++++++
arch/x86/kvm/mmu/tdp_mmu.c | 114 ++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 5 ++
5 files changed, 162 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6b6dbc20ce23a..e0ec1dd271a32 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -989,6 +989,7 @@ struct kvm_arch {
* operations.
*/
bool tdp_mmu_enabled;
+ struct list_head tdp_mmu_roots;
};

struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f53d29e09367c..a3340ed59ad1d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -144,11 +144,6 @@ module_param(dbg, bool, 0644);
#define PT64_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
| shadow_x_mask | shadow_nx_mask | shadow_me_mask)

-#define ACC_EXEC_MASK 1
-#define ACC_WRITE_MASK PT_WRITABLE_MASK
-#define ACC_USER_MASK PT_USER_MASK
-#define ACC_ALL (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
-
/* The mask for the R/X bits in EPT PTEs */
#define PT64_EPT_READABLE_MASK 0x1ull
#define PT64_EPT_EXECUTABLE_MASK 0x4ull
@@ -209,7 +204,7 @@ struct kvm_shadow_walk_iterator {
__shadow_walk_next(&(_walker), spte))

static struct kmem_cache *pte_list_desc_cache;
-static struct kmem_cache *mmu_page_header_cache;
+struct kmem_cache *mmu_page_header_cache;
static struct percpu_counter kvm_total_used_mmu_pages;

static u64 __read_mostly shadow_nx_mask;
@@ -3588,9 +3583,13 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
return;

sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
- --sp->root_count;
- if (!sp->root_count && sp->role.invalid)
- kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+
+ if (kvm_mmu_put_root(sp)) {
+ if (sp->tdp_mmu_page)
+ kvm_tdp_mmu_free_root(kvm, sp);
+ else if (sp->role.invalid)
+ kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+ }

*root_hpa = INVALID_PAGE;
}
@@ -3680,8 +3679,16 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
hpa_t root;
unsigned i;

- if (shadow_root_level >= PT64_ROOT_4LEVEL) {
- root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
+ if (vcpu->kvm->arch.tdp_mmu_enabled) {
+ root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
+
+ if (!VALID_PAGE(root))
+ return -ENOSPC;
+ vcpu->arch.mmu->root_hpa = root;
+ } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
+ root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level,
+ true);
+
if (!VALID_PAGE(root))
return -ENOSPC;
vcpu->arch.mmu->root_hpa = root;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 74ccbf001a42e..6cedf578c9a8d 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -43,8 +43,12 @@ struct kvm_mmu_page {

/* Number of writes since the last time traversal visited this page. */
atomic_t write_flooding_count;
+
+ bool tdp_mmu_page;
};

+extern struct kmem_cache *mmu_page_header_cache;
+
static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
{
struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
@@ -96,6 +100,11 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
(PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
* PT64_LEVEL_BITS))) - 1))

+#define ACC_EXEC_MASK 1
+#define ACC_WRITE_MASK PT_WRITABLE_MASK
+#define ACC_USER_MASK PT_USER_MASK
+#define ACC_ALL (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
+
/* Functions for interpreting SPTEs */
static inline bool is_mmio_spte(u64 spte)
{
@@ -126,4 +135,19 @@ static inline kvm_pfn_t spte_to_pfn(u64 pte)
return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
}

+static inline void kvm_mmu_get_root(struct kvm_mmu_page *sp)
+{
+ BUG_ON(!sp->root_count);
+
+ ++sp->root_count;
+}
+
+static inline bool kvm_mmu_put_root(struct kvm_mmu_page *sp)
+{
+ --sp->root_count;
+
+ return !sp->root_count;
+}
+
+
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b3809835e90b1..09a84a6e157b6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1,5 +1,7 @@
// SPDX-License-Identifier: GPL-2.0

+#include "mmu.h"
+#include "mmu_internal.h"
#include "tdp_mmu.h"

static bool __read_mostly tdp_mmu_enabled = false;
@@ -29,10 +31,122 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)

/* This should not be changed for the lifetime of the VM. */
kvm->arch.tdp_mmu_enabled = true;
+
+ INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
}

void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
{
if (!kvm->arch.tdp_mmu_enabled)
return;
+
+ WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
+}
+
+#define for_each_tdp_mmu_root(_kvm, _root) \
+ list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)
+
+bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
+{
+ struct kvm_mmu_page *sp;
+
+ sp = to_shadow_page(hpa);
+
+ return sp->tdp_mmu_page && sp->root_count;
+}
+
+void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+ lockdep_assert_held(&kvm->mmu_lock);
+
+ WARN_ON(root->root_count);
+ WARN_ON(!root->tdp_mmu_page);
+
+ list_del(&root->link);
+
+ free_page((unsigned long)root->spt);
+ kmem_cache_free(mmu_page_header_cache, root);
+}
+
+static void put_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+ if (kvm_mmu_put_root(root))
+ kvm_tdp_mmu_free_root(kvm, root);
+}
+
+static void get_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+ lockdep_assert_held(&kvm->mmu_lock);
+
+ kvm_mmu_get_root(root);
+}
+
+static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
+ int level)
+{
+ union kvm_mmu_page_role role;
+
+ role = vcpu->arch.mmu->mmu_role.base;
+ role.level = vcpu->arch.mmu->shadow_root_level;
+ role.direct = true;
+ role.gpte_is_8_bytes = true;
+ role.access = ACC_ALL;
+
+ return role;
+}
+
+static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+ int level)
+{
+ struct kvm_mmu_page *sp;
+
+ sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
+ sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+ set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
+ sp->role.word = page_role_for_level(vcpu, level).word;
+ sp->gfn = gfn;
+ sp->tdp_mmu_page = true;
+
+ return sp;
+}
+
+static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
+{
+ union kvm_mmu_page_role role;
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_mmu_page *root;
+
+ role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
+
+ spin_lock(&kvm->mmu_lock);
+
+ /* Check for an existing root before allocating a new one. */
+ for_each_tdp_mmu_root(kvm, root) {
+ if (root->role.word == role.word) {
+ get_tdp_mmu_root(kvm, root);
+ spin_unlock(&kvm->mmu_lock);
+ return root;
+ }
+ }
+
+ root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
+ root->root_count = 1;
+
+ list_add(&root->link, &kvm->arch.tdp_mmu_roots);
+
+ spin_unlock(&kvm->mmu_lock);
+
+ return root;
+}
+
+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu_page *root;
+
+ root = get_tdp_mmu_vcpu_root(vcpu);
+ if (!root)
+ return INVALID_PAGE;
+
+ return __pa(root->spt);
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index cd4a562a70e9a..ac0ef91294420 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -7,4 +7,9 @@

void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
+
+bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
+void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);
+
#endif /* __KVM_X86_MMU_TDP_MMU_H */
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 23:05:41

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 18/20] kvm: x86/mmu: Support MMIO in the TDP MMU

In order to support MMIO, KVM must be able to walk the TDP paging
structures to find mappings for a given GFN. Support this walk for
the TDP MMU.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

v2: Thanks to Dan Carpenter and kernel test robot for finding that root
was used uninitialized in get_mmio_spte.

Signed-off-by: Ben Gardon <[email protected]>
Reported-by: kernel test robot <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 70 ++++++++++++++++++++++++++------------
arch/x86/kvm/mmu/tdp_mmu.c | 18 ++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 2 ++
3 files changed, 69 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 58d2412817c87..2e8bf8d19c35a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3853,54 +3853,82 @@ static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct)
return vcpu_match_mmio_gva(vcpu, addr);
}

-/* return true if reserved bit is detected on spte. */
-static bool
-walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
+/*
+ * Return the level of the lowest level SPTE added to sptes.
+ * That SPTE may be non-present.
+ */
+static int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes)
{
struct kvm_shadow_walk_iterator iterator;
- u64 sptes[PT64_ROOT_MAX_LEVEL], spte = 0ull;
- struct rsvd_bits_validate *rsvd_check;
- int root, leaf;
- bool reserved = false;
+ int leaf = vcpu->arch.mmu->root_level;
+ u64 spte;

- rsvd_check = &vcpu->arch.mmu->shadow_zero_check;

walk_shadow_page_lockless_begin(vcpu);

- for (shadow_walk_init(&iterator, vcpu, addr),
- leaf = root = iterator.level;
+ for (shadow_walk_init(&iterator, vcpu, addr);
shadow_walk_okay(&iterator);
__shadow_walk_next(&iterator, spte)) {
+ leaf = iterator.level;
spte = mmu_spte_get_lockless(iterator.sptep);

sptes[leaf - 1] = spte;
- leaf--;

if (!is_shadow_present_pte(spte))
break;

+ }
+
+ walk_shadow_page_lockless_end(vcpu);
+
+ return leaf;
+}
+
+/* return true if reserved bit is detected on spte. */
+static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
+{
+ u64 sptes[PT64_ROOT_MAX_LEVEL];
+ struct rsvd_bits_validate *rsvd_check;
+ int root = vcpu->arch.mmu->root_level;
+ int leaf;
+ int level;
+ bool reserved = false;
+
+ if (!VALID_PAGE(vcpu->arch.mmu->root_hpa)) {
+ *sptep = 0ull;
+ return reserved;
+ }
+
+ if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+ leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes);
+ else
+ leaf = get_walk(vcpu, addr, sptes);
+
+ rsvd_check = &vcpu->arch.mmu->shadow_zero_check;
+
+ for (level = root; level >= leaf; level--) {
+ if (!is_shadow_present_pte(sptes[level - 1]))
+ break;
/*
* Use a bitwise-OR instead of a logical-OR to aggregate the
* reserved bit and EPT's invalid memtype/XWR checks to avoid
* adding a Jcc in the loop.
*/
- reserved |= __is_bad_mt_xwr(rsvd_check, spte) |
- __is_rsvd_bits_set(rsvd_check, spte, iterator.level);
+ reserved |= __is_bad_mt_xwr(rsvd_check, sptes[level - 1]) |
+ __is_rsvd_bits_set(rsvd_check, sptes[level - 1],
+ level);
}

- walk_shadow_page_lockless_end(vcpu);
-
if (reserved) {
pr_err("%s: detect reserved bits on spte, addr 0x%llx, dump hierarchy:\n",
__func__, addr);
- while (root > leaf) {
+ for (level = root; level >= leaf; level--)
pr_err("------ spte 0x%llx level %d.\n",
- sptes[root - 1], root);
- root--;
- }
+ sptes[level - 1], level);
}

- *sptep = spte;
+ *sptep = sptes[leaf - 1];
+
return reserved;
}

@@ -3912,7 +3940,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
if (mmio_info_in_cache(vcpu, addr, direct))
return RET_PF_EMULATE;

- reserved = walk_shadow_page_get_mmio_spte(vcpu, addr, &spte);
+ reserved = get_mmio_spte(vcpu, addr, &spte);
if (WARN_ON(reserved))
return -EINVAL;

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c471f2e977d11..b1515b89606e1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1128,3 +1128,21 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
return spte_set;
}

+/*
+ * Return the level of the lowest level SPTE added to sptes.
+ * That SPTE may be non-present.
+ */
+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes)
+{
+ struct tdp_iter iter;
+ struct kvm_mmu *mmu = vcpu->arch.mmu;
+ int leaf = vcpu->arch.mmu->shadow_root_level;
+ gfn_t gfn = addr >> PAGE_SHIFT;
+
+ tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+ leaf = iter.level;
+ sptes[leaf - 1] = iter.old_spte;
+ }
+
+ return leaf;
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index b66283db43221..f890048dfcba5 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -43,4 +43,6 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,

bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn);
+
+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes);
#endif /* __KVM_X86_MMU_TDP_MMU_H */
--
2.28.0.1011.ga647a8990f-goog

2020-10-14 23:05:42

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 17/20] kvm: x86/mmu: Support write protection for nesting in tdp MMU

To support nested virtualization, KVM will sometimes need to write
protect pages which are part of a shadowed paging structure or are not
writable in the shadowed paging structure. Add a function to write
protect GFN mappings for this purpose.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 4 +++
arch/x86/kvm/mmu/tdp_mmu.c | 50 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 3 +++
3 files changed, 57 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8fcf5e955c475..58d2412817c87 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1553,6 +1553,10 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
write_protected |= __rmap_write_protect(kvm, rmap_head, true);
}

+ if (kvm->arch.tdp_mmu_enabled)
+ write_protected |=
+ kvm_tdp_mmu_write_protect_gfn(kvm, slot, gfn);
+
return write_protected;
}

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 94624cc1df84c..c471f2e977d11 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1078,3 +1078,53 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
put_tdp_mmu_root(kvm, root);
}
}
+
+/*
+ * Removes write access on the last level SPTE mapping this GFN and unsets the
+ * SPTE_MMU_WRITABLE bit to ensure future writes continue to be intercepted.
+ * Returns true if an SPTE was set and a TLB flush is needed.
+ */
+static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t gfn)
+{
+ struct tdp_iter iter;
+ u64 new_spte;
+ bool spte_set = false;
+
+ tdp_root_for_each_leaf_pte(iter, root, gfn, gfn + 1) {
+ if (!is_writable_pte(iter.old_spte))
+ break;
+
+ new_spte = iter.old_spte &
+ ~(PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE);
+
+ tdp_mmu_set_spte(kvm, &iter, new_spte);
+ spte_set = true;
+ }
+
+ return spte_set;
+}
+
+/*
+ * Removes write access on the last level SPTE mapping this GFN and unsets the
+ * SPTE_MMU_WRITABLE bit to ensure future writes continue to be intercepted.
+ * Returns true if an SPTE was set and a TLB flush is needed.
+ */
+bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
+ struct kvm_memory_slot *slot, gfn_t gfn)
+{
+ struct kvm_mmu_page *root;
+ int root_as_id;
+ bool spte_set = false;
+
+ lockdep_assert_held(&kvm->mmu_lock);
+ for_each_tdp_mmu_root(kvm, root) {
+ root_as_id = kvm_mmu_page_as_id(root);
+ if (root_as_id != slot->as_id)
+ continue;
+
+ spte_set = write_protect_gfn(kvm, root, gfn) || spte_set;
+ }
+ return spte_set;
+}
+
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index dc4cdc5cc29f5..b66283db43221 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -40,4 +40,7 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot);
void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
const struct kvm_memory_slot *slot);
+
+bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
+ struct kvm_memory_slot *slot, gfn_t gfn);
#endif /* __KVM_X86_MMU_TDP_MMU_H */
--
2.28.0.1011.ga647a8990f-goog

2020-10-15 01:35:34

by Ben Gardon

[permalink] [raw]
Subject: [PATCH v2 10/20] kvm: x86/mmu: Add TDP MMU PF handler

Add functions to handle page faults in the TDP MMU. These page faults
are currently handled in much the same way as the x86 shadow paging
based MMU, however the ordering of some operations is slightly
different. Future patches will add eager NX splitting, a fast page fault
handler, and parallel page faults.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 82 +++++++--------------
arch/x86/kvm/mmu/mmu_internal.h | 59 +++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 124 ++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 5 ++
4 files changed, 212 insertions(+), 58 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 288b97e96202e..421a12a247b67 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -141,23 +141,6 @@ enum {
/* make pte_list_desc fit well in cache line */
#define PTE_LIST_EXT 3

-/*
- * Return values of handle_mmio_page_fault, mmu.page_fault, and fast_page_fault().
- *
- * RET_PF_RETRY: let CPU fault again on the address.
- * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
- * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
- * RET_PF_FIXED: The faulting entry has been fixed.
- * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
- */
-enum {
- RET_PF_RETRY = 0,
- RET_PF_EMULATE,
- RET_PF_INVALID,
- RET_PF_FIXED,
- RET_PF_SPURIOUS,
-};
-
struct pte_list_desc {
u64 *sptes[PTE_LIST_EXT];
struct pte_list_desc *more;
@@ -195,19 +178,11 @@ static struct percpu_counter kvm_total_used_mmu_pages;
static u64 __read_mostly shadow_nx_mask;
static u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
static u64 __read_mostly shadow_user_mask;
-static u64 __read_mostly shadow_accessed_mask;
static u64 __read_mostly shadow_mmio_value;
static u64 __read_mostly shadow_mmio_access_mask;
static u64 __read_mostly shadow_present_mask;
static u64 __read_mostly shadow_me_mask;

-/*
- * SPTEs used by MMUs without A/D bits are marked with SPTE_AD_DISABLED_MASK;
- * shadow_acc_track_mask is the set of bits to be cleared in non-accessed
- * pages.
- */
-static u64 __read_mostly shadow_acc_track_mask;
-
/*
* The mask/shift to use for saving the original R/X bits when marking the PTE
* as not-present for access tracking purposes. We do not save the W bit as the
@@ -314,22 +289,11 @@ static inline bool spte_ad_need_write_protect(u64 spte)
return (spte & SPTE_SPECIAL_MASK) != SPTE_AD_ENABLED_MASK;
}

-static bool is_nx_huge_page_enabled(void)
+bool is_nx_huge_page_enabled(void)
{
return READ_ONCE(nx_huge_pages);
}

-static inline u64 spte_shadow_accessed_mask(u64 spte)
-{
- MMU_WARN_ON(is_mmio_spte(spte));
- return spte_ad_enabled(spte) ? shadow_accessed_mask : 0;
-}
-
-static inline bool is_access_track_spte(u64 spte)
-{
- return !spte_ad_enabled(spte) && (spte & shadow_acc_track_mask) == 0;
-}
-
/*
* Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
* the memslots generation and is derived as follows:
@@ -377,7 +341,7 @@ static u64 get_mmio_spte_generation(u64 spte)
return gen;
}

-static u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
+u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
{

u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK;
@@ -2468,7 +2432,7 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
__shadow_walk_next(iterator, *iterator->sptep);
}

-static u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
+u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
{
u64 spte;

@@ -2886,15 +2850,10 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
E820_TYPE_RAM);
}

-/* Bits which may be returned by set_spte() */
-#define SET_SPTE_WRITE_PROTECTED_PT BIT(0)
-#define SET_SPTE_NEED_REMOTE_TLB_FLUSH BIT(1)
-#define SET_SPTE_SPURIOUS BIT(2)
-
-static int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
- gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool speculative,
- bool can_unsync, bool host_writable, bool ad_disabled,
- u64 *new_spte)
+int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
+ gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool speculative,
+ bool can_unsync, bool host_writable, bool ad_disabled,
+ u64 *new_spte)
{
u64 spte = 0;
int ret = 0;
@@ -3187,9 +3146,9 @@ static int host_pfn_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn,
return level;
}

-static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
- int max_level, kvm_pfn_t *pfnp,
- bool huge_page_disallowed, int *req_level)
+int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn, int max_level,
+ kvm_pfn_t *pfnp, bool huge_page_disallowed,
+ int *req_level)
{
struct kvm_memory_slot *slot;
struct kvm_lpage_info *linfo;
@@ -3243,8 +3202,8 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
return level;
}

-static void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int cur_level,
- kvm_pfn_t *pfnp, int *levelp)
+void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int cur_level,
+ kvm_pfn_t *pfnp, int *levelp)
{
int level = *levelp;

@@ -4068,9 +4027,11 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
if (page_fault_handle_page_track(vcpu, error_code, gfn))
return RET_PF_EMULATE;

- r = fast_page_fault(vcpu, gpa, error_code);
- if (r != RET_PF_INVALID)
- return r;
+ if (!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)) {
+ r = fast_page_fault(vcpu, gpa, error_code);
+ if (r != RET_PF_INVALID)
+ return r;
+ }

r = mmu_topup_memory_caches(vcpu, false);
if (r)
@@ -4092,8 +4053,13 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
r = make_mmu_pages_available(vcpu);
if (r)
goto out_unlock;
- r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn,
- prefault, is_tdp);
+
+ if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+ r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable,
+ max_level, pfn, prefault, is_tdp);
+ else
+ r = __direct_map(vcpu, gpa, error_code, map_writable, max_level,
+ pfn, prefault, is_tdp);

out_unlock:
spin_unlock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index c053a157e4d55..f7fe5616eff98 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -121,6 +121,14 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
#define ACC_ALL (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)

static u64 __read_mostly shadow_dirty_mask;
+static u64 __read_mostly shadow_accessed_mask;
+
+/*
+ * SPTEs used by MMUs without A/D bits are marked with SPTE_AD_DISABLED_MASK;
+ * shadow_acc_track_mask is the set of bits to be cleared in non-accessed
+ * pages.
+ */
+static u64 __read_mostly shadow_acc_track_mask;

/* Functions for interpreting SPTEs */
static inline bool is_mmio_spte(u64 spte)
@@ -186,6 +194,57 @@ static inline bool is_dirty_spte(u64 spte)
return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
}

+static inline u64 spte_shadow_accessed_mask(u64 spte)
+{
+ MMU_WARN_ON(is_mmio_spte(spte));
+ return spte_ad_enabled(spte) ? shadow_accessed_mask : 0;
+}
+
+static inline bool is_access_track_spte(u64 spte)
+{
+ return !spte_ad_enabled(spte) && (spte & shadow_acc_track_mask) == 0;
+}
+
void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, u64 start_gfn,
u64 pages);
+
+/*
+ * Return values of handle_mmio_page_fault, mmu.page_fault, and fast_page_fault().
+ *
+ * RET_PF_RETRY: let CPU fault again on the address.
+ * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
+ * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
+ * RET_PF_FIXED: The faulting entry has been fixed.
+ * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
+ */
+enum {
+ RET_PF_RETRY = 0,
+ RET_PF_EMULATE,
+ RET_PF_INVALID,
+ RET_PF_FIXED,
+ RET_PF_SPURIOUS,
+};
+
+/* Bits which may be returned by set_spte() */
+#define SET_SPTE_WRITE_PROTECTED_PT BIT(0)
+#define SET_SPTE_NEED_REMOTE_TLB_FLUSH BIT(1)
+#define SET_SPTE_SPURIOUS BIT(2)
+
+int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
+ gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool speculative,
+ bool can_unsync, bool host_writable, bool ad_disabled,
+ u64 *new_spte);
+u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
+u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
+
+int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn, int max_level,
+ kvm_pfn_t *pfnp, bool huge_page_disallowed,
+ int *req_level);
+void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int cur_level,
+ kvm_pfn_t *pfnp, int *levelp);
+
+bool is_nx_huge_page_enabled(void);
+
+void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
+
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9b5cd4a832f1a..f92c12c4ce31a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -291,6 +291,10 @@ static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)

+#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end) \
+ for_each_tdp_pte(_iter, __va(_mmu->root_hpa), \
+ _mmu->shadow_root_level, _start, _end)
+
static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
{
if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
@@ -371,3 +375,123 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
if (flush)
kvm_flush_remote_tlbs(kvm);
}
+
+/*
+ * Installs a last-level SPTE to handle a TDP page fault.
+ * (NPT/EPT violation/misconfiguration)
+ */
+static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
+ int map_writable,
+ struct tdp_iter *iter,
+ kvm_pfn_t pfn, bool prefault)
+{
+ u64 new_spte;
+ int ret = 0;
+ int make_spte_ret = 0;
+
+ if (unlikely(is_noslot_pfn(pfn)))
+ new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
+ else
+ make_spte_ret = make_spte(vcpu, ACC_ALL, iter->level, iter->gfn,
+ pfn, iter->old_spte, prefault, true,
+ map_writable, !shadow_accessed_mask,
+ &new_spte);
+
+ tdp_mmu_set_spte(vcpu->kvm, iter, new_spte);
+
+ /*
+ * If the page fault was caused by a write but the page is write
+ * protected, emulation is needed. If the emulation was skipped,
+ * the vCPU would have the same fault again.
+ */
+ if (make_spte_ret & SET_SPTE_WRITE_PROTECTED_PT) {
+ if (write)
+ ret = RET_PF_EMULATE;
+ kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
+ }
+
+ /* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
+ if (unlikely(is_mmio_spte(new_spte)))
+ ret = RET_PF_EMULATE;
+
+ if (!prefault)
+ vcpu->stat.pf_fixed++;
+
+ return ret;
+}
+
+/*
+ * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
+ * page tables and SPTEs to translate the faulting guest physical address.
+ */
+int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
+ int map_writable, int max_level, kvm_pfn_t pfn,
+ bool prefault, bool is_tdp)
+{
+ bool nx_huge_page_workaround_enabled = is_nx_huge_page_enabled();
+ bool write = error_code & PFERR_WRITE_MASK;
+ bool exec = error_code & PFERR_FETCH_MASK;
+ bool huge_page_disallowed = exec && nx_huge_page_workaround_enabled;
+ struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct tdp_iter iter;
+ struct kvm_mmu_memory_cache *pf_pt_cache =
+ &vcpu->arch.mmu_shadow_page_cache;
+ u64 *child_pt;
+ u64 new_spte;
+ int ret;
+ gfn_t gfn = gpa >> PAGE_SHIFT;
+ int level;
+ int req_level;
+
+ BUG_ON(!is_tdp);
+ BUG_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
+ BUG_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa));
+
+ level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
+ huge_page_disallowed, &req_level);
+
+ tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+ if (nx_huge_page_workaround_enabled)
+ disallowed_hugepage_adjust(iter.old_spte, gfn,
+ iter.level, &pfn, &level);
+
+ if (iter.level == level)
+ break;
+
+ /*
+ * If there is an SPTE mapping a large page at a higher level
+ * than the target, that SPTE must be cleared and replaced
+ * with a non-leaf SPTE.
+ */
+ if (is_shadow_present_pte(iter.old_spte) &&
+ is_large_pte(iter.old_spte)) {
+ tdp_mmu_set_spte(vcpu->kvm, &iter, 0);
+
+ kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn,
+ KVM_PAGES_PER_HPAGE(iter.level));
+
+ /*
+ * The iter must explicitly re-read the spte here
+ * because the new value informs the !present
+ * path below.
+ */
+ iter.old_spte = READ_ONCE(*iter.sptep);
+ }
+
+ if (!is_shadow_present_pte(iter.old_spte)) {
+ child_pt = kvm_mmu_memory_cache_alloc(pf_pt_cache);
+ clear_page(child_pt);
+ new_spte = make_nonleaf_spte(child_pt,
+ !shadow_accessed_mask);
+
+ tdp_mmu_set_spte(vcpu->kvm, &iter, new_spte);
+ }
+ }
+
+ BUG_ON(iter.level != level);
+
+ ret = tdp_mmu_map_handle_target_level(vcpu, write, map_writable, &iter,
+ pfn, prefault);
+
+ return ret;
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 6de2d007fc03c..4d111a4dd332f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -14,4 +14,9 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);

bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
+
+int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
+ int map_writable, int max_level, kvm_pfn_t pfn,
+ bool prefault, bool is_tdp);
+
#endif /* __KVM_X86_MMU_TDP_MMU_H */
--
2.28.0.1011.ga647a8990f-goog

2020-10-16 16:12:43

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 10/20] kvm: x86/mmu: Add TDP MMU PF handler

On 14/10/20 20:26, Ben Gardon wrote:
> + if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> + r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable,
> + max_level, pfn, prefault, is_tdp);
> + else
> + r = __direct_map(vcpu, gpa, error_code, map_writable, max_level,
> + pfn, prefault, is_tdp);

I like the rename, but I guess is_tdp is clearly enough superfluous.

Paolo

2020-10-16 16:50:49

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 17/20] kvm: x86/mmu: Support write protection for nesting in tdp MMU

On 14/10/20 20:26, Ben Gardon wrote:
> + spte_set = write_protect_gfn(kvm, root, gfn) || spte_set;

Remaining instance of ||.

2020-10-16 17:08:16

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 00/20] Introduce the TDP MMU

On 14/10/20 20:26, Ben Gardon wrote:
> arch/x86/include/asm/kvm_host.h | 14 +
> arch/x86/kvm/Makefile | 3 +-
> arch/x86/kvm/mmu/mmu.c | 487 +++++++------
> arch/x86/kvm/mmu/mmu_internal.h | 242 +++++++
> arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
> arch/x86/kvm/mmu/tdp_iter.c | 181 +++++
> arch/x86/kvm/mmu/tdp_iter.h | 60 ++
> arch/x86/kvm/mmu/tdp_mmu.c | 1154 +++++++++++++++++++++++++++++++
> arch/x86/kvm/mmu/tdp_mmu.h | 48 ++
> include/linux/kvm_host.h | 2 +
> virt/kvm/kvm_main.c | 12 +-
> 11 files changed, 1944 insertions(+), 262 deletions(-)
> create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
> create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
> create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c
> create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h
>

My implementation of tdp_iter_set_spte was completely different, but
of course that's not an issue; I would still like to understand and
comment on why the bool arguments to __tdp_mmu_set_spte are needed.

Apart from splitting tdp_mmu_iter_flush_cond_resched from
tdp_mmu_iter_cond_resched, my remaining changes on top are pretty
small and mostly cosmetic. I'll give it another go next week
and send it Linus's way if everything's all right.

Paolo

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f8525c89fc95..baf260421a56 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -7,20 +7,15 @@
#include "tdp_mmu.h"
#include "spte.h"

+#ifdef CONFIG_X86_64
static bool __read_mostly tdp_mmu_enabled = false;
+module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
+#endif

static bool is_tdp_mmu_enabled(void)
{
#ifdef CONFIG_X86_64
- if (!READ_ONCE(tdp_mmu_enabled))
- return false;
-
- if (WARN_ONCE(!tdp_enabled,
- "Creating a VM with TDP MMU enabled requires TDP."))
- return false;
-
- return true;
-
+ return tdp_enabled && READ_ONCE(tdp_mmu_enabled);
#else
return false;
#endif /* CONFIG_X86_64 */
@@ -277,8 +277,8 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
unaccount_huge_nx_page(kvm, sp);

for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
- old_child_spte = *(pt + i);
- *(pt + i) = 0;
+ old_child_spte = READ_ONCE(*(pt + i));
+ WRITE_ONCE(*(pt + i), 0);
handle_changed_spte(kvm, as_id,
gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
old_child_spte, 0, level - 1);
@@ -309,7 +309,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
struct kvm_mmu_page *root = sptep_to_sp(root_pt);
int as_id = kvm_mmu_page_as_id(root);

- *iter->sptep = new_spte;
+ WRITE_ONCE(*iter->sptep, new_spte);

__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
iter->level);
@@ -361,16 +361,28 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
for_each_tdp_pte(_iter, __va(_mmu->root_hpa), \
_mmu->shadow_root_level, _start, _end)

-static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
+/*
+ * Flush the TLB if the process should drop kvm->mmu_lock.
+ * Return whether the caller still needs to flush the tlb.
+ */
+static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
{
if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
kvm_flush_remote_tlbs(kvm);
cond_resched_lock(&kvm->mmu_lock);
tdp_iter_refresh_walk(iter);
+ return false;
+ } else {
return true;
}
+}

- return false;
+static void tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
+{
+ if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+ cond_resched_lock(&kvm->mmu_lock);
+ tdp_iter_refresh_walk(iter);
+ }
}

/*
@@ -407,7 +419,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
tdp_mmu_set_spte(kvm, &iter, 0);

if (can_yield)
- flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
+ flush_needed = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
else
flush_needed = true;
}
@@ -479,7 +479,10 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
map_writable, !shadow_accessed_mask,
&new_spte);

- tdp_mmu_set_spte(vcpu->kvm, iter, new_spte);
+ if (new_spte == iter->old_spte)
+ ret = RET_PF_SPURIOUS;
+ else
+ tdp_mmu_set_spte(vcpu->kvm, iter, new_spte);

/*
* If the page fault was caused by a write but the page is write
@@ -496,7 +496,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
}

/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
- if (unlikely(is_mmio_spte(new_spte)))
+ else if (unlikely(is_mmio_spte(new_spte)))
ret = RET_PF_EMULATE;

trace_kvm_mmu_set_spte(iter->level, iter->gfn, iter->sptep);
@@ -528,8 +528,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
int level;
int req_level;

- BUG_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
- BUG_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa));
+ if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
+ return RET_PF_ENTRY;
+ if (WARN_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)))
+ return RET_PF_ENTRY;

level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
huge_page_disallowed, &req_level);
@@ -579,7 +581,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
}
}

- BUG_ON(iter.level != level);
+ if (WARN_ON(iter.level != level))
+ return RET_PF_RETRY;

ret = tdp_mmu_map_handle_target_level(vcpu, write, map_writable, &iter,
pfn, prefault);
@@ -817,9 +829,8 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
*/
kvm_mmu_get_root(kvm, root);

- spte_set = wrprot_gfn_range(kvm, root, slot->base_gfn,
- slot->base_gfn + slot->npages, min_level) ||
- spte_set;
+ spte_set |= wrprot_gfn_range(kvm, root, slot->base_gfn,
+ slot->base_gfn + slot->npages, min_level);

kvm_mmu_put_root(kvm, root);
}
@@ -886,8 +897,8 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
*/
kvm_mmu_get_root(kvm, root);

- spte_set = clear_dirty_gfn_range(kvm, root, slot->base_gfn,
- slot->base_gfn + slot->npages) || spte_set;
+ spte_set |= clear_dirty_gfn_range(kvm, root, slot->base_gfn,
+ slot->base_gfn + slot->npages);

kvm_mmu_put_root(kvm, root);
}
@@ -1009,8 +1020,8 @@ bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
*/
kvm_mmu_get_root(kvm, root);

- spte_set = set_dirty_gfn_range(kvm, root, slot->base_gfn,
- slot->base_gfn + slot->npages) || spte_set;
+ spte_set |= set_dirty_gfn_range(kvm, root, slot->base_gfn,
+ slot->base_gfn + slot->npages);

kvm_mmu_put_root(kvm, root);
}
@@ -1042,9 +1053,9 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
continue;

tdp_mmu_set_spte(kvm, &iter, 0);
- spte_set = true;

- spte_set = !tdp_mmu_iter_cond_resched(kvm, &iter);
+ spte_set = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
}

if (spte_set)

2020-10-16 18:45:35

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 04/20] kvm: x86/mmu: Allocate and free TDP MMU roots

On 14/10/20 20:26, Ben Gardon wrote:
> +
> +static void put_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> +{
> + if (kvm_mmu_put_root(root))
> + kvm_tdp_mmu_free_root(kvm, root);
> +}

Unused...

> +static void get_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> +{
> + lockdep_assert_held(&kvm->mmu_lock);
> +
> + kvm_mmu_get_root(root);
> +}
> +

... and duplicate with kvm_mmu_get_root itself since we can move the
assertion there.

Paolo

2020-10-16 19:57:45

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 15/20] kvm: x86/mmu: Support dirty logging for the TDP MMU

On 14/10/20 20:26, Ben Gardon wrote:
>
> + if (kvm->arch.tdp_mmu_enabled)
> + kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
> + slot->base_gfn + gfn_offset, mask, true);

This was "false" in v1, I need --verbose for this change. :)

> while (mask) {
> rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),

> + spte_set = wrprot_gfn_range(kvm, root, slot->base_gfn,
> + slot->base_gfn + slot->npages, min_level) ||
> + spte_set;

A few remaining instances of ||.

Paolo

2020-10-19 17:04:05

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH v2 04/20] kvm: x86/mmu: Allocate and free TDP MMU roots

On Fri, Oct 16, 2020 at 7:56 AM Paolo Bonzini <[email protected]> wrote:
>
> On 14/10/20 20:26, Ben Gardon wrote:
> > +
> > +static void put_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> > +{
> > + if (kvm_mmu_put_root(root))
> > + kvm_tdp_mmu_free_root(kvm, root);
> > +}
>
> Unused...

Woops, I should have added an unused tag or added this in commit 7.
It's used by many other functions, but nothing in this patch anymore.

>
> > +static void get_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> > +{
> > + lockdep_assert_held(&kvm->mmu_lock);
> > +
> > + kvm_mmu_get_root(root);
> > +}
> > +
>
> ... and duplicate with kvm_mmu_get_root itself since we can move the
> assertion there.
>
> Paolo
>

2020-10-19 18:20:03

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH v2 00/20] Introduce the TDP MMU

On Fri, Oct 16, 2020 at 9:50 AM Paolo Bonzini <[email protected]> wrote:
>
> On 14/10/20 20:26, Ben Gardon wrote:
> > arch/x86/include/asm/kvm_host.h | 14 +
> > arch/x86/kvm/Makefile | 3 +-
> > arch/x86/kvm/mmu/mmu.c | 487 +++++++------
> > arch/x86/kvm/mmu/mmu_internal.h | 242 +++++++
> > arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
> > arch/x86/kvm/mmu/tdp_iter.c | 181 +++++
> > arch/x86/kvm/mmu/tdp_iter.h | 60 ++
> > arch/x86/kvm/mmu/tdp_mmu.c | 1154 +++++++++++++++++++++++++++++++
> > arch/x86/kvm/mmu/tdp_mmu.h | 48 ++
> > include/linux/kvm_host.h | 2 +
> > virt/kvm/kvm_main.c | 12 +-
> > 11 files changed, 1944 insertions(+), 262 deletions(-)
> > create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
> > create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
> > create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c
> > create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h
> >
>
> My implementation of tdp_iter_set_spte was completely different, but
> of course that's not an issue; I would still like to understand and
> comment on why the bool arguments to __tdp_mmu_set_spte are needed.

The simplest explanation for those options to not mark the page as
dirty in the dirty bitmap or not mark the page accessed is simply that
the legacy MMU doesn't do it, but I will outline why it doesn't more
specifically.

Let's consider dirty logging first. When getting the dirty log, we
follow the following steps:
1. Atomically get and clear an unsigned long of the dirty bitmap
2. For each GFN in the range of pages covered by the unsigned long mask:
3. Clear the dirty or writable bit on the SPTE
4. Copy the mask of dirty pages to be returned to userspace

If we mark the page as dirty in the dirty bitmap in step 3, we'll
report the page as dirty twice - once in this dirty log call, and
again in the next one. This can lead to unexpected behavior:
1. Pause all vCPUs
2. Get the dirty log <--- Returns all pages dirtied before the vCPUs were paused
3. Get the dirty log again <--- Unexpectedly returns a non-zero number
of dirty pages even though no pages were actually dirtied

I believe a similar process happens for access tracking though MMU
notifiers which would lead to incorrect behavior if we called
kvm_set_pfn_accessed during the handler for notifier_clear_young or
notifier_clear_flush_young

>
> Apart from splitting tdp_mmu_iter_flush_cond_resched from
> tdp_mmu_iter_cond_resched, my remaining changes on top are pretty
> small and mostly cosmetic. I'll give it another go next week
> and send it Linus's way if everything's all right.

Fantastic, thank you!

>
> Paolo
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f8525c89fc95..baf260421a56 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -7,20 +7,15 @@
> #include "tdp_mmu.h"
> #include "spte.h"
>
> +#ifdef CONFIG_X86_64
> static bool __read_mostly tdp_mmu_enabled = false;
> +module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
> +#endif
>
> static bool is_tdp_mmu_enabled(void)
> {
> #ifdef CONFIG_X86_64
> - if (!READ_ONCE(tdp_mmu_enabled))
> - return false;
> -
> - if (WARN_ONCE(!tdp_enabled,
> - "Creating a VM with TDP MMU enabled requires TDP."))
> - return false;
> -
> - return true;
> -
> + return tdp_enabled && READ_ONCE(tdp_mmu_enabled);
> #else
> return false;
> #endif /* CONFIG_X86_64 */
> @@ -277,8 +277,8 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> unaccount_huge_nx_page(kvm, sp);
>
> for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> - old_child_spte = *(pt + i);
> - *(pt + i) = 0;
> + old_child_spte = READ_ONCE(*(pt + i));
> + WRITE_ONCE(*(pt + i), 0);
> handle_changed_spte(kvm, as_id,
> gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
> old_child_spte, 0, level - 1);
> @@ -309,7 +309,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> struct kvm_mmu_page *root = sptep_to_sp(root_pt);
> int as_id = kvm_mmu_page_as_id(root);
>
> - *iter->sptep = new_spte;
> + WRITE_ONCE(*iter->sptep, new_spte);
>
> __handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
> iter->level);
> @@ -361,16 +361,28 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
> for_each_tdp_pte(_iter, __va(_mmu->root_hpa), \
> _mmu->shadow_root_level, _start, _end)
>
> -static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
> +/*
> + * Flush the TLB if the process should drop kvm->mmu_lock.
> + * Return whether the caller still needs to flush the tlb.
> + */
> +static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
> {
> if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> kvm_flush_remote_tlbs(kvm);
> cond_resched_lock(&kvm->mmu_lock);
> tdp_iter_refresh_walk(iter);
> + return false;
> + } else {
> return true;
> }
> +}
>
> - return false;
> +static void tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
> +{
> + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> + cond_resched_lock(&kvm->mmu_lock);
> + tdp_iter_refresh_walk(iter);
> + }
> }
>
> /*
> @@ -407,7 +419,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> tdp_mmu_set_spte(kvm, &iter, 0);
>
> if (can_yield)
> - flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
> + flush_needed = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
> else
> flush_needed = true;
> }
> @@ -479,7 +479,10 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
> map_writable, !shadow_accessed_mask,
> &new_spte);
>
> - tdp_mmu_set_spte(vcpu->kvm, iter, new_spte);
> + if (new_spte == iter->old_spte)
> + ret = RET_PF_SPURIOUS;
> + else
> + tdp_mmu_set_spte(vcpu->kvm, iter, new_spte);
>
> /*
> * If the page fault was caused by a write but the page is write
> @@ -496,7 +496,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
> }
>
> /* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
> - if (unlikely(is_mmio_spte(new_spte)))
> + else if (unlikely(is_mmio_spte(new_spte)))
> ret = RET_PF_EMULATE;
>
> trace_kvm_mmu_set_spte(iter->level, iter->gfn, iter->sptep);
> @@ -528,8 +528,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> int level;
> int req_level;
>
> - BUG_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
> - BUG_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa));
> + if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
> + return RET_PF_ENTRY;
> + if (WARN_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)))
> + return RET_PF_ENTRY;
>
> level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
> huge_page_disallowed, &req_level);
> @@ -579,7 +581,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> }
> }
>
> - BUG_ON(iter.level != level);
> + if (WARN_ON(iter.level != level))
> + return RET_PF_RETRY;
>
> ret = tdp_mmu_map_handle_target_level(vcpu, write, map_writable, &iter,
> pfn, prefault);
> @@ -817,9 +829,8 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
> */
> kvm_mmu_get_root(kvm, root);
>
> - spte_set = wrprot_gfn_range(kvm, root, slot->base_gfn,
> - slot->base_gfn + slot->npages, min_level) ||
> - spte_set;
> + spte_set |= wrprot_gfn_range(kvm, root, slot->base_gfn,
> + slot->base_gfn + slot->npages, min_level);
>
> kvm_mmu_put_root(kvm, root);
> }
> @@ -886,8 +897,8 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
> */
> kvm_mmu_get_root(kvm, root);
>
> - spte_set = clear_dirty_gfn_range(kvm, root, slot->base_gfn,
> - slot->base_gfn + slot->npages) || spte_set;
> + spte_set |= clear_dirty_gfn_range(kvm, root, slot->base_gfn,
> + slot->base_gfn + slot->npages);
>
> kvm_mmu_put_root(kvm, root);
> }
> @@ -1009,8 +1020,8 @@ bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
> */
> kvm_mmu_get_root(kvm, root);
>
> - spte_set = set_dirty_gfn_range(kvm, root, slot->base_gfn,
> - slot->base_gfn + slot->npages) || spte_set;
> + spte_set |= set_dirty_gfn_range(kvm, root, slot->base_gfn,
> + slot->base_gfn + slot->npages);
>
> kvm_mmu_put_root(kvm, root);
> }
> @@ -1042,9 +1053,9 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> continue;
>
> tdp_mmu_set_spte(kvm, &iter, 0);
> - spte_set = true;
>
> - spte_set = !tdp_mmu_iter_cond_resched(kvm, &iter);
> + spte_set = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
> }
>
> if (spte_set)
>

2020-10-20 06:21:13

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH v2 15/20] kvm: x86/mmu: Support dirty logging for the TDP MMU

On Fri, Oct 16, 2020 at 9:18 AM Paolo Bonzini <[email protected]> wrote:
>
> On 14/10/20 20:26, Ben Gardon wrote:
> >
> > + if (kvm->arch.tdp_mmu_enabled)
> > + kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
> > + slot->base_gfn + gfn_offset, mask, true);
>
> This was "false" in v1, I need --verbose for this change. :)

I don't think this changed from v1. Note that there are two callers in
mmu.c - kvm_mmu_write_protect_pt_masked and
kvm_mmu_clear_dirty_pt_masked. One calls with wrprot = true and the
other with wrprot = false.

>
> > while (mask) {
> > rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
>
> > + spte_set = wrprot_gfn_range(kvm, root, slot->base_gfn,
> > + slot->base_gfn + slot->npages, min_level) ||
> > + spte_set;
>
> A few remaining instances of ||.

Gah, I thought I had gotten all of them. Thanks for catching these.

>
> Paolo
>

2020-10-20 06:53:01

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 15/20] kvm: x86/mmu: Support dirty logging for the TDP MMU

On 19/10/20 19:07, Ben Gardon wrote:
> On Fri, Oct 16, 2020 at 9:18 AM Paolo Bonzini <[email protected]> wrote:
>>
>> On 14/10/20 20:26, Ben Gardon wrote:
>>>
>>> + if (kvm->arch.tdp_mmu_enabled)
>>> + kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
>>> + slot->base_gfn + gfn_offset, mask, true);
>>
>> This was "false" in v1, I need --verbose for this change. :)
>
> I don't think this changed from v1. Note that there are two callers in
> mmu.c - kvm_mmu_write_protect_pt_masked and
> kvm_mmu_clear_dirty_pt_masked. One calls with wrprot = true and the
> other with wrprot = false.

Ah, I messed up fixing the conflicts.

Paolo

2020-10-20 07:56:24

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 07/20] kvm: x86/mmu: Support zapping SPTEs in the TDP MMU

On Wed, 2020-10-14 at 11:26 -0700, Ben Gardon wrote:
> @@ -5827,6 +5831,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> gfn_start, gfn_t gfn_end)
> struct kvm_memslots *slots;
> struct kvm_memory_slot *memslot;
> int i;
> + bool flush;
>
> spin_lock(&kvm->mmu_lock);
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> @@ -5846,6 +5851,12 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> gfn_start, gfn_t gfn_end)
> }
> }
>
> + if (kvm->arch.tdp_mmu_enabled) {
> + flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start,
> gfn_end);
> + if (flush)
> + kvm_flush_remote_tlbs(kvm);
> + }
> +
> spin_unlock(&kvm->mmu_lock);
> }

Hi,

I'm just going through this looking at how I might integrate some other
MMU changes I had been working on. But as long as I am, I'll toss out
an extremely small comment that the "flush" bool seems unnecessary.

I'm also wondering a bit about this function in general. It seems that
this change adds an extra flush in the nested case, but this operation
already flushed for each memslot in order to facilitate the spin break.
If slot_handle_level_range() took some extra parameters it could maybe
be avoided. Not sure if it's worth it.

Rick

2020-10-20 08:03:29

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH v2 07/20] kvm: x86/mmu: Support zapping SPTEs in the TDP MMU

On Mon, Oct 19, 2020 at 1:50 PM Edgecombe, Rick P
<[email protected]> wrote:
>
> On Wed, 2020-10-14 at 11:26 -0700, Ben Gardon wrote:
> > @@ -5827,6 +5831,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> > gfn_start, gfn_t gfn_end)
> > struct kvm_memslots *slots;
> > struct kvm_memory_slot *memslot;
> > int i;
> > + bool flush;
> >
> > spin_lock(&kvm->mmu_lock);
> > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > @@ -5846,6 +5851,12 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> > gfn_start, gfn_t gfn_end)
> > }
> > }
> >
> > + if (kvm->arch.tdp_mmu_enabled) {
> > + flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start,
> > gfn_end);
> > + if (flush)
> > + kvm_flush_remote_tlbs(kvm);
> > + }
> > +
> > spin_unlock(&kvm->mmu_lock);
> > }
>
> Hi,
>
> I'm just going through this looking at how I might integrate some other
> MMU changes I had been working on. But as long as I am, I'll toss out
> an extremely small comment that the "flush" bool seems unnecessary.

I agree this could easily be replaced with:
if (kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end))
kvm_flush_remote_tlbs(kvm);

I like the flush variable just because I think it gives a little more
explanation to the code, but I agree both are perfectly good.

>
> I'm also wondering a bit about this function in general. It seems that
> this change adds an extra flush in the nested case, but this operation
> already flushed for each memslot in order to facilitate the spin break.
> If slot_handle_level_range() took some extra parameters it could maybe
> be avoided. Not sure if it's worth it.

I agree, there's a lot of room for optimization here to reduce the
number of TLB flushes. In this series I have not been too concerned
about optimizing performance. I wanted it to be easy to review and to
minimize the number of bugs in the code.

Future patch series will optimize the TDP MMU and make it actually
performant. Two specific changes I have planned to reduce the number
of TLB flushes are 1.) a deferred TLB flush scheme using the existing
vm-global tlbs_dirty count and 2.) a system for skipping the "legacy
MMU" handlers for various operations if the TDP MMU is enabled and the
"legacy MMU" has not been used on that VM. I believe both of these are
present in the original RFC I sent out a year ago if you're
interested. I'll CC you on those future optimizations.

>
> Rick

2020-10-20 10:14:48

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 00/20] Introduce the TDP MMU

On 19/10/20 20:15, Ben Gardon wrote:
> When getting the dirty log, we
> follow the following steps:
> 1. Atomically get and clear an unsigned long of the dirty bitmap
> 2. For each GFN in the range of pages covered by the unsigned long mask:
> 3. Clear the dirty or writable bit on the SPTE
> 4. Copy the mask of dirty pages to be returned to userspace
>
> If we mark the page as dirty in the dirty bitmap in step 3, we'll
> report the page as dirty twice - once in this dirty log call, and
> again in the next one. This can lead to unexpected behavior:
> 1. Pause all vCPUs
> 2. Get the dirty log <--- Returns all pages dirtied before the vCPUs were paused
> 3. Get the dirty log again <--- Unexpectedly returns a non-zero number
> of dirty pages even though no pages were actually dirtied

Got it, that might also fail the dirty_log_test. Thanks!

Paolo

2020-10-21 15:11:20

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v2 04/20] kvm: x86/mmu: Allocate and free TDP MMU roots

On Wed, Oct 14, 2020 at 11:26:44AM -0700, Ben Gardon wrote:
> The TDP MMU must be able to allocate paging structure root pages and track
> the usage of those pages. Implement a similar, but separate system for root
> page allocation to that of the x86 shadow paging implementation. When
> future patches add synchronization model changes to allow for parallel
> page faults, these pages will need to be handled differently from the
> x86 shadow paging based MMU's root pages.
>
> Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> machine. This series introduced no new failures.
>
> This series can be viewed in Gerrit at:
> https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 29 +++++---
> arch/x86/kvm/mmu/mmu_internal.h | 24 +++++++
> arch/x86/kvm/mmu/tdp_mmu.c | 114 ++++++++++++++++++++++++++++++++
> arch/x86/kvm/mmu/tdp_mmu.h | 5 ++
> 5 files changed, 162 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6b6dbc20ce23a..e0ec1dd271a32 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -989,6 +989,7 @@ struct kvm_arch {
> * operations.
> */
> bool tdp_mmu_enabled;
> + struct list_head tdp_mmu_roots;
> };
>
> struct kvm_vm_stat {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f53d29e09367c..a3340ed59ad1d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -144,11 +144,6 @@ module_param(dbg, bool, 0644);
> #define PT64_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
> | shadow_x_mask | shadow_nx_mask | shadow_me_mask)
>
> -#define ACC_EXEC_MASK 1
> -#define ACC_WRITE_MASK PT_WRITABLE_MASK
> -#define ACC_USER_MASK PT_USER_MASK
> -#define ACC_ALL (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
> -
> /* The mask for the R/X bits in EPT PTEs */
> #define PT64_EPT_READABLE_MASK 0x1ull
> #define PT64_EPT_EXECUTABLE_MASK 0x4ull
> @@ -209,7 +204,7 @@ struct kvm_shadow_walk_iterator {
> __shadow_walk_next(&(_walker), spte))
>
> static struct kmem_cache *pte_list_desc_cache;
> -static struct kmem_cache *mmu_page_header_cache;
> +struct kmem_cache *mmu_page_header_cache;
> static struct percpu_counter kvm_total_used_mmu_pages;
>
> static u64 __read_mostly shadow_nx_mask;
> @@ -3588,9 +3583,13 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
> return;
>
> sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
> - --sp->root_count;
> - if (!sp->root_count && sp->role.invalid)
> - kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
> +
> + if (kvm_mmu_put_root(sp)) {
> + if (sp->tdp_mmu_page)
> + kvm_tdp_mmu_free_root(kvm, sp);
> + else if (sp->role.invalid)
> + kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
> + }
>
> *root_hpa = INVALID_PAGE;
> }
> @@ -3680,8 +3679,16 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> hpa_t root;
> unsigned i;
>
> - if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> - root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
> + if (vcpu->kvm->arch.tdp_mmu_enabled) {
> + root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> +
> + if (!VALID_PAGE(root))
> + return -ENOSPC;
> + vcpu->arch.mmu->root_hpa = root;
> + } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> + root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level,
> + true);
> +
> if (!VALID_PAGE(root))
> return -ENOSPC;
> vcpu->arch.mmu->root_hpa = root;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 74ccbf001a42e..6cedf578c9a8d 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -43,8 +43,12 @@ struct kvm_mmu_page {
>
> /* Number of writes since the last time traversal visited this page. */
> atomic_t write_flooding_count;
> +
> + bool tdp_mmu_page;
> };
>
> +extern struct kmem_cache *mmu_page_header_cache;
> +
> static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
> {
> struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
> @@ -96,6 +100,11 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
> (PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
> * PT64_LEVEL_BITS))) - 1))
>
> +#define ACC_EXEC_MASK 1
> +#define ACC_WRITE_MASK PT_WRITABLE_MASK
> +#define ACC_USER_MASK PT_USER_MASK
> +#define ACC_ALL (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
> +
> /* Functions for interpreting SPTEs */
> static inline bool is_mmio_spte(u64 spte)
> {
> @@ -126,4 +135,19 @@ static inline kvm_pfn_t spte_to_pfn(u64 pte)
> return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
> }
>
> +static inline void kvm_mmu_get_root(struct kvm_mmu_page *sp)
> +{
> + BUG_ON(!sp->root_count);
> +
> + ++sp->root_count;
> +}
> +
> +static inline bool kvm_mmu_put_root(struct kvm_mmu_page *sp)
> +{
> + --sp->root_count;
> +
> + return !sp->root_count;
> +}
> +
> +
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b3809835e90b1..09a84a6e157b6 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1,5 +1,7 @@
> // SPDX-License-Identifier: GPL-2.0
>
> +#include "mmu.h"
> +#include "mmu_internal.h"
> #include "tdp_mmu.h"
>
> static bool __read_mostly tdp_mmu_enabled = false;
> @@ -29,10 +31,122 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
>
> /* This should not be changed for the lifetime of the VM. */
> kvm->arch.tdp_mmu_enabled = true;
> +
> + INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
> }
>
> void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> {
> if (!kvm->arch.tdp_mmu_enabled)
> return;
> +
> + WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
> +}
> +
> +#define for_each_tdp_mmu_root(_kvm, _root) \
> + list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)
> +
> +bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
> +{
> + struct kvm_mmu_page *sp;
> +
> + sp = to_shadow_page(hpa);
> +
> + return sp->tdp_mmu_page && sp->root_count;
> +}
> +
> +void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
> +{
> + lockdep_assert_held(&kvm->mmu_lock);
> +
> + WARN_ON(root->root_count);
> + WARN_ON(!root->tdp_mmu_page);
> +
> + list_del(&root->link);
> +
> + free_page((unsigned long)root->spt);
> + kmem_cache_free(mmu_page_header_cache, root);
> +}
> +
> +static void put_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> +{
> + if (kvm_mmu_put_root(root))
> + kvm_tdp_mmu_free_root(kvm, root);
> +}
> +
> +static void get_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> +{
> + lockdep_assert_held(&kvm->mmu_lock);
> +
> + kvm_mmu_get_root(root);
> +}
> +
> +static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
> + int level)
> +{
> + union kvm_mmu_page_role role;
> +
> + role = vcpu->arch.mmu->mmu_role.base;
> + role.level = vcpu->arch.mmu->shadow_root_level;

role.level = level;
The role will be calculated for non root pages later.

> + role.direct = true;
> + role.gpte_is_8_bytes = true;
> + role.access = ACC_ALL;
> +
> + return role;
> +}
> +
> +static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> + int level)
> +{
> + struct kvm_mmu_page *sp;
> +
> + sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> + sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> + set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> +
> + sp->role.word = page_role_for_level(vcpu, level).word;
> + sp->gfn = gfn;
> + sp->tdp_mmu_page = true;
> +
> + return sp;
> +}
> +
> +static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
> +{
> + union kvm_mmu_page_role role;
> + struct kvm *kvm = vcpu->kvm;
> + struct kvm_mmu_page *root;
> +
> + role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
> +
> + spin_lock(&kvm->mmu_lock);
> +
> + /* Check for an existing root before allocating a new one. */
> + for_each_tdp_mmu_root(kvm, root) {
> + if (root->role.word == role.word) {
> + get_tdp_mmu_root(kvm, root);
> + spin_unlock(&kvm->mmu_lock);
> + return root;
> + }
> + }
> +
> + root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
> + root->root_count = 1;
> +
> + list_add(&root->link, &kvm->arch.tdp_mmu_roots);
> +
> + spin_unlock(&kvm->mmu_lock);
> +
> + return root;
> +}
> +
> +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_mmu_page *root;
> +
> + root = get_tdp_mmu_vcpu_root(vcpu);
> + if (!root)
> + return INVALID_PAGE;
> +
> + return __pa(root->spt);
> }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index cd4a562a70e9a..ac0ef91294420 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -7,4 +7,9 @@
>
> void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
> +
> +bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
> +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
> +void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);
> +
> #endif /* __KVM_X86_MMU_TDP_MMU_H */
> --
> 2.28.0.1011.ga647a8990f-goog
>

Thanks
Yu

2020-10-22 03:27:06

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v2 02/20] kvm: x86/mmu: Introduce tdp_iter

On Wed, Oct 14, 2020 at 11:26:42AM -0700, Ben Gardon wrote:
> The TDP iterator implements a pre-order traversal of a TDP paging
> structure. This iterator will be used in future patches to create
> an efficient implementation of the KVM MMU for the TDP case.
>
> Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> machine. This series introduced no new failures.
>
> This series can be viewed in Gerrit at:
> https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> arch/x86/kvm/Makefile | 3 +-
> arch/x86/kvm/mmu/mmu.c | 66 ------------
> arch/x86/kvm/mmu/mmu_internal.h | 66 ++++++++++++
> arch/x86/kvm/mmu/tdp_iter.c | 176 ++++++++++++++++++++++++++++++++
> arch/x86/kvm/mmu/tdp_iter.h | 56 ++++++++++
> 5 files changed, 300 insertions(+), 67 deletions(-)
> create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
> create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
>
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 7f86a14aed0e9..4525c1151bf99 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -15,7 +15,8 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
>
> kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
> i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
> - hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o
> + hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \
> + mmu/tdp_iter.o
>
> kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
> vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6c9db349600c8..6d82784ed5679 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -121,28 +121,6 @@ module_param(dbg, bool, 0644);
>
> #define PTE_PREFETCH_NUM 8
>
> -#define PT_FIRST_AVAIL_BITS_SHIFT 10
> -#define PT64_SECOND_AVAIL_BITS_SHIFT 54
> -
> -/*
> - * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
> - * Access Tracking SPTEs.
> - */
> -#define SPTE_SPECIAL_MASK (3ULL << 52)
> -#define SPTE_AD_ENABLED_MASK (0ULL << 52)
> -#define SPTE_AD_DISABLED_MASK (1ULL << 52)
> -#define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
> -#define SPTE_MMIO_MASK (3ULL << 52)
> -
> -#define PT64_LEVEL_BITS 9
> -
> -#define PT64_LEVEL_SHIFT(level) \
> - (PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
> -
> -#define PT64_INDEX(address, level)\
> - (((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
> -
> -
> #define PT32_LEVEL_BITS 10
>
> #define PT32_LEVEL_SHIFT(level) \
> @@ -155,19 +133,6 @@ module_param(dbg, bool, 0644);
> #define PT32_INDEX(address, level)\
> (((address) >> PT32_LEVEL_SHIFT(level)) & ((1 << PT32_LEVEL_BITS) - 1))
>
> -
> -#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
> -#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
> -#else
> -#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
> -#endif
> -#define PT64_LVL_ADDR_MASK(level) \
> - (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
> - * PT64_LEVEL_BITS))) - 1))
> -#define PT64_LVL_OFFSET_MASK(level) \
> - (PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
> - * PT64_LEVEL_BITS))) - 1))
> -
> #define PT32_BASE_ADDR_MASK PAGE_MASK
> #define PT32_DIR_BASE_ADDR_MASK \
> (PAGE_MASK & ~((1ULL << (PAGE_SHIFT + PT32_LEVEL_BITS)) - 1))
> @@ -192,8 +157,6 @@ module_param(dbg, bool, 0644);
> #define SPTE_HOST_WRITEABLE (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
> #define SPTE_MMU_WRITEABLE (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
>
> -#define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
> -
> /* make pte_list_desc fit well in cache line */
> #define PTE_LIST_EXT 3
>
> @@ -349,11 +312,6 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 access_mask)
> }
> EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
>
> -static bool is_mmio_spte(u64 spte)
> -{
> - return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
> -}
> -
> static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
> {
> return sp->role.ad_disabled;
> @@ -626,35 +584,11 @@ static int is_nx(struct kvm_vcpu *vcpu)
> return vcpu->arch.efer & EFER_NX;
> }
>
> -static int is_shadow_present_pte(u64 pte)
> -{
> - return (pte != 0) && !is_mmio_spte(pte);
> -}
> -
> -static int is_large_pte(u64 pte)
> -{
> - return pte & PT_PAGE_SIZE_MASK;
> -}
> -
> -static int is_last_spte(u64 pte, int level)
> -{
> - if (level == PG_LEVEL_4K)
> - return 1;
> - if (is_large_pte(pte))
> - return 1;
> - return 0;
> -}
> -
> static bool is_executable_pte(u64 spte)
> {
> return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
> }
>
> -static kvm_pfn_t spte_to_pfn(u64 pte)
> -{
> - return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
> -}
> -
> static gfn_t pse36_gfn_delta(u32 gpte)
> {
> int shift = 32 - PT32_DIR_PSE36_SHIFT - PAGE_SHIFT;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 3acf3b8eb469d..74ccbf001a42e 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -2,6 +2,8 @@
> #ifndef __KVM_X86_MMU_INTERNAL_H
> #define __KVM_X86_MMU_INTERNAL_H
>
> +#include "mmu.h"
> +
> #include <linux/types.h>
>
> #include <asm/kvm_host.h>
> @@ -60,4 +62,68 @@ void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
> bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
> struct kvm_memory_slot *slot, u64 gfn);
>
> +#define PT64_LEVEL_BITS 9
> +
> +#define PT64_LEVEL_SHIFT(level) \
> + (PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
> +
> +#define PT64_INDEX(address, level)\
> + (((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
> +#define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
> +
> +#define PT_FIRST_AVAIL_BITS_SHIFT 10
> +#define PT64_SECOND_AVAIL_BITS_SHIFT 54
> +
> +/*
> + * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
> + * Access Tracking SPTEs.
> + */
> +#define SPTE_SPECIAL_MASK (3ULL << 52)
> +#define SPTE_AD_ENABLED_MASK (0ULL << 52)
> +#define SPTE_AD_DISABLED_MASK (1ULL << 52)
> +#define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
> +#define SPTE_MMIO_MASK (3ULL << 52)
> +
> +#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
> +#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
> +#else
> +#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
> +#endif
> +#define PT64_LVL_ADDR_MASK(level) \
> + (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
> + * PT64_LEVEL_BITS))) - 1))
> +#define PT64_LVL_OFFSET_MASK(level) \
> + (PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
> + * PT64_LEVEL_BITS))) - 1))
> +
> +/* Functions for interpreting SPTEs */
> +static inline bool is_mmio_spte(u64 spte)
> +{
> + return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
> +}
> +
> +static inline int is_shadow_present_pte(u64 pte)
> +{
> + return (pte != 0) && !is_mmio_spte(pte);
> +}
> +
> +static inline int is_large_pte(u64 pte)
> +{
> + return pte & PT_PAGE_SIZE_MASK;
> +}
> +
> +static inline int is_last_spte(u64 pte, int level)
> +{
> + if (level == PG_LEVEL_4K)
> + return 1;
> + if (is_large_pte(pte))
> + return 1;
> + return 0;
> +}
> +
> +static inline kvm_pfn_t spte_to_pfn(u64 pte)
> +{
> + return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
> +}
> +
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> new file mode 100644
> index 0000000000000..b07e9f0c5d4aa
> --- /dev/null
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -0,0 +1,176 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "mmu_internal.h"
> +#include "tdp_iter.h"
> +
> +/*
> + * Recalculates the pointer to the SPTE for the current GFN and level and
> + * reread the SPTE.
> + */
> +static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
> +{
> + iter->sptep = iter->pt_path[iter->level - 1] +
> + SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
> + iter->old_spte = READ_ONCE(*iter->sptep);
> +}
> +
> +static gfn_t round_gfn_for_level(gfn_t gfn, int level)
> +{
> + return gfn - (gfn % KVM_PAGES_PER_HPAGE(level));

Instead of the modulo operator, how about we use:
return gfn & ~(KVM_PAGES_PER_HPAGE(level) - 1);
here? :)

> +}
> +
> +/*
> + * Sets a TDP iterator to walk a pre-order traversal of the paging structure
> + * rooted at root_pt, starting with the walk to translate goal_gfn.
> + */
> +void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> + int min_level, gfn_t goal_gfn)
> +{
> + WARN_ON(root_level < 1);
> + WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
> +
> + iter->goal_gfn = goal_gfn;
> + iter->root_level = root_level;
> + iter->min_level = min_level;
> + iter->level = root_level;
> + iter->pt_path[iter->level - 1] = root_pt;
> +
> + iter->gfn = round_gfn_for_level(iter->goal_gfn, iter->level);
> + tdp_iter_refresh_sptep(iter);
> +
> + iter->valid = true;
> +}
> +
> +/*
> + * Given an SPTE and its level, returns a pointer containing the host virtual
> + * address of the child page table referenced by the SPTE. Returns null if
> + * there is no such entry.
> + */
> +u64 *spte_to_child_pt(u64 spte, int level)
> +{
> + /*
> + * There's no child entry if this entry isn't present or is a
> + * last-level entry.
> + */
> + if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
> + return NULL;
> +
> + return __va(spte_to_pfn(spte) << PAGE_SHIFT);
> +}
> +
> +/*
> + * Steps down one level in the paging structure towards the goal GFN. Returns
> + * true if the iterator was able to step down a level, false otherwise.
> + */
> +static bool try_step_down(struct tdp_iter *iter)
> +{
> + u64 *child_pt;
> +
> + if (iter->level == iter->min_level)
> + return false;
> +
> + /*
> + * Reread the SPTE before stepping down to avoid traversing into page
> + * tables that are no longer linked from this entry.
> + */
> + iter->old_spte = READ_ONCE(*iter->sptep);
> +
> + child_pt = spte_to_child_pt(iter->old_spte, iter->level);
> + if (!child_pt)
> + return false;
> +
> + iter->level--;
> + iter->pt_path[iter->level - 1] = child_pt;
> + iter->gfn = round_gfn_for_level(iter->goal_gfn, iter->level);
> + tdp_iter_refresh_sptep(iter);
> +
> + return true;
> +}
> +
> +/*
> + * Steps to the next entry in the current page table, at the current page table
> + * level. The next entry could point to a page backing guest memory or another
> + * page table, or it could be non-present. Returns true if the iterator was
> + * able to step to the next entry in the page table, false if the iterator was
> + * already at the end of the current page table.
> + */
> +static bool try_step_side(struct tdp_iter *iter)
> +{
> + /*
> + * Check if the iterator is already at the end of the current page
> + * table.
> + */
> + if (!((iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) %
> + KVM_PAGES_PER_HPAGE(iter->level + 1)))
> + return false;
> +

And maybe:
if (SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level) ==
(PT64_ENT_PER_PAGE - 1))
here?

> + iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
> + iter->goal_gfn = iter->gfn;
> + iter->sptep++;
> + iter->old_spte = READ_ONCE(*iter->sptep);
> +
> + return true;
> +}
> +
> +/*
> + * Tries to traverse back up a level in the paging structure so that the walk
> + * can continue from the next entry in the parent page table. Returns true on a
> + * successful step up, false if already in the root page.
> + */
> +static bool try_step_up(struct tdp_iter *iter)
> +{
> + if (iter->level == iter->root_level)
> + return false;
> +
> + iter->level++;
> + iter->gfn = round_gfn_for_level(iter->gfn, iter->level);
> + tdp_iter_refresh_sptep(iter);
> +
> + return true;
> +}
> +
> +/*
> + * Step to the next SPTE in a pre-order traversal of the paging structure.
> + * To get to the next SPTE, the iterator either steps down towards the goal
> + * GFN, if at a present, non-last-level SPTE, or over to a SPTE mapping a
> + * highter GFN.
> + *
> + * The basic algorithm is as follows:
> + * 1. If the current SPTE is a non-last-level SPTE, step down into the page
> + * table it points to.
> + * 2. If the iterator cannot step down, it will try to step to the next SPTE
> + * in the current page of the paging structure.
> + * 3. If the iterator cannot step to the next entry in the current page, it will
> + * try to step up to the parent paging structure page. In this case, that
> + * SPTE will have already been visited, and so the iterator must also step
> + * to the side again.
> + */
> +void tdp_iter_next(struct tdp_iter *iter)
> +{
> + if (try_step_down(iter))
> + return;
> +
> + do {
> + if (try_step_side(iter))
> + return;
> + } while (try_step_up(iter));
> + iter->valid = false;
> +}
> +
> +/*
> + * Restart the walk over the paging structure from the root, starting from the
> + * highest gfn the iterator had previously reached. Assumes that the entire
> + * paging structure, except the root page, may have been completely torn down
> + * and rebuilt.
> + */
> +void tdp_iter_refresh_walk(struct tdp_iter *iter)
> +{
> + gfn_t goal_gfn = iter->goal_gfn;
> +
> + if (iter->gfn > goal_gfn)
> + goal_gfn = iter->gfn;
> +
> + tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
> + iter->root_level, iter->min_level, goal_gfn);
> +}
> +
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> new file mode 100644
> index 0000000000000..d629a53e1b73f
> --- /dev/null
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -0,0 +1,56 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#ifndef __KVM_X86_MMU_TDP_ITER_H
> +#define __KVM_X86_MMU_TDP_ITER_H
> +
> +#include <linux/kvm_host.h>
> +
> +#include "mmu.h"
> +
> +/*
> + * A TDP iterator performs a pre-order walk over a TDP paging structure.
> + */
> +struct tdp_iter {
> + /*
> + * The iterator will traverse the paging structure towards the mapping
> + * for this GFN.
> + */
> + gfn_t goal_gfn;
> + /* Pointers to the page tables traversed to reach the current SPTE */
> + u64 *pt_path[PT64_ROOT_MAX_LEVEL];
> + /* A pointer to the current SPTE */
> + u64 *sptep;
> + /* The lowest GFN mapped by the current SPTE */
> + gfn_t gfn;
> + /* The level of the root page given to the iterator */
> + int root_level;
> + /* The lowest level the iterator should traverse to */
> + int min_level;
> + /* The iterator's current level within the paging structure */
> + int level;
> + /* A snapshot of the value at sptep */
> + u64 old_spte;
> + /*
> + * Whether the iterator has a valid state. This will be false if the
> + * iterator walks off the end of the paging structure.
> + */
> + bool valid;
> +};
> +
> +/*
> + * Iterates over every SPTE mapping the GFN range [start, end) in a
> + * preorder traversal.
> + */
> +#define for_each_tdp_pte(iter, root, root_level, start, end) \
> + for (tdp_iter_start(&iter, root, root_level, PG_LEVEL_4K, start); \
> + iter.valid && iter.gfn < end; \
> + tdp_iter_next(&iter))
> +
> +u64 *spte_to_child_pt(u64 pte, int level);
> +
> +void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> + int min_level, gfn_t goal_gfn);
> +void tdp_iter_next(struct tdp_iter *iter);
> +void tdp_iter_refresh_walk(struct tdp_iter *iter);
> +
> +#endif /* __KVM_X86_MMU_TDP_ITER_H */
> --
> 2.28.0.1011.ga647a8990f-goog
>

I am just suggesting to replace the modulo operations with bit-shifts...
Also, it's very exciting to see such patch set. Thanks!

B.R.
Yu

2020-10-22 03:28:21

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v2 07/20] kvm: x86/mmu: Support zapping SPTEs in the TDP MMU

On Wed, Oct 14, 2020 at 11:26:47AM -0700, Ben Gardon wrote:
> Add functions to zap SPTEs to the TDP MMU. These are needed to tear down
> TDP MMU roots properly and implement other MMU functions which require
> tearing down mappings. Future patches will add functions to populate the
> page tables, but as for this patch there will not be any work for these
> functions to do.
>
> Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> machine. This series introduced no new failures.
>
> This series can be viewed in Gerrit at:
> https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 15 +++++
> arch/x86/kvm/mmu/tdp_iter.c | 5 ++
> arch/x86/kvm/mmu/tdp_iter.h | 1 +
> arch/x86/kvm/mmu/tdp_mmu.c | 109 ++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/mmu/tdp_mmu.h | 2 +
> 5 files changed, 132 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8bf20723c6177..337ab6823e312 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5787,6 +5787,10 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> kvm_reload_remote_mmus(kvm);
>
> kvm_zap_obsolete_pages(kvm);
> +
> + if (kvm->arch.tdp_mmu_enabled)
> + kvm_tdp_mmu_zap_all(kvm);
> +
> spin_unlock(&kvm->mmu_lock);
> }
>
> @@ -5827,6 +5831,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> struct kvm_memslots *slots;
> struct kvm_memory_slot *memslot;
> int i;
> + bool flush;
>
> spin_lock(&kvm->mmu_lock);
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> @@ -5846,6 +5851,12 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> }
> }
>
> + if (kvm->arch.tdp_mmu_enabled) {
> + flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end);
> + if (flush)
> + kvm_flush_remote_tlbs(kvm);
> + }
> +
> spin_unlock(&kvm->mmu_lock);
> }
>
> @@ -6012,6 +6023,10 @@ void kvm_mmu_zap_all(struct kvm *kvm)
> }
>
> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +
> + if (kvm->arch.tdp_mmu_enabled)
> + kvm_tdp_mmu_zap_all(kvm);
> +
> spin_unlock(&kvm->mmu_lock);
> }
>
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index b07e9f0c5d4aa..701eb753b701e 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -174,3 +174,8 @@ void tdp_iter_refresh_walk(struct tdp_iter *iter)
> iter->root_level, iter->min_level, goal_gfn);
> }
>
> +u64 *tdp_iter_root_pt(struct tdp_iter *iter)
> +{
> + return iter->pt_path[iter->root_level - 1];
> +}
> +
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index d629a53e1b73f..884ed2c70bfed 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -52,5 +52,6 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> int min_level, gfn_t goal_gfn);
> void tdp_iter_next(struct tdp_iter *iter);
> void tdp_iter_refresh_walk(struct tdp_iter *iter);
> +u64 *tdp_iter_root_pt(struct tdp_iter *iter);
>
> #endif /* __KVM_X86_MMU_TDP_ITER_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f2bd3a6928ce9..9b5cd4a832f1a 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -56,8 +56,13 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
> return sp->tdp_mmu_page && sp->root_count;
> }
>
> +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> + gfn_t start, gfn_t end);
> +
> void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
> {
> + gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
> +

boot_cpu_data.x86_phys_bits is the host address width. Value of the guest's
may vary. So maybe we should just traverse the memslots and zap the gfn ranges
in each of them?

> lockdep_assert_held(&kvm->mmu_lock);
>
> WARN_ON(root->root_count);
> @@ -65,6 +70,8 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
>
> list_del(&root->link);
>
> + zap_gfn_range(kvm, root, 0, max_gfn);
> +
> free_page((unsigned long)root->spt);
> kmem_cache_free(mmu_page_header_cache, root);
> }
> @@ -155,6 +162,11 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> u64 old_spte, u64 new_spte, int level);
>
> +static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
> +{
> + return sp->role.smm ? 1 : 0;
> +}
> +
> /**
> * handle_changed_spte - handle bookkeeping associated with an SPTE change
> * @kvm: kvm instance
> @@ -262,3 +274,100 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> {
> __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
> }
> +
> +static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> + u64 new_spte)
> +{
> + u64 *root_pt = tdp_iter_root_pt(iter);
> + struct kvm_mmu_page *root = sptep_to_sp(root_pt);
> + int as_id = kvm_mmu_page_as_id(root);
> +
> + *iter->sptep = new_spte;
> +
> + handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
> + iter->level);
> +}
> +
> +#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
> + for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)
> +
> +static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
> +{
> + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> + kvm_flush_remote_tlbs(kvm);
> + cond_resched_lock(&kvm->mmu_lock);
> + tdp_iter_refresh_walk(iter);
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/*
> + * Tears down the mappings for the range of gfns, [start, end), and frees the
> + * non-root pages mapping GFNs strictly within that range. Returns true if
> + * SPTEs have been cleared and a TLB flush is needed before releasing the
> + * MMU lock.
> + */
> +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> + gfn_t start, gfn_t end)
> +{
> + struct tdp_iter iter;
> + bool flush_needed = false;
> +
> + tdp_root_for_each_pte(iter, root, start, end) {
> + if (!is_shadow_present_pte(iter.old_spte))
> + continue;
> +
> + /*
> + * If this is a non-last-level SPTE that covers a larger range
> + * than should be zapped, continue, and zap the mappings at a
> + * lower level.
> + */
> + if ((iter.gfn < start ||
> + iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> + !is_last_spte(iter.old_spte, iter.level))
> + continue;
> +
> + tdp_mmu_set_spte(kvm, &iter, 0);
> +
> + flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
> + }
> + return flush_needed;
> +}
> +
> +/*
> + * Tears down the mappings for the range of gfns, [start, end), and frees the
> + * non-root pages mapping GFNs strictly within that range. Returns true if
> + * SPTEs have been cleared and a TLB flush is needed before releasing the
> + * MMU lock.
> + */
> +bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + struct kvm_mmu_page *root;
> + bool flush = false;
> +
> + for_each_tdp_mmu_root(kvm, root) {
> + /*
> + * Take a reference on the root so that it cannot be freed if
> + * this thread releases the MMU lock and yields in this loop.
> + */
> + get_tdp_mmu_root(kvm, root);
> +
> + flush |= zap_gfn_range(kvm, root, start, end);
> +
> + put_tdp_mmu_root(kvm, root);
> + }
> +
> + return flush;
> +}
> +
> +void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> +{
> + gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
> + bool flush;
> +
> + flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn);
> + if (flush)
> + kvm_flush_remote_tlbs(kvm);
> +}
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index ac0ef91294420..6de2d007fc03c 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -12,4 +12,6 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
> hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
> void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);
>
> +bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_tdp_mmu_zap_all(struct kvm *kvm);
> #endif /* __KVM_X86_MMU_TDP_MMU_H */
> --
> 2.28.0.1011.ga647a8990f-goog
>

B.R.
Yu

2020-10-22 06:12:18

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 07/20] kvm: x86/mmu: Support zapping SPTEs in the TDP MMU

On 21/10/20 17:02, Yu Zhang wrote:
>> void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
>> {
>> + gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
>> +
> boot_cpu_data.x86_phys_bits is the host address width. Value of the guest's
> may vary. So maybe we should just traverse the memslots and zap the gfn ranges
> in each of them?
>

It must be smaller than the host value for two-dimensional paging, though.

Paolo

2020-10-22 06:24:19

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v2 07/20] kvm: x86/mmu: Support zapping SPTEs in the TDP MMU

On Wed, Oct 21, 2020 at 07:20:15PM +0200, Paolo Bonzini wrote:
> On 21/10/20 17:02, Yu Zhang wrote:
> >> void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
> >> {
> >> + gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
> >> +
> > boot_cpu_data.x86_phys_bits is the host address width. Value of the guest's
> > may vary. So maybe we should just traverse the memslots and zap the gfn ranges
> > in each of them?
> >
>
> It must be smaller than the host value for two-dimensional paging, though.

Yes. And using boot_cpu_data.x86_phys_bits works, but won't it be somewhat
overkilling? E.g. for a host with 46 bits and a guest with 39 bits width?

Any concern for doing the zap by going through the memslots? Thanks. :)

B.R.
Yu
>
> Paolo
>

2020-10-22 06:31:24

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 07/20] kvm: x86/mmu: Support zapping SPTEs in the TDP MMU

On 21/10/20 19:24, Yu Zhang wrote:
> On Wed, Oct 21, 2020 at 07:20:15PM +0200, Paolo Bonzini wrote:
>> On 21/10/20 17:02, Yu Zhang wrote:
>>>> void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
>>>> {
>>>> + gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
>>>> +
>>> boot_cpu_data.x86_phys_bits is the host address width. Value of the guest's
>>> may vary. So maybe we should just traverse the memslots and zap the gfn ranges
>>> in each of them?
>>>
>>
>> It must be smaller than the host value for two-dimensional paging, though.
>
> Yes. And using boot_cpu_data.x86_phys_bits works, but won't it be somewhat
> overkilling? E.g. for a host with 46 bits and a guest with 39 bits width?

It would go quickly through extra memory space because the PML4E entries
above the first would be empty. So it's just 511 comparisons.

Paolo

2020-10-22 06:31:27

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 04/20] kvm: x86/mmu: Allocate and free TDP MMU roots

On 21/10/20 19:54, Ben Gardon wrote:
> On Wed, Oct 21, 2020 at 8:09 AM Yu Zhang <[email protected]> wrote:
>>
>> On Wed, Oct 14, 2020 at 11:26:44AM -0700, Ben Gardon wrote:
>>> The TDP MMU must be able to allocate paging structure root pages and track
>>> the usage of those pages. Implement a similar, but separate system for root
>>> page allocation to that of the x86 shadow paging implementation. When
>>> future patches add synchronization model changes to allow for parallel
>>> page faults, these pages will need to be handled differently from the
>>> x86 shadow paging based MMU's root pages.
>>>
>>> Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
>>> machine. This series introduced no new failures.
>>>
>>> This series can be viewed in Gerrit at:
>>> https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
>>>
>>> Signed-off-by: Ben Gardon <[email protected]>
>>> ---
>>> arch/x86/include/asm/kvm_host.h | 1 +
>>> arch/x86/kvm/mmu/mmu.c | 29 +++++---
>>> arch/x86/kvm/mmu/mmu_internal.h | 24 +++++++
>>> arch/x86/kvm/mmu/tdp_mmu.c | 114 ++++++++++++++++++++++++++++++++
>>> arch/x86/kvm/mmu/tdp_mmu.h | 5 ++
>>> 5 files changed, 162 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>> index 6b6dbc20ce23a..e0ec1dd271a32 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -989,6 +989,7 @@ struct kvm_arch {
>>> * operations.
>>> */
>>> bool tdp_mmu_enabled;
>>> + struct list_head tdp_mmu_roots;
>>> };
>>>
>>> struct kvm_vm_stat {
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index f53d29e09367c..a3340ed59ad1d 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -144,11 +144,6 @@ module_param(dbg, bool, 0644);
>>> #define PT64_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
>>> | shadow_x_mask | shadow_nx_mask | shadow_me_mask)
>>>
>>> -#define ACC_EXEC_MASK 1
>>> -#define ACC_WRITE_MASK PT_WRITABLE_MASK
>>> -#define ACC_USER_MASK PT_USER_MASK
>>> -#define ACC_ALL (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
>>> -
>>> /* The mask for the R/X bits in EPT PTEs */
>>> #define PT64_EPT_READABLE_MASK 0x1ull
>>> #define PT64_EPT_EXECUTABLE_MASK 0x4ull
>>> @@ -209,7 +204,7 @@ struct kvm_shadow_walk_iterator {
>>> __shadow_walk_next(&(_walker), spte))
>>>
>>> static struct kmem_cache *pte_list_desc_cache;
>>> -static struct kmem_cache *mmu_page_header_cache;
>>> +struct kmem_cache *mmu_page_header_cache;
>>> static struct percpu_counter kvm_total_used_mmu_pages;
>>>
>>> static u64 __read_mostly shadow_nx_mask;
>>> @@ -3588,9 +3583,13 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
>>> return;
>>>
>>> sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
>>> - --sp->root_count;
>>> - if (!sp->root_count && sp->role.invalid)
>>> - kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
>>> +
>>> + if (kvm_mmu_put_root(sp)) {
>>> + if (sp->tdp_mmu_page)
>>> + kvm_tdp_mmu_free_root(kvm, sp);
>>> + else if (sp->role.invalid)
>>> + kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
>>> + }
>>>
>>> *root_hpa = INVALID_PAGE;
>>> }
>>> @@ -3680,8 +3679,16 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>>> hpa_t root;
>>> unsigned i;
>>>
>>> - if (shadow_root_level >= PT64_ROOT_4LEVEL) {
>>> - root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
>>> + if (vcpu->kvm->arch.tdp_mmu_enabled) {
>>> + root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
>>> +
>>> + if (!VALID_PAGE(root))
>>> + return -ENOSPC;
>>> + vcpu->arch.mmu->root_hpa = root;
>>> + } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
>>> + root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level,
>>> + true);
>>> +
>>> if (!VALID_PAGE(root))
>>> return -ENOSPC;
>>> vcpu->arch.mmu->root_hpa = root;
>>> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
>>> index 74ccbf001a42e..6cedf578c9a8d 100644
>>> --- a/arch/x86/kvm/mmu/mmu_internal.h
>>> +++ b/arch/x86/kvm/mmu/mmu_internal.h
>>> @@ -43,8 +43,12 @@ struct kvm_mmu_page {
>>>
>>> /* Number of writes since the last time traversal visited this page. */
>>> atomic_t write_flooding_count;
>>> +
>>> + bool tdp_mmu_page;
>>> };
>>>
>>> +extern struct kmem_cache *mmu_page_header_cache;
>>> +
>>> static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
>>> {
>>> struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>> @@ -96,6 +100,11 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>>> (PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
>>> * PT64_LEVEL_BITS))) - 1))
>>>
>>> +#define ACC_EXEC_MASK 1
>>> +#define ACC_WRITE_MASK PT_WRITABLE_MASK
>>> +#define ACC_USER_MASK PT_USER_MASK
>>> +#define ACC_ALL (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
>>> +
>>> /* Functions for interpreting SPTEs */
>>> static inline bool is_mmio_spte(u64 spte)
>>> {
>>> @@ -126,4 +135,19 @@ static inline kvm_pfn_t spte_to_pfn(u64 pte)
>>> return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
>>> }
>>>
>>> +static inline void kvm_mmu_get_root(struct kvm_mmu_page *sp)
>>> +{
>>> + BUG_ON(!sp->root_count);
>>> +
>>> + ++sp->root_count;
>>> +}
>>> +
>>> +static inline bool kvm_mmu_put_root(struct kvm_mmu_page *sp)
>>> +{
>>> + --sp->root_count;
>>> +
>>> + return !sp->root_count;
>>> +}
>>> +
>>> +
>>> #endif /* __KVM_X86_MMU_INTERNAL_H */
>>> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
>>> index b3809835e90b1..09a84a6e157b6 100644
>>> --- a/arch/x86/kvm/mmu/tdp_mmu.c
>>> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
>>> @@ -1,5 +1,7 @@
>>> // SPDX-License-Identifier: GPL-2.0
>>>
>>> +#include "mmu.h"
>>> +#include "mmu_internal.h"
>>> #include "tdp_mmu.h"
>>>
>>> static bool __read_mostly tdp_mmu_enabled = false;
>>> @@ -29,10 +31,122 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
>>>
>>> /* This should not be changed for the lifetime of the VM. */
>>> kvm->arch.tdp_mmu_enabled = true;
>>> +
>>> + INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
>>> }
>>>
>>> void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>>> {
>>> if (!kvm->arch.tdp_mmu_enabled)
>>> return;
>>> +
>>> + WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
>>> +}
>>> +
>>> +#define for_each_tdp_mmu_root(_kvm, _root) \
>>> + list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)
>>> +
>>> +bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
>>> +{
>>> + struct kvm_mmu_page *sp;
>>> +
>>> + sp = to_shadow_page(hpa);
>>> +
>>> + return sp->tdp_mmu_page && sp->root_count;
>>> +}
>>> +
>>> +void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
>>> +{
>>> + lockdep_assert_held(&kvm->mmu_lock);
>>> +
>>> + WARN_ON(root->root_count);
>>> + WARN_ON(!root->tdp_mmu_page);
>>> +
>>> + list_del(&root->link);
>>> +
>>> + free_page((unsigned long)root->spt);
>>> + kmem_cache_free(mmu_page_header_cache, root);
>>> +}
>>> +
>>> +static void put_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
>>> +{
>>> + if (kvm_mmu_put_root(root))
>>> + kvm_tdp_mmu_free_root(kvm, root);
>>> +}
>>> +
>>> +static void get_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
>>> +{
>>> + lockdep_assert_held(&kvm->mmu_lock);
>>> +
>>> + kvm_mmu_get_root(root);
>>> +}
>>> +
>>> +static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
>>> + int level)
>>> +{
>>> + union kvm_mmu_page_role role;
>>> +
>>> + role = vcpu->arch.mmu->mmu_role.base;
>>> + role.level = vcpu->arch.mmu->shadow_root_level;
>>
>> role.level = level;
>> The role will be calculated for non root pages later.
>
> Thank you for catching that Yu, that was definitely an error!
> I'm guessing this never showed up in my testing because I don't think
> the TDP MMU actually uses role.level for anything other than root
> pages.

I'll fix it up, thanks to both!

Paolo

2020-10-22 08:01:31

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v2 07/20] kvm: x86/mmu: Support zapping SPTEs in the TDP MMU

On Wed, Oct 21, 2020 at 08:00:47PM +0200, Paolo Bonzini wrote:
> On 21/10/20 19:24, Yu Zhang wrote:
> > On Wed, Oct 21, 2020 at 07:20:15PM +0200, Paolo Bonzini wrote:
> >> On 21/10/20 17:02, Yu Zhang wrote:
> >>>> void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
> >>>> {
> >>>> + gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
> >>>> +
> >>> boot_cpu_data.x86_phys_bits is the host address width. Value of the guest's
> >>> may vary. So maybe we should just traverse the memslots and zap the gfn ranges
> >>> in each of them?
> >>>
> >>
> >> It must be smaller than the host value for two-dimensional paging, though.
> >
> > Yes. And using boot_cpu_data.x86_phys_bits works, but won't it be somewhat
> > overkilling? E.g. for a host with 46 bits and a guest with 39 bits width?
>
> It would go quickly through extra memory space because the PML4E entries
> above the first would be empty. So it's just 511 comparisons.
>

Oh, yes. The overhead seems not as big as I assumed. :)

Yu
> Paolo
>

2020-10-22 12:04:15

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH v2 04/20] kvm: x86/mmu: Allocate and free TDP MMU roots

On Wed, Oct 21, 2020 at 8:09 AM Yu Zhang <[email protected]> wrote:
>
> On Wed, Oct 14, 2020 at 11:26:44AM -0700, Ben Gardon wrote:
> > The TDP MMU must be able to allocate paging structure root pages and track
> > the usage of those pages. Implement a similar, but separate system for root
> > page allocation to that of the x86 shadow paging implementation. When
> > future patches add synchronization model changes to allow for parallel
> > page faults, these pages will need to be handled differently from the
> > x86 shadow paging based MMU's root pages.
> >
> > Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> > machine. This series introduced no new failures.
> >
> > This series can be viewed in Gerrit at:
> > https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
> >
> > Signed-off-by: Ben Gardon <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 1 +
> > arch/x86/kvm/mmu/mmu.c | 29 +++++---
> > arch/x86/kvm/mmu/mmu_internal.h | 24 +++++++
> > arch/x86/kvm/mmu/tdp_mmu.c | 114 ++++++++++++++++++++++++++++++++
> > arch/x86/kvm/mmu/tdp_mmu.h | 5 ++
> > 5 files changed, 162 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 6b6dbc20ce23a..e0ec1dd271a32 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -989,6 +989,7 @@ struct kvm_arch {
> > * operations.
> > */
> > bool tdp_mmu_enabled;
> > + struct list_head tdp_mmu_roots;
> > };
> >
> > struct kvm_vm_stat {
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index f53d29e09367c..a3340ed59ad1d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -144,11 +144,6 @@ module_param(dbg, bool, 0644);
> > #define PT64_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
> > | shadow_x_mask | shadow_nx_mask | shadow_me_mask)
> >
> > -#define ACC_EXEC_MASK 1
> > -#define ACC_WRITE_MASK PT_WRITABLE_MASK
> > -#define ACC_USER_MASK PT_USER_MASK
> > -#define ACC_ALL (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
> > -
> > /* The mask for the R/X bits in EPT PTEs */
> > #define PT64_EPT_READABLE_MASK 0x1ull
> > #define PT64_EPT_EXECUTABLE_MASK 0x4ull
> > @@ -209,7 +204,7 @@ struct kvm_shadow_walk_iterator {
> > __shadow_walk_next(&(_walker), spte))
> >
> > static struct kmem_cache *pte_list_desc_cache;
> > -static struct kmem_cache *mmu_page_header_cache;
> > +struct kmem_cache *mmu_page_header_cache;
> > static struct percpu_counter kvm_total_used_mmu_pages;
> >
> > static u64 __read_mostly shadow_nx_mask;
> > @@ -3588,9 +3583,13 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
> > return;
> >
> > sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
> > - --sp->root_count;
> > - if (!sp->root_count && sp->role.invalid)
> > - kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
> > +
> > + if (kvm_mmu_put_root(sp)) {
> > + if (sp->tdp_mmu_page)
> > + kvm_tdp_mmu_free_root(kvm, sp);
> > + else if (sp->role.invalid)
> > + kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
> > + }
> >
> > *root_hpa = INVALID_PAGE;
> > }
> > @@ -3680,8 +3679,16 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> > hpa_t root;
> > unsigned i;
> >
> > - if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> > - root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
> > + if (vcpu->kvm->arch.tdp_mmu_enabled) {
> > + root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> > +
> > + if (!VALID_PAGE(root))
> > + return -ENOSPC;
> > + vcpu->arch.mmu->root_hpa = root;
> > + } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> > + root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level,
> > + true);
> > +
> > if (!VALID_PAGE(root))
> > return -ENOSPC;
> > vcpu->arch.mmu->root_hpa = root;
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 74ccbf001a42e..6cedf578c9a8d 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -43,8 +43,12 @@ struct kvm_mmu_page {
> >
> > /* Number of writes since the last time traversal visited this page. */
> > atomic_t write_flooding_count;
> > +
> > + bool tdp_mmu_page;
> > };
> >
> > +extern struct kmem_cache *mmu_page_header_cache;
> > +
> > static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
> > {
> > struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
> > @@ -96,6 +100,11 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
> > (PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
> > * PT64_LEVEL_BITS))) - 1))
> >
> > +#define ACC_EXEC_MASK 1
> > +#define ACC_WRITE_MASK PT_WRITABLE_MASK
> > +#define ACC_USER_MASK PT_USER_MASK
> > +#define ACC_ALL (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
> > +
> > /* Functions for interpreting SPTEs */
> > static inline bool is_mmio_spte(u64 spte)
> > {
> > @@ -126,4 +135,19 @@ static inline kvm_pfn_t spte_to_pfn(u64 pte)
> > return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
> > }
> >
> > +static inline void kvm_mmu_get_root(struct kvm_mmu_page *sp)
> > +{
> > + BUG_ON(!sp->root_count);
> > +
> > + ++sp->root_count;
> > +}
> > +
> > +static inline bool kvm_mmu_put_root(struct kvm_mmu_page *sp)
> > +{
> > + --sp->root_count;
> > +
> > + return !sp->root_count;
> > +}
> > +
> > +
> > #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index b3809835e90b1..09a84a6e157b6 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1,5 +1,7 @@
> > // SPDX-License-Identifier: GPL-2.0
> >
> > +#include "mmu.h"
> > +#include "mmu_internal.h"
> > #include "tdp_mmu.h"
> >
> > static bool __read_mostly tdp_mmu_enabled = false;
> > @@ -29,10 +31,122 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
> >
> > /* This should not be changed for the lifetime of the VM. */
> > kvm->arch.tdp_mmu_enabled = true;
> > +
> > + INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
> > }
> >
> > void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> > {
> > if (!kvm->arch.tdp_mmu_enabled)
> > return;
> > +
> > + WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
> > +}
> > +
> > +#define for_each_tdp_mmu_root(_kvm, _root) \
> > + list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)
> > +
> > +bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
> > +{
> > + struct kvm_mmu_page *sp;
> > +
> > + sp = to_shadow_page(hpa);
> > +
> > + return sp->tdp_mmu_page && sp->root_count;
> > +}
> > +
> > +void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
> > +{
> > + lockdep_assert_held(&kvm->mmu_lock);
> > +
> > + WARN_ON(root->root_count);
> > + WARN_ON(!root->tdp_mmu_page);
> > +
> > + list_del(&root->link);
> > +
> > + free_page((unsigned long)root->spt);
> > + kmem_cache_free(mmu_page_header_cache, root);
> > +}
> > +
> > +static void put_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> > +{
> > + if (kvm_mmu_put_root(root))
> > + kvm_tdp_mmu_free_root(kvm, root);
> > +}
> > +
> > +static void get_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> > +{
> > + lockdep_assert_held(&kvm->mmu_lock);
> > +
> > + kvm_mmu_get_root(root);
> > +}
> > +
> > +static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
> > + int level)
> > +{
> > + union kvm_mmu_page_role role;
> > +
> > + role = vcpu->arch.mmu->mmu_role.base;
> > + role.level = vcpu->arch.mmu->shadow_root_level;
>
> role.level = level;
> The role will be calculated for non root pages later.

Thank you for catching that Yu, that was definitely an error!
I'm guessing this never showed up in my testing because I don't think
the TDP MMU actually uses role.level for anything other than root
pages.

>
> > + role.direct = true;
> > + role.gpte_is_8_bytes = true;
> > + role.access = ACC_ALL;
> > +
> > + return role;
> > +}
> > +
> > +static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > + int level)
> > +{
> > + struct kvm_mmu_page *sp;
> > +
> > + sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > + sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > + set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> > +
> > + sp->role.word = page_role_for_level(vcpu, level).word;
> > + sp->gfn = gfn;
> > + sp->tdp_mmu_page = true;
> > +
> > + return sp;
> > +}
> > +
> > +static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
> > +{
> > + union kvm_mmu_page_role role;
> > + struct kvm *kvm = vcpu->kvm;
> > + struct kvm_mmu_page *root;
> > +
> > + role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
> > +
> > + spin_lock(&kvm->mmu_lock);
> > +
> > + /* Check for an existing root before allocating a new one. */
> > + for_each_tdp_mmu_root(kvm, root) {
> > + if (root->role.word == role.word) {
> > + get_tdp_mmu_root(kvm, root);
> > + spin_unlock(&kvm->mmu_lock);
> > + return root;
> > + }
> > + }
> > +
> > + root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
> > + root->root_count = 1;
> > +
> > + list_add(&root->link, &kvm->arch.tdp_mmu_roots);
> > +
> > + spin_unlock(&kvm->mmu_lock);
> > +
> > + return root;
> > +}
> > +
> > +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_mmu_page *root;
> > +
> > + root = get_tdp_mmu_vcpu_root(vcpu);
> > + if (!root)
> > + return INVALID_PAGE;
> > +
> > + return __pa(root->spt);
> > }
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index cd4a562a70e9a..ac0ef91294420 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -7,4 +7,9 @@
> >
> > void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> > void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
> > +
> > +bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
> > +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
> > +void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);
> > +
> > #endif /* __KVM_X86_MMU_TDP_MMU_H */
> > --
> > 2.28.0.1011.ga647a8990f-goog
> >
>
> Thanks
> Yu

2020-10-22 12:10:09

by Ben Gardon

[permalink] [raw]
Subject: Re: [PATCH v2 02/20] kvm: x86/mmu: Introduce tdp_iter

On Wed, Oct 21, 2020 at 7:59 AM Yu Zhang <[email protected]> wrote:
>
> On Wed, Oct 14, 2020 at 11:26:42AM -0700, Ben Gardon wrote:
> > The TDP iterator implements a pre-order traversal of a TDP paging
> > structure. This iterator will be used in future patches to create
> > an efficient implementation of the KVM MMU for the TDP case.
> >
> > Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> > machine. This series introduced no new failures.
> >
> > This series can be viewed in Gerrit at:
> > https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
> >
> > Signed-off-by: Ben Gardon <[email protected]>
> > ---
> > arch/x86/kvm/Makefile | 3 +-
> > arch/x86/kvm/mmu/mmu.c | 66 ------------
> > arch/x86/kvm/mmu/mmu_internal.h | 66 ++++++++++++
> > arch/x86/kvm/mmu/tdp_iter.c | 176 ++++++++++++++++++++++++++++++++
> > arch/x86/kvm/mmu/tdp_iter.h | 56 ++++++++++
> > 5 files changed, 300 insertions(+), 67 deletions(-)
> > create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
> > create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
> >
> > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > index 7f86a14aed0e9..4525c1151bf99 100644
> > --- a/arch/x86/kvm/Makefile
> > +++ b/arch/x86/kvm/Makefile
> > @@ -15,7 +15,8 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
> >
> > kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
> > i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
> > - hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o
> > + hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \
> > + mmu/tdp_iter.o
> >
> > kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
> > vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6c9db349600c8..6d82784ed5679 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -121,28 +121,6 @@ module_param(dbg, bool, 0644);
> >
> > #define PTE_PREFETCH_NUM 8
> >
> > -#define PT_FIRST_AVAIL_BITS_SHIFT 10
> > -#define PT64_SECOND_AVAIL_BITS_SHIFT 54
> > -
> > -/*
> > - * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
> > - * Access Tracking SPTEs.
> > - */
> > -#define SPTE_SPECIAL_MASK (3ULL << 52)
> > -#define SPTE_AD_ENABLED_MASK (0ULL << 52)
> > -#define SPTE_AD_DISABLED_MASK (1ULL << 52)
> > -#define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
> > -#define SPTE_MMIO_MASK (3ULL << 52)
> > -
> > -#define PT64_LEVEL_BITS 9
> > -
> > -#define PT64_LEVEL_SHIFT(level) \
> > - (PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
> > -
> > -#define PT64_INDEX(address, level)\
> > - (((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
> > -
> > -
> > #define PT32_LEVEL_BITS 10
> >
> > #define PT32_LEVEL_SHIFT(level) \
> > @@ -155,19 +133,6 @@ module_param(dbg, bool, 0644);
> > #define PT32_INDEX(address, level)\
> > (((address) >> PT32_LEVEL_SHIFT(level)) & ((1 << PT32_LEVEL_BITS) - 1))
> >
> > -
> > -#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
> > -#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
> > -#else
> > -#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
> > -#endif
> > -#define PT64_LVL_ADDR_MASK(level) \
> > - (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
> > - * PT64_LEVEL_BITS))) - 1))
> > -#define PT64_LVL_OFFSET_MASK(level) \
> > - (PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
> > - * PT64_LEVEL_BITS))) - 1))
> > -
> > #define PT32_BASE_ADDR_MASK PAGE_MASK
> > #define PT32_DIR_BASE_ADDR_MASK \
> > (PAGE_MASK & ~((1ULL << (PAGE_SHIFT + PT32_LEVEL_BITS)) - 1))
> > @@ -192,8 +157,6 @@ module_param(dbg, bool, 0644);
> > #define SPTE_HOST_WRITEABLE (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
> > #define SPTE_MMU_WRITEABLE (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
> >
> > -#define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
> > -
> > /* make pte_list_desc fit well in cache line */
> > #define PTE_LIST_EXT 3
> >
> > @@ -349,11 +312,6 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 access_mask)
> > }
> > EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
> >
> > -static bool is_mmio_spte(u64 spte)
> > -{
> > - return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
> > -}
> > -
> > static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
> > {
> > return sp->role.ad_disabled;
> > @@ -626,35 +584,11 @@ static int is_nx(struct kvm_vcpu *vcpu)
> > return vcpu->arch.efer & EFER_NX;
> > }
> >
> > -static int is_shadow_present_pte(u64 pte)
> > -{
> > - return (pte != 0) && !is_mmio_spte(pte);
> > -}
> > -
> > -static int is_large_pte(u64 pte)
> > -{
> > - return pte & PT_PAGE_SIZE_MASK;
> > -}
> > -
> > -static int is_last_spte(u64 pte, int level)
> > -{
> > - if (level == PG_LEVEL_4K)
> > - return 1;
> > - if (is_large_pte(pte))
> > - return 1;
> > - return 0;
> > -}
> > -
> > static bool is_executable_pte(u64 spte)
> > {
> > return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
> > }
> >
> > -static kvm_pfn_t spte_to_pfn(u64 pte)
> > -{
> > - return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
> > -}
> > -
> > static gfn_t pse36_gfn_delta(u32 gpte)
> > {
> > int shift = 32 - PT32_DIR_PSE36_SHIFT - PAGE_SHIFT;
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 3acf3b8eb469d..74ccbf001a42e 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -2,6 +2,8 @@
> > #ifndef __KVM_X86_MMU_INTERNAL_H
> > #define __KVM_X86_MMU_INTERNAL_H
> >
> > +#include "mmu.h"
> > +
> > #include <linux/types.h>
> >
> > #include <asm/kvm_host.h>
> > @@ -60,4 +62,68 @@ void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
> > bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
> > struct kvm_memory_slot *slot, u64 gfn);
> >
> > +#define PT64_LEVEL_BITS 9
> > +
> > +#define PT64_LEVEL_SHIFT(level) \
> > + (PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
> > +
> > +#define PT64_INDEX(address, level)\
> > + (((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
> > +#define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
> > +
> > +#define PT_FIRST_AVAIL_BITS_SHIFT 10
> > +#define PT64_SECOND_AVAIL_BITS_SHIFT 54
> > +
> > +/*
> > + * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
> > + * Access Tracking SPTEs.
> > + */
> > +#define SPTE_SPECIAL_MASK (3ULL << 52)
> > +#define SPTE_AD_ENABLED_MASK (0ULL << 52)
> > +#define SPTE_AD_DISABLED_MASK (1ULL << 52)
> > +#define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
> > +#define SPTE_MMIO_MASK (3ULL << 52)
> > +
> > +#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
> > +#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
> > +#else
> > +#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
> > +#endif
> > +#define PT64_LVL_ADDR_MASK(level) \
> > + (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
> > + * PT64_LEVEL_BITS))) - 1))
> > +#define PT64_LVL_OFFSET_MASK(level) \
> > + (PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
> > + * PT64_LEVEL_BITS))) - 1))
> > +
> > +/* Functions for interpreting SPTEs */
> > +static inline bool is_mmio_spte(u64 spte)
> > +{
> > + return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
> > +}
> > +
> > +static inline int is_shadow_present_pte(u64 pte)
> > +{
> > + return (pte != 0) && !is_mmio_spte(pte);
> > +}
> > +
> > +static inline int is_large_pte(u64 pte)
> > +{
> > + return pte & PT_PAGE_SIZE_MASK;
> > +}
> > +
> > +static inline int is_last_spte(u64 pte, int level)
> > +{
> > + if (level == PG_LEVEL_4K)
> > + return 1;
> > + if (is_large_pte(pte))
> > + return 1;
> > + return 0;
> > +}
> > +
> > +static inline kvm_pfn_t spte_to_pfn(u64 pte)
> > +{
> > + return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
> > +}
> > +
> > #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> > new file mode 100644
> > index 0000000000000..b07e9f0c5d4aa
> > --- /dev/null
> > +++ b/arch/x86/kvm/mmu/tdp_iter.c
> > @@ -0,0 +1,176 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include "mmu_internal.h"
> > +#include "tdp_iter.h"
> > +
> > +/*
> > + * Recalculates the pointer to the SPTE for the current GFN and level and
> > + * reread the SPTE.
> > + */
> > +static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
> > +{
> > + iter->sptep = iter->pt_path[iter->level - 1] +
> > + SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
> > + iter->old_spte = READ_ONCE(*iter->sptep);
> > +}
> > +
> > +static gfn_t round_gfn_for_level(gfn_t gfn, int level)
> > +{
> > + return gfn - (gfn % KVM_PAGES_PER_HPAGE(level));
>
> Instead of the modulo operator, how about we use:
> return gfn & ~(KVM_PAGES_PER_HPAGE(level) - 1);
> here? :)
>
> > +}
> > +
> > +/*
> > + * Sets a TDP iterator to walk a pre-order traversal of the paging structure
> > + * rooted at root_pt, starting with the walk to translate goal_gfn.
> > + */
> > +void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> > + int min_level, gfn_t goal_gfn)
> > +{
> > + WARN_ON(root_level < 1);
> > + WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
> > +
> > + iter->goal_gfn = goal_gfn;
> > + iter->root_level = root_level;
> > + iter->min_level = min_level;
> > + iter->level = root_level;
> > + iter->pt_path[iter->level - 1] = root_pt;
> > +
> > + iter->gfn = round_gfn_for_level(iter->goal_gfn, iter->level);
> > + tdp_iter_refresh_sptep(iter);
> > +
> > + iter->valid = true;
> > +}
> > +
> > +/*
> > + * Given an SPTE and its level, returns a pointer containing the host virtual
> > + * address of the child page table referenced by the SPTE. Returns null if
> > + * there is no such entry.
> > + */
> > +u64 *spte_to_child_pt(u64 spte, int level)
> > +{
> > + /*
> > + * There's no child entry if this entry isn't present or is a
> > + * last-level entry.
> > + */
> > + if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
> > + return NULL;
> > +
> > + return __va(spte_to_pfn(spte) << PAGE_SHIFT);
> > +}
> > +
> > +/*
> > + * Steps down one level in the paging structure towards the goal GFN. Returns
> > + * true if the iterator was able to step down a level, false otherwise.
> > + */
> > +static bool try_step_down(struct tdp_iter *iter)
> > +{
> > + u64 *child_pt;
> > +
> > + if (iter->level == iter->min_level)
> > + return false;
> > +
> > + /*
> > + * Reread the SPTE before stepping down to avoid traversing into page
> > + * tables that are no longer linked from this entry.
> > + */
> > + iter->old_spte = READ_ONCE(*iter->sptep);
> > +
> > + child_pt = spte_to_child_pt(iter->old_spte, iter->level);
> > + if (!child_pt)
> > + return false;
> > +
> > + iter->level--;
> > + iter->pt_path[iter->level - 1] = child_pt;
> > + iter->gfn = round_gfn_for_level(iter->goal_gfn, iter->level);
> > + tdp_iter_refresh_sptep(iter);
> > +
> > + return true;
> > +}
> > +
> > +/*
> > + * Steps to the next entry in the current page table, at the current page table
> > + * level. The next entry could point to a page backing guest memory or another
> > + * page table, or it could be non-present. Returns true if the iterator was
> > + * able to step to the next entry in the page table, false if the iterator was
> > + * already at the end of the current page table.
> > + */
> > +static bool try_step_side(struct tdp_iter *iter)
> > +{
> > + /*
> > + * Check if the iterator is already at the end of the current page
> > + * table.
> > + */
> > + if (!((iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) %
> > + KVM_PAGES_PER_HPAGE(iter->level + 1)))
> > + return false;
> > +
>
> And maybe:
> if (SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level) ==
> (PT64_ENT_PER_PAGE - 1))
> here?
>
> > + iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
> > + iter->goal_gfn = iter->gfn;
> > + iter->sptep++;
> > + iter->old_spte = READ_ONCE(*iter->sptep);
> > +
> > + return true;
> > +}
> > +
> > +/*
> > + * Tries to traverse back up a level in the paging structure so that the walk
> > + * can continue from the next entry in the parent page table. Returns true on a
> > + * successful step up, false if already in the root page.
> > + */
> > +static bool try_step_up(struct tdp_iter *iter)
> > +{
> > + if (iter->level == iter->root_level)
> > + return false;
> > +
> > + iter->level++;
> > + iter->gfn = round_gfn_for_level(iter->gfn, iter->level);
> > + tdp_iter_refresh_sptep(iter);
> > +
> > + return true;
> > +}
> > +
> > +/*
> > + * Step to the next SPTE in a pre-order traversal of the paging structure.
> > + * To get to the next SPTE, the iterator either steps down towards the goal
> > + * GFN, if at a present, non-last-level SPTE, or over to a SPTE mapping a
> > + * highter GFN.
> > + *
> > + * The basic algorithm is as follows:
> > + * 1. If the current SPTE is a non-last-level SPTE, step down into the page
> > + * table it points to.
> > + * 2. If the iterator cannot step down, it will try to step to the next SPTE
> > + * in the current page of the paging structure.
> > + * 3. If the iterator cannot step to the next entry in the current page, it will
> > + * try to step up to the parent paging structure page. In this case, that
> > + * SPTE will have already been visited, and so the iterator must also step
> > + * to the side again.
> > + */
> > +void tdp_iter_next(struct tdp_iter *iter)
> > +{
> > + if (try_step_down(iter))
> > + return;
> > +
> > + do {
> > + if (try_step_side(iter))
> > + return;
> > + } while (try_step_up(iter));
> > + iter->valid = false;
> > +}
> > +
> > +/*
> > + * Restart the walk over the paging structure from the root, starting from the
> > + * highest gfn the iterator had previously reached. Assumes that the entire
> > + * paging structure, except the root page, may have been completely torn down
> > + * and rebuilt.
> > + */
> > +void tdp_iter_refresh_walk(struct tdp_iter *iter)
> > +{
> > + gfn_t goal_gfn = iter->goal_gfn;
> > +
> > + if (iter->gfn > goal_gfn)
> > + goal_gfn = iter->gfn;
> > +
> > + tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
> > + iter->root_level, iter->min_level, goal_gfn);
> > +}
> > +
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > new file mode 100644
> > index 0000000000000..d629a53e1b73f
> > --- /dev/null
> > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > @@ -0,0 +1,56 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#ifndef __KVM_X86_MMU_TDP_ITER_H
> > +#define __KVM_X86_MMU_TDP_ITER_H
> > +
> > +#include <linux/kvm_host.h>
> > +
> > +#include "mmu.h"
> > +
> > +/*
> > + * A TDP iterator performs a pre-order walk over a TDP paging structure.
> > + */
> > +struct tdp_iter {
> > + /*
> > + * The iterator will traverse the paging structure towards the mapping
> > + * for this GFN.
> > + */
> > + gfn_t goal_gfn;
> > + /* Pointers to the page tables traversed to reach the current SPTE */
> > + u64 *pt_path[PT64_ROOT_MAX_LEVEL];
> > + /* A pointer to the current SPTE */
> > + u64 *sptep;
> > + /* The lowest GFN mapped by the current SPTE */
> > + gfn_t gfn;
> > + /* The level of the root page given to the iterator */
> > + int root_level;
> > + /* The lowest level the iterator should traverse to */
> > + int min_level;
> > + /* The iterator's current level within the paging structure */
> > + int level;
> > + /* A snapshot of the value at sptep */
> > + u64 old_spte;
> > + /*
> > + * Whether the iterator has a valid state. This will be false if the
> > + * iterator walks off the end of the paging structure.
> > + */
> > + bool valid;
> > +};
> > +
> > +/*
> > + * Iterates over every SPTE mapping the GFN range [start, end) in a
> > + * preorder traversal.
> > + */
> > +#define for_each_tdp_pte(iter, root, root_level, start, end) \
> > + for (tdp_iter_start(&iter, root, root_level, PG_LEVEL_4K, start); \
> > + iter.valid && iter.gfn < end; \
> > + tdp_iter_next(&iter))
> > +
> > +u64 *spte_to_child_pt(u64 pte, int level);
> > +
> > +void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> > + int min_level, gfn_t goal_gfn);
> > +void tdp_iter_next(struct tdp_iter *iter);
> > +void tdp_iter_refresh_walk(struct tdp_iter *iter);
> > +
> > +#endif /* __KVM_X86_MMU_TDP_ITER_H */
> > --
> > 2.28.0.1011.ga647a8990f-goog
> >
>
> I am just suggesting to replace the modulo operations with bit-shifts...
> Also, it's very exciting to see such patch set. Thanks!

Those are great suggestions, thank you. I wonder if the compiler would
have been smart enough to convert those to bit shifts. (I kind of
doubt it)
I'm glad you're excited about this patch set, thank you for taking a
look and helping to review it!
It doesn't sound like I'll be sending out a v3 of this set, but I'll
be sure to amend the code with your suggestions if they don't make it
into the version Paolo queues.

>
> B.R.
> Yu
>

2020-10-22 13:26:57

by Yu Zhang

[permalink] [raw]
Subject: Re: [PATCH v2 02/20] kvm: x86/mmu: Introduce tdp_iter

On Wed, Oct 21, 2020 at 11:08:52AM -0700, Ben Gardon wrote:
> On Wed, Oct 21, 2020 at 7:59 AM Yu Zhang <[email protected]> wrote:
> >
> > On Wed, Oct 14, 2020 at 11:26:42AM -0700, Ben Gardon wrote:
> > > The TDP iterator implements a pre-order traversal of a TDP paging
> > > structure. This iterator will be used in future patches to create
> > > an efficient implementation of the KVM MMU for the TDP case.
> > >
> > > Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> > > machine. This series introduced no new failures.
> > >
> > > This series can be viewed in Gerrit at:
> > > https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
> > >
> > > Signed-off-by: Ben Gardon <[email protected]>
> > > ---
> > > arch/x86/kvm/Makefile | 3 +-
> > > arch/x86/kvm/mmu/mmu.c | 66 ------------
> > > arch/x86/kvm/mmu/mmu_internal.h | 66 ++++++++++++
> > > arch/x86/kvm/mmu/tdp_iter.c | 176 ++++++++++++++++++++++++++++++++
> > > arch/x86/kvm/mmu/tdp_iter.h | 56 ++++++++++
> > > 5 files changed, 300 insertions(+), 67 deletions(-)
> > > create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
> > > create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
> > >
> > > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > > index 7f86a14aed0e9..4525c1151bf99 100644
> > > --- a/arch/x86/kvm/Makefile
> > > +++ b/arch/x86/kvm/Makefile
> > > @@ -15,7 +15,8 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
> > >
> > > kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
> > > i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
> > > - hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o
> > > + hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \
> > > + mmu/tdp_iter.o
> > >
> > > kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
> > > vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 6c9db349600c8..6d82784ed5679 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -121,28 +121,6 @@ module_param(dbg, bool, 0644);
> > >
> > > #define PTE_PREFETCH_NUM 8
> > >
> > > -#define PT_FIRST_AVAIL_BITS_SHIFT 10
> > > -#define PT64_SECOND_AVAIL_BITS_SHIFT 54
> > > -
> > > -/*
> > > - * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
> > > - * Access Tracking SPTEs.
> > > - */
> > > -#define SPTE_SPECIAL_MASK (3ULL << 52)
> > > -#define SPTE_AD_ENABLED_MASK (0ULL << 52)
> > > -#define SPTE_AD_DISABLED_MASK (1ULL << 52)
> > > -#define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
> > > -#define SPTE_MMIO_MASK (3ULL << 52)
> > > -
> > > -#define PT64_LEVEL_BITS 9
> > > -
> > > -#define PT64_LEVEL_SHIFT(level) \
> > > - (PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
> > > -
> > > -#define PT64_INDEX(address, level)\
> > > - (((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
> > > -
> > > -
> > > #define PT32_LEVEL_BITS 10
> > >
> > > #define PT32_LEVEL_SHIFT(level) \
> > > @@ -155,19 +133,6 @@ module_param(dbg, bool, 0644);
> > > #define PT32_INDEX(address, level)\
> > > (((address) >> PT32_LEVEL_SHIFT(level)) & ((1 << PT32_LEVEL_BITS) - 1))
> > >
> > > -
> > > -#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
> > > -#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
> > > -#else
> > > -#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
> > > -#endif
> > > -#define PT64_LVL_ADDR_MASK(level) \
> > > - (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
> > > - * PT64_LEVEL_BITS))) - 1))
> > > -#define PT64_LVL_OFFSET_MASK(level) \
> > > - (PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
> > > - * PT64_LEVEL_BITS))) - 1))
> > > -
> > > #define PT32_BASE_ADDR_MASK PAGE_MASK
> > > #define PT32_DIR_BASE_ADDR_MASK \
> > > (PAGE_MASK & ~((1ULL << (PAGE_SHIFT + PT32_LEVEL_BITS)) - 1))
> > > @@ -192,8 +157,6 @@ module_param(dbg, bool, 0644);
> > > #define SPTE_HOST_WRITEABLE (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
> > > #define SPTE_MMU_WRITEABLE (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
> > >
> > > -#define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
> > > -
> > > /* make pte_list_desc fit well in cache line */
> > > #define PTE_LIST_EXT 3
> > >
> > > @@ -349,11 +312,6 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 access_mask)
> > > }
> > > EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
> > >
> > > -static bool is_mmio_spte(u64 spte)
> > > -{
> > > - return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
> > > -}
> > > -
> > > static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
> > > {
> > > return sp->role.ad_disabled;
> > > @@ -626,35 +584,11 @@ static int is_nx(struct kvm_vcpu *vcpu)
> > > return vcpu->arch.efer & EFER_NX;
> > > }
> > >
> > > -static int is_shadow_present_pte(u64 pte)
> > > -{
> > > - return (pte != 0) && !is_mmio_spte(pte);
> > > -}
> > > -
> > > -static int is_large_pte(u64 pte)
> > > -{
> > > - return pte & PT_PAGE_SIZE_MASK;
> > > -}
> > > -
> > > -static int is_last_spte(u64 pte, int level)
> > > -{
> > > - if (level == PG_LEVEL_4K)
> > > - return 1;
> > > - if (is_large_pte(pte))
> > > - return 1;
> > > - return 0;
> > > -}
> > > -
> > > static bool is_executable_pte(u64 spte)
> > > {
> > > return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
> > > }
> > >
> > > -static kvm_pfn_t spte_to_pfn(u64 pte)
> > > -{
> > > - return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
> > > -}
> > > -
> > > static gfn_t pse36_gfn_delta(u32 gpte)
> > > {
> > > int shift = 32 - PT32_DIR_PSE36_SHIFT - PAGE_SHIFT;
> > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > > index 3acf3b8eb469d..74ccbf001a42e 100644
> > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > @@ -2,6 +2,8 @@
> > > #ifndef __KVM_X86_MMU_INTERNAL_H
> > > #define __KVM_X86_MMU_INTERNAL_H
> > >
> > > +#include "mmu.h"
> > > +
> > > #include <linux/types.h>
> > >
> > > #include <asm/kvm_host.h>
> > > @@ -60,4 +62,68 @@ void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
> > > bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
> > > struct kvm_memory_slot *slot, u64 gfn);
> > >
> > > +#define PT64_LEVEL_BITS 9
> > > +
> > > +#define PT64_LEVEL_SHIFT(level) \
> > > + (PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
> > > +
> > > +#define PT64_INDEX(address, level)\
> > > + (((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
> > > +#define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
> > > +
> > > +#define PT_FIRST_AVAIL_BITS_SHIFT 10
> > > +#define PT64_SECOND_AVAIL_BITS_SHIFT 54
> > > +
> > > +/*
> > > + * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
> > > + * Access Tracking SPTEs.
> > > + */
> > > +#define SPTE_SPECIAL_MASK (3ULL << 52)
> > > +#define SPTE_AD_ENABLED_MASK (0ULL << 52)
> > > +#define SPTE_AD_DISABLED_MASK (1ULL << 52)
> > > +#define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
> > > +#define SPTE_MMIO_MASK (3ULL << 52)
> > > +
> > > +#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
> > > +#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
> > > +#else
> > > +#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
> > > +#endif
> > > +#define PT64_LVL_ADDR_MASK(level) \
> > > + (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
> > > + * PT64_LEVEL_BITS))) - 1))
> > > +#define PT64_LVL_OFFSET_MASK(level) \
> > > + (PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
> > > + * PT64_LEVEL_BITS))) - 1))
> > > +
> > > +/* Functions for interpreting SPTEs */
> > > +static inline bool is_mmio_spte(u64 spte)
> > > +{
> > > + return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
> > > +}
> > > +
> > > +static inline int is_shadow_present_pte(u64 pte)
> > > +{
> > > + return (pte != 0) && !is_mmio_spte(pte);
> > > +}
> > > +
> > > +static inline int is_large_pte(u64 pte)
> > > +{
> > > + return pte & PT_PAGE_SIZE_MASK;
> > > +}
> > > +
> > > +static inline int is_last_spte(u64 pte, int level)
> > > +{
> > > + if (level == PG_LEVEL_4K)
> > > + return 1;
> > > + if (is_large_pte(pte))
> > > + return 1;
> > > + return 0;
> > > +}
> > > +
> > > +static inline kvm_pfn_t spte_to_pfn(u64 pte)
> > > +{
> > > + return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
> > > +}
> > > +
> > > #endif /* __KVM_X86_MMU_INTERNAL_H */
> > > diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> > > new file mode 100644
> > > index 0000000000000..b07e9f0c5d4aa
> > > --- /dev/null
> > > +++ b/arch/x86/kvm/mmu/tdp_iter.c
> > > @@ -0,0 +1,176 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +
> > > +#include "mmu_internal.h"
> > > +#include "tdp_iter.h"
> > > +
> > > +/*
> > > + * Recalculates the pointer to the SPTE for the current GFN and level and
> > > + * reread the SPTE.
> > > + */
> > > +static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
> > > +{
> > > + iter->sptep = iter->pt_path[iter->level - 1] +
> > > + SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
> > > + iter->old_spte = READ_ONCE(*iter->sptep);
> > > +}
> > > +
> > > +static gfn_t round_gfn_for_level(gfn_t gfn, int level)
> > > +{
> > > + return gfn - (gfn % KVM_PAGES_PER_HPAGE(level));
> >
> > Instead of the modulo operator, how about we use:
> > return gfn & ~(KVM_PAGES_PER_HPAGE(level) - 1);
> > here? :)
> >
> > > +}
> > > +
> > > +/*
> > > + * Sets a TDP iterator to walk a pre-order traversal of the paging structure
> > > + * rooted at root_pt, starting with the walk to translate goal_gfn.
> > > + */
> > > +void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> > > + int min_level, gfn_t goal_gfn)
> > > +{
> > > + WARN_ON(root_level < 1);
> > > + WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
> > > +
> > > + iter->goal_gfn = goal_gfn;
> > > + iter->root_level = root_level;
> > > + iter->min_level = min_level;
> > > + iter->level = root_level;
> > > + iter->pt_path[iter->level - 1] = root_pt;
> > > +
> > > + iter->gfn = round_gfn_for_level(iter->goal_gfn, iter->level);
> > > + tdp_iter_refresh_sptep(iter);
> > > +
> > > + iter->valid = true;
> > > +}
> > > +
> > > +/*
> > > + * Given an SPTE and its level, returns a pointer containing the host virtual
> > > + * address of the child page table referenced by the SPTE. Returns null if
> > > + * there is no such entry.
> > > + */
> > > +u64 *spte_to_child_pt(u64 spte, int level)
> > > +{
> > > + /*
> > > + * There's no child entry if this entry isn't present or is a
> > > + * last-level entry.
> > > + */
> > > + if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
> > > + return NULL;
> > > +
> > > + return __va(spte_to_pfn(spte) << PAGE_SHIFT);
> > > +}
> > > +
> > > +/*
> > > + * Steps down one level in the paging structure towards the goal GFN. Returns
> > > + * true if the iterator was able to step down a level, false otherwise.
> > > + */
> > > +static bool try_step_down(struct tdp_iter *iter)
> > > +{
> > > + u64 *child_pt;
> > > +
> > > + if (iter->level == iter->min_level)
> > > + return false;
> > > +
> > > + /*
> > > + * Reread the SPTE before stepping down to avoid traversing into page
> > > + * tables that are no longer linked from this entry.
> > > + */
> > > + iter->old_spte = READ_ONCE(*iter->sptep);
> > > +
> > > + child_pt = spte_to_child_pt(iter->old_spte, iter->level);
> > > + if (!child_pt)
> > > + return false;
> > > +
> > > + iter->level--;
> > > + iter->pt_path[iter->level - 1] = child_pt;
> > > + iter->gfn = round_gfn_for_level(iter->goal_gfn, iter->level);
> > > + tdp_iter_refresh_sptep(iter);
> > > +
> > > + return true;
> > > +}
> > > +
> > > +/*
> > > + * Steps to the next entry in the current page table, at the current page table
> > > + * level. The next entry could point to a page backing guest memory or another
> > > + * page table, or it could be non-present. Returns true if the iterator was
> > > + * able to step to the next entry in the page table, false if the iterator was
> > > + * already at the end of the current page table.
> > > + */
> > > +static bool try_step_side(struct tdp_iter *iter)
> > > +{
> > > + /*
> > > + * Check if the iterator is already at the end of the current page
> > > + * table.
> > > + */
> > > + if (!((iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) %
> > > + KVM_PAGES_PER_HPAGE(iter->level + 1)))
> > > + return false;
> > > +
> >
> > And maybe:
> > if (SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level) ==
> > (PT64_ENT_PER_PAGE - 1))
> > here?
> >
> > > + iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
> > > + iter->goal_gfn = iter->gfn;
> > > + iter->sptep++;
> > > + iter->old_spte = READ_ONCE(*iter->sptep);
> > > +
> > > + return true;
> > > +}
> > > +
> > > +/*
> > > + * Tries to traverse back up a level in the paging structure so that the walk
> > > + * can continue from the next entry in the parent page table. Returns true on a
> > > + * successful step up, false if already in the root page.
> > > + */
> > > +static bool try_step_up(struct tdp_iter *iter)
> > > +{
> > > + if (iter->level == iter->root_level)
> > > + return false;
> > > +
> > > + iter->level++;
> > > + iter->gfn = round_gfn_for_level(iter->gfn, iter->level);
> > > + tdp_iter_refresh_sptep(iter);
> > > +
> > > + return true;
> > > +}
> > > +
> > > +/*
> > > + * Step to the next SPTE in a pre-order traversal of the paging structure.
> > > + * To get to the next SPTE, the iterator either steps down towards the goal
> > > + * GFN, if at a present, non-last-level SPTE, or over to a SPTE mapping a
> > > + * highter GFN.
> > > + *
> > > + * The basic algorithm is as follows:
> > > + * 1. If the current SPTE is a non-last-level SPTE, step down into the page
> > > + * table it points to.
> > > + * 2. If the iterator cannot step down, it will try to step to the next SPTE
> > > + * in the current page of the paging structure.
> > > + * 3. If the iterator cannot step to the next entry in the current page, it will
> > > + * try to step up to the parent paging structure page. In this case, that
> > > + * SPTE will have already been visited, and so the iterator must also step
> > > + * to the side again.
> > > + */
> > > +void tdp_iter_next(struct tdp_iter *iter)
> > > +{
> > > + if (try_step_down(iter))
> > > + return;
> > > +
> > > + do {
> > > + if (try_step_side(iter))
> > > + return;
> > > + } while (try_step_up(iter));
> > > + iter->valid = false;
> > > +}
> > > +
> > > +/*
> > > + * Restart the walk over the paging structure from the root, starting from the
> > > + * highest gfn the iterator had previously reached. Assumes that the entire
> > > + * paging structure, except the root page, may have been completely torn down
> > > + * and rebuilt.
> > > + */
> > > +void tdp_iter_refresh_walk(struct tdp_iter *iter)
> > > +{
> > > + gfn_t goal_gfn = iter->goal_gfn;
> > > +
> > > + if (iter->gfn > goal_gfn)
> > > + goal_gfn = iter->gfn;
> > > +
> > > + tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
> > > + iter->root_level, iter->min_level, goal_gfn);
> > > +}
> > > +
> > > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > > new file mode 100644
> > > index 0000000000000..d629a53e1b73f
> > > --- /dev/null
> > > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > > @@ -0,0 +1,56 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +
> > > +#ifndef __KVM_X86_MMU_TDP_ITER_H
> > > +#define __KVM_X86_MMU_TDP_ITER_H
> > > +
> > > +#include <linux/kvm_host.h>
> > > +
> > > +#include "mmu.h"
> > > +
> > > +/*
> > > + * A TDP iterator performs a pre-order walk over a TDP paging structure.
> > > + */
> > > +struct tdp_iter {
> > > + /*
> > > + * The iterator will traverse the paging structure towards the mapping
> > > + * for this GFN.
> > > + */
> > > + gfn_t goal_gfn;
> > > + /* Pointers to the page tables traversed to reach the current SPTE */
> > > + u64 *pt_path[PT64_ROOT_MAX_LEVEL];
> > > + /* A pointer to the current SPTE */
> > > + u64 *sptep;
> > > + /* The lowest GFN mapped by the current SPTE */
> > > + gfn_t gfn;
> > > + /* The level of the root page given to the iterator */
> > > + int root_level;
> > > + /* The lowest level the iterator should traverse to */
> > > + int min_level;
> > > + /* The iterator's current level within the paging structure */
> > > + int level;
> > > + /* A snapshot of the value at sptep */
> > > + u64 old_spte;
> > > + /*
> > > + * Whether the iterator has a valid state. This will be false if the
> > > + * iterator walks off the end of the paging structure.
> > > + */
> > > + bool valid;
> > > +};
> > > +
> > > +/*
> > > + * Iterates over every SPTE mapping the GFN range [start, end) in a
> > > + * preorder traversal.
> > > + */
> > > +#define for_each_tdp_pte(iter, root, root_level, start, end) \
> > > + for (tdp_iter_start(&iter, root, root_level, PG_LEVEL_4K, start); \
> > > + iter.valid && iter.gfn < end; \
> > > + tdp_iter_next(&iter))
> > > +
> > > +u64 *spte_to_child_pt(u64 pte, int level);
> > > +
> > > +void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> > > + int min_level, gfn_t goal_gfn);
> > > +void tdp_iter_next(struct tdp_iter *iter);
> > > +void tdp_iter_refresh_walk(struct tdp_iter *iter);
> > > +
> > > +#endif /* __KVM_X86_MMU_TDP_ITER_H */
> > > --
> > > 2.28.0.1011.ga647a8990f-goog
> > >
> >
> > I am just suggesting to replace the modulo operations with bit-shifts...
> > Also, it's very exciting to see such patch set. Thanks!
>
> Those are great suggestions, thank you. I wonder if the compiler would
> have been smart enough to convert those to bit shifts. (I kind of
> doubt it)
> I'm glad you're excited about this patch set, thank you for taking a
> look and helping to review it!
> It doesn't sound like I'll be sending out a v3 of this set, but I'll
> be sure to amend the code with your suggestions if they don't make it
> into the version Paolo queues.

Great. And looking forward to the performance improvement series! :)

Yu

>
> >
> > B.R.
> > Yu
> >
>