This series enables KVM to save memory when using the TDP MMU by waiting
to allocate memslot rmaps until they are needed. To do this, KVM tracks
whether or not a shadow root has been allocated. In order to get away
with not allocating the rmaps, KVM must also be sure to skip operations
which iterate over the rmaps. If the TDP MMU is in use and we have not
allocated a shadow root, these operations would essentially be op-ops
anyway. Skipping the rmap operations has a secondary benefit of avoiding
acquiring the MMU lock in write mode in many cases, substantially
reducing MMU lock contention.
This series was tested on an Intel Skylake machine. With the TDP MMU off
and on, this introduced no new failures on kvm-unit-tests or KVM selftests.
Changelog:
v2:
Incorporated feedback from Paolo and Sean
Replaced the memslot_assignment_lock with slots_arch_lock, which
has a larger critical section.
v3:
Removed shadow_mmu_active as suggested by Sean
Removed everything except adding a return value to
kvm_mmu_init_tdp_mmu from patch 1 of v2
Added RCU protection and better memory ordering for installing the
memslot rmaps as suggested by Paolo
Reordered most of the patches
v4:
Renamed functions to allocate and free memslots based on feedback
from David.
Eliminated the goto in memslot_rmap_alloc, as David suggested.
Eliminated kvm_memslots_have_rmaps and updated comments on uses of
memslots_have_rmaps. Suggested by Paolo.
Changed the description on patch 7 to one Paolo suggested.
Collected Reviewed-by tags from David.
Dropped the patch to add RCU notations to rmap accesses.
Ben Gardon (7):
KVM: x86/mmu: Deduplicate rmap freeing
KVM: x86/mmu: Factor out allocating memslot rmap
KVM: mmu: Refactor memslot copy
KVM: mmu: Add slots_arch_lock for memslot arch fields
KVM: x86/mmu: Add a field to control memslot rmap allocation
KVM: x86/mmu: Skip rmap operations if rmaps not allocated
KVM: x86/mmu: Lazily allocate memslot rmaps
arch/x86/include/asm/kvm_host.h | 8 ++
arch/x86/kvm/mmu/mmu.c | 153 +++++++++++++++++++++-----------
arch/x86/kvm/mmu/tdp_mmu.c | 6 +-
arch/x86/kvm/mmu/tdp_mmu.h | 4 +-
arch/x86/kvm/x86.c | 108 ++++++++++++++++++----
include/linux/kvm_host.h | 9 ++
virt/kvm/kvm_main.c | 54 ++++++++---
7 files changed, 255 insertions(+), 87 deletions(-)
--
2.31.1.607.g51e8a6a459-goog
Small code deduplication. No functional change expected.
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/x86.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5bd550eaf683..1e1f4f31e586 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10887,17 +10887,23 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm_hv_destroy_vm(kvm);
}
-void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
+static void memslot_rmap_free(struct kvm_memory_slot *slot)
{
int i;
for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
kvfree(slot->arch.rmap[i]);
slot->arch.rmap[i] = NULL;
+ }
+}
- if (i == 0)
- continue;
+void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ int i;
+
+ memslot_rmap_free(slot);
+ for (i = 1; i < KVM_NR_PAGE_SIZES; ++i) {
kvfree(slot->arch.lpage_info[i - 1]);
slot->arch.lpage_info[i - 1] = NULL;
}
@@ -10963,12 +10969,9 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
return 0;
out_free:
- for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
- kvfree(slot->arch.rmap[i]);
- slot->arch.rmap[i] = NULL;
- if (i == 0)
- continue;
+ memslot_rmap_free(slot);
+ for (i = 1; i < KVM_NR_PAGE_SIZES; ++i) {
kvfree(slot->arch.lpage_info[i - 1]);
slot->arch.lpage_info[i - 1] = NULL;
}
--
2.31.1.607.g51e8a6a459-goog
If only the TDP MMU is being used to manage the memory mappings for a VM,
then many rmap operations can be skipped as they are guaranteed to be
no-ops. This saves some time which would be spent on the rmap operation.
It also avoids acquiring the MMU lock in write mode for many operations.
This makes it safe to run the VM without rmaps allocated, when only
using the TDP MMU and sets the stage for waiting to allocate the rmaps
until they're needed.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 128 +++++++++++++++++++++++++----------------
1 file changed, 77 insertions(+), 51 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f059f2e8c6fe..b0bdb924d519 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1189,6 +1189,10 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
if (is_tdp_mmu_enabled(kvm))
kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
slot->base_gfn + gfn_offset, mask, true);
+
+ if (!kvm->arch.memslots_have_rmaps)
+ return;
+
while (mask) {
rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
PG_LEVEL_4K, slot);
@@ -1218,6 +1222,10 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
if (is_tdp_mmu_enabled(kvm))
kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
slot->base_gfn + gfn_offset, mask, false);
+
+ if (!kvm->arch.memslots_have_rmaps)
+ return;
+
while (mask) {
rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
PG_LEVEL_4K, slot);
@@ -1260,9 +1268,12 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
int i;
bool write_protected = false;
- for (i = PG_LEVEL_4K; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
- rmap_head = __gfn_to_rmap(gfn, i, slot);
- write_protected |= __rmap_write_protect(kvm, rmap_head, true);
+ if (kvm->arch.memslots_have_rmaps) {
+ for (i = PG_LEVEL_4K; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
+ rmap_head = __gfn_to_rmap(gfn, i, slot);
+ write_protected |= __rmap_write_protect(kvm, rmap_head,
+ true);
+ }
}
if (is_tdp_mmu_enabled(kvm))
@@ -1433,9 +1444,10 @@ static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm,
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
- bool flush;
+ bool flush = false;
- flush = kvm_handle_gfn_range(kvm, range, kvm_unmap_rmapp);
+ if (kvm->arch.memslots_have_rmaps)
+ flush = kvm_handle_gfn_range(kvm, range, kvm_unmap_rmapp);
if (is_tdp_mmu_enabled(kvm))
flush |= kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush);
@@ -1445,9 +1457,10 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
- bool flush;
+ bool flush = false;
- flush = kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmapp);
+ if (kvm->arch.memslots_have_rmaps)
+ flush = kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmapp);
if (is_tdp_mmu_enabled(kvm))
flush |= kvm_tdp_mmu_set_spte_gfn(kvm, range);
@@ -1500,9 +1513,10 @@ static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
- bool young;
+ bool young = false;
- young = kvm_handle_gfn_range(kvm, range, kvm_age_rmapp);
+ if (kvm->arch.memslots_have_rmaps)
+ young = kvm_handle_gfn_range(kvm, range, kvm_age_rmapp);
if (is_tdp_mmu_enabled(kvm))
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
@@ -1512,9 +1526,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
- bool young;
+ bool young = false;
- young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmapp);
+ if (kvm->arch.memslots_have_rmaps)
+ young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmapp);
if (is_tdp_mmu_enabled(kvm))
young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
@@ -5440,7 +5455,8 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
*/
kvm_reload_remote_mmus(kvm);
- kvm_zap_obsolete_pages(kvm);
+ if (kvm->arch.memslots_have_rmaps)
+ kvm_zap_obsolete_pages(kvm);
write_unlock(&kvm->mmu_lock);
@@ -5492,29 +5508,29 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
int i;
bool flush = false;
- write_lock(&kvm->mmu_lock);
- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
- slots = __kvm_memslots(kvm, i);
- kvm_for_each_memslot(memslot, slots) {
- gfn_t start, end;
-
- start = max(gfn_start, memslot->base_gfn);
- end = min(gfn_end, memslot->base_gfn + memslot->npages);
- if (start >= end)
- continue;
+ if (kvm->arch.memslots_have_rmaps) {
+ write_lock(&kvm->mmu_lock);
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ slots = __kvm_memslots(kvm, i);
+ kvm_for_each_memslot(memslot, slots) {
+ gfn_t start, end;
+
+ start = max(gfn_start, memslot->base_gfn);
+ end = min(gfn_end, memslot->base_gfn + memslot->npages);
+ if (start >= end)
+ continue;
- flush = slot_handle_level_range(kvm, memslot, kvm_zap_rmapp,
- PG_LEVEL_4K,
- KVM_MAX_HUGEPAGE_LEVEL,
- start, end - 1, true, flush);
+ flush = slot_handle_level_range(kvm, memslot,
+ kvm_zap_rmapp, PG_LEVEL_4K,
+ KVM_MAX_HUGEPAGE_LEVEL, start,
+ end - 1, true, flush);
+ }
}
+ if (flush)
+ kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);
+ write_unlock(&kvm->mmu_lock);
}
- if (flush)
- kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);
-
- write_unlock(&kvm->mmu_lock);
-
if (is_tdp_mmu_enabled(kvm)) {
flush = false;
@@ -5541,12 +5557,15 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
struct kvm_memory_slot *memslot,
int start_level)
{
- bool flush;
+ bool flush = false;
- write_lock(&kvm->mmu_lock);
- flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
- start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
- write_unlock(&kvm->mmu_lock);
+ if (kvm->arch.memslots_have_rmaps) {
+ write_lock(&kvm->mmu_lock);
+ flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
+ start_level, KVM_MAX_HUGEPAGE_LEVEL,
+ false);
+ write_unlock(&kvm->mmu_lock);
+ }
if (is_tdp_mmu_enabled(kvm)) {
read_lock(&kvm->mmu_lock);
@@ -5616,16 +5635,15 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
struct kvm_memory_slot *slot = (struct kvm_memory_slot *)memslot;
bool flush;
- write_lock(&kvm->mmu_lock);
- flush = slot_handle_leaf(kvm, slot, kvm_mmu_zap_collapsible_spte, true);
-
- if (flush)
- kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
- write_unlock(&kvm->mmu_lock);
+ if (kvm->arch.memslots_have_rmaps) {
+ write_lock(&kvm->mmu_lock);
+ flush = slot_handle_leaf(kvm, slot, kvm_mmu_zap_collapsible_spte, true);
+ if (flush)
+ kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+ write_unlock(&kvm->mmu_lock);
+ }
if (is_tdp_mmu_enabled(kvm)) {
- flush = false;
-
read_lock(&kvm->mmu_lock);
flush = kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot, flush);
if (flush)
@@ -5652,11 +5670,14 @@ void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
struct kvm_memory_slot *memslot)
{
- bool flush;
+ bool flush = false;
- write_lock(&kvm->mmu_lock);
- flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false);
- write_unlock(&kvm->mmu_lock);
+ if (kvm->arch.memslots_have_rmaps) {
+ write_lock(&kvm->mmu_lock);
+ flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty,
+ false);
+ write_unlock(&kvm->mmu_lock);
+ }
if (is_tdp_mmu_enabled(kvm)) {
read_lock(&kvm->mmu_lock);
@@ -5681,6 +5702,14 @@ void kvm_mmu_zap_all(struct kvm *kvm)
int ign;
write_lock(&kvm->mmu_lock);
+ if (is_tdp_mmu_enabled(kvm))
+ kvm_tdp_mmu_zap_all(kvm);
+
+ if (!kvm->arch.memslots_have_rmaps) {
+ write_unlock(&kvm->mmu_lock);
+ return;
+ }
+
restart:
list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
if (WARN_ON(sp->role.invalid))
@@ -5693,9 +5722,6 @@ void kvm_mmu_zap_all(struct kvm *kvm)
kvm_mmu_commit_zap_page(kvm, &invalid_list);
- if (is_tdp_mmu_enabled(kvm))
- kvm_tdp_mmu_zap_all(kvm);
-
write_unlock(&kvm->mmu_lock);
}
--
2.31.1.607.g51e8a6a459-goog
Factor out copying kvm_memslots from allocating the memory for new ones
in preparation for adding a new lock to protect the arch-specific fields
of the memslots.
No functional change intended.
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
---
virt/kvm/kvm_main.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6b4feb92dc79..9e106742b388 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1306,6 +1306,18 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
return old_memslots;
}
+static size_t kvm_memslots_size(int slots)
+{
+ return sizeof(struct kvm_memslots) +
+ (sizeof(struct kvm_memory_slot) * slots);
+}
+
+static void kvm_copy_memslots(struct kvm_memslots *from,
+ struct kvm_memslots *to)
+{
+ memcpy(to, from, kvm_memslots_size(from->used_slots));
+}
+
/*
* Note, at a minimum, the current number of used slots must be allocated, even
* when deleting a memslot, as we need a complete duplicate of the memslots for
@@ -1315,19 +1327,16 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
enum kvm_mr_change change)
{
struct kvm_memslots *slots;
- size_t old_size, new_size;
-
- old_size = sizeof(struct kvm_memslots) +
- (sizeof(struct kvm_memory_slot) * old->used_slots);
+ size_t new_size;
if (change == KVM_MR_CREATE)
- new_size = old_size + sizeof(struct kvm_memory_slot);
+ new_size = kvm_memslots_size(old->used_slots + 1);
else
- new_size = old_size;
+ new_size = kvm_memslots_size(old->used_slots);
slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT);
if (likely(slots))
- memcpy(slots, old, old_size);
+ kvm_copy_memslots(old, slots);
return slots;
}
--
2.31.1.607.g51e8a6a459-goog
Add a new lock to protect the arch-specific fields of memslots if they
need to be modified in a kvm->srcu read critical section. A future
commit will use this lock to lazily allocate memslot rmaps for x86.
Signed-off-by: Ben Gardon <[email protected]>
---
include/linux/kvm_host.h | 9 +++++++++
virt/kvm/kvm_main.c | 31 ++++++++++++++++++++++++++-----
2 files changed, 35 insertions(+), 5 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8895b95b6a22..2d5e797fbb08 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -472,6 +472,15 @@ struct kvm {
#endif /* KVM_HAVE_MMU_RWLOCK */
struct mutex slots_lock;
+
+ /*
+ * Protects the arch-specific fields of struct kvm_memory_slots in
+ * use by the VM. To be used under the slots_lock (above) or in a
+ * kvm->srcu read cirtical section where acquiring the slots_lock
+ * would lead to deadlock with the synchronize_srcu in
+ * install_new_memslots.
+ */
+ struct mutex slots_arch_lock;
struct mm_struct *mm; /* userspace tied to this vm */
struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9e106742b388..5c40d83754b1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -908,6 +908,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
mutex_init(&kvm->lock);
mutex_init(&kvm->irq_lock);
mutex_init(&kvm->slots_lock);
+ mutex_init(&kvm->slots_arch_lock);
INIT_LIST_HEAD(&kvm->devices);
BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
@@ -1280,6 +1281,10 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
rcu_assign_pointer(kvm->memslots[as_id], slots);
+
+ /* Acquired in kvm_set_memslot. */
+ mutex_unlock(&kvm->slots_arch_lock);
+
synchronize_srcu_expedited(&kvm->srcu);
/*
@@ -1351,6 +1356,9 @@ static int kvm_set_memslot(struct kvm *kvm,
struct kvm_memslots *slots;
int r;
+ /* Released in install_new_memslots. */
+ mutex_lock(&kvm->slots_arch_lock);
+
slots = kvm_dup_memslots(__kvm_memslots(kvm, as_id), change);
if (!slots)
return -ENOMEM;
@@ -1364,10 +1372,9 @@ static int kvm_set_memslot(struct kvm *kvm,
slot->flags |= KVM_MEMSLOT_INVALID;
/*
- * We can re-use the old memslots, the only difference from the
- * newly installed memslots is the invalid flag, which will get
- * dropped by update_memslots anyway. We'll also revert to the
- * old memslots if preparing the new memory region fails.
+ * We can re-use the memory from the old memslots.
+ * It will be overwritten with a copy of the new memslots
+ * after reacquiring the slots_arch_lock below.
*/
slots = install_new_memslots(kvm, as_id, slots);
@@ -1379,6 +1386,17 @@ static int kvm_set_memslot(struct kvm *kvm,
* - kvm_is_visible_gfn (mmu_check_root)
*/
kvm_arch_flush_shadow_memslot(kvm, slot);
+
+ /* Released in install_new_memslots. */
+ mutex_lock(&kvm->slots_arch_lock);
+
+ /*
+ * The arch-specific fields of the memslots could have changed
+ * between releasing the slots_arch_lock in
+ * install_new_memslots and here, so get a fresh copy of the
+ * slots.
+ */
+ kvm_copy_memslots(__kvm_memslots(kvm, as_id), slots);
}
r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
@@ -1394,8 +1412,11 @@ static int kvm_set_memslot(struct kvm *kvm,
return 0;
out_slots:
- if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
+ if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
+ slot = id_to_memslot(slots, old->id);
+ slot->flags &= ~KVM_MEMSLOT_INVALID;
slots = install_new_memslots(kvm, as_id, slots);
+ }
kvfree(slots);
return r;
}
--
2.31.1.607.g51e8a6a459-goog
If the TDP MMU is in use, wait to allocate the rmaps until the shadow
MMU is actually used. (i.e. a nested VM is launched.) This saves memory
equal to 0.2% of guest memory in cases where the TDP MMU is used and
there are no nested guests involved.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu/mmu.c | 53 +++++++++++++++++++++++----------
arch/x86/kvm/mmu/tdp_mmu.c | 6 ++--
arch/x86/kvm/mmu/tdp_mmu.h | 4 +--
arch/x86/kvm/x86.c | 45 +++++++++++++++++++++++++++-
5 files changed, 89 insertions(+), 21 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fc75ed49bfee..7b65f82ade1c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1868,4 +1868,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
int kvm_cpu_dirty_log_size(void);
+int alloc_all_memslots_rmaps(struct kvm *kvm);
+
#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b0bdb924d519..183afccd2944 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1190,7 +1190,8 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
slot->base_gfn + gfn_offset, mask, true);
- if (!kvm->arch.memslots_have_rmaps)
+ /* Read memslots_have_rmaps before the rmaps themselves */
+ if (!smp_load_acquire(&kvm->arch.memslots_have_rmaps))
return;
while (mask) {
@@ -1223,7 +1224,8 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
slot->base_gfn + gfn_offset, mask, false);
- if (!kvm->arch.memslots_have_rmaps)
+ /* Read memslots_have_rmaps before the rmaps themselves */
+ if (!smp_load_acquire(&kvm->arch.memslots_have_rmaps))
return;
while (mask) {
@@ -1268,7 +1270,8 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
int i;
bool write_protected = false;
- if (kvm->arch.memslots_have_rmaps) {
+ /* Read memslots_have_rmaps before the rmaps themselves */
+ if (smp_load_acquire(&kvm->arch.memslots_have_rmaps)) {
for (i = PG_LEVEL_4K; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
rmap_head = __gfn_to_rmap(gfn, i, slot);
write_protected |= __rmap_write_protect(kvm, rmap_head,
@@ -1446,7 +1449,8 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool flush = false;
- if (kvm->arch.memslots_have_rmaps)
+ /* Read memslots_have_rmaps before the rmaps themselves */
+ if (smp_load_acquire(&kvm->arch.memslots_have_rmaps))
flush = kvm_handle_gfn_range(kvm, range, kvm_unmap_rmapp);
if (is_tdp_mmu_enabled(kvm))
@@ -1459,7 +1463,8 @@ bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool flush = false;
- if (kvm->arch.memslots_have_rmaps)
+ /* Read memslots_have_rmaps before the rmaps themselves */
+ if (smp_load_acquire(&kvm->arch.memslots_have_rmaps))
flush = kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmapp);
if (is_tdp_mmu_enabled(kvm))
@@ -1515,7 +1520,8 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;
- if (kvm->arch.memslots_have_rmaps)
+ /* Read memslots_have_rmaps before the rmaps themselves */
+ if (smp_load_acquire(&kvm->arch.memslots_have_rmaps))
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmapp);
if (is_tdp_mmu_enabled(kvm))
@@ -1528,7 +1534,8 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;
- if (kvm->arch.memslots_have_rmaps)
+ /* Read memslots_have_rmaps before the rmaps themselves */
+ if (smp_load_acquire(&kvm->arch.memslots_have_rmaps))
young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmapp);
if (is_tdp_mmu_enabled(kvm))
@@ -3295,6 +3302,10 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
}
}
+ r = alloc_all_memslots_rmaps(vcpu->kvm);
+ if (r)
+ return r;
+
write_lock(&vcpu->kvm->mmu_lock);
r = make_mmu_pages_available(vcpu);
if (r < 0)
@@ -5455,7 +5466,8 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
*/
kvm_reload_remote_mmus(kvm);
- if (kvm->arch.memslots_have_rmaps)
+ /* Read memslots_have_rmaps before the rmaps themselves */
+ if (smp_load_acquire(&kvm->arch.memslots_have_rmaps))
kvm_zap_obsolete_pages(kvm);
write_unlock(&kvm->mmu_lock);
@@ -5483,9 +5495,13 @@ void kvm_mmu_init_vm(struct kvm *kvm)
{
struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
- kvm_mmu_init_tdp_mmu(kvm);
-
- kvm->arch.memslots_have_rmaps = true;
+ if (!kvm_mmu_init_tdp_mmu(kvm))
+ /*
+ * No smp_load/store wrappers needed here as we are in
+ * VM init and there cannot be any memslots / other threads
+ * accessing this struct kvm yet.
+ */
+ kvm->arch.memslots_have_rmaps = true;
node->track_write = kvm_mmu_pte_write;
node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
@@ -5508,7 +5524,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
int i;
bool flush = false;
- if (kvm->arch.memslots_have_rmaps) {
+ /* Read memslots_have_rmaps before the rmaps themselves */
+ if (smp_load_acquire(&kvm->arch.memslots_have_rmaps)) {
write_lock(&kvm->mmu_lock);
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
slots = __kvm_memslots(kvm, i);
@@ -5559,7 +5576,8 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
{
bool flush = false;
- if (kvm->arch.memslots_have_rmaps) {
+ /* Read memslots_have_rmaps before the rmaps themselves */
+ if (smp_load_acquire(&kvm->arch.memslots_have_rmaps)) {
write_lock(&kvm->mmu_lock);
flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
start_level, KVM_MAX_HUGEPAGE_LEVEL,
@@ -5635,7 +5653,8 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
struct kvm_memory_slot *slot = (struct kvm_memory_slot *)memslot;
bool flush;
- if (kvm->arch.memslots_have_rmaps) {
+ /* Read memslots_have_rmaps before the rmaps themselves */
+ if (smp_load_acquire(&kvm->arch.memslots_have_rmaps)) {
write_lock(&kvm->mmu_lock);
flush = slot_handle_leaf(kvm, slot, kvm_mmu_zap_collapsible_spte, true);
if (flush)
@@ -5672,7 +5691,8 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
{
bool flush = false;
- if (kvm->arch.memslots_have_rmaps) {
+ /* Read memslots_have_rmaps before the rmaps themselves */
+ if (smp_load_acquire(&kvm->arch.memslots_have_rmaps)) {
write_lock(&kvm->mmu_lock);
flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty,
false);
@@ -5705,7 +5725,8 @@ void kvm_mmu_zap_all(struct kvm *kvm)
if (is_tdp_mmu_enabled(kvm))
kvm_tdp_mmu_zap_all(kvm);
- if (!kvm->arch.memslots_have_rmaps) {
+ /* Read memslots_have_rmaps before the rmaps themselves */
+ if (!smp_load_acquire(&kvm->arch.memslots_have_rmaps)) {
write_unlock(&kvm->mmu_lock);
return;
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 95eeb5ac6a8a..ea00c9502ba1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -14,10 +14,10 @@ static bool __read_mostly tdp_mmu_enabled = false;
module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
/* Initializes the TDP MMU for the VM, if enabled. */
-void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
+bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
{
if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
- return;
+ return false;
/* This should not be changed for the lifetime of the VM. */
kvm->arch.tdp_mmu_enabled = true;
@@ -25,6 +25,8 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
+
+ return true;
}
static __always_inline void kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 5fdf63090451..b046ab5137a1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -80,12 +80,12 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
int *root_level);
#ifdef CONFIG_X86_64
-void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
+bool kvm_mmu_init_tdp_mmu(struct kvm *kvm);
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { return kvm->arch.tdp_mmu_enabled; }
static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; }
#else
-static inline void kvm_mmu_init_tdp_mmu(struct kvm *kvm) {}
+static inline bool kvm_mmu_init_tdp_mmu(struct kvm *kvm) { return false; }
static inline void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) {}
static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { return false; }
static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 03b6bcff2a53..fdc1b2759771 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10920,6 +10920,8 @@ static int memslot_rmap_alloc(struct kvm_memory_slot *slot,
int lpages;
int level = i + 1;
+ WARN_ON(slot->arch.rmap[i]);
+
lpages = gfn_to_index(slot->base_gfn + npages - 1,
slot->base_gfn, level) + 1;
@@ -10935,6 +10937,46 @@ static int memslot_rmap_alloc(struct kvm_memory_slot *slot,
return 0;
}
+int alloc_all_memslots_rmaps(struct kvm *kvm)
+{
+ struct kvm_memslots *slots;
+ struct kvm_memory_slot *slot;
+ int r = 0;
+ int i;
+
+ /*
+ * Check memslots_have_rmaps early before acquiring the
+ * slots_arch_lock below.
+ */
+ if (smp_load_acquire(&kvm->arch.memslots_have_rmaps))
+ return 0;
+
+ mutex_lock(&kvm->slots_arch_lock);
+
+ /*
+ * Read memslots_have_rmaps again, under the slots arch lock,
+ * before allocating the rmaps
+ */
+ if (smp_load_acquire(&kvm->arch.memslots_have_rmaps))
+ return 0;
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ slots = __kvm_memslots(kvm, i);
+ kvm_for_each_memslot(slot, slots) {
+ r = memslot_rmap_alloc(slot, slot->npages);
+ if (r) {
+ mutex_unlock(&kvm->slots_arch_lock);
+ return r;
+ }
+ }
+ }
+
+ /* Write rmap pointers before memslots_have_rmaps */
+ smp_store_release(&kvm->arch.memslots_have_rmaps, true);
+ mutex_unlock(&kvm->slots_arch_lock);
+ return 0;
+}
+
static int kvm_alloc_memslot_metadata(struct kvm *kvm,
struct kvm_memory_slot *slot,
unsigned long npages)
@@ -10949,7 +10991,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
*/
memset(&slot->arch, 0, sizeof(slot->arch));
- if (kvm->arch.memslots_have_rmaps) {
+ /* Read memslots_have_rmaps before allocating the rmaps */
+ if (smp_load_acquire(&kvm->arch.memslots_have_rmaps)) {
r = memslot_rmap_alloc(slot, npages);
if (r)
return r;
--
2.31.1.607.g51e8a6a459-goog
Add a field to control whether new memslots should have rmaps allocated
for them. As of this change, it's not safe to skip allocating rmaps, so
the field is always set to allocate rmaps. Future changes will make it
safe to operate without rmaps, using the TDP MMU. Then further changes
will allow the rmaps to be allocated lazily when needed for nested
oprtation.
No functional change expected.
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 6 ++++++
arch/x86/kvm/mmu/mmu.c | 2 ++
arch/x86/kvm/x86.c | 13 ++++++++-----
3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55efbacfc244..fc75ed49bfee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1124,6 +1124,12 @@ struct kvm_arch {
*/
spinlock_t tdp_mmu_pages_lock;
#endif /* CONFIG_X86_64 */
+
+ /*
+ * If set, rmaps have been allocated for all memslots and should be
+ * allocated for any newly created or modified memslots.
+ */
+ bool memslots_have_rmaps;
};
struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0144c40d09c7..f059f2e8c6fe 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5469,6 +5469,8 @@ void kvm_mmu_init_vm(struct kvm *kvm)
kvm_mmu_init_tdp_mmu(kvm);
+ kvm->arch.memslots_have_rmaps = true;
+
node->track_write = kvm_mmu_pte_write;
node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
kvm_page_track_register_notifier(kvm, node);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cc0440b5b35d..03b6bcff2a53 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10935,7 +10935,8 @@ static int memslot_rmap_alloc(struct kvm_memory_slot *slot,
return 0;
}
-static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
+static int kvm_alloc_memslot_metadata(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
unsigned long npages)
{
int i;
@@ -10948,9 +10949,11 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
*/
memset(&slot->arch, 0, sizeof(slot->arch));
- r = memslot_rmap_alloc(slot, npages);
- if (r)
- return r;
+ if (kvm->arch.memslots_have_rmaps) {
+ r = memslot_rmap_alloc(slot, npages);
+ if (r)
+ return r;
+ }
for (i = 1; i < KVM_NR_PAGE_SIZES; ++i) {
struct kvm_lpage_info *linfo;
@@ -11021,7 +11024,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
enum kvm_mr_change change)
{
if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
- return kvm_alloc_memslot_metadata(memslot,
+ return kvm_alloc_memslot_metadata(kvm, memslot,
mem->memory_size >> PAGE_SHIFT);
return 0;
}
--
2.31.1.607.g51e8a6a459-goog
On Tue, May 11, 2021, Ben Gardon wrote:
> Factor out copying kvm_memslots from allocating the memory for new ones
> in preparation for adding a new lock to protect the arch-specific fields
> of the memslots.
>
> No functional change intended.
>
> Reviewed-by: David Hildenbrand <[email protected]>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> virt/kvm/kvm_main.c | 23 ++++++++++++++++-------
> 1 file changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6b4feb92dc79..9e106742b388 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1306,6 +1306,18 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
> return old_memslots;
> }
>
> +static size_t kvm_memslots_size(int slots)
Can we call this kvm_calc_memslots_size()? This doesn't actually return the
true size of a given memslots instance since the allocated size may be greater
than the size computed by looking at used_slots.
> +{
> + return sizeof(struct kvm_memslots) +
> + (sizeof(struct kvm_memory_slot) * slots);
> +}
> +
> +static void kvm_copy_memslots(struct kvm_memslots *from,
> + struct kvm_memslots *to)
> +{
> + memcpy(to, from, kvm_memslots_size(from->used_slots));
> +}
> +
> /*
> * Note, at a minimum, the current number of used slots must be allocated, even
> * when deleting a memslot, as we need a complete duplicate of the memslots for
> @@ -1315,19 +1327,16 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
> enum kvm_mr_change change)
> {
> struct kvm_memslots *slots;
> - size_t old_size, new_size;
> -
> - old_size = sizeof(struct kvm_memslots) +
> - (sizeof(struct kvm_memory_slot) * old->used_slots);
> + size_t new_size;
>
> if (change == KVM_MR_CREATE)
> - new_size = old_size + sizeof(struct kvm_memory_slot);
> + new_size = kvm_memslots_size(old->used_slots + 1);
> else
> - new_size = old_size;
> + new_size = kvm_memslots_size(old->used_slots);
>
> slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT);
> if (likely(slots))
> - memcpy(slots, old, old_size);
> + kvm_copy_memslots(old, slots);
>
> return slots;
> }
> --
> 2.31.1.607.g51e8a6a459-goog
>
On Tue, May 11, 2021, Ben Gardon wrote:
> Add a new lock to protect the arch-specific fields of memslots if they
> need to be modified in a kvm->srcu read critical section. A future
> commit will use this lock to lazily allocate memslot rmaps for x86.
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> include/linux/kvm_host.h | 9 +++++++++
> virt/kvm/kvm_main.c | 31 ++++++++++++++++++++++++++-----
> 2 files changed, 35 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8895b95b6a22..2d5e797fbb08 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -472,6 +472,15 @@ struct kvm {
> #endif /* KVM_HAVE_MMU_RWLOCK */
>
> struct mutex slots_lock;
> +
> + /*
> + * Protects the arch-specific fields of struct kvm_memory_slots in
> + * use by the VM. To be used under the slots_lock (above) or in a
> + * kvm->srcu read cirtical section where acquiring the slots_lock
^^^^^^^^
critical
> + * would lead to deadlock with the synchronize_srcu in
> + * install_new_memslots.
> + */
> + struct mutex slots_arch_lock;
> struct mm_struct *mm; /* userspace tied to this vm */
> struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
> struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
On Tue, May 11, 2021, Ben Gardon wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -10935,6 +10937,46 @@ static int memslot_rmap_alloc(struct kvm_memory_slot *slot,
> return 0;
> }
>
> +int alloc_all_memslots_rmaps(struct kvm *kvm)
> +{
> + struct kvm_memslots *slots;
> + struct kvm_memory_slot *slot;
> + int r = 0;
No need to initialize r. And then it makes sense to put i and r on the same
line.
> + int i;
> +
> + /*
> + * Check memslots_have_rmaps early before acquiring the
> + * slots_arch_lock below.
> + */
> + if (smp_load_acquire(&kvm->arch.memslots_have_rmaps))
> + return 0;
> +
> + mutex_lock(&kvm->slots_arch_lock);
> +
> + /*
> + * Read memslots_have_rmaps again, under the slots arch lock,
> + * before allocating the rmaps
> + */
> + if (smp_load_acquire(&kvm->arch.memslots_have_rmaps))
> + return 0;
This fails to drop slots_arch_lock.
> +
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + slots = __kvm_memslots(kvm, i);
> + kvm_for_each_memslot(slot, slots) {
> + r = memslot_rmap_alloc(slot, slot->npages);
> + if (r) {
> + mutex_unlock(&kvm->slots_arch_lock);
> + return r;
> + }
> + }
> + }
> +
> + /* Write rmap pointers before memslots_have_rmaps */
> + smp_store_release(&kvm->arch.memslots_have_rmaps, true);
> + mutex_unlock(&kvm->slots_arch_lock);
> + return 0;
> +}
> +
> static int kvm_alloc_memslot_metadata(struct kvm *kvm,
> struct kvm_memory_slot *slot,
> unsigned long npages)
> @@ -10949,7 +10991,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
> */
> memset(&slot->arch, 0, sizeof(slot->arch));
>
> - if (kvm->arch.memslots_have_rmaps) {
> + /* Read memslots_have_rmaps before allocating the rmaps */
> + if (smp_load_acquire(&kvm->arch.memslots_have_rmaps)) {
> r = memslot_rmap_alloc(slot, npages);
> if (r)
> return r;
> --
> 2.31.1.607.g51e8a6a459-goog
>
On Tue, May 11, 2021, Ben Gardon wrote:
> Add a new lock to protect the arch-specific fields of memslots if they
> need to be modified in a kvm->srcu read critical section. A future
> commit will use this lock to lazily allocate memslot rmaps for x86.
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> include/linux/kvm_host.h | 9 +++++++++
> virt/kvm/kvm_main.c | 31 ++++++++++++++++++++++++++-----
> 2 files changed, 35 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8895b95b6a22..2d5e797fbb08 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -472,6 +472,15 @@ struct kvm {
> #endif /* KVM_HAVE_MMU_RWLOCK */
>
> struct mutex slots_lock;
> +
> + /*
> + * Protects the arch-specific fields of struct kvm_memory_slots in
> + * use by the VM. To be used under the slots_lock (above) or in a
> + * kvm->srcu read cirtical section where acquiring the slots_lock
> + * would lead to deadlock with the synchronize_srcu in
> + * install_new_memslots.
> + */
> + struct mutex slots_arch_lock;
> struct mm_struct *mm; /* userspace tied to this vm */
> struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
> struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9e106742b388..5c40d83754b1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -908,6 +908,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> mutex_init(&kvm->lock);
> mutex_init(&kvm->irq_lock);
> mutex_init(&kvm->slots_lock);
> + mutex_init(&kvm->slots_arch_lock);
> INIT_LIST_HEAD(&kvm->devices);
>
> BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
> @@ -1280,6 +1281,10 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
> slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
>
> rcu_assign_pointer(kvm->memslots[as_id], slots);
> +
> + /* Acquired in kvm_set_memslot. */
> + mutex_unlock(&kvm->slots_arch_lock);
> +
> synchronize_srcu_expedited(&kvm->srcu);
>
> /*
> @@ -1351,6 +1356,9 @@ static int kvm_set_memslot(struct kvm *kvm,
> struct kvm_memslots *slots;
> int r;
>
> + /* Released in install_new_memslots. */
This needs a much more comprehensive comment, either here or above the declaration
of slots_arch_lock. I can't find anything that explicitly states the the lock
must be held from the time the previous memslots are duplicated/copied until the
new memslots are set. Without that information, the rules/expecations are not
clear.
> + mutex_lock(&kvm->slots_arch_lock);
> +
> slots = kvm_dup_memslots(__kvm_memslots(kvm, as_id), change);
> if (!slots)
> return -ENOMEM;
Fails to drop slots_arch_lock.
> @@ -1364,10 +1372,9 @@ static int kvm_set_memslot(struct kvm *kvm,
> slot->flags |= KVM_MEMSLOT_INVALID;
>
> /*
> - * We can re-use the old memslots, the only difference from the
> - * newly installed memslots is the invalid flag, which will get
> - * dropped by update_memslots anyway. We'll also revert to the
> - * old memslots if preparing the new memory region fails.
> + * We can re-use the memory from the old memslots.
> + * It will be overwritten with a copy of the new memslots
> + * after reacquiring the slots_arch_lock below.
> */
> slots = install_new_memslots(kvm, as_id, slots);
>
> @@ -1379,6 +1386,17 @@ static int kvm_set_memslot(struct kvm *kvm,
> * - kvm_is_visible_gfn (mmu_check_root)
> */
> kvm_arch_flush_shadow_memslot(kvm, slot);
> +
> + /* Released in install_new_memslots. */
> + mutex_lock(&kvm->slots_arch_lock);
> +
> + /*
> + * The arch-specific fields of the memslots could have changed
> + * between releasing the slots_arch_lock in
> + * install_new_memslots and here, so get a fresh copy of the
> + * slots.
> + */
> + kvm_copy_memslots(__kvm_memslots(kvm, as_id), slots);
Ow. This is feedback for a previous patch, but kvm_copy_memslots() absolutely
needs to swap the order of params to match memcpy(), i.e. @to is first, @from is
second.
> }
>
> r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
> @@ -1394,8 +1412,11 @@ static int kvm_set_memslot(struct kvm *kvm,
> return 0;
>
> out_slots:
> - if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
> + if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
> + slot = id_to_memslot(slots, old->id);
> + slot->flags &= ~KVM_MEMSLOT_INVALID;
> slots = install_new_memslots(kvm, as_id, slots);
> + }
Fails to drop slots_arch_lock if kvm_arch_prepare_memory_region() fails for
anything other than DELETE and MOVE.
I really, really don't like dropping the lock in install_new_memslots(). Unlocking
bugs aside, IMO it makes it quite difficult to understand exactly what
slots_arch_lock protects. Unfortunately I'm just whining at this point since I
don't have a better idea :-(
> kvfree(slots);
> return r;
> }
> --
> 2.31.1.607.g51e8a6a459-goog
>
On 11/05/21 21:21, Sean Christopherson wrote:
>> + /* Released in install_new_memslots. */
>
> This needs a much more comprehensive comment, either here or above the declaration
> of slots_arch_lock. I can't find anything that explicitly states the the lock
> must be held from the time the previous memslots are duplicated/copied until the
> new memslots are set. Without that information, the rules/expecations are not
> clear.
Indeed, this needs basically a description of the races that can happen,
as you explained them in the v1 review.
> Unfortunately I'm just whining at this point since I
> don't have a better idea
Yeah, the synchronize_srcu call in install_new_memslots throws a wrench
in most alternatives.
Paolo
On Tue, May 11, 2021, Ben Gardon wrote:
> @@ -1260,9 +1268,12 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
> int i;
> bool write_protected = false;
>
> - for (i = PG_LEVEL_4K; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> - rmap_head = __gfn_to_rmap(gfn, i, slot);
> - write_protected |= __rmap_write_protect(kvm, rmap_head, true);
> + if (kvm->arch.memslots_have_rmaps) {
> + for (i = PG_LEVEL_4K; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> + rmap_head = __gfn_to_rmap(gfn, i, slot);
> + write_protected |= __rmap_write_protect(kvm, rmap_head,
> + true);
I vote to let "true" poke out.
> + }
> }
>
> if (is_tdp_mmu_enabled(kvm))
...
> @@ -5440,7 +5455,8 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> */
> kvm_reload_remote_mmus(kvm);
>
> - kvm_zap_obsolete_pages(kvm);
> + if (kvm->arch.memslots_have_rmaps)
> + kvm_zap_obsolete_pages(kvm);
Hmm, for cases where we're iterating over the list of active_mmu_pages, I would
prefer to either leave the code as-is or short-circuit the helpers with a more
explicit:
if (list_empty(&kvm->arch.active_mmu_pages))
return ...;
I'd probably vote for leaving the code as it is; the loop iteration and list_empty
check in kvm_mmu_commit_zap_page() add a single compare-and-jump in the worst
case scenario.
In other words, restrict use of memslots_have_rmaps to flows that directly
walk the rmaps, as opposed to partially overloading memslots_have_rmaps to mean
"is using legacy MMU".
> write_unlock(&kvm->mmu_lock);
>
...
> @@ -5681,6 +5702,14 @@ void kvm_mmu_zap_all(struct kvm *kvm)
> int ign;
>
> write_lock(&kvm->mmu_lock);
> + if (is_tdp_mmu_enabled(kvm))
> + kvm_tdp_mmu_zap_all(kvm);
> +
> + if (!kvm->arch.memslots_have_rmaps) {
> + write_unlock(&kvm->mmu_lock);
> + return;
Another case where falling through to walking active_mmu_pages is perfectly ok.
> + }
> +
> restart:
> list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
> if (WARN_ON(sp->role.invalid))
> @@ -5693,9 +5722,6 @@ void kvm_mmu_zap_all(struct kvm *kvm)
>
> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>
> - if (is_tdp_mmu_enabled(kvm))
> - kvm_tdp_mmu_zap_all(kvm);
> -
> write_unlock(&kvm->mmu_lock);
> }
>
> --
> 2.31.1.607.g51e8a6a459-goog
>
On Tue, May 11, 2021, Ben Gardon wrote:
> If the TDP MMU is in use, wait to allocate the rmaps until the shadow
> MMU is actually used. (i.e. a nested VM is launched.) This saves memory
> equal to 0.2% of guest memory in cases where the TDP MMU is used and
> there are no nested guests involved.
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/mmu/mmu.c | 53 +++++++++++++++++++++++----------
> arch/x86/kvm/mmu/tdp_mmu.c | 6 ++--
> arch/x86/kvm/mmu/tdp_mmu.h | 4 +--
> arch/x86/kvm/x86.c | 45 +++++++++++++++++++++++++++-
> 5 files changed, 89 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fc75ed49bfee..7b65f82ade1c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1868,4 +1868,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>
> int kvm_cpu_dirty_log_size(void);
>
> +int alloc_all_memslots_rmaps(struct kvm *kvm);
> +
> #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b0bdb924d519..183afccd2944 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1190,7 +1190,8 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
> slot->base_gfn + gfn_offset, mask, true);
>
> - if (!kvm->arch.memslots_have_rmaps)
> + /* Read memslots_have_rmaps before the rmaps themselves */
IIRC, you open coded reading memslots_have_rmaps because of a circular
dependency, but you can solve that simply by defining the helper in mmu.h
instead of kvm_host.h.
And I think you could even make it static in mmu.c and omit the smp_load_acuquire
from the three users in x86.c, though that's probably not worth it.
Either way, reading the same comment over and over and over, just to make
checkpatch happy, gets more than a bit tedious.
That would also allow you to elaborate on why the smp_load_acquire() is
necessary, and preferably what it pairs with.
> + if (!smp_load_acquire(&kvm->arch.memslots_have_rmaps))
> return;
>
> while (mask) {
On Tue, May 11, 2021, Sean Christopherson wrote:
> On Tue, May 11, 2021, Ben Gardon wrote:
> > If the TDP MMU is in use, wait to allocate the rmaps until the shadow
> > MMU is actually used. (i.e. a nested VM is launched.) This saves memory
> > equal to 0.2% of guest memory in cases where the TDP MMU is used and
> > there are no nested guests involved.
> >
> > Signed-off-by: Ben Gardon <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 2 ++
> > arch/x86/kvm/mmu/mmu.c | 53 +++++++++++++++++++++++----------
> > arch/x86/kvm/mmu/tdp_mmu.c | 6 ++--
> > arch/x86/kvm/mmu/tdp_mmu.h | 4 +--
> > arch/x86/kvm/x86.c | 45 +++++++++++++++++++++++++++-
> > 5 files changed, 89 insertions(+), 21 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index fc75ed49bfee..7b65f82ade1c 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1868,4 +1868,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
> >
> > int kvm_cpu_dirty_log_size(void);
> >
> > +int alloc_all_memslots_rmaps(struct kvm *kvm);
> > +
> > #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index b0bdb924d519..183afccd2944 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1190,7 +1190,8 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> > kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
> > slot->base_gfn + gfn_offset, mask, true);
> >
> > - if (!kvm->arch.memslots_have_rmaps)
> > + /* Read memslots_have_rmaps before the rmaps themselves */
>
> IIRC, you open coded reading memslots_have_rmaps because of a circular
> dependency, but you can solve that simply by defining the helper in mmu.h
> instead of kvm_host.h.
>
> And I think you could even make it static in mmu.c and omit the smp_load_acuquire
> from the three users in x86.c, though that's probably not worth it.
>
> Either way, reading the same comment over and over and over, just to make
> checkpatch happy, gets more than a bit tedious.
>
> That would also allow you to elaborate on why the smp_load_acquire() is
> necessary, and preferably what it pairs with.
Belated thought: you could also introduce the helper in patch 06 in order to
miminize thrash in this patch.