2021-09-21 03:00:50

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH v5 00/13] KVM: Scalable memslots implementation

From: "Maciej S. Szmigiero" <[email protected]>

The current memslot code uses a (reverse) gfn-ordered memslot array
for keeping track of them.
This only allows quick binary search by gfn, quick lookup by hva is
not possible (the implementation has to do a linear scan of the whole
memslot array).

Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change
flags) has to make a copy of the whole array so it has a scratch copy
to work on.

Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently
present is just a limitation of the array-based memslot implementation.

Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.

In order to have two memslot sets, but only one copy of the actual
memslots it is necessary to split out the memslot data from the
memslot sets.

The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.

These two memslot sets should normally point to the same set of memslots.
They can, however, be desynchronized when performing a memslot management
operation by replacing the memslot to be modified by its copy.
After the operation is complete, both memslot sets once again point to
the same, common set of memslot data.

This series implements the aforementioned idea.

The new implementation uses two trees to perform quick lookups:
For tracking of gfn an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.

For tracking of hva, however, an interval tree is needed since they
can overlap between memslots.

ID to memslot mappings are kept in a hash table instead of using a
statically allocated "id_to_index" array.

The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.

There was also a desire to make the new structure operate on "pay as
you go" basis, that is, that the user only pays the price of the
memslot count that is actually used, not of the maximum count allowed.

The operation semantics were carefully matched to the original
implementation, the outside-visible behavior should not change.
Only the timing will be different.

Making lookup and memslot management operations O(log(n)) brings some
performance benefits (tested on a Xeon 8167M machine):
509 slots in use:
Test Before After Improvement
Map 0.0232s 0.0223s 4%
Unmap 0.0724s 0.0315s 56%
Unmap 2M 0.00155s 0.000869s 44%
Move active 0.000101s 0.0000792s 22%
Move inactive 0.000108s 0.0000847s 21%
Slot setup 0.0135s 0.00803s 41%

100 slots in use:
Test Before After Improvement
Map 0.0195s 0.0191s None
Unmap 0.0374s 0.0312s 17%
Unmap 2M 0.000470s 0.000447s 5%
Move active 0.0000956s 0.0000800s 16%
Move inactive 0.000101s 0.0000840s 17%
Slot setup 0.00260s 0.00174s 33%

50 slots in use:
Test Before After Improvement
Map 0.0192s 0.0190s None
Unmap 0.0339s 0.0311s 8%
Unmap 2M 0.000399s 0.000395s None
Move active 0.0000999s 0.0000854s 15%
Move inactive 0.0000992s 0.0000826s 17%
Slot setup 0.00141s 0.000990s 30%

30 slots in use:
Test Before After Improvement
Map 0.0192s 0.0190s None
Unmap 0.0325s 0.0310s 5%
Unmap 2M 0.000373s 0.000373s None
Move active 0.000100s 0.0000865s 14%
Move inactive 0.000106s 0.0000918s 13%
Slot setup 0.000989s 0.000775s 22%

10 slots in use:
Test Before After Improvement
Map 0.0192s 0.0186s 3%
Unmap 0.0313s 0.0310s None
Unmap 2M 0.000348s 0.000351s None
Move active 0.000110s 0.0000948s 14%
Move inactive 0.000111s 0.0000933s 16%
Slot setup 0.000342s 0.000283s 17%

32k slots in use:
Test Before After Improvement
Map (8194) 0.200s 0.0541s 73%
Unmap 3.88s 0.0351s 99%
Unmap 2M 3.88s 0.0348s 99%
Move active 0.00142s 0.0000786s 94%
Move inactive 0.00148s 0.0000880s 94%
Slot setup 16.1s 0.59s 96%

Since the map test can be done with up to 8194 slots, the result above
for this test was obtained running it with that maximum number of
slots.

In both the old and new memslot code case the measurements were done
against the new KVM selftest framework, with TDP MMU disabled.

On x86-64 the code was well tested, passed KVM unit tests and KVM
selftests with KASAN on.
And, of course, booted various guests successfully (including nested
ones with TDP MMU enabled).
On other KVM platforms the code was compile-tested only.

Changes since v1:
* Drop the already merged HVA handler retpoline-friendliness patch,

* Split the scalable memslots patch into 8 smaller ones,

* Rebase onto the current kvm/queue,

* Make sure that ranges at both memslot's hva_nodes are always
initialized,

* Remove kvm_mmu_calculate_default_mmu_pages() prototype, too,
when removing this function,

* Redo benchmarks, measure 32k memslots on the old implementation, too.

Changes since v2:
* Rebase onto the current kvm/queue, taking into account the now-merged
MMU notifiers rewrite.
This reduces the diffstat by ~50 lines.

Changes since v3:
* Rebase onto the current (Aug 12) kvm/queue, taking into account the
introduction of slots_arch_lock (and lazy rmaps allocation) and per-vCPU
"last used slot" mini-cache,

* Change n_memslots_pages to u64 to avoid overflowing it on 32-bit KVM,

* Skip the kvm_mmu_change_mmu_pages() call for memslot operations that
don't change the total page count anyway,

* Move n_memslots_pages recalc to kvm_arch_prepare_memory_region() so
a proper error code can be returned in case we spot an underflow,

* Integrate the open-coded s390 gfn_to_memslot_approx() into the main KVM
code by adding "approx" parameter to the existing __gfn_to_memslot() while
renaming it to __gfn_to_memslot_approx() and introducing
__gfn_to_memslot() wrapper so existing call sites won't be affected,
since __gfn_to_memslot() now offers an extra functionality over
search_memslots().
This last fact wasn't the case at the time the previous version of this
series was posted,

* Split out the move of WARN that's triggered on invalid memslot index to
a separate patch,

* Rename the old slot variable in kvm_memslot_delete() to "oldslot",
add an additional comment why we delete a memslot from the hash table
in kvm_memslot_move_backward() in "KVM: Resolve memslot ID via a hash
table instead of via a static array" patch,

* Rename a patch from "Introduce memslots hva tree" to "Use interval tree
to do fast hva lookup in memslots",

* Document that the kvm_for_each_hva_range_memslot() range is inclusive,

* Rename kvm_for_each_hva_range_memslot to
kvm_for_each_memslot_in_hva_range, move this macro to kvm_main.c,

* Update the WARN in __kvm_handle_hva_range() and kvm_zap_gfn_range() to
also trigger on empty ranges,

* Use "bkt" instead of "ctr" for hash bucket index variables,

* Add a comment describing the idea behind the memslots dual set system
above struct kvm_memory_slot,

* Rename "memslots_all" to "__memslots" in struct kvm and add comments
there describing its two memslot members,

* Remove kvm_memslots_idx() helper and store the set index explicitly in
the particular memslots set instead,

* Open code kvm_init_memslots(),

* Rename swap_memslots() into kvm_swap_active_memslots() and make it
also swap pointers themselves to the provided two sets,

* Initialize "parent" outside of the for loop in kvm_memslot_gfn_insert(),

* Add kvm_memslot_gfn_erase() and kvm_memslot_gfn_replace() helpers,

* Fold kvm_init_memslot_hva_ranges() into kvm_copy_memslot(),

* Remove kvm_copy_replace_memslot() and introduce kvm_activate_memslot()
instead,

* Rename "slotsact" / "slotsina" into "active" / "inactive" in
kvm_set_memslot(), add a big comment what are the semantics of
"slotact" / "slotina" variables to this function,

* Move WARN about a missing old slot to kvm_set_memslot() and return -EIO
in this case,

* Set KVM_MEMSLOT_INVALID flag on a temporary slot before (rather than
after) replacing it in the inactive set in kvm_set_memslot() as it is a bit
more intuitive way to do it,

* Move handling of an error from kvm_arch_prepare_memory_region() directly
below a call to this function in kvm_set_memslot() from the end of this
function,

* Rework some of the comments in kvm_set_memslot(),

* Rename "oldslot" / "nslot" into "old" / "new" here and there in
"Keep memslots in tree-based structures instead of array-based ones"
patch,

* Add Sean's "Co-developed-by:" to the aforementioned patch since many of
its above changes are coming from his proposed patch (and he did a lot
of good work reviewing both this patch and the whole series),

* Introduce kvm_for_each_memslot_in_gfn_range() and use it in the last two
patches,

* Unfold long lines here and there,

* Retest the patch series.

Changes since v4:
* Rebase onto v5.15-rc2 (torvalds/master),

* Fix 64-bit division of n_memslots_pages for 32-bit KVM,

* Collect Claudio's Reviewed-by tags for some of the patches.

arch/arm64/kvm/Kconfig | 1 +
arch/arm64/kvm/mmu.c | 15 +-
arch/mips/kvm/Kconfig | 1 +
arch/mips/kvm/mips.c | 3 +-
arch/powerpc/kvm/Kconfig | 1 +
arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 +-
arch/powerpc/kvm/book3s_hv.c | 3 +-
arch/powerpc/kvm/book3s_hv_nested.c | 4 +-
arch/powerpc/kvm/book3s_hv_uvmem.c | 14 +-
arch/powerpc/kvm/powerpc.c | 5 +-
arch/s390/kvm/Kconfig | 1 +
arch/s390/kvm/kvm-s390.c | 66 +--
arch/s390/kvm/kvm-s390.h | 14 +
arch/s390/kvm/pv.c | 4 +-
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/debugfs.c | 6 +-
arch/x86/kvm/mmu/mmu.c | 35 +-
arch/x86/kvm/x86.c | 42 +-
include/linux/kvm_host.h | 224 +++++++---
virt/kvm/kvm_main.c | 651 +++++++++++++++-------------
21 files changed, 624 insertions(+), 473 deletions(-)


2021-09-21 03:01:04

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH v5 02/13] KVM: x86: Don't call kvm_mmu_change_mmu_pages() if the count hasn't changed

From: "Maciej S. Szmigiero" <[email protected]>

There is no point in calling kvm_mmu_change_mmu_pages() for memslot
operations that don't change the total page count, so do it just for
KVM_MR_CREATE and KVM_MR_DELETE.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
arch/x86/kvm/x86.c | 34 ++++++++++++++++++----------------
1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 65fdf27b9423..2e4fe2511c5d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11609,22 +11609,24 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
const struct kvm_memory_slot *new,
enum kvm_mr_change change)
{
- if (change == KVM_MR_CREATE)
- kvm->arch.n_memslots_pages += new->npages;
- else if (change == KVM_MR_DELETE) {
- WARN_ON(kvm->arch.n_memslots_pages < old->npages);
- kvm->arch.n_memslots_pages -= old->npages;
- }
-
- if (!kvm->arch.n_requested_mmu_pages) {
- u64 memslots_pages;
- unsigned long nr_mmu_pages;
-
- memslots_pages = kvm->arch.n_memslots_pages * KVM_PERMILLE_MMU_PAGES;
- do_div(memslots_pages, 1000);
- nr_mmu_pages = max_t(typeof(nr_mmu_pages),
- memslots_pages, KVM_MIN_ALLOC_MMU_PAGES);
- kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
+ if (change == KVM_MR_CREATE || change == KVM_MR_DELETE) {
+ if (change == KVM_MR_CREATE)
+ kvm->arch.n_memslots_pages += new->npages;
+ else {
+ WARN_ON(kvm->arch.n_memslots_pages < old->npages);
+ kvm->arch.n_memslots_pages -= old->npages;
+ }
+
+ if (!kvm->arch.n_requested_mmu_pages) {
+ u64 memslots_pages;
+ unsigned long nr_mmu_pages;
+
+ memslots_pages = kvm->arch.n_memslots_pages * KVM_PERMILLE_MMU_PAGES;
+ do_div(memslots_pages, 1000);
+ nr_mmu_pages = max_t(typeof(nr_mmu_pages),
+ memslots_pages, KVM_MIN_ALLOC_MMU_PAGES);
+ kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
+ }
}

kvm_mmu_slot_apply_flags(kvm, old, new, change);

2021-09-21 03:01:09

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH v5 03/13] KVM: Add "old" memslot parameter to kvm_arch_prepare_memory_region()

From: "Maciej S. Szmigiero" <[email protected]>

This is needed for the next commit, which moves n_memslots_pages
recalculation from kvm_arch_commit_memory_region() to the aforementioned
function to allow for returning an error code.

While we are at it let's also rename the "memslot" parameter to "new" for
consistency with kvm_arch_commit_memory_region(), which uses the same
argument set now.

No functional change intended.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
arch/arm64/kvm/mmu.c | 7 ++++---
arch/mips/kvm/mips.c | 3 ++-
arch/powerpc/kvm/powerpc.c | 5 +++--
arch/s390/kvm/kvm-s390.c | 3 ++-
arch/x86/kvm/x86.c | 5 +++--
include/linux/kvm_host.h | 3 ++-
virt/kvm/kvm_main.c | 2 +-
7 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 1a94a7ca48f2..6d93ae1edb6d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1486,7 +1486,8 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
}

int kvm_arch_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
+ const struct kvm_memory_slot *old,
+ struct kvm_memory_slot *new,
const struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change)
{
@@ -1502,7 +1503,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
* Prevent userspace from creating a memory region outside of the IPA
* space addressable by the KVM guest IPA space.
*/
- if ((memslot->base_gfn + memslot->npages) > (kvm_phys_size(kvm) >> PAGE_SHIFT))
+ if ((new->base_gfn + new->npages) > (kvm_phys_size(kvm) >> PAGE_SHIFT))
return -EFAULT;

mmap_read_lock(current->mm);
@@ -1534,7 +1535,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,

if (vma->vm_flags & VM_PFNMAP) {
/* IO region dirty page logging not allowed */
- if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+ if (new->flags & KVM_MEM_LOG_DIRTY_PAGES) {
ret = -EINVAL;
break;
}
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 75c6f264c626..8587c260c6fb 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -233,7 +233,8 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
}

int kvm_arch_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
+ const struct kvm_memory_slot *old,
+ struct kvm_memory_slot *new,
const struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change)
{
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b4e6f70b97b9..6ecb6a51d82f 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -706,11 +706,12 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
}

int kvm_arch_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
+ const struct kvm_memory_slot *old,
+ struct kvm_memory_slot *new,
const struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change)
{
- return kvmppc_core_prepare_memory_region(kvm, memslot, mem, change);
+ return kvmppc_core_prepare_memory_region(kvm, new, mem, change);
}

void kvm_arch_commit_memory_region(struct kvm *kvm,
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 752a0ffab9bf..3fd0f4afe742 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -5016,7 +5016,8 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)

/* Section: memory related */
int kvm_arch_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
+ const struct kvm_memory_slot *old,
+ struct kvm_memory_slot *new,
const struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change)
{
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2e4fe2511c5d..97d86223427d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11506,12 +11506,13 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
}

int kvm_arch_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
+ const struct kvm_memory_slot *old,
+ struct kvm_memory_slot *new,
const struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change)
{
if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
- return kvm_alloc_memslot_metadata(kvm, memslot,
+ return kvm_alloc_memslot_metadata(kvm, new,
mem->memory_size >> PAGE_SHIFT);
return 0;
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 041ca7f15ea4..c6963f6f1eb8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -832,7 +832,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
int kvm_arch_prepare_memory_region(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
+ const struct kvm_memory_slot *old,
+ struct kvm_memory_slot *new,
const struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change);
void kvm_arch_commit_memory_region(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 439d3b4cd1a9..c3c71a9e85c9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1571,7 +1571,7 @@ static int kvm_set_memslot(struct kvm *kvm,
kvm_copy_memslots(slots, __kvm_memslots(kvm, as_id));
}

- r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
+ r = kvm_arch_prepare_memory_region(kvm, old, new, mem, change);
if (r)
goto out_slots;

2021-09-21 03:01:58

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH v5 05/13] KVM: Integrate gfn_to_memslot_approx() into search_memslots()

From: "Maciej S. Szmigiero" <[email protected]>

s390 arch has gfn_to_memslot_approx() which is almost identical to
search_memslots(), differing only in that in case the gfn falls in a hole
one of the memslots bordering the hole is returned.

Add this lookup mode as an option to search_memslots() so we don't have two
almost identical functions for looking up a memslot by its gfn.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
Reviewed-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/kvm-s390.c | 39 ++-------------------------------------
include/linux/kvm_host.h | 25 ++++++++++++++++++++++---
virt/kvm/kvm_main.c | 2 +-
3 files changed, 25 insertions(+), 41 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 3fd0f4afe742..540fa948baa5 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1941,41 +1941,6 @@ static long kvm_s390_set_skeys(struct kvm *kvm, struct kvm_s390_skeys *args)
/* for consistency */
#define KVM_S390_CMMA_SIZE_MAX ((u32)KVM_S390_SKEYS_MAX)

-/*
- * Similar to gfn_to_memslot, but returns the index of a memslot also when the
- * address falls in a hole. In that case the index of one of the memslots
- * bordering the hole is returned.
- */
-static int gfn_to_memslot_approx(struct kvm_memslots *slots, gfn_t gfn)
-{
- int start = 0, end = slots->used_slots;
- int slot = atomic_read(&slots->last_used_slot);
- struct kvm_memory_slot *memslots = slots->memslots;
-
- if (gfn >= memslots[slot].base_gfn &&
- gfn < memslots[slot].base_gfn + memslots[slot].npages)
- return slot;
-
- while (start < end) {
- slot = start + (end - start) / 2;
-
- if (gfn >= memslots[slot].base_gfn)
- end = slot;
- else
- start = slot + 1;
- }
-
- if (start >= slots->used_slots)
- return slots->used_slots - 1;
-
- if (gfn >= memslots[start].base_gfn &&
- gfn < memslots[start].base_gfn + memslots[start].npages) {
- atomic_set(&slots->last_used_slot, start);
- }
-
- return start;
-}
-
static int kvm_s390_peek_cmma(struct kvm *kvm, struct kvm_s390_cmma_log *args,
u8 *res, unsigned long bufsize)
{
@@ -2002,8 +1967,8 @@ static int kvm_s390_peek_cmma(struct kvm *kvm, struct kvm_s390_cmma_log *args,
static unsigned long kvm_s390_next_dirty_cmma(struct kvm_memslots *slots,
unsigned long cur_gfn)
{
- int slotidx = gfn_to_memslot_approx(slots, cur_gfn);
- struct kvm_memory_slot *ms = slots->memslots + slotidx;
+ struct kvm_memory_slot *ms = __gfn_to_memslot_approx(slots, cur_gfn, true);
+ int slotidx = ms - slots->memslots;
unsigned long ofs = cur_gfn - ms->base_gfn;

if (ms->base_gfn + ms->npages <= cur_gfn) {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c6963f6f1eb8..8fd9644f40b2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1231,10 +1231,14 @@ try_get_memslot(struct kvm_memslots *slots, int slot_index, gfn_t gfn)
* Returns a pointer to the memslot that contains gfn and records the index of
* the slot in index. Otherwise returns NULL.
*
+ * With "approx" set returns the memslot also when the address falls
+ * in a hole. In that case one of the memslots bordering the hole is
+ * returned.
+ *
* IMPORTANT: Slots are sorted from highest GFN to lowest GFN!
*/
static inline struct kvm_memory_slot *
-search_memslots(struct kvm_memslots *slots, gfn_t gfn, int *index)
+search_memslots(struct kvm_memslots *slots, gfn_t gfn, int *index, bool approx)
{
int start = 0, end = slots->used_slots;
struct kvm_memory_slot *memslots = slots->memslots;
@@ -1252,11 +1256,20 @@ search_memslots(struct kvm_memslots *slots, gfn_t gfn, int *index)
start = slot + 1;
}

+ if (approx && start >= slots->used_slots) {
+ *index = slots->used_slots - 1;
+ return &memslots[slots->used_slots - 1];
+ }
+
slot = try_get_memslot(slots, start, gfn);
if (slot) {
*index = start;
return slot;
}
+ if (approx) {
+ *index = start;
+ return &memslots[start];
+ }

return NULL;
}
@@ -1267,7 +1280,7 @@ search_memslots(struct kvm_memslots *slots, gfn_t gfn, int *index)
* itself isn't here as an inline because that would bloat other code too much.
*/
static inline struct kvm_memory_slot *
-__gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
+__gfn_to_memslot_approx(struct kvm_memslots *slots, gfn_t gfn, bool approx)
{
struct kvm_memory_slot *slot;
int slot_index = atomic_read(&slots->last_used_slot);
@@ -1276,7 +1289,7 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
if (slot)
return slot;

- slot = search_memslots(slots, gfn, &slot_index);
+ slot = search_memslots(slots, gfn, &slot_index, approx);
if (slot) {
atomic_set(&slots->last_used_slot, slot_index);
return slot;
@@ -1285,6 +1298,12 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
return NULL;
}

+static inline struct kvm_memory_slot *
+__gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
+{
+ return __gfn_to_memslot_approx(slots, gfn, false);
+}
+
static inline unsigned long
__gfn_to_hva_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
{
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c3c71a9e85c9..14043a6edb88 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2062,7 +2062,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
* search_memslots() instead of __gfn_to_memslot() to avoid
* thrashing the VM-wide last_used_index in kvm_memslots.
*/
- slot = search_memslots(slots, gfn, &slot_index);
+ slot = search_memslots(slots, gfn, &slot_index, false);
if (slot) {
vcpu->last_used_slot = slot_index;
return slot;

2021-09-21 03:06:28

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH v5 11/13] KVM: Keep memslots in tree-based structures instead of array-based ones

From: "Maciej S. Szmigiero" <[email protected]>

The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.

Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.

Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.

Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.

In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.

The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.

These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.

This commit implements the aforementioned idea.

For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.

The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
arch/arm64/kvm/mmu.c | 8 +-
arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 +-
arch/powerpc/kvm/book3s_hv.c | 3 +-
arch/powerpc/kvm/book3s_hv_nested.c | 4 +-
arch/powerpc/kvm/book3s_hv_uvmem.c | 14 +-
arch/s390/kvm/kvm-s390.c | 24 +-
arch/s390/kvm/kvm-s390.h | 6 +-
arch/x86/kvm/debugfs.c | 6 +-
arch/x86/kvm/mmu/mmu.c | 4 +-
arch/x86/kvm/x86.c | 4 +-
include/linux/kvm_host.h | 140 +++---
virt/kvm/kvm_main.c | 659 +++++++++++++---------------
12 files changed, 425 insertions(+), 451 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6d93ae1edb6d..d59969200164 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -210,13 +210,13 @@ static void stage2_flush_vm(struct kvm *kvm)
{
struct kvm_memslots *slots;
struct kvm_memory_slot *memslot;
- int idx;
+ int idx, bkt;

idx = srcu_read_lock(&kvm->srcu);
spin_lock(&kvm->mmu_lock);

slots = kvm_memslots(kvm);
- kvm_for_each_memslot(memslot, slots)
+ kvm_for_each_memslot(memslot, bkt, slots)
stage2_flush_memslot(kvm, memslot);

spin_unlock(&kvm->mmu_lock);
@@ -595,14 +595,14 @@ void stage2_unmap_vm(struct kvm *kvm)
{
struct kvm_memslots *slots;
struct kvm_memory_slot *memslot;
- int idx;
+ int idx, bkt;

idx = srcu_read_lock(&kvm->srcu);
mmap_read_lock(current->mm);
spin_lock(&kvm->mmu_lock);

slots = kvm_memslots(kvm);
- kvm_for_each_memslot(memslot, slots)
+ kvm_for_each_memslot(memslot, bkt, slots)
stage2_unmap_memslot(kvm, memslot);

spin_unlock(&kvm->mmu_lock);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index c63e263312a4..213232914367 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -734,11 +734,11 @@ void kvmppc_rmap_reset(struct kvm *kvm)
{
struct kvm_memslots *slots;
struct kvm_memory_slot *memslot;
- int srcu_idx;
+ int srcu_idx, bkt;

srcu_idx = srcu_read_lock(&kvm->srcu);
slots = kvm_memslots(kvm);
- kvm_for_each_memslot(memslot, slots) {
+ kvm_for_each_memslot(memslot, bkt, slots) {
/* Mutual exclusion with kvm_unmap_hva_range etc. */
spin_lock(&kvm->mmu_lock);
/*
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2acb1c96cfaf..2f0609d9d264 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5857,11 +5857,12 @@ static int kvmhv_svm_off(struct kvm *kvm)
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
struct kvm_memory_slot *memslot;
struct kvm_memslots *slots = __kvm_memslots(kvm, i);
+ int bkt;

if (!slots)
continue;

- kvm_for_each_memslot(memslot, slots) {
+ kvm_for_each_memslot(memslot, bkt, slots) {
kvmppc_uvmem_drop_pages(memslot, kvm, true);
uv_unregister_mem_slot(kvm->arch.lpid, memslot->id);
}
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index ed8a2c9f5629..9435e482d514 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -749,7 +749,7 @@ void kvmhv_release_all_nested(struct kvm *kvm)
struct kvm_nested_guest *gp;
struct kvm_nested_guest *freelist = NULL;
struct kvm_memory_slot *memslot;
- int srcu_idx;
+ int srcu_idx, bkt;

spin_lock(&kvm->mmu_lock);
for (i = 0; i <= kvm->arch.max_nested_lpid; i++) {
@@ -770,7 +770,7 @@ void kvmhv_release_all_nested(struct kvm *kvm)
}

srcu_idx = srcu_read_lock(&kvm->srcu);
- kvm_for_each_memslot(memslot, kvm_memslots(kvm))
+ kvm_for_each_memslot(memslot, bkt, kvm_memslots(kvm))
kvmhv_free_memslot_nest_rmap(memslot);
srcu_read_unlock(&kvm->srcu, srcu_idx);
}
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index a7061ee3b157..adc1c495d47c 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -459,7 +459,7 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
struct kvm_memslots *slots;
struct kvm_memory_slot *memslot, *m;
int ret = H_SUCCESS;
- int srcu_idx;
+ int srcu_idx, bkt;

kvm->arch.secure_guest = KVMPPC_SECURE_INIT_START;

@@ -478,7 +478,7 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)

/* register the memslot */
slots = kvm_memslots(kvm);
- kvm_for_each_memslot(memslot, slots) {
+ kvm_for_each_memslot(memslot, bkt, slots) {
ret = __kvmppc_uvmem_memslot_create(kvm, memslot);
if (ret)
break;
@@ -486,7 +486,7 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)

if (ret) {
slots = kvm_memslots(kvm);
- kvm_for_each_memslot(m, slots) {
+ kvm_for_each_memslot(m, bkt, slots) {
if (m == memslot)
break;
__kvmppc_uvmem_memslot_delete(kvm, memslot);
@@ -647,7 +647,7 @@ void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,

unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm)
{
- int srcu_idx;
+ int srcu_idx, bkt;
struct kvm_memory_slot *memslot;

/*
@@ -662,7 +662,7 @@ unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm)

srcu_idx = srcu_read_lock(&kvm->srcu);

- kvm_for_each_memslot(memslot, kvm_memslots(kvm))
+ kvm_for_each_memslot(memslot, bkt, kvm_memslots(kvm))
kvmppc_uvmem_drop_pages(memslot, kvm, false);

srcu_read_unlock(&kvm->srcu, srcu_idx);
@@ -821,7 +821,7 @@ unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
{
struct kvm_memslots *slots;
struct kvm_memory_slot *memslot;
- int srcu_idx;
+ int srcu_idx, bkt;
long ret = H_SUCCESS;

if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
@@ -830,7 +830,7 @@ unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
/* migrate any unmoved normal pfn to device pfns*/
srcu_idx = srcu_read_lock(&kvm->srcu);
slots = kvm_memslots(kvm);
- kvm_for_each_memslot(memslot, slots) {
+ kvm_for_each_memslot(memslot, bkt, slots) {
ret = kvmppc_uv_migrate_mem_slot(kvm, memslot);
if (ret) {
/*
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index bbc32cd7a9a3..c9992a237c2b 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1035,13 +1035,13 @@ static int kvm_s390_vm_start_migration(struct kvm *kvm)
struct kvm_memory_slot *ms;
struct kvm_memslots *slots;
unsigned long ram_pages = 0;
- int slotnr;
+ int bkt;

/* migration mode already enabled */
if (kvm->arch.migration_mode)
return 0;
slots = kvm_memslots(kvm);
- if (!slots || !slots->used_slots)
+ if (!slots || kvm_memslots_empty(slots))
return -EINVAL;

if (!kvm->arch.use_cmma) {
@@ -1049,8 +1049,7 @@ static int kvm_s390_vm_start_migration(struct kvm *kvm)
return 0;
}
/* mark all the pages in active slots as dirty */
- for (slotnr = 0; slotnr < slots->used_slots; slotnr++) {
- ms = slots->memslots + slotnr;
+ kvm_for_each_memslot(ms, bkt, slots) {
if (!ms->dirty_bitmap)
return -EINVAL;
/*
@@ -1968,22 +1967,21 @@ static unsigned long kvm_s390_next_dirty_cmma(struct kvm_memslots *slots,
unsigned long cur_gfn)
{
struct kvm_memory_slot *ms = __gfn_to_memslot_approx(slots, cur_gfn, true);
- int slotidx = ms - slots->memslots;
unsigned long ofs = cur_gfn - ms->base_gfn;
+ struct rb_node *mnode = &ms->gfn_node[slots->node_idx];

if (ms->base_gfn + ms->npages <= cur_gfn) {
- slotidx--;
+ mnode = rb_next(mnode);
/* If we are above the highest slot, wrap around */
- if (slotidx < 0)
- slotidx = slots->used_slots - 1;
+ if (!mnode)
+ mnode = rb_first(&slots->gfn_tree);

- ms = slots->memslots + slotidx;
+ ms = container_of(mnode, struct kvm_memory_slot, gfn_node[slots->node_idx]);
ofs = 0;
}
ofs = find_next_bit(kvm_second_dirty_bitmap(ms), ms->npages, ofs);
- while ((slotidx > 0) && (ofs >= ms->npages)) {
- slotidx--;
- ms = slots->memslots + slotidx;
+ while (ofs >= ms->npages && (mnode = rb_next(mnode))) {
+ ms = container_of(mnode, struct kvm_memory_slot, gfn_node[slots->node_idx]);
ofs = find_next_bit(kvm_second_dirty_bitmap(ms), ms->npages, 0);
}
return ms->base_gfn + ofs;
@@ -1996,7 +1994,7 @@ static int kvm_s390_get_cmma(struct kvm *kvm, struct kvm_s390_cmma_log *args,
struct kvm_memslots *slots = kvm_memslots(kvm);
struct kvm_memory_slot *ms;

- if (unlikely(!slots->used_slots))
+ if (unlikely(kvm_memslots_empty(slots)))
return 0;

cur_gfn = kvm_s390_next_dirty_cmma(slots, args->start_gfn);
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 73d4659ddd59..c6812c17a4b6 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -211,12 +211,14 @@ static inline int kvm_s390_user_cpu_state_ctrl(struct kvm *kvm)
/* get the end gfn of the last (highest gfn) memslot */
static inline unsigned long kvm_s390_get_gfn_end(struct kvm_memslots *slots)
{
+ struct rb_node *node;
struct kvm_memory_slot *ms;

- if (WARN_ON(!slots->used_slots))
+ if (WARN_ON(kvm_memslots_empty(slots)))
return 0;

- ms = slots->memslots;
+ node = rb_last(&slots->gfn_tree);
+ ms = container_of(node, struct kvm_memory_slot, gfn_node[slots->node_idx]);
return ms->base_gfn + ms->npages;
}

diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index 54a83a744538..543a8c04025c 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -107,9 +107,10 @@ static int kvm_mmu_rmaps_stat_show(struct seq_file *m, void *v)
write_lock(&kvm->mmu_lock);

for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ int bkt;
+
slots = __kvm_memslots(kvm, i);
- for (j = 0; j < slots->used_slots; j++) {
- slot = &slots->memslots[j];
+ kvm_for_each_memslot(slot, bkt, slots)
for (k = 0; k < KVM_NR_PAGE_SIZES; k++) {
rmap = slot->arch.rmap[k];
lpage_size = kvm_mmu_slot_lpages(slot, k + 1);
@@ -121,7 +122,6 @@ static int kvm_mmu_rmaps_stat_show(struct seq_file *m, void *v)
cur[index]++;
}
}
- }
}

write_unlock(&kvm->mmu_lock);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 61b9b7b5c10c..a05e581ef210 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5730,8 +5730,10 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)

if (kvm_memslots_have_rmaps(kvm)) {
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ int bkt;
+
slots = __kvm_memslots(kvm, i);
- kvm_for_each_memslot(memslot, slots) {
+ kvm_for_each_memslot(memslot, bkt, slots) {
gfn_t start, end;

start = max(gfn_start, memslot->base_gfn);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0fffb8414009..5a111b159f35 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11405,8 +11405,10 @@ int alloc_all_memslots_rmaps(struct kvm *kvm)
}

for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ int bkt;
+
slots = __kvm_memslots(kvm, i);
- kvm_for_each_memslot(slot, slots) {
+ kvm_for_each_memslot(slot, bkt, slots) {
r = memslot_rmap_alloc(slot, slot->npages);
if (r) {
mutex_unlock(&kvm->slots_arch_lock);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 068877af9756..6433efff447a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -31,6 +31,7 @@
#include <linux/notifier.h>
#include <linux/hashtable.h>
#include <linux/interval_tree.h>
+#include <linux/rbtree.h>
#include <asm/signal.h>

#include <linux/kvm.h>
@@ -358,11 +359,13 @@ struct kvm_vcpu {
struct kvm_dirty_ring dirty_ring;

/*
- * The index of the most recently used memslot by this vCPU. It's ok
- * if this becomes stale due to memslot changes since we always check
- * it is a valid slot.
+ * The most recently used memslot by this vCPU and the slots generation
+ * for which it is valid.
+ * No wraparound protection is needed since generations won't overflow in
+ * thousands of years, even assuming 1M memslot operations per second.
*/
- int last_used_slot;
+ struct kvm_memory_slot *last_used_slot;
+ u64 last_used_slot_gen;
};

/* must be called with irqs disabled */
@@ -427,9 +430,26 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
*/
#define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)

+/*
+ * Since at idle each memslot belongs to two memslot sets it has to contain
+ * two embedded nodes for each data structure that it forms a part of.
+ *
+ * Two memslot sets (one active and one inactive) are necessary so the VM
+ * continues to run on one memslot set while the other is being modified.
+ *
+ * These two memslot sets normally point to the same set of memslots.
+ * They can, however, be desynchronized when performing a memslot management
+ * operation by replacing the memslot to be modified by its copy.
+ * After the operation is complete, both memslot sets once again point to
+ * the same, common set of memslot data.
+ *
+ * The memslots themselves are independent of each other so they can be
+ * individually added or deleted.
+ */
struct kvm_memory_slot {
- struct hlist_node id_node;
- struct interval_tree_node hva_node;
+ struct hlist_node id_node[2];
+ struct interval_tree_node hva_node[2];
+ struct rb_node gfn_node[2];
gfn_t base_gfn;
unsigned long npages;
unsigned long *dirty_bitmap;
@@ -524,19 +544,14 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
}
#endif

-/*
- * Note:
- * memslots are not sorted by id anymore, please use id_to_memslot()
- * to get the memslot by its id.
- */
struct kvm_memslots {
u64 generation;
+ atomic_long_t last_used_slot;
struct rb_root_cached hva_tree;
- /* The mapping table from slot id to the index in memslots[]. */
+ struct rb_root gfn_tree;
+ /* The mapping table from slot id to memslot. */
DECLARE_HASHTABLE(id_hash, 7);
- atomic_t last_used_slot;
- int used_slots;
- struct kvm_memory_slot memslots[];
+ int node_idx;
};

struct kvm {
@@ -557,6 +572,9 @@ struct kvm {
*/
struct mutex slots_arch_lock;
struct mm_struct *mm; /* userspace tied to this vm */
+ /* The two memslot sets - active and inactive (per address space) */
+ struct kvm_memslots __memslots[KVM_ADDRESS_SPACE_NUM][2];
+ /* The current active memslot set for each address space */
struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];

@@ -731,12 +749,6 @@ static inline int kvm_vcpu_get_idx(struct kvm_vcpu *vcpu)
return vcpu->vcpu_idx;
}

-#define kvm_for_each_memslot(memslot, slots) \
- for (memslot = &slots->memslots[0]; \
- memslot < slots->memslots + slots->used_slots; memslot++) \
- if (WARN_ON_ONCE(!memslot->npages)) { \
- } else
-
void kvm_vcpu_destroy(struct kvm_vcpu *vcpu);

void vcpu_load(struct kvm_vcpu *vcpu);
@@ -797,12 +809,23 @@ static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
return __kvm_memslots(vcpu->kvm, as_id);
}

+static inline bool kvm_memslots_empty(struct kvm_memslots *slots)
+{
+ return RB_EMPTY_ROOT(&slots->gfn_tree);
+}
+
+#define kvm_for_each_memslot(memslot, bkt, slots) \
+ hash_for_each(slots->id_hash, bkt, memslot, id_node[slots->node_idx]) \
+ if (WARN_ON_ONCE(!memslot->npages)) { \
+ } else
+
static inline
struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
{
struct kvm_memory_slot *slot;
+ int idx = slots->node_idx;

- hash_for_each_possible(slots->id_hash, slot, id_node, id) {
+ hash_for_each_possible(slots->id_hash, slot, id_node[idx], id) {
if (slot->id == id)
return slot;
}
@@ -1205,25 +1228,15 @@ void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);

/*
- * Returns a pointer to the memslot at slot_index if it contains gfn.
+ * Returns a pointer to the memslot if it contains gfn.
* Otherwise returns NULL.
*/
static inline struct kvm_memory_slot *
-try_get_memslot(struct kvm_memslots *slots, int slot_index, gfn_t gfn)
+try_get_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
{
- struct kvm_memory_slot *slot;
-
- if (slot_index < 0 || slot_index >= slots->used_slots)
+ if (!slot)
return NULL;

- /*
- * slot_index can come from vcpu->last_used_slot which is not kept
- * in sync with userspace-controllable memslot deletion. So use nospec
- * to prevent the CPU from speculating past the end of memslots[].
- */
- slot_index = array_index_nospec(slot_index, slots->used_slots);
- slot = &slots->memslots[slot_index];
-
if (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
return slot;
else
@@ -1231,50 +1244,31 @@ try_get_memslot(struct kvm_memslots *slots, int slot_index, gfn_t gfn)
}

/*
- * Returns a pointer to the memslot that contains gfn and records the index of
- * the slot in index. Otherwise returns NULL.
+ * Returns a pointer to the memslot that contains gfn. Otherwise returns NULL.
*
* With "approx" set returns the memslot also when the address falls
* in a hole. In that case one of the memslots bordering the hole is
* returned.
- *
- * IMPORTANT: Slots are sorted from highest GFN to lowest GFN!
*/
static inline struct kvm_memory_slot *
-search_memslots(struct kvm_memslots *slots, gfn_t gfn, int *index, bool approx)
+search_memslots(struct kvm_memslots *slots, gfn_t gfn, bool approx)
{
- int start = 0, end = slots->used_slots;
- struct kvm_memory_slot *memslots = slots->memslots;
struct kvm_memory_slot *slot;
-
- if (unlikely(!slots->used_slots))
- return NULL;
-
- while (start < end) {
- int slot = start + (end - start) / 2;
-
- if (gfn >= memslots[slot].base_gfn)
- end = slot;
- else
- start = slot + 1;
- }
-
- if (approx && start >= slots->used_slots) {
- *index = slots->used_slots - 1;
- return &memslots[slots->used_slots - 1];
- }
-
- slot = try_get_memslot(slots, start, gfn);
- if (slot) {
- *index = start;
- return slot;
- }
- if (approx) {
- *index = start;
- return &memslots[start];
+ struct rb_node *node;
+ int idx = slots->node_idx;
+
+ slot = NULL;
+ for (node = slots->gfn_tree.rb_node; node; ) {
+ slot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
+ if (gfn >= slot->base_gfn) {
+ if (gfn < slot->base_gfn + slot->npages)
+ return slot;
+ node = node->rb_right;
+ } else
+ node = node->rb_left;
}

- return NULL;
+ return approx ? slot : NULL;
}

/*
@@ -1286,15 +1280,15 @@ static inline struct kvm_memory_slot *
__gfn_to_memslot_approx(struct kvm_memslots *slots, gfn_t gfn, bool approx)
{
struct kvm_memory_slot *slot;
- int slot_index = atomic_read(&slots->last_used_slot);

- slot = try_get_memslot(slots, slot_index, gfn);
+ slot = (struct kvm_memory_slot *)atomic_long_read(&slots->last_used_slot);
+ slot = try_get_memslot(slot, gfn);
if (slot)
return slot;

- slot = search_memslots(slots, gfn, &slot_index, approx);
+ slot = search_memslots(slots, gfn, approx);
if (slot) {
- atomic_set(&slots->last_used_slot, slot_index);
+ atomic_long_set(&slots->last_used_slot, (unsigned long)slot);
return slot;
}

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7ed780996910..5fea467d6fec 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -406,7 +406,7 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
vcpu->preempted = false;
vcpu->ready = false;
preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
- vcpu->last_used_slot = 0;
+ vcpu->last_used_slot = NULL;
}

void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
@@ -505,7 +505,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
range->start, range->end - 1) {
unsigned long hva_start, hva_end;

- slot = container_of(node, struct kvm_memory_slot, hva_node);
+ slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
hva_start = max(range->start, slot->userspace_addr);
hva_end = min(range->end, slot->userspace_addr +
(slot->npages << PAGE_SHIFT));
@@ -836,20 +836,6 @@ static void kvm_destroy_pm_notifier(struct kvm *kvm)
}
#endif /* CONFIG_HAVE_KVM_PM_NOTIFIER */

-static struct kvm_memslots *kvm_alloc_memslots(void)
-{
- struct kvm_memslots *slots;
-
- slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL_ACCOUNT);
- if (!slots)
- return NULL;
-
- slots->hva_tree = RB_ROOT_CACHED;
- hash_init(slots->id_hash);
-
- return slots;
-}
-
static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
{
if (!memslot->dirty_bitmap)
@@ -859,27 +845,33 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
memslot->dirty_bitmap = NULL;
}

+/* This does not remove the slot from struct kvm_memslots data structures */
static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
{
kvm_destroy_dirty_bitmap(slot);

kvm_arch_free_memslot(kvm, slot);

- slot->flags = 0;
- slot->npages = 0;
+ kfree(slot);
}

static void kvm_free_memslots(struct kvm *kvm, struct kvm_memslots *slots)
{
+ struct hlist_node *idnode;
struct kvm_memory_slot *memslot;
+ int bkt;

- if (!slots)
+ /*
+ * The same memslot objects live in both active and inactive sets,
+ * arbitrarily free using index '1' so the second invocation of this
+ * function isn't operating over a structure with dangling pointers
+ * (even though this function isn't actually touching them).
+ */
+ if (!slots->node_idx)
return;

- kvm_for_each_memslot(memslot, slots)
+ hash_for_each_safe(slots->id_hash, bkt, idnode, memslot, id_node[1])
kvm_free_memslot(kvm, memslot);
-
- kvfree(slots);
}

static umode_t kvm_stats_debugfs_mode(const struct _kvm_stats_desc *pdesc)
@@ -1018,8 +1010,9 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
static struct kvm *kvm_create_vm(unsigned long type)
{
struct kvm *kvm = kvm_arch_alloc_vm();
+ struct kvm_memslots *slots;
int r = -ENOMEM;
- int i;
+ int i, j;

if (!kvm)
return ERR_PTR(-ENOMEM);
@@ -1046,13 +1039,20 @@ static struct kvm *kvm_create_vm(unsigned long type)

refcount_set(&kvm->users_count, 1);
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
- struct kvm_memslots *slots = kvm_alloc_memslots();
+ for (j = 0; j < 2; j++) {
+ slots = &kvm->__memslots[i][j];

- if (!slots)
- goto out_err_no_arch_destroy_vm;
- /* Generations must be different for each address space. */
- slots->generation = i;
- rcu_assign_pointer(kvm->memslots[i], slots);
+ atomic_long_set(&slots->last_used_slot, (unsigned long)NULL);
+ slots->hva_tree = RB_ROOT_CACHED;
+ slots->gfn_tree = RB_ROOT;
+ hash_init(slots->id_hash);
+ slots->node_idx = j;
+
+ /* Generations must be different for each address space. */
+ slots->generation = i;
+ }
+
+ rcu_assign_pointer(kvm->memslots[i], &kvm->__memslots[i][0]);
}

for (i = 0; i < KVM_NR_BUSES; i++) {
@@ -1106,8 +1106,6 @@ static struct kvm *kvm_create_vm(unsigned long type)
WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count));
for (i = 0; i < KVM_NR_BUSES; i++)
kfree(kvm_get_bus(kvm, i));
- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
- kvm_free_memslots(kvm, __kvm_memslots(kvm, i));
cleanup_srcu_struct(&kvm->irq_srcu);
out_err_no_irq_srcu:
cleanup_srcu_struct(&kvm->srcu);
@@ -1172,8 +1170,10 @@ static void kvm_destroy_vm(struct kvm *kvm)
#endif
kvm_arch_destroy_vm(kvm);
kvm_destroy_devices(kvm);
- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
- kvm_free_memslots(kvm, __kvm_memslots(kvm, i));
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
+ kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
+ }
cleanup_srcu_struct(&kvm->irq_srcu);
cleanup_srcu_struct(&kvm->srcu);
kvm_arch_free_vm(kvm);
@@ -1243,217 +1243,6 @@ static int kvm_alloc_dirty_bitmap(struct kvm_memory_slot *memslot)
return 0;
}

-/*
- * Delete a memslot by decrementing the number of used slots and shifting all
- * other entries in the array forward one spot.
- * @memslot is a detached dummy struct with just .id and .as_id filled.
- */
-static inline void kvm_memslot_delete(struct kvm_memslots *slots,
- struct kvm_memory_slot *memslot)
-{
- struct kvm_memory_slot *mslots = slots->memslots;
- struct kvm_memory_slot *oldslot = id_to_memslot(slots, memslot->id);
- int i;
-
- if (WARN_ON(!oldslot))
- return;
-
- slots->used_slots--;
-
- if (atomic_read(&slots->last_used_slot) >= slots->used_slots)
- atomic_set(&slots->last_used_slot, 0);
-
- for (i = oldslot - mslots; i < slots->used_slots; i++) {
- interval_tree_remove(&mslots[i].hva_node, &slots->hva_tree);
- hash_del(&mslots[i].id_node);
-
- mslots[i] = mslots[i + 1];
- interval_tree_insert(&mslots[i].hva_node, &slots->hva_tree);
- hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
- }
- interval_tree_remove(&mslots[i].hva_node, &slots->hva_tree);
- hash_del(&mslots[i].id_node);
- mslots[i] = *memslot;
-}
-
-/*
- * "Insert" a new memslot by incrementing the number of used slots. Returns
- * the new slot's initial index into the memslots array.
- */
-static inline int kvm_memslot_insert_back(struct kvm_memslots *slots)
-{
- return slots->used_slots++;
-}
-
-/*
- * Move a changed memslot backwards in the array by shifting existing slots
- * with a higher GFN toward the front of the array. Note, the changed memslot
- * itself is not preserved in the array, i.e. not swapped at this time, only
- * its new index into the array is tracked. Returns the changed memslot's
- * current index into the memslots array.
- * The memslot at the returned index will not be in @slots->hva_tree or
- * @slots->id_hash by then.
- * @memslot is a detached struct with desired final data of the changed slot.
- */
-static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
- struct kvm_memory_slot *memslot)
-{
- struct kvm_memory_slot *mslots = slots->memslots;
- struct kvm_memory_slot *mmemslot = id_to_memslot(slots, memslot->id);
- int i;
-
- if (!mmemslot || !slots->used_slots)
- return -1;
-
- /*
- * The loop below will (possibly) overwrite the target memslot with
- * data of the next memslot, or a similar loop in
- * kvm_memslot_move_forward() will overwrite it with data of the
- * previous memslot.
- * Then update_memslots() will unconditionally overwrite and re-add
- * it to the hash table.
- * That's why the memslot has to be first removed from the hash table
- * here.
- */
- interval_tree_remove(&mmemslot->hva_node, &slots->hva_tree);
- hash_del(&mmemslot->id_node);
-
- /*
- * Move the target memslot backward in the array by shifting existing
- * memslots with a higher GFN (than the target memslot) towards the
- * front of the array.
- */
- for (i = mmemslot - mslots; i < slots->used_slots - 1; i++) {
- if (memslot->base_gfn > mslots[i + 1].base_gfn)
- break;
-
- WARN_ON_ONCE(memslot->base_gfn == mslots[i + 1].base_gfn);
-
- /* Shift the next memslot forward one and update its index. */
- interval_tree_remove(&mslots[i + 1].hva_node, &slots->hva_tree);
- hash_del(&mslots[i + 1].id_node);
-
- mslots[i] = mslots[i + 1];
- interval_tree_insert(&mslots[i].hva_node, &slots->hva_tree);
- hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
- }
- return i;
-}
-
-/*
- * Move a changed memslot forwards in the array by shifting existing slots with
- * a lower GFN toward the back of the array. Note, the changed memslot itself
- * is not preserved in the array, i.e. not swapped at this time, only its new
- * index into the array is tracked. Returns the changed memslot's final index
- * into the memslots array.
- * The memslot at the returned index will not be in @slots->hva_tree or
- * @slots->id_hash by then.
- * @memslot is a detached struct with desired final data of the new or
- * changed slot.
- * Assumes that the memslot at @start index is not in @slots->hva_tree or
- * @slots->id_hash.
- */
-static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
- struct kvm_memory_slot *memslot,
- int start)
-{
- struct kvm_memory_slot *mslots = slots->memslots;
- int i;
-
- for (i = start; i > 0; i--) {
- if (memslot->base_gfn < mslots[i - 1].base_gfn)
- break;
-
- WARN_ON_ONCE(memslot->base_gfn == mslots[i - 1].base_gfn);
-
- /* Shift the next memslot back one and update its index. */
- interval_tree_remove(&mslots[i - 1].hva_node, &slots->hva_tree);
- hash_del(&mslots[i - 1].id_node);
-
- mslots[i] = mslots[i - 1];
- interval_tree_insert(&mslots[i].hva_node, &slots->hva_tree);
- hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
- }
- return i;
-}
-
-/*
- * Re-sort memslots based on their GFN to account for an added, deleted, or
- * moved memslot. Sorting memslots by GFN allows using a binary search during
- * memslot lookup.
- *
- * IMPORTANT: Slots are sorted from highest GFN to lowest GFN! I.e. the entry
- * at memslots[0] has the highest GFN.
- *
- * The sorting algorithm takes advantage of having initially sorted memslots
- * and knowing the position of the changed memslot. Sorting is also optimized
- * by not swapping the updated memslot and instead only shifting other memslots
- * and tracking the new index for the update memslot. Only once its final
- * index is known is the updated memslot copied into its position in the array.
- *
- * - When deleting a memslot, the deleted memslot simply needs to be moved to
- * the end of the array.
- *
- * - When creating a memslot, the algorithm "inserts" the new memslot at the
- * end of the array and then it forward to its correct location.
- *
- * - When moving a memslot, the algorithm first moves the updated memslot
- * backward to handle the scenario where the memslot's GFN was changed to a
- * lower value. update_memslots() then falls through and runs the same flow
- * as creating a memslot to move the memslot forward to handle the scenario
- * where its GFN was changed to a higher value.
- *
- * Note, slots are sorted from highest->lowest instead of lowest->highest for
- * historical reasons. Originally, invalid memslots where denoted by having
- * GFN=0, thus sorting from highest->lowest naturally sorted invalid memslots
- * to the end of the array. The current algorithm uses dedicated logic to
- * delete a memslot and thus does not rely on invalid memslots having GFN=0.
- *
- * The other historical motiviation for highest->lowest was to improve the
- * performance of memslot lookup. KVM originally used a linear search starting
- * at memslots[0]. On x86, the largest memslot usually has one of the highest,
- * if not *the* highest, GFN, as the bulk of the guest's RAM is located in a
- * single memslot above the 4gb boundary. As the largest memslot is also the
- * most likely to be referenced, sorting it to the front of the array was
- * advantageous. The current binary search starts from the middle of the array
- * and uses an LRU pointer to improve performance for all memslots and GFNs.
- *
- * @memslot is a detached struct, not a part of the current or new memslot
- * array.
- */
-static void update_memslots(struct kvm_memslots *slots,
- struct kvm_memory_slot *memslot,
- enum kvm_mr_change change)
-{
- int i;
-
- if (change == KVM_MR_DELETE) {
- kvm_memslot_delete(slots, memslot);
- } else {
- if (change == KVM_MR_CREATE)
- i = kvm_memslot_insert_back(slots);
- else
- i = kvm_memslot_move_backward(slots, memslot);
- i = kvm_memslot_move_forward(slots, memslot, i);
-
- if (WARN_ON_ONCE(i < 0))
- return;
-
- /*
- * Copy the memslot to its new position in memslots and update
- * its index accordingly.
- */
- slots->memslots[i] = *memslot;
- slots->memslots[i].hva_node.start = memslot->userspace_addr;
- slots->memslots[i].hva_node.last = memslot->userspace_addr +
- (memslot->npages << PAGE_SHIFT) - 1;
- interval_tree_insert(&slots->memslots[i].hva_node,
- &slots->hva_tree);
- hash_add(slots->id_hash, &slots->memslots[i].id_node,
- memslot->id);
- }
-}
-
static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
{
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -1468,11 +1257,12 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
return 0;
}

-static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
- int as_id, struct kvm_memslots *slots)
+static void kvm_swap_active_memslots(struct kvm *kvm, int as_id,
+ struct kvm_memslots **active,
+ struct kvm_memslots **inactive)
{
- struct kvm_memslots *old_memslots = __kvm_memslots(kvm, as_id);
- u64 gen = old_memslots->generation;
+ struct kvm_memslots *slots = *inactive;
+ u64 gen = (*active)->generation;

WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
@@ -1524,61 +1314,139 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,

slots->generation = gen;

- return old_memslots;
+ swap(*active, *inactive);
}

-static size_t kvm_memslots_size(int slots)
+static void kvm_memslot_gfn_insert(struct kvm_memslots *slots,
+ struct kvm_memory_slot *slot)
{
- return sizeof(struct kvm_memslots) +
- (sizeof(struct kvm_memory_slot) * slots);
+ struct rb_root *gfn_tree = &slots->gfn_tree;
+ struct rb_node **node, *parent;
+ int idx = slots->node_idx;
+
+ parent = NULL;
+ for (node = &gfn_tree->rb_node; *node; ) {
+ struct kvm_memory_slot *tmp;
+
+ tmp = container_of(*node, struct kvm_memory_slot, gfn_node[idx]);
+ parent = *node;
+ if (slot->base_gfn < tmp->base_gfn)
+ node = &(*node)->rb_left;
+ else if (slot->base_gfn > tmp->base_gfn)
+ node = &(*node)->rb_right;
+ else
+ BUG();
+ }
+
+ rb_link_node(&slot->gfn_node[idx], parent, node);
+ rb_insert_color(&slot->gfn_node[idx], gfn_tree);
}

-static void kvm_copy_memslots(struct kvm_memslots *to,
- struct kvm_memslots *from)
+static void kvm_memslot_gfn_erase(struct kvm_memslots *slots,
+ struct kvm_memory_slot *slot)
{
- memcpy(to, from, kvm_memslots_size(from->used_slots));
+ rb_erase(&slot->gfn_node[slots->node_idx], &slots->gfn_tree);
}

-static void kvm_copy_memslots_arch(struct kvm_memslots *to,
- struct kvm_memslots *from)
+static void kvm_memslot_gfn_replace(struct kvm_memslots *slots,
+ struct kvm_memory_slot *old,
+ struct kvm_memory_slot *new)
{
- int i;
+ int idx = slots->node_idx;
+
+ WARN_ON_ONCE(old->base_gfn != new->base_gfn);

- for (i = 0; i < from->used_slots; i++)
- to->memslots[i].arch = from->memslots[i].arch;
+ rb_replace_node(&old->gfn_node[idx], &new->gfn_node[idx],
+ &slots->gfn_tree);
}

-/*
- * Note, at a minimum, the current number of used slots must be allocated, even
- * when deleting a memslot, as we need a complete duplicate of the memslots for
- * use when invalidating a memslot prior to deleting/moving the memslot.
- */
-static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
- enum kvm_mr_change change)
+static void kvm_copy_memslot(struct kvm_memory_slot *dest,
+ const struct kvm_memory_slot *src)
{
- struct kvm_memslots *slots;
- size_t new_size;
- struct kvm_memory_slot *memslot;
+ dest->base_gfn = src->base_gfn;
+ dest->npages = src->npages;
+ dest->dirty_bitmap = src->dirty_bitmap;
+ dest->arch = src->arch;
+ dest->userspace_addr = src->userspace_addr;
+ dest->flags = src->flags;
+ dest->id = src->id;
+ dest->as_id = src->as_id;

- if (change == KVM_MR_CREATE)
- new_size = kvm_memslots_size(old->used_slots + 1);
- else
- new_size = kvm_memslots_size(old->used_slots);
+ dest->hva_node[0].start = dest->hva_node[1].start =
+ dest->userspace_addr;
+ dest->hva_node[0].last = dest->hva_node[1].last =
+ dest->userspace_addr + (dest->npages << PAGE_SHIFT) - 1;
+}

- slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT);
- if (unlikely(!slots))
- return NULL;
+/*
+ * Replace @old with @new in @slots.
+ *
+ * With NULL @old this simply adds @new to @slots.
+ * With NULL @new this simply removes @old from @slots.
+ *
+ * If @new is non-NULL its hva_node[slots_idx] range has to be set
+ * appropriately.
+ */
+static void kvm_replace_memslot(struct kvm_memslots *slots,
+ struct kvm_memory_slot *old,
+ struct kvm_memory_slot *new)
+{
+ int idx = slots->node_idx;
+
+ if (old) {
+ hash_del(&old->id_node[idx]);
+ interval_tree_remove(&old->hva_node[idx], &slots->hva_tree);
+ atomic_long_cmpxchg(&slots->last_used_slot,
+ (unsigned long)old, (unsigned long)new);
+ if (!new) {
+ kvm_memslot_gfn_erase(slots, old);
+ return;
+ }
+ }

- kvm_copy_memslots(slots, old);
+ WARN_ON(PAGE_SHIFT > 0 &&
+ new->hva_node[idx].start >= new->hva_node[idx].last);
+ hash_add(slots->id_hash, &new->id_node[idx], new->id);
+ interval_tree_insert(&new->hva_node[idx], &slots->hva_tree);

- slots->hva_tree = RB_ROOT_CACHED;
- hash_init(slots->id_hash);
- kvm_for_each_memslot(memslot, slots) {
- interval_tree_insert(&memslot->hva_node, &slots->hva_tree);
- hash_add(slots->id_hash, &memslot->id_node, memslot->id);
+ /* Shame there is no O(1) interval_tree_replace()... */
+ if (old && old->base_gfn == new->base_gfn) {
+ kvm_memslot_gfn_replace(slots, old, new);
+ } else {
+ if (old)
+ kvm_memslot_gfn_erase(slots, old);
+ kvm_memslot_gfn_insert(slots, new);
}
+}

- return slots;
+/*
+ * Replace @old with @new in @active set, first activating the @inactive
+ * set so @active will no longer be active and can be modified.
+ * Then free @old and return with pointers in @active and @inactive swapped
+ * (since the actual active <-> inactive sets have been swapped).
+ *
+ * With NULL @old this simply adds @new to @active (while swapping the sets).
+ * With NULL @new this simply removes @old from @active and frees it
+ * (while also swapping the sets).
+ */
+static void kvm_activate_memslot(struct kvm *kvm, int as_id,
+ struct kvm_memslots **active,
+ struct kvm_memslots **inactive,
+ struct kvm_memory_slot *old,
+ struct kvm_memory_slot *new)
+{
+ /*
+ * Swap the active <-> inactive memslots.
+ * Note, this also swaps the active and inactive pointers themselves
+ * and releases slots_arch_lock.
+ */
+ kvm_swap_active_memslots(kvm, as_id, active, inactive);
+
+ /* Propagate the new memslot to the now inactive memslots. */
+ kvm_replace_memslot(*inactive, old, new);
+
+ /* And free the old slot (if there was one). */
+ kfree(old);
}

static int kvm_set_memslot(struct kvm *kvm,
@@ -1587,16 +1455,47 @@ static int kvm_set_memslot(struct kvm *kvm,
struct kvm_memory_slot *new, int as_id,
enum kvm_mr_change change)
{
- struct kvm_memory_slot *slot;
- struct kvm_memslots *slots;
+ struct kvm_memslots *active = __kvm_memslots(kvm, as_id);
+ int node_idx_inactive = active->node_idx == 0 ? 1 : 0;
+ struct kvm_memslots *inactive = &kvm->__memslots[as_id][node_idx_inactive];
+ /*
+ * "slotina" (from "slot inactive") is a slot that is never in the
+ * active memslot set.
+ * This slot may be a part of the inactive memslot set or it might be detached.
+ *
+ * Conversely, an "slotact" (from "slot active") is a slot that is
+ * in the active memslot set.
+ * This slot might be a part of the inactive memslot set, too.
+ *
+ * The above terms only apply to a particular variable if it is going
+ * to see further accesses later during this function execution
+ * (that is, an invariant may no longer be true if the particular variable
+ * won't be accessed anymore).
+ */
+ struct kvm_memory_slot *slotina, *slotact;
int r;

+ if (change != KVM_MR_CREATE) {
+ slotact = id_to_memslot(active, old->id);
+ if (WARN_ON_ONCE(!slotact))
+ return -EIO;
+ }
+
+ /*
+ * Modifications are done on a temporary, unreachable slot.
+ * The changes are then (eventually) propagated to both the
+ * active and inactive slots.
+ */
+ slotina = kzalloc(sizeof(*slotina), GFP_KERNEL_ACCOUNT);
+ if (!slotina)
+ return -ENOMEM;
+
/*
- * Released in install_new_memslots.
+ * Released in kvm_swap_active_memslots.
*
* Must be held from before the current memslots are copied until
* after the new memslots are installed with rcu_assign_pointer,
- * then released before the synchronize srcu in install_new_memslots.
+ * then released before the synchronize srcu in kvm_swap_active_memslots.
*
* When modifying memslots outside of the slots_lock, must be held
* before reading the pointer to the current memslots until after all
@@ -1607,68 +1506,145 @@ static int kvm_set_memslot(struct kvm *kvm,
*/
mutex_lock(&kvm->slots_arch_lock);

- slots = kvm_dup_memslots(__kvm_memslots(kvm, as_id), change);
- if (!slots) {
- mutex_unlock(&kvm->slots_arch_lock);
- return -ENOMEM;
- }
-
if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
/*
- * Note, the INVALID flag needs to be in the appropriate entry
- * in the freshly allocated memslots, not in @old or @new.
+ * Mark the current slot INVALID.
+ * This must be done on the temporary slot to avoid
+ * modifying the current slot in the active tree.
*/
- slot = id_to_memslot(slots, old->id);
- slot->flags |= KVM_MEMSLOT_INVALID;
+ kvm_copy_memslot(slotina, slotact);
+ slotina->flags |= KVM_MEMSLOT_INVALID;
+ kvm_replace_memslot(inactive, slotact, slotina);
+
+ /*
+ * Activate the slot that is now marked INVALID, but don't
+ * propagate the slot to the now inactive slots. The slot is
+ * either going to be deleted or recreated as a new slot.
+ */
+ kvm_swap_active_memslots(kvm, as_id, &active, &inactive);

/*
- * We can re-use the memory from the old memslots.
- * It will be overwritten with a copy of the new memslots
- * after reacquiring the slots_arch_lock below.
+ * The temporary and current slot have swapped roles,
+ * slotina is now in the active set and slotact is not,
+ * so swap the variables appropriately, too.
*/
- slots = install_new_memslots(kvm, as_id, slots);
+ swap(slotina, slotact);

- /* From this point no new shadow pages pointing to a deleted,
+ /*
+ * From this point no new shadow pages pointing to a deleted,
* or moved, memslot will be created.
*
* validation of sp->gfn happens in:
* - gfn_to_hva (kvm_read_guest, gfn_to_pfn)
* - kvm_is_visible_gfn (mmu_check_root)
*/
- kvm_arch_flush_shadow_memslot(kvm, slot);
+ kvm_arch_flush_shadow_memslot(kvm, slotact);

- /* Released in install_new_memslots. */
+ /* Was released by kvm_swap_active_memslots, reacquire. */
mutex_lock(&kvm->slots_arch_lock);
+ }

+ if (change != KVM_MR_CREATE) {
/*
- * The arch-specific fields of the memslots could have changed
- * between releasing the slots_arch_lock in
- * install_new_memslots and here, so get a fresh copy of these
- * fields.
+ * The arch-specific fields of the memslot could have changed
+ * between reading them and taking slots_arch_lock in one of two
+ * places above.
+ * That includes old and new which were read in __kvm_set_memory_region.
*/
- kvm_copy_memslots_arch(slots, __kvm_memslots(kvm, as_id));
+ old->arch = new->arch = slotina->arch = slotact->arch;
}

r = kvm_arch_prepare_memory_region(kvm, old, new, mem, change);
- if (r)
- goto out_slots;
+ if (r) {
+ if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
+ /*
+ * Revert the above INVALID change.
+ * No modifications required since the original slot
+ * was preserved in the inactive slots.
+ * This also frees the temporary slot and releases slots_arch_lock.
+ */
+ kvm_activate_memslot(kvm, as_id, &active, &inactive, slotact, slotina);
+ } else {
+ mutex_unlock(&kvm->slots_arch_lock);
+ kfree(slotina);
+ }
+ return r;
+ }

- update_memslots(slots, new, change);
- slots = install_new_memslots(kvm, as_id, slots);
+ if (change == KVM_MR_MOVE) {
+ /*
+ * The memslot's gfn is changing, remove it from the inactive
+ * tree, it will be re-added with its updated gfn. Because its
+ * range is changing, an in-place replace is not possible.
+ */
+ kvm_memslot_gfn_erase(inactive, slotina);

- kvm_arch_commit_memory_region(kvm, mem, old, new, change);
+ slotina->base_gfn = new->base_gfn;
+ slotina->flags = new->flags;
+ slotina->dirty_bitmap = new->dirty_bitmap;
+ /* kvm_arch_prepare_memory_region() might have modified arch */
+ slotina->arch = new->arch;

- kvfree(slots);
- return 0;
+ /* Re-add to the gfn tree with the updated gfn */
+ kvm_memslot_gfn_insert(inactive, slotina);

-out_slots:
- if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
- slots = install_new_memslots(kvm, as_id, slots);
+ /* Replace the current INVALID slot with the updated memslot. */
+ kvm_activate_memslot(kvm, as_id, &active, &inactive, slotact, slotina);
+ } else if (change == KVM_MR_FLAGS_ONLY) {
+ /*
+ * Similar to the MOVE case, but the slot doesn't need to be
+ * zapped as an intermediate step. Instead, the old memslot is
+ * simply replaced with a new, updated copy in both memslot sets.
+ *
+ * Since the memslot gfn is unchanged, kvm_copy_replace_memslot()
+ * and kvm_memslot_gfn_replace() can be used to switch the node
+ * in the gfn tree instead of removing the old and inserting the
+ * new as two separate operations. Replacement is a single O(1)
+ * operation versus two O(log(n)) operations for remove+insert.
+ */
+ kvm_copy_memslot(slotina, slotact);
+ slotina->flags = new->flags;
+ slotina->dirty_bitmap = new->dirty_bitmap;
+ /* kvm_arch_prepare_memory_region() might have modified arch */
+ slotina->arch = new->arch;
+ kvm_replace_memslot(inactive, slotact, slotina);
+
+ kvm_activate_memslot(kvm, as_id, &active, &inactive, slotact, slotina);
+ } else if (change == KVM_MR_CREATE) {
+ /*
+ * Add the new memslot to the inactive set as a copy of the
+ * new memslot data provided by userspace.
+ */
+ kvm_copy_memslot(slotina, new);
+ kvm_replace_memslot(inactive, NULL, slotina);
+
+ kvm_activate_memslot(kvm, as_id, &active, &inactive, NULL, slotina);
+ } else if (change == KVM_MR_DELETE) {
+ /*
+ * Remove the old memslot (in the inactive memslots)
+ * by passing NULL as the new slot.
+ */
+ kvm_replace_memslot(inactive, slotina, NULL);
+ kvm_activate_memslot(kvm, as_id, &active, &inactive, slotact, NULL);
} else {
- mutex_unlock(&kvm->slots_arch_lock);
+ BUG();
}
- kvfree(slots);
- return r;
+
+ /*
+ * No need to refresh new->arch since this runs without slots_arch_lock anyway
+ * (was released by kvm_activate_memslot call in one of the branches above).
+ */
+ kvm_arch_commit_memory_region(kvm, mem, old, new, change);
+
+ /*
+ * Free the memslot and its metadata.
+ * Note, slotact and slotina hold the same metadata, but slotact
+ * was freed by kvm_activate_memslot(). It's slotina's turn now.
+ */
+ if (change == KVM_MR_DELETE)
+ kvm_free_memslot(kvm, slotina);
+
+ return 0;
}

static int kvm_delete_memslot(struct kvm *kvm,
@@ -1676,7 +1652,6 @@ static int kvm_delete_memslot(struct kvm *kvm,
struct kvm_memory_slot *old, int as_id)
{
struct kvm_memory_slot new;
- int r;

if (!old->npages)
return -EINVAL;
@@ -1689,12 +1664,7 @@ static int kvm_delete_memslot(struct kvm *kvm,
*/
new.as_id = as_id;

- r = kvm_set_memslot(kvm, mem, old, &new, as_id, KVM_MR_DELETE);
- if (r)
- return r;
-
- kvm_free_memslot(kvm, old);
- return 0;
+ return kvm_set_memslot(kvm, mem, old, &new, as_id, KVM_MR_DELETE);
}

/*
@@ -1737,12 +1707,6 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
return -EINVAL;

- /*
- * Make a full copy of the old memslot, the pointer will become stale
- * when the memslots are re-sorted by update_memslots(), and the old
- * memslot needs to be referenced after calling update_memslots(), e.g.
- * to free its resources and for arch specific behavior.
- */
tmp = id_to_memslot(__kvm_memslots(kvm, as_id), id);
if (tmp) {
old = *tmp;
@@ -1788,8 +1752,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
}

if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
+ int bkt;
+
/* Check for overlaps */
- kvm_for_each_memslot(tmp, __kvm_memslots(kvm, as_id)) {
+ kvm_for_each_memslot(tmp, bkt, __kvm_memslots(kvm, as_id)) {
if (tmp->id == id)
continue;
if (!((new.base_gfn + new.npages <= tmp->base_gfn) ||
@@ -2126,21 +2092,30 @@ EXPORT_SYMBOL_GPL(gfn_to_memslot);
struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn)
{
struct kvm_memslots *slots = kvm_vcpu_memslots(vcpu);
+ u64 gen = slots->generation;
struct kvm_memory_slot *slot;
- int slot_index;

- slot = try_get_memslot(slots, vcpu->last_used_slot, gfn);
+ /*
+ * This also protects against using a memslot from a different address space,
+ * since different address spaces have different generation numbers.
+ */
+ if (unlikely(gen != vcpu->last_used_slot_gen)) {
+ vcpu->last_used_slot = NULL;
+ vcpu->last_used_slot_gen = gen;
+ }
+
+ slot = try_get_memslot(vcpu->last_used_slot, gfn);
if (slot)
return slot;

/*
* Fall back to searching all memslots. We purposely use
* search_memslots() instead of __gfn_to_memslot() to avoid
- * thrashing the VM-wide last_used_index in kvm_memslots.
+ * thrashing the VM-wide last_used_slot in kvm_memslots.
*/
- slot = search_memslots(slots, gfn, &slot_index, false);
+ slot = search_memslots(slots, gfn, false);
if (slot) {
- vcpu->last_used_slot = slot_index;
+ vcpu->last_used_slot = slot;
return slot;
}

2021-09-21 03:07:02

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH v5 09/13] KVM: Use interval tree to do fast hva lookup in memslots

From: "Maciej S. Szmigiero" <[email protected]>

The current memslots implementation only allows quick binary search by gfn,
quick lookup by hva is not possible - the implementation has to do a linear
scan of the whole memslots array, even though the operation being performed
might apply just to a single memslot.

This significantly hurts performance of per-hva operations with higher
memslot counts.

Since hva ranges can overlap between memslots an interval tree is needed
for tracking them.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
arch/arm64/kvm/Kconfig | 1 +
arch/mips/kvm/Kconfig | 1 +
arch/powerpc/kvm/Kconfig | 1 +
arch/s390/kvm/Kconfig | 1 +
arch/x86/kvm/Kconfig | 1 +
include/linux/kvm_host.h | 3 +++
virt/kvm/kvm_main.c | 48 ++++++++++++++++++++++++++++++++++------
7 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index d7eec0b43744..42185dcc9596 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -38,6 +38,7 @@ menuconfig KVM
select HAVE_KVM_IRQ_BYPASS
select HAVE_KVM_VCPU_RUN_PID_CHANGE
select SCHED_INFO
+ select INTERVAL_TREE
help
Support hosting virtualized guest machines.

diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index a77297480f56..91d197bee9c0 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -27,6 +27,7 @@ config KVM
select KVM_MMIO
select MMU_NOTIFIER
select SRCU
+ select INTERVAL_TREE
help
Support for hosting Guest kernels.

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index ff581d70f20c..e4c24f524ba8 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -26,6 +26,7 @@ config KVM
select KVM_VFIO
select IRQ_BYPASS_MANAGER
select HAVE_KVM_IRQ_BYPASS
+ select INTERVAL_TREE

config KVM_BOOK3S_HANDLER
bool
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 67a8e770e369..2e84d3922f7c 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -33,6 +33,7 @@ config KVM
select HAVE_KVM_NO_POLL
select SRCU
select KVM_VFIO
+ select INTERVAL_TREE
help
Support hosting paravirtualized guest machines using the SIE
virtualization capability on the mainframe. This should work
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ac69894eab88..572325ec8085 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -43,6 +43,7 @@ config KVM
select KVM_GENERIC_DIRTYLOG_READ_PROTECT
select KVM_VFIO
select SRCU
+ select INTERVAL_TREE
select HAVE_KVM_PM_NOTIFIER if PM
help
Support hosting fully virtualized guest machines using hardware
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d2acc00a6472..068877af9756 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -30,6 +30,7 @@
#include <linux/nospec.h>
#include <linux/notifier.h>
#include <linux/hashtable.h>
+#include <linux/interval_tree.h>
#include <asm/signal.h>

#include <linux/kvm.h>
@@ -428,6 +429,7 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)

struct kvm_memory_slot {
struct hlist_node id_node;
+ struct interval_tree_node hva_node;
gfn_t base_gfn;
unsigned long npages;
unsigned long *dirty_bitmap;
@@ -529,6 +531,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
*/
struct kvm_memslots {
u64 generation;
+ struct rb_root_cached hva_tree;
/* The mapping table from slot id to the index in memslots[]. */
DECLARE_HASHTABLE(id_hash, 7);
atomic_t last_used_slot;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 50597608d085..7ed780996910 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -472,6 +472,12 @@ static void kvm_null_fn(void)
}
#define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)

+/* Iterate over each memslot intersecting [start, last] (inclusive) range */
+#define kvm_for_each_memslot_in_hva_range(node, slots, start, last) \
+ for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
+ node; \
+ node = interval_tree_iter_next(node, start, last)) \
+
static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
const struct kvm_hva_range *range)
{
@@ -481,6 +487,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
struct kvm_memslots *slots;
int i, idx;

+ if (WARN_ON_ONCE(range->end <= range->start))
+ return 0;
+
/* A null handler is allowed if and only if on_lock() is provided. */
if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
IS_KVM_NULL_FN(range->handler)))
@@ -489,15 +498,17 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
idx = srcu_read_lock(&kvm->srcu);

for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ struct interval_tree_node *node;
+
slots = __kvm_memslots(kvm, i);
- kvm_for_each_memslot(slot, slots) {
+ kvm_for_each_memslot_in_hva_range(node, slots,
+ range->start, range->end - 1) {
unsigned long hva_start, hva_end;

+ slot = container_of(node, struct kvm_memory_slot, hva_node);
hva_start = max(range->start, slot->userspace_addr);
hva_end = min(range->end, slot->userspace_addr +
(slot->npages << PAGE_SHIFT));
- if (hva_start >= hva_end)
- continue;

/*
* To optimize for the likely case where the address
@@ -833,6 +844,7 @@ static struct kvm_memslots *kvm_alloc_memslots(void)
if (!slots)
return NULL;

+ slots->hva_tree = RB_ROOT_CACHED;
hash_init(slots->id_hash);

return slots;
@@ -1252,10 +1264,14 @@ static inline void kvm_memslot_delete(struct kvm_memslots *slots,
atomic_set(&slots->last_used_slot, 0);

for (i = oldslot - mslots; i < slots->used_slots; i++) {
+ interval_tree_remove(&mslots[i].hva_node, &slots->hva_tree);
hash_del(&mslots[i].id_node);
+
mslots[i] = mslots[i + 1];
+ interval_tree_insert(&mslots[i].hva_node, &slots->hva_tree);
hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
}
+ interval_tree_remove(&mslots[i].hva_node, &slots->hva_tree);
hash_del(&mslots[i].id_node);
mslots[i] = *memslot;
}
@@ -1275,7 +1291,8 @@ static inline int kvm_memslot_insert_back(struct kvm_memslots *slots)
* itself is not preserved in the array, i.e. not swapped at this time, only
* its new index into the array is tracked. Returns the changed memslot's
* current index into the memslots array.
- * The memslot at the returned index will not be in @slots->id_hash by then.
+ * The memslot at the returned index will not be in @slots->hva_tree or
+ * @slots->id_hash by then.
* @memslot is a detached struct with desired final data of the changed slot.
*/
static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
@@ -1298,6 +1315,7 @@ static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
* That's why the memslot has to be first removed from the hash table
* here.
*/
+ interval_tree_remove(&mmemslot->hva_node, &slots->hva_tree);
hash_del(&mmemslot->id_node);

/*
@@ -1312,8 +1330,11 @@ static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
WARN_ON_ONCE(memslot->base_gfn == mslots[i + 1].base_gfn);

/* Shift the next memslot forward one and update its index. */
+ interval_tree_remove(&mslots[i + 1].hva_node, &slots->hva_tree);
hash_del(&mslots[i + 1].id_node);
+
mslots[i] = mslots[i + 1];
+ interval_tree_insert(&mslots[i].hva_node, &slots->hva_tree);
hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
}
return i;
@@ -1325,10 +1346,12 @@ static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
* is not preserved in the array, i.e. not swapped at this time, only its new
* index into the array is tracked. Returns the changed memslot's final index
* into the memslots array.
- * The memslot at the returned index will not be in @slots->id_hash by then.
+ * The memslot at the returned index will not be in @slots->hva_tree or
+ * @slots->id_hash by then.
* @memslot is a detached struct with desired final data of the new or
* changed slot.
- * Assumes that the memslot at @start index is not in @slots->id_hash.
+ * Assumes that the memslot at @start index is not in @slots->hva_tree or
+ * @slots->id_hash.
*/
static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
struct kvm_memory_slot *memslot,
@@ -1344,8 +1367,11 @@ static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
WARN_ON_ONCE(memslot->base_gfn == mslots[i - 1].base_gfn);

/* Shift the next memslot back one and update its index. */
+ interval_tree_remove(&mslots[i - 1].hva_node, &slots->hva_tree);
hash_del(&mslots[i - 1].id_node);
+
mslots[i] = mslots[i - 1];
+ interval_tree_insert(&mslots[i].hva_node, &slots->hva_tree);
hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
}
return i;
@@ -1418,6 +1444,11 @@ static void update_memslots(struct kvm_memslots *slots,
* its index accordingly.
*/
slots->memslots[i] = *memslot;
+ slots->memslots[i].hva_node.start = memslot->userspace_addr;
+ slots->memslots[i].hva_node.last = memslot->userspace_addr +
+ (memslot->npages << PAGE_SHIFT) - 1;
+ interval_tree_insert(&slots->memslots[i].hva_node,
+ &slots->hva_tree);
hash_add(slots->id_hash, &slots->memslots[i].id_node,
memslot->id);
}
@@ -1540,9 +1571,12 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,

kvm_copy_memslots(slots, old);

+ slots->hva_tree = RB_ROOT_CACHED;
hash_init(slots->id_hash);
- kvm_for_each_memslot(memslot, slots)
+ kvm_for_each_memslot(memslot, slots) {
+ interval_tree_insert(&memslot->hva_node, &slots->hva_tree);
hash_add(slots->id_hash, &memslot->id_node, memslot->id);
+ }

return slots;
}

2021-09-21 03:36:16

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH v5 01/13] KVM: x86: Cache total page count to avoid traversing the memslot array

From: "Maciej S. Szmigiero" <[email protected]>

There is no point in recalculating from scratch the total number of pages
in all memslots each time a memslot is created or deleted.

Just cache the value and update it accordingly on each such operation so
the code doesn't need to traverse the whole memslot array each time.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 24 ------------------------
arch/x86/kvm/x86.c | 20 +++++++++++++++++---
3 files changed, 18 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8f48a7ec577..315d5368ba84 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1038,6 +1038,7 @@ struct kvm_x86_msr_filter {
#define APICV_INHIBIT_REASON_X2APIC 5

struct kvm_arch {
+ u64 n_memslots_pages;
unsigned long n_used_mmu_pages;
unsigned long n_requested_mmu_pages;
unsigned long n_max_mmu_pages;
@@ -1572,7 +1573,6 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
const struct kvm_memory_slot *memslot);
void kvm_mmu_zap_all(struct kvm *kvm);
void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
-unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm);
void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);

int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2d7e61122af8..61b9b7b5c10c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6133,30 +6133,6 @@ int kvm_mmu_module_init(void)
return ret;
}

-/*
- * Calculate mmu pages needed for kvm.
- */
-unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm)
-{
- unsigned long nr_mmu_pages;
- unsigned long nr_pages = 0;
- struct kvm_memslots *slots;
- struct kvm_memory_slot *memslot;
- int i;
-
- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
- slots = __kvm_memslots(kvm, i);
-
- kvm_for_each_memslot(memslot, slots)
- nr_pages += memslot->npages;
- }
-
- nr_mmu_pages = nr_pages * KVM_PERMILLE_MMU_PAGES / 1000;
- nr_mmu_pages = max(nr_mmu_pages, KVM_MIN_ALLOC_MMU_PAGES);
-
- return nr_mmu_pages;
-}
-
void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
{
kvm_mmu_unload(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 28ef14155726..65fdf27b9423 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11609,9 +11609,23 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
const struct kvm_memory_slot *new,
enum kvm_mr_change change)
{
- if (!kvm->arch.n_requested_mmu_pages)
- kvm_mmu_change_mmu_pages(kvm,
- kvm_mmu_calculate_default_mmu_pages(kvm));
+ if (change == KVM_MR_CREATE)
+ kvm->arch.n_memslots_pages += new->npages;
+ else if (change == KVM_MR_DELETE) {
+ WARN_ON(kvm->arch.n_memslots_pages < old->npages);
+ kvm->arch.n_memslots_pages -= old->npages;
+ }
+
+ if (!kvm->arch.n_requested_mmu_pages) {
+ u64 memslots_pages;
+ unsigned long nr_mmu_pages;
+
+ memslots_pages = kvm->arch.n_memslots_pages * KVM_PERMILLE_MMU_PAGES;
+ do_div(memslots_pages, 1000);
+ nr_mmu_pages = max_t(typeof(nr_mmu_pages),
+ memslots_pages, KVM_MIN_ALLOC_MMU_PAGES);
+ kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
+ }

kvm_mmu_slot_apply_flags(kvm, old, new, change);

2021-09-21 03:36:41

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH v5 04/13] KVM: x86: Move n_memslots_pages recalc to kvm_arch_prepare_memory_region()

From: "Maciej S. Szmigiero" <[email protected]>

This allows us to return a proper error code in case we spot an underflow.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
arch/x86/kvm/x86.c | 49 ++++++++++++++++++++++++++--------------------
1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 97d86223427d..0fffb8414009 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11511,9 +11511,23 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
const struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change)
{
- if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
- return kvm_alloc_memslot_metadata(kvm, new,
- mem->memory_size >> PAGE_SHIFT);
+ if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
+ int ret;
+
+ ret = kvm_alloc_memslot_metadata(kvm, new,
+ mem->memory_size >> PAGE_SHIFT);
+ if (ret)
+ return ret;
+
+ if (change == KVM_MR_CREATE)
+ kvm->arch.n_memslots_pages += new->npages;
+ } else if (change == KVM_MR_DELETE) {
+ if (WARN_ON(kvm->arch.n_memslots_pages < old->npages))
+ return -EIO;
+
+ kvm->arch.n_memslots_pages -= old->npages;
+ }
+
return 0;
}

@@ -11610,24 +11624,17 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
const struct kvm_memory_slot *new,
enum kvm_mr_change change)
{
- if (change == KVM_MR_CREATE || change == KVM_MR_DELETE) {
- if (change == KVM_MR_CREATE)
- kvm->arch.n_memslots_pages += new->npages;
- else {
- WARN_ON(kvm->arch.n_memslots_pages < old->npages);
- kvm->arch.n_memslots_pages -= old->npages;
- }
-
- if (!kvm->arch.n_requested_mmu_pages) {
- u64 memslots_pages;
- unsigned long nr_mmu_pages;
-
- memslots_pages = kvm->arch.n_memslots_pages * KVM_PERMILLE_MMU_PAGES;
- do_div(memslots_pages, 1000);
- nr_mmu_pages = max_t(typeof(nr_mmu_pages),
- memslots_pages, KVM_MIN_ALLOC_MMU_PAGES);
- kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
- }
+ /* Only CREATE or DELETE affects n_memslots_pages */
+ if ((change == KVM_MR_CREATE || change == KVM_MR_DELETE) &&
+ !kvm->arch.n_requested_mmu_pages) {
+ u64 memslots_pages;
+ unsigned long nr_mmu_pages;
+
+ memslots_pages = kvm->arch.n_memslots_pages * KVM_PERMILLE_MMU_PAGES;
+ do_div(memslots_pages, 1000);
+ nr_mmu_pages = max_t(typeof(nr_mmu_pages),
+ memslots_pages, KVM_MIN_ALLOC_MMU_PAGES);
+ kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
}

kvm_mmu_slot_apply_flags(kvm, old, new, change);

2021-09-21 03:37:10

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH v5 08/13] KVM: Resolve memslot ID via a hash table instead of via a static array

From: "Maciej S. Szmigiero" <[email protected]>

Memslot ID to the corresponding memslot mappings are currently kept as
indices in static id_to_index array.
The size of this array depends on the maximum allowed memslot count
(regardless of the number of memslots actually in use).

This has become especially problematic recently, when memslot count cap was
removed, so the maximum count is now full 32k memslots - the maximum
allowed by the current KVM API.

Keeping these IDs in a hash table (instead of an array) avoids this
problem.

Resolving a memslot ID to the actual memslot (instead of its index) will
also enable transitioning away from an array-based implementation of the
whole memslots structure in a later commit.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
include/linux/kvm_host.h | 16 +++++------
virt/kvm/kvm_main.c | 61 +++++++++++++++++++++++++++++++---------
2 files changed, 55 insertions(+), 22 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8fd9644f40b2..d2acc00a6472 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -29,6 +29,7 @@
#include <linux/refcount.h>
#include <linux/nospec.h>
#include <linux/notifier.h>
+#include <linux/hashtable.h>
#include <asm/signal.h>

#include <linux/kvm.h>
@@ -426,6 +427,7 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
#define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)

struct kvm_memory_slot {
+ struct hlist_node id_node;
gfn_t base_gfn;
unsigned long npages;
unsigned long *dirty_bitmap;
@@ -528,7 +530,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
struct kvm_memslots {
u64 generation;
/* The mapping table from slot id to the index in memslots[]. */
- short id_to_index[KVM_MEM_SLOTS_NUM];
+ DECLARE_HASHTABLE(id_hash, 7);
atomic_t last_used_slot;
int used_slots;
struct kvm_memory_slot memslots[];
@@ -795,16 +797,14 @@ static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
static inline
struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
{
- int index = slots->id_to_index[id];
struct kvm_memory_slot *slot;

- if (index < 0)
- return NULL;
-
- slot = &slots->memslots[index];
+ hash_for_each_possible(slots->id_hash, slot, id_node, id) {
+ if (slot->id == id)
+ return slot;
+ }

- WARN_ON(slot->id != id);
- return slot;
+ return NULL;
}

/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 48d182840060..50597608d085 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -827,15 +827,13 @@ static void kvm_destroy_pm_notifier(struct kvm *kvm)

static struct kvm_memslots *kvm_alloc_memslots(void)
{
- int i;
struct kvm_memslots *slots;

slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL_ACCOUNT);
if (!slots)
return NULL;

- for (i = 0; i < KVM_MEM_SLOTS_NUM; i++)
- slots->id_to_index[i] = -1;
+ hash_init(slots->id_hash);

return slots;
}
@@ -1236,14 +1234,16 @@ static int kvm_alloc_dirty_bitmap(struct kvm_memory_slot *memslot)
/*
* Delete a memslot by decrementing the number of used slots and shifting all
* other entries in the array forward one spot.
+ * @memslot is a detached dummy struct with just .id and .as_id filled.
*/
static inline void kvm_memslot_delete(struct kvm_memslots *slots,
struct kvm_memory_slot *memslot)
{
struct kvm_memory_slot *mslots = slots->memslots;
+ struct kvm_memory_slot *oldslot = id_to_memslot(slots, memslot->id);
int i;

- if (WARN_ON(slots->id_to_index[memslot->id] == -1))
+ if (WARN_ON(!oldslot))
return;

slots->used_slots--;
@@ -1251,12 +1251,13 @@ static inline void kvm_memslot_delete(struct kvm_memslots *slots,
if (atomic_read(&slots->last_used_slot) >= slots->used_slots)
atomic_set(&slots->last_used_slot, 0);

- for (i = slots->id_to_index[memslot->id]; i < slots->used_slots; i++) {
+ for (i = oldslot - mslots; i < slots->used_slots; i++) {
+ hash_del(&mslots[i].id_node);
mslots[i] = mslots[i + 1];
- slots->id_to_index[mslots[i].id] = i;
+ hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
}
+ hash_del(&mslots[i].id_node);
mslots[i] = *memslot;
- slots->id_to_index[memslot->id] = -1;
}

/*
@@ -1274,30 +1275,46 @@ static inline int kvm_memslot_insert_back(struct kvm_memslots *slots)
* itself is not preserved in the array, i.e. not swapped at this time, only
* its new index into the array is tracked. Returns the changed memslot's
* current index into the memslots array.
+ * The memslot at the returned index will not be in @slots->id_hash by then.
+ * @memslot is a detached struct with desired final data of the changed slot.
*/
static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
struct kvm_memory_slot *memslot)
{
struct kvm_memory_slot *mslots = slots->memslots;
+ struct kvm_memory_slot *mmemslot = id_to_memslot(slots, memslot->id);
int i;

- if (slots->id_to_index[memslot->id] == -1 || !slots->used_slots)
+ if (!mmemslot || !slots->used_slots)
return -1;

+ /*
+ * The loop below will (possibly) overwrite the target memslot with
+ * data of the next memslot, or a similar loop in
+ * kvm_memslot_move_forward() will overwrite it with data of the
+ * previous memslot.
+ * Then update_memslots() will unconditionally overwrite and re-add
+ * it to the hash table.
+ * That's why the memslot has to be first removed from the hash table
+ * here.
+ */
+ hash_del(&mmemslot->id_node);
+
/*
* Move the target memslot backward in the array by shifting existing
* memslots with a higher GFN (than the target memslot) towards the
* front of the array.
*/
- for (i = slots->id_to_index[memslot->id]; i < slots->used_slots - 1; i++) {
+ for (i = mmemslot - mslots; i < slots->used_slots - 1; i++) {
if (memslot->base_gfn > mslots[i + 1].base_gfn)
break;

WARN_ON_ONCE(memslot->base_gfn == mslots[i + 1].base_gfn);

/* Shift the next memslot forward one and update its index. */
+ hash_del(&mslots[i + 1].id_node);
mslots[i] = mslots[i + 1];
- slots->id_to_index[mslots[i].id] = i;
+ hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
}
return i;
}
@@ -1308,6 +1325,10 @@ static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
* is not preserved in the array, i.e. not swapped at this time, only its new
* index into the array is tracked. Returns the changed memslot's final index
* into the memslots array.
+ * The memslot at the returned index will not be in @slots->id_hash by then.
+ * @memslot is a detached struct with desired final data of the new or
+ * changed slot.
+ * Assumes that the memslot at @start index is not in @slots->id_hash.
*/
static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
struct kvm_memory_slot *memslot,
@@ -1323,8 +1344,9 @@ static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
WARN_ON_ONCE(memslot->base_gfn == mslots[i - 1].base_gfn);

/* Shift the next memslot back one and update its index. */
+ hash_del(&mslots[i - 1].id_node);
mslots[i] = mslots[i - 1];
- slots->id_to_index[mslots[i].id] = i;
+ hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
}
return i;
}
@@ -1369,6 +1391,9 @@ static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
* most likely to be referenced, sorting it to the front of the array was
* advantageous. The current binary search starts from the middle of the array
* and uses an LRU pointer to improve performance for all memslots and GFNs.
+ *
+ * @memslot is a detached struct, not a part of the current or new memslot
+ * array.
*/
static void update_memslots(struct kvm_memslots *slots,
struct kvm_memory_slot *memslot,
@@ -1393,7 +1418,8 @@ static void update_memslots(struct kvm_memslots *slots,
* its index accordingly.
*/
slots->memslots[i] = *memslot;
- slots->id_to_index[memslot->id] = i;
+ hash_add(slots->id_hash, &slots->memslots[i].id_node,
+ memslot->id);
}
}

@@ -1501,6 +1527,7 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
{
struct kvm_memslots *slots;
size_t new_size;
+ struct kvm_memory_slot *memslot;

if (change == KVM_MR_CREATE)
new_size = kvm_memslots_size(old->used_slots + 1);
@@ -1508,8 +1535,14 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
new_size = kvm_memslots_size(old->used_slots);

slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT);
- if (likely(slots))
- kvm_copy_memslots(slots, old);
+ if (unlikely(!slots))
+ return NULL;
+
+ kvm_copy_memslots(slots, old);
+
+ hash_init(slots->id_hash);
+ kvm_for_each_memslot(memslot, slots)
+ hash_add(slots->id_hash, &memslot->id_node, memslot->id);

return slots;
}

2021-09-21 03:38:16

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH v5 07/13] KVM: Just resync arch fields when slots_arch_lock gets reacquired

From: "Maciej S. Szmigiero" <[email protected]>

There is no need to copy the whole memslot data after releasing
slots_arch_lock for a moment to install temporary memslots copy in
kvm_set_memslot() since this lock only protects the arch field of each
memslot.

Just resync this particular field after reacquiring slots_arch_lock.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
virt/kvm/kvm_main.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 348fae880189..48d182840060 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1482,6 +1482,15 @@ static void kvm_copy_memslots(struct kvm_memslots *to,
memcpy(to, from, kvm_memslots_size(from->used_slots));
}

+static void kvm_copy_memslots_arch(struct kvm_memslots *to,
+ struct kvm_memslots *from)
+{
+ int i;
+
+ for (i = 0; i < from->used_slots; i++)
+ to->memslots[i].arch = from->memslots[i].arch;
+}
+
/*
* Note, at a minimum, the current number of used slots must be allocated, even
* when deleting a memslot, as we need a complete duplicate of the memslots for
@@ -1567,10 +1576,10 @@ static int kvm_set_memslot(struct kvm *kvm,
/*
* The arch-specific fields of the memslots could have changed
* between releasing the slots_arch_lock in
- * install_new_memslots and here, so get a fresh copy of the
- * slots.
+ * install_new_memslots and here, so get a fresh copy of these
+ * fields.
*/
- kvm_copy_memslots(slots, __kvm_memslots(kvm, as_id));
+ kvm_copy_memslots_arch(slots, __kvm_memslots(kvm, as_id));
}

r = kvm_arch_prepare_memory_region(kvm, old, new, mem, change);
@@ -1587,8 +1596,6 @@ static int kvm_set_memslot(struct kvm *kvm,

out_slots:
if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
- slot = id_to_memslot(slots, old->id);
- slot->flags &= ~KVM_MEMSLOT_INVALID;
slots = install_new_memslots(kvm, as_id, slots);
} else {
mutex_unlock(&kvm->slots_arch_lock);

2021-09-21 03:38:22

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH v5 06/13] KVM: Move WARN on invalid memslot index to update_memslots()

From: "Maciej S. Szmigiero" <[email protected]>

Since kvm_memslot_move_forward() can theoretically return a negative
memslot index even when kvm_memslot_move_backward() returned a positive one
(and so did not WARN) let's just move the warning to the common code.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
Reviewed-by: Claudio Imbrenda <[email protected]>
---
virt/kvm/kvm_main.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 14043a6edb88..348fae880189 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1281,8 +1281,7 @@ static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
struct kvm_memory_slot *mslots = slots->memslots;
int i;

- if (WARN_ON_ONCE(slots->id_to_index[memslot->id] == -1) ||
- WARN_ON_ONCE(!slots->used_slots))
+ if (slots->id_to_index[memslot->id] == -1 || !slots->used_slots)
return -1;

/*
@@ -1386,6 +1385,9 @@ static void update_memslots(struct kvm_memslots *slots,
i = kvm_memslot_move_backward(slots, memslot);
i = kvm_memslot_move_forward(slots, memslot, i);

+ if (WARN_ON_ONCE(i < 0))
+ return;
+
/*
* Copy the memslot to its new position in memslots and update
* its index accordingly.

2021-09-21 03:39:42

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH v5 12/13] KVM: Optimize gfn lookup in kvm_zap_gfn_range()

From: "Maciej S. Szmigiero" <[email protected]>

Introduce a memslots gfn upper bound operation and use it to optimize
kvm_zap_gfn_range().
This way this handler can do a quick lookup for intersecting gfns and won't
have to do a linear scan of the whole memslot set.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 11 +++++--
include/linux/kvm_host.h | 69 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a05e581ef210..f2859988d630 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5724,18 +5724,25 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
int i;
bool flush = false;

+ if (WARN_ON_ONCE(gfn_end <= gfn_start))
+ return;
+
write_lock(&kvm->mmu_lock);

kvm_inc_notifier_count(kvm, gfn_start, gfn_end);

if (kvm_memslots_have_rmaps(kvm)) {
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
- int bkt;
+ int idx;
+ struct rb_node *node;

slots = __kvm_memslots(kvm, i);
- kvm_for_each_memslot(memslot, bkt, slots) {
+ idx = slots->node_idx;
+
+ kvm_for_each_memslot_in_gfn_range(node, slots, gfn_start, gfn_end) {
gfn_t start, end;

+ memslot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
start = max(gfn_start, memslot->base_gfn);
end = min(gfn_end, memslot->base_gfn + memslot->npages);
if (start >= end)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6433efff447a..9ae5f7341cf5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -833,6 +833,75 @@ struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
return NULL;
}

+static inline
+struct rb_node *kvm_memslots_gfn_upper_bound(struct kvm_memslots *slots, gfn_t gfn)
+{
+ int idx = slots->node_idx;
+ struct rb_node *node, *result = NULL;
+
+ for (node = slots->gfn_tree.rb_node; node; ) {
+ struct kvm_memory_slot *slot;
+
+ slot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
+ if (gfn < slot->base_gfn) {
+ result = node;
+ node = node->rb_left;
+ } else
+ node = node->rb_right;
+ }
+
+ return result;
+}
+
+static inline
+struct rb_node *kvm_for_each_in_gfn_first(struct kvm_memslots *slots, gfn_t start)
+{
+ struct rb_node *node;
+
+ /*
+ * Find the slot with the lowest gfn that can possibly intersect with
+ * the range, so we'll ideally have slot start <= range start
+ */
+ node = kvm_memslots_gfn_upper_bound(slots, start);
+ if (node) {
+ struct rb_node *pnode;
+
+ /*
+ * A NULL previous node means that the very first slot
+ * already has a higher start gfn.
+ * In this case slot start > range start.
+ */
+ pnode = rb_prev(node);
+ if (pnode)
+ node = pnode;
+ } else {
+ /* a NULL node below means no slots */
+ node = rb_last(&slots->gfn_tree);
+ }
+
+ return node;
+}
+
+static inline
+bool kvm_for_each_in_gfn_no_more(struct kvm_memslots *slots, struct rb_node *node, gfn_t end)
+{
+ struct kvm_memory_slot *memslot;
+
+ memslot = container_of(node, struct kvm_memory_slot, gfn_node[slots->node_idx]);
+
+ /*
+ * If this slot starts beyond or at the end of the range so does
+ * every next one
+ */
+ return memslot->base_gfn >= end;
+}
+
+/* Iterate over each memslot *possibly* intersecting [start, end) range */
+#define kvm_for_each_memslot_in_gfn_range(node, slots, start, end) \
+ for (node = kvm_for_each_in_gfn_first(slots, start); \
+ node && !kvm_for_each_in_gfn_no_more(slots, node, end); \
+ node = rb_next(node)) \
+
/*
* KVM_SET_USER_MEMORY_REGION ioctl allows the following operations:
* - create a new memory slot

2021-09-21 03:40:26

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH v5 10/13] KVM: s390: Introduce kvm_s390_get_gfn_end()

From: "Maciej S. Szmigiero" <[email protected]>

And use it where s390 code would just access the memslot with the highest
gfn directly.

No functional change intended.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
Reviewed-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/kvm-s390.c | 2 +-
arch/s390/kvm/kvm-s390.h | 12 ++++++++++++
arch/s390/kvm/pv.c | 4 +---
3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 540fa948baa5..bbc32cd7a9a3 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2006,7 +2006,7 @@ static int kvm_s390_get_cmma(struct kvm *kvm, struct kvm_s390_cmma_log *args,
if (!ms)
return 0;
next_gfn = kvm_s390_next_dirty_cmma(slots, cur_gfn + 1);
- mem_end = slots->memslots[0].base_gfn + slots->memslots[0].npages;
+ mem_end = kvm_s390_get_gfn_end(slots);

while (args->count < bufsize) {
hva = gfn_to_hva(kvm, cur_gfn);
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index ecd741ee3276..73d4659ddd59 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -208,6 +208,18 @@ static inline int kvm_s390_user_cpu_state_ctrl(struct kvm *kvm)
return kvm->arch.user_cpu_state_ctrl != 0;
}

+/* get the end gfn of the last (highest gfn) memslot */
+static inline unsigned long kvm_s390_get_gfn_end(struct kvm_memslots *slots)
+{
+ struct kvm_memory_slot *ms;
+
+ if (WARN_ON(!slots->used_slots))
+ return 0;
+
+ ms = slots->memslots;
+ return ms->base_gfn + ms->npages;
+}
+
/* implemented in pv.c */
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
int kvm_s390_pv_create_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index c8841f476e91..e51cccfded25 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -117,7 +117,6 @@ static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
unsigned long base = uv_info.guest_base_stor_len;
unsigned long virt = uv_info.guest_virt_var_stor_len;
unsigned long npages = 0, vlen = 0;
- struct kvm_memory_slot *memslot;

kvm->arch.pv.stor_var = NULL;
kvm->arch.pv.stor_base = __get_free_pages(GFP_KERNEL_ACCOUNT, get_order(base));
@@ -131,8 +130,7 @@ static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
* Slots are sorted by GFN
*/
mutex_lock(&kvm->slots_lock);
- memslot = kvm_memslots(kvm)->memslots;
- npages = memslot->base_gfn + memslot->npages;
+ npages = kvm_s390_get_gfn_end(kvm_memslots(kvm));
mutex_unlock(&kvm->slots_lock);

kvm->arch.pv.guest_len = npages * PAGE_SIZE;

2021-09-21 03:40:32

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH v5 13/13] KVM: Optimize overlapping memslots check

From: "Maciej S. Szmigiero" <[email protected]>

Do a quick lookup for possibly overlapping gfns when creating or moving
a memslot instead of performing a linear scan of the whole memslot set.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
virt/kvm/kvm_main.c | 36 +++++++++++++++++++++++++++---------
1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5fea467d6fec..78dad8c6376f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1667,6 +1667,30 @@ static int kvm_delete_memslot(struct kvm *kvm,
return kvm_set_memslot(kvm, mem, old, &new, as_id, KVM_MR_DELETE);
}

+static bool kvm_check_memslot_overlap(struct kvm_memslots *slots,
+ struct kvm_memory_slot *nslot)
+{
+ int idx = slots->node_idx;
+ gfn_t nend = nslot->base_gfn + nslot->npages;
+ struct rb_node *node;
+
+ kvm_for_each_memslot_in_gfn_range(node, slots, nslot->base_gfn, nend) {
+ struct kvm_memory_slot *cslot;
+ gfn_t cend;
+
+ cslot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
+ cend = cslot->base_gfn + cslot->npages;
+ if (cslot->id == nslot->id)
+ continue;
+
+ /* kvm_for_each_in_gfn_no_more() guarantees that cslot->base_gfn < nend */
+ if (cend > nslot->base_gfn)
+ return true;
+ }
+
+ return false;
+}
+
/*
* Allocate some memory and give it an address in the guest physical address
* space.
@@ -1752,16 +1776,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
}

if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
- int bkt;
-
/* Check for overlaps */
- kvm_for_each_memslot(tmp, bkt, __kvm_memslots(kvm, as_id)) {
- if (tmp->id == id)
- continue;
- if (!((new.base_gfn + new.npages <= tmp->base_gfn) ||
- (new.base_gfn >= tmp->base_gfn + tmp->npages)))
- return -EEXIST;
- }
+ if (kvm_check_memslot_overlap(__kvm_memslots(kvm, as_id),
+ &new))
+ return -EEXIST;
}

/* Allocate/free page dirty bitmap as needed */

2021-10-19 22:10:00

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 00/13] KVM: Scalable memslots implementation

On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:

For future revisions, feel free to omit the lengthy intro and just provide links
to previous versions.

> On x86-64 the code was well tested, passed KVM unit tests and KVM
> selftests with KASAN on.
> And, of course, booted various guests successfully (including nested
> ones with TDP MMU enabled).
> On other KVM platforms the code was compile-tested only.
>
> Changes since v1:

...

> Changes since v2:

...

> Changes since v3:

...

> Changes since v4:
> * Rebase onto v5.15-rc2 (torvalds/master),
>
> * Fix 64-bit division of n_memslots_pages for 32-bit KVM,
>
> * Collect Claudio's Reviewed-by tags for some of the patches.

Heh, this threw me for a loop. The standard pattern is to start with the most
recent version and work backwards, that way reviewers can quickly see the delta
for _this_ version. I.e.

Changes since v4:
...

Changes since v3:
...

2021-10-19 22:26:24

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] KVM: x86: Cache total page count to avoid traversing the memslot array

On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <[email protected]>
>
> There is no point in recalculating from scratch the total number of pages
> in all memslots each time a memslot is created or deleted.
>
> Just cache the value and update it accordingly on each such operation so
> the code doesn't need to traverse the whole memslot array each time.
>
> Signed-off-by: Maciej S. Szmigiero <[email protected]>
> ---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 28ef14155726..65fdf27b9423 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11609,9 +11609,23 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> const struct kvm_memory_slot *new,
> enum kvm_mr_change change)
> {
> - if (!kvm->arch.n_requested_mmu_pages)
> - kvm_mmu_change_mmu_pages(kvm,
> - kvm_mmu_calculate_default_mmu_pages(kvm));
> + if (change == KVM_MR_CREATE)
> + kvm->arch.n_memslots_pages += new->npages;
> + else if (change == KVM_MR_DELETE) {
> + WARN_ON(kvm->arch.n_memslots_pages < old->npages);
> + kvm->arch.n_memslots_pages -= old->npages;
> + }
> +
> + if (!kvm->arch.n_requested_mmu_pages) {

Hmm, once n_requested_mmu_pages is set it can't be unset. That means this can be
further optimized to skip avoid taking mmu_lock on flags-only changes (and
memslot movement). E.g.

if (!kvm->arch.n_requested_mmu_pages &&
(change == KVM_MR_CREATE || change == KVM_MR_DELETE)) {

}

It's a little risky, but kvm_vm_ioctl_set_nr_mmu_pages() would need to be modified
to allow clearing n_requested_mmu_pages and it already takes slots_lock, so IMO
it's ok to force kvm_vm_ioctl_set_nr_mmu_pages() to recalculate pages if it wants
to allow reverting back to the default.

> + u64 memslots_pages;
> + unsigned long nr_mmu_pages;
> +
> + memslots_pages = kvm->arch.n_memslots_pages * KVM_PERMILLE_MMU_PAGES;
> + do_div(memslots_pages, 1000);
> + nr_mmu_pages = max_t(typeof(nr_mmu_pages),
> + memslots_pages, KVM_MIN_ALLOC_MMU_PAGES);

"memslots_pages" is a bit of a misnomer. Any objection to avoiding naming problems
by explicitly casting to an "unsigned long" and simply operating on nr_mmu_pages?

nr_mmu_pages = (unsigned long)kvm->arch.n_memslots_pages;
nr_mmu_pages *= (KVM_PERMILLE_MMU_PAGES / 1000);
nr_mmu_pages = max(nr_mmu_pages, KVM_MIN_ALLOC_MMU_PAGES);
kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);

E.g. the whole thing can be

if (!kvm->arch.n_requested_mmu_pages &&
(change == KVM_MR_CREATE || change == KVM_MR_DELETE)) {
unsigned long nr_mmu_pages;

if (change == KVM_MR_CREATE) {
kvm->arch.n_memslots_pages += new->npages;
} else {
WARN_ON(kvm->arch.n_memslots_pages < old->npages);
kvm->arch.n_memslots_pages -= old->npages;
}

nr_mmu_pages = (unsigned long)kvm->arch.n_memslots_pages;
nr_mmu_pages *= (KVM_PERMILLE_MMU_PAGES / 1000);
nr_mmu_pages = max(nr_mmu_pages, KVM_MIN_ALLOC_MMU_PAGES);
kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
}

> + kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
> + }
>
> kvm_mmu_slot_apply_flags(kvm, old, new, change);
>

2021-10-19 22:33:42

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] KVM: x86: Cache total page count to avoid traversing the memslot array

On Tue, Oct 19, 2021, Sean Christopherson wrote:
> On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
> > From: "Maciej S. Szmigiero" <[email protected]>
> >
> > There is no point in recalculating from scratch the total number of pages
> > in all memslots each time a memslot is created or deleted.
> >
> > Just cache the value and update it accordingly on each such operation so
> > the code doesn't need to traverse the whole memslot array each time.
> >
> > Signed-off-by: Maciej S. Szmigiero <[email protected]>
> > ---
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 28ef14155726..65fdf27b9423 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -11609,9 +11609,23 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> > const struct kvm_memory_slot *new,
> > enum kvm_mr_change change)
> > {
> > - if (!kvm->arch.n_requested_mmu_pages)
> > - kvm_mmu_change_mmu_pages(kvm,
> > - kvm_mmu_calculate_default_mmu_pages(kvm));
> > + if (change == KVM_MR_CREATE)
> > + kvm->arch.n_memslots_pages += new->npages;
> > + else if (change == KVM_MR_DELETE) {
> > + WARN_ON(kvm->arch.n_memslots_pages < old->npages);
> > + kvm->arch.n_memslots_pages -= old->npages;
> > + }
> > +
> > + if (!kvm->arch.n_requested_mmu_pages) {
>
> Hmm, once n_requested_mmu_pages is set it can't be unset. That means this can be
> further optimized to skip avoid taking mmu_lock on flags-only changes (and
> memslot movement). E.g.
>
> if (!kvm->arch.n_requested_mmu_pages &&
> (change == KVM_MR_CREATE || change == KVM_MR_DELETE)) {
>
> }
>
> It's a little risky, but kvm_vm_ioctl_set_nr_mmu_pages() would need to be modified
> to allow clearing n_requested_mmu_pages and it already takes slots_lock, so IMO
> it's ok to force kvm_vm_ioctl_set_nr_mmu_pages() to recalculate pages if it wants
> to allow reverting back to the default.

Doh, and then I read patch 2...

I would swap the order of patch 2 and patch 1, that way the optimization patch is
super simple, and you don't end up reworking a bunch of code that was added in the
immediately preceding patch. E.g. as a first patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 28ef14155726..f3b1aed08566 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11609,7 +11609,8 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
const struct kvm_memory_slot *new,
enum kvm_mr_change change)
{
- if (!kvm->arch.n_requested_mmu_pages)
+ if (!kvm->arch.n_requested_mmu_pages &&
+ (change == KVM_MR_CREATE || change == KVM_MR_DELETE))
kvm_mmu_change_mmu_pages(kvm,
kvm_mmu_calculate_default_mmu_pages(kvm));



2021-10-19 22:40:06

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 04/13] KVM: x86: Move n_memslots_pages recalc to kvm_arch_prepare_memory_region()

On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <[email protected]>
>
> This allows us to return a proper error code in case we spot an underflow.
>
> Signed-off-by: Maciej S. Szmigiero <[email protected]>
> ---
> arch/x86/kvm/x86.c | 49 ++++++++++++++++++++++++++--------------------
> 1 file changed, 28 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 97d86223427d..0fffb8414009 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11511,9 +11511,23 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> const struct kvm_userspace_memory_region *mem,
> enum kvm_mr_change change)
> {
> - if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
> - return kvm_alloc_memslot_metadata(kvm, new,
> - mem->memory_size >> PAGE_SHIFT);
> + if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
> + int ret;
> +
> + ret = kvm_alloc_memslot_metadata(kvm, new,
> + mem->memory_size >> PAGE_SHIFT);
> + if (ret)
> + return ret;
> +
> + if (change == KVM_MR_CREATE)
> + kvm->arch.n_memslots_pages += new->npages;
> + } else if (change == KVM_MR_DELETE) {
> + if (WARN_ON(kvm->arch.n_memslots_pages < old->npages))
> + return -EIO;

This is not worth the churn. In a way, it's worse because userspace can spam
the living snot out of the kernel log by retrying the ioctl().

Since underflow can happen if and only if there's a KVM bug, and a pretty bad one
at that, just make the original WARN_ON a KVM_BUG_ON. That will kill the VM and
also provide the WARN_ON_ONCE behavior that we probably want.

> +
> + kvm->arch.n_memslots_pages -= old->npages;
> + }
> +
> return 0;
> }
>

2021-10-19 23:42:02

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 05/13] KVM: Integrate gfn_to_memslot_approx() into search_memslots()

On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
> @@ -1267,7 +1280,7 @@ search_memslots(struct kvm_memslots *slots, gfn_t gfn, int *index)
> * itself isn't here as an inline because that would bloat other code too much.
> */
> static inline struct kvm_memory_slot *
> -__gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
> +__gfn_to_memslot_approx(struct kvm_memslots *slots, gfn_t gfn, bool approx)

This function name is a misnomer. The helper is not an "approx" version, it's an
inner helper that takes an @approx param. Unless someone has a more clever name,
the dreaded four underscores seems like the way to go. Warning away users is a
good thing in this case...

> {
> struct kvm_memory_slot *slot;
> int slot_index = atomic_read(&slots->last_used_slot);
> @@ -1276,7 +1289,7 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
> if (slot)
> return slot;
>
> - slot = search_memslots(slots, gfn, &slot_index);
> + slot = search_memslots(slots, gfn, &slot_index, approx);
> if (slot) {
> atomic_set(&slots->last_used_slot, slot_index);
> return slot;
> @@ -1285,6 +1298,12 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
> return NULL;
> }
>

There's a comment that doesn't show up in this diff that should also be moved,
and opportunistically updated.

> +static inline struct kvm_memory_slot *
> +__gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
> +{
> + return __gfn_to_memslot_approx(slots, gfn, false);
> +}
> +
> static inline unsigned long
> __gfn_to_hva_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
> {

E.g. this as fixup?

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 540fa948baa5..2964c773b36c 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1964,10 +1964,15 @@ static int kvm_s390_peek_cmma(struct kvm *kvm, struct kvm_s390_cmma_log *args,
return 0;
}

+static int gfn_to_memslot_approx(struct kvm_memslots *slots, gfn_t gfn)
+{
+ return ____gfn_to_memslot(slots, cur_gfn, true);
+}
+
static unsigned long kvm_s390_next_dirty_cmma(struct kvm_memslots *slots,
unsigned long cur_gfn)
{
- struct kvm_memory_slot *ms = __gfn_to_memslot_approx(slots, cur_gfn, true);
+ struct kvm_memory_slot *ms = gfn_to_memslot_approx(slots, cur_gfn);
int slotidx = ms - slots->memslots;
unsigned long ofs = cur_gfn - ms->base_gfn;

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8fd9644f40b2..ec1a074c2f6e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1274,13 +1274,8 @@ search_memslots(struct kvm_memslots *slots, gfn_t gfn, int *index, bool approx)
return NULL;
}

-/*
- * __gfn_to_memslot() and its descendants are here because it is called from
- * non-modular code in arch/powerpc/kvm/book3s_64_vio{,_hv}.c. gfn_to_memslot()
- * itself isn't here as an inline because that would bloat other code too much.
- */
static inline struct kvm_memory_slot *
-__gfn_to_memslot_approx(struct kvm_memslots *slots, gfn_t gfn, bool approx)
+____gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn, bool approx)
{
struct kvm_memory_slot *slot;
int slot_index = atomic_read(&slots->last_used_slot);
@@ -1298,10 +1293,15 @@ __gfn_to_memslot_approx(struct kvm_memslots *slots, gfn_t gfn, bool approx)
return NULL;
}

+/*
+ * __gfn_to_memslot() and its descendants are here to allow arch code to inline
+ * the lookups in hot paths. gfn_to_memslot() itself isn't here as an inline
+ * because that would bloat other code too much.
+ */
static inline struct kvm_memory_slot *
__gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
{
- return __gfn_to_memslot_approx(slots, gfn, false);
+ return ____gfn_to_memslot(slots, gfn, false);
}

static inline unsigned long

2021-10-19 23:44:54

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 06/13] KVM: Move WARN on invalid memslot index to update_memslots()

On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <[email protected]>
>
> Since kvm_memslot_move_forward() can theoretically return a negative
> memslot index even when kvm_memslot_move_backward() returned a positive one
> (and so did not WARN) let's just move the warning to the common code.
>
> Signed-off-by: Maciej S. Szmigiero <[email protected]>
> Reviewed-by: Claudio Imbrenda <[email protected]>
> ---

Reviewed-by: Sean Christopherson <[email protected]>

2021-10-19 23:56:24

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 07/13] KVM: Just resync arch fields when slots_arch_lock gets reacquired

On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <[email protected]>
>
> There is no need to copy the whole memslot data after releasing
> slots_arch_lock for a moment to install temporary memslots copy in
> kvm_set_memslot() since this lock only protects the arch field of each
> memslot.
>
> Just resync this particular field after reacquiring slots_arch_lock.

I assume this needed to avoid having a mess when introducing the r-b tree? If so,
please call that out. Iterating over the slots might actually be slower than the
full memcpy, i.e. as a standalone patch this may or may not be make sense.

> Signed-off-by: Maciej S. Szmigiero <[email protected]>
> ---
> virt/kvm/kvm_main.c | 17 ++++++++++++-----
> 1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 348fae880189..48d182840060 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1482,6 +1482,15 @@ static void kvm_copy_memslots(struct kvm_memslots *to,
> memcpy(to, from, kvm_memslots_size(from->used_slots));
> }
>
> +static void kvm_copy_memslots_arch(struct kvm_memslots *to,
> + struct kvm_memslots *from)
> +{
> + int i;
> +
> + for (i = 0; i < from->used_slots; i++)
> + to->memslots[i].arch = from->memslots[i].arch;

This should probably be a memcpy(), I don't know what all shenanigans the compiler
can throw at us if it gets to copy a struct by value.

> +}
> +
> /*
> * Note, at a minimum, the current number of used slots must be allocated, even
> * when deleting a memslot, as we need a complete duplicate of the memslots for

There's an out-of-sight comment that's now stale, can you revert to the
pre-slots_arch_lock comment?

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 48d182840060..ef3345428047 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1555,9 +1555,10 @@ static int kvm_set_memslot(struct kvm *kvm,
slot->flags |= KVM_MEMSLOT_INVALID;

/*
- * We can re-use the memory from the old memslots.
- * It will be overwritten with a copy of the new memslots
- * after reacquiring the slots_arch_lock below.
+ * We can re-use the old memslots, the only difference from the
+ * newly installed memslots is the invalid flag, which will get
+ * dropped by update_memslots anyway. We'll also revert to the
+ * old memslots if preparing the new memory region fails.
*/
slots = install_new_memslots(kvm, as_id, slots);

> @@ -1567,10 +1576,10 @@ static int kvm_set_memslot(struct kvm *kvm,
> /*
> * The arch-specific fields of the memslots could have changed
> * between releasing the slots_arch_lock in
> - * install_new_memslots and here, so get a fresh copy of the
> - * slots.
> + * install_new_memslots and here, so get a fresh copy of these
> + * fields.
> */
> - kvm_copy_memslots(slots, __kvm_memslots(kvm, as_id));
> + kvm_copy_memslots_arch(slots, __kvm_memslots(kvm, as_id));
> }
>
> r = kvm_arch_prepare_memory_region(kvm, old, new, mem, change);
> @@ -1587,8 +1596,6 @@ static int kvm_set_memslot(struct kvm *kvm,
>
> out_slots:
> if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
> - slot = id_to_memslot(slots, old->id);
> - slot->flags &= ~KVM_MEMSLOT_INVALID;
> slots = install_new_memslots(kvm, as_id, slots);
> } else {

The braces can be dropped since both branches are now single lines.

> mutex_unlock(&kvm->slots_arch_lock);

2021-10-20 00:48:05

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 08/13] KVM: Resolve memslot ID via a hash table instead of via a static array

On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
> ---
> include/linux/kvm_host.h | 16 +++++------
> virt/kvm/kvm_main.c | 61 +++++++++++++++++++++++++++++++---------
> 2 files changed, 55 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8fd9644f40b2..d2acc00a6472 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -29,6 +29,7 @@
> #include <linux/refcount.h>
> #include <linux/nospec.h>
> #include <linux/notifier.h>
> +#include <linux/hashtable.h>
> #include <asm/signal.h>
>
> #include <linux/kvm.h>
> @@ -426,6 +427,7 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> #define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
>
> struct kvm_memory_slot {
> + struct hlist_node id_node;
> gfn_t base_gfn;
> unsigned long npages;
> unsigned long *dirty_bitmap;
> @@ -528,7 +530,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
> struct kvm_memslots {
> u64 generation;
> /* The mapping table from slot id to the index in memslots[]. */
> - short id_to_index[KVM_MEM_SLOTS_NUM];
> + DECLARE_HASHTABLE(id_hash, 7);

Can you add a comment explaining the rationale for size "7"? Not necessarily the
justification in choosing "7", more so the tradeoffs between performance, memory,
etc... so that all your work/investigation isn't lost and doesn't have to be repeated
if someone wants to tweak this in the future.

> atomic_t last_used_slot;
> int used_slots;
> struct kvm_memory_slot memslots[];
> @@ -795,16 +797,14 @@ static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
> static inline
> struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
> {
> - int index = slots->id_to_index[id];
> struct kvm_memory_slot *slot;
>
> - if (index < 0)
> - return NULL;
> -
> - slot = &slots->memslots[index];
> + hash_for_each_possible(slots->id_hash, slot, id_node, id) {
> + if (slot->id == id)
> + return slot;

Hmm, related to the hash, it might be worth adding a stat here to count collisions.
Might be more pain than it's worth though since we don't have @kvm.

> + }
>
> - WARN_ON(slot->id != id);
> - return slot;
> + return NULL;
> }
>
> /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 48d182840060..50597608d085 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -827,15 +827,13 @@ static void kvm_destroy_pm_notifier(struct kvm *kvm)
>
> static struct kvm_memslots *kvm_alloc_memslots(void)
> {
> - int i;
> struct kvm_memslots *slots;
>
> slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL_ACCOUNT);
> if (!slots)
> return NULL;
>
> - for (i = 0; i < KVM_MEM_SLOTS_NUM; i++)
> - slots->id_to_index[i] = -1;
> + hash_init(slots->id_hash);
>
> return slots;
> }
> @@ -1236,14 +1234,16 @@ static int kvm_alloc_dirty_bitmap(struct kvm_memory_slot *memslot)
> /*
> * Delete a memslot by decrementing the number of used slots and shifting all
> * other entries in the array forward one spot.
> + * @memslot is a detached dummy struct with just .id and .as_id filled.
> */
> static inline void kvm_memslot_delete(struct kvm_memslots *slots,
> struct kvm_memory_slot *memslot)
> {
> struct kvm_memory_slot *mslots = slots->memslots;
> + struct kvm_memory_slot *oldslot = id_to_memslot(slots, memslot->id);
> int i;
>
> - if (WARN_ON(slots->id_to_index[memslot->id] == -1))
> + if (WARN_ON(!oldslot))
> return;
>
> slots->used_slots--;
> @@ -1251,12 +1251,13 @@ static inline void kvm_memslot_delete(struct kvm_memslots *slots,
> if (atomic_read(&slots->last_used_slot) >= slots->used_slots)
> atomic_set(&slots->last_used_slot, 0);
>
> - for (i = slots->id_to_index[memslot->id]; i < slots->used_slots; i++) {
> + for (i = oldslot - mslots; i < slots->used_slots; i++) {
> + hash_del(&mslots[i].id_node);
> mslots[i] = mslots[i + 1];
> - slots->id_to_index[mslots[i].id] = i;
> + hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
> }
> + hash_del(&mslots[i].id_node);
> mslots[i] = *memslot;
> - slots->id_to_index[memslot->id] = -1;
> }
>
> /*
> @@ -1274,30 +1275,46 @@ static inline int kvm_memslot_insert_back(struct kvm_memslots *slots)
> * itself is not preserved in the array, i.e. not swapped at this time, only
> * its new index into the array is tracked. Returns the changed memslot's
> * current index into the memslots array.
> + * The memslot at the returned index will not be in @slots->id_hash by then.
> + * @memslot is a detached struct with desired final data of the changed slot.
> */
> static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
> struct kvm_memory_slot *memslot)
> {
> struct kvm_memory_slot *mslots = slots->memslots;
> + struct kvm_memory_slot *mmemslot = id_to_memslot(slots, memslot->id);

My comment from v3 about the danger of "mmemslot" still stands. FWIW, I dislike
"mslots" as well, but that predates me, and all of this will go away in the end :-)

On Wed, May 19, 2021 at 3:31 PM Sean Christopherson <[email protected]> wrote:
> On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
> > struct kvm_memory_slot *mslots = slots->memslots;
> > + struct kvm_memory_slot *dmemslot = id_to_memslot(slots, memslot->id);
>
> I vote to call these local vars "old", or something along those lines. dmemslot
> isn't too bad, but mmemslot in the helpers below is far too similar to memslot,
> and using the wrong will cause nasty explosions.


> int i;
>
> - if (slots->id_to_index[memslot->id] == -1 || !slots->used_slots)
> + if (!mmemslot || !slots->used_slots)
> return -1;
>
> + /*
> + * The loop below will (possibly) overwrite the target memslot with
> + * data of the next memslot, or a similar loop in
> + * kvm_memslot_move_forward() will overwrite it with data of the
> + * previous memslot.
> + * Then update_memslots() will unconditionally overwrite and re-add
> + * it to the hash table.
> + * That's why the memslot has to be first removed from the hash table
> + * here.
> + */

Is this reword accurate?

/*
* Delete the slot from the hash table before sorting the remaining
* slots, the slot's data may be overwritten when copying slots as part
* of the sorting proccess. update_memslots() will unconditionally
* rewrite the entire slot and re-add it to the hash table.
*/

> + hash_del(&mmemslot->id_node);
> +
> /*
> * Move the target memslot backward in the array by shifting existing
> * memslots with a higher GFN (than the target memslot) towards the
> * front of the array.
> */
> - for (i = slots->id_to_index[memslot->id]; i < slots->used_slots - 1; i++) {
> + for (i = mmemslot - mslots; i < slots->used_slots - 1; i++) {
> if (memslot->base_gfn > mslots[i + 1].base_gfn)
> break;
>
> WARN_ON_ONCE(memslot->base_gfn == mslots[i + 1].base_gfn);
>
> /* Shift the next memslot forward one and update its index. */
> + hash_del(&mslots[i + 1].id_node);
> mslots[i] = mslots[i + 1];
> - slots->id_to_index[mslots[i].id] = i;
> + hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
> }
> return i;
> }
> @@ -1308,6 +1325,10 @@ static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
> * is not preserved in the array, i.e. not swapped at this time, only its new
> * index into the array is tracked. Returns the changed memslot's final index
> * into the memslots array.
> + * The memslot at the returned index will not be in @slots->id_hash by then.
> + * @memslot is a detached struct with desired final data of the new or
> + * changed slot.
> + * Assumes that the memslot at @start index is not in @slots->id_hash.
> */
> static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
> struct kvm_memory_slot *memslot,
> @@ -1323,8 +1344,9 @@ static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
> WARN_ON_ONCE(memslot->base_gfn == mslots[i - 1].base_gfn);
>
> /* Shift the next memslot back one and update its index. */
> + hash_del(&mslots[i - 1].id_node);
> mslots[i] = mslots[i - 1];
> - slots->id_to_index[mslots[i].id] = i;
> + hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
> }
> return i;
> }
> @@ -1369,6 +1391,9 @@ static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
> * most likely to be referenced, sorting it to the front of the array was
> * advantageous. The current binary search starts from the middle of the array
> * and uses an LRU pointer to improve performance for all memslots and GFNs.
> + *
> + * @memslot is a detached struct, not a part of the current or new memslot
> + * array.
> */
> static void update_memslots(struct kvm_memslots *slots,
> struct kvm_memory_slot *memslot,
> @@ -1393,7 +1418,8 @@ static void update_memslots(struct kvm_memslots *slots,
> * its index accordingly.
> */
> slots->memslots[i] = *memslot;
> - slots->id_to_index[memslot->id] = i;
> + hash_add(slots->id_hash, &slots->memslots[i].id_node,
> + memslot->id);

Let this poke out past 80 chars, i.e. drop the newline.

> }
> }
>
> @@ -1501,6 +1527,7 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
> {
> struct kvm_memslots *slots;
> size_t new_size;
> + struct kvm_memory_slot *memslot;
>
> if (change == KVM_MR_CREATE)
> new_size = kvm_memslots_size(old->used_slots + 1);
> @@ -1508,8 +1535,14 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
> new_size = kvm_memslots_size(old->used_slots);
>
> slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT);
> - if (likely(slots))
> - kvm_copy_memslots(slots, old);
> + if (unlikely(!slots))
> + return NULL;
> +
> + kvm_copy_memslots(slots, old);
> +
> + hash_init(slots->id_hash);
> + kvm_for_each_memslot(memslot, slots)
> + hash_add(slots->id_hash, &memslot->id_node, memslot->id);
>
> return slots;
> }

2021-10-20 18:43:27

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 00/13] KVM: Scalable memslots implementation

On 20.10.2021 00:07, Sean Christopherson wrote:
> On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
>
> For future revisions, feel free to omit the lengthy intro and just provide links
> to previous versions.

Will do.

>> On x86-64 the code was well tested, passed KVM unit tests and KVM
>> selftests with KASAN on.
>> And, of course, booted various guests successfully (including nested
>> ones with TDP MMU enabled).
>> On other KVM platforms the code was compile-tested only.
>>
>> Changes since v1:
>
> ...
>
>> Changes since v2:
>
> ...
>
>> Changes since v3:
>
> ...
>
>> Changes since v4:
>> * Rebase onto v5.15-rc2 (torvalds/master),
>>
>> * Fix 64-bit division of n_memslots_pages for 32-bit KVM,
>>
>> * Collect Claudio's Reviewed-by tags for some of the patches.
>
> Heh, this threw me for a loop. The standard pattern is to start with the most
> recent version and work backwards, that way reviewers can quickly see the delta
> for _this_ version. I.e.
>
> Changes since v4:
> ...
>
> Changes since v3:
> ...
>

I have always used the chronological order but your argument about
reviewers being able to quickly see the delta makes sense - will switch
to having the latest changes on the top in the next version.

By the way, looking at the current https://lore.kernel.org/lkml/ at the
time I am writing this, while most of v3+ submissions are indeed
using the "latest on the top" order, some aren't:
https://lore.kernel.org/lkml/[email protected]/T/
https://lore.kernel.org/lkml/[email protected]/T/
https://lore.kernel.org/lkml/YW%2Fq70dLyF+YudyF@T590/T/ (this one uses a
hybrid approach - current version changes on the top, remaining changeset
in chronological order).

Thanks,
Maciej

2021-10-20 18:43:47

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] KVM: x86: Cache total page count to avoid traversing the memslot array

On 20.10.2021 00:31, Sean Christopherson wrote:
> On Tue, Oct 19, 2021, Sean Christopherson wrote:
>> On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <[email protected]>
>>>
>>> There is no point in recalculating from scratch the total number of pages
>>> in all memslots each time a memslot is created or deleted.
>>>
>>> Just cache the value and update it accordingly on each such operation so
>>> the code doesn't need to traverse the whole memslot array each time.
>>>
>>> Signed-off-by: Maciej S. Szmigiero <[email protected]>
>>> ---
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 28ef14155726..65fdf27b9423 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -11609,9 +11609,23 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>>> const struct kvm_memory_slot *new,
>>> enum kvm_mr_change change)
>>> {
>>> - if (!kvm->arch.n_requested_mmu_pages)
>>> - kvm_mmu_change_mmu_pages(kvm,
>>> - kvm_mmu_calculate_default_mmu_pages(kvm));
>>> + if (change == KVM_MR_CREATE)
>>> + kvm->arch.n_memslots_pages += new->npages;
>>> + else if (change == KVM_MR_DELETE) {
>>> + WARN_ON(kvm->arch.n_memslots_pages < old->npages);
>>> + kvm->arch.n_memslots_pages -= old->npages;
>>> + }
>>> +
>>> + if (!kvm->arch.n_requested_mmu_pages) {
>>
>> Hmm, once n_requested_mmu_pages is set it can't be unset. That means this can be
>> further optimized to skip avoid taking mmu_lock on flags-only changes (and
>> memslot movement). E.g.
>>
>> if (!kvm->arch.n_requested_mmu_pages &&
>> (change == KVM_MR_CREATE || change == KVM_MR_DELETE)) {
>>
>> }
>>
>> It's a little risky, but kvm_vm_ioctl_set_nr_mmu_pages() would need to be modified
>> to allow clearing n_requested_mmu_pages and it already takes slots_lock, so IMO
>> it's ok to force kvm_vm_ioctl_set_nr_mmu_pages() to recalculate pages if it wants
>> to allow reverting back to the default.
>
> Doh, and then I read patch 2...
>
> I would swap the order of patch 2 and patch 1, that way the optimization patch is
> super simple, and you don't end up reworking a bunch of code that was added in the
> immediately preceding patch. E.g. as a first patch
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 28ef14155726..f3b1aed08566 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11609,7 +11609,8 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> const struct kvm_memory_slot *new,
> enum kvm_mr_change change)
> {
> - if (!kvm->arch.n_requested_mmu_pages)
> + if (!kvm->arch.n_requested_mmu_pages &&
> + (change == KVM_MR_CREATE || change == KVM_MR_DELETE))
> kvm_mmu_change_mmu_pages(kvm,
> kvm_mmu_calculate_default_mmu_pages(kvm));
>
>
>

Will do.

Thanks,
Maciej

2021-10-20 18:44:13

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 04/13] KVM: x86: Move n_memslots_pages recalc to kvm_arch_prepare_memory_region()

On 20.10.2021 00:38, Sean Christopherson wrote:
> On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <[email protected]>
>>
>> This allows us to return a proper error code in case we spot an underflow.
>>
>> Signed-off-by: Maciej S. Szmigiero <[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 49 ++++++++++++++++++++++++++--------------------
>> 1 file changed, 28 insertions(+), 21 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 97d86223427d..0fffb8414009 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -11511,9 +11511,23 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>> const struct kvm_userspace_memory_region *mem,
>> enum kvm_mr_change change)
>> {
>> - if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
>> - return kvm_alloc_memslot_metadata(kvm, new,
>> - mem->memory_size >> PAGE_SHIFT);
>> + if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
>> + int ret;
>> +
>> + ret = kvm_alloc_memslot_metadata(kvm, new,
>> + mem->memory_size >> PAGE_SHIFT);
>> + if (ret)
>> + return ret;
>> +
>> + if (change == KVM_MR_CREATE)
>> + kvm->arch.n_memslots_pages += new->npages;
>> + } else if (change == KVM_MR_DELETE) {
>> + if (WARN_ON(kvm->arch.n_memslots_pages < old->npages))
>> + return -EIO;
>
> This is not worth the churn. In a way, it's worse because userspace can spam
> the living snot out of the kernel log by retrying the ioctl().
>
> Since underflow can happen if and only if there's a KVM bug, and a pretty bad one
> at that, just make the original WARN_ON a KVM_BUG_ON. That will kill the VM and
> also provide the WARN_ON_ONCE behavior that we probably want.

Will do.

Thanks,
Maciej

2021-10-20 18:44:53

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 07/13] KVM: Just resync arch fields when slots_arch_lock gets reacquired

On 20.10.2021 01:55, Sean Christopherson wrote:
> On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <[email protected]>
>>
>> There is no need to copy the whole memslot data after releasing
>> slots_arch_lock for a moment to install temporary memslots copy in
>> kvm_set_memslot() since this lock only protects the arch field of each
>> memslot.
>>
>> Just resync this particular field after reacquiring slots_arch_lock.
>
> I assume this needed to avoid having a mess when introducing the r-b tree? If so,
> please call that out. Iterating over the slots might actually be slower than the
> full memcpy, i.e. as a standalone patch this may or may not be make sense.

Yes, it's an intermediate state of the code to not break bisecting.
The code changed by this patch is then completely replaced later by the
patch 11 of this patchset.

Will add a note about this to the commit message.

>> Signed-off-by: Maciej S. Szmigiero <[email protected]>
>> ---
>> virt/kvm/kvm_main.c | 17 ++++++++++++-----
>> 1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 348fae880189..48d182840060 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1482,6 +1482,15 @@ static void kvm_copy_memslots(struct kvm_memslots *to,
>> memcpy(to, from, kvm_memslots_size(from->used_slots));
>> }
>>
>> +static void kvm_copy_memslots_arch(struct kvm_memslots *to,
>> + struct kvm_memslots *from)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < from->used_slots; i++)
>> + to->memslots[i].arch = from->memslots[i].arch;
>
> This should probably be a memcpy(), I don't know what all shenanigans the compiler
> can throw at us if it gets to copy a struct by value.

Normally, copy-assignment of a struct is a safe operation (this is purely
an internal kernel struct, so there are no worries about padding leakage
to the userspace), but can replace this with a memcpy().

>> +}
>> +
>> /*
>> * Note, at a minimum, the current number of used slots must be allocated, even
>> * when deleting a memslot, as we need a complete duplicate of the memslots for
>
> There's an out-of-sight comment that's now stale, can you revert to the
> pre-slots_arch_lock comment?
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 48d182840060..ef3345428047 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1555,9 +1555,10 @@ static int kvm_set_memslot(struct kvm *kvm,
> slot->flags |= KVM_MEMSLOT_INVALID;
>
> /*
> - * We can re-use the memory from the old memslots.
> - * It will be overwritten with a copy of the new memslots
> - * after reacquiring the slots_arch_lock below.
> + * We can re-use the old memslots, the only difference from the
> + * newly installed memslots is the invalid flag, which will get
> + * dropped by update_memslots anyway. We'll also revert to the
> + * old memslots if preparing the new memory region fails.
> */
> slots = install_new_memslots(kvm, as_id, slots);
>

Will do.

>> @@ -1567,10 +1576,10 @@ static int kvm_set_memslot(struct kvm *kvm,
>> /*
>> * The arch-specific fields of the memslots could have changed
>> * between releasing the slots_arch_lock in
>> - * install_new_memslots and here, so get a fresh copy of the
>> - * slots.
>> + * install_new_memslots and here, so get a fresh copy of these
>> + * fields.
>> */
>> - kvm_copy_memslots(slots, __kvm_memslots(kvm, as_id));
>> + kvm_copy_memslots_arch(slots, __kvm_memslots(kvm, as_id));
>> }
>>
>> r = kvm_arch_prepare_memory_region(kvm, old, new, mem, change);
>> @@ -1587,8 +1596,6 @@ static int kvm_set_memslot(struct kvm *kvm,
>>
>> out_slots:
>> if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
>> - slot = id_to_memslot(slots, old->id);
>> - slot->flags &= ~KVM_MEMSLOT_INVALID;
>> slots = install_new_memslots(kvm, as_id, slots);
>> } else {
>
> The braces can be dropped since both branches are now single lines.
>
>> mutex_unlock(&kvm->slots_arch_lock);

Will drop them.

Thanks,
Maciej

2021-10-20 18:44:53

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 08/13] KVM: Resolve memslot ID via a hash table instead of via a static array

On 20.10.2021 02:43, Sean Christopherson wrote:
> On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
>> ---
>> include/linux/kvm_host.h | 16 +++++------
>> virt/kvm/kvm_main.c | 61 +++++++++++++++++++++++++++++++---------
>> 2 files changed, 55 insertions(+), 22 deletions(-)
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 8fd9644f40b2..d2acc00a6472 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -29,6 +29,7 @@
>> #include <linux/refcount.h>
>> #include <linux/nospec.h>
>> #include <linux/notifier.h>
>> +#include <linux/hashtable.h>
>> #include <asm/signal.h>
>>
>> #include <linux/kvm.h>
>> @@ -426,6 +427,7 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>> #define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
>>
>> struct kvm_memory_slot {
>> + struct hlist_node id_node;
>> gfn_t base_gfn;
>> unsigned long npages;
>> unsigned long *dirty_bitmap;
>> @@ -528,7 +530,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>> struct kvm_memslots {
>> u64 generation;
>> /* The mapping table from slot id to the index in memslots[]. */
>> - short id_to_index[KVM_MEM_SLOTS_NUM];
>> + DECLARE_HASHTABLE(id_hash, 7);
>
> Can you add a comment explaining the rationale for size "7"? Not necessarily the
> justification in choosing "7", more so the tradeoffs between performance, memory,
> etc... so that all your work/investigation isn't lost and doesn't have to be repeated
> if someone wants to tweak this in the future.

Will add such comment.

>> atomic_t last_used_slot;
>> int used_slots;
>> struct kvm_memory_slot memslots[];
>> @@ -795,16 +797,14 @@ static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
>> static inline
>> struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
>> {
>> - int index = slots->id_to_index[id];
>> struct kvm_memory_slot *slot;
>>
>> - if (index < 0)
>> - return NULL;
>> -
>> - slot = &slots->memslots[index];
>> + hash_for_each_possible(slots->id_hash, slot, id_node, id) {
>> + if (slot->id == id)
>> + return slot;
>
> Hmm, related to the hash, it might be worth adding a stat here to count collisions.
> Might be more pain than it's worth though since we don't have @kvm.

It's a good idea if it turns out that it's worth optimizing the code
further (by, for example, introducing a self-resizing hash table, which
would bring a significant increase in complexity for rather uncertain
gains).

>> @@ -1274,30 +1275,46 @@ static inline int kvm_memslot_insert_back(struct kvm_memslots *slots)
>> * itself is not preserved in the array, i.e. not swapped at this time, only
>> * its new index into the array is tracked. Returns the changed memslot's
>> * current index into the memslots array.
>> + * The memslot at the returned index will not be in @slots->id_hash by then.
>> + * @memslot is a detached struct with desired final data of the changed slot.
>> */
>> static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
>> struct kvm_memory_slot *memslot)
>> {
>> struct kvm_memory_slot *mslots = slots->memslots;
>> + struct kvm_memory_slot *mmemslot = id_to_memslot(slots, memslot->id);
>
> My comment from v3 about the danger of "mmemslot" still stands. FWIW, I dislike
> "mslots" as well, but that predates me, and all of this will go away in the end :-)
>
> On Wed, May 19, 2021 at 3:31 PM Sean Christopherson <[email protected]> wrote:
>> On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
>>> struct kvm_memory_slot *mslots = slots->memslots;
>>> + struct kvm_memory_slot *dmemslot = id_to_memslot(slots, memslot->id);
>>
>> I vote to call these local vars "old", or something along those lines. dmemslot
>> isn't too bad, but mmemslot in the helpers below is far too similar to memslot,
>> and using the wrong will cause nasty explosions.
>

Will rename "mmemslot" to "oldslot" in kvm_memslot_move_backward(), too.

>> int i;
>>
>> - if (slots->id_to_index[memslot->id] == -1 || !slots->used_slots)
>> + if (!mmemslot || !slots->used_slots)
>> return -1;
>>
>> + /*
>> + * The loop below will (possibly) overwrite the target memslot with
>> + * data of the next memslot, or a similar loop in
>> + * kvm_memslot_move_forward() will overwrite it with data of the
>> + * previous memslot.
>> + * Then update_memslots() will unconditionally overwrite and re-add
>> + * it to the hash table.
>> + * That's why the memslot has to be first removed from the hash table
>> + * here.
>> + */
>
> Is this reword accurate?
>
> /*
> * Delete the slot from the hash table before sorting the remaining
> * slots, the slot's data may be overwritten when copying slots as part
> * of the sorting proccess. update_memslots() will unconditionally
> * rewrite the entire slot and re-add it to the hash table.
> */

It's accurate, will replace the comment with the proposed one.

>> @@ -1369,6 +1391,9 @@ static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
>> * most likely to be referenced, sorting it to the front of the array was
>> * advantageous. The current binary search starts from the middle of the array
>> * and uses an LRU pointer to improve performance for all memslots and GFNs.
>> + *
>> + * @memslot is a detached struct, not a part of the current or new memslot
>> + * array.
>> */
>> static void update_memslots(struct kvm_memslots *slots,
>> struct kvm_memory_slot *memslot,
>> @@ -1393,7 +1418,8 @@ static void update_memslots(struct kvm_memslots *slots,
>> * its index accordingly.
>> */
>> slots->memslots[i] = *memslot;
>> - slots->id_to_index[memslot->id] = i;
>> + hash_add(slots->id_hash, &slots->memslots[i].id_node,
>> + memslot->id);
>
> Let this poke out past 80 chars, i.e. drop the newline.

Will do.

Thanks,
Maciej

2021-10-20 18:46:11

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] KVM: x86: Cache total page count to avoid traversing the memslot array

On 20.10.2021 00:24, Sean Christopherson wrote:
> On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <[email protected]>
>>
>> There is no point in recalculating from scratch the total number of pages
>> in all memslots each time a memslot is created or deleted.
>>
>> Just cache the value and update it accordingly on each such operation so
>> the code doesn't need to traverse the whole memslot array each time.
>>
>> Signed-off-by: Maciej S. Szmigiero <[email protected]>
>> ---
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 28ef14155726..65fdf27b9423 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -11609,9 +11609,23 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>> const struct kvm_memory_slot *new,
>> enum kvm_mr_change change)
>> {
>> - if (!kvm->arch.n_requested_mmu_pages)
>> - kvm_mmu_change_mmu_pages(kvm,
>> - kvm_mmu_calculate_default_mmu_pages(kvm));
>> + if (change == KVM_MR_CREATE)
>> + kvm->arch.n_memslots_pages += new->npages;
>> + else if (change == KVM_MR_DELETE) {
>> + WARN_ON(kvm->arch.n_memslots_pages < old->npages);
>> + kvm->arch.n_memslots_pages -= old->npages;
>> + }
>> +
>> + if (!kvm->arch.n_requested_mmu_pages) {
(..)
>> + u64 memslots_pages;
>> + unsigned long nr_mmu_pages;
>> +
>> + memslots_pages = kvm->arch.n_memslots_pages * KVM_PERMILLE_MMU_PAGES;
>> + do_div(memslots_pages, 1000);
>> + nr_mmu_pages = max_t(typeof(nr_mmu_pages),
>> + memslots_pages, KVM_MIN_ALLOC_MMU_PAGES);
>
> "memslots_pages" is a bit of a misnomer. Any objection to avoiding naming problems
> by explicitly casting to an "unsigned long" and simply operating on nr_mmu_pages?
>
> nr_mmu_pages = (unsigned long)kvm->arch.n_memslots_pages;
> nr_mmu_pages *= (KVM_PERMILLE_MMU_PAGES / 1000);
> nr_mmu_pages = max(nr_mmu_pages, KVM_MIN_ALLOC_MMU_PAGES);
> kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
>
> E.g. the whole thing can be
>
> if (!kvm->arch.n_requested_mmu_pages &&
> (change == KVM_MR_CREATE || change == KVM_MR_DELETE)) {
> unsigned long nr_mmu_pages;
>
> if (change == KVM_MR_CREATE) {
> kvm->arch.n_memslots_pages += new->npages;
> } else {
> WARN_ON(kvm->arch.n_memslots_pages < old->npages);
> kvm->arch.n_memslots_pages -= old->npages;
> }
>
> nr_mmu_pages = (unsigned long)kvm->arch.n_memslots_pages;
> nr_mmu_pages *= (KVM_PERMILLE_MMU_PAGES / 1000);

The above line will set nr_mmu_pages to zero since KVM_PERMILLE_MMU_PAGES
is 20, so when integer-divided by 1000 will result in a multiplication
coefficient of zero.

> nr_mmu_pages = max(nr_mmu_pages, KVM_MIN_ALLOC_MMU_PAGES);
> kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
> }
>
>> + kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
>> + }
>>
>> kvm_mmu_slot_apply_flags(kvm, old, new, change);
>>

Thanks,
Maciej

2021-10-20 18:46:59

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 05/13] KVM: Integrate gfn_to_memslot_approx() into search_memslots()

On 20.10.2021 01:38, Sean Christopherson wrote:
> On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
>> @@ -1267,7 +1280,7 @@ search_memslots(struct kvm_memslots *slots, gfn_t gfn, int *index)
>> * itself isn't here as an inline because that would bloat other code too much.
>> */
>> static inline struct kvm_memory_slot *
>> -__gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
>> +__gfn_to_memslot_approx(struct kvm_memslots *slots, gfn_t gfn, bool approx)
>
> This function name is a misnomer. The helper is not an "approx" version, it's an
> inner helper that takes an @approx param. Unless someone has a more clever name,
> the dreaded four underscores seems like the way to go. Warning away users is a
> good thing in this case...
>
>> {
>> struct kvm_memory_slot *slot;
>> int slot_index = atomic_read(&slots->last_used_slot);
>> @@ -1276,7 +1289,7 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
>> if (slot)
>> return slot;
>>
>> - slot = search_memslots(slots, gfn, &slot_index);
>> + slot = search_memslots(slots, gfn, &slot_index, approx);
>> if (slot) {
>> atomic_set(&slots->last_used_slot, slot_index);
>> return slot;
>> @@ -1285,6 +1298,12 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
>> return NULL;
>> }
>>
>
> There's a comment that doesn't show up in this diff that should also be moved,
> and opportunistically updated.
>
>> +static inline struct kvm_memory_slot *
>> +__gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
>> +{
>> + return __gfn_to_memslot_approx(slots, gfn, false);
>> +}
>> +
>> static inline unsigned long
>> __gfn_to_hva_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
>> {
>
> E.g. this as fixup?
>
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 540fa948baa5..2964c773b36c 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -1964,10 +1964,15 @@ static int kvm_s390_peek_cmma(struct kvm *kvm, struct kvm_s390_cmma_log *args,
> return 0;
> }
>
> +static int gfn_to_memslot_approx(struct kvm_memslots *slots, gfn_t gfn)
> +{
> + return ____gfn_to_memslot(slots, cur_gfn, true);
> +}
> +
> static unsigned long kvm_s390_next_dirty_cmma(struct kvm_memslots *slots,
> unsigned long cur_gfn)
> {
> - struct kvm_memory_slot *ms = __gfn_to_memslot_approx(slots, cur_gfn, true);
> + struct kvm_memory_slot *ms = gfn_to_memslot_approx(slots, cur_gfn);
> int slotidx = ms - slots->memslots;
> unsigned long ofs = cur_gfn - ms->base_gfn;
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8fd9644f40b2..ec1a074c2f6e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1274,13 +1274,8 @@ search_memslots(struct kvm_memslots *slots, gfn_t gfn, int *index, bool approx)
> return NULL;
> }
>
> -/*
> - * __gfn_to_memslot() and its descendants are here because it is called from
> - * non-modular code in arch/powerpc/kvm/book3s_64_vio{,_hv}.c. gfn_to_memslot()
> - * itself isn't here as an inline because that would bloat other code too much.
> - */
> static inline struct kvm_memory_slot *
> -__gfn_to_memslot_approx(struct kvm_memslots *slots, gfn_t gfn, bool approx)
> +____gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn, bool approx)
> {
> struct kvm_memory_slot *slot;
> int slot_index = atomic_read(&slots->last_used_slot);
> @@ -1298,10 +1293,15 @@ __gfn_to_memslot_approx(struct kvm_memslots *slots, gfn_t gfn, bool approx)
> return NULL;
> }
>
> +/*
> + * __gfn_to_memslot() and its descendants are here to allow arch code to inline
> + * the lookups in hot paths. gfn_to_memslot() itself isn't here as an inline
> + * because that would bloat other code too much.
> + */
> static inline struct kvm_memory_slot *
> __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
> {
> - return __gfn_to_memslot_approx(slots, gfn, false);
> + return ____gfn_to_memslot(slots, gfn, false);
> }
>
> static inline unsigned long
>

Looks sensible, will apply your proposed changes.

Thanks,
Maciej

2021-10-20 19:00:07

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 07/13] KVM: Just resync arch fields when slots_arch_lock gets reacquired

On 20.10.2021 20:57, Sean Christopherson wrote:
> On Wed, Oct 20, 2021, Maciej S. Szmigiero wrote:
>> On 20.10.2021 01:55, Sean Christopherson wrote:
>>> On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
>>> This should probably be a memcpy(), I don't know what all shenanigans the compiler
>>> can throw at us if it gets to copy a struct by value.
>>
>> Normally, copy-assignment of a struct is a safe operation (this is purely
>> an internal kernel struct, so there are no worries about padding leakage
>> to the userspace), but can replace this with a memcpy().
>
> I was more worried about the compiler using SIMD instructions. I assume the kernel
> build process has lots of guards in place to prevent such shenanigans, but on the
> other hand I _know_ mempcy() is safe :-)
>

So we will play safe and use mempcy() then :)

Thanks,
Maciej

2021-10-20 19:02:14

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 07/13] KVM: Just resync arch fields when slots_arch_lock gets reacquired

On Wed, Oct 20, 2021, Maciej S. Szmigiero wrote:
> On 20.10.2021 01:55, Sean Christopherson wrote:
> > On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
> > This should probably be a memcpy(), I don't know what all shenanigans the compiler
> > can throw at us if it gets to copy a struct by value.
>
> Normally, copy-assignment of a struct is a safe operation (this is purely
> an internal kernel struct, so there are no worries about padding leakage
> to the userspace), but can replace this with a memcpy().

I was more worried about the compiler using SIMD instructions. I assume the kernel
build process has lots of guards in place to prevent such shenanigans, but on the
other hand I _know_ mempcy() is safe :-)

2021-10-20 19:05:34

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] KVM: x86: Cache total page count to avoid traversing the memslot array

On Wed, Oct 20, 2021, Maciej S. Szmigiero wrote:
> On 20.10.2021 00:24, Sean Christopherson wrote:
> > E.g. the whole thing can be
> >
> > if (!kvm->arch.n_requested_mmu_pages &&
> > (change == KVM_MR_CREATE || change == KVM_MR_DELETE)) {
> > unsigned long nr_mmu_pages;
> >
> > if (change == KVM_MR_CREATE) {
> > kvm->arch.n_memslots_pages += new->npages;
> > } else {
> > WARN_ON(kvm->arch.n_memslots_pages < old->npages);
> > kvm->arch.n_memslots_pages -= old->npages;
> > }
> >
> > nr_mmu_pages = (unsigned long)kvm->arch.n_memslots_pages;
> > nr_mmu_pages *= (KVM_PERMILLE_MMU_PAGES / 1000);
>
> The above line will set nr_mmu_pages to zero since KVM_PERMILLE_MMU_PAGES
> is 20, so when integer-divided by 1000 will result in a multiplication
> coefficient of zero.

Ugh, math. And thus do_div() to avoid the whole 64-bit divide issue on 32-bit KVM.
Bummer.

2021-10-20 20:01:33

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 00/13] KVM: Scalable memslots implementation

On Wed, Oct 20, 2021, Maciej S. Szmigiero wrote:
> On 20.10.2021 00:07, Sean Christopherson wrote:
> I have always used the chronological order but your argument about
> reviewers being able to quickly see the delta makes sense - will switch
> to having the latest changes on the top in the next version.
>
> By the way, looking at the current https://lore.kernel.org/lkml/ at the
> time I am writing this, while most of v3+ submissions are indeed
> using the "latest on the top" order, some aren't:
> https://lore.kernel.org/lkml/[email protected]/T/
> https://lore.kernel.org/lkml/[email protected]/T/
> https://lore.kernel.org/lkml/YW%2Fq70dLyF+YudyF@T590/T/ (this one uses a
> hybrid approach - current version changes on the top, remaining changeset
> in chronological order).

Some people are heathens that have yet to be enlightened. Rest assured I'll do
plenty of prosthelytizing should they ever post to arch/x86/kvm ;-)

2021-10-20 22:43:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 08/13] KVM: Resolve memslot ID via a hash table instead of via a static array

On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
> @@ -1251,12 +1251,13 @@ static inline void kvm_memslot_delete(struct kvm_memslots *slots,
> if (atomic_read(&slots->last_used_slot) >= slots->used_slots)
> atomic_set(&slots->last_used_slot, 0);
>
> - for (i = slots->id_to_index[memslot->id]; i < slots->used_slots; i++) {
> + for (i = oldslot - mslots; i < slots->used_slots; i++) {
> + hash_del(&mslots[i].id_node);
> mslots[i] = mslots[i + 1];
> - slots->id_to_index[mslots[i].id] = i;
> + hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
> }
> + hash_del(&mslots[i].id_node);
> mslots[i] = *memslot;
> - slots->id_to_index[memslot->id] = -1;

Aha! This code has been bugging and I finally figured out why. Fundamentally,
this is identical to kvm_memslot_move_backward(), the only difference being that
the _backward() variant has a second "stop" condition.

But yet this is subtly different in the hash manipulations because performs the
final node deletion (which is a random node, that may not be the target node!)
outside of the loop, whereas _backward() deletes the target node before the loop.

I like the _backward() variant a lot more. And if this code is converted to that
approach, i.e.

for (i = oldslot - mslots; i < slots->used_slots; i++) {
hash_del(&mslots[i + 1].id_node);
mslots[i] = mslots[i + 1];
hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
}

then all three loops fit a similar pattern and we can extract the node craziness
into a helper. I know this mostly goes away in the end, but I think/hope it will
make the future patches easier to follow this there's only ever one "replace"
helper.

For convenience, with the s/mmemslot/oldslot and comment changes.

---
virt/kvm/kvm_main.c | 63 ++++++++++++++++++++++++++-------------------
1 file changed, 37 insertions(+), 26 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 50597608d085..6f5298bc7710 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1231,6 +1231,23 @@ static int kvm_alloc_dirty_bitmap(struct kvm_memory_slot *memslot)
return 0;
}

+static void kvm_memslot_replace(struct kvm_memslots *slots, int dst, int src)
+{
+ struct kvm_memory_slot *mslots = slots->memslots;
+
+ /*
+ * Remove the source from the hash list, copying the hlist_node data
+ * would corrupt the list.
+ */
+ hash_del(&mslots[src].id_node);
+
+ /* Copy the source *data*, not the pointer, to the destination. */
+ mslots[dst] = mslots[src];
+
+ /* Re-add the memslot to the list using the destination's node. */
+ hash_add(slots->id_hash, &mslots[dst].id_node, mslots[dst].id);
+}
+
/*
* Delete a memslot by decrementing the number of used slots and shifting all
* other entries in the array forward one spot.
@@ -1251,12 +1268,16 @@ static inline void kvm_memslot_delete(struct kvm_memslots *slots,
if (atomic_read(&slots->last_used_slot) >= slots->used_slots)
atomic_set(&slots->last_used_slot, 0);

- for (i = oldslot - mslots; i < slots->used_slots; i++) {
- hash_del(&mslots[i].id_node);
- mslots[i] = mslots[i + 1];
- hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
- }
- hash_del(&mslots[i].id_node);
+ /*
+ * Remove the to-be-deleted memslot from the list _before_ shifting
+ * the trailing memslots forward, its data will be overwritten.
+ * Defer the (somewhat pointless) copying of the memslot until after
+ * the last slot has been shifted to avoid overwriting said last slot.
+ */
+ hash_del(&oldslot->id_node);
+
+ for (i = oldslot - mslots; i < slots->used_slots; i++)
+ kvm_memslot_replace(slots, i, i + 1);
mslots[i] = *memslot;
}

@@ -1282,39 +1303,32 @@ static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
struct kvm_memory_slot *memslot)
{
struct kvm_memory_slot *mslots = slots->memslots;
- struct kvm_memory_slot *mmemslot = id_to_memslot(slots, memslot->id);
+ struct kvm_memory_slot *oldslot = id_to_memslot(slots, memslot->id);
int i;

- if (!mmemslot || !slots->used_slots)
+ if (!oldslot || !slots->used_slots)
return -1;

/*
- * The loop below will (possibly) overwrite the target memslot with
- * data of the next memslot, or a similar loop in
- * kvm_memslot_move_forward() will overwrite it with data of the
- * previous memslot.
- * Then update_memslots() will unconditionally overwrite and re-add
- * it to the hash table.
- * That's why the memslot has to be first removed from the hash table
- * here.
+ * Delete the slot from the hash table before sorting the remaining
+ * slots, the slot's data may be overwritten when copying slots as part
+ * of the sorting proccess. update_memslots() will unconditionally
+ * rewrite the entire slot and re-add it to the hash table.
*/
- hash_del(&mmemslot->id_node);
+ hash_del(&oldslot->id_node);

/*
* Move the target memslot backward in the array by shifting existing
* memslots with a higher GFN (than the target memslot) towards the
* front of the array.
*/
- for (i = mmemslot - mslots; i < slots->used_slots - 1; i++) {
+ for (i = oldslot - mslots; i < slots->used_slots - 1; i++) {
if (memslot->base_gfn > mslots[i + 1].base_gfn)
break;

WARN_ON_ONCE(memslot->base_gfn == mslots[i + 1].base_gfn);

- /* Shift the next memslot forward one and update its index. */
- hash_del(&mslots[i + 1].id_node);
- mslots[i] = mslots[i + 1];
- hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
+ kvm_memslot_replace(slots, i, i + 1);
}
return i;
}
@@ -1343,10 +1357,7 @@ static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,

WARN_ON_ONCE(memslot->base_gfn == mslots[i - 1].base_gfn);

- /* Shift the next memslot back one and update its index. */
- hash_del(&mslots[i - 1].id_node);
- mslots[i] = mslots[i - 1];
- hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
+ kvm_memslot_replace(slots, i, i - 1);
}
return i;
}
--

2021-10-21 00:09:59

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 12/13] KVM: Optimize gfn lookup in kvm_zap_gfn_range()

On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:

Some mechanical comments while they're on my mind, I'll get back to a full review
tomorrow.

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 6433efff447a..9ae5f7341cf5 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -833,6 +833,75 @@ struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
> return NULL;
> }
>
> +static inline
> +struct rb_node *kvm_memslots_gfn_upper_bound(struct kvm_memslots *slots, gfn_t gfn)

Function attributes should go on the same line as the function unless there's a
really good reason to do otherwise.

In this case, I would honestly just drop the helper. It's really hard to express
what this function does in a name that isn't absurdly long, and there's exactly
one user at the end of the series.

https://lkml.kernel.org/r/[email protected]

> +{
> + int idx = slots->node_idx;
> + struct rb_node *node, *result = NULL;
> +
> + for (node = slots->gfn_tree.rb_node; node; ) {
> + struct kvm_memory_slot *slot;

My personal preference is to put declarations outside of the for loop. I find it
easier to read, it's harder to have shadowing issues if all variables are declared
at the top, especially when using relatively generic names.

> +
> + slot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
> + if (gfn < slot->base_gfn) {
> + result = node;
> + node = node->rb_left;
> + } else

Needs braces since the "if" has braces.

> + node = node->rb_right;
> + }
> +
> + return result;
> +}
> +
> +static inline
> +struct rb_node *kvm_for_each_in_gfn_first(struct kvm_memslots *slots, gfn_t start)

The kvm_for_each_in_gfn prefix is _really_ confusing. I get that these are all
helpers for "kvm_for_each_memslot...", but it's hard not to think these are all
iterators on their own. I would gladly sacrifice namespacing for readability in
this case.

I also wouldn't worry about capturing the details. For most folks reading this
code, the important part is understanding the control flow of
kvm_for_each_memslot_in_gfn_range(). Capturing the under-the-hood details in the
name isn't a priority since anyone modifying this code is going to have to do a
lot of staring no matter what :-)

> +static inline
> +bool kvm_for_each_in_gfn_no_more(struct kvm_memslots *slots, struct rb_node *node, gfn_t end)
> +{
> + struct kvm_memory_slot *memslot;
> +
> + memslot = container_of(node, struct kvm_memory_slot, gfn_node[slots->node_idx]);
> +
> + /*
> + * If this slot starts beyond or at the end of the range so does
> + * every next one
> + */
> + return memslot->base_gfn >= end;
> +}
> +
> +/* Iterate over each memslot *possibly* intersecting [start, end) range */
> +#define kvm_for_each_memslot_in_gfn_range(node, slots, start, end) \
> + for (node = kvm_for_each_in_gfn_first(slots, start); \
> + node && !kvm_for_each_in_gfn_no_more(slots, node, end); \

I think it makes sense to move the NULL check into the validation helper? We had
a similar case in KVM's legacy MMU where a "null" check was left to the caller,
and it ended up with a bunch of redundant and confusing code. I don't see that
happening here, but at the same time it's odd for the validator to not sanity
check @node.

> + node = rb_next(node)) \

It's silly, but I'd add a wrapper for this one, just to make it easy to follow
the control flow.

Maybe this as delta? I'm definitely not set on the names, was just trying to
find something that's short and to the point.

---
include/linux/kvm_host.h | 60 +++++++++++++++++++++-------------------
1 file changed, 31 insertions(+), 29 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9ae5f7341cf5..a88bd5d9e4aa 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -833,36 +833,29 @@ struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
return NULL;
}

-static inline
-struct rb_node *kvm_memslots_gfn_upper_bound(struct kvm_memslots *slots, gfn_t gfn)
+static inline struct rb_node *kvm_get_first_node(struct kvm_memslots *slots,
+ gfn_t start)
{
+ struct kvm_memory_slot *slot;
+ struct rb_node *node, *tmp;
int idx = slots->node_idx;
- struct rb_node *node, *result = NULL;
-
- for (node = slots->gfn_tree.rb_node; node; ) {
- struct kvm_memory_slot *slot;
-
- slot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
- if (gfn < slot->base_gfn) {
- result = node;
- node = node->rb_left;
- } else
- node = node->rb_right;
- }
-
- return result;
-}
-
-static inline
-struct rb_node *kvm_for_each_in_gfn_first(struct kvm_memslots *slots, gfn_t start)
-{
- struct rb_node *node;

/*
* Find the slot with the lowest gfn that can possibly intersect with
* the range, so we'll ideally have slot start <= range start
*/
- node = kvm_memslots_gfn_upper_bound(slots, start);
+ node = NULL;
+ for (tmp = slots->gfn_tree.rb_node; tmp; ) {
+
+ slot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
+ if (gfn < slot->base_gfn) {
+ node = tmp;
+ tmp = tmp->rb_left;
+ } else {
+ tmp = tmp->rb_right;
+ }
+ }
+
if (node) {
struct rb_node *pnode;

@@ -882,12 +875,16 @@ struct rb_node *kvm_for_each_in_gfn_first(struct kvm_memslots *slots, gfn_t star
return node;
}

-static inline
-bool kvm_for_each_in_gfn_no_more(struct kvm_memslots *slots, struct rb_node *node, gfn_t end)
+static inline bool kvm_is_last_node(struct kvm_memslots *slots,
+ struct rb_node *node, gfn_t end)
{
struct kvm_memory_slot *memslot;

- memslot = container_of(node, struct kvm_memory_slot, gfn_node[slots->node_idx]);
+ if (!node)
+ return true;
+
+ memslot = container_of(node, struct kvm_memory_slot,
+ gfn_node[slots->node_idx]);

/*
* If this slot starts beyond or at the end of the range so does
@@ -896,11 +893,16 @@ bool kvm_for_each_in_gfn_no_more(struct kvm_memslots *slots, struct rb_node *nod
return memslot->base_gfn >= end;
}

+static inline bool kvm_get_next_node(struct rb_node *node)
+{
+ return rb_next(node)
+}
+
/* Iterate over each memslot *possibly* intersecting [start, end) range */
#define kvm_for_each_memslot_in_gfn_range(node, slots, start, end) \
- for (node = kvm_for_each_in_gfn_first(slots, start); \
- node && !kvm_for_each_in_gfn_no_more(slots, node, end); \
- node = rb_next(node)) \
+ for (node = kvm_get_first_node(slots, start); \
+ !kvm_is_last_node(slots, node, end); \
+ node = kvm_get_next_node(node)) \

/*
* KVM_SET_USER_MEMORY_REGION ioctl allows the following operations:
--

2021-10-21 14:17:47

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 08/13] KVM: Resolve memslot ID via a hash table instead of via a static array

On 21.10.2021 00:39, Sean Christopherson wrote:
> On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
>> @@ -1251,12 +1251,13 @@ static inline void kvm_memslot_delete(struct kvm_memslots *slots,
>> if (atomic_read(&slots->last_used_slot) >= slots->used_slots)
>> atomic_set(&slots->last_used_slot, 0);
>>
>> - for (i = slots->id_to_index[memslot->id]; i < slots->used_slots; i++) {
>> + for (i = oldslot - mslots; i < slots->used_slots; i++) {
>> + hash_del(&mslots[i].id_node);
>> mslots[i] = mslots[i + 1];
>> - slots->id_to_index[mslots[i].id] = i;
>> + hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
>> }
>> + hash_del(&mslots[i].id_node);
>> mslots[i] = *memslot;
>> - slots->id_to_index[memslot->id] = -1;
>
> Aha! This code has been bugging and I finally figured out why. Fundamentally,
> this is identical to kvm_memslot_move_backward(), the only difference being that
> the _backward() variant has a second "stop" condition.
>
> But yet this is subtly different in the hash manipulations because performs the
> final node deletion (which is a random node, that may not be the target node!)

Technically, it's not truly "random" node that's deleted since it's
simply the one belonging to the last slot in the old array (which slot
will now be moved to the second to last position in the array).
But I get your overall point here.

> outside of the loop, whereas _backward() deletes the target node before the loop.
>
> I like the _backward() variant a lot more. And if this code is converted to that
> approach, i.e.
>
> for (i = oldslot - mslots; i < slots->used_slots; i++) {
> hash_del(&mslots[i + 1].id_node);
> mslots[i] = mslots[i + 1];
> hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
> }
>
> then all three loops fit a similar pattern and we can extract the node craziness
> into a helper. I know this mostly goes away in the end, but I think/hope it will
> make the future patches easier to follow this there's only ever one "replace"
> helper.
>
> For convenience, with the s/mmemslot/oldslot and comment changes.

The change in this patch was supposed to be minimally-invasive (since
it's getting replaced by patch 11 anyway), but since you have already
refactored it then I will update the patch with your proposed change.

Thanks,
Maciej

> ---
> virt/kvm/kvm_main.c | 63 ++++++++++++++++++++++++++-------------------
> 1 file changed, 37 insertions(+), 26 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 50597608d085..6f5298bc7710 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1231,6 +1231,23 @@ static int kvm_alloc_dirty_bitmap(struct kvm_memory_slot *memslot)
> return 0;
> }
>
> +static void kvm_memslot_replace(struct kvm_memslots *slots, int dst, int src)
> +{
> + struct kvm_memory_slot *mslots = slots->memslots;
> +
> + /*
> + * Remove the source from the hash list, copying the hlist_node data
> + * would corrupt the list.
> + */
> + hash_del(&mslots[src].id_node);
> +
> + /* Copy the source *data*, not the pointer, to the destination. */
> + mslots[dst] = mslots[src];
> +
> + /* Re-add the memslot to the list using the destination's node. */
> + hash_add(slots->id_hash, &mslots[dst].id_node, mslots[dst].id);
> +}
> +
> /*
> * Delete a memslot by decrementing the number of used slots and shifting all
> * other entries in the array forward one spot.
> @@ -1251,12 +1268,16 @@ static inline void kvm_memslot_delete(struct kvm_memslots *slots,
> if (atomic_read(&slots->last_used_slot) >= slots->used_slots)
> atomic_set(&slots->last_used_slot, 0);
>
> - for (i = oldslot - mslots; i < slots->used_slots; i++) {
> - hash_del(&mslots[i].id_node);
> - mslots[i] = mslots[i + 1];
> - hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
> - }
> - hash_del(&mslots[i].id_node);
> + /*
> + * Remove the to-be-deleted memslot from the list _before_ shifting
> + * the trailing memslots forward, its data will be overwritten.
> + * Defer the (somewhat pointless) copying of the memslot until after
> + * the last slot has been shifted to avoid overwriting said last slot.
> + */
> + hash_del(&oldslot->id_node);
> +
> + for (i = oldslot - mslots; i < slots->used_slots; i++)
> + kvm_memslot_replace(slots, i, i + 1);
> mslots[i] = *memslot;
> }
>
> @@ -1282,39 +1303,32 @@ static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
> struct kvm_memory_slot *memslot)
> {
> struct kvm_memory_slot *mslots = slots->memslots;
> - struct kvm_memory_slot *mmemslot = id_to_memslot(slots, memslot->id);
> + struct kvm_memory_slot *oldslot = id_to_memslot(slots, memslot->id);
> int i;
>
> - if (!mmemslot || !slots->used_slots)
> + if (!oldslot || !slots->used_slots)
> return -1;
>
> /*
> - * The loop below will (possibly) overwrite the target memslot with
> - * data of the next memslot, or a similar loop in
> - * kvm_memslot_move_forward() will overwrite it with data of the
> - * previous memslot.
> - * Then update_memslots() will unconditionally overwrite and re-add
> - * it to the hash table.
> - * That's why the memslot has to be first removed from the hash table
> - * here.
> + * Delete the slot from the hash table before sorting the remaining
> + * slots, the slot's data may be overwritten when copying slots as part
> + * of the sorting proccess. update_memslots() will unconditionally
> + * rewrite the entire slot and re-add it to the hash table.
> */
> - hash_del(&mmemslot->id_node);
> + hash_del(&oldslot->id_node);
>
> /*
> * Move the target memslot backward in the array by shifting existing
> * memslots with a higher GFN (than the target memslot) towards the
> * front of the array.
> */
> - for (i = mmemslot - mslots; i < slots->used_slots - 1; i++) {
> + for (i = oldslot - mslots; i < slots->used_slots - 1; i++) {
> if (memslot->base_gfn > mslots[i + 1].base_gfn)
> break;
>
> WARN_ON_ONCE(memslot->base_gfn == mslots[i + 1].base_gfn);
>
> - /* Shift the next memslot forward one and update its index. */
> - hash_del(&mslots[i + 1].id_node);
> - mslots[i] = mslots[i + 1];
> - hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
> + kvm_memslot_replace(slots, i, i + 1);
> }
> return i;
> }
> @@ -1343,10 +1357,7 @@ static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
>
> WARN_ON_ONCE(memslot->base_gfn == mslots[i - 1].base_gfn);
>
> - /* Shift the next memslot back one and update its index. */
> - hash_del(&mslots[i - 1].id_node);
> - mslots[i] = mslots[i - 1];
> - hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
> + kvm_memslot_replace(slots, i, i - 1);
> }
> return i;
> }
> --
>

2021-10-21 14:18:16

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 12/13] KVM: Optimize gfn lookup in kvm_zap_gfn_range()

On 21.10.2021 01:47, Sean Christopherson wrote:
> On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
>
> Some mechanical comments while they're on my mind, I'll get back to a full review
> tomorrow.
>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 6433efff447a..9ae5f7341cf5 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -833,6 +833,75 @@ struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
>> return NULL;
>> }
>>
>> +static inline
>> +struct rb_node *kvm_memslots_gfn_upper_bound(struct kvm_memslots *slots, gfn_t gfn)
>
> Function attributes should go on the same line as the function unless there's a
> really good reason to do otherwise.

Here the reason was a long line length, which was 84 characters even with
function attributes moved to a separate line.

include/linux/kvm_host.h contains a lot of helpers written in a similar
style:
> static inline gfn_t
> hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)

> In this case, I would honestly just drop the helper. It's really hard to express
> what this function does in a name that isn't absurdly long, and there's exactly
> one user at the end of the series.

The "upper bound" is a common name for a binary search operation that
finds the first node that has its key strictly greater than the searched key.

It can be integrated into its caller but I would leave a comment there
describing what kind of operation that block of code does to aid in
understanding the code.

Although, to be honest, I don't quite get the reason for doing this
considering that you want to put a single "rb_next()" call into its own
helper for clarity below.

> https://lkml.kernel.org/r/[email protected]
>
>> +{
>> + int idx = slots->node_idx;
>> + struct rb_node *node, *result = NULL;
>> +
>> + for (node = slots->gfn_tree.rb_node; node; ) {
>> + struct kvm_memory_slot *slot;
>
> My personal preference is to put declarations outside of the for loop. I find it
> easier to read, it's harder to have shadowing issues if all variables are declared
> at the top, especially when using relatively generic names.

Will do.

>> +
>> + slot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
>> + if (gfn < slot->base_gfn) {
>> + result = node;
>> + node = node->rb_left;
>> + } else
>
> Needs braces since the "if" has braces.

Will add them.

>> + node = node->rb_right;
>> + }
>> +
>> + return result;
>> +}
>> +
>> +static inline
>> +struct rb_node *kvm_for_each_in_gfn_first(struct kvm_memslots *slots, gfn_t start)
>
> The kvm_for_each_in_gfn prefix is _really_ confusing. I get that these are all
> helpers for "kvm_for_each_memslot...", but it's hard not to think these are all
> iterators on their own. I would gladly sacrifice namespacing for readability in
> this case.

"kvm_for_each_memslot_in_gfn_range" was your proposed name here:
https://lore.kernel.org/kvm/[email protected]/

But no problem renaming it.

> I also wouldn't worry about capturing the details. For most folks reading this
> code, the important part is understanding the control flow of
> kvm_for_each_memslot_in_gfn_range(). Capturing the under-the-hood details in the
> name isn't a priority since anyone modifying this code is going to have to do a
> lot of staring no matter what :-)

It's even better if somebody modifying complex code has to read it carefully
(within reason, obviously).

>> +static inline
>> +bool kvm_for_each_in_gfn_no_more(struct kvm_memslots *slots, struct rb_node *node, gfn_t end)
>> +{
>> + struct kvm_memory_slot *memslot;
>> +
>> + memslot = container_of(node, struct kvm_memory_slot, gfn_node[slots->node_idx]);
>> +
>> + /*
>> + * If this slot starts beyond or at the end of the range so does
>> + * every next one
>> + */
>> + return memslot->base_gfn >= end;
>> +}
>> +
>> +/* Iterate over each memslot *possibly* intersecting [start, end) range */
>> +#define kvm_for_each_memslot_in_gfn_range(node, slots, start, end) \
>> + for (node = kvm_for_each_in_gfn_first(slots, start); \
>> + node && !kvm_for_each_in_gfn_no_more(slots, node, end); \
>
> I think it makes sense to move the NULL check into the validation helper? We had
> a similar case in KVM's legacy MMU where a "null" check was left to the caller,
> and it ended up with a bunch of redundant and confusing code. I don't see that
> happening here, but at the same time it's odd for the validator to not sanity
> check @node.
>

Will do.

>> + node = rb_next(node)) \
>
> It's silly, but I'd add a wrapper for this one, just to make it easy to follow
> the control flow.
>
> Maybe this as delta? I'm definitely not set on the names, was just trying to
> find something that's short and to the point.

Thanks for the proposed patch, I added some comments inline below.

> ---
> include/linux/kvm_host.h | 60 +++++++++++++++++++++-------------------
> 1 file changed, 31 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9ae5f7341cf5..a88bd5d9e4aa 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -833,36 +833,29 @@ struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
> return NULL;
> }
>
> -static inline
> -struct rb_node *kvm_memslots_gfn_upper_bound(struct kvm_memslots *slots, gfn_t gfn)
> +static inline struct rb_node *kvm_get_first_node(struct kvm_memslots *slots,
> + gfn_t start)
> {
> + struct kvm_memory_slot *slot;
> + struct rb_node *node, *tmp;
> int idx = slots->node_idx;
> - struct rb_node *node, *result = NULL;
> -
> - for (node = slots->gfn_tree.rb_node; node; ) {
> - struct kvm_memory_slot *slot;
> -
> - slot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
> - if (gfn < slot->base_gfn) {
> - result = node;
> - node = node->rb_left;
> - } else
> - node = node->rb_right;
> - }
> -
> - return result;
> -}
> -
> -static inline
> -struct rb_node *kvm_for_each_in_gfn_first(struct kvm_memslots *slots, gfn_t start)
> -{
> - struct rb_node *node;
>
> /*
> * Find the slot with the lowest gfn that can possibly intersect with
> * the range, so we'll ideally have slot start <= range start
> */
> - node = kvm_memslots_gfn_upper_bound(slots, start);
> + node = NULL;
> + for (tmp = slots->gfn_tree.rb_node; tmp; ) {
> +
> + slot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);

Here -------------------------------^ should be "tmp", not "node" since
you renamed "node" into "tmp" and "result" into "node" while integrating
this function into its caller.

> + if (gfn < slot->base_gfn) {
> + node = tmp;
> + tmp = tmp->rb_left;
> + } else {
> + tmp = tmp->rb_right;
> + }
> + }
> +
> if (node) {
> struct rb_node *pnode;
>
> @@ -882,12 +875,16 @@ struct rb_node *kvm_for_each_in_gfn_first(struct kvm_memslots *slots, gfn_t star
> return node;
> }
>
> -static inline
> -bool kvm_for_each_in_gfn_no_more(struct kvm_memslots *slots, struct rb_node *node, gfn_t end)
> +static inline bool kvm_is_last_node(struct kvm_memslots *slots,
> + struct rb_node *node, gfn_t end)

kvm_is_last_node() is a bit misleading since this function is supposed
to return true even on the last node, only returning false one node past
the last one (or when the tree runs out of nodes).

> {
> struct kvm_memory_slot *memslot;
>
> - memslot = container_of(node, struct kvm_memory_slot, gfn_node[slots->node_idx]);
> + if (!node)
> + return true;
> +
> + memslot = container_of(node, struct kvm_memory_slot,
> + gfn_node[slots->node_idx]);

You previously un-wrapped such lines, like for example in
https://lore.kernel.org/kvm/[email protected]/ :
>> + slot = container_of(node, struct kvm_memory_slot,
>> + gfn_node[idxactive]);
>
> With 'idx', this can go on a single line. It runs over by two chars, but the 80
> char limit is a soft limit, and IMO avoiding line breaks for things like this
> improves readability.


>
> /*
> * If this slot starts beyond or at the end of the range so does
> @@ -896,11 +893,16 @@ bool kvm_for_each_in_gfn_no_more(struct kvm_memslots *slots, struct rb_node *nod
> return memslot->base_gfn >= end;
> }
>
> +static inline bool kvm_get_next_node(struct rb_node *node)
> +{
> + return rb_next(node)
> +}
> +
> /* Iterate over each memslot *possibly* intersecting [start, end) range */
> #define kvm_for_each_memslot_in_gfn_range(node, slots, start, end) \
> - for (node = kvm_for_each_in_gfn_first(slots, start); \
> - node && !kvm_for_each_in_gfn_no_more(slots, node, end); \
> - node = rb_next(node)) \
> + for (node = kvm_get_first_node(slots, start); \
> + !kvm_is_last_node(slots, node, end); \
> + node = kvm_get_next_node(node)) \
>
> /*
> * KVM_SET_USER_MEMORY_REGION ioctl allows the following operations:
> --
>

Thanks,
Maciej

2021-10-21 16:35:02

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 12/13] KVM: Optimize gfn lookup in kvm_zap_gfn_range()

On Thu, Oct 21, 2021, Maciej S. Szmigiero wrote:
> On 21.10.2021 01:47, Sean Christopherson wrote:
> > In this case, I would honestly just drop the helper. It's really hard to express
> > what this function does in a name that isn't absurdly long, and there's exactly
> > one user at the end of the series.
>
> The "upper bound" is a common name for a binary search operation that
> finds the first node that has its key strictly greater than the searched key.

Ah, that I did not know (obviously). But I suspect that detail will be lost on
other readers as well, even if they are familiar with the terminology.

> It can be integrated into its caller but I would leave a comment there
> describing what kind of operation that block of code does to aid in
> understanding the code.

Yeah, completely agree a comment would be wonderful.

> Although, to be honest, I don't quite get the reason for doing this
> considering that you want to put a single "rb_next()" call into its own
> helper for clarity below.

The goal is to make the macro itself easy to understand, even if the reader may
not understand the underlying details. The bare rb_next() forces the reader to
pause to think about exactly what "node" is, and perhaps even dive into the code
for the other helpers.

With something like this, a reader that doesn't know the memslots details can
get a good idea of the basic gist of the macro without having to even know the
type of "node". Obviously someone writing code will need to know the type, but
for readers bouncing around code it's a detail they don't need to know.

#define kvm_for_each_memslot_in_gfn_range(node, slots, start, end) \
for (node = kvm_get_first_node(slots, start); \
!kvm_is_valid_node(slots, node, end); \
node = kvm_get_next_node(node))

Hmm, on that point, having the caller do

memslot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);

is more than a bit odd, and as is the case with the bare rb_next(), bleeds
implementation details into code that really doesn't care about implementation
details. Eww, and looking closer, the caller also needs to grab slots->node_idx.

So while I would love to avoid an opaque iterator, adding one would be a net
positive in this case. E.g.

/* Iterator used for walking memslots that overlap a gfn range. */
struct kvm_memslot_iterator iter {
struct rb_node *node;
struct kvm_memory_slot *memslot;
struct kvm_memory_slots *slots;
gfn_t start;
gfn_t end;
}

static inline void kvm_memslot_iter_start(struct kvm_memslot_iter *iter,
struct kvm_memslots *slots,
gfn_t start, gfn_t end)
{
...
}

static inline bool kvm_memslot_iter_is_valid(struct kvm_memslot_iter *iter)
{
/*
* If this slot starts beyond or at the end of the range so does
* every next one
*/
return iter->node && iter->memslot->base_gfn < end;
}

static inline void kvm_memslot_iter_next(struct kvm_memslot_iter *iter)
{
iter->node = rb_next(iter->node);

if (!iter->node)
return;

iter->memslot = container_of(iter->node, struct kvm_memory_slot,
gfn_node[iter->slots->node_idx]);
}

/* Iterate over each memslot *possibly* intersecting [start, end) range */
#define kvm_for_each_memslot_in_gfn_range(iter, node, slots, start, end) \
for (kvm_memslot_iter_start(iter, node, slots, start, end); \
kvm_memslot_iter_is_valid(iter); \
kvm_memslot_iter_next(node)) \


Ugh, this got me looking at kvm_zap_gfn_range(), and that thing is trainwreck.
There are three calls kvm_flush_remote_tlbs_with_address(), two of which should
be unnecessary, but become necessary because the last one is broken. *sigh*

That'd also be a good excuse to extract the rmap loop to a separate helper. Then
you don't need to constantly juggle the 80 char limit and variable collisions
while you're modifying this mess. I'll post the attached patches separately
since the first one (two?) should go into 5.15. They're compile tested only
at this point, but hopefully I've had enough coffee and they're safe to base
this series on top (note, they're based on kvm/queue, commit 73f122c4f06f ("KVM:
cleanup allocation of rmaps and page tracking data").

> > The kvm_for_each_in_gfn prefix is _really_ confusing. I get that these are all
> > helpers for "kvm_for_each_memslot...", but it's hard not to think these are all
> > iterators on their own. I would gladly sacrifice namespacing for readability in
> > this case.
>
> "kvm_for_each_memslot_in_gfn_range" was your proposed name here:
> https://lore.kernel.org/kvm/[email protected]/
>
> But no problem renaming it.

Oh, I was commenting on the inner helpers. The macro name itself is great. ;-)

> > @@ -882,12 +875,16 @@ struct rb_node *kvm_for_each_in_gfn_first(struct kvm_memslots *slots, gfn_t star
> > return node;
> > }
> >
> > -static inline
> > -bool kvm_for_each_in_gfn_no_more(struct kvm_memslots *slots, struct rb_node *node, gfn_t end)
> > +static inline bool kvm_is_last_node(struct kvm_memslots *slots,
> > + struct rb_node *node, gfn_t end)
>
> kvm_is_last_node() is a bit misleading since this function is supposed
> to return true even on the last node, only returning false one node past
> the last one (or when the tree runs out of nodes).

Good point. I didn't love the name when I suggested either. What about
kvm_is_valid_node()?

> > {
> > struct kvm_memory_slot *memslot;
> >
> > - memslot = container_of(node, struct kvm_memory_slot, gfn_node[slots->node_idx]);
> > + if (!node)
> > + return true;
> > +
> > + memslot = container_of(node, struct kvm_memory_slot,
> > + gfn_node[slots->node_idx]);
>
> You previously un-wrapped such lines, like for example in
> https://lore.kernel.org/kvm/[email protected]/ :

Heh, yes, the balance between "too long" and "hard to read" is subjective. The
ones I usually let run over are cases where it's a short word on the end, the
overrun is only a couple chars, the statement is the sole line of an if/else
statement, there's a null/validity check immediately following etc...

All that said, I don't have a strong opinion on this one, I'm a-ok if you want to
let it run over.

> > > + slot = container_of(node, struct kvm_memory_slot,
> > > + gfn_node[idxactive]);
> >
> > With 'idx', this can go on a single line. It runs over by two chars, but the 80
> > char limit is a soft limit, and IMO avoiding line breaks for things like this
> > improves readability.
>
>
> >
> > /*
> > * If this slot starts beyond or at the end of the range so does
> > @@ -896,11 +893,16 @@ bool kvm_for_each_in_gfn_no_more(struct kvm_memslots *slots, struct rb_node *nod
> > return memslot->base_gfn >= end;
> > }
> >
> > +static inline bool kvm_get_next_node(struct rb_node *node)
> > +{
> > + return rb_next(node)
> > +}
> > +
> > /* Iterate over each memslot *possibly* intersecting [start, end) range */
> > #define kvm_for_each_memslot_in_gfn_range(node, slots, start, end) \
> > - for (node = kvm_for_each_in_gfn_first(slots, start); \
> > - node && !kvm_for_each_in_gfn_no_more(slots, node, end); \
> > - node = rb_next(node)) \
> > + for (node = kvm_get_first_node(slots, start); \
> > + !kvm_is_last_node(slots, node, end); \
> > + node = kvm_get_next_node(node)) \
> >
> > /*
> > * KVM_SET_USER_MEMORY_REGION ioctl allows the following operations:
> > --
> >
>
> Thanks,
> Maciej


Attachments:
(No filename) (7.46 kB)
0001-KVM-x86-mmu-Drop-a-redundant-broken-remote-TLB-flush.patch (1.56 kB)
0002-KVM-x86-mmu-Drop-a-redundant-remote-TLB-flush-in-kvm.patch (1.43 kB)
0003-KVM-x86-mmu-Extract-zapping-of-rmaps-for-gfn-range-t.patch (2.97 kB)
Download all attachments

2021-10-21 21:48:50

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 12/13] KVM: Optimize gfn lookup in kvm_zap_gfn_range()

On 21.10.2021 18:30, Sean Christopherson wrote:
> On Thu, Oct 21, 2021, Maciej S. Szmigiero wrote:
>> On 21.10.2021 01:47, Sean Christopherson wrote:
>>> In this case, I would honestly just drop the helper. It's really hard to express
>>> what this function does in a name that isn't absurdly long, and there's exactly
>>> one user at the end of the series.
>>
>> The "upper bound" is a common name for a binary search operation that
>> finds the first node that has its key strictly greater than the searched key.
>
> Ah, that I did not know (obviously). But I suspect that detail will be lost on
> other readers as well, even if they are familiar with the terminology.
>
>> It can be integrated into its caller but I would leave a comment there
>> describing what kind of operation that block of code does to aid in
>> understanding the code.
>
> Yeah, completely agree a comment would be wonderful.

????

>> Although, to be honest, I don't quite get the reason for doing this
>> considering that you want to put a single "rb_next()" call into its own
>> helper for clarity below.
>
> The goal is to make the macro itself easy to understand, even if the reader may
> not understand the underlying details. The bare rb_next() forces the reader to
> pause to think about exactly what "node" is, and perhaps even dive into the code
> for the other helpers.
>
> With something like this, a reader that doesn't know the memslots details can
> get a good idea of the basic gist of the macro without having to even know the
> type of "node". Obviously someone writing code will need to know the type, but
> for readers bouncing around code it's a detail they don't need to know.
>
> #define kvm_for_each_memslot_in_gfn_range(node, slots, start, end) \
> for (node = kvm_get_first_node(slots, start); \
> !kvm_is_valid_node(slots, node, end); \
> node = kvm_get_next_node(node))
>
> Hmm, on that point, having the caller do
>
> memslot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
>
> is more than a bit odd, and as is the case with the bare rb_next(), bleeds
> implementation details into code that really doesn't care about implementation
> details. Eww, and looking closer, the caller also needs to grab slots->node_idx.
>
> So while I would love to avoid an opaque iterator, adding one would be a net
> positive in this case. E.g.
>
> /* Iterator used for walking memslots that overlap a gfn range. */
> struct kvm_memslot_iterator iter {
> struct rb_node *node;
> struct kvm_memory_slot *memslot;
> struct kvm_memory_slots *slots;
> gfn_t start;
> gfn_t end;
> }
>
> static inline void kvm_memslot_iter_start(struct kvm_memslot_iter *iter,
> struct kvm_memslots *slots,
> gfn_t start, gfn_t end)
> {
> ...
> }
>
> static inline bool kvm_memslot_iter_is_valid(struct kvm_memslot_iter *iter)
> {
> /*
> * If this slot starts beyond or at the end of the range so does
> * every next one
> */
> return iter->node && iter->memslot->base_gfn < end;
> }
>
> static inline void kvm_memslot_iter_next(struct kvm_memslot_iter *iter)
> {
> iter->node = rb_next(iter->node);
>
> if (!iter->node)
> return;
>
> iter->memslot = container_of(iter->node, struct kvm_memory_slot,
> gfn_node[iter->slots->node_idx]);
> }
>
> /* Iterate over each memslot *possibly* intersecting [start, end) range */
> #define kvm_for_each_memslot_in_gfn_range(iter, node, slots, start, end) \
> for (kvm_memslot_iter_start(iter, node, slots, start, end); \
> kvm_memslot_iter_is_valid(iter); \
> kvm_memslot_iter_next(node)) \
>

The iterator-based for_each implementation looks pretty nice (love the
order and consistency that higher-level abstractions bring to code) -
will change the code to use iterators instead.

It also solves the kvm_is_valid_node() naming issue below.

> Ugh, this got me looking at kvm_zap_gfn_range(), and that thing is trainwreck.
> There are three calls kvm_flush_remote_tlbs_with_address(), two of which should
> be unnecessary, but become necessary because the last one is broken. *sigh*
>
> That'd also be a good excuse to extract the rmap loop to a separate helper. Then
> you don't need to constantly juggle the 80 char limit and variable collisions
> while you're modifying this mess. I'll post the attached patches separately
> since the first one (two?) should go into 5.15. They're compile tested only
> at this point, but hopefully I've had enough coffee and they're safe to base
> this series on top (note, they're based on kvm/queue, commit 73f122c4f06f ("KVM:
> cleanup allocation of rmaps and page tracking data").

All right, will make sure that a respin is based on a kvm tree with these
commits in.

>>> The kvm_for_each_in_gfn prefix is _really_ confusing. I get that these are all
>>> helpers for "kvm_for_each_memslot...", but it's hard not to think these are all
>>> iterators on their own. I would gladly sacrifice namespacing for readability in
>>> this case.
>>
>> "kvm_for_each_memslot_in_gfn_range" was your proposed name here:
>> https://lore.kernel.org/kvm/[email protected]/
>>
>> But no problem renaming it.
>
> Oh, I was commenting on the inner helpers. The macro name itself is great. ;-)
>
>>> @@ -882,12 +875,16 @@ struct rb_node *kvm_for_each_in_gfn_first(struct kvm_memslots *slots, gfn_t star
>>> return node;
>>> }
>>>
>>> -static inline
>>> -bool kvm_for_each_in_gfn_no_more(struct kvm_memslots *slots, struct rb_node *node, gfn_t end)
>>> +static inline bool kvm_is_last_node(struct kvm_memslots *slots,
>>> + struct rb_node *node, gfn_t end)
>>
>> kvm_is_last_node() is a bit misleading since this function is supposed
>> to return true even on the last node, only returning false one node past
>> the last one (or when the tree runs out of nodes).
>
> Good point. I didn't love the name when I suggested either. What about
> kvm_is_valid_node()?

kvm_is_valid_node() sounds a bit too generic for me, but since we rewrite
the code to be iterator-based this issue goes away.

Thanks,
Maciej

2021-10-27 05:38:33

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] KVM: Use interval tree to do fast hva lookup in memslots

On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <[email protected]>
>
> The current memslots implementation only allows quick binary search by gfn,
> quick lookup by hva is not possible - the implementation has to do a linear
> scan of the whole memslots array, even though the operation being performed
> might apply just to a single memslot.
>
> This significantly hurts performance of per-hva operations with higher
> memslot counts.
>
> Since hva ranges can overlap between memslots an interval tree is needed
> for tracking them.
>
> Signed-off-by: Maciej S. Szmigiero <[email protected]>
> ---
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 50597608d085..7ed780996910 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -472,6 +472,12 @@ static void kvm_null_fn(void)
> }
> #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
>
> +/* Iterate over each memslot intersecting [start, last] (inclusive) range */
> +#define kvm_for_each_memslot_in_hva_range(node, slots, start, last) \
> + for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
> + node; \
> + node = interval_tree_iter_next(node, start, last)) \

Similar to kvm_for_each_memslot_in_gfn_range(), this should use an opaque iterator
to hide the implementation details from the caller, e.g. to avoid having to define
a "struct interval_tree_node" and do container_of.

2021-10-27 07:17:04

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] KVM: Use interval tree to do fast hva lookup in memslots

On 26.10.2021 20:19, Sean Christopherson wrote:
> On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <[email protected]>
>>
>> The current memslots implementation only allows quick binary search by gfn,
>> quick lookup by hva is not possible - the implementation has to do a linear
>> scan of the whole memslots array, even though the operation being performed
>> might apply just to a single memslot.
>>
>> This significantly hurts performance of per-hva operations with higher
>> memslot counts.
>>
>> Since hva ranges can overlap between memslots an interval tree is needed
>> for tracking them.
>>
>> Signed-off-by: Maciej S. Szmigiero <[email protected]>
>> ---
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 50597608d085..7ed780996910 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -472,6 +472,12 @@ static void kvm_null_fn(void)
>> }
>> #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
>>
>> +/* Iterate over each memslot intersecting [start, last] (inclusive) range */
>> +#define kvm_for_each_memslot_in_hva_range(node, slots, start, last) \
>> + for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
>> + node; \
>> + node = interval_tree_iter_next(node, start, last)) \
>
> Similar to kvm_for_each_memslot_in_gfn_range(), this should use an opaque iterator
> to hide the implementation details from the caller, e.g. to avoid having to define
> a "struct interval_tree_node" and do container_of.
>

Will convert to an iterator-based for_each implementation in the next version
of this patchset.

Thanks,
Maciej

2021-10-27 07:20:36

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 13/13] KVM: Optimize overlapping memslots check

On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <[email protected]>
>
> Do a quick lookup for possibly overlapping gfns when creating or moving
> a memslot instead of performing a linear scan of the whole memslot set.
>
> Signed-off-by: Maciej S. Szmigiero <[email protected]>
> ---
> virt/kvm/kvm_main.c | 36 +++++++++++++++++++++++++++---------
> 1 file changed, 27 insertions(+), 9 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5fea467d6fec..78dad8c6376f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1667,6 +1667,30 @@ static int kvm_delete_memslot(struct kvm *kvm,
> return kvm_set_memslot(kvm, mem, old, &new, as_id, KVM_MR_DELETE);
> }
>
> +static bool kvm_check_memslot_overlap(struct kvm_memslots *slots,
> + struct kvm_memory_slot *nslot)
> +{
> + int idx = slots->node_idx;
> + gfn_t nend = nslot->base_gfn + nslot->npages;
> + struct rb_node *node;
> +
> + kvm_for_each_memslot_in_gfn_range(node, slots, nslot->base_gfn, nend) {
> + struct kvm_memory_slot *cslot;
> + gfn_t cend;
> +
> + cslot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
> + cend = cslot->base_gfn + cslot->npages;
> + if (cslot->id == nslot->id)
> + continue;
> +
> + /* kvm_for_each_in_gfn_no_more() guarantees that cslot->base_gfn < nend */
> + if (cend > nslot->base_gfn)

Hmm, IMO the need for this check means that kvm_for_each_memslot_in_gfn_range()
is flawed. The user of kvm_for_each...() should not be responsible for skipping
memslots that do not actually overlap the requested range. I.e. this function
should be no more than:

static bool kvm_check_memslot_overlap(struct kvm_memslots *slots,
struct kvm_memory_slot *slot)
{
gfn_t start = slot->base_gfn;
gfn_t end = start + slot->npages;

kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
if (iter.slot->id != slot->id)
return true;
}

return false;
}


and I suspect kvm_zap_gfn_range() could be further simplified as well.

Looking back at the introduction of the helper, its comment's highlighting of
"possibily" now makes sense.

/* Iterate over each memslot *possibly* intersecting [start, end) range */
#define kvm_for_each_memslot_in_gfn_range(node, slots, start, end) \

That's an unnecessarily bad API. It's a very solvable problem for the iterator
helpers to advance until there's actually overlap, not doing so violates the
principle of least surprise, and unless I'm missing something, there's no use
case for an "approximate" iteration.

> + return true;
> + }
> +
> + return false;
> +}
> +
> /*
> * Allocate some memory and give it an address in the guest physical address
> * space.
> @@ -1752,16 +1776,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
> }
>
> if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
> - int bkt;
> -
> /* Check for overlaps */

This comment can be dropped, the new function is fairly self-documenting.

> - kvm_for_each_memslot(tmp, bkt, __kvm_memslots(kvm, as_id)) {
> - if (tmp->id == id)
> - continue;
> - if (!((new.base_gfn + new.npages <= tmp->base_gfn) ||
> - (new.base_gfn >= tmp->base_gfn + tmp->npages)))
> - return -EEXIST;
> - }
> + if (kvm_check_memslot_overlap(__kvm_memslots(kvm, as_id),
> + &new))

And then with the comment dropped, the wrap can be avoided by folding the check
into the outer if statement, e.g.

if (((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) &&
kvm_check_memslot_overlap(__kvm_memslots(kvm, as_id), &new))
return -EEXIST;

> + return -EEXIST;
> }
>
> /* Allocate/free page dirty bitmap as needed */

2021-10-27 13:54:57

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 11/13] KVM: Keep memslots in tree-based structures instead of array-based ones

On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
> @@ -1607,68 +1506,145 @@ static int kvm_set_memslot(struct kvm *kvm,
> + if (change != KVM_MR_CREATE) {
> /*
> - * The arch-specific fields of the memslots could have changed
> - * between releasing the slots_arch_lock in
> - * install_new_memslots and here, so get a fresh copy of these
> - * fields.
> + * The arch-specific fields of the memslot could have changed
> + * between reading them and taking slots_arch_lock in one of two
> + * places above.
> + * That includes old and new which were read in __kvm_set_memory_region.
> */
> - kvm_copy_memslots_arch(slots, __kvm_memslots(kvm, as_id));
> + old->arch = new->arch = slotina->arch = slotact->arch;

Fudge. This subtly and silently fixes an existing bug where @old and @new can
have stale arch specific data due to x86's godawful slots_arch_lock behavior.
If a flags-only update collides with alloc_all_memslots_rmaps(), @old and @new
may have stale (NULL) data if the rmaps activation happens after the old slot is
snapshotted.

It can be fixed by doing exactly this, but that is so, so gross (not your fault
at all, I'm complaining about the existing mess).

I think we can opportunistically prep for this series to make the end result
(mostly this patch), a bit cleaner while fixing that snafu. Specifically, I
think I see a path to avoiding bikeshedding slotina, slotact, etc...

I'll get a series for the fix posted tomorrow, and hopefully reply with my thoughts
for this patch too.

2021-10-27 21:26:47

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 13/13] KVM: Optimize overlapping memslots check

On 26.10.2021 20:59, Sean Christopherson wrote:
> On Mon, Sep 20, 2021, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <[email protected]>
>>
>> Do a quick lookup for possibly overlapping gfns when creating or moving
>> a memslot instead of performing a linear scan of the whole memslot set.
>>
>> Signed-off-by: Maciej S. Szmigiero <[email protected]>
>> ---
>> virt/kvm/kvm_main.c | 36 +++++++++++++++++++++++++++---------
>> 1 file changed, 27 insertions(+), 9 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 5fea467d6fec..78dad8c6376f 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1667,6 +1667,30 @@ static int kvm_delete_memslot(struct kvm *kvm,
>> return kvm_set_memslot(kvm, mem, old, &new, as_id, KVM_MR_DELETE);
>> }
>>
>> +static bool kvm_check_memslot_overlap(struct kvm_memslots *slots,
>> + struct kvm_memory_slot *nslot)
>> +{
>> + int idx = slots->node_idx;
>> + gfn_t nend = nslot->base_gfn + nslot->npages;
>> + struct rb_node *node;
>> +
>> + kvm_for_each_memslot_in_gfn_range(node, slots, nslot->base_gfn, nend) {
>> + struct kvm_memory_slot *cslot;
>> + gfn_t cend;
>> +
>> + cslot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
>> + cend = cslot->base_gfn + cslot->npages;
>> + if (cslot->id == nslot->id)
>> + continue;
>> +
>> + /* kvm_for_each_in_gfn_no_more() guarantees that cslot->base_gfn < nend */
>> + if (cend > nslot->base_gfn)
>
> Hmm, IMO the need for this check means that kvm_for_each_memslot_in_gfn_range()
> is flawed. The user of kvm_for_each...() should not be responsible for skipping
> memslots that do not actually overlap the requested range. I.e. this function
> should be no more than:
>
> static bool kvm_check_memslot_overlap(struct kvm_memslots *slots,
> struct kvm_memory_slot *slot)
> {
> gfn_t start = slot->base_gfn;
> gfn_t end = start + slot->npages;
>
> kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> if (iter.slot->id != slot->id)
> return true;
> }
>
> return false;
> }
>
>
> and I suspect kvm_zap_gfn_range() could be further simplified as well.
>
> Looking back at the introduction of the helper, its comment's highlighting of
> "possibily" now makes sense.
>
> /* Iterate over each memslot *possibly* intersecting [start, end) range */
> #define kvm_for_each_memslot_in_gfn_range(node, slots, start, end) \
>
> That's an unnecessarily bad API. It's a very solvable problem for the iterator
> helpers to advance until there's actually overlap, not doing so violates the
> principle of least surprise, and unless I'm missing something, there's no use
> case for an "approximate" iteration.

In principle this can be done, however this will complicate the gfn
iterator logic - especially the kvm_memslot_iter_start() part, which
will already get messier from open-coding kvm_memslots_gfn_upper_bound()
there.

At the same kvm_zap_gfn_range() will still need to do the memslot range
<-> request range merging by itself as it does not want to process the
whole returned memslot, but rather just the part that's actually
overlapping its requested range.

In the worst case, the current code can return one memslot too much, so
I don't think it's worth bringing additional complexity just to detect
and skip it - it's not that uncommon to design an API that needs extra
checking from its caller to cover some corner cases.

For example, see pthread_cond_wait() or kernel waitqueues with their
spurious wakeups or atomic_compare_exchange_weak() from C11.
And these are higher level APIs than a very limited internal KVM one
with just two callers.
In case of kvm_zap_gfn_range() the necessary checking is already
there and has to be kept due to the above range merging.

Also, a code that is simpler is easier to understand, maintain and
so less prone to subtle bugs.

>> + return true;
>> + }
>> +
>> + return false;
>> +}
>> +
>> /*
>> * Allocate some memory and give it an address in the guest physical address
>> * space.
>> @@ -1752,16 +1776,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
>> }
>>
>> if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
>> - int bkt;
>> -
>> /* Check for overlaps */
>
> This comment can be dropped, the new function is fairly self-documenting.

Will drop it.

>> - kvm_for_each_memslot(tmp, bkt, __kvm_memslots(kvm, as_id)) {
>> - if (tmp->id == id)
>> - continue;
>> - if (!((new.base_gfn + new.npages <= tmp->base_gfn) ||
>> - (new.base_gfn >= tmp->base_gfn + tmp->npages)))
>> - return -EEXIST;
>> - }
>> + if (kvm_check_memslot_overlap(__kvm_memslots(kvm, as_id),
>> + &new))
>
> And then with the comment dropped, the wrap can be avoided by folding the check
> into the outer if statement, e.g.
>
> if (((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) &&
> kvm_check_memslot_overlap(__kvm_memslots(kvm, as_id), &new))
> return -EEXIST;
>

Will fold it.

Thanks,
Maciej

2021-10-27 23:57:06

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 11/13] KVM: Keep memslots in tree-based structures instead of array-based ones

On Wed, Oct 27, 2021, Sean Christopherson wrote:
> I'll get a series for the fix posted tomorrow, and hopefully reply with my thoughts
> for this patch too.

Knew I should have hedged. I got a series compiling, but it's completed untested
otheriwse. I do have input for this patch though.

The cleanups I have prepped allow for a few things:

1. Dropping the copy of the old memslot, thus freeing up "old" to be a pointer
to the actual old memslot.

2. Folding the existing kvm_delete_memslot() into __kvm_set_memory_region() to
free up the name for a helper to kvm_set_memslot().

3. Handling the dirty bitmap and memslot freeing in generic versions of prepare
and commit. This is related to #1.

And then for this patch, there is a hard invariant that we can rely on to make
the code easier to follow, and that is: KVM can never directly modify a slot that
is in the active memslots tree(s). That invariant means there is no need to track
the active vs. inactive memslots, because all helpers simply retrieve the inactive
memslot on-demand.

And doling out the work to helpers also mostly eliminates the need to track the
inactive vs. active slot. There's still some slightly twisty logic, especially for
the MOVE case (which nobody uses, *sigh*). If we really want, even that mess can
be cleaned up by pre-allocating an extra memslot, but I don't think it's worth the
effort so long as things are well documented.

Anyway, the twisty logic is mostly a matter of documentation; IMO it's actually
advantageous to not swap pointers as the different naming and behavior highlights
the differences between each type of change, especially with respect to side
effects.

I'll post the full series tomorrow (this part is unlikely to be tested, but the
prep work will be tested). Below is the meat of kvm_set_memslot(). Again,
compile tested only, there's a 99.999999% chance it's buggy.

As for naming, "working" was the least awful name I could come up with. I also
considered "scratch" and "tmp". I'm definitely open to other names.

And finally, ignore my comments on the other patch about using memcpy(). KVM
copies memslot structs all over the place without memcpy(), i.e. I was being
paranoid and/or delusional :-)

static void kvm_invalidate_memslot(struct kvm *kvm,
struct kvm_memory_slot *old,
struct kvm_memory_slot *working_slot)
{
/*
* Mark the current slot INVALID. As with all memslot modifications,
* this must be done on an unreachable slot to avoid modifying the
* current slot in the active tree.
*/
kvm_copy_memslot(working_slot, old);
working_slot->flags |= KVM_MEMSLOT_INVALID;
kvm_replace_memslot(kvm, old, working_slot);

/*
* Activate the slot that is now marked INVALID, but don't propagate
* the slot to the now inactive slots. The slot is either going to be
* deleted or recreated as a new slot.
*/
kvm_swap_active_memslots(kvm, old->as_id);

/*
* From this point no new shadow pages pointing to a deleted, or moved,
* memslot will be created. Validation of sp->gfn happens in:
* - gfn_to_hva (kvm_read_guest, gfn_to_pfn)
* - kvm_is_visible_gfn (mmu_check_root)
*/
kvm_arch_flush_shadow_memslot(kvm, old);

/* Was released by kvm_swap_active_memslots, reacquire. */
mutex_lock(&kvm->slots_arch_lock);

/*
* Copy the arch-specific field of the newly-installed slot back to the
* old slot as the arch data could have changed between releasing
* slots_arch_lock in install_new_memslots() and re-acquiring the lock
* above. Writers are required to retrieve memslots *after* acquiring
* slots_arch_lock, thus the active slot's data is guaranteed to be fresh.
*/
old->arch = working_slot->arch;
}

static void kvm_create_memslot(struct kvm *kvm,
const struct kvm_memory_slot *new,
struct kvm_memory_slot *working)
{
/*
* Add the new memslot to the inactive set as a copy of the
* new memslot data provided by userspace.
*/
kvm_copy_memslot(working, new);
kvm_replace_memslot(kvm, NULL, working);
kvm_activate_memslot(kvm, NULL, working);
}

static void kvm_delete_memslot(struct kvm *kvm,
struct kvm_memory_slot *old,
struct kvm_memory_slot *invalid_slot)
{
/*
* Remove the old memslot (in the inactive memslots) by passing NULL as
* the "new" slot.
*/
kvm_replace_memslot(kvm, old, NULL);

/* And do the same for the invalid version in the active slot. */
kvm_activate_memslot(kvm, invalid_slot, NULL);

/* Free the invalid slot, the caller will clean up the old slot. */
kfree(invalid_slot);
}

static struct kvm_memory_slot *kvm_move_memslot(struct kvm *kvm,
struct kvm_memory_slot *old,
const struct kvm_memory_slot *new,
struct kvm_memory_slot *invalid_slot)
{
struct kvm_memslots *slots = kvm_get_inactive_memslots(kvm, old->as_id);

/*
* The memslot's gfn is changing, remove it from the inactive tree, it
* will be re-added with its updated gfn. Because its range is
* changing, an in-place replace is not possible.
*/
kvm_memslot_gfn_erase(slots, old);

/*
* The old slot is now fully disconnected, reuse its memory for the
* persistent copy of "new".
*/
kvm_copy_memslot(old, new);

/* Re-add to the gfn tree with the updated gfn */
kvm_memslot_gfn_insert(slots, old);

/* Replace the current INVALID slot with the updated memslot. */
kvm_activate_memslot(kvm, invalid_slot, old);

/*
* Clear the INVALID flag so that the invalid_slot is now a perfect
* copy of the old slot. Return it for cleanup in the caller.
*/
WARN_ON_ONCE(!(invalid_slot->flags & KVM_MEMSLOT_INVALID));
invalid_slot->flags &= ~KVM_MEMSLOT_INVALID;
return invalid_slot;
}

static void kvm_update_flags_memslot(struct kvm *kvm,
struct kvm_memory_slot *old,
const struct kvm_memory_slot *new,
struct kvm_memory_slot *working_slot)
{
/*
* Similar to the MOVE case, but the slot doesn't need to be
* zapped as an intermediate step. Instead, the old memslot is
* simply replaced with a new, updated copy in both memslot sets.
*
* Since the memslot gfn is unchanged, kvm_copy_replace_memslot()
* and kvm_memslot_gfn_replace() can be used to switch the node
* in the gfn tree instead of removing the old and inserting the
* new as two separate operations. Replacement is a single O(1)
* operation versus two O(log(n)) operations for remove+insert.
*/
kvm_copy_memslot(working_slot, new);
kvm_replace_memslot(kvm, old, working_slot);
kvm_activate_memslot(kvm, old, working_slot);
}


static int kvm_set_memslot(struct kvm *kvm,
struct kvm_memory_slot *old,
struct kvm_memory_slot *new,
enum kvm_mr_change change)
{
struct kvm_memory_slot *working;
int r;

/*
* Modifications are done on an unreachable slot. Any changes are then
* (eventually) propagated to both the active and inactive slots. This
* allocation would ideally be on-demand (in helpers), but is done here
* to avoid having to handle failure after kvm_prepare_memory_region().
*/
working = kzalloc(sizeof(*working), GFP_KERNEL_ACCOUNT);
if (!working)
return -ENOMEM;

/*
* Released in kvm_swap_active_memslots.
*
* Must be held from before the current memslots are copied until
* after the new memslots are installed with rcu_assign_pointer,
* then released before the synchronize srcu in kvm_swap_active_memslots.
*
* When modifying memslots outside of the slots_lock, must be held
* before reading the pointer to the current memslots until after all
* changes to those memslots are complete.
*
* These rules ensure that installing new memslots does not lose
* changes made to the previous memslots.
*/
mutex_lock(&kvm->slots_arch_lock);

/*
* Invalidate the old slot if it's being deleted or moved. This is
* done prior to actually deleting/moving the memslot to allow vCPUs to
* continue running by ensuring there are no mappings or shadow pages
* for the memslot when it is deleted/moved. Without pre-invalidation
* (and without a lock), a window would exist between effecting the
* delete/move and committing the changes in arch code where KVM or a
* guest could access a non-existent memslot.
*/
if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
kvm_invalidate_memslot(kvm, old, working);

r = kvm_prepare_memory_region(kvm, old, new, change);
if (r) {
/*
* For DELETE/MOVE, revert the above INVALID change. No
* modifications required since the original slot was preserved
* in the inactive slots. Changing the active memslots also
* releases slots_arch_lock.
*/
if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
kvm_activate_memslot(kvm, working, old);
else
mutex_unlock(&kvm->slots_arch_lock);
kfree(working);
return r;
}

/*
* For DELETE and MOVE, the working slot is now active as the INVALID
* version of the old slot. MOVE is particularly special as it reuses
* the old slot and returns a copy of the old slot (in working_slot).
* For CREATE, there is no old slot. For DELETE and FLAGS_ONLY, the
* old slot is detached but otherwise preserved.
*/
if (change == KVM_MR_CREATE)
kvm_create_memslot(kvm, new, working);
else if (change == KVM_MR_DELETE)
kvm_delete_memslot(kvm, old, working);
if (change == KVM_MR_MOVE)
old = kvm_move_memslot(kvm, old, new, working);
else if (change == KVM_MR_FLAGS_ONLY)
kvm_update_flags_memslot(kvm, old, new, working);
else
BUG();

/*
* No need to refresh new->arch, changes after dropping slots_arch_lock
* will directly hit the final, active memsot. Architectures are
* responsible for knowing that new->arch may be stale.
*/
kvm_commit_memory_region(kvm, old, new, change);

return 0;
}

2021-10-28 17:56:18

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 13/13] KVM: Optimize overlapping memslots check

On Wed, Oct 27, 2021, Maciej S. Szmigiero wrote:
> On 26.10.2021 20:59, Sean Christopherson wrote:
> > > + /* kvm_for_each_in_gfn_no_more() guarantees that cslot->base_gfn < nend */
> > > + if (cend > nslot->base_gfn)
> >
> > Hmm, IMO the need for this check means that kvm_for_each_memslot_in_gfn_range()
> > is flawed. The user of kvm_for_each...() should not be responsible for skipping
> > memslots that do not actually overlap the requested range. I.e. this function
> > should be no more than:
> >
> > static bool kvm_check_memslot_overlap(struct kvm_memslots *slots,
> > struct kvm_memory_slot *slot)
> > {
> > gfn_t start = slot->base_gfn;
> > gfn_t end = start + slot->npages;
> >
> > kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > if (iter.slot->id != slot->id)
> > return true;
> > }
> >
> > return false;
> > }
> >
> >
> > and I suspect kvm_zap_gfn_range() could be further simplified as well.
> >
> > Looking back at the introduction of the helper, its comment's highlighting of
> > "possibily" now makes sense.
> >
> > /* Iterate over each memslot *possibly* intersecting [start, end) range */
> > #define kvm_for_each_memslot_in_gfn_range(node, slots, start, end) \
> >
> > That's an unnecessarily bad API. It's a very solvable problem for the iterator
> > helpers to advance until there's actually overlap, not doing so violates the
> > principle of least surprise, and unless I'm missing something, there's no use
> > case for an "approximate" iteration.
>
> In principle this can be done, however this will complicate the gfn
> iterator logic - especially the kvm_memslot_iter_start() part, which
> will already get messier from open-coding kvm_memslots_gfn_upper_bound()
> there.

Hmm, no, this is trivial to handle, though admittedly a bit unpleasant.

/*
* Note, kvm_memslot_iter_start() finds the first memslot that _may_ overlap
* the range, it does not verify that there is actual overlap. The check in
* the loop body filters out the case where the highest memslot with a base_gfn
* below start doesn't actually overlap.
*/
#define kvm_for_each_memslot_in_gfn_range(iter, node, slots, start, end) \
for (kvm_memslot_iter_start(iter, node, slots, start, end); \
kvm_memslot_iter_is_valid(iter); \
kvm_memslot_iter_next(node)) \
if (iter->slot->base_gfn + iter->slot->npages < start) { \
} else



> At the same kvm_zap_gfn_range() will still need to do the memslot range
> <-> request range merging by itself as it does not want to process the
> whole returned memslot, but rather just the part that's actually
> overlapping its requested range.

That's purely coincidental though. IMO, kvm_zap_gfn_range() would be well within
its rights to sanity the memslot, e.g.

if (WARN_ON(memslot->base_gfn + memslot->npages < gfn_start))
continue;

> In the worst case, the current code can return one memslot too much, so
> I don't think it's worth bringing additional complexity just to detect
> and skip it

I strongly disagree. This is very much a case of one chunk of code that knows
the internal details of what it's doing taking on all the pain and complexity
so that users of the helper

> it's not that uncommon to design an API that needs extra checking from its
> caller to cover some corner cases.

That doesn't mean it's desirable.

> For example, see pthread_cond_wait() or kernel waitqueues with their
> spurious wakeups or atomic_compare_exchange_weak() from C11.
> And these are higher level APIs than a very limited internal KVM one
> with just two callers.

Two _existing_ callers. Odds are very, very high that future usage of
kvm_for_each_memslot_in_gfn_range() will overlook the detail about the helper
not actually doing what it says it does. That could be addressed to some extent
by renaming it kvm_for_each_memslot_in_gfn_range_approx() or whatever, but as
above this isn't difficult to handle, just gross.

> In case of kvm_zap_gfn_range() the necessary checking is already
> there and has to be kept due to the above range merging.
>
> Also, a code that is simpler is easier to understand, maintain and
> so less prone to subtle bugs.

Heh, and IMO that's an argument for putting all the complexity into a single
location. :-)

2021-10-28 22:24:18

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 11/13] KVM: Keep memslots in tree-based structures instead of array-based ones

On Wed, Oct 27, 2021, Sean Christopherson wrote:
> And doling out the work to helpers also mostly eliminates the need to track the
> inactive vs. active slot. There's still some slightly twisty logic, especially for
> the MOVE case (which nobody uses, *sigh*). If we really want, even that mess can
> be cleaned up by pre-allocating an extra memslot, but I don't think it's worth the
> effort so long as things are well documented.

Aha! I figured out how to do this in a way that reduces complexity and fixes the
wart of "new" being an on-stack object, which has always bugged me but wasn't
really fixable given the old/current memslots approach of duplicating the entire
array. I belive it "needs" to be done as cleanup on top, i.e. what I've proposed
here would still exist as an intermediate step, but the final result would be clean.

Rather than reuse a single working slot, dynamically allocate "new" except in the
DELETE case. That way a "working" slot isn't needed for CREATE or FLAGS_ONLY since
they simply replace "old" (NULL for CREATE) with "new".

It requires an extra allocation for MOVE, but (a) it's only a single memslot and
(b) AFAIK no real VMM actually uses MOVE, and the benefit is that it significantly
reduces the weirdness for the code as a whole. And even MOVE is cleaned up because
it more clearly replace old and invalid_slot with new, as opposed to doing a weird
dance where it uses the old slot as the "working" slot.

Assuming my approach actually works, I really like the end result. I got waylaid
yet again on other stuff, so I may not get a series posted today. *sigh*

2021-10-29 16:26:26

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 13/13] KVM: Optimize overlapping memslots check

On 28.10.2021 19:53, Sean Christopherson wrote:
> On Wed, Oct 27, 2021, Maciej S. Szmigiero wrote:
>> On 26.10.2021 20:59, Sean Christopherson wrote:
>>>> + /* kvm_for_each_in_gfn_no_more() guarantees that cslot->base_gfn < nend */
>>>> + if (cend > nslot->base_gfn)
>>>
>>> Hmm, IMO the need for this check means that kvm_for_each_memslot_in_gfn_range()
>>> is flawed. The user of kvm_for_each...() should not be responsible for skipping
>>> memslots that do not actually overlap the requested range. I.e. this function
>>> should be no more than:
>>>
>>> static bool kvm_check_memslot_overlap(struct kvm_memslots *slots,
>>> struct kvm_memory_slot *slot)
>>> {
>>> gfn_t start = slot->base_gfn;
>>> gfn_t end = start + slot->npages;
>>>
>>> kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
>>> if (iter.slot->id != slot->id)
>>> return true;
>>> }
>>>
>>> return false;
>>> }
>>>
>>>
>>> and I suspect kvm_zap_gfn_range() could be further simplified as well.
>>>
>>> Looking back at the introduction of the helper, its comment's highlighting of
>>> "possibily" now makes sense.
>>>
>>> /* Iterate over each memslot *possibly* intersecting [start, end) range */
>>> #define kvm_for_each_memslot_in_gfn_range(node, slots, start, end) \
>>>
>>> That's an unnecessarily bad API. It's a very solvable problem for the iterator
>>> helpers to advance until there's actually overlap, not doing so violates the
>>> principle of least surprise, and unless I'm missing something, there's no use
>>> case for an "approximate" iteration.
>>
>> In principle this can be done, however this will complicate the gfn
>> iterator logic - especially the kvm_memslot_iter_start() part, which
>> will already get messier from open-coding kvm_memslots_gfn_upper_bound()
>> there.
>
> Hmm, no, this is trivial to handle, though admittedly a bit unpleasant.
>
> /*
> * Note, kvm_memslot_iter_start() finds the first memslot that _may_ overlap
> * the range, it does not verify that there is actual overlap. The check in
> * the loop body filters out the case where the highest memslot with a base_gfn
> * below start doesn't actually overlap.
> */
> #define kvm_for_each_memslot_in_gfn_range(iter, node, slots, start, end) \
> for (kvm_memslot_iter_start(iter, node, slots, start, end); \
> kvm_memslot_iter_is_valid(iter); \
> kvm_memslot_iter_next(node)) \
> if (iter->slot->base_gfn + iter->slot->npages < start) { \
> } else

As you say, that's rather unpleasant, since we know that the first
returned memslot is the only one that's possibly *not* overlapping
(and then it only happens sometimes).
Yet with the above change we'll pay the price of this check for every
loop iteration (for every returned memslot).
That's definitely not optimizing for the most common case.

Also, the above code has a bug - using a [start, end) notation compatible
with what kvm_for_each_memslot_in_gfn_range() expects, where [1, 4)
means a range consisting of { 1, 2, 3 }, consider a tree consisting of the
following two memslots: [1, 2), [3, 5)

When kvm_for_each_memslot_in_gfn_range() is then called to "return"
memslots overlapping range [2, 4) it will "return" the [1, 2) memslot, too -
even though it does *not* actually overlap the requested range.

While this bug is easy to fix (just use "<=" instead of "<") it serves to
underline that one has to be very careful with working with this type of
code as it is very easy to introduce subtle mistakes here.

Here, how many lines of code a function or macro has is not a good proxy
metric for how hard is to build a strictly accurate mental model of it.

>> At the same kvm_zap_gfn_range() will still need to do the memslot range
>> <-> request range merging by itself as it does not want to process the
>> whole returned memslot, but rather just the part that's actually
>> overlapping its requested range.
>
> That's purely coincidental though. IMO, kvm_zap_gfn_range() would be well within
> its rights to sanity the memslot, e.g.
>
> if (WARN_ON(memslot->base_gfn + memslot->npages < gfn_start))
> continue;
>
>> In the worst case, the current code can return one memslot too much, so
>> I don't think it's worth bringing additional complexity just to detect
>> and skip it
>
> I strongly disagree. This is very much a case of one chunk of code that knows
> the internal details of what it's doing taking on all the pain and complexity
> so that users of the helper
>
>> it's not that uncommon to design an API that needs extra checking from its
>> caller to cover some corner cases.
>
> That doesn't mean it's desirable.
>
>> For example, see pthread_cond_wait() or kernel waitqueues with their
>> spurious wakeups or atomic_compare_exchange_weak() from C11.
>> And these are higher level APIs than a very limited internal KVM one
>> with just two callers.
>
> Two _existing_ callers. Odds are very, very high that future usage of
> kvm_for_each_memslot_in_gfn_range() will overlook the detail about the helper
> not actually doing what it says it does. That could be addressed to some extent
> by renaming it kvm_for_each_memslot_in_gfn_range_approx() or whatever, but as
> above this isn't difficult to handle, just gross.

What kind of future users of this API do you envision?

I've pointed out above that adding this extra check means essentially
optimizing for an uncommon case.

One of the callers of this function already has the necessary code to
reject non-overlapping memslots and have to keep it to calculate the
effective overlapping range for each memslot.
For the second caller (which, by the way, is called much less often than
kvm_zap_gfn_range()) it's a matter of one extra condition.

>> In case of kvm_zap_gfn_range() the necessary checking is already
>> there and has to be kept due to the above range merging.
>>
>> Also, a code that is simpler is easier to understand, maintain and
>> so less prone to subtle bugs.
>
> Heh, and IMO that's an argument for putting all the complexity into a single
> location. :-)
>

If you absolutely insist then obviously I can change the code to return
only memslots strictly overlapping the requested range in the next
patchset version.

However, I do want to know that this will be the final logic here,
since I am concerned that yet another bug might slip in, this time
under my radar, too.

Thanks,
Maciej

2021-10-30 00:34:43

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 13/13] KVM: Optimize overlapping memslots check

On Fri, Oct 29, 2021, Maciej S. Szmigiero wrote:
> On 28.10.2021 19:53, Sean Christopherson wrote:
> > Hmm, no, this is trivial to handle, though admittedly a bit unpleasant.
> >
> > /*
> > * Note, kvm_memslot_iter_start() finds the first memslot that _may_ overlap
> > * the range, it does not verify that there is actual overlap. The check in
> > * the loop body filters out the case where the highest memslot with a base_gfn
> > * below start doesn't actually overlap.
> > */
> > #define kvm_for_each_memslot_in_gfn_range(iter, node, slots, start, end) \
> > for (kvm_memslot_iter_start(iter, node, slots, start, end); \
> > kvm_memslot_iter_is_valid(iter); \
> > kvm_memslot_iter_next(node)) \
> > if (iter->slot->base_gfn + iter->slot->npages < start) { \
> > } else
>
> As you say, that's rather unpleasant, since we know that the first
> returned memslot is the only one that's possibly *not* overlapping
> (and then it only happens sometimes).
> Yet with the above change we'll pay the price of this check for every
> loop iteration (for every returned memslot).

I'm definitely not saying that it's the best/right/only way to handle this,
merely pointing out that it's not _that_ complex, modulo off-by-one bugs :-)

> That's definitely not optimizing for the most common case.

Meh, it's a nop for kvm_check_memslot_overlap() and completely in the noise for
kvm_zap_gfn_range(). Not saying I disagree that's a flawed way to handle this
just saying that even the quick-and-dirty solution is extremely unlikely to be
relevant to performance.

> Also, the above code has a bug - using a [start, end) notation compatible
> with what kvm_for_each_memslot_in_gfn_range() expects, where [1, 4)
> means a range consisting of { 1, 2, 3 }, consider a tree consisting of the
> following two memslots: [1, 2), [3, 5)
>
> When kvm_for_each_memslot_in_gfn_range() is then called to "return"
> memslots overlapping range [2, 4) it will "return" the [1, 2) memslot, too -
> even though it does *not* actually overlap the requested range.
>
> While this bug is easy to fix (just use "<=" instead of "<") it serves to
> underline that one has to be very careful with working with this type of
> code as it is very easy to introduce subtle mistakes here.

Yes, and that's exactly why I want to write this _once_.

> > Two _existing_ callers. Odds are very, very high that future usage of
> > kvm_for_each_memslot_in_gfn_range() will overlook the detail about the helper
> > not actually doing what it says it does. That could be addressed to some extent
> > by renaming it kvm_for_each_memslot_in_gfn_range_approx() or whatever, but as
> > above this isn't difficult to handle, just gross.
>
> What kind of future users of this API do you envision?
>
> I've pointed out above that adding this extra check means essentially
> optimizing for an uncommon case.

Usage similar to kvm_zap_gfn_range() where KVM wants to take action on a specific
gfn range. I'm actually somewhat surprised that none of the other architectures
have a use case in their MMUs, though I don't know the story for things like
shadow paging that "necessitate" x86's behavior.

> One of the callers of this function already has the necessary code to
> reject non-overlapping memslots and have to keep it to calculate the
> effective overlapping range for each memslot.
> For the second caller (which, by the way, is called much less often than
> kvm_zap_gfn_range()) it's a matter of one extra condition.
>
> > > In case of kvm_zap_gfn_range() the necessary checking is already
> > > there and has to be kept due to the above range merging.
> > >
> > > Also, a code that is simpler is easier to understand, maintain and
> > > so less prone to subtle bugs.
> >
> > Heh, and IMO that's an argument for putting all the complexity into a single
> > location. :-)
> >
>
> If you absolutely insist then obviously I can change the code to return
> only memslots strictly overlapping the requested range in the next
> patchset version.

I feel pretty strongly that the risk vs. reward is heavily in favor of returning
only strictly overlapping memslots. The risk being that a few years down the road
someone runs afoul of this and we end up with a bug in production. The reward is
avoiding writing moderately complex code and at best shave a few uops in an x86
slooow path. It's entirely possible there's never a third user, but IMO there
isn't enough reward to justify even the smallest amount of risk.

Paolo, any opinion?

2021-11-01 22:33:02

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] KVM: x86: Cache total page count to avoid traversing the memslot array

On Wed, Oct 20, 2021, Sean Christopherson wrote:
> On Wed, Oct 20, 2021, Maciej S. Szmigiero wrote:
> > On 20.10.2021 00:24, Sean Christopherson wrote:
> > > E.g. the whole thing can be
> > >
> > > if (!kvm->arch.n_requested_mmu_pages &&
> > > (change == KVM_MR_CREATE || change == KVM_MR_DELETE)) {
> > > unsigned long nr_mmu_pages;
> > >
> > > if (change == KVM_MR_CREATE) {
> > > kvm->arch.n_memslots_pages += new->npages;
> > > } else {
> > > WARN_ON(kvm->arch.n_memslots_pages < old->npages);
> > > kvm->arch.n_memslots_pages -= old->npages;
> > > }
> > >
> > > nr_mmu_pages = (unsigned long)kvm->arch.n_memslots_pages;
> > > nr_mmu_pages *= (KVM_PERMILLE_MMU_PAGES / 1000);
> >
> > The above line will set nr_mmu_pages to zero since KVM_PERMILLE_MMU_PAGES
> > is 20, so when integer-divided by 1000 will result in a multiplication
> > coefficient of zero.
>
> Ugh, math. And thus do_div() to avoid the whole 64-bit divide issue on 32-bit KVM.
> Bummer.

I was revisiting this today because (a) simply making n_memslots_pages a u64 doesn't
cleanly handle the case where the resulting nr_mmu_pages would wrap, (b) any fix
in that are should really go in a separate patch to fix
kvm_mmu_calculate_default_mmu_pages() and then carry that behavior forward

But as I dove deeper (and deeper), I came to the conclusion that supporting a
total number of memslot pages that doesn't fit in an unsigned long is a waste of
our time. With a 32-bit kernel, userspace can at most address 3gb of virtual
memory, whereas wrapping the total number of pages would require 4tb+ of guest
physical memory. Even with x86's second address space for SMM, that means userspace
would need to alias all of guest memory more than one _thousand_ times. And on
older hardware with MAXPHYADDR < 43, the guest couldn't actually access any of those
aliases even if userspace lied about guest.MAXPHYADDR.

So unless I'm missing something, or PPC or MIPS has some crazy way for a 32-bit
host to support 4TB of guest memory, my vote would be to explicitly disallow
creating more memslot pages than can fit in an unsigned long. Then x86 KVM could
reuse the cache nr_memslot_pages and x86's MMU wouldn't have to update a big pile
of code to support a scenario that practically speaking is useless.

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 72b329e82089..acabdbdef5cf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -552,6 +552,7 @@ struct kvm {
*/
struct mutex slots_arch_lock;
struct mm_struct *mm; /* userspace tied to this vm */
+ unsigned long nr_memslot_pages;
struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8bf4b89cfa03..c63fc5c05322 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1567,6 +1567,15 @@ static void kvm_commit_memory_region(struct kvm *kvm,
const struct kvm_memory_slot *new,
enum kvm_mr_change change)
{
+ /*
+ * Update the total number of memslot pages before calling the arch
+ * hook so that architectures can consume the result directly.
+ */
+ if (change == KVM_MR_DELETE)
+ kvm->nr_memslot_pages -= old->npages;
+ else if (change == KVM_MR_CREATE)
+ kvm->nr_memslot_pages += new->npages;
+
kvm_arch_commit_memory_region(kvm, old, new, change);

/*
@@ -1738,6 +1747,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (!old || !old->npages)
return -EINVAL;

+ if (WARN_ON_ONCE(kvm->nr_memslot_pages < old->npages))
+ return -EIO;
+
memset(&new, 0, sizeof(new));
new.id = id;
new.as_id = as_id;
@@ -1756,6 +1768,13 @@ int __kvm_set_memory_region(struct kvm *kvm,

if (!old || !old->npages) {
change = KVM_MR_CREATE;
+
+ /*
+ * To simplify KVM internals, the total number of pages across
+ * all memslots must fit in an unsigned long.
+ */
+ if ((kvm->nr_memslot_pages + new.npages) < kvm->nr_memslot_pages)
+ return -EINVAL;
} else { /* Modify an existing slot. */
if ((new.userspace_addr != old->userspace_addr) ||
(new.npages != old->npages) ||

2021-11-03 12:05:49

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] KVM: x86: Cache total page count to avoid traversing the memslot array

On 01.11.2021 23:29, Sean Christopherson wrote:
> On Wed, Oct 20, 2021, Sean Christopherson wrote:
>> On Wed, Oct 20, 2021, Maciej S. Szmigiero wrote:
>>> On 20.10.2021 00:24, Sean Christopherson wrote:
>>>> E.g. the whole thing can be
>>>>
>>>> if (!kvm->arch.n_requested_mmu_pages &&
>>>> (change == KVM_MR_CREATE || change == KVM_MR_DELETE)) {
>>>> unsigned long nr_mmu_pages;
>>>>
>>>> if (change == KVM_MR_CREATE) {
>>>> kvm->arch.n_memslots_pages += new->npages;
>>>> } else {
>>>> WARN_ON(kvm->arch.n_memslots_pages < old->npages);
>>>> kvm->arch.n_memslots_pages -= old->npages;
>>>> }
>>>>
>>>> nr_mmu_pages = (unsigned long)kvm->arch.n_memslots_pages;
>>>> nr_mmu_pages *= (KVM_PERMILLE_MMU_PAGES / 1000);
>>>
>>> The above line will set nr_mmu_pages to zero since KVM_PERMILLE_MMU_PAGES
>>> is 20, so when integer-divided by 1000 will result in a multiplication
>>> coefficient of zero.
>>
>> Ugh, math. And thus do_div() to avoid the whole 64-bit divide issue on 32-bit KVM.
>> Bummer.
>
> I was revisiting this today because (a) simply making n_memslots_pages a u64 doesn't
> cleanly handle the case where the resulting nr_mmu_pages would wrap,

Handling this case without capping total n_memslots_pages would require
either capping memslots_pages on 32-bit KVM to make it fit in 32-bits or
changing kvm_mmu_change_mmu_pages() and all the logic further down to
accept u64's.

> (b) any fix
> in that are should really go in a separate patch to fix
> kvm_mmu_calculate_default_mmu_pages() and then carry that behavior forward
>
> But as I dove deeper (and deeper), I came to the conclusion that supporting a
> total number of memslot pages that doesn't fit in an unsigned long is a waste of
> our time. With a 32-bit kernel, userspace can at most address 3gb of virtual
> memory, whereas wrapping the total number of pages would require 4tb+ of guest
> physical memory. Even with x86's second address space for SMM, that means userspace
> would need to alias all of guest memory more than one _thousand_ times. And on
> older hardware with MAXPHYADDR < 43, the guest couldn't actually access any of those
> aliases even if userspace lied about guest.MAXPHYADDR.
>
> So unless I'm missing something, or PPC or MIPS has some crazy way for a 32-bit
> host to support 4TB of guest memory, my vote would be to explicitly disallow
> creating more memslot pages than can fit in an unsigned long. Then x86 KVM could
> reuse the cache nr_memslot_pages and x86's MMU wouldn't have to update a big pile
> of code to support a scenario that practically speaking is useless.
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 72b329e82089..acabdbdef5cf 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -552,6 +552,7 @@ struct kvm {
> */
> struct mutex slots_arch_lock;
> struct mm_struct *mm; /* userspace tied to this vm */
> + unsigned long nr_memslot_pages;
> struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
> struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8bf4b89cfa03..c63fc5c05322 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1567,6 +1567,15 @@ static void kvm_commit_memory_region(struct kvm *kvm,
> const struct kvm_memory_slot *new,
> enum kvm_mr_change change)
> {
> + /*
> + * Update the total number of memslot pages before calling the arch
> + * hook so that architectures can consume the result directly.
> + */
> + if (change == KVM_MR_DELETE)
> + kvm->nr_memslot_pages -= old->npages;
> + else if (change == KVM_MR_CREATE)
> + kvm->nr_memslot_pages += new->npages;
> +
> kvm_arch_commit_memory_region(kvm, old, new, change);
>
> /*
> @@ -1738,6 +1747,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> if (!old || !old->npages)
> return -EINVAL;
>
> + if (WARN_ON_ONCE(kvm->nr_memslot_pages < old->npages))
> + return -EIO;
> +
> memset(&new, 0, sizeof(new));
> new.id = id;
> new.as_id = as_id;
> @@ -1756,6 +1768,13 @@ int __kvm_set_memory_region(struct kvm *kvm,
>
> if (!old || !old->npages) {
> change = KVM_MR_CREATE;
> +
> + /*
> + * To simplify KVM internals, the total number of pages across
> + * all memslots must fit in an unsigned long.
> + */
> + if ((kvm->nr_memslot_pages + new.npages) < kvm->nr_memslot_pages)
> + return -EINVAL;
> } else { /* Modify an existing slot. */
> if ((new.userspace_addr != old->userspace_addr) ||
> (new.npages != old->npages) ||
>

Capping total n_memslots_pages makes sense to me to avoid the (existing)
nr_mmu_pages wraparound issue, will update the next patchset version
accordingly.

Thanks,
Maciej

2021-11-03 14:50:06

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] KVM: x86: Cache total page count to avoid traversing the memslot array

On Wed, Nov 03, 2021, Maciej S. Szmigiero wrote:
> Capping total n_memslots_pages makes sense to me to avoid the (existing)
> nr_mmu_pages wraparound issue, will update the next patchset version
> accordingly.

No need to do it yourself. I have a reworked version of the series with a bunch
of cleanups before and after the meat of your series, as well non-functional changes
(hopefully) to the "Resolve memslot ID via a hash table" and "Keep memslots in
tree-based structures" to avoid all the swap() behavior and to provide better
continuity between the aforementioned patches. Unless something goes sideways in
the last few touchups, I'll get it posted today.

2021-11-03 15:43:44

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] KVM: x86: Cache total page count to avoid traversing the memslot array

On 03.11.2021 15:47, Sean Christopherson wrote:
> On Wed, Nov 03, 2021, Maciej S. Szmigiero wrote:
>> Capping total n_memslots_pages makes sense to me to avoid the (existing)
>> nr_mmu_pages wraparound issue, will update the next patchset version
>> accordingly.
>
> No need to do it yourself. I have a reworked version of the series with a bunch
> of cleanups before and after the meat of your series, as well non-functional changes
> (hopefully) to the "Resolve memslot ID via a hash table" and "Keep memslots in
> tree-based structures" to avoid all the swap() behavior and to provide better
> continuity between the aforementioned patches. Unless something goes sideways in
> the last few touchups, I'll get it posted today.
>

Thanks.