From: "Maciej S. Szmigiero" <[email protected]>
While the last "5.5" version of KVM scalable memslots implementation was
merged to kvm/queue some changes from its review round are still pending.
This series contains these changes which still apply to the final form of
the code.
The changes include switching kvm_for_each_memslot_in_gfn_range() to use
iterators.
However, I've dropped kvm_for_each_memslot_in_hva_range() rework to
use dedicated iterators since the existing implementation was already
returning only strictly overlapping memslots and it was already using
interval tree iterators.
The code there is also self-contained and very simple.
The code was tested on x86 with KASAN on and booted various guests
successfully (including nested ones; with TDP MMU both enabled and disabled).
There were some VMX APICv warnings during these tests but these look
related to the latest VMX PI changes rather than memslots handling code.
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 11 ++--
arch/x86/kvm/x86.c | 3 +-
include/linux/kvm_host.h | 104 ++++++++++++++++++++------------
virt/kvm/kvm_main.c | 25 +++-----
5 files changed, 83 insertions(+), 62 deletions(-)
From: "Maciej S. Szmigiero" <[email protected]>
With kvm->nr_memslot_pages capped at ULONG_MAX we can't safely multiply it
by KVM_PERMILLE_MMU_PAGES (20) since this operation can possibly overflow
an unsigned long variable.
Rewrite this "* 20 / 1000" operation as "/ 50" instead to avoid such
overflow.
Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/x86.c | 3 +--
2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1fcb345bc107..8cd1d254c948 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -135,7 +135,7 @@
#define KVM_HPAGE_MASK(x) (~(KVM_HPAGE_SIZE(x) - 1))
#define KVM_PAGES_PER_HPAGE(x) (KVM_HPAGE_SIZE(x) / PAGE_SIZE)
-#define KVM_PERMILLE_MMU_PAGES 20
+#define KVM_MEMSLOT_PAGES_TO_MMU_PAGES_RATIO 50
#define KVM_MIN_ALLOC_MMU_PAGES 64UL
#define KVM_MMU_HASH_SHIFT 12
#define KVM_NUM_MMU_PAGES (1 << KVM_MMU_HASH_SHIFT)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04e8dabc187d..69330b395f12 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11753,8 +11753,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
(change == KVM_MR_CREATE || change == KVM_MR_DELETE)) {
unsigned long nr_mmu_pages;
- nr_mmu_pages = kvm->nr_memslot_pages * KVM_PERMILLE_MMU_PAGES;
- nr_mmu_pages /= 1000;
+ nr_mmu_pages = kvm->nr_memslot_pages / KVM_MEMSLOT_PAGES_TO_MMU_PAGES_RATIO;
nr_mmu_pages = max(nr_mmu_pages, KVM_MIN_ALLOC_MMU_PAGES);
kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
}
From: "Maciej S. Szmigiero" <[email protected]>
Open-coding a cmpxchg()-like operation is significantly less readable than
a direct call.
Also, the open-coded version compiles to multiple instructions with
a branch on x86, instead of just a single instruction.
Since technically the open-coded variant didn't guarantee actual atomicity
add a comment there, too, that this property isn't strictly required in
this case.
Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
virt/kvm/kvm_main.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d4399db06d49..367c1cba26d2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1378,8 +1378,12 @@ static void kvm_replace_memslot(struct kvm *kvm,
hash_del(&old->id_node[idx]);
interval_tree_remove(&old->hva_node[idx], &slots->hva_tree);
- if ((long)old == atomic_long_read(&slots->last_used_slot))
- atomic_long_set(&slots->last_used_slot, (long)new);
+ /*
+ * The atomicity isn't strictly required here since we are
+ * operating on an inactive memslots set anyway.
+ */
+ atomic_long_cmpxchg(&slots->last_used_slot,
+ (unsigned long)old, (unsigned long)new);
if (!new) {
kvm_erase_gfn_node(slots, old);
From: "Maciej S. Szmigiero" <[email protected]>
Make kvm_for_each_memslot_in_gfn_range() return only memslots at least
partially intersecting the requested range.
At the same time modify its implementation to use iterators.
This simplifies kvm_check_memslot_overlap() a bit.
Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 11 ++---
include/linux/kvm_host.h | 104 +++++++++++++++++++++++++--------------
virt/kvm/kvm_main.c | 17 ++-----
3 files changed, 75 insertions(+), 57 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8f0035517450..009eb27d34d3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5715,23 +5715,22 @@ static bool __kvm_zap_rmaps(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
{
const struct kvm_memory_slot *memslot;
struct kvm_memslots *slots;
- struct rb_node *node;
+ struct kvm_memslot_iter iter;
bool flush = false;
gfn_t start, end;
- int i, idx;
+ int i;
if (!kvm_memslots_have_rmaps(kvm))
return flush;
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
slots = __kvm_memslots(kvm, i);
- idx = slots->node_idx;
- kvm_for_each_memslot_in_gfn_range(node, slots, gfn_start, gfn_end) {
- memslot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
+ kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, gfn_end) {
+ memslot = kvm_memslot_iter_slot(&iter);
start = max(gfn_start, memslot->base_gfn);
end = min(gfn_end, memslot->base_gfn + memslot->npages);
- if (start >= end)
+ if (WARN_ON_ONCE(start >= end))
continue;
flush = slot_handle_level_range(kvm, memslot, kvm_zap_rmapp,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3932bb091099..580d9abd97c1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -841,74 +841,104 @@ struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
return NULL;
}
-static inline
-struct rb_node *kvm_memslots_gfn_upper_bound(struct kvm_memslots *slots, gfn_t gfn)
+/* Iterator used for walking memslots that overlap a gfn range. */
+struct kvm_memslot_iter {
+ struct kvm_memslots *slots;
+ gfn_t end;
+ struct rb_node *node;
+};
+
+static inline void kvm_memslot_iter_start(struct kvm_memslot_iter *iter,
+ struct kvm_memslots *slots,
+ gfn_t start, gfn_t end)
{
int idx = slots->node_idx;
- struct rb_node *node, *result = NULL;
+ struct rb_node *tmp;
+ struct kvm_memory_slot *slot;
- for (node = slots->gfn_tree.rb_node; node; ) {
- struct kvm_memory_slot *slot;
+ iter->slots = slots;
+ iter->end = end;
- slot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
- if (gfn < slot->base_gfn) {
- result = node;
- node = node->rb_left;
- } else
- node = node->rb_right;
+ /*
+ * Find the so called "upper bound" of a key - the first node that has
+ * its key strictly greater than the searched one (the start gfn in our case).
+ */
+ iter->node = NULL;
+ for (tmp = slots->gfn_tree.rb_node; tmp; ) {
+ slot = container_of(tmp, struct kvm_memory_slot, gfn_node[idx]);
+ if (start < slot->base_gfn) {
+ iter->node = tmp;
+ tmp = tmp->rb_left;
+ } else {
+ tmp = tmp->rb_right;
+ }
}
- return result;
-}
-
-static inline
-struct rb_node *kvm_for_each_in_gfn_first(struct kvm_memslots *slots, gfn_t start)
-{
- struct rb_node *node;
-
/*
* Find the slot with the lowest gfn that can possibly intersect with
* the range, so we'll ideally have slot start <= range start
*/
- node = kvm_memslots_gfn_upper_bound(slots, start);
- if (node) {
- struct rb_node *pnode;
-
+ if (iter->node) {
/*
* A NULL previous node means that the very first slot
* already has a higher start gfn.
* In this case slot start > range start.
*/
- pnode = rb_prev(node);
- if (pnode)
- node = pnode;
+ tmp = rb_prev(iter->node);
+ if (tmp)
+ iter->node = tmp;
} else {
/* a NULL node below means no slots */
- node = rb_last(&slots->gfn_tree);
+ iter->node = rb_last(&slots->gfn_tree);
}
- return node;
+ if (iter->node) {
+ /*
+ * It is possible in the slot start < range start case that the
+ * found slot ends before or at range start (slot end <= range start)
+ * and so it does not overlap the requested range.
+ *
+ * In such non-overlapping case the next slot (if it exists) will
+ * already have slot start > range start, otherwise the logic above
+ * would have found it instead of the current slot.
+ */
+ slot = container_of(iter->node, struct kvm_memory_slot, gfn_node[idx]);
+ if (slot->base_gfn + slot->npages <= start)
+ iter->node = rb_next(iter->node);
+ }
}
-static inline
-bool kvm_for_each_in_gfn_no_more(struct kvm_memslots *slots, struct rb_node *node, gfn_t end)
+static inline struct kvm_memory_slot *kvm_memslot_iter_slot(struct kvm_memslot_iter *iter)
+{
+ return container_of(iter->node, struct kvm_memory_slot, gfn_node[iter->slots->node_idx]);
+}
+
+static inline bool kvm_memslot_iter_is_valid(struct kvm_memslot_iter *iter)
{
struct kvm_memory_slot *memslot;
- memslot = container_of(node, struct kvm_memory_slot, gfn_node[slots->node_idx]);
+ if (!iter->node)
+ return false;
+
+ memslot = kvm_memslot_iter_slot(iter);
/*
* If this slot starts beyond or at the end of the range so does
* every next one
*/
- return memslot->base_gfn >= end;
+ return memslot->base_gfn < iter->end;
+}
+
+static inline void kvm_memslot_iter_next(struct kvm_memslot_iter *iter)
+{
+ iter->node = rb_next(iter->node);
}
-/* Iterate over each memslot *possibly* intersecting [start, end) range */
-#define kvm_for_each_memslot_in_gfn_range(node, slots, start, end) \
- for (node = kvm_for_each_in_gfn_first(slots, start); \
- node && !kvm_for_each_in_gfn_no_more(slots, node, end); \
- node = rb_next(node)) \
+/* Iterate over each memslot at least partially intersecting [start, end) range */
+#define kvm_for_each_memslot_in_gfn_range(iter, slots, start, end) \
+ for (kvm_memslot_iter_start(iter, slots, start, end); \
+ kvm_memslot_iter_is_valid(iter); \
+ kvm_memslot_iter_next(iter))
/*
* KVM_SET_USER_MEMORY_REGION ioctl allows the following operations:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 367c1cba26d2..d6503b92b3cf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1797,22 +1797,11 @@ static int kvm_set_memslot(struct kvm *kvm,
static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
gfn_t start, gfn_t end)
{
- int idx = slots->node_idx;
- struct rb_node *node;
-
- kvm_for_each_memslot_in_gfn_range(node, slots, start, end) {
- struct kvm_memory_slot *cslot;
- gfn_t cend;
+ struct kvm_memslot_iter iter;
- cslot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
- cend = cslot->base_gfn + cslot->npages;
- if (cslot->id == id)
- continue;
-
- /* kvm_for_each_in_gfn_no_more() guarantees that cslot->base_gfn < nend */
- if (cend > start)
+ kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end)
+ if (kvm_memslot_iter_slot(&iter)->id != id)
return true;
- }
return false;
}
On 11/26/21 01:31, Maciej S. Szmigiero wrote:
> - if ((long)old == atomic_long_read(&slots->last_used_slot))
> - atomic_long_set(&slots->last_used_slot, (long)new);
> + /*
> + * The atomicity isn't strictly required here since we are
> + * operating on an inactive memslots set anyway.
> + */
> + atomic_long_cmpxchg(&slots->last_used_slot,
> + (unsigned long)old, (unsigned long)new);
I think using read/set is more readable than a comment saying that
atomicity is not required.
It's a fairly common pattern, and while I agree that it's a PITA to
write atomic_long_read and atomic_long_set, the person that reads the
code is also helped by read/set, because they know they have to think
about ownership invariants rather than concurrency invariants.
Paolo
On 11/26/21 01:31, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero"<[email protected]>
>
> While the last "5.5" version of KVM scalable memslots implementation was
> merged to kvm/queue some changes from its review round are still pending.
You can go ahead and post v6, I'll replace. However, note that I would
prefer the current form instead of patch 2's atomic_long_cmpxchg.
Paolo
On 26.11.2021 11:35, Paolo Bonzini wrote:
> On 11/26/21 01:31, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero"<[email protected]>
>>
>> While the last "5.5" version of KVM scalable memslots implementation was
>> merged to kvm/queue some changes from its review round are still pending.
>
> You can go ahead and post v6, I'll replace.
Which tree should I target then?
kvm/queue already has these commits so git refuses to rebase on top of it.
By the way, v6 will have more changes than this series since there will be
patches for intermediate forms of code, too.
> However, note that I would prefer the current form instead of patch 2's atomic_long_cmpxchg.
Understood.
> Paolo
Thanks,
Maciej
On 11/26/21 14:18, Maciej S. Szmigiero wrote:
>>
>> You can go ahead and post v6, I'll replace.
>
> Which tree should I target then?
> kvm/queue already has these commits so git refuses to rebase on top of it.
>
> By the way, v6 will have more changes than this series since there will be
> patches for intermediate forms of code, too.
I'll undo them from kvm/queue shortly, since I've now updated all my
development machines to 5.16-rc2.
Paolo
On Fri, Nov 26, 2021, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <[email protected]>
>
> With kvm->nr_memslot_pages capped at ULONG_MAX we can't safely multiply it
> by KVM_PERMILLE_MMU_PAGES (20) since this operation can possibly overflow
> an unsigned long variable.
>
> Rewrite this "* 20 / 1000" operation as "/ 50" instead to avoid such
> overflow.
>
> Signed-off-by: Maciej S. Szmigiero <[email protected]>
> ---
Reviewed-by: Sean Christopherson <[email protected]>