2011-02-22 08:07:35

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 0/7] KVM: optimize memslots searching and cache GPN to GFN

I noticed paging64/32_walk_addr_generic and gfn_to_memslot are very hot in kvm
modules:

# perf report
# Events: 38K cycles
#
# Overhead Command Shared Object Symbol
# ........ ............... .................. ......
#
6.73% qemu-system-x86 [kernel.kallsyms] [k] native_read_tsc
4.51% qemu-system-x86 [kernel.kallsyms] [k] delay_tsc
3.93% qemu-system-x86 [kvm] [k] paging64_walk_addr_generic
3.91% qemu-system-x86 [kernel.kallsyms] [k] arch_local_irq_restore
3.28% qemu-system-x86 [kvm_intel] [k] vmx_vcpu_run
3.12% qemu-system-x86 [kernel.kallsyms] [k] lock_acquire
3.10% qemu-system-x86 [kernel.kallsyms] [k] paravirt_read_tsc
2.70% qemu-system-x86 [kernel.kallsyms] [k] copy_user_generic_string
2.30% qemu-system-x86 [kernel.kallsyms] [k] check_flags
2.15% qemu-system-x86 [kernel.kallsyms] [k] do_raw_spin_lock
2.03% qemu-system-x86 [kvm] [k] gfn_to_memslot

The aim of this patchset is to reduce the overload of these two functions, it
does:
- sort memslots by ->base_gfn and use binary search to search the right slot
which is implemented by patch 1 ~ patch 7
- cache guest page number to frame number to reduce *_walk_addr_generic called,
and patch 7 does it.

After the patchset:
# Events: 36K cycles
#
# Overhead Command Shared Object Symbol
# ........ ............... .................. ......
#
5.64% qemu-system-x86 [kernel.kallsyms] [k] native_read_tsc
4.27% qemu-system-x86 [kvm_intel] [k] vmx_vcpu_run
3.85% qemu-system-x86 [kernel.kallsyms] [k] delay_tsc
3.32% qemu-system-x86 [kernel.kallsyms] [k] arch_local_irq_restore
2.68% qemu-system-x86 [kernel.kallsyms] [k] paravirt_read_tsc
2.41% qemu-system-x86 [kernel.kallsyms] [k] lock_acquire
2.25% qemu-system-x86 [kernel.kallsyms] [k] copy_user_generic_string
2.14% qemu-system-x86 [kvm] [k] kvm_mmu_pte_write
1.82% qemu-system-x86 [kernel.kallsyms] [k] check_flags
1.80% qemu-system-x86 [kvm] [k] paging64_walk_addr_generic
[......]
[......]
0.81% qemu-system-x86 [kvm] [k] gfn_to_memslot

There is the Unixbecn(the parameter is -c 10 fsdisk) test report:
Before patchset:
2090.5
1882.5
1845.2

After vtlb patch:
2237.4
2151.8
2118.0

After memslots sorted patches:
2392.5
2485.9
2416.1


2011-02-22 08:08:39

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 1/7] KVM: cleanup memslot_id function

We can get memslot id from memslot->id directly

Signed-off-by: Xiao Guangrong <[email protected]>
---
include/linux/kvm_host.h | 7 ++++++-
virt/kvm/kvm_main.c | 17 -----------------
2 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ab42855..b1a7430 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -365,7 +365,6 @@ pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
bool *writable);
pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn);
-int memslot_id(struct kvm *kvm, gfn_t gfn);
void kvm_release_pfn_dirty(pfn_t);
void kvm_release_pfn_clean(pfn_t pfn);
void kvm_set_pfn_dirty(pfn_t pfn);
@@ -388,6 +387,12 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
+
+static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
+{
+ return gfn_to_memslot(kvm, gfn)->id;
+}
+
int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1fa0d29..cf90f28 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -997,23 +997,6 @@ out:
return size;
}

-int memslot_id(struct kvm *kvm, gfn_t gfn)
-{
- int i;
- struct kvm_memslots *slots = kvm_memslots(kvm);
- struct kvm_memory_slot *memslot = NULL;
-
- for (i = 0; i < slots->nmemslots; ++i) {
- memslot = &slots->memslots[i];
-
- if (gfn >= memslot->base_gfn
- && gfn < memslot->base_gfn + memslot->npages)
- break;
- }
-
- return memslot - slots->memslots;
-}
-
static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
gfn_t *nr_pages)
{
--
1.7.4

2011-02-22 08:09:36

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 2/7] KVM: introduce KVM_MEM_SLOTS_NUM macro

Introduce KVM_MEM_SLOTS_NUM macro to instead of
KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/mmu.c | 2 +-
include/linux/kvm_host.h | 7 +++++--
virt/kvm/kvm_main.c | 2 +-
4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 37bd730..178d658 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -30,6 +30,7 @@
#define KVM_MEMORY_SLOTS 32
/* memory slots that does not exposed to userspace */
#define KVM_PRIVATE_MEM_SLOTS 4
+#define KVM_MEM_SLOTS_NUM (KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)

#define KVM_PIO_PAGE_OFFSET 1
#define KVM_COALESCED_MMIO_PAGE_OFFSET 2
@@ -207,7 +208,7 @@ struct kvm_mmu_page {
* One bit set per slot which has memory
* in this shadow page.
*/
- DECLARE_BITMAP(slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
+ DECLARE_BITMAP(slot_bitmap, KVM_MEM_SLOTS_NUM);
bool multimapped; /* More than one parent_pte? */
bool unsync;
int root_count; /* Currently serving as active root */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b6a9963..268c891 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1056,7 +1056,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
PAGE_SIZE);
set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
- bitmap_zero(sp->slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
+ bitmap_zero(sp->slot_bitmap, KVM_MEM_SLOTS_NUM);
sp->multimapped = 0;
sp->parent_pte = parent_pte;
kvm_mod_used_mmu_pages(vcpu->kvm, +1);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b1a7430..865486f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -217,11 +217,14 @@ struct kvm_irq_routing_table {};

#endif

+#ifndef KVM_MEM_SLOTS_NUM
+#define KVM_MEM_SLOTS_NUM (KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
+#endif
+
struct kvm_memslots {
int nmemslots;
u64 generation;
- struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
- KVM_PRIVATE_MEM_SLOTS];
+ struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM];
};

struct kvm {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cf90f28..e3e744c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -651,7 +651,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
goto out;
if (user_alloc && (mem->userspace_addr & (PAGE_SIZE - 1)))
goto out;
- if (mem->slot >= KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
+ if (mem->slot >= KVM_MEM_SLOTS_NUM)
goto out;
if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
goto out;
--
1.7.4

2011-02-22 08:10:23

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 1/3] KVM: introduce memslots_updated function

Introduce memslots_updated function which is called after memslots updated

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 2 +-
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 15 +++++++++------
3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 17af71d..888f0b0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3252,7 +3252,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
goto out;
memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
slots->memslots[log->slot].dirty_bitmap = dirty_bitmap;
- slots->generation++;
+ memslots_updated(slots, log->slot);

old_slots = kvm->memslots;
rcu_assign_pointer(kvm->memslots, slots);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 865486f..15ff818 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -331,6 +331,7 @@ int is_error_pfn(pfn_t pfn);
int is_hwpoison_pfn(pfn_t pfn);
int is_fault_pfn(pfn_t pfn);
int kvm_is_error_hva(unsigned long addr);
+void memslots_updated(struct kvm_memslots *slots, int slot_id);
int kvm_set_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem,
int user_alloc);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e3e744c..e1c6e24 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -623,6 +623,13 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot)
}
#endif /* !CONFIG_S390 */

+void memslots_updated(struct kvm_memslots *slots, int slot_id)
+{
+ if (slot_id >= slots->nmemslots)
+ slots->nmemslots = slot_id + 1;
+ slots->generation++;
+}
+
/*
* Allocate some memory and give it an address in the guest physical address
* space.
@@ -768,10 +775,8 @@ skip_lpage:
if (!slots)
goto out_free;
memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
- if (mem->slot >= slots->nmemslots)
- slots->nmemslots = mem->slot + 1;
- slots->generation++;
slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
+ memslots_updated(slots, mem->slot);

old_memslots = kvm->memslots;
rcu_assign_pointer(kvm->memslots, slots);
@@ -803,9 +808,6 @@ skip_lpage:
if (!slots)
goto out_free;
memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
- if (mem->slot >= slots->nmemslots)
- slots->nmemslots = mem->slot + 1;
- slots->generation++;

/* actual memory is freed via old in kvm_free_physmem_slot below */
if (!npages) {
@@ -816,6 +818,7 @@ skip_lpage:
}

slots->memslots[mem->slot] = new;
+ memslots_updated(slots, mem->slot);
old_memslots = kvm->memslots;
rcu_assign_pointer(kvm->memslots, slots);
synchronize_srcu_expedited(&kvm->srcu);
--
1.7.4

2011-02-22 08:11:31

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 4/7] KVM: sort memslots and use binary search to search the right slot

Sort memslots then search the slot with binary search to speed up the
slot searching

Signed-off-by: Xiao Guangrong <[email protected]>
---
include/linux/kvm_host.h | 2 +
virt/kvm/kvm_main.c | 85 ++++++++++++++++++++++++++++++++++------------
2 files changed, 65 insertions(+), 22 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 15ff818..f1963a4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -223,8 +223,10 @@ struct kvm_irq_routing_table {};

struct kvm_memslots {
int nmemslots;
+ int used_slots;
u64 generation;
struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM];
+ struct kvm_memory_slot *slots_sort[KVM_MEM_SLOTS_NUM];
};

struct kvm {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e1c6e24..0795426 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -47,6 +47,7 @@
#include <linux/srcu.h>
#include <linux/hugetlb.h>
#include <linux/slab.h>
+#include <linux/sort.h>

#include <asm/processor.h>
#include <asm/io.h>
@@ -623,11 +624,67 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot)
}
#endif /* !CONFIG_S390 */

+static int cmp_memslot(const void *slot1, const void *slot2)
+{
+ struct kvm_memory_slot *s1, *s2;
+
+ s1 = *(struct kvm_memory_slot **)slot1;
+ s2 = *(struct kvm_memory_slot **)slot2;
+
+ if (s1->base_gfn > s2->base_gfn)
+ return 1;
+ if (s1->base_gfn < s2->base_gfn)
+ return -1;
+
+ return 0;
+}
+
+static struct kvm_memory_slot *search_memslots(struct kvm_memslots *slots,
+ gfn_t gfn)
+{
+ int start = 0, end = slots->used_slots, idx;
+ struct kvm_memory_slot *memslot;
+
+ while (start < end) {
+ idx = (start + end) / 2;
+ memslot = slots->slots_sort[idx];
+
+ if (gfn >= memslot->base_gfn &&
+ gfn < memslot->base_gfn + memslot->npages)
+ return memslot;
+
+ if (memslot->base_gfn < gfn)
+ start = idx + 1;
+ else
+ end = idx;
+ }
+
+ return NULL;
+}
+
+static void sort_memslots(struct kvm_memslots *slots)
+{
+ int i, num = 0;
+ struct kvm_memory_slot *memslot;
+
+ for (i = 0; i < slots->nmemslots; i++) {
+ memslot = &slots->memslots[i];
+ if (!memslot->npages)
+ continue;
+
+ slots->slots_sort[num++] = memslot;
+ }
+
+ slots->used_slots = num;
+ sort(slots->slots_sort, num, sizeof(memslot), cmp_memslot, NULL);
+}
+
void memslots_updated(struct kvm_memslots *slots, int slot_id)
{
if (slot_id >= slots->nmemslots)
slots->nmemslots = slot_id + 1;
slots->generation++;
+ sort_memslots(slots);
}

/*
@@ -939,16 +996,7 @@ EXPORT_SYMBOL_GPL(kvm_is_error_hva);
static struct kvm_memory_slot *__gfn_to_memslot(struct kvm_memslots *slots,
gfn_t gfn)
{
- int i;
-
- for (i = 0; i < slots->nmemslots; ++i) {
- struct kvm_memory_slot *memslot = &slots->memslots[i];
-
- if (gfn >= memslot->base_gfn
- && gfn < memslot->base_gfn + memslot->npages)
- return memslot;
- }
- return NULL;
+ return search_memslots(slots, gfn);
}

struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
@@ -959,20 +1007,13 @@ EXPORT_SYMBOL_GPL(gfn_to_memslot);

int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
{
- int i;
- struct kvm_memslots *slots = kvm_memslots(kvm);
+ struct kvm_memory_slot *memslot = gfn_to_memslot(kvm, gfn);

- for (i = 0; i < KVM_MEMORY_SLOTS; ++i) {
- struct kvm_memory_slot *memslot = &slots->memslots[i];
-
- if (memslot->flags & KVM_MEMSLOT_INVALID)
- continue;
+ if (!memslot || memslot->id > KVM_MEMORY_SLOTS ||
+ memslot->flags & KVM_MEMSLOT_INVALID)
+ return 0;

- if (gfn >= memslot->base_gfn
- && gfn < memslot->base_gfn + memslot->npages)
- return 1;
- }
- return 0;
+ return 1;
}
EXPORT_SYMBOL_GPL(kvm_is_visible_gfn);

--
1.7.4

2011-02-22 08:12:18

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 5/7] KVM: cache the last used slot

Cache the last used slot, the hit ratio is more than 35% during my test

Signed-off-by: Xiao Guangrong <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 11 +++++++++--
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f1963a4..7fbae16 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -225,6 +225,7 @@ struct kvm_memslots {
int nmemslots;
int used_slots;
u64 generation;
+ struct kvm_memory_slot *slot_cache;
struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM];
struct kvm_memory_slot *slots_sort[KVM_MEM_SLOTS_NUM];
};
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0795426..4917d6a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -643,15 +643,21 @@ static struct kvm_memory_slot *search_memslots(struct kvm_memslots *slots,
gfn_t gfn)
{
int start = 0, end = slots->used_slots, idx;
- struct kvm_memory_slot *memslot;
+ struct kvm_memory_slot *memslot = slots->slot_cache;
+
+ if (memslot && gfn >= memslot->base_gfn &&
+ gfn < memslot->base_gfn + memslot->npages)
+ return memslot;

while (start < end) {
idx = (start + end) / 2;
memslot = slots->slots_sort[idx];

if (gfn >= memslot->base_gfn &&
- gfn < memslot->base_gfn + memslot->npages)
+ gfn < memslot->base_gfn + memslot->npages) {
+ slots->slot_cache = memslot;
return memslot;
+ }

if (memslot->base_gfn < gfn)
start = idx + 1;
@@ -684,6 +690,7 @@ void memslots_updated(struct kvm_memslots *slots, int slot_id)
if (slot_id >= slots->nmemslots)
slots->nmemslots = slot_id + 1;
slots->generation++;
+ slots->slot_cache = NULL;
sort_memslots(slots);
}

--
1.7.4

2011-02-22 08:14:12

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 6/7] KVM: cleanup traversal used slots

We can get all used slots from memslots->slots_sort[], then memslots->nmemslots
is only used when memslots is sorted, this operation is not frequency i think,
so it can be removed

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/ia64/kvm/kvm-ia64.c | 3 +--
arch/x86/kvm/mmu.c | 9 +++++----
arch/x86/kvm/x86.c | 2 +-
include/linux/kvm_host.h | 7 +++++--
virt/kvm/iommu.c | 13 +++++++------
virt/kvm/kvm_main.c | 15 +++++++--------
6 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 8213efe..75b218c 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1369,8 +1369,7 @@ static void kvm_release_vm_pages(struct kvm *kvm)
unsigned long base_gfn;

slots = kvm_memslots(kvm);
- for (i = 0; i < slots->nmemslots; i++) {
- memslot = &slots->memslots[i];
+ kvm_for_each_memslot(slots, memslot, i) {
base_gfn = memslot->base_gfn;

for (j = 0; j < memslot->npages; j++) {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 268c891..0d6e7b1 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -874,11 +874,11 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
int ret;
int retval = 0;
struct kvm_memslots *slots;
+ struct kvm_memory_slot *memslot;

slots = kvm_memslots(kvm);

- for (i = 0; i < slots->nmemslots; i++) {
- struct kvm_memory_slot *memslot = &slots->memslots[i];
+ kvm_for_each_memslot(slots, memslot, i) {
unsigned long start = memslot->userspace_addr;
unsigned long end;

@@ -3671,11 +3671,12 @@ unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm)
unsigned int nr_mmu_pages;
unsigned int nr_pages = 0;
struct kvm_memslots *slots;
+ struct kvm_memory_slot *memslot;

slots = kvm_memslots(kvm);

- for (i = 0; i < slots->nmemslots; i++)
- nr_pages += slots->memslots[i].npages;
+ kvm_for_each_memslot(slots, memslot, i)
+ nr_pages += memslot->npages;

nr_mmu_pages = nr_pages * KVM_PERMILLE_MMU_PAGES / 1000;
nr_mmu_pages = max(nr_mmu_pages,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 888f0b0..632da9e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3252,7 +3252,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
goto out;
memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
slots->memslots[log->slot].dirty_bitmap = dirty_bitmap;
- memslots_updated(slots, log->slot);
+ memslots_updated(slots);

old_slots = kvm->memslots;
rcu_assign_pointer(kvm->memslots, slots);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7fbae16..c9fd336 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -222,7 +222,6 @@ struct kvm_irq_routing_table {};
#endif

struct kvm_memslots {
- int nmemslots;
int used_slots;
u64 generation;
struct kvm_memory_slot *slot_cache;
@@ -302,6 +301,10 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
idx < atomic_read(&kvm->online_vcpus) && vcpup; \
vcpup = kvm_get_vcpu(kvm, ++idx))

+#define kvm_for_each_memslot(memslots, memslot, i) \
+ for (i = 0; i < (memslots)->used_slots && \
+ ({memslot = (memslots)->slots_sort[i]; 1; }); i++)
+
int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id);
void kvm_vcpu_uninit(struct kvm_vcpu *vcpu);

@@ -334,7 +337,7 @@ int is_error_pfn(pfn_t pfn);
int is_hwpoison_pfn(pfn_t pfn);
int is_fault_pfn(pfn_t pfn);
int kvm_is_error_hva(unsigned long addr);
-void memslots_updated(struct kvm_memslots *slots, int slot_id);
+void memslots_updated(struct kvm_memslots *slots);
int kvm_set_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem,
int user_alloc);
diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c
index 62a9caf..79571a9 100644
--- a/virt/kvm/iommu.c
+++ b/virt/kvm/iommu.c
@@ -128,12 +128,13 @@ static int kvm_iommu_map_memslots(struct kvm *kvm)
{
int i, idx, r = 0;
struct kvm_memslots *slots;
+ struct kvm_memory_slot *memslot;

idx = srcu_read_lock(&kvm->srcu);
slots = kvm_memslots(kvm);

- for (i = 0; i < slots->nmemslots; i++) {
- r = kvm_iommu_map_pages(kvm, &slots->memslots[i]);
+ kvm_for_each_memslot(slots, memslot, i) {
+ r = kvm_iommu_map_pages(kvm, memslot);
if (r)
break;
}
@@ -289,14 +290,14 @@ static int kvm_iommu_unmap_memslots(struct kvm *kvm)
{
int i, idx;
struct kvm_memslots *slots;
+ struct kvm_memory_slot *memslot;

idx = srcu_read_lock(&kvm->srcu);
slots = kvm_memslots(kvm);

- for (i = 0; i < slots->nmemslots; i++) {
- kvm_iommu_put_pages(kvm, slots->memslots[i].base_gfn,
- slots->memslots[i].npages);
- }
+ kvm_for_each_memslot(slots, memslot, i)
+ kvm_iommu_put_pages(kvm, memslot->base_gfn, memslot->npages);
+
srcu_read_unlock(&kvm->srcu, idx);

return 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4917d6a..16b9cf3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -544,9 +544,10 @@ void kvm_free_physmem(struct kvm *kvm)
{
int i;
struct kvm_memslots *slots = kvm->memslots;
+ struct kvm_memory_slot *memslot;

- for (i = 0; i < slots->nmemslots; ++i)
- kvm_free_physmem_slot(&slots->memslots[i], NULL);
+ kvm_for_each_memslot(slots, memslot, i)
+ kvm_free_physmem_slot(memslot, NULL);

kfree(kvm->memslots);
}
@@ -673,7 +674,7 @@ static void sort_memslots(struct kvm_memslots *slots)
int i, num = 0;
struct kvm_memory_slot *memslot;

- for (i = 0; i < slots->nmemslots; i++) {
+ for (i = 0; i < KVM_MEM_SLOTS_NUM; i++) {
memslot = &slots->memslots[i];
if (!memslot->npages)
continue;
@@ -685,10 +686,8 @@ static void sort_memslots(struct kvm_memslots *slots)
sort(slots->slots_sort, num, sizeof(memslot), cmp_memslot, NULL);
}

-void memslots_updated(struct kvm_memslots *slots, int slot_id)
+void memslots_updated(struct kvm_memslots *slots)
{
- if (slot_id >= slots->nmemslots)
- slots->nmemslots = slot_id + 1;
slots->generation++;
slots->slot_cache = NULL;
sort_memslots(slots);
@@ -840,7 +839,7 @@ skip_lpage:
goto out_free;
memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
- memslots_updated(slots, mem->slot);
+ memslots_updated(slots);

old_memslots = kvm->memslots;
rcu_assign_pointer(kvm->memslots, slots);
@@ -882,7 +881,7 @@ skip_lpage:
}

slots->memslots[mem->slot] = new;
- memslots_updated(slots, mem->slot);
+ memslots_updated(slots);
old_memslots = kvm->memslots;
rcu_assign_pointer(kvm->memslots, slots);
synchronize_srcu_expedited(&kvm->srcu);
--
1.7.4

2011-02-22 08:15:45

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 7/7] KVM: MMU: cache guest page number to guest frame number

Cache guest page number to guest frame number to avoid walk guest page table
frequently, the 'vtlb' idea is from Xen.

Note:
we can't use vtlb in ept guests since the guest tlb invalid operation is not
intercept(reload CR3, invlpg), also can't used in L2 nnpt guest for the same
reason, but we can used it to cache L1's npt page table.

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 10 ++++-
arch/x86/kvm/mmu.c | 94 +++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/mmutrace.h | 79 ++++++++++++++++++++++++++++++++
arch/x86/kvm/paging_tmpl.h | 19 +++++++-
4 files changed, 196 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 178d658..b05ad8f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -234,6 +234,13 @@ struct kvm_pio_request {
int size;
};

+#define VTLB_ENTRIES (1 << 4)
+struct vtlb_entry {
+ gfn_t page_number;
+ gfn_t frame_number;
+ u32 access;
+};
+
/*
* x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level
* 32-bit). The kvm_mmu structure abstracts the details of the current mmu
@@ -267,8 +274,9 @@ struct kvm_mmu {
u64 rsvd_bits_mask[2][4];

bool nx;
-
+ bool vtlb_enabled;
u64 pdptrs[4]; /* pae */
+ struct vtlb_entry vtlb[VTLB_ENTRIES];
};

struct kvm_vcpu_arch {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0d6e7b1..e45c0d6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2644,8 +2644,87 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
trace_kvm_mmu_audit(vcpu, AUDIT_POST_SYNC);
}

+static int vtlb_hash(gva_t page)
+{
+ return page & (VTLB_ENTRIES - 1);
+}
+
+static struct vtlb_entry *get_vtlb_entry(struct kvm_mmu *mmu, gfn_t page_number)
+{
+ return &mmu->vtlb[vtlb_hash(page_number)];
+}
+
+static void vtlb_flush(struct kvm_mmu *mmu)
+{
+ int i;
+
+ if (!mmu->vtlb_enabled)
+ return;
+
+ for (i = 0; i < VTLB_ENTRIES; i++)
+ mmu->vtlb[i].frame_number = INVALID_PAGE;
+
+ trace_vtlb_flush(mmu);
+}
+
+static void vtlb_insert(struct kvm_mmu *mmu, gva_t va, gfn_t frame, u32 access)
+{
+ gfn_t page_number;
+ struct vtlb_entry *entry;
+
+ if (!mmu->vtlb_enabled)
+ return;
+
+ page_number = gpa_to_gfn(va);
+ entry = get_vtlb_entry(mmu, page_number);
+ entry->page_number = page_number;
+ entry->frame_number = frame;
+ entry->access = access;
+
+ trace_vtlb_insert(mmu, page_number, frame, access);
+}
+
+static gfn_t vtlb_lookup(struct kvm_mmu *mmu, gva_t va, u32 access)
+{
+ gfn_t page_number;
+ gfn_t frame_number = INVALID_PAGE;
+ struct vtlb_entry *entry;
+
+ if (!mmu->vtlb_enabled)
+ return INVALID_PAGE;
+
+ page_number = gpa_to_gfn(va);
+ entry = get_vtlb_entry(mmu, page_number);
+
+ if (entry->frame_number != INVALID_PAGE &&
+ entry->page_number == page_number &&
+ (entry->access & access) == access)
+ frame_number = entry->frame_number;
+
+ trace_vtlb_lookup(mmu, page_number, frame_number != INVALID_PAGE);
+ return frame_number;
+}
+
+static void vtlb_invalid_gfn(struct kvm_mmu *mmu, gva_t va)
+{
+ gfn_t page_number;
+ struct vtlb_entry *entry;
+
+ if (!mmu->vtlb_enabled)
+ return;
+
+ page_number = gpa_to_gfn(va);
+ entry = get_vtlb_entry(mmu, page_number);
+
+ if (entry->page_number == page_number)
+ entry->frame_number = INVALID_PAGE;
+
+ trace_vtlb_invalid_gfn(mmu, page_number);
+}
+
void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
{
+ vtlb_flush(&vcpu->arch.mmu);
spin_lock(&vcpu->kvm->mmu_lock);
mmu_sync_roots(vcpu);
spin_unlock(&vcpu->kvm->mmu_lock);
@@ -2809,6 +2888,7 @@ static int nonpaging_init_context(struct kvm_vcpu *vcpu,
context->root_hpa = INVALID_PAGE;
context->direct_map = true;
context->nx = false;
+ context->vtlb_enabled = false;
return 0;
}

@@ -2938,6 +3018,7 @@ static int paging64_init_context_common(struct kvm_vcpu *vcpu,
context->shadow_root_level = level;
context->root_hpa = INVALID_PAGE;
context->direct_map = false;
+ context->vtlb_enabled = true;
return 0;
}

@@ -2965,6 +3046,7 @@ static int paging32_init_context(struct kvm_vcpu *vcpu,
context->shadow_root_level = PT32E_ROOT_LEVEL;
context->root_hpa = INVALID_PAGE;
context->direct_map = false;
+ context->vtlb_enabled = true;
return 0;
}

@@ -2992,6 +3074,7 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
context->get_cr3 = get_cr3;
context->inject_page_fault = kvm_inject_page_fault;
context->nx = is_nx(vcpu);
+ context->vtlb_enabled = false;

if (!is_paging(vcpu)) {
context->nx = false;
@@ -3056,6 +3139,7 @@ static int init_kvm_nested_mmu(struct kvm_vcpu *vcpu)

g_context->get_cr3 = get_cr3;
g_context->inject_page_fault = kvm_inject_page_fault;
+ g_context->vtlb_enabled = false;

/*
* Note that arch.mmu.gva_to_gpa translates l2_gva to l1_gpa. The
@@ -3092,11 +3176,14 @@ static int init_kvm_mmu(struct kvm_vcpu *vcpu)
vcpu->arch.update_pte.pfn = bad_pfn;

if (mmu_is_nested(vcpu))
- return init_kvm_nested_mmu(vcpu);
+ init_kvm_nested_mmu(vcpu);
else if (tdp_enabled)
- return init_kvm_tdp_mmu(vcpu);
+ init_kvm_tdp_mmu(vcpu);
else
- return init_kvm_softmmu(vcpu);
+ init_kvm_softmmu(vcpu);
+
+ vtlb_flush(&vcpu->arch.mmu);
+ return 0;
}

static void destroy_kvm_mmu(struct kvm_vcpu *vcpu)
@@ -3463,6 +3550,7 @@ EXPORT_SYMBOL_GPL(kvm_mmu_page_fault);

void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
{
+ vtlb_invalid_gfn(&vcpu->arch.mmu, gva);
vcpu->arch.mmu.invlpg(vcpu, gva);
kvm_mmu_flush_tlb(vcpu);
++vcpu->stat.invlpg;
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index b60b4fd..2bacc3f 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -214,6 +214,85 @@ TRACE_EVENT(
TP_printk("vcpu:%d %s", __entry->vcpu->cpu,
audit_point_name[__entry->audit_point])
);
+
+TRACE_EVENT(
+ vtlb_flush,
+ TP_PROTO(struct kvm_mmu *mmu),
+ TP_ARGS(mmu),
+
+ TP_STRUCT__entry(
+ __field(struct kvm_mmu *, mmu)
+ ),
+
+ TP_fast_assign(
+ __entry->mmu = mmu;
+ ),
+
+ TP_printk("mmu:%p", __entry->mmu)
+);
+
+TRACE_EVENT(
+ vtlb_insert,
+ TP_PROTO(struct kvm_mmu *mmu, gfn_t page, gfn_t frame, u32 access),
+ TP_ARGS(mmu, page, frame, access),
+
+ TP_STRUCT__entry(
+ __field(struct kvm_mmu *, mmu)
+ __field(gfn_t, page)
+ __field(gfn_t, frame)
+ __field(u32, access)
+ ),
+
+ TP_fast_assign(
+ __entry->mmu = mmu;
+ __entry->page = page;
+ __entry->frame = frame;
+ __entry->access = access;
+ ),
+
+ TP_printk("mmu:%p page_number:%llx frame_number:%llx access:%x",
+ __entry->mmu, __entry->page, __entry->frame, __entry->access)
+);
+
+TRACE_EVENT(
+ vtlb_lookup,
+ TP_PROTO(struct kvm_mmu *mmu, gfn_t page, bool hit),
+ TP_ARGS(mmu, page, hit),
+
+ TP_STRUCT__entry(
+ __field(struct kvm_mmu *, mmu)
+ __field(gfn_t, page)
+ __field(bool, hit)
+ ),
+
+ TP_fast_assign(
+ __entry->mmu = mmu;
+ __entry->page = page;
+ __entry->hit = hit;
+ ),
+
+ TP_printk("mmu:%p page_number:%llx %s", __entry->mmu, __entry->page,
+ __entry->hit ? "hit" : "miss")
+);
+
+TRACE_EVENT(
+ vtlb_invalid_gfn,
+ TP_PROTO(struct kvm_mmu *mmu, gfn_t page),
+ TP_ARGS(mmu, page),
+
+ TP_STRUCT__entry(
+ __field(struct kvm_mmu *, mmu)
+ __field(gfn_t, page)
+ ),
+
+ TP_fast_assign(
+ __entry->mmu = mmu;
+ __entry->page = page;
+ ),
+
+ TP_printk("mmu:%p page_number:%llx", __entry->mmu, __entry->page)
+);
+
#endif /* _TRACE_KVMMMU_H */

#undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 6bccc24..a7da29e 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -261,6 +261,7 @@ walk:

walker->pt_access = pt_access;
walker->pte_access = pte_access;
+ vtlb_insert(mmu, addr, walker->gfn, pte_access);
pgprintk("%s: pte %llx pte_access %x pt_access %x\n",
__func__, (u64)pte, pte_access, pt_access);
return 1;
@@ -691,12 +692,19 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
{
struct guest_walker walker;
gpa_t gpa = UNMAPPED_GVA;
+ gfn_t gfn;
int r;

+ gfn = vtlb_lookup(&vcpu->arch.mmu, vaddr, access);
+ if (gfn != INVALID_PAGE)
+ goto success;
+
r = FNAME(walk_addr)(&walker, vcpu, vaddr, access);

if (r) {
- gpa = gfn_to_gpa(walker.gfn);
+ gfn = walker.gfn;
+success:
+ gpa = gfn_to_gpa(gfn);
gpa |= vaddr & ~PAGE_MASK;
} else if (exception)
*exception = walker.fault;
@@ -710,12 +718,19 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
{
struct guest_walker walker;
gpa_t gpa = UNMAPPED_GVA;
+ gfn_t gfn;
int r;

+ gfn = vtlb_lookup(&vcpu->arch.nested_mmu, vaddr, access);
+ if (gfn != INVALID_PAGE)
+ goto success;
+
r = FNAME(walk_addr_nested)(&walker, vcpu, vaddr, access);

if (r) {
- gpa = gfn_to_gpa(walker.gfn);
+ gfn = walker.gfn;
+success:
+ gpa = gfn_to_gpa(gfn);
gpa |= vaddr & ~PAGE_MASK;
} else if (exception)
*exception = walker.fault;
--
1.7.4

2011-02-22 14:25:32

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 4/7] KVM: sort memslots and use binary search to search the right slot

On 02/22/2011 10:12 AM, Xiao Guangrong wrote:
> Sort memslots then search the slot with binary search to speed up the
> slot searching
>

I'm not sure if a binary search is the right algorithm here. It
introduces a lot of branches which may be mispredicted.

Options we've discussed are:

- Sort slots by size, use linear search (so the largest slots are found
quickly)
- Weighted balanced tree
http://en.wikipedia.org/wiki/Weight-balanced_tree, use weight == slot size

Both options still make the miss case (mmio) slow. We could cache the
result of a miss in an spte by using a reserved bit, and checking the
page fault error code (or seeing if we get an ept violation or ept
misconfiguration), so if we get repeated mmio on a page, we don't need
to search the slot list/tree.

--
error compiling committee.c: too many arguments to function

2011-02-22 14:27:01

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 5/7] KVM: cache the last used slot

On 02/22/2011 10:13 AM, Xiao Guangrong wrote:
> Cache the last used slot, the hit ratio is more than 35% during my test
>

If we sort the slots by size, or use a weighted balanced tree, this
should come for free.

--
error compiling committee.c: too many arguments to function

2011-02-22 14:32:33

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 7/7] KVM: MMU: cache guest page number to guest frame number

On 02/22/2011 10:16 AM, Xiao Guangrong wrote:
> Cache guest page number to guest frame number to avoid walk guest page table
> frequently, the 'vtlb' idea is from Xen.
>
> Note:
> we can't use vtlb in ept guests since the guest tlb invalid operation is not
> intercept(reload CR3, invlpg), also can't used in L2 nnpt guest for the same
> reason, but we can used it to cache L1's npt page table.
>

I'm not so hot about introducing a new mechanism strictly for older
hosts... EPT exists in three generations of Intel processors now (Sandy
Bridge, Westmere, and Nehalem), and NPT is significantly older.

I'd rather make the guest page table walk faster. I believe that things
like get_user() can be significantly faster than copy_from_user() on
older hosts, and a four-level page table walk shouldn't be that slow.

--
error compiling committee.c: too many arguments to function

2011-02-22 14:54:47

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH 4/7] KVM: sort memslots and use binary search to search the right slot

On Tue, 2011-02-22 at 16:25 +0200, Avi Kivity wrote:
> On 02/22/2011 10:12 AM, Xiao Guangrong wrote:
> > Sort memslots then search the slot with binary search to speed up the
> > slot searching
> >
>
> I'm not sure if a binary search is the right algorithm here. It
> introduces a lot of branches which may be mispredicted.
>
> Options we've discussed are:
>
> - Sort slots by size, use linear search (so the largest slots are found
> quickly)
> - Weighted balanced tree
> http://en.wikipedia.org/wiki/Weight-balanced_tree, use weight == slot size

I've got an implementation using a weight balanced tree working now. I
need to do some testing to see if I can detect any performance
difference from the current unsorted, linear array.

> Both options still make the miss case (mmio) slow. We could cache the
> result of a miss in an spte by using a reserved bit, and checking the
> page fault error code (or seeing if we get an ept violation or ept
> misconfiguration), so if we get repeated mmio on a page, we don't need
> to search the slot list/tree.

I haven't started on this idea yet. Thanks,

Alex


2011-02-22 18:54:55

by Alex Williamson

[permalink] [raw]
Subject: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

This series introduces a new weight-balanced binary tree (wbtree) for
general use. It's largely leveraged from the rbtree, copying it's
rotate functions, while introducing different rebalance and erase
functions. This tree is particularly useful for managing memory
ranges, where it's desirable to have the most likely targets (the
largest ranges) at the top of each subtree.

Patches 2 & 3 go on to convert the KVM memory slots to a growable
array and make use of wbtree for efficient managment. Trying to
exercise the worst case for this data structure, I ran netperf
TCP_RR on an emulated rtl8139 NIC connected directly to the host
via a tap. Both qemu-kvm and the netserver on the host were
pinned to optimal CPUs with taskset. This series resulted in
a 3% improvement for this test.

Note that part of why this series is RFC is that the print_tree
function in the last patch is debug code that generates output
for dot. You can copy the output to a file and run:

dot -Tpdf foo.dot > foo.pdf

to generate a nice diagram of the tree currently in use. I'll
follow-up with a few examples. Thanks,

Alex

---

Alex Williamson (3):
kvm: Use weight-balanced tree for memory slot management
kvm: Allow memory slot array to grow on demand
Weight-balanced tree


arch/ia64/include/asm/kvm_host.h | 4 -
arch/ia64/kvm/kvm-ia64.c | 2
arch/powerpc/include/asm/kvm_host.h | 3
arch/s390/include/asm/kvm_host.h | 3
arch/x86/include/asm/kvm_host.h | 5 -
arch/x86/include/asm/vmx.h | 6 -
arch/x86/kvm/mmu.c | 32 ++++-
arch/x86/kvm/x86.c | 6 -
include/linux/kvm_host.h | 25 ++++
include/linux/wbtree.h | 55 +++++++++
lib/Makefile | 3
lib/wbtree.c | 170 ++++++++++++++++++++++++++
virt/kvm/kvm_main.c | 226 +++++++++++++++++++++++++++--------
13 files changed, 464 insertions(+), 76 deletions(-)
create mode 100644 include/linux/wbtree.h
create mode 100644 lib/wbtree.c

2011-02-22 18:55:14

by Alex Williamson

[permalink] [raw]
Subject: [RFC PATCH 1/3] Weight-balanced tree

Signed-off-by: Alex Williamson <[email protected]>
---

include/linux/wbtree.h | 55 ++++++++++++++++
lib/Makefile | 3 +
lib/wbtree.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 227 insertions(+), 1 deletions(-)
create mode 100644 include/linux/wbtree.h
create mode 100644 lib/wbtree.c

diff --git a/include/linux/wbtree.h b/include/linux/wbtree.h
new file mode 100644
index 0000000..ef70f6f
--- /dev/null
+++ b/include/linux/wbtree.h
@@ -0,0 +1,55 @@
+/*
+ * Weight-balanced binary tree
+ *
+ * The balance of this tree is based on search probability. The
+ * heaviest weighted nodes (the ones we're most likely to hit), are
+ * at the top of each subtree.
+ *
+ * Copywrite (C) 2011 Red Hat, Inc.
+ *
+ * Authors:
+ * Alex Williamson <[email protected]>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+#ifndef _LINUX_WBTREE_H
+#define _LINUX_WBTREE_H
+
+#include <linux/kernel.h>
+#include <linux/stddef.h>
+
+struct wb_node {
+ struct wb_node *wb_parent;
+ struct wb_node *wb_left;
+ struct wb_node *wb_right;
+ unsigned long wb_weight;
+};
+
+struct wb_root {
+ struct wb_node *wb_node;
+};
+
+#define WB_ROOT (struct wb_root) { NULL, }
+#define wb_entry(ptr, type, member) container_of(ptr, type, member)
+
+extern void wb_rebalance(struct wb_node *node, struct wb_root *root);
+extern void wb_erase(struct wb_node *node, struct wb_root *root);
+extern struct wb_node *wb_first(struct wb_root *root);
+extern struct wb_node *wb_last(struct wb_root *root);
+extern struct wb_node *wb_next(struct wb_node *node);
+extern struct wb_node *wb_prev(struct wb_node *node);
+
+static inline void wb_set_weight(struct wb_node *node, unsigned long weight)
+{
+ node->wb_weight = weight;
+}
+
+static inline void wb_link_node(struct wb_node *node, struct wb_node *parent,
+ struct wb_node **wb_link)
+{
+ node->wb_left = node->wb_right = NULL;
+ node->wb_parent = parent;
+ *wb_link = node;
+}
+#endif /* _LINUX_WBTREE_H */
diff --git a/lib/Makefile b/lib/Makefile
index cbb774f..5c42e63 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -21,7 +21,8 @@ lib-y += kobject.o kref.o klist.o

obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o \
- string_helpers.o gcd.o lcm.o list_sort.o uuid.o flex_array.o
+ string_helpers.o gcd.o lcm.o list_sort.o uuid.o flex_array.o \
+ wbtree.o

ifeq ($(CONFIG_DEBUG_KOBJECT),y)
CFLAGS_kobject.o += -DDEBUG
diff --git a/lib/wbtree.c b/lib/wbtree.c
new file mode 100644
index 0000000..54c3817
--- /dev/null
+++ b/lib/wbtree.c
@@ -0,0 +1,170 @@
+/*
+ * Weight-balanced binary tree
+ *
+ * The balance of this tree is based on search probability. The
+ * heaviest weighted nodes (the ones we're most likely to hit), are
+ * at the top of each subtree.
+ *
+ * Copywrite (C) 2011 Red Hat, Inc.
+ *
+ * Authors:
+ * Alex Williamson <[email protected]>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+#include <linux/wbtree.h>
+#include <linux/module.h>
+
+static void __wb_rotate_left(struct wb_node *node, struct wb_root *root)
+{
+ struct wb_node *right = node->wb_right;
+ struct wb_node *parent = node->wb_parent;
+
+ if ((node->wb_right = right->wb_left))
+ right->wb_left->wb_parent = node;
+ right->wb_left = node;
+
+ right->wb_parent = parent;
+
+ if (parent) {
+ if (node == parent->wb_left)
+ parent->wb_left = right;
+ else
+ parent->wb_right = right;
+ } else
+ root->wb_node = right;
+
+ node->wb_parent = right;
+}
+
+static void __wb_rotate_right(struct wb_node *node, struct wb_root *root)
+{
+ struct wb_node *left = node->wb_left;
+ struct wb_node *parent = node->wb_parent;
+
+ if ((node->wb_left = left->wb_right))
+ left->wb_right->wb_parent = node;
+ left->wb_right = node;
+
+ left->wb_parent = parent;
+
+ if (parent) {
+ if (node == parent->wb_right)
+ parent->wb_right = left;
+ else
+ parent->wb_left = left;
+ } else
+ root->wb_node = left;
+
+ node->wb_parent = left;
+}
+
+void wb_rebalance(struct wb_node *node, struct wb_root *root)
+{
+ while (node->wb_parent && node->wb_parent->wb_weight < node->wb_weight) {
+ if (node == node->wb_parent->wb_left)
+ __wb_rotate_right(node->wb_parent, root);
+ else
+ __wb_rotate_left(node->wb_parent, root);
+ }
+}
+EXPORT_SYMBOL(wb_rebalance);
+
+void wb_erase(struct wb_node *node, struct wb_root *root)
+{
+ while (node->wb_left || node->wb_right) {
+ if (!node->wb_left)
+ __wb_rotate_left(node, root);
+ else if (!node->wb_right)
+ __wb_rotate_right(node, root);
+ else {
+ if (node->wb_left->wb_weight >
+ node->wb_right->wb_weight)
+ __wb_rotate_right(node, root);
+ else
+ __wb_rotate_left(node, root);
+ }
+ }
+
+ if (node->wb_parent) {
+ if (node->wb_parent->wb_left == node)
+ node->wb_parent->wb_left = NULL;
+ else
+ node->wb_parent->wb_right = NULL;
+ } else
+ root->wb_node = NULL;
+}
+EXPORT_SYMBOL(wb_erase);
+
+struct wb_node *wb_first(struct wb_root *root)
+{
+ struct wb_node *node;
+
+ node = root->wb_node;
+ if (!node)
+ return NULL;
+
+ while (node->wb_left)
+ node = node->wb_left;
+
+ return node;
+}
+EXPORT_SYMBOL(wb_first);
+
+struct wb_node *wb_last(struct wb_root *root)
+{
+ struct wb_node *node;
+
+ node = root->wb_node;
+ if (!node)
+ return NULL;
+
+ while (node->wb_right)
+ node = node->wb_right;
+
+ return node;
+}
+EXPORT_SYMBOL(wb_last);
+
+struct wb_node *wb_next(struct wb_node *node)
+{
+ struct wb_node *parent;
+
+ if (node->wb_parent == node)
+ return NULL;
+
+ if (node->wb_right) {
+ node = node->wb_right;
+ while (node->wb_left)
+ node = node->wb_left;
+ return node;
+ }
+
+ while ((parent = node->wb_parent) && node == parent->wb_right)
+ node = parent;
+
+ return parent;
+}
+EXPORT_SYMBOL(wb_next);
+
+struct wb_node *wb_prev(struct wb_node *node)
+{
+ struct wb_node *parent;
+
+ if (node->wb_parent == node)
+ return NULL;
+
+ if (node->wb_left) {
+ node = node->wb_left;
+ while (node->wb_right)
+ node = node->wb_right;
+ return node;
+ }
+
+ while ((parent = node->wb_parent) && node == parent->wb_left)
+ node = parent;
+
+ return parent;
+}
+EXPORT_SYMBOL(wb_prev);

2011-02-22 18:55:27

by Alex Williamson

[permalink] [raw]
Subject: [RFC PATCH 2/3] kvm: Allow memory slot array to grow on demand

Remove fixed KVM_MEMORY_SLOTS limit, allowing the slot array
to grow on demand. Private slots are now allocated at the
front instead of the end. Only x86 seems to use private slots,
so this is now zero for all other archs. The memslots pointer
is already updated using rcu, so changing the size off the
array when it's replaces is straight forward. x86 also keeps
a bitmap of slots used by a kvm_mmu_page, which requires a
shadow tlb flush whenever we increase the number of slots.
This forces the pages to be rebuilt with the new bitmap size.

Signed-off-by: Alex Williamson <[email protected]>
---

arch/ia64/include/asm/kvm_host.h | 4 --
arch/ia64/kvm/kvm-ia64.c | 2 +
arch/powerpc/include/asm/kvm_host.h | 3 --
arch/s390/include/asm/kvm_host.h | 3 --
arch/x86/include/asm/kvm_host.h | 5 +--
arch/x86/include/asm/vmx.h | 6 ++-
arch/x86/kvm/mmu.c | 32 +++++++++++++++--
arch/x86/kvm/x86.c | 6 ++-
include/linux/kvm_host.h | 22 +++++++++++-
virt/kvm/kvm_main.c | 65 +++++++++++++++++++++++++----------
10 files changed, 103 insertions(+), 45 deletions(-)

diff --git a/arch/ia64/include/asm/kvm_host.h b/arch/ia64/include/asm/kvm_host.h
index 2689ee5..11d0ab2 100644
--- a/arch/ia64/include/asm/kvm_host.h
+++ b/arch/ia64/include/asm/kvm_host.h
@@ -23,10 +23,6 @@
#ifndef __ASM_KVM_HOST_H
#define __ASM_KVM_HOST_H

-#define KVM_MEMORY_SLOTS 32
-/* memory slots that does not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS 4
-
#define KVM_COALESCED_MMIO_PAGE_OFFSET 1

/* define exit reasons from vmm to kvm*/
diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 70d224d..f1adda2 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1814,7 +1814,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
mutex_lock(&kvm->slots_lock);

r = -EINVAL;
- if (log->slot >= KVM_MEMORY_SLOTS)
+ if (log->slot >= kvm->memslots->nmemslots)
goto out;

memslot = &kvm->memslots->memslots[log->slot];
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index bba3b9b..dc80057 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -29,9 +29,6 @@
#include <asm/kvm_asm.h>

#define KVM_MAX_VCPUS 1
-#define KVM_MEMORY_SLOTS 32
-/* memory slots that does not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS 4

#define KVM_COALESCED_MMIO_PAGE_OFFSET 1

diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index cef7dbf..92a964c 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -20,9 +20,6 @@
#include <asm/cpu.h>

#define KVM_MAX_VCPUS 64
-#define KVM_MEMORY_SLOTS 32
-/* memory slots that does not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS 4

struct sca_entry {
atomic_t scn;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ffd7f8d..5c94392 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -27,9 +27,8 @@
#include <asm/msr-index.h>

#define KVM_MAX_VCPUS 64
-#define KVM_MEMORY_SLOTS 32
/* memory slots that does not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS 4
+#define KVM_PRIVATE_MEM_SLOTS 3

#define KVM_PIO_PAGE_OFFSET 1
#define KVM_COALESCED_MMIO_PAGE_OFFSET 2
@@ -207,7 +206,7 @@ struct kvm_mmu_page {
* One bit set per slot which has memory
* in this shadow page.
*/
- DECLARE_BITMAP(slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
+ unsigned long *slot_bitmap;
bool multimapped; /* More than one parent_pte? */
bool unsync;
int root_count; /* Currently serving as active root */
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 84471b8..7fd8c89 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -370,9 +370,9 @@ enum vmcs_field {

#define AR_RESERVD_MASK 0xfffe0f00

-#define TSS_PRIVATE_MEMSLOT (KVM_MEMORY_SLOTS + 0)
-#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT (KVM_MEMORY_SLOTS + 1)
-#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT (KVM_MEMORY_SLOTS + 2)
+#define TSS_PRIVATE_MEMSLOT 0
+#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT 1
+#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT 2

#define VMX_NR_VPIDS (1 << 16)
#define VMX_VPID_EXTENT_SINGLE_CONTEXT 1
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ccacf0b..91e14f6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1029,9 +1029,13 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)

static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
+ struct kvm_memslots *slots = kvm_memslots(kvm);
+
ASSERT(is_empty_shadow_page(sp->spt));
hlist_del(&sp->hash_link);
list_del(&sp->link);
+ if (unlikely(slots->nmemslots > sizeof(sp->slot_bitmap) * 8))
+ kfree(sp->slot_bitmap);
__free_page(virt_to_page(sp->spt));
if (!sp->role.direct)
__free_page(virt_to_page(sp->gfns));
@@ -1048,6 +1052,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
u64 *parent_pte, int direct)
{
struct kvm_mmu_page *sp;
+ struct kvm_memslots *slots = kvm_memslots(vcpu->kvm);

sp = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache, sizeof *sp);
sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache, PAGE_SIZE);
@@ -1056,7 +1061,16 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
PAGE_SIZE);
set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
- bitmap_zero(sp->slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
+
+ if (unlikely(slots->nmemslots > sizeof(sp->slot_bitmap) * 8)) {
+ sp->slot_bitmap = kzalloc(sizeof(long) *
+ BITS_TO_LONGS(slots->nmemslots),
+ GFP_KERNEL);
+ if (!sp->slot_bitmap)
+ return NULL;
+ } else
+ bitmap_zero((void *)&sp->slot_bitmap, slots->nmemslots);
+
sp->multimapped = 0;
sp->parent_pte = parent_pte;
kvm_mod_used_mmu_pages(vcpu->kvm, +1);
@@ -1817,8 +1831,12 @@ static void page_header_update_slot(struct kvm *kvm, void *pte, gfn_t gfn)
{
int slot = memslot_id(kvm, gfn);
struct kvm_mmu_page *sp = page_header(__pa(pte));
+ struct kvm_memslots *slots = kvm_memslots(kvm);

- __set_bit(slot, sp->slot_bitmap);
+ if (likely(slots->nmemslots <= sizeof(sp->slot_bitmap) * 8))
+ __set_bit(slot, (void *)&sp->slot_bitmap);
+ else
+ __set_bit(slot, sp->slot_bitmap);
}

static void mmu_convert_notrap(struct kvm_mmu_page *sp)
@@ -3530,13 +3548,19 @@ int kvm_mmu_setup(struct kvm_vcpu *vcpu)
void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
{
struct kvm_mmu_page *sp;
+ struct kvm_memslots *slots = kvm_memslots(kvm);

list_for_each_entry(sp, &kvm->arch.active_mmu_pages, link) {
int i;
u64 *pt;

- if (!test_bit(slot, sp->slot_bitmap))
- continue;
+ if (likely(slots->nmemslots <= sizeof(sp->slot_bitmap) * 8)) {
+ if (!test_bit(slot, (void *)&sp->slot_bitmap))
+ continue;
+ } else {
+ if (!test_bit(slot, sp->slot_bitmap))
+ continue;
+ }

pt = sp->spt;
for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5eccdba..88688d8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1978,7 +1978,7 @@ int kvm_dev_ioctl_check_extension(long ext)
r = KVM_MAX_VCPUS;
break;
case KVM_CAP_NR_MEMSLOTS:
- r = KVM_MEMORY_SLOTS;
+ r = KVM_MAX_MEM_SLOTS - KVM_PRIVATE_MEM_SLOTS;
break;
case KVM_CAP_PV_MMU: /* obsolete */
r = 0;
@@ -3201,7 +3201,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
mutex_lock(&kvm->slots_lock);

r = -EINVAL;
- if (log->slot >= KVM_MEMORY_SLOTS)
+ if (log->slot >= kvm->memslots->nmemslots)
goto out;

memslot = &kvm->memslots->memslots[log->slot];
@@ -6068,7 +6068,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
int map_flags = MAP_PRIVATE | MAP_ANONYMOUS;

/* Prevent internal slot pages from being moved by fork()/COW. */
- if (memslot->id >= KVM_MEMORY_SLOTS)
+ if (memslot->id < KVM_PRIVATE_MEM_SLOTS)
map_flags = MAP_SHARED | MAP_ANONYMOUS;

/*To keep backward compatibility with older userspace,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b5021db..7bbb36f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -28,6 +28,25 @@
#include <asm/kvm_host.h>

/*
+ * Private slots are not exposed to userspace. These are filled at the
+ * front of the slot array with the userspace visible 0 index starting
+ * immediately following.
+ */
+#ifndef KVM_PRIVATE_MEM_SLOTS
+ #define KVM_PRIVATE_MEM_SLOTS 0
+#endif
+
+/*
+ * Protect from malicious userspace by putting an upper bound on the number
+ * of memory slots. This is an arbitrarily large number that still allows
+ * us to make pseudo-guarantees about supporting 64 assigned devices with
+ * plenty of slots left over.
+ */
+#ifndef KVM_MAX_MEM_SLOTS
+ #define KVM_MAX_MEM_SLOTS 512
+#endif
+
+/*
* vcpu->requests bit members
*/
#define KVM_REQ_TLB_FLUSH 0
@@ -206,8 +225,7 @@ struct kvm_irq_routing_table {};
struct kvm_memslots {
int nmemslots;
u64 generation;
- struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
- KVM_PRIVATE_MEM_SLOTS];
+ struct kvm_memory_slot memslots[];
};

struct kvm {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fd67bcd..a3a5bda 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -623,13 +623,14 @@ int __kvm_set_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem,
int user_alloc)
{
- int r;
+ int r, nmemslots;
gfn_t base_gfn;
unsigned long npages;
unsigned long i;
- struct kvm_memory_slot *memslot;
- struct kvm_memory_slot old, new;
+ struct kvm_memory_slot *memslot = NULL;
+ struct kvm_memory_slot old = {}, new = {};
struct kvm_memslots *slots, *old_memslots;
+ bool flush = false;

r = -EINVAL;
/* General sanity checks */
@@ -639,12 +640,11 @@ int __kvm_set_memory_region(struct kvm *kvm,
goto out;
if (user_alloc && (mem->userspace_addr & (PAGE_SIZE - 1)))
goto out;
- if (mem->slot >= KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
+ if (mem->slot >= KVM_MAX_MEM_SLOTS)
goto out;
if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
goto out;

- memslot = &kvm->memslots->memslots[mem->slot];
base_gfn = mem->guest_phys_addr >> PAGE_SHIFT;
npages = mem->memory_size >> PAGE_SHIFT;

@@ -655,7 +655,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (!npages)
mem->flags &= ~KVM_MEM_LOG_DIRTY_PAGES;

- new = old = *memslot;
+ if (mem->slot < kvm->memslots->nmemslots) {
+ memslot = &kvm->memslots->memslots[mem->slot];
+ new = old = *memslot;
+ }

new.id = mem->slot;
new.base_gfn = base_gfn;
@@ -669,7 +672,7 @@ int __kvm_set_memory_region(struct kvm *kvm,

/* Check for overlaps */
r = -EEXIST;
- for (i = 0; i < KVM_MEMORY_SLOTS; ++i) {
+ for (i = KVM_PRIVATE_MEM_SLOTS; i < kvm->memslots->nmemslots; ++i) {
struct kvm_memory_slot *s = &kvm->memslots->memslots[i];

if (s == memslot || !s->npages)
@@ -752,12 +755,19 @@ skip_lpage:

if (!npages) {
r = -ENOMEM;
- slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
+
+ nmemslots = (mem->slot >= kvm->memslots->nmemslots) ?
+ mem->slot + 1 : kvm->memslots->nmemslots;
+
+ slots = kzalloc(sizeof(struct kvm_memslots) +
+ nmemslots * sizeof(struct kvm_memory_slot),
+ GFP_KERNEL);
if (!slots)
goto out_free;
- memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
- if (mem->slot >= slots->nmemslots)
- slots->nmemslots = mem->slot + 1;
+ memcpy(slots, kvm->memslots,
+ sizeof(struct kvm_memslots) + kvm->memslots->nmemslots *
+ sizeof(struct kvm_memory_slot));
+ slots->nmemslots = nmemslots;
slots->generation++;
slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;

@@ -787,12 +797,21 @@ skip_lpage:
}

r = -ENOMEM;
- slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
+
+ if (mem->slot >= kvm->memslots->nmemslots) {
+ nmemslots = mem->slot + 1;
+ flush = true;
+ } else
+ nmemslots = kvm->memslots->nmemslots;
+
+ slots = kzalloc(sizeof(struct kvm_memslots) +
+ nmemslots * sizeof(struct kvm_memory_slot),
+ GFP_KERNEL);
if (!slots)
goto out_free;
- memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
- if (mem->slot >= slots->nmemslots)
- slots->nmemslots = mem->slot + 1;
+ memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots) +
+ kvm->memslots->nmemslots * sizeof(struct kvm_memory_slot));
+ slots->nmemslots = nmemslots;
slots->generation++;

/* actual memory is freed via old in kvm_free_physmem_slot below */
@@ -808,6 +827,9 @@ skip_lpage:
rcu_assign_pointer(kvm->memslots, slots);
synchronize_srcu_expedited(&kvm->srcu);

+ if (flush)
+ kvm_arch_flush_shadow(kvm);
+
kvm_arch_commit_memory_region(kvm, mem, old, user_alloc);

kvm_free_physmem_slot(&old, &new);
@@ -841,7 +863,7 @@ int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
kvm_userspace_memory_region *mem,
int user_alloc)
{
- if (mem->slot >= KVM_MEMORY_SLOTS)
+ if (mem->slot >= KVM_MAX_MEM_SLOTS)
return -EINVAL;
return kvm_set_memory_region(kvm, mem, user_alloc);
}
@@ -855,7 +877,7 @@ int kvm_get_dirty_log(struct kvm *kvm,
unsigned long any = 0;

r = -EINVAL;
- if (log->slot >= KVM_MEMORY_SLOTS)
+ if (log->slot >= kvm->memslots->nmemslots)
goto out;

memslot = &kvm->memslots->memslots[log->slot];
@@ -947,7 +969,7 @@ int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
int i;
struct kvm_memslots *slots = kvm_memslots(kvm);

- for (i = 0; i < KVM_MEMORY_SLOTS; ++i) {
+ for (i = KVM_PRIVATE_MEM_SLOTS; i < slots->nmemslots; ++i) {
struct kvm_memory_slot *memslot = &slots->memslots[i];

if (memslot->flags & KVM_MEMSLOT_INVALID)
@@ -1832,6 +1854,8 @@ static long kvm_vm_ioctl(struct file *filp,
sizeof kvm_userspace_mem))
goto out;

+ kvm_userspace_mem.slot += KVM_PRIVATE_MEM_SLOTS;
+
r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem, 1);
if (r)
goto out;
@@ -1843,6 +1867,9 @@ static long kvm_vm_ioctl(struct file *filp,
r = -EFAULT;
if (copy_from_user(&log, argp, sizeof log))
goto out;
+
+ log.slot += KVM_PRIVATE_MEM_SLOTS;
+
r = kvm_vm_ioctl_get_dirty_log(kvm, &log);
if (r)
goto out;
@@ -1937,7 +1964,7 @@ static long kvm_vm_compat_ioctl(struct file *filp,
if (copy_from_user(&compat_log, (void __user *)arg,
sizeof(compat_log)))
goto out;
- log.slot = compat_log.slot;
+ log.slot = compat_log.slot + KVM_PRIVATE_MEM_SLOTS;
log.padding1 = compat_log.padding1;
log.padding2 = compat_log.padding2;
log.dirty_bitmap = compat_ptr(compat_log.dirty_bitmap);

2011-02-22 18:55:43

by Alex Williamson

[permalink] [raw]
Subject: [RFC PATCH 3/3] kvm: Use weight-balanced tree for memory slot management

Rather than search linearly through an unordered slot array, we
can embed a tree node into each element. We can still do RCU
tree updates by applying the pointer offset to each element
in the new copy of the tree.

Signed-off-by: Alex Williamson <[email protected]>
---

include/linux/kvm_host.h | 3 +
virt/kvm/kvm_main.c | 163 +++++++++++++++++++++++++++++++++++++---------
2 files changed, 135 insertions(+), 31 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7bbb36f..2a43f9b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -18,6 +18,7 @@
#include <linux/msi.h>
#include <linux/slab.h>
#include <linux/rcupdate.h>
+#include <linux/wbtree.h>
#include <asm/signal.h>

#include <linux/kvm.h>
@@ -171,6 +172,7 @@ struct kvm_lpage_info {
};

struct kvm_memory_slot {
+ struct wb_node wb_node;
gfn_t base_gfn;
unsigned long npages;
unsigned long flags;
@@ -225,6 +227,7 @@ struct kvm_irq_routing_table {};
struct kvm_memslots {
int nmemslots;
u64 generation;
+ struct wb_root wb_root;
struct kvm_memory_slot memslots[];
};

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a3a5bda..0a6ef96 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -447,6 +447,7 @@ static struct kvm *kvm_create_vm(void)
kvm->memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
if (!kvm->memslots)
goto out_err_nosrcu;
+ kvm->memslots->wb_root = WB_ROOT;
if (init_srcu_struct(&kvm->srcu))
goto out_err_nosrcu;
for (i = 0; i < KVM_NR_BUSES; i++) {
@@ -611,6 +612,119 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot)
}
#endif /* !CONFIG_S390 */

+static struct kvm_memory_slot *kvm_memslot_search(gfn_t gfn,
+ struct kvm_memslots *slots)
+{
+ struct wb_node *node = slots->wb_root.wb_node;
+ struct kvm_memory_slot *slot;
+
+ while (node) {
+ slot = wb_entry(node, struct kvm_memory_slot, wb_node);
+
+ if (gfn < slot->base_gfn)
+ node = node->wb_left;
+ else if (gfn >= slot->base_gfn + slot->npages)
+ node = node->wb_right;
+ else
+ return slot;
+ }
+ return NULL;
+}
+
+static void print_tree(struct kvm_memslots *slots)
+{
+ int i;
+
+ printk("digraph wbtree {\nnode [shape = record];\n");
+
+ for (i = 0; i < slots->nmemslots; i++) {
+ struct kvm_memory_slot *s = &slots->memslots[i];
+ struct wb_node *n = &s->wb_node;
+
+ if (!s->npages)
+ continue;
+
+ printk("node%d [label=\"<l> | <n> %lx-%lx\\n(%ld) | <r>\"];\n",
+ i, (unsigned long)s->base_gfn,
+ (unsigned long)(s->base_gfn + s->npages), s->npages);
+ if (n->wb_left) {
+ struct kvm_memory_slot *sl;
+ int left;
+ sl = wb_entry(n->wb_left, struct kvm_memory_slot,
+ wb_node);
+ left = sl - slots->memslots;
+ printk("\"node%d\":l -> \"node%d\":n;\n", i, left);
+ }
+ if (n->wb_right) {
+ struct kvm_memory_slot *sr;
+ int right;
+ sr = wb_entry(n->wb_right, struct kvm_memory_slot,
+ wb_node);
+ right = sr - slots->memslots;
+ printk("\"node%d\":r -> \"node%d\":n;\n", i, right);
+ }
+ }
+ printk("}\n");
+}
+
+static int kvm_memslot_insert(struct kvm_memory_slot *slot,
+ struct kvm_memslots *slots)
+{
+ struct wb_node **node = &slots->wb_root.wb_node, *parent = NULL;
+ struct kvm_memory_slot *memslot;
+
+ wb_set_weight(&slot->wb_node, slot->npages);
+
+ while (*node) {
+ memslot = wb_entry(*node, struct kvm_memory_slot, wb_node);
+ parent = *node;
+
+ if (slot->base_gfn + slot->npages <= memslot->base_gfn)
+ node = &(*node)->wb_left;
+ else if (slot->base_gfn >= memslot->base_gfn + memslot->npages)
+ node = &(*node)->wb_right;
+ else
+ return -EINVAL;
+ }
+
+ wb_link_node(&slot->wb_node, parent, node);
+ wb_rebalance(&slot->wb_node, &slots->wb_root);
+ print_tree(slots);
+ return 0;
+}
+
+static void kvm_memslot_remove(struct kvm_memory_slot *slot,
+ struct kvm_memslots *slots)
+{
+ wb_erase(&slot->wb_node, &slots->wb_root);
+ print_tree(slots);
+}
+
+static inline void kvm_memslot_pointer_fix(void *old, void *new, void **ptr)
+{
+ if (*ptr) {
+ if (new > old)
+ *ptr += (new - old);
+ else
+ *ptr -= (old - new);
+ }
+}
+
+static void kvm_memslots_copy_fixup(struct kvm_memslots *old,
+ struct kvm_memslots *new)
+{
+ int i;
+
+ for (i = 0; i < old->nmemslots; i++) {
+ struct wb_node *node = &new->memslots[i].wb_node;
+ kvm_memslot_pointer_fix(old, new, (void **)&node->wb_left);
+ kvm_memslot_pointer_fix(old, new, (void **)&node->wb_right);
+ kvm_memslot_pointer_fix(old, new, (void **)&node->wb_parent);
+ }
+
+ kvm_memslot_pointer_fix(old, new, (void **)&new->wb_root.wb_node);
+}
+
/*
* Allocate some memory and give it an address in the guest physical address
* space.
@@ -767,6 +881,8 @@ skip_lpage:
memcpy(slots, kvm->memslots,
sizeof(struct kvm_memslots) + kvm->memslots->nmemslots *
sizeof(struct kvm_memory_slot));
+ kvm_memslots_copy_fixup(kvm->memslots, slots);
+ new.wb_node = slots->memslots[mem->slot].wb_node;
slots->nmemslots = nmemslots;
slots->generation++;
slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
@@ -823,6 +939,13 @@ skip_lpage:
}

slots->memslots[mem->slot] = new;
+ kvm_memslots_copy_fixup(kvm->memslots, slots);
+
+ if (npages)
+ kvm_memslot_insert(&slots->memslots[mem->slot], slots);
+ else
+ kvm_memslot_remove(&slots->memslots[mem->slot], slots);
+
old_memslots = kvm->memslots;
rcu_assign_pointer(kvm->memslots, slots);
synchronize_srcu_expedited(&kvm->srcu);
@@ -946,16 +1069,7 @@ EXPORT_SYMBOL_GPL(kvm_is_error_hva);
static struct kvm_memory_slot *__gfn_to_memslot(struct kvm_memslots *slots,
gfn_t gfn)
{
- int i;
-
- for (i = 0; i < slots->nmemslots; ++i) {
- struct kvm_memory_slot *memslot = &slots->memslots[i];
-
- if (gfn >= memslot->base_gfn
- && gfn < memslot->base_gfn + memslot->npages)
- return memslot;
- }
- return NULL;
+ return kvm_memslot_search(gfn, slots);
}

struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
@@ -966,20 +1080,16 @@ EXPORT_SYMBOL_GPL(gfn_to_memslot);

int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
{
- int i;
struct kvm_memslots *slots = kvm_memslots(kvm);
+ struct kvm_memory_slot *slot = kvm_memslot_search(gfn, slots);

- for (i = KVM_PRIVATE_MEM_SLOTS; i < slots->nmemslots; ++i) {
- struct kvm_memory_slot *memslot = &slots->memslots[i];
+ if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
+ return 0;

- if (memslot->flags & KVM_MEMSLOT_INVALID)
- continue;
+ if (slot - slots->memslots < KVM_PRIVATE_MEM_SLOTS)
+ return 0;

- if (gfn >= memslot->base_gfn
- && gfn < memslot->base_gfn + memslot->npages)
- return 1;
- }
- return 0;
+ return 1;
}
EXPORT_SYMBOL_GPL(kvm_is_visible_gfn);

@@ -1009,19 +1119,10 @@ out:

int memslot_id(struct kvm *kvm, gfn_t gfn)
{
- int i;
struct kvm_memslots *slots = kvm_memslots(kvm);
- struct kvm_memory_slot *memslot = NULL;
-
- for (i = 0; i < slots->nmemslots; ++i) {
- memslot = &slots->memslots[i];
-
- if (gfn >= memslot->base_gfn
- && gfn < memslot->base_gfn + memslot->npages)
- break;
- }
+ struct kvm_memory_slot *slot = kvm_memslot_search(gfn, slots);

- return memslot - slots->memslots;
+ return slot ? slot - slots->memslots : slots->nmemslots;
}

static unsigned long gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,

2011-02-22 18:59:21

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On Tue, 2011-02-22 at 11:54 -0700, Alex Williamson wrote:
> This series introduces a new weight-balanced binary tree (wbtree) for
> general use. It's largely leveraged from the rbtree, copying it's
> rotate functions, while introducing different rebalance and erase
> functions. This tree is particularly useful for managing memory
> ranges, where it's desirable to have the most likely targets (the
> largest ranges) at the top of each subtree.
>
> Patches 2 & 3 go on to convert the KVM memory slots to a growable
> array and make use of wbtree for efficient managment. Trying to
> exercise the worst case for this data structure, I ran netperf
> TCP_RR on an emulated rtl8139 NIC connected directly to the host
> via a tap. Both qemu-kvm and the netserver on the host were
> pinned to optimal CPUs with taskset. This series resulted in
> a 3% improvement for this test.
>
> Note that part of why this series is RFC is that the print_tree
> function in the last patch is debug code that generates output
> for dot. You can copy the output to a file and run:
>
> dot -Tpdf foo.dot > foo.pdf
>
> to generate a nice diagram of the tree currently in use. I'll
> follow-up with a few examples. Thanks,

Attached are examples of the memory slot tree for 2G, 4G, and 8G guests.
Thanks,

Alex


Attachments:
2G.pdf (48.49 kB)
4G.pdf (48.54 kB)
8G.pdf (48.50 kB)
Download all attachments

2011-02-23 01:29:27

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 4/7] KVM: sort memslots and use binary search to search the right slot

On 02/22/2011 10:25 PM, Avi Kivity wrote:
> On 02/22/2011 10:12 AM, Xiao Guangrong wrote:
>> Sort memslots then search the slot with binary search to speed up the
>> slot searching
>>
>
> I'm not sure if a binary search is the right algorithm here. It introduces a lot of branches which may be mispredicted.
>
> Options we've discussed are:
>
> - Sort slots by size, use linear search (so the largest slots are found quickly)
> - Weighted balanced tree http://en.wikipedia.org/wiki/Weight-balanced_tree, use weight == slot size
>

Yeah, there are the better chooses.

> Both options still make the miss case (mmio) slow. We could cache the result of a miss in an spte by using a reserved bit, and checking the page fault error code (or seeing if we get an ept violation or ept misconfiguration), so if we get repeated mmio on a page, we don't need to search the slot list/tree.
>

Sounds good! Will do it, thanks!

2011-02-23 01:37:36

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 7/7] KVM: MMU: cache guest page number to guest frame number

On 02/22/2011 10:32 PM, Avi Kivity wrote:
> On 02/22/2011 10:16 AM, Xiao Guangrong wrote:
>> Cache guest page number to guest frame number to avoid walk guest page table
>> frequently, the 'vtlb' idea is from Xen.
>>
>> Note:
>> we can't use vtlb in ept guests since the guest tlb invalid operation is not
>> intercept(reload CR3, invlpg), also can't used in L2 nnpt guest for the same
>> reason, but we can used it to cache L1's npt page table.
>>
>
> I'm not so hot about introducing a new mechanism strictly for older hosts... EPT exists in three generations of Intel processors now (Sandy Bridge, Westmere, and Nehalem), and NPT is significantly older.
>

Um...so, do we should stop the new features for softmmu, only bug fix is welcome? :-)

2011-02-23 01:56:24

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On Tue, 2011-02-22 at 11:54 -0700, Alex Williamson wrote:
> This series introduces a new weight-balanced binary tree (wbtree) for
> general use. It's largely leveraged from the rbtree, copying it's
> rotate functions, while introducing different rebalance and erase
> functions. This tree is particularly useful for managing memory
> ranges, where it's desirable to have the most likely targets (the
> largest ranges) at the top of each subtree.
>
> Patches 2 & 3 go on to convert the KVM memory slots to a growable
> array and make use of wbtree for efficient managment. Trying to
> exercise the worst case for this data structure, I ran netperf
> TCP_RR on an emulated rtl8139 NIC connected directly to the host
> via a tap. Both qemu-kvm and the netserver on the host were
> pinned to optimal CPUs with taskset. This series resulted in
> a 3% improvement for this test.

Marcelo asked about kernbench with ept=0 on this series. Using a 4-way,
10G guest with pinned vcpus, build in tmpfs, 10x optimal run, I got:

before (stdev) after (stdev) %
--------+-----------------+----------------+-------
Elapsed | 42.809 (0.19) | 42.305 (0.23) | -1.18
User | 115.709 (0.22) | 114.720 (0.31) | -0.85
System | 41.605 (0.14) | 40.924 (0.20) | -1.64
%cpu | 366.9 (1.45) | 367.6 (1.51) | 0.19
context | 7272.3 (68.6) | 7249.5 (97.8) | -0.31
sleeps | 14826.2 (110.6) | 14798.5 (63.0) | -0.19

So, a small but measurable gain. Also, sorry for forgetting to address
Marcelo's comments from the original version of patch 2/3, I'll pick
them up in the next round. Thanks,

Alex

2011-02-23 09:28:20

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 7/7] KVM: MMU: cache guest page number to guest frame number

On 02/23/2011 03:38 AM, Xiao Guangrong wrote:
> On 02/22/2011 10:32 PM, Avi Kivity wrote:
> > On 02/22/2011 10:16 AM, Xiao Guangrong wrote:
> >> Cache guest page number to guest frame number to avoid walk guest page table
> >> frequently, the 'vtlb' idea is from Xen.
> >>
> >> Note:
> >> we can't use vtlb in ept guests since the guest tlb invalid operation is not
> >> intercept(reload CR3, invlpg), also can't used in L2 nnpt guest for the same
> >> reason, but we can used it to cache L1's npt page table.
> >>
> >
> > I'm not so hot about introducing a new mechanism strictly for older hosts... EPT exists in three generations of Intel processors now (Sandy Bridge, Westmere, and Nehalem), and NPT is significantly older.
> >
>
> Um...so, do we should stop the new features for softmmu, only bug fix is welcome? :-)

No. There is always a tradeoff between features and complexity. What
I'm saying is that I want to shift the tradeoff, for older processors,
towards reducing complexity. An improvement that is very simple, or
gives very large gains, will be accepted. A complex improvement that
gives small gains may be rejected (but if it's for newer processors, it
may be accepted). It's a way for the maintainers to manage the ever
growing complexity.

--
error compiling committee.c: too many arguments to function

2011-02-23 13:09:28

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] Weight-balanced tree

On 02/22/2011 08:55 PM, Alex Williamson wrote:
> Signed-off-by: Alex Williamson<[email protected]>
> ---
>
> include/linux/wbtree.h | 55 ++++++++++++++++
> lib/Makefile | 3 +
> lib/wbtree.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 227 insertions(+), 1 deletions(-)
> create mode 100644 include/linux/wbtree.h
> create mode 100644 lib/wbtree.c
>

Andrew, can we add something like this to the tree? Can it go through
the kvm tree?

> diff --git a/include/linux/wbtree.h b/include/linux/wbtree.h
> new file mode 100644
> index 0000000..ef70f6f
> --- /dev/null
> +++ b/include/linux/wbtree.h
> @@ -0,0 +1,55 @@
> +/*
> + * Weight-balanced binary tree
> + *
> + * The balance of this tree is based on search probability. The
> + * heaviest weighted nodes (the ones we're most likely to hit), are
> + * at the top of each subtree.
> + *
> + * Copywrite (C) 2011 Red Hat, Inc.
> + *
> + * Authors:
> + * Alex Williamson<[email protected]>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +#ifndef _LINUX_WBTREE_H
> +#define _LINUX_WBTREE_H
> +
> +#include<linux/kernel.h>
> +#include<linux/stddef.h>
> +
> +struct wb_node {
> + struct wb_node *wb_parent;
> + struct wb_node *wb_left;
> + struct wb_node *wb_right;
> + unsigned long wb_weight;
> +};
> +
> +struct wb_root {
> + struct wb_node *wb_node;
> +};
> +
> +#define WB_ROOT (struct wb_root) { NULL, }
> +#define wb_entry(ptr, type, member) container_of(ptr, type, member)
> +
> +extern void wb_rebalance(struct wb_node *node, struct wb_root *root);
> +extern void wb_erase(struct wb_node *node, struct wb_root *root);
> +extern struct wb_node *wb_first(struct wb_root *root);
> +extern struct wb_node *wb_last(struct wb_root *root);
> +extern struct wb_node *wb_next(struct wb_node *node);
> +extern struct wb_node *wb_prev(struct wb_node *node);
> +
> +static inline void wb_set_weight(struct wb_node *node, unsigned long weight)
> +{
> + node->wb_weight = weight;
> +}
> +
> +static inline void wb_link_node(struct wb_node *node, struct wb_node *parent,
> + struct wb_node **wb_link)
> +{
> + node->wb_left = node->wb_right = NULL;
> + node->wb_parent = parent;
> + *wb_link = node;
> +}
> +#endif /* _LINUX_WBTREE_H */
> diff --git a/lib/Makefile b/lib/Makefile
> index cbb774f..5c42e63 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -21,7 +21,8 @@ lib-y += kobject.o kref.o klist.o
>
> obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
> bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o \
> - string_helpers.o gcd.o lcm.o list_sort.o uuid.o flex_array.o
> + string_helpers.o gcd.o lcm.o list_sort.o uuid.o flex_array.o \
> + wbtree.o
>

obj-$(CONFIG_WEIGHT_BALANCED_TREE) += wbtree.o

then kvm can select it, and the impact on non-kvm kernels is removed.

> ifeq ($(CONFIG_DEBUG_KOBJECT),y)
> CFLAGS_kobject.o += -DDEBUG
> diff --git a/lib/wbtree.c b/lib/wbtree.c
> new file mode 100644
> index 0000000..54c3817
> --- /dev/null
> +++ b/lib/wbtree.c
> @@ -0,0 +1,170 @@
> +/*
> + * Weight-balanced binary tree
> + *
> + * The balance of this tree is based on search probability. The
> + * heaviest weighted nodes (the ones we're most likely to hit), are
> + * at the top of each subtree.
> + *
> + * Copywrite (C) 2011 Red Hat, Inc.
> + *
> + * Authors:
> + * Alex Williamson<[email protected]>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +#include<linux/wbtree.h>
> +#include<linux/module.h>
> +
> +static void __wb_rotate_left(struct wb_node *node, struct wb_root *root)
> +{
> + struct wb_node *right = node->wb_right;
> + struct wb_node *parent = node->wb_parent;
> +
> + if ((node->wb_right = right->wb_left))
> + right->wb_left->wb_parent = node;
> + right->wb_left = node;
> +
> + right->wb_parent = parent;
> +
> + if (parent) {
> + if (node == parent->wb_left)
> + parent->wb_left = right;
> + else
> + parent->wb_right = right;
> + } else
> + root->wb_node = right;
> +
> + node->wb_parent = right;
> +}
> +
> +static void __wb_rotate_right(struct wb_node *node, struct wb_root *root)
> +{
> + struct wb_node *left = node->wb_left;
> + struct wb_node *parent = node->wb_parent;
> +
> + if ((node->wb_left = left->wb_right))
> + left->wb_right->wb_parent = node;
> + left->wb_right = node;
> +
> + left->wb_parent = parent;
> +
> + if (parent) {
> + if (node == parent->wb_right)
> + parent->wb_right = left;
> + else
> + parent->wb_left = left;
> + } else
> + root->wb_node = left;
> +
> + node->wb_parent = left;
> +}
> +
> +void wb_rebalance(struct wb_node *node, struct wb_root *root)
> +{
> + while (node->wb_parent&& node->wb_parent->wb_weight< node->wb_weight) {
> + if (node == node->wb_parent->wb_left)
> + __wb_rotate_right(node->wb_parent, root);
> + else
> + __wb_rotate_left(node->wb_parent, root);
> + }
> +}
> +EXPORT_SYMBOL(wb_rebalance);
> +
> +void wb_erase(struct wb_node *node, struct wb_root *root)
> +{
> + while (node->wb_left || node->wb_right) {
> + if (!node->wb_left)
> + __wb_rotate_left(node, root);
> + else if (!node->wb_right)
> + __wb_rotate_right(node, root);
> + else {
> + if (node->wb_left->wb_weight>
> + node->wb_right->wb_weight)
> + __wb_rotate_right(node, root);
> + else
> + __wb_rotate_left(node, root);
> + }
> + }
> +
> + if (node->wb_parent) {
> + if (node->wb_parent->wb_left == node)
> + node->wb_parent->wb_left = NULL;
> + else
> + node->wb_parent->wb_right = NULL;
> + } else
> + root->wb_node = NULL;
> +}
> +EXPORT_SYMBOL(wb_erase);
> +
> +struct wb_node *wb_first(struct wb_root *root)
> +{
> + struct wb_node *node;
> +
> + node = root->wb_node;
> + if (!node)
> + return NULL;
> +
> + while (node->wb_left)
> + node = node->wb_left;
> +
> + return node;
> +}
> +EXPORT_SYMBOL(wb_first);
> +
> +struct wb_node *wb_last(struct wb_root *root)
> +{
> + struct wb_node *node;
> +
> + node = root->wb_node;
> + if (!node)
> + return NULL;
> +
> + while (node->wb_right)
> + node = node->wb_right;
> +
> + return node;
> +}
> +EXPORT_SYMBOL(wb_last);
> +
> +struct wb_node *wb_next(struct wb_node *node)
> +{
> + struct wb_node *parent;
> +
> + if (node->wb_parent == node)
> + return NULL;
> +
> + if (node->wb_right) {
> + node = node->wb_right;
> + while (node->wb_left)
> + node = node->wb_left;
> + return node;
> + }
> +
> + while ((parent = node->wb_parent)&& node == parent->wb_right)
> + node = parent;
> +
> + return parent;
> +}
> +EXPORT_SYMBOL(wb_next);
> +
> +struct wb_node *wb_prev(struct wb_node *node)
> +{
> + struct wb_node *parent;
> +
> + if (node->wb_parent == node)
> + return NULL;
> +
> + if (node->wb_left) {
> + node = node->wb_left;
> + while (node->wb_right)
> + node = node->wb_right;
> + return node;
> + }
> +
> + while ((parent = node->wb_parent)&& node == parent->wb_left)
> + node = parent;
> +
> + return parent;
> +}
> +EXPORT_SYMBOL(wb_prev);
>


--
error compiling committee.c: too many arguments to function

2011-02-23 13:12:48

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On 02/22/2011 08:54 PM, Alex Williamson wrote:
> This series introduces a new weight-balanced binary tree (wbtree) for
> general use. It's largely leveraged from the rbtree, copying it's
> rotate functions, while introducing different rebalance and erase
> functions. This tree is particularly useful for managing memory
> ranges, where it's desirable to have the most likely targets (the
> largest ranges) at the top of each subtree.
>
> Patches 2& 3 go on to convert the KVM memory slots to a growable
> array and make use of wbtree for efficient managment. Trying to
> exercise the worst case for this data structure, I ran netperf
> TCP_RR on an emulated rtl8139 NIC connected directly to the host
> via a tap. Both qemu-kvm and the netserver on the host were
> pinned to optimal CPUs with taskset. This series resulted in
> a 3% improvement for this test.
>

In this case, I think most of the faults (at least after the guest was
warmed up) missed the tree completely. In this case a weight balanced
tree is hardly optimal (it is optimized for hits), so I think you'll see
a bigger gain from the mmio fault optimization. You'll probably see
most of the gain running mmu intensive tests with ept=0.

> Note that part of why this series is RFC is that the print_tree
> function in the last patch is debug code that generates output
> for dot. You can copy the output to a file and run:
>
> dot -Tpdf foo.dot> foo.pdf
>
> to generate a nice diagram of the tree currently in use. I'll
> follow-up with a few examples. Thanks,
>
> Alex
>

--
error compiling committee.c: too many arguments to function

2011-02-23 17:02:52

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] Weight-balanced tree

On Wed, 2011-02-23 at 15:09 +0200, Avi Kivity wrote:
> On 02/22/2011 08:55 PM, Alex Williamson wrote:
> > Signed-off-by: Alex Williamson<[email protected]>
> > ---
> >
> > include/linux/wbtree.h | 55 ++++++++++++++++
> > lib/Makefile | 3 +
> > lib/wbtree.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 227 insertions(+), 1 deletions(-)
> > create mode 100644 include/linux/wbtree.h
> > create mode 100644 lib/wbtree.c
> >
>
> Andrew, can we add something like this to the tree? Can it go through
> the kvm tree?
>
> > diff --git a/include/linux/wbtree.h b/include/linux/wbtree.h
> > new file mode 100644
> > index 0000000..ef70f6f
> > --- /dev/null
> > +++ b/include/linux/wbtree.h
> > @@ -0,0 +1,55 @@
> > +/*
> > + * Weight-balanced binary tree
> > + *
> > + * The balance of this tree is based on search probability. The
> > + * heaviest weighted nodes (the ones we're most likely to hit), are
> > + * at the top of each subtree.
> > + *
> > + * Copywrite (C) 2011 Red Hat, Inc.
> > + *
> > + * Authors:
> > + * Alex Williamson<[email protected]>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2. See
> > + * the COPYING file in the top-level directory.
> > + */
> > +#ifndef _LINUX_WBTREE_H
> > +#define _LINUX_WBTREE_H
> > +
> > +#include<linux/kernel.h>
> > +#include<linux/stddef.h>
> > +
> > +struct wb_node {
> > + struct wb_node *wb_parent;
> > + struct wb_node *wb_left;
> > + struct wb_node *wb_right;
> > + unsigned long wb_weight;
> > +};
> > +
> > +struct wb_root {
> > + struct wb_node *wb_node;
> > +};
> > +
> > +#define WB_ROOT (struct wb_root) { NULL, }
> > +#define wb_entry(ptr, type, member) container_of(ptr, type, member)
> > +
> > +extern void wb_rebalance(struct wb_node *node, struct wb_root *root);
> > +extern void wb_erase(struct wb_node *node, struct wb_root *root);
> > +extern struct wb_node *wb_first(struct wb_root *root);
> > +extern struct wb_node *wb_last(struct wb_root *root);
> > +extern struct wb_node *wb_next(struct wb_node *node);
> > +extern struct wb_node *wb_prev(struct wb_node *node);
> > +
> > +static inline void wb_set_weight(struct wb_node *node, unsigned long weight)
> > +{
> > + node->wb_weight = weight;
> > +}
> > +
> > +static inline void wb_link_node(struct wb_node *node, struct wb_node *parent,
> > + struct wb_node **wb_link)
> > +{
> > + node->wb_left = node->wb_right = NULL;
> > + node->wb_parent = parent;
> > + *wb_link = node;
> > +}
> > +#endif /* _LINUX_WBTREE_H */
> > diff --git a/lib/Makefile b/lib/Makefile
> > index cbb774f..5c42e63 100644
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -21,7 +21,8 @@ lib-y += kobject.o kref.o klist.o
> >
> > obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
> > bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o \
> > - string_helpers.o gcd.o lcm.o list_sort.o uuid.o flex_array.o
> > + string_helpers.o gcd.o lcm.o list_sort.o uuid.o flex_array.o \
> > + wbtree.o
> >
>
> obj-$(CONFIG_WEIGHT_BALANCED_TREE) += wbtree.o
>
> then kvm can select it, and the impact on non-kvm kernels is removed.

Then we'd have issues trying to build an external kvm module for a
pre-existing non-kvm kernel. Do we care? If we were to take such a
path, I think the default should be on, kvm would depend on it, but we
could add an option to disable it for EMBEDDED/EXPERT. Thanks,

Alex

2011-02-23 17:08:55

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] Weight-balanced tree

On 02/23/2011 07:02 PM, Alex Williamson wrote:
> >
> > obj-$(CONFIG_WEIGHT_BALANCED_TREE) += wbtree.o
> >
> > then kvm can select it, and the impact on non-kvm kernels is removed.
>
> Then we'd have issues trying to build an external kvm module for a
> pre-existing non-kvm kernel. Do we care?

Officially, no.

What we typically do in these cases is copy the code into the kvm-kmod
compatibility layer and compile it if the kernel doesn't supply it (like
all older kernels regardless of config).

> If we were to take such a
> path, I think the default should be on, kvm would depend on it, but we
> could add an option to disable it for EMBEDDED/EXPERT. Thanks,

That would work as well.

--
error compiling committee.c: too many arguments to function

2011-02-23 18:06:41

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On Wed, 2011-02-23 at 15:12 +0200, Avi Kivity wrote:
> On 02/22/2011 08:54 PM, Alex Williamson wrote:
> > This series introduces a new weight-balanced binary tree (wbtree) for
> > general use. It's largely leveraged from the rbtree, copying it's
> > rotate functions, while introducing different rebalance and erase
> > functions. This tree is particularly useful for managing memory
> > ranges, where it's desirable to have the most likely targets (the
> > largest ranges) at the top of each subtree.
> >
> > Patches 2& 3 go on to convert the KVM memory slots to a growable
> > array and make use of wbtree for efficient managment. Trying to
> > exercise the worst case for this data structure, I ran netperf
> > TCP_RR on an emulated rtl8139 NIC connected directly to the host
> > via a tap. Both qemu-kvm and the netserver on the host were
> > pinned to optimal CPUs with taskset. This series resulted in
> > a 3% improvement for this test.
> >
>
> In this case, I think most of the faults (at least after the guest was
> warmed up) missed the tree completely.

Except for the mmio faults for the NIC, which will traverse the entire
depth of that branch of the tree for every access.

> In this case a weight balanced
> tree is hardly optimal (it is optimized for hits), so I think you'll see
> a bigger gain from the mmio fault optimization. You'll probably see
> most of the gain running mmu intensive tests with ept=0.

Right, the gain expected by this test is that we're only traversing 6-7
tree nodes until we don't find a match, versus the full 32 entries of
the original memslot array. So it's effectively comparing worst case
scenarios for both data structures.

Hopefully the followup with kernbench run with ept=0 show that there's
also still a benefit in the data match scenario. The existing array
ends up being nearly optimal for memory hits since it registers memory
from 1M - 3.5G in slot0 and 4G - 10.5G in slot1. For the tree, we jump
straight to the bigger slot. I'll run one more set of kernbench tests
with the original code, just reversing slots 0&1 to see if we take much
of a hit from the tree overhead. Thanks,

Alex

2011-02-23 19:28:58

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On Wed, 2011-02-23 at 11:06 -0700, Alex Williamson wrote:
> On Wed, 2011-02-23 at 15:12 +0200, Avi Kivity wrote:
> > On 02/22/2011 08:54 PM, Alex Williamson wrote:
> > > This series introduces a new weight-balanced binary tree (wbtree) for
> > > general use. It's largely leveraged from the rbtree, copying it's
> > > rotate functions, while introducing different rebalance and erase
> > > functions. This tree is particularly useful for managing memory
> > > ranges, where it's desirable to have the most likely targets (the
> > > largest ranges) at the top of each subtree.
> > >
> > > Patches 2& 3 go on to convert the KVM memory slots to a growable
> > > array and make use of wbtree for efficient managment. Trying to
> > > exercise the worst case for this data structure, I ran netperf
> > > TCP_RR on an emulated rtl8139 NIC connected directly to the host
> > > via a tap. Both qemu-kvm and the netserver on the host were
> > > pinned to optimal CPUs with taskset. This series resulted in
> > > a 3% improvement for this test.
> > >
> >
> > In this case, I think most of the faults (at least after the guest was
> > warmed up) missed the tree completely.
>
> Except for the mmio faults for the NIC, which will traverse the entire
> depth of that branch of the tree for every access.
>
> > In this case a weight balanced
> > tree is hardly optimal (it is optimized for hits), so I think you'll see
> > a bigger gain from the mmio fault optimization. You'll probably see
> > most of the gain running mmu intensive tests with ept=0.
>
> Right, the gain expected by this test is that we're only traversing 6-7
> tree nodes until we don't find a match, versus the full 32 entries of
> the original memslot array. So it's effectively comparing worst case
> scenarios for both data structures.
>
> Hopefully the followup with kernbench run with ept=0 show that there's
> also still a benefit in the data match scenario. The existing array
> ends up being nearly optimal for memory hits since it registers memory
> from 1M - 3.5G in slot0 and 4G - 10.5G in slot1. For the tree, we jump
> straight to the bigger slot. I'll run one more set of kernbench tests
> with the original code, just reversing slots 0&1 to see if we take much
> of a hit from the tree overhead. Thanks,

I had forgotten about <1M mem, so actually the slot configuration was:

0: <1M
1: 1M - 3.5G
2: 4G+

I stacked the deck in favor of the static array (0: 4G+, 1: 1M-3.5G, 2:
<1M), and got these kernbench results:

base (stdev) reorder (stdev) wbtree (stdev)
--------+-----------------+----------------+----------------+
Elapsed | 42.809 (0.19) | 42.160 (0.22) | 42.305 (0.23) |
User | 115.709 (0.22) | 114.358 (0.40) | 114.720 (0.31) |
System | 41.605 (0.14) | 40.741 (0.22) | 40.924 (0.20) |
%cpu | 366.9 (1.45) | 367.4 (1.17) | 367.6 (1.51) |
context | 7272.3 (68.6) | 7248.1 (89.7) | 7249.5 (97.8) |
sleeps | 14826.2 (110.6) | 14780.7 (86.9) | 14798.5 (63.0) |

So, wbtree is only slightly behind reordering, and the standard
deviation suggests the runs are mostly within the noise of each other.
Thanks,

Alex


2011-02-23 20:19:38

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] Weight-balanced tree

On Wed, 2011-02-23 at 19:08 +0200, Avi Kivity wrote:
> On 02/23/2011 07:02 PM, Alex Williamson wrote:
> > >
> > > obj-$(CONFIG_WEIGHT_BALANCED_TREE) += wbtree.o
> > >
> > > then kvm can select it, and the impact on non-kvm kernels is removed.
> >
> > Then we'd have issues trying to build an external kvm module for a
> > pre-existing non-kvm kernel. Do we care?
>
> Officially, no.
>
> What we typically do in these cases is copy the code into the kvm-kmod
> compatibility layer and compile it if the kernel doesn't supply it (like
> all older kernels regardless of config).
>
> > If we were to take such a
> > path, I think the default should be on, kvm would depend on it, but we
> > could add an option to disable it for EMBEDDED/EXPERT. Thanks,
>
> That would work as well.

In retrospect, I suppose it doesn't really make sense to have it
compiled in "just in case". I'll follow your first suggestion. Thanks,

Alex

2011-02-24 10:04:45

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On 02/23/2011 08:06 PM, Alex Williamson wrote:
> On Wed, 2011-02-23 at 15:12 +0200, Avi Kivity wrote:
> > On 02/22/2011 08:54 PM, Alex Williamson wrote:
> > > This series introduces a new weight-balanced binary tree (wbtree) for
> > > general use. It's largely leveraged from the rbtree, copying it's
> > > rotate functions, while introducing different rebalance and erase
> > > functions. This tree is particularly useful for managing memory
> > > ranges, where it's desirable to have the most likely targets (the
> > > largest ranges) at the top of each subtree.
> > >
> > > Patches 2& 3 go on to convert the KVM memory slots to a growable
> > > array and make use of wbtree for efficient managment. Trying to
> > > exercise the worst case for this data structure, I ran netperf
> > > TCP_RR on an emulated rtl8139 NIC connected directly to the host
> > > via a tap. Both qemu-kvm and the netserver on the host were
> > > pinned to optimal CPUs with taskset. This series resulted in
> > > a 3% improvement for this test.
> > >
> >
> > In this case, I think most of the faults (at least after the guest was
> > warmed up) missed the tree completely.
>
> Except for the mmio faults for the NIC, which will traverse the entire
> depth of that branch of the tree for every access.

That is exactly what I meant: most of the faults cause the search to
fail. What we want here is to cache the success/fail result of the
search so we don't do it in the first place.

> > In this case a weight balanced
> > tree is hardly optimal (it is optimized for hits), so I think you'll see
> > a bigger gain from the mmio fault optimization. You'll probably see
> > most of the gain running mmu intensive tests with ept=0.
>
> Right, the gain expected by this test is that we're only traversing 6-7
> tree nodes until we don't find a match, versus the full 32 entries of
> the original memslot array. So it's effectively comparing worst case
> scenarios for both data structures.

If we optimized the linear list we'd sort it by size, descending, and
limit it by the number of instantiated slots, which I think would beat
the tree.

--
error compiling committee.c: too many arguments to function

2011-02-24 10:07:03

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On 02/23/2011 09:28 PM, Alex Williamson wrote:
> I had forgotten about<1M mem, so actually the slot configuration was:
>
> 0:<1M
> 1: 1M - 3.5G
> 2: 4G+
>
> I stacked the deck in favor of the static array (0: 4G+, 1: 1M-3.5G, 2:
> <1M), and got these kernbench results:
>
> base (stdev) reorder (stdev) wbtree (stdev)
> --------+-----------------+----------------+----------------+
> Elapsed | 42.809 (0.19) | 42.160 (0.22) | 42.305 (0.23) |
> User | 115.709 (0.22) | 114.358 (0.40) | 114.720 (0.31) |
> System | 41.605 (0.14) | 40.741 (0.22) | 40.924 (0.20) |
> %cpu | 366.9 (1.45) | 367.4 (1.17) | 367.6 (1.51) |
> context | 7272.3 (68.6) | 7248.1 (89.7) | 7249.5 (97.8) |
> sleeps | 14826.2 (110.6) | 14780.7 (86.9) | 14798.5 (63.0) |
>
> So, wbtree is only slightly behind reordering, and the standard
> deviation suggests the runs are mostly within the noise of each other.
> Thanks,

Doesn't this indicate we should use reordering, instead of a new data
structure?

--
error compiling committee.c: too many arguments to function

2011-02-24 10:39:24

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] kvm: Allow memory slot array to grow on demand

On 02/22/2011 08:55 PM, Alex Williamson wrote:
> Remove fixed KVM_MEMORY_SLOTS limit, allowing the slot array
> to grow on demand. Private slots are now allocated at the
> front instead of the end. Only x86 seems to use private slots,
> so this is now zero for all other archs. The memslots pointer
> is already updated using rcu, so changing the size off the
> array when it's replaces is straight forward. x86 also keeps
> a bitmap of slots used by a kvm_mmu_page, which requires a
> shadow tlb flush whenever we increase the number of slots.
> This forces the pages to be rebuilt with the new bitmap size.
>
>
>
> #define KVM_PIO_PAGE_OFFSET 1
> #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
> @@ -207,7 +206,7 @@ struct kvm_mmu_page {
> * One bit set per slot which has memory
> * in this shadow page.
> */
> - DECLARE_BITMAP(slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
> + unsigned long *slot_bitmap;

What about

union {
DECLARE_BITMAP(direct_slot_bitmap, BITS_PER_LONG);
unsigned long *indirect_slot_bitmap;
};

to make the hackery below more explicit?

>
> static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> + struct kvm_memslots *slots = kvm_memslots(kvm);
> +
> ASSERT(is_empty_shadow_page(sp->spt));
> hlist_del(&sp->hash_link);
> list_del(&sp->link);
> + if (unlikely(slots->nmemslots> sizeof(sp->slot_bitmap) * 8))
> + kfree(sp->slot_bitmap);
> __free_page(virt_to_page(sp->spt));
> if (!sp->role.direct)
> __free_page(virt_to_page(sp->gfns));
> @@ -1048,6 +1052,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
> u64 *parent_pte, int direct)
> {
> struct kvm_mmu_page *sp;
> + struct kvm_memslots *slots = kvm_memslots(vcpu->kvm);
>
> sp = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache, sizeof *sp);
> sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache, PAGE_SIZE);
> @@ -1056,7 +1061,16 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
> PAGE_SIZE);
> set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> list_add(&sp->link,&vcpu->kvm->arch.active_mmu_pages);
> - bitmap_zero(sp->slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
> +
> + if (unlikely(slots->nmemslots> sizeof(sp->slot_bitmap) * 8)) {
> + sp->slot_bitmap = kzalloc(sizeof(long) *
> + BITS_TO_LONGS(slots->nmemslots),
> + GFP_KERNEL);
> + if (!sp->slot_bitmap)
> + return NULL;

We don't support failing kvm_mmu_get_page(). See
mmu_memory_cache_alloc() and mmu_topup_memory_caches().

> + } else
> + bitmap_zero((void *)&sp->slot_bitmap, slots->nmemslots);
> +
>



>
> static void mmu_convert_notrap(struct kvm_mmu_page *sp)
> @@ -3530,13 +3548,19 @@ int kvm_mmu_setup(struct kvm_vcpu *vcpu)
> void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
> {
> struct kvm_mmu_page *sp;
> + struct kvm_memslots *slots = kvm_memslots(kvm);
>
> list_for_each_entry(sp,&kvm->arch.active_mmu_pages, link) {
> int i;
> u64 *pt;
>
> - if (!test_bit(slot, sp->slot_bitmap))
> - continue;
> + if (likely(slots->nmemslots<= sizeof(sp->slot_bitmap) * 8)) {
> + if (!test_bit(slot, (void *)&sp->slot_bitmap))
> + continue;
> + } else {
> + if (!test_bit(slot, sp->slot_bitmap))
> + continue;
> + }

That likely() would fail 100% for certain guests.

Neater to write

slot_bitmap = sp_slot_bitmap(sp);
if (!test_bit(slot, sp_slot_bitmap))
continue;

> +
> +/*
> + * Protect from malicious userspace by putting an upper bound on the number
> + * of memory slots. This is an arbitrarily large number that still allows
> + * us to make pseudo-guarantees about supporting 64 assigned devices with
> + * plenty of slots left over.
> + */
> +#ifndef KVM_MAX_MEM_SLOTS
> + #define KVM_MAX_MEM_SLOTS 512
> +#endif

The increase should be in a separate patch (after we optimize the
search-fail case).

>
> if (!npages) {
> r = -ENOMEM;
> - slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
> +
> + nmemslots = (mem->slot>= kvm->memslots->nmemslots) ?
> + mem->slot + 1 : kvm->memslots->nmemslots;
> +
> + slots = kzalloc(sizeof(struct kvm_memslots) +
> + nmemslots * sizeof(struct kvm_memory_slot),
> + GFP_KERNEL);
> if (!slots)
> goto out_free;
> - memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
> - if (mem->slot>= slots->nmemslots)
> - slots->nmemslots = mem->slot + 1;
> + memcpy(slots, kvm->memslots,
> + sizeof(struct kvm_memslots) + kvm->memslots->nmemslots *
> + sizeof(struct kvm_memory_slot));
> + slots->nmemslots = nmemslots;
> slots->generation++;
> slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
>
> @@ -787,12 +797,21 @@ skip_lpage:
> }
>
> r = -ENOMEM;
> - slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
> +
> + if (mem->slot>= kvm->memslots->nmemslots) {
> + nmemslots = mem->slot + 1;
> + flush = true;

Isn't flush here a little too agressive? Shouldn't we flush only if we
cross the BITS_PER_LONG threshold?

> + } else
> + nmemslots = kvm->memslots->nmemslots;
> +
> + slots = kzalloc(sizeof(struct kvm_memslots) +
> + nmemslots * sizeof(struct kvm_memory_slot),
> + GFP_KERNEL);

Code duplication -> helper.

> if (!slots)
> goto out_free;
> - memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
> - if (mem->slot>= slots->nmemslots)
> - slots->nmemslots = mem->slot + 1;
> + memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots) +
> + kvm->memslots->nmemslots * sizeof(struct kvm_memory_slot));
> + slots->nmemslots = nmemslots;
> slots->generation++;
>
> /* actual memory is freed via old in kvm_free_physmem_slot below */
> @@ -808,6 +827,9 @@ skip_lpage:
> rcu_assign_pointer(kvm->memslots, slots);
> synchronize_srcu_expedited(&kvm->srcu);
>
> + if (flush)
> + kvm_arch_flush_shadow(kvm);
> +

Need to flush before rcu_assign_pointer() so kvm_mmu_free_page() sees
the old slot count.

But even that is insufficient since we'll create direct and indirect
slot bitmaps concurrently. Need to store whether the bitmap is direct
or not in kvm_mmu_page.

> @@ -1832,6 +1854,8 @@ static long kvm_vm_ioctl(struct file *filp,
> sizeof kvm_userspace_mem))
> goto out;
>
> + kvm_userspace_mem.slot += KVM_PRIVATE_MEM_SLOTS;
> +

Slightly uneasy about this, but no real objection.

--
error compiling committee.c: too many arguments to function

2011-02-24 17:35:56

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On Thu, 2011-02-24 at 12:06 +0200, Avi Kivity wrote:
> On 02/23/2011 09:28 PM, Alex Williamson wrote:
> > I had forgotten about<1M mem, so actually the slot configuration was:
> >
> > 0:<1M
> > 1: 1M - 3.5G
> > 2: 4G+
> >
> > I stacked the deck in favor of the static array (0: 4G+, 1: 1M-3.5G, 2:
> > <1M), and got these kernbench results:
> >
> > base (stdev) reorder (stdev) wbtree (stdev)
> > --------+-----------------+----------------+----------------+
> > Elapsed | 42.809 (0.19) | 42.160 (0.22) | 42.305 (0.23) |
> > User | 115.709 (0.22) | 114.358 (0.40) | 114.720 (0.31) |
> > System | 41.605 (0.14) | 40.741 (0.22) | 40.924 (0.20) |
> > %cpu | 366.9 (1.45) | 367.4 (1.17) | 367.6 (1.51) |
> > context | 7272.3 (68.6) | 7248.1 (89.7) | 7249.5 (97.8) |
> > sleeps | 14826.2 (110.6) | 14780.7 (86.9) | 14798.5 (63.0) |
> >
> > So, wbtree is only slightly behind reordering, and the standard
> > deviation suggests the runs are mostly within the noise of each other.
> > Thanks,
>
> Doesn't this indicate we should use reordering, instead of a new data
> structure?

The original problem that brought this on was scaling. The re-ordered
array still has O(N) scaling while the tree should have ~O(logN) (note
that it currently doesn't because it needs a compaction algorithm added
after insert and remove). So yes, it's hard to beat the results of a
test that hammers on the first couple entries of a sorted array, but I
think the tree has better than current performance and more predictable
when scaled performance.

If we knew when we were searching for which type of data, it would
perhaps be nice if we could use a sorted array for guest memory (since
it's nicely bounded into a small number of large chunks), and a tree for
mmio (where we expect the scaling to be a factor). Thanks,

Alex

2011-02-24 18:08:26

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] kvm: Allow memory slot array to grow on demand

On Thu, 2011-02-24 at 12:39 +0200, Avi Kivity wrote:
> On 02/22/2011 08:55 PM, Alex Williamson wrote:
> > Remove fixed KVM_MEMORY_SLOTS limit, allowing the slot array
> > to grow on demand. Private slots are now allocated at the
> > front instead of the end. Only x86 seems to use private slots,
> > so this is now zero for all other archs. The memslots pointer
> > is already updated using rcu, so changing the size off the
> > array when it's replaces is straight forward. x86 also keeps
> > a bitmap of slots used by a kvm_mmu_page, which requires a
> > shadow tlb flush whenever we increase the number of slots.
> > This forces the pages to be rebuilt with the new bitmap size.
> >
> >
> >
> > #define KVM_PIO_PAGE_OFFSET 1
> > #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
> > @@ -207,7 +206,7 @@ struct kvm_mmu_page {
> > * One bit set per slot which has memory
> > * in this shadow page.
> > */
> > - DECLARE_BITMAP(slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
> > + unsigned long *slot_bitmap;
>
> What about
>
> union {
> DECLARE_BITMAP(direct_slot_bitmap, BITS_PER_LONG);
> unsigned long *indirect_slot_bitmap;
> };
>
> to make the hackery below more explicit?

Yeah, it need something to make the hackery go down easier. I was
actually thinking about:

unsigned long *slot_bitmap;
DECLARE_BITMAP(direct_slot_bitmap, BITS_PER_LONG);

Where we'd then just set:

slot_bitmap = &direct_slot_bitmap;

It wastes 8 bytes, and pushes the cache a little harder, but still helps
the locality and makes the usage more consistent.

>
> >
> > static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> > {
> > + struct kvm_memslots *slots = kvm_memslots(kvm);
> > +
> > ASSERT(is_empty_shadow_page(sp->spt));
> > hlist_del(&sp->hash_link);
> > list_del(&sp->link);
> > + if (unlikely(slots->nmemslots> sizeof(sp->slot_bitmap) * 8))
> > + kfree(sp->slot_bitmap);
> > __free_page(virt_to_page(sp->spt));
> > if (!sp->role.direct)
> > __free_page(virt_to_page(sp->gfns));
> > @@ -1048,6 +1052,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
> > u64 *parent_pte, int direct)
> > {
> > struct kvm_mmu_page *sp;
> > + struct kvm_memslots *slots = kvm_memslots(vcpu->kvm);
> >
> > sp = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache, sizeof *sp);
> > sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache, PAGE_SIZE);
> > @@ -1056,7 +1061,16 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
> > PAGE_SIZE);
> > set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> > list_add(&sp->link,&vcpu->kvm->arch.active_mmu_pages);
> > - bitmap_zero(sp->slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
> > +
> > + if (unlikely(slots->nmemslots> sizeof(sp->slot_bitmap) * 8)) {
> > + sp->slot_bitmap = kzalloc(sizeof(long) *
> > + BITS_TO_LONGS(slots->nmemslots),
> > + GFP_KERNEL);
> > + if (!sp->slot_bitmap)
> > + return NULL;
>
> We don't support failing kvm_mmu_get_page(). See
> mmu_memory_cache_alloc() and mmu_topup_memory_caches().

Hmm, apparently my search stopped at __direct_map() calling
kvm_mmu_get_page() and handling an error.

> > + } else
> > + bitmap_zero((void *)&sp->slot_bitmap, slots->nmemslots);
> > +
> >
>
>
>
> >
> > static void mmu_convert_notrap(struct kvm_mmu_page *sp)
> > @@ -3530,13 +3548,19 @@ int kvm_mmu_setup(struct kvm_vcpu *vcpu)
> > void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
> > {
> > struct kvm_mmu_page *sp;
> > + struct kvm_memslots *slots = kvm_memslots(kvm);
> >
> > list_for_each_entry(sp,&kvm->arch.active_mmu_pages, link) {
> > int i;
> > u64 *pt;
> >
> > - if (!test_bit(slot, sp->slot_bitmap))
> > - continue;
> > + if (likely(slots->nmemslots<= sizeof(sp->slot_bitmap) * 8)) {
> > + if (!test_bit(slot, (void *)&sp->slot_bitmap))
> > + continue;
> > + } else {
> > + if (!test_bit(slot, sp->slot_bitmap))
> > + continue;
> > + }
>
> That likely() would fail 100% for certain guests.
>
> Neater to write
>
> slot_bitmap = sp_slot_bitmap(sp);
> if (!test_bit(slot, sp_slot_bitmap))
> continue;

OK

> > +
> > +/*
> > + * Protect from malicious userspace by putting an upper bound on the number
> > + * of memory slots. This is an arbitrarily large number that still allows
> > + * us to make pseudo-guarantees about supporting 64 assigned devices with
> > + * plenty of slots left over.
> > + */
> > +#ifndef KVM_MAX_MEM_SLOTS
> > + #define KVM_MAX_MEM_SLOTS 512
> > +#endif
>
> The increase should be in a separate patch (after we optimize the
> search-fail case).

Ok, I'll make this be 32 + PRIVATE_SLOTS for now

> >
> > if (!npages) {
> > r = -ENOMEM;
> > - slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
> > +
> > + nmemslots = (mem->slot>= kvm->memslots->nmemslots) ?
> > + mem->slot + 1 : kvm->memslots->nmemslots;
> > +
> > + slots = kzalloc(sizeof(struct kvm_memslots) +
> > + nmemslots * sizeof(struct kvm_memory_slot),
> > + GFP_KERNEL);
> > if (!slots)
> > goto out_free;
> > - memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
> > - if (mem->slot>= slots->nmemslots)
> > - slots->nmemslots = mem->slot + 1;
> > + memcpy(slots, kvm->memslots,
> > + sizeof(struct kvm_memslots) + kvm->memslots->nmemslots *
> > + sizeof(struct kvm_memory_slot));
> > + slots->nmemslots = nmemslots;
> > slots->generation++;
> > slots->memslots[mem->slot].flags |= KVM_MEMSLOT_INVALID;
> >
> > @@ -787,12 +797,21 @@ skip_lpage:
> > }
> >
> > r = -ENOMEM;
> > - slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
> > +
> > + if (mem->slot>= kvm->memslots->nmemslots) {
> > + nmemslots = mem->slot + 1;
> > + flush = true;
>
> Isn't flush here a little too agressive? Shouldn't we flush only if we
> cross the BITS_PER_LONG threshold?

Perhaps, but is that overly exploiting our knowledge about the bitmap
implementation? I figured better to error too aggressively than too
lazy since this is a rare event already.

> > + } else
> > + nmemslots = kvm->memslots->nmemslots;
> > +
> > + slots = kzalloc(sizeof(struct kvm_memslots) +
> > + nmemslots * sizeof(struct kvm_memory_slot),
> > + GFP_KERNEL);
>
> Code duplication -> helper.
>
> > if (!slots)
> > goto out_free;
> > - memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots));
> > - if (mem->slot>= slots->nmemslots)
> > - slots->nmemslots = mem->slot + 1;
> > + memcpy(slots, kvm->memslots, sizeof(struct kvm_memslots) +
> > + kvm->memslots->nmemslots * sizeof(struct kvm_memory_slot));
> > + slots->nmemslots = nmemslots;
> > slots->generation++;
> >
> > /* actual memory is freed via old in kvm_free_physmem_slot below */
> > @@ -808,6 +827,9 @@ skip_lpage:
> > rcu_assign_pointer(kvm->memslots, slots);
> > synchronize_srcu_expedited(&kvm->srcu);
> >
> > + if (flush)
> > + kvm_arch_flush_shadow(kvm);
> > +
>
> Need to flush before rcu_assign_pointer() so kvm_mmu_free_page() sees
> the old slot count.
>
> But even that is insufficient since we'll create direct and indirect
> slot bitmaps concurrently. Need to store whether the bitmap is direct
> or not in kvm_mmu_page.

Ick. Ok, I'll investigate.

> > @@ -1832,6 +1854,8 @@ static long kvm_vm_ioctl(struct file *filp,
> > sizeof kvm_userspace_mem))
> > goto out;
> >
> > + kvm_userspace_mem.slot += KVM_PRIVATE_MEM_SLOTS;
> > +
>
> Slightly uneasy about this, but no real objection.

If you have better ideas, let me know. This reminds me to ask about
this chunk:

@@ -671,7 +674,7 @@ int __kvm_set_memory_region(struct kvm *kvm,

/* Check for overlaps */
r = -EEXIST;
- for (i = 0; i < KVM_MEMORY_SLOTS; ++i) {
+ for (i = KVM_PRIVATE_MEM_SLOTS; i < kvm->memslots->nmemslots; ++i) {
struct kvm_memory_slot *s = &kvm->memslots->memslots[i];

if (s == memslot || !s->npages)

I kept the same behavior as previous, but it highlights that we're not
checking for overlaps between private slots and anything else. Existing
bug? Thanks,

Alex

2011-02-24 23:05:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] Weight-balanced tree

On Tue, 22 Feb 2011 11:55:05 -0700
Alex Williamson <[email protected]> wrote:

> Signed-off-by: Alex Williamson <[email protected]>
> ---
>
> include/linux/wbtree.h | 55 ++++++++++++++++
> lib/Makefile | 3 +
> lib/wbtree.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 227 insertions(+), 1 deletions(-)
> create mode 100644 include/linux/wbtree.h
> create mode 100644 lib/wbtree.c
>
> diff --git a/include/linux/wbtree.h b/include/linux/wbtree.h
> new file mode 100644
> index 0000000..ef70f6f
> --- /dev/null
> +++ b/include/linux/wbtree.h
> @@ -0,0 +1,55 @@
> +/*
> + * Weight-balanced binary tree
> + *
> + * The balance of this tree is based on search probability. The
> + * heaviest weighted nodes (the ones we're most likely to hit), are
> + * at the top of each subtree.
> + *
> + * Copywrite (C) 2011 Red Hat, Inc.
> + *
> + * Authors:
> + * Alex Williamson <[email protected]>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +#ifndef _LINUX_WBTREE_H
> +#define _LINUX_WBTREE_H
> +
> +#include <linux/kernel.h>
> +#include <linux/stddef.h>
> +
> +struct wb_node {
> + struct wb_node *wb_parent;
> + struct wb_node *wb_left;
> + struct wb_node *wb_right;
> + unsigned long wb_weight;
> +};
> +
> +struct wb_root {
> + struct wb_node *wb_node;
> +};
> +
> +#define WB_ROOT (struct wb_root) { NULL, }
> +#define wb_entry(ptr, type, member) container_of(ptr, type, member)

Please implement this as a C function. That way we get typechecking
which container_of() is unable to provide.

Otherwise the code looks OK to me, apart from the total lack of
documentation. Please document at least all the exported interfaces.
The documentation should be designed to tell people how to maintain
this code, and how to use this code reliably and well from other subsystems.
Don't fall into the trap of just dumbly filling out templates.

The documentation should also carefully explain the locking
requirements which these functions place upon callers. (rwlocks used
how? rcu?)

Once that is all squared away, please merge the code via the KVM tree.

2011-02-27 09:45:00

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] kvm: Allow memory slot array to grow on demand

On 02/24/2011 08:08 PM, Alex Williamson wrote:
> > > @@ -207,7 +206,7 @@ struct kvm_mmu_page {
> > > * One bit set per slot which has memory
> > > * in this shadow page.
> > > */
> > > - DECLARE_BITMAP(slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
> > > + unsigned long *slot_bitmap;
> >
> > What about
> >
> > union {
> > DECLARE_BITMAP(direct_slot_bitmap, BITS_PER_LONG);
> > unsigned long *indirect_slot_bitmap;
> > };
> >
> > to make the hackery below more explicit?
>
> Yeah, it need something to make the hackery go down easier. I was
> actually thinking about:
>
> unsigned long *slot_bitmap;
> DECLARE_BITMAP(direct_slot_bitmap, BITS_PER_LONG);
>
> Where we'd then just set:
>
> slot_bitmap =&direct_slot_bitmap;
>
> It wastes 8 bytes, and pushes the cache a little harder, but still helps
> the locality and makes the usage more consistent.

unsigned long *sp_slot_bitmap(struct kvm_mmu_page *sp) { ... }

gives you the best of both worlds.

> >
> > We don't support failing kvm_mmu_get_page(). See
> > mmu_memory_cache_alloc() and mmu_topup_memory_caches().
>
> Hmm, apparently my search stopped at __direct_map() calling
> kvm_mmu_get_page() and handling an error.

That's dead code (was there from the very first commit into mmu.c).

> > >
> > > r = -ENOMEM;
> > > - slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
> > > +
> > > + if (mem->slot>= kvm->memslots->nmemslots) {
> > > + nmemslots = mem->slot + 1;
> > > + flush = true;
> >
> > Isn't flush here a little too agressive? Shouldn't we flush only if we
> > cross the BITS_PER_LONG threshold?
>
> Perhaps, but is that overly exploiting our knowledge about the bitmap
> implementation? I figured better to error too aggressively than too
> lazy since this is a rare event already.

I'm worried about the screen-clearing using the vga window at
0xa[08]000. If that works without too much flushing, then we're fine.

On second thoughts we're likely fine even if we do flush, since it's in
a tight loop so it takes very little work to reestablish the dropped sptes.

> > > @@ -1832,6 +1854,8 @@ static long kvm_vm_ioctl(struct file *filp,
> > > sizeof kvm_userspace_mem))
> > > goto out;
> > >
> > > + kvm_userspace_mem.slot += KVM_PRIVATE_MEM_SLOTS;
> > > +
> >
> > Slightly uneasy about this, but no real objection.
>
> If you have better ideas, let me know. This reminds me to ask about
> this chunk:
>
> @@ -671,7 +674,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>
> /* Check for overlaps */
> r = -EEXIST;
> - for (i = 0; i< KVM_MEMORY_SLOTS; ++i) {
> + for (i = KVM_PRIVATE_MEM_SLOTS; i< kvm->memslots->nmemslots; ++i) {
> struct kvm_memory_slot *s =&kvm->memslots->memslots[i];
>
> if (s == memslot || !s->npages)
>
> I kept the same behavior as previous, but it highlights that we're not
> checking for overlaps between private slots and anything else. Existing
> bug? Thanks,

Yes, possibly serious. Who knows what happens if we create a page using
one slot and remove it via another?

Let's go write some Visual Basic.

--
error compiling committee.c: too many arguments to function

2011-02-27 09:54:38

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On 02/24/2011 07:35 PM, Alex Williamson wrote:
> On Thu, 2011-02-24 at 12:06 +0200, Avi Kivity wrote:
> > On 02/23/2011 09:28 PM, Alex Williamson wrote:
> > > I had forgotten about<1M mem, so actually the slot configuration was:
> > >
> > > 0:<1M
> > > 1: 1M - 3.5G
> > > 2: 4G+
> > >
> > > I stacked the deck in favor of the static array (0: 4G+, 1: 1M-3.5G, 2:
> > > <1M), and got these kernbench results:
> > >
> > > base (stdev) reorder (stdev) wbtree (stdev)
> > > --------+-----------------+----------------+----------------+
> > > Elapsed | 42.809 (0.19) | 42.160 (0.22) | 42.305 (0.23) |
> > > User | 115.709 (0.22) | 114.358 (0.40) | 114.720 (0.31) |
> > > System | 41.605 (0.14) | 40.741 (0.22) | 40.924 (0.20) |
> > > %cpu | 366.9 (1.45) | 367.4 (1.17) | 367.6 (1.51) |
> > > context | 7272.3 (68.6) | 7248.1 (89.7) | 7249.5 (97.8) |
> > > sleeps | 14826.2 (110.6) | 14780.7 (86.9) | 14798.5 (63.0) |
> > >
> > > So, wbtree is only slightly behind reordering, and the standard
> > > deviation suggests the runs are mostly within the noise of each other.
> > > Thanks,
> >
> > Doesn't this indicate we should use reordering, instead of a new data
> > structure?
>
> The original problem that brought this on was scaling. The re-ordered
> array still has O(N) scaling while the tree should have ~O(logN) (note
> that it currently doesn't because it needs a compaction algorithm added
> after insert and remove). So yes, it's hard to beat the results of a
> test that hammers on the first couple entries of a sorted array, but I
> think the tree has better than current performance and more predictable
> when scaled performance.

Scaling doesn't matter, only actual performance. Even a guest with 512
slots would still hammer only on the first few slots, since these will
contain the bulk of memory.

> If we knew when we were searching for which type of data, it would
> perhaps be nice if we could use a sorted array for guest memory (since
> it's nicely bounded into a small number of large chunks), and a tree for
> mmio (where we expect the scaling to be a factor). Thanks,

We have three types of memory:

- RAM - a few large slots
- mapped mmio (for device assignment) - possible many small slots
- non-mapped mmio (for emulated devices) - no slots

The first two are handled in exactly the same way - they're just memory
slots. We expect a lot more hits into the RAM slots, since they're much
bigger. But by far the majority of faults will be for the third
category - mapped memory will be hit once per page, then handled by
hardware until Linux memory management does something about the page,
which should hopefully be rare (with device assignment, rare == never,
since those pages are pinned).

Therefore our optimization priorities should be
- complete miss into the slot list
- hit into the RAM slots
- hit into the other slots (trailing far behind)

Of course worst-case performance matters. For example, we might (not
sure) be searching the list with the mmu spinlock held.

I think we still have a bit to go before we can justify the new data
structure.

--
error compiling committee.c: too many arguments to function

2011-02-28 23:04:36

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On Sun, 2011-02-27 at 11:54 +0200, Avi Kivity wrote:
> On 02/24/2011 07:35 PM, Alex Williamson wrote:
> > On Thu, 2011-02-24 at 12:06 +0200, Avi Kivity wrote:
> > > On 02/23/2011 09:28 PM, Alex Williamson wrote:
> > > > I had forgotten about<1M mem, so actually the slot configuration was:
> > > >
> > > > 0:<1M
> > > > 1: 1M - 3.5G
> > > > 2: 4G+
> > > >
> > > > I stacked the deck in favor of the static array (0: 4G+, 1: 1M-3.5G, 2:
> > > > <1M), and got these kernbench results:
> > > >
> > > > base (stdev) reorder (stdev) wbtree (stdev)
> > > > --------+-----------------+----------------+----------------+
> > > > Elapsed | 42.809 (0.19) | 42.160 (0.22) | 42.305 (0.23) |
> > > > User | 115.709 (0.22) | 114.358 (0.40) | 114.720 (0.31) |
> > > > System | 41.605 (0.14) | 40.741 (0.22) | 40.924 (0.20) |
> > > > %cpu | 366.9 (1.45) | 367.4 (1.17) | 367.6 (1.51) |
> > > > context | 7272.3 (68.6) | 7248.1 (89.7) | 7249.5 (97.8) |
> > > > sleeps | 14826.2 (110.6) | 14780.7 (86.9) | 14798.5 (63.0) |
> > > >
> > > > So, wbtree is only slightly behind reordering, and the standard
> > > > deviation suggests the runs are mostly within the noise of each other.
> > > > Thanks,
> > >
> > > Doesn't this indicate we should use reordering, instead of a new data
> > > structure?
> >
> > The original problem that brought this on was scaling. The re-ordered
> > array still has O(N) scaling while the tree should have ~O(logN) (note
> > that it currently doesn't because it needs a compaction algorithm added
> > after insert and remove). So yes, it's hard to beat the results of a
> > test that hammers on the first couple entries of a sorted array, but I
> > think the tree has better than current performance and more predictable
> > when scaled performance.
>
> Scaling doesn't matter, only actual performance. Even a guest with 512
> slots would still hammer only on the first few slots, since these will
> contain the bulk of memory.

It seems like we need a good mixed workload benchmark. So far we've
only tested worst case, with a pure emulated I/O test, and best case,
with a pure memory test. Ordering an array only helps the latter, and
only barely beats the tree, so I suspect overall performance would be
better with a tree.

> > If we knew when we were searching for which type of data, it would
> > perhaps be nice if we could use a sorted array for guest memory (since
> > it's nicely bounded into a small number of large chunks), and a tree for
> > mmio (where we expect the scaling to be a factor). Thanks,
>
> We have three types of memory:
>
> - RAM - a few large slots
> - mapped mmio (for device assignment) - possible many small slots
> - non-mapped mmio (for emulated devices) - no slots
>
> The first two are handled in exactly the same way - they're just memory
> slots. We expect a lot more hits into the RAM slots, since they're much
> bigger. But by far the majority of faults will be for the third
> category - mapped memory will be hit once per page, then handled by
> hardware until Linux memory management does something about the page,
> which should hopefully be rare (with device assignment, rare == never,
> since those pages are pinned).
>
> Therefore our optimization priorities should be
> - complete miss into the slot list

The tree is obviously the most time and space efficient for this and the
netperf results show a pretty clear win here. I think it's really only
a question of whether we'd be ok with slow, cache thrashing, searches
here if we could effectively cache the result for next time as you've
suggested. Even then, it seems like steady state performance would be
prone to unusual slowdowns (ex. have to flush sptes on every add, what's
the regeneration time to replace all those slow lookups?).

> - hit into the RAM slots

It's really just the indirection of the tree and slightly larger element
size that gives the sorted array an edge here.

> - hit into the other slots (trailing far behind)

Obviously an array sucks at this.

> Of course worst-case performance matters. For example, we might (not
> sure) be searching the list with the mmu spinlock held.
>
> I think we still have a bit to go before we can justify the new data
> structure.

Suggestions for a mixed workload benchmark? What else would you like to
see? Thanks,

Alex

2011-03-01 15:04:09

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On 03/01/2011 01:04 AM, Alex Williamson wrote:
> > >
> > > The original problem that brought this on was scaling. The re-ordered
> > > array still has O(N) scaling while the tree should have ~O(logN) (note
> > > that it currently doesn't because it needs a compaction algorithm added
> > > after insert and remove). So yes, it's hard to beat the results of a
> > > test that hammers on the first couple entries of a sorted array, but I
> > > think the tree has better than current performance and more predictable
> > > when scaled performance.
> >
> > Scaling doesn't matter, only actual performance. Even a guest with 512
> > slots would still hammer only on the first few slots, since these will
> > contain the bulk of memory.
>
> It seems like we need a good mixed workload benchmark. So far we've
> only tested worst case, with a pure emulated I/O test, and best case,
> with a pure memory test. Ordering an array only helps the latter, and
> only barely beats the tree, so I suspect overall performance would be
> better with a tree.

But if we cache the missed-all-memslots result in the spte, we eliminate
the worst case, and are left with just the best case.

> > > If we knew when we were searching for which type of data, it would
> > > perhaps be nice if we could use a sorted array for guest memory (since
> > > it's nicely bounded into a small number of large chunks), and a tree for
> > > mmio (where we expect the scaling to be a factor). Thanks,
> >
> > We have three types of memory:
> >
> > - RAM - a few large slots
> > - mapped mmio (for device assignment) - possible many small slots
> > - non-mapped mmio (for emulated devices) - no slots
> >
> > The first two are handled in exactly the same way - they're just memory
> > slots. We expect a lot more hits into the RAM slots, since they're much
> > bigger. But by far the majority of faults will be for the third
> > category - mapped memory will be hit once per page, then handled by
> > hardware until Linux memory management does something about the page,
> > which should hopefully be rare (with device assignment, rare == never,
> > since those pages are pinned).
> >
> > Therefore our optimization priorities should be
> > - complete miss into the slot list
>
> The tree is obviously the most time and space efficient for this and the
> netperf results show a pretty clear win here. I think it's really only
> a question of whether we'd be ok with slow, cache thrashing, searches
> here if we could effectively cache the result for next time as you've
> suggested.

Yes.

> Even then, it seems like steady state performance would be
> prone to unusual slowdowns (ex. have to flush sptes on every add, what's
> the regeneration time to replace all those slow lookups?).

Those sptes would be flushed very rarely (never in normal operation).
When we add a slot, yes we drop those sptes, but also a few million
other sptes, so it's lost in the noise (regeneration time is just the
slot lookup itself).

> > - hit into the RAM slots
>
> It's really just the indirection of the tree and slightly larger element
> size that gives the sorted array an edge here.
>
> > - hit into the other slots (trailing far behind)
>
> Obviously an array sucks at this.
>
> > Of course worst-case performance matters. For example, we might (not
> > sure) be searching the list with the mmu spinlock held.
> >
> > I think we still have a bit to go before we can justify the new data
> > structure.
>
> Suggestions for a mixed workload benchmark? What else would you like to
> see? Thanks,

The problem here is that all workloads will cache all memslots very
quickly into sptes and all lookups will be misses. There are two cases
where we have lookups that hit the memslots structure: ept=0, and host
swap. Neither are things we want to optimize too heavily.

--
error compiling committee.c: too many arguments to function

2011-03-01 18:20:38

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On Tue, 2011-03-01 at 17:03 +0200, Avi Kivity wrote:
> On 03/01/2011 01:04 AM, Alex Williamson wrote:
> > > >
> > > > The original problem that brought this on was scaling. The re-ordered
> > > > array still has O(N) scaling while the tree should have ~O(logN) (note
> > > > that it currently doesn't because it needs a compaction algorithm added
> > > > after insert and remove). So yes, it's hard to beat the results of a
> > > > test that hammers on the first couple entries of a sorted array, but I
> > > > think the tree has better than current performance and more predictable
> > > > when scaled performance.
> > >
> > > Scaling doesn't matter, only actual performance. Even a guest with 512
> > > slots would still hammer only on the first few slots, since these will
> > > contain the bulk of memory.
> >
> > It seems like we need a good mixed workload benchmark. So far we've
> > only tested worst case, with a pure emulated I/O test, and best case,
> > with a pure memory test. Ordering an array only helps the latter, and
> > only barely beats the tree, so I suspect overall performance would be
> > better with a tree.
>
> But if we cache the missed-all-memslots result in the spte, we eliminate
> the worst case, and are left with just the best case.

There's potentially a lot of entries between best case and worst case.

> > > > If we knew when we were searching for which type of data, it would
> > > > perhaps be nice if we could use a sorted array for guest memory (since
> > > > it's nicely bounded into a small number of large chunks), and a tree for
> > > > mmio (where we expect the scaling to be a factor). Thanks,
> > >
> > > We have three types of memory:
> > >
> > > - RAM - a few large slots
> > > - mapped mmio (for device assignment) - possible many small slots
> > > - non-mapped mmio (for emulated devices) - no slots
> > >
> > > The first two are handled in exactly the same way - they're just memory
> > > slots. We expect a lot more hits into the RAM slots, since they're much
> > > bigger. But by far the majority of faults will be for the third
> > > category - mapped memory will be hit once per page, then handled by
> > > hardware until Linux memory management does something about the page,
> > > which should hopefully be rare (with device assignment, rare == never,
> > > since those pages are pinned).
> > >
> > > Therefore our optimization priorities should be
> > > - complete miss into the slot list
> >
> > The tree is obviously the most time and space efficient for this and the
> > netperf results show a pretty clear win here. I think it's really only
> > a question of whether we'd be ok with slow, cache thrashing, searches
> > here if we could effectively cache the result for next time as you've
> > suggested.
>
> Yes.
>
> > Even then, it seems like steady state performance would be
> > prone to unusual slowdowns (ex. have to flush sptes on every add, what's
> > the regeneration time to replace all those slow lookups?).
>
> Those sptes would be flushed very rarely (never in normal operation).
> When we add a slot, yes we drop those sptes, but also a few million
> other sptes, so it's lost in the noise (regeneration time is just the
> slot lookup itself).
>
> > > - hit into the RAM slots
> >
> > It's really just the indirection of the tree and slightly larger element
> > size that gives the sorted array an edge here.
> >
> > > - hit into the other slots (trailing far behind)
> >
> > Obviously an array sucks at this.
> >
> > > Of course worst-case performance matters. For example, we might (not
> > > sure) be searching the list with the mmu spinlock held.
> > >
> > > I think we still have a bit to go before we can justify the new data
> > > structure.
> >
> > Suggestions for a mixed workload benchmark? What else would you like to
> > see? Thanks,
>
> The problem here is that all workloads will cache all memslots very
> quickly into sptes and all lookups will be misses. There are two cases
> where we have lookups that hit the memslots structure: ept=0, and host
> swap. Neither are things we want to optimize too heavily.

Which seems to suggest that:

A. making those misses fast = win
B. making those misses fast + caching misses = win++
C. we don't care if the sorted array is subtly faster for ept=0

Sound right? So is the question whether cached misses alone gets us 99%
of the improvement since hits are already getting cached in sptes for
cases we care about? Xiao, it sounded like you might have already
started the caching misses approach. I'd be interested to incorporate
and test if you have anything working. Thanks,

Alex

2011-03-01 19:47:58

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On Sun, Feb 27, 2011 at 11:54:29AM +0200, Avi Kivity wrote:
> On 02/24/2011 07:35 PM, Alex Williamson wrote:
> >On Thu, 2011-02-24 at 12:06 +0200, Avi Kivity wrote:
> >> On 02/23/2011 09:28 PM, Alex Williamson wrote:
> >> > I had forgotten about<1M mem, so actually the slot configuration was:
> >> >
> >> > 0:<1M
> >> > 1: 1M - 3.5G
> >> > 2: 4G+
> >> >
> >> > I stacked the deck in favor of the static array (0: 4G+, 1: 1M-3.5G, 2:
> >> > <1M), and got these kernbench results:
> >> >
> >> > base (stdev) reorder (stdev) wbtree (stdev)
> >> > --------+-----------------+----------------+----------------+
> >> > Elapsed | 42.809 (0.19) | 42.160 (0.22) | 42.305 (0.23) |
> >> > User | 115.709 (0.22) | 114.358 (0.40) | 114.720 (0.31) |
> >> > System | 41.605 (0.14) | 40.741 (0.22) | 40.924 (0.20) |
> >> > %cpu | 366.9 (1.45) | 367.4 (1.17) | 367.6 (1.51) |
> >> > context | 7272.3 (68.6) | 7248.1 (89.7) | 7249.5 (97.8) |
> >> > sleeps | 14826.2 (110.6) | 14780.7 (86.9) | 14798.5 (63.0) |
> >> >
> >> > So, wbtree is only slightly behind reordering, and the standard
> >> > deviation suggests the runs are mostly within the noise of each other.
> >> > Thanks,
> >>
> >> Doesn't this indicate we should use reordering, instead of a new data
> >> structure?
> >
> >The original problem that brought this on was scaling. The re-ordered
> >array still has O(N) scaling while the tree should have ~O(logN) (note
> >that it currently doesn't because it needs a compaction algorithm added
> >after insert and remove). So yes, it's hard to beat the results of a
> >test that hammers on the first couple entries of a sorted array, but I
> >think the tree has better than current performance and more predictable
> >when scaled performance.
>
> Scaling doesn't matter, only actual performance. Even a guest with
> 512 slots would still hammer only on the first few slots, since
> these will contain the bulk of memory.
>
> >If we knew when we were searching for which type of data, it would
> >perhaps be nice if we could use a sorted array for guest memory (since
> >it's nicely bounded into a small number of large chunks), and a tree for
> >mmio (where we expect the scaling to be a factor). Thanks,
>
> We have three types of memory:
>
> - RAM - a few large slots
> - mapped mmio (for device assignment) - possible many small slots
> - non-mapped mmio (for emulated devices) - no slots
>
> The first two are handled in exactly the same way - they're just
> memory slots. We expect a lot more hits into the RAM slots, since
> they're much bigger. But by far the majority of faults will be for
> the third category - mapped memory will be hit once per page, then
> handled by hardware until Linux memory management does something
> about the page, which should hopefully be rare (with device
> assignment, rare == never, since those pages are pinned).
>
> Therefore our optimization priorities should be
> - complete miss into the slot list
> - hit into the RAM slots
> - hit into the other slots (trailing far behind)

Whatever ordering considered optimal in one workload can be suboptimal
in another. The binary search reduces the number of slots inspected
in the average case. Using slot size as weight favours probability.

> Of course worst-case performance matters. For example, we might
> (not sure) be searching the list with the mmu spinlock held.
>
> I think we still have a bit to go before we can justify the new data
> structure.

Intensive IDE disk IO on guest with lots of assigned network devices, 3%
improvement on netperf with rtl8139, 1% improvement on kernbench?

Fail to see justification for not using it.

2011-03-02 13:31:41

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On 03/01/2011 08:20 PM, Alex Williamson wrote:
> > > It seems like we need a good mixed workload benchmark. So far we've
> > > only tested worst case, with a pure emulated I/O test, and best case,
> > > with a pure memory test. Ordering an array only helps the latter, and
> > > only barely beats the tree, so I suspect overall performance would be
> > > better with a tree.
> >
> > But if we cache the missed-all-memslots result in the spte, we eliminate
> > the worst case, and are left with just the best case.
>
> There's potentially a lot of entries between best case and worst case.

The mid case is where we have a lot of small slots which are
continuously flushed. That would be (ept=0 && new mappings continuously
established) || (lots of small mappings && lots of host paging
activity). I don't know of any guests that continuously reestablish BAR
mappings; and host paging activity doesn't apply to device assignment.
What are we left with?

> >
> > The problem here is that all workloads will cache all memslots very
> > quickly into sptes and all lookups will be misses. There are two cases
> > where we have lookups that hit the memslots structure: ept=0, and host
> > swap. Neither are things we want to optimize too heavily.
>
> Which seems to suggest that:
>
> A. making those misses fast = win
> B. making those misses fast + caching misses = win++
> C. we don't care if the sorted array is subtly faster for ept=0
>
> Sound right? So is the question whether cached misses alone gets us 99%
> of the improvement since hits are already getting cached in sptes for
> cases we care about?

Yes, that's my feeling. Caching those misses is a lot more important
than speeding them up, since the cache will stay valid for long periods,
and since the hit rate will be very high.

Cache+anything=O(1)
no-cache+tree=O(log(n))
no-cache+array=O(n)

--
error compiling committee.c: too many arguments to function

2011-03-02 13:35:00

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree

On 03/01/2011 09:47 PM, Marcelo Tosatti wrote:
> On Sun, Feb 27, 2011 at 11:54:29AM +0200, Avi Kivity wrote:
> > On 02/24/2011 07:35 PM, Alex Williamson wrote:
> > >On Thu, 2011-02-24 at 12:06 +0200, Avi Kivity wrote:
> > >> On 02/23/2011 09:28 PM, Alex Williamson wrote:
> > >> > I had forgotten about<1M mem, so actually the slot configuration was:
> > >> >
> > >> > 0:<1M
> > >> > 1: 1M - 3.5G
> > >> > 2: 4G+
> > >> >
> > >> > I stacked the deck in favor of the static array (0: 4G+, 1: 1M-3.5G, 2:
> > >> > <1M), and got these kernbench results:
> > >> >
> > >> > base (stdev) reorder (stdev) wbtree (stdev)
> > >> > --------+-----------------+----------------+----------------+
> > >> > Elapsed | 42.809 (0.19) | 42.160 (0.22) | 42.305 (0.23) |
> > >> > User | 115.709 (0.22) | 114.358 (0.40) | 114.720 (0.31) |
> > >> > System | 41.605 (0.14) | 40.741 (0.22) | 40.924 (0.20) |
> > >> > %cpu | 366.9 (1.45) | 367.4 (1.17) | 367.6 (1.51) |
> > >> > context | 7272.3 (68.6) | 7248.1 (89.7) | 7249.5 (97.8) |
> > >> > sleeps | 14826.2 (110.6) | 14780.7 (86.9) | 14798.5 (63.0) |
> > >> >
> > >> > So, wbtree is only slightly behind reordering, and the standard
> > >> > deviation suggests the runs are mostly within the noise of each other.
> > >> > Thanks,
> > >>
> > >> Doesn't this indicate we should use reordering, instead of a new data
> > >> structure?
> > >
> > >The original problem that brought this on was scaling. The re-ordered
> > >array still has O(N) scaling while the tree should have ~O(logN) (note
> > >that it currently doesn't because it needs a compaction algorithm added
> > >after insert and remove). So yes, it's hard to beat the results of a
> > >test that hammers on the first couple entries of a sorted array, but I
> > >think the tree has better than current performance and more predictable
> > >when scaled performance.
> >
> > Scaling doesn't matter, only actual performance. Even a guest with
> > 512 slots would still hammer only on the first few slots, since
> > these will contain the bulk of memory.
> >
> > >If we knew when we were searching for which type of data, it would
> > >perhaps be nice if we could use a sorted array for guest memory (since
> > >it's nicely bounded into a small number of large chunks), and a tree for
> > >mmio (where we expect the scaling to be a factor). Thanks,
> >
> > We have three types of memory:
> >
> > - RAM - a few large slots
> > - mapped mmio (for device assignment) - possible many small slots
> > - non-mapped mmio (for emulated devices) - no slots
> >
> > The first two are handled in exactly the same way - they're just
> > memory slots. We expect a lot more hits into the RAM slots, since
> > they're much bigger. But by far the majority of faults will be for
> > the third category - mapped memory will be hit once per page, then
> > handled by hardware until Linux memory management does something
> > about the page, which should hopefully be rare (with device
> > assignment, rare == never, since those pages are pinned).
> >
> > Therefore our optimization priorities should be
> > - complete miss into the slot list
> > - hit into the RAM slots
> > - hit into the other slots (trailing far behind)
>
> Whatever ordering considered optimal in one workload can be suboptimal
> in another. The binary search reduces the number of slots inspected
> in the average case. Using slot size as weight favours probability.

It's really difficult to come up with a workload that causes many hits
to small slots.

> > Of course worst-case performance matters. For example, we might
> > (not sure) be searching the list with the mmu spinlock held.
> >
> > I think we still have a bit to go before we can justify the new data
> > structure.
>
> Intensive IDE disk IO on guest with lots of assigned network devices, 3%
> improvement on netperf with rtl8139, 1% improvement on kernbench?
>
> Fail to see justification for not using it.

By itself it's great, but the miss cache will cause the code to be
called very rarely. So I prefer the sorted array which is simpler (and
faster for the few-large-slots case).

--
error compiling committee.c: too many arguments to function