2013-03-20 08:30:42

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

Changlog:
V2:
- do not reset n_requested_mmu_pages and n_max_mmu_pages
- batch free root shadow pages to reduce vcpu notification and mmu-lock
contention
- remove the first patch that introduce kvm->arch.mmu_cache since we only
'memset zero' on hashtable rather than all mmu cache members in this
version
- remove unnecessary kvm_reload_remote_mmus after kvm_mmu_zap_all

* Issue
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability.

* Idea
Since all shadow page will be zapped, we can directly zap the mmu-cache
and rmap so that vcpu will fault on the new mmu-cache, after that, we can
directly free the memory used by old mmu-cache.

The root shadow page is little especial since they are currently used by
vcpus, we can not directly free them. So, we zap the root shadow pages and
re-add them into the new mmu-cache.

* TODO
(1): free root shadow pages by using generation-number
(2): drop unnecessary @npages from kvm_arch_create_memslot

* Performance
The testcase can be found at:
http://www.gossamer-threads.com/lists/engine?do=post_attachment;postatt_id=54896;list=linux
is used to measure the time of delete / add memslot. At that time, all vcpus
are waiting, that means, no mmu-lock contention. I believe the result be more
beautiful if other vcpus and mmu notification need to hold the mmu-lock.

Guest VCPU:6, Mem:2048M

before: Run 10 times, Avg time:46078825 ns.

after: Run 10 times, Avg time:21558774 ns. (+ 113%)

Xiao Guangrong (7):
KVM: MMU: introduce mmu_cache->pte_list_descs
KVM: x86: introduce memslot_set_lpage_disallowed
KVM: x86: introduce kvm_clear_all_gfn_page_info
KVM: MMU: delete shadow page from hash list in
kvm_mmu_prepare_zap_page
KVM: MMU: split kvm_mmu_prepare_zap_page
KVM: MMU: fast zap all shadow pages
KVM: MMU: drop unnecessary kvm_reload_remote_mmus after
kvm_mmu_zap_all

arch/x86/include/asm/kvm_host.h | 7 ++-
arch/x86/kvm/mmu.c | 105 ++++++++++++++++++++++++++++++++++-----
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/x86.c | 87 +++++++++++++++++++++++++-------
include/linux/kvm_host.h | 1 +
5 files changed, 166 insertions(+), 35 deletions(-)

--
1.7.7.6


2013-03-20 08:30:52

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v2 2/7] KVM: x86: introduce memslot_set_lpage_disallowed

It is used to set disallowed lage page on the specified level, can be
used in later patch

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 53 ++++++++++++++++++++++++++++++++++-----------------
1 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6622ac0..e0a7f67 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6868,12 +6868,45 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
}
}

+static void
+memslot_set_lpage_disallowed(struct kvm_memory_slot *slot,
+ unsigned long npages, int lpage_size,
+ int lpages)
+{
+ struct kvm_lpage_info *lpage_info;
+ unsigned long ugfn;
+ int level = lpage_size + 1;
+
+ WARN_ON(!lpage_size);
+
+ lpage_info = slot->arch.lpage_info[lpage_size - 1];
+
+ if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
+ lpage_info[0].write_count = 1;
+
+ if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
+ lpage_info[lpages - 1].write_count = 1;
+
+ ugfn = slot->userspace_addr >> PAGE_SHIFT;
+ /*
+ * If the gfn and userspace address are not aligned wrt each
+ * other, or if explicitly asked to, disable large page
+ * support for this slot
+ */
+ if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1) ||
+ !kvm_largepages_enabled()) {
+ unsigned long j;
+
+ for (j = 0; j < lpages; ++j)
+ lpage_info[j].write_count = 1;
+ }
+}
+
int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
{
int i;

for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
- unsigned long ugfn;
int lpages;
int level = i + 1;

@@ -6892,23 +6925,7 @@ int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
if (!slot->arch.lpage_info[i - 1])
goto out_free;

- if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
- slot->arch.lpage_info[i - 1][0].write_count = 1;
- if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
- slot->arch.lpage_info[i - 1][lpages - 1].write_count = 1;
- ugfn = slot->userspace_addr >> PAGE_SHIFT;
- /*
- * If the gfn and userspace address are not aligned wrt each
- * other, or if explicitly asked to, disable large page
- * support for this slot
- */
- if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1) ||
- !kvm_largepages_enabled()) {
- unsigned long j;
-
- for (j = 0; j < lpages; ++j)
- slot->arch.lpage_info[i - 1][j].write_count = 1;
- }
+ memslot_set_lpage_disallowed(slot, npages, i, lpages);
}

return 0;
--
1.7.7.6

2013-03-20 08:31:05

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v2 7/7] KVM: MMU: drop unnecessary kvm_reload_remote_mmus after kvm_mmu_zap_all

It is the responsibility of kvm_mmu_zap_all that keeps the consistent
of mmu and tlbs

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 87d27a8..5bae962 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7047,7 +7047,6 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
kvm_mmu_zap_all(kvm);
- kvm_reload_remote_mmus(kvm);
}

void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
--
1.7.7.6

2013-03-20 08:31:30

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v2 5/7] KVM: MMU: split kvm_mmu_prepare_zap_page

Then the new function __kvm_mmu_prepare_zap_page only zaps the shadow page
without KVM_REQ_MMU_RELOAD. Later, we will use it to batch free root shadow
pages

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 20 +++++++++++++++-----
1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5578c91..e880cdd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2066,8 +2066,9 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
return zapped;
}

-static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- struct list_head *invalid_list)
+static int
+__kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ struct list_head *invalid_list)
{
int ret;

@@ -2088,15 +2089,24 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
ret++;
list_move(&sp->link, invalid_list);
kvm_mod_used_mmu_pages(kvm, -1);
- } else {
+ } else
list_move(&sp->link, &kvm->arch.active_mmu_pages);
- kvm_reload_remote_mmus(kvm);
- }

sp->role.invalid = 1;
return ret;
}

+static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ struct list_head *invalid_list)
+{
+ int ret = __kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+
+ if (sp->root_count)
+ kvm_reload_remote_mmus(kvm);
+
+ return ret;
+}
+
static void kvm_mmu_commit_zap_page(struct kvm *kvm,
struct list_head *invalid_list)
{
--
1.7.7.6

2013-03-20 08:31:00

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v2 3/7] KVM: x86: introduce kvm_clear_all_gfn_page_info

This function is used to reset the rmaps and page info of all guest page
which will be used in later patch

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 31 +++++++++++++++++++++++++++++++
include/linux/kvm_host.h | 1 +
2 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e0a7f67..87d27a8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6902,6 +6902,37 @@ memslot_set_lpage_disallowed(struct kvm_memory_slot *slot,
}
}

+static void clear_memslot_page_info(struct kvm_memory_slot *slot)
+{
+ int i;
+
+ for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
+ int lpages;
+ int level = i + 1;
+
+ lpages = gfn_to_index(slot->base_gfn + slot->npages - 1,
+ slot->base_gfn, level) + 1;
+
+ memset(slot->arch.rmap[i], 0,
+ lpages * sizeof(*slot->arch.rmap[i]));
+
+ if (i) {
+ memset(slot->arch.lpage_info[i - 1], 0,
+ sizeof(*slot->arch.lpage_info[i - 1]));
+ memslot_set_lpage_disallowed(slot, slot->npages, i,
+ lpages);
+ }
+ }
+}
+
+void kvm_clear_all_gfn_page_info(struct kvm *kvm)
+{
+ struct kvm_memory_slot *slot;
+
+ kvm_for_each_memslot(slot, kvm->memslots)
+ clear_memslot_page_info(slot);
+}
+
int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
{
int i;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0f4941a..1de9b79 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -491,6 +491,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem);
void kvm_arch_free_memslot(struct kvm_memory_slot *free,
struct kvm_memory_slot *dont);
+void kvm_clear_all_gfn_page_info(struct kvm *kvm);
int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages);
int kvm_arch_prepare_memory_region(struct kvm *kvm,
struct kvm_memory_slot *memslot,
--
1.7.7.6

2013-03-20 08:31:53

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v2 4/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page

Move deletion shadow page from the hash list from kvm_mmu_commit_zap_page to
kvm_mmu_prepare_zap_page, we that we can free the shadow page out of mmu-lock.

Also, delete the invalid shadow page from the hash list since this page can
not be reused anymore. This makes reset mmu-cache more easier - we do not need
to care all hash entries after reset mmu-cache

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index dc37512..5578c91 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1472,7 +1472,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
{
ASSERT(is_empty_shadow_page(sp->spt));
- hlist_del(&sp->hash_link);
+
list_del(&sp->link);
free_page((unsigned long)sp->spt);
if (!sp->role.direct)
@@ -1660,7 +1660,8 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,

#define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) \
for_each_gfn_sp(_kvm, _sp, _gfn) \
- if ((_sp)->role.direct || (_sp)->role.invalid) {} else
+ if ((_sp)->role.direct || \
+ ((_sp)->role.invalid && WARN_ON(1))) {} else

/* @sp->gfn should be write-protected at the call site */
static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
@@ -2079,6 +2080,9 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
unaccount_shadowed(kvm, sp->gfn);
if (sp->unsync)
kvm_unlink_unsync_page(kvm, sp);
+
+ hlist_del_init(&sp->hash_link);
+
if (!sp->root_count) {
/* Count self */
ret++;
--
1.7.7.6

2013-03-20 08:30:50

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v2 1/7] KVM: MMU: introduce mmu_cache->pte_list_descs

This list is used to link all the pte_list_desc used by mmu cache, so
we can easily free the memory used by gfn's rmap and parent spte list

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 7 ++++---
arch/x86/kvm/mmu.c | 14 +++++++++++++-
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/x86.c | 2 +-
4 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3f205c6c..c8899c6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -526,15 +526,16 @@ struct kvm_apic_map {
};

struct kvm_arch {
+ /* MMU cache members. */
unsigned int n_used_mmu_pages;
unsigned int n_requested_mmu_pages;
unsigned int n_max_mmu_pages;
unsigned int indirect_shadow_pages;
+ /* Hash table of struct kvm_mmu_page. */
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
- /*
- * Hash table of struct kvm_mmu_page.
- */
struct list_head active_mmu_pages;
+ struct list_head pte_list_descs;
+
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
int iommu_flags;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c1a9b7b..dc37512 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -156,6 +156,7 @@ module_param(dbg, bool, 0644);
struct pte_list_desc {
u64 *sptes[PTE_LIST_EXT];
struct pte_list_desc *more;
+ struct list_head list;
};

struct kvm_shadow_walk_iterator {
@@ -704,11 +705,16 @@ static void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)

static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
{
- return mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
+ struct pte_list_desc *desc;
+
+ desc = mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
+ list_add(&desc->list, &vcpu->kvm->arch.pte_list_descs);
+ return desc;
}

static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
{
+ list_del(&pte_list_desc->list);
kmem_cache_free(pte_list_desc_cache, pte_list_desc);
}

@@ -4337,6 +4343,12 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4])
}
EXPORT_SYMBOL_GPL(kvm_mmu_get_spte_hierarchy);

+void kvm_mmu_cache_init(struct kvm *kvm)
+{
+ INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ INIT_LIST_HEAD(&kvm->arch.pte_list_descs);
+}
+
void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
{
ASSERT(vcpu);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 3b1ad00..2923fd2 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -50,6 +50,7 @@
#define PFERR_RSVD_MASK (1U << 3)
#define PFERR_FETCH_MASK (1U << 4)

+void kvm_mmu_cache_init(struct kvm *kvm);
int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d3c4787..6622ac0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6778,7 +6778,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
if (type)
return -EINVAL;

- INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm_mmu_cache_init(kvm);
INIT_LIST_HEAD(&kvm->arch.assigned_dev_head);

/* Reserve bit 0 of irq_sources_bitmap for userspace irq source */
--
1.7.7.6

2013-03-20 08:33:24

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v2 6/7] KVM: MMU: fast zap all shadow pages

The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability.

Since all shadow page will be zapped, we can directly zap the mmu-cache
and rmap so that vcpu will fault on the new mmu-cache, after that, we can
directly free the memory used by old mmu-cache.

The root shadow page is little especial since they are currently used by
vcpus, we can not directly free them. So, we zap the root shadow pages and
re-add them into the new mmu-cache.

After this patch, kvm_mmu_zap_all can be faster 113% than before

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e880cdd..72e5bdb 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4196,17 +4196,72 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)

void kvm_mmu_zap_all(struct kvm *kvm)
{
- struct kvm_mmu_page *sp, *node;
+ LIST_HEAD(root_mmu_pages);
LIST_HEAD(invalid_list);
+ struct list_head pte_list_descs;
+ struct kvm_mmu_page *sp, *node;
+ struct pte_list_desc *desc, *ndesc;
+ int root_sp = 0;

spin_lock(&kvm->mmu_lock);
+
restart:
+ /*
+ * The root shadow pages are being used on vcpus that can not
+ * directly removed, we filter them out and re-add them to the
+ * new mmu cache.
+ */
list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link)
- if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
- goto restart;
+ if (sp->root_count) {
+ int ret;
+
+ root_sp++;
+ ret = __kvm_mmu_prepare_zap_page(kvm, sp,
+ &invalid_list);
+ list_move(&sp->link, &root_mmu_pages);
+ if (ret)
+ goto restart;
+ }
+
+ list_splice(&kvm->arch.active_mmu_pages, &invalid_list);
+ list_replace(&kvm->arch.pte_list_descs, &pte_list_descs);
+
+ /*
+ * Reset the mmu cache so that later vcpu will fault on the new
+ * mmu cache.
+ */
+ kvm->arch.indirect_shadow_pages = 0;
+ /* Root shadow pages will be added to the new mmu cache. */
+ kvm_mod_used_mmu_pages(kvm, -(kvm->arch.n_used_mmu_pages - root_sp));
+ memset(kvm->arch.mmu_page_hash, 0, sizeof(kvm->arch.mmu_page_hash));
+ kvm_mmu_cache_init(kvm);
+
+ /*
+ * Now, the mmu cache has been reset, we can re-add the root shadow
+ * pages into the cache.
+ */
+ list_replace(&root_mmu_pages, &kvm->arch.active_mmu_pages);
+
+ /* Reset gfn's rmap and lpage info. */
+ kvm_clear_all_gfn_page_info(kvm);
+
+ /*
+ * Notify all vcpus to reload and flush TLB if root shadow pages
+ * were zapped (KVM_REQ_MMU_RELOAD forces TLB to be flushed).
+ *
+ * The TLB need not be flushed if no root shadow page was found
+ * since no vcpu uses shadow page.
+ */
+ if (root_sp)
+ kvm_reload_remote_mmus(kvm);

- kvm_mmu_commit_zap_page(kvm, &invalid_list);
spin_unlock(&kvm->mmu_lock);
+
+ list_for_each_entry_safe(sp, node, &invalid_list, link)
+ kvm_mmu_free_page(sp);
+
+ list_for_each_entry_safe(desc, ndesc, &pte_list_descs, list)
+ mmu_free_pte_list_desc(desc);
}

void kvm_mmu_zap_mmio_sptes(struct kvm *kvm)
--
1.7.7.6

2013-03-21 13:14:19

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v2 4/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page

On Wed, Mar 20, 2013 at 04:30:24PM +0800, Xiao Guangrong wrote:
> Move deletion shadow page from the hash list from kvm_mmu_commit_zap_page to
> kvm_mmu_prepare_zap_page, we that we can free the shadow page out of mmu-lock.
>
> Also, delete the invalid shadow page from the hash list since this page can
> not be reused anymore. This makes reset mmu-cache more easier - we do not need
> to care all hash entries after reset mmu-cache
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 8 ++++++--
> 1 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index dc37512..5578c91 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1472,7 +1472,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> {
> ASSERT(is_empty_shadow_page(sp->spt));
> - hlist_del(&sp->hash_link);
> +
> list_del(&sp->link);
> free_page((unsigned long)sp->spt);
> if (!sp->role.direct)
> @@ -1660,7 +1660,8 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>
> #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) \
> for_each_gfn_sp(_kvm, _sp, _gfn) \
> - if ((_sp)->role.direct || (_sp)->role.invalid) {} else
> + if ((_sp)->role.direct || \
> + ((_sp)->role.invalid && WARN_ON(1))) {} else
>
> /* @sp->gfn should be write-protected at the call site */
> static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> @@ -2079,6 +2080,9 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> unaccount_shadowed(kvm, sp->gfn);
> if (sp->unsync)
> kvm_unlink_unsync_page(kvm, sp);
> +
> + hlist_del_init(&sp->hash_link);
> +
Now we delete roots from hash, but leave it on active_mmu_pages list. Is
this OK?

> if (!sp->root_count) {
> /* Count self */
> ret++;
> --
> 1.7.7.6

--
Gleb.

2013-03-21 22:27:32

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On Wed, Mar 20, 2013 at 04:30:20PM +0800, Xiao Guangrong wrote:
> Changlog:
> V2:
> - do not reset n_requested_mmu_pages and n_max_mmu_pages
> - batch free root shadow pages to reduce vcpu notification and mmu-lock
> contention
> - remove the first patch that introduce kvm->arch.mmu_cache since we only
> 'memset zero' on hashtable rather than all mmu cache members in this
> version
> - remove unnecessary kvm_reload_remote_mmus after kvm_mmu_zap_all
>
> * Issue
> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> walk and zap all shadow pages one by one, also it need to zap all guest
> page's rmap and all shadow page's parent spte list. Particularly, things
> become worse if guest uses more memory or vcpus. It is not good for
> scalability.

Xiao,

The bulk removal of shadow pages from mmu cache is nerving - it creates
two codepaths to delete a data structure: the usual, single entry one
and the bulk one.

There are two main usecases for kvm_mmu_zap_all(): to invalidate the
current mmu tree (from kvm_set_memory) and to tear down all pages
(VM shutdown).

The first usecase can use your idea of an invalid generation number
on shadow pages. That is, increment the VM generation number, nuke the root
pages and thats it.

The modifications should be contained to kvm_mmu_get_page() mostly,
correct? (would also have to keep counters to increase SLAB freeing
ratio, relative to number of outdated shadow pages).

And then have codepaths that nuke shadow pages break from the spinlock,
such as kvm_mmu_slot_remove_write_access does now (spin_needbreak).
That would also solve the current issues without using more memory
for pte_list_desc and without the delicate "Reset MMU cache" step.

What you think?

> * Idea
> Since all shadow page will be zapped, we can directly zap the mmu-cache
> and rmap so that vcpu will fault on the new mmu-cache, after that, we can
> directly free the memory used by old mmu-cache.
>
> The root shadow page is little especial since they are currently used by
> vcpus, we can not directly free them. So, we zap the root shadow pages and
> re-add them into the new mmu-cache.
>
> * TODO
> (1): free root shadow pages by using generation-number
> (2): drop unnecessary @npages from kvm_arch_create_memslot
>
> * Performance
> The testcase can be found at:
> http://www.gossamer-threads.com/lists/engine?do=post_attachment;postatt_id=54896;list=linux
> is used to measure the time of delete / add memslot. At that time, all vcpus
> are waiting, that means, no mmu-lock contention. I believe the result be more
> beautiful if other vcpus and mmu notification need to hold the mmu-lock.
>
> Guest VCPU:6, Mem:2048M
>
> before: Run 10 times, Avg time:46078825 ns.
>
> after: Run 10 times, Avg time:21558774 ns. (+ 113%)
>
> Xiao Guangrong (7):
> KVM: MMU: introduce mmu_cache->pte_list_descs
> KVM: x86: introduce memslot_set_lpage_disallowed
> KVM: x86: introduce kvm_clear_all_gfn_page_info
> KVM: MMU: delete shadow page from hash list in
> kvm_mmu_prepare_zap_page
> KVM: MMU: split kvm_mmu_prepare_zap_page
> KVM: MMU: fast zap all shadow pages
> KVM: MMU: drop unnecessary kvm_reload_remote_mmus after
> kvm_mmu_zap_all
>
> arch/x86/include/asm/kvm_host.h | 7 ++-
> arch/x86/kvm/mmu.c | 105 ++++++++++++++++++++++++++++++++++-----
> arch/x86/kvm/mmu.h | 1 +
> arch/x86/kvm/x86.c | 87 +++++++++++++++++++++++++-------
> include/linux/kvm_host.h | 1 +
> 5 files changed, 166 insertions(+), 35 deletions(-)
>
> --
> 1.7.7.6

2013-03-22 02:11:27

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On 03/22/2013 06:21 AM, Marcelo Tosatti wrote:
> On Wed, Mar 20, 2013 at 04:30:20PM +0800, Xiao Guangrong wrote:
>> Changlog:
>> V2:
>> - do not reset n_requested_mmu_pages and n_max_mmu_pages
>> - batch free root shadow pages to reduce vcpu notification and mmu-lock
>> contention
>> - remove the first patch that introduce kvm->arch.mmu_cache since we only
>> 'memset zero' on hashtable rather than all mmu cache members in this
>> version
>> - remove unnecessary kvm_reload_remote_mmus after kvm_mmu_zap_all
>>
>> * Issue
>> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
>> walk and zap all shadow pages one by one, also it need to zap all guest
>> page's rmap and all shadow page's parent spte list. Particularly, things
>> become worse if guest uses more memory or vcpus. It is not good for
>> scalability.
>
> Xiao,
>
> The bulk removal of shadow pages from mmu cache is nerving - it creates
> two codepaths to delete a data structure: the usual, single entry one
> and the bulk one.
>
> There are two main usecases for kvm_mmu_zap_all(): to invalidate the
> current mmu tree (from kvm_set_memory) and to tear down all pages
> (VM shutdown).
>
> The first usecase can use your idea of an invalid generation number
> on shadow pages. That is, increment the VM generation number, nuke the root
> pages and thats it.
>
> The modifications should be contained to kvm_mmu_get_page() mostly,
> correct? (would also have to keep counters to increase SLAB freeing
> ratio, relative to number of outdated shadow pages).

Yes.

>
> And then have codepaths that nuke shadow pages break from the spinlock,

I think this is not needed any more. We can let mmu_notify use the generation
number to invalid all shadow pages, then we only need to free them after
all vcpus down and mmu_notify unregistered - at this point, no lock contention,
we can directly free them.

> such as kvm_mmu_slot_remove_write_access does now (spin_needbreak).

BTW, to my honest, i do not think spin_needbreak is a good way - it does
not fix the hot-lock contention and it just occupies more cpu time to avoid
possible soft lock-ups.

Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
create page tables again. zap-all-shadow-page need long time to be finished,
the worst case is, it can not completed forever on intensive vcpu and memory
usage.

I still think the right way to fix this kind of thing is optimization for
mmu-lock.

> That would also solve the current issues without using more memory
> for pte_list_desc and without the delicate "Reset MMU cache" step.
>
> What you think?

I agree your point, Marcelo! I will redesign it. Thank you!

2013-03-22 02:16:14

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v2 4/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page

On 03/21/2013 09:14 PM, Gleb Natapov wrote:
> On Wed, Mar 20, 2013 at 04:30:24PM +0800, Xiao Guangrong wrote:
>> Move deletion shadow page from the hash list from kvm_mmu_commit_zap_page to
>> kvm_mmu_prepare_zap_page, we that we can free the shadow page out of mmu-lock.
>>
>> Also, delete the invalid shadow page from the hash list since this page can
>> not be reused anymore. This makes reset mmu-cache more easier - we do not need
>> to care all hash entries after reset mmu-cache
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/kvm/mmu.c | 8 ++++++--
>> 1 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index dc37512..5578c91 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1472,7 +1472,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>> {
>> ASSERT(is_empty_shadow_page(sp->spt));
>> - hlist_del(&sp->hash_link);
>> +
>> list_del(&sp->link);
>> free_page((unsigned long)sp->spt);
>> if (!sp->role.direct)
>> @@ -1660,7 +1660,8 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>>
>> #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) \
>> for_each_gfn_sp(_kvm, _sp, _gfn) \
>> - if ((_sp)->role.direct || (_sp)->role.invalid) {} else
>> + if ((_sp)->role.direct || \
>> + ((_sp)->role.invalid && WARN_ON(1))) {} else
>>
>> /* @sp->gfn should be write-protected at the call site */
>> static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>> @@ -2079,6 +2080,9 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>> unaccount_shadowed(kvm, sp->gfn);
>> if (sp->unsync)
>> kvm_unlink_unsync_page(kvm, sp);
>> +
>> + hlist_del_init(&sp->hash_link);
>> +
> Now we delete roots from hash, but leave it on active_mmu_pages list. Is
> this OK?

It is okay i think. Hash-lish is only used to find gfn's shadow page. Invalid shadow page
does not contain any useful guest content and will be freed soon after vcpu reload.

IIRC, we did it when we used rcu to free shadow pages.

2013-03-22 10:04:12

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On 03/22/2013 10:11 AM, Xiao Guangrong wrote:

>> The modifications should be contained to kvm_mmu_get_page() mostly,
>> correct? (would also have to keep counters to increase SLAB freeing
>> ratio, relative to number of outdated shadow pages).
>
> Yes.
>
>>
>> And then have codepaths that nuke shadow pages break from the spinlock,
>
> I think this is not needed any more. We can let mmu_notify use the generation
> number to invalid all shadow pages, then we only need to free them after
> all vcpus down and mmu_notify unregistered - at this point, no lock contention,
> we can directly free them.

Sorry. This is wrong since after call ->release(), the memory will be
freed. zap-all-sp can not be delayed. Will think out a good way to handle this...

2013-03-22 10:56:46

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On Fri, Mar 22, 2013 at 10:11:17AM +0800, Xiao Guangrong wrote:
> On 03/22/2013 06:21 AM, Marcelo Tosatti wrote:
> > On Wed, Mar 20, 2013 at 04:30:20PM +0800, Xiao Guangrong wrote:
> >> Changlog:
> >> V2:
> >> - do not reset n_requested_mmu_pages and n_max_mmu_pages
> >> - batch free root shadow pages to reduce vcpu notification and mmu-lock
> >> contention
> >> - remove the first patch that introduce kvm->arch.mmu_cache since we only
> >> 'memset zero' on hashtable rather than all mmu cache members in this
> >> version
> >> - remove unnecessary kvm_reload_remote_mmus after kvm_mmu_zap_all
> >>
> >> * Issue
> >> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> >> walk and zap all shadow pages one by one, also it need to zap all guest
> >> page's rmap and all shadow page's parent spte list. Particularly, things
> >> become worse if guest uses more memory or vcpus. It is not good for
> >> scalability.
> >
> > Xiao,
> >
> > The bulk removal of shadow pages from mmu cache is nerving - it creates
> > two codepaths to delete a data structure: the usual, single entry one
> > and the bulk one.
> >
> > There are two main usecases for kvm_mmu_zap_all(): to invalidate the
> > current mmu tree (from kvm_set_memory) and to tear down all pages
> > (VM shutdown).
> >
> > The first usecase can use your idea of an invalid generation number
> > on shadow pages. That is, increment the VM generation number, nuke the root
> > pages and thats it.
> >
> > The modifications should be contained to kvm_mmu_get_page() mostly,
> > correct? (would also have to keep counters to increase SLAB freeing
> > ratio, relative to number of outdated shadow pages).
>
> Yes.
>
> >
> > And then have codepaths that nuke shadow pages break from the spinlock,
>
> I think this is not needed any more. We can let mmu_notify use the generation
> number to invalid all shadow pages, then we only need to free them after
> all vcpus down and mmu_notify unregistered - at this point, no lock contention,
> we can directly free them.
>
> > such as kvm_mmu_slot_remove_write_access does now (spin_needbreak).
>
> BTW, to my honest, i do not think spin_needbreak is a good way - it does
> not fix the hot-lock contention and it just occupies more cpu time to avoid
> possible soft lock-ups.
>
> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
> create page tables again. zap-all-shadow-page need long time to be finished,
> the worst case is, it can not completed forever on intensive vcpu and memory
> usage.

Yes, but the suggestion is to use spin_needbreak on the VM shutdown
cases, where there is no detailed concern about performance. Such as
mmu_notifier_release, kvm_destroy_vm, etc. In those cases what matters
most is that host remains unaffected (and that it finishes in a
reasonable time).

> I still think the right way to fix this kind of thing is optimization for
> mmu-lock.

And then for the cases where performance matters just increase a
VM global generetion number, zap the roots and then on kvm_mmu_get_page:

kvm_mmu_get_page() {
sp = lookup_hash(gfn)
if (sp->role = role) {
if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
kvm_mmu_commit_zap_page(sp); (no need for TLB flushes as its unreachable)
kvm_mmu_init_page(sp);
proceed as if the page was just allocated
}
}
}

It makes the kvm_mmu_zap_all path even faster than you have now.
I suppose this was your idea correct with the generation number correct?

2013-03-22 11:10:54

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On 03/22/2013 06:54 PM, Marcelo Tosatti wrote:

>>
>>>
>>> And then have codepaths that nuke shadow pages break from the spinlock,
>>
>> I think this is not needed any more. We can let mmu_notify use the generation
>> number to invalid all shadow pages, then we only need to free them after
>> all vcpus down and mmu_notify unregistered - at this point, no lock contention,
>> we can directly free them.
>>
>>> such as kvm_mmu_slot_remove_write_access does now (spin_needbreak).
>>
>> BTW, to my honest, i do not think spin_needbreak is a good way - it does
>> not fix the hot-lock contention and it just occupies more cpu time to avoid
>> possible soft lock-ups.
>>
>> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
>> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
>> create page tables again. zap-all-shadow-page need long time to be finished,
>> the worst case is, it can not completed forever on intensive vcpu and memory
>> usage.
>
> Yes, but the suggestion is to use spin_needbreak on the VM shutdown
> cases, where there is no detailed concern about performance. Such as
> mmu_notifier_release, kvm_destroy_vm, etc. In those cases what matters
> most is that host remains unaffected (and that it finishes in a
> reasonable time).

Okay. I agree with you, will give a try.

>
>> I still think the right way to fix this kind of thing is optimization for
>> mmu-lock.
>
> And then for the cases where performance matters just increase a
> VM global generetion number, zap the roots and then on kvm_mmu_get_page:
>
> kvm_mmu_get_page() {
> sp = lookup_hash(gfn)
> if (sp->role = role) {
> if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
> kvm_mmu_commit_zap_page(sp); (no need for TLB flushes as its unreachable)
> kvm_mmu_init_page(sp);
> proceed as if the page was just allocated
> }
> }
> }
>
> It makes the kvm_mmu_zap_all path even faster than you have now.
> I suppose this was your idea correct with the generation number correct?

Wow, great minds think alike, this is exactly what i am doing. ;)

2013-03-22 11:28:16

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On Fri, Mar 22, 2013 at 07:10:44PM +0800, Xiao Guangrong wrote:
> On 03/22/2013 06:54 PM, Marcelo Tosatti wrote:
>
> >>
> >>>
> >>> And then have codepaths that nuke shadow pages break from the spinlock,
> >>
> >> I think this is not needed any more. We can let mmu_notify use the generation
> >> number to invalid all shadow pages, then we only need to free them after
> >> all vcpus down and mmu_notify unregistered - at this point, no lock contention,
> >> we can directly free them.
> >>
> >>> such as kvm_mmu_slot_remove_write_access does now (spin_needbreak).
> >>
> >> BTW, to my honest, i do not think spin_needbreak is a good way - it does
> >> not fix the hot-lock contention and it just occupies more cpu time to avoid
> >> possible soft lock-ups.
> >>
> >> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
> >> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
> >> create page tables again. zap-all-shadow-page need long time to be finished,
> >> the worst case is, it can not completed forever on intensive vcpu and memory
> >> usage.
> >
> > Yes, but the suggestion is to use spin_needbreak on the VM shutdown
> > cases, where there is no detailed concern about performance. Such as
> > mmu_notifier_release, kvm_destroy_vm, etc. In those cases what matters
> > most is that host remains unaffected (and that it finishes in a
> > reasonable time).
>
> Okay. I agree with you, will give a try.
>
> >
> >> I still think the right way to fix this kind of thing is optimization for
> >> mmu-lock.
> >
> > And then for the cases where performance matters just increase a
> > VM global generetion number, zap the roots and then on kvm_mmu_get_page:
> >
> > kvm_mmu_get_page() {
> > sp = lookup_hash(gfn)
> > if (sp->role = role) {
> > if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
> > kvm_mmu_commit_zap_page(sp); (no need for TLB flushes as its unreachable)
> > kvm_mmu_init_page(sp);
> > proceed as if the page was just allocated
> > }
> > }
> > }
> >
> > It makes the kvm_mmu_zap_all path even faster than you have now.
> > I suppose this was your idea correct with the generation number correct?
>
> Wow, great minds think alike, this is exactly what i am doing. ;)
>
Not that I disagree with above code, but why not make mmu_gen_number to be
part of a role and remove old pages in kvm_mmu_free_some_pages() whenever
limit is reached like we looks to be doing with role.invalid pages now.

--
Gleb.

2013-03-22 11:39:34

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On 03/22/2013 07:28 PM, Gleb Natapov wrote:
> On Fri, Mar 22, 2013 at 07:10:44PM +0800, Xiao Guangrong wrote:
>> On 03/22/2013 06:54 PM, Marcelo Tosatti wrote:
>>
>>>>
>>>>>
>>>>> And then have codepaths that nuke shadow pages break from the spinlock,
>>>>
>>>> I think this is not needed any more. We can let mmu_notify use the generation
>>>> number to invalid all shadow pages, then we only need to free them after
>>>> all vcpus down and mmu_notify unregistered - at this point, no lock contention,
>>>> we can directly free them.
>>>>
>>>>> such as kvm_mmu_slot_remove_write_access does now (spin_needbreak).
>>>>
>>>> BTW, to my honest, i do not think spin_needbreak is a good way - it does
>>>> not fix the hot-lock contention and it just occupies more cpu time to avoid
>>>> possible soft lock-ups.
>>>>
>>>> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
>>>> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
>>>> create page tables again. zap-all-shadow-page need long time to be finished,
>>>> the worst case is, it can not completed forever on intensive vcpu and memory
>>>> usage.
>>>
>>> Yes, but the suggestion is to use spin_needbreak on the VM shutdown
>>> cases, where there is no detailed concern about performance. Such as
>>> mmu_notifier_release, kvm_destroy_vm, etc. In those cases what matters
>>> most is that host remains unaffected (and that it finishes in a
>>> reasonable time).
>>
>> Okay. I agree with you, will give a try.
>>
>>>
>>>> I still think the right way to fix this kind of thing is optimization for
>>>> mmu-lock.
>>>
>>> And then for the cases where performance matters just increase a
>>> VM global generetion number, zap the roots and then on kvm_mmu_get_page:
>>>
>>> kvm_mmu_get_page() {
>>> sp = lookup_hash(gfn)
>>> if (sp->role = role) {
>>> if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
>>> kvm_mmu_commit_zap_page(sp); (no need for TLB flushes as its unreachable)
>>> kvm_mmu_init_page(sp);
>>> proceed as if the page was just allocated
>>> }
>>> }
>>> }
>>>
>>> It makes the kvm_mmu_zap_all path even faster than you have now.
>>> I suppose this was your idea correct with the generation number correct?
>>
>> Wow, great minds think alike, this is exactly what i am doing. ;)
>>
> Not that I disagree with above code, but why not make mmu_gen_number to be
> part of a role and remove old pages in kvm_mmu_free_some_pages() whenever
> limit is reached like we looks to be doing with role.invalid pages now.

These pages can be reused after purge its entries and delete it from parents
list, it can reduce the pressure of memory allocator. Also, we can move it to
the head of active_list so that the pages with invalid_gen can be reclaimed first.

2013-03-22 11:47:20

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On Fri, Mar 22, 2013 at 07:39:24PM +0800, Xiao Guangrong wrote:
> On 03/22/2013 07:28 PM, Gleb Natapov wrote:
> > On Fri, Mar 22, 2013 at 07:10:44PM +0800, Xiao Guangrong wrote:
> >> On 03/22/2013 06:54 PM, Marcelo Tosatti wrote:
> >>
> >>>>
> >>>>>
> >>>>> And then have codepaths that nuke shadow pages break from the spinlock,
> >>>>
> >>>> I think this is not needed any more. We can let mmu_notify use the generation
> >>>> number to invalid all shadow pages, then we only need to free them after
> >>>> all vcpus down and mmu_notify unregistered - at this point, no lock contention,
> >>>> we can directly free them.
> >>>>
> >>>>> such as kvm_mmu_slot_remove_write_access does now (spin_needbreak).
> >>>>
> >>>> BTW, to my honest, i do not think spin_needbreak is a good way - it does
> >>>> not fix the hot-lock contention and it just occupies more cpu time to avoid
> >>>> possible soft lock-ups.
> >>>>
> >>>> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
> >>>> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
> >>>> create page tables again. zap-all-shadow-page need long time to be finished,
> >>>> the worst case is, it can not completed forever on intensive vcpu and memory
> >>>> usage.
> >>>
> >>> Yes, but the suggestion is to use spin_needbreak on the VM shutdown
> >>> cases, where there is no detailed concern about performance. Such as
> >>> mmu_notifier_release, kvm_destroy_vm, etc. In those cases what matters
> >>> most is that host remains unaffected (and that it finishes in a
> >>> reasonable time).
> >>
> >> Okay. I agree with you, will give a try.
> >>
> >>>
> >>>> I still think the right way to fix this kind of thing is optimization for
> >>>> mmu-lock.
> >>>
> >>> And then for the cases where performance matters just increase a
> >>> VM global generetion number, zap the roots and then on kvm_mmu_get_page:
> >>>
> >>> kvm_mmu_get_page() {
> >>> sp = lookup_hash(gfn)
> >>> if (sp->role = role) {
> >>> if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
> >>> kvm_mmu_commit_zap_page(sp); (no need for TLB flushes as its unreachable)
> >>> kvm_mmu_init_page(sp);
> >>> proceed as if the page was just allocated
> >>> }
> >>> }
> >>> }
> >>>
> >>> It makes the kvm_mmu_zap_all path even faster than you have now.
> >>> I suppose this was your idea correct with the generation number correct?
> >>
> >> Wow, great minds think alike, this is exactly what i am doing. ;)
> >>
> > Not that I disagree with above code, but why not make mmu_gen_number to be
> > part of a role and remove old pages in kvm_mmu_free_some_pages() whenever
> > limit is reached like we looks to be doing with role.invalid pages now.
>
> These pages can be reused after purge its entries and delete it from parents
> list, it can reduce the pressure of memory allocator. Also, we can move it to
> the head of active_list so that the pages with invalid_gen can be reclaimed first.
>
You mean tail of the active_list, since kvm_mmu_free_some_pages()
removes pages from tail? Since pages with new mmu_gen_number will be put
at the head of the list it is natural that tail will contain pages with
outdated generation numbers without need to explicitly move them.

--
Gleb.

2013-03-22 12:03:20

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On 03/22/2013 07:47 PM, Gleb Natapov wrote:
> On Fri, Mar 22, 2013 at 07:39:24PM +0800, Xiao Guangrong wrote:
>> On 03/22/2013 07:28 PM, Gleb Natapov wrote:
>>> On Fri, Mar 22, 2013 at 07:10:44PM +0800, Xiao Guangrong wrote:
>>>> On 03/22/2013 06:54 PM, Marcelo Tosatti wrote:
>>>>
>>>>>>
>>>>>>>
>>>>>>> And then have codepaths that nuke shadow pages break from the spinlock,
>>>>>>
>>>>>> I think this is not needed any more. We can let mmu_notify use the generation
>>>>>> number to invalid all shadow pages, then we only need to free them after
>>>>>> all vcpus down and mmu_notify unregistered - at this point, no lock contention,
>>>>>> we can directly free them.
>>>>>>
>>>>>>> such as kvm_mmu_slot_remove_write_access does now (spin_needbreak).
>>>>>>
>>>>>> BTW, to my honest, i do not think spin_needbreak is a good way - it does
>>>>>> not fix the hot-lock contention and it just occupies more cpu time to avoid
>>>>>> possible soft lock-ups.
>>>>>>
>>>>>> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
>>>>>> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
>>>>>> create page tables again. zap-all-shadow-page need long time to be finished,
>>>>>> the worst case is, it can not completed forever on intensive vcpu and memory
>>>>>> usage.
>>>>>
>>>>> Yes, but the suggestion is to use spin_needbreak on the VM shutdown
>>>>> cases, where there is no detailed concern about performance. Such as
>>>>> mmu_notifier_release, kvm_destroy_vm, etc. In those cases what matters
>>>>> most is that host remains unaffected (and that it finishes in a
>>>>> reasonable time).
>>>>
>>>> Okay. I agree with you, will give a try.
>>>>
>>>>>
>>>>>> I still think the right way to fix this kind of thing is optimization for
>>>>>> mmu-lock.
>>>>>
>>>>> And then for the cases where performance matters just increase a
>>>>> VM global generetion number, zap the roots and then on kvm_mmu_get_page:
>>>>>
>>>>> kvm_mmu_get_page() {
>>>>> sp = lookup_hash(gfn)
>>>>> if (sp->role = role) {
>>>>> if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
>>>>> kvm_mmu_commit_zap_page(sp); (no need for TLB flushes as its unreachable)
>>>>> kvm_mmu_init_page(sp);
>>>>> proceed as if the page was just allocated
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> It makes the kvm_mmu_zap_all path even faster than you have now.
>>>>> I suppose this was your idea correct with the generation number correct?
>>>>
>>>> Wow, great minds think alike, this is exactly what i am doing. ;)
>>>>
>>> Not that I disagree with above code, but why not make mmu_gen_number to be
>>> part of a role and remove old pages in kvm_mmu_free_some_pages() whenever
>>> limit is reached like we looks to be doing with role.invalid pages now.
>>
>> These pages can be reused after purge its entries and delete it from parents
>> list, it can reduce the pressure of memory allocator. Also, we can move it to
>> the head of active_list so that the pages with invalid_gen can be reclaimed first.
>>
> You mean tail of the active_list, since kvm_mmu_free_some_pages()
> removes pages from tail? Since pages with new mmu_gen_number will be put

I mean purge the invalid-gen page first, then update its valid-gen to current-gen,
then move it to the head of active_list:

kvm_mmu_get_page() {
sp = lookup_hash(gfn)
if (sp->role = role) {
if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
kvm_mmu_purge_page(sp); (no need for TLB flushes as its unreachable)
sp->mmu_gen_number = kvm->arch.mmu_gen_number;
@@@@@@ move sp to the head of active list @@@@@@
}
}
}


> at the head of the list it is natural that tail will contain pages with
> outdated generation numbers without need to explicitly move them.

Currently, only the new allocated page can be moved to the head of
active_list. The existing pages are not moved by kvm_mmu_get_page.
It seems a bug.

2013-03-22 12:15:34

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On Fri, Mar 22, 2013 at 08:03:04PM +0800, Xiao Guangrong wrote:
> On 03/22/2013 07:47 PM, Gleb Natapov wrote:
> > On Fri, Mar 22, 2013 at 07:39:24PM +0800, Xiao Guangrong wrote:
> >> On 03/22/2013 07:28 PM, Gleb Natapov wrote:
> >>> On Fri, Mar 22, 2013 at 07:10:44PM +0800, Xiao Guangrong wrote:
> >>>> On 03/22/2013 06:54 PM, Marcelo Tosatti wrote:
> >>>>
> >>>>>>
> >>>>>>>
> >>>>>>> And then have codepaths that nuke shadow pages break from the spinlock,
> >>>>>>
> >>>>>> I think this is not needed any more. We can let mmu_notify use the generation
> >>>>>> number to invalid all shadow pages, then we only need to free them after
> >>>>>> all vcpus down and mmu_notify unregistered - at this point, no lock contention,
> >>>>>> we can directly free them.
> >>>>>>
> >>>>>>> such as kvm_mmu_slot_remove_write_access does now (spin_needbreak).
> >>>>>>
> >>>>>> BTW, to my honest, i do not think spin_needbreak is a good way - it does
> >>>>>> not fix the hot-lock contention and it just occupies more cpu time to avoid
> >>>>>> possible soft lock-ups.
> >>>>>>
> >>>>>> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
> >>>>>> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
> >>>>>> create page tables again. zap-all-shadow-page need long time to be finished,
> >>>>>> the worst case is, it can not completed forever on intensive vcpu and memory
> >>>>>> usage.
> >>>>>
> >>>>> Yes, but the suggestion is to use spin_needbreak on the VM shutdown
> >>>>> cases, where there is no detailed concern about performance. Such as
> >>>>> mmu_notifier_release, kvm_destroy_vm, etc. In those cases what matters
> >>>>> most is that host remains unaffected (and that it finishes in a
> >>>>> reasonable time).
> >>>>
> >>>> Okay. I agree with you, will give a try.
> >>>>
> >>>>>
> >>>>>> I still think the right way to fix this kind of thing is optimization for
> >>>>>> mmu-lock.
> >>>>>
> >>>>> And then for the cases where performance matters just increase a
> >>>>> VM global generetion number, zap the roots and then on kvm_mmu_get_page:
> >>>>>
> >>>>> kvm_mmu_get_page() {
> >>>>> sp = lookup_hash(gfn)
> >>>>> if (sp->role = role) {
> >>>>> if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
> >>>>> kvm_mmu_commit_zap_page(sp); (no need for TLB flushes as its unreachable)
> >>>>> kvm_mmu_init_page(sp);
> >>>>> proceed as if the page was just allocated
> >>>>> }
> >>>>> }
> >>>>> }
> >>>>>
> >>>>> It makes the kvm_mmu_zap_all path even faster than you have now.
> >>>>> I suppose this was your idea correct with the generation number correct?
> >>>>
> >>>> Wow, great minds think alike, this is exactly what i am doing. ;)
> >>>>
> >>> Not that I disagree with above code, but why not make mmu_gen_number to be
> >>> part of a role and remove old pages in kvm_mmu_free_some_pages() whenever
> >>> limit is reached like we looks to be doing with role.invalid pages now.
> >>
> >> These pages can be reused after purge its entries and delete it from parents
> >> list, it can reduce the pressure of memory allocator. Also, we can move it to
> >> the head of active_list so that the pages with invalid_gen can be reclaimed first.
> >>
> > You mean tail of the active_list, since kvm_mmu_free_some_pages()
> > removes pages from tail? Since pages with new mmu_gen_number will be put
>
> I mean purge the invalid-gen page first, then update its valid-gen to current-gen,
> then move it to the head of active_list:
>
> kvm_mmu_get_page() {
> sp = lookup_hash(gfn)
> if (sp->role = role) {
> if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
> kvm_mmu_purge_page(sp); (no need for TLB flushes as its unreachable)
> sp->mmu_gen_number = kvm->arch.mmu_gen_number;
> @@@@@@ move sp to the head of active list @@@@@@
> }
> }
> }
>
>
And I am saying that if you make mmu_gen_number part of the role you do
not need to change kvm_mmu_get_page() at all. It will just work.

> > at the head of the list it is natural that tail will contain pages with
> > outdated generation numbers without need to explicitly move them.
>
> Currently, only the new allocated page can be moved to the head of
> active_list. The existing pages are not moved by kvm_mmu_get_page.
> It seems a bug.
Ideally it needs to be LRU list based on accessed bit scanning.

--
Gleb.

2013-03-22 12:37:44

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On 03/22/2013 08:12 PM, Gleb Natapov wrote:
> On Fri, Mar 22, 2013 at 08:03:04PM +0800, Xiao Guangrong wrote:
>> On 03/22/2013 07:47 PM, Gleb Natapov wrote:
>>> On Fri, Mar 22, 2013 at 07:39:24PM +0800, Xiao Guangrong wrote:
>>>> On 03/22/2013 07:28 PM, Gleb Natapov wrote:
>>>>> On Fri, Mar 22, 2013 at 07:10:44PM +0800, Xiao Guangrong wrote:
>>>>>> On 03/22/2013 06:54 PM, Marcelo Tosatti wrote:
>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> And then have codepaths that nuke shadow pages break from the spinlock,
>>>>>>>>
>>>>>>>> I think this is not needed any more. We can let mmu_notify use the generation
>>>>>>>> number to invalid all shadow pages, then we only need to free them after
>>>>>>>> all vcpus down and mmu_notify unregistered - at this point, no lock contention,
>>>>>>>> we can directly free them.
>>>>>>>>
>>>>>>>>> such as kvm_mmu_slot_remove_write_access does now (spin_needbreak).
>>>>>>>>
>>>>>>>> BTW, to my honest, i do not think spin_needbreak is a good way - it does
>>>>>>>> not fix the hot-lock contention and it just occupies more cpu time to avoid
>>>>>>>> possible soft lock-ups.
>>>>>>>>
>>>>>>>> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
>>>>>>>> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
>>>>>>>> create page tables again. zap-all-shadow-page need long time to be finished,
>>>>>>>> the worst case is, it can not completed forever on intensive vcpu and memory
>>>>>>>> usage.
>>>>>>>
>>>>>>> Yes, but the suggestion is to use spin_needbreak on the VM shutdown
>>>>>>> cases, where there is no detailed concern about performance. Such as
>>>>>>> mmu_notifier_release, kvm_destroy_vm, etc. In those cases what matters
>>>>>>> most is that host remains unaffected (and that it finishes in a
>>>>>>> reasonable time).
>>>>>>
>>>>>> Okay. I agree with you, will give a try.
>>>>>>
>>>>>>>
>>>>>>>> I still think the right way to fix this kind of thing is optimization for
>>>>>>>> mmu-lock.
>>>>>>>
>>>>>>> And then for the cases where performance matters just increase a
>>>>>>> VM global generetion number, zap the roots and then on kvm_mmu_get_page:
>>>>>>>
>>>>>>> kvm_mmu_get_page() {
>>>>>>> sp = lookup_hash(gfn)
>>>>>>> if (sp->role = role) {
>>>>>>> if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
>>>>>>> kvm_mmu_commit_zap_page(sp); (no need for TLB flushes as its unreachable)
>>>>>>> kvm_mmu_init_page(sp);
>>>>>>> proceed as if the page was just allocated
>>>>>>> }
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> It makes the kvm_mmu_zap_all path even faster than you have now.
>>>>>>> I suppose this was your idea correct with the generation number correct?
>>>>>>
>>>>>> Wow, great minds think alike, this is exactly what i am doing. ;)
>>>>>>
>>>>> Not that I disagree with above code, but why not make mmu_gen_number to be
>>>>> part of a role and remove old pages in kvm_mmu_free_some_pages() whenever
>>>>> limit is reached like we looks to be doing with role.invalid pages now.
>>>>
>>>> These pages can be reused after purge its entries and delete it from parents
>>>> list, it can reduce the pressure of memory allocator. Also, we can move it to
>>>> the head of active_list so that the pages with invalid_gen can be reclaimed first.
>>>>
>>> You mean tail of the active_list, since kvm_mmu_free_some_pages()
>>> removes pages from tail? Since pages with new mmu_gen_number will be put
>>
>> I mean purge the invalid-gen page first, then update its valid-gen to current-gen,
>> then move it to the head of active_list:
>>
>> kvm_mmu_get_page() {
>> sp = lookup_hash(gfn)
>> if (sp->role = role) {
>> if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
>> kvm_mmu_purge_page(sp); (no need for TLB flushes as its unreachable)
>> sp->mmu_gen_number = kvm->arch.mmu_gen_number;
>> @@@@@@ move sp to the head of active list @@@@@@
>> }
>> }
>> }
>>
>>
> And I am saying that if you make mmu_gen_number part of the role you do
> not need to change kvm_mmu_get_page() at all. It will just work.

Oh, i got what your said. But i want to reuse these page (without
free and re-allocate). What do you think about this?

>
>>> at the head of the list it is natural that tail will contain pages with
>>> outdated generation numbers without need to explicitly move them.
>>
>> Currently, only the new allocated page can be moved to the head of
>> active_list. The existing pages are not moved by kvm_mmu_get_page.
>> It seems a bug.
> Ideally it needs to be LRU list based on accessed bit scanning.

Yes, but unfortunately, A bit does not be supported on some intel cpus...

2013-03-22 19:15:30

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On Fri, Mar 22, 2013 at 08:37:33PM +0800, Xiao Guangrong wrote:
> On 03/22/2013 08:12 PM, Gleb Natapov wrote:
> > On Fri, Mar 22, 2013 at 08:03:04PM +0800, Xiao Guangrong wrote:
> >> On 03/22/2013 07:47 PM, Gleb Natapov wrote:
> >>> On Fri, Mar 22, 2013 at 07:39:24PM +0800, Xiao Guangrong wrote:
> >>>> On 03/22/2013 07:28 PM, Gleb Natapov wrote:
> >>>>> On Fri, Mar 22, 2013 at 07:10:44PM +0800, Xiao Guangrong wrote:
> >>>>>> On 03/22/2013 06:54 PM, Marcelo Tosatti wrote:
> >>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> And then have codepaths that nuke shadow pages break from the spinlock,
> >>>>>>>>
> >>>>>>>> I think this is not needed any more. We can let mmu_notify use the generation
> >>>>>>>> number to invalid all shadow pages, then we only need to free them after
> >>>>>>>> all vcpus down and mmu_notify unregistered - at this point, no lock contention,
> >>>>>>>> we can directly free them.
> >>>>>>>>
> >>>>>>>>> such as kvm_mmu_slot_remove_write_access does now (spin_needbreak).
> >>>>>>>>
> >>>>>>>> BTW, to my honest, i do not think spin_needbreak is a good way - it does
> >>>>>>>> not fix the hot-lock contention and it just occupies more cpu time to avoid
> >>>>>>>> possible soft lock-ups.
> >>>>>>>>
> >>>>>>>> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
> >>>>>>>> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
> >>>>>>>> create page tables again. zap-all-shadow-page need long time to be finished,
> >>>>>>>> the worst case is, it can not completed forever on intensive vcpu and memory
> >>>>>>>> usage.
> >>>>>>>
> >>>>>>> Yes, but the suggestion is to use spin_needbreak on the VM shutdown
> >>>>>>> cases, where there is no detailed concern about performance. Such as
> >>>>>>> mmu_notifier_release, kvm_destroy_vm, etc. In those cases what matters
> >>>>>>> most is that host remains unaffected (and that it finishes in a
> >>>>>>> reasonable time).
> >>>>>>
> >>>>>> Okay. I agree with you, will give a try.
> >>>>>>
> >>>>>>>
> >>>>>>>> I still think the right way to fix this kind of thing is optimization for
> >>>>>>>> mmu-lock.
> >>>>>>>
> >>>>>>> And then for the cases where performance matters just increase a
> >>>>>>> VM global generetion number, zap the roots and then on kvm_mmu_get_page:
> >>>>>>>
> >>>>>>> kvm_mmu_get_page() {
> >>>>>>> sp = lookup_hash(gfn)
> >>>>>>> if (sp->role = role) {
> >>>>>>> if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
> >>>>>>> kvm_mmu_commit_zap_page(sp); (no need for TLB flushes as its unreachable)
> >>>>>>> kvm_mmu_init_page(sp);
> >>>>>>> proceed as if the page was just allocated
> >>>>>>> }
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> It makes the kvm_mmu_zap_all path even faster than you have now.
> >>>>>>> I suppose this was your idea correct with the generation number correct?
> >>>>>>
> >>>>>> Wow, great minds think alike, this is exactly what i am doing. ;)
> >>>>>>
> >>>>> Not that I disagree with above code, but why not make mmu_gen_number to be
> >>>>> part of a role and remove old pages in kvm_mmu_free_some_pages() whenever
> >>>>> limit is reached like we looks to be doing with role.invalid pages now.
> >>>>
> >>>> These pages can be reused after purge its entries and delete it from parents
> >>>> list, it can reduce the pressure of memory allocator. Also, we can move it to
> >>>> the head of active_list so that the pages with invalid_gen can be reclaimed first.
> >>>>
> >>> You mean tail of the active_list, since kvm_mmu_free_some_pages()
> >>> removes pages from tail? Since pages with new mmu_gen_number will be put
> >>
> >> I mean purge the invalid-gen page first, then update its valid-gen to current-gen,
> >> then move it to the head of active_list:
> >>
> >> kvm_mmu_get_page() {
> >> sp = lookup_hash(gfn)
> >> if (sp->role = role) {
> >> if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
> >> kvm_mmu_purge_page(sp); (no need for TLB flushes as its unreachable)
> >> sp->mmu_gen_number = kvm->arch.mmu_gen_number;
> >> @@@@@@ move sp to the head of active list @@@@@@
> >> }
> >> }
> >> }
> >>
> >>
> > And I am saying that if you make mmu_gen_number part of the role you do
> > not need to change kvm_mmu_get_page() at all. It will just work.
>
> Oh, i got what your said. But i want to reuse these page (without
> free and re-allocate). What do you think about this?
>
We did not do that for sp->role.invalid pages although we could do what
is proposed above for them too (am I right?). If there is measurable
advantage of reusing invalid pages in kvm_mmu_get_page() lets do it like
that, but if not then less code is better.

> >
> >>> at the head of the list it is natural that tail will contain pages with
> >>> outdated generation numbers without need to explicitly move them.
> >>
> >> Currently, only the new allocated page can be moved to the head of
> >> active_list. The existing pages are not moved by kvm_mmu_get_page.
> >> It seems a bug.
> > Ideally it needs to be LRU list based on accessed bit scanning.
>
> Yes, but unfortunately, A bit does not be supported on some intel cpus...
>
Yes, but there is not much we can do there.

--
Gleb.

2013-04-17 20:59:24

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On Fri, Mar 22, 2013 at 09:15:24PM +0200, Gleb Natapov wrote:
> On Fri, Mar 22, 2013 at 08:37:33PM +0800, Xiao Guangrong wrote:
> > On 03/22/2013 08:12 PM, Gleb Natapov wrote:
> > > On Fri, Mar 22, 2013 at 08:03:04PM +0800, Xiao Guangrong wrote:
> > >> On 03/22/2013 07:47 PM, Gleb Natapov wrote:
> > >>> On Fri, Mar 22, 2013 at 07:39:24PM +0800, Xiao Guangrong wrote:
> > >>>> On 03/22/2013 07:28 PM, Gleb Natapov wrote:
> > >>>>> On Fri, Mar 22, 2013 at 07:10:44PM +0800, Xiao Guangrong wrote:
> > >>>>>> On 03/22/2013 06:54 PM, Marcelo Tosatti wrote:
> > >>>>>>
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> And then have codepaths that nuke shadow pages break from the spinlock,
> > >>>>>>>>
> > >>>>>>>> I think this is not needed any more. We can let mmu_notify use the generation
> > >>>>>>>> number to invalid all shadow pages, then we only need to free them after
> > >>>>>>>> all vcpus down and mmu_notify unregistered - at this point, no lock contention,
> > >>>>>>>> we can directly free them.
> > >>>>>>>>
> > >>>>>>>>> such as kvm_mmu_slot_remove_write_access does now (spin_needbreak).
> > >>>>>>>>
> > >>>>>>>> BTW, to my honest, i do not think spin_needbreak is a good way - it does
> > >>>>>>>> not fix the hot-lock contention and it just occupies more cpu time to avoid
> > >>>>>>>> possible soft lock-ups.
> > >>>>>>>>
> > >>>>>>>> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
> > >>>>>>>> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
> > >>>>>>>> create page tables again. zap-all-shadow-page need long time to be finished,
> > >>>>>>>> the worst case is, it can not completed forever on intensive vcpu and memory
> > >>>>>>>> usage.
> > >>>>>>>
> > >>>>>>> Yes, but the suggestion is to use spin_needbreak on the VM shutdown
> > >>>>>>> cases, where there is no detailed concern about performance. Such as
> > >>>>>>> mmu_notifier_release, kvm_destroy_vm, etc. In those cases what matters
> > >>>>>>> most is that host remains unaffected (and that it finishes in a
> > >>>>>>> reasonable time).
> > >>>>>>
> > >>>>>> Okay. I agree with you, will give a try.
> > >>>>>>
> > >>>>>>>
> > >>>>>>>> I still think the right way to fix this kind of thing is optimization for
> > >>>>>>>> mmu-lock.
> > >>>>>>>
> > >>>>>>> And then for the cases where performance matters just increase a
> > >>>>>>> VM global generetion number, zap the roots and then on kvm_mmu_get_page:
> > >>>>>>>
> > >>>>>>> kvm_mmu_get_page() {
> > >>>>>>> sp = lookup_hash(gfn)
> > >>>>>>> if (sp->role = role) {
> > >>>>>>> if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
> > >>>>>>> kvm_mmu_commit_zap_page(sp); (no need for TLB flushes as its unreachable)
> > >>>>>>> kvm_mmu_init_page(sp);
> > >>>>>>> proceed as if the page was just allocated
> > >>>>>>> }
> > >>>>>>> }
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>> It makes the kvm_mmu_zap_all path even faster than you have now.
> > >>>>>>> I suppose this was your idea correct with the generation number correct?
> > >>>>>>
> > >>>>>> Wow, great minds think alike, this is exactly what i am doing. ;)
> > >>>>>>
> > >>>>> Not that I disagree with above code, but why not make mmu_gen_number to be
> > >>>>> part of a role and remove old pages in kvm_mmu_free_some_pages() whenever
> > >>>>> limit is reached like we looks to be doing with role.invalid pages now.
> > >>>>
> > >>>> These pages can be reused after purge its entries and delete it from parents
> > >>>> list, it can reduce the pressure of memory allocator. Also, we can move it to
> > >>>> the head of active_list so that the pages with invalid_gen can be reclaimed first.
> > >>>>
> > >>> You mean tail of the active_list, since kvm_mmu_free_some_pages()
> > >>> removes pages from tail? Since pages with new mmu_gen_number will be put
> > >>
> > >> I mean purge the invalid-gen page first, then update its valid-gen to current-gen,
> > >> then move it to the head of active_list:
> > >>
> > >> kvm_mmu_get_page() {
> > >> sp = lookup_hash(gfn)
> > >> if (sp->role = role) {
> > >> if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
> > >> kvm_mmu_purge_page(sp); (no need for TLB flushes as its unreachable)
> > >> sp->mmu_gen_number = kvm->arch.mmu_gen_number;
> > >> @@@@@@ move sp to the head of active list @@@@@@
> > >> }
> > >> }
> > >> }
> > >>
> > >>
> > > And I am saying that if you make mmu_gen_number part of the role you do
> > > not need to change kvm_mmu_get_page() at all. It will just work.
> >
> > Oh, i got what your said. But i want to reuse these page (without
> > free and re-allocate). What do you think about this?
> >
> We did not do that for sp->role.invalid pages although we could do what
> is proposed above for them too (am I right?). If there is measurable
> advantage of reusing invalid pages in kvm_mmu_get_page() lets do it like
> that, but if not then less code is better.

The number of sp->role.invalid=1 pages is small (only shadow roots). It
can grow but is bounded to a handful. No improvement visible there.

The number of shadow pages with old mmu_gen_number is potentially large.

Returning all shadow pages to the allocator is problematic because it
takes a long time (therefore the suggestion to postpone it).

Spreading the work to free (or reuse) those shadow pages to individual
page fault instances alleviates the mmu_lock hold time issue without
significant reduction to post kvm_mmu_zap_all operation (which has to
rebuilt all pagetables anyway).

You prefer to modify SLAB allocator to aggressively free these stale
shadow pages rather than kvm_mmu_get_page to reuse them?

2013-04-18 09:42:45

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On Wed, Apr 17, 2013 at 05:39:04PM -0300, Marcelo Tosatti wrote:
> On Fri, Mar 22, 2013 at 09:15:24PM +0200, Gleb Natapov wrote:
> > On Fri, Mar 22, 2013 at 08:37:33PM +0800, Xiao Guangrong wrote:
> > > On 03/22/2013 08:12 PM, Gleb Natapov wrote:
> > > > On Fri, Mar 22, 2013 at 08:03:04PM +0800, Xiao Guangrong wrote:
> > > >> On 03/22/2013 07:47 PM, Gleb Natapov wrote:
> > > >>> On Fri, Mar 22, 2013 at 07:39:24PM +0800, Xiao Guangrong wrote:
> > > >>>> On 03/22/2013 07:28 PM, Gleb Natapov wrote:
> > > >>>>> On Fri, Mar 22, 2013 at 07:10:44PM +0800, Xiao Guangrong wrote:
> > > >>>>>> On 03/22/2013 06:54 PM, Marcelo Tosatti wrote:
> > > >>>>>>
> > > >>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> And then have codepaths that nuke shadow pages break from the spinlock,
> > > >>>>>>>>
> > > >>>>>>>> I think this is not needed any more. We can let mmu_notify use the generation
> > > >>>>>>>> number to invalid all shadow pages, then we only need to free them after
> > > >>>>>>>> all vcpus down and mmu_notify unregistered - at this point, no lock contention,
> > > >>>>>>>> we can directly free them.
> > > >>>>>>>>
> > > >>>>>>>>> such as kvm_mmu_slot_remove_write_access does now (spin_needbreak).
> > > >>>>>>>>
> > > >>>>>>>> BTW, to my honest, i do not think spin_needbreak is a good way - it does
> > > >>>>>>>> not fix the hot-lock contention and it just occupies more cpu time to avoid
> > > >>>>>>>> possible soft lock-ups.
> > > >>>>>>>>
> > > >>>>>>>> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
> > > >>>>>>>> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
> > > >>>>>>>> create page tables again. zap-all-shadow-page need long time to be finished,
> > > >>>>>>>> the worst case is, it can not completed forever on intensive vcpu and memory
> > > >>>>>>>> usage.
> > > >>>>>>>
> > > >>>>>>> Yes, but the suggestion is to use spin_needbreak on the VM shutdown
> > > >>>>>>> cases, where there is no detailed concern about performance. Such as
> > > >>>>>>> mmu_notifier_release, kvm_destroy_vm, etc. In those cases what matters
> > > >>>>>>> most is that host remains unaffected (and that it finishes in a
> > > >>>>>>> reasonable time).
> > > >>>>>>
> > > >>>>>> Okay. I agree with you, will give a try.
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>>> I still think the right way to fix this kind of thing is optimization for
> > > >>>>>>>> mmu-lock.
> > > >>>>>>>
> > > >>>>>>> And then for the cases where performance matters just increase a
> > > >>>>>>> VM global generetion number, zap the roots and then on kvm_mmu_get_page:
> > > >>>>>>>
> > > >>>>>>> kvm_mmu_get_page() {
> > > >>>>>>> sp = lookup_hash(gfn)
> > > >>>>>>> if (sp->role = role) {
> > > >>>>>>> if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
> > > >>>>>>> kvm_mmu_commit_zap_page(sp); (no need for TLB flushes as its unreachable)
> > > >>>>>>> kvm_mmu_init_page(sp);
> > > >>>>>>> proceed as if the page was just allocated
> > > >>>>>>> }
> > > >>>>>>> }
> > > >>>>>>> }
> > > >>>>>>>
> > > >>>>>>> It makes the kvm_mmu_zap_all path even faster than you have now.
> > > >>>>>>> I suppose this was your idea correct with the generation number correct?
> > > >>>>>>
> > > >>>>>> Wow, great minds think alike, this is exactly what i am doing. ;)
> > > >>>>>>
> > > >>>>> Not that I disagree with above code, but why not make mmu_gen_number to be
> > > >>>>> part of a role and remove old pages in kvm_mmu_free_some_pages() whenever
> > > >>>>> limit is reached like we looks to be doing with role.invalid pages now.
> > > >>>>
> > > >>>> These pages can be reused after purge its entries and delete it from parents
> > > >>>> list, it can reduce the pressure of memory allocator. Also, we can move it to
> > > >>>> the head of active_list so that the pages with invalid_gen can be reclaimed first.
> > > >>>>
> > > >>> You mean tail of the active_list, since kvm_mmu_free_some_pages()
> > > >>> removes pages from tail? Since pages with new mmu_gen_number will be put
> > > >>
> > > >> I mean purge the invalid-gen page first, then update its valid-gen to current-gen,
> > > >> then move it to the head of active_list:
> > > >>
> > > >> kvm_mmu_get_page() {
> > > >> sp = lookup_hash(gfn)
> > > >> if (sp->role = role) {
> > > >> if (sp->mmu_gen_number != kvm->arch.mmu_gen_number) {
> > > >> kvm_mmu_purge_page(sp); (no need for TLB flushes as its unreachable)
> > > >> sp->mmu_gen_number = kvm->arch.mmu_gen_number;
> > > >> @@@@@@ move sp to the head of active list @@@@@@
> > > >> }
> > > >> }
> > > >> }
> > > >>
> > > >>
> > > > And I am saying that if you make mmu_gen_number part of the role you do
> > > > not need to change kvm_mmu_get_page() at all. It will just work.
> > >
> > > Oh, i got what your said. But i want to reuse these page (without
> > > free and re-allocate). What do you think about this?
> > >
> > We did not do that for sp->role.invalid pages although we could do what
> > is proposed above for them too (am I right?). If there is measurable
> > advantage of reusing invalid pages in kvm_mmu_get_page() lets do it like
> > that, but if not then less code is better.
>
> The number of sp->role.invalid=1 pages is small (only shadow roots). It
> can grow but is bounded to a handful. No improvement visible there.
>
> The number of shadow pages with old mmu_gen_number is potentially large.
>
> Returning all shadow pages to the allocator is problematic because it
> takes a long time (therefore the suggestion to postpone it).
>
> Spreading the work to free (or reuse) those shadow pages to individual
> page fault instances alleviates the mmu_lock hold time issue without
> significant reduction to post kvm_mmu_zap_all operation (which has to
> rebuilt all pagetables anyway).
>
> You prefer to modify SLAB allocator to aggressively free these stale
> shadow pages rather than kvm_mmu_get_page to reuse them?
Are you saying that what makes kvm_mmu_zap_all() slow is that we return
all the shadow pages to the SLAB allocator? As far as I understand what
makes it slow is walking over huge number of shadow pages via various
lists, actually releasing them to the SLAB is not an issue, otherwise
the problem could have been solved by just moving
kvm_mmu_commit_zap_page() out of the mmu_lock. If there is measurable
SLAB overhead from not reusing the pages I am all for reusing them, but
is this really the case or just premature optimization?

--
Gleb.

2013-04-18 14:02:06

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On Thu, Apr 18, 2013 at 12:42:39PM +0300, Gleb Natapov wrote:
> > > that, but if not then less code is better.
> >
> > The number of sp->role.invalid=1 pages is small (only shadow roots). It
> > can grow but is bounded to a handful. No improvement visible there.
> >
> > The number of shadow pages with old mmu_gen_number is potentially large.
> >
> > Returning all shadow pages to the allocator is problematic because it
> > takes a long time (therefore the suggestion to postpone it).
> >
> > Spreading the work to free (or reuse) those shadow pages to individual
> > page fault instances alleviates the mmu_lock hold time issue without
> > significant reduction to post kvm_mmu_zap_all operation (which has to
> > rebuilt all pagetables anyway).
> >
> > You prefer to modify SLAB allocator to aggressively free these stale
> > shadow pages rather than kvm_mmu_get_page to reuse them?
> Are you saying that what makes kvm_mmu_zap_all() slow is that we return
> all the shadow pages to the SLAB allocator? As far as I understand what
> makes it slow is walking over huge number of shadow pages via various
> lists, actually releasing them to the SLAB is not an issue, otherwise
> the problem could have been solved by just moving
> kvm_mmu_commit_zap_page() out of the mmu_lock. If there is measurable
> SLAB overhead from not reusing the pages I am all for reusing them, but
> is this really the case or just premature optimization?

Actually releasing them is not a problem. Walking all pages, lists and
releasing in the process part of the problem ("returning them to the allocator"
would have been clearer with "freeing them").

Point is at some point you have to walk all pages and release their data
structures. With Xiaos scheme its possible to avoid this lengthy process
by either:

1) reusing the pages with stale generation number
or
2) releasing them via the SLAB shrinker more aggressively

(another typo, i meant "SLAB shrinker" not "SLAB allocator").

But you seem to be concerned for 1) due to code complexity issues?

2013-04-18 16:36:36

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On Thu, Apr 18, 2013 at 11:01:18AM -0300, Marcelo Tosatti wrote:
> On Thu, Apr 18, 2013 at 12:42:39PM +0300, Gleb Natapov wrote:
> > > > that, but if not then less code is better.
> > >
> > > The number of sp->role.invalid=1 pages is small (only shadow roots). It
> > > can grow but is bounded to a handful. No improvement visible there.
> > >
> > > The number of shadow pages with old mmu_gen_number is potentially large.
> > >
> > > Returning all shadow pages to the allocator is problematic because it
> > > takes a long time (therefore the suggestion to postpone it).
> > >
> > > Spreading the work to free (or reuse) those shadow pages to individual
> > > page fault instances alleviates the mmu_lock hold time issue without
> > > significant reduction to post kvm_mmu_zap_all operation (which has to
> > > rebuilt all pagetables anyway).
> > >
> > > You prefer to modify SLAB allocator to aggressively free these stale
> > > shadow pages rather than kvm_mmu_get_page to reuse them?
> > Are you saying that what makes kvm_mmu_zap_all() slow is that we return
> > all the shadow pages to the SLAB allocator? As far as I understand what
> > makes it slow is walking over huge number of shadow pages via various
> > lists, actually releasing them to the SLAB is not an issue, otherwise
> > the problem could have been solved by just moving
> > kvm_mmu_commit_zap_page() out of the mmu_lock. If there is measurable
> > SLAB overhead from not reusing the pages I am all for reusing them, but
> > is this really the case or just premature optimization?
>
> Actually releasing them is not a problem. Walking all pages, lists and
> releasing in the process part of the problem ("returning them to the allocator"
> would have been clearer with "freeing them").
>
> Point is at some point you have to walk all pages and release their data
> structures. With Xiaos scheme its possible to avoid this lengthy process
> by either:
>
> 1) reusing the pages with stale generation number
> or
> 2) releasing them via the SLAB shrinker more aggressively
>
But is it really so? The number of allocated shadow pages are limited
via n_max_mmu_pages mechanism, so I expect most freeing to happen in
make_mmu_pages_available() which is called during page fault so freeing
will be spread across page faults more or less equally. Doing
kvm_mmu_prepare_zap_page()/kvm_mmu_commit_zap_page() and zapping unknown
number of shadow pages during kvm_mmu_get_page() just to reuse one does
not sound like a clear win to me.

> (another typo, i meant "SLAB shrinker" not "SLAB allocator").
>
> But you seem to be concerned for 1) due to code complexity issues?
>
It adds code that looks to me redundant. I may be wrong of course, if
it is a demonstrable win I am all for it.

--
Gleb.

2013-04-18 17:44:23

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] KVM: MMU: fast zap all shadow pages

On Thu, Apr 18, 2013 at 07:36:03PM +0300, Gleb Natapov wrote:
> On Thu, Apr 18, 2013 at 11:01:18AM -0300, Marcelo Tosatti wrote:
> > On Thu, Apr 18, 2013 at 12:42:39PM +0300, Gleb Natapov wrote:
> > > > > that, but if not then less code is better.
> > > >
> > > > The number of sp->role.invalid=1 pages is small (only shadow roots). It
> > > > can grow but is bounded to a handful. No improvement visible there.
> > > >
> > > > The number of shadow pages with old mmu_gen_number is potentially large.
> > > >
> > > > Returning all shadow pages to the allocator is problematic because it
> > > > takes a long time (therefore the suggestion to postpone it).
> > > >
> > > > Spreading the work to free (or reuse) those shadow pages to individual
> > > > page fault instances alleviates the mmu_lock hold time issue without
> > > > significant reduction to post kvm_mmu_zap_all operation (which has to
> > > > rebuilt all pagetables anyway).
> > > >
> > > > You prefer to modify SLAB allocator to aggressively free these stale
> > > > shadow pages rather than kvm_mmu_get_page to reuse them?
> > > Are you saying that what makes kvm_mmu_zap_all() slow is that we return
> > > all the shadow pages to the SLAB allocator? As far as I understand what
> > > makes it slow is walking over huge number of shadow pages via various
> > > lists, actually releasing them to the SLAB is not an issue, otherwise
> > > the problem could have been solved by just moving
> > > kvm_mmu_commit_zap_page() out of the mmu_lock. If there is measurable
> > > SLAB overhead from not reusing the pages I am all for reusing them, but
> > > is this really the case or just premature optimization?
> >
> > Actually releasing them is not a problem. Walking all pages, lists and
> > releasing in the process part of the problem ("returning them to the allocator"
> > would have been clearer with "freeing them").
> >
> > Point is at some point you have to walk all pages and release their data
> > structures. With Xiaos scheme its possible to avoid this lengthy process
> > by either:
> >
> > 1) reusing the pages with stale generation number
> > or
> > 2) releasing them via the SLAB shrinker more aggressively
> >
> But is it really so? The number of allocated shadow pages are limited
> via n_max_mmu_pages mechanism, so I expect most freeing to happen in
> make_mmu_pages_available() which is called during page fault so freeing
> will be spread across page faults more or less equally. Doing
> kvm_mmu_prepare_zap_page()/kvm_mmu_commit_zap_page() and zapping unknown
> number of shadow pages during kvm_mmu_get_page() just to reuse one does
> not sound like a clear win to me.

Makes sense.

> > (another typo, i meant "SLAB shrinker" not "SLAB allocator").
> >
> > But you seem to be concerned for 1) due to code complexity issues?
> >
> It adds code that looks to me redundant. I may be wrong of course, if
> it is a demonstrable win I am all for it.

Ditto.