2013-04-27 03:13:33

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v4 0/6] KVM: MMU: fast zap all shadow pages

This patchset is based on current 'queue' branch on kvm tree.

Changlog:

V4:
1): drop unmapping invalid rmap out of mmu-lock and use lock-break technique
instead. Thanks to Gleb's comments.

2): needn't handle invalid-gen pages specially due to page table always
switched by KVM_REQ_MMU_RELOAD. Thanks to Marcelo's comments.

V3:
completely redesign the algorithm, please see below.

V2:
- do not reset n_requested_mmu_pages and n_max_mmu_pages
- batch free root shadow pages to reduce vcpu notification and mmu-lock
contention
- remove the first patch that introduce kvm->arch.mmu_cache since we only
'memset zero' on hashtable rather than all mmu cache members in this
version
- remove unnecessary kvm_reload_remote_mmus after kvm_mmu_zap_all

* Issue
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability.

* Idea
KVM maintains a global mmu invalid generation-number which is stored in
kvm->arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp->mmu_valid_gen when it is created.

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.

The invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
are keeped in mmu-cache until page allocator reclaims page.

* Challenges
Some page invalidation is requested when memslot is moved or deleted
and kvm is being destroy who call zap_all_pages to delete all sp using
their rmap and lpage-info, after call zap_all_pages, the rmap and lpage-info
will be freed.

For the lpage-info, we clear all lpage count when do zap-all-pages, then
all invalid shadow pages are not counted in lpage-info, after that lpage-info
on the invalid memslot can be safely freed. This is also good for the
performance - it allows guest to use hugepage as far as possible.

For the rmap, we use lock-break technique to zap all sptes linked on the
invalid rmap, it is not very effective but good for the first step.

* TODO
Unmapping invalid rmap out of mmu-lock with a clear way.

Xiao Guangrong (6):
KVM: MMU: drop unnecessary kvm_reload_remote_mmus
KVM: x86: introduce memslot_set_lpage_disallowed
KVM: MMU: introduce kvm_clear_all_lpage_info
KVM: MMU: fast invalid all shadow pages
KVM: x86: use the fast way to invalid all pages
KVM: MMU: make kvm_mmu_zap_all preemptable

arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/mmu.c | 88 ++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/mmu.h | 2 +
arch/x86/kvm/x86.c | 87 ++++++++++++++++++++++++++++-----------
arch/x86/kvm/x86.h | 2 +
5 files changed, 155 insertions(+), 26 deletions(-)

--
1.7.7.6


2013-04-27 03:13:42

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v4 1/6] KVM: MMU: drop unnecessary kvm_reload_remote_mmus

It is the responsibility of kvm_mmu_zap_all that keeps the
consistent of mmu and tlbs. And it is also unnecessary after
zap all mmio sptes since no mmio spte exists on root shadow
page and it can not be cached into tlb

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 5 +----
1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2a434bf..91dd9f4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7039,16 +7039,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
* If memory slot is created, or moved, we need to clear all
* mmio sptes.
*/
- if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
+ if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE))
kvm_mmu_zap_mmio_sptes(kvm);
- kvm_reload_remote_mmus(kvm);
- }
}

void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
kvm_mmu_zap_all(kvm);
- kvm_reload_remote_mmus(kvm);
}

void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
--
1.7.7.6

2013-04-27 03:14:11

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v4 5/6] KVM: x86: use the fast way to invalid all pages

Replace kvm_mmu_zap_all by kvm_mmu_invalid_all_pages except on
the path of mmu_notifier->release() which will be fixed in
the later patch

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e4494c..809a053 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5483,7 +5483,7 @@ static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
* to ensure that the updated hypercall appears atomically across all
* VCPUs.
*/
- kvm_mmu_zap_all(vcpu->kvm);
+ kvm_mmu_invalid_memslot_pages(vcpu->kvm, NULL);

kvm_x86_ops->patch_hypercall(vcpu, instruction);

@@ -7093,7 +7093,7 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot)
{
- kvm_arch_flush_shadow_all(kvm);
+ kvm_mmu_invalid_memslot_pages(kvm, slot);
}

int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
--
1.7.7.6

2013-04-27 03:14:09

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability.

In this patch, we introduce a faster way to invalid all shadow pages.
KVM maintains a global mmu invalid generation-number which is stored in
kvm->arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp->mmu_valid_gen when it is created.

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.

The invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
are keeped in mmu-cache until page allocator reclaims page.

If the invalidation is due to memslot changed, its rmap amd lpage-info
will be freed soon, in order to avoiding use invalid memory, we unmap
all sptes on its rmap and always reset the large-info all memslots so
that rmap and lpage info can be safely freed.

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/mmu.c | 77 ++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/mmu.h | 2 +
3 files changed, 80 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 18635ae..7adf8f8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -220,6 +220,7 @@ struct kvm_mmu_page {
int root_count; /* Currently serving as active root */
unsigned int unsync_children;
unsigned long parent_ptes; /* Reverse mapping for parent_pte */
+ unsigned long mmu_valid_gen;
DECLARE_BITMAP(unsync_child_bitmap, 512);

#ifdef CONFIG_X86_32
@@ -527,6 +528,7 @@ struct kvm_arch {
unsigned int n_requested_mmu_pages;
unsigned int n_max_mmu_pages;
unsigned int indirect_shadow_pages;
+ unsigned long mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
* Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 004cc87..63110c7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sp);
}

+static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ return likely(sp->mmu_valid_gen == kvm->arch.mmu_valid_gen);
+}
+
static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
gfn_t gfn,
gva_t gaddr,
@@ -1864,6 +1869,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
role.quadrant = quadrant;
}
for_each_gfn_sp(vcpu->kvm, sp, gfn) {
+ if (!is_valid_sp(vcpu->kvm, sp))
+ continue;
+
if (!need_sync && sp->unsync)
need_sync = true;

@@ -1900,6 +1908,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,

account_shadowed(vcpu->kvm, gfn);
}
+ sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
init_shadow_page_table(sp);
trace_kvm_mmu_get_page(sp, true);
return sp;
@@ -2070,8 +2079,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
kvm_mmu_page_unlink_children(kvm, sp);
kvm_mmu_unlink_parents(kvm, sp);
- if (!sp->role.invalid && !sp->role.direct)
+
+ if (!sp->role.invalid && !sp->role.direct &&
+ /* Invalid-gen pages are not accounted. */
+ is_valid_sp(kvm, sp))
unaccount_shadowed(kvm, sp->gfn);
+
if (sp->unsync)
kvm_unlink_unsync_page(kvm, sp);
if (!sp->root_count) {
@@ -4194,6 +4207,68 @@ restart:
spin_unlock(&kvm->mmu_lock);
}

+static void
+memslot_unmap_rmaps(struct kvm_memory_slot *slot, struct kvm *kvm)
+{
+ int level;
+
+ for (level = PT_PAGE_TABLE_LEVEL;
+ level < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++level) {
+ unsigned long idx, *rmapp;
+
+ rmapp = slot->arch.rmap[level - PT_PAGE_TABLE_LEVEL];
+ idx = gfn_to_index(slot->base_gfn + slot->npages - 1,
+ slot->base_gfn, level) + 1;
+
+ while (idx--) {
+ kvm_unmap_rmapp(kvm, rmapp + idx, slot, 0);
+
+ if (need_resched() || spin_needbreak(&kvm->mmu_lock))
+ cond_resched_lock(&kvm->mmu_lock);
+ }
+ }
+}
+
+/*
+ * Fast invalid all shadow pages belong to @slot.
+ *
+ * @slot != NULL means the invalidation is caused the memslot specified
+ * by @slot is being deleted, in this case, we should ensure that rmap
+ * and lpage-info of the @slot can not be used after calling the function.
+ *
+ * @slot == NULL means the invalidation due to other reasons, we need
+ * not care rmap and lpage-info since they are still valid after calling
+ * the function.
+ */
+void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
+ struct kvm_memory_slot *slot)
+{
+ spin_lock(&kvm->mmu_lock);
+ kvm->arch.mmu_valid_gen++;
+
+ /*
+ * All shadow paes are invalid, reset the large page info,
+ * then we can safely desotry the memslot, it is also good
+ * for large page used.
+ */
+ kvm_clear_all_lpage_info(kvm);
+
+ /*
+ * Notify all vcpus to reload its shadow page table
+ * and flush TLB. Then all vcpus will switch to new
+ * shadow page table with the new mmu_valid_gen.
+ *
+ * Note: we should do this under the protection of
+ * mmu-lock, otherwise, vcpu would purge shadow page
+ * but miss tlb flush.
+ */
+ kvm_reload_remote_mmus(kvm);
+
+ if (slot)
+ memslot_unmap_rmaps(slot, kvm);
+ spin_unlock(&kvm->mmu_lock);
+}
+
void kvm_mmu_zap_mmio_sptes(struct kvm *kvm)
{
struct kvm_mmu_page *sp, *node;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2adcbc2..94670f0 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -97,4 +97,6 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
return (mmu->permissions[pfec >> 1] >> pte_access) & 1;
}

+void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
+ struct kvm_memory_slot *slot);
#endif
--
1.7.7.6

2013-04-27 03:14:14

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v4 6/6] KVM: MMU: make kvm_mmu_zap_all preemptable

Now, kvm_mmu_zap_all is only called in the path of mmu_notifier->release,
at that time, vcpu has stopped that means no new page will be create, we
can use lock-break technique to avoid potential soft lockup

(Note: at this time, the mmu-lock still has contention between ->release
and other mmu-notify handlers.)

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 11 ++++++++++-
1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 63110c7..46d1d47 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4197,12 +4197,21 @@ void kvm_mmu_zap_all(struct kvm *kvm)
struct kvm_mmu_page *sp, *node;
LIST_HEAD(invalid_list);

+ might_sleep();
+
spin_lock(&kvm->mmu_lock);
restart:
- list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link)
+ list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
goto restart;

+ if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+ kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ cond_resched_lock(&kvm->mmu_lock);
+ goto restart;
+ }
+ }
+
kvm_mmu_commit_zap_page(kvm, &invalid_list);
spin_unlock(&kvm->mmu_lock);
}
--
1.7.7.6

2013-04-27 03:14:06

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v4 3/6] KVM: MMU: introduce kvm_clear_all_lpage_info

This function is used to reset the large page info of all guest pages
which will be used in later patch

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 25 +++++++++++++++++++++++++
arch/x86/kvm/x86.h | 2 ++
2 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 52b4e97..8e4494c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6951,6 +6951,31 @@ static void memslot_set_lpage_disallowed(struct kvm_memory_slot *slot,
}
}

+static void clear_memslot_lpage_info(struct kvm_memory_slot *slot)
+{
+ int i;
+
+ for (i = 1; i < KVM_NR_PAGE_SIZES; ++i) {
+ int lpages;
+ int level = i + 1;
+
+ lpages = gfn_to_index(slot->base_gfn + slot->npages - 1,
+ slot->base_gfn, level) + 1;
+
+ memset(slot->arch.lpage_info[i - 1], 0,
+ sizeof(*slot->arch.lpage_info[i - 1]));
+ memslot_set_lpage_disallowed(slot, slot->npages, i, lpages);
+ }
+}
+
+void kvm_clear_all_lpage_info(struct kvm *kvm)
+{
+ struct kvm_memory_slot *slot;
+
+ kvm_for_each_memslot(slot, kvm->memslots)
+ clear_memslot_lpage_info(slot);
+}
+
int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
{
int i;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index e224f7a..beae540 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -108,6 +108,8 @@ static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
return false;
}

+void kvm_clear_all_lpage_info(struct kvm *kvm);
+
void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
--
1.7.7.6

2013-04-27 03:14:02

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v4 2/6] KVM: x86: introduce memslot_set_lpage_disallowed

It is used to set disallowed large page on the specified level, can be
used in later patch

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 53 ++++++++++++++++++++++++++++++++++-----------------
1 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 91dd9f4..52b4e97 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6917,12 +6917,45 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
}
}

+static void memslot_set_lpage_disallowed(struct kvm_memory_slot *slot,
+ unsigned long npages,
+ int lpage_size, int lpages)
+{
+ struct kvm_lpage_info *lpage_info;
+ unsigned long ugfn;
+ int level = lpage_size + 1;
+
+ WARN_ON(!lpage_size);
+
+ lpage_info = slot->arch.lpage_info[lpage_size - 1];
+
+ if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
+ lpage_info[0].write_count = 1;
+
+ if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
+ lpage_info[lpages - 1].write_count = 1;
+
+ ugfn = slot->userspace_addr >> PAGE_SHIFT;
+
+ /*
+ * If the gfn and userspace address are not aligned wrt each
+ * other, or if explicitly asked to, disable large page
+ * support for this slot
+ */
+ if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1) ||
+ !kvm_largepages_enabled()) {
+ unsigned long j;
+
+ for (j = 0; j < lpages; ++j)
+ lpage_info[j].write_count = 1;
+ }
+}
+
int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
{
int i;

for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
- unsigned long ugfn;
int lpages;
int level = i + 1;

@@ -6941,23 +6974,7 @@ int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
if (!slot->arch.lpage_info[i - 1])
goto out_free;

- if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
- slot->arch.lpage_info[i - 1][0].write_count = 1;
- if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
- slot->arch.lpage_info[i - 1][lpages - 1].write_count = 1;
- ugfn = slot->userspace_addr >> PAGE_SHIFT;
- /*
- * If the gfn and userspace address are not aligned wrt each
- * other, or if explicitly asked to, disable large page
- * support for this slot
- */
- if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1) ||
- !kvm_largepages_enabled()) {
- unsigned long j;
-
- for (j = 0; j < lpages; ++j)
- slot->arch.lpage_info[i - 1][j].write_count = 1;
- }
+ memslot_set_lpage_disallowed(slot, npages, i, lpages);
}

return 0;
--
1.7.7.6

2013-05-03 01:06:46

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Sat, Apr 27, 2013 at 11:13:20AM +0800, Xiao Guangrong wrote:
> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> walk and zap all shadow pages one by one, also it need to zap all guest
> page's rmap and all shadow page's parent spte list. Particularly, things
> become worse if guest uses more memory or vcpus. It is not good for
> scalability.
>
> In this patch, we introduce a faster way to invalid all shadow pages.
> KVM maintains a global mmu invalid generation-number which is stored in
> kvm->arch.mmu_valid_gen and every shadow page stores the current global
> generation-number into sp->mmu_valid_gen when it is created.
>
> When KVM need zap all shadow pages sptes, it just simply increase the
> global generation-number then reload root shadow pages on all vcpus.
> Vcpu will create a new shadow page table according to current kvm's
> generation-number. It ensures the old pages are not used any more.
>
> The invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> are keeped in mmu-cache until page allocator reclaims page.
>
> If the invalidation is due to memslot changed, its rmap amd lpage-info
> will be freed soon, in order to avoiding use invalid memory, we unmap
> all sptes on its rmap and always reset the large-info all memslots so
> that rmap and lpage info can be safely freed.
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 +
> arch/x86/kvm/mmu.c | 77 ++++++++++++++++++++++++++++++++++++++-
> arch/x86/kvm/mmu.h | 2 +
> 3 files changed, 80 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 18635ae..7adf8f8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -220,6 +220,7 @@ struct kvm_mmu_page {
> int root_count; /* Currently serving as active root */
> unsigned int unsync_children;
> unsigned long parent_ptes; /* Reverse mapping for parent_pte */
> + unsigned long mmu_valid_gen;
> DECLARE_BITMAP(unsync_child_bitmap, 512);
>
> #ifdef CONFIG_X86_32
> @@ -527,6 +528,7 @@ struct kvm_arch {
> unsigned int n_requested_mmu_pages;
> unsigned int n_max_mmu_pages;
> unsigned int indirect_shadow_pages;
> + unsigned long mmu_valid_gen;
> struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> /*
> * Hash table of struct kvm_mmu_page.
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 004cc87..63110c7 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
> __clear_sp_write_flooding_count(sp);
> }
>
> +static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> + return likely(sp->mmu_valid_gen == kvm->arch.mmu_valid_gen);
> +}
> +
> static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> gfn_t gfn,
> gva_t gaddr,
> @@ -1864,6 +1869,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> role.quadrant = quadrant;
> }
> for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> + if (!is_valid_sp(vcpu->kvm, sp))
> + continue;
> +
> if (!need_sync && sp->unsync)
> need_sync = true;
>
> @@ -1900,6 +1908,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>
> account_shadowed(vcpu->kvm, gfn);
> }
> + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> init_shadow_page_table(sp);
> trace_kvm_mmu_get_page(sp, true);
> return sp;
> @@ -2070,8 +2079,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> kvm_mmu_page_unlink_children(kvm, sp);
> kvm_mmu_unlink_parents(kvm, sp);
> - if (!sp->role.invalid && !sp->role.direct)
> +
> + if (!sp->role.invalid && !sp->role.direct &&
> + /* Invalid-gen pages are not accounted. */
> + is_valid_sp(kvm, sp))
> unaccount_shadowed(kvm, sp->gfn);
> +
> if (sp->unsync)
> kvm_unlink_unsync_page(kvm, sp);
> if (!sp->root_count) {
> @@ -4194,6 +4207,68 @@ restart:
> spin_unlock(&kvm->mmu_lock);
> }
>
> +static void
> +memslot_unmap_rmaps(struct kvm_memory_slot *slot, struct kvm *kvm)
> +{
> + int level;
> +
> + for (level = PT_PAGE_TABLE_LEVEL;
> + level < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++level) {
> + unsigned long idx, *rmapp;
> +
> + rmapp = slot->arch.rmap[level - PT_PAGE_TABLE_LEVEL];
> + idx = gfn_to_index(slot->base_gfn + slot->npages - 1,
> + slot->base_gfn, level) + 1;
> +
> + while (idx--) {
> + kvm_unmap_rmapp(kvm, rmapp + idx, slot, 0);
> +
> + if (need_resched() || spin_needbreak(&kvm->mmu_lock))
> + cond_resched_lock(&kvm->mmu_lock);
> + }
> + }
> +}
> +
> +/*
> + * Fast invalid all shadow pages belong to @slot.
> + *
> + * @slot != NULL means the invalidation is caused the memslot specified
> + * by @slot is being deleted, in this case, we should ensure that rmap
> + * and lpage-info of the @slot can not be used after calling the function.
> + *
> + * @slot == NULL means the invalidation due to other reasons, we need
> + * not care rmap and lpage-info since they are still valid after calling
> + * the function.
> + */
> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> + struct kvm_memory_slot *slot)
> +{
> + spin_lock(&kvm->mmu_lock);
> + kvm->arch.mmu_valid_gen++;
> +
> + /*
> + * All shadow paes are invalid, reset the large page info,
> + * then we can safely desotry the memslot, it is also good
> + * for large page used.
> + */
> + kvm_clear_all_lpage_info(kvm);

Xiao,

I understood it was agreed that simple mmu_lock lockbreak while
avoiding zapping of newly instantiated pages upon a

if(spin_needbreak)
cond_resched_lock()

cycle was enough as a first step? And then later introduce root zapping
along with measurements.

https://lkml.org/lkml/2013/4/22/544

2013-05-03 02:10:37

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH v4 2/6] KVM: x86: introduce memslot_set_lpage_disallowed

On Sat, 27 Apr 2013 11:13:18 +0800
Xiao Guangrong <[email protected]> wrote:

> It is used to set disallowed large page on the specified level, can be
> used in later patch
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/x86.c | 53 ++++++++++++++++++++++++++++++++++-----------------
> 1 files changed, 35 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 91dd9f4..52b4e97 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6917,12 +6917,45 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
> }
> }
>
> +static void memslot_set_lpage_disallowed(struct kvm_memory_slot *slot,
> + unsigned long npages,
> + int lpage_size, int lpages)

What this function does is to disable large page support for this slot
as can be seen in the comment below.

Since setting lpage_info to something ("disallowed" ?) is an implementation
detail, we'd better hide such a thing from the function name.

Taking into account that we have "kvm_largepages_enabled()", something like
disable_largepages_memslot() may be a candidate.

But I want to see suggestions from others as well.

Takuya

2013-05-03 02:15:37

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH v4 3/6] KVM: MMU: introduce kvm_clear_all_lpage_info

On Sat, 27 Apr 2013 11:13:19 +0800
Xiao Guangrong <[email protected]> wrote:

> This function is used to reset the large page info of all guest pages
> which will be used in later patch
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/x86.c | 25 +++++++++++++++++++++++++
> arch/x86/kvm/x86.h | 2 ++
> 2 files changed, 27 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 52b4e97..8e4494c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6951,6 +6951,31 @@ static void memslot_set_lpage_disallowed(struct kvm_memory_slot *slot,
> }
> }
>
> +static void clear_memslot_lpage_info(struct kvm_memory_slot *slot)
> +{
> + int i;
> +
> + for (i = 1; i < KVM_NR_PAGE_SIZES; ++i) {
> + int lpages;
> + int level = i + 1;
> +
> + lpages = gfn_to_index(slot->base_gfn + slot->npages - 1,
> + slot->base_gfn, level) + 1;
> +
> + memset(slot->arch.lpage_info[i - 1], 0,
> + sizeof(*slot->arch.lpage_info[i - 1]));
> + memslot_set_lpage_disallowed(slot, slot->npages, i, lpages);

This does something other than clearing.
Any better name?

Takuya

2013-05-03 02:27:37

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Sat, 27 Apr 2013 11:13:20 +0800
Xiao Guangrong <[email protected]> wrote:

> +/*
> + * Fast invalid all shadow pages belong to @slot.
> + *
> + * @slot != NULL means the invalidation is caused the memslot specified
> + * by @slot is being deleted, in this case, we should ensure that rmap
> + * and lpage-info of the @slot can not be used after calling the function.
> + *
> + * @slot == NULL means the invalidation due to other reasons, we need

The comment should explain what the "other reasons" are.
But this API may better be split into two separate functions; it depends
on the "other reasons".

> + * not care rmap and lpage-info since they are still valid after calling
> + * the function.
> + */
> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> + struct kvm_memory_slot *slot)

You yourself is explaining this as "invalidation" in the comment.
kvm_mmu_invalidate_shadow_pages_memslot() or something...

Anybody can think of a better name?

Takuya

2013-05-03 05:52:21

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:

>> +
>> +/*
>> + * Fast invalid all shadow pages belong to @slot.
>> + *
>> + * @slot != NULL means the invalidation is caused the memslot specified
>> + * by @slot is being deleted, in this case, we should ensure that rmap
>> + * and lpage-info of the @slot can not be used after calling the function.
>> + *
>> + * @slot == NULL means the invalidation due to other reasons, we need
>> + * not care rmap and lpage-info since they are still valid after calling
>> + * the function.
>> + */
>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>> + struct kvm_memory_slot *slot)
>> +{
>> + spin_lock(&kvm->mmu_lock);
>> + kvm->arch.mmu_valid_gen++;
>> +
>> + /*
>> + * All shadow paes are invalid, reset the large page info,
>> + * then we can safely desotry the memslot, it is also good
>> + * for large page used.
>> + */
>> + kvm_clear_all_lpage_info(kvm);
>
> Xiao,
>
> I understood it was agreed that simple mmu_lock lockbreak while
> avoiding zapping of newly instantiated pages upon a
>
> if(spin_needbreak)
> cond_resched_lock()
>
> cycle was enough as a first step? And then later introduce root zapping
> along with measurements.
>
> https://lkml.org/lkml/2013/4/22/544

Yes, it is.

See the changelog in 0/0:

" we use lock-break technique to zap all sptes linked on the
invalid rmap, it is not very effective but good for the first step."

Thanks!

2013-05-03 05:56:08

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v4 2/6] KVM: x86: introduce memslot_set_lpage_disallowed

On 05/03/2013 10:10 AM, Takuya Yoshikawa wrote:
> On Sat, 27 Apr 2013 11:13:18 +0800
> Xiao Guangrong <[email protected]> wrote:
>
>> It is used to set disallowed large page on the specified level, can be
>> used in later patch
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 53 ++++++++++++++++++++++++++++++++++-----------------
>> 1 files changed, 35 insertions(+), 18 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 91dd9f4..52b4e97 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -6917,12 +6917,45 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
>> }
>> }
>>
>> +static void memslot_set_lpage_disallowed(struct kvm_memory_slot *slot,
>> + unsigned long npages,
>> + int lpage_size, int lpages)
>
> What this function does is to disable large page support for this slot
> as can be seen in the comment below.
>
> Since setting lpage_info to something ("disallowed" ?) is an implementation
> detail, we'd better hide such a thing from the function name.
>
> Taking into account that we have "kvm_largepages_enabled()", something like
> disable_largepages_memslot() may be a candidate.
>

No.

kvm_largepages_enabled effects on largepages_enabled, it is not related
with this function. Actually, I really do not care the different between
"disallowed" and "disabled".

2013-05-03 05:58:05

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v4 3/6] KVM: MMU: introduce kvm_clear_all_lpage_info

On 05/03/2013 10:15 AM, Takuya Yoshikawa wrote:
> On Sat, 27 Apr 2013 11:13:19 +0800
> Xiao Guangrong <[email protected]> wrote:
>
>> This function is used to reset the large page info of all guest pages
>> which will be used in later patch
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 25 +++++++++++++++++++++++++
>> arch/x86/kvm/x86.h | 2 ++
>> 2 files changed, 27 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 52b4e97..8e4494c 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -6951,6 +6951,31 @@ static void memslot_set_lpage_disallowed(struct kvm_memory_slot *slot,
>> }
>> }
>>
>> +static void clear_memslot_lpage_info(struct kvm_memory_slot *slot)
>> +{
>> + int i;
>> +
>> + for (i = 1; i < KVM_NR_PAGE_SIZES; ++i) {
>> + int lpages;
>> + int level = i + 1;
>> +
>> + lpages = gfn_to_index(slot->base_gfn + slot->npages - 1,
>> + slot->base_gfn, level) + 1;
>> +
>> + memset(slot->arch.lpage_info[i - 1], 0,
>> + sizeof(*slot->arch.lpage_info[i - 1]));
>> + memslot_set_lpage_disallowed(slot, slot->npages, i, lpages);
>
> This does something other than clearing.

Aha, this API *clears* the count set by kvm mmu. It is meaningful enough,
i think.


2013-05-03 06:00:59

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On 05/03/2013 10:27 AM, Takuya Yoshikawa wrote:
> On Sat, 27 Apr 2013 11:13:20 +0800
> Xiao Guangrong <[email protected]> wrote:
>
>> +/*
>> + * Fast invalid all shadow pages belong to @slot.
>> + *
>> + * @slot != NULL means the invalidation is caused the memslot specified
>> + * by @slot is being deleted, in this case, we should ensure that rmap
>> + * and lpage-info of the @slot can not be used after calling the function.
>> + *
>> + * @slot == NULL means the invalidation due to other reasons, we need
>
> The comment should explain what the "other reasons" are.
> But this API may better be split into two separate functions; it depends
> on the "other reasons".

NO.

>
>> + * not care rmap and lpage-info since they are still valid after calling
>> + * the function.
>> + */
>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>> + struct kvm_memory_slot *slot)
>
> You yourself is explaining this as "invalidation" in the comment.
> kvm_mmu_invalidate_shadow_pages_memslot() or something...

Umm, invalidate is a better name. Will update after collecting Marcelo, Gleb
and other guy's comments.

2013-05-03 15:53:32

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
>
> >> +
> >> +/*
> >> + * Fast invalid all shadow pages belong to @slot.
> >> + *
> >> + * @slot != NULL means the invalidation is caused the memslot specified
> >> + * by @slot is being deleted, in this case, we should ensure that rmap
> >> + * and lpage-info of the @slot can not be used after calling the function.
> >> + *
> >> + * @slot == NULL means the invalidation due to other reasons, we need
> >> + * not care rmap and lpage-info since they are still valid after calling
> >> + * the function.
> >> + */
> >> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> >> + struct kvm_memory_slot *slot)
> >> +{
> >> + spin_lock(&kvm->mmu_lock);
> >> + kvm->arch.mmu_valid_gen++;
> >> +
> >> + /*
> >> + * All shadow paes are invalid, reset the large page info,
> >> + * then we can safely desotry the memslot, it is also good
> >> + * for large page used.
> >> + */
> >> + kvm_clear_all_lpage_info(kvm);
> >
> > Xiao,
> >
> > I understood it was agreed that simple mmu_lock lockbreak while
> > avoiding zapping of newly instantiated pages upon a
> >
> > if(spin_needbreak)
> > cond_resched_lock()
> >
> > cycle was enough as a first step? And then later introduce root zapping
> > along with measurements.
> >
> > https://lkml.org/lkml/2013/4/22/544
>
> Yes, it is.
>
> See the changelog in 0/0:
>
> " we use lock-break technique to zap all sptes linked on the
> invalid rmap, it is not very effective but good for the first step."
>
> Thanks!

Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
zapping the root? Only lock-break technique along with generation number
was what was agreed.

That is, having:

> >> + /*
> >> + * All shadow paes are invalid, reset the large page info,
> >> + * then we can safely desotry the memslot, it is also good
> >> + * for large page used.
> >> + */
> >> + kvm_clear_all_lpage_info(kvm);

Was an optimization step that should be done after being shown it is an
advantage?

It is more work, but it leads to a better understanding of the issues in
practice.

If you have reasons to do it now, then please have it in the final
patches, as an optimization on top of the first patches (where the
lockbreak technique plus generation numbers is introduced).

2013-05-03 16:51:17

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
> On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
>> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
>>
>>>> +
>>>> +/*
>>>> + * Fast invalid all shadow pages belong to @slot.
>>>> + *
>>>> + * @slot != NULL means the invalidation is caused the memslot specified
>>>> + * by @slot is being deleted, in this case, we should ensure that rmap
>>>> + * and lpage-info of the @slot can not be used after calling the function.
>>>> + *
>>>> + * @slot == NULL means the invalidation due to other reasons, we need
>>>> + * not care rmap and lpage-info since they are still valid after calling
>>>> + * the function.
>>>> + */
>>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>>>> + struct kvm_memory_slot *slot)
>>>> +{
>>>> + spin_lock(&kvm->mmu_lock);
>>>> + kvm->arch.mmu_valid_gen++;
>>>> +
>>>> + /*
>>>> + * All shadow paes are invalid, reset the large page info,
>>>> + * then we can safely desotry the memslot, it is also good
>>>> + * for large page used.
>>>> + */
>>>> + kvm_clear_all_lpage_info(kvm);
>>>
>>> Xiao,
>>>
>>> I understood it was agreed that simple mmu_lock lockbreak while
>>> avoiding zapping of newly instantiated pages upon a
>>>
>>> if(spin_needbreak)
>>> cond_resched_lock()
>>>
>>> cycle was enough as a first step? And then later introduce root zapping
>>> along with measurements.
>>>
>>> https://lkml.org/lkml/2013/4/22/544
>>
>> Yes, it is.
>>
>> See the changelog in 0/0:
>>
>> " we use lock-break technique to zap all sptes linked on the
>> invalid rmap, it is not very effective but good for the first step."
>>
>> Thanks!
>
> Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
> zapping the root? Only lock-break technique along with generation number
> was what was agreed.

Marcelo,

Please Wait... I am completely confused. :(

Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
Are these changes you wanted?

void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
struct kvm_memory_slot *slot)
{
spin_lock(&kvm->mmu_lock);
kvm->arch.mmu_valid_gen++;

/* Zero all root pages.*/
restart:
list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
if (!sp->root_count)
continue;

if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
goto restart;
}

/*
* All shadow paes are invalid, reset the large page info,
* then we can safely desotry the memslot, it is also good
* for large page used.
*/
kvm_clear_all_lpage_info(kvm);

kvm_mmu_commit_zap_page(kvm, &invalid_list);
spin_unlock(&kvm->mmu_lock);
}

static void rmap_remove(struct kvm *kvm, u64 *spte)
{
struct kvm_mmu_page *sp;
gfn_t gfn;
unsigned long *rmapp;

sp = page_header(__pa(spte));
+
+ /* Let invalid sp do not access its rmap. */
+ if (!sp_is_valid(sp))
+ return;
+
gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
pte_list_remove(spte, rmapp);
}

If yes, there is the reason why we can not do this that i mentioned before:

after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
->invalidate_range_start, can not find any spte using the host page, then
Accessed/Dirty for host page is missing tracked.
(missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)

What's your idea?

And I should apologize for my poor communications, really sorry for that...

2013-05-04 00:52:31

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Sat, May 04, 2013 at 12:51:06AM +0800, Xiao Guangrong wrote:
> On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
> > On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
> >> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
> >>
> >>>> +
> >>>> +/*
> >>>> + * Fast invalid all shadow pages belong to @slot.
> >>>> + *
> >>>> + * @slot != NULL means the invalidation is caused the memslot specified
> >>>> + * by @slot is being deleted, in this case, we should ensure that rmap
> >>>> + * and lpage-info of the @slot can not be used after calling the function.
> >>>> + *
> >>>> + * @slot == NULL means the invalidation due to other reasons, we need
> >>>> + * not care rmap and lpage-info since they are still valid after calling
> >>>> + * the function.
> >>>> + */
> >>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> >>>> + struct kvm_memory_slot *slot)
> >>>> +{
> >>>> + spin_lock(&kvm->mmu_lock);
> >>>> + kvm->arch.mmu_valid_gen++;
> >>>> +
> >>>> + /*
> >>>> + * All shadow paes are invalid, reset the large page info,
> >>>> + * then we can safely desotry the memslot, it is also good
> >>>> + * for large page used.
> >>>> + */
> >>>> + kvm_clear_all_lpage_info(kvm);
> >>>
> >>> Xiao,
> >>>
> >>> I understood it was agreed that simple mmu_lock lockbreak while
> >>> avoiding zapping of newly instantiated pages upon a
> >>>
> >>> if(spin_needbreak)
> >>> cond_resched_lock()
> >>>
> >>> cycle was enough as a first step? And then later introduce root zapping
> >>> along with measurements.
> >>>
> >>> https://lkml.org/lkml/2013/4/22/544
> >>
> >> Yes, it is.
> >>
> >> See the changelog in 0/0:
> >>
> >> " we use lock-break technique to zap all sptes linked on the
> >> invalid rmap, it is not very effective but good for the first step."
> >>
> >> Thanks!
> >
> > Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
> > zapping the root? Only lock-break technique along with generation number
> > was what was agreed.
>
> Marcelo,
>
> Please Wait... I am completely confused. :(
>
> Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
> Are these changes you wanted?
>
> void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> struct kvm_memory_slot *slot)
> {
> spin_lock(&kvm->mmu_lock);
> kvm->arch.mmu_valid_gen++;
>
> /* Zero all root pages.*/
> restart:
> list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
> if (!sp->root_count)
> continue;
>
> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> goto restart;
> }
>
> /*
> * All shadow paes are invalid, reset the large page info,
> * then we can safely desotry the memslot, it is also good
> * for large page used.
> */
> kvm_clear_all_lpage_info(kvm);
>
> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> spin_unlock(&kvm->mmu_lock);
> }
>
> static void rmap_remove(struct kvm *kvm, u64 *spte)
> {
> struct kvm_mmu_page *sp;
> gfn_t gfn;
> unsigned long *rmapp;
>
> sp = page_header(__pa(spte));
> +
> + /* Let invalid sp do not access its rmap. */
> + if (!sp_is_valid(sp))
> + return;
> +
> gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
> rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
> pte_list_remove(spte, rmapp);
> }
>
> If yes, there is the reason why we can not do this that i mentioned before:
>
> after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
> Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
> ->invalidate_range_start, can not find any spte using the host page, then
> Accessed/Dirty for host page is missing tracked.
> (missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)
>
> What's your idea?


Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
releases mmu_lock and reacquires it again, only shadow pages
from the generation with which kvm_mmu_zap_all started are zapped (this
guarantees forward progress and eventual termination).

kvm_mmu_zap_generation()
spin_lock(mmu_lock)
int generation = kvm->arch.mmu_generation;

for_each_shadow_page(sp) {
if (sp->generation == kvm->arch.mmu_generation)
zap_page(sp)
if (spin_needbreak(mmu_lock)) {
kvm->arch.mmu_generation++;
cond_resched_lock(mmu_lock);
}
}

kvm_mmu_zap_all()
spin_lock(mmu_lock)
for_each_shadow_page(sp) {
if (spin_needbreak(mmu_lock)) {
cond_resched_lock(mmu_lock);
}
}

Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.

This addresses the main problem: excessively long hold times
of kvm_mmu_zap_all with very large guests.

Do you see any problem with this logic? This was what i was thinking
we agreed.

Step 2) Show that the optimization to zap only the roots is worthwhile
via benchmarking, and implement it.

What do you say?

2013-05-04 00:57:16

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Fri, May 03, 2013 at 09:52:01PM -0300, Marcelo Tosatti wrote:
> On Sat, May 04, 2013 at 12:51:06AM +0800, Xiao Guangrong wrote:
> > On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
> > > On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
> > >> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
> > >>
> > >>>> +
> > >>>> +/*
> > >>>> + * Fast invalid all shadow pages belong to @slot.
> > >>>> + *
> > >>>> + * @slot != NULL means the invalidation is caused the memslot specified
> > >>>> + * by @slot is being deleted, in this case, we should ensure that rmap
> > >>>> + * and lpage-info of the @slot can not be used after calling the function.
> > >>>> + *
> > >>>> + * @slot == NULL means the invalidation due to other reasons, we need
> > >>>> + * not care rmap and lpage-info since they are still valid after calling
> > >>>> + * the function.
> > >>>> + */
> > >>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> > >>>> + struct kvm_memory_slot *slot)
> > >>>> +{
> > >>>> + spin_lock(&kvm->mmu_lock);
> > >>>> + kvm->arch.mmu_valid_gen++;
> > >>>> +
> > >>>> + /*
> > >>>> + * All shadow paes are invalid, reset the large page info,
> > >>>> + * then we can safely desotry the memslot, it is also good
> > >>>> + * for large page used.
> > >>>> + */
> > >>>> + kvm_clear_all_lpage_info(kvm);
> > >>>
> > >>> Xiao,
> > >>>
> > >>> I understood it was agreed that simple mmu_lock lockbreak while
> > >>> avoiding zapping of newly instantiated pages upon a
> > >>>
> > >>> if(spin_needbreak)
> > >>> cond_resched_lock()
> > >>>
> > >>> cycle was enough as a first step? And then later introduce root zapping
> > >>> along with measurements.
> > >>>
> > >>> https://lkml.org/lkml/2013/4/22/544
> > >>
> > >> Yes, it is.
> > >>
> > >> See the changelog in 0/0:
> > >>
> > >> " we use lock-break technique to zap all sptes linked on the
> > >> invalid rmap, it is not very effective but good for the first step."
> > >>
> > >> Thanks!
> > >
> > > Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
> > > zapping the root? Only lock-break technique along with generation number
> > > was what was agreed.
> >
> > Marcelo,
> >
> > Please Wait... I am completely confused. :(
> >
> > Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
> > Are these changes you wanted?
> >
> > void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> > struct kvm_memory_slot *slot)
> > {
> > spin_lock(&kvm->mmu_lock);
> > kvm->arch.mmu_valid_gen++;
> >
> > /* Zero all root pages.*/
> > restart:
> > list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
> > if (!sp->root_count)
> > continue;
> >
> > if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> > goto restart;
> > }
> >
> > /*
> > * All shadow paes are invalid, reset the large page info,
> > * then we can safely desotry the memslot, it is also good
> > * for large page used.
> > */
> > kvm_clear_all_lpage_info(kvm);
> >
> > kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > spin_unlock(&kvm->mmu_lock);
> > }
> >
> > static void rmap_remove(struct kvm *kvm, u64 *spte)
> > {
> > struct kvm_mmu_page *sp;
> > gfn_t gfn;
> > unsigned long *rmapp;
> >
> > sp = page_header(__pa(spte));
> > +
> > + /* Let invalid sp do not access its rmap. */
> > + if (!sp_is_valid(sp))
> > + return;
> > +
> > gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
> > rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
> > pte_list_remove(spte, rmapp);
> > }
> >
> > If yes, there is the reason why we can not do this that i mentioned before:
> >
> > after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
> > Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
> > ->invalidate_range_start, can not find any spte using the host page, then
> > Accessed/Dirty for host page is missing tracked.
> > (missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)
> >
> > What's your idea?
>
>
> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
> releases mmu_lock and reacquires it again, only shadow pages
> from the generation with which kvm_mmu_zap_all started are zapped (this
> guarantees forward progress and eventual termination).
>
> kvm_mmu_zap_generation()
> spin_lock(mmu_lock)
> int generation = kvm->arch.mmu_generation;
>
> for_each_shadow_page(sp) {
> if (sp->generation == kvm->arch.mmu_generation)
> zap_page(sp)
> if (spin_needbreak(mmu_lock)) {
> kvm->arch.mmu_generation++;
> cond_resched_lock(mmu_lock);
> }
> }
>
> kvm_mmu_zap_all()
> spin_lock(mmu_lock)
> for_each_shadow_page(sp) {
> if (spin_needbreak(mmu_lock)) {
> cond_resched_lock(mmu_lock);
> }
> }
>
> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
>
> This addresses the main problem: excessively long hold times
> of kvm_mmu_zap_all with very large guests.
>
> Do you see any problem with this logic? This was what i was thinking
> we agreed.
>
> Step 2) Show that the optimization to zap only the roots is worthwhile
> via benchmarking, and implement it.
>
> What do you say?

One concern you had earlier was:

"BTW, to my honest, i do not think spin_needbreak is a good way - it
does not fix the hot-lock contention and it just occupies more cpu time
to avoid possible soft lock-ups.

Especially, zap-all-shadow-pages can let other vcpus fault and vcpus
contest mmu-lock, then zap-all-shadow-pages release mmu-lock and wait,
other vcpus create page tables again. zap-all-shadow-page need long
time to be finished, the worst case is, it can not completed forever on
intensive vcpu and memory usage."

But with generation numbers you can guarantee termination (as long as
new pages are added to one side of the active_mmu_pages list while
kvm_mmu_zap_all begins walking from the other).

2013-05-06 05:15:24

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On 05/04/2013 08:52 AM, Marcelo Tosatti wrote:
> On Sat, May 04, 2013 at 12:51:06AM +0800, Xiao Guangrong wrote:
>> On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
>>> On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
>>>> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
>>>>
>>>>>> +
>>>>>> +/*
>>>>>> + * Fast invalid all shadow pages belong to @slot.
>>>>>> + *
>>>>>> + * @slot != NULL means the invalidation is caused the memslot specified
>>>>>> + * by @slot is being deleted, in this case, we should ensure that rmap
>>>>>> + * and lpage-info of the @slot can not be used after calling the function.
>>>>>> + *
>>>>>> + * @slot == NULL means the invalidation due to other reasons, we need
>>>>>> + * not care rmap and lpage-info since they are still valid after calling
>>>>>> + * the function.
>>>>>> + */
>>>>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>>>>>> + struct kvm_memory_slot *slot)
>>>>>> +{
>>>>>> + spin_lock(&kvm->mmu_lock);
>>>>>> + kvm->arch.mmu_valid_gen++;
>>>>>> +
>>>>>> + /*
>>>>>> + * All shadow paes are invalid, reset the large page info,
>>>>>> + * then we can safely desotry the memslot, it is also good
>>>>>> + * for large page used.
>>>>>> + */
>>>>>> + kvm_clear_all_lpage_info(kvm);
>>>>>
>>>>> Xiao,
>>>>>
>>>>> I understood it was agreed that simple mmu_lock lockbreak while
>>>>> avoiding zapping of newly instantiated pages upon a
>>>>>
>>>>> if(spin_needbreak)
>>>>> cond_resched_lock()
>>>>>
>>>>> cycle was enough as a first step? And then later introduce root zapping
>>>>> along with measurements.
>>>>>
>>>>> https://lkml.org/lkml/2013/4/22/544
>>>>
>>>> Yes, it is.
>>>>
>>>> See the changelog in 0/0:
>>>>
>>>> " we use lock-break technique to zap all sptes linked on the
>>>> invalid rmap, it is not very effective but good for the first step."
>>>>
>>>> Thanks!
>>>
>>> Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
>>> zapping the root? Only lock-break technique along with generation number
>>> was what was agreed.
>>
>> Marcelo,
>>
>> Please Wait... I am completely confused. :(
>>
>> Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
>> Are these changes you wanted?
>>
>> void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>> struct kvm_memory_slot *slot)
>> {
>> spin_lock(&kvm->mmu_lock);
>> kvm->arch.mmu_valid_gen++;
>>
>> /* Zero all root pages.*/
>> restart:
>> list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
>> if (!sp->root_count)
>> continue;
>>
>> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
>> goto restart;
>> }
>>
>> /*
>> * All shadow paes are invalid, reset the large page info,
>> * then we can safely desotry the memslot, it is also good
>> * for large page used.
>> */
>> kvm_clear_all_lpage_info(kvm);
>>
>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>> spin_unlock(&kvm->mmu_lock);
>> }
>>
>> static void rmap_remove(struct kvm *kvm, u64 *spte)
>> {
>> struct kvm_mmu_page *sp;
>> gfn_t gfn;
>> unsigned long *rmapp;
>>
>> sp = page_header(__pa(spte));
>> +
>> + /* Let invalid sp do not access its rmap. */
>> + if (!sp_is_valid(sp))
>> + return;
>> +
>> gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
>> rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
>> pte_list_remove(spte, rmapp);
>> }
>>
>> If yes, there is the reason why we can not do this that i mentioned before:
>>
>> after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
>> Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
>> ->invalidate_range_start, can not find any spte using the host page, then
>> Accessed/Dirty for host page is missing tracked.
>> (missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)
>>
>> What's your idea?
>
>
> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
> releases mmu_lock and reacquires it again, only shadow pages
> from the generation with which kvm_mmu_zap_all started are zapped (this
> guarantees forward progress and eventual termination).
>
> kvm_mmu_zap_generation()
> spin_lock(mmu_lock)
> int generation = kvm->arch.mmu_generation;
>
> for_each_shadow_page(sp) {
> if (sp->generation == kvm->arch.mmu_generation)
> zap_page(sp)
> if (spin_needbreak(mmu_lock)) {
> kvm->arch.mmu_generation++;
> cond_resched_lock(mmu_lock);
> }
> }
>
> kvm_mmu_zap_all()
> spin_lock(mmu_lock)
> for_each_shadow_page(sp) {
> if (spin_needbreak(mmu_lock)) {
> cond_resched_lock(mmu_lock);
> }
> }
>
> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
>
> This addresses the main problem: excessively long hold times
> of kvm_mmu_zap_all with very large guests.
>
> Do you see any problem with this logic? This was what i was thinking
> we agreed.

No. I understand it and it can work.

Actually, it is similar with Gleb's idea that "zapping stale shadow pages
(and uses lock break technique)", after some discussion, we thought "only zap
shadow pages that are reachable from the slot's rmap" is better, that is this
patchset does.
(https://lkml.org/lkml/2013/4/23/73)

>
> Step 2) Show that the optimization to zap only the roots is worthwhile
> via benchmarking, and implement it.

This is what i am confused. I can not understand how "zap only the roots"
works. You mean these change?

kvm_mmu_zap_generation()
spin_lock(mmu_lock)
int generation = kvm->arch.mmu_generation;

for_each_shadow_page(sp) {
/* Change here. */
=> if ((sp->generation == kvm->arch.mmu_generation) &&
=> sp->root_count)
zap_page(sp)

if (spin_needbreak(mmu_lock)) {
kvm->arch.mmu_generation++;
cond_resched_lock(mmu_lock);
}
}

If we do this, there will have shadow pages that are linked to invalid memslot's
rmap. How do we handle these pages and the mmu-notify issue?

Thanks!

2013-05-06 12:36:32

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Mon, May 06, 2013 at 11:39:11AM +0800, Xiao Guangrong wrote:
> On 05/04/2013 08:52 AM, Marcelo Tosatti wrote:
> > On Sat, May 04, 2013 at 12:51:06AM +0800, Xiao Guangrong wrote:
> >> On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
> >>> On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
> >>>> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
> >>>>
> >>>>>> +
> >>>>>> +/*
> >>>>>> + * Fast invalid all shadow pages belong to @slot.
> >>>>>> + *
> >>>>>> + * @slot != NULL means the invalidation is caused the memslot specified
> >>>>>> + * by @slot is being deleted, in this case, we should ensure that rmap
> >>>>>> + * and lpage-info of the @slot can not be used after calling the function.
> >>>>>> + *
> >>>>>> + * @slot == NULL means the invalidation due to other reasons, we need
> >>>>>> + * not care rmap and lpage-info since they are still valid after calling
> >>>>>> + * the function.
> >>>>>> + */
> >>>>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> >>>>>> + struct kvm_memory_slot *slot)
> >>>>>> +{
> >>>>>> + spin_lock(&kvm->mmu_lock);
> >>>>>> + kvm->arch.mmu_valid_gen++;
> >>>>>> +
> >>>>>> + /*
> >>>>>> + * All shadow paes are invalid, reset the large page info,
> >>>>>> + * then we can safely desotry the memslot, it is also good
> >>>>>> + * for large page used.
> >>>>>> + */
> >>>>>> + kvm_clear_all_lpage_info(kvm);
> >>>>>
> >>>>> Xiao,
> >>>>>
> >>>>> I understood it was agreed that simple mmu_lock lockbreak while
> >>>>> avoiding zapping of newly instantiated pages upon a
> >>>>>
> >>>>> if(spin_needbreak)
> >>>>> cond_resched_lock()
> >>>>>
> >>>>> cycle was enough as a first step? And then later introduce root zapping
> >>>>> along with measurements.
> >>>>>
> >>>>> https://lkml.org/lkml/2013/4/22/544
> >>>>
> >>>> Yes, it is.
> >>>>
> >>>> See the changelog in 0/0:
> >>>>
> >>>> " we use lock-break technique to zap all sptes linked on the
> >>>> invalid rmap, it is not very effective but good for the first step."
> >>>>
> >>>> Thanks!
> >>>
> >>> Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
> >>> zapping the root? Only lock-break technique along with generation number
> >>> was what was agreed.
> >>
> >> Marcelo,
> >>
> >> Please Wait... I am completely confused. :(
> >>
> >> Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
> >> Are these changes you wanted?
> >>
> >> void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> >> struct kvm_memory_slot *slot)
> >> {
> >> spin_lock(&kvm->mmu_lock);
> >> kvm->arch.mmu_valid_gen++;
> >>
> >> /* Zero all root pages.*/
> >> restart:
> >> list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
> >> if (!sp->root_count)
> >> continue;
> >>
> >> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> >> goto restart;
> >> }
> >>
> >> /*
> >> * All shadow paes are invalid, reset the large page info,
> >> * then we can safely desotry the memslot, it is also good
> >> * for large page used.
> >> */
> >> kvm_clear_all_lpage_info(kvm);
> >>
> >> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> >> spin_unlock(&kvm->mmu_lock);
> >> }
> >>
> >> static void rmap_remove(struct kvm *kvm, u64 *spte)
> >> {
> >> struct kvm_mmu_page *sp;
> >> gfn_t gfn;
> >> unsigned long *rmapp;
> >>
> >> sp = page_header(__pa(spte));
> >> +
> >> + /* Let invalid sp do not access its rmap. */
> >> + if (!sp_is_valid(sp))
> >> + return;
> >> +
> >> gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
> >> rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
> >> pte_list_remove(spte, rmapp);
> >> }
> >>
> >> If yes, there is the reason why we can not do this that i mentioned before:
> >>
> >> after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
> >> Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
> >> ->invalidate_range_start, can not find any spte using the host page, then
> >> Accessed/Dirty for host page is missing tracked.
> >> (missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)
> >>
> >> What's your idea?
> >
> >
> > Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> > spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
> > releases mmu_lock and reacquires it again, only shadow pages
> > from the generation with which kvm_mmu_zap_all started are zapped (this
> > guarantees forward progress and eventual termination).
> >
> > kvm_mmu_zap_generation()
> > spin_lock(mmu_lock)
> > int generation = kvm->arch.mmu_generation;
> >
> > for_each_shadow_page(sp) {
> > if (sp->generation == kvm->arch.mmu_generation)
> > zap_page(sp)
> > if (spin_needbreak(mmu_lock)) {
> > kvm->arch.mmu_generation++;
> > cond_resched_lock(mmu_lock);
> > }
> > }
> >
> > kvm_mmu_zap_all()
> > spin_lock(mmu_lock)
> > for_each_shadow_page(sp) {
> > if (spin_needbreak(mmu_lock)) {
> > cond_resched_lock(mmu_lock);
> > }
> > }
> >
> > Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> > Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> >
> > This addresses the main problem: excessively long hold times
> > of kvm_mmu_zap_all with very large guests.
> >
> > Do you see any problem with this logic? This was what i was thinking
> > we agreed.
>
> No. I understand it and it can work.
>
> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> (and uses lock break technique)", after some discussion, we thought "only zap
> shadow pages that are reachable from the slot's rmap" is better, that is this
> patchset does.
> (https://lkml.org/lkml/2013/4/23/73)
>
But this is not what the patch is doing. Close, but not the same :)
Instead of zapping shadow pages reachable from slot's rmap the patch
does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
That is why you need special code to re-init lpage_info. What I proposed
was to call zap_page() on all shadow pages reachable from rmap. This
will take care of lpage_info counters. Does this make sense?

> >
> > Step 2) Show that the optimization to zap only the roots is worthwhile
> > via benchmarking, and implement it.
>
> This is what i am confused. I can not understand how "zap only the roots"
> works. You mean these change?
>
> kvm_mmu_zap_generation()
> spin_lock(mmu_lock)
> int generation = kvm->arch.mmu_generation;
>
> for_each_shadow_page(sp) {
> /* Change here. */
> => if ((sp->generation == kvm->arch.mmu_generation) &&
> => sp->root_count)
> zap_page(sp)
>
> if (spin_needbreak(mmu_lock)) {
> kvm->arch.mmu_generation++;
> cond_resched_lock(mmu_lock);
> }
> }
>
> If we do this, there will have shadow pages that are linked to invalid memslot's
> rmap. How do we handle these pages and the mmu-notify issue?
>
> Thanks!
>

--
Gleb.

2013-05-06 13:10:26

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On 05/06/2013 08:36 PM, Gleb Natapov wrote:

>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
>>> releases mmu_lock and reacquires it again, only shadow pages
>>> from the generation with which kvm_mmu_zap_all started are zapped (this
>>> guarantees forward progress and eventual termination).
>>>
>>> kvm_mmu_zap_generation()
>>> spin_lock(mmu_lock)
>>> int generation = kvm->arch.mmu_generation;
>>>
>>> for_each_shadow_page(sp) {
>>> if (sp->generation == kvm->arch.mmu_generation)
>>> zap_page(sp)
>>> if (spin_needbreak(mmu_lock)) {
>>> kvm->arch.mmu_generation++;
>>> cond_resched_lock(mmu_lock);
>>> }
>>> }
>>>
>>> kvm_mmu_zap_all()
>>> spin_lock(mmu_lock)
>>> for_each_shadow_page(sp) {
>>> if (spin_needbreak(mmu_lock)) {
>>> cond_resched_lock(mmu_lock);
>>> }
>>> }
>>>
>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
>>>
>>> This addresses the main problem: excessively long hold times
>>> of kvm_mmu_zap_all with very large guests.
>>>
>>> Do you see any problem with this logic? This was what i was thinking
>>> we agreed.
>>
>> No. I understand it and it can work.
>>
>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
>> (and uses lock break technique)", after some discussion, we thought "only zap
>> shadow pages that are reachable from the slot's rmap" is better, that is this
>> patchset does.
>> (https://lkml.org/lkml/2013/4/23/73)
>>
> But this is not what the patch is doing. Close, but not the same :)

Okay. :)

> Instead of zapping shadow pages reachable from slot's rmap the patch
> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
> That is why you need special code to re-init lpage_info. What I proposed
> was to call zap_page() on all shadow pages reachable from rmap. This
> will take care of lpage_info counters. Does this make sense?

Unfortunately, no! We still need to care lpage_info. lpage_info is used
to count the number of guest page tables in the memslot.

For example, there is a memslot:
memslot[0].based_gfn = 0, memslot[0].npages = 100,

and there is a shadow page:
sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.

this sp is counted in the memslot[0] but it can not be found by walking
memslot[0]->rmap since there is no last mapping in this shadow page.

2013-05-06 17:25:01

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
>
> >>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> >>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
> >>> releases mmu_lock and reacquires it again, only shadow pages
> >>> from the generation with which kvm_mmu_zap_all started are zapped (this
> >>> guarantees forward progress and eventual termination).
> >>>
> >>> kvm_mmu_zap_generation()
> >>> spin_lock(mmu_lock)
> >>> int generation = kvm->arch.mmu_generation;
> >>>
> >>> for_each_shadow_page(sp) {
> >>> if (sp->generation == kvm->arch.mmu_generation)
> >>> zap_page(sp)
> >>> if (spin_needbreak(mmu_lock)) {
> >>> kvm->arch.mmu_generation++;
> >>> cond_resched_lock(mmu_lock);
> >>> }
> >>> }
> >>>
> >>> kvm_mmu_zap_all()
> >>> spin_lock(mmu_lock)
> >>> for_each_shadow_page(sp) {
> >>> if (spin_needbreak(mmu_lock)) {
> >>> cond_resched_lock(mmu_lock);
> >>> }
> >>> }
> >>>
> >>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> >>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> >>>
> >>> This addresses the main problem: excessively long hold times
> >>> of kvm_mmu_zap_all with very large guests.
> >>>
> >>> Do you see any problem with this logic? This was what i was thinking
> >>> we agreed.
> >>
> >> No. I understand it and it can work.
> >>
> >> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> >> (and uses lock break technique)", after some discussion, we thought "only zap
> >> shadow pages that are reachable from the slot's rmap" is better, that is this
> >> patchset does.
> >> (https://lkml.org/lkml/2013/4/23/73)
> >>
> > But this is not what the patch is doing. Close, but not the same :)
>
> Okay. :)
>
> > Instead of zapping shadow pages reachable from slot's rmap the patch
> > does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
> > That is why you need special code to re-init lpage_info. What I proposed
> > was to call zap_page() on all shadow pages reachable from rmap. This
> > will take care of lpage_info counters. Does this make sense?
>
> Unfortunately, no! We still need to care lpage_info. lpage_info is used
> to count the number of guest page tables in the memslot.
>
> For example, there is a memslot:
> memslot[0].based_gfn = 0, memslot[0].npages = 100,
>
> and there is a shadow page:
> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
>
> this sp is counted in the memslot[0] but it can not be found by walking
> memslot[0]->rmap since there is no last mapping in this shadow page.
>
Right, so what about walking mmu_page_hash for each gfn belonging to the
slot that is in process to be removed to find those?

--
Gleb.

2013-05-06 17:46:06

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On 05/07/2013 01:24 AM, Gleb Natapov wrote:
> On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
>> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
>>
>>>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
>>>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
>>>>> releases mmu_lock and reacquires it again, only shadow pages
>>>>> from the generation with which kvm_mmu_zap_all started are zapped (this
>>>>> guarantees forward progress and eventual termination).
>>>>>
>>>>> kvm_mmu_zap_generation()
>>>>> spin_lock(mmu_lock)
>>>>> int generation = kvm->arch.mmu_generation;
>>>>>
>>>>> for_each_shadow_page(sp) {
>>>>> if (sp->generation == kvm->arch.mmu_generation)
>>>>> zap_page(sp)
>>>>> if (spin_needbreak(mmu_lock)) {
>>>>> kvm->arch.mmu_generation++;
>>>>> cond_resched_lock(mmu_lock);
>>>>> }
>>>>> }
>>>>>
>>>>> kvm_mmu_zap_all()
>>>>> spin_lock(mmu_lock)
>>>>> for_each_shadow_page(sp) {
>>>>> if (spin_needbreak(mmu_lock)) {
>>>>> cond_resched_lock(mmu_lock);
>>>>> }
>>>>> }
>>>>>
>>>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
>>>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
>>>>>
>>>>> This addresses the main problem: excessively long hold times
>>>>> of kvm_mmu_zap_all with very large guests.
>>>>>
>>>>> Do you see any problem with this logic? This was what i was thinking
>>>>> we agreed.
>>>>
>>>> No. I understand it and it can work.
>>>>
>>>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
>>>> (and uses lock break technique)", after some discussion, we thought "only zap
>>>> shadow pages that are reachable from the slot's rmap" is better, that is this
>>>> patchset does.
>>>> (https://lkml.org/lkml/2013/4/23/73)
>>>>
>>> But this is not what the patch is doing. Close, but not the same :)
>>
>> Okay. :)
>>
>>> Instead of zapping shadow pages reachable from slot's rmap the patch
>>> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
>>> That is why you need special code to re-init lpage_info. What I proposed
>>> was to call zap_page() on all shadow pages reachable from rmap. This
>>> will take care of lpage_info counters. Does this make sense?
>>
>> Unfortunately, no! We still need to care lpage_info. lpage_info is used
>> to count the number of guest page tables in the memslot.
>>
>> For example, there is a memslot:
>> memslot[0].based_gfn = 0, memslot[0].npages = 100,
>>
>> and there is a shadow page:
>> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
>>
>> this sp is counted in the memslot[0] but it can not be found by walking
>> memslot[0]->rmap since there is no last mapping in this shadow page.
>>
> Right, so what about walking mmu_page_hash for each gfn belonging to the
> slot that is in process to be removed to find those?

That will cost lots of time. The size of hashtable is 1 << 10. If the
memslot has 4M memory, it will walk all the entries, the cost is the same
as walking active_list (maybe litter more). And a memslot has 4M memory is
the normal case i think.

Another point is that lpage_info stops mmu to use large page. If we
do not reset lpage_info, mmu is using 4K page until the invalid-sp is
zapped.

2013-05-06 20:13:09

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Mon, May 06, 2013 at 11:39:11AM +0800, Xiao Guangrong wrote:
> On 05/04/2013 08:52 AM, Marcelo Tosatti wrote:
> > On Sat, May 04, 2013 at 12:51:06AM +0800, Xiao Guangrong wrote:
> >> On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
> >>> On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
> >>>> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
> >>>>
> >>>>>> +
> >>>>>> +/*
> >>>>>> + * Fast invalid all shadow pages belong to @slot.
> >>>>>> + *
> >>>>>> + * @slot != NULL means the invalidation is caused the memslot specified
> >>>>>> + * by @slot is being deleted, in this case, we should ensure that rmap
> >>>>>> + * and lpage-info of the @slot can not be used after calling the function.
> >>>>>> + *
> >>>>>> + * @slot == NULL means the invalidation due to other reasons, we need
> >>>>>> + * not care rmap and lpage-info since they are still valid after calling
> >>>>>> + * the function.
> >>>>>> + */
> >>>>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> >>>>>> + struct kvm_memory_slot *slot)
> >>>>>> +{
> >>>>>> + spin_lock(&kvm->mmu_lock);
> >>>>>> + kvm->arch.mmu_valid_gen++;
> >>>>>> +
> >>>>>> + /*
> >>>>>> + * All shadow paes are invalid, reset the large page info,
> >>>>>> + * then we can safely desotry the memslot, it is also good
> >>>>>> + * for large page used.
> >>>>>> + */
> >>>>>> + kvm_clear_all_lpage_info(kvm);
> >>>>>
> >>>>> Xiao,
> >>>>>
> >>>>> I understood it was agreed that simple mmu_lock lockbreak while
> >>>>> avoiding zapping of newly instantiated pages upon a
> >>>>>
> >>>>> if(spin_needbreak)
> >>>>> cond_resched_lock()
> >>>>>
> >>>>> cycle was enough as a first step? And then later introduce root zapping
> >>>>> along with measurements.
> >>>>>
> >>>>> https://lkml.org/lkml/2013/4/22/544
> >>>>
> >>>> Yes, it is.
> >>>>
> >>>> See the changelog in 0/0:
> >>>>
> >>>> " we use lock-break technique to zap all sptes linked on the
> >>>> invalid rmap, it is not very effective but good for the first step."
> >>>>
> >>>> Thanks!
> >>>
> >>> Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
> >>> zapping the root? Only lock-break technique along with generation number
> >>> was what was agreed.
> >>
> >> Marcelo,
> >>
> >> Please Wait... I am completely confused. :(
> >>
> >> Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
> >> Are these changes you wanted?
> >>
> >> void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> >> struct kvm_memory_slot *slot)
> >> {
> >> spin_lock(&kvm->mmu_lock);
> >> kvm->arch.mmu_valid_gen++;
> >>
> >> /* Zero all root pages.*/
> >> restart:
> >> list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
> >> if (!sp->root_count)
> >> continue;
> >>
> >> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> >> goto restart;
> >> }
> >>
> >> /*
> >> * All shadow paes are invalid, reset the large page info,
> >> * then we can safely desotry the memslot, it is also good
> >> * for large page used.
> >> */
> >> kvm_clear_all_lpage_info(kvm);
> >>
> >> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> >> spin_unlock(&kvm->mmu_lock);
> >> }
> >>
> >> static void rmap_remove(struct kvm *kvm, u64 *spte)
> >> {
> >> struct kvm_mmu_page *sp;
> >> gfn_t gfn;
> >> unsigned long *rmapp;
> >>
> >> sp = page_header(__pa(spte));
> >> +
> >> + /* Let invalid sp do not access its rmap. */
> >> + if (!sp_is_valid(sp))
> >> + return;
> >> +
> >> gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
> >> rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
> >> pte_list_remove(spte, rmapp);
> >> }
> >>
> >> If yes, there is the reason why we can not do this that i mentioned before:
> >>
> >> after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
> >> Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
> >> ->invalidate_range_start, can not find any spte using the host page, then
> >> Accessed/Dirty for host page is missing tracked.
> >> (missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)
> >>
> >> What's your idea?
> >
> >
> > Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> > spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
> > releases mmu_lock and reacquires it again, only shadow pages
> > from the generation with which kvm_mmu_zap_all started are zapped (this
> > guarantees forward progress and eventual termination).
> >
> > kvm_mmu_zap_generation()
> > spin_lock(mmu_lock)
> > int generation = kvm->arch.mmu_generation;
> >
> > for_each_shadow_page(sp) {
> > if (sp->generation == kvm->arch.mmu_generation)
> > zap_page(sp)
> > if (spin_needbreak(mmu_lock)) {
> > kvm->arch.mmu_generation++;
> > cond_resched_lock(mmu_lock);
> > }
> > }
> >
> > kvm_mmu_zap_all()
> > spin_lock(mmu_lock)
> > for_each_shadow_page(sp) {
> > if (spin_needbreak(mmu_lock)) {
> > cond_resched_lock(mmu_lock);
> > }
> > }
> >
> > Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> > Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> >
> > This addresses the main problem: excessively long hold times
> > of kvm_mmu_zap_all with very large guests.
> >
> > Do you see any problem with this logic? This was what i was thinking
> > we agreed.
>
> No. I understand it and it can work.
>
> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> (and uses lock break technique)", after some discussion, we thought "only zap
> shadow pages that are reachable from the slot's rmap" is better, that is this
> patchset does.
> (https://lkml.org/lkml/2013/4/23/73)
>
> >
> > Step 2) Show that the optimization to zap only the roots is worthwhile
> > via benchmarking, and implement it.
>
> This is what i am confused. I can not understand how "zap only the roots"
> works. You mean these change?
>
> kvm_mmu_zap_generation()
> spin_lock(mmu_lock)
> int generation = kvm->arch.mmu_generation;
>
> for_each_shadow_page(sp) {
> /* Change here. */
> => if ((sp->generation == kvm->arch.mmu_generation) &&
> => sp->root_count)
> zap_page(sp)
>
> if (spin_needbreak(mmu_lock)) {
> kvm->arch.mmu_generation++;
> cond_resched_lock(mmu_lock);
> }
> }
>
> If we do this, there will have shadow pages that are linked to invalid memslot's
> rmap. How do we handle these pages and the mmu-notify issue?
>
> Thanks!

By "zap only roots" i mean zapping roots plus generation number on
shadow pages. But this as a second step, after it has been demonstrated
its worthwhile.

2013-05-07 03:40:10

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On 05/07/2013 03:50 AM, Marcelo Tosatti wrote:
> On Mon, May 06, 2013 at 11:39:11AM +0800, Xiao Guangrong wrote:
>> On 05/04/2013 08:52 AM, Marcelo Tosatti wrote:
>>> On Sat, May 04, 2013 at 12:51:06AM +0800, Xiao Guangrong wrote:
>>>> On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
>>>>> On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
>>>>>> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
>>>>>>
>>>>>>>> +
>>>>>>>> +/*
>>>>>>>> + * Fast invalid all shadow pages belong to @slot.
>>>>>>>> + *
>>>>>>>> + * @slot != NULL means the invalidation is caused the memslot specified
>>>>>>>> + * by @slot is being deleted, in this case, we should ensure that rmap
>>>>>>>> + * and lpage-info of the @slot can not be used after calling the function.
>>>>>>>> + *
>>>>>>>> + * @slot == NULL means the invalidation due to other reasons, we need
>>>>>>>> + * not care rmap and lpage-info since they are still valid after calling
>>>>>>>> + * the function.
>>>>>>>> + */
>>>>>>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>>>>>>>> + struct kvm_memory_slot *slot)
>>>>>>>> +{
>>>>>>>> + spin_lock(&kvm->mmu_lock);
>>>>>>>> + kvm->arch.mmu_valid_gen++;
>>>>>>>> +
>>>>>>>> + /*
>>>>>>>> + * All shadow paes are invalid, reset the large page info,
>>>>>>>> + * then we can safely desotry the memslot, it is also good
>>>>>>>> + * for large page used.
>>>>>>>> + */
>>>>>>>> + kvm_clear_all_lpage_info(kvm);
>>>>>>>
>>>>>>> Xiao,
>>>>>>>
>>>>>>> I understood it was agreed that simple mmu_lock lockbreak while
>>>>>>> avoiding zapping of newly instantiated pages upon a
>>>>>>>
>>>>>>> if(spin_needbreak)
>>>>>>> cond_resched_lock()
>>>>>>>
>>>>>>> cycle was enough as a first step? And then later introduce root zapping
>>>>>>> along with measurements.
>>>>>>>
>>>>>>> https://lkml.org/lkml/2013/4/22/544
>>>>>>
>>>>>> Yes, it is.
>>>>>>
>>>>>> See the changelog in 0/0:
>>>>>>
>>>>>> " we use lock-break technique to zap all sptes linked on the
>>>>>> invalid rmap, it is not very effective but good for the first step."
>>>>>>
>>>>>> Thanks!
>>>>>
>>>>> Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
>>>>> zapping the root? Only lock-break technique along with generation number
>>>>> was what was agreed.
>>>>
>>>> Marcelo,
>>>>
>>>> Please Wait... I am completely confused. :(
>>>>
>>>> Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
>>>> Are these changes you wanted?
>>>>
>>>> void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>>>> struct kvm_memory_slot *slot)
>>>> {
>>>> spin_lock(&kvm->mmu_lock);
>>>> kvm->arch.mmu_valid_gen++;
>>>>
>>>> /* Zero all root pages.*/
>>>> restart:
>>>> list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
>>>> if (!sp->root_count)
>>>> continue;
>>>>
>>>> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
>>>> goto restart;
>>>> }
>>>>
>>>> /*
>>>> * All shadow paes are invalid, reset the large page info,
>>>> * then we can safely desotry the memslot, it is also good
>>>> * for large page used.
>>>> */
>>>> kvm_clear_all_lpage_info(kvm);
>>>>
>>>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>>>> spin_unlock(&kvm->mmu_lock);
>>>> }
>>>>
>>>> static void rmap_remove(struct kvm *kvm, u64 *spte)
>>>> {
>>>> struct kvm_mmu_page *sp;
>>>> gfn_t gfn;
>>>> unsigned long *rmapp;
>>>>
>>>> sp = page_header(__pa(spte));
>>>> +
>>>> + /* Let invalid sp do not access its rmap. */
>>>> + if (!sp_is_valid(sp))
>>>> + return;
>>>> +
>>>> gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
>>>> rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
>>>> pte_list_remove(spte, rmapp);
>>>> }
>>>>
>>>> If yes, there is the reason why we can not do this that i mentioned before:
>>>>
>>>> after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
>>>> Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
>>>> ->invalidate_range_start, can not find any spte using the host page, then
>>>> Accessed/Dirty for host page is missing tracked.
>>>> (missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)
>>>>
>>>> What's your idea?
>>>
>>>
>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
>>> releases mmu_lock and reacquires it again, only shadow pages
>>> from the generation with which kvm_mmu_zap_all started are zapped (this
>>> guarantees forward progress and eventual termination).
>>>
>>> kvm_mmu_zap_generation()
>>> spin_lock(mmu_lock)
>>> int generation = kvm->arch.mmu_generation;
>>>
>>> for_each_shadow_page(sp) {
>>> if (sp->generation == kvm->arch.mmu_generation)
>>> zap_page(sp)
>>> if (spin_needbreak(mmu_lock)) {
>>> kvm->arch.mmu_generation++;
>>> cond_resched_lock(mmu_lock);
>>> }
>>> }
>>>
>>> kvm_mmu_zap_all()
>>> spin_lock(mmu_lock)
>>> for_each_shadow_page(sp) {
>>> if (spin_needbreak(mmu_lock)) {
>>> cond_resched_lock(mmu_lock);
>>> }
>>> }
>>>
>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
>>>
>>> This addresses the main problem: excessively long hold times
>>> of kvm_mmu_zap_all with very large guests.
>>>
>>> Do you see any problem with this logic? This was what i was thinking
>>> we agreed.
>>
>> No. I understand it and it can work.
>>
>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
>> (and uses lock break technique)", after some discussion, we thought "only zap
>> shadow pages that are reachable from the slot's rmap" is better, that is this
>> patchset does.
>> (https://lkml.org/lkml/2013/4/23/73)
>>
>>>
>>> Step 2) Show that the optimization to zap only the roots is worthwhile
>>> via benchmarking, and implement it.
>>
>> This is what i am confused. I can not understand how "zap only the roots"
>> works. You mean these change?
>>
>> kvm_mmu_zap_generation()
>> spin_lock(mmu_lock)
>> int generation = kvm->arch.mmu_generation;
>>
>> for_each_shadow_page(sp) {
>> /* Change here. */
>> => if ((sp->generation == kvm->arch.mmu_generation) &&
>> => sp->root_count)
>> zap_page(sp)
>>
>> if (spin_needbreak(mmu_lock)) {
>> kvm->arch.mmu_generation++;
>> cond_resched_lock(mmu_lock);
>> }
>> }
>>
>> If we do this, there will have shadow pages that are linked to invalid memslot's
>> rmap. How do we handle these pages and the mmu-notify issue?
>>
>> Thanks!
>
> By "zap only roots" i mean zapping roots plus generation number on
> shadow pages. But this as a second step, after it has been demonstrated
> its worthwhile.

Marcelo,

Sorry for my stupidity, still do not understand. Could you please show me the
pseudocode and answer my questions above?




2013-05-07 08:58:55

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Tue, May 07, 2013 at 01:45:52AM +0800, Xiao Guangrong wrote:
> On 05/07/2013 01:24 AM, Gleb Natapov wrote:
> > On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
> >> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
> >>
> >>>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> >>>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
> >>>>> releases mmu_lock and reacquires it again, only shadow pages
> >>>>> from the generation with which kvm_mmu_zap_all started are zapped (this
> >>>>> guarantees forward progress and eventual termination).
> >>>>>
> >>>>> kvm_mmu_zap_generation()
> >>>>> spin_lock(mmu_lock)
> >>>>> int generation = kvm->arch.mmu_generation;
> >>>>>
> >>>>> for_each_shadow_page(sp) {
> >>>>> if (sp->generation == kvm->arch.mmu_generation)
> >>>>> zap_page(sp)
> >>>>> if (spin_needbreak(mmu_lock)) {
> >>>>> kvm->arch.mmu_generation++;
> >>>>> cond_resched_lock(mmu_lock);
> >>>>> }
> >>>>> }
> >>>>>
> >>>>> kvm_mmu_zap_all()
> >>>>> spin_lock(mmu_lock)
> >>>>> for_each_shadow_page(sp) {
> >>>>> if (spin_needbreak(mmu_lock)) {
> >>>>> cond_resched_lock(mmu_lock);
> >>>>> }
> >>>>> }
> >>>>>
> >>>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> >>>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> >>>>>
> >>>>> This addresses the main problem: excessively long hold times
> >>>>> of kvm_mmu_zap_all with very large guests.
> >>>>>
> >>>>> Do you see any problem with this logic? This was what i was thinking
> >>>>> we agreed.
> >>>>
> >>>> No. I understand it and it can work.
> >>>>
> >>>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> >>>> (and uses lock break technique)", after some discussion, we thought "only zap
> >>>> shadow pages that are reachable from the slot's rmap" is better, that is this
> >>>> patchset does.
> >>>> (https://lkml.org/lkml/2013/4/23/73)
> >>>>
> >>> But this is not what the patch is doing. Close, but not the same :)
> >>
> >> Okay. :)
> >>
> >>> Instead of zapping shadow pages reachable from slot's rmap the patch
> >>> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
> >>> That is why you need special code to re-init lpage_info. What I proposed
> >>> was to call zap_page() on all shadow pages reachable from rmap. This
> >>> will take care of lpage_info counters. Does this make sense?
> >>
> >> Unfortunately, no! We still need to care lpage_info. lpage_info is used
> >> to count the number of guest page tables in the memslot.
> >>
> >> For example, there is a memslot:
> >> memslot[0].based_gfn = 0, memslot[0].npages = 100,
> >>
> >> and there is a shadow page:
> >> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
> >>
> >> this sp is counted in the memslot[0] but it can not be found by walking
> >> memslot[0]->rmap since there is no last mapping in this shadow page.
> >>
> > Right, so what about walking mmu_page_hash for each gfn belonging to the
> > slot that is in process to be removed to find those?
>
> That will cost lots of time. The size of hashtable is 1 << 10. If the
> memslot has 4M memory, it will walk all the entries, the cost is the same
> as walking active_list (maybe litter more). And a memslot has 4M memory is
> the normal case i think.
>
Memslots will be much bigger with memory hotplug. Lock break should be
used while walking mmu_page_hash obviously, but still iterating over
entire memslot gfn space to find a few gfn that may be there is
suboptimal. We can keep a list of them in the memslot itself.

> Another point is that lpage_info stops mmu to use large page. If we
> do not reset lpage_info, mmu is using 4K page until the invalid-sp is
> zapped.
>
I do not think this is a big issue. If lpage_info prevented the use of
large pages for some memory ranges before we zapped entire shadow pages
it was probably for a reason, so new shadow page will prevent large
pages from been created for the same memory ranges.

--
Gleb.

2013-05-07 09:41:48

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On 05/07/2013 04:58 PM, Gleb Natapov wrote:
> On Tue, May 07, 2013 at 01:45:52AM +0800, Xiao Guangrong wrote:
>> On 05/07/2013 01:24 AM, Gleb Natapov wrote:
>>> On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
>>>> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
>>>>
>>>>>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
>>>>>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
>>>>>>> releases mmu_lock and reacquires it again, only shadow pages
>>>>>>> from the generation with which kvm_mmu_zap_all started are zapped (this
>>>>>>> guarantees forward progress and eventual termination).
>>>>>>>
>>>>>>> kvm_mmu_zap_generation()
>>>>>>> spin_lock(mmu_lock)
>>>>>>> int generation = kvm->arch.mmu_generation;
>>>>>>>
>>>>>>> for_each_shadow_page(sp) {
>>>>>>> if (sp->generation == kvm->arch.mmu_generation)
>>>>>>> zap_page(sp)
>>>>>>> if (spin_needbreak(mmu_lock)) {
>>>>>>> kvm->arch.mmu_generation++;
>>>>>>> cond_resched_lock(mmu_lock);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> kvm_mmu_zap_all()
>>>>>>> spin_lock(mmu_lock)
>>>>>>> for_each_shadow_page(sp) {
>>>>>>> if (spin_needbreak(mmu_lock)) {
>>>>>>> cond_resched_lock(mmu_lock);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
>>>>>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
>>>>>>>
>>>>>>> This addresses the main problem: excessively long hold times
>>>>>>> of kvm_mmu_zap_all with very large guests.
>>>>>>>
>>>>>>> Do you see any problem with this logic? This was what i was thinking
>>>>>>> we agreed.
>>>>>>
>>>>>> No. I understand it and it can work.
>>>>>>
>>>>>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
>>>>>> (and uses lock break technique)", after some discussion, we thought "only zap
>>>>>> shadow pages that are reachable from the slot's rmap" is better, that is this
>>>>>> patchset does.
>>>>>> (https://lkml.org/lkml/2013/4/23/73)
>>>>>>
>>>>> But this is not what the patch is doing. Close, but not the same :)
>>>>
>>>> Okay. :)
>>>>
>>>>> Instead of zapping shadow pages reachable from slot's rmap the patch
>>>>> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
>>>>> That is why you need special code to re-init lpage_info. What I proposed
>>>>> was to call zap_page() on all shadow pages reachable from rmap. This
>>>>> will take care of lpage_info counters. Does this make sense?
>>>>
>>>> Unfortunately, no! We still need to care lpage_info. lpage_info is used
>>>> to count the number of guest page tables in the memslot.
>>>>
>>>> For example, there is a memslot:
>>>> memslot[0].based_gfn = 0, memslot[0].npages = 100,
>>>>
>>>> and there is a shadow page:
>>>> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
>>>>
>>>> this sp is counted in the memslot[0] but it can not be found by walking
>>>> memslot[0]->rmap since there is no last mapping in this shadow page.
>>>>
>>> Right, so what about walking mmu_page_hash for each gfn belonging to the
>>> slot that is in process to be removed to find those?
>>
>> That will cost lots of time. The size of hashtable is 1 << 10. If the
>> memslot has 4M memory, it will walk all the entries, the cost is the same
>> as walking active_list (maybe litter more). And a memslot has 4M memory is
>> the normal case i think.
>>
> Memslots will be much bigger with memory hotplug. Lock break should be
> used while walking mmu_page_hash obviously, but still iterating over
> entire memslot gfn space to find a few gfn that may be there is
> suboptimal. We can keep a list of them in the memslot itself.

It sounds good to me.

BTW, this approach looks more complex and use more memory (new list_head
added into every shadow page) used, why you dislike clearing lpage_info? ;)

>
>> Another point is that lpage_info stops mmu to use large page. If we
>> do not reset lpage_info, mmu is using 4K page until the invalid-sp is
>> zapped.
>>
> I do not think this is a big issue. If lpage_info prevented the use of
> large pages for some memory ranges before we zapped entire shadow pages
> it was probably for a reason, so new shadow page will prevent large
> pages from been created for the same memory ranges.

Still worried, but I will try it if Marcelo does not have objects.
Thanks a lot for your valuable suggestion, Gleb!

Now, i am trying my best to catch Marcelo's idea of "zapping root
pages", but......


2013-05-07 10:01:11

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Tue, May 07, 2013 at 05:41:35PM +0800, Xiao Guangrong wrote:
> On 05/07/2013 04:58 PM, Gleb Natapov wrote:
> > On Tue, May 07, 2013 at 01:45:52AM +0800, Xiao Guangrong wrote:
> >> On 05/07/2013 01:24 AM, Gleb Natapov wrote:
> >>> On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
> >>>> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
> >>>>
> >>>>>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> >>>>>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
> >>>>>>> releases mmu_lock and reacquires it again, only shadow pages
> >>>>>>> from the generation with which kvm_mmu_zap_all started are zapped (this
> >>>>>>> guarantees forward progress and eventual termination).
> >>>>>>>
> >>>>>>> kvm_mmu_zap_generation()
> >>>>>>> spin_lock(mmu_lock)
> >>>>>>> int generation = kvm->arch.mmu_generation;
> >>>>>>>
> >>>>>>> for_each_shadow_page(sp) {
> >>>>>>> if (sp->generation == kvm->arch.mmu_generation)
> >>>>>>> zap_page(sp)
> >>>>>>> if (spin_needbreak(mmu_lock)) {
> >>>>>>> kvm->arch.mmu_generation++;
> >>>>>>> cond_resched_lock(mmu_lock);
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> kvm_mmu_zap_all()
> >>>>>>> spin_lock(mmu_lock)
> >>>>>>> for_each_shadow_page(sp) {
> >>>>>>> if (spin_needbreak(mmu_lock)) {
> >>>>>>> cond_resched_lock(mmu_lock);
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> >>>>>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> >>>>>>>
> >>>>>>> This addresses the main problem: excessively long hold times
> >>>>>>> of kvm_mmu_zap_all with very large guests.
> >>>>>>>
> >>>>>>> Do you see any problem with this logic? This was what i was thinking
> >>>>>>> we agreed.
> >>>>>>
> >>>>>> No. I understand it and it can work.
> >>>>>>
> >>>>>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> >>>>>> (and uses lock break technique)", after some discussion, we thought "only zap
> >>>>>> shadow pages that are reachable from the slot's rmap" is better, that is this
> >>>>>> patchset does.
> >>>>>> (https://lkml.org/lkml/2013/4/23/73)
> >>>>>>
> >>>>> But this is not what the patch is doing. Close, but not the same :)
> >>>>
> >>>> Okay. :)
> >>>>
> >>>>> Instead of zapping shadow pages reachable from slot's rmap the patch
> >>>>> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
> >>>>> That is why you need special code to re-init lpage_info. What I proposed
> >>>>> was to call zap_page() on all shadow pages reachable from rmap. This
> >>>>> will take care of lpage_info counters. Does this make sense?
> >>>>
> >>>> Unfortunately, no! We still need to care lpage_info. lpage_info is used
> >>>> to count the number of guest page tables in the memslot.
> >>>>
> >>>> For example, there is a memslot:
> >>>> memslot[0].based_gfn = 0, memslot[0].npages = 100,
> >>>>
> >>>> and there is a shadow page:
> >>>> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
> >>>>
> >>>> this sp is counted in the memslot[0] but it can not be found by walking
> >>>> memslot[0]->rmap since there is no last mapping in this shadow page.
> >>>>
> >>> Right, so what about walking mmu_page_hash for each gfn belonging to the
> >>> slot that is in process to be removed to find those?
> >>
> >> That will cost lots of time. The size of hashtable is 1 << 10. If the
> >> memslot has 4M memory, it will walk all the entries, the cost is the same
> >> as walking active_list (maybe litter more). And a memslot has 4M memory is
> >> the normal case i think.
> >>
> > Memslots will be much bigger with memory hotplug. Lock break should be
> > used while walking mmu_page_hash obviously, but still iterating over
> > entire memslot gfn space to find a few gfn that may be there is
> > suboptimal. We can keep a list of them in the memslot itself.
>
> It sounds good to me.
>
> BTW, this approach looks more complex and use more memory (new list_head
> added into every shadow page) used, why you dislike clearing lpage_info? ;)
>
Looks a little bit hackish, but now that I see we do not have easy way
to find all shadow pages counted in lpage_info I am not entirely against
it. If you convince Marcelo that clearing lpage_info like that is a good
idea I may reconsider. I think, regardless of tracking lpage_info,
having a way to find all shadow pages that reference a memslot is a good
thing though.

> >
> >> Another point is that lpage_info stops mmu to use large page. If we
> >> do not reset lpage_info, mmu is using 4K page until the invalid-sp is
> >> zapped.
> >>
> > I do not think this is a big issue. If lpage_info prevented the use of
> > large pages for some memory ranges before we zapped entire shadow pages
> > it was probably for a reason, so new shadow page will prevent large
> > pages from been created for the same memory ranges.
>
> Still worried, but I will try it if Marcelo does not have objects.
> Thanks a lot for your valuable suggestion, Gleb!
>
> Now, i am trying my best to catch Marcelo's idea of "zapping root
> pages", but......
>
Yes, I am missing what Marcelo means there too. We cannot free memslot
until we unmap its rmap one way or the other.

--
Gleb.

2013-05-07 14:34:31

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Tue, May 07, 2013 at 01:00:51PM +0300, Gleb Natapov wrote:
> On Tue, May 07, 2013 at 05:41:35PM +0800, Xiao Guangrong wrote:
> > On 05/07/2013 04:58 PM, Gleb Natapov wrote:
> > > On Tue, May 07, 2013 at 01:45:52AM +0800, Xiao Guangrong wrote:
> > >> On 05/07/2013 01:24 AM, Gleb Natapov wrote:
> > >>> On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
> > >>>> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
> > >>>>
> > >>>>>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> > >>>>>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
> > >>>>>>> releases mmu_lock and reacquires it again, only shadow pages
> > >>>>>>> from the generation with which kvm_mmu_zap_all started are zapped (this
> > >>>>>>> guarantees forward progress and eventual termination).
> > >>>>>>>
> > >>>>>>> kvm_mmu_zap_generation()
> > >>>>>>> spin_lock(mmu_lock)
> > >>>>>>> int generation = kvm->arch.mmu_generation;
> > >>>>>>>
> > >>>>>>> for_each_shadow_page(sp) {
> > >>>>>>> if (sp->generation == kvm->arch.mmu_generation)
> > >>>>>>> zap_page(sp)
> > >>>>>>> if (spin_needbreak(mmu_lock)) {
> > >>>>>>> kvm->arch.mmu_generation++;
> > >>>>>>> cond_resched_lock(mmu_lock);
> > >>>>>>> }
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>> kvm_mmu_zap_all()
> > >>>>>>> spin_lock(mmu_lock)
> > >>>>>>> for_each_shadow_page(sp) {
> > >>>>>>> if (spin_needbreak(mmu_lock)) {
> > >>>>>>> cond_resched_lock(mmu_lock);
> > >>>>>>> }
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> > >>>>>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> > >>>>>>>
> > >>>>>>> This addresses the main problem: excessively long hold times
> > >>>>>>> of kvm_mmu_zap_all with very large guests.
> > >>>>>>>
> > >>>>>>> Do you see any problem with this logic? This was what i was thinking
> > >>>>>>> we agreed.
> > >>>>>>
> > >>>>>> No. I understand it and it can work.
> > >>>>>>
> > >>>>>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> > >>>>>> (and uses lock break technique)", after some discussion, we thought "only zap
> > >>>>>> shadow pages that are reachable from the slot's rmap" is better, that is this
> > >>>>>> patchset does.
> > >>>>>> (https://lkml.org/lkml/2013/4/23/73)
> > >>>>>>
> > >>>>> But this is not what the patch is doing. Close, but not the same :)
> > >>>>
> > >>>> Okay. :)
> > >>>>
> > >>>>> Instead of zapping shadow pages reachable from slot's rmap the patch
> > >>>>> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
> > >>>>> That is why you need special code to re-init lpage_info. What I proposed
> > >>>>> was to call zap_page() on all shadow pages reachable from rmap. This
> > >>>>> will take care of lpage_info counters. Does this make sense?
> > >>>>
> > >>>> Unfortunately, no! We still need to care lpage_info. lpage_info is used
> > >>>> to count the number of guest page tables in the memslot.
> > >>>>
> > >>>> For example, there is a memslot:
> > >>>> memslot[0].based_gfn = 0, memslot[0].npages = 100,
> > >>>>
> > >>>> and there is a shadow page:
> > >>>> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
> > >>>>
> > >>>> this sp is counted in the memslot[0] but it can not be found by walking
> > >>>> memslot[0]->rmap since there is no last mapping in this shadow page.
> > >>>>
> > >>> Right, so what about walking mmu_page_hash for each gfn belonging to the
> > >>> slot that is in process to be removed to find those?
> > >>
> > >> That will cost lots of time. The size of hashtable is 1 << 10. If the
> > >> memslot has 4M memory, it will walk all the entries, the cost is the same
> > >> as walking active_list (maybe litter more). And a memslot has 4M memory is
> > >> the normal case i think.
> > >>
> > > Memslots will be much bigger with memory hotplug. Lock break should be
> > > used while walking mmu_page_hash obviously, but still iterating over
> > > entire memslot gfn space to find a few gfn that may be there is
> > > suboptimal. We can keep a list of them in the memslot itself.
> >
> > It sounds good to me.
> >
> > BTW, this approach looks more complex and use more memory (new list_head
> > added into every shadow page) used, why you dislike clearing lpage_info? ;)
> >
> Looks a little bit hackish, but now that I see we do not have easy way
> to find all shadow pages counted in lpage_info I am not entirely against
> it. If you convince Marcelo that clearing lpage_info like that is a good
> idea I may reconsider. I think, regardless of tracking lpage_info,
> having a way to find all shadow pages that reference a memslot is a good
> thing though.
>
> > >
> > >> Another point is that lpage_info stops mmu to use large page. If we
> > >> do not reset lpage_info, mmu is using 4K page until the invalid-sp is
> > >> zapped.
> > >>
> > > I do not think this is a big issue. If lpage_info prevented the use of
> > > large pages for some memory ranges before we zapped entire shadow pages
> > > it was probably for a reason, so new shadow page will prevent large
> > > pages from been created for the same memory ranges.
> >
> > Still worried, but I will try it if Marcelo does not have objects.
> > Thanks a lot for your valuable suggestion, Gleb!
> >
> > Now, i am trying my best to catch Marcelo's idea of "zapping root
> > pages", but......
> >
> Yes, I am missing what Marcelo means there too. We cannot free memslot
> until we unmap its rmap one way or the other.

I do not understand what are you optimizing for, given the four possible
cases we discussed at

https://lkml.org/lkml/2013/4/18/280

That is, why a simple for_each_all_shadow_page(zap_page) is not sufficient.

2013-05-07 14:43:36

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Tue, May 07, 2013 at 11:39:59AM +0800, Xiao Guangrong wrote:
> >>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> >>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
> >>> releases mmu_lock and reacquires it again, only shadow pages
> >>> from the generation with which kvm_mmu_zap_all started are zapped (this
> >>> guarantees forward progress and eventual termination).
> >>>
> >>> kvm_mmu_zap_generation()
> >>> spin_lock(mmu_lock)
> >>> int generation = kvm->arch.mmu_generation;
> >>>
> >>> for_each_shadow_page(sp) {
> >>> if (sp->generation == kvm->arch.mmu_generation)
> >>> zap_page(sp)
> >>> if (spin_needbreak(mmu_lock)) {
> >>> kvm->arch.mmu_generation++;
> >>> cond_resched_lock(mmu_lock);
> >>> }
> >>> }
> >>>
> >>> kvm_mmu_zap_all()
> >>> spin_lock(mmu_lock)
> >>> for_each_shadow_page(sp) {
> >>> if (spin_needbreak(mmu_lock)) {
> >>> cond_resched_lock(mmu_lock);
> >>> }
> >>> }
> >>>
> >>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> >>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> >>>
> >>> This addresses the main problem: excessively long hold times
> >>> of kvm_mmu_zap_all with very large guests.
> >>>
> >>> Do you see any problem with this logic? This was what i was thinking
> >>> we agreed.
> >>
> >> No. I understand it and it can work.
> >>
> >> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> >> (and uses lock break technique)", after some discussion, we thought "only zap
> >> shadow pages that are reachable from the slot's rmap" is better, that is this
> >> patchset does.
> >> (https://lkml.org/lkml/2013/4/23/73)
> >>
> >>>
> >>> Step 2) Show that the optimization to zap only the roots is worthwhile
> >>> via benchmarking, and implement it.
> >>
> >> This is what i am confused. I can not understand how "zap only the roots"
> >> works. You mean these change?
> >>
> >> kvm_mmu_zap_generation()
> >> spin_lock(mmu_lock)
> >> int generation = kvm->arch.mmu_generation;
> >>
> >> for_each_shadow_page(sp) {
> >> /* Change here. */
> >> => if ((sp->generation == kvm->arch.mmu_generation) &&
> >> => sp->root_count)
> >> zap_page(sp)
> >>
> >> if (spin_needbreak(mmu_lock)) {
> >> kvm->arch.mmu_generation++;
> >> cond_resched_lock(mmu_lock);
> >> }
> >> }
> >>
> >> If we do this, there will have shadow pages that are linked to invalid memslot's
> >> rmap. How do we handle these pages and the mmu-notify issue?

No, this is a full kvm_mmu_zap_page().

In step 2, after demonstrating and understanding kvm_mmu_zap_page()'s inefficiency (which
we are not certain about, given the four use cases of slot
deletion/move/create), use something smarter than plain
kvm_mmu_zap_page.

> >> Thanks!
> >
> > By "zap only roots" i mean zapping roots plus generation number on
> > shadow pages. But this as a second step, after it has been demonstrated
> > its worthwhile.
>
> Marcelo,
>
> Sorry for my stupidity, still do not understand. Could you please show me the
> pseudocode and answer my questions above?

Hopefully its clear now?

2013-05-07 14:57:21

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Tue, May 07, 2013 at 11:33:29AM -0300, Marcelo Tosatti wrote:
> On Tue, May 07, 2013 at 01:00:51PM +0300, Gleb Natapov wrote:
> > On Tue, May 07, 2013 at 05:41:35PM +0800, Xiao Guangrong wrote:
> > > On 05/07/2013 04:58 PM, Gleb Natapov wrote:
> > > > On Tue, May 07, 2013 at 01:45:52AM +0800, Xiao Guangrong wrote:
> > > >> On 05/07/2013 01:24 AM, Gleb Natapov wrote:
> > > >>> On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
> > > >>>> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
> > > >>>>
> > > >>>>>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> > > >>>>>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
> > > >>>>>>> releases mmu_lock and reacquires it again, only shadow pages
> > > >>>>>>> from the generation with which kvm_mmu_zap_all started are zapped (this
> > > >>>>>>> guarantees forward progress and eventual termination).
> > > >>>>>>>
> > > >>>>>>> kvm_mmu_zap_generation()
> > > >>>>>>> spin_lock(mmu_lock)
> > > >>>>>>> int generation = kvm->arch.mmu_generation;
> > > >>>>>>>
> > > >>>>>>> for_each_shadow_page(sp) {
> > > >>>>>>> if (sp->generation == kvm->arch.mmu_generation)
> > > >>>>>>> zap_page(sp)
> > > >>>>>>> if (spin_needbreak(mmu_lock)) {
> > > >>>>>>> kvm->arch.mmu_generation++;
> > > >>>>>>> cond_resched_lock(mmu_lock);
> > > >>>>>>> }
> > > >>>>>>> }
> > > >>>>>>>
> > > >>>>>>> kvm_mmu_zap_all()
> > > >>>>>>> spin_lock(mmu_lock)
> > > >>>>>>> for_each_shadow_page(sp) {
> > > >>>>>>> if (spin_needbreak(mmu_lock)) {
> > > >>>>>>> cond_resched_lock(mmu_lock);
> > > >>>>>>> }
> > > >>>>>>> }
> > > >>>>>>>
> > > >>>>>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> > > >>>>>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> > > >>>>>>>
> > > >>>>>>> This addresses the main problem: excessively long hold times
> > > >>>>>>> of kvm_mmu_zap_all with very large guests.
> > > >>>>>>>
> > > >>>>>>> Do you see any problem with this logic? This was what i was thinking
> > > >>>>>>> we agreed.
> > > >>>>>>
> > > >>>>>> No. I understand it and it can work.
> > > >>>>>>
> > > >>>>>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> > > >>>>>> (and uses lock break technique)", after some discussion, we thought "only zap
> > > >>>>>> shadow pages that are reachable from the slot's rmap" is better, that is this
> > > >>>>>> patchset does.
> > > >>>>>> (https://lkml.org/lkml/2013/4/23/73)
> > > >>>>>>
> > > >>>>> But this is not what the patch is doing. Close, but not the same :)
> > > >>>>
> > > >>>> Okay. :)
> > > >>>>
> > > >>>>> Instead of zapping shadow pages reachable from slot's rmap the patch
> > > >>>>> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
> > > >>>>> That is why you need special code to re-init lpage_info. What I proposed
> > > >>>>> was to call zap_page() on all shadow pages reachable from rmap. This
> > > >>>>> will take care of lpage_info counters. Does this make sense?
> > > >>>>
> > > >>>> Unfortunately, no! We still need to care lpage_info. lpage_info is used
> > > >>>> to count the number of guest page tables in the memslot.
> > > >>>>
> > > >>>> For example, there is a memslot:
> > > >>>> memslot[0].based_gfn = 0, memslot[0].npages = 100,
> > > >>>>
> > > >>>> and there is a shadow page:
> > > >>>> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
> > > >>>>
> > > >>>> this sp is counted in the memslot[0] but it can not be found by walking
> > > >>>> memslot[0]->rmap since there is no last mapping in this shadow page.
> > > >>>>
> > > >>> Right, so what about walking mmu_page_hash for each gfn belonging to the
> > > >>> slot that is in process to be removed to find those?
> > > >>
> > > >> That will cost lots of time. The size of hashtable is 1 << 10. If the
> > > >> memslot has 4M memory, it will walk all the entries, the cost is the same
> > > >> as walking active_list (maybe litter more). And a memslot has 4M memory is
> > > >> the normal case i think.
> > > >>
> > > > Memslots will be much bigger with memory hotplug. Lock break should be
> > > > used while walking mmu_page_hash obviously, but still iterating over
> > > > entire memslot gfn space to find a few gfn that may be there is
> > > > suboptimal. We can keep a list of them in the memslot itself.
> > >
> > > It sounds good to me.
> > >
> > > BTW, this approach looks more complex and use more memory (new list_head
> > > added into every shadow page) used, why you dislike clearing lpage_info? ;)
> > >
> > Looks a little bit hackish, but now that I see we do not have easy way
> > to find all shadow pages counted in lpage_info I am not entirely against
> > it. If you convince Marcelo that clearing lpage_info like that is a good
> > idea I may reconsider. I think, regardless of tracking lpage_info,
> > having a way to find all shadow pages that reference a memslot is a good
> > thing though.
> >
> > > >
> > > >> Another point is that lpage_info stops mmu to use large page. If we
> > > >> do not reset lpage_info, mmu is using 4K page until the invalid-sp is
> > > >> zapped.
> > > >>
> > > > I do not think this is a big issue. If lpage_info prevented the use of
> > > > large pages for some memory ranges before we zapped entire shadow pages
> > > > it was probably for a reason, so new shadow page will prevent large
> > > > pages from been created for the same memory ranges.
> > >
> > > Still worried, but I will try it if Marcelo does not have objects.
> > > Thanks a lot for your valuable suggestion, Gleb!
> > >
> > > Now, i am trying my best to catch Marcelo's idea of "zapping root
> > > pages", but......
> > >
> > Yes, I am missing what Marcelo means there too. We cannot free memslot
> > until we unmap its rmap one way or the other.
>
> I do not understand what are you optimizing for, given the four possible
> cases we discussed at
>
> https://lkml.org/lkml/2013/4/18/280
>
We are optimizing mmu_lock holding time for all of those cases.

But you cannot just "zap roots + sp gen number increase." on slot
deletion because you need to transfer access/dirty information from rmap
that is going to be deleted to actual page before
kvm_set_memory_region() returns to a caller.

> That is, why a simple for_each_all_shadow_page(zap_page) is not sufficient.
With a lock break? It is. We tried to optimize that by zapping only pages
that reference memslot that is going to be deleted and zap all other
later when recycling old sps, but if you think this is premature
optimization I am fine with it.

--
Gleb.

2013-05-07 15:10:42

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Tue, May 07, 2013 at 05:56:08PM +0300, Gleb Natapov wrote:
> > > Yes, I am missing what Marcelo means there too. We cannot free memslot
> > > until we unmap its rmap one way or the other.
> >
> > I do not understand what are you optimizing for, given the four possible
> > cases we discussed at
> >
> > https://lkml.org/lkml/2013/4/18/280
> >
> We are optimizing mmu_lock holding time for all of those cases.
>
> But you cannot just "zap roots + sp gen number increase." on slot
> deletion because you need to transfer access/dirty information from rmap
> that is going to be deleted to actual page before
> kvm_set_memory_region() returns to a caller.
>
> > That is, why a simple for_each_all_shadow_page(zap_page) is not sufficient.
> With a lock break? It is. We tried to optimize that by zapping only pages
> that reference memslot that is going to be deleted and zap all other
> later when recycling old sps, but if you think this is premature
> optimization I am fine with it.

If it can be shown that its not premature optimization, I am fine with
it.

AFAICS all cases are 1) rare and 2) not latency sensitive (as in there
is no requirement for those cases to finish in a short period of time).

2013-05-08 10:41:07

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

On Tue, May 07, 2013 at 12:09:29PM -0300, Marcelo Tosatti wrote:
> On Tue, May 07, 2013 at 05:56:08PM +0300, Gleb Natapov wrote:
> > > > Yes, I am missing what Marcelo means there too. We cannot free memslot
> > > > until we unmap its rmap one way or the other.
> > >
> > > I do not understand what are you optimizing for, given the four possible
> > > cases we discussed at
> > >
> > > https://lkml.org/lkml/2013/4/18/280
> > >
> > We are optimizing mmu_lock holding time for all of those cases.
> >
> > But you cannot just "zap roots + sp gen number increase." on slot
> > deletion because you need to transfer access/dirty information from rmap
> > that is going to be deleted to actual page before
> > kvm_set_memory_region() returns to a caller.
> >
> > > That is, why a simple for_each_all_shadow_page(zap_page) is not sufficient.
> > With a lock break? It is. We tried to optimize that by zapping only pages
> > that reference memslot that is going to be deleted and zap all other
> > later when recycling old sps, but if you think this is premature
> > optimization I am fine with it.
>
> If it can be shown that its not premature optimization, I am fine with
> it.
>
> AFAICS all cases are 1) rare and 2) not latency sensitive (as in there
> is no requirement for those cases to finish in a short period of time).
OK, lets start from a simple version. The one that goes through rmap
turned out to be more complicated that we expected.

--
Gleb.