2013-05-22 19:56:14

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v7 00/11] KVM: MMU: fast zap all shadow pages

Changlog:
V7:
1): separate some optimization into two patches which do not reuse
the obsolete pages and collapse tlb flushes, suggested by Marcelo.

2): make the patch based on Gleb's diff change which reduce
KVM_REQ_MMU_RELOAD when root page is being zapped.

3): remove calling kvm_mmu_zap_page when patching hypercall, investigated
by Gleb.

4): drop the patch which deleted page from hash list at the "prepare"
time since it can break the walk based on hash list.

5): rename kvm_mmu_invalidate_all_pages to kvm_mmu_invalidate_zap_all_pages.

6): introduce kvm_mmu_prepare_zap_obsolete_page which is used to zap obsolete
page to collapse tlb flushes.

V6:
1): reversely walk active_list to skip the new created pages based
on the comments from Gleb and Paolo.

2): completely replace kvm_mmu_zap_all by kvm_mmu_invalidate_all_pages
based on Gleb's comments.

3): improve the parameters of kvm_mmu_invalidate_all_pages based on
Gleb's comments.

4): rename kvm_mmu_invalidate_memslot_pages to kvm_mmu_invalidate_all_pages
5): rename zap_invalid_pages to kvm_zap_obsolete_pages

V5:
1): rename is_valid_sp to is_obsolete_sp
2): use lock-break technique to zap all old pages instead of only pages
linked on invalid slot's rmap suggested by Marcelo.
3): trace invalid pages and kvm_mmu_invalidate_memslot_pages()
4): rename kvm_mmu_invalid_memslot_pages to kvm_mmu_invalidate_memslot_pages
according to Takuya's comments.

V4:
1): drop unmapping invalid rmap out of mmu-lock and use lock-break technique
instead. Thanks to Gleb's comments.

2): needn't handle invalid-gen pages specially due to page table always
switched by KVM_REQ_MMU_RELOAD. Thanks to Marcelo's comments.

V3:
completely redesign the algorithm, please see below.

V2:
- do not reset n_requested_mmu_pages and n_max_mmu_pages
- batch free root shadow pages to reduce vcpu notification and mmu-lock
contention
- remove the first patch that introduce kvm->arch.mmu_cache since we only
'memset zero' on hashtable rather than all mmu cache members in this
version
- remove unnecessary kvm_reload_remote_mmus after kvm_mmu_zap_all

* Issue
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability.

* Idea
KVM maintains a global mmu invalid generation-number which is stored in
kvm->arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp->mmu_valid_gen when it is created.

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.

Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
are zapped by using lock-break technique.

Gleb Natapov (1):
KVM: MMU: reduce KVM_REQ_MMU_RELOAD when root page is zapped

Xiao Guangrong (10):
KVM: x86: drop calling kvm_mmu_zap_all in emulator_fix_hypercall
KVM: MMU: drop unnecessary kvm_reload_remote_mmus
KVM: MMU: fast invalidate all pages
KVM: MMU: zap pages in batch
KVM: x86: use the fast way to invalidate all pages
KVM: MMU: show mmu_valid_gen in shadow page related tracepoints
KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages
KVM: MMU: do not reuse the obsolete page
KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page
KVM: MMU: collapse TLB flushes when zap all pages

arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/mmu.c | 134 ++++++++++++++++++++++++++++++++++++---
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/mmutrace.h | 42 +++++++++---
arch/x86/kvm/x86.c | 16 +----
5 files changed, 162 insertions(+), 33 deletions(-)

--
1.7.7.6


2013-05-22 19:56:24

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v7 07/11] KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages

It is good for debug and development

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 1 +
arch/x86/kvm/mmutrace.h | 20 ++++++++++++++++++++
2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c010ace..3a3e6c5 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4245,6 +4245,7 @@ restart:
void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm)
{
spin_lock(&kvm->mmu_lock);
+ trace_kvm_mmu_invalidate_zap_all_pages(kvm);
kvm->arch.mmu_valid_gen++;

kvm_zap_obsolete_pages(kvm);
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 697f466..eb444dd 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -276,6 +276,26 @@ TRACE_EVENT(
__spte_satisfied(old_spte), __spte_satisfied(new_spte)
)
);
+
+TRACE_EVENT(
+ kvm_mmu_invalidate_zap_all_pages,
+ TP_PROTO(struct kvm *kvm),
+ TP_ARGS(kvm),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, mmu_valid_gen)
+ __field(unsigned int, mmu_used_pages)
+ ),
+
+ TP_fast_assign(
+ __entry->mmu_valid_gen = kvm->arch.mmu_valid_gen;
+ __entry->mmu_used_pages = kvm->arch.n_used_mmu_pages;
+ ),
+
+ TP_printk("kvm-mmu-valid-gen %lx used_pages %x",
+ __entry->mmu_valid_gen, __entry->mmu_used_pages
+ )
+);
#endif /* _TRACE_KVMMMU_H */

#undef TRACE_INCLUDE_PATH
--
1.7.7.6

2013-05-22 19:56:32

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v7 11/11] KVM: MMU: reduce KVM_REQ_MMU_RELOAD when root page is zapped

From: Gleb Natapov <[email protected]>

Quote Gleb's mail:
| why don't we check for sp->role.invalid in
| kvm_mmu_prepare_zap_page before calling kvm_reload_remote_mmus()?

and

| Actually we can add check for is_obsolete_sp() there too since
| kvm_mmu_invalidate_all_pages() already calls kvm_reload_remote_mmus()
| after incrementing mmu_valid_gen.

[ Xiao: add some comments and the check of is_obsolete_sp() ]

Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5e34056..055d675 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2097,7 +2097,13 @@ __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
kvm_mod_used_mmu_pages(kvm, -1);
} else {
list_move(&sp->link, &kvm->arch.active_mmu_pages);
- kvm_reload_remote_mmus(kvm);
+
+ /*
+ * The obsolete pages can not be used on any vcpus.
+ * See the comments in kvm_mmu_invalidate_zap_all_pages().
+ */
+ if (!sp->role.invalid && !is_obsolete_sp(kvm, sp))
+ kvm_reload_remote_mmus(kvm);
}

sp->role.invalid = 1;
--
1.7.7.6

2013-05-22 19:56:30

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

kvm_zap_obsolete_pages uses lock-break technique to zap pages,
it will flush tlb every time when it does lock-break

We can reload mmu on all vcpus after updating the generation
number so that the obsolete pages are not used on any vcpus,
after that we do not need to flush tlb when obsolete pages
are zapped

Note: kvm_mmu_commit_zap_page is still needed before free
the pages since other vcpus may be doing locklessly shadow
page walking

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 32 ++++++++++++++++++++++----------
1 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e676356..5e34056 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4237,8 +4237,6 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
restart:
list_for_each_entry_safe_reverse(sp, node,
&kvm->arch.active_mmu_pages, link) {
- int ret;
-
/*
* No obsolete page exists before new created page since
* active_mmu_pages is the FIFO list.
@@ -4254,21 +4252,24 @@ restart:
if (sp->role.invalid)
continue;

+ /*
+ * Need not flush tlb since we only zap the sp with invalid
+ * generation number.
+ */
if (batch >= BATCH_ZAP_PAGES &&
- (need_resched() || spin_needbreak(&kvm->mmu_lock))) {
+ cond_resched_lock(&kvm->mmu_lock)) {
batch = 0;
- kvm_mmu_commit_zap_page(kvm, &invalid_list);
- cond_resched_lock(&kvm->mmu_lock);
goto restart;
}

- ret = kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
- batch += ret;
-
- if (ret)
- goto restart;
+ batch += kvm_mmu_prepare_zap_obsolete_page(kvm, sp,
+ &invalid_list);
}

+ /*
+ * Should flush tlb before free page tables since lockless-walking
+ * may use the pages.
+ */
kvm_mmu_commit_zap_page(kvm, &invalid_list);
}

@@ -4287,6 +4288,17 @@ void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm)
trace_kvm_mmu_invalidate_zap_all_pages(kvm);
kvm->arch.mmu_valid_gen++;

+ /*
+ * Notify all vcpus to reload its shadow page table
+ * and flush TLB. Then all vcpus will switch to new
+ * shadow page table with the new mmu_valid_gen.
+ *
+ * Note: we should do this under the protection of
+ * mmu-lock, otherwise, vcpu would purge shadow page
+ * but miss tlb flush.
+ */
+ kvm_reload_remote_mmus(kvm);
+
kvm_zap_obsolete_pages(kvm);
spin_unlock(&kvm->mmu_lock);
}
--
1.7.7.6

2013-05-22 19:57:04

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v7 08/11] KVM: MMU: do not reuse the obsolete page

The obsolete page will be zapped soon, do not resue it to
reduce future page fault

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3a3e6c5..9b57faa 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1869,6 +1869,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
role.quadrant = quadrant;
}
for_each_gfn_sp(vcpu->kvm, sp, gfn) {
+ if (is_obsolete_sp(vcpu->kvm, sp))
+ continue;
+
if (!need_sync && sp->unsync)
need_sync = true;

--
1.7.7.6

2013-05-22 19:57:34

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

It is only used to zap the obsolete page. Since the obsolete page
will not be used, we need not spend time to find its unsync children
out. Also, we delete the page from shadow page cache so that the page
is completely isolated after call this function.

The later patch will use it to collapse tlb flushes

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9b57faa..e676356 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
{
ASSERT(is_empty_shadow_page(sp->spt));
- hlist_del(&sp->hash_link);
+ hlist_del_init(&sp->hash_link);
list_del(&sp->link);
free_page((unsigned long)sp->spt);
if (!sp->role.direct)
@@ -2069,14 +2069,19 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
return zapped;
}

-static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- struct list_head *invalid_list)
+static int
+__kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ bool zap_unsync_children,
+ struct list_head *invalid_list)
{
- int ret;
+ int ret = 0;

trace_kvm_mmu_prepare_zap_page(sp);
++kvm->stat.mmu_shadow_zapped;
- ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
+
+ if (likely(zap_unsync_children))
+ ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
+
kvm_mmu_page_unlink_children(kvm, sp);
kvm_mmu_unlink_parents(kvm, sp);

@@ -2099,6 +2104,37 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
return ret;
}

+/*
+ * The obsolete page will not be used, we need not spend time to find
+ * its unsync children out. Also, we delete the page from shadow page
+ * cache so that the page is completely isolated after call this
+ * function.
+ *
+ * Note: if we use this function in for_each_gfn_xxx macros, we should
+ * re-walk the list when it successfully zaps one page.
+ */
+static int
+kvm_mmu_prepare_zap_obsolete_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ struct list_head *invalid_list)
+{
+ int ret;
+
+ WARN_ON(!is_obsolete_sp(kvm, sp));
+
+ ret = __kvm_mmu_prepare_zap_page(kvm, sp, false, invalid_list);
+ if (ret)
+ hlist_del_init(&sp->hash_link);
+
+ WARN_ON(ret > 1);
+ return ret;
+}
+
+static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ struct list_head *invalid_list)
+{
+ return __kvm_mmu_prepare_zap_page(kvm, sp, true, invalid_list);
+}
+
static void kvm_mmu_commit_zap_page(struct kvm *kvm,
struct list_head *invalid_list)
{
--
1.7.7.6

2013-05-22 19:56:20

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v7 04/11] KVM: MMU: zap pages in batch

Zap at lease 10 pages before releasing mmu-lock to reduce the overload
caused by requiring lock

After the patch, kvm_zap_obsolete_pages can forward progress anyway,
so update the comments

[ It improves kernel building 0.6% ~ 1% ]

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 35 +++++++++++------------------------
1 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f302540..688e755 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4203,14 +4203,18 @@ restart:
spin_unlock(&kvm->mmu_lock);
}

+#define BATCH_ZAP_PAGES 10
static void kvm_zap_obsolete_pages(struct kvm *kvm)
{
struct kvm_mmu_page *sp, *node;
LIST_HEAD(invalid_list);
+ int batch = 0;

restart:
list_for_each_entry_safe_reverse(sp, node,
&kvm->arch.active_mmu_pages, link) {
+ int ret;
+
/*
* No obsolete page exists before new created page since
* active_mmu_pages is the FIFO list.
@@ -4219,28 +4223,6 @@ restart:
break;

/*
- * Do not repeatedly zap a root page to avoid unnecessary
- * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
- * progress:
- * vcpu 0 vcpu 1
- * call vcpu_enter_guest():
- * 1): handle KVM_REQ_MMU_RELOAD
- * and require mmu-lock to
- * load mmu
- * repeat:
- * 1): zap root page and
- * send KVM_REQ_MMU_RELOAD
- *
- * 2): if (cond_resched_lock(mmu-lock))
- *
- * 2): hold mmu-lock and load mmu
- *
- * 3): see KVM_REQ_MMU_RELOAD bit
- * on vcpu->requests is set
- * then return 1 to call
- * vcpu_enter_guest() again.
- * goto repeat;
- *
* Since we are reversely walking the list and the invalid
* list will be moved to the head, skip the invalid page
* can help us to avoid the infinity list walking.
@@ -4248,13 +4230,18 @@ restart:
if (sp->role.invalid)
continue;

- if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+ if (batch >= BATCH_ZAP_PAGES &&
+ (need_resched() || spin_needbreak(&kvm->mmu_lock))) {
+ batch = 0;
kvm_mmu_commit_zap_page(kvm, &invalid_list);
cond_resched_lock(&kvm->mmu_lock);
goto restart;
}

- if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
+ ret = kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+ batch += ret;
+
+ if (ret)
goto restart;
}

--
1.7.7.6

2013-05-22 19:57:53

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v7 06/11] KVM: MMU: show mmu_valid_gen in shadow page related tracepoints

Show sp->mmu_valid_gen

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmutrace.h | 22 ++++++++++++----------
1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index b8f6172..697f466 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -7,16 +7,18 @@
#undef TRACE_SYSTEM
#define TRACE_SYSTEM kvmmmu

-#define KVM_MMU_PAGE_FIELDS \
- __field(__u64, gfn) \
- __field(__u32, role) \
- __field(__u32, root_count) \
+#define KVM_MMU_PAGE_FIELDS \
+ __field(unsigned long, mmu_valid_gen) \
+ __field(__u64, gfn) \
+ __field(__u32, role) \
+ __field(__u32, root_count) \
__field(bool, unsync)

-#define KVM_MMU_PAGE_ASSIGN(sp) \
- __entry->gfn = sp->gfn; \
- __entry->role = sp->role.word; \
- __entry->root_count = sp->root_count; \
+#define KVM_MMU_PAGE_ASSIGN(sp) \
+ __entry->mmu_valid_gen = sp->mmu_valid_gen; \
+ __entry->gfn = sp->gfn; \
+ __entry->role = sp->role.word; \
+ __entry->root_count = sp->root_count; \
__entry->unsync = sp->unsync;

#define KVM_MMU_PAGE_PRINTK() ({ \
@@ -28,8 +30,8 @@
\
role.word = __entry->role; \
\
- trace_seq_printf(p, "sp gfn %llx %u%s q%u%s %s%s" \
- " %snxe root %u %s%c", \
+ trace_seq_printf(p, "sp gen %lx gfn %llx %u%s q%u%s %s%s" \
+ " %snxe root %u %s%c", __entry->mmu_valid_gen, \
__entry->gfn, role.level, \
role.cr4_pae ? " pae" : "", \
role.quadrant, \
--
1.7.7.6

2013-05-22 19:58:17

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v7 05/11] KVM: x86: use the fast way to invalidate all pages

Replace kvm_mmu_zap_all by kvm_mmu_invalidate_zap_all_pages

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 15 ---------------
arch/x86/kvm/x86.c | 4 ++--
2 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 688e755..c010ace 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4188,21 +4188,6 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
spin_unlock(&kvm->mmu_lock);
}

-void kvm_mmu_zap_all(struct kvm *kvm)
-{
- struct kvm_mmu_page *sp, *node;
- LIST_HEAD(invalid_list);
-
- spin_lock(&kvm->mmu_lock);
-restart:
- list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link)
- if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
- goto restart;
-
- kvm_mmu_commit_zap_page(kvm, &invalid_list);
- spin_unlock(&kvm->mmu_lock);
-}
-
#define BATCH_ZAP_PAGES 10
static void kvm_zap_obsolete_pages(struct kvm *kvm)
{
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3758ff9..15e10f7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7066,13 +7066,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,

void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
- kvm_mmu_zap_all(kvm);
+ kvm_mmu_invalidate_zap_all_pages(kvm);
}

void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot)
{
- kvm_arch_flush_shadow_all(kvm);
+ kvm_mmu_invalidate_zap_all_pages(kvm);
}

int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
--
1.7.7.6

2013-05-22 19:56:17

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v7 02/11] KVM: MMU: drop unnecessary kvm_reload_remote_mmus

It is the responsibility of kvm_mmu_zap_all that keeps the
consistent of mmu and tlbs. And it is also unnecessary after
zap all mmio sptes since no mmio spte exists on root shadow
page and it can not be cached into tlb

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 5 +----
1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6739b1d..3758ff9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7060,16 +7060,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
* If memory slot is created, or moved, we need to clear all
* mmio sptes.
*/
- if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
+ if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE))
kvm_mmu_zap_mmio_sptes(kvm);
- kvm_reload_remote_mmus(kvm);
- }
}

void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
kvm_mmu_zap_all(kvm);
- kvm_reload_remote_mmus(kvm);
}

void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
--
1.7.7.6

2013-05-22 19:58:47

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v7 03/11] KVM: MMU: fast invalidate all pages

The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability

In this patch, we introduce a faster way to invalidate all shadow pages.
KVM maintains a global mmu invalid generation-number which is stored in
kvm->arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp->mmu_valid_gen when it is created

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.
Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
are zapped by using lock-break technique

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/mmu.c | 84 +++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu.h | 1 +
3 files changed, 87 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3741c65..bff7d46 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -222,6 +222,7 @@ struct kvm_mmu_page {
int root_count; /* Currently serving as active root */
unsigned int unsync_children;
unsigned long parent_ptes; /* Reverse mapping for parent_pte */
+ unsigned long mmu_valid_gen;
DECLARE_BITMAP(unsync_child_bitmap, 512);

#ifdef CONFIG_X86_32
@@ -529,6 +530,7 @@ struct kvm_arch {
unsigned int n_requested_mmu_pages;
unsigned int n_max_mmu_pages;
unsigned int indirect_shadow_pages;
+ unsigned long mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
* Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f8ca2f3..f302540 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sp);
}

+static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
+}
+
static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
gfn_t gfn,
gva_t gaddr,
@@ -1900,6 +1905,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,

account_shadowed(vcpu->kvm, gfn);
}
+ sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
init_shadow_page_table(sp);
trace_kvm_mmu_get_page(sp, true);
return sp;
@@ -2070,8 +2076,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
kvm_mmu_page_unlink_children(kvm, sp);
kvm_mmu_unlink_parents(kvm, sp);
+
if (!sp->role.invalid && !sp->role.direct)
unaccount_shadowed(kvm, sp->gfn);
+
if (sp->unsync)
kvm_unlink_unsync_page(kvm, sp);
if (!sp->root_count) {
@@ -4195,6 +4203,82 @@ restart:
spin_unlock(&kvm->mmu_lock);
}

+static void kvm_zap_obsolete_pages(struct kvm *kvm)
+{
+ struct kvm_mmu_page *sp, *node;
+ LIST_HEAD(invalid_list);
+
+restart:
+ list_for_each_entry_safe_reverse(sp, node,
+ &kvm->arch.active_mmu_pages, link) {
+ /*
+ * No obsolete page exists before new created page since
+ * active_mmu_pages is the FIFO list.
+ */
+ if (!is_obsolete_sp(kvm, sp))
+ break;
+
+ /*
+ * Do not repeatedly zap a root page to avoid unnecessary
+ * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
+ * progress:
+ * vcpu 0 vcpu 1
+ * call vcpu_enter_guest():
+ * 1): handle KVM_REQ_MMU_RELOAD
+ * and require mmu-lock to
+ * load mmu
+ * repeat:
+ * 1): zap root page and
+ * send KVM_REQ_MMU_RELOAD
+ *
+ * 2): if (cond_resched_lock(mmu-lock))
+ *
+ * 2): hold mmu-lock and load mmu
+ *
+ * 3): see KVM_REQ_MMU_RELOAD bit
+ * on vcpu->requests is set
+ * then return 1 to call
+ * vcpu_enter_guest() again.
+ * goto repeat;
+ *
+ * Since we are reversely walking the list and the invalid
+ * list will be moved to the head, skip the invalid page
+ * can help us to avoid the infinity list walking.
+ */
+ if (sp->role.invalid)
+ continue;
+
+ if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+ kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ cond_resched_lock(&kvm->mmu_lock);
+ goto restart;
+ }
+
+ if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
+ goto restart;
+ }
+
+ kvm_mmu_commit_zap_page(kvm, &invalid_list);
+}
+
+/*
+ * Fast invalidate all shadow pages and use lock-break technique
+ * to zap obsolete pages.
+ *
+ * It's required when memslot is being deleted or VM is being
+ * destroyed, in these cases, we should ensure that KVM MMU does
+ * not use any resource of the being-deleted slot or all slots
+ * after calling the function.
+ */
+void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm)
+{
+ spin_lock(&kvm->mmu_lock);
+ kvm->arch.mmu_valid_gen++;
+
+ kvm_zap_obsolete_pages(kvm);
+ spin_unlock(&kvm->mmu_lock);
+}
+
void kvm_mmu_zap_mmio_sptes(struct kvm *kvm)
{
struct kvm_mmu_page *sp, *node;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2adcbc2..922bfae 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -97,4 +97,5 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
return (mmu->permissions[pfec >> 1] >> pte_access) & 1;
}

+void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
#endif
--
1.7.7.6

2013-05-22 19:56:15

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v7 01/11] KVM: x86: drop calling kvm_mmu_zap_all in emulator_fix_hypercall

Quote Gleb's mail:

| Back then kvm->lock protected memslot access so code like:
|
| mutex_lock(&vcpu->kvm->lock);
| kvm_mmu_zap_all(vcpu->kvm);
| mutex_unlock(&vcpu->kvm->lock);
|
| which is what 7aa81cc0 does was enough to guaranty that no vcpu will
| run while code is patched. This is no longer the case and
| mutex_lock(&vcpu->kvm->lock); is gone from that code path long time ago,
| so now kvm_mmu_zap_all() there is useless and the code is incorrect.

So we drop it and it will be fixed later

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 7 -------
1 files changed, 0 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8d28810..6739b1d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5523,13 +5523,6 @@ static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
char instruction[3];
unsigned long rip = kvm_rip_read(vcpu);

- /*
- * Blow out the MMU to ensure that no other VCPU has an active mapping
- * to ensure that the updated hypercall appears atomically across all
- * VCPUs.
- */
- kvm_mmu_zap_all(vcpu->kvm);
-
kvm_x86_ops->patch_hypercall(vcpu, instruction);

return emulator_write_emulated(ctxt, rip, instruction, 3, NULL);
--
1.7.7.6

2013-05-23 05:57:33

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
> It is only used to zap the obsolete page. Since the obsolete page
> will not be used, we need not spend time to find its unsync children
> out. Also, we delete the page from shadow page cache so that the page
> is completely isolated after call this function.
>
> The later patch will use it to collapse tlb flushes
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
> 1 files changed, 41 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 9b57faa..e676356 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> {
> ASSERT(is_empty_shadow_page(sp->spt));
> - hlist_del(&sp->hash_link);
> + hlist_del_init(&sp->hash_link);
Why do you need hlist_del_init() here? Why not move it into
kvm_mmu_prepare_zap_page() like we discussed it here:
https://patchwork.kernel.org/patch/2580351/ instead of doing
it differently for obsolete and non obsolete pages?

> list_del(&sp->link);
> free_page((unsigned long)sp->spt);
> if (!sp->role.direct)
> @@ -2069,14 +2069,19 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
> return zapped;
> }
>
> -static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> - struct list_head *invalid_list)
> +static int
> +__kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> + bool zap_unsync_children,
> + struct list_head *invalid_list)
> {
> - int ret;
> + int ret = 0;
>
> trace_kvm_mmu_prepare_zap_page(sp);
> ++kvm->stat.mmu_shadow_zapped;
> - ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> +
> + if (likely(zap_unsync_children))
> + ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> +
> kvm_mmu_page_unlink_children(kvm, sp);
> kvm_mmu_unlink_parents(kvm, sp);
>
> @@ -2099,6 +2104,37 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> return ret;
> }
>
> +/*
> + * The obsolete page will not be used, we need not spend time to find
> + * its unsync children out. Also, we delete the page from shadow page
> + * cache so that the page is completely isolated after call this
> + * function.
> + *
> + * Note: if we use this function in for_each_gfn_xxx macros, we should
> + * re-walk the list when it successfully zaps one page.
> + */
> +static int
> +kvm_mmu_prepare_zap_obsolete_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> + struct list_head *invalid_list)
> +{
> + int ret;
> +
> + WARN_ON(!is_obsolete_sp(kvm, sp));
> +
> + ret = __kvm_mmu_prepare_zap_page(kvm, sp, false, invalid_list);
> + if (ret)
> + hlist_del_init(&sp->hash_link);
Why hlist_del() is not enough?

> +
> + WARN_ON(ret > 1);
> + return ret;
> +}
> +
> +static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> + struct list_head *invalid_list)
> +{
> + return __kvm_mmu_prepare_zap_page(kvm, sp, true, invalid_list);
> +}
> +
> static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> struct list_head *invalid_list)
> {
> --
> 1.7.7.6

--
Gleb.

2013-05-23 06:12:54

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
> it will flush tlb every time when it does lock-break
>
> We can reload mmu on all vcpus after updating the generation
> number so that the obsolete pages are not used on any vcpus,
> after that we do not need to flush tlb when obsolete pages
> are zapped
>
> Note: kvm_mmu_commit_zap_page is still needed before free
> the pages since other vcpus may be doing locklessly shadow
> page walking
>
Since obsolete pages are not accessible for lockless page walking after
reload of all roots I do not understand why additional tlb flush is
needed. Also why tlb flush should prevent lockless-walking from using
the page? Making page unreachable from root_hpa does that, no?

> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 32 ++++++++++++++++++++++----------
> 1 files changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index e676356..5e34056 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4237,8 +4237,6 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
> restart:
> list_for_each_entry_safe_reverse(sp, node,
> &kvm->arch.active_mmu_pages, link) {
> - int ret;
> -
> /*
> * No obsolete page exists before new created page since
> * active_mmu_pages is the FIFO list.
> @@ -4254,21 +4252,24 @@ restart:
> if (sp->role.invalid)
> continue;
>
> + /*
> + * Need not flush tlb since we only zap the sp with invalid
> + * generation number.
> + */
> if (batch >= BATCH_ZAP_PAGES &&
> - (need_resched() || spin_needbreak(&kvm->mmu_lock))) {
> + cond_resched_lock(&kvm->mmu_lock)) {
> batch = 0;
> - kvm_mmu_commit_zap_page(kvm, &invalid_list);
> - cond_resched_lock(&kvm->mmu_lock);
> goto restart;
> }
>
> - ret = kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> - batch += ret;
> -
> - if (ret)
> - goto restart;
> + batch += kvm_mmu_prepare_zap_obsolete_page(kvm, sp,
> + &invalid_list);
> }
>
> + /*
> + * Should flush tlb before free page tables since lockless-walking
> + * may use the pages.
> + */
> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> }
>
> @@ -4287,6 +4288,17 @@ void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm)
> trace_kvm_mmu_invalidate_zap_all_pages(kvm);
> kvm->arch.mmu_valid_gen++;
>
> + /*
> + * Notify all vcpus to reload its shadow page table
> + * and flush TLB. Then all vcpus will switch to new
> + * shadow page table with the new mmu_valid_gen.
> + *
> + * Note: we should do this under the protection of
> + * mmu-lock, otherwise, vcpu would purge shadow page
> + * but miss tlb flush.
> + */
> + kvm_reload_remote_mmus(kvm);
> +
> kvm_zap_obsolete_pages(kvm);
> spin_unlock(&kvm->mmu_lock);
> }
> --
> 1.7.7.6

--
Gleb.

2013-05-23 06:13:23

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On 05/23/2013 01:57 PM, Gleb Natapov wrote:
> On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
>> It is only used to zap the obsolete page. Since the obsolete page
>> will not be used, we need not spend time to find its unsync children
>> out. Also, we delete the page from shadow page cache so that the page
>> is completely isolated after call this function.
>>
>> The later patch will use it to collapse tlb flushes
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
>> 1 files changed, 41 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 9b57faa..e676356 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>> {
>> ASSERT(is_empty_shadow_page(sp->spt));
>> - hlist_del(&sp->hash_link);
>> + hlist_del_init(&sp->hash_link);
> Why do you need hlist_del_init() here? Why not move it into

Since the hlist will be double freed. We will it like this:

kvm_mmu_prepare_zap_obsolete_page(page, list);
kvm_mmu_commit_zap_page(list);
kvm_mmu_free_page(page);

The first place is kvm_mmu_prepare_zap_obsolete_page(page), which have
deleted the hash list.

> kvm_mmu_prepare_zap_page() like we discussed it here:
> https://patchwork.kernel.org/patch/2580351/ instead of doing
> it differently for obsolete and non obsolete pages?

It is can break the hash-list walking: we should rescan the
hash list once the page is prepared-ly zapped.

I mentioned it in the changelog:

4): drop the patch which deleted page from hash list at the "prepare"
time since it can break the walk based on hash list.
>
>> list_del(&sp->link);
>> free_page((unsigned long)sp->spt);
>> if (!sp->role.direct)
>> @@ -2069,14 +2069,19 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
>> return zapped;
>> }
>>
>> -static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>> - struct list_head *invalid_list)
>> +static int
>> +__kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>> + bool zap_unsync_children,
>> + struct list_head *invalid_list)
>> {
>> - int ret;
>> + int ret = 0;
>>
>> trace_kvm_mmu_prepare_zap_page(sp);
>> ++kvm->stat.mmu_shadow_zapped;
>> - ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>> +
>> + if (likely(zap_unsync_children))
>> + ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>> +
>> kvm_mmu_page_unlink_children(kvm, sp);
>> kvm_mmu_unlink_parents(kvm, sp);
>>
>> @@ -2099,6 +2104,37 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>> return ret;
>> }
>>
>> +/*
>> + * The obsolete page will not be used, we need not spend time to find
>> + * its unsync children out. Also, we delete the page from shadow page
>> + * cache so that the page is completely isolated after call this
>> + * function.
>> + *
>> + * Note: if we use this function in for_each_gfn_xxx macros, we should
>> + * re-walk the list when it successfully zaps one page.
>> + */
>> +static int
>> +kvm_mmu_prepare_zap_obsolete_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>> + struct list_head *invalid_list)
>> +{
>> + int ret;
>> +
>> + WARN_ON(!is_obsolete_sp(kvm, sp));
>> +
>> + ret = __kvm_mmu_prepare_zap_page(kvm, sp, false, invalid_list);
>> + if (ret)
>> + hlist_del_init(&sp->hash_link);
> Why hlist_del() is not enough?

Since it will be deleted again in kvm_mmu_free_page().
I am not sure if has another better way to do this.

2013-05-23 06:18:46

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On Thu, May 23, 2013 at 02:13:06PM +0800, Xiao Guangrong wrote:
> On 05/23/2013 01:57 PM, Gleb Natapov wrote:
> > On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
> >> It is only used to zap the obsolete page. Since the obsolete page
> >> will not be used, we need not spend time to find its unsync children
> >> out. Also, we delete the page from shadow page cache so that the page
> >> is completely isolated after call this function.
> >>
> >> The later patch will use it to collapse tlb flushes
> >>
> >> Signed-off-by: Xiao Guangrong <[email protected]>
> >> ---
> >> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
> >> 1 files changed, 41 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 9b57faa..e676356 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> >> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> >> {
> >> ASSERT(is_empty_shadow_page(sp->spt));
> >> - hlist_del(&sp->hash_link);
> >> + hlist_del_init(&sp->hash_link);
> > Why do you need hlist_del_init() here? Why not move it into
>
> Since the hlist will be double freed. We will it like this:
>
> kvm_mmu_prepare_zap_obsolete_page(page, list);
> kvm_mmu_commit_zap_page(list);
> kvm_mmu_free_page(page);
>
> The first place is kvm_mmu_prepare_zap_obsolete_page(page), which have
> deleted the hash list.
>
> > kvm_mmu_prepare_zap_page() like we discussed it here:
> > https://patchwork.kernel.org/patch/2580351/ instead of doing
> > it differently for obsolete and non obsolete pages?
>
> It is can break the hash-list walking: we should rescan the
> hash list once the page is prepared-ly zapped.
>
> I mentioned it in the changelog:
>
> 4): drop the patch which deleted page from hash list at the "prepare"
> time since it can break the walk based on hash list.
Can you elaborate on how this can happen?

--
Gleb.

2013-05-23 06:27:12

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

On 05/23/2013 02:12 PM, Gleb Natapov wrote:
> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
>> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
>> it will flush tlb every time when it does lock-break
>>
>> We can reload mmu on all vcpus after updating the generation
>> number so that the obsolete pages are not used on any vcpus,
>> after that we do not need to flush tlb when obsolete pages
>> are zapped
>>
>> Note: kvm_mmu_commit_zap_page is still needed before free
>> the pages since other vcpus may be doing locklessly shadow
>> page walking
>>
> Since obsolete pages are not accessible for lockless page walking after
> reload of all roots I do not understand why additional tlb flush is

kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the
vcpu is not running on guest mode, it does nothing except set the request
bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus()
return on other vcpu.

Like this scenario:

VCPU 0 VCPU 1
exit when it encounters #PF

kvm_reload_remote_mmus(){
set vcpu1->request bit;

do not send IPI due to
vcpu 1 not running on guest mode

call page-fault handler then go lockless walking !!!
return
}


> needed. Also why tlb flush should prevent lockless-walking from using
> the page? Making page unreachable from root_hpa does that, no?

lockless-walking disables the interrupt and makes the vcpu state as
READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE,
kvm_flush_remote_tlbs() should send IPI to this vcpu in this case.

2013-05-23 06:32:08

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On 05/23/2013 02:18 PM, Gleb Natapov wrote:
> On Thu, May 23, 2013 at 02:13:06PM +0800, Xiao Guangrong wrote:
>> On 05/23/2013 01:57 PM, Gleb Natapov wrote:
>>> On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
>>>> It is only used to zap the obsolete page. Since the obsolete page
>>>> will not be used, we need not spend time to find its unsync children
>>>> out. Also, we delete the page from shadow page cache so that the page
>>>> is completely isolated after call this function.
>>>>
>>>> The later patch will use it to collapse tlb flushes
>>>>
>>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>>> ---
>>>> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
>>>> 1 files changed, 41 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>> index 9b57faa..e676356 100644
>>>> --- a/arch/x86/kvm/mmu.c
>>>> +++ b/arch/x86/kvm/mmu.c
>>>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>>>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>>>> {
>>>> ASSERT(is_empty_shadow_page(sp->spt));
>>>> - hlist_del(&sp->hash_link);
>>>> + hlist_del_init(&sp->hash_link);
>>> Why do you need hlist_del_init() here? Why not move it into
>>
>> Since the hlist will be double freed. We will it like this:
>>
>> kvm_mmu_prepare_zap_obsolete_page(page, list);
>> kvm_mmu_commit_zap_page(list);
>> kvm_mmu_free_page(page);
>>
>> The first place is kvm_mmu_prepare_zap_obsolete_page(page), which have
>> deleted the hash list.
>>
>>> kvm_mmu_prepare_zap_page() like we discussed it here:
>>> https://patchwork.kernel.org/patch/2580351/ instead of doing
>>> it differently for obsolete and non obsolete pages?
>>
>> It is can break the hash-list walking: we should rescan the
>> hash list once the page is prepared-ly zapped.
>>
>> I mentioned it in the changelog:
>>
>> 4): drop the patch which deleted page from hash list at the "prepare"
>> time since it can break the walk based on hash list.
> Can you elaborate on how this can happen?

There is a example:

int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
{
struct kvm_mmu_page *sp;
LIST_HEAD(invalid_list);
int r;

pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
r = 0;
spin_lock(&kvm->mmu_lock);
for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
sp->role.word);
r = 1;
kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
}
kvm_mmu_commit_zap_page(kvm, &invalid_list);
spin_unlock(&kvm->mmu_lock);

return r;
}

It works fine since kvm_mmu_prepare_zap_page does not touch the hash list.
If we delete hlist in kvm_mmu_prepare_zap_page(), this kind of codes should
be changed to:

restart:
for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
sp->role.word);
r = 1;
if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
goto restart;
}
kvm_mmu_commit_zap_page(kvm, &invalid_list);

2013-05-23 07:24:55

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

On Thu, May 23, 2013 at 02:26:57PM +0800, Xiao Guangrong wrote:
> On 05/23/2013 02:12 PM, Gleb Natapov wrote:
> > On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
> >> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
> >> it will flush tlb every time when it does lock-break
> >>
> >> We can reload mmu on all vcpus after updating the generation
> >> number so that the obsolete pages are not used on any vcpus,
> >> after that we do not need to flush tlb when obsolete pages
> >> are zapped
> >>
> >> Note: kvm_mmu_commit_zap_page is still needed before free
> >> the pages since other vcpus may be doing locklessly shadow
> >> page walking
> >>
> > Since obsolete pages are not accessible for lockless page walking after
> > reload of all roots I do not understand why additional tlb flush is
>
> kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the
> vcpu is not running on guest mode, it does nothing except set the request
> bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus()
> return on other vcpu.
>
> Like this scenario:
>
> VCPU 0 VCPU 1
> exit when it encounters #PF
>
> kvm_reload_remote_mmus(){
> set vcpu1->request bit;
>
> do not send IPI due to
> vcpu 1 not running on guest mode
>
> call page-fault handler then go lockless walking !!!
> return
> }
>
>
> > needed. Also why tlb flush should prevent lockless-walking from using
> > the page? Making page unreachable from root_hpa does that, no?
>
> lockless-walking disables the interrupt and makes the vcpu state as
> READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE,
> kvm_flush_remote_tlbs() should send IPI to this vcpu in this case.

kvm_flush_remote_tlbs() uses the same make_all_cpus_request() as
kvm_reload_remote_mmus() does, so why the same scenario you describe
above cannot happen with kvm_flush_remote_tlbs()?

--
Gleb.

2013-05-23 07:37:15

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On Thu, May 23, 2013 at 02:31:47PM +0800, Xiao Guangrong wrote:
> On 05/23/2013 02:18 PM, Gleb Natapov wrote:
> > On Thu, May 23, 2013 at 02:13:06PM +0800, Xiao Guangrong wrote:
> >> On 05/23/2013 01:57 PM, Gleb Natapov wrote:
> >>> On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
> >>>> It is only used to zap the obsolete page. Since the obsolete page
> >>>> will not be used, we need not spend time to find its unsync children
> >>>> out. Also, we delete the page from shadow page cache so that the page
> >>>> is completely isolated after call this function.
> >>>>
> >>>> The later patch will use it to collapse tlb flushes
> >>>>
> >>>> Signed-off-by: Xiao Guangrong <[email protected]>
> >>>> ---
> >>>> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
> >>>> 1 files changed, 41 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>>> index 9b57faa..e676356 100644
> >>>> --- a/arch/x86/kvm/mmu.c
> >>>> +++ b/arch/x86/kvm/mmu.c
> >>>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> >>>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> >>>> {
> >>>> ASSERT(is_empty_shadow_page(sp->spt));
> >>>> - hlist_del(&sp->hash_link);
> >>>> + hlist_del_init(&sp->hash_link);
> >>> Why do you need hlist_del_init() here? Why not move it into
> >>
> >> Since the hlist will be double freed. We will it like this:
> >>
> >> kvm_mmu_prepare_zap_obsolete_page(page, list);
> >> kvm_mmu_commit_zap_page(list);
> >> kvm_mmu_free_page(page);
> >>
> >> The first place is kvm_mmu_prepare_zap_obsolete_page(page), which have
> >> deleted the hash list.
> >>
> >>> kvm_mmu_prepare_zap_page() like we discussed it here:
> >>> https://patchwork.kernel.org/patch/2580351/ instead of doing
> >>> it differently for obsolete and non obsolete pages?
> >>
> >> It is can break the hash-list walking: we should rescan the
> >> hash list once the page is prepared-ly zapped.
> >>
> >> I mentioned it in the changelog:
> >>
> >> 4): drop the patch which deleted page from hash list at the "prepare"
> >> time since it can break the walk based on hash list.
> > Can you elaborate on how this can happen?
>
> There is a example:
>
> int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
> {
> struct kvm_mmu_page *sp;
> LIST_HEAD(invalid_list);
> int r;
>
> pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
> r = 0;
> spin_lock(&kvm->mmu_lock);
> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
> sp->role.word);
> r = 1;
> kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> }
> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> spin_unlock(&kvm->mmu_lock);
>
> return r;
> }
>
> It works fine since kvm_mmu_prepare_zap_page does not touch the hash list.
> If we delete hlist in kvm_mmu_prepare_zap_page(), this kind of codes should
> be changed to:
>
> restart:
> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
> sp->role.word);
> r = 1;
> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> goto restart;
> }
> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>
Hmm, yes. So lets leave it as is and always commit invalid_list before
releasing lock in kvm_zap_obsolete_pages() or skip obsolete pages while
walking hash table. Former is clearer I think.

--
Gleb.

2013-05-23 07:37:29

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

On 05/23/2013 03:24 PM, Gleb Natapov wrote:
> On Thu, May 23, 2013 at 02:26:57PM +0800, Xiao Guangrong wrote:
>> On 05/23/2013 02:12 PM, Gleb Natapov wrote:
>>> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
>>>> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
>>>> it will flush tlb every time when it does lock-break
>>>>
>>>> We can reload mmu on all vcpus after updating the generation
>>>> number so that the obsolete pages are not used on any vcpus,
>>>> after that we do not need to flush tlb when obsolete pages
>>>> are zapped
>>>>
>>>> Note: kvm_mmu_commit_zap_page is still needed before free
>>>> the pages since other vcpus may be doing locklessly shadow
>>>> page walking
>>>>
>>> Since obsolete pages are not accessible for lockless page walking after
>>> reload of all roots I do not understand why additional tlb flush is
>>
>> kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the
>> vcpu is not running on guest mode, it does nothing except set the request
>> bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus()
>> return on other vcpu.
>>
>> Like this scenario:
>>
>> VCPU 0 VCPU 1
>> exit when it encounters #PF
>>
>> kvm_reload_remote_mmus(){
>> set vcpu1->request bit;
>>
>> do not send IPI due to
>> vcpu 1 not running on guest mode
>>
>> call page-fault handler then go lockless walking !!!
>> return
>> }
>>
>>
>>> needed. Also why tlb flush should prevent lockless-walking from using
>>> the page? Making page unreachable from root_hpa does that, no?
>>
>> lockless-walking disables the interrupt and makes the vcpu state as
>> READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE,
>> kvm_flush_remote_tlbs() should send IPI to this vcpu in this case.
>
> kvm_flush_remote_tlbs() uses the same make_all_cpus_request() as
> kvm_reload_remote_mmus() does, so why the same scenario you describe
> above cannot happen with kvm_flush_remote_tlbs()?


After call kvm_flush_remote_tlbs(), the page still exists on vcpu->root,
so we can not protect the page is being used by other vcpu.

But before call kvm_mmu_commit_zap_page(), the page has been deleted from
vcpu's page table, after call kvm_flush_remote_tlbs(), we can ensure that
other vcpus can not find these pages.

2013-05-23 07:39:03

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

On 05/23/2013 03:37 PM, Xiao Guangrong wrote:
> On 05/23/2013 03:24 PM, Gleb Natapov wrote:
>> On Thu, May 23, 2013 at 02:26:57PM +0800, Xiao Guangrong wrote:
>>> On 05/23/2013 02:12 PM, Gleb Natapov wrote:
>>>> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
>>>>> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
>>>>> it will flush tlb every time when it does lock-break
>>>>>
>>>>> We can reload mmu on all vcpus after updating the generation
>>>>> number so that the obsolete pages are not used on any vcpus,
>>>>> after that we do not need to flush tlb when obsolete pages
>>>>> are zapped
>>>>>
>>>>> Note: kvm_mmu_commit_zap_page is still needed before free
>>>>> the pages since other vcpus may be doing locklessly shadow
>>>>> page walking
>>>>>
>>>> Since obsolete pages are not accessible for lockless page walking after
>>>> reload of all roots I do not understand why additional tlb flush is
>>>
>>> kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the
>>> vcpu is not running on guest mode, it does nothing except set the request
>>> bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus()
>>> return on other vcpu.
>>>
>>> Like this scenario:
>>>
>>> VCPU 0 VCPU 1
>>> exit when it encounters #PF
>>>
>>> kvm_reload_remote_mmus(){
>>> set vcpu1->request bit;
>>>
>>> do not send IPI due to
>>> vcpu 1 not running on guest mode
>>>
>>> call page-fault handler then go lockless walking !!!
>>> return
>>> }
>>>
>>>
>>>> needed. Also why tlb flush should prevent lockless-walking from using
>>>> the page? Making page unreachable from root_hpa does that, no?
>>>
>>> lockless-walking disables the interrupt and makes the vcpu state as
>>> READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE,
>>> kvm_flush_remote_tlbs() should send IPI to this vcpu in this case.
>>
>> kvm_flush_remote_tlbs() uses the same make_all_cpus_request() as
>> kvm_reload_remote_mmus() does, so why the same scenario you describe
>> above cannot happen with kvm_flush_remote_tlbs()?
>
>
> After call kvm_flush_remote_tlbs(), the page still exists on vcpu->root,

Sorry, should be kvm_reload_remote_mmus() here.

> so we can not protect the page is being used by other vcpu.
>
> But before call kvm_mmu_commit_zap_page(), the page has been deleted from
> vcpu's page table, after call kvm_flush_remote_tlbs(), we can ensure that
> other vcpus can not find these pages.
>
>

2013-05-23 07:50:29

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On 05/23/2013 03:37 PM, Gleb Natapov wrote:
> On Thu, May 23, 2013 at 02:31:47PM +0800, Xiao Guangrong wrote:
>> On 05/23/2013 02:18 PM, Gleb Natapov wrote:
>>> On Thu, May 23, 2013 at 02:13:06PM +0800, Xiao Guangrong wrote:
>>>> On 05/23/2013 01:57 PM, Gleb Natapov wrote:
>>>>> On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
>>>>>> It is only used to zap the obsolete page. Since the obsolete page
>>>>>> will not be used, we need not spend time to find its unsync children
>>>>>> out. Also, we delete the page from shadow page cache so that the page
>>>>>> is completely isolated after call this function.
>>>>>>
>>>>>> The later patch will use it to collapse tlb flushes
>>>>>>
>>>>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>>>>> ---
>>>>>> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
>>>>>> 1 files changed, 41 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>>>> index 9b57faa..e676356 100644
>>>>>> --- a/arch/x86/kvm/mmu.c
>>>>>> +++ b/arch/x86/kvm/mmu.c
>>>>>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>>>>>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>>>>>> {
>>>>>> ASSERT(is_empty_shadow_page(sp->spt));
>>>>>> - hlist_del(&sp->hash_link);
>>>>>> + hlist_del_init(&sp->hash_link);
>>>>> Why do you need hlist_del_init() here? Why not move it into
>>>>
>>>> Since the hlist will be double freed. We will it like this:
>>>>
>>>> kvm_mmu_prepare_zap_obsolete_page(page, list);
>>>> kvm_mmu_commit_zap_page(list);
>>>> kvm_mmu_free_page(page);
>>>>
>>>> The first place is kvm_mmu_prepare_zap_obsolete_page(page), which have
>>>> deleted the hash list.
>>>>
>>>>> kvm_mmu_prepare_zap_page() like we discussed it here:
>>>>> https://patchwork.kernel.org/patch/2580351/ instead of doing
>>>>> it differently for obsolete and non obsolete pages?
>>>>
>>>> It is can break the hash-list walking: we should rescan the
>>>> hash list once the page is prepared-ly zapped.
>>>>
>>>> I mentioned it in the changelog:
>>>>
>>>> 4): drop the patch which deleted page from hash list at the "prepare"
>>>> time since it can break the walk based on hash list.
>>> Can you elaborate on how this can happen?
>>
>> There is a example:
>>
>> int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
>> {
>> struct kvm_mmu_page *sp;
>> LIST_HEAD(invalid_list);
>> int r;
>>
>> pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
>> r = 0;
>> spin_lock(&kvm->mmu_lock);
>> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
>> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
>> sp->role.word);
>> r = 1;
>> kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
>> }
>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>> spin_unlock(&kvm->mmu_lock);
>>
>> return r;
>> }
>>
>> It works fine since kvm_mmu_prepare_zap_page does not touch the hash list.
>> If we delete hlist in kvm_mmu_prepare_zap_page(), this kind of codes should
>> be changed to:
>>
>> restart:
>> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
>> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
>> sp->role.word);
>> r = 1;
>> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
>> goto restart;
>> }
>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>>
> Hmm, yes. So lets leave it as is and always commit invalid_list before

So, you mean drop this patch and the patch of
KVM: MMU: collapse TLB flushes when zap all pages?

But, we only introduced less code in this patch, most of them is reusing
the code of __kvm_mmu_prepare_zap_page...

Furthermore, maybe not related to this patch, i do not think calling
mmu_zap_unsync_children() in kvm_mmu_prepare_zap_page() is necessary,
but i need to test it very carefully. Why not let
kvm_mmu_prepare_zap_obsolete_page for the first step? :(

> releasing lock in kvm_zap_obsolete_pages() or skip obsolete pages while
> walking hash table. Former is clearer I think.
>
> --
> Gleb.
>
>
>

2013-05-23 07:56:29

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

On Thu, May 23, 2013 at 03:38:49PM +0800, Xiao Guangrong wrote:
> On 05/23/2013 03:37 PM, Xiao Guangrong wrote:
> > On 05/23/2013 03:24 PM, Gleb Natapov wrote:
> >> On Thu, May 23, 2013 at 02:26:57PM +0800, Xiao Guangrong wrote:
> >>> On 05/23/2013 02:12 PM, Gleb Natapov wrote:
> >>>> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
> >>>>> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
> >>>>> it will flush tlb every time when it does lock-break
> >>>>>
> >>>>> We can reload mmu on all vcpus after updating the generation
> >>>>> number so that the obsolete pages are not used on any vcpus,
> >>>>> after that we do not need to flush tlb when obsolete pages
> >>>>> are zapped
> >>>>>
> >>>>> Note: kvm_mmu_commit_zap_page is still needed before free
> >>>>> the pages since other vcpus may be doing locklessly shadow
> >>>>> page walking
> >>>>>
> >>>> Since obsolete pages are not accessible for lockless page walking after
> >>>> reload of all roots I do not understand why additional tlb flush is
> >>>
> >>> kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the
> >>> vcpu is not running on guest mode, it does nothing except set the request
> >>> bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus()
> >>> return on other vcpu.
> >>>
> >>> Like this scenario:
> >>>
> >>> VCPU 0 VCPU 1
> >>> exit when it encounters #PF
> >>>
> >>> kvm_reload_remote_mmus(){
> >>> set vcpu1->request bit;
> >>>
> >>> do not send IPI due to
> >>> vcpu 1 not running on guest mode
> >>>
> >>> call page-fault handler then go lockless walking !!!
> >>> return
> >>> }
> >>>
> >>>
> >>>> needed. Also why tlb flush should prevent lockless-walking from using
> >>>> the page? Making page unreachable from root_hpa does that, no?
> >>>
> >>> lockless-walking disables the interrupt and makes the vcpu state as
> >>> READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE,
> >>> kvm_flush_remote_tlbs() should send IPI to this vcpu in this case.
> >>
> >> kvm_flush_remote_tlbs() uses the same make_all_cpus_request() as
> >> kvm_reload_remote_mmus() does, so why the same scenario you describe
> >> above cannot happen with kvm_flush_remote_tlbs()?
> >
> >
> > After call kvm_flush_remote_tlbs(), the page still exists on vcpu->root,
>
> Sorry, should be kvm_reload_remote_mmus() here.
>
> > so we can not protect the page is being used by other vcpu.
> >
> > But before call kvm_mmu_commit_zap_page(), the page has been deleted from
> > vcpu's page table, after call kvm_flush_remote_tlbs(), we can ensure that
> > other vcpus can not find these pages.
> >
Ah, I see, so the barrier is needed after page is unlinked from the
vcpu->root hierarchy.

--
Gleb.

2013-05-23 08:09:31

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On Thu, May 23, 2013 at 03:50:16PM +0800, Xiao Guangrong wrote:
> On 05/23/2013 03:37 PM, Gleb Natapov wrote:
> > On Thu, May 23, 2013 at 02:31:47PM +0800, Xiao Guangrong wrote:
> >> On 05/23/2013 02:18 PM, Gleb Natapov wrote:
> >>> On Thu, May 23, 2013 at 02:13:06PM +0800, Xiao Guangrong wrote:
> >>>> On 05/23/2013 01:57 PM, Gleb Natapov wrote:
> >>>>> On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
> >>>>>> It is only used to zap the obsolete page. Since the obsolete page
> >>>>>> will not be used, we need not spend time to find its unsync children
> >>>>>> out. Also, we delete the page from shadow page cache so that the page
> >>>>>> is completely isolated after call this function.
> >>>>>>
> >>>>>> The later patch will use it to collapse tlb flushes
> >>>>>>
> >>>>>> Signed-off-by: Xiao Guangrong <[email protected]>
> >>>>>> ---
> >>>>>> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
> >>>>>> 1 files changed, 41 insertions(+), 5 deletions(-)
> >>>>>>
> >>>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>>>>> index 9b57faa..e676356 100644
> >>>>>> --- a/arch/x86/kvm/mmu.c
> >>>>>> +++ b/arch/x86/kvm/mmu.c
> >>>>>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> >>>>>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> >>>>>> {
> >>>>>> ASSERT(is_empty_shadow_page(sp->spt));
> >>>>>> - hlist_del(&sp->hash_link);
> >>>>>> + hlist_del_init(&sp->hash_link);
> >>>>> Why do you need hlist_del_init() here? Why not move it into
> >>>>
> >>>> Since the hlist will be double freed. We will it like this:
> >>>>
> >>>> kvm_mmu_prepare_zap_obsolete_page(page, list);
> >>>> kvm_mmu_commit_zap_page(list);
> >>>> kvm_mmu_free_page(page);
> >>>>
> >>>> The first place is kvm_mmu_prepare_zap_obsolete_page(page), which have
> >>>> deleted the hash list.
> >>>>
> >>>>> kvm_mmu_prepare_zap_page() like we discussed it here:
> >>>>> https://patchwork.kernel.org/patch/2580351/ instead of doing
> >>>>> it differently for obsolete and non obsolete pages?
> >>>>
> >>>> It is can break the hash-list walking: we should rescan the
> >>>> hash list once the page is prepared-ly zapped.
> >>>>
> >>>> I mentioned it in the changelog:
> >>>>
> >>>> 4): drop the patch which deleted page from hash list at the "prepare"
> >>>> time since it can break the walk based on hash list.
> >>> Can you elaborate on how this can happen?
> >>
> >> There is a example:
> >>
> >> int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
> >> {
> >> struct kvm_mmu_page *sp;
> >> LIST_HEAD(invalid_list);
> >> int r;
> >>
> >> pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
> >> r = 0;
> >> spin_lock(&kvm->mmu_lock);
> >> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
> >> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
> >> sp->role.word);
> >> r = 1;
> >> kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> >> }
> >> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> >> spin_unlock(&kvm->mmu_lock);
> >>
> >> return r;
> >> }
> >>
> >> It works fine since kvm_mmu_prepare_zap_page does not touch the hash list.
> >> If we delete hlist in kvm_mmu_prepare_zap_page(), this kind of codes should
> >> be changed to:
> >>
> >> restart:
> >> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
> >> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
> >> sp->role.word);
> >> r = 1;
> >> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> >> goto restart;
> >> }
> >> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> >>
> > Hmm, yes. So lets leave it as is and always commit invalid_list before
>
> So, you mean drop this patch and the patch of
> KVM: MMU: collapse TLB flushes when zap all pages?
>
We still want to add kvm_reload_remote_mmus() to
kvm_mmu_invalidate_zap_all_pages(). But yes, we disable a nice
optimization here. So may be skipping obsolete pages while walking
hashtable is better solution.

> But, we only introduced less code in this patch, most of them is reusing
> the code of __kvm_mmu_prepare_zap_page...
>
> Furthermore, maybe not related to this patch, i do not think calling
> mmu_zap_unsync_children() in kvm_mmu_prepare_zap_page() is necessary,
> but i need to test it very carefully. Why not let
> kvm_mmu_prepare_zap_obsolete_page for the first step? :(

Yes, I want Marcelo opinion on skipping mmu_zap_unsync_children() first.

> > releasing lock in kvm_zap_obsolete_pages() or skip obsolete pages while
> > walking hash table. Former is clearer I think.
> >
> > --
> > Gleb.
> >
> >
> >

--
Gleb.

2013-05-23 08:33:56

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On 05/23/2013 04:09 PM, Gleb Natapov wrote:
> On Thu, May 23, 2013 at 03:50:16PM +0800, Xiao Guangrong wrote:
>> On 05/23/2013 03:37 PM, Gleb Natapov wrote:
>>> On Thu, May 23, 2013 at 02:31:47PM +0800, Xiao Guangrong wrote:
>>>> On 05/23/2013 02:18 PM, Gleb Natapov wrote:
>>>>> On Thu, May 23, 2013 at 02:13:06PM +0800, Xiao Guangrong wrote:
>>>>>> On 05/23/2013 01:57 PM, Gleb Natapov wrote:
>>>>>>> On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
>>>>>>>> It is only used to zap the obsolete page. Since the obsolete page
>>>>>>>> will not be used, we need not spend time to find its unsync children
>>>>>>>> out. Also, we delete the page from shadow page cache so that the page
>>>>>>>> is completely isolated after call this function.
>>>>>>>>
>>>>>>>> The later patch will use it to collapse tlb flushes
>>>>>>>>
>>>>>>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>>>>>>> ---
>>>>>>>> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
>>>>>>>> 1 files changed, 41 insertions(+), 5 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>>>>>> index 9b57faa..e676356 100644
>>>>>>>> --- a/arch/x86/kvm/mmu.c
>>>>>>>> +++ b/arch/x86/kvm/mmu.c
>>>>>>>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>>>>>>>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>>>>>>>> {
>>>>>>>> ASSERT(is_empty_shadow_page(sp->spt));
>>>>>>>> - hlist_del(&sp->hash_link);
>>>>>>>> + hlist_del_init(&sp->hash_link);
>>>>>>> Why do you need hlist_del_init() here? Why not move it into
>>>>>>
>>>>>> Since the hlist will be double freed. We will it like this:
>>>>>>
>>>>>> kvm_mmu_prepare_zap_obsolete_page(page, list);
>>>>>> kvm_mmu_commit_zap_page(list);
>>>>>> kvm_mmu_free_page(page);
>>>>>>
>>>>>> The first place is kvm_mmu_prepare_zap_obsolete_page(page), which have
>>>>>> deleted the hash list.
>>>>>>
>>>>>>> kvm_mmu_prepare_zap_page() like we discussed it here:
>>>>>>> https://patchwork.kernel.org/patch/2580351/ instead of doing
>>>>>>> it differently for obsolete and non obsolete pages?
>>>>>>
>>>>>> It is can break the hash-list walking: we should rescan the
>>>>>> hash list once the page is prepared-ly zapped.
>>>>>>
>>>>>> I mentioned it in the changelog:
>>>>>>
>>>>>> 4): drop the patch which deleted page from hash list at the "prepare"
>>>>>> time since it can break the walk based on hash list.
>>>>> Can you elaborate on how this can happen?
>>>>
>>>> There is a example:
>>>>
>>>> int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
>>>> {
>>>> struct kvm_mmu_page *sp;
>>>> LIST_HEAD(invalid_list);
>>>> int r;
>>>>
>>>> pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
>>>> r = 0;
>>>> spin_lock(&kvm->mmu_lock);
>>>> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
>>>> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
>>>> sp->role.word);
>>>> r = 1;
>>>> kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
>>>> }
>>>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>>>> spin_unlock(&kvm->mmu_lock);
>>>>
>>>> return r;
>>>> }
>>>>
>>>> It works fine since kvm_mmu_prepare_zap_page does not touch the hash list.
>>>> If we delete hlist in kvm_mmu_prepare_zap_page(), this kind of codes should
>>>> be changed to:
>>>>
>>>> restart:
>>>> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
>>>> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
>>>> sp->role.word);
>>>> r = 1;
>>>> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
>>>> goto restart;
>>>> }
>>>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>>>>
>>> Hmm, yes. So lets leave it as is and always commit invalid_list before
>>
>> So, you mean drop this patch and the patch of
>> KVM: MMU: collapse TLB flushes when zap all pages?
>>
> We still want to add kvm_reload_remote_mmus() to
> kvm_mmu_invalidate_zap_all_pages(). But yes, we disable a nice
> optimization here. So may be skipping obsolete pages while walking
> hashtable is better solution.

Okay.

Will update this patch and the later patch.

>
>> But, we only introduced less code in this patch, most of them is reusing
>> the code of __kvm_mmu_prepare_zap_page...
>>
>> Furthermore, maybe not related to this patch, i do not think calling
>> mmu_zap_unsync_children() in kvm_mmu_prepare_zap_page() is necessary,
>> but i need to test it very carefully. Why not let
>> kvm_mmu_prepare_zap_obsolete_page for the first step? :(
>
> Yes, I want Marcelo opinion on skipping mmu_zap_unsync_children() first.

Okay. Thank you!

2013-05-23 11:14:11

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On 05/23/2013 04:09 PM, Gleb Natapov wrote:
> On Thu, May 23, 2013 at 03:50:16PM +0800, Xiao Guangrong wrote:
>> On 05/23/2013 03:37 PM, Gleb Natapov wrote:
>>> On Thu, May 23, 2013 at 02:31:47PM +0800, Xiao Guangrong wrote:
>>>> On 05/23/2013 02:18 PM, Gleb Natapov wrote:
>>>>> On Thu, May 23, 2013 at 02:13:06PM +0800, Xiao Guangrong wrote:
>>>>>> On 05/23/2013 01:57 PM, Gleb Natapov wrote:
>>>>>>> On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
>>>>>>>> It is only used to zap the obsolete page. Since the obsolete page
>>>>>>>> will not be used, we need not spend time to find its unsync children
>>>>>>>> out. Also, we delete the page from shadow page cache so that the page
>>>>>>>> is completely isolated after call this function.
>>>>>>>>
>>>>>>>> The later patch will use it to collapse tlb flushes
>>>>>>>>
>>>>>>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>>>>>>> ---
>>>>>>>> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
>>>>>>>> 1 files changed, 41 insertions(+), 5 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>>>>>> index 9b57faa..e676356 100644
>>>>>>>> --- a/arch/x86/kvm/mmu.c
>>>>>>>> +++ b/arch/x86/kvm/mmu.c
>>>>>>>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>>>>>>>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>>>>>>>> {
>>>>>>>> ASSERT(is_empty_shadow_page(sp->spt));
>>>>>>>> - hlist_del(&sp->hash_link);
>>>>>>>> + hlist_del_init(&sp->hash_link);
>>>>>>> Why do you need hlist_del_init() here? Why not move it into
>>>>>>
>>>>>> Since the hlist will be double freed. We will it like this:
>>>>>>
>>>>>> kvm_mmu_prepare_zap_obsolete_page(page, list);
>>>>>> kvm_mmu_commit_zap_page(list);
>>>>>> kvm_mmu_free_page(page);
>>>>>>
>>>>>> The first place is kvm_mmu_prepare_zap_obsolete_page(page), which have
>>>>>> deleted the hash list.
>>>>>>
>>>>>>> kvm_mmu_prepare_zap_page() like we discussed it here:
>>>>>>> https://patchwork.kernel.org/patch/2580351/ instead of doing
>>>>>>> it differently for obsolete and non obsolete pages?
>>>>>>
>>>>>> It is can break the hash-list walking: we should rescan the
>>>>>> hash list once the page is prepared-ly zapped.
>>>>>>
>>>>>> I mentioned it in the changelog:
>>>>>>
>>>>>> 4): drop the patch which deleted page from hash list at the "prepare"
>>>>>> time since it can break the walk based on hash list.
>>>>> Can you elaborate on how this can happen?
>>>>
>>>> There is a example:
>>>>
>>>> int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
>>>> {
>>>> struct kvm_mmu_page *sp;
>>>> LIST_HEAD(invalid_list);
>>>> int r;
>>>>
>>>> pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
>>>> r = 0;
>>>> spin_lock(&kvm->mmu_lock);
>>>> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
>>>> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
>>>> sp->role.word);
>>>> r = 1;
>>>> kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
>>>> }
>>>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>>>> spin_unlock(&kvm->mmu_lock);
>>>>
>>>> return r;
>>>> }
>>>>
>>>> It works fine since kvm_mmu_prepare_zap_page does not touch the hash list.
>>>> If we delete hlist in kvm_mmu_prepare_zap_page(), this kind of codes should
>>>> be changed to:
>>>>
>>>> restart:
>>>> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
>>>> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
>>>> sp->role.word);
>>>> r = 1;
>>>> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
>>>> goto restart;
>>>> }
>>>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>>>>
>>> Hmm, yes. So lets leave it as is and always commit invalid_list before
>>
>> So, you mean drop this patch and the patch of
>> KVM: MMU: collapse TLB flushes when zap all pages?
>>
> We still want to add kvm_reload_remote_mmus() to
> kvm_mmu_invalidate_zap_all_pages(). But yes, we disable a nice
> optimization here. So may be skipping obsolete pages while walking
> hashtable is better solution.

I am willing to use this way instead, but it looks worse than this
patch:

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9b57faa..810410c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
{
ASSERT(is_empty_shadow_page(sp->spt));
- hlist_del(&sp->hash_link);
+ hlist_del_init(&sp->hash_link);
list_del(&sp->link);
free_page((unsigned long)sp->spt);
if (!sp->role.direct)
@@ -1648,14 +1648,20 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
static void kvm_mmu_commit_zap_page(struct kvm *kvm,
struct list_head *invalid_list);

+static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
+}
+
#define for_each_gfn_sp(_kvm, _sp, _gfn) \
hlist_for_each_entry(_sp, \
&(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \
- if ((_sp)->gfn != (_gfn)) {} else
+ if ((_sp)->gfn != (_gfn) || is_obsolete_sp(_kvm, _sp)) {} else

#define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) \
for_each_gfn_sp(_kvm, _sp, _gfn) \
- if ((_sp)->role.direct || (_sp)->role.invalid) {} else
+ if ((_sp)->role.direct || \
+ (_sp)->role.invalid || is_obsolete_sp(_kvm, _sp)) {} else

/* @sp->gfn should be write-protected at the call site */
static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
@@ -1838,11 +1844,6 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sp);
}

-static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
- return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
-}
-
static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
gfn_t gfn,
gva_t gaddr,
@@ -2085,11 +2086,15 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,

if (sp->unsync)
kvm_unlink_unsync_page(kvm, sp);
+
if (!sp->root_count) {
/* Count self */
ret++;
list_move(&sp->link, invalid_list);
kvm_mod_used_mmu_pages(kvm, -1);
+
+ if (unlikely(is_obsolete_sp(kvm, sp)))
+ hlist_del_init(&sp->hash_link);
} else {
list_move(&sp->link, &kvm->arch.active_mmu_pages);
kvm_reload_remote_mmus(kvm);

isn't it?

2013-05-23 12:40:17

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On Thu, May 23, 2013 at 07:13:58PM +0800, Xiao Guangrong wrote:
> On 05/23/2013 04:09 PM, Gleb Natapov wrote:
> > On Thu, May 23, 2013 at 03:50:16PM +0800, Xiao Guangrong wrote:
> >> On 05/23/2013 03:37 PM, Gleb Natapov wrote:
> >>> On Thu, May 23, 2013 at 02:31:47PM +0800, Xiao Guangrong wrote:
> >>>> On 05/23/2013 02:18 PM, Gleb Natapov wrote:
> >>>>> On Thu, May 23, 2013 at 02:13:06PM +0800, Xiao Guangrong wrote:
> >>>>>> On 05/23/2013 01:57 PM, Gleb Natapov wrote:
> >>>>>>> On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
> >>>>>>>> It is only used to zap the obsolete page. Since the obsolete page
> >>>>>>>> will not be used, we need not spend time to find its unsync children
> >>>>>>>> out. Also, we delete the page from shadow page cache so that the page
> >>>>>>>> is completely isolated after call this function.
> >>>>>>>>
> >>>>>>>> The later patch will use it to collapse tlb flushes
> >>>>>>>>
> >>>>>>>> Signed-off-by: Xiao Guangrong <[email protected]>
> >>>>>>>> ---
> >>>>>>>> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
> >>>>>>>> 1 files changed, 41 insertions(+), 5 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>>>>>>> index 9b57faa..e676356 100644
> >>>>>>>> --- a/arch/x86/kvm/mmu.c
> >>>>>>>> +++ b/arch/x86/kvm/mmu.c
> >>>>>>>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> >>>>>>>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> >>>>>>>> {
> >>>>>>>> ASSERT(is_empty_shadow_page(sp->spt));
> >>>>>>>> - hlist_del(&sp->hash_link);
> >>>>>>>> + hlist_del_init(&sp->hash_link);
> >>>>>>> Why do you need hlist_del_init() here? Why not move it into
> >>>>>>
> >>>>>> Since the hlist will be double freed. We will it like this:
> >>>>>>
> >>>>>> kvm_mmu_prepare_zap_obsolete_page(page, list);
> >>>>>> kvm_mmu_commit_zap_page(list);
> >>>>>> kvm_mmu_free_page(page);
> >>>>>>
> >>>>>> The first place is kvm_mmu_prepare_zap_obsolete_page(page), which have
> >>>>>> deleted the hash list.
> >>>>>>
> >>>>>>> kvm_mmu_prepare_zap_page() like we discussed it here:
> >>>>>>> https://patchwork.kernel.org/patch/2580351/ instead of doing
> >>>>>>> it differently for obsolete and non obsolete pages?
> >>>>>>
> >>>>>> It is can break the hash-list walking: we should rescan the
> >>>>>> hash list once the page is prepared-ly zapped.
> >>>>>>
> >>>>>> I mentioned it in the changelog:
> >>>>>>
> >>>>>> 4): drop the patch which deleted page from hash list at the "prepare"
> >>>>>> time since it can break the walk based on hash list.
> >>>>> Can you elaborate on how this can happen?
> >>>>
> >>>> There is a example:
> >>>>
> >>>> int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
> >>>> {
> >>>> struct kvm_mmu_page *sp;
> >>>> LIST_HEAD(invalid_list);
> >>>> int r;
> >>>>
> >>>> pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
> >>>> r = 0;
> >>>> spin_lock(&kvm->mmu_lock);
> >>>> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
> >>>> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
> >>>> sp->role.word);
> >>>> r = 1;
> >>>> kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> >>>> }
> >>>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> >>>> spin_unlock(&kvm->mmu_lock);
> >>>>
> >>>> return r;
> >>>> }
> >>>>
> >>>> It works fine since kvm_mmu_prepare_zap_page does not touch the hash list.
> >>>> If we delete hlist in kvm_mmu_prepare_zap_page(), this kind of codes should
> >>>> be changed to:
> >>>>
> >>>> restart:
> >>>> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
> >>>> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
> >>>> sp->role.word);
> >>>> r = 1;
> >>>> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> >>>> goto restart;
> >>>> }
> >>>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> >>>>
> >>> Hmm, yes. So lets leave it as is and always commit invalid_list before
> >>
> >> So, you mean drop this patch and the patch of
> >> KVM: MMU: collapse TLB flushes when zap all pages?
> >>
> > We still want to add kvm_reload_remote_mmus() to
> > kvm_mmu_invalidate_zap_all_pages(). But yes, we disable a nice
> > optimization here. So may be skipping obsolete pages while walking
> > hashtable is better solution.
>
> I am willing to use this way instead, but it looks worse than this
> patch:
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 9b57faa..810410c 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> {
> ASSERT(is_empty_shadow_page(sp->spt));
> - hlist_del(&sp->hash_link);
> + hlist_del_init(&sp->hash_link);
Why not drop this

> list_del(&sp->link);
> free_page((unsigned long)sp->spt);
> if (!sp->role.direct)
> @@ -1648,14 +1648,20 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> struct list_head *invalid_list);
>
> +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> +}
> +
> #define for_each_gfn_sp(_kvm, _sp, _gfn) \
> hlist_for_each_entry(_sp, \
> &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \
> - if ((_sp)->gfn != (_gfn)) {} else
> + if ((_sp)->gfn != (_gfn) || is_obsolete_sp(_kvm, _sp)) {} else
>
> #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) \
> for_each_gfn_sp(_kvm, _sp, _gfn) \
> - if ((_sp)->role.direct || (_sp)->role.invalid) {} else
> + if ((_sp)->role.direct || \
> + (_sp)->role.invalid || is_obsolete_sp(_kvm, _sp)) {} else
>
> /* @sp->gfn should be write-protected at the call site */
> static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> @@ -1838,11 +1844,6 @@ static void clear_sp_write_flooding_count(u64 *spte)
> __clear_sp_write_flooding_count(sp);
> }
>
> -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> -{
> - return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> -}
> -
> static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> gfn_t gfn,
> gva_t gaddr,
> @@ -2085,11 +2086,15 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>
> if (sp->unsync)
> kvm_unlink_unsync_page(kvm, sp);
> +
> if (!sp->root_count) {
> /* Count self */
> ret++;
> list_move(&sp->link, invalid_list);
> kvm_mod_used_mmu_pages(kvm, -1);
> +
> + if (unlikely(is_obsolete_sp(kvm, sp)))
> + hlist_del_init(&sp->hash_link);
and this.

Since we check for obsolete while searching hashtable why delete it
here?

> } else {
> list_move(&sp->link, &kvm->arch.active_mmu_pages);
> kvm_reload_remote_mmus(kvm);
>
> isn't it?

--
Gleb.

2013-05-23 13:04:02

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On 05/23/2013 08:39 PM, Gleb Natapov wrote:
> On Thu, May 23, 2013 at 07:13:58PM +0800, Xiao Guangrong wrote:
>> On 05/23/2013 04:09 PM, Gleb Natapov wrote:
>>> On Thu, May 23, 2013 at 03:50:16PM +0800, Xiao Guangrong wrote:
>>>> On 05/23/2013 03:37 PM, Gleb Natapov wrote:
>>>>> On Thu, May 23, 2013 at 02:31:47PM +0800, Xiao Guangrong wrote:
>>>>>> On 05/23/2013 02:18 PM, Gleb Natapov wrote:
>>>>>>> On Thu, May 23, 2013 at 02:13:06PM +0800, Xiao Guangrong wrote:
>>>>>>>> On 05/23/2013 01:57 PM, Gleb Natapov wrote:
>>>>>>>>> On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
>>>>>>>>>> It is only used to zap the obsolete page. Since the obsolete page
>>>>>>>>>> will not be used, we need not spend time to find its unsync children
>>>>>>>>>> out. Also, we delete the page from shadow page cache so that the page
>>>>>>>>>> is completely isolated after call this function.
>>>>>>>>>>
>>>>>>>>>> The later patch will use it to collapse tlb flushes
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
>>>>>>>>>> 1 files changed, 41 insertions(+), 5 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>>>>>>>> index 9b57faa..e676356 100644
>>>>>>>>>> --- a/arch/x86/kvm/mmu.c
>>>>>>>>>> +++ b/arch/x86/kvm/mmu.c
>>>>>>>>>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>>>>>>>>>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>>>>>>>>>> {
>>>>>>>>>> ASSERT(is_empty_shadow_page(sp->spt));
>>>>>>>>>> - hlist_del(&sp->hash_link);
>>>>>>>>>> + hlist_del_init(&sp->hash_link);
>>>>>>>>> Why do you need hlist_del_init() here? Why not move it into
>>>>>>>>
>>>>>>>> Since the hlist will be double freed. We will it like this:
>>>>>>>>
>>>>>>>> kvm_mmu_prepare_zap_obsolete_page(page, list);
>>>>>>>> kvm_mmu_commit_zap_page(list);
>>>>>>>> kvm_mmu_free_page(page);
>>>>>>>>
>>>>>>>> The first place is kvm_mmu_prepare_zap_obsolete_page(page), which have
>>>>>>>> deleted the hash list.
>>>>>>>>
>>>>>>>>> kvm_mmu_prepare_zap_page() like we discussed it here:
>>>>>>>>> https://patchwork.kernel.org/patch/2580351/ instead of doing
>>>>>>>>> it differently for obsolete and non obsolete pages?
>>>>>>>>
>>>>>>>> It is can break the hash-list walking: we should rescan the
>>>>>>>> hash list once the page is prepared-ly zapped.
>>>>>>>>
>>>>>>>> I mentioned it in the changelog:
>>>>>>>>
>>>>>>>> 4): drop the patch which deleted page from hash list at the "prepare"
>>>>>>>> time since it can break the walk based on hash list.
>>>>>>> Can you elaborate on how this can happen?
>>>>>>
>>>>>> There is a example:
>>>>>>
>>>>>> int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
>>>>>> {
>>>>>> struct kvm_mmu_page *sp;
>>>>>> LIST_HEAD(invalid_list);
>>>>>> int r;
>>>>>>
>>>>>> pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
>>>>>> r = 0;
>>>>>> spin_lock(&kvm->mmu_lock);
>>>>>> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
>>>>>> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
>>>>>> sp->role.word);
>>>>>> r = 1;
>>>>>> kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
>>>>>> }
>>>>>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>>>>>> spin_unlock(&kvm->mmu_lock);
>>>>>>
>>>>>> return r;
>>>>>> }
>>>>>>
>>>>>> It works fine since kvm_mmu_prepare_zap_page does not touch the hash list.
>>>>>> If we delete hlist in kvm_mmu_prepare_zap_page(), this kind of codes should
>>>>>> be changed to:
>>>>>>
>>>>>> restart:
>>>>>> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
>>>>>> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
>>>>>> sp->role.word);
>>>>>> r = 1;
>>>>>> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
>>>>>> goto restart;
>>>>>> }
>>>>>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>>>>>>
>>>>> Hmm, yes. So lets leave it as is and always commit invalid_list before
>>>>
>>>> So, you mean drop this patch and the patch of
>>>> KVM: MMU: collapse TLB flushes when zap all pages?
>>>>
>>> We still want to add kvm_reload_remote_mmus() to
>>> kvm_mmu_invalidate_zap_all_pages(). But yes, we disable a nice
>>> optimization here. So may be skipping obsolete pages while walking
>>> hashtable is better solution.
>>
>> I am willing to use this way instead, but it looks worse than this
>> patch:
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 9b57faa..810410c 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>> {
>> ASSERT(is_empty_shadow_page(sp->spt));
>> - hlist_del(&sp->hash_link);
>> + hlist_del_init(&sp->hash_link);
> Why not drop this
>
>> list_del(&sp->link);
>> free_page((unsigned long)sp->spt);
>> if (!sp->role.direct)
>> @@ -1648,14 +1648,20 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>> static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>> struct list_head *invalid_list);
>>
>> +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>> +{
>> + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
>> +}
>> +
>> #define for_each_gfn_sp(_kvm, _sp, _gfn) \
>> hlist_for_each_entry(_sp, \
>> &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \
>> - if ((_sp)->gfn != (_gfn)) {} else
>> + if ((_sp)->gfn != (_gfn) || is_obsolete_sp(_kvm, _sp)) {} else
>>
>> #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) \
>> for_each_gfn_sp(_kvm, _sp, _gfn) \
>> - if ((_sp)->role.direct || (_sp)->role.invalid) {} else
>> + if ((_sp)->role.direct || \
>> + (_sp)->role.invalid || is_obsolete_sp(_kvm, _sp)) {} else
>>
>> /* @sp->gfn should be write-protected at the call site */
>> static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>> @@ -1838,11 +1844,6 @@ static void clear_sp_write_flooding_count(u64 *spte)
>> __clear_sp_write_flooding_count(sp);
>> }
>>
>> -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>> -{
>> - return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
>> -}
>> -
>> static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>> gfn_t gfn,
>> gva_t gaddr,
>> @@ -2085,11 +2086,15 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>>
>> if (sp->unsync)
>> kvm_unlink_unsync_page(kvm, sp);
>> +
>> if (!sp->root_count) {
>> /* Count self */
>> ret++;
>> list_move(&sp->link, invalid_list);
>> kvm_mod_used_mmu_pages(kvm, -1);
>> +
>> + if (unlikely(is_obsolete_sp(kvm, sp)))
>> + hlist_del_init(&sp->hash_link);
> and this.
>
> Since we check for obsolete while searching hashtable why delete it
> here?

In order to zap obsolete pages without tlb flush, we should delete them from
hash list at the "prepare" time. Here, we only delete the obsolete pages so
that the hashtable walking functions, like kvm_mmu_unprotect_page(), can work
properly by skipping obsolete page.

And, kvm_mmu_prepare_zap_page() is a recursion function:
kvm_mmu_prepare_zap_page() -> zap_unsync_children -> kvm_mmu_prepare_zap_page().
It seems it is the only place to do this thing. For example, below code is not
allowed in kvm_zap_obsolete_pages():

if (kvm_mmu_prepare_zap_page(sp, list))
hlist_del(sp->hlist);

Or, i missed your suggestion?

2013-05-23 15:58:09

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On Thu, May 23, 2013 at 09:03:50PM +0800, Xiao Guangrong wrote:
> On 05/23/2013 08:39 PM, Gleb Natapov wrote:
> > On Thu, May 23, 2013 at 07:13:58PM +0800, Xiao Guangrong wrote:
> >> On 05/23/2013 04:09 PM, Gleb Natapov wrote:
> >>> On Thu, May 23, 2013 at 03:50:16PM +0800, Xiao Guangrong wrote:
> >>>> On 05/23/2013 03:37 PM, Gleb Natapov wrote:
> >>>>> On Thu, May 23, 2013 at 02:31:47PM +0800, Xiao Guangrong wrote:
> >>>>>> On 05/23/2013 02:18 PM, Gleb Natapov wrote:
> >>>>>>> On Thu, May 23, 2013 at 02:13:06PM +0800, Xiao Guangrong wrote:
> >>>>>>>> On 05/23/2013 01:57 PM, Gleb Natapov wrote:
> >>>>>>>>> On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
> >>>>>>>>>> It is only used to zap the obsolete page. Since the obsolete page
> >>>>>>>>>> will not be used, we need not spend time to find its unsync children
> >>>>>>>>>> out. Also, we delete the page from shadow page cache so that the page
> >>>>>>>>>> is completely isolated after call this function.
> >>>>>>>>>>
> >>>>>>>>>> The later patch will use it to collapse tlb flushes
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Xiao Guangrong <[email protected]>
> >>>>>>>>>> ---
> >>>>>>>>>> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
> >>>>>>>>>> 1 files changed, 41 insertions(+), 5 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>>>>>>>>> index 9b57faa..e676356 100644
> >>>>>>>>>> --- a/arch/x86/kvm/mmu.c
> >>>>>>>>>> +++ b/arch/x86/kvm/mmu.c
> >>>>>>>>>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> >>>>>>>>>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> >>>>>>>>>> {
> >>>>>>>>>> ASSERT(is_empty_shadow_page(sp->spt));
> >>>>>>>>>> - hlist_del(&sp->hash_link);
> >>>>>>>>>> + hlist_del_init(&sp->hash_link);
> >>>>>>>>> Why do you need hlist_del_init() here? Why not move it into
> >>>>>>>>
> >>>>>>>> Since the hlist will be double freed. We will it like this:
> >>>>>>>>
> >>>>>>>> kvm_mmu_prepare_zap_obsolete_page(page, list);
> >>>>>>>> kvm_mmu_commit_zap_page(list);
> >>>>>>>> kvm_mmu_free_page(page);
> >>>>>>>>
> >>>>>>>> The first place is kvm_mmu_prepare_zap_obsolete_page(page), which have
> >>>>>>>> deleted the hash list.
> >>>>>>>>
> >>>>>>>>> kvm_mmu_prepare_zap_page() like we discussed it here:
> >>>>>>>>> https://patchwork.kernel.org/patch/2580351/ instead of doing
> >>>>>>>>> it differently for obsolete and non obsolete pages?
> >>>>>>>>
> >>>>>>>> It is can break the hash-list walking: we should rescan the
> >>>>>>>> hash list once the page is prepared-ly zapped.
> >>>>>>>>
> >>>>>>>> I mentioned it in the changelog:
> >>>>>>>>
> >>>>>>>> 4): drop the patch which deleted page from hash list at the "prepare"
> >>>>>>>> time since it can break the walk based on hash list.
> >>>>>>> Can you elaborate on how this can happen?
> >>>>>>
> >>>>>> There is a example:
> >>>>>>
> >>>>>> int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
> >>>>>> {
> >>>>>> struct kvm_mmu_page *sp;
> >>>>>> LIST_HEAD(invalid_list);
> >>>>>> int r;
> >>>>>>
> >>>>>> pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
> >>>>>> r = 0;
> >>>>>> spin_lock(&kvm->mmu_lock);
> >>>>>> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
> >>>>>> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
> >>>>>> sp->role.word);
> >>>>>> r = 1;
> >>>>>> kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> >>>>>> }
> >>>>>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> >>>>>> spin_unlock(&kvm->mmu_lock);
> >>>>>>
> >>>>>> return r;
> >>>>>> }
> >>>>>>
> >>>>>> It works fine since kvm_mmu_prepare_zap_page does not touch the hash list.
> >>>>>> If we delete hlist in kvm_mmu_prepare_zap_page(), this kind of codes should
> >>>>>> be changed to:
> >>>>>>
> >>>>>> restart:
> >>>>>> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
> >>>>>> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
> >>>>>> sp->role.word);
> >>>>>> r = 1;
> >>>>>> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> >>>>>> goto restart;
> >>>>>> }
> >>>>>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> >>>>>>
> >>>>> Hmm, yes. So lets leave it as is and always commit invalid_list before
> >>>>
> >>>> So, you mean drop this patch and the patch of
> >>>> KVM: MMU: collapse TLB flushes when zap all pages?
> >>>>
> >>> We still want to add kvm_reload_remote_mmus() to
> >>> kvm_mmu_invalidate_zap_all_pages(). But yes, we disable a nice
> >>> optimization here. So may be skipping obsolete pages while walking
> >>> hashtable is better solution.
> >>
> >> I am willing to use this way instead, but it looks worse than this
> >> patch:
> >>
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 9b57faa..810410c 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> >> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> >> {
> >> ASSERT(is_empty_shadow_page(sp->spt));
> >> - hlist_del(&sp->hash_link);
> >> + hlist_del_init(&sp->hash_link);
> > Why not drop this
> >
> >> list_del(&sp->link);
> >> free_page((unsigned long)sp->spt);
> >> if (!sp->role.direct)
> >> @@ -1648,14 +1648,20 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> >> static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> >> struct list_head *invalid_list);
> >>
> >> +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> >> +{
> >> + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> >> +}
> >> +
> >> #define for_each_gfn_sp(_kvm, _sp, _gfn) \
> >> hlist_for_each_entry(_sp, \
> >> &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \
> >> - if ((_sp)->gfn != (_gfn)) {} else
> >> + if ((_sp)->gfn != (_gfn) || is_obsolete_sp(_kvm, _sp)) {} else
> >>
> >> #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) \
> >> for_each_gfn_sp(_kvm, _sp, _gfn) \
> >> - if ((_sp)->role.direct || (_sp)->role.invalid) {} else
> >> + if ((_sp)->role.direct || \
> >> + (_sp)->role.invalid || is_obsolete_sp(_kvm, _sp)) {} else
> >>
> >> /* @sp->gfn should be write-protected at the call site */
> >> static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >> @@ -1838,11 +1844,6 @@ static void clear_sp_write_flooding_count(u64 *spte)
> >> __clear_sp_write_flooding_count(sp);
> >> }
> >>
> >> -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> >> -{
> >> - return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> >> -}
> >> -
> >> static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >> gfn_t gfn,
> >> gva_t gaddr,
> >> @@ -2085,11 +2086,15 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> >>
> >> if (sp->unsync)
> >> kvm_unlink_unsync_page(kvm, sp);
> >> +
> >> if (!sp->root_count) {
> >> /* Count self */
> >> ret++;
> >> list_move(&sp->link, invalid_list);
> >> kvm_mod_used_mmu_pages(kvm, -1);
> >> +
> >> + if (unlikely(is_obsolete_sp(kvm, sp)))
> >> + hlist_del_init(&sp->hash_link);
> > and this.
> >
> > Since we check for obsolete while searching hashtable why delete it
> > here?
>
> In order to zap obsolete pages without tlb flush, we should delete them from
> hash list at the "prepare" time. Here, we only delete the obsolete pages so
> that the hashtable walking functions, like kvm_mmu_unprotect_page(), can work
> properly by skipping obsolete page.
>
Why we have to delete them from the hash at "prepare" time? I hash walk
ignores them they are as good as deleted, no?

> And, kvm_mmu_prepare_zap_page() is a recursion function:
> kvm_mmu_prepare_zap_page() -> zap_unsync_children -> kvm_mmu_prepare_zap_page().
> It seems it is the only place to do this thing. For example, below code is not
> allowed in kvm_zap_obsolete_pages():
>
> if (kvm_mmu_prepare_zap_page(sp, list))
> hlist_del(sp->hlist);
>
> Or, i missed your suggestion?
My assumption is that we can leave obsolete shadow pages on hashtable
till commit_zap time.

BTW is it such a good idea to call kvm_mmu_commit_zap_page() once on all
obsolete pages? We basically loop over all of them under the lock
without lock break.

--
Gleb.

2013-05-24 05:39:46

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On 05/23/2013 11:57 PM, Gleb Natapov wrote:
> On Thu, May 23, 2013 at 09:03:50PM +0800, Xiao Guangrong wrote:
>> On 05/23/2013 08:39 PM, Gleb Natapov wrote:
>>> On Thu, May 23, 2013 at 07:13:58PM +0800, Xiao Guangrong wrote:
>>>> On 05/23/2013 04:09 PM, Gleb Natapov wrote:
>>>>> On Thu, May 23, 2013 at 03:50:16PM +0800, Xiao Guangrong wrote:
>>>>>> On 05/23/2013 03:37 PM, Gleb Natapov wrote:
>>>>>>> On Thu, May 23, 2013 at 02:31:47PM +0800, Xiao Guangrong wrote:
>>>>>>>> On 05/23/2013 02:18 PM, Gleb Natapov wrote:
>>>>>>>>> On Thu, May 23, 2013 at 02:13:06PM +0800, Xiao Guangrong wrote:
>>>>>>>>>> On 05/23/2013 01:57 PM, Gleb Natapov wrote:
>>>>>>>>>>> On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
>>>>>>>>>>>> It is only used to zap the obsolete page. Since the obsolete page
>>>>>>>>>>>> will not be used, we need not spend time to find its unsync children
>>>>>>>>>>>> out. Also, we delete the page from shadow page cache so that the page
>>>>>>>>>>>> is completely isolated after call this function.
>>>>>>>>>>>>
>>>>>>>>>>>> The later patch will use it to collapse tlb flushes
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>>>>>>>>>>> ---
>>>>>>>>>>>> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
>>>>>>>>>>>> 1 files changed, 41 insertions(+), 5 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>>>>>>>>>> index 9b57faa..e676356 100644
>>>>>>>>>>>> --- a/arch/x86/kvm/mmu.c
>>>>>>>>>>>> +++ b/arch/x86/kvm/mmu.c
>>>>>>>>>>>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>>>>>>>>>>>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>>>>>>>>>>>> {
>>>>>>>>>>>> ASSERT(is_empty_shadow_page(sp->spt));
>>>>>>>>>>>> - hlist_del(&sp->hash_link);
>>>>>>>>>>>> + hlist_del_init(&sp->hash_link);
>>>>>>>>>>> Why do you need hlist_del_init() here? Why not move it into
>>>>>>>>>>
>>>>>>>>>> Since the hlist will be double freed. We will it like this:
>>>>>>>>>>
>>>>>>>>>> kvm_mmu_prepare_zap_obsolete_page(page, list);
>>>>>>>>>> kvm_mmu_commit_zap_page(list);
>>>>>>>>>> kvm_mmu_free_page(page);
>>>>>>>>>>
>>>>>>>>>> The first place is kvm_mmu_prepare_zap_obsolete_page(page), which have
>>>>>>>>>> deleted the hash list.
>>>>>>>>>>
>>>>>>>>>>> kvm_mmu_prepare_zap_page() like we discussed it here:
>>>>>>>>>>> https://patchwork.kernel.org/patch/2580351/ instead of doing
>>>>>>>>>>> it differently for obsolete and non obsolete pages?
>>>>>>>>>>
>>>>>>>>>> It is can break the hash-list walking: we should rescan the
>>>>>>>>>> hash list once the page is prepared-ly zapped.
>>>>>>>>>>
>>>>>>>>>> I mentioned it in the changelog:
>>>>>>>>>>
>>>>>>>>>> 4): drop the patch which deleted page from hash list at the "prepare"
>>>>>>>>>> time since it can break the walk based on hash list.
>>>>>>>>> Can you elaborate on how this can happen?
>>>>>>>>
>>>>>>>> There is a example:
>>>>>>>>
>>>>>>>> int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
>>>>>>>> {
>>>>>>>> struct kvm_mmu_page *sp;
>>>>>>>> LIST_HEAD(invalid_list);
>>>>>>>> int r;
>>>>>>>>
>>>>>>>> pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
>>>>>>>> r = 0;
>>>>>>>> spin_lock(&kvm->mmu_lock);
>>>>>>>> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
>>>>>>>> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
>>>>>>>> sp->role.word);
>>>>>>>> r = 1;
>>>>>>>> kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
>>>>>>>> }
>>>>>>>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>>>>>>>> spin_unlock(&kvm->mmu_lock);
>>>>>>>>
>>>>>>>> return r;
>>>>>>>> }
>>>>>>>>
>>>>>>>> It works fine since kvm_mmu_prepare_zap_page does not touch the hash list.
>>>>>>>> If we delete hlist in kvm_mmu_prepare_zap_page(), this kind of codes should
>>>>>>>> be changed to:
>>>>>>>>
>>>>>>>> restart:
>>>>>>>> for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
>>>>>>>> pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
>>>>>>>> sp->role.word);
>>>>>>>> r = 1;
>>>>>>>> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
>>>>>>>> goto restart;
>>>>>>>> }
>>>>>>>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>>>>>>>>
>>>>>>> Hmm, yes. So lets leave it as is and always commit invalid_list before
>>>>>>
>>>>>> So, you mean drop this patch and the patch of
>>>>>> KVM: MMU: collapse TLB flushes when zap all pages?
>>>>>>
>>>>> We still want to add kvm_reload_remote_mmus() to
>>>>> kvm_mmu_invalidate_zap_all_pages(). But yes, we disable a nice
>>>>> optimization here. So may be skipping obsolete pages while walking
>>>>> hashtable is better solution.
>>>>
>>>> I am willing to use this way instead, but it looks worse than this
>>>> patch:
>>>>
>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>> index 9b57faa..810410c 100644
>>>> --- a/arch/x86/kvm/mmu.c
>>>> +++ b/arch/x86/kvm/mmu.c
>>>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>>>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>>>> {
>>>> ASSERT(is_empty_shadow_page(sp->spt));
>>>> - hlist_del(&sp->hash_link);
>>>> + hlist_del_init(&sp->hash_link);
>>> Why not drop this
>>>
>>>> list_del(&sp->link);
>>>> free_page((unsigned long)sp->spt);
>>>> if (!sp->role.direct)
>>>> @@ -1648,14 +1648,20 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>>>> static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>>>> struct list_head *invalid_list);
>>>>
>>>> +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>>>> +{
>>>> + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
>>>> +}
>>>> +
>>>> #define for_each_gfn_sp(_kvm, _sp, _gfn) \
>>>> hlist_for_each_entry(_sp, \
>>>> &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \
>>>> - if ((_sp)->gfn != (_gfn)) {} else
>>>> + if ((_sp)->gfn != (_gfn) || is_obsolete_sp(_kvm, _sp)) {} else
>>>>
>>>> #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) \
>>>> for_each_gfn_sp(_kvm, _sp, _gfn) \
>>>> - if ((_sp)->role.direct || (_sp)->role.invalid) {} else
>>>> + if ((_sp)->role.direct || \
>>>> + (_sp)->role.invalid || is_obsolete_sp(_kvm, _sp)) {} else
>>>>
>>>> /* @sp->gfn should be write-protected at the call site */
>>>> static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>>>> @@ -1838,11 +1844,6 @@ static void clear_sp_write_flooding_count(u64 *spte)
>>>> __clear_sp_write_flooding_count(sp);
>>>> }
>>>>
>>>> -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>>>> -{
>>>> - return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
>>>> -}
>>>> -
>>>> static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>>>> gfn_t gfn,
>>>> gva_t gaddr,
>>>> @@ -2085,11 +2086,15 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>>>>
>>>> if (sp->unsync)
>>>> kvm_unlink_unsync_page(kvm, sp);
>>>> +
>>>> if (!sp->root_count) {
>>>> /* Count self */
>>>> ret++;
>>>> list_move(&sp->link, invalid_list);
>>>> kvm_mod_used_mmu_pages(kvm, -1);
>>>> +
>>>> + if (unlikely(is_obsolete_sp(kvm, sp)))
>>>> + hlist_del_init(&sp->hash_link);
>>> and this.
>>>
>>> Since we check for obsolete while searching hashtable why delete it
>>> here?
>>
>> In order to zap obsolete pages without tlb flush, we should delete them from
>> hash list at the "prepare" time. Here, we only delete the obsolete pages so
>> that the hashtable walking functions, like kvm_mmu_unprotect_page(), can work
>> properly by skipping obsolete page.
>>
> Why we have to delete them from the hash at "prepare" time? I hash walk
> ignores them they are as good as deleted, no?
>
>> And, kvm_mmu_prepare_zap_page() is a recursion function:
>> kvm_mmu_prepare_zap_page() -> zap_unsync_children -> kvm_mmu_prepare_zap_page().
>> It seems it is the only place to do this thing. For example, below code is not
>> allowed in kvm_zap_obsolete_pages():
>>
>> if (kvm_mmu_prepare_zap_page(sp, list))
>> hlist_del(sp->hlist);
>>
>> Or, i missed your suggestion?
> My assumption is that we can leave obsolete shadow pages on hashtable
> till commit_zap time.

Ah, i see.

Yes, i agree with your idea. I think we can only skip the obsolete-and-invalid
page since the obsolete-but-unzapped page still affects the mmu's behaviour,
for example, it can cause page write-protect, kvm_mmu_unprotect_page()
can not work by skipping unzapped-obsolete pages.

>
> BTW is it such a good idea to call kvm_mmu_commit_zap_page() once on all

If other choices are available, we can try.

> obsolete pages? We basically loop over all of them under the lock
> without lock break.

It seems no. :)
Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
is easily contended. I did the simple track:

+ int num = 0;
restart:
list_for_each_entry_safe_reverse(sp, node,
&kvm->arch.active_mmu_pages, link) {
@@ -4265,6 +4265,7 @@ restart:
if (batch >= BATCH_ZAP_PAGES &&
cond_resched_lock(&kvm->mmu_lock)) {
batch = 0;
+ num++;
goto restart;
}

@@ -4277,6 +4278,7 @@ restart:
* may use the pages.
*/
kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ printk("lock-break: %d.\n", num);
}

I do read pci rom when doing kernel building in the guest which
has 1G memory and 4vcpus with ept enabled, this is the normal
workload and normal configuration.

# dmesg
[ 2338.759099] lock-break: 8.
[ 2339.732442] lock-break: 5.
[ 2340.904446] lock-break: 3.
[ 2342.513514] lock-break: 3.
[ 2343.452229] lock-break: 3.
[ 2344.981599] lock-break: 4.

Basically, we need to break many times.

2013-05-24 05:53:56

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On 05/24/2013 01:39 PM, Xiao Guangrong wrote:

>>> if (kvm_mmu_prepare_zap_page(sp, list))
>>> hlist_del(sp->hlist);
>>>
>>> Or, i missed your suggestion?
>> My assumption is that we can leave obsolete shadow pages on hashtable
>> till commit_zap time.
>
> Ah, i see.
>
> Yes, i agree with your idea. I think we can only skip the obsolete-and-invalid
> page since the obsolete-but-unzapped page still affects the mmu's behaviour,
> for example, it can cause page write-protect, kvm_mmu_unprotect_page()
> can not work by skipping unzapped-obsolete pages.

kvm_mmu_unprotect_page() can work, we can skip obsolete pages too when detect
whether need to write-protect a page, it is easier to make the page become
writable when zapping obsolete pages.

Will update it following your idea, sorry for my noise.

2013-05-24 20:23:28

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 03/11] KVM: MMU: fast invalidate all pages

Hi Xiao,

On Thu, May 23, 2013 at 03:55:52AM +0800, Xiao Guangrong wrote:
> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> walk and zap all shadow pages one by one, also it need to zap all guest
> page's rmap and all shadow page's parent spte list. Particularly, things
> become worse if guest uses more memory or vcpus. It is not good for
> scalability
>
> In this patch, we introduce a faster way to invalidate all shadow pages.
> KVM maintains a global mmu invalid generation-number which is stored in
> kvm->arch.mmu_valid_gen and every shadow page stores the current global
> generation-number into sp->mmu_valid_gen when it is created
>
> When KVM need zap all shadow pages sptes, it just simply increase the
> global generation-number then reload root shadow pages on all vcpus.
> Vcpu will create a new shadow page table according to current kvm's
> generation-number. It ensures the old pages are not used any more.
> Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> are zapped by using lock-break technique
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 +
> arch/x86/kvm/mmu.c | 84 +++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/mmu.h | 1 +
> 3 files changed, 87 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3741c65..bff7d46 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -222,6 +222,7 @@ struct kvm_mmu_page {
> int root_count; /* Currently serving as active root */
> unsigned int unsync_children;
> unsigned long parent_ptes; /* Reverse mapping for parent_pte */
> + unsigned long mmu_valid_gen;
> DECLARE_BITMAP(unsync_child_bitmap, 512);
>
> #ifdef CONFIG_X86_32
> @@ -529,6 +530,7 @@ struct kvm_arch {
> unsigned int n_requested_mmu_pages;
> unsigned int n_max_mmu_pages;
> unsigned int indirect_shadow_pages;
> + unsigned long mmu_valid_gen;
> struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> /*
> * Hash table of struct kvm_mmu_page.
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index f8ca2f3..f302540 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
> __clear_sp_write_flooding_count(sp);
> }
>
> +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> +}
> +
> static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> gfn_t gfn,
> gva_t gaddr,
> @@ -1900,6 +1905,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>
> account_shadowed(vcpu->kvm, gfn);
> }
> + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> init_shadow_page_table(sp);
> trace_kvm_mmu_get_page(sp, true);
> return sp;
> @@ -2070,8 +2076,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> kvm_mmu_page_unlink_children(kvm, sp);
> kvm_mmu_unlink_parents(kvm, sp);
> +
> if (!sp->role.invalid && !sp->role.direct)
> unaccount_shadowed(kvm, sp->gfn);
> +
> if (sp->unsync)
> kvm_unlink_unsync_page(kvm, sp);
> if (!sp->root_count) {
> @@ -4195,6 +4203,82 @@ restart:
> spin_unlock(&kvm->mmu_lock);
> }
>
> +static void kvm_zap_obsolete_pages(struct kvm *kvm)
> +{
> + struct kvm_mmu_page *sp, *node;
> + LIST_HEAD(invalid_list);
> +
> +restart:
> + list_for_each_entry_safe_reverse(sp, node,
> + &kvm->arch.active_mmu_pages, link) {
> + /*
> + * No obsolete page exists before new created page since
> + * active_mmu_pages is the FIFO list.
> + */
> + if (!is_obsolete_sp(kvm, sp))
> + break;

Can you add a comment to list_add(x, active_mmu_pages) callsites
mentioning this case?

Because it'll break silently if people do list_add_tail().

> + /*
> + * Do not repeatedly zap a root page to avoid unnecessary
> + * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
> + * progress:
> + * vcpu 0 vcpu 1
> + * call vcpu_enter_guest():
> + * 1): handle KVM_REQ_MMU_RELOAD
> + * and require mmu-lock to
> + * load mmu
> + * repeat:
> + * 1): zap root page and
> + * send KVM_REQ_MMU_RELOAD
> + *
> + * 2): if (cond_resched_lock(mmu-lock))
> + *
> + * 2): hold mmu-lock and load mmu
> + *
> + * 3): see KVM_REQ_MMU_RELOAD bit
> + * on vcpu->requests is set
> + * then return 1 to call
> + * vcpu_enter_guest() again.
> + * goto repeat;
> + *
> + * Since we are reversely walking the list and the invalid
> + * list will be moved to the head, skip the invalid page
> + * can help us to avoid the infinity list walking.
> + */
> + if (sp->role.invalid)
> + continue;

But this allows completing (that is returning), with page that should
be zapped still present (even though its invalid).

Is another pass needed at the end to take care of the invalid pages?
Which at that point must have their root_count decreased.

> +
> + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> + kvm_mmu_commit_zap_page(kvm, &invalid_list);
> + cond_resched_lock(&kvm->mmu_lock);
> + goto restart;
> + }
> +
> + if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> + goto restart;
> + }
> +
> + kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +}
> +
> +/*
> + * Fast invalidate all shadow pages and use lock-break technique
> + * to zap obsolete pages.
> + *
> + * It's required when memslot is being deleted or VM is being
> + * destroyed, in these cases, we should ensure that KVM MMU does
> + * not use any resource of the being-deleted slot or all slots
> + * after calling the function.
> + */
> +void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm)
> +{
> + spin_lock(&kvm->mmu_lock);
> + kvm->arch.mmu_valid_gen++;
> +
> + kvm_zap_obsolete_pages(kvm);
> + spin_unlock(&kvm->mmu_lock);
> +}
> +

Also this function should be serialized, that is, should not allow
simultaneous kvm_mmu_invalidate_zap_all_pages. If thats so
assert(mutex_is_locked(kvm->lock)) would help.

Probably fine to have simultaneous users, but not necessary AFAICS.

2013-05-24 20:34:51

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch

On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
> caused by requiring lock
>
> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
> so update the comments
>
> [ It improves kernel building 0.6% ~ 1% ]

Can you please describe the overload in more detail? Under what scenario
is kernel building improved?

2013-05-26 08:27:03

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v7 03/11] KVM: MMU: fast invalidate all pages

On Fri, May 24, 2013 at 05:23:07PM -0300, Marcelo Tosatti wrote:
> Hi Xiao,
>
> On Thu, May 23, 2013 at 03:55:52AM +0800, Xiao Guangrong wrote:
> > The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> > walk and zap all shadow pages one by one, also it need to zap all guest
> > page's rmap and all shadow page's parent spte list. Particularly, things
> > become worse if guest uses more memory or vcpus. It is not good for
> > scalability
> >
> > In this patch, we introduce a faster way to invalidate all shadow pages.
> > KVM maintains a global mmu invalid generation-number which is stored in
> > kvm->arch.mmu_valid_gen and every shadow page stores the current global
> > generation-number into sp->mmu_valid_gen when it is created
> >
> > When KVM need zap all shadow pages sptes, it just simply increase the
> > global generation-number then reload root shadow pages on all vcpus.
> > Vcpu will create a new shadow page table according to current kvm's
> > generation-number. It ensures the old pages are not used any more.
> > Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> > are zapped by using lock-break technique
> >
> > Signed-off-by: Xiao Guangrong <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 2 +
> > arch/x86/kvm/mmu.c | 84 +++++++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/mmu.h | 1 +
> > 3 files changed, 87 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 3741c65..bff7d46 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -222,6 +222,7 @@ struct kvm_mmu_page {
> > int root_count; /* Currently serving as active root */
> > unsigned int unsync_children;
> > unsigned long parent_ptes; /* Reverse mapping for parent_pte */
> > + unsigned long mmu_valid_gen;
> > DECLARE_BITMAP(unsync_child_bitmap, 512);
> >
> > #ifdef CONFIG_X86_32
> > @@ -529,6 +530,7 @@ struct kvm_arch {
> > unsigned int n_requested_mmu_pages;
> > unsigned int n_max_mmu_pages;
> > unsigned int indirect_shadow_pages;
> > + unsigned long mmu_valid_gen;
> > struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > /*
> > * Hash table of struct kvm_mmu_page.
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index f8ca2f3..f302540 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
> > __clear_sp_write_flooding_count(sp);
> > }
> >
> > +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > +{
> > + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> > +}
> > +
> > static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > gfn_t gfn,
> > gva_t gaddr,
> > @@ -1900,6 +1905,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >
> > account_shadowed(vcpu->kvm, gfn);
> > }
> > + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> > init_shadow_page_table(sp);
> > trace_kvm_mmu_get_page(sp, true);
> > return sp;
> > @@ -2070,8 +2076,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> > ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> > kvm_mmu_page_unlink_children(kvm, sp);
> > kvm_mmu_unlink_parents(kvm, sp);
> > +
> > if (!sp->role.invalid && !sp->role.direct)
> > unaccount_shadowed(kvm, sp->gfn);
> > +
> > if (sp->unsync)
> > kvm_unlink_unsync_page(kvm, sp);
> > if (!sp->root_count) {
> > @@ -4195,6 +4203,82 @@ restart:
> > spin_unlock(&kvm->mmu_lock);
> > }
> >
> > +static void kvm_zap_obsolete_pages(struct kvm *kvm)
> > +{
> > + struct kvm_mmu_page *sp, *node;
> > + LIST_HEAD(invalid_list);
> > +
> > +restart:
> > + list_for_each_entry_safe_reverse(sp, node,
> > + &kvm->arch.active_mmu_pages, link) {
> > + /*
> > + * No obsolete page exists before new created page since
> > + * active_mmu_pages is the FIFO list.
> > + */
> > + if (!is_obsolete_sp(kvm, sp))
> > + break;
>
> Can you add a comment to list_add(x, active_mmu_pages) callsites
> mentioning this case?
>
> Because it'll break silently if people do list_add_tail().
>
> > + /*
> > + * Do not repeatedly zap a root page to avoid unnecessary
> > + * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
> > + * progress:
> > + * vcpu 0 vcpu 1
> > + * call vcpu_enter_guest():
> > + * 1): handle KVM_REQ_MMU_RELOAD
> > + * and require mmu-lock to
> > + * load mmu
> > + * repeat:
> > + * 1): zap root page and
> > + * send KVM_REQ_MMU_RELOAD
> > + *
> > + * 2): if (cond_resched_lock(mmu-lock))
> > + *
> > + * 2): hold mmu-lock and load mmu
> > + *
> > + * 3): see KVM_REQ_MMU_RELOAD bit
> > + * on vcpu->requests is set
> > + * then return 1 to call
> > + * vcpu_enter_guest() again.
> > + * goto repeat;
> > + *
> > + * Since we are reversely walking the list and the invalid
> > + * list will be moved to the head, skip the invalid page
> > + * can help us to avoid the infinity list walking.
> > + */
> > + if (sp->role.invalid)
> > + continue;
>
> But this allows completing (that is returning), with page that should
> be zapped still present (even though its invalid).
>
> Is another pass needed at the end to take care of the invalid pages?
> Which at that point must have their root_count decreased.
>
It is not different from how it work now. Invalid page can still be not
zapped after zap_all() completes. They are ignored by all relevant code
paths.

> > +
> > + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> > + kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > + cond_resched_lock(&kvm->mmu_lock);
> > + goto restart;
> > + }
> > +
> > + if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> > + goto restart;
> > + }
> > +
> > + kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > +}
> > +
> > +/*
> > + * Fast invalidate all shadow pages and use lock-break technique
> > + * to zap obsolete pages.
> > + *
> > + * It's required when memslot is being deleted or VM is being
> > + * destroyed, in these cases, we should ensure that KVM MMU does
> > + * not use any resource of the being-deleted slot or all slots
> > + * after calling the function.
> > + */
> > +void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm)
> > +{
> > + spin_lock(&kvm->mmu_lock);
> > + kvm->arch.mmu_valid_gen++;
> > +
> > + kvm_zap_obsolete_pages(kvm);
> > + spin_unlock(&kvm->mmu_lock);
> > +}
> > +
>
> Also this function should be serialized, that is, should not allow
> simultaneous kvm_mmu_invalidate_zap_all_pages. If thats so
> assert(mutex_is_locked(kvm->lock)) would help.
>
> Probably fine to have simultaneous users, but not necessary AFAICS.
I raced this point on previous patch submission. The function can be
executed simultaneously only during vmexit when zap_all is called by mm
notifiers. During regular work the function is only called by slot
manipulation functions and those are serialized by slot lock.

--
Gleb.

2013-05-27 02:02:43

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 03/11] KVM: MMU: fast invalidate all pages

Hi Marcelo,

On 05/25/2013 04:23 AM, Marcelo Tosatti wrote:

>> +static void kvm_zap_obsolete_pages(struct kvm *kvm)
>> +{
>> + struct kvm_mmu_page *sp, *node;
>> + LIST_HEAD(invalid_list);
>> +
>> +restart:
>> + list_for_each_entry_safe_reverse(sp, node,
>> + &kvm->arch.active_mmu_pages, link) {
>> + /*
>> + * No obsolete page exists before new created page since
>> + * active_mmu_pages is the FIFO list.
>> + */
>> + if (!is_obsolete_sp(kvm, sp))
>> + break;
>
> Can you add a comment to list_add(x, active_mmu_pages) callsites
> mentioning this case?
>
> Because it'll break silently if people do list_add_tail().

Sure, I will do it in the next version.

And i totally agree with Gleb's points that reply your other questions
in this patch.

Thank you all!

2013-05-27 02:20:25

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch

On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
>> caused by requiring lock
>>
>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
>> so update the comments
>>
>> [ It improves kernel building 0.6% ~ 1% ]
>
> Can you please describe the overload in more detail? Under what scenario
> is kernel building improved?

Yes.

The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
every one second.

[
echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
]

2013-05-27 22:29:17

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 03/11] KVM: MMU: fast invalidate all pages

On Sun, May 26, 2013 at 11:26:49AM +0300, Gleb Natapov wrote:
> On Fri, May 24, 2013 at 05:23:07PM -0300, Marcelo Tosatti wrote:
> > Hi Xiao,
> >
> > On Thu, May 23, 2013 at 03:55:52AM +0800, Xiao Guangrong wrote:
> > > The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> > > walk and zap all shadow pages one by one, also it need to zap all guest
> > > page's rmap and all shadow page's parent spte list. Particularly, things
> > > become worse if guest uses more memory or vcpus. It is not good for
> > > scalability
> > >
> > > In this patch, we introduce a faster way to invalidate all shadow pages.
> > > KVM maintains a global mmu invalid generation-number which is stored in
> > > kvm->arch.mmu_valid_gen and every shadow page stores the current global
> > > generation-number into sp->mmu_valid_gen when it is created
> > >
> > > When KVM need zap all shadow pages sptes, it just simply increase the
> > > global generation-number then reload root shadow pages on all vcpus.
> > > Vcpu will create a new shadow page table according to current kvm's
> > > generation-number. It ensures the old pages are not used any more.
> > > Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> > > are zapped by using lock-break technique
> > >
> > > Signed-off-by: Xiao Guangrong <[email protected]>
> > > ---
> > > arch/x86/include/asm/kvm_host.h | 2 +
> > > arch/x86/kvm/mmu.c | 84 +++++++++++++++++++++++++++++++++++++++
> > > arch/x86/kvm/mmu.h | 1 +
> > > 3 files changed, 87 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 3741c65..bff7d46 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -222,6 +222,7 @@ struct kvm_mmu_page {
> > > int root_count; /* Currently serving as active root */
> > > unsigned int unsync_children;
> > > unsigned long parent_ptes; /* Reverse mapping for parent_pte */
> > > + unsigned long mmu_valid_gen;
> > > DECLARE_BITMAP(unsync_child_bitmap, 512);
> > >
> > > #ifdef CONFIG_X86_32
> > > @@ -529,6 +530,7 @@ struct kvm_arch {
> > > unsigned int n_requested_mmu_pages;
> > > unsigned int n_max_mmu_pages;
> > > unsigned int indirect_shadow_pages;
> > > + unsigned long mmu_valid_gen;
> > > struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > > /*
> > > * Hash table of struct kvm_mmu_page.
> > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > index f8ca2f3..f302540 100644
> > > --- a/arch/x86/kvm/mmu.c
> > > +++ b/arch/x86/kvm/mmu.c
> > > @@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
> > > __clear_sp_write_flooding_count(sp);
> > > }
> > >
> > > +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > > +{
> > > + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> > > +}
> > > +
> > > static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > gfn_t gfn,
> > > gva_t gaddr,
> > > @@ -1900,6 +1905,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >
> > > account_shadowed(vcpu->kvm, gfn);
> > > }
> > > + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> > > init_shadow_page_table(sp);
> > > trace_kvm_mmu_get_page(sp, true);
> > > return sp;
> > > @@ -2070,8 +2076,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> > > ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> > > kvm_mmu_page_unlink_children(kvm, sp);
> > > kvm_mmu_unlink_parents(kvm, sp);
> > > +
> > > if (!sp->role.invalid && !sp->role.direct)
> > > unaccount_shadowed(kvm, sp->gfn);
> > > +
> > > if (sp->unsync)
> > > kvm_unlink_unsync_page(kvm, sp);
> > > if (!sp->root_count) {
> > > @@ -4195,6 +4203,82 @@ restart:
> > > spin_unlock(&kvm->mmu_lock);
> > > }
> > >
> > > +static void kvm_zap_obsolete_pages(struct kvm *kvm)
> > > +{
> > > + struct kvm_mmu_page *sp, *node;
> > > + LIST_HEAD(invalid_list);
> > > +
> > > +restart:
> > > + list_for_each_entry_safe_reverse(sp, node,
> > > + &kvm->arch.active_mmu_pages, link) {
> > > + /*
> > > + * No obsolete page exists before new created page since
> > > + * active_mmu_pages is the FIFO list.
> > > + */
> > > + if (!is_obsolete_sp(kvm, sp))
> > > + break;
> >
> > Can you add a comment to list_add(x, active_mmu_pages) callsites
> > mentioning this case?
> >
> > Because it'll break silently if people do list_add_tail().
> >
> > > + /*
> > > + * Do not repeatedly zap a root page to avoid unnecessary
> > > + * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
> > > + * progress:
> > > + * vcpu 0 vcpu 1
> > > + * call vcpu_enter_guest():
> > > + * 1): handle KVM_REQ_MMU_RELOAD
> > > + * and require mmu-lock to
> > > + * load mmu
> > > + * repeat:
> > > + * 1): zap root page and
> > > + * send KVM_REQ_MMU_RELOAD
> > > + *
> > > + * 2): if (cond_resched_lock(mmu-lock))
> > > + *
> > > + * 2): hold mmu-lock and load mmu
> > > + *
> > > + * 3): see KVM_REQ_MMU_RELOAD bit
> > > + * on vcpu->requests is set
> > > + * then return 1 to call
> > > + * vcpu_enter_guest() again.
> > > + * goto repeat;
> > > + *
> > > + * Since we are reversely walking the list and the invalid
> > > + * list will be moved to the head, skip the invalid page
> > > + * can help us to avoid the infinity list walking.
> > > + */
> > > + if (sp->role.invalid)
> > > + continue;
> >
> > But this allows completing (that is returning), with page that should
> > be zapped still present (even though its invalid).
> >
> > Is another pass needed at the end to take care of the invalid pages?
> > Which at that point must have their root_count decreased.
> >
> It is not different from how it work now. Invalid page can still be not
> zapped after zap_all() completes. They are ignored by all relevant code
> paths.

It is different. kvm_mmu_zap_all() returns with 0 pages at
mmu_active_pages today.

It is probably worthwhile to maintain behaviour (instead of every other
code path does the right thing).

> > Also this function should be serialized, that is, should not allow
> > simultaneous kvm_mmu_invalidate_zap_all_pages. If thats so
> > assert(mutex_is_locked(kvm->lock)) would help.
> >
> > Probably fine to have simultaneous users, but not necessary AFAICS.
> I raced this point on previous patch submission. The function can be
> executed simultaneously only during vmexit when zap_all is called by mm
> notifiers. During regular work the function is only called by slot
> manipulation functions and those are serialized by slot lock.

How about fix hypercall vs kvm_set_memory?

2013-05-27 22:59:35

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 03/11] KVM: MMU: fast invalidate all pages

On 05/27/2013 04:37 AM, Marcelo Tosatti wrote:
> On Sun, May 26, 2013 at 11:26:49AM +0300, Gleb Natapov wrote:
>> On Fri, May 24, 2013 at 05:23:07PM -0300, Marcelo Tosatti wrote:
>>> Hi Xiao,
>>>
>>> On Thu, May 23, 2013 at 03:55:52AM +0800, Xiao Guangrong wrote:
>>>> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
>>>> walk and zap all shadow pages one by one, also it need to zap all guest
>>>> page's rmap and all shadow page's parent spte list. Particularly, things
>>>> become worse if guest uses more memory or vcpus. It is not good for
>>>> scalability
>>>>
>>>> In this patch, we introduce a faster way to invalidate all shadow pages.
>>>> KVM maintains a global mmu invalid generation-number which is stored in
>>>> kvm->arch.mmu_valid_gen and every shadow page stores the current global
>>>> generation-number into sp->mmu_valid_gen when it is created
>>>>
>>>> When KVM need zap all shadow pages sptes, it just simply increase the
>>>> global generation-number then reload root shadow pages on all vcpus.
>>>> Vcpu will create a new shadow page table according to current kvm's
>>>> generation-number. It ensures the old pages are not used any more.
>>>> Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
>>>> are zapped by using lock-break technique
>>>>
>>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>>> ---
>>>> arch/x86/include/asm/kvm_host.h | 2 +
>>>> arch/x86/kvm/mmu.c | 84 +++++++++++++++++++++++++++++++++++++++
>>>> arch/x86/kvm/mmu.h | 1 +
>>>> 3 files changed, 87 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>>> index 3741c65..bff7d46 100644
>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>> @@ -222,6 +222,7 @@ struct kvm_mmu_page {
>>>> int root_count; /* Currently serving as active root */
>>>> unsigned int unsync_children;
>>>> unsigned long parent_ptes; /* Reverse mapping for parent_pte */
>>>> + unsigned long mmu_valid_gen;
>>>> DECLARE_BITMAP(unsync_child_bitmap, 512);
>>>>
>>>> #ifdef CONFIG_X86_32
>>>> @@ -529,6 +530,7 @@ struct kvm_arch {
>>>> unsigned int n_requested_mmu_pages;
>>>> unsigned int n_max_mmu_pages;
>>>> unsigned int indirect_shadow_pages;
>>>> + unsigned long mmu_valid_gen;
>>>> struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>>>> /*
>>>> * Hash table of struct kvm_mmu_page.
>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>> index f8ca2f3..f302540 100644
>>>> --- a/arch/x86/kvm/mmu.c
>>>> +++ b/arch/x86/kvm/mmu.c
>>>> @@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
>>>> __clear_sp_write_flooding_count(sp);
>>>> }
>>>>
>>>> +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>>>> +{
>>>> + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
>>>> +}
>>>> +
>>>> static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>>>> gfn_t gfn,
>>>> gva_t gaddr,
>>>> @@ -1900,6 +1905,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>>>>
>>>> account_shadowed(vcpu->kvm, gfn);
>>>> }
>>>> + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
>>>> init_shadow_page_table(sp);
>>>> trace_kvm_mmu_get_page(sp, true);
>>>> return sp;
>>>> @@ -2070,8 +2076,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>>>> ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>>>> kvm_mmu_page_unlink_children(kvm, sp);
>>>> kvm_mmu_unlink_parents(kvm, sp);
>>>> +
>>>> if (!sp->role.invalid && !sp->role.direct)
>>>> unaccount_shadowed(kvm, sp->gfn);
>>>> +
>>>> if (sp->unsync)
>>>> kvm_unlink_unsync_page(kvm, sp);
>>>> if (!sp->root_count) {
>>>> @@ -4195,6 +4203,82 @@ restart:
>>>> spin_unlock(&kvm->mmu_lock);
>>>> }
>>>>
>>>> +static void kvm_zap_obsolete_pages(struct kvm *kvm)
>>>> +{
>>>> + struct kvm_mmu_page *sp, *node;
>>>> + LIST_HEAD(invalid_list);
>>>> +
>>>> +restart:
>>>> + list_for_each_entry_safe_reverse(sp, node,
>>>> + &kvm->arch.active_mmu_pages, link) {
>>>> + /*
>>>> + * No obsolete page exists before new created page since
>>>> + * active_mmu_pages is the FIFO list.
>>>> + */
>>>> + if (!is_obsolete_sp(kvm, sp))
>>>> + break;
>>>
>>> Can you add a comment to list_add(x, active_mmu_pages) callsites
>>> mentioning this case?
>>>
>>> Because it'll break silently if people do list_add_tail().
>>>
>>>> + /*
>>>> + * Do not repeatedly zap a root page to avoid unnecessary
>>>> + * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
>>>> + * progress:
>>>> + * vcpu 0 vcpu 1
>>>> + * call vcpu_enter_guest():
>>>> + * 1): handle KVM_REQ_MMU_RELOAD
>>>> + * and require mmu-lock to
>>>> + * load mmu
>>>> + * repeat:
>>>> + * 1): zap root page and
>>>> + * send KVM_REQ_MMU_RELOAD
>>>> + *
>>>> + * 2): if (cond_resched_lock(mmu-lock))
>>>> + *
>>>> + * 2): hold mmu-lock and load mmu
>>>> + *
>>>> + * 3): see KVM_REQ_MMU_RELOAD bit
>>>> + * on vcpu->requests is set
>>>> + * then return 1 to call
>>>> + * vcpu_enter_guest() again.
>>>> + * goto repeat;
>>>> + *
>>>> + * Since we are reversely walking the list and the invalid
>>>> + * list will be moved to the head, skip the invalid page
>>>> + * can help us to avoid the infinity list walking.
>>>> + */
>>>> + if (sp->role.invalid)
>>>> + continue;
>>>
>>> But this allows completing (that is returning), with page that should
>>> be zapped still present (even though its invalid).
>>>
>>> Is another pass needed at the end to take care of the invalid pages?
>>> Which at that point must have their root_count decreased.
>>>
>> It is not different from how it work now. Invalid page can still be not
>> zapped after zap_all() completes. They are ignored by all relevant code
>> paths.
>
> It is different. kvm_mmu_zap_all() returns with 0 pages at
> mmu_active_pages today.

Do not think so.

kvm_mmu_zap_all() also keeps role.root && role.invalid pages on mmu_active_pages.
role.root && role.invalid pages should be zapped in vcpu context.

>
> It is probably worthwhile to maintain behaviour (instead of every other
> code path does the right thing).
>
>>> Also this function should be serialized, that is, should not allow
>>> simultaneous kvm_mmu_invalidate_zap_all_pages. If thats so
>>> assert(mutex_is_locked(kvm->lock)) would help.
>>>
>>> Probably fine to have simultaneous users, but not necessary AFAICS.
>> I raced this point on previous patch submission. The function can be
>> executed simultaneously only during vmexit when zap_all is called by mm
>> notifiers. During regular work the function is only called by slot
>> manipulation functions and those are serialized by slot lock.
>
> How about fix hypercall vs kvm_set_memory?

In this version, we have dropped call kvm_mmu_zap_all in hypercall since
it is useless.

(pacthing hypercall is broken for a long time:
https://lkml.org/lkml/2013/5/22/388)

2013-05-28 13:24:47

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
> It is only used to zap the obsolete page. Since the obsolete page
> will not be used, we need not spend time to find its unsync children
> out. Also, we delete the page from shadow page cache so that the page
> is completely isolated after call this function.
>
> The later patch will use it to collapse tlb flushes
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
> 1 files changed, 41 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 9b57faa..e676356 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> {
> ASSERT(is_empty_shadow_page(sp->spt));
> - hlist_del(&sp->hash_link);
> + hlist_del_init(&sp->hash_link);
> list_del(&sp->link);
> free_page((unsigned long)sp->spt);
> if (!sp->role.direct)
> @@ -2069,14 +2069,19 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
> return zapped;
> }
>
> -static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> - struct list_head *invalid_list)
> +static int
> +__kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> + bool zap_unsync_children,
> + struct list_head *invalid_list)
> {
> - int ret;
> + int ret = 0;
>
> trace_kvm_mmu_prepare_zap_page(sp);
> ++kvm->stat.mmu_shadow_zapped;
> - ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> +
> + if (likely(zap_unsync_children))
> + ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> +

Why is this an important case to be optimized?

1) shadow is the uncommon, obsolete case.
2) mmu_zap_unsync_children has

if (parent->role.level == PT_PAGE_TABLE_LEVEL)
return 0;

So the large majority of pages are already optimized.

2013-05-28 13:24:53

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
> it will flush tlb every time when it does lock-break
>
> We can reload mmu on all vcpus after updating the generation
> number so that the obsolete pages are not used on any vcpus,
> after that we do not need to flush tlb when obsolete pages
> are zapped

After that point batching is also not relevant anymore?


Still concerned about a similar case mentioned earlier:

"
Note the account for pages freed step after pages are actually
freed: as discussed with Takuya, having pages freed and freed page
accounting out of sync across mmu_lock is potentially problematic:
kvm->arch.n_used_mmu_pages and friends do not reflect reality which can
cause problems for SLAB freeing and page allocation throttling.
"

This is a real problem, if you decrease n_used_mmu_pages at
kvm_mmu_prepare_zap_page, but only actually free pages later
at kvm_mmu_commit_zap_page, there is the possibility of allowing
a huge number to be retained. There should be a maximum number of pages
at invalid_list.

(even higher possibility if you schedule without freeing pages reported
as released!).

> Note: kvm_mmu_commit_zap_page is still needed before free
> the pages since other vcpus may be doing locklessly shadow
> page walking
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 32 ++++++++++++++++++++++----------
> 1 files changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index e676356..5e34056 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4237,8 +4237,6 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
> restart:
> list_for_each_entry_safe_reverse(sp, node,
> &kvm->arch.active_mmu_pages, link) {
> - int ret;
> -
> /*
> * No obsolete page exists before new created page since
> * active_mmu_pages is the FIFO list.
> @@ -4254,21 +4252,24 @@ restart:
> if (sp->role.invalid)
> continue;
>
> + /*
> + * Need not flush tlb since we only zap the sp with invalid
> + * generation number.
> + */
> if (batch >= BATCH_ZAP_PAGES &&
> - (need_resched() || spin_needbreak(&kvm->mmu_lock))) {
> + cond_resched_lock(&kvm->mmu_lock)) {
> batch = 0;
> - kvm_mmu_commit_zap_page(kvm, &invalid_list);
> - cond_resched_lock(&kvm->mmu_lock);
> goto restart;
> }
>
> - ret = kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> - batch += ret;
> -
> - if (ret)
> - goto restart;
> + batch += kvm_mmu_prepare_zap_obsolete_page(kvm, sp,
> + &invalid_list);
> }
>
> + /*
> + * Should flush tlb before free page tables since lockless-walking
> + * may use the pages.
> + */
> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> }
>
> @@ -4287,6 +4288,17 @@ void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm)
> trace_kvm_mmu_invalidate_zap_all_pages(kvm);
> kvm->arch.mmu_valid_gen++;
>
> + /*
> + * Notify all vcpus to reload its shadow page table
> + * and flush TLB. Then all vcpus will switch to new
> + * shadow page table with the new mmu_valid_gen.
> + *
> + * Note: we should do this under the protection of
> + * mmu-lock, otherwise, vcpu would purge shadow page
> + * but miss tlb flush.
> + */
> + kvm_reload_remote_mmus(kvm);
> +
> kvm_zap_obsolete_pages(kvm);
> spin_unlock(&kvm->mmu_lock);
> }
> --
> 1.7.7.6

2013-05-28 13:25:19

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch

On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
> > On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
> >> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
> >> caused by requiring lock
> >>
> >> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
> >> so update the comments
> >>
> >> [ It improves kernel building 0.6% ~ 1% ]
> >
> > Can you please describe the overload in more detail? Under what scenario
> > is kernel building improved?
>
> Yes.
>
> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
> every one second.
>
> [
> echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
> cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
> ]

Can't see why it reflects real world scenario (or a real world
scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?

Point is, it would be good to understand why this change
is improving performance? What are these cases where breaking out of
kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
< 10 ?


2013-05-28 14:51:51

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On 05/28/2013 08:13 AM, Marcelo Tosatti wrote:
> On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
>> It is only used to zap the obsolete page. Since the obsolete page
>> will not be used, we need not spend time to find its unsync children
>> out. Also, we delete the page from shadow page cache so that the page
>> is completely isolated after call this function.
>>
>> The later patch will use it to collapse tlb flushes
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
>> 1 files changed, 41 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 9b57faa..e676356 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>> {
>> ASSERT(is_empty_shadow_page(sp->spt));
>> - hlist_del(&sp->hash_link);
>> + hlist_del_init(&sp->hash_link);
>> list_del(&sp->link);
>> free_page((unsigned long)sp->spt);
>> if (!sp->role.direct)
>> @@ -2069,14 +2069,19 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
>> return zapped;
>> }
>>
>> -static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>> - struct list_head *invalid_list)
>> +static int
>> +__kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>> + bool zap_unsync_children,
>> + struct list_head *invalid_list)
>> {
>> - int ret;
>> + int ret = 0;
>>
>> trace_kvm_mmu_prepare_zap_page(sp);
>> ++kvm->stat.mmu_shadow_zapped;
>> - ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>> +
>> + if (likely(zap_unsync_children))
>> + ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>> +
>
> Why is this an important case to be optimized?
>
> 1) shadow is the uncommon, obsolete case.
> 2) mmu_zap_unsync_children has
>
> if (parent->role.level == PT_PAGE_TABLE_LEVEL)
> return 0;
>
> So the large majority of pages are already optimized.

Hmm, if we zap the high level page (e.g level = 4), it should walk its
children and its children's children. It is high overload.
(IMHO, trivial optimization is still necessary, especially, the change
is really slight.)

And, there is another point me mentioned in the changelog:
"Also, we delete the page from shadow page cache so that the page
is completely isolated after call this function."
Skipping zapping unsync-children can ensure that only one page is
zapped so that we can use "hlist_del_init(&sp->hash_link)" to completely
remove the page from mmu-cache.

Now, Gleb and i got a agreement that skipping obsolete page when
walking hash list is a better way.

BTW, zapping unsync-children is unnecessary, is it?

2013-05-28 15:02:21

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch

On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
>>>> caused by requiring lock
>>>>
>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
>>>> so update the comments
>>>>
>>>> [ It improves kernel building 0.6% ~ 1% ]
>>>
>>> Can you please describe the overload in more detail? Under what scenario
>>> is kernel building improved?
>>
>> Yes.
>>
>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
>> every one second.
>>
>> [
>> echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
>> cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
>> ]
>
> Can't see why it reflects real world scenario (or a real world
> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
>
> Point is, it would be good to understand why this change
> is improving performance? What are these cases where breaking out of
> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
> < 10 ?

When guest read ROM, qemu will set the memory to map the device's firmware,
that is why kvm_mmu_zap_all can be called in the scenario.

The reasons why it heart the performance are:
1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
other vcpus need wait a long time to do I/O.

2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
request from other vcpus.

Is it enough?


2013-05-28 15:19:20

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

On 05/28/2013 08:36 AM, Marcelo Tosatti wrote:
> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
>> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
>> it will flush tlb every time when it does lock-break
>>
>> We can reload mmu on all vcpus after updating the generation
>> number so that the obsolete pages are not used on any vcpus,
>> after that we do not need to flush tlb when obsolete pages
>> are zapped
>
> After that point batching is also not relevant anymore?

no... without batching, we do not know how much time we will
spend to zap pages. It is not good for the case that
zap_all_pages is called in the vcpu context.

>
>
> Still concerned about a similar case mentioned earlier:
>
> "
> Note the account for pages freed step after pages are actually
> freed: as discussed with Takuya, having pages freed and freed page
> accounting out of sync across mmu_lock is potentially problematic:
> kvm->arch.n_used_mmu_pages and friends do not reflect reality which can
> cause problems for SLAB freeing and page allocation throttling.
> "
>
> This is a real problem, if you decrease n_used_mmu_pages at
> kvm_mmu_prepare_zap_page, but only actually free pages later
> at kvm_mmu_commit_zap_page, there is the possibility of allowing
> a huge number to be retained. There should be a maximum number of pages
> at invalid_list.
>
> (even higher possibility if you schedule without freeing pages reported
> as released!).
>
>> Note: kvm_mmu_commit_zap_page is still needed before free
>> the pages since other vcpus may be doing locklessly shadow
>> page walking

Ah, yes, i agree with you.

We can introduce a list, say kvm->arch.obsolte_pages, to link all of the
zapped-page, the page-shrink will free the page on that list first.

Marcelo, if you do not have objection on patch 1 ~ 8 and 11, could you please
let them merged first, and do add some comments and tlb optimization later?

2013-05-29 03:03:30

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

On 05/28/2013 11:19 PM, Xiao Guangrong wrote:
> On 05/28/2013 08:36 AM, Marcelo Tosatti wrote:
>> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
>>> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
>>> it will flush tlb every time when it does lock-break
>>>
>>> We can reload mmu on all vcpus after updating the generation
>>> number so that the obsolete pages are not used on any vcpus,
>>> after that we do not need to flush tlb when obsolete pages
>>> are zapped
>>
>> After that point batching is also not relevant anymore?
>
> no... without batching, we do not know how much time we will
> spend to zap pages. It is not good for the case that
> zap_all_pages is called in the vcpu context.
>
>>
>>
>> Still concerned about a similar case mentioned earlier:
>>
>> "
>> Note the account for pages freed step after pages are actually
>> freed: as discussed with Takuya, having pages freed and freed page
>> accounting out of sync across mmu_lock is potentially problematic:
>> kvm->arch.n_used_mmu_pages and friends do not reflect reality which can
>> cause problems for SLAB freeing and page allocation throttling.
>> "
>>
>> This is a real problem, if you decrease n_used_mmu_pages at
>> kvm_mmu_prepare_zap_page, but only actually free pages later
>> at kvm_mmu_commit_zap_page, there is the possibility of allowing
>> a huge number to be retained. There should be a maximum number of pages
>> at invalid_list.
>>
>> (even higher possibility if you schedule without freeing pages reported
>> as released!).
>>
>>> Note: kvm_mmu_commit_zap_page is still needed before free
>>> the pages since other vcpus may be doing locklessly shadow
>>> page walking
>
> Ah, yes, i agree with you.
>
> We can introduce a list, say kvm->arch.obsolte_pages, to link all of the
> zapped-page, the page-shrink will free the page on that list first.
>
> Marcelo, if you do not have objection on patch 1 ~ 8 and 11, could you please
> let them merged first, and do add some comments and tlb optimization later?

Exclude patch 11 please, since it depends on the "collapse" optimization.

2013-05-29 12:40:29

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote:
> >>> the pages since other vcpus may be doing locklessly shadow
> >>> page walking
> >
> > Ah, yes, i agree with you.
> >
> > We can introduce a list, say kvm->arch.obsolte_pages, to link all of the
> > zapped-page, the page-shrink will free the page on that list first.
> >
> > Marcelo, if you do not have objection on patch 1 ~ 8 and 11, could you please
> > let them merged first, and do add some comments and tlb optimization later?
>
> Exclude patch 11 please, since it depends on the "collapse" optimization.

I'm fine with patch 1 being merged. I think the remaining patches need better
understanding or explanation. The problems i see are:

1) The magic number "10" to zap before considering reschedule is
annoying. It would be good to understand why it is needed at all.

But then again, the testcase is measuring kvm_mmu_zap_all performance
alone which we know is not a common operation, so perhaps there is
no need for that minimum-pages-to-zap-before-reschedule.

2) The problem above (retention of large number of pages while zapping)
can be fatal, it can lead to OOM and host crash.

3) Make sure that introduction of obsolete pages can not lead to a
huge number of shadow pages around (the correct reason you gave for not merging
https://patchwork.kernel.org/patch/2309641/ is not true anymore
obsolete pages).

Other than these points, i'm fine with obsolete pages optimization
to speed up kvm_mmu_zap_all and the rest of the patchset.

2013-05-29 12:40:28

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch

On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
> > On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
> >> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
> >>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
> >>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
> >>>> caused by requiring lock
> >>>>
> >>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
> >>>> so update the comments
> >>>>
> >>>> [ It improves kernel building 0.6% ~ 1% ]
> >>>
> >>> Can you please describe the overload in more detail? Under what scenario
> >>> is kernel building improved?
> >>
> >> Yes.
> >>
> >> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
> >> every one second.
> >>
> >> [
> >> echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
> >> cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
> >> ]
> >
> > Can't see why it reflects real world scenario (or a real world
> > scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
> >
> > Point is, it would be good to understand why this change
> > is improving performance? What are these cases where breaking out of
> > kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
> > < 10 ?
>
> When guest read ROM, qemu will set the memory to map the device's firmware,
> that is why kvm_mmu_zap_all can be called in the scenario.
>
> The reasons why it heart the performance are:
> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
> when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
> other vcpus need wait a long time to do I/O.
>
> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
> request from other vcpus.
>
> Is it enough?

That is no problem. The problem is why you chose "10" as the minimum number of
pages to zap before considering reschedule. I would expect the need to
reschedule to be rare enough that one kvm_mmu_zap_all instance (between
schedule in and schedule out) to be able to release no less than a
thousand pages.

So i'd like to understand better what is the drive for this change (this
was the original question).

2013-05-29 12:40:26

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On Tue, May 28, 2013 at 10:51:38PM +0800, Xiao Guangrong wrote:
> On 05/28/2013 08:13 AM, Marcelo Tosatti wrote:
> > On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
> >> It is only used to zap the obsolete page. Since the obsolete page
> >> will not be used, we need not spend time to find its unsync children
> >> out. Also, we delete the page from shadow page cache so that the page
> >> is completely isolated after call this function.
> >>
> >> The later patch will use it to collapse tlb flushes
> >>
> >> Signed-off-by: Xiao Guangrong <[email protected]>
> >> ---
> >> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
> >> 1 files changed, 41 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 9b57faa..e676356 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> >> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> >> {
> >> ASSERT(is_empty_shadow_page(sp->spt));
> >> - hlist_del(&sp->hash_link);
> >> + hlist_del_init(&sp->hash_link);
> >> list_del(&sp->link);
> >> free_page((unsigned long)sp->spt);
> >> if (!sp->role.direct)
> >> @@ -2069,14 +2069,19 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
> >> return zapped;
> >> }
> >>
> >> -static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> >> - struct list_head *invalid_list)
> >> +static int
> >> +__kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> >> + bool zap_unsync_children,
> >> + struct list_head *invalid_list)
> >> {
> >> - int ret;
> >> + int ret = 0;
> >>
> >> trace_kvm_mmu_prepare_zap_page(sp);
> >> ++kvm->stat.mmu_shadow_zapped;
> >> - ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> >> +
> >> + if (likely(zap_unsync_children))
> >> + ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> >> +
> >
> > Why is this an important case to be optimized?
> >
> > 1) shadow is the uncommon, obsolete case.
> > 2) mmu_zap_unsync_children has
> >
> > if (parent->role.level == PT_PAGE_TABLE_LEVEL)
> > return 0;
> >
> > So the large majority of pages are already optimized.
>
> Hmm, if we zap the high level page (e.g level = 4), it should walk its
> children and its children's children. It is high overload.
> (IMHO, trivial optimization is still necessary, especially, the change
> is really slight.)
>
> And, there is another point me mentioned in the changelog:
> "Also, we delete the page from shadow page cache so that the page
> is completely isolated after call this function."
> Skipping zapping unsync-children can ensure that only one page is
> zapped so that we can use "hlist_del_init(&sp->hash_link)" to completely
> remove the page from mmu-cache.
>
> Now, Gleb and i got a agreement that skipping obsolete page when
> walking hash list is a better way.
>
> BTW, zapping unsync-children is unnecessary, is it?

It is necessary that if an unsync page exists, that
invlpg emulation is able to reach it, or that at kvm_mmu_get_page
time they are synchronized.

You transfer the synchronization work to pagefault time, which directly
affects guest performance, while it could have been done by the host
(this was the reason for zapping unsync children).

2013-05-29 13:09:21

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch

On 05/29/2013 07:11 PM, Marcelo Tosatti wrote:
> On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
>> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
>>> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
>>>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
>>>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
>>>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
>>>>>> caused by requiring lock
>>>>>>
>>>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
>>>>>> so update the comments
>>>>>>
>>>>>> [ It improves kernel building 0.6% ~ 1% ]
>>>>>
>>>>> Can you please describe the overload in more detail? Under what scenario
>>>>> is kernel building improved?
>>>>
>>>> Yes.
>>>>
>>>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
>>>> every one second.
>>>>
>>>> [
>>>> echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
>>>> cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
>>>> ]
>>>
>>> Can't see why it reflects real world scenario (or a real world
>>> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
>>>
>>> Point is, it would be good to understand why this change
>>> is improving performance? What are these cases where breaking out of
>>> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
>>> < 10 ?
>>
>> When guest read ROM, qemu will set the memory to map the device's firmware,
>> that is why kvm_mmu_zap_all can be called in the scenario.
>>
>> The reasons why it heart the performance are:
>> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
>> when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
>> other vcpus need wait a long time to do I/O.
>>
>> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
>> request from other vcpus.
>>
>> Is it enough?
>
> That is no problem. The problem is why you chose "10" as the minimum number of
> pages to zap before considering reschedule. I would expect the need to

Well, my description above explained why batch-zapping is needed - we do
not want the vcpu spend lots of time to zap all pages because it hurts other
vcpus running.

But, why the batch page number is "10"... I can not answer this, i just guessed
that '10' can make vcpu do not spend long time on zap_all_pages and do
not cause mmu-lock too hungry. "10" is the speculative value and i am not sure
it is the best value but at lease, i think it can work.

> reschedule to be rare enough that one kvm_mmu_zap_all instance (between
> schedule in and schedule out) to be able to release no less than a
> thousand pages.

Unfortunately, no.

This information is I replied Gleb in his mail where he raced a question that
why "collapse tlb flush is needed":

======
It seems no.
Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
is easily contended. I did the simple track:

+ int num = 0;
restart:
list_for_each_entry_safe_reverse(sp, node,
&kvm->arch.active_mmu_pages, link) {
@@ -4265,6 +4265,7 @@ restart:
if (batch >= BATCH_ZAP_PAGES &&
cond_resched_lock(&kvm->mmu_lock)) {
batch = 0;
+ num++;
goto restart;
}

@@ -4277,6 +4278,7 @@ restart:
* may use the pages.
*/
kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ printk("lock-break: %d.\n", num);
}

I do read pci rom when doing kernel building in the guest which
has 1G memory and 4vcpus with ept enabled, this is the normal
workload and normal configuration.

# dmesg
[ 2338.759099] lock-break: 8.
[ 2339.732442] lock-break: 5.
[ 2340.904446] lock-break: 3.
[ 2342.513514] lock-break: 3.
[ 2343.452229] lock-break: 3.
[ 2344.981599] lock-break: 4.

Basically, we need to break many times.
======

You can see we should break 3 times to zap all pages even if we have zapoed
10 pages in batch. It is obviously that it need break more times without
batch-zapping.



2013-05-29 13:19:50

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

On 05/29/2013 08:39 PM, Marcelo Tosatti wrote:
> On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote:
>>>>> the pages since other vcpus may be doing locklessly shadow
>>>>> page walking
>>>
>>> Ah, yes, i agree with you.
>>>
>>> We can introduce a list, say kvm->arch.obsolte_pages, to link all of the
>>> zapped-page, the page-shrink will free the page on that list first.
>>>
>>> Marcelo, if you do not have objection on patch 1 ~ 8 and 11, could you please
>>> let them merged first, and do add some comments and tlb optimization later?
>>
>> Exclude patch 11 please, since it depends on the "collapse" optimization.
>
> I'm fine with patch 1 being merged. I think the remaining patches need better
> understanding or explanation. The problems i see are:
>
> 1) The magic number "10" to zap before considering reschedule is
> annoying. It would be good to understand why it is needed at all.

......

>
> But then again, the testcase is measuring kvm_mmu_zap_all performance
> alone which we know is not a common operation, so perhaps there is
> no need for that minimum-pages-to-zap-before-reschedule.

Well. Although, this is not the common operation, but this operation
can be triggered by VCPU - it one VCPU take long time on zap-all-pages,
other vcpus is missing IPI-synce, or missing IO. This is easily cause
soft lockups if the vcpu is doing memslot-releated things.

>
> 2) The problem above (retention of large number of pages while zapping)
> can be fatal, it can lead to OOM and host crash.

This problem is introduced in this patch (patch 10), without this patch,
the pages are always be zapped before release the mmu-lock.

>
> 3) Make sure that introduction of obsolete pages can not lead to a
> huge number of shadow pages around (the correct reason you gave for not merging
> https://patchwork.kernel.org/patch/2309641/ is not true anymore
> obsolete pages).

Actually, this question is the same as 2). It also the page-reclaim problem.
Without patch #10, no problem.

>
> Other than these points, i'm fine with obsolete pages optimization
> to speed up kvm_mmu_zap_all and the rest of the patchset.

Thank you!

2013-05-29 13:34:09

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch

On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
> On 05/29/2013 07:11 PM, Marcelo Tosatti wrote:
> > On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
> >> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
> >>> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
> >>>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
> >>>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
> >>>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
> >>>>>> caused by requiring lock
> >>>>>>
> >>>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
> >>>>>> so update the comments
> >>>>>>
> >>>>>> [ It improves kernel building 0.6% ~ 1% ]
> >>>>>
> >>>>> Can you please describe the overload in more detail? Under what scenario
> >>>>> is kernel building improved?
> >>>>
> >>>> Yes.
> >>>>
> >>>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
> >>>> every one second.
> >>>>
> >>>> [
> >>>> echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
> >>>> cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
> >>>> ]
> >>>
> >>> Can't see why it reflects real world scenario (or a real world
> >>> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
> >>>
> >>> Point is, it would be good to understand why this change
> >>> is improving performance? What are these cases where breaking out of
> >>> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
> >>> < 10 ?
> >>
> >> When guest read ROM, qemu will set the memory to map the device's firmware,
> >> that is why kvm_mmu_zap_all can be called in the scenario.
> >>
> >> The reasons why it heart the performance are:
> >> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
> >> when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
> >> other vcpus need wait a long time to do I/O.
> >>
> >> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
> >> request from other vcpus.
> >>
> >> Is it enough?
> >
> > That is no problem. The problem is why you chose "10" as the minimum number of
> > pages to zap before considering reschedule. I would expect the need to
>
> Well, my description above explained why batch-zapping is needed - we do
> not want the vcpu spend lots of time to zap all pages because it hurts other
> vcpus running.
>
> But, why the batch page number is "10"... I can not answer this, i just guessed
> that '10' can make vcpu do not spend long time on zap_all_pages and do
> not cause mmu-lock too hungry. "10" is the speculative value and i am not sure
> it is the best value but at lease, i think it can work.
>
> > reschedule to be rare enough that one kvm_mmu_zap_all instance (between
> > schedule in and schedule out) to be able to release no less than a
> > thousand pages.
>
> Unfortunately, no.
>
> This information is I replied Gleb in his mail where he raced a question that
> why "collapse tlb flush is needed":
>
> ======
> It seems no.
> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
> is easily contended. I did the simple track:
>
> + int num = 0;
> restart:
> list_for_each_entry_safe_reverse(sp, node,
> &kvm->arch.active_mmu_pages, link) {
> @@ -4265,6 +4265,7 @@ restart:
> if (batch >= BATCH_ZAP_PAGES &&
> cond_resched_lock(&kvm->mmu_lock)) {
> batch = 0;
> + num++;
> goto restart;
> }
>
> @@ -4277,6 +4278,7 @@ restart:
> * may use the pages.
> */
> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> + printk("lock-break: %d.\n", num);
> }
>
> I do read pci rom when doing kernel building in the guest which
> has 1G memory and 4vcpus with ept enabled, this is the normal
> workload and normal configuration.
>
> # dmesg
> [ 2338.759099] lock-break: 8.
> [ 2339.732442] lock-break: 5.
> [ 2340.904446] lock-break: 3.
> [ 2342.513514] lock-break: 3.
> [ 2343.452229] lock-break: 3.
> [ 2344.981599] lock-break: 4.
>
> Basically, we need to break many times.
> ======
>
> You can see we should break 3 times to zap all pages even if we have zapoed
> 10 pages in batch. It is obviously that it need break more times without
> batch-zapping.

Yes, but this is not a real scenario, or even describes a real scenario
as far as i know.

Are you sure this minimum-batching-before-considering-reschedule even
after obsolete pages optimization?

I fail to see why.

2013-05-29 13:39:21

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch

On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
> This information is I replied Gleb in his mail where he raced a question that
> why "collapse tlb flush is needed":
>
> ======
> It seems no.
> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
> is easily contended. I did the simple track:
>
> + int num = 0;
> restart:
> list_for_each_entry_safe_reverse(sp, node,
> &kvm->arch.active_mmu_pages, link) {
> @@ -4265,6 +4265,7 @@ restart:
> if (batch >= BATCH_ZAP_PAGES &&
> cond_resched_lock(&kvm->mmu_lock)) {
> batch = 0;
> + num++;
> goto restart;
> }
>
> @@ -4277,6 +4278,7 @@ restart:
> * may use the pages.
> */
> kvm_mmu_commit_zap_page(kvm, &invalid_list);
> + printk("lock-break: %d.\n", num);
> }
>
> I do read pci rom when doing kernel building in the guest which
> has 1G memory and 4vcpus with ept enabled, this is the normal
> workload and normal configuration.
>
> # dmesg
> [ 2338.759099] lock-break: 8.
> [ 2339.732442] lock-break: 5.
> [ 2340.904446] lock-break: 3.
> [ 2342.513514] lock-break: 3.
> [ 2343.452229] lock-break: 3.
> [ 2344.981599] lock-break: 4.
>
> Basically, we need to break many times.

Should measure kvm_mmu_zap_all latency.

> ======
>
> You can see we should break 3 times to zap all pages even if we have zapoed
> 10 pages in batch. It is obviously that it need break more times without
> batch-zapping.

Again, breaking should be no problem, what matters is latency. Please
measure kvm_mmu_zap_all latency after all optimizations to justify
this minimum batching.

2013-05-29 13:43:39

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

On 05/29/2013 08:25 PM, Marcelo Tosatti wrote:
> On Tue, May 28, 2013 at 10:51:38PM +0800, Xiao Guangrong wrote:
>> On 05/28/2013 08:13 AM, Marcelo Tosatti wrote:
>>> On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
>>>> It is only used to zap the obsolete page. Since the obsolete page
>>>> will not be used, we need not spend time to find its unsync children
>>>> out. Also, we delete the page from shadow page cache so that the page
>>>> is completely isolated after call this function.
>>>>
>>>> The later patch will use it to collapse tlb flushes
>>>>
>>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>>> ---
>>>> arch/x86/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
>>>> 1 files changed, 41 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>> index 9b57faa..e676356 100644
>>>> --- a/arch/x86/kvm/mmu.c
>>>> +++ b/arch/x86/kvm/mmu.c
>>>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>>>> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>>>> {
>>>> ASSERT(is_empty_shadow_page(sp->spt));
>>>> - hlist_del(&sp->hash_link);
>>>> + hlist_del_init(&sp->hash_link);
>>>> list_del(&sp->link);
>>>> free_page((unsigned long)sp->spt);
>>>> if (!sp->role.direct)
>>>> @@ -2069,14 +2069,19 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
>>>> return zapped;
>>>> }
>>>>
>>>> -static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>>>> - struct list_head *invalid_list)
>>>> +static int
>>>> +__kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>>>> + bool zap_unsync_children,
>>>> + struct list_head *invalid_list)
>>>> {
>>>> - int ret;
>>>> + int ret = 0;
>>>>
>>>> trace_kvm_mmu_prepare_zap_page(sp);
>>>> ++kvm->stat.mmu_shadow_zapped;
>>>> - ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>>>> +
>>>> + if (likely(zap_unsync_children))
>>>> + ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>>>> +
>>>
>>> Why is this an important case to be optimized?
>>>
>>> 1) shadow is the uncommon, obsolete case.
>>> 2) mmu_zap_unsync_children has
>>>
>>> if (parent->role.level == PT_PAGE_TABLE_LEVEL)
>>> return 0;
>>>
>>> So the large majority of pages are already optimized.
>>
>> Hmm, if we zap the high level page (e.g level = 4), it should walk its
>> children and its children's children. It is high overload.
>> (IMHO, trivial optimization is still necessary, especially, the change
>> is really slight.)
>>
>> And, there is another point me mentioned in the changelog:
>> "Also, we delete the page from shadow page cache so that the page
>> is completely isolated after call this function."
>> Skipping zapping unsync-children can ensure that only one page is
>> zapped so that we can use "hlist_del_init(&sp->hash_link)" to completely
>> remove the page from mmu-cache.
>>
>> Now, Gleb and i got a agreement that skipping obsolete page when
>> walking hash list is a better way.
>>
>> BTW, zapping unsync-children is unnecessary, is it?
>
> It is necessary that if an unsync page exists, that
> invlpg emulation is able to reach it, or that at kvm_mmu_get_page
> time they are synchronized.

Hmmm? It is not always better.

If unsync pages is zapped, mmu will map a new alloced-page which all
of its entries are nonpresent. It can cause more #PF than the case
we sync the page.
Especially, for the invlpg case, in that case you zap the page which
still have been being mapped on other vcpu's page table which currently
being used.

And, It does the further-possible work at once - spend time to walk/zap all
the unsync children but they may not be used at all. So delaying this work
until they are used is better.

>
> You transfer the synchronization work to pagefault time, which directly
> affects guest performance, while it could have been done by the host
> (this was the reason for zapping unsync children).

It seem no, most case doing zap_page is in the vcpu context, not host.

2013-05-29 14:00:40

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch

On 05/29/2013 09:21 PM, Marcelo Tosatti wrote:
> On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
>> On 05/29/2013 07:11 PM, Marcelo Tosatti wrote:
>>> On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
>>>> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
>>>>> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
>>>>>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
>>>>>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
>>>>>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
>>>>>>>> caused by requiring lock
>>>>>>>>
>>>>>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
>>>>>>>> so update the comments
>>>>>>>>
>>>>>>>> [ It improves kernel building 0.6% ~ 1% ]
>>>>>>>
>>>>>>> Can you please describe the overload in more detail? Under what scenario
>>>>>>> is kernel building improved?
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
>>>>>> every one second.
>>>>>>
>>>>>> [
>>>>>> echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
>>>>>> cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
>>>>>> ]
>>>>>
>>>>> Can't see why it reflects real world scenario (or a real world
>>>>> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
>>>>>
>>>>> Point is, it would be good to understand why this change
>>>>> is improving performance? What are these cases where breaking out of
>>>>> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
>>>>> < 10 ?
>>>>
>>>> When guest read ROM, qemu will set the memory to map the device's firmware,
>>>> that is why kvm_mmu_zap_all can be called in the scenario.
>>>>
>>>> The reasons why it heart the performance are:
>>>> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
>>>> when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
>>>> other vcpus need wait a long time to do I/O.
>>>>
>>>> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
>>>> request from other vcpus.
>>>>
>>>> Is it enough?
>>>
>>> That is no problem. The problem is why you chose "10" as the minimum number of
>>> pages to zap before considering reschedule. I would expect the need to
>>
>> Well, my description above explained why batch-zapping is needed - we do
>> not want the vcpu spend lots of time to zap all pages because it hurts other
>> vcpus running.
>>
>> But, why the batch page number is "10"... I can not answer this, i just guessed
>> that '10' can make vcpu do not spend long time on zap_all_pages and do
>> not cause mmu-lock too hungry. "10" is the speculative value and i am not sure
>> it is the best value but at lease, i think it can work.
>>
>>> reschedule to be rare enough that one kvm_mmu_zap_all instance (between
>>> schedule in and schedule out) to be able to release no less than a
>>> thousand pages.
>>
>> Unfortunately, no.
>>
>> This information is I replied Gleb in his mail where he raced a question that
>> why "collapse tlb flush is needed":
>>
>> ======
>> It seems no.
>> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
>> is easily contended. I did the simple track:
>>
>> + int num = 0;
>> restart:
>> list_for_each_entry_safe_reverse(sp, node,
>> &kvm->arch.active_mmu_pages, link) {
>> @@ -4265,6 +4265,7 @@ restart:
>> if (batch >= BATCH_ZAP_PAGES &&
>> cond_resched_lock(&kvm->mmu_lock)) {
>> batch = 0;
>> + num++;
>> goto restart;
>> }
>>
>> @@ -4277,6 +4278,7 @@ restart:
>> * may use the pages.
>> */
>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>> + printk("lock-break: %d.\n", num);
>> }
>>
>> I do read pci rom when doing kernel building in the guest which
>> has 1G memory and 4vcpus with ept enabled, this is the normal
>> workload and normal configuration.
>>
>> # dmesg
>> [ 2338.759099] lock-break: 8.
>> [ 2339.732442] lock-break: 5.
>> [ 2340.904446] lock-break: 3.
>> [ 2342.513514] lock-break: 3.
>> [ 2343.452229] lock-break: 3.
>> [ 2344.981599] lock-break: 4.
>>
>> Basically, we need to break many times.
>> ======
>>
>> You can see we should break 3 times to zap all pages even if we have zapoed
>> 10 pages in batch. It is obviously that it need break more times without
>> batch-zapping.
>
> Yes, but this is not a real scenario, or even describes a real scenario
> as far as i know.

Aha.

Okay, maybe "read rom" is not the common case, but vcpu can trigger it or guest
driver can do in the further. What happen if vcpu trigger it? The worst
case is, one vcpu is always break mmu-lock due to one other vcpus doing intense
memory access and the reset vcpus is waiting IO or its IPI. It is easy soft-lockup.

More worse, if host memory is really less and host is always trying to reclaim qemu's
memory, that cause the lock always be hot and can not zap one page before rescheduled.

>
> Are you sure this minimum-batching-before-considering-reschedule even
> after obsolete pages optimization?

Yes, this track is after all patches in this series are applied.

2013-05-29 14:02:24

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch

On 05/29/2013 09:32 PM, Marcelo Tosatti wrote:
> On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
>> This information is I replied Gleb in his mail where he raced a question that
>> why "collapse tlb flush is needed":
>>
>> ======
>> It seems no.
>> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
>> is easily contended. I did the simple track:
>>
>> + int num = 0;
>> restart:
>> list_for_each_entry_safe_reverse(sp, node,
>> &kvm->arch.active_mmu_pages, link) {
>> @@ -4265,6 +4265,7 @@ restart:
>> if (batch >= BATCH_ZAP_PAGES &&
>> cond_resched_lock(&kvm->mmu_lock)) {
>> batch = 0;
>> + num++;
>> goto restart;
>> }
>>
>> @@ -4277,6 +4278,7 @@ restart:
>> * may use the pages.
>> */
>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>> + printk("lock-break: %d.\n", num);
>> }
>>
>> I do read pci rom when doing kernel building in the guest which
>> has 1G memory and 4vcpus with ept enabled, this is the normal
>> workload and normal configuration.
>>
>> # dmesg
>> [ 2338.759099] lock-break: 8.
>> [ 2339.732442] lock-break: 5.
>> [ 2340.904446] lock-break: 3.
>> [ 2342.513514] lock-break: 3.
>> [ 2343.452229] lock-break: 3.
>> [ 2344.981599] lock-break: 4.
>>
>> Basically, we need to break many times.
>
> Should measure kvm_mmu_zap_all latency.
>
>> ======
>>
>> You can see we should break 3 times to zap all pages even if we have zapoed
>> 10 pages in batch. It is obviously that it need break more times without
>> batch-zapping.
>
> Again, breaking should be no problem, what matters is latency. Please
> measure kvm_mmu_zap_all latency after all optimizations to justify
> this minimum batching.

Okay, okay. I will benchmark the latency.

2013-05-29 16:04:05

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch

On 05/29/2013 10:02 PM, Xiao Guangrong wrote:
> On 05/29/2013 09:32 PM, Marcelo Tosatti wrote:
>> On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
>>> This information is I replied Gleb in his mail where he raced a question that
>>> why "collapse tlb flush is needed":
>>>
>>> ======
>>> It seems no.
>>> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
>>> is easily contended. I did the simple track:
>>>
>>> + int num = 0;
>>> restart:
>>> list_for_each_entry_safe_reverse(sp, node,
>>> &kvm->arch.active_mmu_pages, link) {
>>> @@ -4265,6 +4265,7 @@ restart:
>>> if (batch >= BATCH_ZAP_PAGES &&
>>> cond_resched_lock(&kvm->mmu_lock)) {
>>> batch = 0;
>>> + num++;
>>> goto restart;
>>> }
>>>
>>> @@ -4277,6 +4278,7 @@ restart:
>>> * may use the pages.
>>> */
>>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>>> + printk("lock-break: %d.\n", num);
>>> }
>>>
>>> I do read pci rom when doing kernel building in the guest which
>>> has 1G memory and 4vcpus with ept enabled, this is the normal
>>> workload and normal configuration.
>>>
>>> # dmesg
>>> [ 2338.759099] lock-break: 8.
>>> [ 2339.732442] lock-break: 5.
>>> [ 2340.904446] lock-break: 3.
>>> [ 2342.513514] lock-break: 3.
>>> [ 2343.452229] lock-break: 3.
>>> [ 2344.981599] lock-break: 4.
>>>
>>> Basically, we need to break many times.
>>
>> Should measure kvm_mmu_zap_all latency.
>>
>>> ======
>>>
>>> You can see we should break 3 times to zap all pages even if we have zapoed
>>> 10 pages in batch. It is obviously that it need break more times without
>>> batch-zapping.
>>
>> Again, breaking should be no problem, what matters is latency. Please
>> measure kvm_mmu_zap_all latency after all optimizations to justify
>> this minimum batching.
>
> Okay, okay. I will benchmark the latency.

Okay, I have done the test, the test environment is the same that
"I do read pci rom when doing kernel building in the guest which
has 1G memory and 4vcpus with ept enabled, this is the normal
workload and normal configuration.".

Batch-zapped:
Guest:
# cat /sys/bus/pci/devices/0000\:00\:03.0/rom
# free -m
total used free shared buffers cached
Mem: 975 793 181 0 6 438
-/+ buffers/cache: 347 627
Swap: 2015 43 1972

Host shows:
[ 2229.918558] lock-break: 5.
[ 2229.918564] kvm_mmu_invalidate_zap_all_pages: 174706e.


No-batch:
Guest:
# cat /sys/bus/pci/devices/0000\:00\:03.0/rom
# free -m
total used free shared buffers cached
Mem: 975 843 131 0 17 476
-/+ buffers/cache: 348 626
Swap: 2015 2

Host shows:
[ 2931.675285] lock-break: 13.
[ 2931.675291] kvm_mmu_invalidate_zap_all_pages: 69c1676.

That means, nearly the same memory accessed on guest:
- batch-zapped need to break 5 times, the latency is 174706e.
- no-batch need to break 13 times, the latency is 69c1676.

The code change to track the latency:

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 055d675..a66f21b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4233,13 +4233,13 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
spin_unlock(&kvm->mmu_lock);
}

-#define BATCH_ZAP_PAGES 10
+#define BATCH_ZAP_PAGES 0
static void kvm_zap_obsolete_pages(struct kvm *kvm)
{
struct kvm_mmu_page *sp, *node;
LIST_HEAD(invalid_list);
int batch = 0;
-
+ int num = 0;
restart:
list_for_each_entry_safe_reverse(sp, node,
&kvm->arch.active_mmu_pages, link) {
@@ -4265,6 +4265,7 @@ restart:
if (batch >= BATCH_ZAP_PAGES &&
cond_resched_lock(&kvm->mmu_lock)) {
batch = 0;
+ num++;
goto restart;
}

@@ -4277,6 +4278,7 @@ restart:
* may use the pages.
*/
kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ printk("lock-break: %d.\n", num);
}

/*
@@ -4290,7 +4292,12 @@ restart:
*/
void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm)
{
+ u64 start;
+
spin_lock(&kvm->mmu_lock);
+
+ start = local_clock();
+
trace_kvm_mmu_invalidate_zap_all_pages(kvm);
kvm->arch.mmu_valid_gen++;

@@ -4306,6 +4313,9 @@ void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm)
kvm_reload_remote_mmus(kvm);

kvm_zap_obsolete_pages(kvm);
+
+ printk("%s: %llx.\n", __FUNCTION__, local_clock() - start);
+
spin_unlock(&kvm->mmu_lock);
}


2013-05-30 00:54:14

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

On Wed, May 29, 2013 at 09:19:41PM +0800, Xiao Guangrong wrote:
> On 05/29/2013 08:39 PM, Marcelo Tosatti wrote:
> > On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote:
> >>>>> the pages since other vcpus may be doing locklessly shadow
> >>>>> page walking
> >>>
> >>> Ah, yes, i agree with you.
> >>>
> >>> We can introduce a list, say kvm->arch.obsolte_pages, to link all of the
> >>> zapped-page, the page-shrink will free the page on that list first.
> >>>
> >>> Marcelo, if you do not have objection on patch 1 ~ 8 and 11, could you please
> >>> let them merged first, and do add some comments and tlb optimization later?
> >>
> >> Exclude patch 11 please, since it depends on the "collapse" optimization.
> >
> > I'm fine with patch 1 being merged. I think the remaining patches need better
> > understanding or explanation. The problems i see are:
> >
> > 1) The magic number "10" to zap before considering reschedule is
> > annoying. It would be good to understand why it is needed at all.
>
> ......
>
> >
> > But then again, the testcase is measuring kvm_mmu_zap_all performance
> > alone which we know is not a common operation, so perhaps there is
> > no need for that minimum-pages-to-zap-before-reschedule.
>
> Well. Although, this is not the common operation, but this operation
> can be triggered by VCPU - it one VCPU take long time on zap-all-pages,
> other vcpus is missing IPI-synce, or missing IO. This is easily cause
> soft lockups if the vcpu is doing memslot-releated things.
>
+1. If it is trigarable by a guest it may slow down the guest, but we
should not allow for it to slow down a host.

--
Gleb.

2013-05-30 16:24:59

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

On Thu, 30 May 2013 03:53:38 +0300
Gleb Natapov <[email protected]> wrote:

> On Wed, May 29, 2013 at 09:19:41PM +0800, Xiao Guangrong wrote:
> > On 05/29/2013 08:39 PM, Marcelo Tosatti wrote:
> > > On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote:
> > >>>>> the pages since other vcpus may be doing locklessly shadow
> > >>>>> page walking
> > >>>
> > >>> Ah, yes, i agree with you.
> > >>>
> > >>> We can introduce a list, say kvm->arch.obsolte_pages, to link all of the
> > >>> zapped-page, the page-shrink will free the page on that list first.
> > >>>
> > >>> Marcelo, if you do not have objection on patch 1 ~ 8 and 11, could you please
> > >>> let them merged first, and do add some comments and tlb optimization later?
> > >>
> > >> Exclude patch 11 please, since it depends on the "collapse" optimization.
> > >
> > > I'm fine with patch 1 being merged. I think the remaining patches need better
> > > understanding or explanation. The problems i see are:
> > >
> > > 1) The magic number "10" to zap before considering reschedule is
> > > annoying. It would be good to understand why it is needed at all.
> >
> > ......
> >
> > >
> > > But then again, the testcase is measuring kvm_mmu_zap_all performance
> > > alone which we know is not a common operation, so perhaps there is
> > > no need for that minimum-pages-to-zap-before-reschedule.
> >
> > Well. Although, this is not the common operation, but this operation
> > can be triggered by VCPU - it one VCPU take long time on zap-all-pages,
> > other vcpus is missing IPI-synce, or missing IO. This is easily cause
> > soft lockups if the vcpu is doing memslot-releated things.
> >
> +1. If it is trigarable by a guest it may slow down the guest, but we
> should not allow for it to slow down a host.
>

Well, I don't object to the minimum-pages-to-zap-before-reschedule idea
itself, but if you're going to take patch 4, please at least add a warning
in the changelog that the magic number "10" was selected without good enough
reasoning.

"[ It improves kernel building 0.6% ~ 1% ]" alone will make it hard for
others to change the number later.

I actually once tried to do a similar thing for other code. So I have a
possible reasoning for this, and 10 should probably be changed later.

Takuya

2013-05-30 17:11:11

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

On Fri, 31 May 2013 01:24:43 +0900
Takuya Yoshikawa <[email protected]> wrote:

> On Thu, 30 May 2013 03:53:38 +0300
> Gleb Natapov <[email protected]> wrote:
>
> > On Wed, May 29, 2013 at 09:19:41PM +0800, Xiao Guangrong wrote:
> > > On 05/29/2013 08:39 PM, Marcelo Tosatti wrote:
> > > > On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote:
> > > >>>>> the pages since other vcpus may be doing locklessly shadow
> > > >>>>> page walking
> > > >>>
> > > >>> Ah, yes, i agree with you.
> > > >>>
> > > >>> We can introduce a list, say kvm->arch.obsolte_pages, to link all of the
> > > >>> zapped-page, the page-shrink will free the page on that list first.
> > > >>>
> > > >>> Marcelo, if you do not have objection on patch 1 ~ 8 and 11, could you please
> > > >>> let them merged first, and do add some comments and tlb optimization later?
> > > >>
> > > >> Exclude patch 11 please, since it depends on the "collapse" optimization.
> > > >
> > > > I'm fine with patch 1 being merged. I think the remaining patches need better
> > > > understanding or explanation. The problems i see are:
> > > >
> > > > 1) The magic number "10" to zap before considering reschedule is
> > > > annoying. It would be good to understand why it is needed at all.
> > >
> > > ......
> > >
> > > >
> > > > But then again, the testcase is measuring kvm_mmu_zap_all performance
> > > > alone which we know is not a common operation, so perhaps there is
> > > > no need for that minimum-pages-to-zap-before-reschedule.
> > >
> > > Well. Although, this is not the common operation, but this operation
> > > can be triggered by VCPU - it one VCPU take long time on zap-all-pages,
> > > other vcpus is missing IPI-synce, or missing IO. This is easily cause
> > > soft lockups if the vcpu is doing memslot-releated things.
> > >
> > +1. If it is trigarable by a guest it may slow down the guest, but we
> > should not allow for it to slow down a host.
> >
>
> Well, I don't object to the minimum-pages-to-zap-before-reschedule idea
> itself, but if you're going to take patch 4, please at least add a warning
> in the changelog that the magic number "10" was selected without good enough
> reasoning.
>
> "[ It improves kernel building 0.6% ~ 1% ]" alone will make it hard for
> others to change the number later.
>
> I actually once tried to do a similar thing for other code. So I have a
> possible reasoning for this, and 10 should probably be changed later.
>

In this case, the solution seems to be very simple: just drop spin_needbreak()
and leave need_resched() alone.

This way we can guarantee that zap-all will get a fair amount of CPU time for
each scheduling from the host scheduler's point of view. Of course this can
block other VCPU threads waiting for mmu_lock during that time slice, but
should be much better than blocking them for some magical number of zappings.

We also need to remember that spin_needbreak() does not do anything for some
preempt config settings.

Takuya