2013-04-16 06:33:07

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

This patchset is based on my previous two patchset:
[PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
(https://lkml.org/lkml/2013/4/1/2)

[PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
(https://lkml.org/lkml/2013/4/1/134)

Changlog:
V3:
completely redesign the algorithm, please see below.

V2:
- do not reset n_requested_mmu_pages and n_max_mmu_pages
- batch free root shadow pages to reduce vcpu notification and mmu-lock
contention
- remove the first patch that introduce kvm->arch.mmu_cache since we only
'memset zero' on hashtable rather than all mmu cache members in this
version
- remove unnecessary kvm_reload_remote_mmus after kvm_mmu_zap_all

* Issue
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability.

* Idea
KVM maintains a global mmu invalid generation-number which is stored in
kvm->arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp->mmu_valid_gen when it is created.

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.

The invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
are keeped in mmu-cache until page allocator reclaims page.

* Challenges
Some page invalidation is requested when memslot is moved or deleted
and kvm is being destroy who call zap_all_pages to delete all sp using
their rmap and lpage-info, after call zap_all_pages, the rmap and lpage-info
will be freed. So, we should implement a fast way to delete sp from the rmap
and lpage-info.

For the lpage-info, we clear all lpage count when do zap-all-pages, then
all invalid shadow pages are not counted in lpage-info, after that lpage-info
on the invalid memslot can be safely freed. This is also good for the
performance - it allows guest to use hugepage as far as possible.

For the rmap, we introduce a way to unmap rmap out of mmu-lock.
In order to do that, we should resolve these problems:
1) do not corrupt the rmap
2) keep pte-list-descs available
3) keep shadow page available

Resolve 1):
we make the invalid rmap be remove-only that means we only delete and
clear spte from the rmap, no new sptes can be added to it.
This is reasonable since kvm can not do address translation on invalid rmap
(gfn_to_pfn is failed on invalid memslot) and all sptes on invalid rmap can
not be reused (they belong to invalid shadow page).

Resolve 2):
We use the placeholder (PTE_LIST_SPTE_SKIP) to indicate spte has been deleted
from the rmap instead of freeing pte-list-descs and moving sptes. Then, the
pte-list-desc entry are available when concurrently unmap the rmap.
The pte-list-descs are freed when the memslot is not visible to all vcpus.

Resolve 3):
we protect the lifecycle of sp by this algorithm:

unmap-rmap-out-of-mmu-lock():
for-each-rmap-in-slot:
preempt_disable
kvm->arch.being_unmapped_rmap = rmapp

clear spte and reset rmap entry

kvm->arch.being_unmapped_rmap = NULL
preempt_enable

Other patch like zap-sp and mmu-notify which are protected
by mmu-lock:

clear spte and reset rmap entry
retry:
if (kvm->arch.being_unmapped_rmap == rmap)
goto retry
(the wait is very rare and clear one rmap is very fast, it
is not bad even if wait is needed)

Then, we can sure the spte is always available when we concurrently unmap the
rmap


* TODO
Use a better algorithm to free pte-list-desc, for example, we can link them
together by desc->more.

* Performance
We observably reduce the contention of mmu-lock and make the invalidation
be preemptable.

Xiao Guangrong (15):
KVM: x86: clean up and optimize for kvm_arch_free_memslot
KVM: fold kvm_arch_create_memslot into kvm_arch_prepare_memory_region
KVM: x86: do not reuse rmap when memslot is moved
KVM: MMU: abstract memslot rmap related operations
KVM: MMU: allow per-rmap operations
KVM: MMU: allow concurrently clearing spte on remove-only pte-list
KVM: MMU: introduce invalid rmap handlers
KVM: MMU: allow unmap invalid rmap out of mmu-lock
KVM: MMU: introduce free_meslot_rmap_desc_nolock
KVM: x86: introduce memslot_set_lpage_disallowed
KVM: MMU: introduce kvm_clear_all_lpage_info
KVM: MMU: fast invalid all shadow pages
KVM: x86: use the fast way to invalid all pages
KVM: move srcu_read_lock/srcu_read_unlock to arch-specified code
KVM: MMU: replace kvm_zap_all with kvm_mmu_invalid_all_pages

arch/arm/kvm/arm.c | 5 -
arch/ia64/kvm/kvm-ia64.c | 5 -
arch/powerpc/kvm/powerpc.c | 8 +-
arch/s390/kvm/kvm-s390.c | 5 -
arch/x86/include/asm/kvm_host.h | 7 +-
arch/x86/kvm/mmu.c | 493 +++++++++++++++++++++++++++++++++++----
arch/x86/kvm/mmu.h | 21 ++
arch/x86/kvm/mmu_audit.c | 10 +-
arch/x86/kvm/x86.c | 122 +++++++---
arch/x86/kvm/x86.h | 2 +
include/linux/kvm_host.h | 1 -
virt/kvm/kvm_main.c | 11 +-
12 files changed, 576 insertions(+), 114 deletions(-)

--
1.7.7.6


2013-04-16 06:33:13

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 02/15] KVM: fold kvm_arch_create_memslot into kvm_arch_prepare_memory_region

It removes a arch-specified interface and also removes unnecessary
empty functions on some architectures

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/arm/kvm/arm.c | 5 -----
arch/ia64/kvm/kvm-ia64.c | 5 -----
arch/powerpc/kvm/powerpc.c | 8 ++++++--
arch/s390/kvm/kvm-s390.c | 5 -----
arch/x86/kvm/x86.c | 7 ++++++-
include/linux/kvm_host.h | 1 -
virt/kvm/kvm_main.c | 8 ++------
7 files changed, 14 insertions(+), 25 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index e4ad0bb..c76e63e 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -159,11 +159,6 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
{
}

-int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
-{
- return 0;
-}
-
/**
* kvm_arch_destroy_vm - destroy the VM data structure
* @kvm: pointer to the KVM struct
diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 7a54455..fcfb03b 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1553,11 +1553,6 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
{
}

-int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
-{
- return 0;
-}
-
int kvm_arch_prepare_memory_region(struct kvm *kvm,
struct kvm_memory_slot *memslot,
struct kvm_userspace_memory_region *mem,
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 16b4595..aab8039 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -405,9 +405,9 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
kvmppc_core_free_memslot(free, dont);
}

-int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
+static int kvm_arch_create_memslot(struct kvm_memory_slot *slot)
{
- return kvmppc_core_create_memslot(slot, npages);
+ return kvmppc_core_create_memslot(slot, slot->npages);
}

int kvm_arch_prepare_memory_region(struct kvm *kvm,
@@ -415,6 +415,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change)
{
+ if (change == KVM_MR_CREATE)
+ if (kvm_arch_create_memslot(memslot))
+ return -ENOMEM;
+
return kvmppc_core_prepare_memory_region(kvm, memslot, mem);
}

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 33161b4..7bfd6f6 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -967,11 +967,6 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
{
}

-int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
-{
- return 0;
-}
-
/* Section: memory related */
int kvm_arch_prepare_memory_region(struct kvm *kvm,
struct kvm_memory_slot *memslot,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b0be7ec..447789c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6875,8 +6875,9 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
}
}

-int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages)
+static int kvm_arch_create_memslot(struct kvm_memory_slot *slot)
{
+ unsigned long npages = slot->npages;
int i;

for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
@@ -6938,6 +6939,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change)
{
+ if (change == KVM_MR_CREATE)
+ if (kvm_arch_create_memslot(memslot))
+ return -ENOMEM;
+
/*
* Only private memory slots need to be mapped here since
* KVM_SET_MEMORY_REGION ioctl is no longer supported.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1c0be23..f39ec18 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -493,7 +493,6 @@ int __kvm_set_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem);
void kvm_arch_free_memslot(struct kvm_memory_slot *free,
struct kvm_memory_slot *dont);
-int kvm_arch_create_memslot(struct kvm_memory_slot *slot, unsigned long npages);
int kvm_arch_prepare_memory_region(struct kvm *kvm,
struct kvm_memory_slot *memslot,
struct kvm_userspace_memory_region *mem,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d21694a..acc9f30 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -825,13 +825,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (!(new.flags & KVM_MEM_LOG_DIRTY_PAGES))
new.dirty_bitmap = NULL;

- r = -ENOMEM;
- if (change == KVM_MR_CREATE) {
- new.userspace_addr = mem->userspace_addr;
+ new.userspace_addr = mem->userspace_addr;

- if (kvm_arch_create_memslot(&new, npages))
- goto out_free;
- }
+ r = -ENOMEM;

/* Allocate page dirty bitmap if needed */
if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) {
--
1.7.7.6

2013-04-16 06:33:24

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 05/15] KVM: MMU: allow per-rmap operations

Introduce rmap_operations to allow rmap having different operations,
then, we are able to handle invalid rmap specially

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu.c | 31 ++++++++++++++++++++++++-------
arch/x86/kvm/mmu.h | 16 ++++++++++++++++
arch/x86/kvm/x86.c | 1 +
4 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4e1f7cb..5fd6ed1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -511,6 +511,7 @@ struct kvm_lpage_info {
};

struct kvm_arch_memory_slot {
+ struct rmap_operations *ops;
unsigned long *rmap[KVM_NR_PAGE_SIZES];
struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
};
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 514f5b1..99ad2a4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1055,13 +1055,13 @@ static int slot_rmap_add(struct kvm_memory_slot *slot,
struct kvm_vcpu *vcpu, unsigned long *rmapp,
u64 *spte)
{
- return pte_list_add(vcpu, spte, rmapp);
+ return slot->arch.ops->rmap_add(vcpu, spte, rmapp);
}

static void slot_rmap_remove(struct kvm_memory_slot *slot,
unsigned long *rmapp, u64 *spte)
{
- pte_list_remove(spte, rmapp);
+ slot->arch.ops->rmap_remove(spte, rmapp);
}

static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
@@ -1238,7 +1238,7 @@ static bool slot_rmap_write_protect(struct kvm_memory_slot *slot,
struct kvm *kvm, unsigned long *rmapp,
bool pt_protect)
{
- return __rmap_write_protect(kvm, rmapp, pt_protect);
+ return slot->arch.ops->rmap_write_protect(kvm, rmapp, pt_protect);
}

/**
@@ -1306,7 +1306,7 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
static int slot_rmap_unmap(struct kvm *kvm, unsigned long *rmapp,
struct kvm_memory_slot *slot, unsigned long data)
{
- return kvm_unmap_rmapp(kvm, rmapp);
+ return slot->arch.ops->rmap_unmap(kvm, rmapp);
}

static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
@@ -1353,7 +1353,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
static int slot_rmap_set_pte(struct kvm *kvm, unsigned long *rmapp,
struct kvm_memory_slot *slot, unsigned long data)
{
- return kvm_set_pte_rmapp(kvm, rmapp, (pte_t *)data);
+ return slot->arch.ops->rmap_set_pte(kvm, rmapp, (pte_t *)data);
}

static int kvm_handle_hva_range(struct kvm *kvm,
@@ -1470,7 +1470,7 @@ out:
static int slot_rmap_age(struct kvm *kvm, unsigned long *rmapp,
struct kvm_memory_slot *slot, unsigned long data)
{
- int young = kvm_age_rmapp(kvm, rmapp);
+ int young = slot->arch.ops->rmap_age(kvm, rmapp);

/* @data has hva passed to kvm_age_hva(). */
trace_kvm_age_page(data, slot, young);
@@ -1508,7 +1508,7 @@ static int slot_rmap_test_age(struct kvm *kvm, unsigned long *rmapp,
struct kvm_memory_slot *slot,
unsigned long data)
{
- return kvm_test_age_rmapp(kvm, rmapp);
+ return slot->arch.ops->rmap_test_age(kvm, rmapp);
}

#define RMAP_RECYCLE_THRESHOLD 1000
@@ -1537,6 +1537,23 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
return kvm_handle_hva(kvm, hva, 0, slot_rmap_test_age);
}

+static struct rmap_operations normal_rmap_ops = {
+ .rmap_add = pte_list_add,
+ .rmap_remove = pte_list_remove,
+
+ .rmap_write_protect = __rmap_write_protect,
+
+ .rmap_set_pte = kvm_set_pte_rmapp,
+ .rmap_age = kvm_age_rmapp,
+ .rmap_test_age = kvm_test_age_rmapp,
+ .rmap_unmap = kvm_unmap_rmapp
+};
+
+void init_memslot_rmap_ops(struct kvm_memory_slot *slot)
+{
+ slot->arch.ops = &normal_rmap_ops;
+}
+
#ifdef MMU_DEBUG
static int is_empty_shadow_page(u64 *spt)
{
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index ffd40d1..bb2b22e 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -114,4 +114,20 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
return (mmu->permissions[pfec >> 1] >> pte_access) & 1;
}

+struct rmap_operations {
+ int (*rmap_add)(struct kvm_vcpu *vcpu, u64 *spte,
+ unsigned long *rmap);
+ void (*rmap_remove)(u64 *spte, unsigned long *rmap);
+
+ bool (*rmap_write_protect)(struct kvm *kvm, unsigned long *rmap,
+ bool pt_protect);
+
+ int (*rmap_set_pte)(struct kvm *kvm, unsigned long *rmap,
+ pte_t *ptep);
+ int (*rmap_age)(struct kvm *kvm, unsigned long *rmap);
+ int (*rmap_test_age)(struct kvm *kvm, unsigned long *rmap);
+ int (*rmap_unmap)(struct kvm *kvm, unsigned long *rmapp);
+};
+
+void init_memslot_rmap_ops(struct kvm_memory_slot *slot);
#endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 839e666..bec83cd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6919,6 +6919,7 @@ static int kvm_arch_create_memslot(struct kvm_memory_slot *slot)
}
}

+ init_memslot_rmap_ops(slot);
return 0;

out_free:
--
1.7.7.6

2013-04-16 06:33:42

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 04/15] KVM: MMU: abstract memslot rmap related operations

Introduce slot_rmap_* functions to abstract memslot rmap related
operations which makes the later patch more clearer

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 108 +++++++++++++++++++++++++++++++++-------------
arch/x86/kvm/mmu_audit.c | 10 +++--
2 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index dcc059c..514f5b1 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1033,14 +1033,14 @@ static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
}

/*
- * Take gfn and return the reverse mapping to it.
+ * Take gfn and return the memslot and reverse mapping to it.
*/
-static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int level)
+static unsigned long *gfn_to_rmap(struct kvm *kvm,
+ struct kvm_memory_slot **slot,
+ gfn_t gfn, int level)
{
- struct kvm_memory_slot *slot;
-
- slot = gfn_to_memslot(kvm, gfn);
- return __gfn_to_rmap(gfn, level, slot);
+ *slot = gfn_to_memslot(kvm, gfn);
+ return __gfn_to_rmap(gfn, level, *slot);
}

static bool rmap_can_add(struct kvm_vcpu *vcpu)
@@ -1051,27 +1051,42 @@ static bool rmap_can_add(struct kvm_vcpu *vcpu)
return mmu_memory_cache_free_objects(cache);
}

+static int slot_rmap_add(struct kvm_memory_slot *slot,
+ struct kvm_vcpu *vcpu, unsigned long *rmapp,
+ u64 *spte)
+{
+ return pte_list_add(vcpu, spte, rmapp);
+}
+
+static void slot_rmap_remove(struct kvm_memory_slot *slot,
+ unsigned long *rmapp, u64 *spte)
+{
+ pte_list_remove(spte, rmapp);
+}
+
static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
{
+ struct kvm_memory_slot *slot;
struct kvm_mmu_page *sp;
unsigned long *rmapp;

sp = page_header(__pa(spte));
kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
- rmapp = gfn_to_rmap(vcpu->kvm, gfn, sp->role.level);
- return pte_list_add(vcpu, spte, rmapp);
+ rmapp = gfn_to_rmap(vcpu->kvm, &slot, gfn, sp->role.level);
+ return slot_rmap_add(slot, vcpu, rmapp, spte);
}

static void rmap_remove(struct kvm *kvm, u64 *spte)
{
+ struct kvm_memory_slot *slot;
struct kvm_mmu_page *sp;
gfn_t gfn;
unsigned long *rmapp;

sp = page_header(__pa(spte));
gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
- rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
- pte_list_remove(spte, rmapp);
+ rmapp = gfn_to_rmap(kvm, &slot, gfn, sp->role.level);
+ slot_rmap_remove(slot, rmapp, spte);
}

/*
@@ -1219,6 +1234,13 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
return flush;
}

+static bool slot_rmap_write_protect(struct kvm_memory_slot *slot,
+ struct kvm *kvm, unsigned long *rmapp,
+ bool pt_protect)
+{
+ return __rmap_write_protect(kvm, rmapp, pt_protect);
+}
+
/**
* kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
* @kvm: kvm instance
@@ -1238,7 +1260,7 @@ void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
while (mask) {
rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
PT_PAGE_TABLE_LEVEL, slot);
- __rmap_write_protect(kvm, rmapp, false);
+ slot_rmap_write_protect(slot, kvm, rmapp, false);

/* clear the first set bit */
mask &= mask - 1;
@@ -1257,14 +1279,14 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
for (i = PT_PAGE_TABLE_LEVEL;
i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
rmapp = __gfn_to_rmap(gfn, i, slot);
- write_protected |= __rmap_write_protect(kvm, rmapp, true);
+ write_protected |= slot_rmap_write_protect(slot, kvm, rmapp,
+ true);
}

return write_protected;
}

-static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
- struct kvm_memory_slot *slot, unsigned long data)
+static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -1281,14 +1303,19 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
return need_tlb_flush;
}

+static int slot_rmap_unmap(struct kvm *kvm, unsigned long *rmapp,
+ struct kvm_memory_slot *slot, unsigned long data)
+{
+ return kvm_unmap_rmapp(kvm, rmapp);
+}
+
static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
- struct kvm_memory_slot *slot, unsigned long data)
+ pte_t *ptep)
{
u64 *sptep;
struct rmap_iterator iter;
int need_flush = 0;
u64 new_spte;
- pte_t *ptep = (pte_t *)data;
pfn_t new_pfn;

WARN_ON(pte_huge(*ptep));
@@ -1323,6 +1350,12 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
return 0;
}

+static int slot_rmap_set_pte(struct kvm *kvm, unsigned long *rmapp,
+ struct kvm_memory_slot *slot, unsigned long data)
+{
+ return kvm_set_pte_rmapp(kvm, rmapp, (pte_t *)data);
+}
+
static int kvm_handle_hva_range(struct kvm *kvm,
unsigned long start,
unsigned long end,
@@ -1388,21 +1421,20 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,

int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
{
- return kvm_handle_hva(kvm, hva, 0, kvm_unmap_rmapp);
+ return kvm_handle_hva(kvm, hva, 0, slot_rmap_unmap);
}

int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end)
{
- return kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
+ return kvm_handle_hva_range(kvm, start, end, 0, slot_rmap_unmap);
}

void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
{
- kvm_handle_hva(kvm, hva, (unsigned long)&pte, kvm_set_pte_rmapp);
+ kvm_handle_hva(kvm, hva, (unsigned long)&pte, slot_rmap_set_pte);
}

-static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
- struct kvm_memory_slot *slot, unsigned long data)
+static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
{
u64 *sptep;
struct rmap_iterator uninitialized_var(iter);
@@ -1417,7 +1449,7 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
* out actively used pages or breaking up actively used hugepages.
*/
if (!shadow_accessed_mask) {
- young = kvm_unmap_rmapp(kvm, rmapp, slot, data);
+ young = kvm_unmap_rmapp(kvm, rmapp);
goto out;
}

@@ -1432,13 +1464,20 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
}
}
out:
+ return young;
+}
+
+static int slot_rmap_age(struct kvm *kvm, unsigned long *rmapp,
+ struct kvm_memory_slot *slot, unsigned long data)
+{
+ int young = kvm_age_rmapp(kvm, rmapp);
+
/* @data has hva passed to kvm_age_hva(). */
trace_kvm_age_page(data, slot, young);
return young;
}

-static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
- struct kvm_memory_slot *slot, unsigned long data)
+static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -1465,29 +1504,37 @@ out:
return young;
}

+static int slot_rmap_test_age(struct kvm *kvm, unsigned long *rmapp,
+ struct kvm_memory_slot *slot,
+ unsigned long data)
+{
+ return kvm_test_age_rmapp(kvm, rmapp);
+}
+
#define RMAP_RECYCLE_THRESHOLD 1000

static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
{
- unsigned long *rmapp;
+ struct kvm_memory_slot *slot;
struct kvm_mmu_page *sp;
+ unsigned long *rmapp;

sp = page_header(__pa(spte));

- rmapp = gfn_to_rmap(vcpu->kvm, gfn, sp->role.level);
+ rmapp = gfn_to_rmap(vcpu->kvm, &slot, gfn, sp->role.level);

- kvm_unmap_rmapp(vcpu->kvm, rmapp, NULL, 0);
+ slot_rmap_unmap(vcpu->kvm, rmapp, slot, 0);
kvm_flush_remote_tlbs(vcpu->kvm);
}

int kvm_age_hva(struct kvm *kvm, unsigned long hva)
{
- return kvm_handle_hva(kvm, hva, hva, kvm_age_rmapp);
+ return kvm_handle_hva(kvm, hva, hva, slot_rmap_age);
}

int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
{
- return kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
+ return kvm_handle_hva(kvm, hva, 0, slot_rmap_test_age);
}

#ifdef MMU_DEBUG
@@ -4224,7 +4271,8 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)

for (index = 0; index <= last_index; ++index, ++rmapp) {
if (*rmapp)
- __rmap_write_protect(kvm, rmapp, false);
+ slot_rmap_write_protect(memslot, kvm, rmapp,
+ false);

if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
kvm_flush_remote_tlbs(kvm);
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index daff69e..ab53647 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -129,8 +129,9 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
{
static DEFINE_RATELIMIT_STATE(ratelimit_state, 5 * HZ, 10);
- unsigned long *rmapp;
+ struct kvm_memory_slot *slot;
struct kvm_mmu_page *rev_sp;
+ unsigned long *rmapp;
gfn_t gfn;

rev_sp = page_header(__pa(sptep));
@@ -146,7 +147,7 @@ static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
return;
}

- rmapp = gfn_to_rmap(kvm, gfn, rev_sp->role.level);
+ rmapp = gfn_to_rmap(kvm, &slot, gfn, rev_sp->role.level);
if (!*rmapp) {
if (!__ratelimit(&ratelimit_state))
return;
@@ -188,14 +189,15 @@ static void check_mappings_rmap(struct kvm *kvm, struct kvm_mmu_page *sp)

static void audit_write_protection(struct kvm *kvm, struct kvm_mmu_page *sp)
{
+ struct kvm_memory_slot *slot;
+ struct rmap_iterator iter;
unsigned long *rmapp;
u64 *sptep;
- struct rmap_iterator iter;

if (sp->role.direct || sp->unsync || sp->role.invalid)
return;

- rmapp = gfn_to_rmap(kvm, sp->gfn, PT_PAGE_TABLE_LEVEL);
+ rmapp = gfn_to_rmap(kvm, &slot, sp->gfn, PT_PAGE_TABLE_LEVEL);

for (sptep = rmap_get_first(*rmapp, &iter); sptep;
sptep = rmap_get_next(&iter)) {
--
1.7.7.6

2013-04-16 06:33:11

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 01/15] KVM: x86: clean up and optimize for kvm_arch_free_memslot

memslot rmap and lpage-info are never partly reused and nothing need
be freed when new memslot is created

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 21 ++++++++++++---------
1 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4be4733..b0be7ec 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6856,19 +6856,22 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
{
int i;

+ if (dont && dont->arch.rmap[0] == free->arch.rmap[0])
+ return;
+
+ /* It is a empty memslot. */
+ if (!free->arch.rmap[0])
+ return;
+
for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
- if (!dont || free->arch.rmap[i] != dont->arch.rmap[i]) {
- kvm_kvfree(free->arch.rmap[i]);
- free->arch.rmap[i] = NULL;
- }
+ kvm_kvfree(free->arch.rmap[i]);
+ free->arch.rmap[i] = NULL;
+
if (i == 0)
continue;

- if (!dont || free->arch.lpage_info[i - 1] !=
- dont->arch.lpage_info[i - 1]) {
- kvm_kvfree(free->arch.lpage_info[i - 1]);
- free->arch.lpage_info[i - 1] = NULL;
- }
+ kvm_kvfree(free->arch.lpage_info[i - 1]);
+ free->arch.lpage_info[i - 1] = NULL;
}
}

--
1.7.7.6

2013-04-16 06:35:19

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 10/15] KVM: x86: introduce memslot_set_lpage_disallowed

It is used to set disallowed lage page on the specified level, can be
used in later patch

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 53 ++++++++++++++++++++++++++++++++++-----------------
1 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bec83cd..0c5bb2c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6875,13 +6875,46 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
}
}

+static void memslot_set_lpage_disallowed(struct kvm_memory_slot *slot,
+ unsigned long npages,
+ int lpage_size, int lpages)
+{
+ struct kvm_lpage_info *lpage_info;
+ unsigned long ugfn;
+ int level = lpage_size + 1;
+
+ WARN_ON(!lpage_size);
+
+ lpage_info = slot->arch.lpage_info[lpage_size - 1];
+
+ if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
+ lpage_info[0].write_count = 1;
+
+ if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
+ lpage_info[lpages - 1].write_count = 1;
+
+ ugfn = slot->userspace_addr >> PAGE_SHIFT;
+
+ /*
+ * If the gfn and userspace address are not aligned wrt each
+ * other, or if explicitly asked to, disable large page
+ * support for this slot
+ */
+ if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1) ||
+ !kvm_largepages_enabled()) {
+ unsigned long j;
+
+ for (j = 0; j < lpages; ++j)
+ lpage_info[j].write_count = 1;
+ }
+}
+
static int kvm_arch_create_memslot(struct kvm_memory_slot *slot)
{
unsigned long npages = slot->npages;
int i;

for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
- unsigned long ugfn;
int lpages;
int level = i + 1;

@@ -6900,23 +6933,7 @@ static int kvm_arch_create_memslot(struct kvm_memory_slot *slot)
if (!slot->arch.lpage_info[i - 1])
goto out_free;

- if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
- slot->arch.lpage_info[i - 1][0].write_count = 1;
- if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
- slot->arch.lpage_info[i - 1][lpages - 1].write_count = 1;
- ugfn = slot->userspace_addr >> PAGE_SHIFT;
- /*
- * If the gfn and userspace address are not aligned wrt each
- * other, or if explicitly asked to, disable large page
- * support for this slot
- */
- if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1) ||
- !kvm_largepages_enabled()) {
- unsigned long j;
-
- for (j = 0; j < lpages; ++j)
- slot->arch.lpage_info[i - 1][j].write_count = 1;
- }
+ memslot_set_lpage_disallowed(slot, npages, i, lpages);
}

init_memslot_rmap_ops(slot);
--
1.7.7.6

2013-04-16 06:35:31

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability.

In this patch, we introduce a faster way to invalid all shadow pages.
KVM maintains a global mmu invalid generation-number which is stored in
kvm->arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp->mmu_valid_gen when it is created.

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.

The invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
are keeped in mmu-cache until page allocator reclaims page.

If the invalidation is due to memslot changed, its rmap amd lpage-info
will be freed soon, in order to avoiding use invalid memory, we unmap
all sptes on its rmap and always reset the large-info all memslots so
that rmap and lpage info can be safely freed.

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/mmu.c | 85 +++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/mmu.h | 4 ++
arch/x86/kvm/x86.c | 6 +++
4 files changed, 94 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1ad9a34..6f8ee18 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -223,6 +223,7 @@ struct kvm_mmu_page {
int root_count; /* Currently serving as active root */
unsigned int unsync_children;
unsigned long parent_ptes; /* Reverse mapping for parent_pte */
+ unsigned long mmu_valid_gen;
DECLARE_BITMAP(unsync_child_bitmap, 512);

#ifdef CONFIG_X86_32
@@ -531,6 +532,7 @@ struct kvm_arch {
unsigned int n_requested_mmu_pages;
unsigned int n_max_mmu_pages;
unsigned int indirect_shadow_pages;
+ unsigned long mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
* Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9ac584f..12129b7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1732,6 +1732,11 @@ static struct rmap_operations invalid_rmap_ops = {
.rmap_unmap = kvm_unmap_invalid_rmapp
};

+static void init_invalid_memslot_rmap_ops(struct kvm_memory_slot *slot)
+{
+ slot->arch.ops = &invalid_rmap_ops;
+}
+
typedef void (*handle_rmap_fun)(unsigned long *rmapp, void *data);
static void walk_memslot_rmap_nolock(struct kvm_memory_slot *slot,
handle_rmap_fun fun, void *data)
@@ -1812,6 +1817,65 @@ void free_meslot_rmap_desc_nolock(struct kvm_memory_slot *slot)
walk_memslot_rmap_nolock(slot, free_rmap_desc_nolock, NULL);
}

+/*
+ * Fast invalid all shadow pages belong to @slot.
+ *
+ * @slot != NULL means the invalidation is caused the memslot specified
+ * by @slot is being deleted, in this case, we should ensure that rmap
+ * and lpage-info of the @slot can not be used after calling the function.
+ * Specially, if @slot is INVALID_ALL_SLOTS, all slots will be deleted
+ * soon, it always happens when kvm is being destroyed.
+ *
+ * @slot == NULL means the invalidation due to other reasons, we need
+ * not care rmap and lpage-info since they are still valid after calling
+ * the function.
+ */
+void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
+ struct kvm_memory_slot *slot)
+{
+ struct kvm_memory_slot *each_slot;
+
+ spin_lock(&kvm->mmu_lock);
+ kvm->arch.mmu_valid_gen++;
+
+ if (slot == INVALID_ALL_SLOTS)
+ kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
+ init_invalid_memslot_rmap_ops(each_slot);
+ else if (slot)
+ init_invalid_memslot_rmap_ops(slot);
+
+ /*
+ * All shadow paes are invalid, reset the large page info,
+ * then we can safely desotry the memslot, it is also good
+ * for large page used.
+ */
+ kvm_clear_all_lpage_info(kvm);
+
+ /*
+ * Notify all vcpus to reload its shadow page table
+ * and flush TLB. Then all vcpus will switch to new
+ * shadow page table with the new mmu_valid_gen.
+ *
+ * Note: we should do this under the protection of
+ * mmu-lock, otherwise, vcpu would purge shadow page
+ * but miss tlb flush.
+ */
+ kvm_reload_remote_mmus(kvm);
+ spin_unlock(&kvm->mmu_lock);
+
+ if (slot == INVALID_ALL_SLOTS)
+ kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
+ unmap_memslot_rmap_nolock(kvm, each_slot);
+ else if (slot)
+ unmap_memslot_rmap_nolock(kvm, slot);
+
+ /*
+ * To ensure that all vcpus and mmu-notify are not clearing
+ * spte and rmap entry.
+ */
+ synchronize_srcu_expedited(&kvm->srcu);
+}
+
#ifdef MMU_DEBUG
static int is_empty_shadow_page(u64 *spt)
{
@@ -2219,6 +2283,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sp);
}

+static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ return likely(sp->mmu_valid_gen == kvm->arch.mmu_valid_gen);
+}
+
static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
gfn_t gfn,
gva_t gaddr,
@@ -2245,6 +2314,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
role.quadrant = quadrant;
}
for_each_gfn_sp(vcpu->kvm, sp, gfn) {
+ if (!is_valid_sp(vcpu->kvm, sp))
+ continue;
+
if (!need_sync && sp->unsync)
need_sync = true;

@@ -2281,6 +2353,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,

account_shadowed(vcpu->kvm, gfn);
}
+ sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
init_shadow_page_table(sp);
trace_kvm_mmu_get_page(sp, true);
return sp;
@@ -2451,8 +2524,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
kvm_mmu_page_unlink_children(kvm, sp);
kvm_mmu_unlink_parents(kvm, sp);
- if (!sp->role.invalid && !sp->role.direct)
+
+ if (!sp->role.invalid && !sp->role.direct &&
+ /* Invalid-gen pages are not accounted. */
+ is_valid_sp(kvm, sp))
unaccount_shadowed(kvm, sp->gfn);
+
if (sp->unsync)
kvm_unlink_unsync_page(kvm, sp);
if (!sp->root_count) {
@@ -3256,7 +3333,8 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)

sp = page_header(root);
--sp->root_count;
- if (!sp->root_count && sp->role.invalid) {
+ if (!is_valid_sp(vcpu->kvm, sp) ||
+ (!sp->root_count && sp->role.invalid)) {
kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
}
@@ -3271,7 +3349,8 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
root &= PT64_BASE_ADDR_MASK;
sp = page_header(root);
--sp->root_count;
- if (!sp->root_count && sp->role.invalid)
+ if (!is_valid_sp(vcpu->kvm, sp) ||
+ (!sp->root_count && sp->role.invalid))
kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
&invalid_list);
}
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index ab434b7..f01c5ce 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -131,4 +131,8 @@ struct rmap_operations {

void init_memslot_rmap_ops(struct kvm_memory_slot *slot);
void free_meslot_rmap_desc_nolock(struct kvm_memory_slot *slot);
+
+#define INVALID_ALL_SLOTS ((struct kvm_memory_slot *)(~0))
+void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
+ struct kvm_memory_slot *slot);
#endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fc4956c..6dbb80c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6863,6 +6863,12 @@ void kvm_arch_free_memslot(struct kvm_memory_slot *free,
if (!free->arch.rmap[0])
return;

+ /*
+ * Vcpus and mmu-notify can not see the rmap of this slot which
+ * can be safely freed.
+ */
+ free_meslot_rmap_desc_nolock(free);
+
for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
kvm_kvfree(free->arch.rmap[i]);
free->arch.rmap[i] = NULL;
--
1.7.7.6

2013-04-16 06:35:38

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 15/15] KVM: MMU: replace kvm_zap_all with kvm_mmu_invalid_all_pages

Use kvm_mmu_invalid_all_pages in kvm_arch_flush_shadow_all and
rename kvm_zap_all to kvm_free_all which is used to free all
memmory used by kvm mmu when vm is being destroyed, at this time,
no vcpu exists and mmu-notify has been unregistered, so we can
free the shadow pages out of mmu-lock

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/mmu.c | 15 ++-------------
arch/x86/kvm/x86.c | 9 ++++-----
3 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6f8ee18..a336055 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -771,7 +771,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
struct kvm_memory_slot *slot,
gfn_t gfn_offset, unsigned long mask);
-void kvm_mmu_zap_all(struct kvm *kvm);
+void kvm_mmu_free_all(struct kvm *kvm);
void kvm_arch_init_generation(struct kvm *kvm);
void kvm_mmu_invalid_mmio_sptes(struct kvm *kvm);
unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 12129b7..10c43ea 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4639,28 +4639,17 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
spin_unlock(&kvm->mmu_lock);
}

-void kvm_mmu_zap_all(struct kvm *kvm)
+void kvm_mmu_free_all(struct kvm *kvm)
{
struct kvm_mmu_page *sp, *node;
LIST_HEAD(invalid_list);

- might_sleep();
-
- spin_lock(&kvm->mmu_lock);
restart:
- list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
+ list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link)
if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
goto restart;

- if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
- kvm_mmu_commit_zap_page(kvm, &invalid_list);
- cond_resched_lock(&kvm->mmu_lock);
- goto restart;
- }
- }
-
kvm_mmu_commit_zap_page(kvm, &invalid_list);
- spin_unlock(&kvm->mmu_lock);
}

static void kvm_mmu_zap_mmio_sptes(struct kvm *kvm)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d3dd0d5..4bb88f5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6840,6 +6840,7 @@ void kvm_arch_sync_events(struct kvm *kvm)

void kvm_arch_destroy_vm(struct kvm *kvm)
{
+ kvm_mmu_free_all(kvm);
kvm_iommu_unmap_guest(kvm);
kfree(kvm->arch.vpic);
kfree(kvm->arch.vioapic);
@@ -7056,11 +7057,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,

void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
- int idx;
-
- idx = srcu_read_lock(&kvm->srcu);
- kvm_mmu_zap_all(kvm);
- srcu_read_unlock(&kvm->srcu, idx);
+ mutex_lock(&kvm->slots_lock);
+ kvm_mmu_invalid_memslot_pages(kvm, INVALID_ALL_SLOTS);
+ mutex_unlock(&kvm->slots_lock);
}

void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
--
1.7.7.6

2013-04-16 06:35:56

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 14/15] KVM: move srcu_read_lock/srcu_read_unlock to arch-specified code

Move srcu_read_lock/srcu_read_unlock in kvm_mmu_notifier_release
to kvm_arch_flush_shadow_all since we will hold slot-lock instead
of srcu

Only ARM, POWERPC and x86 are using mmu-notify and
kvm_arch_flush_shadow_all on ARM and POWERPC does nothing, so we
only need to modify the code on x86

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 4 ++++
virt/kvm/kvm_main.c | 3 ---
2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6e7c85b..d3dd0d5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7056,7 +7056,11 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,

void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
+ int idx;
+
+ idx = srcu_read_lock(&kvm->srcu);
kvm_mmu_zap_all(kvm);
+ srcu_read_unlock(&kvm->srcu, idx);
}

void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index acc9f30..f48eef9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -418,11 +418,8 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
struct mm_struct *mm)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
- int idx;

- idx = srcu_read_lock(&kvm->srcu);
kvm_arch_flush_shadow_all(kvm);
- srcu_read_unlock(&kvm->srcu, idx);
}

static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
--
1.7.7.6

2013-04-16 06:36:17

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 13/15] KVM: x86: use the fast way to invalid all pages

Replace kvm_mmu_zap_all by kvm_mmu_invalid_all_pages except on
the path of mmu_notifier->release() which will be replaced in
the later patch

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6dbb80c..6e7c85b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5465,7 +5465,7 @@ static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
* to ensure that the updated hypercall appears atomically across all
* VCPUs.
*/
- kvm_mmu_zap_all(vcpu->kvm);
+ kvm_mmu_invalid_memslot_pages(vcpu->kvm, NULL);

kvm_x86_ops->patch_hypercall(vcpu, instruction);

@@ -7062,7 +7062,7 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot)
{
- kvm_arch_flush_shadow_all(kvm);
+ kvm_mmu_invalid_memslot_pages(kvm, slot);
}

int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
--
1.7.7.6

2013-04-16 06:39:09

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 08/15] KVM: MMU: allow unmap invalid rmap out of mmu-lock

pte_list_clear_concurrently allows us to reset pte-desc entry
out of mmu-lock. We can reset spte out of mmu-lock if we can protect the
lifecycle of sp, we use this way to achieve the goal:

unmap_memslot_rmap_nolock():
for-each-rmap-in-slot:
preempt_disable
kvm->arch.being_unmapped_rmap = rmapp
clear spte and reset rmap entry
kvm->arch.being_unmapped_rmap = NULL
preempt_enable

Other patch like zap-sp and mmu-notify which are protected
by mmu-lock:
clear spte and reset rmap entry
retry:
if (kvm->arch.being_unmapped_rmap == rmap)
goto retry
(the wait is very rare and clear one rmap is very fast, it
is not bad even if wait is needed)

Then, we can sure the spte is always available when we do
unmap_memslot_rmap_nolock

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/mmu.c | 114 ++++++++++++++++++++++++++++++++++++---
arch/x86/kvm/mmu.h | 2 +-
3 files changed, 110 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5fd6ed1..1ad9a34 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -536,6 +536,8 @@ struct kvm_arch {
* Hash table of struct kvm_mmu_page.
*/
struct list_head active_mmu_pages;
+ unsigned long *being_unmapped_rmap;
+
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
int iommu_flags;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2a7a5d0..e6414d2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1104,10 +1104,10 @@ static int slot_rmap_add(struct kvm_memory_slot *slot,
return slot->arch.ops->rmap_add(vcpu, spte, rmapp);
}

-static void slot_rmap_remove(struct kvm_memory_slot *slot,
+static void slot_rmap_remove(struct kvm_memory_slot *slot, struct kvm *kvm,
unsigned long *rmapp, u64 *spte)
{
- slot->arch.ops->rmap_remove(spte, rmapp);
+ slot->arch.ops->rmap_remove(kvm, spte, rmapp);
}

static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
@@ -1132,7 +1132,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
sp = page_header(__pa(spte));
gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
rmapp = gfn_to_rmap(kvm, &slot, gfn, sp->role.level);
- slot_rmap_remove(slot, rmapp, spte);
+ slot_rmap_remove(slot, kvm, rmapp, spte);
}

/*
@@ -1589,9 +1589,14 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
return kvm_handle_hva(kvm, hva, 0, slot_rmap_test_age);
}

+static void rmap_remove_spte(struct kvm *kvm, u64 *spte, unsigned long *rmapp)
+{
+ pte_list_remove(spte, rmapp);
+}
+
static struct rmap_operations normal_rmap_ops = {
.rmap_add = pte_list_add,
- .rmap_remove = pte_list_remove,
+ .rmap_remove = rmap_remove_spte,

.rmap_write_protect = __rmap_write_protect,

@@ -1613,9 +1618,27 @@ static int invalid_rmap_add(struct kvm_vcpu *vcpu, u64 *spte,
return 0;
}

-static void invalid_rmap_remove(u64 *spte, unsigned long *rmapp)
+static void sync_being_unmapped_rmap(struct kvm *kvm, unsigned long *rmapp)
+{
+ /*
+ * Ensure all the sptes on the rmap have been zapped and
+ * the rmap's entries have been reset so that
+ * unmap_invalid_rmap_nolock can not get any spte from the
+ * rmap after calling sync_being_unmapped_rmap().
+ */
+ smp_mb();
+retry:
+ if (unlikely(ACCESS_ONCE(kvm->arch.being_unmapped_rmap) == rmapp)) {
+ cpu_relax();
+ goto retry;
+ }
+}
+
+static void
+invalid_rmap_remove(struct kvm *kvm, u64 *spte, unsigned long *rmapp)
{
pte_list_clear_concurrently(spte, rmapp);
+ sync_being_unmapped_rmap(kvm, rmapp);
}

static bool invalid_rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
@@ -1635,7 +1658,11 @@ static int __kvm_unmap_invalid_rmapp(unsigned long *rmapp)
if (sptep == PTE_LIST_SPTE_SKIP)
continue;

- /* Do not call .rmap_remove(). */
+ /*
+ * Do not call .rmap_remove() since we do not want to wait
+ * on sync_being_unmapped_rmap() when all sptes should be
+ * removed from the rmap.
+ */
if (mmu_spte_clear_track_bits(sptep))
pte_list_clear_concurrently(sptep, rmapp);
}
@@ -1645,7 +1672,10 @@ static int __kvm_unmap_invalid_rmapp(unsigned long *rmapp)

static int kvm_unmap_invalid_rmapp(struct kvm *kvm, unsigned long *rmapp)
{
- return __kvm_unmap_invalid_rmapp(rmapp);
+ int ret = __kvm_unmap_invalid_rmapp(rmapp);
+
+ sync_being_unmapped_rmap(kvm, rmapp);
+ return ret;
}

static int invalid_rmap_set_pte(struct kvm *kvm, unsigned long *rmapp,
@@ -1686,6 +1716,76 @@ static struct rmap_operations invalid_rmap_ops = {
.rmap_unmap = kvm_unmap_invalid_rmapp
};

+typedef void (*handle_rmap_fun)(unsigned long *rmapp, void *data);
+static void walk_memslot_rmap_nolock(struct kvm_memory_slot *slot,
+ handle_rmap_fun fun, void *data)
+{
+ int level;
+
+ for (level = PT_PAGE_TABLE_LEVEL;
+ level < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++level) {
+ unsigned long idx, *rmapp;
+
+ rmapp = slot->arch.rmap[level - PT_PAGE_TABLE_LEVEL];
+ idx = gfn_to_index(slot->base_gfn + slot->npages - 1,
+ slot->base_gfn, level) + 1;
+ /*
+ * Walk ramp from the high index to low index to reduce
+ * possible wait in sync_being_unmapped_rmap().
+ */
+ while (idx--)
+ fun(rmapp + idx, data);
+ }
+}
+
+static void unmap_rmap_no_lock_begin(struct kvm *kvm, unsigned long *rmapp)
+{
+ preempt_disable();
+ kvm->arch.being_unmapped_rmap = rmapp;
+
+ /*
+ * Set being_unmapped_rmap should be before read/write any
+ * sptes on the rmaps.
+ * See the comment in sync_being_unmapped_rmap().
+ */
+ smp_mb();
+}
+
+static void unmap_rmap_no_lock_end(struct kvm *kvm)
+{
+ /*
+ * Ensure clearing spte and resetting rmap's entries has
+ * been finished.
+ * See the comment in sync_being_unmapped_rmap().
+ */
+ smp_mb();
+ kvm->arch.being_unmapped_rmap = NULL;
+ preempt_enable();
+}
+
+static void unmap_invalid_rmap_nolock(unsigned long *rmapp, void *data)
+{
+ struct kvm *kvm = (struct kvm *)data;
+
+ if (!ACCESS_ONCE(*rmapp))
+ return;
+
+ unmap_rmap_no_lock_begin(kvm, rmapp);
+ __kvm_unmap_invalid_rmapp(rmapp);
+ unmap_rmap_no_lock_end(kvm);
+}
+
+static void
+unmap_memslot_rmap_nolock(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ /* Only invalid rmaps can be unmapped out of mmu-lock. */
+ WARN_ON(slot->arch.ops != &invalid_rmap_ops);
+ /* Use slots_lock to protect kvm->arch.being_unmapped_rmap. */
+ WARN_ON(!mutex_is_locked(&kvm->slots_lock));
+
+ walk_memslot_rmap_nolock(slot, unmap_invalid_rmap_nolock, kvm);
+}
+
#ifdef MMU_DEBUG
static int is_empty_shadow_page(u64 *spt)
{
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index bb2b22e..d6aa31a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -117,7 +117,7 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
struct rmap_operations {
int (*rmap_add)(struct kvm_vcpu *vcpu, u64 *spte,
unsigned long *rmap);
- void (*rmap_remove)(u64 *spte, unsigned long *rmap);
+ void (*rmap_remove)(struct kvm *kvm, u64 *spte, unsigned long *rmap);

bool (*rmap_write_protect)(struct kvm *kvm, unsigned long *rmap,
bool pt_protect);
--
1.7.7.6

2013-04-16 06:39:23

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 11/15] KVM: MMU: introduce kvm_clear_all_lpage_info

This function is used to reset the large page info of all guest page
which will be used in later patch

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 25 +++++++++++++++++++++++++
arch/x86/kvm/x86.h | 2 ++
2 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c5bb2c..fc4956c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6909,6 +6909,31 @@ static void memslot_set_lpage_disallowed(struct kvm_memory_slot *slot,
}
}

+static void clear_memslot_lpage_info(struct kvm_memory_slot *slot)
+{
+ int i;
+
+ for (i = 1; i < KVM_NR_PAGE_SIZES; ++i) {
+ int lpages;
+ int level = i + 1;
+
+ lpages = gfn_to_index(slot->base_gfn + slot->npages - 1,
+ slot->base_gfn, level) + 1;
+
+ memset(slot->arch.lpage_info[i - 1], 0,
+ sizeof(*slot->arch.lpage_info[i - 1]));
+ memslot_set_lpage_disallowed(slot, slot->npages, i, lpages);
+ }
+}
+
+void kvm_clear_all_lpage_info(struct kvm *kvm)
+{
+ struct kvm_memory_slot *slot;
+
+ kvm_for_each_memslot(slot, kvm->memslots)
+ clear_memslot_lpage_info(slot);
+}
+
static int kvm_arch_create_memslot(struct kvm_memory_slot *slot)
{
unsigned long npages = slot->npages;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index e224f7a..beae540 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -108,6 +108,8 @@ static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
return false;
}

+void kvm_clear_all_lpage_info(struct kvm *kvm);
+
void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
--
1.7.7.6

2013-04-16 06:39:42

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 09/15] KVM: MMU: introduce free_meslot_rmap_desc_nolock

It frees pte-list-descs used by memslot rmap after update
memslot is completed

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 26 ++++++++++++++++++++++++++
arch/x86/kvm/mmu.h | 1 +
2 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e6414d2..9ac584f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1049,6 +1049,22 @@ static void pte_list_clear_concurrently(u64 *spte, unsigned long *pte_list)
return;
}

+static void pte_list_free_desc(unsigned long *pte_list)
+{
+ struct pte_list_desc *desc, *next;
+ unsigned long pte_value = *pte_list;
+
+ if (!(pte_value & 1))
+ return;
+
+ desc = (struct pte_list_desc *)(pte_value & ~1ul);
+ do {
+ next = desc->more;
+ mmu_free_pte_list_desc(desc);
+ desc = next;
+ } while (desc);
+}
+
typedef void (*pte_list_walk_fn) (u64 *spte);
static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
{
@@ -1786,6 +1802,16 @@ unmap_memslot_rmap_nolock(struct kvm *kvm, struct kvm_memory_slot *slot)
walk_memslot_rmap_nolock(slot, unmap_invalid_rmap_nolock, kvm);
}

+static void free_rmap_desc_nolock(unsigned long *rmapp, void *data)
+{
+ pte_list_free_desc(rmapp);
+}
+
+void free_meslot_rmap_desc_nolock(struct kvm_memory_slot *slot)
+{
+ walk_memslot_rmap_nolock(slot, free_rmap_desc_nolock, NULL);
+}
+
#ifdef MMU_DEBUG
static int is_empty_shadow_page(u64 *spt)
{
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index d6aa31a..ab434b7 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -130,4 +130,5 @@ struct rmap_operations {
};

void init_memslot_rmap_ops(struct kvm_memory_slot *slot);
+void free_meslot_rmap_desc_nolock(struct kvm_memory_slot *slot);
#endif
--
1.7.7.6

2013-04-16 06:40:03

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 06/15] KVM: MMU: allow concurrently clearing spte on remove-only pte-list

This patch introduce PTE_LIST_SPTE_SKIP which is the placeholder and
it will be set on pte-list after removing a spte so that other sptes
on this pte_list are not moved and the pte-list-descs on the pte-list
are not freed.

If vcpu can not add spte to the pte-list (e.g. the rmap on invalid
memslot) and spte can not be freed during pte-list walk, we can
concurrently clear sptes on the pte-list, the worst case is, we double
zap a spte that is safe.

This patch only ensures that concurrently zapping pte-list is safe,
we will keep spte available during concurrently clearing in the later
patches

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 99ad2a4..850eab5 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -900,6 +900,18 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
}

/*
+ * It is the placeholder and it will be set on pte-list after removing
+ * a spte so that other sptes on this pte_list are not moved and the
+ * pte-list-descs on the pte-list are not freed.
+ *
+ * If vcpu can not add spte to the pte-list (e.g. the rmap on invalid
+ * memslot) and spte can not be freed during pte-list walk, we can
+ * cocurrently clear sptes on the pte-list, the worst case is, we double
+ * zap a spte that is safe.
+ */
+#define PTE_LIST_SPTE_SKIP (u64 *)((~0x0ul) & (~1))
+
+/*
* Pte mapping structures:
*
* If pte_list bit zero is zero, then pte_list point to the spte.
@@ -1003,6 +1015,40 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)
}
}

+static void pte_list_clear_concurrently(u64 *spte, unsigned long *pte_list)
+{
+ struct pte_list_desc *desc;
+ unsigned long pte_value = *pte_list;
+ int i;
+
+ /* Empty pte list stores nothing. */
+ WARN_ON(!pte_value);
+
+ if (!(pte_value & 1)) {
+ if ((u64 *)pte_value == spte) {
+ *pte_list = (unsigned long)PTE_LIST_SPTE_SKIP;
+ return;
+ }
+
+ /* someone has already cleared it. */
+ WARN_ON(pte_value != (unsigned long)PTE_LIST_SPTE_SKIP);
+ return;
+ }
+
+ desc = (struct pte_list_desc *)(pte_value & ~1ul);
+ while (desc) {
+ for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
+ if (desc->sptes[i] == spte) {
+ desc->sptes[i] = PTE_LIST_SPTE_SKIP;
+ return;
+ }
+
+ desc = desc->more;
+ }
+
+ return;
+}
+
typedef void (*pte_list_walk_fn) (u64 *spte);
static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
{
@@ -1214,6 +1260,12 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
return false;
}

+/* PTE_LIST_SPTE_SKIP is only used on invalid rmap. */
+static void check_valid_sptep(u64 *sptep)
+{
+ WARN_ON(sptep == PTE_LIST_SPTE_SKIP || !is_rmap_spte(*sptep));
+}
+
static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
bool pt_protect)
{
@@ -1222,7 +1274,7 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
bool flush = false;

for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
- BUG_ON(!(*sptep & PT_PRESENT_MASK));
+ check_valid_sptep(sptep);
if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
sptep = rmap_get_first(*rmapp, &iter);
continue;
@@ -1293,7 +1345,7 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
int need_tlb_flush = 0;

while ((sptep = rmap_get_first(*rmapp, &iter))) {
- BUG_ON(!(*sptep & PT_PRESENT_MASK));
+ check_valid_sptep(sptep);
rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", sptep, *sptep);

drop_spte(kvm, sptep);
@@ -1322,7 +1374,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
new_pfn = pte_pfn(*ptep);

for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
- BUG_ON(!is_shadow_present_pte(*sptep));
+ check_valid_sptep(sptep);
rmap_printk("kvm_set_pte_rmapp: spte %p %llx\n", sptep, *sptep);

need_flush = 1;
@@ -1455,7 +1507,7 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)

for (sptep = rmap_get_first(*rmapp, &iter); sptep;
sptep = rmap_get_next(&iter)) {
- BUG_ON(!is_shadow_present_pte(*sptep));
+ check_valid_sptep(sptep);

if (*sptep & shadow_accessed_mask) {
young = 1;
@@ -1493,7 +1545,7 @@ static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp)

for (sptep = rmap_get_first(*rmapp, &iter); sptep;
sptep = rmap_get_next(&iter)) {
- BUG_ON(!is_shadow_present_pte(*sptep));
+ check_valid_sptep(sptep);

if (*sptep & shadow_accessed_mask) {
young = 1;
--
1.7.7.6

2013-04-16 06:40:10

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 07/15] KVM: MMU: introduce invalid rmap handlers

Invalid rmaps is the rmap of the invalid memslot which is being
deleted, especially, we can treat all rmaps are invalid when
kvm is being destroyed since all memslot will be deleted soon.
MMU should remove all sptes on these rmaps before the invalid
memslot fully deleted

The reason why we separately handle invalid rmap is we want to
unmap invalid-rmap out of mmu-lock to achieve scale performance
on intensive memory and vcpu used guest

This patch make all the operations on invalid rmap are clearing
spte and reset rmap's entry. In the later patch, we will introduce
the path out of mmu-lock to unmap invalid rmap

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 80 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 850eab5..2a7a5d0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1606,6 +1606,86 @@ void init_memslot_rmap_ops(struct kvm_memory_slot *slot)
slot->arch.ops = &normal_rmap_ops;
}

+static int invalid_rmap_add(struct kvm_vcpu *vcpu, u64 *spte,
+ unsigned long *pte_list)
+{
+ WARN_ON(1);
+ return 0;
+}
+
+static void invalid_rmap_remove(u64 *spte, unsigned long *rmapp)
+{
+ pte_list_clear_concurrently(spte, rmapp);
+}
+
+static bool invalid_rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
+ bool pt_protect)
+{
+ WARN_ON(1);
+ return false;
+}
+
+static int __kvm_unmap_invalid_rmapp(unsigned long *rmapp)
+{
+ u64 *sptep;
+ struct rmap_iterator iter;
+
+ for (sptep = rmap_get_first(*rmapp, &iter); sptep;
+ sptep = rmap_get_next(&iter)) {
+ if (sptep == PTE_LIST_SPTE_SKIP)
+ continue;
+
+ /* Do not call .rmap_remove(). */
+ if (mmu_spte_clear_track_bits(sptep))
+ pte_list_clear_concurrently(sptep, rmapp);
+ }
+
+ return 0;
+}
+
+static int kvm_unmap_invalid_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+ return __kvm_unmap_invalid_rmapp(rmapp);
+}
+
+static int invalid_rmap_set_pte(struct kvm *kvm, unsigned long *rmapp,
+ pte_t *ptep)
+{
+ return kvm_unmap_invalid_rmapp(kvm, rmapp);
+}
+
+/*
+ * Invalid rmaps is the rmap of the invalid memslot which is being
+ * deleted, especially, we can treat all rmaps are invalid when
+ * kvm is being destroyed since all memslot will be deleted soon.
+ * MMU should remove all sptes on these rmaps before the invalid
+ * memslot fully deleted.
+ *
+ * VCPUs can not do address translation on invalid memslots, that
+ * means no sptes can be added to their rmaps and no shadow page
+ * can be created in their memory regions, so rmap_add and
+ * rmap_write_protect on invalid memslot should never happen.
+ * Any sptes on invalid rmaps are stale and can not be reused,
+ * we drop all sptes on any other operations. So, all handlers
+ * on invalid rmap do the same thing - remove and zap sptes on
+ * the rmap.
+ *
+ * KVM use pte_list_clear_concurrently to clear spte on invalid
+ * rmap which resets rmap's entry but keeps rmap's memory. The
+ * rmap is fully destroyed when free the invalid memslot.
+ */
+static struct rmap_operations invalid_rmap_ops = {
+ .rmap_add = invalid_rmap_add,
+ .rmap_remove = invalid_rmap_remove,
+
+ .rmap_write_protect = invalid_rmap_write_protect,
+
+ .rmap_set_pte = invalid_rmap_set_pte,
+ .rmap_age = kvm_unmap_invalid_rmapp,
+ .rmap_test_age = kvm_unmap_invalid_rmapp,
+ .rmap_unmap = kvm_unmap_invalid_rmapp
+};
+
#ifdef MMU_DEBUG
static int is_empty_shadow_page(u64 *spt)
{
--
1.7.7.6

2013-04-16 06:42:20

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 03/15] KVM: x86: do not reuse rmap when memslot is moved

Let kvm do not reuse the rmap of the memslot which is being moved
then the rmap of moved or deleted memslot can only be unmapped, no
new spte can be added on it.

This is good for us to unmap rmap out of mmu-lock in the later patches

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/x86.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 447789c..839e666 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6939,7 +6939,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem,
enum kvm_mr_change change)
{
- if (change == KVM_MR_CREATE)
+ if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
if (kvm_arch_create_memslot(memslot))
return -ENOMEM;

--
1.7.7.6

2013-04-17 23:38:33

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 07/15] KVM: MMU: introduce invalid rmap handlers

On Tue, Apr 16, 2013 at 02:32:45PM +0800, Xiao Guangrong wrote:
> Invalid rmaps is the rmap of the invalid memslot which is being
> deleted, especially, we can treat all rmaps are invalid when
> kvm is being destroyed since all memslot will be deleted soon.
> MMU should remove all sptes on these rmaps before the invalid
> memslot fully deleted
>
> The reason why we separately handle invalid rmap is we want to
> unmap invalid-rmap out of mmu-lock to achieve scale performance
> on intensive memory and vcpu used guest

Better try to make the code simpler, and introduce complexity only
if necessary.

The idea to zap the roots is very elegant and apparently effective. What
are its problems?

2013-04-18 02:38:53

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

On Tue, Apr 16, 2013 at 02:32:50PM +0800, Xiao Guangrong wrote:
> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> walk and zap all shadow pages one by one, also it need to zap all guest
> page's rmap and all shadow page's parent spte list. Particularly, things
> become worse if guest uses more memory or vcpus. It is not good for
> scalability.
>
> In this patch, we introduce a faster way to invalid all shadow pages.
> KVM maintains a global mmu invalid generation-number which is stored in
> kvm->arch.mmu_valid_gen and every shadow page stores the current global
> generation-number into sp->mmu_valid_gen when it is created.
>
> When KVM need zap all shadow pages sptes, it just simply increase the
> global generation-number then reload root shadow pages on all vcpus.
> Vcpu will create a new shadow page table according to current kvm's
> generation-number. It ensures the old pages are not used any more.
>
> The invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> are keeped in mmu-cache until page allocator reclaims page.
>
> If the invalidation is due to memslot changed, its rmap amd lpage-info
> will be freed soon, in order to avoiding use invalid memory, we unmap
> all sptes on its rmap and always reset the large-info all memslots so
> that rmap and lpage info can be safely freed.
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 +
> arch/x86/kvm/mmu.c | 85 +++++++++++++++++++++++++++++++++++++-
> arch/x86/kvm/mmu.h | 4 ++
> arch/x86/kvm/x86.c | 6 +++
> 4 files changed, 94 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1ad9a34..6f8ee18 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -223,6 +223,7 @@ struct kvm_mmu_page {
> int root_count; /* Currently serving as active root */
> unsigned int unsync_children;
> unsigned long parent_ptes; /* Reverse mapping for parent_pte */
> + unsigned long mmu_valid_gen;
> DECLARE_BITMAP(unsync_child_bitmap, 512);
>
> #ifdef CONFIG_X86_32
> @@ -531,6 +532,7 @@ struct kvm_arch {
> unsigned int n_requested_mmu_pages;
> unsigned int n_max_mmu_pages;
> unsigned int indirect_shadow_pages;
> + unsigned long mmu_valid_gen;
> struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> /*
> * Hash table of struct kvm_mmu_page.
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 9ac584f..12129b7 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1732,6 +1732,11 @@ static struct rmap_operations invalid_rmap_ops = {
> .rmap_unmap = kvm_unmap_invalid_rmapp
> };
>
> +static void init_invalid_memslot_rmap_ops(struct kvm_memory_slot *slot)
> +{
> + slot->arch.ops = &invalid_rmap_ops;
> +}
> +
> typedef void (*handle_rmap_fun)(unsigned long *rmapp, void *data);
> static void walk_memslot_rmap_nolock(struct kvm_memory_slot *slot,
> handle_rmap_fun fun, void *data)
> @@ -1812,6 +1817,65 @@ void free_meslot_rmap_desc_nolock(struct kvm_memory_slot *slot)
> walk_memslot_rmap_nolock(slot, free_rmap_desc_nolock, NULL);
> }
>
> +/*
> + * Fast invalid all shadow pages belong to @slot.
> + *
> + * @slot != NULL means the invalidation is caused the memslot specified
> + * by @slot is being deleted, in this case, we should ensure that rmap
> + * and lpage-info of the @slot can not be used after calling the function.
> + * Specially, if @slot is INVALID_ALL_SLOTS, all slots will be deleted
> + * soon, it always happens when kvm is being destroyed.
> + *
> + * @slot == NULL means the invalidation due to other reasons, we need
> + * not care rmap and lpage-info since they are still valid after calling
> + * the function.
> + */
> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> + struct kvm_memory_slot *slot)
> +{
> + struct kvm_memory_slot *each_slot;
> +
> + spin_lock(&kvm->mmu_lock);
> + kvm->arch.mmu_valid_gen++;
> +
> + if (slot == INVALID_ALL_SLOTS)
> + kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
> + init_invalid_memslot_rmap_ops(each_slot);
> + else if (slot)
> + init_invalid_memslot_rmap_ops(slot);
> +
> + /*
> + * All shadow paes are invalid, reset the large page info,
> + * then we can safely desotry the memslot, it is also good
> + * for large page used.
> + */
> + kvm_clear_all_lpage_info(kvm);
> +
> + /*
> + * Notify all vcpus to reload its shadow page table
> + * and flush TLB. Then all vcpus will switch to new
> + * shadow page table with the new mmu_valid_gen.
> + *
> + * Note: we should do this under the protection of
> + * mmu-lock, otherwise, vcpu would purge shadow page
> + * but miss tlb flush.
> + */
> + kvm_reload_remote_mmus(kvm);
> + spin_unlock(&kvm->mmu_lock);
> +
> + if (slot == INVALID_ALL_SLOTS)
> + kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
> + unmap_memslot_rmap_nolock(kvm, each_slot);
> + else if (slot)
> + unmap_memslot_rmap_nolock(kvm, slot);

What is the justification for this?

> +
> + /*
> + * To ensure that all vcpus and mmu-notify are not clearing
> + * spte and rmap entry.
> + */
> + synchronize_srcu_expedited(&kvm->srcu);
> +}
> +
> #ifdef MMU_DEBUG
> static int is_empty_shadow_page(u64 *spt)
> {
> @@ -2219,6 +2283,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
> __clear_sp_write_flooding_count(sp);
> }
>
> +static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> + return likely(sp->mmu_valid_gen == kvm->arch.mmu_valid_gen);
> +}
> +
> static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> gfn_t gfn,
> gva_t gaddr,
> @@ -2245,6 +2314,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> role.quadrant = quadrant;
> }
> for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> + if (!is_valid_sp(vcpu->kvm, sp))
> + continue;
> +
> if (!need_sync && sp->unsync)
> need_sync = true;
>
> @@ -2281,6 +2353,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>
> account_shadowed(vcpu->kvm, gfn);
> }
> + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> init_shadow_page_table(sp);
> trace_kvm_mmu_get_page(sp, true);
> return sp;
> @@ -2451,8 +2524,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> kvm_mmu_page_unlink_children(kvm, sp);

The rmaps[] arrays linked to !is_valid_sp() shadow pages should not be
accessed (as they have been freed already).

I suppose the is_valid_sp() conditional below should be moved earlier,
before kvm_mmu_unlink_parents or any other rmap access.

This is fine: the !is_valid_sp() shadow pages are only reachable
by SLAB and the hypervisor itself.

> kvm_mmu_unlink_parents(kvm, sp);
> - if (!sp->role.invalid && !sp->role.direct)
> +
> + if (!sp->role.invalid && !sp->role.direct &&
> + /* Invalid-gen pages are not accounted. */
> + is_valid_sp(kvm, sp))
> unaccount_shadowed(kvm, sp->gfn);
> +
> if (sp->unsync)
> kvm_unlink_unsync_page(kvm, sp);
> if (!sp->root_count) {
> @@ -3256,7 +3333,8 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
>
> sp = page_header(root);
> --sp->root_count;
> - if (!sp->root_count && sp->role.invalid) {
> + if (!is_valid_sp(vcpu->kvm, sp) ||
> + (!sp->root_count && sp->role.invalid)) {
> kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
> kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);

Why can't kvm_mmu_prepare_zap_page+kvm_mmu_commit_zap_page handle
invalid shadow pages?

Perhaps another name is useful to avoid confusion with role.invalid.
Maybe 'defunct'.

2013-04-18 02:39:09

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 15/15] KVM: MMU: replace kvm_zap_all with kvm_mmu_invalid_all_pages

On Tue, Apr 16, 2013 at 02:32:53PM +0800, Xiao Guangrong wrote:
> Use kvm_mmu_invalid_all_pages in kvm_arch_flush_shadow_all and
> rename kvm_zap_all to kvm_free_all which is used to free all
> memmory used by kvm mmu when vm is being destroyed, at this time,
> no vcpu exists and mmu-notify has been unregistered, so we can
> free the shadow pages out of mmu-lock

Since there is no contention for mmu-lock its also not a problem to
grab the lock right?

Automated verification of locking/srcu might complain.

2013-04-18 03:15:49

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 07/15] KVM: MMU: introduce invalid rmap handlers

On 04/18/2013 07:38 AM, Marcelo Tosatti wrote:
> On Tue, Apr 16, 2013 at 02:32:45PM +0800, Xiao Guangrong wrote:
>> Invalid rmaps is the rmap of the invalid memslot which is being
>> deleted, especially, we can treat all rmaps are invalid when
>> kvm is being destroyed since all memslot will be deleted soon.
>> MMU should remove all sptes on these rmaps before the invalid
>> memslot fully deleted
>>
>> The reason why we separately handle invalid rmap is we want to
>> unmap invalid-rmap out of mmu-lock to achieve scale performance
>> on intensive memory and vcpu used guest
>
> Better try to make the code simpler, and introduce complexity only
> if necessary.

Marcelo,

This code is necessary to implement "unmap invalid rmap out of mmu-lock",
the reason why we need it is that ...

>
> The idea to zap the roots is very elegant and apparently effective. What
> are its problems?

I mentioned it in 00/15:

* Challenges
Some page invalidation is requested when memslot is moved or deleted
and kvm is being destroy who call zap_all_pages to delete all sp using
their rmap and lpage-info, after call zap_all_pages, the rmap and lpage-info
will be freed. So, we should implement a fast way to delete sp from the rmap
and lpage-info.

2013-04-18 04:00:55

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

On 04/18/2013 08:05 AM, Marcelo Tosatti wrote:
> On Tue, Apr 16, 2013 at 02:32:50PM +0800, Xiao Guangrong wrote:
>> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
>> walk and zap all shadow pages one by one, also it need to zap all guest
>> page's rmap and all shadow page's parent spte list. Particularly, things
>> become worse if guest uses more memory or vcpus. It is not good for
>> scalability.
>>
>> In this patch, we introduce a faster way to invalid all shadow pages.
>> KVM maintains a global mmu invalid generation-number which is stored in
>> kvm->arch.mmu_valid_gen and every shadow page stores the current global
>> generation-number into sp->mmu_valid_gen when it is created.
>>
>> When KVM need zap all shadow pages sptes, it just simply increase the
>> global generation-number then reload root shadow pages on all vcpus.
>> Vcpu will create a new shadow page table according to current kvm's
>> generation-number. It ensures the old pages are not used any more.
>>
>> The invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
>> are keeped in mmu-cache until page allocator reclaims page.
>>
>> If the invalidation is due to memslot changed, its rmap amd lpage-info
>> will be freed soon, in order to avoiding use invalid memory, we unmap
>> all sptes on its rmap and always reset the large-info all memslots so
>> that rmap and lpage info can be safely freed.
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 2 +
>> arch/x86/kvm/mmu.c | 85 +++++++++++++++++++++++++++++++++++++-
>> arch/x86/kvm/mmu.h | 4 ++
>> arch/x86/kvm/x86.c | 6 +++
>> 4 files changed, 94 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 1ad9a34..6f8ee18 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -223,6 +223,7 @@ struct kvm_mmu_page {
>> int root_count; /* Currently serving as active root */
>> unsigned int unsync_children;
>> unsigned long parent_ptes; /* Reverse mapping for parent_pte */
>> + unsigned long mmu_valid_gen;
>> DECLARE_BITMAP(unsync_child_bitmap, 512);
>>
>> #ifdef CONFIG_X86_32
>> @@ -531,6 +532,7 @@ struct kvm_arch {
>> unsigned int n_requested_mmu_pages;
>> unsigned int n_max_mmu_pages;
>> unsigned int indirect_shadow_pages;
>> + unsigned long mmu_valid_gen;
>> struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>> /*
>> * Hash table of struct kvm_mmu_page.
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 9ac584f..12129b7 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1732,6 +1732,11 @@ static struct rmap_operations invalid_rmap_ops = {
>> .rmap_unmap = kvm_unmap_invalid_rmapp
>> };
>>
>> +static void init_invalid_memslot_rmap_ops(struct kvm_memory_slot *slot)
>> +{
>> + slot->arch.ops = &invalid_rmap_ops;
>> +}
>> +
>> typedef void (*handle_rmap_fun)(unsigned long *rmapp, void *data);
>> static void walk_memslot_rmap_nolock(struct kvm_memory_slot *slot,
>> handle_rmap_fun fun, void *data)
>> @@ -1812,6 +1817,65 @@ void free_meslot_rmap_desc_nolock(struct kvm_memory_slot *slot)
>> walk_memslot_rmap_nolock(slot, free_rmap_desc_nolock, NULL);
>> }
>>
>> +/*
>> + * Fast invalid all shadow pages belong to @slot.
>> + *
>> + * @slot != NULL means the invalidation is caused the memslot specified
>> + * by @slot is being deleted, in this case, we should ensure that rmap
>> + * and lpage-info of the @slot can not be used after calling the function.
>> + * Specially, if @slot is INVALID_ALL_SLOTS, all slots will be deleted
>> + * soon, it always happens when kvm is being destroyed.
>> + *
>> + * @slot == NULL means the invalidation due to other reasons, we need
>> + * not care rmap and lpage-info since they are still valid after calling
>> + * the function.
>> + */
>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>> + struct kvm_memory_slot *slot)
>> +{
>> + struct kvm_memory_slot *each_slot;
>> +
>> + spin_lock(&kvm->mmu_lock);
>> + kvm->arch.mmu_valid_gen++;
>> +
>> + if (slot == INVALID_ALL_SLOTS)
>> + kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
>> + init_invalid_memslot_rmap_ops(each_slot);
>> + else if (slot)
>> + init_invalid_memslot_rmap_ops(slot);
>> +
>> + /*
>> + * All shadow paes are invalid, reset the large page info,
>> + * then we can safely desotry the memslot, it is also good
>> + * for large page used.
>> + */
>> + kvm_clear_all_lpage_info(kvm);
>> +
>> + /*
>> + * Notify all vcpus to reload its shadow page table
>> + * and flush TLB. Then all vcpus will switch to new
>> + * shadow page table with the new mmu_valid_gen.
>> + *
>> + * Note: we should do this under the protection of
>> + * mmu-lock, otherwise, vcpu would purge shadow page
>> + * but miss tlb flush.
>> + */
>> + kvm_reload_remote_mmus(kvm);
>> + spin_unlock(&kvm->mmu_lock);
>> +
>> + if (slot == INVALID_ALL_SLOTS)
>> + kvm_for_each_memslot(each_slot, kvm_memslots(kvm))
>> + unmap_memslot_rmap_nolock(kvm, each_slot);
>> + else if (slot)
>> + unmap_memslot_rmap_nolock(kvm, slot);
>
> What is the justification for this?

We want the rmap of being deleted memslot is removed-only that is
needed for unmapping rmap out of mmu-lock.

======
1) do not corrupt the rmap
2) keep pte-list-descs available
3) keep shadow page available

Resolve 1):
we make the invalid rmap be remove-only that means we only delete and
clear spte from the rmap, no new sptes can be added to it.
This is reasonable since kvm can not do address translation on invalid rmap
(gfn_to_pfn is failed on invalid memslot) and all sptes on invalid rmap can
not be reused (they belong to invalid shadow page).
======

clear_flush_young / test_young / change_pte of mmu-notify can rewrite
rmap with the present-spte (P bit is set), we should umap rmap in
these handlers.

>
>> +
>> + /*
>> + * To ensure that all vcpus and mmu-notify are not clearing
>> + * spte and rmap entry.
>> + */
>> + synchronize_srcu_expedited(&kvm->srcu);
>> +}
>> +
>> #ifdef MMU_DEBUG
>> static int is_empty_shadow_page(u64 *spt)
>> {
>> @@ -2219,6 +2283,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
>> __clear_sp_write_flooding_count(sp);
>> }
>>
>> +static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>> +{
>> + return likely(sp->mmu_valid_gen == kvm->arch.mmu_valid_gen);
>> +}
>> +
>> static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>> gfn_t gfn,
>> gva_t gaddr,
>> @@ -2245,6 +2314,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>> role.quadrant = quadrant;
>> }
>> for_each_gfn_sp(vcpu->kvm, sp, gfn) {
>> + if (!is_valid_sp(vcpu->kvm, sp))
>> + continue;
>> +
>> if (!need_sync && sp->unsync)
>> need_sync = true;
>>
>> @@ -2281,6 +2353,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>>
>> account_shadowed(vcpu->kvm, gfn);
>> }
>> + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
>> init_shadow_page_table(sp);
>> trace_kvm_mmu_get_page(sp, true);
>> return sp;
>> @@ -2451,8 +2524,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>> ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>> kvm_mmu_page_unlink_children(kvm, sp);
>
> The rmaps[] arrays linked to !is_valid_sp() shadow pages should not be
> accessed (as they have been freed already).
>
> I suppose the is_valid_sp() conditional below should be moved earlier,
> before kvm_mmu_unlink_parents or any other rmap access.
>
> This is fine: the !is_valid_sp() shadow pages are only reachable
> by SLAB and the hypervisor itself.

Unfortunately we can not do this. :(

The sptes in shadow pape can linked to many slots, if the spte is linked
to the rmap of being deleted memslot, it is ok, otherwise, the rmap of
still used memslot is miss updated.

For example, slot 0 is being deleted, sp->spte[0] is linked on slot[0].rmap,
sp->spte[1] is linked on slot[1].rmap. If we do not access rmap of this 'sp',
the already-freed spte[1] is still linked on slot[1].rmap.

We can let kvm update the rmap for sp->spte[1] and do not unlink sp->spte[0].
This is also not allowed since mmu-notify can access the invalid rmap before
the memslot is destroyed, then mmu-notify will get already-freed spte on
the rmap or page Access/Dirty is miss tracked (if let mmu-notify do not access
the invalid rmap).

>
>> kvm_mmu_unlink_parents(kvm, sp);
>> - if (!sp->role.invalid && !sp->role.direct)
>> +
>> + if (!sp->role.invalid && !sp->role.direct &&
>> + /* Invalid-gen pages are not accounted. */
>> + is_valid_sp(kvm, sp))
>> unaccount_shadowed(kvm, sp->gfn);
>> +
>> if (sp->unsync)
>> kvm_unlink_unsync_page(kvm, sp);
>> if (!sp->root_count) {
>> @@ -3256,7 +3333,8 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
>>
>> sp = page_header(root);
>> --sp->root_count;
>> - if (!sp->root_count && sp->role.invalid) {
>> + if (!is_valid_sp(vcpu->kvm, sp) ||
>> + (!sp->root_count && sp->role.invalid)) {
>> kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
>> kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
>
> Why can't kvm_mmu_prepare_zap_page+kvm_mmu_commit_zap_page handle
> invalid shadow pages?

Oh, this is unnecessary. Anyway vcpu can reset its root when it is handling
KVM_REQ_MMU_RELOAD

>
> Perhaps another name is useful to avoid confusion with role.invalid.
> Maybe 'defunct'.

Yes, this is better.

2013-04-18 04:04:06

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 15/15] KVM: MMU: replace kvm_zap_all with kvm_mmu_invalid_all_pages

On 04/18/2013 08:08 AM, Marcelo Tosatti wrote:
> On Tue, Apr 16, 2013 at 02:32:53PM +0800, Xiao Guangrong wrote:
>> Use kvm_mmu_invalid_all_pages in kvm_arch_flush_shadow_all and
>> rename kvm_zap_all to kvm_free_all which is used to free all
>> memmory used by kvm mmu when vm is being destroyed, at this time,
>> no vcpu exists and mmu-notify has been unregistered, so we can
>> free the shadow pages out of mmu-lock
>
> Since there is no contention for mmu-lock its also not a problem to
> grab the lock right?

This still has contention. Other mmu-notify can happen when we handle
->release(). On the other handle, spin-lock is not preemptable.

>
> Automated verification of locking/srcu might complain.

We hold slot-lock to free shadow page out of mmu-lock, it can avoid
the complain. No?


2013-04-18 11:00:47

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 08/15] KVM: MMU: allow unmap invalid rmap out of mmu-lock

On Tue, Apr 16, 2013 at 02:32:46PM +0800, Xiao Guangrong wrote:
> pte_list_clear_concurrently allows us to reset pte-desc entry
> out of mmu-lock. We can reset spte out of mmu-lock if we can protect the
> lifecycle of sp, we use this way to achieve the goal:
>
> unmap_memslot_rmap_nolock():
> for-each-rmap-in-slot:
> preempt_disable
> kvm->arch.being_unmapped_rmap = rmapp
> clear spte and reset rmap entry
> kvm->arch.being_unmapped_rmap = NULL
> preempt_enable
>
> Other patch like zap-sp and mmu-notify which are protected
> by mmu-lock:
> clear spte and reset rmap entry
> retry:
> if (kvm->arch.being_unmapped_rmap == rmap)
> goto retry
> (the wait is very rare and clear one rmap is very fast, it
> is not bad even if wait is needed)
>
I do not understand what how this achieve the goal. Suppose that rmap
== X and kvm->arch.being_unmapped_rmap == NULL so "goto retry" is skipped,
but moment later unmap_memslot_rmap_nolock() does
vm->arch.being_unmapped_rmap = X.

> Then, we can sure the spte is always available when we do
> unmap_memslot_rmap_nolock
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 +
> arch/x86/kvm/mmu.c | 114 ++++++++++++++++++++++++++++++++++++---
> arch/x86/kvm/mmu.h | 2 +-
> 3 files changed, 110 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5fd6ed1..1ad9a34 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -536,6 +536,8 @@ struct kvm_arch {
> * Hash table of struct kvm_mmu_page.
> */
> struct list_head active_mmu_pages;
> + unsigned long *being_unmapped_rmap;
> +
> struct list_head assigned_dev_head;
> struct iommu_domain *iommu_domain;
> int iommu_flags;
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 2a7a5d0..e6414d2 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1104,10 +1104,10 @@ static int slot_rmap_add(struct kvm_memory_slot *slot,
> return slot->arch.ops->rmap_add(vcpu, spte, rmapp);
> }
>
> -static void slot_rmap_remove(struct kvm_memory_slot *slot,
> +static void slot_rmap_remove(struct kvm_memory_slot *slot, struct kvm *kvm,
> unsigned long *rmapp, u64 *spte)
> {
> - slot->arch.ops->rmap_remove(spte, rmapp);
> + slot->arch.ops->rmap_remove(kvm, spte, rmapp);
> }
>
> static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
> @@ -1132,7 +1132,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
> sp = page_header(__pa(spte));
> gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
> rmapp = gfn_to_rmap(kvm, &slot, gfn, sp->role.level);
> - slot_rmap_remove(slot, rmapp, spte);
> + slot_rmap_remove(slot, kvm, rmapp, spte);
> }
>
> /*
> @@ -1589,9 +1589,14 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
> return kvm_handle_hva(kvm, hva, 0, slot_rmap_test_age);
> }
>
> +static void rmap_remove_spte(struct kvm *kvm, u64 *spte, unsigned long *rmapp)
> +{
> + pte_list_remove(spte, rmapp);
> +}
> +
> static struct rmap_operations normal_rmap_ops = {
> .rmap_add = pte_list_add,
> - .rmap_remove = pte_list_remove,
> + .rmap_remove = rmap_remove_spte,
>
> .rmap_write_protect = __rmap_write_protect,
>
> @@ -1613,9 +1618,27 @@ static int invalid_rmap_add(struct kvm_vcpu *vcpu, u64 *spte,
> return 0;
> }
>
> -static void invalid_rmap_remove(u64 *spte, unsigned long *rmapp)
> +static void sync_being_unmapped_rmap(struct kvm *kvm, unsigned long *rmapp)
> +{
> + /*
> + * Ensure all the sptes on the rmap have been zapped and
> + * the rmap's entries have been reset so that
> + * unmap_invalid_rmap_nolock can not get any spte from the
> + * rmap after calling sync_being_unmapped_rmap().
> + */
> + smp_mb();
> +retry:
> + if (unlikely(ACCESS_ONCE(kvm->arch.being_unmapped_rmap) == rmapp)) {
> + cpu_relax();
> + goto retry;
> + }
> +}
> +
> +static void
> +invalid_rmap_remove(struct kvm *kvm, u64 *spte, unsigned long *rmapp)
> {
> pte_list_clear_concurrently(spte, rmapp);
> + sync_being_unmapped_rmap(kvm, rmapp);
> }
>
> static bool invalid_rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
> @@ -1635,7 +1658,11 @@ static int __kvm_unmap_invalid_rmapp(unsigned long *rmapp)
> if (sptep == PTE_LIST_SPTE_SKIP)
> continue;
>
> - /* Do not call .rmap_remove(). */
> + /*
> + * Do not call .rmap_remove() since we do not want to wait
> + * on sync_being_unmapped_rmap() when all sptes should be
> + * removed from the rmap.
> + */
> if (mmu_spte_clear_track_bits(sptep))
> pte_list_clear_concurrently(sptep, rmapp);
> }
> @@ -1645,7 +1672,10 @@ static int __kvm_unmap_invalid_rmapp(unsigned long *rmapp)
>
> static int kvm_unmap_invalid_rmapp(struct kvm *kvm, unsigned long *rmapp)
> {
> - return __kvm_unmap_invalid_rmapp(rmapp);
> + int ret = __kvm_unmap_invalid_rmapp(rmapp);
> +
> + sync_being_unmapped_rmap(kvm, rmapp);
> + return ret;
> }
>
> static int invalid_rmap_set_pte(struct kvm *kvm, unsigned long *rmapp,
> @@ -1686,6 +1716,76 @@ static struct rmap_operations invalid_rmap_ops = {
> .rmap_unmap = kvm_unmap_invalid_rmapp
> };
>
> +typedef void (*handle_rmap_fun)(unsigned long *rmapp, void *data);
> +static void walk_memslot_rmap_nolock(struct kvm_memory_slot *slot,
> + handle_rmap_fun fun, void *data)
> +{
> + int level;
> +
> + for (level = PT_PAGE_TABLE_LEVEL;
> + level < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++level) {
> + unsigned long idx, *rmapp;
> +
> + rmapp = slot->arch.rmap[level - PT_PAGE_TABLE_LEVEL];
> + idx = gfn_to_index(slot->base_gfn + slot->npages - 1,
> + slot->base_gfn, level) + 1;
> + /*
> + * Walk ramp from the high index to low index to reduce
> + * possible wait in sync_being_unmapped_rmap().
> + */
> + while (idx--)
> + fun(rmapp + idx, data);
> + }
> +}
> +
> +static void unmap_rmap_no_lock_begin(struct kvm *kvm, unsigned long *rmapp)
> +{
> + preempt_disable();
> + kvm->arch.being_unmapped_rmap = rmapp;
> +
> + /*
> + * Set being_unmapped_rmap should be before read/write any
> + * sptes on the rmaps.
> + * See the comment in sync_being_unmapped_rmap().
> + */
> + smp_mb();
> +}
> +
> +static void unmap_rmap_no_lock_end(struct kvm *kvm)
> +{
> + /*
> + * Ensure clearing spte and resetting rmap's entries has
> + * been finished.
> + * See the comment in sync_being_unmapped_rmap().
> + */
> + smp_mb();
> + kvm->arch.being_unmapped_rmap = NULL;
> + preempt_enable();
> +}
> +
> +static void unmap_invalid_rmap_nolock(unsigned long *rmapp, void *data)
> +{
> + struct kvm *kvm = (struct kvm *)data;
> +
> + if (!ACCESS_ONCE(*rmapp))
> + return;
> +
> + unmap_rmap_no_lock_begin(kvm, rmapp);
> + __kvm_unmap_invalid_rmapp(rmapp);
> + unmap_rmap_no_lock_end(kvm);
> +}
> +
> +static void
> +unmap_memslot_rmap_nolock(struct kvm *kvm, struct kvm_memory_slot *slot)
> +{
> + /* Only invalid rmaps can be unmapped out of mmu-lock. */
> + WARN_ON(slot->arch.ops != &invalid_rmap_ops);
> + /* Use slots_lock to protect kvm->arch.being_unmapped_rmap. */
> + WARN_ON(!mutex_is_locked(&kvm->slots_lock));
> +
> + walk_memslot_rmap_nolock(slot, unmap_invalid_rmap_nolock, kvm);
> +}
> +
> #ifdef MMU_DEBUG
> static int is_empty_shadow_page(u64 *spt)
> {
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index bb2b22e..d6aa31a 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -117,7 +117,7 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
> struct rmap_operations {
> int (*rmap_add)(struct kvm_vcpu *vcpu, u64 *spte,
> unsigned long *rmap);
> - void (*rmap_remove)(u64 *spte, unsigned long *rmap);
> + void (*rmap_remove)(struct kvm *kvm, u64 *spte, unsigned long *rmap);
>
> bool (*rmap_write_protect)(struct kvm *kvm, unsigned long *rmap,
> bool pt_protect);
> --
> 1.7.7.6

--
Gleb.

2013-04-18 11:26:45

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 08/15] KVM: MMU: allow unmap invalid rmap out of mmu-lock

On 04/18/2013 07:00 PM, Gleb Natapov wrote:
> On Tue, Apr 16, 2013 at 02:32:46PM +0800, Xiao Guangrong wrote:
>> pte_list_clear_concurrently allows us to reset pte-desc entry
>> out of mmu-lock. We can reset spte out of mmu-lock if we can protect the
>> lifecycle of sp, we use this way to achieve the goal:
>>
>> unmap_memslot_rmap_nolock():
>> for-each-rmap-in-slot:
>> preempt_disable
>> kvm->arch.being_unmapped_rmap = rmapp
>> clear spte and reset rmap entry
>> kvm->arch.being_unmapped_rmap = NULL
>> preempt_enable
>>
>> Other patch like zap-sp and mmu-notify which are protected
>> by mmu-lock:
>> clear spte and reset rmap entry
>> retry:
>> if (kvm->arch.being_unmapped_rmap == rmap)
>> goto retry
>> (the wait is very rare and clear one rmap is very fast, it
>> is not bad even if wait is needed)
>>
> I do not understand what how this achieve the goal. Suppose that rmap
> == X and kvm->arch.being_unmapped_rmap == NULL so "goto retry" is skipped,
> but moment later unmap_memslot_rmap_nolock() does
> vm->arch.being_unmapped_rmap = X.

Access rmap is always safe since rmap and its entries are valid until
memslot is destroyed.

This algorithm protects spte since it can be freed in the protection of mmu-lock.

In your scenario:

======
CPU 1 CPU 2

vcpu / mmu-notify access the RMAP unmap rmap out of mmu-lock which is under
which is under mmu-lock slot-lock

zap spte1
clear RMAP entry

kvm->arch.being_unmapped_rmap = NULL,
do not wait

free spte1

set kvm->arch.being_unmapped_rmap = RMAP
walking RMAP and do not see spet1 on RMAP
(the entry of spte 1 has been reset by CPU 1)
set kvm->arch.being_unmapped_rmap = NULL
======

That protect CPU 2 can not access the freed-spte.

2013-04-18 11:38:42

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 08/15] KVM: MMU: allow unmap invalid rmap out of mmu-lock

On Thu, Apr 18, 2013 at 07:22:23PM +0800, Xiao Guangrong wrote:
> On 04/18/2013 07:00 PM, Gleb Natapov wrote:
> > On Tue, Apr 16, 2013 at 02:32:46PM +0800, Xiao Guangrong wrote:
> >> pte_list_clear_concurrently allows us to reset pte-desc entry
> >> out of mmu-lock. We can reset spte out of mmu-lock if we can protect the
> >> lifecycle of sp, we use this way to achieve the goal:
> >>
> >> unmap_memslot_rmap_nolock():
> >> for-each-rmap-in-slot:
> >> preempt_disable
> >> kvm->arch.being_unmapped_rmap = rmapp
> >> clear spte and reset rmap entry
> >> kvm->arch.being_unmapped_rmap = NULL
> >> preempt_enable
> >>
> >> Other patch like zap-sp and mmu-notify which are protected
> >> by mmu-lock:
> >> clear spte and reset rmap entry
> >> retry:
> >> if (kvm->arch.being_unmapped_rmap == rmap)
> >> goto retry
> >> (the wait is very rare and clear one rmap is very fast, it
> >> is not bad even if wait is needed)
> >>
> > I do not understand what how this achieve the goal. Suppose that rmap
> > == X and kvm->arch.being_unmapped_rmap == NULL so "goto retry" is skipped,
> > but moment later unmap_memslot_rmap_nolock() does
> > vm->arch.being_unmapped_rmap = X.
>
> Access rmap is always safe since rmap and its entries are valid until
> memslot is destroyed.
>
> This algorithm protects spte since it can be freed in the protection of mmu-lock.
>
> In your scenario:
>
> ======
> CPU 1 CPU 2
>
> vcpu / mmu-notify access the RMAP unmap rmap out of mmu-lock which is under
> which is under mmu-lock slot-lock
>
> zap spte1
> clear RMAP entry
>
> kvm->arch.being_unmapped_rmap = NULL,
> do not wait
>
> free spte1
>
> set kvm->arch.being_unmapped_rmap = RMAP
> walking RMAP and do not see spet1 on RMAP
> (the entry of spte 1 has been reset by CPU 1)
and what prevents this from happening concurrently with "clear RMAP
entry"? Is it safe?

> set kvm->arch.being_unmapped_rmap = NULL
> ======
>
> That protect CPU 2 can not access the freed-spte.
>

--
Gleb.

2013-04-18 12:10:35

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 08/15] KVM: MMU: allow unmap invalid rmap out of mmu-lock

On 04/18/2013 07:38 PM, Gleb Natapov wrote:
> On Thu, Apr 18, 2013 at 07:22:23PM +0800, Xiao Guangrong wrote:
>> On 04/18/2013 07:00 PM, Gleb Natapov wrote:
>>> On Tue, Apr 16, 2013 at 02:32:46PM +0800, Xiao Guangrong wrote:
>>>> pte_list_clear_concurrently allows us to reset pte-desc entry
>>>> out of mmu-lock. We can reset spte out of mmu-lock if we can protect the
>>>> lifecycle of sp, we use this way to achieve the goal:
>>>>
>>>> unmap_memslot_rmap_nolock():
>>>> for-each-rmap-in-slot:
>>>> preempt_disable
>>>> kvm->arch.being_unmapped_rmap = rmapp
>>>> clear spte and reset rmap entry
>>>> kvm->arch.being_unmapped_rmap = NULL
>>>> preempt_enable
>>>>
>>>> Other patch like zap-sp and mmu-notify which are protected
>>>> by mmu-lock:
>>>> clear spte and reset rmap entry
>>>> retry:
>>>> if (kvm->arch.being_unmapped_rmap == rmap)
>>>> goto retry
>>>> (the wait is very rare and clear one rmap is very fast, it
>>>> is not bad even if wait is needed)
>>>>
>>> I do not understand what how this achieve the goal. Suppose that rmap
>>> == X and kvm->arch.being_unmapped_rmap == NULL so "goto retry" is skipped,
>>> but moment later unmap_memslot_rmap_nolock() does
>>> vm->arch.being_unmapped_rmap = X.
>>
>> Access rmap is always safe since rmap and its entries are valid until
>> memslot is destroyed.
>>
>> This algorithm protects spte since it can be freed in the protection of mmu-lock.
>>
>> In your scenario:
>>
>> ======
>> CPU 1 CPU 2
>>
>> vcpu / mmu-notify access the RMAP unmap rmap out of mmu-lock which is under
>> which is under mmu-lock slot-lock
>>
>> zap spte1
>> clear RMAP entry
>>
>> kvm->arch.being_unmapped_rmap = NULL,
>> do not wait
>>
>> free spte1
>>
>> set kvm->arch.being_unmapped_rmap = RMAP
>> walking RMAP and do not see spet1 on RMAP
>> (the entry of spte 1 has been reset by CPU 1)
> and what prevents this from happening concurrently with "clear RMAP
> entry"? Is it safe?

All the possible changes on the RMAP entry is from valid-spte to PTE_LIST_SPTE_SKIP.
(no valid-spte to valid-spte / no spte to new-spte)

There are three possible cases:
case 1): both two paths can see the valid-spte.
the worst case is, the host page can be double A/D tracked
(multi calling of kvm_set_pfn_accessed/kvm_set_pfn_dirty), it is safe.

case 2): only the path under protection of mmu-lock see the valid-spte
this is safe since RMAP and spte are always valid under mmu-lock

case 3): only the path out of mmu-lock see the valid-spte
then the path under mmu-lock will being wait until the no-lock path has
finished. The spte is valid and no-lock path is safe to call
kvm_set_pfn_accessed/kvm_set_pfn_dirty.

Do you get any potential issue?

2013-04-18 13:29:46

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

On Thu, Apr 18, 2013 at 10:03:06AM -0300, Marcelo Tosatti wrote:
> On Thu, Apr 18, 2013 at 12:00:16PM +0800, Xiao Guangrong wrote:
> > >
> > > What is the justification for this?
> >
> > We want the rmap of being deleted memslot is removed-only that is
> > needed for unmapping rmap out of mmu-lock.
> >
> > ======
> > 1) do not corrupt the rmap
> > 2) keep pte-list-descs available
> > 3) keep shadow page available
> >
> > Resolve 1):
> > we make the invalid rmap be remove-only that means we only delete and
> > clear spte from the rmap, no new sptes can be added to it.
> > This is reasonable since kvm can not do address translation on invalid rmap
> > (gfn_to_pfn is failed on invalid memslot) and all sptes on invalid rmap can
> > not be reused (they belong to invalid shadow page).
> > ======
> >
> > clear_flush_young / test_young / change_pte of mmu-notify can rewrite
> > rmap with the present-spte (P bit is set), we should umap rmap in
> > these handlers.
> >
> > >
> > >> +
> > >> + /*
> > >> + * To ensure that all vcpus and mmu-notify are not clearing
> > >> + * spte and rmap entry.
> > >> + */
> > >> + synchronize_srcu_expedited(&kvm->srcu);
> > >> +}
> > >> +
> > >> #ifdef MMU_DEBUG
> > >> static int is_empty_shadow_page(u64 *spt)
> > >> {
> > >> @@ -2219,6 +2283,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
> > >> __clear_sp_write_flooding_count(sp);
> > >> }
> > >>
> > >> +static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > >> +{
> > >> + return likely(sp->mmu_valid_gen == kvm->arch.mmu_valid_gen);
> > >> +}
> > >> +
> > >> static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >> gfn_t gfn,
> > >> gva_t gaddr,
> > >> @@ -2245,6 +2314,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >> role.quadrant = quadrant;
> > >> }
> > >> for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> > >> + if (!is_valid_sp(vcpu->kvm, sp))
> > >> + continue;
> > >> +
> > >> if (!need_sync && sp->unsync)
> > >> need_sync = true;
> > >>
> > >> @@ -2281,6 +2353,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >>
> > >> account_shadowed(vcpu->kvm, gfn);
> > >> }
> > >> + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> > >> init_shadow_page_table(sp);
> > >> trace_kvm_mmu_get_page(sp, true);
> > >> return sp;
> > >> @@ -2451,8 +2524,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> > >> ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> > >> kvm_mmu_page_unlink_children(kvm, sp);
> > >
> > > The rmaps[] arrays linked to !is_valid_sp() shadow pages should not be
> > > accessed (as they have been freed already).
> > >
> > > I suppose the is_valid_sp() conditional below should be moved earlier,
> > > before kvm_mmu_unlink_parents or any other rmap access.
> > >
> > > This is fine: the !is_valid_sp() shadow pages are only reachable
> > > by SLAB and the hypervisor itself.
> >
> > Unfortunately we can not do this. :(
> >
> > The sptes in shadow pape can linked to many slots, if the spte is linked
> > to the rmap of being deleted memslot, it is ok, otherwise, the rmap of
> > still used memslot is miss updated.
> >
> > For example, slot 0 is being deleted, sp->spte[0] is linked on slot[0].rmap,
> > sp->spte[1] is linked on slot[1].rmap. If we do not access rmap of this 'sp',
> > the already-freed spte[1] is still linked on slot[1].rmap.
> >
> > We can let kvm update the rmap for sp->spte[1] and do not unlink sp->spte[0].
> > This is also not allowed since mmu-notify can access the invalid rmap before
> > the memslot is destroyed, then mmu-notify will get already-freed spte on
> > the rmap or page Access/Dirty is miss tracked (if let mmu-notify do not access
> > the invalid rmap).
>
> Why not release all rmaps?
>
> Subject: [PATCH v2 3/7] KVM: x86: introduce kvm_clear_all_gfn_page_info
>
> This function is used to reset the rmaps and page info of all guest page
> which will be used in later patch
>
> Signed-off-by: Xiao Guangrong <[email protected]>

Which you have later in patchset.

So, what is the justification for the zap root + generation number increase
to work on a per memslot basis, given that

/*
* If memory slot is created, or moved, we need to clear all
* mmio sptes.
*/
if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
kvm_mmu_zap_mmio_sptes(kvm);
kvm_reload_remote_mmus(kvm);
}

Is going to be dealt with generation number on mmio spte idea?

Note at the moment all shadows pages are zapped on deletion / move,
and there is no performance complaint for those cases.

In fact, for what case is generation number on mmio spte optimizes for?
The cases are where slots are deleted/moved/created on a live guest
are:

- Legacy VGA mode operation where VGA slots are created/deleted. Zapping
all shadow not a performance issue in that case.
- Device hotplug (not performance critical).
- Remapping of PCI memory regions (not a performance issue).
- Memory hotplug (not a performance issue).

These are all rare events in which there is no particular concern about
rebuilding shadow pages to resume cruise speed operation.

So from this POV (please correct if not accurate) avoiding problems
with huge number of shadow pages is all thats being asked for.

Which is handled nicely by zap roots + sp gen number increase.

2013-04-18 13:45:56

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

On Thu, Apr 18, 2013 at 12:00:16PM +0800, Xiao Guangrong wrote:
> >
> > What is the justification for this?
>
> We want the rmap of being deleted memslot is removed-only that is
> needed for unmapping rmap out of mmu-lock.
>
> ======
> 1) do not corrupt the rmap
> 2) keep pte-list-descs available
> 3) keep shadow page available
>
> Resolve 1):
> we make the invalid rmap be remove-only that means we only delete and
> clear spte from the rmap, no new sptes can be added to it.
> This is reasonable since kvm can not do address translation on invalid rmap
> (gfn_to_pfn is failed on invalid memslot) and all sptes on invalid rmap can
> not be reused (they belong to invalid shadow page).
> ======
>
> clear_flush_young / test_young / change_pte of mmu-notify can rewrite
> rmap with the present-spte (P bit is set), we should umap rmap in
> these handlers.
>
> >
> >> +
> >> + /*
> >> + * To ensure that all vcpus and mmu-notify are not clearing
> >> + * spte and rmap entry.
> >> + */
> >> + synchronize_srcu_expedited(&kvm->srcu);
> >> +}
> >> +
> >> #ifdef MMU_DEBUG
> >> static int is_empty_shadow_page(u64 *spt)
> >> {
> >> @@ -2219,6 +2283,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
> >> __clear_sp_write_flooding_count(sp);
> >> }
> >>
> >> +static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> >> +{
> >> + return likely(sp->mmu_valid_gen == kvm->arch.mmu_valid_gen);
> >> +}
> >> +
> >> static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >> gfn_t gfn,
> >> gva_t gaddr,
> >> @@ -2245,6 +2314,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >> role.quadrant = quadrant;
> >> }
> >> for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> >> + if (!is_valid_sp(vcpu->kvm, sp))
> >> + continue;
> >> +
> >> if (!need_sync && sp->unsync)
> >> need_sync = true;
> >>
> >> @@ -2281,6 +2353,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >>
> >> account_shadowed(vcpu->kvm, gfn);
> >> }
> >> + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> >> init_shadow_page_table(sp);
> >> trace_kvm_mmu_get_page(sp, true);
> >> return sp;
> >> @@ -2451,8 +2524,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> >> ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> >> kvm_mmu_page_unlink_children(kvm, sp);
> >
> > The rmaps[] arrays linked to !is_valid_sp() shadow pages should not be
> > accessed (as they have been freed already).
> >
> > I suppose the is_valid_sp() conditional below should be moved earlier,
> > before kvm_mmu_unlink_parents or any other rmap access.
> >
> > This is fine: the !is_valid_sp() shadow pages are only reachable
> > by SLAB and the hypervisor itself.
>
> Unfortunately we can not do this. :(
>
> The sptes in shadow pape can linked to many slots, if the spte is linked
> to the rmap of being deleted memslot, it is ok, otherwise, the rmap of
> still used memslot is miss updated.
>
> For example, slot 0 is being deleted, sp->spte[0] is linked on slot[0].rmap,
> sp->spte[1] is linked on slot[1].rmap. If we do not access rmap of this 'sp',
> the already-freed spte[1] is still linked on slot[1].rmap.
>
> We can let kvm update the rmap for sp->spte[1] and do not unlink sp->spte[0].
> This is also not allowed since mmu-notify can access the invalid rmap before
> the memslot is destroyed, then mmu-notify will get already-freed spte on
> the rmap or page Access/Dirty is miss tracked (if let mmu-notify do not access
> the invalid rmap).

Why not release all rmaps?

Subject: [PATCH v2 3/7] KVM: x86: introduce kvm_clear_all_gfn_page_info

This function is used to reset the rmaps and page info of all guest page
which will be used in later patch

Signed-off-by: Xiao Guangrong <[email protected]>

2013-04-18 15:21:12

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 12/15] KVM: MMU: fast invalid all shadow pages

On 04/18/2013 09:29 PM, Marcelo Tosatti wrote:
> On Thu, Apr 18, 2013 at 10:03:06AM -0300, Marcelo Tosatti wrote:
>> On Thu, Apr 18, 2013 at 12:00:16PM +0800, Xiao Guangrong wrote:
>>>>
>>>> What is the justification for this?
>>>
>>> We want the rmap of being deleted memslot is removed-only that is
>>> needed for unmapping rmap out of mmu-lock.
>>>
>>> ======
>>> 1) do not corrupt the rmap
>>> 2) keep pte-list-descs available
>>> 3) keep shadow page available
>>>
>>> Resolve 1):
>>> we make the invalid rmap be remove-only that means we only delete and
>>> clear spte from the rmap, no new sptes can be added to it.
>>> This is reasonable since kvm can not do address translation on invalid rmap
>>> (gfn_to_pfn is failed on invalid memslot) and all sptes on invalid rmap can
>>> not be reused (they belong to invalid shadow page).
>>> ======
>>>
>>> clear_flush_young / test_young / change_pte of mmu-notify can rewrite
>>> rmap with the present-spte (P bit is set), we should umap rmap in
>>> these handlers.
>>>
>>>>
>>>>> +
>>>>> + /*
>>>>> + * To ensure that all vcpus and mmu-notify are not clearing
>>>>> + * spte and rmap entry.
>>>>> + */
>>>>> + synchronize_srcu_expedited(&kvm->srcu);
>>>>> +}
>>>>> +
>>>>> #ifdef MMU_DEBUG
>>>>> static int is_empty_shadow_page(u64 *spt)
>>>>> {
>>>>> @@ -2219,6 +2283,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
>>>>> __clear_sp_write_flooding_count(sp);
>>>>> }
>>>>>
>>>>> +static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>>>>> +{
>>>>> + return likely(sp->mmu_valid_gen == kvm->arch.mmu_valid_gen);
>>>>> +}
>>>>> +
>>>>> static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>>>>> gfn_t gfn,
>>>>> gva_t gaddr,
>>>>> @@ -2245,6 +2314,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>>>>> role.quadrant = quadrant;
>>>>> }
>>>>> for_each_gfn_sp(vcpu->kvm, sp, gfn) {
>>>>> + if (!is_valid_sp(vcpu->kvm, sp))
>>>>> + continue;
>>>>> +
>>>>> if (!need_sync && sp->unsync)
>>>>> need_sync = true;
>>>>>
>>>>> @@ -2281,6 +2353,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>>>>>
>>>>> account_shadowed(vcpu->kvm, gfn);
>>>>> }
>>>>> + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
>>>>> init_shadow_page_table(sp);
>>>>> trace_kvm_mmu_get_page(sp, true);
>>>>> return sp;
>>>>> @@ -2451,8 +2524,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>>>>> ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>>>>> kvm_mmu_page_unlink_children(kvm, sp);
>>>>
>>>> The rmaps[] arrays linked to !is_valid_sp() shadow pages should not be
>>>> accessed (as they have been freed already).
>>>>
>>>> I suppose the is_valid_sp() conditional below should be moved earlier,
>>>> before kvm_mmu_unlink_parents or any other rmap access.
>>>>
>>>> This is fine: the !is_valid_sp() shadow pages are only reachable
>>>> by SLAB and the hypervisor itself.
>>>
>>> Unfortunately we can not do this. :(
>>>
>>> The sptes in shadow pape can linked to many slots, if the spte is linked
>>> to the rmap of being deleted memslot, it is ok, otherwise, the rmap of
>>> still used memslot is miss updated.
>>>
>>> For example, slot 0 is being deleted, sp->spte[0] is linked on slot[0].rmap,
>>> sp->spte[1] is linked on slot[1].rmap. If we do not access rmap of this 'sp',
>>> the already-freed spte[1] is still linked on slot[1].rmap.
>>>
>>> We can let kvm update the rmap for sp->spte[1] and do not unlink sp->spte[0].
>>> This is also not allowed since mmu-notify can access the invalid rmap before
>>> the memslot is destroyed, then mmu-notify will get already-freed spte on
>>> the rmap or page Access/Dirty is miss tracked (if let mmu-notify do not access
>>> the invalid rmap).
>>
>> Why not release all rmaps?
>>
>> Subject: [PATCH v2 3/7] KVM: x86: introduce kvm_clear_all_gfn_page_info
>>
>> This function is used to reset the rmaps and page info of all guest page
>> which will be used in later patch
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>
> Which you have later in patchset.

The patch you mentioned is old (v2), now it only resets lpage-info excluding
rmap:

======
[PATCH v3 11/15] KVM: MMU: introduce kvm_clear_all_lpage_info

This function is used to reset the large page info of all guest page
which will be used in later patch

Signed-off-by: Xiao Guangrong <[email protected]>
======

We can not release all rmaps. If we do this, ->invalidate_page and
->invalidate_range_start can not find any spte using the host page,
that means, Accessed/Dirty for host page is missing tracked.
(missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)

Furthermore, when we drop a invalid-gen spte, we will call
kvm_set_pfn_dirty/kvm_set_pfn_dirty for a already-freed host page since
mmu-notify can not find the spte by rmap.
(we can skip drop-spte for the invalid-gen sp, but A/D for host page
can be missed)

That is why i introduced unmap_invalid_rmap out of mmu-lock.

>
> So, what is the justification for the zap root + generation number increase
> to work on a per memslot basis, given that
>
> /*
> * If memory slot is created, or moved, we need to clear all
> * mmio sptes.
> */
> if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
> kvm_mmu_zap_mmio_sptes(kvm);
> kvm_reload_remote_mmus(kvm);
> }
>
> Is going to be dealt with generation number on mmio spte idea?

Yes. Actually, this patchset (v3) is based on other two patchset:

======
This patchset is based on my previous two patchset:
[PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
(https://lkml.org/lkml/2013/4/1/2)

[PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
(https://lkml.org/lkml/2013/4/1/134)
======

We did that it [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes.

>
> Note at the moment all shadows pages are zapped on deletion / move,
> and there is no performance complaint for those cases.

Yes, zap-mmio is only needed for MEMSLOT_CREATE.

>
> In fact, for what case is generation number on mmio spte optimizes for?
> The cases are where slots are deleted/moved/created on a live guest
> are:
>
> - Legacy VGA mode operation where VGA slots are created/deleted. Zapping
> all shadow not a performance issue in that case.
> - Device hotplug (not performance critical).
> - Remapping of PCI memory regions (not a performance issue).
> - Memory hotplug (not a performance issue).
>
> These are all rare events in which there is no particular concern about
> rebuilding shadow pages to resume cruise speed operation.
>
> So from this POV (please correct if not accurate) avoiding problems
> with huge number of shadow pages is all thats being asked for.
>
> Which is handled nicely by zap roots + sp gen number increase.

So, we can use "zap roots + sp gen number" instead of current zap-mmio-sp?
If yes, i totally agree with you.


2013-04-20 17:49:08

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 15/15] KVM: MMU: replace kvm_zap_all with kvm_mmu_invalid_all_pages

On Thu, Apr 18, 2013 at 12:03:45PM +0800, Xiao Guangrong wrote:
> On 04/18/2013 08:08 AM, Marcelo Tosatti wrote:
> > On Tue, Apr 16, 2013 at 02:32:53PM +0800, Xiao Guangrong wrote:
> >> Use kvm_mmu_invalid_all_pages in kvm_arch_flush_shadow_all and
> >> rename kvm_zap_all to kvm_free_all which is used to free all
> >> memmory used by kvm mmu when vm is being destroyed, at this time,
> >> no vcpu exists and mmu-notify has been unregistered, so we can
> >> free the shadow pages out of mmu-lock
> >
> > Since there is no contention for mmu-lock its also not a problem to
> > grab the lock right?
>
> This still has contention. Other mmu-notify can happen when we handle
> ->release(). On the other handle, spin-lock is not preemptable.

Don't think so:

kvm_coalesced_mmio_free(kvm);
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
#else
kvm_arch_flush_shadow_all(kvm);
#endif
kvm_arch_destroy_vm(kvm);

> > Automated verification of locking/srcu might complain.
>
> We hold slot-lock to free shadow page out of mmu-lock, it can avoid
> the complain. No?

Not if it realizes srcu is required to access the data structures.

2013-04-21 06:59:51

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 15/15] KVM: MMU: replace kvm_zap_all with kvm_mmu_invalid_all_pages

On 04/21/2013 01:18 AM, Marcelo Tosatti wrote:
> On Thu, Apr 18, 2013 at 12:03:45PM +0800, Xiao Guangrong wrote:
>> On 04/18/2013 08:08 AM, Marcelo Tosatti wrote:
>>> On Tue, Apr 16, 2013 at 02:32:53PM +0800, Xiao Guangrong wrote:
>>>> Use kvm_mmu_invalid_all_pages in kvm_arch_flush_shadow_all and
>>>> rename kvm_zap_all to kvm_free_all which is used to free all
>>>> memmory used by kvm mmu when vm is being destroyed, at this time,
>>>> no vcpu exists and mmu-notify has been unregistered, so we can
>>>> free the shadow pages out of mmu-lock
>>>
>>> Since there is no contention for mmu-lock its also not a problem to
>>> grab the lock right?
>>
>> This still has contention. Other mmu-notify can happen when we handle
>> ->release(). On the other handle, spin-lock is not preemptable.
>
> Don't think so:

Hi Marcelo,

The comment of ->release() says:

/*
* Called either by mmu_notifier_unregister or when the mm is
* being destroyed by exit_mmap, always before all pages are
* freed. This can run concurrently with other mmu notifier
* methods (the ones invoked outside the mm context)
>
> kvm_coalesced_mmio_free(kvm);
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
> #else
> kvm_arch_flush_shadow_all(kvm);
> #endif
> kvm_arch_destroy_vm(kvm);

The contention does not exist in the code you listed above. It happens when
vm abnormally exits (for example, VM is killed). Please refer to
commit 3ad3d90 (mm: mmu_notifier: fix freed page still mapped in secondary MMU).
The current mmu-notify code is wrong and i have posted a patch to fix it which
can be found at:
http://marc.info/?l=kvm&m=136609583232031&w=2

Maybe i misunderstand your meaning. This patch tries to use kvm_mmu_invalid_all_pages
in ->release and rename kvm_zap_all to kvm_free_all. Do you mean we can still use
mmu-lock in kvm_free_all()? If yes, I do not have strong opinion on this point and
will keep kvm_free_all under the protection of mmu-lock.

>
>>> Automated verification of locking/srcu might complain.
>>
>> We hold slot-lock to free shadow page out of mmu-lock, it can avoid
>> the complain. No?
>
> Not if it realizes srcu is required to access the data structures.

It seems that kvm->srcu is only used to protect kvm->memslots, in kvm_memslots:

static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)
{
return rcu_dereference_check(kvm->memslots,
srcu_read_lock_held(&kvm->srcu)
|| lockdep_is_held(&kvm->slots_lock));
}

kvm->memslots can be safely accessed when hold kvm->srcu _or_ slots_lock.

Thanks!

2013-04-21 13:03:53

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
> This patchset is based on my previous two patchset:
> [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
> (https://lkml.org/lkml/2013/4/1/2)
>
> [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
> (https://lkml.org/lkml/2013/4/1/134)
>
> Changlog:
> V3:
> completely redesign the algorithm, please see below.
>
This looks pretty complicated. Is it still needed in order to avoid soft
lockups after "avoid potential soft lockup and unneeded mmu reload" patch?

--
Gleb.

2013-04-21 14:09:39

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

On 04/21/2013 09:03 PM, Gleb Natapov wrote:
> On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
>> This patchset is based on my previous two patchset:
>> [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
>> (https://lkml.org/lkml/2013/4/1/2)
>>
>> [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
>> (https://lkml.org/lkml/2013/4/1/134)
>>
>> Changlog:
>> V3:
>> completely redesign the algorithm, please see below.
>>
> This looks pretty complicated. Is it still needed in order to avoid soft
> lockups after "avoid potential soft lockup and unneeded mmu reload" patch?

Yes.

I discussed this point with Marcelo:

======
BTW, to my honest, i do not think spin_needbreak is a good way - it does
not fix the hot-lock contention and it just occupies more cpu time to avoid
possible soft lock-ups.

Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
create page tables again. zap-all-shadow-page need long time to be finished,
the worst case is, it can not completed forever on intensive vcpu and memory
usage.

I still think the right way to fix this kind of thing is optimization for
mmu-lock.
======

Which parts scare you? Let's find a way to optimize for it. ;). For example,
if you do not like unmap_memslot_rmap_nolock(), we can simplify it - We can
use walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() to
protect spte instead of kvm->being_unmaped_rmap.

Thanks!

2013-04-21 23:42:20

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

On Sun, Apr 21, 2013 at 12:27:51PM -0300, Marcelo Tosatti wrote:
> On Sun, Apr 21, 2013 at 04:03:46PM +0300, Gleb Natapov wrote:
> > On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
> > > This patchset is based on my previous two patchset:
> > > [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
> > > (https://lkml.org/lkml/2013/4/1/2)
> > >
> > > [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
> > > (https://lkml.org/lkml/2013/4/1/134)
> > >
> > > Changlog:
> > > V3:
> > > completely redesign the algorithm, please see below.
> > >
> > This looks pretty complicated. Is it still needed in order to avoid soft
> > lockups after "avoid potential soft lockup and unneeded mmu reload" patch?
>
> Do not want kvm_set_memory (cases: DELETE/MOVE/CREATES) to be
> suspectible to:
>
> vcpu 1 | kvm_set_memory
> create shadow page
> nuke shadow page
> create shadow page
> nuke shadow page
>
> Which is guest triggerable behavior with spinlock preemption algorithm.

Not only guest triggerable as in the sense of a malicious guest,
but condition above can be induced by host workload with non-malicious
guest system.

Also kvm_set_memory being relatively fast with huge memory guests
is nice (which is what Xiaos idea allows).

2013-04-21 23:42:19

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

On Sun, Apr 21, 2013 at 10:09:29PM +0800, Xiao Guangrong wrote:
> On 04/21/2013 09:03 PM, Gleb Natapov wrote:
> > On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
> >> This patchset is based on my previous two patchset:
> >> [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
> >> (https://lkml.org/lkml/2013/4/1/2)
> >>
> >> [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
> >> (https://lkml.org/lkml/2013/4/1/134)
> >>
> >> Changlog:
> >> V3:
> >> completely redesign the algorithm, please see below.
> >>
> > This looks pretty complicated. Is it still needed in order to avoid soft
> > lockups after "avoid potential soft lockup and unneeded mmu reload" patch?
>
> Yes.
>
> I discussed this point with Marcelo:
>
> ======
> BTW, to my honest, i do not think spin_needbreak is a good way - it does
> not fix the hot-lock contention and it just occupies more cpu time to avoid
> possible soft lock-ups.
>
> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
> create page tables again. zap-all-shadow-page need long time to be finished,
> the worst case is, it can not completed forever on intensive vcpu and memory
> usage.
>
> I still think the right way to fix this kind of thing is optimization for
> mmu-lock.
> ======
>
> Which parts scare you? Let's find a way to optimize for it. ;). For example,
> if you do not like unmap_memslot_rmap_nolock(), we can simplify it - We can
> use walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() to
> protect spte instead of kvm->being_unmaped_rmap.
>
> Thanks!

Xiao,

You can just remove all shadow rmaps now that you've agreed per-memslot
flushes are not necessary. Which then gets rid of necessity for lockless
rmap accesses. Right?

2013-04-21 23:42:17

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

On Sun, Apr 21, 2013 at 04:03:46PM +0300, Gleb Natapov wrote:
> On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
> > This patchset is based on my previous two patchset:
> > [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
> > (https://lkml.org/lkml/2013/4/1/2)
> >
> > [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
> > (https://lkml.org/lkml/2013/4/1/134)
> >
> > Changlog:
> > V3:
> > completely redesign the algorithm, please see below.
> >
> This looks pretty complicated. Is it still needed in order to avoid soft
> lockups after "avoid potential soft lockup and unneeded mmu reload" patch?

Do not want kvm_set_memory (cases: DELETE/MOVE/CREATES) to be
suspectible to:

vcpu 1 | kvm_set_memory
create shadow page
nuke shadow page
create shadow page
nuke shadow page

Which is guest triggerable behavior with spinlock preemption algorithm.

2013-04-22 02:50:51

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

On 04/21/2013 11:24 PM, Marcelo Tosatti wrote:
> On Sun, Apr 21, 2013 at 10:09:29PM +0800, Xiao Guangrong wrote:
>> On 04/21/2013 09:03 PM, Gleb Natapov wrote:
>>> On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
>>>> This patchset is based on my previous two patchset:
>>>> [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
>>>> (https://lkml.org/lkml/2013/4/1/2)
>>>>
>>>> [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
>>>> (https://lkml.org/lkml/2013/4/1/134)
>>>>
>>>> Changlog:
>>>> V3:
>>>> completely redesign the algorithm, please see below.
>>>>
>>> This looks pretty complicated. Is it still needed in order to avoid soft
>>> lockups after "avoid potential soft lockup and unneeded mmu reload" patch?
>>
>> Yes.
>>
>> I discussed this point with Marcelo:
>>
>> ======
>> BTW, to my honest, i do not think spin_needbreak is a good way - it does
>> not fix the hot-lock contention and it just occupies more cpu time to avoid
>> possible soft lock-ups.
>>
>> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
>> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
>> create page tables again. zap-all-shadow-page need long time to be finished,
>> the worst case is, it can not completed forever on intensive vcpu and memory
>> usage.
>>
>> I still think the right way to fix this kind of thing is optimization for
>> mmu-lock.
>> ======
>>
>> Which parts scare you? Let's find a way to optimize for it. ;). For example,
>> if you do not like unmap_memslot_rmap_nolock(), we can simplify it - We can
>> use walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() to
>> protect spte instead of kvm->being_unmaped_rmap.
>>
>> Thanks!
>
> Xiao,
>
> You can just remove all shadow rmaps now that you've agreed per-memslot
> flushes are not necessary. Which then gets rid of necessity for lockless
> rmap accesses. Right?

Hi Marcelo,

I am worried about:

======
We can not release all rmaps. If we do this, ->invalidate_page and
->invalidate_range_start can not find any spte using the host page,
that means, Accessed/Dirty for host page is missing tracked.
(missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)

[https://lkml.org/lkml/2013/4/18/358]
======

Do you think this is a issue? What's your idea?

Thanks!

2013-04-22 09:21:25

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

On Sun, Apr 21, 2013 at 10:09:29PM +0800, Xiao Guangrong wrote:
> On 04/21/2013 09:03 PM, Gleb Natapov wrote:
> > On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
> >> This patchset is based on my previous two patchset:
> >> [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
> >> (https://lkml.org/lkml/2013/4/1/2)
> >>
> >> [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
> >> (https://lkml.org/lkml/2013/4/1/134)
> >>
> >> Changlog:
> >> V3:
> >> completely redesign the algorithm, please see below.
> >>
> > This looks pretty complicated. Is it still needed in order to avoid soft
> > lockups after "avoid potential soft lockup and unneeded mmu reload" patch?
>
> Yes.
>
> I discussed this point with Marcelo:
>
> ======
> BTW, to my honest, i do not think spin_needbreak is a good way - it does
> not fix the hot-lock contention and it just occupies more cpu time to avoid
> possible soft lock-ups.
>
> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
> create page tables again. zap-all-shadow-page need long time to be finished,
> the worst case is, it can not completed forever on intensive vcpu and memory
> usage.
>
So what about mixed approach: use generation numbers and reload roots to
quickly invalidate all shadow pages and then do kvm_mmu_zap_all_invalid().
kvm_mmu_zap_all_invalid() is a new function that invalidates only shadow
pages with stale generation number (and uses lock break technique). It
may traverse active_mmu_pages from tail to head since new shadow pages
will be added to the head of the list or it may use invalid slot rmap to
find exactly what should be invalidated.

> I still think the right way to fix this kind of thing is optimization for
> mmu-lock.
> ======
>
> Which parts scare you? Let's find a way to optimize for it. ;). For example,
> if you do not like unmap_memslot_rmap_nolock(), we can simplify it - We can
> use walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() to
> protect spte instead of kvm->being_unmaped_rmap.
>

kvm->being_unmaped_rmap is particularly tricky, although looks
correct. Additional indirection with rmap ops also does not help following
the code. I'd rather have if(slot is invalid) in a couple of places where
things should be done differently. In most places it will be WARN_ON(slot
is invalid).

--
Gleb.

2013-04-22 12:40:18

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

On Sun, Apr 21, 2013 at 12:35:08PM -0300, Marcelo Tosatti wrote:
> On Sun, Apr 21, 2013 at 12:27:51PM -0300, Marcelo Tosatti wrote:
> > On Sun, Apr 21, 2013 at 04:03:46PM +0300, Gleb Natapov wrote:
> > > On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
> > > > This patchset is based on my previous two patchset:
> > > > [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
> > > > (https://lkml.org/lkml/2013/4/1/2)
> > > >
> > > > [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
> > > > (https://lkml.org/lkml/2013/4/1/134)
> > > >
> > > > Changlog:
> > > > V3:
> > > > completely redesign the algorithm, please see below.
> > > >
> > > This looks pretty complicated. Is it still needed in order to avoid soft
> > > lockups after "avoid potential soft lockup and unneeded mmu reload" patch?
> >
> > Do not want kvm_set_memory (cases: DELETE/MOVE/CREATES) to be
> > suspectible to:
> >
> > vcpu 1 | kvm_set_memory
> > create shadow page
> > nuke shadow page
> > create shadow page
> > nuke shadow page
> >
> > Which is guest triggerable behavior with spinlock preemption algorithm.
>
> Not only guest triggerable as in the sense of a malicious guest,
> but condition above can be induced by host workload with non-malicious
> guest system.
>
Is the problem that newly created shadow pages are immediately zapped?
Shouldn't generation number/kvm_mmu_zap_all_invalid() idea described here
https://lkml.org/lkml/2013/4/22/111 solve this?

> Also kvm_set_memory being relatively fast with huge memory guests
> is nice (which is what Xiaos idea allows).
>

--
Gleb.

2013-04-22 13:46:02

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

On Mon, 22 Apr 2013 15:39:38 +0300
Gleb Natapov <[email protected]> wrote:

> > > Do not want kvm_set_memory (cases: DELETE/MOVE/CREATES) to be
> > > suspectible to:
> > >
> > > vcpu 1 | kvm_set_memory
> > > create shadow page
> > > nuke shadow page
> > > create shadow page
> > > nuke shadow page
> > >
> > > Which is guest triggerable behavior with spinlock preemption algorithm.
> >
> > Not only guest triggerable as in the sense of a malicious guest,
> > but condition above can be induced by host workload with non-malicious
> > guest system.
> >
> Is the problem that newly created shadow pages are immediately zapped?
> Shouldn't generation number/kvm_mmu_zap_all_invalid() idea described here
> https://lkml.org/lkml/2013/4/22/111 solve this?

I guess so. That's what Avi described when he tried to achieve
lockless TLB flushes. Mixing that idea with Xiao's approach will
achieve reasonably nice performance, I think.

Various improvements should be added later on top of that if needed.

> > Also kvm_set_memory being relatively fast with huge memory guests
> > is nice (which is what Xiaos idea allows).

I agree with this point. But if so, it should be actually measured on
such guests, even if the algorithm looks promising.

Takuya

2013-04-23 00:19:16

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

On 04/22/2013 05:21 PM, Gleb Natapov wrote:
> On Sun, Apr 21, 2013 at 10:09:29PM +0800, Xiao Guangrong wrote:
>> On 04/21/2013 09:03 PM, Gleb Natapov wrote:
>>> On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
>>>> This patchset is based on my previous two patchset:
>>>> [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
>>>> (https://lkml.org/lkml/2013/4/1/2)
>>>>
>>>> [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
>>>> (https://lkml.org/lkml/2013/4/1/134)
>>>>
>>>> Changlog:
>>>> V3:
>>>> completely redesign the algorithm, please see below.
>>>>
>>> This looks pretty complicated. Is it still needed in order to avoid soft
>>> lockups after "avoid potential soft lockup and unneeded mmu reload" patch?
>>
>> Yes.
>>
>> I discussed this point with Marcelo:
>>
>> ======
>> BTW, to my honest, i do not think spin_needbreak is a good way - it does
>> not fix the hot-lock contention and it just occupies more cpu time to avoid
>> possible soft lock-ups.
>>
>> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
>> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
>> create page tables again. zap-all-shadow-page need long time to be finished,
>> the worst case is, it can not completed forever on intensive vcpu and memory
>> usage.
>>
> So what about mixed approach: use generation numbers and reload roots to
> quickly invalidate all shadow pages and then do kvm_mmu_zap_all_invalid().
> kvm_mmu_zap_all_invalid() is a new function that invalidates only shadow
> pages with stale generation number (and uses lock break technique). It
> may traverse active_mmu_pages from tail to head since new shadow pages
> will be added to the head of the list or it may use invalid slot rmap to
> find exactly what should be invalidated.

I prefer to unmapping the invalid rmap instead of zapping stale shadow pages
in kvm_mmu_zap_all_invalid(), the former is faster.

This way may help but not good, after reload mmu with the new generation number,
all of the vcpu will fault in a long time, try to hold mmu-lock is not good even
if use lock break technique.

I think we can do this step first, then unmapping invalid rmap out of mmu-lock
later.

>
>> I still think the right way to fix this kind of thing is optimization for
>> mmu-lock.
>> ======
>>
>> Which parts scare you? Let's find a way to optimize for it. ;). For example,
>> if you do not like unmap_memslot_rmap_nolock(), we can simplify it - We can
>> use walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() to
>> protect spte instead of kvm->being_unmaped_rmap.
>>
>
> kvm->being_unmaped_rmap is particularly tricky, although looks

Okay. Will use walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end
instead.

> correct. Additional indirection with rmap ops also does not help following
> the code. I'd rather have if(slot is invalid) in a couple of places where
> things should be done differently. In most places it will be WARN_ON(slot
> is invalid).

Less change, good to me, will do. ;)

Thanks!

2013-04-23 02:04:26

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

On Mon, Apr 22, 2013 at 10:45:53PM +0900, Takuya Yoshikawa wrote:
> On Mon, 22 Apr 2013 15:39:38 +0300
> Gleb Natapov <[email protected]> wrote:
>
> > > > Do not want kvm_set_memory (cases: DELETE/MOVE/CREATES) to be
> > > > suspectible to:
> > > >
> > > > vcpu 1 | kvm_set_memory
> > > > create shadow page
> > > > nuke shadow page
> > > > create shadow page
> > > > nuke shadow page
> > > >
> > > > Which is guest triggerable behavior with spinlock preemption algorithm.
> > >
> > > Not only guest triggerable as in the sense of a malicious guest,
> > > but condition above can be induced by host workload with non-malicious
> > > guest system.
> > >
> > Is the problem that newly created shadow pages are immediately zapped?
> > Shouldn't generation number/kvm_mmu_zap_all_invalid() idea described here
> > https://lkml.org/lkml/2013/4/22/111 solve this?
>
> I guess so. That's what Avi described when he tried to achieve
> lockless TLB flushes. Mixing that idea with Xiao's approach will
> achieve reasonably nice performance, I think.

Yes.

> Various improvements should be added later on top of that if needed.
>
> > > Also kvm_set_memory being relatively fast with huge memory guests
> > > is nice (which is what Xiaos idea allows).
>
> I agree with this point. But if so, it should be actually measured on
> such guests, even if the algorithm looks promising.

Works for me.

2013-04-23 06:28:21

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

On Tue, Apr 23, 2013 at 08:19:02AM +0800, Xiao Guangrong wrote:
> On 04/22/2013 05:21 PM, Gleb Natapov wrote:
> > On Sun, Apr 21, 2013 at 10:09:29PM +0800, Xiao Guangrong wrote:
> >> On 04/21/2013 09:03 PM, Gleb Natapov wrote:
> >>> On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
> >>>> This patchset is based on my previous two patchset:
> >>>> [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
> >>>> (https://lkml.org/lkml/2013/4/1/2)
> >>>>
> >>>> [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
> >>>> (https://lkml.org/lkml/2013/4/1/134)
> >>>>
> >>>> Changlog:
> >>>> V3:
> >>>> completely redesign the algorithm, please see below.
> >>>>
> >>> This looks pretty complicated. Is it still needed in order to avoid soft
> >>> lockups after "avoid potential soft lockup and unneeded mmu reload" patch?
> >>
> >> Yes.
> >>
> >> I discussed this point with Marcelo:
> >>
> >> ======
> >> BTW, to my honest, i do not think spin_needbreak is a good way - it does
> >> not fix the hot-lock contention and it just occupies more cpu time to avoid
> >> possible soft lock-ups.
> >>
> >> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
> >> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
> >> create page tables again. zap-all-shadow-page need long time to be finished,
> >> the worst case is, it can not completed forever on intensive vcpu and memory
> >> usage.
> >>
> > So what about mixed approach: use generation numbers and reload roots to
> > quickly invalidate all shadow pages and then do kvm_mmu_zap_all_invalid().
> > kvm_mmu_zap_all_invalid() is a new function that invalidates only shadow
> > pages with stale generation number (and uses lock break technique). It
> > may traverse active_mmu_pages from tail to head since new shadow pages
> > will be added to the head of the list or it may use invalid slot rmap to
> > find exactly what should be invalidated.
>
> I prefer to unmapping the invalid rmap instead of zapping stale shadow pages
> in kvm_mmu_zap_all_invalid(), the former is faster.
>
Not sure what do you mean here. What is "unmapping the invalid rmap"?

> This way may help but not good, after reload mmu with the new generation number,
> all of the vcpu will fault in a long time, try to hold mmu-lock is not good even
> if use lock break technique.
If kvm_mmu_zap_all_invalid(slot) will only zap shadow pages that are
reachable from the slot's rmap, as opposite to zapping all invalid
shadow pages, it will have much less work to do. The slots that we
add/remove during hot plug are usually small. To guaranty reasonable
forward progress we can break the lock only after certain amount of
shadow pages are invalidated. All other invalid shadow pages will be
zapped in make_mmu_pages_available() and zapping will be spread between
page faults.

>
> I think we can do this step first, then unmapping invalid rmap out of mmu-lock
> later.
>

--
Gleb.

2013-04-23 07:20:59

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

On 04/23/2013 02:28 PM, Gleb Natapov wrote:
> On Tue, Apr 23, 2013 at 08:19:02AM +0800, Xiao Guangrong wrote:
>> On 04/22/2013 05:21 PM, Gleb Natapov wrote:
>>> On Sun, Apr 21, 2013 at 10:09:29PM +0800, Xiao Guangrong wrote:
>>>> On 04/21/2013 09:03 PM, Gleb Natapov wrote:
>>>>> On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
>>>>>> This patchset is based on my previous two patchset:
>>>>>> [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
>>>>>> (https://lkml.org/lkml/2013/4/1/2)
>>>>>>
>>>>>> [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
>>>>>> (https://lkml.org/lkml/2013/4/1/134)
>>>>>>
>>>>>> Changlog:
>>>>>> V3:
>>>>>> completely redesign the algorithm, please see below.
>>>>>>
>>>>> This looks pretty complicated. Is it still needed in order to avoid soft
>>>>> lockups after "avoid potential soft lockup and unneeded mmu reload" patch?
>>>>
>>>> Yes.
>>>>
>>>> I discussed this point with Marcelo:
>>>>
>>>> ======
>>>> BTW, to my honest, i do not think spin_needbreak is a good way - it does
>>>> not fix the hot-lock contention and it just occupies more cpu time to avoid
>>>> possible soft lock-ups.
>>>>
>>>> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
>>>> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
>>>> create page tables again. zap-all-shadow-page need long time to be finished,
>>>> the worst case is, it can not completed forever on intensive vcpu and memory
>>>> usage.
>>>>
>>> So what about mixed approach: use generation numbers and reload roots to
>>> quickly invalidate all shadow pages and then do kvm_mmu_zap_all_invalid().
>>> kvm_mmu_zap_all_invalid() is a new function that invalidates only shadow
>>> pages with stale generation number (and uses lock break technique). It
>>> may traverse active_mmu_pages from tail to head since new shadow pages
>>> will be added to the head of the list or it may use invalid slot rmap to
>>> find exactly what should be invalidated.
>>
>> I prefer to unmapping the invalid rmap instead of zapping stale shadow pages
>> in kvm_mmu_zap_all_invalid(), the former is faster.
>>
> Not sure what do you mean here. What is "unmapping the invalid rmap"?

it is like you said below:
======
kvm_mmu_zap_all_invalid(slot) will only zap shadow pages that are
reachable from the slot's rmap
======
My suggestion is zapping the spte that are linked in the slot's rmap.

>
>> This way may help but not good, after reload mmu with the new generation number,
>> all of the vcpu will fault in a long time, try to hold mmu-lock is not good even
>> if use lock break technique.
> If kvm_mmu_zap_all_invalid(slot) will only zap shadow pages that are
> reachable from the slot's rmap, as opposite to zapping all invalid
> shadow pages, it will have much less work to do. The slots that we
> add/remove during hot plug are usually small. To guaranty reasonable
> forward progress we can break the lock only after certain amount of
> shadow pages are invalidated. All other invalid shadow pages will be
> zapped in make_mmu_pages_available() and zapping will be spread between
> page faults.

No interested in hot-remove memory?

BTW, could you please review my previous patchsets and apply them if its
looks ok? ;)

[PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
(https://lkml.org/lkml/2013/4/1/2)

[PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
(https://lkml.org/lkml/2013/4/1/134)

Thanks!

2013-04-23 07:33:23

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 00/15] KVM: MMU: fast zap all shadow pages

On Tue, Apr 23, 2013 at 03:20:28PM +0800, Xiao Guangrong wrote:
> On 04/23/2013 02:28 PM, Gleb Natapov wrote:
> > On Tue, Apr 23, 2013 at 08:19:02AM +0800, Xiao Guangrong wrote:
> >> On 04/22/2013 05:21 PM, Gleb Natapov wrote:
> >>> On Sun, Apr 21, 2013 at 10:09:29PM +0800, Xiao Guangrong wrote:
> >>>> On 04/21/2013 09:03 PM, Gleb Natapov wrote:
> >>>>> On Tue, Apr 16, 2013 at 02:32:38PM +0800, Xiao Guangrong wrote:
> >>>>>> This patchset is based on my previous two patchset:
> >>>>>> [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
> >>>>>> (https://lkml.org/lkml/2013/4/1/2)
> >>>>>>
> >>>>>> [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
> >>>>>> (https://lkml.org/lkml/2013/4/1/134)
> >>>>>>
> >>>>>> Changlog:
> >>>>>> V3:
> >>>>>> completely redesign the algorithm, please see below.
> >>>>>>
> >>>>> This looks pretty complicated. Is it still needed in order to avoid soft
> >>>>> lockups after "avoid potential soft lockup and unneeded mmu reload" patch?
> >>>>
> >>>> Yes.
> >>>>
> >>>> I discussed this point with Marcelo:
> >>>>
> >>>> ======
> >>>> BTW, to my honest, i do not think spin_needbreak is a good way - it does
> >>>> not fix the hot-lock contention and it just occupies more cpu time to avoid
> >>>> possible soft lock-ups.
> >>>>
> >>>> Especially, zap-all-shadow-pages can let other vcpus fault and vcpus contest
> >>>> mmu-lock, then zap-all-shadow-pages release mmu-lock and wait, other vcpus
> >>>> create page tables again. zap-all-shadow-page need long time to be finished,
> >>>> the worst case is, it can not completed forever on intensive vcpu and memory
> >>>> usage.
> >>>>
> >>> So what about mixed approach: use generation numbers and reload roots to
> >>> quickly invalidate all shadow pages and then do kvm_mmu_zap_all_invalid().
> >>> kvm_mmu_zap_all_invalid() is a new function that invalidates only shadow
> >>> pages with stale generation number (and uses lock break technique). It
> >>> may traverse active_mmu_pages from tail to head since new shadow pages
> >>> will be added to the head of the list or it may use invalid slot rmap to
> >>> find exactly what should be invalidated.
> >>
> >> I prefer to unmapping the invalid rmap instead of zapping stale shadow pages
> >> in kvm_mmu_zap_all_invalid(), the former is faster.
> >>
> > Not sure what do you mean here. What is "unmapping the invalid rmap"?
>
> it is like you said below:
> ======
> kvm_mmu_zap_all_invalid(slot) will only zap shadow pages that are
> reachable from the slot's rmap
> ======
> My suggestion is zapping the spte that are linked in the slot's rmap.
>
OK, so we are on the same page.

> >
> >> This way may help but not good, after reload mmu with the new generation number,
> >> all of the vcpu will fault in a long time, try to hold mmu-lock is not good even
> >> if use lock break technique.
> > If kvm_mmu_zap_all_invalid(slot) will only zap shadow pages that are
> > reachable from the slot's rmap, as opposite to zapping all invalid
> > shadow pages, it will have much less work to do. The slots that we
> > add/remove during hot plug are usually small. To guaranty reasonable
> > forward progress we can break the lock only after certain amount of
> > shadow pages are invalidated. All other invalid shadow pages will be
> > zapped in make_mmu_pages_available() and zapping will be spread between
> > page faults.
>
> No interested in hot-remove memory?
>
I am, good point. Still, I think, that with guarantied forward progress the
slot removal time should be bound to something reasonable. At least we
should have evidence of the contrary before optimizing for it. Hot
memory remove is not instantaneous from guest point of view either,
guest needs to move memory around to make it possible.

> BTW, could you please review my previous patchsets and apply them if its
> looks ok? ;)
>
I need Marcelo's acks on them too :)

> [PATCH 0/2] KVM: x86: avoid potential soft lockup and unneeded mmu reload
> (https://lkml.org/lkml/2013/4/1/2)
>
But you yourself saying that with this patch slot remove may never
complete with high memory usage workload.

> [PATCH v2 0/6] KVM: MMU: fast invalid all mmio sptes
> (https://lkml.org/lkml/2013/4/1/134)
>
Missed this one. Will review.

--
Gleb.