Changelog v2:
- the changes from Gleb's review:
1) fix calculating the number of spte in the pte_list_add()
2) set iter->desc to NULL if meet a nulls desc to cleanup the code of
rmap_get_next()
3) fix hlist corruption due to accessing sp->hlish out of mmu-lock
4) use rcu functions to access the rcu protected pointer
5) spte will be missed in lockless walker if the spte is moved in a desc
(remove a spte from the rmap using only one desc). Fix it by bottom-up
walking the desc
- the changes from Paolo's review
1) make the order and memory barriers between update spte / add spte into
rmap and dirty-log more clear
- the changes from Marcelo's review:
1) let fast page fault only fix the spts on the last level (level = 1)
2) improve some changelogs and comments
- the changes from Takuya's review:
move the patch "flush tlb if the spte can be locklessly modified" forward
to make it's more easily merged
Thank all of you very much for your time and patience on this patchset!
Since we use rcu_assign_pointer() to update the points in desc even if dirty
log is disabled, i have measured the performance:
Host: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz * 12 + 36G memory
- migrate-perf (benchmark the time of get-dirty-log)
before: Run 10 times, Avg time:9009483 ns.
after: Run 10 times, Avg time:4807343 ns.
- kerbench
Guest: 12 VCPUs + 8G memory
before:
EPT is enabled:
# cat 09-05-origin-ept | grep real
real 85.58
real 83.47
real 82.95
EPT is disabled:
# cat 09-05-origin-shadow | grep real
real 138.77
real 138.99
real 139.55
after:
EPT is enabled:
# cat 09-05-lockless-ept | grep real
real 83.40
real 82.81
real 83.39
EPT is disabled:
# cat 09-05-lockless-shadow | grep real
real 138.91
real 139.71
real 138.94
No performance regression!
Background
==========
Currently, when mark memslot dirty logged or get dirty page, we need to
write-protect large guest memory, it is the heavy work, especially, we need to
hold mmu-lock which is also required by vcpu to fix its page table fault and
mmu-notifier when host page is being changed. In the extreme cpu / memory used
guest, it becomes a scalability issue.
This patchset introduces a way to locklessly write-protect guest memory.
Idea
==========
There are the challenges we meet and the ideas to resolve them.
1) How to locklessly walk rmap?
The first idea we got to prevent "desc" being freed when we are walking the
rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
it updates the rmap really frequently.
So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
to be reused more quickly. We also store a "nulls" in the last "desc"
(desc->more) which can help us to detect whether the "desc" is moved to anther
rmap then we can re-walk the rmap if that happened. I learned this idea from
nulls-list.
Another issue is, when a spte is deleted from the "desc", another spte in the
last "desc" will be moved to this position to replace the deleted one. If the
deleted one has been accessed and we do not access the replaced one, the
replaced one is missed when we do lockless walk.
To fix this case, we do not backward move the spte, instead, we forward move
the entry: when a spte is deleted, we move the entry in the first desc to that
position.
2) How to locklessly access shadow page table?
It is easy if the handler is in the vcpu context, in that case we can use
walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
disable interrupt to stop shadow page be freed. But we are on the ioctl context
and the paths we are optimizing for have heavy workload, disabling interrupt is
not good for the system performance.
We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
call_rcu() to free the shadow page if that indicator is set. Set/Clear the
indicator are protected by slot-lock, so it need not be atomic and does not
hurt the performance and the scalability.
3) How to locklessly write-protect guest memory?
Currently, there are two behaviors when we write-protect guest memory, one is
clearing the Writable bit on spte and the another one is dropping spte when it
points to large page. The former is easy we only need to atomicly clear a bit
but the latter is hard since we need to remove the spte from rmap. so we unify
these two behaviors that only make the spte readonly. Making large spte
readonly instead of nonpresent is also good for reducing jitter.
And we need to pay more attention on the order of making spte writable, adding
spte into rmap and setting the corresponding bit on dirty bitmap since
kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap,
we should ensure the writable spte can be found in rmap before the dirty bitmap
is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
the page.
Performance result
====================
The performance result and the benchmark can be found at:
http://permalink.gmane.org/gmane.linux.kernel/1534876
Xiao Guangrong (15):
KVM: MMU: fix the count of spte number
KVM: MMU: properly check last spte in fast_page_fault()
KVM: MMU: lazily drop large spte
KVM: MMU: flush tlb if the spte can be locklessly modified
KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
KVM: MMU: update spte and add it into rmap before dirty log
KVM: MMU: redesign the algorithm of pte_list
KVM: MMU: introduce nulls desc
KVM: MMU: introduce pte-list lockless walker
KVM: MMU: initialize the pointers in pte_list_desc properly
KVM: MMU: reintroduce kvm_mmu_isolate_page()
KVM: MMU: allow locklessly access shadow page table out of vcpu thread
KVM: MMU: locklessly write-protect the page
KVM: MMU: clean up spte_write_protect
KVM: MMU: use rcu functions to access the pointer
arch/x86/include/asm/kvm_host.h | 10 +-
arch/x86/kvm/mmu.c | 566 ++++++++++++++++++++++++++++++----------
arch/x86/kvm/mmu.h | 28 ++
arch/x86/kvm/x86.c | 34 ++-
4 files changed, 491 insertions(+), 147 deletions(-)
--
1.8.1.4
If the desc is the last one and it is full, its sptes is not counted
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6e2d2c8..7714fd8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -948,6 +948,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
count += PTE_LIST_EXT;
}
if (desc->sptes[PTE_LIST_EXT-1]) {
+ count += PTE_LIST_EXT;
desc->more = mmu_alloc_pte_list_desc(vcpu);
desc = desc->more;
}
--
1.8.1.4
It is easy if the handler is in the vcpu context, in that case we can use
walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
disable interrupt to stop shadow page being freed. But we are on the ioctl
context and the paths we are optimizing for have heavy workload, disabling
interrupt is not good for the system performance
We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then
use call_rcu() to free the shadow page if that indicator is set. Set/Clear the
indicator are protected by slot-lock, so it need not be atomic and does not
hurt the performance and the scalability
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 6 +++++-
arch/x86/kvm/mmu.c | 32 ++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu.h | 22 ++++++++++++++++++++++
3 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c76ff74..8e4ca0d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -226,7 +226,10 @@ struct kvm_mmu_page {
/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */
unsigned long mmu_valid_gen;
- DECLARE_BITMAP(unsync_child_bitmap, 512);
+ union {
+ DECLARE_BITMAP(unsync_child_bitmap, 512);
+ struct rcu_head rcu;
+ };
#ifdef CONFIG_X86_32
/*
@@ -554,6 +557,7 @@ struct kvm_arch {
*/
struct list_head active_mmu_pages;
struct list_head zapped_obsolete_pages;
+ bool rcu_free_shadow_page;
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2bf450a..f551fc7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2355,6 +2355,30 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
return ret;
}
+static void kvm_mmu_isolate_pages(struct list_head *invalid_list)
+{
+ struct kvm_mmu_page *sp;
+
+ list_for_each_entry(sp, invalid_list, link)
+ kvm_mmu_isolate_page(sp);
+}
+
+static void free_pages_rcu(struct rcu_head *head)
+{
+ struct kvm_mmu_page *next, *sp;
+
+ sp = container_of(head, struct kvm_mmu_page, rcu);
+ while (sp) {
+ if (!list_empty(&sp->link))
+ next = list_first_entry(&sp->link,
+ struct kvm_mmu_page, link);
+ else
+ next = NULL;
+ kvm_mmu_free_page(sp);
+ sp = next;
+ }
+}
+
static void kvm_mmu_commit_zap_page(struct kvm *kvm,
struct list_head *invalid_list)
{
@@ -2375,6 +2399,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
*/
kvm_flush_remote_tlbs(kvm);
+ if (kvm->arch.rcu_free_shadow_page) {
+ kvm_mmu_isolate_pages(invalid_list);
+ sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
+ list_del_init(invalid_list);
+ call_rcu(&sp->rcu, free_pages_rcu);
+ return;
+ }
+
list_for_each_entry_safe(sp, nsp, invalid_list, link) {
WARN_ON(!sp->role.invalid || sp->root_count);
kvm_mmu_isolate_page(sp);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 77e044a..61217f3 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -117,4 +117,26 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
}
void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
+
+/*
+ * The caller should ensure that these two functions should be
+ * serially called.
+ */
+static inline void kvm_mmu_rcu_free_page_begin(struct kvm *kvm)
+{
+ rcu_read_lock();
+
+ kvm->arch.rcu_free_shadow_page = true;
+ /* Set the indicator before access shadow page. */
+ smp_mb();
+}
+
+static inline void kvm_mmu_rcu_free_page_end(struct kvm *kvm)
+{
+ /* Make sure that access shadow page has finished. */
+ smp_mb();
+ kvm->arch.rcu_free_shadow_page = false;
+
+ rcu_read_unlock();
+}
#endif
--
1.8.1.4
Use rcu_assign_pointer() to update all the pointer in desc
and use rcu_dereference() to lockless read the pointer
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 46 ++++++++++++++++++++++++++++------------------
1 file changed, 28 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f3f17a0..808c2d9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -927,12 +927,23 @@ static void pte_list_desc_ctor(void *p)
desc->more = NULL;
}
+#define rcu_assign_pte_list(pte_list_p, value) \
+ rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), \
+ (unsigned long *)(value))
+
+#define rcu_assign_desc_more(morep, value) \
+ rcu_assign_pointer(*(unsigned long __rcu **)&morep, \
+ (unsigned long *)value)
+
+#define rcu_assign_spte(sptep, value) \
+ rcu_assign_pointer(*(u64 __rcu **)&sptep, (u64 *)value)
+
static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc)
{
unsigned long marker;
marker = (unsigned long)pte_list | 1UL;
- desc->more = (struct pte_list_desc *)marker;
+ rcu_assign_desc_more(desc->more, (struct pte_list_desc *)marker);
}
static bool desc_is_a_nulls(struct pte_list_desc *desc)
@@ -989,10 +1000,6 @@ static int count_spte_number(struct pte_list_desc *desc)
return first_free + desc_num * PTE_LIST_EXT;
}
-#define rcu_assign_pte_list(pte_list_p, value) \
- rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), \
- (unsigned long *)(value))
-
/*
* Pte mapping structures:
*
@@ -1019,8 +1026,8 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
if (!(*pte_list & 1)) {
rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_pte_list_desc(vcpu);
- desc->sptes[0] = (u64 *)*pte_list;
- desc->sptes[1] = spte;
+ rcu_assign_spte(desc->sptes[0], *pte_list);
+ rcu_assign_spte(desc->sptes[1], spte);
desc_mark_nulls(pte_list, desc);
rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
return 1;
@@ -1033,13 +1040,13 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
if (desc->sptes[PTE_LIST_EXT - 1]) {
struct pte_list_desc *new_desc;
new_desc = mmu_alloc_pte_list_desc(vcpu);
- new_desc->more = desc;
+ rcu_assign_desc_more(new_desc->more, desc);
desc = new_desc;
rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
}
free_pos = find_first_free(desc);
- desc->sptes[free_pos] = spte;
+ rcu_assign_spte(desc->sptes[free_pos], spte);
return count_spte_number(desc) - 1;
}
@@ -1057,8 +1064,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
* Move the entry from the first desc to this position we want
* to remove.
*/
- desc->sptes[i] = first_desc->sptes[last_used];
- first_desc->sptes[last_used] = NULL;
+ rcu_assign_spte(desc->sptes[i], first_desc->sptes[last_used]);
+ rcu_assign_spte(first_desc->sptes[last_used], NULL);
/* No valid entry in this desc, we can free this desc now. */
if (!first_desc->sptes[0]) {
@@ -1070,7 +1077,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
WARN_ON(desc_is_a_nulls(next_desc));
mmu_free_pte_list_desc(first_desc);
- *pte_list = (unsigned long)next_desc | 1ul;
+ rcu_assign_pte_list(pte_list, (unsigned long)next_desc | 1ul);
return;
}
@@ -1079,8 +1086,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
* then the desc can be freed.
*/
if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
- *pte_list = (unsigned long)first_desc->sptes[0];
- first_desc->sptes[0] = NULL;
+ rcu_assign_pte_list(pte_list, first_desc->sptes[0]);
+ rcu_assign_spte(first_desc->sptes[0], NULL);
mmu_free_pte_list_desc(first_desc);
}
}
@@ -1102,7 +1109,7 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)
pr_err("pte_list_remove: %p 1->BUG\n", spte);
BUG();
}
- *pte_list = 0;
+ rcu_assign_pte_list(pte_list, 0);
return;
}
@@ -1174,9 +1181,12 @@ restart:
* used in the rmap when a spte is removed. Otherwise the
* moved entry will be missed.
*/
- for (i = PTE_LIST_EXT - 1; i >= 0; i--)
- if (desc->sptes[i])
- fn(desc->sptes[i]);
+ for (i = PTE_LIST_EXT - 1; i >= 0; i--) {
+ u64 *sptep = rcu_dereference(desc->sptes[i]);
+
+ if (sptep)
+ fn(sptep);
+ }
desc = rcu_dereference(desc->more);
--
1.8.1.4
Now, the only user of spte_write_protect is rmap_write_protect which
always calls spte_write_protect with pt_protect = true, so drop
it and the unused parameter @kvm
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 44b7822..f3f17a0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1330,8 +1330,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
}
/*
- * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte write-protection is caused by protecting shadow page table.
+ * Write-protect on the specified @sptep.
*
* Note: write protection is difference between drity logging and spte
* protection:
@@ -1342,25 +1341,23 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
*
* Return true if tlb need be flushed.
*/
-static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
+static bool spte_write_protect(u64 *sptep)
{
u64 spte = *sptep;
if (!is_writable_pte(spte) &&
- !(pt_protect && spte_is_locklessly_modifiable(spte)))
+ !spte_is_locklessly_modifiable(spte))
return false;
rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
- if (pt_protect)
- spte &= ~SPTE_MMU_WRITEABLE;
- spte = spte & ~PT_WRITABLE_MASK;
+ spte &= ~SPTE_MMU_WRITEABLE;
+ spte &= ~PT_WRITABLE_MASK;
return mmu_spte_update(sptep, spte);
}
-static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
- bool pt_protect)
+static bool __rmap_write_protect(unsigned long *rmapp)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -1369,7 +1366,7 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
- flush |= spte_write_protect(kvm, sptep, pt_protect);
+ flush |= spte_write_protect(sptep);
sptep = rmap_get_next(&iter);
}
@@ -1438,7 +1435,7 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
for (i = PT_PAGE_TABLE_LEVEL;
i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
rmapp = __gfn_to_rmap(gfn, i, slot);
- write_protected |= __rmap_write_protect(kvm, rmapp, true);
+ write_protected |= __rmap_write_protect(rmapp);
}
return write_protected;
--
1.8.1.4
Currently, when mark memslot dirty logged or get dirty page, we need to
write-protect large guest memory, it is the heavy work, especially, we
need to hold mmu-lock which is also required by vcpu to fix its page table
fault and mmu-notifier when host page is being changed. In the extreme
cpu / memory used guest, it becomes a scalability issue
This patch introduces a way to locklessly write-protect guest memory
Now, lockless rmap walk, lockless shadow page table access and lockless
spte wirte-protection are ready, it is the time to implements page
write-protection out of mmu-lock
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ----
arch/x86/kvm/mmu.c | 53 +++++++++++++++++++++++++++++------------
arch/x86/kvm/mmu.h | 6 +++++
arch/x86/kvm/x86.c | 11 ++++-----
4 files changed, 49 insertions(+), 25 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8e4ca0d..00b44b1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -789,10 +789,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask);
int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
- struct kvm_memory_slot *slot,
- gfn_t gfn_offset, unsigned long mask);
void kvm_mmu_zap_all(struct kvm *kvm);
void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm);
unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f551fc7..44b7822 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1376,8 +1376,31 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
return flush;
}
-/**
- * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
+static void __rmap_write_protect_lockless(u64 *sptep)
+{
+ u64 spte;
+ int level = page_header(__pa(sptep))->role.level;
+
+retry:
+ spte = mmu_spte_get_lockless(sptep);
+ if (unlikely(!is_last_spte(spte, level) || !is_writable_pte(spte)))
+ return;
+
+ if (likely(cmpxchg64(sptep, spte, spte & ~PT_WRITABLE_MASK) == spte))
+ return;
+
+ goto retry;
+}
+
+static void rmap_write_protect_lockless(unsigned long *rmapp)
+{
+ pte_list_walk_lockless(rmapp, __rmap_write_protect_lockless);
+}
+
+/*
+ * kvm_mmu_write_protect_pt_masked_lockless - write protect selected PT level
+ * pages out of mmu-lock.
+ *
* @kvm: kvm instance
* @slot: slot to protect
* @gfn_offset: start of the BITS_PER_LONG pages we care about
@@ -1386,16 +1409,17 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
* Used when we do not need to care about huge page mappings: e.g. during dirty
* logging we do not have any such mappings.
*/
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
- struct kvm_memory_slot *slot,
- gfn_t gfn_offset, unsigned long mask)
+void
+kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn_offset, unsigned long mask)
{
unsigned long *rmapp;
while (mask) {
rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
PT_PAGE_TABLE_LEVEL, slot);
- __rmap_write_protect(kvm, rmapp, false);
+ rmap_write_protect_lockless(rmapp);
/* clear the first set bit */
mask &= mask - 1;
@@ -4547,7 +4571,7 @@ int kvm_mmu_setup(struct kvm_vcpu *vcpu)
return init_kvm_mmu(vcpu);
}
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
+void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot)
{
struct kvm_memory_slot *memslot;
gfn_t last_gfn;
@@ -4556,8 +4580,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
memslot = id_to_memslot(kvm->memslots, slot);
last_gfn = memslot->base_gfn + memslot->npages - 1;
- spin_lock(&kvm->mmu_lock);
-
+ kvm_mmu_rcu_free_page_begin(kvm);
for (i = PT_PAGE_TABLE_LEVEL;
i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
unsigned long *rmapp;
@@ -4567,15 +4590,15 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
for (index = 0; index <= last_index; ++index, ++rmapp) {
- if (*rmapp)
- __rmap_write_protect(kvm, rmapp, false);
+ rmap_write_protect_lockless(rmapp);
- if (need_resched() || spin_needbreak(&kvm->mmu_lock))
- cond_resched_lock(&kvm->mmu_lock);
+ if (need_resched()) {
+ kvm_mmu_rcu_free_page_end(kvm);
+ kvm_mmu_rcu_free_page_begin(kvm);
+ }
}
}
-
- spin_unlock(&kvm->mmu_lock);
+ kvm_mmu_rcu_free_page_end(kvm);
/*
* We can flush all the TLBs out of the mmu lock without TLB
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 61217f3..2ac649e 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -139,4 +139,10 @@ static inline void kvm_mmu_rcu_free_page_end(struct kvm *kvm)
rcu_read_unlock();
}
+
+void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot);
+void
+kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn_offset, unsigned long mask);
#endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5582b66..33d483a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3543,8 +3543,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
memset(dirty_bitmap_buffer, 0, n);
- spin_lock(&kvm->mmu_lock);
-
+ kvm_mmu_rcu_free_page_begin(kvm);
for (i = 0; i < n / sizeof(long); i++) {
unsigned long mask;
gfn_t offset;
@@ -3568,10 +3567,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
dirty_bitmap_buffer[i] = mask;
offset = i * BITS_PER_LONG;
- kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
+ kvm_mmu_write_protect_pt_masked_lockless(kvm, memslot,
+ offset, mask);
}
-
- spin_unlock(&kvm->mmu_lock);
+ kvm_mmu_rcu_free_page_end(kvm);
/*
* All the TLBs can be flushed out of mmu lock, see the comments in
@@ -7231,7 +7230,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
* See the comments in fast_page_fault().
*/
if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
- kvm_mmu_slot_remove_write_access(kvm, mem->slot);
+ kvm_mmu_slot_remove_write_access_lockless(kvm, mem->slot);
}
void kvm_arch_flush_shadow_all(struct kvm *kvm)
--
1.8.1.4
It was removed by commit 834be0d83. Now we will need it to do lockless shadow
page walking protected by rcu, so reintroduce it
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index fe80019..2bf450a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1675,14 +1675,30 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
percpu_counter_add(&kvm_total_used_mmu_pages, nr);
}
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+/*
+ * Remove the sp from shadow page cache, after call it,
+ * we can not find this sp from the cache, and the shadow
+ * page table is still valid.
+ *
+ * It should be under the protection of mmu lock.
+ */
+static void kvm_mmu_isolate_page(struct kvm_mmu_page *sp)
{
ASSERT(is_empty_shadow_page(sp->spt));
+
hlist_del(&sp->hash_link);
- list_del(&sp->link);
- free_page((unsigned long)sp->spt);
if (!sp->role.direct)
free_page((unsigned long)sp->gfns);
+}
+
+/*
+ * Free the shadow page table and the sp, we can do it
+ * out of the protection of mmu lock.
+ */
+static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+{
+ list_del(&sp->link);
+ free_page((unsigned long)sp->spt);
kmem_cache_free(mmu_page_header_cache, sp);
}
@@ -2361,6 +2377,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
list_for_each_entry_safe(sp, nsp, invalid_list, link) {
WARN_ON(!sp->role.invalid || sp->root_count);
+ kvm_mmu_isolate_page(sp);
kvm_mmu_free_page(sp);
}
}
--
1.8.1.4
Relax the tlb flush condition since we will write-protect the spte out of mmu
lock. Note lockless write-protection only marks the writable spte to readonly
and the spte can be writable only if both SPTE_HOST_WRITEABLE and
SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)
This patch is used to avoid this kind of race:
VCPU 0 VCPU 1
lockless wirte protection:
set spte.w = 0
lock mmu-lock
write protection the spte to sync shadow page,
see spte.w = 0, then without flush tlb
unlock mmu-lock
!!! At this point, the shadow page can still be
writable due to the corrupt tlb entry
Flush all TLB
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 88107ee..7488229 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -595,7 +595,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
* we always atomicly update it, see the comments in
* spte_has_volatile_bits().
*/
- if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
+ if (spte_is_locklessly_modifiable(old_spte) &&
+ !is_writable_pte(new_spte))
ret = true;
if (!shadow_accessed_mask)
--
1.8.1.4
Since pte_list_desc will be locklessly accessed we need to atomicly initialize
its pointers so that the lockless walker can not get the partial value from the
pointer
In this patch we use the way of assigning pointer to initialize its pointers
which is always atomic instead of using kmem_cache_zalloc
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 27 +++++++++++++++++++++------
1 file changed, 21 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3e1432f..fe80019 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -687,14 +687,15 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
}
static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
- struct kmem_cache *base_cache, int min)
+ struct kmem_cache *base_cache, int min,
+ gfp_t flags)
{
void *obj;
if (cache->nobjs >= min)
return 0;
while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
- obj = kmem_cache_zalloc(base_cache, GFP_KERNEL);
+ obj = kmem_cache_alloc(base_cache, flags);
if (!obj)
return -ENOMEM;
cache->objects[cache->nobjs++] = obj;
@@ -741,14 +742,16 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
int r;
r = mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
- pte_list_desc_cache, 8 + PTE_PREFETCH_NUM);
+ pte_list_desc_cache, 8 + PTE_PREFETCH_NUM,
+ GFP_KERNEL);
if (r)
goto out;
r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
if (r)
goto out;
r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
- mmu_page_header_cache, 4);
+ mmu_page_header_cache, 4,
+ GFP_KERNEL | __GFP_ZERO);
out:
return r;
}
@@ -913,6 +916,17 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
return level - 1;
}
+static void pte_list_desc_ctor(void *p)
+{
+ struct pte_list_desc *desc = p;
+ int i;
+
+ for (i = 0; i < PTE_LIST_EXT; i++)
+ desc->sptes[i] = NULL;
+
+ desc->more = NULL;
+}
+
static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc)
{
unsigned long marker;
@@ -1066,6 +1080,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
*/
if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
*pte_list = (unsigned long)first_desc->sptes[0];
+ first_desc->sptes[0] = NULL;
mmu_free_pte_list_desc(first_desc);
}
}
@@ -4699,8 +4714,8 @@ static void mmu_destroy_caches(void)
int kvm_mmu_module_init(void)
{
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
- sizeof(struct pte_list_desc),
- 0, SLAB_DESTROY_BY_RCU, NULL);
+ sizeof(struct pte_list_desc),
+ 0, SLAB_DESTROY_BY_RCU, pte_list_desc_ctor);
if (!pte_list_desc_cache)
goto nomem;
--
1.8.1.4
The basic idea is from nulls list which uses a nulls to indicate
whether the desc is moved to different pte-list
Note, we should do bottom-up walk in the desc since we always move
the bottom entry to the deleted position. A desc only has 3 entries
in the current code so it is not a problem now, but the issue will
be triggered if we expend the size of desc in the further development
Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 53 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c5f1b27..3e1432f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -975,6 +975,10 @@ static int count_spte_number(struct pte_list_desc *desc)
return first_free + desc_num * PTE_LIST_EXT;
}
+#define rcu_assign_pte_list(pte_list_p, value) \
+ rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), \
+ (unsigned long *)(value))
+
/*
* Pte mapping structures:
*
@@ -994,7 +998,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
if (!*pte_list) {
rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
- *pte_list = (unsigned long)spte;
+ rcu_assign_pte_list(pte_list, spte);
return 0;
}
@@ -1004,7 +1008,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
desc_mark_nulls(pte_list, desc);
- *pte_list = (unsigned long)desc | 1;
+ rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
return 1;
}
@@ -1017,7 +1021,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
new_desc = mmu_alloc_pte_list_desc(vcpu);
new_desc->more = desc;
desc = new_desc;
- *pte_list = (unsigned long)desc | 1;
+ rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
}
free_pos = find_first_free(desc);
@@ -1125,6 +1129,51 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
WARN_ON(desc_get_nulls_value(desc) != pte_list);
}
+/* The caller should hold rcu lock. */
+static void pte_list_walk_lockless(unsigned long *pte_list,
+ pte_list_walk_fn fn)
+{
+ struct pte_list_desc *desc;
+ unsigned long pte_list_value;
+ int i;
+
+restart:
+ /*
+ * Force the pte_list to be reloaded.
+ *
+ * See the comments in hlist_nulls_for_each_entry_rcu().
+ */
+ barrier();
+ pte_list_value = *rcu_dereference(pte_list);
+ if (!pte_list_value)
+ return;
+
+ if (!(pte_list_value & 1))
+ return fn((u64 *)pte_list_value);
+
+ desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
+ while (!desc_is_a_nulls(desc)) {
+ /*
+ * We should do top-down walk since we always use the higher
+ * indices to replace the deleted entry if only one desc is
+ * used in the rmap when a spte is removed. Otherwise the
+ * moved entry will be missed.
+ */
+ for (i = PTE_LIST_EXT - 1; i >= 0; i--)
+ if (desc->sptes[i])
+ fn(desc->sptes[i]);
+
+ desc = rcu_dereference(desc->more);
+
+ /* It is being initialized. */
+ if (unlikely(!desc))
+ goto restart;
+ }
+
+ if (unlikely(desc_get_nulls_value(desc) != pte_list))
+ goto restart;
+}
+
static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
struct kvm_memory_slot *slot)
{
@@ -4651,7 +4700,7 @@ int kvm_mmu_module_init(void)
{
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
sizeof(struct pte_list_desc),
- 0, 0, NULL);
+ 0, SLAB_DESTROY_BY_RCU, NULL);
if (!pte_list_desc_cache)
goto nomem;
--
1.8.1.4
Using sp->role.level instead of @level since @level is not got from the
page table hierarchy
There is no issue in current code since the fast page fault currently only
fixes the fault caused by dirty-log that is always on the last level
(level = 1)
This patch makes the code more readable and avoids potential issue in the
further development
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7714fd8..869f1db 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2804,9 +2804,9 @@ static bool page_fault_can_be_fast(u32 error_code)
}
static bool
-fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte)
+fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+ u64 *sptep, u64 spte)
{
- struct kvm_mmu_page *sp = page_header(__pa(sptep));
gfn_t gfn;
WARN_ON(!sp->role.direct);
@@ -2832,6 +2832,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
u32 error_code)
{
struct kvm_shadow_walk_iterator iterator;
+ struct kvm_mmu_page *sp;
bool ret = false;
u64 spte = 0ull;
@@ -2852,7 +2853,8 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
goto exit;
}
- if (!is_last_spte(spte, level))
+ sp = page_header(__pa(iterator.sptep));
+ if (!is_last_spte(spte, sp->role.level))
goto exit;
/*
@@ -2878,7 +2880,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
* the gfn is not stable for indirect shadow page.
* See Documentation/virtual/kvm/locking.txt to get more detail.
*/
- ret = fast_pf_fix_direct_spte(vcpu, iterator.sptep, spte);
+ ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
exit:
trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
spte, ret);
--
1.8.1.4
It likes nulls list and we use the pte-list as the nulls which can help us to
detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
if that happened
kvm->slots_lock is held when we do lockless walking that prevents rmap
is reused (free rmap need to hold that lock) so that we can not see the same
nulls used on different rmaps
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 35 +++++++++++++++++++++++++++++------
1 file changed, 29 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 08fb4e2..c5f1b27 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
return level - 1;
}
+static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc)
+{
+ unsigned long marker;
+
+ marker = (unsigned long)pte_list | 1UL;
+ desc->more = (struct pte_list_desc *)marker;
+}
+
+static bool desc_is_a_nulls(struct pte_list_desc *desc)
+{
+ return (unsigned long)desc & 1;
+}
+
+static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc)
+{
+ return (unsigned long *)((unsigned long)desc & ~1);
+}
+
static int __find_first_free(struct pte_list_desc *desc)
{
int i;
@@ -951,7 +969,7 @@ static int count_spte_number(struct pte_list_desc *desc)
first_free = __find_first_free(desc);
- for (desc_num = 0; desc->more; desc = desc->more)
+ for (desc_num = 0; !desc_is_a_nulls(desc->more); desc = desc->more)
desc_num++;
return first_free + desc_num * PTE_LIST_EXT;
@@ -985,6 +1003,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc = mmu_alloc_pte_list_desc(vcpu);
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
+ desc_mark_nulls(pte_list, desc);
*pte_list = (unsigned long)desc | 1;
return 1;
}
@@ -1030,7 +1049,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
/*
* Only one entry existing but still use a desc to store it?
*/
- WARN_ON(!next_desc);
+ WARN_ON(desc_is_a_nulls(next_desc));
mmu_free_pte_list_desc(first_desc);
*pte_list = (unsigned long)next_desc | 1ul;
@@ -1041,7 +1060,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
* Only one entry in this desc, move the entry to the head
* then the desc can be freed.
*/
- if (!first_desc->sptes[1] && !first_desc->more) {
+ if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
*pte_list = (unsigned long)first_desc->sptes[0];
mmu_free_pte_list_desc(first_desc);
}
@@ -1070,7 +1089,7 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)
rmap_printk("pte_list_remove: %p many->many\n", spte);
desc = (struct pte_list_desc *)(*pte_list & ~1ul);
- while (desc) {
+ while (!desc_is_a_nulls(desc)) {
for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
if (desc->sptes[i] == spte) {
pte_list_desc_remove_entry(pte_list,
@@ -1097,11 +1116,13 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
return fn((u64 *)*pte_list);
desc = (struct pte_list_desc *)(*pte_list & ~1ul);
- while (desc) {
+ while (!desc_is_a_nulls(desc)) {
for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
fn(desc->sptes[i]);
desc = desc->more;
}
+
+ WARN_ON(desc_get_nulls_value(desc) != pte_list);
}
static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
@@ -1184,6 +1205,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
iter->pos = 0;
+ WARN_ON(desc_is_a_nulls(iter->desc));
return iter->desc->sptes[iter->pos];
}
@@ -1204,7 +1226,8 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
return sptep;
}
- iter->desc = iter->desc->more;
+ iter->desc = desc_is_a_nulls(iter->desc->more) ?
+ NULL : iter->desc->more;
if (iter->desc) {
iter->pos = 0;
--
1.8.1.4
Change the algorithm to:
1) always add new desc to the first desc (pointed by parent_ptes/rmap)
that is good to implement rcu-nulls-list-like lockless rmap
walking
2) always move the entry in the first desc to the the position we want
to remove when delete a spte in the parent_ptes/rmap (backward-move).
It is good for us to implement lockless rmap walk since in the current
code, when a spte is deleted from the "desc", another spte in the last
"desc" will be moved to this position to replace the deleted one. If the
deleted one has been accessed and we do not access the replaced one, the
replaced one is missed when we do lockless walk.
To fix this case, we do not backward move the spte, instead, we forward
move the entry: when a spte is deleted, we move the entry in the first
desc to that position
Both of these also can reduce cache miss
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 180 ++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 123 insertions(+), 57 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8ea54d9..08fb4e2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,50 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
return level - 1;
}
+static int __find_first_free(struct pte_list_desc *desc)
+{
+ int i;
+
+ for (i = 0; i < PTE_LIST_EXT; i++)
+ if (!desc->sptes[i])
+ break;
+ return i;
+}
+
+static int find_first_free(struct pte_list_desc *desc)
+{
+ int free = __find_first_free(desc);
+
+ WARN_ON(free >= PTE_LIST_EXT);
+ return free;
+}
+
+static int find_last_used(struct pte_list_desc *desc)
+{
+ int used = __find_first_free(desc) - 1;
+
+ WARN_ON(used < 0 || used >= PTE_LIST_EXT);
+ return used;
+}
+
+/*
+ * TODO: we can encode the desc number into the rmap/parent_ptes
+ * since at least 10 physical/virtual address bits are reserved
+ * on x86. It is worthwhile if it shows that the desc walking is
+ * a performance issue.
+ */
+static int count_spte_number(struct pte_list_desc *desc)
+{
+ int first_free, desc_num;
+
+ first_free = __find_first_free(desc);
+
+ for (desc_num = 0; desc->more; desc = desc->more)
+ desc_num++;
+
+ return first_free + desc_num * PTE_LIST_EXT;
+}
+
/*
* Pte mapping structures:
*
@@ -923,99 +967,121 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
*
* Returns the number of pte entries before the spte was added or zero if
* the spte was not added.
- *
*/
static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
unsigned long *pte_list)
{
struct pte_list_desc *desc;
- int i, count = 0;
+ int free_pos;
if (!*pte_list) {
rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
*pte_list = (unsigned long)spte;
- } else if (!(*pte_list & 1)) {
+ return 0;
+ }
+
+ if (!(*pte_list & 1)) {
rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_pte_list_desc(vcpu);
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
*pte_list = (unsigned long)desc | 1;
- ++count;
- } else {
- rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
- desc = (struct pte_list_desc *)(*pte_list & ~1ul);
- while (desc->sptes[PTE_LIST_EXT-1] && desc->more) {
- desc = desc->more;
- count += PTE_LIST_EXT;
- }
- if (desc->sptes[PTE_LIST_EXT-1]) {
- count += PTE_LIST_EXT;
- desc->more = mmu_alloc_pte_list_desc(vcpu);
- desc = desc->more;
- }
- for (i = 0; desc->sptes[i]; ++i)
- ++count;
- desc->sptes[i] = spte;
+ return 1;
}
- return count;
+
+ rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
+ desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+
+ /* No empty entry in the desc. */
+ if (desc->sptes[PTE_LIST_EXT - 1]) {
+ struct pte_list_desc *new_desc;
+ new_desc = mmu_alloc_pte_list_desc(vcpu);
+ new_desc->more = desc;
+ desc = new_desc;
+ *pte_list = (unsigned long)desc | 1;
+ }
+
+ free_pos = find_first_free(desc);
+ desc->sptes[free_pos] = spte;
+ return count_spte_number(desc) - 1;
}
static void
-pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
- int i, struct pte_list_desc *prev_desc)
+pte_list_desc_remove_entry(unsigned long *pte_list,
+ struct pte_list_desc *desc, int i)
{
- int j;
+ struct pte_list_desc *first_desc;
+ int last_used;
- for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
- ;
- desc->sptes[i] = desc->sptes[j];
- desc->sptes[j] = NULL;
- if (j != 0)
+ first_desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+ last_used = find_last_used(first_desc);
+
+ /*
+ * Move the entry from the first desc to this position we want
+ * to remove.
+ */
+ desc->sptes[i] = first_desc->sptes[last_used];
+ first_desc->sptes[last_used] = NULL;
+
+ /* No valid entry in this desc, we can free this desc now. */
+ if (!first_desc->sptes[0]) {
+ struct pte_list_desc *next_desc = first_desc->more;
+
+ /*
+ * Only one entry existing but still use a desc to store it?
+ */
+ WARN_ON(!next_desc);
+
+ mmu_free_pte_list_desc(first_desc);
+ *pte_list = (unsigned long)next_desc | 1ul;
return;
- if (!prev_desc && !desc->more)
- *pte_list = (unsigned long)desc->sptes[0];
- else
- if (prev_desc)
- prev_desc->more = desc->more;
- else
- *pte_list = (unsigned long)desc->more | 1;
- mmu_free_pte_list_desc(desc);
+ }
+
+ /*
+ * Only one entry in this desc, move the entry to the head
+ * then the desc can be freed.
+ */
+ if (!first_desc->sptes[1] && !first_desc->more) {
+ *pte_list = (unsigned long)first_desc->sptes[0];
+ mmu_free_pte_list_desc(first_desc);
+ }
}
static void pte_list_remove(u64 *spte, unsigned long *pte_list)
{
struct pte_list_desc *desc;
- struct pte_list_desc *prev_desc;
int i;
if (!*pte_list) {
- printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
+ pr_err("pte_list_remove: %p 0->BUG\n", spte);
BUG();
- } else if (!(*pte_list & 1)) {
+ return;
+ }
+
+ if (!(*pte_list & 1)) {
rmap_printk("pte_list_remove: %p 1->0\n", spte);
if ((u64 *)*pte_list != spte) {
- printk(KERN_ERR "pte_list_remove: %p 1->BUG\n", spte);
+ pr_err("pte_list_remove: %p 1->BUG\n", spte);
BUG();
}
*pte_list = 0;
- } else {
- rmap_printk("pte_list_remove: %p many->many\n", spte);
- desc = (struct pte_list_desc *)(*pte_list & ~1ul);
- prev_desc = NULL;
- while (desc) {
- for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
- if (desc->sptes[i] == spte) {
- pte_list_desc_remove_entry(pte_list,
- desc, i,
- prev_desc);
- return;
- }
- prev_desc = desc;
- desc = desc->more;
- }
- pr_err("pte_list_remove: %p many->many\n", spte);
- BUG();
+ return;
}
+
+ rmap_printk("pte_list_remove: %p many->many\n", spte);
+ desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+ while (desc) {
+ for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
+ if (desc->sptes[i] == spte) {
+ pte_list_desc_remove_entry(pte_list,
+ desc, i);
+ return;
+ }
+ desc = desc->more;
+ }
+
+ pr_err("pte_list_remove: %p many->many\n", spte);
+ BUG();
}
typedef void (*pte_list_walk_fn) (u64 *spte);
--
1.8.1.4
kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the its dirty
bitmap, so we should ensure the writable spte can be found in rmap before the
dirty bitmap is visible. Otherwise, we clear the dirty bitmap but fail to
write-protect the page which is detailed in the comments in this patch
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 84 ++++++++++++++++++++++++++++++++++++++++++------------
arch/x86/kvm/x86.c | 10 +++++++
2 files changed, 76 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a983570..8ea54d9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2428,6 +2428,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
{
u64 spte;
int ret = 0;
+ bool remap = is_rmap_spte(*sptep);
if (set_mmio_spte(vcpu->kvm, sptep, gfn, pfn, pte_access))
return 0;
@@ -2489,12 +2490,73 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
}
}
- if (pte_access & ACC_WRITE_MASK)
- mark_page_dirty(vcpu->kvm, gfn);
-
set_pte:
if (mmu_spte_update(sptep, spte))
kvm_flush_remote_tlbs(vcpu->kvm);
+
+ if (!remap) {
+ if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD)
+ rmap_recycle(vcpu, sptep, gfn);
+
+ if (level > PT_PAGE_TABLE_LEVEL)
+ ++vcpu->kvm->stat.lpages;
+ }
+
+ /*
+ * The orders we require are:
+ * 1) set spte to writable __before__ set the dirty bitmap.
+ * It makes sure that dirty-logging is not missed when do
+ * live migration at the final step where kvm should stop
+ * the guest and push the remaining dirty pages got from
+ * dirty-bitmap to the destination. The similar cases are
+ * in fast_pf_fix_direct_spte() and kvm_write_guest_page().
+ *
+ * 2) add the spte into rmap __before__ set the dirty bitmap.
+ *
+ * They can ensure we can find the writable spte on the rmap
+ * when we do lockless write-protection since
+ * kvm_vm_ioctl_get_dirty_log() write-protects the pages based
+ * on its dirty-bitmap, otherwise these cases will happen:
+ *
+ * CPU 0 CPU 1
+ * kvm ioctl doing get-dirty-pages
+ * mark_page_dirty(gfn) which
+ * set the gfn on the dirty maps
+ * mask = xchg(dirty_bitmap, 0)
+ *
+ * try to write-protect gfns which
+ * are set on "mask" then walk then
+ * rmap, see no spte on that rmap
+ * add the spte into rmap
+ *
+ * !!!!!! Then the page can be freely wrote but not recorded in
+ * the dirty bitmap.
+ *
+ * And:
+ *
+ * VCPU 0 CPU 1
+ * kvm ioctl doing get-dirty-pages
+ * mark_page_dirty(gfn) which
+ * set the gfn on the dirty maps
+ *
+ * add spte into rmap
+ * mask = xchg(dirty_bitmap, 0)
+ *
+ * try to write-protect gfns which
+ * are set on "mask" then walk then
+ * rmap, see spte is on the ramp
+ * but it is readonly or nonpresent
+ * Mark spte writable
+ *
+ * !!!!!! Then the page can be freely wrote but not recorded in the
+ * dirty bitmap.
+ *
+ * See the comments in kvm_vm_ioctl_get_dirty_log().
+ */
+ smp_wmb();
+
+ if (pte_access & ACC_WRITE_MASK)
+ mark_page_dirty(vcpu->kvm, gfn);
done:
return ret;
}
@@ -2504,9 +2566,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
int level, gfn_t gfn, pfn_t pfn, bool speculative,
bool host_writable)
{
- int was_rmapped = 0;
- int rmap_count;
-
pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
*sptep, write_fault, gfn);
@@ -2528,8 +2587,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
spte_to_pfn(*sptep), pfn);
drop_spte(vcpu->kvm, sptep);
kvm_flush_remote_tlbs(vcpu->kvm);
- } else
- was_rmapped = 1;
+ }
}
if (set_spte(vcpu, sptep, pte_access, level, gfn, pfn, speculative,
@@ -2547,16 +2605,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
is_large_pte(*sptep)? "2MB" : "4kB",
*sptep & PT_PRESENT_MASK ?"RW":"R", gfn,
*sptep, sptep);
- if (!was_rmapped && is_large_pte(*sptep))
- ++vcpu->kvm->stat.lpages;
-
- if (is_shadow_present_pte(*sptep)) {
- if (!was_rmapped) {
- rmap_count = rmap_add(vcpu, sptep, gfn);
- if (rmap_count > RMAP_RECYCLE_THRESHOLD)
- rmap_recycle(vcpu, sptep, gfn);
- }
- }
kvm_release_pfn_clean(pfn);
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 72f1487..5582b66 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3555,6 +3555,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
is_dirty = true;
mask = xchg(&dirty_bitmap[i], 0);
+ /*
+ * smp_rmb();
+ *
+ * The read-barrier is implied in xchg() acting as a
+ * full barrier and it ensures getting dirty bitmap
+ * is before walking the rmap and spte write-protection.
+ *
+ * See the comments in set_spte().
+ */
+
dirty_bitmap_buffer[i] = mask;
offset = i * BITS_PER_LONG;
--
1.8.1.4
Now we can flush all the TLBs out of the mmu lock without TLB corruption when
write-proect the sptes, it is because:
- we have marked large sptes readonly instead of dropping them that means we
just change the spte from writable to readonly so that we only need to care
the case of changing spte from present to present (changing the spte from
present to nonpresent will flush all the TLBs immediately), in other words,
the only case we need to care is mmu_spte_update()
- in mmu_spte_update(), we have checked
SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
means it does not depend on PT_WRITABLE_MASK anymore
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 18 ++++++++++++++----
arch/x86/kvm/x86.c | 9 +++++++--
2 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7488229..a983570 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4320,15 +4320,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
if (*rmapp)
__rmap_write_protect(kvm, rmapp, false);
- if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
- kvm_flush_remote_tlbs(kvm);
+ if (need_resched() || spin_needbreak(&kvm->mmu_lock))
cond_resched_lock(&kvm->mmu_lock);
- }
}
}
- kvm_flush_remote_tlbs(kvm);
spin_unlock(&kvm->mmu_lock);
+
+ /*
+ * We can flush all the TLBs out of the mmu lock without TLB
+ * corruption since we just change the spte from writable to
+ * readonly so that we only need to care the case of changing
+ * spte from present to present (changing the spte from present
+ * to nonpresent will flush all the TLBs immediately), in other
+ * words, the only case we care is mmu_spte_update() where we
+ * haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
+ * instead of PT_WRITABLE_MASK, that means it does not depend
+ * on PT_WRITABLE_MASK anymore.
+ */
+ kvm_flush_remote_tlbs(kvm);
}
#define BATCH_ZAP_PAGES 10
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6ad0c07..72f1487 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3560,11 +3560,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
offset = i * BITS_PER_LONG;
kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
}
- if (is_dirty)
- kvm_flush_remote_tlbs(kvm);
spin_unlock(&kvm->mmu_lock);
+ /*
+ * All the TLBs can be flushed out of mmu lock, see the comments in
+ * kvm_mmu_slot_remove_write_access().
+ */
+ if (is_dirty)
+ kvm_flush_remote_tlbs(kvm);
+
r = -EFAULT;
if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
goto out;
--
1.8.1.4
Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them un-present, the page fault caused by read access can
be avoided
The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter. This removes the need for the return value.
This version has fixed the issue reported in 6b73a9606, the reason of that
issue is that fast_page_fault() directly sets the readonly large spte to
writable but only dirty the first page into the dirty-bitmap that means
other pages are missed. Fixed it by only the normal sptes (on the
PT_PAGE_TABLE_LEVEL level) can be fast fixed
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 36 ++++++++++++++++++++----------------
arch/x86/kvm/x86.c | 8 ++++++--
2 files changed, 26 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 869f1db..88107ee 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1177,8 +1177,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
/*
* Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte writ-protection is caused by protecting shadow page table.
- * @flush indicates whether tlb need be flushed.
+ * spte write-protection is caused by protecting shadow page table.
*
* Note: write protection is difference between drity logging and spte
* protection:
@@ -1187,10 +1186,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
* - for spte protection, the spte can be writable only after unsync-ing
* shadow page.
*
- * Return true if the spte is dropped.
+ * Return true if tlb need be flushed.
*/
-static bool
-spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
+static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
{
u64 spte = *sptep;
@@ -1200,17 +1198,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
- if (__drop_large_spte(kvm, sptep)) {
- *flush |= true;
- return true;
- }
-
if (pt_protect)
spte &= ~SPTE_MMU_WRITEABLE;
spte = spte & ~PT_WRITABLE_MASK;
- *flush |= mmu_spte_update(sptep, spte);
- return false;
+ return mmu_spte_update(sptep, spte);
}
static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
@@ -1222,11 +1214,8 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
- if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
- sptep = rmap_get_first(*rmapp, &iter);
- continue;
- }
+ flush |= spte_write_protect(kvm, sptep, pt_protect);
sptep = rmap_get_next(&iter);
}
@@ -2675,6 +2664,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
break;
}
+ drop_large_spte(vcpu, iterator.sptep);
+
if (!is_shadow_present_pte(*iterator.sptep)) {
u64 base_addr = iterator.addr;
@@ -2876,6 +2867,19 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
goto exit;
/*
+ * Do not fix write-permission on the large spte since we only dirty
+ * the first page into the dirty-bitmap in fast_pf_fix_direct_spte()
+ * that means other pages are missed if its slot is dirty-logged.
+ *
+ * Instead, we let the slow page fault path create a normal spte to
+ * fix the access.
+ *
+ * See the comments in kvm_arch_commit_memory_region().
+ */
+ if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+ goto exit;
+
+ /*
* Currently, fast page fault only works for direct mapping since
* the gfn is not stable for indirect shadow page.
* See Documentation/virtual/kvm/locking.txt to get more detail.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e5ca72a..6ad0c07 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7208,8 +7208,12 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
/*
* Write protect all pages for dirty logging.
- * Existing largepage mappings are destroyed here and new ones will
- * not be created until the end of the logging.
+ *
+ * All the sptes including the large sptes which point to this
+ * slot are set to readonly. We can not create any new large
+ * spte on this slot until the end of the logging.
+ *
+ * See the comments in fast_page_fault().
*/
if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
kvm_mmu_slot_remove_write_access(kvm, mem->slot);
--
1.8.1.4
On Sep 5, 2013, at 6:29 PM, Xiao Guangrong <[email protected]> wrote:
>
> A desc only has 3 entries
> in the current code so it is not a problem now, but the issue will
> be triggered if we expend the size of desc in the further development
Sorry, this description is obvious wrong, the bug exists even if 3 entries in the desc?
do not know what i was thinking about when wrote this down...-
On Thu, Sep 05, 2013 at 06:29:04PM +0800, Xiao Guangrong wrote:
> If the desc is the last one and it is full, its sptes is not counted
>
Hmm, if desc is not full but it is not the last one all sptes after the
desc are not counted too.
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 6e2d2c8..7714fd8 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -948,6 +948,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
> count += PTE_LIST_EXT;
> }
> if (desc->sptes[PTE_LIST_EXT-1]) {
> + count += PTE_LIST_EXT;
> desc->more = mmu_alloc_pte_list_desc(vcpu);
> desc = desc->more;
> }
> --
> 1.8.1.4
--
Gleb.
On Sep 8, 2013, at 8:19 PM, Gleb Natapov <[email protected]> wrote:
> On Thu, Sep 05, 2013 at 06:29:04PM +0800, Xiao Guangrong wrote:
>> If the desc is the last one and it is full, its sptes is not counted
>>
> Hmm, if desc is not full but it is not the last one all sptes after the
> desc are not counted too.
But the desc must be the last one if it's not full since we always add
new entry or delete entry from the last desc.
On Sun, Sep 08, 2013 at 09:55:04PM +0800, Xiao Guangrong wrote:
>
> On Sep 8, 2013, at 8:19 PM, Gleb Natapov <[email protected]> wrote:
>
> > On Thu, Sep 05, 2013 at 06:29:04PM +0800, Xiao Guangrong wrote:
> >> If the desc is the last one and it is full, its sptes is not counted
> >>
> > Hmm, if desc is not full but it is not the last one all sptes after the
> > desc are not counted too.
>
> But the desc must be the last one if it's not full since we always add
> new entry or delete entry from the last desc.
>
Why do we alway delete entries from last desc? We delete them from the
desc we found them in. Current code does not try to move entries between
descs, only inside a desc.
--
Gleb.
On Sep 8, 2013, at 10:01 PM, Gleb Natapov <[email protected]> wrote:
> On Sun, Sep 08, 2013 at 09:55:04PM +0800, Xiao Guangrong wrote:
>>
>> On Sep 8, 2013, at 8:19 PM, Gleb Natapov <[email protected]> wrote:
>>
>>> On Thu, Sep 05, 2013 at 06:29:04PM +0800, Xiao Guangrong wrote:
>>>> If the desc is the last one and it is full, its sptes is not counted
>>>>
>>> Hmm, if desc is not full but it is not the last one all sptes after the
>>> desc are not counted too.
>>
>> But the desc must be the last one if it's not full since we always add
>> new entry or delete entry from the last desc.
>>
> Why do we alway delete entries from last desc? We delete them from the
> desc we found them in. Current code does not try to move entries between
> descs, only inside a desc.
Oh, yes. Sorry, my memory is wrong? :(
So, currently there has some gaps in desc and it wastes memory. Can not fix
them with simple change and i think it is not worthy to fix them separately since
after my new algorithm, these should all be fixed? so how about just drop this
fix?
On Sun, Sep 08, 2013 at 10:24:05PM +0800, Xiao Guangrong wrote:
>
> On Sep 8, 2013, at 10:01 PM, Gleb Natapov <[email protected]> wrote:
>
> > On Sun, Sep 08, 2013 at 09:55:04PM +0800, Xiao Guangrong wrote:
> >>
> >> On Sep 8, 2013, at 8:19 PM, Gleb Natapov <[email protected]> wrote:
> >>
> >>> On Thu, Sep 05, 2013 at 06:29:04PM +0800, Xiao Guangrong wrote:
> >>>> If the desc is the last one and it is full, its sptes is not counted
> >>>>
> >>> Hmm, if desc is not full but it is not the last one all sptes after the
> >>> desc are not counted too.
> >>
> >> But the desc must be the last one if it's not full since we always add
> >> new entry or delete entry from the last desc.
> >>
> > Why do we alway delete entries from last desc? We delete them from the
> > desc we found them in. Current code does not try to move entries between
> > descs, only inside a desc.
>
> Oh, yes. Sorry, my memory is wrong? :(
>
> So, currently there has some gaps in desc and it wastes memory. Can not fix
> them with simple change and i think it is not worthy to fix them separately since
> after my new algorithm, these should all be fixed? so how about just drop this
> fix?
>
Yes, if we are going to change algorithm anyway no need to spend time
fixing what we have now.
--
Gleb.
Marcelo do you feel your comments are addressed in patches 3 and 5 of
this series?
On Thu, Sep 05, 2013 at 06:29:03PM +0800, Xiao Guangrong wrote:
> Changelog v2:
>
> - the changes from Gleb's review:
> 1) fix calculating the number of spte in the pte_list_add()
> 2) set iter->desc to NULL if meet a nulls desc to cleanup the code of
> rmap_get_next()
> 3) fix hlist corruption due to accessing sp->hlish out of mmu-lock
> 4) use rcu functions to access the rcu protected pointer
> 5) spte will be missed in lockless walker if the spte is moved in a desc
> (remove a spte from the rmap using only one desc). Fix it by bottom-up
> walking the desc
>
> - the changes from Paolo's review
> 1) make the order and memory barriers between update spte / add spte into
> rmap and dirty-log more clear
>
> - the changes from Marcelo's review:
> 1) let fast page fault only fix the spts on the last level (level = 1)
> 2) improve some changelogs and comments
>
> - the changes from Takuya's review:
> move the patch "flush tlb if the spte can be locklessly modified" forward
> to make it's more easily merged
>
> Thank all of you very much for your time and patience on this patchset!
>
> Since we use rcu_assign_pointer() to update the points in desc even if dirty
> log is disabled, i have measured the performance:
> Host: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz * 12 + 36G memory
>
> - migrate-perf (benchmark the time of get-dirty-log)
> before: Run 10 times, Avg time:9009483 ns.
> after: Run 10 times, Avg time:4807343 ns.
>
> - kerbench
> Guest: 12 VCPUs + 8G memory
> before:
> EPT is enabled:
> # cat 09-05-origin-ept | grep real
> real 85.58
> real 83.47
> real 82.95
>
> EPT is disabled:
> # cat 09-05-origin-shadow | grep real
> real 138.77
> real 138.99
> real 139.55
>
> after:
> EPT is enabled:
> # cat 09-05-lockless-ept | grep real
> real 83.40
> real 82.81
> real 83.39
>
> EPT is disabled:
> # cat 09-05-lockless-shadow | grep real
> real 138.91
> real 139.71
> real 138.94
>
> No performance regression!
>
>
>
> Background
> ==========
> Currently, when mark memslot dirty logged or get dirty page, we need to
> write-protect large guest memory, it is the heavy work, especially, we need to
> hold mmu-lock which is also required by vcpu to fix its page table fault and
> mmu-notifier when host page is being changed. In the extreme cpu / memory used
> guest, it becomes a scalability issue.
>
> This patchset introduces a way to locklessly write-protect guest memory.
>
> Idea
> ==========
> There are the challenges we meet and the ideas to resolve them.
>
> 1) How to locklessly walk rmap?
> The first idea we got to prevent "desc" being freed when we are walking the
> rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
> it updates the rmap really frequently.
>
> So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
> to be reused more quickly. We also store a "nulls" in the last "desc"
> (desc->more) which can help us to detect whether the "desc" is moved to anther
> rmap then we can re-walk the rmap if that happened. I learned this idea from
> nulls-list.
>
> Another issue is, when a spte is deleted from the "desc", another spte in the
> last "desc" will be moved to this position to replace the deleted one. If the
> deleted one has been accessed and we do not access the replaced one, the
> replaced one is missed when we do lockless walk.
> To fix this case, we do not backward move the spte, instead, we forward move
> the entry: when a spte is deleted, we move the entry in the first desc to that
> position.
>
> 2) How to locklessly access shadow page table?
> It is easy if the handler is in the vcpu context, in that case we can use
> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
> disable interrupt to stop shadow page be freed. But we are on the ioctl context
> and the paths we are optimizing for have heavy workload, disabling interrupt is
> not good for the system performance.
>
> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
> call_rcu() to free the shadow page if that indicator is set. Set/Clear the
> indicator are protected by slot-lock, so it need not be atomic and does not
> hurt the performance and the scalability.
>
> 3) How to locklessly write-protect guest memory?
> Currently, there are two behaviors when we write-protect guest memory, one is
> clearing the Writable bit on spte and the another one is dropping spte when it
> points to large page. The former is easy we only need to atomicly clear a bit
> but the latter is hard since we need to remove the spte from rmap. so we unify
> these two behaviors that only make the spte readonly. Making large spte
> readonly instead of nonpresent is also good for reducing jitter.
>
> And we need to pay more attention on the order of making spte writable, adding
> spte into rmap and setting the corresponding bit on dirty bitmap since
> kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap,
> we should ensure the writable spte can be found in rmap before the dirty bitmap
> is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
> the page.
>
> Performance result
> ====================
> The performance result and the benchmark can be found at:
> http://permalink.gmane.org/gmane.linux.kernel/1534876
>
> Xiao Guangrong (15):
> KVM: MMU: fix the count of spte number
> KVM: MMU: properly check last spte in fast_page_fault()
> KVM: MMU: lazily drop large spte
> KVM: MMU: flush tlb if the spte can be locklessly modified
> KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
> KVM: MMU: update spte and add it into rmap before dirty log
> KVM: MMU: redesign the algorithm of pte_list
> KVM: MMU: introduce nulls desc
> KVM: MMU: introduce pte-list lockless walker
> KVM: MMU: initialize the pointers in pte_list_desc properly
> KVM: MMU: reintroduce kvm_mmu_isolate_page()
> KVM: MMU: allow locklessly access shadow page table out of vcpu thread
> KVM: MMU: locklessly write-protect the page
> KVM: MMU: clean up spte_write_protect
> KVM: MMU: use rcu functions to access the pointer
>
> arch/x86/include/asm/kvm_host.h | 10 +-
> arch/x86/kvm/mmu.c | 566 ++++++++++++++++++++++++++++++----------
> arch/x86/kvm/mmu.h | 28 ++
> arch/x86/kvm/x86.c | 34 ++-
> 4 files changed, 491 insertions(+), 147 deletions(-)
>
> --
> 1.8.1.4
--
Gleb.
On Thu, Sep 05, 2013 at 06:29:12PM +0800, Xiao Guangrong wrote:
> The basic idea is from nulls list which uses a nulls to indicate
> whether the desc is moved to different pte-list
>
> Note, we should do bottom-up walk in the desc since we always move
> the bottom entry to the deleted position. A desc only has 3 entries
> in the current code so it is not a problem now, but the issue will
> be triggered if we expend the size of desc in the further development
>
> Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 53 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index c5f1b27..3e1432f 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -975,6 +975,10 @@ static int count_spte_number(struct pte_list_desc *desc)
> return first_free + desc_num * PTE_LIST_EXT;
> }
>
> +#define rcu_assign_pte_list(pte_list_p, value) \
> + rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), \
> + (unsigned long *)(value))
> +
> /*
> * Pte mapping structures:
> *
> @@ -994,7 +998,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
>
> if (!*pte_list) {
> rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
> - *pte_list = (unsigned long)spte;
> + rcu_assign_pte_list(pte_list, spte);
> return 0;
> }
>
> @@ -1004,7 +1008,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
> desc->sptes[0] = (u64 *)*pte_list;
> desc->sptes[1] = spte;
> desc_mark_nulls(pte_list, desc);
> - *pte_list = (unsigned long)desc | 1;
> + rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
> return 1;
> }
>
> @@ -1017,7 +1021,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
> new_desc = mmu_alloc_pte_list_desc(vcpu);
> new_desc->more = desc;
> desc = new_desc;
> - *pte_list = (unsigned long)desc | 1;
> + rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
> }
>
> free_pos = find_first_free(desc);
> @@ -1125,6 +1129,51 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
> WARN_ON(desc_get_nulls_value(desc) != pte_list);
> }
>
> +/* The caller should hold rcu lock. */
> +static void pte_list_walk_lockless(unsigned long *pte_list,
> + pte_list_walk_fn fn)
> +{
> + struct pte_list_desc *desc;
> + unsigned long pte_list_value;
> + int i;
> +
> +restart:
> + /*
> + * Force the pte_list to be reloaded.
> + *
> + * See the comments in hlist_nulls_for_each_entry_rcu().
> + */
> + barrier();
> + pte_list_value = *rcu_dereference(pte_list);
> + if (!pte_list_value)
> + return;
> +
> + if (!(pte_list_value & 1))
> + return fn((u64 *)pte_list_value);
> +
> + desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
> + while (!desc_is_a_nulls(desc)) {
> + /*
> + * We should do top-down walk since we always use the higher
> + * indices to replace the deleted entry if only one desc is
> + * used in the rmap when a spte is removed. Otherwise the
> + * moved entry will be missed.
> + */
> + for (i = PTE_LIST_EXT - 1; i >= 0; i--)
> + if (desc->sptes[i])
> + fn(desc->sptes[i]);
> +
> + desc = rcu_dereference(desc->more);
> +
> + /* It is being initialized. */
> + if (unlikely(!desc))
> + goto restart;
> + }
> +
> + if (unlikely(desc_get_nulls_value(desc) != pte_list))
> + goto restart;
> +}
> +
> static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
> struct kvm_memory_slot *slot)
> {
> @@ -4651,7 +4700,7 @@ int kvm_mmu_module_init(void)
> {
> pte_list_desc_cache = kmem_cache_create("pte_list_desc",
> sizeof(struct pte_list_desc),
> - 0, 0, NULL);
> + 0, SLAB_DESTROY_BY_RCU, NULL);
Haven't we agreed that constructor is needed for the cache?
> if (!pte_list_desc_cache)
> goto nomem;
>
> --
> 1.8.1.4
--
Gleb.
Hi Gleb,
On 09/16/2013 08:42 PM, Gleb Natapov wrote:
>> static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
>> struct kvm_memory_slot *slot)
>> {
>> @@ -4651,7 +4700,7 @@ int kvm_mmu_module_init(void)
>> {
>> pte_list_desc_cache = kmem_cache_create("pte_list_desc",
>> sizeof(struct pte_list_desc),
>> - 0, 0, NULL);
>> + 0, SLAB_DESTROY_BY_RCU, NULL);
> Haven't we agreed that constructor is needed for the cache?
Yes. I've made it as a separate patch:
[PATCH v2 10/15] KVM: MMU: initialize the pointers in pte_list_desc properly
On Mon, Sep 16, 2013 at 09:52:26PM +0800, Xiao Guangrong wrote:
> Hi Gleb,
>
> On 09/16/2013 08:42 PM, Gleb Natapov wrote:
>
> >> static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
> >> struct kvm_memory_slot *slot)
> >> {
> >> @@ -4651,7 +4700,7 @@ int kvm_mmu_module_init(void)
> >> {
> >> pte_list_desc_cache = kmem_cache_create("pte_list_desc",
> >> sizeof(struct pte_list_desc),
> >> - 0, 0, NULL);
> >> + 0, SLAB_DESTROY_BY_RCU, NULL);
> > Haven't we agreed that constructor is needed for the cache?
>
> Yes. I've made it as a separate patch:
> [PATCH v2 10/15] KVM: MMU: initialize the pointers in pte_list_desc properly
>
Ah, I haven't got there yet :)
--
Gleb.
On Thu, Sep 05, 2013 at 06:29:05PM +0800, Xiao Guangrong wrote:
> Using sp->role.level instead of @level since @level is not got from the
> page table hierarchy
>
> There is no issue in current code since the fast page fault currently only
> fixes the fault caused by dirty-log that is always on the last level
> (level = 1)
>
> This patch makes the code more readable and avoids potential issue in the
> further development
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 7714fd8..869f1db 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2804,9 +2804,9 @@ static bool page_fault_can_be_fast(u32 error_code)
> }
>
> static bool
> -fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte)
> +fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> + u64 *sptep, u64 spte)
> {
> - struct kvm_mmu_page *sp = page_header(__pa(sptep));
> gfn_t gfn;
>
> WARN_ON(!sp->role.direct);
> @@ -2832,6 +2832,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
> u32 error_code)
> {
> struct kvm_shadow_walk_iterator iterator;
> + struct kvm_mmu_page *sp;
> bool ret = false;
> u64 spte = 0ull;
>
> @@ -2852,7 +2853,8 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
> goto exit;
> }
>
> - if (!is_last_spte(spte, level))
> + sp = page_header(__pa(iterator.sptep));
> + if (!is_last_spte(spte, sp->role.level))
> goto exit;
>
> /*
> @@ -2878,7 +2880,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
> * the gfn is not stable for indirect shadow page.
> * See Documentation/virtual/kvm/locking.txt to get more detail.
> */
> - ret = fast_pf_fix_direct_spte(vcpu, iterator.sptep, spte);
> + ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
> exit:
> trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
> spte, ret);
> --
> 1.8.1.4
Unrelated to this patch:
If vcpu->mode = OUTSIDE_GUEST_MODE, no IPI is sent
by kvm_flush_remote_tlbs.
So how is this supposed to work again?
/*
* Wait for all vcpus to exit guest mode and/or lockless shadow
* page table walks.
*/
kvm_flush_remote_tlbs(kvm);
Patch looks fine.
On Thu, Sep 05, 2013 at 06:29:06PM +0800, Xiao Guangrong wrote:
> Currently, kvm zaps the large spte if write-protected is needed, the later
> read can fault on that spte. Actually, we can make the large spte readonly
> instead of making them un-present, the page fault caused by read access can
> be avoided
>
> The idea is from Avi:
> | As I mentioned before, write-protecting a large spte is a good idea,
> | since it moves some work from protect-time to fault-time, so it reduces
> | jitter. This removes the need for the return value.
>
> This version has fixed the issue reported in 6b73a9606, the reason of that
> issue is that fast_page_fault() directly sets the readonly large spte to
> writable but only dirty the first page into the dirty-bitmap that means
> other pages are missed. Fixed it by only the normal sptes (on the
> PT_PAGE_TABLE_LEVEL level) can be fast fixed
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 36 ++++++++++++++++++++----------------
> arch/x86/kvm/x86.c | 8 ++++++--
> 2 files changed, 26 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 869f1db..88107ee 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1177,8 +1177,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
>
> /*
> * Write-protect on the specified @sptep, @pt_protect indicates whether
> - * spte writ-protection is caused by protecting shadow page table.
> - * @flush indicates whether tlb need be flushed.
> + * spte write-protection is caused by protecting shadow page table.
> *
> * Note: write protection is difference between drity logging and spte
> * protection:
> @@ -1187,10 +1186,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
> * - for spte protection, the spte can be writable only after unsync-ing
> * shadow page.
> *
> - * Return true if the spte is dropped.
> + * Return true if tlb need be flushed.
> */
> -static bool
> -spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
> +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
> {
> u64 spte = *sptep;
>
> @@ -1200,17 +1198,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
>
> rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
>
> - if (__drop_large_spte(kvm, sptep)) {
> - *flush |= true;
> - return true;
> - }
> -
> if (pt_protect)
> spte &= ~SPTE_MMU_WRITEABLE;
> spte = spte & ~PT_WRITABLE_MASK;
>
> - *flush |= mmu_spte_update(sptep, spte);
> - return false;
> + return mmu_spte_update(sptep, spte);
> }
Is it necessary for kvm_mmu_unprotect_page to search for an entire range large
page range now, instead of a 4k page?
On Thu, Sep 05, 2013 at 06:29:08PM +0800, Xiao Guangrong wrote:
> Now we can flush all the TLBs out of the mmu lock without TLB corruption when
> write-proect the sptes, it is because:
> - we have marked large sptes readonly instead of dropping them that means we
> just change the spte from writable to readonly so that we only need to care
> the case of changing spte from present to present (changing the spte from
> present to nonpresent will flush all the TLBs immediately), in other words,
> the only case we need to care is mmu_spte_update()
>
> - in mmu_spte_update(), we have checked
> SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
> means it does not depend on PT_WRITABLE_MASK anymore
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 18 ++++++++++++++----
> arch/x86/kvm/x86.c | 9 +++++++--
> 2 files changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 7488229..a983570 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4320,15 +4320,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
> if (*rmapp)
> __rmap_write_protect(kvm, rmapp, false);
>
> - if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> - kvm_flush_remote_tlbs(kvm);
> + if (need_resched() || spin_needbreak(&kvm->mmu_lock))
> cond_resched_lock(&kvm->mmu_lock);
> - }
> }
> }
>
> - kvm_flush_remote_tlbs(kvm);
> spin_unlock(&kvm->mmu_lock);
> +
> + /*
> + * We can flush all the TLBs out of the mmu lock without TLB
> + * corruption since we just change the spte from writable to
> + * readonly so that we only need to care the case of changing
> + * spte from present to present (changing the spte from present
> + * to nonpresent will flush all the TLBs immediately), in other
> + * words, the only case we care is mmu_spte_update() where we
> + * haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
> + * instead of PT_WRITABLE_MASK, that means it does not depend
> + * on PT_WRITABLE_MASK anymore.
> + */
> + kvm_flush_remote_tlbs(kvm);
> }
What about need_remote_flush?
On Oct 1, 2013, at 5:23 AM, Marcelo Tosatti <[email protected]> wrote:
>>
>
> Unrelated to this patch:
>
> If vcpu->mode = OUTSIDE_GUEST_MODE, no IPI is sent
> by kvm_flush_remote_tlbs.
Yes.
>
> So how is this supposed to work again?
>
> /*
> * Wait for all vcpus to exit guest mode and/or lockless shadow
> * page table walks.
> */
On the lockless walking path, we change the vcpu->mode to
READING_SHADOW_PAGE_TABLES, so that IPI is needed.
Or i missed your question?
> kvm_flush_remote_tlbs(kvm);
>
> Patch looks fine.
Thank you, Marcelo!
On Oct 1, 2013, at 6:39 AM, Marcelo Tosatti <[email protected]> wrote:
> On Thu, Sep 05, 2013 at 06:29:06PM +0800, Xiao Guangrong wrote:
>> Currently, kvm zaps the large spte if write-protected is needed, the later
>> read can fault on that spte. Actually, we can make the large spte readonly
>> instead of making them un-present, the page fault caused by read access can
>> be avoided
>>
>> The idea is from Avi:
>> | As I mentioned before, write-protecting a large spte is a good idea,
>> | since it moves some work from protect-time to fault-time, so it reduces
>> | jitter. This removes the need for the return value.
>>
>> This version has fixed the issue reported in 6b73a9606, the reason of that
>> issue is that fast_page_fault() directly sets the readonly large spte to
>> writable but only dirty the first page into the dirty-bitmap that means
>> other pages are missed. Fixed it by only the normal sptes (on the
>> PT_PAGE_TABLE_LEVEL level) can be fast fixed
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/kvm/mmu.c | 36 ++++++++++++++++++++----------------
>> arch/x86/kvm/x86.c | 8 ++++++--
>> 2 files changed, 26 insertions(+), 18 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 869f1db..88107ee 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1177,8 +1177,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
>>
>> /*
>> * Write-protect on the specified @sptep, @pt_protect indicates whether
>> - * spte writ-protection is caused by protecting shadow page table.
>> - * @flush indicates whether tlb need be flushed.
>> + * spte write-protection is caused by protecting shadow page table.
>> *
>> * Note: write protection is difference between drity logging and spte
>> * protection:
>> @@ -1187,10 +1186,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
>> * - for spte protection, the spte can be writable only after unsync-ing
>> * shadow page.
>> *
>> - * Return true if the spte is dropped.
>> + * Return true if tlb need be flushed.
>> */
>> -static bool
>> -spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
>> +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
>> {
>> u64 spte = *sptep;
>>
>> @@ -1200,17 +1198,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
>>
>> rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
>>
>> - if (__drop_large_spte(kvm, sptep)) {
>> - *flush |= true;
>> - return true;
>> - }
>> -
>> if (pt_protect)
>> spte &= ~SPTE_MMU_WRITEABLE;
>> spte = spte & ~PT_WRITABLE_MASK;
>>
>> - *flush |= mmu_spte_update(sptep, spte);
>> - return false;
>> + return mmu_spte_update(sptep, spte);
>> }
>
> Is it necessary for kvm_mmu_unprotect_page to search for an entire range large
> page range now, instead of a 4k page?
It is unnecessary. kvm_mmu_unprotect_page is used to delete the gfn's shadow pages
then vcpu will try to re-fault. If any gfn in the large range has shadow page, it will stop using large
mapping, so that the mapping will be split to small mappings when vcpu re-fault again.
On Oct 1, 2013, at 7:05 AM, Marcelo Tosatti <[email protected]> wrote:
> On Thu, Sep 05, 2013 at 06:29:08PM +0800, Xiao Guangrong wrote:
>> Now we can flush all the TLBs out of the mmu lock without TLB corruption when
>> write-proect the sptes, it is because:
>> - we have marked large sptes readonly instead of dropping them that means we
>> just change the spte from writable to readonly so that we only need to care
>> the case of changing spte from present to present (changing the spte from
>> present to nonpresent will flush all the TLBs immediately), in other words,
>> the only case we need to care is mmu_spte_update()
>>
>> - in mmu_spte_update(), we have checked
>> SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
>> means it does not depend on PT_WRITABLE_MASK anymore
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/kvm/mmu.c | 18 ++++++++++++++----
>> arch/x86/kvm/x86.c | 9 +++++++--
>> 2 files changed, 21 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 7488229..a983570 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -4320,15 +4320,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
>> if (*rmapp)
>> __rmap_write_protect(kvm, rmapp, false);
>>
>> - if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
>> - kvm_flush_remote_tlbs(kvm);
>> + if (need_resched() || spin_needbreak(&kvm->mmu_lock))
>> cond_resched_lock(&kvm->mmu_lock);
>> - }
>> }
>> }
>>
>> - kvm_flush_remote_tlbs(kvm);
>> spin_unlock(&kvm->mmu_lock);
>> +
>> + /*
>> + * We can flush all the TLBs out of the mmu lock without TLB
>> + * corruption since we just change the spte from writable to
>> + * readonly so that we only need to care the case of changing
>> + * spte from present to present (changing the spte from present
>> + * to nonpresent will flush all the TLBs immediately), in other
>> + * words, the only case we care is mmu_spte_update() where we
>> + * haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
>> + * instead of PT_WRITABLE_MASK, that means it does not depend
>> + * on PT_WRITABLE_MASK anymore.
>> + */
>> + kvm_flush_remote_tlbs(kvm);
>> }
>
> What about need_remote_flush?
It is safe because before calling need_remote_flush mmu_pte_write_new_pte is called to
update the spte which will finally call set_spte() where the tlb flush has been properly checked.
On Thu, Sep 05, 2013 at 06:29:15PM +0800, Xiao Guangrong wrote:
> It is easy if the handler is in the vcpu context, in that case we can use
> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
> disable interrupt to stop shadow page being freed. But we are on the ioctl
> context and the paths we are optimizing for have heavy workload, disabling
> interrupt is not good for the system performance
>
> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then
> use call_rcu() to free the shadow page if that indicator is set. Set/Clear the
> indicator are protected by slot-lock, so it need not be atomic and does not
> hurt the performance and the scalability
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 6 +++++-
> arch/x86/kvm/mmu.c | 32 ++++++++++++++++++++++++++++++++
> arch/x86/kvm/mmu.h | 22 ++++++++++++++++++++++
> 3 files changed, 59 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c76ff74..8e4ca0d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -226,7 +226,10 @@ struct kvm_mmu_page {
> /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */
> unsigned long mmu_valid_gen;
>
> - DECLARE_BITMAP(unsync_child_bitmap, 512);
> + union {
> + DECLARE_BITMAP(unsync_child_bitmap, 512);
> + struct rcu_head rcu;
> + };
>
> #ifdef CONFIG_X86_32
> /*
> @@ -554,6 +557,7 @@ struct kvm_arch {
> */
> struct list_head active_mmu_pages;
> struct list_head zapped_obsolete_pages;
> + bool rcu_free_shadow_page;
>
> struct list_head assigned_dev_head;
> struct iommu_domain *iommu_domain;
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 2bf450a..f551fc7 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2355,6 +2355,30 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> return ret;
> }
>
> +static void kvm_mmu_isolate_pages(struct list_head *invalid_list)
> +{
> + struct kvm_mmu_page *sp;
> +
> + list_for_each_entry(sp, invalid_list, link)
> + kvm_mmu_isolate_page(sp);
> +}
> +
> +static void free_pages_rcu(struct rcu_head *head)
> +{
> + struct kvm_mmu_page *next, *sp;
> +
> + sp = container_of(head, struct kvm_mmu_page, rcu);
> + while (sp) {
> + if (!list_empty(&sp->link))
> + next = list_first_entry(&sp->link,
> + struct kvm_mmu_page, link);
> + else
> + next = NULL;
> + kvm_mmu_free_page(sp);
> + sp = next;
> + }
> +}
> +
> static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> struct list_head *invalid_list)
> {
> @@ -2375,6 +2399,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> */
> kvm_flush_remote_tlbs(kvm);
>
> + if (kvm->arch.rcu_free_shadow_page) {
> + kvm_mmu_isolate_pages(invalid_list);
> + sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
> + list_del_init(invalid_list);
> + call_rcu(&sp->rcu, free_pages_rcu);
> + return;
> + }
This is unbounded (there was a similar problem with early fast page fault
implementations):
>From RCU/checklist.txt:
" An especially important property of the synchronize_rcu()
primitive is that it automatically self-limits: if grace periods
are delayed for whatever reason, then the synchronize_rcu()
primitive will correspondingly delay updates. In contrast,
code using call_rcu() should explicitly limit update rate in
cases where grace periods are delayed, as failing to do so can
result in excessive realtime latencies or even OOM conditions.
"
Moreover, freeing pages differently depending on some state should
be avoided.
Alternatives:
- Disable interrupts at write protect sites.
- Rate limit the number of pages freed via call_rcu
per grace period.
- Some better alternative.
Hi Marcelo,
On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti <[email protected]> wrote:
>>
>> + if (kvm->arch.rcu_free_shadow_page) {
>> + kvm_mmu_isolate_pages(invalid_list);
>> + sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
>> + list_del_init(invalid_list);
>> + call_rcu(&sp->rcu, free_pages_rcu);
>> + return;
>> + }
>
> This is unbounded (there was a similar problem with early fast page fault
> implementations):
>
> From RCU/checklist.txt:
>
> " An especially important property of the synchronize_rcu()
> primitive is that it automatically self-limits: if grace periods
> are delayed for whatever reason, then the synchronize_rcu()
> primitive will correspondingly delay updates. In contrast,
> code using call_rcu() should explicitly limit update rate in
> cases where grace periods are delayed, as failing to do so can
> result in excessive realtime latencies or even OOM conditions.
> "
I understand what you are worrying about? Hmm, can it be avoided by
just using kvm->arch.rcu_free_shadow_page in a small window? - Then
there are slight chance that the page need to be freed by call_rcu.
>
> Moreover, freeing pages differently depending on some state should
> be avoided.
>
> Alternatives:
>
> - Disable interrupts at write protect sites.
The write-protection can be triggered by KVM ioctl that is not in the VCPU
context, if we do this, we also need to send IPI to the KVM thread when do
TLB flush. And we can not do much work while interrupt is disabled due to
interrupt latency.
> - Rate limit the number of pages freed via call_rcu
> per grace period.
Seems complex. :(
> - Some better alternative.
Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
and encodes the page-level into the spte (since we need to check if the spte
is the last-spte. ). How about this?
I planned to do it after this patchset merged, if you like it and if you think
that "using kvm->arch.rcu_free_shadow_page in a small window" can not avoid
the issue, i am happy to do it in the next version. :)
Thanks, Marcelo!
On Tue, Oct 08, 2013 at 12:02:32PM +0800, Xiao Guangrong wrote:
>
> Hi Marcelo,
>
> On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti <[email protected]> wrote:
>
> >>
> >> + if (kvm->arch.rcu_free_shadow_page) {
> >> + kvm_mmu_isolate_pages(invalid_list);
> >> + sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
> >> + list_del_init(invalid_list);
> >> + call_rcu(&sp->rcu, free_pages_rcu);
> >> + return;
> >> + }
> >
> > This is unbounded (there was a similar problem with early fast page fault
> > implementations):
> >
> > From RCU/checklist.txt:
> >
> > " An especially important property of the synchronize_rcu()
> > primitive is that it automatically self-limits: if grace periods
> > are delayed for whatever reason, then the synchronize_rcu()
> > primitive will correspondingly delay updates. In contrast,
> > code using call_rcu() should explicitly limit update rate in
> > cases where grace periods are delayed, as failing to do so can
> > result in excessive realtime latencies or even OOM conditions.
> > "
>
> I understand what you are worrying about… Hmm, can it be avoided by
> just using kvm->arch.rcu_free_shadow_page in a small window? - Then
> there are slight chance that the page need to be freed by call_rcu.
The point that must be addressed is that you cannot allow an unlimited
number of sp's to be freed via call_rcu between two grace periods.
So something like:
- For every 17MB worth of shadow pages.
- Guarantee a grace period has passed.
If you control kvm->arch.rcu_free_shadow_page, you could periodically
verify how many MBs worth of shadow pages are in the queue for RCU
freeing and force grace period after a certain number.
> > Moreover, freeing pages differently depending on some state should
> > be avoided.
> >
> > Alternatives:
> >
> > - Disable interrupts at write protect sites.
>
> The write-protection can be triggered by KVM ioctl that is not in the VCPU
> context, if we do this, we also need to send IPI to the KVM thread when do
> TLB flush.
Yes. However for the case being measured, simultaneous page freeing by vcpus
should be minimal (therefore not affecting the latency of GET_DIRTY_LOG).
> And we can not do much work while interrupt is disabled due to
> interrupt latency.
>
> > - Rate limit the number of pages freed via call_rcu
> > per grace period.
>
> Seems complex. :(
>
> > - Some better alternative.
>
> Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
> and encodes the page-level into the spte (since we need to check if the spte
> is the last-spte. ). How about this?
Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
regards to limitation? (maybe it is).
> I planned to do it after this patchset merged, if you like it and if you think
> that "using kvm->arch.rcu_free_shadow_page in a small window" can not avoid
> the issue, i am happy to do it in the next version. :)
Unfortunately the window can be large (as it depends on the size of the
memslot), so it would be best if this problem can be addressed before
merging. What is your idea for reducing rcu_free_shadow_page=1 window?
Thank you for the good work.
On 10/09/2013 09:56 AM, Marcelo Tosatti wrote:
> On Tue, Oct 08, 2013 at 12:02:32PM +0800, Xiao Guangrong wrote:
>>
>> Hi Marcelo,
>>
>> On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti <[email protected]> wrote:
>>
>>>>
>>>> + if (kvm->arch.rcu_free_shadow_page) {
>>>> + kvm_mmu_isolate_pages(invalid_list);
>>>> + sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
>>>> + list_del_init(invalid_list);
>>>> + call_rcu(&sp->rcu, free_pages_rcu);
>>>> + return;
>>>> + }
>>>
>>> This is unbounded (there was a similar problem with early fast page fault
>>> implementations):
>>>
>>> From RCU/checklist.txt:
>>>
>>> " An especially important property of the synchronize_rcu()
>>> primitive is that it automatically self-limits: if grace periods
>>> are delayed for whatever reason, then the synchronize_rcu()
>>> primitive will correspondingly delay updates. In contrast,
>>> code using call_rcu() should explicitly limit update rate in
>>> cases where grace periods are delayed, as failing to do so can
>>> result in excessive realtime latencies or even OOM conditions.
>>> "
>>
>> I understand what you are worrying about… Hmm, can it be avoided by
>> just using kvm->arch.rcu_free_shadow_page in a small window? - Then
>> there are slight chance that the page need to be freed by call_rcu.
>
> The point that must be addressed is that you cannot allow an unlimited
> number of sp's to be freed via call_rcu between two grace periods.
>
> So something like:
>
> - For every 17MB worth of shadow pages.
> - Guarantee a grace period has passed.
Hmm, the 'qhimark' in rcu is 10000, that means rcu allows call_rcu
to pend 10000 times in a grace-period without slowdown. Can we really
reach this number while rcu_free_shadow_page is set? Anyway, if it can,
we can use rcu tech-break to avoid it, can't we?
>
> If you control kvm->arch.rcu_free_shadow_page, you could periodically
> verify how many MBs worth of shadow pages are in the queue for RCU
> freeing and force grace period after a certain number.
I have no idea how to froce grace-period for the cpu which is running
in rcu-read side. IIUC, only dyntick-idle and offline CPU can be froced,
see rcu_gp_fqs().
>
>>> Moreover, freeing pages differently depending on some state should
>>> be avoided.
>>>
>>> Alternatives:
>>>
>>> - Disable interrupts at write protect sites.
>>
>> The write-protection can be triggered by KVM ioctl that is not in the VCPU
>> context, if we do this, we also need to send IPI to the KVM thread when do
>> TLB flush.
>
> Yes. However for the case being measured, simultaneous page freeing by vcpus
> should be minimal (therefore not affecting the latency of GET_DIRTY_LOG).
I agree, but write-protection will cost lots of time, we need to:
1) do write-protection under irq disabled, that is not good for device
Or
2) do pieces of works, then enable irq and disable it agian to continue
the work. Enabling and disabing irq many times are not cheap for x86.
no?
>
>> And we can not do much work while interrupt is disabled due to
>> interrupt latency.
>>
>>> - Rate limit the number of pages freed via call_rcu
>>> per grace period.
>>
>> Seems complex. :(
>>
>>> - Some better alternative.
>>
>> Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
>> and encodes the page-level into the spte (since we need to check if the spte
>> is the last-spte. ). How about this?
>
> Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
> regards to limitation? (maybe it is).
For my experience, freeing shadow page and allocing shadow page are balanced,
we can check it by (make -j12 on a guest with 4 vcpus and):
# echo > trace
[root@eric-desktop tracing]# cat trace > ~/log | sleep 3
[root@eric-desktop tracing]# cat ~/log | grep new | wc -l
10816
[root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
10656
[root@eric-desktop tracing]# cat set_event
kvmmmu:kvm_mmu_get_page
kvmmmu:kvm_mmu_prepare_zap_page
alloc VS. free = 10816 : 10656
So that, mostly all allocing and freeing are done in the slab's
cache and the slab frees shdadow pages very slowly, there is no rcu issue.
>
>> I planned to do it after this patchset merged, if you like it and if you think
>> that "using kvm->arch.rcu_free_shadow_page in a small window" can not avoid
>> the issue, i am happy to do it in the next version. :)
>
> Unfortunately the window can be large (as it depends on the size of the
> memslot), so it would be best if this problem can be addressed before
> merging. What is your idea for reducing rcu_free_shadow_page=1 window?
>
I mean something like:
for (i =0; i <= page_nr; i++) {
rcu_free_shadow_page=1;
write_protect(&rmap[i]);
rcu_free_shadow_page=0;
}
we write-protect less entries per time rather than a memslot.
BTW, i think rcu_need_break() would be usefull for this case, then the
rcu_read side do not dealy synchronize_rcu() too much.
> Thank you for the good work.
I really appreciate your, Gleb's and other guy's time and inspirations
on it. :)
On Wed, Oct 09, 2013 at 06:45:47PM +0800, Xiao Guangrong wrote:
> On 10/09/2013 09:56 AM, Marcelo Tosatti wrote:
> > On Tue, Oct 08, 2013 at 12:02:32PM +0800, Xiao Guangrong wrote:
> >>
> >> Hi Marcelo,
> >>
> >> On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti <[email protected]> wrote:
> >>
> >>>>
> >>>> + if (kvm->arch.rcu_free_shadow_page) {
> >>>> + kvm_mmu_isolate_pages(invalid_list);
> >>>> + sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
> >>>> + list_del_init(invalid_list);
> >>>> + call_rcu(&sp->rcu, free_pages_rcu);
> >>>> + return;
> >>>> + }
> >>>
> >>> This is unbounded (there was a similar problem with early fast page fault
> >>> implementations):
> >>>
> >>> From RCU/checklist.txt:
> >>>
> >>> " An especially important property of the synchronize_rcu()
> >>> primitive is that it automatically self-limits: if grace periods
> >>> are delayed for whatever reason, then the synchronize_rcu()
> >>> primitive will correspondingly delay updates. In contrast,
> >>> code using call_rcu() should explicitly limit update rate in
> >>> cases where grace periods are delayed, as failing to do so can
> >>> result in excessive realtime latencies or even OOM conditions.
> >>> "
> >>
> >> I understand what you are worrying about… Hmm, can it be avoided by
> >> just using kvm->arch.rcu_free_shadow_page in a small window? - Then
> >> there are slight chance that the page need to be freed by call_rcu.
> >
> > The point that must be addressed is that you cannot allow an unlimited
> > number of sp's to be freed via call_rcu between two grace periods.
> >
> > So something like:
> >
> > - For every 17MB worth of shadow pages.
> > - Guarantee a grace period has passed.
>
> Hmm, the 'qhimark' in rcu is 10000, that means rcu allows call_rcu
> to pend 10000 times in a grace-period without slowdown. Can we really
> reach this number while rcu_free_shadow_page is set? Anyway, if it can,
> we can use rcu tech-break to avoid it, can't we?
Yes.
> > If you control kvm->arch.rcu_free_shadow_page, you could periodically
> > verify how many MBs worth of shadow pages are in the queue for RCU
> > freeing and force grace period after a certain number.
>
> I have no idea how to froce grace-period for the cpu which is running
> in rcu-read side. IIUC, only dyntick-idle and offline CPU can be froced,
> see rcu_gp_fqs().
synchronize_rcu.
> >>> Moreover, freeing pages differently depending on some state should
> >>> be avoided.
> >>>
> >>> Alternatives:
> >>>
> >>> - Disable interrupts at write protect sites.
> >>
> >> The write-protection can be triggered by KVM ioctl that is not in the VCPU
> >> context, if we do this, we also need to send IPI to the KVM thread when do
> >> TLB flush.
> >
> > Yes. However for the case being measured, simultaneous page freeing by vcpus
> > should be minimal (therefore not affecting the latency of GET_DIRTY_LOG).
>
> I agree, but write-protection will cost lots of time, we need to:
> 1) do write-protection under irq disabled, that is not good for device
> Or
> 2) do pieces of works, then enable irq and disable it agian to continue
> the work. Enabling and disabing irq many times are not cheap for x86.
>
> no?
Its getting cheaper (see performance optimization guide).
> >> And we can not do much work while interrupt is disabled due to
> >> interrupt latency.
> >>
> >>> - Rate limit the number of pages freed via call_rcu
> >>> per grace period.
> >>
> >> Seems complex. :(
> >>
> >>> - Some better alternative.
> >>
> >> Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
> >> and encodes the page-level into the spte (since we need to check if the spte
> >> is the last-spte. ). How about this?
> >
> > Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
> > regards to limitation? (maybe it is).
>
> For my experience, freeing shadow page and allocing shadow page are balanced,
> we can check it by (make -j12 on a guest with 4 vcpus and):
>
> # echo > trace
> [root@eric-desktop tracing]# cat trace > ~/log | sleep 3
> [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
> 10816
> [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
> 10656
> [root@eric-desktop tracing]# cat set_event
> kvmmmu:kvm_mmu_get_page
> kvmmmu:kvm_mmu_prepare_zap_page
>
> alloc VS. free = 10816 : 10656
>
> So that, mostly all allocing and freeing are done in the slab's
> cache and the slab frees shdadow pages very slowly, there is no rcu issue.
A more detailed test case would be:
- cpu0-vcpu0 releasing pages as fast as possible
- cpu1 executing get_dirty_log
Think of a very large guest.
> >> I planned to do it after this patchset merged, if you like it and if you think
> >> that "using kvm->arch.rcu_free_shadow_page in a small window" can not avoid
> >> the issue, i am happy to do it in the next version. :)
> >
> > Unfortunately the window can be large (as it depends on the size of the
> > memslot), so it would be best if this problem can be addressed before
> > merging. What is your idea for reducing rcu_free_shadow_page=1 window?
> >
>
> I mean something like:
>
> for (i =0; i <= page_nr; i++) {
> rcu_free_shadow_page=1;
> write_protect(&rmap[i]);
> rcu_free_shadow_page=0;
> }
>
> we write-protect less entries per time rather than a memslot.
>
> BTW, i think rcu_need_break() would be usefull for this case, then the
> rcu_read side do not dealy synchronize_rcu() too much.
Yes, that works, you'd have to synchronize_sched eventually, depending on the
count of pending pages to be freed via RCU (so that rate limiting is
performed).
Or, you could bound the problematic callsites, those that can free a
large number of shadow pages (mmu_shrink, kvm_mmu_change_mmu_pages,
...), and limit there:
if (spin_needbreak(mmu_lock) || (sp_pending_rcu_free > sp_pending_rcu_free_max)) {
if (sp_pending...)
synchronize_rcu();
}
And any other code path that can possibly free pages, outside locks of
course (even those that can free 1 page). In a similar fashion as to how
the per-vcpu mmu object slabs are grown.
On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
> > >> Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
> > >> and encodes the page-level into the spte (since we need to check if the spte
> > >> is the last-spte. ). How about this?
> > >
> > > Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
> > > regards to limitation? (maybe it is).
> >
> > For my experience, freeing shadow page and allocing shadow page are balanced,
> > we can check it by (make -j12 on a guest with 4 vcpus and):
> >
> > # echo > trace
> > [root@eric-desktop tracing]# cat trace > ~/log | sleep 3
> > [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
> > 10816
> > [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
> > 10656
> > [root@eric-desktop tracing]# cat set_event
> > kvmmmu:kvm_mmu_get_page
> > kvmmmu:kvm_mmu_prepare_zap_page
> >
> > alloc VS. free = 10816 : 10656
> >
> > So that, mostly all allocing and freeing are done in the slab's
> > cache and the slab frees shdadow pages very slowly, there is no rcu issue.
>
> A more detailed test case would be:
>
> - cpu0-vcpu0 releasing pages as fast as possible
> - cpu1 executing get_dirty_log
>
> Think of a very large guest.
>
The number of shadow pages allocated from slab will be bounded by
n_max_mmu_pages, and, in addition, page released to slab is immediately
available for allocation, no need to wait for grace period. RCU comes
into play only when slab is shrunk, which should be almost never. If
SLAB_DESTROY_BY_RCU slab does not rate limit how it frees its pages this
is for slab to fix, not for its users.
--
Gleb.
On Thu, Oct 10, 2013 at 03:08:45PM +0300, Gleb Natapov wrote:
> On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
> > > >> Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
> > > >> and encodes the page-level into the spte (since we need to check if the spte
> > > >> is the last-spte. ). How about this?
> > > >
> > > > Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
> > > > regards to limitation? (maybe it is).
> > >
> > > For my experience, freeing shadow page and allocing shadow page are balanced,
> > > we can check it by (make -j12 on a guest with 4 vcpus and):
> > >
> > > # echo > trace
> > > [root@eric-desktop tracing]# cat trace > ~/log | sleep 3
> > > [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
> > > 10816
> > > [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
> > > 10656
> > > [root@eric-desktop tracing]# cat set_event
> > > kvmmmu:kvm_mmu_get_page
> > > kvmmmu:kvm_mmu_prepare_zap_page
> > >
> > > alloc VS. free = 10816 : 10656
> > >
> > > So that, mostly all allocing and freeing are done in the slab's
> > > cache and the slab frees shdadow pages very slowly, there is no rcu issue.
> >
> > A more detailed test case would be:
> >
> > - cpu0-vcpu0 releasing pages as fast as possible
> > - cpu1 executing get_dirty_log
> >
> > Think of a very large guest.
> >
> The number of shadow pages allocated from slab will be bounded by
> n_max_mmu_pages,
Correct, but that limit is not suitable (maximum number of mmu pages
should be larger than number of mmu pages freeable in a rcu grace
period).
> and, in addition, page released to slab is immediately
> available for allocation, no need to wait for grace period.
See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
> RCU comes
> into play only when slab is shrunk, which should be almost never. If
> SLAB_DESTROY_BY_RCU slab does not rate limit how it frees its pages this
> is for slab to fix, not for its users.
Agree.
On Thu, Oct 10, 2013 at 01:42:22PM -0300, Marcelo Tosatti wrote:
> On Thu, Oct 10, 2013 at 03:08:45PM +0300, Gleb Natapov wrote:
> > On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
> > > > >> Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
> > > > >> and encodes the page-level into the spte (since we need to check if the spte
> > > > >> is the last-spte. ). How about this?
> > > > >
> > > > > Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
> > > > > regards to limitation? (maybe it is).
> > > >
> > > > For my experience, freeing shadow page and allocing shadow page are balanced,
> > > > we can check it by (make -j12 on a guest with 4 vcpus and):
> > > >
> > > > # echo > trace
> > > > [root@eric-desktop tracing]# cat trace > ~/log | sleep 3
> > > > [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
> > > > 10816
> > > > [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
> > > > 10656
> > > > [root@eric-desktop tracing]# cat set_event
> > > > kvmmmu:kvm_mmu_get_page
> > > > kvmmmu:kvm_mmu_prepare_zap_page
> > > >
> > > > alloc VS. free = 10816 : 10656
> > > >
> > > > So that, mostly all allocing and freeing are done in the slab's
> > > > cache and the slab frees shdadow pages very slowly, there is no rcu issue.
> > >
> > > A more detailed test case would be:
> > >
> > > - cpu0-vcpu0 releasing pages as fast as possible
> > > - cpu1 executing get_dirty_log
> > >
> > > Think of a very large guest.
> > >
> > The number of shadow pages allocated from slab will be bounded by
> > n_max_mmu_pages,
>
> Correct, but that limit is not suitable (maximum number of mmu pages
> should be larger than number of mmu pages freeable in a rcu grace
> period).
>
I am not sure I understand what you mean here. What I was sating is that if
we change code to allocate sp->spt from slab, this slab will never have
more then n_max_mmu_pages objects in it.
> > and, in addition, page released to slab is immediately
> > available for allocation, no need to wait for grace period.
>
> See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
>
This comment is exactly what I was referring to in the code you quoted. Do
you see anything problematic in what comment describes?
--
Gleb.
On Thu, Oct 10, 2013 at 10:16:46PM +0300, Gleb Natapov wrote:
> On Thu, Oct 10, 2013 at 01:42:22PM -0300, Marcelo Tosatti wrote:
> > On Thu, Oct 10, 2013 at 03:08:45PM +0300, Gleb Natapov wrote:
> > > On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
> > > > > >> Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
> > > > > >> and encodes the page-level into the spte (since we need to check if the spte
> > > > > >> is the last-spte. ). How about this?
> > > > > >
> > > > > > Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
> > > > > > regards to limitation? (maybe it is).
> > > > >
> > > > > For my experience, freeing shadow page and allocing shadow page are balanced,
> > > > > we can check it by (make -j12 on a guest with 4 vcpus and):
> > > > >
> > > > > # echo > trace
> > > > > [root@eric-desktop tracing]# cat trace > ~/log | sleep 3
> > > > > [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
> > > > > 10816
> > > > > [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
> > > > > 10656
> > > > > [root@eric-desktop tracing]# cat set_event
> > > > > kvmmmu:kvm_mmu_get_page
> > > > > kvmmmu:kvm_mmu_prepare_zap_page
> > > > >
> > > > > alloc VS. free = 10816 : 10656
> > > > >
> > > > > So that, mostly all allocing and freeing are done in the slab's
> > > > > cache and the slab frees shdadow pages very slowly, there is no rcu issue.
> > > >
> > > > A more detailed test case would be:
> > > >
> > > > - cpu0-vcpu0 releasing pages as fast as possible
> > > > - cpu1 executing get_dirty_log
> > > >
> > > > Think of a very large guest.
> > > >
> > > The number of shadow pages allocated from slab will be bounded by
> > > n_max_mmu_pages,
> >
> > Correct, but that limit is not suitable (maximum number of mmu pages
> > should be larger than number of mmu pages freeable in a rcu grace
> > period).
> >
> I am not sure I understand what you mean here. What I was sating is that if
> we change code to allocate sp->spt from slab, this slab will never have
> more then n_max_mmu_pages objects in it.
n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
RCU (its too large). If the free memory watermarks are smaller than
n_max_mmu_pages for all guests, OOM is possible.
> > > and, in addition, page released to slab is immediately
> > > available for allocation, no need to wait for grace period.
> >
> > See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
> >
> This comment is exactly what I was referring to in the code you quoted. Do
> you see anything problematic in what comment describes?
"This delays freeing the SLAB page by a grace period, it does _NOT_
delay object freeing." The page is not available for allocation.
On Thu, Oct 10, 2013 at 06:03:01PM -0300, Marcelo Tosatti wrote:
> On Thu, Oct 10, 2013 at 10:16:46PM +0300, Gleb Natapov wrote:
> > On Thu, Oct 10, 2013 at 01:42:22PM -0300, Marcelo Tosatti wrote:
> > > On Thu, Oct 10, 2013 at 03:08:45PM +0300, Gleb Natapov wrote:
> > > > On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
> > > > > > >> Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
> > > > > > >> and encodes the page-level into the spte (since we need to check if the spte
> > > > > > >> is the last-spte. ). How about this?
> > > > > > >
> > > > > > > Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
> > > > > > > regards to limitation? (maybe it is).
> > > > > >
> > > > > > For my experience, freeing shadow page and allocing shadow page are balanced,
> > > > > > we can check it by (make -j12 on a guest with 4 vcpus and):
> > > > > >
> > > > > > # echo > trace
> > > > > > [root@eric-desktop tracing]# cat trace > ~/log | sleep 3
> > > > > > [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
> > > > > > 10816
> > > > > > [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
> > > > > > 10656
> > > > > > [root@eric-desktop tracing]# cat set_event
> > > > > > kvmmmu:kvm_mmu_get_page
> > > > > > kvmmmu:kvm_mmu_prepare_zap_page
> > > > > >
> > > > > > alloc VS. free = 10816 : 10656
> > > > > >
> > > > > > So that, mostly all allocing and freeing are done in the slab's
> > > > > > cache and the slab frees shdadow pages very slowly, there is no rcu issue.
> > > > >
> > > > > A more detailed test case would be:
> > > > >
> > > > > - cpu0-vcpu0 releasing pages as fast as possible
> > > > > - cpu1 executing get_dirty_log
> > > > >
> > > > > Think of a very large guest.
> > > > >
> > > > The number of shadow pages allocated from slab will be bounded by
> > > > n_max_mmu_pages,
> > >
> > > Correct, but that limit is not suitable (maximum number of mmu pages
> > > should be larger than number of mmu pages freeable in a rcu grace
> > > period).
> > >
> > I am not sure I understand what you mean here. What I was sating is that if
> > we change code to allocate sp->spt from slab, this slab will never have
> > more then n_max_mmu_pages objects in it.
>
> n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
> RCU (its too large). If the free memory watermarks are smaller than
> n_max_mmu_pages for all guests, OOM is possible.
>
Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
that slab size will be bound, so hopefully shrinker will touch it
rarely.
> > > > and, in addition, page released to slab is immediately
> > > > available for allocation, no need to wait for grace period.
> > >
> > > See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
> > >
> > This comment is exactly what I was referring to in the code you quoted. Do
> > you see anything problematic in what comment describes?
>
> "This delays freeing the SLAB page by a grace period, it does _NOT_
> delay object freeing." The page is not available for allocation.
By "page" I mean "spt page" which is a slab object. So "spt page"
AKA slab object will be available fo allocation immediately.
--
Gleb.
On Fri, Oct 11, 2013 at 08:38:31AM +0300, Gleb Natapov wrote:
> > n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
> > RCU (its too large). If the free memory watermarks are smaller than
> > n_max_mmu_pages for all guests, OOM is possible.
> >
> Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
> that slab size will be bound, so hopefully shrinker will touch it
> rarely.
>
> > > > > and, in addition, page released to slab is immediately
> > > > > available for allocation, no need to wait for grace period.
> > > >
> > > > See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
> > > >
> > > This comment is exactly what I was referring to in the code you quoted. Do
> > > you see anything problematic in what comment describes?
> >
> > "This delays freeing the SLAB page by a grace period, it does _NOT_
> > delay object freeing." The page is not available for allocation.
> By "page" I mean "spt page" which is a slab object. So "spt page"
> AKA slab object will be available fo allocation immediately.
The object is reusable within that SLAB cache only, not the
entire system (therefore it does not prevent OOM condition).
OK, perhaps it is useful to use SLAB_DESTROY_BY_RCU, but throttling
is still necessary, as described in the RCU documentation.
On Fri, Oct 11, 2013 at 05:30:17PM -0300, Marcelo Tosatti wrote:
> On Fri, Oct 11, 2013 at 08:38:31AM +0300, Gleb Natapov wrote:
> > > n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
> > > RCU (its too large). If the free memory watermarks are smaller than
> > > n_max_mmu_pages for all guests, OOM is possible.
> > >
> > Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
> > that slab size will be bound, so hopefully shrinker will touch it
> > rarely.
> >
> > > > > > and, in addition, page released to slab is immediately
> > > > > > available for allocation, no need to wait for grace period.
> > > > >
> > > > > See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
> > > > >
> > > > This comment is exactly what I was referring to in the code you quoted. Do
> > > > you see anything problematic in what comment describes?
> > >
> > > "This delays freeing the SLAB page by a grace period, it does _NOT_
> > > delay object freeing." The page is not available for allocation.
> > By "page" I mean "spt page" which is a slab object. So "spt page"
> > AKA slab object will be available fo allocation immediately.
>
> The object is reusable within that SLAB cache only, not the
> entire system (therefore it does not prevent OOM condition).
>
Since object is allocatable immediately by shadow paging code the number
of SLAB objects is bound by n_max_mmu_pages. If there is no enough
memory for n_max_mmu_pages OOM condition can happen anyway since shadow
paging code will usually have exactly n_max_mmu_pages allocated.
> OK, perhaps it is useful to use SLAB_DESTROY_BY_RCU, but throttling
> is still necessary, as described in the RCU documentation.
>
I do not see what should be throttled if we use SLAB_DESTROY_BY_RCU. RCU
comes into play only when SLAB cache is shrunk and it happens far from
kvm code.
--
Gleb.
On Sat, Oct 12, 2013 at 08:53:56AM +0300, Gleb Natapov wrote:
> On Fri, Oct 11, 2013 at 05:30:17PM -0300, Marcelo Tosatti wrote:
> > On Fri, Oct 11, 2013 at 08:38:31AM +0300, Gleb Natapov wrote:
> > > > n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
> > > > RCU (its too large). If the free memory watermarks are smaller than
> > > > n_max_mmu_pages for all guests, OOM is possible.
> > > >
> > > Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
> > > that slab size will be bound, so hopefully shrinker will touch it
> > > rarely.
> > >
> > > > > > > and, in addition, page released to slab is immediately
> > > > > > > available for allocation, no need to wait for grace period.
> > > > > >
> > > > > > See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
> > > > > >
> > > > > This comment is exactly what I was referring to in the code you quoted. Do
> > > > > you see anything problematic in what comment describes?
> > > >
> > > > "This delays freeing the SLAB page by a grace period, it does _NOT_
> > > > delay object freeing." The page is not available for allocation.
> > > By "page" I mean "spt page" which is a slab object. So "spt page"
> > > AKA slab object will be available fo allocation immediately.
> >
> > The object is reusable within that SLAB cache only, not the
> > entire system (therefore it does not prevent OOM condition).
> >
> Since object is allocatable immediately by shadow paging code the number
> of SLAB objects is bound by n_max_mmu_pages. If there is no enough
> memory for n_max_mmu_pages OOM condition can happen anyway since shadow
> paging code will usually have exactly n_max_mmu_pages allocated.
>
> > OK, perhaps it is useful to use SLAB_DESTROY_BY_RCU, but throttling
> > is still necessary, as described in the RCU documentation.
> >
> I do not see what should be throttled if we use SLAB_DESTROY_BY_RCU. RCU
> comes into play only when SLAB cache is shrunk and it happens far from
> kvm code.
You are right.
Why is it safe to allow access, by the lockless page write protect
side, to spt pointer for shadow page A that can change to a shadow page
pointer of shadow page B?
Write protect spte of any page at will? Or verify that in fact thats the
shadow you want to write protect?
Note that spte value might be the same for different shadow pages,
so cmpxchg succeeding does not guarantees its the same shadow page that
has been protected.
On Mon, Oct 14, 2013 at 04:29:45PM -0300, Marcelo Tosatti wrote:
> On Sat, Oct 12, 2013 at 08:53:56AM +0300, Gleb Natapov wrote:
> > On Fri, Oct 11, 2013 at 05:30:17PM -0300, Marcelo Tosatti wrote:
> > > On Fri, Oct 11, 2013 at 08:38:31AM +0300, Gleb Natapov wrote:
> > > > > n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
> > > > > RCU (its too large). If the free memory watermarks are smaller than
> > > > > n_max_mmu_pages for all guests, OOM is possible.
> > > > >
> > > > Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
> > > > that slab size will be bound, so hopefully shrinker will touch it
> > > > rarely.
> > > >
> > > > > > > > and, in addition, page released to slab is immediately
> > > > > > > > available for allocation, no need to wait for grace period.
> > > > > > >
> > > > > > > See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
> > > > > > >
> > > > > > This comment is exactly what I was referring to in the code you quoted. Do
> > > > > > you see anything problematic in what comment describes?
> > > > >
> > > > > "This delays freeing the SLAB page by a grace period, it does _NOT_
> > > > > delay object freeing." The page is not available for allocation.
> > > > By "page" I mean "spt page" which is a slab object. So "spt page"
> > > > AKA slab object will be available fo allocation immediately.
> > >
> > > The object is reusable within that SLAB cache only, not the
> > > entire system (therefore it does not prevent OOM condition).
> > >
> > Since object is allocatable immediately by shadow paging code the number
> > of SLAB objects is bound by n_max_mmu_pages. If there is no enough
> > memory for n_max_mmu_pages OOM condition can happen anyway since shadow
> > paging code will usually have exactly n_max_mmu_pages allocated.
> >
> > > OK, perhaps it is useful to use SLAB_DESTROY_BY_RCU, but throttling
> > > is still necessary, as described in the RCU documentation.
> > >
> > I do not see what should be throttled if we use SLAB_DESTROY_BY_RCU. RCU
> > comes into play only when SLAB cache is shrunk and it happens far from
> > kvm code.
>
> You are right.
>
> Why is it safe to allow access, by the lockless page write protect
> side, to spt pointer for shadow page A that can change to a shadow page
> pointer of shadow page B?
>
> Write protect spte of any page at will? Or verify that in fact thats the
> shadow you want to write protect?
>
> Note that spte value might be the same for different shadow pages,
> so cmpxchg succeeding does not guarantees its the same shadow page that
> has been protected.
>
Two things can happen: spte that we accidentally write protect is some
other last level spte - this is benign, it will be unprotected on next
fault. If spte is not last level this is a problem and Xiao propose to
fix it by encoding spte level into spte itself. Another way to fix it is
to handle fault that is caused by write protected middle sptes in KVM -
just unprotected them and go back to a guest.
--
Gleb.
On Tue, Oct 15, 2013 at 06:57:05AM +0300, Gleb Natapov wrote:
> >
> > Why is it safe to allow access, by the lockless page write protect
> > side, to spt pointer for shadow page A that can change to a shadow page
> > pointer of shadow page B?
> >
> > Write protect spte of any page at will? Or verify that in fact thats the
> > shadow you want to write protect?
> >
> > Note that spte value might be the same for different shadow pages,
> > so cmpxchg succeeding does not guarantees its the same shadow page that
> > has been protected.
> >
> Two things can happen: spte that we accidentally write protect is some
> other last level spte - this is benign, it will be unprotected on next
> fault.
Nothing forbids two identical writable sptes to point to a same pfn. How
do you know you are write protecting the correct one? (the proper gfn).
Lockless walk sounds interesting. By the time you get to the lower
level, that might be a different spte.
All of this to avoid throttling, is it worthwhile?
> If spte is not last level this is a problem and Xiao propose to
> fix it by encoding spte level into spte itself. Another way to fix it is
> to handle fault that is caused by write protected middle sptes in KVM -
> just unprotected them and go back to a guest.
>
> --
> Gleb.
On Oct 16, 2013, at 6:21 AM, Marcelo Tosatti <[email protected]> wrote:
> On Tue, Oct 15, 2013 at 06:57:05AM +0300, Gleb Natapov wrote:
>>>
>>> Why is it safe to allow access, by the lockless page write protect
>>> side, to spt pointer for shadow page A that can change to a shadow page
>>> pointer of shadow page B?
>>>
>>> Write protect spte of any page at will? Or verify that in fact thats the
>>> shadow you want to write protect?
>>>
>>> Note that spte value might be the same for different shadow pages,
>>> so cmpxchg succeeding does not guarantees its the same shadow page that
>>> has been protected.
>>>
>> Two things can happen: spte that we accidentally write protect is some
>> other last level spte - this is benign, it will be unprotected on next
>> fault.
>
> Nothing forbids two identical writable sptes to point to a same pfn. How
> do you know you are write protecting the correct one? (the proper gfn).
>
> Lockless walk sounds interesting. By the time you get to the lower
> level, that might be a different spte.
That's safe. Since get-dirty-log is serialized by slot-lock the dirty-bit
can not be lost - even if we write-protect on the different memslot
(the dirty bit is still set). The worst case is we write-protect on a
unnecessary spte and cause a extra #PF but that is really race.
And the lockless rmap-walker can detect the new spte so that
write-protection on the memslot is not missed.
On Tue, Oct 15, 2013 at 07:21:19PM -0300, Marcelo Tosatti wrote:
> On Tue, Oct 15, 2013 at 06:57:05AM +0300, Gleb Natapov wrote:
> > >
> > > Why is it safe to allow access, by the lockless page write protect
> > > side, to spt pointer for shadow page A that can change to a shadow page
> > > pointer of shadow page B?
> > >
> > > Write protect spte of any page at will? Or verify that in fact thats the
> > > shadow you want to write protect?
> > >
> > > Note that spte value might be the same for different shadow pages,
> > > so cmpxchg succeeding does not guarantees its the same shadow page that
> > > has been protected.
> > >
> > Two things can happen: spte that we accidentally write protect is some
> > other last level spte - this is benign, it will be unprotected on next
> > fault.
>
> Nothing forbids two identical writable sptes to point to a same pfn. How
> do you know you are write protecting the correct one? (the proper gfn).
>
I am not sure I understand what you mean. If spt was freed and reallocated
while lockless shadow page walk happened we definitely write protecting
incorrect one, but this is not a problem unless the spte that is write
protected is not last level, but there are ways to solve that.
> Lockless walk sounds interesting. By the time you get to the lower
> level, that might be a different spte.
>
> All of this to avoid throttling, is it worthwhile?
>
I do not like kvm->arch.rcu_free_shadow_page, feels like a hack. I
proposed slab RCU solution before, but it needs some more work to encode
pt level into spte, so Xiao wanted to do it on top. But since something,
either throttling or slab RCU, needs to be implemented anyway I prefer
the later.
--
Gleb.
On Wed, Oct 16, 2013 at 12:12:11PM +0300, Gleb Natapov wrote:
> On Tue, Oct 15, 2013 at 07:21:19PM -0300, Marcelo Tosatti wrote:
> > On Tue, Oct 15, 2013 at 06:57:05AM +0300, Gleb Natapov wrote:
> > > >
> > > > Why is it safe to allow access, by the lockless page write protect
> > > > side, to spt pointer for shadow page A that can change to a shadow page
> > > > pointer of shadow page B?
> > > >
> > > > Write protect spte of any page at will? Or verify that in fact thats the
> > > > shadow you want to write protect?
> > > >
> > > > Note that spte value might be the same for different shadow pages,
> > > > so cmpxchg succeeding does not guarantees its the same shadow page that
> > > > has been protected.
> > > >
> > > Two things can happen: spte that we accidentally write protect is some
> > > other last level spte - this is benign, it will be unprotected on next
> > > fault.
> >
> > Nothing forbids two identical writable sptes to point to a same pfn. How
> > do you know you are write protecting the correct one? (the proper gfn).
> >
> I am not sure I understand what you mean. If spt was freed and reallocated
> while lockless shadow page walk happened we definitely write protecting
> incorrect one, but this is not a problem unless the spte that is write
> protected is not last level, but there are ways to solve that.
Was assuming cmpxchg success on wrong spte would be problematic, but
Xiao says its detectable on the lockless rmap path.
> > Lockless walk sounds interesting. By the time you get to the lower
> > level, that might be a different spte.
> >
> > All of this to avoid throttling, is it worthwhile?
> >
> I do not like kvm->arch.rcu_free_shadow_page, feels like a hack. I
> proposed slab RCU solution before, but it needs some more work to encode
> pt level into spte, so Xiao wanted to do it on top. But since something,
> either throttling or slab RCU, needs to be implemented anyway I prefer
> the later.
Yes, seems OK to me.