2013-10-23 13:29:58

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 00/15] KVM: MMU: locklessly write-protect

Changelog v3:
- the changes from Gleb's review:
1) drop the patch which fixed the count of spte number in rmap since it
can not be easily fixed and it has gone after applying this patchset

- ideas from Gleb and discussion with Marcelo is also very appreciated:
2) change the way to locklessly access shadow page - use SLAB_DESTROY_BY_RCU
to protect shadow page instead of conditionally using call_rcu()
3) improve is_last_spte() that checks last spte by only using some bits on
the spte, then it is safely used when we locklessly write-protect the
shadow page table

Changelog v2:

- the changes from Gleb's review:
1) fix calculating the number of spte in the pte_list_add()
2) set iter->desc to NULL if meet a nulls desc to cleanup the code of
rmap_get_next()
3) fix hlist corruption due to accessing sp->hlish out of mmu-lock
4) use rcu functions to access the rcu protected pointer
5) spte will be missed in lockless walker if the spte is moved in a desc
(remove a spte from the rmap using only one desc). Fix it by bottom-up
walking the desc

- the changes from Paolo's review
1) make the order and memory barriers between update spte / add spte into
rmap and dirty-log more clear

- the changes from Marcelo's review:
1) let fast page fault only fix the spts on the last level (level = 1)
2) improve some changelogs and comments

- the changes from Takuya's review:
move the patch "flush tlb if the spte can be locklessly modified" forward
to make it's more easily merged

Thank all of you very much for your time and patience on this patchset!

Since we use rcu_assign_pointer() to update the points in desc even if dirty
log is disabled, i have measured the performance:
Host: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz * 12 + 36G memory

- migrate-perf (benchmark the time of get-dirty-log)
before: Run 10 times, Avg time:9009483 ns.
after: Run 10 times, Avg time:4807343 ns.

- kerbench
Guest: 12 VCPUs + 8G memory
before:
EPT is enabled:
# cat 09-05-origin-ept | grep real
real 85.58
real 83.47
real 82.95

EPT is disabled:
# cat 09-05-origin-shadow | grep real
real 138.77
real 138.99
real 139.55

after:
EPT is enabled:
# cat 09-05-lockless-ept | grep real
real 83.40
real 82.81
real 83.39

EPT is disabled:
# cat 09-05-lockless-shadow | grep real
real 138.91
real 139.71
real 138.94

No performance regression!



Background
==========
Currently, when mark memslot dirty logged or get dirty page, we need to
write-protect large guest memory, it is the heavy work, especially, we need to
hold mmu-lock which is also required by vcpu to fix its page table fault and
mmu-notifier when host page is being changed. In the extreme cpu / memory used
guest, it becomes a scalability issue.

This patchset introduces a way to locklessly write-protect guest memory.

Idea
==========
There are the challenges we meet and the ideas to resolve them.

1) How to locklessly walk rmap?
The first idea we got to prevent "desc" being freed when we are walking the
rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
it updates the rmap really frequently.

So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
to be reused more quickly. We also store a "nulls" in the last "desc"
(desc->more) which can help us to detect whether the "desc" is moved to anther
rmap then we can re-walk the rmap if that happened. I learned this idea from
nulls-list.

Another issue is, when a spte is deleted from the "desc", another spte in the
last "desc" will be moved to this position to replace the deleted one. If the
deleted one has been accessed and we do not access the replaced one, the
replaced one is missed when we do lockless walk.
To fix this case, we do not backward move the spte, instead, we forward move
the entry: when a spte is deleted, we move the entry in the first desc to that
position.

2) How to locklessly access shadow page table?
It is easy if the handler is in the vcpu context, in that case we can use
walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
disable interrupt to stop shadow page be freed. But we are on the ioctl context
and the paths we are optimizing for have heavy workload, disabling interrupt is
not good for the system performance.

We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
call_rcu() to free the shadow page if that indicator is set. Set/Clear the
indicator are protected by slot-lock, so it need not be atomic and does not
hurt the performance and the scalability.

3) How to locklessly write-protect guest memory?
Currently, there are two behaviors when we write-protect guest memory, one is
clearing the Writable bit on spte and the another one is dropping spte when it
points to large page. The former is easy we only need to atomicly clear a bit
but the latter is hard since we need to remove the spte from rmap. so we unify
these two behaviors that only make the spte readonly. Making large spte
readonly instead of nonpresent is also good for reducing jitter.

And we need to pay more attention on the order of making spte writable, adding
spte into rmap and setting the corresponding bit on dirty bitmap since
kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap,
we should ensure the writable spte can be found in rmap before the dirty bitmap
is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
the page.

Performance result
====================
The performance result and the benchmark can be found at:
http://permalink.gmane.org/gmane.linux.kernel/1534876

Xiao Guangrong (15):
KVM: MMU: properly check last spte in fast_page_fault()
KVM: MMU: lazily drop large spte
KVM: MMU: flush tlb if the spte can be locklessly modified
KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
KVM: MMU: update spte and add it into rmap before dirty log
KVM: MMU: redesign the algorithm of pte_list
KVM: MMU: introduce nulls desc
KVM: MMU: introduce pte-list lockless walker
KVM: MMU: initialize the pointers in pte_list_desc properly
KVM: MMU: allocate shadow pages from slab
KVM: MMU: locklessly access shadow page under rcu protection
KVM: MMU: check last spte with unawareness of mapping level
KVM: MMU: locklessly write-protect the page
KVM: MMU: clean up spte_write_protect
KVM: MMU: use rcu functions to access the pointer

arch/x86/include/asm/kvm_host.h | 7 +-
arch/x86/kvm/mmu.c | 586 ++++++++++++++++++++++++++++++----------
arch/x86/kvm/mmu.h | 6 +
arch/x86/kvm/mmu_audit.c | 6 +-
arch/x86/kvm/paging_tmpl.h | 6 +-
arch/x86/kvm/x86.c | 34 ++-
6 files changed, 475 insertions(+), 170 deletions(-)

--
1.8.1.4


2013-10-23 13:30:01

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 06/15] KVM: MMU: redesign the algorithm of pte_list

Change the algorithm to:
1) always add new desc to the first desc (pointed by parent_ptes/rmap)
that is good to implement rcu-nulls-list-like lockless rmap
walking

2) always move the entry in the first desc to the the position we want
to remove when delete a spte in the parent_ptes/rmap (backward-move).
It is good for us to implement lockless rmap walk since in the current
code, when a spte is deleted from the "desc", another spte in the last
"desc" will be moved to this position to replace the deleted one. If the
deleted one has been accessed and we do not access the replaced one, the
replaced one is missed when we do lockless walk.
To fix this case, we do not backward move the spte, instead, we forward
move the entry: when a spte is deleted, we move the entry in the first
desc to that position

Both of these also can reduce cache miss

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 179 ++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 123 insertions(+), 56 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e85eed6..5cce039 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,50 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
return level - 1;
}

+static int __find_first_free(struct pte_list_desc *desc)
+{
+ int i;
+
+ for (i = 0; i < PTE_LIST_EXT; i++)
+ if (!desc->sptes[i])
+ break;
+ return i;
+}
+
+static int find_first_free(struct pte_list_desc *desc)
+{
+ int free = __find_first_free(desc);
+
+ WARN_ON(free >= PTE_LIST_EXT);
+ return free;
+}
+
+static int find_last_used(struct pte_list_desc *desc)
+{
+ int used = __find_first_free(desc) - 1;
+
+ WARN_ON(used < 0 || used >= PTE_LIST_EXT);
+ return used;
+}
+
+/*
+ * TODO: we can encode the desc number into the rmap/parent_ptes
+ * since at least 10 physical/virtual address bits are reserved
+ * on x86. It is worthwhile if it shows that the desc walking is
+ * a performance issue.
+ */
+static int count_spte_number(struct pte_list_desc *desc)
+{
+ int first_free, desc_num;
+
+ first_free = __find_first_free(desc);
+
+ for (desc_num = 0; desc->more; desc = desc->more)
+ desc_num++;
+
+ return first_free + desc_num * PTE_LIST_EXT;
+}
+
/*
* Pte mapping structures:
*
@@ -923,98 +967,121 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
*
* Returns the number of pte entries before the spte was added or zero if
* the spte was not added.
- *
*/
static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
unsigned long *pte_list)
{
struct pte_list_desc *desc;
- int i, count = 0;
+ int free_pos;

if (!*pte_list) {
rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
*pte_list = (unsigned long)spte;
- } else if (!(*pte_list & 1)) {
+ return 0;
+ }
+
+ if (!(*pte_list & 1)) {
rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_pte_list_desc(vcpu);
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
*pte_list = (unsigned long)desc | 1;
- ++count;
- } else {
- rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
- desc = (struct pte_list_desc *)(*pte_list & ~1ul);
- while (desc->sptes[PTE_LIST_EXT-1] && desc->more) {
- desc = desc->more;
- count += PTE_LIST_EXT;
- }
- if (desc->sptes[PTE_LIST_EXT-1]) {
- desc->more = mmu_alloc_pte_list_desc(vcpu);
- desc = desc->more;
- }
- for (i = 0; desc->sptes[i]; ++i)
- ++count;
- desc->sptes[i] = spte;
+ return 1;
}
- return count;
+
+ rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
+ desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+
+ /* No empty entry in the desc. */
+ if (desc->sptes[PTE_LIST_EXT - 1]) {
+ struct pte_list_desc *new_desc;
+ new_desc = mmu_alloc_pte_list_desc(vcpu);
+ new_desc->more = desc;
+ desc = new_desc;
+ *pte_list = (unsigned long)desc | 1;
+ }
+
+ free_pos = find_first_free(desc);
+ desc->sptes[free_pos] = spte;
+ return count_spte_number(desc) - 1;
}

static void
-pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
- int i, struct pte_list_desc *prev_desc)
+pte_list_desc_remove_entry(unsigned long *pte_list,
+ struct pte_list_desc *desc, int i)
{
- int j;
+ struct pte_list_desc *first_desc;
+ int last_used;

- for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
- ;
- desc->sptes[i] = desc->sptes[j];
- desc->sptes[j] = NULL;
- if (j != 0)
+ first_desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+ last_used = find_last_used(first_desc);
+
+ /*
+ * Move the entry from the first desc to this position we want
+ * to remove.
+ */
+ desc->sptes[i] = first_desc->sptes[last_used];
+ first_desc->sptes[last_used] = NULL;
+
+ /* No valid entry in this desc, we can free this desc now. */
+ if (!first_desc->sptes[0]) {
+ struct pte_list_desc *next_desc = first_desc->more;
+
+ /*
+ * Only one entry existing but still use a desc to store it?
+ */
+ WARN_ON(!next_desc);
+
+ mmu_free_pte_list_desc(first_desc);
+ *pte_list = (unsigned long)next_desc | 1ul;
return;
- if (!prev_desc && !desc->more)
- *pte_list = (unsigned long)desc->sptes[0];
- else
- if (prev_desc)
- prev_desc->more = desc->more;
- else
- *pte_list = (unsigned long)desc->more | 1;
- mmu_free_pte_list_desc(desc);
+ }
+
+ /*
+ * Only one entry in this desc, move the entry to the head
+ * then the desc can be freed.
+ */
+ if (!first_desc->sptes[1] && !first_desc->more) {
+ *pte_list = (unsigned long)first_desc->sptes[0];
+ mmu_free_pte_list_desc(first_desc);
+ }
}

static void pte_list_remove(u64 *spte, unsigned long *pte_list)
{
struct pte_list_desc *desc;
- struct pte_list_desc *prev_desc;
int i;

if (!*pte_list) {
- printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
+ pr_err("pte_list_remove: %p 0->BUG\n", spte);
BUG();
- } else if (!(*pte_list & 1)) {
+ return;
+ }
+
+ if (!(*pte_list & 1)) {
rmap_printk("pte_list_remove: %p 1->0\n", spte);
if ((u64 *)*pte_list != spte) {
- printk(KERN_ERR "pte_list_remove: %p 1->BUG\n", spte);
+ pr_err("pte_list_remove: %p 1->BUG\n", spte);
BUG();
}
*pte_list = 0;
- } else {
- rmap_printk("pte_list_remove: %p many->many\n", spte);
- desc = (struct pte_list_desc *)(*pte_list & ~1ul);
- prev_desc = NULL;
- while (desc) {
- for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
- if (desc->sptes[i] == spte) {
- pte_list_desc_remove_entry(pte_list,
- desc, i,
- prev_desc);
- return;
- }
- prev_desc = desc;
- desc = desc->more;
- }
- pr_err("pte_list_remove: %p many->many\n", spte);
- BUG();
+ return;
}
+
+ rmap_printk("pte_list_remove: %p many->many\n", spte);
+ desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+ while (desc) {
+ for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
+ if (desc->sptes[i] == spte) {
+ pte_list_desc_remove_entry(pte_list,
+ desc, i);
+ return;
+ }
+ desc = desc->more;
+ }
+
+ pr_err("pte_list_remove: %p many->many\n", spte);
+ BUG();
}

typedef void (*pte_list_walk_fn) (u64 *spte);
--
1.8.1.4

2013-10-23 13:30:09

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 09/15] KVM: MMU: initialize the pointers in pte_list_desc properly

Since pte_list_desc will be locklessly accessed we need to atomicly initialize
its pointers so that the lockless walker can not get the partial value from the
pointer

In this patch we use the way of assigning pointer to initialize its pointers
which is always atomic instead of using kmem_cache_zalloc

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 27 +++++++++++++++++++++------
1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a864140..f3ae74e6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -687,14 +687,15 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
}

static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
- struct kmem_cache *base_cache, int min)
+ struct kmem_cache *base_cache, int min,
+ gfp_t flags)
{
void *obj;

if (cache->nobjs >= min)
return 0;
while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
- obj = kmem_cache_zalloc(base_cache, GFP_KERNEL);
+ obj = kmem_cache_alloc(base_cache, flags);
if (!obj)
return -ENOMEM;
cache->objects[cache->nobjs++] = obj;
@@ -741,14 +742,16 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
int r;

r = mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
- pte_list_desc_cache, 8 + PTE_PREFETCH_NUM);
+ pte_list_desc_cache, 8 + PTE_PREFETCH_NUM,
+ GFP_KERNEL);
if (r)
goto out;
r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
if (r)
goto out;
r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
- mmu_page_header_cache, 4);
+ mmu_page_header_cache, 4,
+ GFP_KERNEL | __GFP_ZERO);
out:
return r;
}
@@ -913,6 +916,17 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
return level - 1;
}

+static void pte_list_desc_ctor(void *p)
+{
+ struct pte_list_desc *desc = p;
+ int i;
+
+ for (i = 0; i < PTE_LIST_EXT; i++)
+ desc->sptes[i] = NULL;
+
+ desc->more = NULL;
+}
+
static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc)
{
unsigned long marker;
@@ -1066,6 +1080,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
*/
if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
*pte_list = (unsigned long)first_desc->sptes[0];
+ first_desc->sptes[0] = NULL;
mmu_free_pte_list_desc(first_desc);
}
}
@@ -4663,8 +4678,8 @@ static void mmu_destroy_caches(void)
int kvm_mmu_module_init(void)
{
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
- sizeof(struct pte_list_desc),
- 0, SLAB_DESTROY_BY_RCU, NULL);
+ sizeof(struct pte_list_desc),
+ 0, SLAB_DESTROY_BY_RCU, pte_list_desc_ctor);
if (!pte_list_desc_cache)
goto nomem;

--
1.8.1.4

2013-10-23 13:30:14

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 15/15] KVM: MMU: use rcu functions to access the pointer

Use rcu_assign_pointer() to update all the pointer in desc
and use rcu_dereference() to lockless read the pointer

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 46 ++++++++++++++++++++++++++++------------------
1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3e4b941..68dac26 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -937,12 +937,23 @@ static void pte_list_desc_ctor(void *p)
desc->more = NULL;
}

+#define rcu_assign_pte_list(pte_list_p, value) \
+ rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), \
+ (unsigned long *)(value))
+
+#define rcu_assign_desc_more(morep, value) \
+ rcu_assign_pointer(*(unsigned long __rcu **)&morep, \
+ (unsigned long *)value)
+
+#define rcu_assign_spte(sptep, value) \
+ rcu_assign_pointer(*(u64 __rcu **)&sptep, (u64 *)value)
+
static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc)
{
unsigned long marker;

marker = (unsigned long)pte_list | 1UL;
- desc->more = (struct pte_list_desc *)marker;
+ rcu_assign_desc_more(desc->more, (struct pte_list_desc *)marker);
}

static bool desc_is_a_nulls(struct pte_list_desc *desc)
@@ -999,10 +1010,6 @@ static int count_spte_number(struct pte_list_desc *desc)
return first_free + desc_num * PTE_LIST_EXT;
}

-#define rcu_assign_pte_list(pte_list_p, value) \
- rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), \
- (unsigned long *)(value))
-
/*
* Pte mapping structures:
*
@@ -1029,8 +1036,8 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
if (!(*pte_list & 1)) {
rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_pte_list_desc(vcpu);
- desc->sptes[0] = (u64 *)*pte_list;
- desc->sptes[1] = spte;
+ rcu_assign_spte(desc->sptes[0], *pte_list);
+ rcu_assign_spte(desc->sptes[1], spte);
desc_mark_nulls(pte_list, desc);
rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
return 1;
@@ -1043,13 +1050,13 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
if (desc->sptes[PTE_LIST_EXT - 1]) {
struct pte_list_desc *new_desc;
new_desc = mmu_alloc_pte_list_desc(vcpu);
- new_desc->more = desc;
+ rcu_assign_desc_more(new_desc->more, desc);
desc = new_desc;
rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
}

free_pos = find_first_free(desc);
- desc->sptes[free_pos] = spte;
+ rcu_assign_spte(desc->sptes[free_pos], spte);
return count_spte_number(desc) - 1;
}

@@ -1067,8 +1074,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
* Move the entry from the first desc to this position we want
* to remove.
*/
- desc->sptes[i] = first_desc->sptes[last_used];
- first_desc->sptes[last_used] = NULL;
+ rcu_assign_spte(desc->sptes[i], first_desc->sptes[last_used]);
+ rcu_assign_spte(first_desc->sptes[last_used], NULL);

/* No valid entry in this desc, we can free this desc now. */
if (!first_desc->sptes[0]) {
@@ -1080,7 +1087,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
WARN_ON(desc_is_a_nulls(next_desc));

mmu_free_pte_list_desc(first_desc);
- *pte_list = (unsigned long)next_desc | 1ul;
+ rcu_assign_pte_list(pte_list, (unsigned long)next_desc | 1ul);
return;
}

@@ -1089,8 +1096,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
* then the desc can be freed.
*/
if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
- *pte_list = (unsigned long)first_desc->sptes[0];
- first_desc->sptes[0] = NULL;
+ rcu_assign_pte_list(pte_list, first_desc->sptes[0]);
+ rcu_assign_spte(first_desc->sptes[0], NULL);
mmu_free_pte_list_desc(first_desc);
}
}
@@ -1112,7 +1119,7 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)
pr_err("pte_list_remove: %p 1->BUG\n", spte);
BUG();
}
- *pte_list = 0;
+ rcu_assign_pte_list(pte_list, 0);
return;
}

@@ -1184,9 +1191,12 @@ restart:
* used in the rmap when a spte is removed. Otherwise the
* moved entry will be missed.
*/
- for (i = PTE_LIST_EXT - 1; i >= 0; i--)
- if (desc->sptes[i])
- fn(desc->sptes[i]);
+ for (i = PTE_LIST_EXT - 1; i >= 0; i--) {
+ u64 *sptep = rcu_dereference(desc->sptes[i]);
+
+ if (sptep)
+ fn(sptep);
+ }

desc = rcu_dereference(desc->more);

--
1.8.1.4

2013-10-23 13:30:48

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 13/15] KVM: MMU: locklessly write-protect the page

Currently, when mark memslot dirty logged or get dirty page, we need to
write-protect large guest memory, it is the heavy work, especially, we
need to hold mmu-lock which is also required by vcpu to fix its page table
fault and mmu-notifier when host page is being changed. In the extreme
cpu / memory used guest, it becomes a scalability issue

This patch introduces a way to locklessly write-protect guest memory

Now, lockless rmap walk, lockless shadow page table access and lockless
spte wirte-protection are ready, it is the time to implements page
write-protection out of mmu-lock

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ---
arch/x86/kvm/mmu.c | 59 ++++++++++++++++++++++++++++++-----------
arch/x86/kvm/mmu.h | 6 +++++
arch/x86/kvm/x86.c | 11 ++++----
4 files changed, 55 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index df9ae10..cdb6f29 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -793,10 +793,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask);

void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
- struct kvm_memory_slot *slot,
- gfn_t gfn_offset, unsigned long mask);
void kvm_mmu_zap_all(struct kvm *kvm);
void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm);
unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8b96d96..d82bbec 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1386,8 +1386,37 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
return flush;
}

-/**
- * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
+static void __rmap_write_protect_lockless(u64 *sptep)
+{
+ u64 spte;
+
+retry:
+ /*
+ * Note we may partly read the sptep on 32bit host, however, we
+ * allow this case because:
+ * - we do not access the page got from the sptep.
+ * - cmpxchg64 can detect that case and avoid setting a wrong value
+ * to the sptep.
+ */
+ spte = *rcu_dereference(sptep);
+ if (unlikely(!is_last_spte(spte) || !is_writable_pte(spte)))
+ return;
+
+ if (likely(cmpxchg64(sptep, spte, spte & ~PT_WRITABLE_MASK) == spte))
+ return;
+
+ goto retry;
+}
+
+static void rmap_write_protect_lockless(unsigned long *rmapp)
+{
+ pte_list_walk_lockless(rmapp, __rmap_write_protect_lockless);
+}
+
+/*
+ * kvm_mmu_write_protect_pt_masked_lockless - write protect selected PT level
+ * pages out of mmu-lock.
+ *
* @kvm: kvm instance
* @slot: slot to protect
* @gfn_offset: start of the BITS_PER_LONG pages we care about
@@ -1396,16 +1425,17 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
* Used when we do not need to care about huge page mappings: e.g. during dirty
* logging we do not have any such mappings.
*/
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
- struct kvm_memory_slot *slot,
- gfn_t gfn_offset, unsigned long mask)
+void
+kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn_offset, unsigned long mask)
{
unsigned long *rmapp;

while (mask) {
rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
PT_PAGE_TABLE_LEVEL, slot);
- __rmap_write_protect(kvm, rmapp, false);
+ rmap_write_protect_lockless(rmapp);

/* clear the first set bit */
mask &= mask - 1;
@@ -4477,7 +4507,7 @@ void kvm_mmu_setup(struct kvm_vcpu *vcpu)
init_kvm_mmu(vcpu);
}

-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
+void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot)
{
struct kvm_memory_slot *memslot;
gfn_t last_gfn;
@@ -4486,8 +4516,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
memslot = id_to_memslot(kvm->memslots, slot);
last_gfn = memslot->base_gfn + memslot->npages - 1;

- spin_lock(&kvm->mmu_lock);
-
+ rcu_read_lock();
for (i = PT_PAGE_TABLE_LEVEL;
i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
unsigned long *rmapp;
@@ -4497,15 +4526,15 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);

for (index = 0; index <= last_index; ++index, ++rmapp) {
- if (*rmapp)
- __rmap_write_protect(kvm, rmapp, false);
+ rmap_write_protect_lockless(rmapp);

- if (need_resched() || spin_needbreak(&kvm->mmu_lock))
- cond_resched_lock(&kvm->mmu_lock);
+ if (need_resched()) {
+ rcu_read_lock();
+ rcu_read_unlock();
+ }
}
}
-
- spin_unlock(&kvm->mmu_lock);
+ rcu_read_unlock();

/*
* We can flush all the TLBs out of the mmu lock without TLB
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2926152..33f313b 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -117,4 +117,10 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
}

void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
+
+void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot);
+void
+kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn_offset, unsigned long mask);
#endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4ac3a27..c6233e1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3554,8 +3554,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
memset(dirty_bitmap_buffer, 0, n);

- spin_lock(&kvm->mmu_lock);
-
+ rcu_read_lock();
for (i = 0; i < n / sizeof(long); i++) {
unsigned long mask;
gfn_t offset;
@@ -3579,10 +3578,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
dirty_bitmap_buffer[i] = mask;

offset = i * BITS_PER_LONG;
- kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
+ kvm_mmu_write_protect_pt_masked_lockless(kvm, memslot,
+ offset, mask);
}
-
- spin_unlock(&kvm->mmu_lock);
+ rcu_read_unlock();

/*
* All the TLBs can be flushed out of mmu lock, see the comments in
@@ -7246,7 +7245,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
* See the comments in fast_page_fault().
*/
if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
- kvm_mmu_slot_remove_write_access(kvm, mem->slot);
+ kvm_mmu_slot_remove_write_access_lockless(kvm, mem->slot);
}

void kvm_arch_flush_shadow_all(struct kvm *kvm)
--
1.8.1.4

2013-10-23 13:31:13

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 14/15] KVM: MMU: clean up spte_write_protect

Now, the only user of spte_write_protect is rmap_write_protect which
always calls spte_write_protect with pt_protect = true, so drop
it and the unused parameter @kvm

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d82bbec..3e4b941 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1340,8 +1340,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
}

/*
- * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte write-protection is caused by protecting shadow page table.
+ * Write-protect on the specified @sptep.
*
* Note: write protection is difference between drity logging and spte
* protection:
@@ -1352,25 +1351,23 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
*
* Return true if tlb need be flushed.
*/
-static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
+static bool spte_write_protect(u64 *sptep)
{
u64 spte = *sptep;

if (!is_writable_pte(spte) &&
- !(pt_protect && spte_is_locklessly_modifiable(spte)))
+ !spte_is_locklessly_modifiable(spte))
return false;

rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);

- if (pt_protect)
- spte &= ~SPTE_MMU_WRITEABLE;
- spte = spte & ~PT_WRITABLE_MASK;
+ spte &= ~SPTE_MMU_WRITEABLE;
+ spte &= ~PT_WRITABLE_MASK;

return mmu_spte_update(sptep, spte);
}

-static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
- bool pt_protect)
+static bool __rmap_write_protect(unsigned long *rmapp)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -1379,7 +1376,7 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));

- flush |= spte_write_protect(kvm, sptep, pt_protect);
+ flush |= spte_write_protect(sptep);
sptep = rmap_get_next(&iter);
}

@@ -1454,7 +1451,7 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
for (i = PT_PAGE_TABLE_LEVEL;
i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
rmapp = __gfn_to_rmap(gfn, i, slot);
- write_protected |= __rmap_write_protect(kvm, rmapp, true);
+ write_protected |= __rmap_write_protect(rmapp);
}

return write_protected;
--
1.8.1.4

2013-10-23 13:31:50

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

Allocate shadow pages from slab instead of page-allocator, frequent
shadow page allocation and free can be hit in the slab cache, it is
very useful for shadow mmu

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/mmu.c | 46 ++++++++++++++++++++++++++++++++++-------
2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5cbf316..df9ae10 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -389,6 +389,7 @@ struct kvm_vcpu_arch {
struct kvm_mmu *walk_mmu;

struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
+ struct kvm_mmu_memory_cache mmu_shadow_page_cache;
struct kvm_mmu_memory_cache mmu_page_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;

@@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
{
struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);

- return (struct kvm_mmu_page *)page_private(page);
+ return (struct kvm_mmu_page *)(page->mapping);
}

static inline u16 kvm_read_ldt(void)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f3ae74e6..1bcc8c8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -178,6 +178,7 @@ struct kvm_shadow_walk_iterator {
__shadow_walk_next(&(_walker), spte))

static struct kmem_cache *pte_list_desc_cache;
+static struct kmem_cache *mmu_shadow_page_cache;
static struct kmem_cache *mmu_page_header_cache;
static struct percpu_counter kvm_total_used_mmu_pages;

@@ -746,7 +747,14 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
GFP_KERNEL);
if (r)
goto out;
- r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
+
+ r = mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
+ mmu_shadow_page_cache, 4,
+ GFP_KERNEL);
+ if (r)
+ goto out;
+
+ r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 4);
if (r)
goto out;
r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
@@ -760,6 +768,8 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
{
mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
pte_list_desc_cache);
+ mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
+ mmu_shadow_page_cache);
mmu_free_memory_cache_page(&vcpu->arch.mmu_page_cache);
mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache,
mmu_page_header_cache);
@@ -1675,12 +1685,28 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
percpu_counter_add(&kvm_total_used_mmu_pages, nr);
}

+static void set_page_header(struct kvm_mmu_page *sp)
+{
+ struct page *page = virt_to_page(sp->spt);
+
+ WARN_ON(page->mapping);
+ page->mapping = (struct address_space *)sp;
+}
+
+static void clear_page_header(struct kvm_mmu_page *sp)
+{
+ struct page *page = virt_to_page(sp->spt);
+
+ page->mapping = NULL;
+}
+
static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
{
ASSERT(is_empty_shadow_page(sp->spt));
hlist_del(&sp->hash_link);
list_del(&sp->link);
- free_page((unsigned long)sp->spt);
+ clear_page_header(sp);
+ kmem_cache_free(mmu_shadow_page_cache, sp->spt);
if (!sp->role.direct)
free_page((unsigned long)sp->gfns);
kmem_cache_free(mmu_page_header_cache, sp);
@@ -1719,10 +1745,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp;

sp = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
- sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache);
+ sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
if (!direct)
sp->gfns = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache);
- set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+ set_page_header(sp);

/*
* The active_mmu_pages list is the FIFO list, do not move the
@@ -2046,12 +2072,13 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
}
}

-static void init_shadow_page_table(struct kvm_mmu_page *sp)
+static void init_shadow_page_table(void *p)
{
+ u64 *sptp = (u64 *)p;
int i;

for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
- sp->spt[i] = 0ull;
+ sptp[i] = 0ull;
}

static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
@@ -2137,7 +2164,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
account_shadowed(vcpu->kvm, gfn);
}
sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
- init_shadow_page_table(sp);
trace_kvm_mmu_get_page(sp, true);
return sp;
}
@@ -4683,6 +4709,12 @@ int kvm_mmu_module_init(void)
if (!pte_list_desc_cache)
goto nomem;

+ mmu_shadow_page_cache = kmem_cache_create("mmu_shadow_page_cache",
+ PAGE_SIZE, PAGE_SIZE, 0,
+ init_shadow_page_table);
+ if (!mmu_shadow_page_cache)
+ goto nomem;
+
mmu_page_header_cache = kmem_cache_create("kvm_mmu_page_header",
sizeof(struct kvm_mmu_page),
0, 0, NULL);
--
1.8.1.4

2013-10-23 13:32:17

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 11/15] KVM: MMU: locklessly access shadow page under rcu protection

Use SLAB_DESTROY_BY_RCU to prevent the shadow page to be freed from the
slab, so that it can be locklessly accessed by holding rcu lock

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1bcc8c8..5b42858 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4710,8 +4710,8 @@ int kvm_mmu_module_init(void)
goto nomem;

mmu_shadow_page_cache = kmem_cache_create("mmu_shadow_page_cache",
- PAGE_SIZE, PAGE_SIZE, 0,
- init_shadow_page_table);
+ PAGE_SIZE, PAGE_SIZE, SLAB_DESTROY_BY_RCU,
+ init_shadow_page_table);
if (!mmu_shadow_page_cache)
goto nomem;

--
1.8.1.4

2013-10-23 13:32:43

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 12/15] KVM: MMU: check last spte with unawareness of mapping level

The sptes on the middle level should obey these rules:
- they are always writable
- they are not pointing to process's page, so that SPTE_HOST_WRITEABLE has
no chance to be set

So we can check last spte by using PT_WRITABLE_MASK and SPTE_HOST_WRITEABLE
that can be got from spte, then we can let is_last_spte() do not depend on
the mapping level anymore

This is important to implement lockless write-protection since only spte is
available at that time

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 25 ++++++++++++-------------
arch/x86/kvm/mmu_audit.c | 6 +++---
arch/x86/kvm/paging_tmpl.h | 6 ++----
3 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5b42858..8b96d96 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -337,13 +337,13 @@ static int is_rmap_spte(u64 pte)
return is_shadow_present_pte(pte);
}

-static int is_last_spte(u64 pte, int level)
+static int is_last_spte(u64 pte)
{
- if (level == PT_PAGE_TABLE_LEVEL)
- return 1;
- if (is_large_pte(pte))
- return 1;
- return 0;
+ /*
+ * All the sptes on the middle level are writable but
+ * SPTE_HOST_WRITEABLE is not set.
+ */
+ return !(is_writable_pte(pte) && !(pte & SPTE_HOST_WRITEABLE));
}

static pfn_t spte_to_pfn(u64 pte)
@@ -2203,7 +2203,7 @@ static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
static void __shadow_walk_next(struct kvm_shadow_walk_iterator *iterator,
u64 spte)
{
- if (is_last_spte(spte, iterator->level)) {
+ if (is_last_spte(spte)) {
iterator->level = 0;
return;
}
@@ -2255,15 +2255,14 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
}
}

-static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
- u64 *spte)
+static bool mmu_page_zap_pte(struct kvm *kvm, u64 *spte)
{
u64 pte;
struct kvm_mmu_page *child;

pte = *spte;
if (is_shadow_present_pte(pte)) {
- if (is_last_spte(pte, sp->role.level)) {
+ if (is_last_spte(pte)) {
drop_spte(kvm, spte);
if (is_large_pte(pte))
--kvm->stat.lpages;
@@ -2286,7 +2285,7 @@ static void kvm_mmu_page_unlink_children(struct kvm *kvm,
unsigned i;

for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
- mmu_page_zap_pte(kvm, sp, sp->spt + i);
+ mmu_page_zap_pte(kvm, sp->spt + i);
}

static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
@@ -3068,7 +3067,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
}

sp = page_header(__pa(iterator.sptep));
- if (!is_last_spte(spte, sp->role.level))
+ if (!is_last_spte(spte))
goto exit;

/*
@@ -4316,7 +4315,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
local_flush = true;
while (npte--) {
entry = *spte;
- mmu_page_zap_pte(vcpu->kvm, sp, spte);
+ mmu_page_zap_pte(vcpu->kvm, spte);
if (gentry &&
!((sp->role.word ^ vcpu->arch.mmu.base_role.word)
& mask.word) && rmap_can_add(vcpu))
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index daff69e..d54e2ad 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -45,7 +45,7 @@ static void __mmu_spte_walk(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
fn(vcpu, ent + i, level);

if (is_shadow_present_pte(ent[i]) &&
- !is_last_spte(ent[i], level)) {
+ !is_last_spte(ent[i])) {
struct kvm_mmu_page *child;

child = page_header(ent[i] & PT64_BASE_ADDR_MASK);
@@ -110,7 +110,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
}
}

- if (!is_shadow_present_pte(*sptep) || !is_last_spte(*sptep, level))
+ if (!is_shadow_present_pte(*sptep) || !is_last_spte(*sptep))
return;

gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
@@ -158,7 +158,7 @@ static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)

static void audit_sptes_have_rmaps(struct kvm_vcpu *vcpu, u64 *sptep, int level)
{
- if (is_shadow_present_pte(*sptep) && is_last_spte(*sptep, level))
+ if (is_shadow_present_pte(*sptep) && is_last_spte(*sptep))
inspect_spte_has_rmap(vcpu->kvm, sptep);
}

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index ad75d77..33f0216 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -809,7 +809,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
{
struct kvm_shadow_walk_iterator iterator;
struct kvm_mmu_page *sp;
- int level;
u64 *sptep;

vcpu_clear_mmio_info(vcpu, gva);
@@ -822,11 +821,10 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)

spin_lock(&vcpu->kvm->mmu_lock);
for_each_shadow_entry(vcpu, gva, iterator) {
- level = iterator.level;
sptep = iterator.sptep;

sp = page_header(__pa(sptep));
- if (is_last_spte(*sptep, level)) {
+ if (is_last_spte(*sptep)) {
pt_element_t gpte;
gpa_t pte_gpa;

@@ -836,7 +834,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
pte_gpa = FNAME(get_level1_sp_gpa)(sp);
pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);

- if (mmu_page_zap_pte(vcpu->kvm, sp, sptep))
+ if (mmu_page_zap_pte(vcpu->kvm, sptep))
kvm_flush_remote_tlbs(vcpu->kvm);

if (!rmap_can_add(vcpu))
--
1.8.1.4

2013-10-23 13:33:23

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

It likes nulls list and we use the pte-list as the nulls which can help us to
detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
if that happened

kvm->slots_lock is held when we do lockless walking that prevents rmap
is reused (free rmap need to hold that lock) so that we can not see the same
nulls used on different rmaps

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 35 +++++++++++++++++++++++++++++------
1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5cce039..4687329 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
return level - 1;
}

+static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc)
+{
+ unsigned long marker;
+
+ marker = (unsigned long)pte_list | 1UL;
+ desc->more = (struct pte_list_desc *)marker;
+}
+
+static bool desc_is_a_nulls(struct pte_list_desc *desc)
+{
+ return (unsigned long)desc & 1;
+}
+
+static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc)
+{
+ return (unsigned long *)((unsigned long)desc & ~1);
+}
+
static int __find_first_free(struct pte_list_desc *desc)
{
int i;
@@ -951,7 +969,7 @@ static int count_spte_number(struct pte_list_desc *desc)

first_free = __find_first_free(desc);

- for (desc_num = 0; desc->more; desc = desc->more)
+ for (desc_num = 0; !desc_is_a_nulls(desc->more); desc = desc->more)
desc_num++;

return first_free + desc_num * PTE_LIST_EXT;
@@ -985,6 +1003,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc = mmu_alloc_pte_list_desc(vcpu);
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
+ desc_mark_nulls(pte_list, desc);
*pte_list = (unsigned long)desc | 1;
return 1;
}
@@ -1030,7 +1049,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
/*
* Only one entry existing but still use a desc to store it?
*/
- WARN_ON(!next_desc);
+ WARN_ON(desc_is_a_nulls(next_desc));

mmu_free_pte_list_desc(first_desc);
*pte_list = (unsigned long)next_desc | 1ul;
@@ -1041,7 +1060,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
* Only one entry in this desc, move the entry to the head
* then the desc can be freed.
*/
- if (!first_desc->sptes[1] && !first_desc->more) {
+ if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
*pte_list = (unsigned long)first_desc->sptes[0];
mmu_free_pte_list_desc(first_desc);
}
@@ -1070,7 +1089,7 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)

rmap_printk("pte_list_remove: %p many->many\n", spte);
desc = (struct pte_list_desc *)(*pte_list & ~1ul);
- while (desc) {
+ while (!desc_is_a_nulls(desc)) {
for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
if (desc->sptes[i] == spte) {
pte_list_desc_remove_entry(pte_list,
@@ -1097,11 +1116,13 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
return fn((u64 *)*pte_list);

desc = (struct pte_list_desc *)(*pte_list & ~1ul);
- while (desc) {
+ while (!desc_is_a_nulls(desc)) {
for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
fn(desc->sptes[i]);
desc = desc->more;
}
+
+ WARN_ON(desc_get_nulls_value(desc) != pte_list);
}

static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
@@ -1184,6 +1205,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)

iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
iter->pos = 0;
+ WARN_ON(desc_is_a_nulls(iter->desc));
return iter->desc->sptes[iter->pos];
}

@@ -1204,7 +1226,8 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
return sptep;
}

- iter->desc = iter->desc->more;
+ iter->desc = desc_is_a_nulls(iter->desc->more) ?
+ NULL : iter->desc->more;

if (iter->desc) {
iter->pos = 0;
--
1.8.1.4

2013-10-23 13:33:21

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 08/15] KVM: MMU: introduce pte-list lockless walker

The basic idea is from nulls list which uses a nulls to indicate
whether the desc is moved to different pte-list

Note, we should do bottom-up walk in the desc since we always move
the bottom entry to the deleted position

Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4687329..a864140 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -975,6 +975,10 @@ static int count_spte_number(struct pte_list_desc *desc)
return first_free + desc_num * PTE_LIST_EXT;
}

+#define rcu_assign_pte_list(pte_list_p, value) \
+ rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), \
+ (unsigned long *)(value))
+
/*
* Pte mapping structures:
*
@@ -994,7 +998,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,

if (!*pte_list) {
rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
- *pte_list = (unsigned long)spte;
+ rcu_assign_pte_list(pte_list, spte);
return 0;
}

@@ -1004,7 +1008,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
desc_mark_nulls(pte_list, desc);
- *pte_list = (unsigned long)desc | 1;
+ rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
return 1;
}

@@ -1017,7 +1021,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
new_desc = mmu_alloc_pte_list_desc(vcpu);
new_desc->more = desc;
desc = new_desc;
- *pte_list = (unsigned long)desc | 1;
+ rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
}

free_pos = find_first_free(desc);
@@ -1125,6 +1129,51 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
WARN_ON(desc_get_nulls_value(desc) != pte_list);
}

+/* The caller should hold rcu lock. */
+static void pte_list_walk_lockless(unsigned long *pte_list,
+ pte_list_walk_fn fn)
+{
+ struct pte_list_desc *desc;
+ unsigned long pte_list_value;
+ int i;
+
+restart:
+ /*
+ * Force the pte_list to be reloaded.
+ *
+ * See the comments in hlist_nulls_for_each_entry_rcu().
+ */
+ barrier();
+ pte_list_value = *rcu_dereference(pte_list);
+ if (!pte_list_value)
+ return;
+
+ if (!(pte_list_value & 1))
+ return fn((u64 *)pte_list_value);
+
+ desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
+ while (!desc_is_a_nulls(desc)) {
+ /*
+ * We should do top-down walk since we always use the higher
+ * indices to replace the deleted entry if only one desc is
+ * used in the rmap when a spte is removed. Otherwise the
+ * moved entry will be missed.
+ */
+ for (i = PTE_LIST_EXT - 1; i >= 0; i--)
+ if (desc->sptes[i])
+ fn(desc->sptes[i]);
+
+ desc = rcu_dereference(desc->more);
+
+ /* It is being initialized. */
+ if (unlikely(!desc))
+ goto restart;
+ }
+
+ if (unlikely(desc_get_nulls_value(desc) != pte_list))
+ goto restart;
+}
+
static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
struct kvm_memory_slot *slot)
{
@@ -4615,7 +4664,7 @@ int kvm_mmu_module_init(void)
{
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
sizeof(struct pte_list_desc),
- 0, 0, NULL);
+ 0, SLAB_DESTROY_BY_RCU, NULL);
if (!pte_list_desc_cache)
goto nomem;

--
1.8.1.4

2013-10-23 13:34:09

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes

Now we can flush all the TLBs out of the mmu lock without TLB corruption when
write-proect the sptes, it is because:
- we have marked large sptes readonly instead of dropping them that means we
just change the spte from writable to readonly so that we only need to care
the case of changing spte from present to present (changing the spte from
present to nonpresent will flush all the TLBs immediately), in other words,
the only case we need to care is mmu_spte_update()

- in mmu_spte_update(), we haved checked
SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
means it does not depend on PT_WRITABLE_MASK anymore

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 18 ++++++++++++++----
arch/x86/kvm/x86.c | 9 +++++++--
2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 62f18ec..337d173 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4273,15 +4273,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
if (*rmapp)
__rmap_write_protect(kvm, rmapp, false);

- if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
- kvm_flush_remote_tlbs(kvm);
+ if (need_resched() || spin_needbreak(&kvm->mmu_lock))
cond_resched_lock(&kvm->mmu_lock);
- }
}
}

- kvm_flush_remote_tlbs(kvm);
spin_unlock(&kvm->mmu_lock);
+
+ /*
+ * We can flush all the TLBs out of the mmu lock without TLB
+ * corruption since we just change the spte from writable to
+ * readonly so that we only need to care the case of changing
+ * spte from present to present (changing the spte from present
+ * to nonpresent will flush all the TLBs immediately), in other
+ * words, the only case we care is mmu_spte_update() where we
+ * haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
+ * instead of PT_WRITABLE_MASK, that means it does not depend
+ * on PT_WRITABLE_MASK anymore.
+ */
+ kvm_flush_remote_tlbs(kvm);
}

#define BATCH_ZAP_PAGES 10
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b3aa650..573c6b3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3571,11 +3571,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
offset = i * BITS_PER_LONG;
kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
}
- if (is_dirty)
- kvm_flush_remote_tlbs(kvm);

spin_unlock(&kvm->mmu_lock);

+ /*
+ * All the TLBs can be flushed out of mmu lock, see the comments in
+ * kvm_mmu_slot_remove_write_access().
+ */
+ if (is_dirty)
+ kvm_flush_remote_tlbs(kvm);
+
r = -EFAULT;
if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
goto out;
--
1.8.1.4

2013-10-23 13:34:31

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 05/15] KVM: MMU: update spte and add it into rmap before dirty log

kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the its dirty
bitmap, so we should ensure the writable spte can be found in rmap before the
dirty bitmap is visible. Otherwise, we clear the dirty bitmap but fail to
write-protect the page which is detailed in the comments in this patch

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 84 ++++++++++++++++++++++++++++++++++++++++++------------
arch/x86/kvm/x86.c | 10 +++++++
2 files changed, 76 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 337d173..e85eed6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2427,6 +2427,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
{
u64 spte;
int ret = 0;
+ bool remap = is_rmap_spte(*sptep);

if (set_mmio_spte(vcpu->kvm, sptep, gfn, pfn, pte_access))
return 0;
@@ -2488,12 +2489,73 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
}
}

- if (pte_access & ACC_WRITE_MASK)
- mark_page_dirty(vcpu->kvm, gfn);
-
set_pte:
if (mmu_spte_update(sptep, spte))
kvm_flush_remote_tlbs(vcpu->kvm);
+
+ if (!remap) {
+ if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD)
+ rmap_recycle(vcpu, sptep, gfn);
+
+ if (level > PT_PAGE_TABLE_LEVEL)
+ ++vcpu->kvm->stat.lpages;
+ }
+
+ /*
+ * The orders we require are:
+ * 1) set spte to writable __before__ set the dirty bitmap.
+ * It makes sure that dirty-logging is not missed when do
+ * live migration at the final step where kvm should stop
+ * the guest and push the remaining dirty pages got from
+ * dirty-bitmap to the destination. The similar cases are
+ * in fast_pf_fix_direct_spte() and kvm_write_guest_page().
+ *
+ * 2) add the spte into rmap __before__ set the dirty bitmap.
+ *
+ * They can ensure we can find the writable spte on the rmap
+ * when we do lockless write-protection since
+ * kvm_vm_ioctl_get_dirty_log() write-protects the pages based
+ * on its dirty-bitmap, otherwise these cases will happen:
+ *
+ * CPU 0 CPU 1
+ * kvm ioctl doing get-dirty-pages
+ * mark_page_dirty(gfn) which
+ * set the gfn on the dirty maps
+ * mask = xchg(dirty_bitmap, 0)
+ *
+ * try to write-protect gfns which
+ * are set on "mask" then walk then
+ * rmap, see no spte on that rmap
+ * add the spte into rmap
+ *
+ * !!!!!! Then the page can be freely wrote but not recorded in
+ * the dirty bitmap.
+ *
+ * And:
+ *
+ * VCPU 0 CPU 1
+ * kvm ioctl doing get-dirty-pages
+ * mark_page_dirty(gfn) which
+ * set the gfn on the dirty maps
+ *
+ * add spte into rmap
+ * mask = xchg(dirty_bitmap, 0)
+ *
+ * try to write-protect gfns which
+ * are set on "mask" then walk then
+ * rmap, see spte is on the ramp
+ * but it is readonly or nonpresent
+ * Mark spte writable
+ *
+ * !!!!!! Then the page can be freely wrote but not recorded in the
+ * dirty bitmap.
+ *
+ * See the comments in kvm_vm_ioctl_get_dirty_log().
+ */
+ smp_wmb();
+
+ if (pte_access & ACC_WRITE_MASK)
+ mark_page_dirty(vcpu->kvm, gfn);
done:
return ret;
}
@@ -2503,9 +2565,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
int level, gfn_t gfn, pfn_t pfn, bool speculative,
bool host_writable)
{
- int was_rmapped = 0;
- int rmap_count;
-
pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
*sptep, write_fault, gfn);

@@ -2527,8 +2586,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
spte_to_pfn(*sptep), pfn);
drop_spte(vcpu->kvm, sptep);
kvm_flush_remote_tlbs(vcpu->kvm);
- } else
- was_rmapped = 1;
+ }
}

if (set_spte(vcpu, sptep, pte_access, level, gfn, pfn, speculative,
@@ -2546,16 +2604,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
is_large_pte(*sptep)? "2MB" : "4kB",
*sptep & PT_PRESENT_MASK ?"RW":"R", gfn,
*sptep, sptep);
- if (!was_rmapped && is_large_pte(*sptep))
- ++vcpu->kvm->stat.lpages;
-
- if (is_shadow_present_pte(*sptep)) {
- if (!was_rmapped) {
- rmap_count = rmap_add(vcpu, sptep, gfn);
- if (rmap_count > RMAP_RECYCLE_THRESHOLD)
- rmap_recycle(vcpu, sptep, gfn);
- }
- }

kvm_release_pfn_clean(pfn);
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 573c6b3..4ac3a27 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3566,6 +3566,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
is_dirty = true;

mask = xchg(&dirty_bitmap[i], 0);
+ /*
+ * smp_rmb();
+ *
+ * The read-barrier is implied in xchg() acting as a
+ * full barrier and it ensures getting dirty bitmap
+ * is before walking the rmap and spte write-protection.
+ *
+ * See the comments in set_spte().
+ */
+
dirty_bitmap_buffer[i] = mask;

offset = i * BITS_PER_LONG;
--
1.8.1.4

2013-10-23 13:29:56

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 01/15] KVM: MMU: properly check last spte in fast_page_fault()

Using sp->role.level instead of @level since @level is not got from the
page table hierarchy

There is no issue in current code since the fast page fault currently only
fixes the fault caused by dirty-log that is always on the last level
(level = 1)

This patch makes the code more readable and avoids potential issue in the
further development

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 40772ef..d2aacc2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2798,9 +2798,9 @@ static bool page_fault_can_be_fast(u32 error_code)
}

static bool
-fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte)
+fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+ u64 *sptep, u64 spte)
{
- struct kvm_mmu_page *sp = page_header(__pa(sptep));
gfn_t gfn;

WARN_ON(!sp->role.direct);
@@ -2826,6 +2826,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
u32 error_code)
{
struct kvm_shadow_walk_iterator iterator;
+ struct kvm_mmu_page *sp;
bool ret = false;
u64 spte = 0ull;

@@ -2846,7 +2847,8 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
goto exit;
}

- if (!is_last_spte(spte, level))
+ sp = page_header(__pa(iterator.sptep));
+ if (!is_last_spte(spte, sp->role.level))
goto exit;

/*
@@ -2872,7 +2874,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
* the gfn is not stable for indirect shadow page.
* See Documentation/virtual/kvm/locking.txt to get more detail.
*/
- ret = fast_pf_fix_direct_spte(vcpu, iterator.sptep, spte);
+ ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
exit:
trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
spte, ret);
--
1.8.1.4

2013-10-23 13:35:17

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 02/15] KVM: MMU: lazily drop large spte

Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them un-present, the page fault caused by read access can
be avoided

The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter. This removes the need for the return value.

This version has fixed the issue reported in 6b73a9606, the reason of that
issue is that fast_page_fault() directly sets the readonly large spte to
writable but only dirty the first page into the dirty-bitmap that means
other pages are missed. Fixed it by only the normal sptes (on the
PT_PAGE_TABLE_LEVEL level) can be fast fixed

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 36 ++++++++++++++++++++----------------
arch/x86/kvm/x86.c | 8 ++++++--
2 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d2aacc2..8739208 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1176,8 +1176,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)

/*
* Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte writ-protection is caused by protecting shadow page table.
- * @flush indicates whether tlb need be flushed.
+ * spte write-protection is caused by protecting shadow page table.
*
* Note: write protection is difference between drity logging and spte
* protection:
@@ -1186,10 +1185,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
* - for spte protection, the spte can be writable only after unsync-ing
* shadow page.
*
- * Return true if the spte is dropped.
+ * Return true if tlb need be flushed.
*/
-static bool
-spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
+static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
{
u64 spte = *sptep;

@@ -1199,17 +1197,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)

rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);

- if (__drop_large_spte(kvm, sptep)) {
- *flush |= true;
- return true;
- }
-
if (pt_protect)
spte &= ~SPTE_MMU_WRITEABLE;
spte = spte & ~PT_WRITABLE_MASK;

- *flush |= mmu_spte_update(sptep, spte);
- return false;
+ return mmu_spte_update(sptep, spte);
}

static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
@@ -1221,11 +1213,8 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,

for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
- if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
- sptep = rmap_get_first(*rmapp, &iter);
- continue;
- }

+ flush |= spte_write_protect(kvm, sptep, pt_protect);
sptep = rmap_get_next(&iter);
}

@@ -2669,6 +2658,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
break;
}

+ drop_large_spte(vcpu, iterator.sptep);
+
if (!is_shadow_present_pte(*iterator.sptep)) {
u64 base_addr = iterator.addr;

@@ -2870,6 +2861,19 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
goto exit;

/*
+ * Do not fix write-permission on the large spte since we only dirty
+ * the first page into the dirty-bitmap in fast_pf_fix_direct_spte()
+ * that means other pages are missed if its slot is dirty-logged.
+ *
+ * Instead, we let the slow page fault path create a normal spte to
+ * fix the access.
+ *
+ * See the comments in kvm_arch_commit_memory_region().
+ */
+ if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+ goto exit;
+
+ /*
* Currently, fast page fault only works for direct mapping since
* the gfn is not stable for indirect shadow page.
* See Documentation/virtual/kvm/locking.txt to get more detail.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index edf2a07..b3aa650 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7223,8 +7223,12 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
/*
* Write protect all pages for dirty logging.
- * Existing largepage mappings are destroyed here and new ones will
- * not be created until the end of the logging.
+ *
+ * All the sptes including the large sptes which point to this
+ * slot are set to readonly. We can not create any new large
+ * spte on this slot until the end of the logging.
+ *
+ * See the comments in fast_page_fault().
*/
if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
kvm_mmu_slot_remove_write_access(kvm, mem->slot);
--
1.8.1.4

2013-10-23 13:35:14

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 03/15] KVM: MMU: flush tlb if the spte can be locklessly modified

Relax the tlb flush condition since we will write-protect the spte out of mmu
lock. Note lockless write-protection only marks the writable spte to readonly
and the spte can be writable only if both SPTE_HOST_WRITEABLE and
SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)

This patch is used to avoid this kind of race:

VCPU 0 VCPU 1
lockless wirte protection:
set spte.w = 0
lock mmu-lock

write protection the spte to sync shadow page,
see spte.w = 0, then without flush tlb

unlock mmu-lock

!!! At this point, the shadow page can still be
writable due to the corrupt tlb entry
Flush all TLB

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8739208..62f18ec 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -595,7 +595,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
* we always atomicly update it, see the comments in
* spte_has_volatile_bits().
*/
- if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
+ if (spte_is_locklessly_modifiable(old_spte) &&
+ !is_writable_pte(new_spte))
ret = true;

if (!shadow_accessed_mask)
--
1.8.1.4

2013-10-24 09:17:40

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 13/15] KVM: MMU: locklessly write-protect the page

On Wed, Oct 23, 2013 at 09:29:31PM +0800, Xiao Guangrong wrote:
> Currently, when mark memslot dirty logged or get dirty page, we need to
> write-protect large guest memory, it is the heavy work, especially, we
> need to hold mmu-lock which is also required by vcpu to fix its page table
> fault and mmu-notifier when host page is being changed. In the extreme
> cpu / memory used guest, it becomes a scalability issue
>
> This patch introduces a way to locklessly write-protect guest memory
>
> Now, lockless rmap walk, lockless shadow page table access and lockless
> spte wirte-protection are ready, it is the time to implements page
> write-protection out of mmu-lock
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 4 ---
> arch/x86/kvm/mmu.c | 59 ++++++++++++++++++++++++++++++-----------
> arch/x86/kvm/mmu.h | 6 +++++
> arch/x86/kvm/x86.c | 11 ++++----
> 4 files changed, 55 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index df9ae10..cdb6f29 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -793,10 +793,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
> u64 dirty_mask, u64 nx_mask, u64 x_mask);
>
> void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
> -void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
> -void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> - struct kvm_memory_slot *slot,
> - gfn_t gfn_offset, unsigned long mask);
> void kvm_mmu_zap_all(struct kvm *kvm);
> void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm);
> unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 8b96d96..d82bbec 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1386,8 +1386,37 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
> return flush;
> }
>
> -/**
> - * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
> +static void __rmap_write_protect_lockless(u64 *sptep)
> +{
> + u64 spte;
> +
> +retry:
> + /*
> + * Note we may partly read the sptep on 32bit host, however, we
> + * allow this case because:
> + * - we do not access the page got from the sptep.
> + * - cmpxchg64 can detect that case and avoid setting a wrong value
> + * to the sptep.
> + */
> + spte = *rcu_dereference(sptep);
> + if (unlikely(!is_last_spte(spte) || !is_writable_pte(spte)))
is_last_spte gets two parameters.

> + return;
> +
> + if (likely(cmpxchg64(sptep, spte, spte & ~PT_WRITABLE_MASK) == spte))
> + return;
> +
> + goto retry;
> +}
> +
> +static void rmap_write_protect_lockless(unsigned long *rmapp)
> +{
> + pte_list_walk_lockless(rmapp, __rmap_write_protect_lockless);
> +}
> +
> +/*
> + * kvm_mmu_write_protect_pt_masked_lockless - write protect selected PT level
> + * pages out of mmu-lock.
> + *
> * @kvm: kvm instance
> * @slot: slot to protect
> * @gfn_offset: start of the BITS_PER_LONG pages we care about
> @@ -1396,16 +1425,17 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
> * Used when we do not need to care about huge page mappings: e.g. during dirty
> * logging we do not have any such mappings.
> */
> -void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> - struct kvm_memory_slot *slot,
> - gfn_t gfn_offset, unsigned long mask)
> +void
> +kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + gfn_t gfn_offset, unsigned long mask)
> {
> unsigned long *rmapp;
>
> while (mask) {
> rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
> PT_PAGE_TABLE_LEVEL, slot);
> - __rmap_write_protect(kvm, rmapp, false);
> + rmap_write_protect_lockless(rmapp);
>
> /* clear the first set bit */
> mask &= mask - 1;
> @@ -4477,7 +4507,7 @@ void kvm_mmu_setup(struct kvm_vcpu *vcpu)
> init_kvm_mmu(vcpu);
> }
>
> -void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
> +void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot)
> {
> struct kvm_memory_slot *memslot;
> gfn_t last_gfn;
> @@ -4486,8 +4516,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
> memslot = id_to_memslot(kvm->memslots, slot);
> last_gfn = memslot->base_gfn + memslot->npages - 1;
>
> - spin_lock(&kvm->mmu_lock);
> -
> + rcu_read_lock();
> for (i = PT_PAGE_TABLE_LEVEL;
> i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
> unsigned long *rmapp;
> @@ -4497,15 +4526,15 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
> last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
>
> for (index = 0; index <= last_index; ++index, ++rmapp) {
> - if (*rmapp)
> - __rmap_write_protect(kvm, rmapp, false);
> + rmap_write_protect_lockless(rmapp);
>
> - if (need_resched() || spin_needbreak(&kvm->mmu_lock))
> - cond_resched_lock(&kvm->mmu_lock);
> + if (need_resched()) {
> + rcu_read_lock();
> + rcu_read_unlock();
> + }
> }
> }
> -
> - spin_unlock(&kvm->mmu_lock);
> + rcu_read_unlock();
>
> /*
> * We can flush all the TLBs out of the mmu lock without TLB
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 2926152..33f313b 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -117,4 +117,10 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
> }
>
> void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
> +
> +void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot);
> +void
> +kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + gfn_t gfn_offset, unsigned long mask);
> #endif
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4ac3a27..c6233e1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3554,8 +3554,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
> dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
> memset(dirty_bitmap_buffer, 0, n);
>
> - spin_lock(&kvm->mmu_lock);
> -
> + rcu_read_lock();
> for (i = 0; i < n / sizeof(long); i++) {
> unsigned long mask;
> gfn_t offset;
> @@ -3579,10 +3578,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
> dirty_bitmap_buffer[i] = mask;
>
> offset = i * BITS_PER_LONG;
> - kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
> + kvm_mmu_write_protect_pt_masked_lockless(kvm, memslot,
> + offset, mask);
> }
> -
> - spin_unlock(&kvm->mmu_lock);
> + rcu_read_unlock();
>
> /*
> * All the TLBs can be flushed out of mmu lock, see the comments in
> @@ -7246,7 +7245,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> * See the comments in fast_page_fault().
> */
> if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
> - kvm_mmu_slot_remove_write_access(kvm, mem->slot);
> + kvm_mmu_slot_remove_write_access_lockless(kvm, mem->slot);
> }
>
> void kvm_arch_flush_shadow_all(struct kvm *kvm)
> --
> 1.8.1.4

--
Gleb.

2013-10-24 09:19:13

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

On Wed, Oct 23, 2013 at 09:29:28PM +0800, Xiao Guangrong wrote:
> Allocate shadow pages from slab instead of page-allocator, frequent
> shadow page allocation and free can be hit in the slab cache, it is
> very useful for shadow mmu
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 3 ++-
> arch/x86/kvm/mmu.c | 46 ++++++++++++++++++++++++++++++++++-------
> 2 files changed, 41 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5cbf316..df9ae10 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -389,6 +389,7 @@ struct kvm_vcpu_arch {
> struct kvm_mmu *walk_mmu;
>
> struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> + struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> struct kvm_mmu_memory_cache mmu_page_cache;
> struct kvm_mmu_memory_cache mmu_page_header_cache;
>
> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
> {
> struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>
> - return (struct kvm_mmu_page *)page_private(page);
> + return (struct kvm_mmu_page *)(page->mapping);
Why?

> }
>
> static inline u16 kvm_read_ldt(void)
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index f3ae74e6..1bcc8c8 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -178,6 +178,7 @@ struct kvm_shadow_walk_iterator {
> __shadow_walk_next(&(_walker), spte))
>
> static struct kmem_cache *pte_list_desc_cache;
> +static struct kmem_cache *mmu_shadow_page_cache;
> static struct kmem_cache *mmu_page_header_cache;
> static struct percpu_counter kvm_total_used_mmu_pages;
>
> @@ -746,7 +747,14 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
> GFP_KERNEL);
> if (r)
> goto out;
> - r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
> +
> + r = mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> + mmu_shadow_page_cache, 4,
> + GFP_KERNEL);
> + if (r)
> + goto out;
> +
> + r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 4);
> if (r)
> goto out;
> r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
> @@ -760,6 +768,8 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> {
> mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> pte_list_desc_cache);
> + mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> + mmu_shadow_page_cache);
> mmu_free_memory_cache_page(&vcpu->arch.mmu_page_cache);
> mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache,
> mmu_page_header_cache);
> @@ -1675,12 +1685,28 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> percpu_counter_add(&kvm_total_used_mmu_pages, nr);
> }
>
> +static void set_page_header(struct kvm_mmu_page *sp)
> +{
> + struct page *page = virt_to_page(sp->spt);
> +
> + WARN_ON(page->mapping);
> + page->mapping = (struct address_space *)sp;
> +}
> +
> +static void clear_page_header(struct kvm_mmu_page *sp)
> +{
> + struct page *page = virt_to_page(sp->spt);
> +
> + page->mapping = NULL;
> +}
> +
> static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> {
> ASSERT(is_empty_shadow_page(sp->spt));
> hlist_del(&sp->hash_link);
> list_del(&sp->link);
> - free_page((unsigned long)sp->spt);
> + clear_page_header(sp);
> + kmem_cache_free(mmu_shadow_page_cache, sp->spt);
> if (!sp->role.direct)
> free_page((unsigned long)sp->gfns);
> kmem_cache_free(mmu_page_header_cache, sp);
> @@ -1719,10 +1745,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
> struct kvm_mmu_page *sp;
>
> sp = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> - sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache);
> + sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> if (!direct)
> sp->gfns = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache);
> - set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> + set_page_header(sp);
>
> /*
> * The active_mmu_pages list is the FIFO list, do not move the
> @@ -2046,12 +2072,13 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
> }
> }
>
> -static void init_shadow_page_table(struct kvm_mmu_page *sp)
> +static void init_shadow_page_table(void *p)
> {
> + u64 *sptp = (u64 *)p;
> int i;
>
> for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
> - sp->spt[i] = 0ull;
> + sptp[i] = 0ull;
> }
>
> static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
> @@ -2137,7 +2164,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> account_shadowed(vcpu->kvm, gfn);
> }
> sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> - init_shadow_page_table(sp);
> trace_kvm_mmu_get_page(sp, true);
> return sp;
> }
> @@ -4683,6 +4709,12 @@ int kvm_mmu_module_init(void)
> if (!pte_list_desc_cache)
> goto nomem;
>
> + mmu_shadow_page_cache = kmem_cache_create("mmu_shadow_page_cache",
> + PAGE_SIZE, PAGE_SIZE, 0,
> + init_shadow_page_table);
> + if (!mmu_shadow_page_cache)
> + goto nomem;
> +
> mmu_page_header_cache = kmem_cache_create("kvm_mmu_page_header",
> sizeof(struct kvm_mmu_page),
> 0, 0, NULL);
> --
> 1.8.1.4

--
Gleb.

2013-10-24 09:29:53

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

On 10/24/2013 05:19 PM, Gleb Natapov wrote:

>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>> {
>> struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>
>> - return (struct kvm_mmu_page *)page_private(page);
>> + return (struct kvm_mmu_page *)(page->mapping);
> Why?

That's because page->private has been used by slab:

/* Remainder is not double word aligned */
union {
unsigned long private; /* Mapping-private opaque data:
* usually used for buffer_heads
* if PagePrivate set; used for
* swp_entry_t if PageSwapCache;
* indicates order in the buddy
* system if PG_buddy is set.
*/
#if USE_SPLIT_PTLOCKS
spinlock_t ptl;
#endif
struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */
struct page *first_page; /* Compound tail pages */
};

2013-10-24 09:32:40

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 13/15] KVM: MMU: locklessly write-protect the page

On Thu, Oct 24, 2013 at 05:24:12PM +0800, Xiao Guangrong wrote:
> On 10/24/2013 05:17 PM, Gleb Natapov wrote:
>
> >>
> >> -/**
> >> - * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
> >> +static void __rmap_write_protect_lockless(u64 *sptep)
> >> +{
> >> + u64 spte;
> >> +
> >> +retry:
> >> + /*
> >> + * Note we may partly read the sptep on 32bit host, however, we
> >> + * allow this case because:
> >> + * - we do not access the page got from the sptep.
> >> + * - cmpxchg64 can detect that case and avoid setting a wrong value
> >> + * to the sptep.
> >> + */
> >> + spte = *rcu_dereference(sptep);
> >> + if (unlikely(!is_last_spte(spte) || !is_writable_pte(spte)))
> > is_last_spte gets two parameters.
>
> In patch [PATCH v3 12/15] KVM: MMU: check last spte with unawareness of mapping level,
> we have removed the 'level' from the parameter list.
Ah, I haven't got 12/15 and git am did not complain :(

--
Gleb.

2013-10-24 09:35:11

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 13/15] KVM: MMU: locklessly write-protect the page

On 10/24/2013 05:17 PM, Gleb Natapov wrote:

>>
>> -/**
>> - * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
>> +static void __rmap_write_protect_lockless(u64 *sptep)
>> +{
>> + u64 spte;
>> +
>> +retry:
>> + /*
>> + * Note we may partly read the sptep on 32bit host, however, we
>> + * allow this case because:
>> + * - we do not access the page got from the sptep.
>> + * - cmpxchg64 can detect that case and avoid setting a wrong value
>> + * to the sptep.
>> + */
>> + spte = *rcu_dereference(sptep);
>> + if (unlikely(!is_last_spte(spte) || !is_writable_pte(spte)))
> is_last_spte gets two parameters.

In patch [PATCH v3 12/15] KVM: MMU: check last spte with unawareness of mapping level,
we have removed the 'level' from the parameter list.

2013-10-24 09:52:10

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
>
> >> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
> >> {
> >> struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
> >>
> >> - return (struct kvm_mmu_page *)page_private(page);
> >> + return (struct kvm_mmu_page *)(page->mapping);
> > Why?
>
> That's because page->private has been used by slab:
>
But does lockless path actually looks at it?

> /* Remainder is not double word aligned */
> union {
> unsigned long private; /* Mapping-private opaque data:
> * usually used for buffer_heads
> * if PagePrivate set; used for
> * swp_entry_t if PageSwapCache;
> * indicates order in the buddy
> * system if PG_buddy is set.
> */
> #if USE_SPLIT_PTLOCKS
> spinlock_t ptl;
> #endif
> struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */
> struct page *first_page; /* Compound tail pages */
> };

--
Gleb.

2013-10-24 10:10:56

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

On 10/24/2013 05:52 PM, Gleb Natapov wrote:
> On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
>> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
>>
>>>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>>>> {
>>>> struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>>>
>>>> - return (struct kvm_mmu_page *)page_private(page);
>>>> + return (struct kvm_mmu_page *)(page->mapping);
>>> Why?
>>
>> That's because page->private has been used by slab:
>>
> But does lockless path actually looks at it?

Lockless path does not use it, however, it is used by kvm_mmu_page():

static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
{
struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);

return (struct kvm_mmu_page *)(page->mapping);
}

which is used in the common code.

2013-10-24 10:39:32

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

On Thu, Oct 24, 2013 at 06:10:46PM +0800, Xiao Guangrong wrote:
> On 10/24/2013 05:52 PM, Gleb Natapov wrote:
> > On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
> >> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
> >>
> >>>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
> >>>> {
> >>>> struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
> >>>>
> >>>> - return (struct kvm_mmu_page *)page_private(page);
> >>>> + return (struct kvm_mmu_page *)(page->mapping);
> >>> Why?
> >>
> >> That's because page->private has been used by slab:
> >>
> > But does lockless path actually looks at it?
>
> Lockless path does not use it, however, it is used by kvm_mmu_page():
>
> static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
> {
> struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>
> return (struct kvm_mmu_page *)(page->mapping);
> }
>
> which is used in the common code.
Ah, so the pointer is not available even after object is allocated.
Make sense since we allocate object, not page here, but is it safe to
use mapping like that?

--
Gleb.

2013-10-24 11:01:58

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

On 10/24/2013 06:39 PM, Gleb Natapov wrote:
> On Thu, Oct 24, 2013 at 06:10:46PM +0800, Xiao Guangrong wrote:
>> On 10/24/2013 05:52 PM, Gleb Natapov wrote:
>>> On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
>>>> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
>>>>
>>>>>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>>>>>> {
>>>>>> struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>>>>>
>>>>>> - return (struct kvm_mmu_page *)page_private(page);
>>>>>> + return (struct kvm_mmu_page *)(page->mapping);
>>>>> Why?
>>>>
>>>> That's because page->private has been used by slab:
>>>>
>>> But does lockless path actually looks at it?
>>
>> Lockless path does not use it, however, it is used by kvm_mmu_page():
>>
>> static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>> {
>> struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>
>> return (struct kvm_mmu_page *)(page->mapping);
>> }
>>
>> which is used in the common code.
> Ah, so the pointer is not available even after object is allocated.
> Make sense since we allocate object, not page here, but is it safe to
> use mapping like that?

The commens says:

struct address_space *mapping; /* If low bit clear, points to
* inode address_space, or NULL.
* If page mapped as anonymous
* memory, low bit is set, and
* it points to anon_vma object:
* see PAGE_MAPPING_ANON below.

It seems mapping is used for address_space or anonymous memory, in
our case, the page is used by slab, so I guess it is ok. And the bug
i put in set_page_header() was not tiggered on both slab and slub.

2013-10-24 12:32:46

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

On Thu, Oct 24, 2013 at 07:01:49PM +0800, Xiao Guangrong wrote:
> On 10/24/2013 06:39 PM, Gleb Natapov wrote:
> > On Thu, Oct 24, 2013 at 06:10:46PM +0800, Xiao Guangrong wrote:
> >> On 10/24/2013 05:52 PM, Gleb Natapov wrote:
> >>> On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
> >>>> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
> >>>>
> >>>>>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
> >>>>>> {
> >>>>>> struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
> >>>>>>
> >>>>>> - return (struct kvm_mmu_page *)page_private(page);
> >>>>>> + return (struct kvm_mmu_page *)(page->mapping);
> >>>>> Why?
> >>>>
> >>>> That's because page->private has been used by slab:
> >>>>
> >>> But does lockless path actually looks at it?
> >>
> >> Lockless path does not use it, however, it is used by kvm_mmu_page():
> >>
> >> static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
> >> {
> >> struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
> >>
> >> return (struct kvm_mmu_page *)(page->mapping);
> >> }
> >>
> >> which is used in the common code.
> > Ah, so the pointer is not available even after object is allocated.
> > Make sense since we allocate object, not page here, but is it safe to
> > use mapping like that?
>
> The commens says:
>
> struct address_space *mapping; /* If low bit clear, points to
> * inode address_space, or NULL.
> * If page mapped as anonymous
> * memory, low bit is set, and
> * it points to anon_vma object:
> * see PAGE_MAPPING_ANON below.
>
> It seems mapping is used for address_space or anonymous memory, in
> our case, the page is used by slab, so I guess it is ok. And the bug
> i put in set_page_header() was not tiggered on both slab and slub.
>
Yeah, I also think so. I asked Andrea (copied) and he thinks that it is
safe too currently, but things changes fast in this area. Andrea?
Another option is too save slab_cache pointer and reset it before
freeing the object but it looks ugly.

--
Gleb.

2013-10-28 03:16:17

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

On 10/24/2013 08:32 PM, Gleb Natapov wrote:
> On Thu, Oct 24, 2013 at 07:01:49PM +0800, Xiao Guangrong wrote:
>> On 10/24/2013 06:39 PM, Gleb Natapov wrote:
>>> On Thu, Oct 24, 2013 at 06:10:46PM +0800, Xiao Guangrong wrote:
>>>> On 10/24/2013 05:52 PM, Gleb Natapov wrote:
>>>>> On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
>>>>>> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
>>>>>>
>>>>>>>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>>>>>>>> {
>>>>>>>> struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>>>>>>>
>>>>>>>> - return (struct kvm_mmu_page *)page_private(page);
>>>>>>>> + return (struct kvm_mmu_page *)(page->mapping);
>>>>>>> Why?
>>>>>>
>>>>>> That's because page->private has been used by slab:
>>>>>>
>>>>> But does lockless path actually looks at it?
>>>>
>>>> Lockless path does not use it, however, it is used by kvm_mmu_page():
>>>>
>>>> static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>>>> {
>>>> struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>>>
>>>> return (struct kvm_mmu_page *)(page->mapping);
>>>> }
>>>>
>>>> which is used in the common code.
>>> Ah, so the pointer is not available even after object is allocated.
>>> Make sense since we allocate object, not page here, but is it safe to
>>> use mapping like that?
>>
>> The commens says:
>>
>> struct address_space *mapping; /* If low bit clear, points to
>> * inode address_space, or NULL.
>> * If page mapped as anonymous
>> * memory, low bit is set, and
>> * it points to anon_vma object:
>> * see PAGE_MAPPING_ANON below.
>>
>> It seems mapping is used for address_space or anonymous memory, in
>> our case, the page is used by slab, so I guess it is ok. And the bug
>> i put in set_page_header() was not tiggered on both slab and slub.
>>
> Yeah, I also think so. I asked Andrea (copied) and he thinks that it is
> safe too currently, but things changes fast in this area. Andrea?
> Another option is too save slab_cache pointer and reset it before
> freeing the object but it looks ugly.

It's ugly but it isn't too bad. :)

Since currently kvm is extensively used to test/measure linux kernel
and the BUG_ON() in set_page_header() can help us to detect the potential
issue, it is easy for us to fix the possible bug in the development-cycle
if 'mapping' is used by slab. If that really happen, maybe we can switch
it to your way instead.