2010-06-30 08:05:55

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 1/11] KVM: MMU: fix writable sync sp mapping

While we sync many unsync sp at one time(in mmu_sync_children()),
we may mapping the spte writable, it's dangerous, if one unsync
sp's mapping gfn is another unsync page's gfn.

For example:

SP1.pte[0] = P
SP2.gfn's pfn = P
[SP1.pte[0] = SP2.gfn's pfn]

First, we write protected SP1 and SP2, but SP1 and SP2 are still the
unsync sp.

Then, sync SP1 first, it will detect SP1.pte[0].gfn only has one unsync-sp,
that is SP2, so it will mapping it writable, but we plan to sync SP2 soon,
at this point, the SP2->unsync is not reliable since later we sync SP2 but
SP2->gfn is already writable.

So the final result is: SP2 is the sync page but SP2.gfn is writable.

This bug will corrupt guest's page table, fixed by mark read-only mapping
if the mapped gfn has shadow pages.

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 045a0f9..24290f8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1810,11 +1810,14 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
bool need_unsync = false;

for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn, node) {
+ if (!can_unsync)
+ return 1;
+
if (s->role.level != PT_PAGE_TABLE_LEVEL)
return 1;

if (!need_unsync && !s->unsync) {
- if (!can_unsync || !oos_shadow)
+ if (!oos_shadow)
return 1;
need_unsync = true;
}
--
1.6.1.2


2010-06-30 08:06:36

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 2/11] KVM: MMU: fix conflict access permissions in direct sp

In no-direct mapping, we mark sp is 'direct' when we mapping the
guest's larger page, but its access is encoded form upper page-struct
entire not include the last mapping, it will cause access conflict.

For example, have this mapping:
[W]
/ PDE1 -> |---|
P[W] | | LPA
\ PDE2 -> |---|
[R]

P have two children, PDE1 and PDE2, both PDE1 and PDE2 mapping the
same lage page(LPA). The P's access is WR, PDE1's access is WR,
PDE2's access is RO(just consider read-write permissions here)

When guest access PDE1, we will create a direct sp for LPA, the sp's
access is from P, is W, then we will mark the ptes is W in this sp.

Then, guest access PDE2, we will find LPA's shadow page, is the same as
PDE's, and mark the ptes is RO.

So, if guest access PDE1, the incorrect #PF is occured.

Fixed by encode the last mapping access into direct shadow page

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/paging_tmpl.h | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 37c26cb..28c8493 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -339,6 +339,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
direct = 1;
if (!is_dirty_gpte(gw->ptes[level - delta]))
access &= ~ACC_WRITE_MASK;
+ access &= gw->pte_access;
+
/*
* It is a large guest pages backed by small host pages,
* So we set @direct(@sp->role.direct)=1, and set
--
1.6.1.2

2010-06-30 08:07:18

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 3/11] KVM: MMU: fix direct sp's access corruptted

If the mapping is writable but the dirty flag is not set, we will find
the read-only direct sp and setup the mapping, then if the write #PF
occur, we will mark this mapping writable in the read-only direct sp,
now, other real read-only mapping will happily write it without #PF.

It may hurt guest's COW

Fixed by re-install the mapping when write #PF occur.

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/paging_tmpl.h | 28 ++++++++++++++++++++++++++--
1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 28c8493..f28f09d 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -325,8 +325,32 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
break;
}

- if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
- continue;
+ if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) {
+ struct kvm_mmu_page *child;
+ unsigned direct_access;
+
+ if (level != gw->level)
+ continue;
+
+ /*
+ * For the direct sp, if the guest pte's dirty bit
+ * changed form clean to dirty, it will corrupt the
+ * sp's access: allow writable in the read-only sp,
+ * so we should update the spte at this point to get
+ * a new sp with the correct access.
+ */
+ direct_access = gw->pt_access & gw->pte_access;
+ if (!is_dirty_gpte(gw->ptes[gw->level - 1]))
+ direct_access &= ~ACC_WRITE_MASK;
+
+ child = page_header(*sptep & PT64_BASE_ADDR_MASK);
+ if (child->role.access == direct_access)
+ continue;
+
+ mmu_page_remove_parent_pte(child, sptep);
+ __set_spte(sptep, shadow_trap_nonpresent_pte);
+ kvm_flush_remote_tlbs(vcpu->kvm);
+ }

if (is_large_pte(*sptep)) {
rmap_remove(vcpu->kvm, sptep);
--
1.6.1.2

2010-06-30 08:07:54

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 4/11] KVM: MMU: fix forgot to flush all vcpu's tlb

After remove a rmap, we should flush all vcpu's tlb

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24290f8..a0c5c31 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1941,6 +1941,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
pgprintk("hfn old %lx new %lx\n",
spte_to_pfn(*sptep), pfn);
rmap_remove(vcpu->kvm, sptep);
+ __set_spte(sptep, shadow_trap_nonpresent_pte);
+ kvm_flush_remote_tlbs(vcpu->kvm);
} else
was_rmapped = 1;
}
--
1.6.1.2

2010-06-30 08:08:48

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 5/11] KVM: MMU: cleanup FNAME(fetch)() functions

Cleanup this function that we are already get the direct sp's access

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/paging_tmpl.h | 19 ++++++++-----------
1 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index f28f09d..3350c02 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -306,12 +306,18 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
gfn_t table_gfn;
int r;
int level;
+ bool dirty = is_dirty_gpte(gw->ptes[gw->level - 1]);
+ unsigned direct_access;
pt_element_t curr_pte;
struct kvm_shadow_walk_iterator iterator;

if (!is_present_gpte(gw->ptes[gw->level - 1]))
return NULL;

+ direct_access = gw->pt_access & gw->pte_access;
+ if (!dirty)
+ direct_access &= ~ACC_WRITE_MASK;
+
for_each_shadow_entry(vcpu, addr, iterator) {
level = iterator.level;
sptep = iterator.sptep;
@@ -319,15 +325,13 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
mmu_set_spte(vcpu, sptep, access,
gw->pte_access & access,
user_fault, write_fault,
- is_dirty_gpte(gw->ptes[gw->level-1]),
- ptwrite, level,
+ dirty, ptwrite, level,
gw->gfn, pfn, false, true);
break;
}

if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) {
struct kvm_mmu_page *child;
- unsigned direct_access;

if (level != gw->level)
continue;
@@ -339,10 +343,6 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
* so we should update the spte at this point to get
* a new sp with the correct access.
*/
- direct_access = gw->pt_access & gw->pte_access;
- if (!is_dirty_gpte(gw->ptes[gw->level - 1]))
- direct_access &= ~ACC_WRITE_MASK;
-
child = page_header(*sptep & PT64_BASE_ADDR_MASK);
if (child->role.access == direct_access)
continue;
@@ -359,11 +359,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
}

if (level <= gw->level) {
- int delta = level - gw->level + 1;
direct = 1;
- if (!is_dirty_gpte(gw->ptes[level - delta]))
- access &= ~ACC_WRITE_MASK;
- access &= gw->pte_access;
+ access = direct_access;

/*
* It is a large guest pages backed by small host pages,
--
1.6.1.2

2010-06-30 08:09:33

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 6/11] KVM: MMU: introduce gfn_to_pfn_atomic() function

Introduce gfn_to_pfn_atomic(), it's the fast path and can used in atomic
context, the later patch will use it

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/mm/gup.c | 2 ++
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 32 +++++++++++++++++++++++++-------
3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 738e659..0c9034b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -6,6 +6,7 @@
*/
#include <linux/sched.h>
#include <linux/mm.h>
+#include <linux/module.h>
#include <linux/vmstat.h>
#include <linux/highmem.h>

@@ -274,6 +275,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,

return nr;
}
+EXPORT_SYMBOL_GPL(__get_user_pages_fast);

/**
* get_user_pages_fast() - pin user pages in memory
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9289d1a..515fefd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -295,6 +295,7 @@ void kvm_release_page_dirty(struct page *page);
void kvm_set_page_dirty(struct page *page);
void kvm_set_page_accessed(struct page *page);

+pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 885d3f5..60bb3d5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -924,19 +924,25 @@ unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
}
EXPORT_SYMBOL_GPL(gfn_to_hva);

-static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
+static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic)
{
struct page *page[1];
int npages;
pfn_t pfn;

- might_sleep();
-
- npages = get_user_pages_fast(addr, 1, 1, page);
+ if (atomic)
+ npages = __get_user_pages_fast(addr, 1, 1, page);
+ else {
+ might_sleep();
+ npages = get_user_pages_fast(addr, 1, 1, page);
+ }

if (unlikely(npages != 1)) {
struct vm_area_struct *vma;

+ if (atomic)
+ goto return_bad_page;
+
down_read(&current->mm->mmap_sem);
if (is_hwpoison_address(addr)) {
up_read(&current->mm->mmap_sem);
@@ -949,6 +955,7 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
if (vma == NULL || addr < vma->vm_start ||
!(vma->vm_flags & VM_PFNMAP)) {
up_read(&current->mm->mmap_sem);
+return_bad_page:
get_page(bad_page);
return page_to_pfn(bad_page);
}
@@ -962,7 +969,7 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
return pfn;
}

-pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
+pfn_t __gfn_to_pfn(struct kvm *kvm, gfn_t gfn, bool atomic)
{
unsigned long addr;

@@ -972,7 +979,18 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
return page_to_pfn(bad_page);
}

- return hva_to_pfn(kvm, addr);
+ return hva_to_pfn(kvm, addr, atomic);
+}
+
+pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
+{
+ return __gfn_to_pfn(kvm, gfn, true);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_atomic);
+
+pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
+{
+ return __gfn_to_pfn(kvm, gfn, false);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn);

@@ -980,7 +998,7 @@ pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn)
{
unsigned long addr = gfn_to_hva_memslot(slot, gfn);
- return hva_to_pfn(kvm, addr);
+ return hva_to_pfn(kvm, addr, false);
}

struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
--
1.6.1.2

2010-06-30 08:10:18

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 7/11] KVM: MMU: introduce gfn_to_hva_many() function

This function not only return the gfn's hva but also the page number
after @gfn in the slot

It's used in the later patch

Signed-off-by: Xiao Guangrong <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 13 ++++++++++++-
2 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 515fefd..8f7af32 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -289,6 +289,7 @@ void kvm_disable_largepages(void);
void kvm_arch_flush_shadow(struct kvm *kvm);

struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn);
+unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn, int *entry);
unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn);
void kvm_release_page_clean(struct page *page);
void kvm_release_page_dirty(struct page *page);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 60bb3d5..a007889 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -913,15 +913,26 @@ static unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
}

-unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
+unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn, int *entry)
{
struct kvm_memory_slot *slot;

slot = gfn_to_memslot(kvm, gfn);
+
if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
return bad_hva();
+
+ if (entry)
+ *entry = slot->npages - (gfn - slot->base_gfn);
+
return gfn_to_hva_memslot(slot, gfn);
}
+EXPORT_SYMBOL_GPL(gfn_to_hva_many);
+
+unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
+{
+ return gfn_to_hva_many(kvm, gfn, NULL);
+}
EXPORT_SYMBOL_GPL(gfn_to_hva);

static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic)
--
1.6.1.2

2010-06-30 08:11:13

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 8/11] KVM: MMU: introduce pte_prefetch_topup_memory_cache()

Introduce this function to topup prefetch cache

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 25 +++++++++++++++++++++----
1 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a0c5c31..6673484 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -89,6 +89,8 @@ module_param(oos_shadow, bool, 0644);
}
#endif

+#define PTE_PREFETCH_NUM 16
+
#define PT_FIRST_AVAIL_BITS_SHIFT 9
#define PT64_SECOND_AVAIL_BITS_SHIFT 52

@@ -290,15 +292,16 @@ static void __set_spte(u64 *sptep, u64 spte)
#endif
}

-static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
- struct kmem_cache *base_cache, int min)
+static int __mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
+ struct kmem_cache *base_cache, int min,
+ int max, gfp_t flags)
{
void *obj;

if (cache->nobjs >= min)
return 0;
- while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
- obj = kmem_cache_zalloc(base_cache, GFP_KERNEL);
+ while (cache->nobjs < max) {
+ obj = kmem_cache_zalloc(base_cache, flags);
if (!obj)
return -ENOMEM;
cache->objects[cache->nobjs++] = obj;
@@ -306,6 +309,20 @@ static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
return 0;
}

+static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
+ struct kmem_cache *base_cache, int min)
+{
+ return __mmu_topup_memory_cache(cache, base_cache, min,
+ ARRAY_SIZE(cache->objects), GFP_KERNEL);
+}
+
+static int pte_prefetch_topup_memory_cache(struct kvm_vcpu *vcpu)
+{
+ return __mmu_topup_memory_cache(&vcpu->arch.mmu_rmap_desc_cache,
+ rmap_desc_cache, PTE_PREFETCH_NUM,
+ PTE_PREFETCH_NUM, GFP_ATOMIC);
+}
+
static void mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc,
struct kmem_cache *cache)
{
--
1.6.1.2

2010-06-30 08:11:58

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 9/11] KVM: MMU: prefetch ptes when intercepted guest #PF

Support prefetch ptes when intercept guest #PF, avoid to #PF by later
access

If we meet any failure in the prefetch path, we will exit it and
not try other ptes to avoid become heavy path

Note: this speculative will mark page become dirty but it not really
accessed, the same issue is in other speculative paths like invlpg,
pte write, fortunately, it just affect host memory management. After
Avi's patchset named "[PATCH v2 1/4] KVM: MMU: Introduce drop_spte()"
merged, we will easily fix it. Will do it in the future.

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 83 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/paging_tmpl.h | 76 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 159 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6673484..fadfafe 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2002,6 +2002,88 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
{
}

+static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
+ struct kvm_mmu_page *sp,
+ u64 *start, u64 *end)
+{
+ gfn_t gfn;
+ struct page *pages[PTE_PREFETCH_NUM];
+
+ gfn = sp->gfn + start - sp->spt;
+ while (start < end) {
+ unsigned long addr;
+ int entry, j, ret;
+
+ addr = gfn_to_hva_many(vcpu->kvm, gfn, &entry);
+ if (kvm_is_error_hva(addr))
+ return -1;
+
+ entry = min(entry, (int)(end - start));
+ ret = __get_user_pages_fast(addr, entry, 1, pages);
+ if (ret <= 0)
+ return -1;
+
+ for (j = 0; j < ret; j++, gfn++, start++)
+ mmu_set_spte(vcpu, start, ACC_ALL,
+ sp->role.access, 0, 0, 1, NULL,
+ sp->role.level, gfn,
+ page_to_pfn(pages[j]), true, false);
+
+ if (ret < entry)
+ return -1;
+ }
+ return 0;
+}
+
+static void __direct_pte_prefetch(struct kvm_vcpu *vcpu,
+ struct kvm_mmu_page *sp, u64 *sptep)
+{
+ u64 *start = NULL;
+ int index, i, max;
+
+ WARN_ON(!sp->role.direct);
+
+ if (pte_prefetch_topup_memory_cache(vcpu))
+ return;
+
+ index = sptep - sp->spt;
+ i = index & ~(PTE_PREFETCH_NUM - 1);
+ max = index | (PTE_PREFETCH_NUM - 1);
+
+ for (; i < max; i++) {
+ u64 *spte = sp->spt + i;
+
+ if (*spte != shadow_trap_nonpresent_pte || spte == sptep) {
+ if (!start)
+ continue;
+ if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0)
+ break;
+ start = NULL;
+ } else if (!start)
+ start = spte;
+ }
+}
+
+static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
+{
+ struct kvm_mmu_page *sp;
+
+ /*
+ * Since it's no accessed bit on EPT, it's no way to
+ * distinguish between actually accessed translations
+ * and prefetched, so disable pte prefetch if EPT is
+ * enabled.
+ */
+ if (!shadow_accessed_mask)
+ return;
+
+ sp = page_header(__pa(sptep));
+ if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+ return;
+
+ __direct_pte_prefetch(vcpu, sp, sptep);
+}
+
static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
int level, gfn_t gfn, pfn_t pfn)
{
@@ -2015,6 +2097,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, ACC_ALL,
0, write, 1, &pt_write,
level, gfn, pfn, false, true);
+ direct_pte_prefetch(vcpu, iterator.sptep);
++vcpu->stat.pf_fixed;
break;
}
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 3350c02..d8c3be8 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -291,6 +291,81 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
gpte_to_gfn(gpte), pfn, true, true);
}

+static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, u64 *sptep)
+{
+ struct kvm_mmu_page *sp;
+ pt_element_t gptep[PTE_PREFETCH_NUM];
+ gpa_t first_pte_gpa;
+ int offset = 0, index, i, j, max;
+
+ sp = page_header(__pa(sptep));
+ index = sptep - sp->spt;
+
+ if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+ return;
+
+ if (sp->role.direct)
+ return __direct_pte_prefetch(vcpu, sp, sptep);
+
+ index = sptep - sp->spt;
+ i = index & ~(PTE_PREFETCH_NUM - 1);
+ max = index | (PTE_PREFETCH_NUM - 1);
+
+ if (PTTYPE == 32)
+ offset = sp->role.quadrant << PT64_LEVEL_BITS;
+
+ first_pte_gpa = gfn_to_gpa(sp->gfn) +
+ (offset + i) * sizeof(pt_element_t);
+
+ if (kvm_read_guest_atomic(vcpu->kvm, first_pte_gpa, gptep,
+ sizeof(gptep)) < 0)
+ return;
+
+ if (pte_prefetch_topup_memory_cache(vcpu))
+ return;
+
+ for (j = 0; i < max; i++, j++) {
+ pt_element_t gpte;
+ unsigned pte_access;
+ u64 *spte = sp->spt + i;
+ gfn_t gfn;
+ pfn_t pfn;
+
+ if (spte == sptep)
+ continue;
+
+ if (*spte != shadow_trap_nonpresent_pte)
+ continue;
+
+ gpte = gptep[j];
+
+ if (is_rsvd_bits_set(vcpu, gpte, PT_PAGE_TABLE_LEVEL))
+ break;
+
+ if (!(gpte & PT_ACCESSED_MASK))
+ continue;
+
+ if (!is_present_gpte(gpte)) {
+ if (!sp->unsync)
+ __set_spte(spte, shadow_notrap_nonpresent_pte);
+ continue;
+ }
+
+ gfn = gpte_to_gfn(gpte);
+
+ pfn = gfn_to_pfn_atomic(vcpu->kvm, gfn);
+ if (is_error_pfn(pfn)) {
+ kvm_release_pfn_clean(pfn);
+ break;
+ }
+
+ pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
+ mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0,
+ is_dirty_gpte(gpte), NULL, sp->role.level, gfn,
+ pfn, true, false);
+ }
+}
+
/*
* Fetch a shadow pte for a specific level in the paging hierarchy.
*/
@@ -327,6 +402,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
user_fault, write_fault,
dirty, ptwrite, level,
gw->gfn, pfn, false, true);
+ FNAME(pte_prefetch)(vcpu, sptep);
break;
}

--
1.6.1.2

2010-06-30 08:12:31

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH 10/11] KVM: MMU: combine guest pte read between walk and pte prefetch

Combine guest pte read between guest pte walk and pte prefetch

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/paging_tmpl.h | 48 ++++++++++++++++++++++++++++++-------------
1 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index d8c3be8..3e047b4 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -67,6 +67,7 @@ struct guest_walker {
int level;
gfn_t table_gfn[PT_MAX_FULL_LEVELS];
pt_element_t ptes[PT_MAX_FULL_LEVELS];
+ pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
unsigned pt_access;
unsigned pte_access;
@@ -110,6 +111,31 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte)
return access;
}

+static int FNAME(read_guest_pte)(struct kvm_vcpu *vcpu,
+ struct guest_walker *walker,
+ gpa_t pte_gpa, pt_element_t *pte)
+{
+ int index, ret;
+ u64 mask;
+ gpa_t base_gpa;
+
+ if (walker->level > PT_PAGE_TABLE_LEVEL)
+ return kvm_read_guest(vcpu->kvm, pte_gpa, pte,
+ sizeof(*pte));
+
+ mask = PTE_PREFETCH_NUM * sizeof(*pte) - 1;
+ base_gpa = pte_gpa & ~mask;
+ index = (pte_gpa - base_gpa) / sizeof(*pte);
+
+ ret = kvm_read_guest(vcpu->kvm, base_gpa, walker->prefetch_ptes,
+ sizeof(walker->prefetch_ptes));
+ if (ret)
+ return ret;
+
+ *pte = walker->prefetch_ptes[index];
+ return 0;
+}
+
/*
* Fetch a guest pte for a guest virtual address
*/
@@ -151,7 +177,7 @@ walk:
walker->table_gfn[walker->level - 1] = table_gfn;
walker->pte_gpa[walker->level - 1] = pte_gpa;

- if (kvm_read_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte)))
+ if (FNAME(read_guest_pte)(vcpu, walker, pte_gpa, &pte))
goto not_present;

trace_kvm_mmu_paging_element(pte, walker->level);
@@ -291,12 +317,12 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
gpte_to_gfn(gpte), pfn, true, true);
}

-static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, u64 *sptep)
+static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu,
+ struct guest_walker *gw, u64 *sptep)
{
struct kvm_mmu_page *sp;
- pt_element_t gptep[PTE_PREFETCH_NUM];
- gpa_t first_pte_gpa;
- int offset = 0, index, i, j, max;
+ pt_element_t *gptep;
+ int index, i, j, max;

sp = page_header(__pa(sptep));
index = sptep - sp->spt;
@@ -311,15 +337,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, u64 *sptep)
i = index & ~(PTE_PREFETCH_NUM - 1);
max = index | (PTE_PREFETCH_NUM - 1);

- if (PTTYPE == 32)
- offset = sp->role.quadrant << PT64_LEVEL_BITS;
-
- first_pte_gpa = gfn_to_gpa(sp->gfn) +
- (offset + i) * sizeof(pt_element_t);
-
- if (kvm_read_guest_atomic(vcpu->kvm, first_pte_gpa, gptep,
- sizeof(gptep)) < 0)
- return;
+ gptep = gw->prefetch_ptes;

if (pte_prefetch_topup_memory_cache(vcpu))
return;
@@ -402,7 +420,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
user_fault, write_fault,
dirty, ptwrite, level,
gw->gfn, pfn, false, true);
- FNAME(pte_prefetch)(vcpu, sptep);
+ FNAME(pte_prefetch)(vcpu, gw, sptep);
break;
}

--
1.6.1.2

2010-06-30 08:13:17

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 11/11] KVM: MMU: trace pte prefetch

Trace pte prefetch, it can help us to improve the prefetch's performance

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 45 +++++++++++++++++++++++++++++++++----------
arch/x86/kvm/mmutrace.h | 33 ++++++++++++++++++++++++++++++++
arch/x86/kvm/paging_tmpl.h | 23 ++++++++++++++++-----
3 files changed, 84 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index fadfafe..0894ed8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -91,6 +91,12 @@ module_param(oos_shadow, bool, 0644);

#define PTE_PREFETCH_NUM 16

+#define PREFETCH_SUCCESS 0
+#define PREFETCH_ERR_GFN2PFN 1
+#define PREFETCH_ERR_ALLOC_MEM 2
+#define PREFETCH_ERR_RSVD_BITS_SET 3
+#define PREFETCH_ERR_MMIO 4
+
#define PT_FIRST_AVAIL_BITS_SHIFT 9
#define PT64_SECOND_AVAIL_BITS_SHIFT 52

@@ -2004,7 +2010,7 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)

static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp,
- u64 *start, u64 *end)
+ u64 *start, u64 *end, u64 address)
{
gfn_t gfn;
struct page *pages[PTE_PREFETCH_NUM];
@@ -2015,36 +2021,52 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
int entry, j, ret;

addr = gfn_to_hva_many(vcpu->kvm, gfn, &entry);
- if (kvm_is_error_hva(addr))
+ if (kvm_is_error_hva(addr)) {
+ trace_pte_prefetch(true, address, 0,
+ PREFETCH_ERR_MMIO);
return -1;
+ }

entry = min(entry, (int)(end - start));
ret = __get_user_pages_fast(addr, entry, 1, pages);
- if (ret <= 0)
+ if (ret <= 0) {
+ trace_pte_prefetch(true, address, 0,
+ PREFETCH_ERR_GFN2PFN);
return -1;
+ }

- for (j = 0; j < ret; j++, gfn++, start++)
+ for (j = 0; j < ret; j++, gfn++, start++) {
+ trace_pte_prefetch(true, address, 0,
+ PREFETCH_SUCCESS);
mmu_set_spte(vcpu, start, ACC_ALL,
sp->role.access, 0, 0, 1, NULL,
sp->role.level, gfn,
page_to_pfn(pages[j]), true, false);
+ }

- if (ret < entry)
+ if (ret < entry) {
+ trace_pte_prefetch(true, address, 0,
+ PREFETCH_ERR_GFN2PFN);
return -1;
+ }
}
return 0;
}

static void __direct_pte_prefetch(struct kvm_vcpu *vcpu,
- struct kvm_mmu_page *sp, u64 *sptep)
+ struct kvm_mmu_page *sp, u64 *sptep,
+ u64 addr)
{
u64 *start = NULL;
int index, i, max;

WARN_ON(!sp->role.direct);

- if (pte_prefetch_topup_memory_cache(vcpu))
+ if (pte_prefetch_topup_memory_cache(vcpu)) {
+ trace_pte_prefetch(true, addr, 0,
+ PREFETCH_ERR_ALLOC_MEM);
return;
+ }

index = sptep - sp->spt;
i = index & ~(PTE_PREFETCH_NUM - 1);
@@ -2056,7 +2078,8 @@ static void __direct_pte_prefetch(struct kvm_vcpu *vcpu,
if (*spte != shadow_trap_nonpresent_pte || spte == sptep) {
if (!start)
continue;
- if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0)
+ if (direct_pte_prefetch_many(vcpu, sp, start,
+ spte, addr) < 0)
break;
start = NULL;
} else if (!start)
@@ -2064,7 +2087,7 @@ static void __direct_pte_prefetch(struct kvm_vcpu *vcpu,
}
}

-static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
+static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep, u64 addr)
{
struct kvm_mmu_page *sp;

@@ -2081,7 +2104,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
if (sp->role.level > PT_PAGE_TABLE_LEVEL)
return;

- __direct_pte_prefetch(vcpu, sp, sptep);
+ __direct_pte_prefetch(vcpu, sp, sptep, addr);
}

static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
@@ -2097,7 +2120,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, ACC_ALL,
0, write, 1, &pt_write,
level, gfn, pfn, false, true);
- direct_pte_prefetch(vcpu, iterator.sptep);
+ direct_pte_prefetch(vcpu, iterator.sptep, gfn << PAGE_SHIFT);
++vcpu->stat.pf_fixed;
break;
}
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 3aab0f0..c07b6a6 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -195,6 +195,39 @@ DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_prepare_zap_page,

TP_ARGS(sp)
);
+
+#define pte_prefetch_err \
+ {PREFETCH_SUCCESS, "SUCCESS" }, \
+ {PREFETCH_ERR_GFN2PFN, "ERR_GFN2PFN" }, \
+ {PREFETCH_ERR_ALLOC_MEM, "ERR_ALLOC_MEM" }, \
+ {PREFETCH_ERR_RSVD_BITS_SET, "ERR_RSVD_BITS_SET"}, \
+ {PREFETCH_ERR_MMIO, "ERR_MMIO" }
+
+TRACE_EVENT(
+ pte_prefetch,
+ TP_PROTO(bool direct, u64 addr, u64 gpte, int err_code),
+
+ TP_ARGS(direct, addr, gpte, err_code),
+
+ TP_STRUCT__entry(
+ __field(bool, direct)
+ __field(u64, addr)
+ __field(u64, gpte)
+ __field(int, err_code)
+ ),
+
+ TP_fast_assign(
+ __entry->direct = direct;
+ __entry->addr = addr;
+ __entry->gpte = gpte;
+ __entry->err_code = err_code;
+ ),
+
+ TP_printk("%s address:%llx gpte:%llx %s",
+ __entry->direct ? "direct" : "indirect",
+ __entry->addr, __entry->gpte,
+ __print_symbolic(__entry->err_code, pte_prefetch_err))
+ );
#endif /* _TRACE_KVMMMU_H */

#undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 3e047b4..ec7f148 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -318,7 +318,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
}

static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu,
- struct guest_walker *gw, u64 *sptep)
+ struct guest_walker *gw, u64 *sptep, u64 addr)
{
struct kvm_mmu_page *sp;
pt_element_t *gptep;
@@ -331,7 +331,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu,
return;

if (sp->role.direct)
- return __direct_pte_prefetch(vcpu, sp, sptep);
+ return __direct_pte_prefetch(vcpu, sp, sptep, addr);

index = sptep - sp->spt;
i = index & ~(PTE_PREFETCH_NUM - 1);
@@ -339,8 +339,10 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu,

gptep = gw->prefetch_ptes;

- if (pte_prefetch_topup_memory_cache(vcpu))
+ if (pte_prefetch_topup_memory_cache(vcpu)) {
+ trace_pte_prefetch(false, addr, 0, PREFETCH_ERR_ALLOC_MEM);
return;
+ }

for (j = 0; i < max; i++, j++) {
pt_element_t gpte;
@@ -357,15 +359,21 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu,

gpte = gptep[j];

- if (is_rsvd_bits_set(vcpu, gpte, PT_PAGE_TABLE_LEVEL))
+ if (is_rsvd_bits_set(vcpu, gpte, PT_PAGE_TABLE_LEVEL)) {
+ trace_pte_prefetch(false, addr, gpte,
+ PREFETCH_ERR_RSVD_BITS_SET);
break;
+ }

if (!(gpte & PT_ACCESSED_MASK))
continue;

if (!is_present_gpte(gpte)) {
- if (!sp->unsync)
+ if (!sp->unsync) {
+ trace_pte_prefetch(false, addr, gpte,
+ PREFETCH_SUCCESS);
__set_spte(spte, shadow_notrap_nonpresent_pte);
+ }
continue;
}

@@ -373,10 +381,13 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu,

pfn = gfn_to_pfn_atomic(vcpu->kvm, gfn);
if (is_error_pfn(pfn)) {
+ trace_pte_prefetch(false, addr, gpte,
+ PREFETCH_ERR_GFN2PFN);
kvm_release_pfn_clean(pfn);
break;
}

+ trace_pte_prefetch(false, addr, gpte, PREFETCH_SUCCESS);
pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0,
is_dirty_gpte(gpte), NULL, sp->role.level, gfn,
@@ -420,7 +431,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
user_fault, write_fault,
dirty, ptwrite, level,
gw->gfn, pfn, false, true);
- FNAME(pte_prefetch)(vcpu, gw, sptep);
+ FNAME(pte_prefetch)(vcpu, gw, sptep, addr);
break;
}

--
1.6.1.2

2010-06-30 20:45:05

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 3/11] KVM: MMU: fix direct sp's access corruptted

On Wed, Jun 30, 2010 at 04:03:28PM +0800, Xiao Guangrong wrote:
> If the mapping is writable but the dirty flag is not set, we will find
> the read-only direct sp and setup the mapping, then if the write #PF
> occur, we will mark this mapping writable in the read-only direct sp,
> now, other real read-only mapping will happily write it without #PF.
>
> It may hurt guest's COW
>
> Fixed by re-install the mapping when write #PF occur.

Applied 1, 2 and 4, thanks.

> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/paging_tmpl.h | 28 ++++++++++++++++++++++++++--
> 1 files changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 28c8493..f28f09d 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -325,8 +325,32 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
> break;
> }
>
> - if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
> - continue;
> + if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) {
> + struct kvm_mmu_page *child;
> + unsigned direct_access;
> +
> + if (level != gw->level)
> + continue;

This will skip the check for the sp at level 1 when emulating 1GB pages
with 4k host pages (where there are direct sp's at level 2 and 1).
Should be > instead of !=.

> +
> + /*
> + * For the direct sp, if the guest pte's dirty bit
> + * changed form clean to dirty, it will corrupt the
> + * sp's access: allow writable in the read-only sp,
> + * so we should update the spte at this point to get
> + * a new sp with the correct access.
> + */
> + direct_access = gw->pt_access & gw->pte_access;
> + if (!is_dirty_gpte(gw->ptes[gw->level - 1]))
> + direct_access &= ~ACC_WRITE_MASK;
> +
> + child = page_header(*sptep & PT64_BASE_ADDR_MASK);
> + if (child->role.access == direct_access)
> + continue;
> +
> + mmu_page_remove_parent_pte(child, sptep);
> + __set_spte(sptep, shadow_trap_nonpresent_pte);
> + kvm_flush_remote_tlbs(vcpu->kvm);
> + }
>
> if (is_large_pte(*sptep)) {
> rmap_remove(vcpu->kvm, sptep);
> --
> 1.6.1.2
>

2010-06-30 20:45:07

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 9/11] KVM: MMU: prefetch ptes when intercepted guest #PF

On Wed, Jun 30, 2010 at 04:08:05PM +0800, Xiao Guangrong wrote:
> Support prefetch ptes when intercept guest #PF, avoid to #PF by later
> access
>
> If we meet any failure in the prefetch path, we will exit it and
> not try other ptes to avoid become heavy path
>
> Note: this speculative will mark page become dirty but it not really
> accessed, the same issue is in other speculative paths like invlpg,
> pte write, fortunately, it just affect host memory management. After
> Avi's patchset named "[PATCH v2 1/4] KVM: MMU: Introduce drop_spte()"
> merged, we will easily fix it. Will do it in the future.
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 83 ++++++++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/paging_tmpl.h | 76 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 159 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 6673484..fadfafe 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2002,6 +2002,88 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
> {
> }
>
> +static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
> + struct kvm_mmu_page *sp,
> + u64 *start, u64 *end)
> +{
> + gfn_t gfn;
> + struct page *pages[PTE_PREFETCH_NUM];
> +
> + gfn = sp->gfn + start - sp->spt;
> + while (start < end) {
> + unsigned long addr;
> + int entry, j, ret;
> +
> + addr = gfn_to_hva_many(vcpu->kvm, gfn, &entry);
> + if (kvm_is_error_hva(addr))
> + return -1;
> +
> + entry = min(entry, (int)(end - start));
> + ret = __get_user_pages_fast(addr, entry, 1, pages);
> + if (ret <= 0)
> + return -1;

Why can't you use gfn_to_pfn_atomic() here, one page at a time? Is
the overhead significant that this is worthwhile?

You're bypassing the centralized interface.

> +
> + for (j = 0; j < ret; j++, gfn++, start++)
> + mmu_set_spte(vcpu, start, ACC_ALL,
> + sp->role.access, 0, 0, 1, NULL,
> + sp->role.level, gfn,
> + page_to_pfn(pages[j]), true, false);
> +
> + if (ret < entry)
> + return -1;
> + }
> + return 0;
> +}
> +
> +static void __direct_pte_prefetch(struct kvm_vcpu *vcpu,
> + struct kvm_mmu_page *sp, u64 *sptep)
> +{
> + u64 *start = NULL;
> + int index, i, max;
> +
> + WARN_ON(!sp->role.direct);
> +
> + if (pte_prefetch_topup_memory_cache(vcpu))
> + return;
> +
> + index = sptep - sp->spt;
> + i = index & ~(PTE_PREFETCH_NUM - 1);
> + max = index | (PTE_PREFETCH_NUM - 1);
> +
> + for (; i < max; i++) {
> + u64 *spte = sp->spt + i;
> +
> + if (*spte != shadow_trap_nonpresent_pte || spte == sptep) {
> + if (!start)
> + continue;
> + if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0)
> + break;
> + start = NULL;
> + } else if (!start)
> + start = spte;
> + }
> +}
> +
> +static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
> +{
> + struct kvm_mmu_page *sp;
> +
> + /*
> + * Since it's no accessed bit on EPT, it's no way to
> + * distinguish between actually accessed translations
> + * and prefetched, so disable pte prefetch if EPT is
> + * enabled.
> + */
> + if (!shadow_accessed_mask)
> + return;
> +
> + sp = page_header(__pa(sptep));
> + if (sp->role.level > PT_PAGE_TABLE_LEVEL)
> + return;
> +
> + __direct_pte_prefetch(vcpu, sp, sptep);
> +}
> +
> static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
> int level, gfn_t gfn, pfn_t pfn)
> {
> @@ -2015,6 +2097,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
> mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, ACC_ALL,
> 0, write, 1, &pt_write,
> level, gfn, pfn, false, true);
> + direct_pte_prefetch(vcpu, iterator.sptep);
> ++vcpu->stat.pf_fixed;
> break;
> }
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 3350c02..d8c3be8 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -291,6 +291,81 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> gpte_to_gfn(gpte), pfn, true, true);
> }
>
> +static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, u64 *sptep)
> +{
> + struct kvm_mmu_page *sp;
> + pt_element_t gptep[PTE_PREFETCH_NUM];
> + gpa_t first_pte_gpa;
> + int offset = 0, index, i, j, max;
> +
> + sp = page_header(__pa(sptep));
> + index = sptep - sp->spt;
> +
> + if (sp->role.level > PT_PAGE_TABLE_LEVEL)
> + return;
> +
> + if (sp->role.direct)
> + return __direct_pte_prefetch(vcpu, sp, sptep);
> +
> + index = sptep - sp->spt;
> + i = index & ~(PTE_PREFETCH_NUM - 1);
> + max = index | (PTE_PREFETCH_NUM - 1);
> +
> + if (PTTYPE == 32)
> + offset = sp->role.quadrant << PT64_LEVEL_BITS;
> +
> + first_pte_gpa = gfn_to_gpa(sp->gfn) +
> + (offset + i) * sizeof(pt_element_t);
> +
> + if (kvm_read_guest_atomic(vcpu->kvm, first_pte_gpa, gptep,
> + sizeof(gptep)) < 0)
> + return;
> +
> + if (pte_prefetch_topup_memory_cache(vcpu))
> + return;
> +
> + for (j = 0; i < max; i++, j++) {
> + pt_element_t gpte;
> + unsigned pte_access;
> + u64 *spte = sp->spt + i;
> + gfn_t gfn;
> + pfn_t pfn;
> +
> + if (spte == sptep)
> + continue;
> +
> + if (*spte != shadow_trap_nonpresent_pte)
> + continue;
> +
> + gpte = gptep[j];
> +
> + if (is_rsvd_bits_set(vcpu, gpte, PT_PAGE_TABLE_LEVEL))
> + break;
> +
> + if (!(gpte & PT_ACCESSED_MASK))
> + continue;
> +
> + if (!is_present_gpte(gpte)) {
> + if (!sp->unsync)
> + __set_spte(spte, shadow_notrap_nonpresent_pte);
> + continue;
> + }
> +
> + gfn = gpte_to_gfn(gpte);
> +
> + pfn = gfn_to_pfn_atomic(vcpu->kvm, gfn);
> + if (is_error_pfn(pfn)) {
> + kvm_release_pfn_clean(pfn);
> + break;
> + }
> +
> + pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
> + mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0,
> + is_dirty_gpte(gpte), NULL, sp->role.level, gfn,
> + pfn, true, false);

reset_host_protection should be true, see commit 1403283acca (also for
direct case to be consistent).

2010-07-01 00:54:48

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 3/11] KVM: MMU: fix direct sp's access corruptted



Marcelo Tosatti wrote:

>> - if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
>> - continue;
>> + if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) {
>> + struct kvm_mmu_page *child;
>> + unsigned direct_access;
>> +
>> + if (level != gw->level)
>> + continue;
>
> This will skip the check for the sp at level 1 when emulating 1GB pages
> with 4k host pages (where there are direct sp's at level 2 and 1).
> Should be > instead of !=.
>

Marcelo,

I think the patch is right.

Every level's direct sp has the same access in the mapping since while we setup the
mapping we find the direct sp with the same access.
(Note: we have encode the D bit to the sp->role.access)

Consider guest 1G writable clean pages and host 4K pages, the shadow pages mapping
is like this:

indirect L4 --> indirect L3 --> direct ReadOnly L2 --> direct ReadOnly L1

When change guest pte to dirty, we update L3' spte and find the direct writable L2 sp,
assume it's A, then we can sure that A's children sps should also writable, the final
mapping is like this:

indirect L4 --> indirect L3 --> direct Writable L2 --> direct Writable L1.

So, i think we not broken anything in this patch :-)






2010-07-01 01:15:26

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 9/11] KVM: MMU: prefetch ptes when intercepted guest #PF



Marcelo Tosatti wrote:

>> +
>> + addr = gfn_to_hva_many(vcpu->kvm, gfn, &entry);
>> + if (kvm_is_error_hva(addr))
>> + return -1;
>> +
>> + entry = min(entry, (int)(end - start));
>> + ret = __get_user_pages_fast(addr, entry, 1, pages);
>> + if (ret <= 0)
>> + return -1;
>
> Why can't you use gfn_to_pfn_atomic() here, one page at a time? Is
> the overhead significant that this is worthwhile?
>
> You're bypassing the centralized interface.

I think it's worthwhile to do since we can reduce gup overhead, no reason
to traverse process's page table again and again for the consecutive pages.

>
>> +
>> + pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
>> + mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0,
>> + is_dirty_gpte(gpte), NULL, sp->role.level, gfn,
>> + pfn, true, false);
>
> reset_host_protection should be true, see commit 1403283acca (also for
> direct case to be consistent).
>

Ah, i forgot it, thanks, Marcelo!

I'll post a small fix patch base on this patchset to change 'false' to 'true' if no
other places no to be improved. :-)

2010-07-01 12:08:22

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 5/11] KVM: MMU: cleanup FNAME(fetch)() functions

On Wed, Jun 30, 2010 at 04:05:00PM +0800, Xiao Guangrong wrote:
> Cleanup this function that we are already get the direct sp's access
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/paging_tmpl.h | 19 ++++++++-----------
> 1 files changed, 8 insertions(+), 11 deletions(-)

Applied, thanks.

2010-07-01 12:08:23

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 9/11] KVM: MMU: prefetch ptes when intercepted guest #PF

On Thu, Jul 01, 2010 at 09:11:36AM +0800, Xiao Guangrong wrote:
>
>
> Marcelo Tosatti wrote:
>
> >> +
> >> + addr = gfn_to_hva_many(vcpu->kvm, gfn, &entry);
> >> + if (kvm_is_error_hva(addr))
> >> + return -1;
> >> +
> >> + entry = min(entry, (int)(end - start));
> >> + ret = __get_user_pages_fast(addr, entry, 1, pages);
> >> + if (ret <= 0)
> >> + return -1;
> >
> > Why can't you use gfn_to_pfn_atomic() here, one page at a time? Is
> > the overhead significant that this is worthwhile?
> >
> > You're bypassing the centralized interface.
>
> I think it's worthwhile to do since we can reduce gup overhead, no reason
> to traverse process's page table again and again for the consecutive pages.

I'd prefer to use gfn_to_pfn_atomic. The big win is with shadow mode and
there its already done one at a time.

> >
> >> +
> >> + pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
> >> + mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0,
> >> + is_dirty_gpte(gpte), NULL, sp->role.level, gfn,
> >> + pfn, true, false);
> >
> > reset_host_protection should be true, see commit 1403283acca (also for
> > direct case to be consistent).
> >
>
> Ah, i forgot it, thanks, Marcelo!
>
> I'll post a small fix patch base on this patchset to change 'false' to 'true' if no
> other places no to be improved. :-)

2010-07-01 12:08:21

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 3/11] KVM: MMU: fix direct sp's access corruptted

On Thu, Jul 01, 2010 at 08:50:58AM +0800, Xiao Guangrong wrote:
>
>
> Marcelo Tosatti wrote:
>
> >> - if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
> >> - continue;
> >> + if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) {
> >> + struct kvm_mmu_page *child;
> >> + unsigned direct_access;
> >> +
> >> + if (level != gw->level)
> >> + continue;
> >
> > This will skip the check for the sp at level 1 when emulating 1GB pages
> > with 4k host pages (where there are direct sp's at level 2 and 1).
> > Should be > instead of !=.
> >
>
> Marcelo,
>
> I think the patch is right.
>
> Every level's direct sp has the same access in the mapping since while we setup the
> mapping we find the direct sp with the same access.
> (Note: we have encode the D bit to the sp->role.access)
>
> Consider guest 1G writable clean pages and host 4K pages, the shadow pages mapping
> is like this:
>
> indirect L4 --> indirect L3 --> direct ReadOnly L2 --> direct ReadOnly L1
>
> When change guest pte to dirty, we update L3' spte and find the direct writable L2 sp,
> assume it's A, then we can sure that A's children sps should also writable, the final
> mapping is like this:
>
> indirect L4 --> indirect L3 --> direct Writable L2 --> direct Writable L1.
>
> So, i think we not broken anything in this patch :-)

You're right. Applied.

2010-07-01 12:11:24

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v3 9/11] KVM: MMU: prefetch ptes when intercepted guest #PF

On 07/01/2010 04:11 AM, Xiao Guangrong wrote:
>
> Marcelo Tosatti wrote:
>
>
>>> +
>>> + addr = gfn_to_hva_many(vcpu->kvm, gfn,&entry);
>>> + if (kvm_is_error_hva(addr))
>>> + return -1;
>>> +
>>> + entry = min(entry, (int)(end - start));
>>> + ret = __get_user_pages_fast(addr, entry, 1, pages);
>>> + if (ret<= 0)
>>> + return -1;
>>>
>> Why can't you use gfn_to_pfn_atomic() here, one page at a time? Is
>> the overhead significant that this is worthwhile?
>>
>> You're bypassing the centralized interface.
>>
> I think it's worthwhile to do since we can reduce gup overhead, no reason
> to traverse process's page table again and again for the consecutive pages.
>

Then we should make the centralized interface work in terms of multiple
pages, and write the single-page interfaces in terms of the multipage
interfaces.

--
error compiling committee.c: too many arguments to function

2010-07-01 12:17:34

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 9/11] KVM: MMU: prefetch ptes when intercepted guest #PF



Avi Kivity wrote:
> On 07/01/2010 04:11 AM, Xiao Guangrong wrote:
>>
>> Marcelo Tosatti wrote:
>>
>>
>>>> +
>>>> + addr = gfn_to_hva_many(vcpu->kvm, gfn,&entry);
>>>> + if (kvm_is_error_hva(addr))
>>>> + return -1;
>>>> +
>>>> + entry = min(entry, (int)(end - start));
>>>> + ret = __get_user_pages_fast(addr, entry, 1, pages);
>>>> + if (ret<= 0)
>>>> + return -1;
>>>>
>>> Why can't you use gfn_to_pfn_atomic() here, one page at a time? Is
>>> the overhead significant that this is worthwhile?
>>>
>>> You're bypassing the centralized interface.
>>>
>> I think it's worthwhile to do since we can reduce gup overhead, no reason
>> to traverse process's page table again and again for the consecutive
>> pages.
>>
>
> Then we should make the centralized interface work in terms of multiple
> pages, and write the single-page interfaces in terms of the multipage
> interfaces.
>

Umm, i'll import a new function named gfn_to_pfn_many_atomic(... int *enough),
using 'enough' to indicate whether have got the all consecutive pages in the slot,
Marcelo, how about it? :-)

2010-07-01 12:27:58

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v3 9/11] KVM: MMU: prefetch ptes when intercepted guest #PF

On Thu, Jul 01, 2010 at 08:13:40PM +0800, Xiao Guangrong wrote:
>
>
> Avi Kivity wrote:
> > On 07/01/2010 04:11 AM, Xiao Guangrong wrote:
> >>
> >> Marcelo Tosatti wrote:
> >>
> >>
> >>>> +
> >>>> + addr = gfn_to_hva_many(vcpu->kvm, gfn,&entry);
> >>>> + if (kvm_is_error_hva(addr))
> >>>> + return -1;
> >>>> +
> >>>> + entry = min(entry, (int)(end - start));
> >>>> + ret = __get_user_pages_fast(addr, entry, 1, pages);
> >>>> + if (ret<= 0)
> >>>> + return -1;
> >>>>
> >>> Why can't you use gfn_to_pfn_atomic() here, one page at a time? Is
> >>> the overhead significant that this is worthwhile?
> >>>
> >>> You're bypassing the centralized interface.
> >>>
> >> I think it's worthwhile to do since we can reduce gup overhead, no reason
> >> to traverse process's page table again and again for the consecutive
> >> pages.
> >>
> >
> > Then we should make the centralized interface work in terms of multiple
> > pages, and write the single-page interfaces in terms of the multipage
> > interfaces.
> >
>
> Umm, i'll import a new function named gfn_to_pfn_many_atomic(... int *enough),
> using 'enough' to indicate whether have got the all consecutive pages in the slot,
> Marcelo, how about it? :-)

Fine.