2015-11-12 11:35:51

by Takuya Yoshikawa

[permalink] [raw]
Subject: [PATCH 00/10 V2] KVM: x86: MMU: Clean up x86's mmu code for future work

v1->v2:
Patch 5 and 7 are added based on Paolo's suggestions.
Patch 8-10 are new.

Patch 1/2/3/4: no change.
Patch 5: Needed a bit more work than I had expected.
Patch 6: Removed extra comment of v1 (patch 5 made it inappropriate).
Patch 7: As expected, many places needed to be converted.
Patch 8: This is new, but only a small change.

Patch 9: Kind of an RFC (though I have checked it to some extent).
Following two places need to be carefully checked:
- in kvm_mmu_get_page: "if (!direct)" block after kvm_mmu_alloc_page()
- in FNAME(fetch): "if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))" case
Patch 10: Trivial cleanup, assuming that patch 9 is correct.


In summary: patch 1-7 is the result of updating v1 based on the suggestions.
Although patch 5 does not look so nice than expected, this is the most
conservative approach, and patch 8-10 try to alleviate the sadness.

Takuya


Takuya Yoshikawa (10):
01: KVM: x86: MMU: Remove unused parameter of __direct_map()
02: KVM: x86: MMU: Add helper function to clear a bit in unsync child bitmap
03: KVM: x86: MMU: Make mmu_set_spte() return emulate value
04: KVM: x86: MMU: Remove is_rmap_spte() and use is_shadow_present_pte()
05: KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk()
06: KVM: x86: MMU: Consolidate WARN_ON/BUG_ON checks for reverse-mapped sptes
07: KVM: x86: MMU: Encapsulate the type of rmap-chain head in a new struct
08: KVM: x86: MMU: Move initialization of parent_ptes out from kvm_mmu_alloc_page()
09: KVM: x86: MMU: Move parent_pte handling from kvm_mmu_get_page() to link_shadow_page()
10: KVM: x86: MMU: Remove unused parameter parent_pte from kvm_mmu_get_page()

Documentation/virtual/kvm/mmu.txt | 4 +-
arch/x86/include/asm/kvm_host.h | 8 +-
arch/x86/kvm/mmu.c | 357 ++++++++++++++++++--------------------
arch/x86/kvm/mmu_audit.c | 15 +-
arch/x86/kvm/paging_tmpl.h | 20 +--
5 files changed, 196 insertions(+), 208 deletions(-)

--
2.1.0


2015-11-12 11:36:33

by Takuya Yoshikawa

[permalink] [raw]
Subject: [PATCH 01/10] KVM: x86: MMU: Remove unused parameter of __direct_map()

Signed-off-by: Takuya Yoshikawa <[email protected]>
---
arch/x86/kvm/mmu.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e7c2c14..c3bbc82 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2708,9 +2708,8 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
__direct_pte_prefetch(vcpu, sp, sptep);
}

-static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
- int map_writable, int level, gfn_t gfn, pfn_t pfn,
- bool prefault)
+static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
+ int level, gfn_t gfn, pfn_t pfn, bool prefault)
{
struct kvm_shadow_walk_iterator iterator;
struct kvm_mmu_page *sp;
@@ -3018,11 +3017,9 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
make_mmu_pages_available(vcpu);
if (likely(!force_pt_level))
transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
- r = __direct_map(vcpu, v, write, map_writable, level, gfn, pfn,
- prefault);
+ r = __direct_map(vcpu, write, map_writable, level, gfn, pfn, prefault);
spin_unlock(&vcpu->kvm->mmu_lock);

-
return r;

out_unlock:
@@ -3531,8 +3528,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
make_mmu_pages_available(vcpu);
if (likely(!force_pt_level))
transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
- r = __direct_map(vcpu, gpa, write, map_writable,
- level, gfn, pfn, prefault);
+ r = __direct_map(vcpu, write, map_writable, level, gfn, pfn, prefault);
spin_unlock(&vcpu->kvm->mmu_lock);

return r;
--
2.1.0

2015-11-12 11:37:20

by Takuya Yoshikawa

[permalink] [raw]
Subject: [PATCH 02/10] KVM: x86: MMU: Add helper function to clear a bit in unsync child bitmap

Both __mmu_unsync_walk() and mmu_pages_clear_parents() have three line
code which clears a bit in the unsync child bitmap; the former places it
inside a loop block and uses a few goto statements to jump to it.

A new helper function, clear_unsync_child_bit(), makes the code cleaner.

Signed-off-by: Takuya Yoshikawa <[email protected]>
---
arch/x86/kvm/mmu.c | 36 ++++++++++++++++++------------------
1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c3bbc82..f3120aa 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1806,6 +1806,13 @@ static int mmu_pages_add(struct kvm_mmu_pages *pvec, struct kvm_mmu_page *sp,
return (pvec->nr == KVM_PAGE_ARRAY_NR);
}

+static inline void clear_unsync_child_bit(struct kvm_mmu_page *sp, int idx)
+{
+ --sp->unsync_children;
+ WARN_ON((int)sp->unsync_children < 0);
+ __clear_bit(idx, sp->unsync_child_bitmap);
+}
+
static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
struct kvm_mmu_pages *pvec)
{
@@ -1815,8 +1822,10 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
struct kvm_mmu_page *child;
u64 ent = sp->spt[i];

- if (!is_shadow_present_pte(ent) || is_large_pte(ent))
- goto clear_child_bitmap;
+ if (!is_shadow_present_pte(ent) || is_large_pte(ent)) {
+ clear_unsync_child_bit(sp, i);
+ continue;
+ }

child = page_header(ent & PT64_BASE_ADDR_MASK);

@@ -1825,28 +1834,21 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
return -ENOSPC;

ret = __mmu_unsync_walk(child, pvec);
- if (!ret)
- goto clear_child_bitmap;
- else if (ret > 0)
+ if (!ret) {
+ clear_unsync_child_bit(sp, i);
+ continue;
+ } else if (ret > 0) {
nr_unsync_leaf += ret;
- else
+ } else
return ret;
} else if (child->unsync) {
nr_unsync_leaf++;
if (mmu_pages_add(pvec, child, i))
return -ENOSPC;
} else
- goto clear_child_bitmap;
-
- continue;
-
-clear_child_bitmap:
- __clear_bit(i, sp->unsync_child_bitmap);
- sp->unsync_children--;
- WARN_ON((int)sp->unsync_children < 0);
+ clear_unsync_child_bit(sp, i);
}

-
return nr_unsync_leaf;
}

@@ -2009,9 +2011,7 @@ static void mmu_pages_clear_parents(struct mmu_page_path *parents)
if (!sp)
return;

- --sp->unsync_children;
- WARN_ON((int)sp->unsync_children < 0);
- __clear_bit(idx, sp->unsync_child_bitmap);
+ clear_unsync_child_bit(sp, idx);
level++;
} while (level < PT64_ROOT_LEVEL-1 && !sp->unsync_children);
}
--
2.1.0

2015-11-12 11:38:05

by Takuya Yoshikawa

[permalink] [raw]
Subject: [PATCH 03/10] KVM: x86: MMU: Make mmu_set_spte() return emulate value

mmu_set_spte()'s code is based on the assumption that the emulate
parameter has a valid pointer value if set_spte() returns true and
write_fault is not zero. In other cases, emulate may be NULL, so a
NULL-check is needed.

Stop passing emulate pointer and make mmu_set_spte() return the emulate
value instead to clean up this complex interface. Prefetch functions
can just throw away the return value.

Signed-off-by: Takuya Yoshikawa <[email protected]>
---
arch/x86/kvm/mmu.c | 27 ++++++++++++++-------------
arch/x86/kvm/paging_tmpl.h | 10 +++++-----
2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f3120aa..c229356 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2564,13 +2564,13 @@ done:
return ret;
}

-static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
- unsigned pte_access, int write_fault, int *emulate,
- int level, gfn_t gfn, pfn_t pfn, bool speculative,
- bool host_writable)
+static bool mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access,
+ int write_fault, int level, gfn_t gfn, pfn_t pfn,
+ bool speculative, bool host_writable)
{
int was_rmapped = 0;
int rmap_count;
+ bool emulate = false;

pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
*sptep, write_fault, gfn);
@@ -2600,12 +2600,12 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
if (set_spte(vcpu, sptep, pte_access, level, gfn, pfn, speculative,
true, host_writable)) {
if (write_fault)
- *emulate = 1;
+ emulate = true;
kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
}

- if (unlikely(is_mmio_spte(*sptep) && emulate))
- *emulate = 1;
+ if (unlikely(is_mmio_spte(*sptep)))
+ emulate = true;

pgprintk("%s: setting spte %llx\n", __func__, *sptep);
pgprintk("instantiating %s PTE (%s) at %llx (%llx) addr %p\n",
@@ -2624,6 +2624,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
}

kvm_release_pfn_clean(pfn);
+
+ return emulate;
}

static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
@@ -2658,9 +2660,8 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
return -1;

for (i = 0; i < ret; i++, gfn++, start++)
- mmu_set_spte(vcpu, start, access, 0, NULL,
- sp->role.level, gfn, page_to_pfn(pages[i]),
- true, true);
+ mmu_set_spte(vcpu, start, access, 0, sp->role.level, gfn,
+ page_to_pfn(pages[i]), true, true);

return 0;
}
@@ -2721,9 +2722,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,

for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) {
if (iterator.level == level) {
- mmu_set_spte(vcpu, iterator.sptep, ACC_ALL,
- write, &emulate, level, gfn, pfn,
- prefault, map_writable);
+ emulate = mmu_set_spte(vcpu, iterator.sptep, ACC_ALL,
+ write, level, gfn, pfn, prefault,
+ map_writable);
direct_pte_prefetch(vcpu, iterator.sptep);
++vcpu->stat.pf_fixed;
break;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 3058a22..9d21b44 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -475,8 +475,8 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
* we call mmu_set_spte() with host_writable = true because
* pte_prefetch_gfn_to_pfn always gets a writable pfn.
*/
- mmu_set_spte(vcpu, spte, pte_access, 0, NULL, PT_PAGE_TABLE_LEVEL,
- gfn, pfn, true, true);
+ mmu_set_spte(vcpu, spte, pte_access, 0, PT_PAGE_TABLE_LEVEL, gfn, pfn,
+ true, true);

return true;
}
@@ -556,7 +556,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
struct kvm_mmu_page *sp = NULL;
struct kvm_shadow_walk_iterator it;
unsigned direct_access, access = gw->pt_access;
- int top_level, emulate = 0;
+ int top_level, emulate;

direct_access = gw->pte_access;

@@ -622,8 +622,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
}

clear_sp_write_flooding_count(it.sptep);
- mmu_set_spte(vcpu, it.sptep, gw->pte_access, write_fault, &emulate,
- it.level, gw->gfn, pfn, prefault, map_writable);
+ emulate = mmu_set_spte(vcpu, it.sptep, gw->pte_access, write_fault,
+ it.level, gw->gfn, pfn, prefault, map_writable);
FNAME(pte_prefetch)(vcpu, gw, it.sptep);

return emulate;
--
2.1.0

2015-11-12 11:38:57

by Takuya Yoshikawa

[permalink] [raw]
Subject: [PATCH 04/10] KVM: x86: MMU: Remove is_rmap_spte() and use is_shadow_present_pte()

is_rmap_spte(), originally named is_rmap_pte(), was introduced when the
simple reverse mapping was implemented by commit cd4a4e5374110444
("[PATCH] KVM: MMU: Implement simple reverse mapping"). At that point,
its role was clear and only rmap_add() and rmap_remove() were using it
to select sptes that need to be reverse-mapped.

Independently of that, is_shadow_present_pte() was first introduced by
commit c7addb902054195b ("KVM: Allow not-present guest page faults to
bypass kvm") to do bypass_guest_pf optimization, which does not exist
any more.

These two seem to have changed their roles somewhat, and is_rmap_spte()
just calls is_shadow_present_pte() now.

Since using both of them with no clear distinction just makes the code
confusing, remove is_rmap_spte().

Signed-off-by: Takuya Yoshikawa <[email protected]>
---
arch/x86/kvm/mmu.c | 13 ++++---------
arch/x86/kvm/mmu_audit.c | 2 +-
2 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c229356..e8cfdc4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -311,11 +311,6 @@ static int is_large_pte(u64 pte)
return pte & PT_PAGE_SIZE_MASK;
}

-static int is_rmap_spte(u64 pte)
-{
- return is_shadow_present_pte(pte);
-}
-
static int is_last_spte(u64 pte, int level)
{
if (level == PT_PAGE_TABLE_LEVEL)
@@ -540,7 +535,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
u64 old_spte = *sptep;
bool ret = false;

- WARN_ON(!is_rmap_spte(new_spte));
+ WARN_ON(!is_shadow_present_pte(new_spte));

if (!is_shadow_present_pte(old_spte)) {
mmu_spte_set(sptep, new_spte);
@@ -595,7 +590,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
else
old_spte = __update_clear_spte_slow(sptep, 0ull);

- if (!is_rmap_spte(old_spte))
+ if (!is_shadow_present_pte(old_spte))
return 0;

pfn = spte_to_pfn(old_spte);
@@ -2575,7 +2570,7 @@ static bool mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access,
pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
*sptep, write_fault, gfn);

- if (is_rmap_spte(*sptep)) {
+ if (is_shadow_present_pte(*sptep)) {
/*
* If we overwrite a PTE page pointer with a 2MB PMD, unlink
* the parent of the now unreachable PTE.
@@ -2919,7 +2914,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
* If the mapping has been changed, let the vcpu fault on the
* same address again.
*/
- if (!is_rmap_spte(spte)) {
+ if (!is_shadow_present_pte(spte)) {
ret = true;
goto exit;
}
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index 03d518e..90ee420 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -183,7 +183,7 @@ static void check_mappings_rmap(struct kvm *kvm, struct kvm_mmu_page *sp)
return;

for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
- if (!is_rmap_spte(sp->spt[i]))
+ if (!is_shadow_present_pte(sp->spt[i]))
continue;

inspect_spte_has_rmap(kvm, sp->spt + i);
--
2.1.0

2015-11-12 11:39:46

by Takuya Yoshikawa

[permalink] [raw]
Subject: [PATCH 05/10] KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk()

kvm_mmu_mark_parents_unsync() alone uses pte_list_walk(), witch does
nearly the same as the for_each_rmap_spte macro. The only difference
is that is_shadow_present_pte() checks cannot be placed there because
kvm_mmu_mark_parents_unsync() can be called with a new parent pointer
whose entry is not set yet.

By calling mark_unsync() separately for the parent and adding the parent
pointer to the parent_ptes chain later in kvm_mmu_get_page(), the macro
works with no problem.

Signed-off-by: Takuya Yoshikawa <[email protected]>
---
arch/x86/kvm/mmu.c | 36 +++++++++++++-----------------------
1 file changed, 13 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e8cfdc4..1691171 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1007,26 +1007,6 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)
}
}

-typedef void (*pte_list_walk_fn) (u64 *spte);
-static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
-{
- struct pte_list_desc *desc;
- int i;
-
- if (!*pte_list)
- return;
-
- if (!(*pte_list & 1))
- return fn((u64 *)*pte_list);
-
- desc = (struct pte_list_desc *)(*pte_list & ~1ul);
- while (desc) {
- for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
- fn(desc->sptes[i]);
- desc = desc->more;
- }
-}
-
static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
struct kvm_memory_slot *slot)
{
@@ -1741,7 +1721,12 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
static void mark_unsync(u64 *spte);
static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
{
- pte_list_walk(&sp->parent_ptes, mark_unsync);
+ u64 *sptep;
+ struct rmap_iterator iter;
+
+ for_each_rmap_spte(&sp->parent_ptes, &iter, sptep) {
+ mark_unsync(sptep);
+ }
}

static void mark_unsync(u64 *spte)
@@ -2111,12 +2096,17 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
if (sp->unsync && kvm_sync_page_transient(vcpu, sp))
break;

- mmu_page_add_parent_pte(vcpu, sp, parent_pte);
if (sp->unsync_children) {
kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
kvm_mmu_mark_parents_unsync(sp);
- } else if (sp->unsync)
+ if (parent_pte)
+ mark_unsync(parent_pte);
+ } else if (sp->unsync) {
kvm_mmu_mark_parents_unsync(sp);
+ if (parent_pte)
+ mark_unsync(parent_pte);
+ }
+ mmu_page_add_parent_pte(vcpu, sp, parent_pte);

__clear_sp_write_flooding_count(sp);
trace_kvm_mmu_get_page(sp, false);
--
2.1.0

2015-11-12 11:40:44

by Takuya Yoshikawa

[permalink] [raw]
Subject: [PATCH 06/10] KVM: x86: MMU: Consolidate WARN_ON/BUG_ON checks for reverse-mapped sptes

At some call sites of rmap_get_first() and rmap_get_next(), BUG_ON is
placed right after the call to detect unrelated sptes which must not be
found in the reverse-mapping list.

Move this check in rmap_get_first/next() so that all call sites, not
just the users of the for_each_rmap_spte() macro, will be checked the
same way. In addition, change the BUG_ON to WARN_ON since killing the
whole host is the last thing that KVM should try.

One thing to keep in mind is that kvm_mmu_unlink_parents() also uses
rmap_get_first() to handle parent sptes. The change will not break it
because parent sptes are present, at least until drop_parent_pte()
actually unlinks them, and not mmio-sptes.

Signed-off-by: Takuya Yoshikawa <[email protected]>
---
Documentation/virtual/kvm/mmu.txt | 4 ++--
arch/x86/kvm/mmu.c | 26 +++++++++++++++++---------
2 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index 3a4d681..daf9c0f 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -203,10 +203,10 @@ Shadow pages contain the following information:
page cannot be destroyed. See role.invalid.
parent_ptes:
The reverse mapping for the pte/ptes pointing at this page's spt. If
- parent_ptes bit 0 is zero, only one spte points at this pages and
+ parent_ptes bit 0 is zero, only one spte points at this page and
parent_ptes points at this single spte, otherwise, there exists multiple
sptes pointing at this page and (parent_ptes & ~0x1) points at a data
- structure with a list of parent_ptes.
+ structure with a list of parent sptes.
unsync:
If true, then the translations in this page may not match the guest's
translation. This is equivalent to the state of the tlb when a pte is
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1691171..ee7b101 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1079,17 +1079,23 @@ struct rmap_iterator {
*/
static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
{
+ u64 *sptep;
+
if (!rmap)
return NULL;

if (!(rmap & 1)) {
iter->desc = NULL;
- return (u64 *)rmap;
+ sptep = (u64 *)rmap;
+ goto out;
}

iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
iter->pos = 0;
- return iter->desc->sptes[iter->pos];
+ sptep = iter->desc->sptes[iter->pos];
+out:
+ WARN_ON(!is_shadow_present_pte(*sptep));
+ return sptep;
}

/*
@@ -1099,14 +1105,14 @@ static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
*/
static u64 *rmap_get_next(struct rmap_iterator *iter)
{
+ u64 *sptep;
+
if (iter->desc) {
if (iter->pos < PTE_LIST_EXT - 1) {
- u64 *sptep;
-
++iter->pos;
sptep = iter->desc->sptes[iter->pos];
if (sptep)
- return sptep;
+ goto out;
}

iter->desc = iter->desc->more;
@@ -1114,17 +1120,20 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
if (iter->desc) {
iter->pos = 0;
/* desc->sptes[0] cannot be NULL */
- return iter->desc->sptes[iter->pos];
+ sptep = iter->desc->sptes[iter->pos];
+ goto out;
}
}

return NULL;
+out:
+ WARN_ON(!is_shadow_present_pte(*sptep));
+ return sptep;
}

#define for_each_rmap_spte(_rmap_, _iter_, _spte_) \
for (_spte_ = rmap_get_first(*_rmap_, _iter_); \
- _spte_ && ({BUG_ON(!is_shadow_present_pte(*_spte_)); 1;}); \
- _spte_ = rmap_get_next(_iter_))
+ _spte_; _spte_ = rmap_get_next(_iter_))

static void drop_spte(struct kvm *kvm, u64 *sptep)
{
@@ -1338,7 +1347,6 @@ static bool kvm_zap_rmapp(struct kvm *kvm, unsigned long *rmapp)
bool flush = false;

while ((sptep = rmap_get_first(*rmapp, &iter))) {
- BUG_ON(!(*sptep & PT_PRESENT_MASK));
rmap_printk("%s: spte %p %llx.\n", __func__, sptep, *sptep);

drop_spte(kvm, sptep);
--
2.1.0

2015-11-12 11:42:07

by Takuya Yoshikawa

[permalink] [raw]
Subject: [PATCH 07/10] KVM: x86: MMU: Encapsulate the type of rmap-chain head in a new struct

New struct kvm_rmap_head makes the code type-safe to some extent.

Signed-off-by: Takuya Yoshikawa <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 8 +-
arch/x86/kvm/mmu.c | 169 +++++++++++++++++++++-------------------
arch/x86/kvm/mmu_audit.c | 13 ++--
3 files changed, 100 insertions(+), 90 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0535359..c5a0c4a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -214,6 +214,10 @@ union kvm_mmu_page_role {
};
};

+struct kvm_rmap_head {
+ unsigned long val;
+};
+
struct kvm_mmu_page {
struct list_head link;
struct hlist_node hash_link;
@@ -231,7 +235,7 @@ struct kvm_mmu_page {
bool unsync;
int root_count; /* Currently serving as active root */
unsigned int unsync_children;
- unsigned long parent_ptes; /* Reverse mapping for parent_pte */
+ struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */

/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */
unsigned long mmu_valid_gen;
@@ -604,7 +608,7 @@ struct kvm_lpage_info {
};

struct kvm_arch_memory_slot {
- unsigned long *rmap[KVM_NR_PAGE_SIZES];
+ struct kvm_rmap_head *rmap[KVM_NR_PAGE_SIZES];
struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
};

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ee7b101..85f4bbd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -916,24 +916,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn,
*
*/
static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
- unsigned long *pte_list)
+ struct kvm_rmap_head *rmap_head)
{
struct pte_list_desc *desc;
int i, count = 0;

- if (!*pte_list) {
+ if (!rmap_head->val) {
rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
- *pte_list = (unsigned long)spte;
- } else if (!(*pte_list & 1)) {
+ rmap_head->val = (unsigned long)spte;
+ } else if (!(rmap_head->val & 1)) {
rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_pte_list_desc(vcpu);
- desc->sptes[0] = (u64 *)*pte_list;
+ desc->sptes[0] = (u64 *)rmap_head->val;
desc->sptes[1] = spte;
- *pte_list = (unsigned long)desc | 1;
+ rmap_head->val = (unsigned long)desc | 1;
++count;
} else {
rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
- desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+ desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
while (desc->sptes[PTE_LIST_EXT-1] && desc->more) {
desc = desc->more;
count += PTE_LIST_EXT;
@@ -950,8 +950,9 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
}

static void
-pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
- int i, struct pte_list_desc *prev_desc)
+pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
+ struct pte_list_desc *desc, int i,
+ struct pte_list_desc *prev_desc)
{
int j;

@@ -962,43 +963,43 @@ pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
if (j != 0)
return;
if (!prev_desc && !desc->more)
- *pte_list = (unsigned long)desc->sptes[0];
+ rmap_head->val = (unsigned long)desc->sptes[0];
else
if (prev_desc)
prev_desc->more = desc->more;
else
- *pte_list = (unsigned long)desc->more | 1;
+ rmap_head->val = (unsigned long)desc->more | 1;
mmu_free_pte_list_desc(desc);
}

-static void pte_list_remove(u64 *spte, unsigned long *pte_list)
+static void pte_list_remove(u64 *spte, struct kvm_rmap_head *rmap_head)
{
struct pte_list_desc *desc;
struct pte_list_desc *prev_desc;
int i;

- if (!*pte_list) {
+ if (!rmap_head->val) {
printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
BUG();
- } else if (!(*pte_list & 1)) {
+ } else if (!(rmap_head->val & 1)) {
rmap_printk("pte_list_remove: %p 1->0\n", spte);
- if ((u64 *)*pte_list != spte) {
+ if ((u64 *)rmap_head->val != spte) {
printk(KERN_ERR "pte_list_remove: %p 1->BUG\n", spte);
BUG();
}
- *pte_list = 0;
+ rmap_head->val = 0;
} else {
rmap_printk("pte_list_remove: %p many->many\n", spte);
- desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+ desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
prev_desc = NULL;
while (desc) {
- for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
+ for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) {
if (desc->sptes[i] == spte) {
- pte_list_desc_remove_entry(pte_list,
- desc, i,
- prev_desc);
+ pte_list_desc_remove_entry(rmap_head,
+ desc, i, prev_desc);
return;
}
+ }
prev_desc = desc;
desc = desc->more;
}
@@ -1007,8 +1008,8 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)
}
}

-static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
- struct kvm_memory_slot *slot)
+static struct kvm_rmap_head *__gfn_to_rmap(gfn_t gfn, int level,
+ struct kvm_memory_slot *slot)
{
unsigned long idx;

@@ -1016,10 +1017,8 @@ static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
return &slot->arch.rmap[level - PT_PAGE_TABLE_LEVEL][idx];
}

-/*
- * Take gfn and return the reverse mapping to it.
- */
-static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, struct kvm_mmu_page *sp)
+static struct kvm_rmap_head *gfn_to_rmap(struct kvm *kvm, gfn_t gfn,
+ struct kvm_mmu_page *sp)
{
struct kvm_memslots *slots;
struct kvm_memory_slot *slot;
@@ -1040,24 +1039,24 @@ static bool rmap_can_add(struct kvm_vcpu *vcpu)
static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
{
struct kvm_mmu_page *sp;
- unsigned long *rmapp;
+ struct kvm_rmap_head *rmap_head;

sp = page_header(__pa(spte));
kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
- rmapp = gfn_to_rmap(vcpu->kvm, gfn, sp);
- return pte_list_add(vcpu, spte, rmapp);
+ rmap_head = gfn_to_rmap(vcpu->kvm, gfn, sp);
+ return pte_list_add(vcpu, spte, rmap_head);
}

static void rmap_remove(struct kvm *kvm, u64 *spte)
{
struct kvm_mmu_page *sp;
gfn_t gfn;
- unsigned long *rmapp;
+ struct kvm_rmap_head *rmap_head;

sp = page_header(__pa(spte));
gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
- rmapp = gfn_to_rmap(kvm, gfn, sp);
- pte_list_remove(spte, rmapp);
+ rmap_head = gfn_to_rmap(kvm, gfn, sp);
+ pte_list_remove(spte, rmap_head);
}

/*
@@ -1077,20 +1076,21 @@ struct rmap_iterator {
*
* Returns sptep if found, NULL otherwise.
*/
-static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
+static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
+ struct rmap_iterator *iter)
{
u64 *sptep;

- if (!rmap)
+ if (!rmap_head->val)
return NULL;

- if (!(rmap & 1)) {
+ if (!(rmap_head->val & 1)) {
iter->desc = NULL;
- sptep = (u64 *)rmap;
+ sptep = (u64 *)rmap_head->val;
goto out;
}

- iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
+ iter->desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
iter->pos = 0;
sptep = iter->desc->sptes[iter->pos];
out:
@@ -1131,9 +1131,9 @@ out:
return sptep;
}

-#define for_each_rmap_spte(_rmap_, _iter_, _spte_) \
- for (_spte_ = rmap_get_first(*_rmap_, _iter_); \
- _spte_; _spte_ = rmap_get_next(_iter_))
+#define for_each_rmap_spte(_rmap_head_, _iter_, _spte_) \
+ for (_spte_ = rmap_get_first(_rmap_head_, _iter_); \
+ _spte_; _spte_ = rmap_get_next(_iter_))

static void drop_spte(struct kvm *kvm, u64 *sptep)
{
@@ -1191,14 +1191,15 @@ static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
return mmu_spte_update(sptep, spte);
}

-static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
+static bool __rmap_write_protect(struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head,
bool pt_protect)
{
u64 *sptep;
struct rmap_iterator iter;
bool flush = false;

- for_each_rmap_spte(rmapp, &iter, sptep)
+ for_each_rmap_spte(rmap_head, &iter, sptep)
flush |= spte_write_protect(kvm, sptep, pt_protect);

return flush;
@@ -1215,13 +1216,13 @@ static bool spte_clear_dirty(struct kvm *kvm, u64 *sptep)
return mmu_spte_update(sptep, spte);
}

-static bool __rmap_clear_dirty(struct kvm *kvm, unsigned long *rmapp)
+static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
{
u64 *sptep;
struct rmap_iterator iter;
bool flush = false;

- for_each_rmap_spte(rmapp, &iter, sptep)
+ for_each_rmap_spte(rmap_head, &iter, sptep)
flush |= spte_clear_dirty(kvm, sptep);

return flush;
@@ -1238,13 +1239,13 @@ static bool spte_set_dirty(struct kvm *kvm, u64 *sptep)
return mmu_spte_update(sptep, spte);
}

-static bool __rmap_set_dirty(struct kvm *kvm, unsigned long *rmapp)
+static bool __rmap_set_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
{
u64 *sptep;
struct rmap_iterator iter;
bool flush = false;

- for_each_rmap_spte(rmapp, &iter, sptep)
+ for_each_rmap_spte(rmap_head, &iter, sptep)
flush |= spte_set_dirty(kvm, sptep);

return flush;
@@ -1264,12 +1265,12 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
struct kvm_memory_slot *slot,
gfn_t gfn_offset, unsigned long mask)
{
- unsigned long *rmapp;
+ struct kvm_rmap_head *rmap_head;

while (mask) {
- rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
- PT_PAGE_TABLE_LEVEL, slot);
- __rmap_write_protect(kvm, rmapp, false);
+ rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
+ PT_PAGE_TABLE_LEVEL, slot);
+ __rmap_write_protect(kvm, rmap_head, false);

/* clear the first set bit */
mask &= mask - 1;
@@ -1289,12 +1290,12 @@ void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
struct kvm_memory_slot *slot,
gfn_t gfn_offset, unsigned long mask)
{
- unsigned long *rmapp;
+ struct kvm_rmap_head *rmap_head;

while (mask) {
- rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
- PT_PAGE_TABLE_LEVEL, slot);
- __rmap_clear_dirty(kvm, rmapp);
+ rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
+ PT_PAGE_TABLE_LEVEL, slot);
+ __rmap_clear_dirty(kvm, rmap_head);

/* clear the first set bit */
mask &= mask - 1;
@@ -1326,27 +1327,27 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
{
struct kvm_memory_slot *slot;
- unsigned long *rmapp;
+ struct kvm_rmap_head *rmap_head;
int i;
bool write_protected = false;

slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);

for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
- rmapp = __gfn_to_rmap(gfn, i, slot);
- write_protected |= __rmap_write_protect(vcpu->kvm, rmapp, true);
+ rmap_head = __gfn_to_rmap(gfn, i, slot);
+ write_protected |= __rmap_write_protect(vcpu->kvm, rmap_head, true);
}

return write_protected;
}

-static bool kvm_zap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
{
u64 *sptep;
struct rmap_iterator iter;
bool flush = false;

- while ((sptep = rmap_get_first(*rmapp, &iter))) {
+ while ((sptep = rmap_get_first(rmap_head, &iter))) {
rmap_printk("%s: spte %p %llx.\n", __func__, sptep, *sptep);

drop_spte(kvm, sptep);
@@ -1356,14 +1357,14 @@ static bool kvm_zap_rmapp(struct kvm *kvm, unsigned long *rmapp)
return flush;
}

-static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
+static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
struct kvm_memory_slot *slot, gfn_t gfn, int level,
unsigned long data)
{
- return kvm_zap_rmapp(kvm, rmapp);
+ return kvm_zap_rmapp(kvm, rmap_head);
}

-static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
+static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
struct kvm_memory_slot *slot, gfn_t gfn, int level,
unsigned long data)
{
@@ -1378,7 +1379,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
new_pfn = pte_pfn(*ptep);

restart:
- for_each_rmap_spte(rmapp, &iter, sptep) {
+ for_each_rmap_spte(rmap_head, &iter, sptep) {
rmap_printk("kvm_set_pte_rmapp: spte %p %llx gfn %llx (%d)\n",
sptep, *sptep, gfn, level);

@@ -1416,11 +1417,11 @@ struct slot_rmap_walk_iterator {

/* output fields. */
gfn_t gfn;
- unsigned long *rmap;
+ struct kvm_rmap_head *rmap;
int level;

/* private field. */
- unsigned long *end_rmap;
+ struct kvm_rmap_head *end_rmap;
};

static void
@@ -1479,7 +1480,7 @@ static int kvm_handle_hva_range(struct kvm *kvm,
unsigned long end,
unsigned long data,
int (*handler)(struct kvm *kvm,
- unsigned long *rmapp,
+ struct kvm_rmap_head *rmap_head,
struct kvm_memory_slot *slot,
gfn_t gfn,
int level,
@@ -1523,7 +1524,8 @@ static int kvm_handle_hva_range(struct kvm *kvm,

static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
unsigned long data,
- int (*handler)(struct kvm *kvm, unsigned long *rmapp,
+ int (*handler)(struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head,
struct kvm_memory_slot *slot,
gfn_t gfn, int level,
unsigned long data))
@@ -1546,7 +1548,7 @@ void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
kvm_handle_hva(kvm, hva, (unsigned long)&pte, kvm_set_pte_rmapp);
}

-static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
+static int kvm_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
struct kvm_memory_slot *slot, gfn_t gfn, int level,
unsigned long data)
{
@@ -1556,18 +1558,19 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp,

BUG_ON(!shadow_accessed_mask);

- for_each_rmap_spte(rmapp, &iter, sptep)
+ for_each_rmap_spte(rmap_head, &iter, sptep) {
if (*sptep & shadow_accessed_mask) {
young = 1;
clear_bit((ffs(shadow_accessed_mask) - 1),
(unsigned long *)sptep);
}
+ }

trace_kvm_age_page(gfn, level, slot, young);
return young;
}

-static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
+static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
struct kvm_memory_slot *slot, gfn_t gfn,
int level, unsigned long data)
{
@@ -1583,11 +1586,12 @@ static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
if (!shadow_accessed_mask)
goto out;

- for_each_rmap_spte(rmapp, &iter, sptep)
+ for_each_rmap_spte(rmap_head, &iter, sptep) {
if (*sptep & shadow_accessed_mask) {
young = 1;
break;
}
+ }
out:
return young;
}
@@ -1596,14 +1600,14 @@ out:

static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
{
- unsigned long *rmapp;
+ struct kvm_rmap_head *rmap_head;
struct kvm_mmu_page *sp;

sp = page_header(__pa(spte));

- rmapp = gfn_to_rmap(vcpu->kvm, gfn, sp);
+ rmap_head = gfn_to_rmap(vcpu->kvm, gfn, sp);

- kvm_unmap_rmapp(vcpu->kvm, rmapp, NULL, gfn, sp->role.level, 0);
+ kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, 0);
kvm_flush_remote_tlbs(vcpu->kvm);
}

@@ -1720,7 +1724,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
* this feature. See the comments in kvm_zap_obsolete_pages().
*/
list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
- sp->parent_ptes = 0;
+ sp->parent_ptes.val = 0;
mmu_page_add_parent_pte(vcpu, sp, parent_pte);
kvm_mod_used_mmu_pages(vcpu->kvm, +1);
return sp;
@@ -2273,7 +2277,7 @@ static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
u64 *sptep;
struct rmap_iterator iter;

- while ((sptep = rmap_get_first(sp->parent_ptes, &iter)))
+ while ((sptep = rmap_get_first(&sp->parent_ptes, &iter)))
drop_parent_pte(sp, sptep);
}

@@ -4485,7 +4489,7 @@ void kvm_mmu_setup(struct kvm_vcpu *vcpu)
}

/* The return value indicates if tlb flush on all vcpus is needed. */
-typedef bool (*slot_level_handler) (struct kvm *kvm, unsigned long *rmap);
+typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head *rmap_head);

/* The caller should hold mmu-lock before calling this function. */
static bool
@@ -4579,9 +4583,10 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
spin_unlock(&kvm->mmu_lock);
}

-static bool slot_rmap_write_protect(struct kvm *kvm, unsigned long *rmapp)
+static bool slot_rmap_write_protect(struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head)
{
- return __rmap_write_protect(kvm, rmapp, false);
+ return __rmap_write_protect(kvm, rmap_head, false);
}

void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
@@ -4617,7 +4622,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
}

static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
- unsigned long *rmapp)
+ struct kvm_rmap_head *rmap_head)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -4626,7 +4631,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
struct kvm_mmu_page *sp;

restart:
- for_each_rmap_spte(rmapp, &iter, sptep) {
+ for_each_rmap_spte(rmap_head, &iter, sptep) {
sp = page_header(__pa(sptep));
pfn = spte_to_pfn(*sptep);

diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index 90ee420..1cee3ec 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -129,7 +129,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
{
static DEFINE_RATELIMIT_STATE(ratelimit_state, 5 * HZ, 10);
- unsigned long *rmapp;
+ struct kvm_rmap_head *rmap_head;
struct kvm_mmu_page *rev_sp;
struct kvm_memslots *slots;
struct kvm_memory_slot *slot;
@@ -150,8 +150,8 @@ static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
return;
}

- rmapp = __gfn_to_rmap(gfn, rev_sp->role.level, slot);
- if (!*rmapp) {
+ rmap_head = __gfn_to_rmap(gfn, rev_sp->role.level, slot);
+ if (!rmap_head->val) {
if (!__ratelimit(&ratelimit_state))
return;
audit_printk(kvm, "no rmap for writable spte %llx\n",
@@ -192,7 +192,7 @@ static void check_mappings_rmap(struct kvm *kvm, struct kvm_mmu_page *sp)

static void audit_write_protection(struct kvm *kvm, struct kvm_mmu_page *sp)
{
- unsigned long *rmapp;
+ struct kvm_rmap_head *rmap_head;
u64 *sptep;
struct rmap_iterator iter;
struct kvm_memslots *slots;
@@ -203,13 +203,14 @@ static void audit_write_protection(struct kvm *kvm, struct kvm_mmu_page *sp)

slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, sp->gfn);
- rmapp = __gfn_to_rmap(sp->gfn, PT_PAGE_TABLE_LEVEL, slot);
+ rmap_head = __gfn_to_rmap(sp->gfn, PT_PAGE_TABLE_LEVEL, slot);

- for_each_rmap_spte(rmapp, &iter, sptep)
+ for_each_rmap_spte(rmap_head, &iter, sptep) {
if (is_writable_pte(*sptep))
audit_printk(kvm, "shadow page has writable "
"mappings: gfn %llx role %x\n",
sp->gfn, sp->role.word);
+ }
}

static void audit_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
--
2.1.0

2015-11-12 11:42:59

by Takuya Yoshikawa

[permalink] [raw]
Subject: [PATCH 08/10] KVM: x86: MMU: Move initialization of parent_ptes out from kvm_mmu_alloc_page()

Make kvm_mmu_alloc_page() do just what its name tells to do, and remove
the extra error check at its call site since the allocation cannot fail.

Signed-off-by: Takuya Yoshikawa <[email protected]>
---
arch/x86/kvm/mmu.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 85f4bbd..9273cd4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1707,8 +1707,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
mmu_spte_clear_no_track(parent_pte);
}

-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
- u64 *parent_pte, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
{
struct kvm_mmu_page *sp;

@@ -1724,8 +1723,6 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
* this feature. See the comments in kvm_zap_obsolete_pages().
*/
list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
- sp->parent_ptes.val = 0;
- mmu_page_add_parent_pte(vcpu, sp, parent_pte);
kvm_mod_used_mmu_pages(vcpu->kvm, +1);
return sp;
}
@@ -2124,10 +2121,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
trace_kvm_mmu_get_page(sp, false);
return sp;
}
+
++vcpu->kvm->stat.mmu_cache_miss;
- sp = kvm_mmu_alloc_page(vcpu, parent_pte, direct);
- if (!sp)
- return sp;
+
+ sp = kvm_mmu_alloc_page(vcpu, direct);
+
+ sp->parent_ptes.val = 0;
+ mmu_page_add_parent_pte(vcpu, sp, parent_pte);
+
sp->gfn = gfn;
sp->role = role;
hlist_add_head(&sp->hash_link,
--
2.1.0

2015-11-12 11:43:57

by Takuya Yoshikawa

[permalink] [raw]
Subject: [PATCH 09/10 RFC] KVM: x86: MMU: Move parent_pte handling from kvm_mmu_get_page() to link_shadow_page()

Every time kvm_mmu_get_page() is called with a non-NULL parent_pte
argument, link_shadow_page() follows that to set the parent entry so
that the new mapping will point to the returned page table.

Moving parent_pte handling there allows to clean up the code because
parent_pte is passed to kvm_mmu_get_page() just for mark_unsync() and
mmu_page_add_parent_pte().

Signed-off-by: Takuya Yoshikawa <[email protected]>
---
arch/x86/kvm/mmu.c | 21 ++++++++-------------
arch/x86/kvm/paging_tmpl.h | 6 ++----
2 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9273cd4..33fe720 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2108,14 +2108,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
if (sp->unsync_children) {
kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
kvm_mmu_mark_parents_unsync(sp);
- if (parent_pte)
- mark_unsync(parent_pte);
} else if (sp->unsync) {
kvm_mmu_mark_parents_unsync(sp);
- if (parent_pte)
- mark_unsync(parent_pte);
}
- mmu_page_add_parent_pte(vcpu, sp, parent_pte);

__clear_sp_write_flooding_count(sp);
trace_kvm_mmu_get_page(sp, false);
@@ -2127,7 +2122,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
sp = kvm_mmu_alloc_page(vcpu, direct);

sp->parent_ptes.val = 0;
- mmu_page_add_parent_pte(vcpu, sp, parent_pte);

sp->gfn = gfn;
sp->role = role;
@@ -2196,7 +2190,8 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
return __shadow_walk_next(iterator, *iterator->sptep);
}

-static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, bool accessed)
+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
+ struct kvm_mmu_page *sp, bool accessed)
{
u64 spte;

@@ -2210,6 +2205,11 @@ static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, bool accessed)
spte |= shadow_accessed_mask;

mmu_spte_set(sptep, spte);
+
+ if (sp->unsync_children || sp->unsync)
+ mark_unsync(sptep);
+
+ mmu_page_add_parent_pte(vcpu, sp, sptep);
}

static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
@@ -2268,11 +2268,6 @@ static void kvm_mmu_page_unlink_children(struct kvm *kvm,
mmu_page_zap_pte(kvm, sp, sp->spt + i);
}

-static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
-{
- mmu_page_remove_parent_pte(sp, parent_pte);
-}
-
static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
{
u64 *sptep;
@@ -2738,7 +2733,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
iterator.level - 1,
1, ACC_ALL, iterator.sptep);

- link_shadow_page(iterator.sptep, sp, true);
+ link_shadow_page(vcpu, iterator.sptep, sp, true);
}
}
return emulate;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 9d21b44..f414ca6 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -598,7 +598,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
goto out_gpte_changed;

if (sp)
- link_shadow_page(it.sptep, sp, PT_GUEST_ACCESSED_MASK);
+ link_shadow_page(vcpu, it.sptep, sp, PT_GUEST_ACCESSED_MASK);
}

for (;
@@ -618,7 +618,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,

sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
true, direct_access, it.sptep);
- link_shadow_page(it.sptep, sp, PT_GUEST_ACCESSED_MASK);
+ link_shadow_page(vcpu, it.sptep, sp, PT_GUEST_ACCESSED_MASK);
}

clear_sp_write_flooding_count(it.sptep);
@@ -629,8 +629,6 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
return emulate;

out_gpte_changed:
- if (sp)
- kvm_mmu_put_page(sp, it.sptep);
kvm_release_pfn_clean(pfn);
return 0;
}
--
2.1.0

2015-11-12 11:44:58

by Takuya Yoshikawa

[permalink] [raw]
Subject: [PATCH 10/10] KVM: x86: MMU: Remove unused parameter parent_pte from kvm_mmu_get_page()

Signed-off-by: Takuya Yoshikawa <[email protected]>
---
arch/x86/kvm/mmu.c | 20 +++++++-------------
arch/x86/kvm/paging_tmpl.h | 4 ++--
2 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 33fe720..101e77d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2072,8 +2072,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
gva_t gaddr,
unsigned level,
int direct,
- unsigned access,
- u64 *parent_pte)
+ unsigned access)
{
union kvm_mmu_page_role role;
unsigned quadrant;
@@ -2730,8 +2729,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
base_addr &= PT64_LVL_ADDR_MASK(iterator.level);
pseudo_gfn = base_addr >> PAGE_SHIFT;
sp = kvm_mmu_get_page(vcpu, pseudo_gfn, iterator.addr,
- iterator.level - 1,
- 1, ACC_ALL, iterator.sptep);
+ iterator.level - 1, 1, ACC_ALL);

link_shadow_page(vcpu, iterator.sptep, sp, true);
}
@@ -3088,8 +3086,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL) {
spin_lock(&vcpu->kvm->mmu_lock);
make_mmu_pages_available(vcpu);
- sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_LEVEL,
- 1, ACC_ALL, NULL);
+ sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_LEVEL, 1, ACC_ALL);
++sp->root_count;
spin_unlock(&vcpu->kvm->mmu_lock);
vcpu->arch.mmu.root_hpa = __pa(sp->spt);
@@ -3101,9 +3098,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
spin_lock(&vcpu->kvm->mmu_lock);
make_mmu_pages_available(vcpu);
sp = kvm_mmu_get_page(vcpu, i << (30 - PAGE_SHIFT),
- i << 30,
- PT32_ROOT_LEVEL, 1, ACC_ALL,
- NULL);
+ i << 30, PT32_ROOT_LEVEL, 1, ACC_ALL);
root = __pa(sp->spt);
++sp->root_count;
spin_unlock(&vcpu->kvm->mmu_lock);
@@ -3140,7 +3135,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
spin_lock(&vcpu->kvm->mmu_lock);
make_mmu_pages_available(vcpu);
sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_LEVEL,
- 0, ACC_ALL, NULL);
+ 0, ACC_ALL);
root = __pa(sp->spt);
++sp->root_count;
spin_unlock(&vcpu->kvm->mmu_lock);
@@ -3173,9 +3168,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
}
spin_lock(&vcpu->kvm->mmu_lock);
make_mmu_pages_available(vcpu);
- sp = kvm_mmu_get_page(vcpu, root_gfn, i << 30,
- PT32_ROOT_LEVEL, 0,
- ACC_ALL, NULL);
+ sp = kvm_mmu_get_page(vcpu, root_gfn, i << 30, PT32_ROOT_LEVEL,
+ 0, ACC_ALL);
root = __pa(sp->spt);
++sp->root_count;
spin_unlock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index f414ca6..ee9d211 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -587,7 +587,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
if (!is_shadow_present_pte(*it.sptep)) {
table_gfn = gw->table_gfn[it.level - 2];
sp = kvm_mmu_get_page(vcpu, table_gfn, addr, it.level-1,
- false, access, it.sptep);
+ false, access);
}

/*
@@ -617,7 +617,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
direct_gfn = gw->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);

sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
- true, direct_access, it.sptep);
+ true, direct_access);
link_shadow_page(vcpu, it.sptep, sp, PT_GUEST_ACCESSED_MASK);
}

--
2.1.0

2015-11-12 12:08:13

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 00/10 V2] KVM: x86: MMU: Clean up x86's mmu code for future work



On 12/11/2015 12:48, Takuya Yoshikawa wrote:
> v1->v2:
> Patch 5 and 7 are added based on Paolo's suggestions.
> Patch 8-10 are new.
>
> Patch 1/2/3/4: no change.
> Patch 5: Needed a bit more work than I had expected.
> Patch 6: Removed extra comment of v1 (patch 5 made it inappropriate).
> Patch 7: As expected, many places needed to be converted.
> Patch 8: This is new, but only a small change.
>
> Patch 9: Kind of an RFC (though I have checked it to some extent).
> Following two places need to be carefully checked:
> - in kvm_mmu_get_page: "if (!direct)" block after kvm_mmu_alloc_page()
> - in FNAME(fetch): "if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))" case
> Patch 10: Trivial cleanup, assuming that patch 9 is correct.
>
>
> In summary: patch 1-7 is the result of updating v1 based on the suggestions.
> Although patch 5 does not look so nice than expected, this is the most
> conservative approach, and patch 8-10 try to alleviate the sadness.

If it works, it's actually better than what we have now. I'll review it
in a few days.

Marcelo, can you look at this as well? You are still king of MMU. :)

Thanks,

Paolo

2015-11-12 14:27:32

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 09/10 RFC] KVM: x86: MMU: Move parent_pte handling from kvm_mmu_get_page() to link_shadow_page()



On 12/11/2015 12:56, Takuya Yoshikawa wrote:
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 9d21b44..f414ca6 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -598,7 +598,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
> goto out_gpte_changed;
>
> if (sp)
> - link_shadow_page(it.sptep, sp, PT_GUEST_ACCESSED_MASK);
> + link_shadow_page(vcpu, it.sptep, sp, PT_GUEST_ACCESSED_MASK);
> }
>

Here I think you can remove completely the

if (sp)
kvm_mmu_put_page(sp, it.sptep);

later in FNAME(fetch). Apart from this nit, it's okay.

On to kvm_mmu_get_page...

if (!direct) {
if (rmap_write_protect(vcpu, gfn))
kvm_flush_remote_tlbs(vcpu->kvm);
if (level > PT_PAGE_TABLE_LEVEL && need_sync)
kvm_sync_pages(vcpu, gfn);

This seems fishy.

need_sync is set if sp->unsync, but then the parents have not been
unsynced yet.

On the other hand, all calls to kvm_mmu_get_page except for the
roots are followed by link_shadow_page... Perhaps if parent_pte != NULL
you can call link_shadow_page directly from kvm_mmu_get_page. The call
would go before the "if (!direct)" and it would subsume all the existing
calls.

We could probably also warn if

(parent_pte == NULL)
!= (level == vcpu->arch.mmu.root_level)

in kvm_mmu_get_page.

Paolo

2015-11-12 17:03:34

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 09/10 RFC] KVM: x86: MMU: Move parent_pte handling from kvm_mmu_get_page() to link_shadow_page()



On 12/11/2015 15:27, Paolo Bonzini wrote:
> Here I think you can remove completely the
>
> if (sp)
> kvm_mmu_put_page(sp, it.sptep);
>
> later in FNAME(fetch). Apart from this nit, it's okay.

Removing this is of course not possible anymore if the other suggestion
works out.

Paolo

2015-11-13 02:14:11

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH 09/10 RFC] KVM: x86: MMU: Move parent_pte handling from kvm_mmu_get_page() to link_shadow_page()

On 2015/11/12 23:27, Paolo Bonzini wrote:

> On 12/11/2015 12:56, Takuya Yoshikawa wrote:
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index 9d21b44..f414ca6 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -598,7 +598,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>> goto out_gpte_changed;
>>
>> if (sp)
>> - link_shadow_page(it.sptep, sp, PT_GUEST_ACCESSED_MASK);
>> + link_shadow_page(vcpu, it.sptep, sp, PT_GUEST_ACCESSED_MASK);
>> }
>>
>
> Here I think you can remove completely the
>
> if (sp)
> kvm_mmu_put_page(sp, it.sptep);
>
> later in FNAME(fetch). Apart from this nit, it's okay.

Yes, that's what this patch does below:

> @@ -629,8 +629,6 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
> return emulate;
>
> out_gpte_changed:
> - if (sp)
> - kvm_mmu_put_page(sp, it.sptep);
> kvm_release_pfn_clean(pfn);
> return 0;
> }

Since this is the only user of kvm_mmu_put_page(), it also removes
the definition:

> @@ -2268,11 +2268,6 @@ static void kvm_mmu_page_unlink_children(struct kvm *kvm,
> mmu_page_zap_pte(kvm, sp, sp->spt + i);
> }
>
> -static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
> -{
> - mmu_page_remove_parent_pte(sp, parent_pte);
> -}
> -
> static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> u64 *sptep;

Actually, I don't understand why this is named kvm_mmu_put_page() for
just removing parent_pte pointer from the sp->parent_ptes pointer chain.


> On to kvm_mmu_get_page...
>
> if (!direct) {
> if (rmap_write_protect(vcpu, gfn))
> kvm_flush_remote_tlbs(vcpu->kvm);
> if (level > PT_PAGE_TABLE_LEVEL && need_sync)
> kvm_sync_pages(vcpu, gfn);
>
> This seems fishy.
>
> need_sync is set if sp->unsync, but then the parents have not been
> unsynced yet.

Reaching here means that kvm_mmu_get_page() could not return sp
from inside the for_each_gfn_sp() loop above, so even without
this patch, mark_unsync() has not been called.

Here, sp holds the new page allocated by kvm_mmu_alloc_page().
One confusing thing is that hlist_add_head() right before this
"if (!direct)" line has already added the new sp to the hash
list, so it will be found by for_each_gfn_indirect_valid_sp()
in kvm_sync_pages().

Because this sp is new and sp->unsync is not set, kvm_sync_pages()
will just skip it and look for other sp's whose ->unsync were found
to be set in the for_each_gfn_sp() loop.

I'm not 100% sure if the existence of the parent_pte pointer in the
newly created sp->parent_ptes chain alone makes any difference:
> @@ -2127,7 +2122,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> sp = kvm_mmu_alloc_page(vcpu, direct);
>
> sp->parent_ptes.val = 0;
> - mmu_page_add_parent_pte(vcpu, sp, parent_pte);
>
> sp->gfn = gfn;
> sp->role = role;


> On the other hand, all calls to kvm_mmu_get_page except for the
> roots are followed by link_shadow_page... Perhaps if parent_pte != NULL
> you can call link_shadow_page directly from kvm_mmu_get_page. The call
> would go before the "if (!direct)" and it would subsume all the existing
> calls.
>
> We could probably also warn if
>
> (parent_pte == NULL)
> != (level == vcpu->arch.mmu.root_level)
>
> in kvm_mmu_get_page.

I think we should set the spte after init_shadow_page_table(), and
to make this subsume all the existing calls, we need to change the
"return sp;" in the for_each_gfn_sp() loop to a goto statement so
that the end of this function will become something like this:

init_shadow_page(sp);
out:
if (parent_pte) {
mmu_page_add_parent_pte(vcpu, sp, parent_pte);
link_shadow_page(parent_pte, sp, accessed);
}
trace_kvm_mmu_get_page(sp, created);
return sp;

So, "bool accessed" needs to be passed to kvm_mmu_get_page().
But any way, we need to understand if mmu_page_add_parent_pte()
really needs to be placed before the "if (!direct)" block.

Takuya

2015-11-13 10:51:16

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 09/10 RFC] KVM: x86: MMU: Move parent_pte handling from kvm_mmu_get_page() to link_shadow_page()



On 13/11/2015 03:15, Takuya Yoshikawa wrote:
> Actually, I don't understand why this is named kvm_mmu_put_page() for
> just removing parent_pte pointer from the sp->parent_ptes pointer chain.

Because it undoes kvm_mmu_get_page, I guess. :)

>
>> On to kvm_mmu_get_page...
>>
>> if (!direct) {
>> if (rmap_write_protect(vcpu, gfn))
>> kvm_flush_remote_tlbs(vcpu->kvm);
>> if (level > PT_PAGE_TABLE_LEVEL && need_sync)
>> kvm_sync_pages(vcpu, gfn);
>>
>> This seems fishy.
>>
>> need_sync is set if sp->unsync, but then the parents have not been
>> unsynced yet.
>
> Reaching here means that kvm_mmu_get_page() could not return sp
> from inside the for_each_gfn_sp() loop above, so even without
> this patch, mark_unsync() has not been called.

You're right.

> Here, sp holds the new page allocated by kvm_mmu_alloc_page().
> One confusing thing is that hlist_add_head() right before this
> "if (!direct)" line has already added the new sp to the hash
> list, so it will be found by for_each_gfn_indirect_valid_sp()
> in kvm_sync_pages().
>
> Because this sp is new and sp->unsync is not set, kvm_sync_pages()
> will just skip it and look for other sp's whose ->unsync were found
> to be set in the for_each_gfn_sp() loop.
>
> I'm not 100% sure if the existence of the parent_pte pointer in the
> newly created sp->parent_ptes chain alone makes any difference:

No, I don't think so. Nothing needs the parent_ptes at this point:

- kvm_mmu_mark_parents_unsync, even in the existing code, it's called
before the new SPTE is created.

- as you said, kvm_mmu_prepare_zap_page can be called by kvm_sync_pages
but it will not operate on this page because its ->unsync is zero.

> So, "bool accessed" needs to be passed to kvm_mmu_get_page().

The "bool accessed" parameter is not necessary, I think. It is only
false in the nested EPT case, and there's no reason not to set the
accessed bit *in the shadow page* if the host supports EPT
accessed/dirty bits. I'll test and send a patch to remove the argument.

> But any way, we need to understand if mmu_page_add_parent_pte()
> really needs to be placed before the "if (!direct)" block.

No, I don't think so anymore.

I think these patches are fine as a starting point for further cleanups,
I'll push them to kvm/queue very soon.

Paolo

2015-11-13 22:06:32

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 05/10] KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk()

On Thu, Nov 12, 2015 at 08:52:45PM +0900, Takuya Yoshikawa wrote:
> kvm_mmu_mark_parents_unsync() alone uses pte_list_walk(), witch does
> nearly the same as the for_each_rmap_spte macro. The only difference
> is that is_shadow_present_pte() checks cannot be placed there because
> kvm_mmu_mark_parents_unsync() can be called with a new parent pointer
> whose entry is not set yet.
>
> By calling mark_unsync() separately for the parent and adding the parent
> pointer to the parent_ptes chain later in kvm_mmu_get_page(), the macro
> works with no problem.
>
> Signed-off-by: Takuya Yoshikawa <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 36 +++++++++++++-----------------------
> 1 file changed, 13 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index e8cfdc4..1691171 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1007,26 +1007,6 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)
> }
> }
>
> -typedef void (*pte_list_walk_fn) (u64 *spte);
> -static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
> -{
> - struct pte_list_desc *desc;
> - int i;
> -
> - if (!*pte_list)
> - return;
> -
> - if (!(*pte_list & 1))
> - return fn((u64 *)*pte_list);
> -
> - desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> - while (desc) {
> - for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
> - fn(desc->sptes[i]);
> - desc = desc->more;
> - }
> -}
> -
> static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
> struct kvm_memory_slot *slot)
> {
> @@ -1741,7 +1721,12 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
> static void mark_unsync(u64 *spte);
> static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
> {
> - pte_list_walk(&sp->parent_ptes, mark_unsync);
> + u64 *sptep;
> + struct rmap_iterator iter;
> +
> + for_each_rmap_spte(&sp->parent_ptes, &iter, sptep) {
> + mark_unsync(sptep);
> + }
> }
>
> static void mark_unsync(u64 *spte)
> @@ -2111,12 +2096,17 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,

Faulting a spte, and one of the levels of sptes, either


spte-1
spte-2
spte-3

has present bit clear. So we're searching for a guest page to shadow, with
gfn "gfn".

> if (sp->unsync && kvm_sync_page_transient(vcpu, sp))
> break;

If a shadow for gfn exists, but is unsync, sync guest-page ---to--> kvm
sptes.

> - mmu_page_add_parent_pte(vcpu, sp, parent_pte);

add "gfn" (actually its "struct kvm_mmu_page *sp" pointer) to
the parent.
> if (sp->unsync_children) {
> kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> kvm_mmu_mark_parents_unsync(sp);

kvm_mmu_mark_parents_unsync relied on the links from current level all
the way to top level to mark all levels unsync, so that on guest entry,
KVM_REQ_MMU_SYNC is processed and any level is brought from guest -->
kvm pages. This now fails, because you removed "mmu_page_add_parent_pte"
(the link is not formed all the way to root).

Unless i am missing something, this is not correct.

> - } else if (sp->unsync)
> + if (parent_pte)
> + mark_unsync(parent_pte);
> + } else if (sp->unsync) {
> kvm_mmu_mark_parents_unsync(sp);
> + if (parent_pte)
> + mark_unsync(parent_pte);
> + }
> + mmu_page_add_parent_pte(vcpu, sp, parent_pte);
>
> __clear_sp_write_flooding_count(sp);
> trace_kvm_mmu_get_page(sp, false);

2015-11-13 22:08:44

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 06/10] KVM: x86: MMU: Consolidate WARN_ON/BUG_ON checks for reverse-mapped sptes

On Thu, Nov 12, 2015 at 08:53:43PM +0900, Takuya Yoshikawa wrote:
> At some call sites of rmap_get_first() and rmap_get_next(), BUG_ON is
> placed right after the call to detect unrelated sptes which must not be
> found in the reverse-mapping list.
>
> Move this check in rmap_get_first/next() so that all call sites, not
> just the users of the for_each_rmap_spte() macro, will be checked the
> same way. In addition, change the BUG_ON to WARN_ON since killing the
> whole host is the last thing that KVM should try.

It should be a BUG_ON, if KVM continues it will corrupt (more) memory.

> One thing to keep in mind is that kvm_mmu_unlink_parents() also uses
> rmap_get_first() to handle parent sptes. The change will not break it
> because parent sptes are present, at least until drop_parent_pte()
> actually unlinks them, and not mmio-sptes.
>
> Signed-off-by: Takuya Yoshikawa <[email protected]>
> ---
> Documentation/virtual/kvm/mmu.txt | 4 ++--
> arch/x86/kvm/mmu.c | 26 +++++++++++++++++---------
> 2 files changed, 19 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
> index 3a4d681..daf9c0f 100644
> --- a/Documentation/virtual/kvm/mmu.txt
> +++ b/Documentation/virtual/kvm/mmu.txt
> @@ -203,10 +203,10 @@ Shadow pages contain the following information:
> page cannot be destroyed. See role.invalid.
> parent_ptes:
> The reverse mapping for the pte/ptes pointing at this page's spt. If
> - parent_ptes bit 0 is zero, only one spte points at this pages and
> + parent_ptes bit 0 is zero, only one spte points at this page and
> parent_ptes points at this single spte, otherwise, there exists multiple
> sptes pointing at this page and (parent_ptes & ~0x1) points at a data
> - structure with a list of parent_ptes.
> + structure with a list of parent sptes.
> unsync:
> If true, then the translations in this page may not match the guest's
> translation. This is equivalent to the state of the tlb when a pte is
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 1691171..ee7b101 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1079,17 +1079,23 @@ struct rmap_iterator {
> */
> static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
> {
> + u64 *sptep;
> +
> if (!rmap)
> return NULL;
>
> if (!(rmap & 1)) {
> iter->desc = NULL;
> - return (u64 *)rmap;
> + sptep = (u64 *)rmap;
> + goto out;
> }
>
> iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
> iter->pos = 0;
> - return iter->desc->sptes[iter->pos];
> + sptep = iter->desc->sptes[iter->pos];
> +out:
> + WARN_ON(!is_shadow_present_pte(*sptep));
> + return sptep;
> }
>
> /*
> @@ -1099,14 +1105,14 @@ static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
> */
> static u64 *rmap_get_next(struct rmap_iterator *iter)
> {
> + u64 *sptep;
> +
> if (iter->desc) {
> if (iter->pos < PTE_LIST_EXT - 1) {
> - u64 *sptep;
> -
> ++iter->pos;
> sptep = iter->desc->sptes[iter->pos];
> if (sptep)
> - return sptep;
> + goto out;
> }
>
> iter->desc = iter->desc->more;
> @@ -1114,17 +1120,20 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
> if (iter->desc) {
> iter->pos = 0;
> /* desc->sptes[0] cannot be NULL */
> - return iter->desc->sptes[iter->pos];
> + sptep = iter->desc->sptes[iter->pos];
> + goto out;
> }
> }
>
> return NULL;
> +out:
> + WARN_ON(!is_shadow_present_pte(*sptep));
> + return sptep;
> }
>
> #define for_each_rmap_spte(_rmap_, _iter_, _spte_) \
> for (_spte_ = rmap_get_first(*_rmap_, _iter_); \
> - _spte_ && ({BUG_ON(!is_shadow_present_pte(*_spte_)); 1;}); \
> - _spte_ = rmap_get_next(_iter_))
> + _spte_; _spte_ = rmap_get_next(_iter_))
>
> static void drop_spte(struct kvm *kvm, u64 *sptep)
> {
> @@ -1338,7 +1347,6 @@ static bool kvm_zap_rmapp(struct kvm *kvm, unsigned long *rmapp)
> bool flush = false;
>
> while ((sptep = rmap_get_first(*rmapp, &iter))) {
> - BUG_ON(!(*sptep & PT_PRESENT_MASK));
> rmap_printk("%s: spte %p %llx.\n", __func__, sptep, *sptep);
>
> drop_spte(kvm, sptep);
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-11-15 21:38:22

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 05/10] KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk()

On Fri, Nov 13, 2015 at 07:47:28PM -0200, Marcelo Tosatti wrote:
> On Thu, Nov 12, 2015 at 08:52:45PM +0900, Takuya Yoshikawa wrote:
> > kvm_mmu_mark_parents_unsync() alone uses pte_list_walk(), witch does
> > nearly the same as the for_each_rmap_spte macro. The only difference
> > is that is_shadow_present_pte() checks cannot be placed there because
> > kvm_mmu_mark_parents_unsync() can be called with a new parent pointer
> > whose entry is not set yet.
> >
> > By calling mark_unsync() separately for the parent and adding the parent
> > pointer to the parent_ptes chain later in kvm_mmu_get_page(), the macro
> > works with no problem.
> >
> > Signed-off-by: Takuya Yoshikawa <[email protected]>
> > ---
> > arch/x86/kvm/mmu.c | 36 +++++++++++++-----------------------
> > 1 file changed, 13 insertions(+), 23 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index e8cfdc4..1691171 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -1007,26 +1007,6 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)
> > }
> > }
> >
> > -typedef void (*pte_list_walk_fn) (u64 *spte);
> > -static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
> > -{
> > - struct pte_list_desc *desc;
> > - int i;
> > -
> > - if (!*pte_list)
> > - return;
> > -
> > - if (!(*pte_list & 1))
> > - return fn((u64 *)*pte_list);
> > -
> > - desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> > - while (desc) {
> > - for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
> > - fn(desc->sptes[i]);
> > - desc = desc->more;
> > - }
> > -}
> > -
> > static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
> > struct kvm_memory_slot *slot)
> > {
> > @@ -1741,7 +1721,12 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
> > static void mark_unsync(u64 *spte);
> > static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
> > {
> > - pte_list_walk(&sp->parent_ptes, mark_unsync);
> > + u64 *sptep;
> > + struct rmap_iterator iter;
> > +
> > + for_each_rmap_spte(&sp->parent_ptes, &iter, sptep) {
> > + mark_unsync(sptep);
> > + }
> > }
> >
> > static void mark_unsync(u64 *spte)
> > @@ -2111,12 +2096,17 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>
> Faulting a spte, and one of the levels of sptes, either
>
>
> spte-1
> spte-2
> spte-3
>
> has present bit clear. So we're searching for a guest page to shadow, with
> gfn "gfn".
>
> > if (sp->unsync && kvm_sync_page_transient(vcpu, sp))
> > break;
>
> If a shadow for gfn exists, but is unsync, sync guest-page ---to--> kvm
> sptes.
>
> > - mmu_page_add_parent_pte(vcpu, sp, parent_pte);
>
> add "gfn" (actually its "struct kvm_mmu_page *sp" pointer) to
> the parent.
> > if (sp->unsync_children) {
> > kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > kvm_mmu_mark_parents_unsync(sp);
>
> kvm_mmu_mark_parents_unsync relied on the links from current level all
> the way to top level to mark all levels unsync, so that on guest entry,
> KVM_REQ_MMU_SYNC is processed and any level is brought from guest -->
> kvm pages. This now fails, because you removed "mmu_page_add_parent_pte"
> (the link is not formed all the way to root).
>
> Unless i am missing something, this is not correct.

The actual issue is this: a higher level page that had, under its children,
no out of sync pages, now, due to your addition, a child that is unsync:

initial state:
level1

final state:

level1 -x-> level2 -x-> level3

Where -x-> are the links created by this pagefault fixing round.

If _any_ page under you is unsync (not necessarily the ones this
pagefault is accessing), you have to mark parents unsync.

2015-11-16 02:50:18

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH 05/10] KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk()

On 2015/11/14 18:20, Marcelo Tosatti wrote:

> The actual issue is this: a higher level page that had, under its children,
> no out of sync pages, now, due to your addition, a child that is unsync:
>
> initial state:
> level1
>
> final state:
>
> level1 -x-> level2 -x-> level3
>
> Where -x-> are the links created by this pagefault fixing round.
>
> If _any_ page under you is unsync (not necessarily the ones this
> pagefault is accessing), you have to mark parents unsync.

I understand this, but I don't think my patch will break this.

What kvm_mmu_mark_parents_unsync() does is:

for each p_i in sp->parent_ptes rmap chain
mark_unsync(p_i);

Then, mark_unsync() finds the parent sp including that p_i to
set ->unsync_child_bitmap and increment ->unsync_children if
necessary. It may also call kvm_mmu_mark_parents_unsync()
recursively.

I understand we need to tell the parents "you have an unsync
child/descendant" until this information reaches the top level
by that recursive calls.

But since these recursive calls cannot come back to the starting sp,
the child->parent graph has no loop, each mark_unsync(p_i) will not
be affected by other parents in that sp->parent_ptes rmap chain,
from which we started the recursive calls.


As the following code shows, my patch does mark_unsync(parent_pte)
separately, and then mmu_page_add_parent_pte(vcpu, sp, parent_pte):

> - } else if (sp->unsync)
> + if (parent_pte)
> + mark_unsync(parent_pte);
> + } else if (sp->unsync) {
> kvm_mmu_mark_parents_unsync(sp);
> + if (parent_pte)
> + mark_unsync(parent_pte);
> + }
> + mmu_page_add_parent_pte(vcpu, sp, parent_pte);

So, as you worried, during each mark_unsync(p_i) is processed,
this parent_pte does not exist in that sp->parent_ptes rmap chain.

But as I explained above, this does not change anything about what
each mark_unsync(p_i) call does, so keeps the original behaviour.


By the way, I think "kvm_mmu_mark_parents_unsync" and "mark_unsync"
do not tell what they actually do well. When I first saw the names,
I thought they would just set the parents' sp->unsync.

To reflect the following meaning better, it should be
propagate_unsync(_to_parents) or something:

Tell the parents "you have an unsync child/descendant"
until this unsync information reaches the top level


Thanks,
Takuya

2015-11-16 03:33:49

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH 06/10] KVM: x86: MMU: Consolidate WARN_ON/BUG_ON checks for reverse-mapped sptes

On 2015/11/14 7:08, Marcelo Tosatti wrote:
> On Thu, Nov 12, 2015 at 08:53:43PM +0900, Takuya Yoshikawa wrote:
>> At some call sites of rmap_get_first() and rmap_get_next(), BUG_ON is
>> placed right after the call to detect unrelated sptes which must not be
>> found in the reverse-mapping list.
>>
>> Move this check in rmap_get_first/next() so that all call sites, not
>> just the users of the for_each_rmap_spte() macro, will be checked the
>> same way. In addition, change the BUG_ON to WARN_ON since killing the
>> whole host is the last thing that KVM should try.
>
> It should be a BUG_ON, if KVM continues it will corrupt (more) memory.

In the sense that we cannot predict what kind of corruption it will
cause, I agree with you.

But if it can only corrupt that guest's memory, it is a bit sad to
kill unrelated guests, and host, too. Anyway, since we cannot say
for sure what a possible bug can cause, I agree with you now.

Thanks,
Takuya

2015-11-17 17:58:48

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 05/10] KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk()



On 16/11/2015 03:51, Takuya Yoshikawa wrote:
> What kvm_mmu_mark_parents_unsync() does is:
>
> for each p_i in sp->parent_ptes rmap chain
> mark_unsync(p_i);
>
> Then, mark_unsync() finds the parent sp including that p_i to
> set ->unsync_child_bitmap and increment ->unsync_children if
> necessary. It may also call kvm_mmu_mark_parents_unsync()
> recursively.

I agree. sp->parent_ptes goes up one level from sp;
kvm_mmu_mark_parents_unsync(sp) visits the level above sp, while
mark_unsync(sp) visit sp and all the levels above it.

Calling mark_unsync(parent_pte) is enough to complete the visit, after
kvm_mmu_mark_parents_unsync has already processed the level above sp.

Paolo

2015-11-18 02:51:29

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 02/10] KVM: x86: MMU: Add helper function to clear a bit in unsync child bitmap



On 11/12/2015 07:50 PM, Takuya Yoshikawa wrote:
> + if (!ret) {
> + clear_unsync_child_bit(sp, i);
> + continue;
> + } else if (ret > 0) {
> nr_unsync_leaf += ret;

Just a single line here, braces are unnecessary.

> - else
> + } else
> return ret;
> } else if (child->unsync) {
> nr_unsync_leaf++;
> if (mmu_pages_add(pvec, child, i))
> return -ENOSPC;
> } else
> - goto clear_child_bitmap;
> -
> - continue;
> -
> -clear_child_bitmap:
> - __clear_bit(i, sp->unsync_child_bitmap);
> - sp->unsync_children--;
> - WARN_ON((int)sp->unsync_children < 0);
> + clear_unsync_child_bit(sp, i);
> }

2015-11-18 03:14:31

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 05/10] KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk()



On 11/12/2015 07:52 PM, Takuya Yoshikawa wrote:
> kvm_mmu_mark_parents_unsync() alone uses pte_list_walk(), witch does
> nearly the same as the for_each_rmap_spte macro. The only difference
> is that is_shadow_present_pte() checks cannot be placed there because
> kvm_mmu_mark_parents_unsync() can be called with a new parent pointer
> whose entry is not set yet.
>
> By calling mark_unsync() separately for the parent and adding the parent
> pointer to the parent_ptes chain later in kvm_mmu_get_page(), the macro
> works with no problem.
>
> Signed-off-by: Takuya Yoshikawa <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 36 +++++++++++++-----------------------
> 1 file changed, 13 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index e8cfdc4..1691171 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1007,26 +1007,6 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)
> }
> }
>
> -typedef void (*pte_list_walk_fn) (u64 *spte);
> -static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
> -{
> - struct pte_list_desc *desc;
> - int i;
> -
> - if (!*pte_list)
> - return;
> -
> - if (!(*pte_list & 1))
> - return fn((u64 *)*pte_list);
> -
> - desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> - while (desc) {
> - for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
> - fn(desc->sptes[i]);
> - desc = desc->more;
> - }
> -}
> -
> static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
> struct kvm_memory_slot *slot)
> {
> @@ -1741,7 +1721,12 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
> static void mark_unsync(u64 *spte);
> static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
> {
> - pte_list_walk(&sp->parent_ptes, mark_unsync);
> + u64 *sptep;
> + struct rmap_iterator iter;
> +
> + for_each_rmap_spte(&sp->parent_ptes, &iter, sptep) {
> + mark_unsync(sptep);
> + }
> }
>
> static void mark_unsync(u64 *spte)
> @@ -2111,12 +2096,17 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> if (sp->unsync && kvm_sync_page_transient(vcpu, sp))
> break;
>
> - mmu_page_add_parent_pte(vcpu, sp, parent_pte);
> if (sp->unsync_children) {
> kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> kvm_mmu_mark_parents_unsync(sp);

After your change above, the @sp's parents have not been changed, no need to call it now.

> - } else if (sp->unsync)
> + if (parent_pte)
> + mark_unsync(parent_pte);
> + } else if (sp->unsync) {
> kvm_mmu_mark_parents_unsync(sp);

Ditto.

> + if (parent_pte)
> + mark_unsync(parent_pte);
> + }
> + mmu_page_add_parent_pte(vcpu, sp, parent_pte);
>
> __clear_sp_write_flooding_count(sp);
> trace_kvm_mmu_get_page(sp, false);
>

2015-11-18 03:28:19

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 07/10] KVM: x86: MMU: Encapsulate the type of rmap-chain head in a new struct



On 11/12/2015 07:55 PM, Takuya Yoshikawa wrote:
> @@ -1720,7 +1724,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
> * this feature. See the comments in kvm_zap_obsolete_pages().
> */
> list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
> - sp->parent_ptes = 0;
> + sp->parent_ptes.val = 0;

The sp is allocated from kmem_cache_zalloc() so explicitly initialize it to zero is
not needed.

2015-11-18 03:39:32

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 09/10 RFC] KVM: x86: MMU: Move parent_pte handling from kvm_mmu_get_page() to link_shadow_page()



On 11/12/2015 07:56 PM, Takuya Yoshikawa wrote:
> Every time kvm_mmu_get_page() is called with a non-NULL parent_pte
> argument, link_shadow_page() follows that to set the parent entry so
> that the new mapping will point to the returned page table.
>
> Moving parent_pte handling there allows to clean up the code because
> parent_pte is passed to kvm_mmu_get_page() just for mark_unsync() and
> mmu_page_add_parent_pte().
>
> Signed-off-by: Takuya Yoshikawa <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 21 ++++++++-------------
> arch/x86/kvm/paging_tmpl.h | 6 ++----
> 2 files changed, 10 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 9273cd4..33fe720 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2108,14 +2108,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> if (sp->unsync_children) {
> kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> kvm_mmu_mark_parents_unsync(sp);
> - if (parent_pte)
> - mark_unsync(parent_pte);
> } else if (sp->unsync) {
> kvm_mmu_mark_parents_unsync(sp);
> - if (parent_pte)
> - mark_unsync(parent_pte);
> }
> - mmu_page_add_parent_pte(vcpu, sp, parent_pte);
>
> __clear_sp_write_flooding_count(sp);
> trace_kvm_mmu_get_page(sp, false);
> @@ -2127,7 +2122,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> sp = kvm_mmu_alloc_page(vcpu, direct);
>
> sp->parent_ptes.val = 0;
> - mmu_page_add_parent_pte(vcpu, sp, parent_pte);
>
> sp->gfn = gfn;
> sp->role = role;
> @@ -2196,7 +2190,8 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
> return __shadow_walk_next(iterator, *iterator->sptep);
> }
>
> -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, bool accessed)
> +static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
> + struct kvm_mmu_page *sp, bool accessed)
> {
> u64 spte;
>
> @@ -2210,6 +2205,11 @@ static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, bool accessed)
> spte |= shadow_accessed_mask;
>
> mmu_spte_set(sptep, spte);
> +
> + if (sp->unsync_children || sp->unsync)
> + mark_unsync(sptep);

Why are these needed?

2015-11-18 09:09:22

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 07/10] KVM: x86: MMU: Encapsulate the type of rmap-chain head in a new struct



On 18/11/2015 04:21, Xiao Guangrong wrote:
>
>
> On 11/12/2015 07:55 PM, Takuya Yoshikawa wrote:
>> @@ -1720,7 +1724,7 @@ static struct kvm_mmu_page
>> *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
>> * this feature. See the comments in kvm_zap_obsolete_pages().
>> */
>> list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
>> - sp->parent_ptes = 0;
>> + sp->parent_ptes.val = 0;
>
> The sp is allocated from kmem_cache_zalloc() so explicitly initialize it
> to zero is not needed.

Right, but it should be a separate patch.

Takuya, since you are going to send another version of this series, can
you also:

1) move this patch either to the beginning or to the end

2) include "KVM: x86: always set accessed bit in shadow PTEs", also near
the beginning of the series?

Thanks!

Paolo

2015-11-19 00:58:38

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH 02/10] KVM: x86: MMU: Add helper function to clear a bit in unsync child bitmap

On 2015/11/18 11:44, Xiao Guangrong wrote:

> On 11/12/2015 07:50 PM, Takuya Yoshikawa wrote:
>> + if (!ret) {
>> + clear_unsync_child_bit(sp, i);
>> + continue;
>> + } else if (ret > 0) {
>> nr_unsync_leaf += ret;
>
> Just a single line here, braces are unnecessary.
>
>> - else
>> + } else
>> return ret;

I know we can eliminate the braces, but that does not mean
we should do so: there seems to be no consensus about this
style issue and checkpatch accepts both ways.

Actually, some people prefer to put braces when one of the
if/else-if/else cases has multiple lines. You can see
some examples in kernel/sched/core.c: see hrtick_start(),
sched_fork(), free_sched_domain().

In our case, I thought putting braces would align the else-if
and else and make the code look a bit nicer, but I know this
may be just a matter of personal feeling.

In short, unless the maintainer, Paolo for this file, has any
preference, both ways will be accepted.

Takuya

2015-11-19 02:23:01

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH 07/10] KVM: x86: MMU: Encapsulate the type of rmap-chain head in a new struct

On 2015/11/18 18:09, Paolo Bonzini wrote:

> On 18/11/2015 04:21, Xiao Guangrong wrote:

>> On 11/12/2015 07:55 PM, Takuya Yoshikawa wrote:
>>> @@ -1720,7 +1724,7 @@ static struct kvm_mmu_page
>>> *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
>>> * this feature. See the comments in kvm_zap_obsolete_pages().
>>> */
>>> list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
>>> - sp->parent_ptes = 0;
>>> + sp->parent_ptes.val = 0;
>>
>> The sp is allocated from kmem_cache_zalloc() so explicitly initialize it
>> to zero is not needed.
>
> Right, but it should be a separate patch.
>
> Takuya, since you are going to send another version of this series, can
> you also:

Yes, I'm preparing to do so.

> 1) move this patch either to the beginning or to the end
>
> 2) include "KVM: x86: always set accessed bit in shadow PTEs", also near
> the beginning of the series?

Commit 1c9a5e19b1af8a2c ("KVM: x86: MMU: always set accessed bit
in shadow PTEs") will be the first.

Then, the ordering will become something like this:

02: Encapsulate the type of rmap-chain head in a new struct
03: Remove unused parameter of __direct_map()
04: Add helper function to clear a bit in unsync child bitmap
05: Make mmu_set_spte() return emulate value
06: Remove is_rmap_spte() and use is_shadow_present_pte()

These five seem to be easy ones for you to apply: since patch 02
touches many places, it should go first to become the base of the
following work.

07: Consolidate BUG_ON checks for reverse-mapped sptes

I will change the WARN_ON to BUG_ON. // Marcelo's comment

08: Move initialization of parent_ptes out from kvm_mmu_alloc_page()

In this patch, I will delete "sp->parent_ptes.val = 0;" line since
this is the problem of kvm_mmu_alloc_page(), though not a new one.
// Xiao's comment

09: Use for_each_rmap_spte macro instead of pte_list_walk()

There is some confusion between us: Paolo and I agreed that the
patch keeps the original way and calls mark_unsync() the same way
as before, but there are still comments from Marcelo and Xiao and
those comments seem to explain the code differently.

I will check again, but I may not change this one and the following
two patches in the next version. If we can eliminate some of the
mark_unsync() calls, that will be kind of an optimization which this
series does not intend to achieve.

Anyway, by moving the non-trivial two patches (09 and 10) here,
reviewing will become easier and you can apply the other patches
separately.

10: Move parent_pte handling from kvm_mmu_get_page()
to link_shadow_page()
11: Remove unused parameter parent_pte from kvm_mmu_get_page()

Thanks,
Takuya

2015-11-19 02:53:22

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH 02/10] KVM: x86: MMU: Add helper function to clear a bit in unsync child bitmap



On 11/19/2015 08:59 AM, Takuya Yoshikawa wrote:
> On 2015/11/18 11:44, Xiao Guangrong wrote:
>
>> On 11/12/2015 07:50 PM, Takuya Yoshikawa wrote:
>>> + if (!ret) {
>>> + clear_unsync_child_bit(sp, i);
>>> + continue;
>>> + } else if (ret > 0) {
>>> nr_unsync_leaf += ret;
>>
>> Just a single line here, braces are unnecessary.
>>
>>> - else
>>> + } else
>>> return ret;
>
> I know we can eliminate the braces, but that does not mean
> we should do so: there seems to be no consensus about this
> style issue and checkpatch accepts both ways.
>
> Actually, some people prefer to put braces when one of the
> if/else-if/else cases has multiple lines. You can see
> some examples in kernel/sched/core.c: see hrtick_start(),
> sched_fork(), free_sched_domain().
>
> In our case, I thought putting braces would align the else-if
> and else and make the code look a bit nicer, but I know this
> may be just a matter of personal feeling.
>
> In short, unless the maintainer, Paolo for this file, has any
> preference, both ways will be accepted.

The reason why i pointed this out is that it is the style documented
in Documentation/CodingStyle:
| Do not unnecessarily use braces where a single statement will do.
|
| if (condition)
| action();
|

Actually, Ingo Molnar hated this braces-style too much and blamed
many developers who used this style (include me, that why i was
nervous to see this style :( ).

If this style is commonly accepted now, it is worth making a patch
to update Documentation/CodingStyle.

2015-11-19 04:02:00

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH 02/10] KVM: x86: MMU: Add helper function to clear a bit in unsync child bitmap

On 2015/11/19 11:46, Xiao Guangrong wrote:

>> Actually, some people prefer to put braces when one of the
>> if/else-if/else cases has multiple lines. You can see
>> some examples in kernel/sched/core.c: see hrtick_start(),
>> sched_fork(), free_sched_domain().
>>
>> In our case, I thought putting braces would align the else-if
>> and else and make the code look a bit nicer, but I know this
>> may be just a matter of personal feeling.
>>
>> In short, unless the maintainer, Paolo for this file, has any
>> preference, both ways will be accepted.
>
> The reason why i pointed this out is that it is the style documented
> in Documentation/CodingStyle:
> | Do not unnecessarily use braces where a single statement will do.
> |
> | if (condition)
> | action();
> |

Ah, this is a different thing. For this case, there is a consensus
and checkpatch will complain if we don't obey the rule.

What I explained was:

if (condition) {
line1;
line2; // multiple lines
} else if {
single-line-statement; -- (*1)
} else
single-line-statement; -- (*2)

For (*1) and (*2), especially for (*1), some people put braces.

> Actually, Ingo Molnar hated this braces-style too much and blamed
> many developers who used this style (include me, that why i was
> nervous to see this style :( ).

I think he likes the coding style of kernel/sched/core.c very much,
as you know. Actually that is one reason why I took it as an example.

Let's just choose the way which Paolo prefers for this time, I don't
know which is better.

Thank you,
Takuya