LinuxLists.cc - [PATCH v5 0/5] KVM: x86: flush tlb out of mmu-lock after write protection

2014-04-17 09:05:45

Subject: [PATCH v5 0/5] KVM: x86: flush tlb out of mmu-lock after write protection

Since Marcelo has agreed the comments improving in the off-line mail, i
consider this is his Ack. :) Please let me know If i misunderstood it.

This patchset is splited from my previous patchset:
[PATCH v3 00/15] KVM: MMU: locklessly write-protect
that can be found at:
https://lkml.org/lkml/2013/10/23/265

Changelog v5:
- improve some comments addressing Marcelo's suggestion

Changelog v4:
- add more comments for patch 5 and thank for Marcelo's review

Xiao Guangrong (5):
Revert "KVM: Simplify kvm->tlbs_dirty handling"
KVM: MMU: properly check last spte in fast_page_fault()
KVM: MMU: lazily drop large spte
KVM: MMU: flush tlb if the spte can be locklessly modified
KVM: MMU: flush tlb out of mmu lock when write-protect the sptes

arch/x86/kvm/mmu.c | 72 ++++++++++++++++++++++++++++++----------------
arch/x86/kvm/mmu.h | 33 +++++++++++++++++++++
arch/x86/kvm/paging_tmpl.h | 7 ++---
arch/x86/kvm/x86.c | 20 ++++++++++---
include/linux/kvm_host.h | 4 +--
virt/kvm/kvm_main.c | 5 +++-
6 files changed, 104 insertions(+), 37 deletions(-)

--
1.8.1.4

2014-04-17 09:05:58

by Xiao Guangrong

[permalink] [raw]

Subject: [PATCH v5 2/5] KVM: MMU: properly check last spte in fast_page_fault()

Using sp->role.level instead of @level since @level is not got from the
page table hierarchy

There is no issue in current code since the fast page fault currently only
fixes the fault caused by dirty-log that is always on the last level
(level = 1)

This patch makes the code more readable and avoids potential issue in the
further development

Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 668ae59..6310704 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2802,9 +2802,9 @@ static bool page_fault_can_be_fast(u32 error_code)
}

static bool
-fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte)
+fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+ u64 *sptep, u64 spte)
{
- struct kvm_mmu_page *sp = page_header(__pa(sptep));
gfn_t gfn;

WARN_ON(!sp->role.direct);
@@ -2830,6 +2830,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
u32 error_code)
{
struct kvm_shadow_walk_iterator iterator;
+ struct kvm_mmu_page *sp;
bool ret = false;
u64 spte = 0ull;

@@ -2853,7 +2854,8 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
goto exit;
}

- if (!is_last_spte(spte, level))
+ sp = page_header(__pa(iterator.sptep));
+ if (!is_last_spte(spte, sp->role.level))
goto exit;

/*
@@ -2879,7 +2881,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
* the gfn is not stable for indirect shadow page.
* See Documentation/virtual/kvm/locking.txt to get more detail.
*/
- ret = fast_pf_fix_direct_spte(vcpu, iterator.sptep, spte);
+ ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
exit:
trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
spte, ret);
--
1.8.1.4

2014-04-17 09:05:56

by Xiao Guangrong

[permalink] [raw]

Subject: [PATCH v5 4/5] KVM: MMU: flush tlb if the spte can be locklessly modified

Relax the tlb flush condition since we will write-protect the spte out of mmu
lock. Note lockless write-protection only marks the writable spte to readonly
and the spte can be writable only if both SPTE_HOST_WRITEABLE and
SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)

This patch is used to avoid this kind of race:

VCPU 0 VCPU 1
lockless wirte protection:
set spte.w = 0
lock mmu-lock

write protection the spte to sync shadow page,
see spte.w = 0, then without flush tlb

unlock mmu-lock

!!! At this point, the shadow page can still be
writable due to the corrupt tlb entry
Flush all TLB

Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ddf0696..388a2ef 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -595,7 +595,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
* we always atomicly update it, see the comments in
* spte_has_volatile_bits().
*/
- if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
+ if (spte_is_locklessly_modifiable(old_spte) &&
+ !is_writable_pte(new_spte))
ret = true;

if (!shadow_accessed_mask)
--
1.8.1.4

2014-04-17 09:05:54

by Xiao Guangrong

[permalink] [raw]

Subject: [PATCH v5 5/5] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes

Now we can flush all the TLBs out of the mmu lock without TLB corruption when
write-proect the sptes, it is because:
- we have marked large sptes readonly instead of dropping them that means we
just change the spte from writable to readonly so that we only need to care
the case of changing spte from present to present (changing the spte from
present to nonpresent will flush all the TLBs immediately), in other words,
the only case we need to care is mmu_spte_update()

- in mmu_spte_update(), we haved checked
SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
means it does not depend on PT_WRITABLE_MASK anymore

Acked-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 25 +++++++++++++++++++++----
arch/x86/kvm/mmu.h | 33 +++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 12 ++++++++++--
3 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 388a2ef..65f2400 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4309,15 +4309,32 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
if (*rmapp)
__rmap_write_protect(kvm, rmapp, false);

- if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
- kvm_flush_remote_tlbs(kvm);
+ if (need_resched() || spin_needbreak(&kvm->mmu_lock))
cond_resched_lock(&kvm->mmu_lock);
- }
}
}

- kvm_flush_remote_tlbs(kvm);
spin_unlock(&kvm->mmu_lock);
+
+ /*
+ * kvm_mmu_slot_remove_write_access() and kvm_vm_ioctl_get_dirty_log()
+ * which do tlb flush out of mmu-lock should be serialized by
+ * kvm->slots_lock otherwise tlb flush would be missed.
+ */
+ lockdep_assert_held(&kvm->slots_lock);
+
+ /*
+ * We can flush all the TLBs out of the mmu lock without TLB
+ * corruption since we just change the spte from writable to
+ * readonly so that we only need to care the case of changing
+ * spte from present to present (changing the spte from present
+ * to nonpresent will flush all the TLBs immediately), in other
+ * words, the only case we care is mmu_spte_update() where we
+ * haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
+ * instead of PT_WRITABLE_MASK, that means it does not depend
+ * on PT_WRITABLE_MASK anymore.
+ */
+ kvm_flush_remote_tlbs(kvm);
}

#define BATCH_ZAP_PAGES 10
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 3842e70..b982112 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -104,6 +104,39 @@ static inline int is_present_gpte(unsigned long pte)
return pte & PT_PRESENT_MASK;
}

+/*
+ * Currently, we have two sorts of write-protection, a) the first one
+ * write-protects guest page to sync the guest modification, b) another one is
+ * used to sync dirty bitmap when we do KVM_GET_DIRTY_LOG. The differences
+ * between these two sorts are:
+ * 1) the first case clears SPTE_MMU_WRITEABLE bit.
+ * 2) the first case requires flushing tlb immediately avoiding corrupting
+ * shadow page table between all vcpus so it should be in the protection of
+ * mmu-lock. And the another case does not need to flush tlb until returning
+ * the dirty bitmap to userspace since it only write-protects the page
+ * logged in the bitmap, that means the page in the dirty bitmap is not
+ * missed, so it can flush tlb out of mmu-lock.
+ *
+ * So, there is the problem: the first case can meet the corrupted tlb caused
+ * by another case which write-protects pages but without flush tlb
+ * immediately. In order to making the first case be aware this problem we let
+ * it flush tlb if we try to write-protect a spte whose SPTE_MMU_WRITEABLE bit
+ * is set, it works since another case never touches SPTE_MMU_WRITEABLE bit.
+ *
+ * Anyway, whenever a spte is updated (only permission and status bits are
+ * changed) we need to check whether the spte with SPTE_MMU_WRITEABLE becomes
+ * readonly, if that happens, we need to flush tlb. Fortunately,
+ * mmu_spte_update() has already handled it perfectly.
+ *
+ * The rules to use SPTE_MMU_WRITEABLE and PT_WRITABLE_MASK:
+ * - if we want to see if it has writable tlb entry or if the spte can be
+ * writable on the mmu mapping, check SPTE_MMU_WRITEABLE, this is the most
+ * case, otherwise
+ * - if we fix page fault on the spte or do write-protection by dirty logging,
+ * check PT_WRITABLE_MASK.
+ *
+ * TODO: introduce APIs to split these two cases.
+ */
static inline int is_writable_pte(unsigned long pte)
{
return pte & PT_WRITABLE_MASK;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ebd8506..fe2590f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3646,11 +3646,19 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
offset = i * BITS_PER_LONG;
kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
}
- if (is_dirty)
- kvm_flush_remote_tlbs(kvm);

spin_unlock(&kvm->mmu_lock);

+ /* See the comments in kvm_mmu_slot_remove_write_access(). */
+ lockdep_assert_held(&kvm->slots_lock);
+
+ /*
+ * All the TLBs can be flushed out of mmu lock, see the comments in
+ * kvm_mmu_slot_remove_write_access().
+ */
+ if (is_dirty)
+ kvm_flush_remote_tlbs(kvm);
+
r = -EFAULT;
if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
goto out;
--
1.8.1.4

2014-04-17 09:05:49

by Xiao Guangrong

[permalink] [raw]

Subject: [PATCH v5 1/5] Revert "KVM: Simplify kvm->tlbs_dirty handling"

This reverts commit 5befdc385ddb2d5ae8995ad89004529a3acf58fc.

Since we will allow flush tlb out of mmu-lock in the later
patch

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/paging_tmpl.h | 7 +++----
include/linux/kvm_host.h | 4 +---
virt/kvm/kvm_main.c | 5 ++++-
3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 123efd3..4107765 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -913,8 +913,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
* and kvm_mmu_notifier_invalidate_range_start detect the mapping page isn't
* used by guest then tlbs are not flushed, so guest is allowed to access the
* freed pages.
- * We set tlbs_dirty to let the notifier know this change and delay the flush
- * until such a case actually happens.
+ * And we increase kvm->tlbs_dirty to delay tlbs flush in this case.
*/
static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
{
@@ -943,7 +942,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
return -EINVAL;

if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
- vcpu->kvm->tlbs_dirty = true;
+ vcpu->kvm->tlbs_dirty++;
continue;
}

@@ -958,7 +957,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)

if (gfn != sp->gfns[i]) {
drop_spte(vcpu->kvm, &sp->spt[i]);
- vcpu->kvm->tlbs_dirty = true;
+ vcpu->kvm->tlbs_dirty++;
continue;
}

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7d21cf9..33cccd5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -410,9 +410,7 @@ struct kvm {
unsigned long mmu_notifier_seq;
long mmu_notifier_count;
#endif
- /* Protected by mmu_lock */
- bool tlbs_dirty;
-
+ long tlbs_dirty;
struct list_head devices;
};

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 56baae8..628275a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -186,9 +186,12 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)

void kvm_flush_remote_tlbs(struct kvm *kvm)
{
+ long dirty_count = kvm->tlbs_dirty;
+
+ smp_mb();
if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
++kvm->stat.remote_tlb_flush;
- kvm->tlbs_dirty = false;
+ cmpxchg(&kvm->tlbs_dirty, dirty_count, 0);
}
EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs);

--
1.8.1.4

2014-04-17 09:08:38

by Xiao Guangrong

[permalink] [raw]

Subject: [PATCH v5 3/5] KVM: MMU: lazily drop large spte

Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them un-present, the page fault caused by read access can
be avoided

The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter. This removes the need for the return value.

This version has fixed the issue reported in 6b73a9606, the reason of that
issue is that fast_page_fault() directly sets the readonly large spte to
writable but only dirty the first page into the dirty-bitmap that means
other pages are missed. Fixed it by only the normal sptes (on the
PT_PAGE_TABLE_LEVEL level) can be fast fixed

Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 34 ++++++++++++++++++----------------
arch/x86/kvm/x86.c | 8 ++++++--
2 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6310704..ddf0696 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1176,8 +1176,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)

/*
* Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte writ-protection is caused by protecting shadow page table.
- * @flush indicates whether tlb need be flushed.
+ * spte write-protection is caused by protecting shadow page table.
*
* Note: write protection is difference between drity logging and spte
* protection:
@@ -1186,10 +1185,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
* - for spte protection, the spte can be writable only after unsync-ing
* shadow page.
*
- * Return true if the spte is dropped.
+ * Return true if tlb need be flushed.
*/
-static bool
-spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
+static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
{
u64 spte = *sptep;

@@ -1199,17 +1197,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)

rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);

- if (__drop_large_spte(kvm, sptep)) {
- *flush |= true;
- return true;
- }
-
if (pt_protect)
spte &= ~SPTE_MMU_WRITEABLE;
spte = spte & ~PT_WRITABLE_MASK;

- *flush |= mmu_spte_update(sptep, spte);
- return false;
+ return mmu_spte_update(sptep, spte);
}

static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
@@ -1221,11 +1213,8 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,

for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
- if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
- sptep = rmap_get_first(*rmapp, &iter);
- continue;
- }

+ flush |= spte_write_protect(kvm, sptep, pt_protect);
sptep = rmap_get_next(&iter);
}

@@ -2877,6 +2866,19 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
goto exit;

/*
+ * Do not fix write-permission on the large spte since we only dirty
+ * the first page into the dirty-bitmap in fast_pf_fix_direct_spte()
+ * that means other pages are missed if its slot is dirty-logged.
+ *
+ * Instead, we let the slow page fault path create a normal spte to
+ * fix the access.
+ *
+ * See the comments in kvm_arch_commit_memory_region().
+ */
+ if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+ goto exit;
+
+ /*
* Currently, fast page fault only works for direct mapping since
* the gfn is not stable for indirect shadow page.
* See Documentation/virtual/kvm/locking.txt to get more detail.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8b8fc0b..ebd8506 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7329,8 +7329,12 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
/*
* Write protect all pages for dirty logging.
- * Existing largepage mappings are destroyed here and new ones will
- * not be created until the end of the logging.
+ *
+ * All the sptes including the large sptes which point to this
+ * slot are set to readonly. We can not create any new large
+ * spte on this slot until the end of the logging.
+ *
+ * See the comments in fast_page_fault().
*/
if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
kvm_mmu_slot_remove_write_access(kvm, mem->slot);
--
1.8.1.4

2014-04-24 14:57:38

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [PATCH v5 0/5] KVM: x86: flush tlb out of mmu-lock after write protection

On Thu, Apr 17, 2014 at 05:06:11PM +0800, Xiao Guangrong wrote:
> Since Marcelo has agreed the comments improving in the off-line mail, i
> consider this is his Ack. :) Please let me know If i misunderstood it.
>
> This patchset is splited from my previous patchset:
> [PATCH v3 00/15] KVM: MMU: locklessly write-protect
> that can be found at:
> https://lkml.org/lkml/2013/10/23/265

Applied, thanks.

2014-04-28 11:31:09

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH v5 5/5] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes

What about some editing of the big comment...

/*
* Currently, shadow PTEs are write protected in two cases, 1) write protecting
* guest page tables, 2) resetting dirty tracking after KVM_GET_DIRTY_LOG. The
* differences between these two sorts are:
*
* a) only the first case clears SPTE_MMU_WRITEABLE bit.
*
* b) the first case requires flushing the TLB immediately to avoid corruption
* of the shadow page table on other VCPUs. In order to synchronize with
* other VCPUs the flush is done under the MMU lock.
*
* The second case instead can delay flushing of the TLB until just before
* returning the dirty bitmap is returned to userspace; this is because it
* only write-protects pages that are set in the bitmap, and further writes
* to those pages can be safely ignored until userspace examines the bitmap.
* We rely on this to flush the TLB outside the MMU lock.
*
* A problem arises when these two cases occur concurrently. Userspace can
* call KVM_GET_DIRTY_LOG, which write-protects pages but does not immediately
* flush the TLB; in the meanwhile, KVM wants to write-protect a guest page
* table, sees it's already write-protected, and the result is a corrupted TLB.
*
* To avoid this problem, when write protecting guest page tables we *always*
* flush the TLB if the spte has the SPTE_MMU_WRITEABLE bit set, even if
* the spte was already write-protected. This works since case 2 never touches
* SPTE_MMU_WRITEABLE bit. In other words, whenever a spte is updated (only
* permission and status bits are changed) we need to check whether a spte with
* SPTE_MMU_WRITEABLE becomes readonly. If that happens, we flush the TLB.
* mmu_spte_update() handles this.
*
* The rules to use SPTE_MMU_WRITEABLE and PT_WRITABLE_MASK are as follows:
*
* a) if you want to see if it has a writable TLB entry, or if the spte can be
* writable on the mmu mapping, check SPTE_MMU_WRITEABLE. This is the most
* common case, otherwise
*
* b) when fixing a page fault on the spte or doing write-protection for
* dirty logging, check PT_WRITABLE_MASK.

Is the above accurate?

> * TODO: introduce APIs to split these two cases.

What do you mean exactly?

Paolo