Changelog in v2:
- fix a issue that the track memory of memslot is freed if we only move
the memslot or change the flags of memslot
- do not track the gfn which is not mapped in memslots
- introduce the nolock APIs at the begin of the patchset
- use 'unsigned short' as the track counter to reduce the memory and which
should be enough for shadow page table and KVMGT
This patchset introduces the feature which allows us to track page
access in guest. Currently, only write access tracking is implemented
in this version.
Four APIs are introduces:
- kvm_page_track_add_page(kvm, gfn, mode), single guest page @gfn is
added into the track pool of the guest instance represented by @kvm,
@mode specifies which kind of access on the @gfn is tracked
- kvm_page_track_remove_page(kvm, gfn, mode), is the opposed operation
of kvm_page_track_add_page() which removes @gfn from the tracking pool.
gfn is no tracked after its last user is gone
- kvm_page_track_register_notifier(kvm, n), register a notifier so that
the event triggered by page tracking will be received, at that time,
the callback of n->track_write() will be called
- kvm_page_track_unregister_notifier(kvm, n), does the opposed operation
of kvm_page_track_register_notifier(), which unlinks the notifier and
stops receiving the tracked event
The first user of page track is non-leaf shadow page tables as they are
always write protected. It also gains performance improvement because
page track speeds up page fault handler for the tracked pages. The
performance result of kernel building is as followings:
before after
real 461.63 real 455.48
user 4529.55 user 4557.88
sys 1995.39 sys 1922.57
Furthermore, it is the infrastructure of other kind of shadow page table,
such as GPU shadow page table introduced in KVMGT (1) and native nested
IOMMU.
This patch can be divided into two parts:
- patch 1 ~ patch 7, implement page tracking
- others patches apply page tracking to non-leaf shadow page table
(1): http://lkml.iu.edu/hypermail/linux/kernel/1510.3/01562.html
Xiao Guangrong (11):
KVM: MMU: rename has_wrprotected_page to mmu_gfn_lpage_is_disallowed
KVM: MMU: introduce kvm_mmu_gfn_{allow,disallow}_lpage
KVM: MMU: introduce kvm_mmu_slot_gfn_write_protect
KVM: page track: add the framework of guest page tracking
KVM: page track: introduce kvm_page_track_{add,remove}_page
KVM: MMU: let page fault handler be aware tracked page
KVM: page track: add notifier support
KVM: MMU: use page track for non-leaf shadow pages
KVM: MMU: simplify mmu_need_write_protect
KVM: MMU: clear write-flooding on the fast path of tracked page
KVM: MMU: apply page track notifier
Documentation/virtual/kvm/mmu.txt | 6 +-
arch/x86/include/asm/kvm_host.h | 12 +-
arch/x86/include/asm/kvm_page_track.h | 67 +++++++++
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/mmu.c | 199 ++++++++++++++++++--------
arch/x86/kvm/mmu.h | 5 +
arch/x86/kvm/page_track.c | 257 ++++++++++++++++++++++++++++++++++
arch/x86/kvm/paging_tmpl.h | 5 +
arch/x86/kvm/x86.c | 27 ++--
9 files changed, 509 insertions(+), 72 deletions(-)
create mode 100644 arch/x86/include/asm/kvm_page_track.h
create mode 100644 arch/x86/kvm/page_track.c
--
1.8.3.1
kvm_lpage_info->write_count is used to detect if the large page mapping
for the gfn on the specified level is allowed, rename it to disallow_lpage
to reflect its purpose, also we rename has_wrprotected_page() to
mmu_gfn_lpage_is_disallowed() to make the code more clearer
Later we will extend this mechanism for page tracking: if the gfn is
tracked then large mapping for that gfn on any level is not allowed.
The new name is more straightforward
Signed-off-by: Xiao Guangrong <[email protected]>
---
Documentation/virtual/kvm/mmu.txt | 6 +++---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/mmu.c | 25 +++++++++++++------------
arch/x86/kvm/x86.c | 14 ++++++++------
4 files changed, 25 insertions(+), 22 deletions(-)
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index daf9c0f..dda2e93 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -391,11 +391,11 @@ To instantiate a large spte, four constraints must be satisfied:
write-protected pages
- the guest page must be wholly contained by a single memory slot
-To check the last two conditions, the mmu maintains a ->write_count set of
+To check the last two conditions, the mmu maintains a ->disallow_lpage set of
arrays for each memory slot and large page size. Every write protected page
-causes its write_count to be incremented, thus preventing instantiation of
+causes its disallow_lpage to be incremented, thus preventing instantiation of
a large spte. The frames at the end of an unaligned memory slot have
-artificially inflated ->write_counts so they can never be instantiated.
+artificially inflated ->disallow_lpages so they can never be instantiated.
Zapping all pages (page generation count)
=========================================
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a7c8987..1d37c1b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -619,7 +619,7 @@ struct kvm_vcpu_arch {
};
struct kvm_lpage_info {
- int write_count;
+ int disallow_lpage;
};
struct kvm_arch_memory_slot {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a1a3d19..61259ff 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -789,7 +789,7 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
slot = __gfn_to_memslot(slots, gfn);
for (i = PT_DIRECTORY_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
linfo = lpage_info_slot(gfn, slot, i);
- linfo->write_count += 1;
+ linfo->disallow_lpage += 1;
}
kvm->arch.indirect_shadow_pages++;
}
@@ -807,31 +807,32 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
slot = __gfn_to_memslot(slots, gfn);
for (i = PT_DIRECTORY_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
linfo = lpage_info_slot(gfn, slot, i);
- linfo->write_count -= 1;
- WARN_ON(linfo->write_count < 0);
+ linfo->disallow_lpage -= 1;
+ WARN_ON(linfo->disallow_lpage < 0);
}
kvm->arch.indirect_shadow_pages--;
}
-static int __has_wrprotected_page(gfn_t gfn, int level,
- struct kvm_memory_slot *slot)
+static bool __mmu_gfn_lpage_is_disallowed(gfn_t gfn, int level,
+ struct kvm_memory_slot *slot)
{
struct kvm_lpage_info *linfo;
if (slot) {
linfo = lpage_info_slot(gfn, slot, level);
- return linfo->write_count;
+ return !!linfo->disallow_lpage;
}
- return 1;
+ return true;
}
-static int has_wrprotected_page(struct kvm_vcpu *vcpu, gfn_t gfn, int level)
+static bool mmu_gfn_lpage_is_disallowed(struct kvm_vcpu *vcpu, gfn_t gfn,
+ int level)
{
struct kvm_memory_slot *slot;
slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
- return __has_wrprotected_page(gfn, level, slot);
+ return __mmu_gfn_lpage_is_disallowed(gfn, level, slot);
}
static int host_mapping_level(struct kvm *kvm, gfn_t gfn)
@@ -897,7 +898,7 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn,
max_level = min(kvm_x86_ops->get_lpage_level(), host_level);
for (level = PT_DIRECTORY_LEVEL; level <= max_level; ++level)
- if (__has_wrprotected_page(large_gfn, level, slot))
+ if (__mmu_gfn_lpage_is_disallowed(large_gfn, level, slot))
break;
return level - 1;
@@ -2511,7 +2512,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
* be fixed if guest refault.
*/
if (level > PT_PAGE_TABLE_LEVEL &&
- has_wrprotected_page(vcpu, gfn, level))
+ mmu_gfn_lpage_is_disallowed(vcpu, gfn, level))
goto done;
spte |= PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE;
@@ -2775,7 +2776,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
level == PT_PAGE_TABLE_LEVEL &&
PageTransCompound(pfn_to_page(pfn)) &&
- !has_wrprotected_page(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
+ !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
unsigned long mask;
/*
* mmu_notifier_retry was successful and we hold the
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b6102c1..a8c88a8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7852,6 +7852,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
int i;
for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
+ struct kvm_lpage_info *linfo;
unsigned long ugfn;
int lpages;
int level = i + 1;
@@ -7866,15 +7867,16 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
if (i == 0)
continue;
- slot->arch.lpage_info[i - 1] = kvm_kvzalloc(lpages *
- sizeof(*slot->arch.lpage_info[i - 1]));
- if (!slot->arch.lpage_info[i - 1])
+ linfo = kvm_kvzalloc(lpages * sizeof(*linfo));
+ if (!linfo)
goto out_free;
+ slot->arch.lpage_info[i - 1] = linfo;
+
if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
- slot->arch.lpage_info[i - 1][0].write_count = 1;
+ linfo[0].disallow_lpage = 1;
if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
- slot->arch.lpage_info[i - 1][lpages - 1].write_count = 1;
+ linfo[lpages - 1].disallow_lpage = 1;
ugfn = slot->userspace_addr >> PAGE_SHIFT;
/*
* If the gfn and userspace address are not aligned wrt each
@@ -7886,7 +7888,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
unsigned long j;
for (j = 0; j < lpages; ++j)
- slot->arch.lpage_info[i - 1][j].write_count = 1;
+ linfo[j].disallow_lpage = 1;
}
}
--
1.8.3.1
Abstract the common operations from account_shadowed() and
unaccount_shadowed(), then introduce kvm_mmu_gfn_disallow_lpage()
and kvm_mmu_gfn_allow_lpage()
These two functions will be used by page tracking in the later patch
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 38 +++++++++++++++++++++++++-------------
arch/x86/kvm/mmu.h | 3 +++
2 files changed, 28 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 61259ff..4b04d13 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -776,21 +776,39 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
return &slot->arch.lpage_info[level - 2][idx];
}
+static void update_gfn_disallow_lpage_count(struct kvm_memory_slot *slot,
+ gfn_t gfn, int count)
+{
+ struct kvm_lpage_info *linfo;
+ int i;
+
+ for (i = PT_DIRECTORY_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
+ linfo = lpage_info_slot(gfn, slot, i);
+ linfo->disallow_lpage += count;
+ WARN_ON(linfo->disallow_lpage < 0);
+ }
+}
+
+void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+ update_gfn_disallow_lpage_count(slot, gfn, 1);
+}
+
+void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+ update_gfn_disallow_lpage_count(slot, gfn, -1);
+}
+
static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
{
struct kvm_memslots *slots;
struct kvm_memory_slot *slot;
- struct kvm_lpage_info *linfo;
gfn_t gfn;
- int i;
gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
- for (i = PT_DIRECTORY_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
- linfo = lpage_info_slot(gfn, slot, i);
- linfo->disallow_lpage += 1;
- }
+ kvm_mmu_gfn_disallow_lpage(slot, gfn);
kvm->arch.indirect_shadow_pages++;
}
@@ -798,18 +816,12 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
{
struct kvm_memslots *slots;
struct kvm_memory_slot *slot;
- struct kvm_lpage_info *linfo;
gfn_t gfn;
- int i;
gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
- for (i = PT_DIRECTORY_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
- linfo = lpage_info_slot(gfn, slot, i);
- linfo->disallow_lpage -= 1;
- WARN_ON(linfo->disallow_lpage < 0);
- }
+ kvm_mmu_gfn_allow_lpage(slot, gfn);
kvm->arch.indirect_shadow_pages--;
}
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 55ffb7b..de92bed 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -174,4 +174,7 @@ static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
+
+void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
+void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
#endif
--
1.8.3.1
Split rmap_write_protect() and introduce the function to abstract the write
protection based on the slot
This function will be used in the later patch
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 16 +++++++++++-----
arch/x86/kvm/mmu.h | 2 ++
2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4b04d13..39809b8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1336,23 +1336,29 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
}
-static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
+bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
+ struct kvm_memory_slot *slot, u64 gfn)
{
- struct kvm_memory_slot *slot;
struct kvm_rmap_head *rmap_head;
int i;
bool write_protected = false;
- slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-
for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
rmap_head = __gfn_to_rmap(gfn, i, slot);
- write_protected |= __rmap_write_protect(vcpu->kvm, rmap_head, true);
+ write_protected |= __rmap_write_protect(kvm, rmap_head, true);
}
return write_protected;
}
+static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
+{
+ struct kvm_memory_slot *slot;
+
+ slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+ return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn);
+}
+
static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
{
u64 *sptep;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index de92bed..58fe98a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -177,4 +177,6 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
+bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
+ struct kvm_memory_slot *slot, u64 gfn);
#endif
--
1.8.3.1
The array, gfn_track[mode][gfn], is introduced in memory slot for every
guest page, this is the tracking count for the gust page on different
modes. If the page is tracked then the count is increased, the page is
not tracked after the count reaches zero
We use 'unsigned short' as the tracking count which should be enough as
shadow page table only can use 2^14 (2^3 for level, 2^1 for cr4_pae, 2^2
for quadrant, 2^3 for access, 2^1 for nxe, 2^1 for cr0_wp, 2^1 for
smep_andnot_wp, 2^1 for smap_andnot_wp, and 2^1 for smm) at most, there
is enough room for other trackers
Two callbacks, kvm_page_track_create_memslot() and
kvm_page_track_free_memslot() are implemented in this patch, they are
internally used to initialize and reclaim the memory of the array
Currently, only write track mode is supported
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/include/asm/kvm_page_track.h | 13 +++++++++
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/page_track.c | 52 +++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 5 ++++
5 files changed, 74 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/include/asm/kvm_page_track.h
create mode 100644 arch/x86/kvm/page_track.c
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1d37c1b..085fde7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -32,6 +32,7 @@
#include <asm/mtrr.h>
#include <asm/msr-index.h>
#include <asm/asm.h>
+#include <asm/kvm_page_track.h>
#define KVM_MAX_VCPUS 255
#define KVM_SOFT_MAX_VCPUS 160
@@ -625,6 +626,7 @@ struct kvm_lpage_info {
struct kvm_arch_memory_slot {
struct kvm_rmap_head *rmap[KVM_NR_PAGE_SIZES];
struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
+ unsigned short *gfn_track[KVM_PAGE_TRACK_MAX];
};
/*
diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
new file mode 100644
index 0000000..55200406
--- /dev/null
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -0,0 +1,13 @@
+#ifndef _ASM_X86_KVM_PAGE_TRACK_H
+#define _ASM_X86_KVM_PAGE_TRACK_H
+
+enum kvm_page_track_mode {
+ KVM_PAGE_TRACK_WRITE,
+ KVM_PAGE_TRACK_MAX,
+};
+
+void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
+ struct kvm_memory_slot *dont);
+int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+ unsigned long npages);
+#endif
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index a1ff508..464fa47 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -13,9 +13,10 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
- hyperv.o
+ hyperv.o page_track.o
kvm-$(CONFIG_KVM_DEVICE_ASSIGNMENT) += assigned-dev.o iommu.o
+
kvm-intel-y += vmx.o pmu_intel.o
kvm-amd-y += svm.o pmu_amd.o
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
new file mode 100644
index 0000000..8c396d0
--- /dev/null
+++ b/arch/x86/kvm/page_track.c
@@ -0,0 +1,52 @@
+/*
+ * Support KVM gust page tracking
+ *
+ * This feature allows us to track page access in guest. Currently, only
+ * write access is tracked.
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ * Xiao Guangrong <[email protected]>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/kvm_host.h>
+#include <asm/kvm_host.h>
+#include <asm/kvm_page_track.h>
+
+#include "mmu.h"
+
+void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
+ struct kvm_memory_slot *dont)
+{
+ int i;
+
+ for (i = 0; i < KVM_PAGE_TRACK_MAX; i++)
+ if (!dont || free->arch.gfn_track[i] !=
+ dont->arch.gfn_track[i]) {
+ kvfree(free->arch.gfn_track[i]);
+ free->arch.gfn_track[i] = NULL;
+ }
+}
+
+int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+ unsigned long npages)
+{
+ int i;
+
+ for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
+ slot->arch.gfn_track[i] = kvm_kvzalloc(npages *
+ sizeof(*slot->arch.gfn_track[i]));
+ if (!slot->arch.gfn_track[i])
+ goto track_free;
+ }
+
+ return 0;
+
+track_free:
+ kvm_page_track_free_memslot(slot, NULL);
+ return -ENOMEM;
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a8c88a8..8ab1ad9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7844,6 +7844,8 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free,
free->arch.lpage_info[i - 1] = NULL;
}
}
+
+ kvm_page_track_free_memslot(free, dont);
}
int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
@@ -7892,6 +7894,9 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
}
}
+ if (kvm_page_track_create_memslot(slot, npages))
+ goto out_free;
+
return 0;
out_free:
--
1.8.3.1
These two functions are the user APIs:
- kvm_page_track_add_page(): add the page to the tracking pool after
that later specified access on that page will be tracked
- kvm_page_track_remove_page(): remove the page from the tracking pool,
the specified access on the page is not tracked after the last user is
gone
Both of these are called under the protection of kvm->srcu or
kvm->slots_lock
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_page_track.h | 13 ++++
arch/x86/kvm/page_track.c | 124 ++++++++++++++++++++++++++++++++++
2 files changed, 137 insertions(+)
diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index 55200406..c010124 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -10,4 +10,17 @@ void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
struct kvm_memory_slot *dont);
int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
unsigned long npages);
+
+void
+kvm_slot_page_track_add_page_nolock(struct kvm *kvm,
+ struct kvm_memory_slot *slot, gfn_t gfn,
+ enum kvm_page_track_mode mode);
+void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
+ enum kvm_page_track_mode mode);
+void kvm_slot_page_track_remove_page_nolock(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn,
+ enum kvm_page_track_mode mode);
+void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
+ enum kvm_page_track_mode mode);
#endif
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 8c396d0..e17efe9 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -50,3 +50,127 @@ track_free:
kvm_page_track_free_memslot(slot, NULL);
return -ENOMEM;
}
+
+static bool check_mode(enum kvm_page_track_mode mode)
+{
+ if (mode < 0 || mode >= KVM_PAGE_TRACK_MAX)
+ return false;
+
+ return true;
+}
+
+static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
+ enum kvm_page_track_mode mode, short count)
+{
+ int index;
+ unsigned short val;
+
+ index = gfn_to_index(gfn, slot->base_gfn, PT_PAGE_TABLE_LEVEL);
+
+ val = slot->arch.gfn_track[mode][index];
+
+ /* does tracking count wrap? */
+ WARN_ON((count > 0) && (val + count < val));
+ /* the last tracker has already gone? */
+ WARN_ON((count < 0) && (val < !count));
+
+ slot->arch.gfn_track[mode][index] += count;
+}
+
+void
+kvm_slot_page_track_add_page_nolock(struct kvm *kvm,
+ struct kvm_memory_slot *slot, gfn_t gfn,
+ enum kvm_page_track_mode mode)
+{
+
+ WARN_ON(!check_mode(mode));
+
+ update_gfn_track(slot, gfn, mode, 1);
+
+ /*
+ * new track stops large page mapping for the
+ * tracked page.
+ */
+ kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+ if (mode == KVM_PAGE_TRACK_WRITE)
+ if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn))
+ kvm_flush_remote_tlbs(kvm);
+}
+
+/*
+ * add guest page to the tracking pool so that corresponding access on that
+ * page will be intercepted.
+ *
+ * It should be called under the protection of kvm->srcu or kvm->slots_lock
+ *
+ * @kvm: the guest instance we are interested in.
+ * @gfn: the guest page.
+ * @mode: tracking mode, currently only write track is supported.
+ */
+void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
+ enum kvm_page_track_mode mode)
+{
+ struct kvm_memslots *slots;
+ struct kvm_memory_slot *slot;
+ int i;
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ slots = __kvm_memslots(kvm, i);
+
+ slot = __gfn_to_memslot(slots, gfn);
+ if (!slot)
+ continue;
+
+ spin_lock(&kvm->mmu_lock);
+ kvm_slot_page_track_add_page_nolock(kvm, slot, gfn, mode);
+ spin_unlock(&kvm->mmu_lock);
+ }
+}
+
+void kvm_slot_page_track_remove_page_nolock(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn,
+ enum kvm_page_track_mode mode)
+{
+ WARN_ON(!check_mode(mode));
+
+ update_gfn_track(slot, gfn, mode, -1);
+
+ /*
+ * allow large page mapping for the tracked page
+ * after the tracker is gone.
+ */
+ kvm_mmu_gfn_allow_lpage(slot, gfn);
+}
+
+/*
+ * remove the guest page from the tracking pool which stops the interception
+ * of corresponding access on that page. It is the opposed operation of
+ * kvm_page_track_add_page().
+ *
+ * It should be called under the protection of kvm->srcu or kvm->slots_lock
+ *
+ * @kvm: the guest instance we are interested in.
+ * @gfn: the guest page.
+ * @mode: tracking mode, currently only write track is supported.
+ */
+void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
+ enum kvm_page_track_mode mode)
+{
+ struct kvm_memslots *slots;
+ struct kvm_memory_slot *slot;
+ int i;
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ slots = __kvm_memslots(kvm, i);
+
+ slot = __gfn_to_memslot(slots, gfn);
+ if (!slot)
+ continue;
+
+ spin_lock(&kvm->mmu_lock);
+ kvm_slot_page_track_remove_page_nolock(kvm, slot, gfn, mode);
+ spin_unlock(&kvm->mmu_lock);
+ }
+}
--
1.8.3.1
The page fault caused by write access on the write tracked page can not
be fixed, it always need to be emulated. page_fault_handle_page_track()
is the fast path we introduce here to skip holding mmu-lock and shadow
page table walking
However, if the page table is not present, it is worth making the page
table entry present and readonly to make the read access happy
mmu_need_write_protect() need to be cooked to avoid page becoming writable
when making page table present or sync/prefetch shadow page table entries
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_page_track.h | 2 ++
arch/x86/kvm/mmu.c | 44 +++++++++++++++++++++++++++++------
arch/x86/kvm/page_track.c | 14 +++++++++++
arch/x86/kvm/paging_tmpl.h | 3 +++
4 files changed, 56 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index c010124..97ac9c3 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -23,4 +23,6 @@ void kvm_slot_page_track_remove_page_nolock(struct kvm *kvm,
enum kvm_page_track_mode mode);
void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
enum kvm_page_track_mode mode);
+bool kvm_page_track_check_mode(struct kvm_vcpu *vcpu, gfn_t gfn,
+ enum kvm_page_track_mode mode);
#endif
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 39809b8..b23f9fc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -41,6 +41,7 @@
#include <asm/cmpxchg.h>
#include <asm/io.h>
#include <asm/vmx.h>
+#include <asm/kvm_page_track.h>
/*
* When setting this variable to true it enables Two-Dimensional-Paging
@@ -2456,25 +2457,29 @@ static void kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn)
}
}
-static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
- bool can_unsync)
+static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
+ bool can_unsync)
{
struct kvm_mmu_page *s;
bool need_unsync = false;
+ if (kvm_page_track_check_mode(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+ return true;
+
for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
if (!can_unsync)
- return 1;
+ return true;
if (s->role.level != PT_PAGE_TABLE_LEVEL)
- return 1;
+ return true;
if (!s->unsync)
need_unsync = true;
}
if (need_unsync)
kvm_unsync_pages(vcpu, gfn);
- return 0;
+
+ return false;
}
static bool kvm_is_mmio_pfn(pfn_t pfn)
@@ -3388,10 +3393,30 @@ int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
}
EXPORT_SYMBOL_GPL(handle_mmio_page_fault);
+static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
+ u32 error_code, gfn_t gfn)
+{
+ if (unlikely(error_code & PFERR_RSVD_MASK))
+ return false;
+
+ if (!(error_code & PFERR_PRESENT_MASK) ||
+ !(error_code & PFERR_WRITE_MASK))
+ return false;
+
+ /*
+ * guest is writing the page which is write tracked which can
+ * not be fixed by page fault handler.
+ */
+ if (kvm_page_track_check_mode(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+ return true;
+
+ return false;
+}
+
static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
u32 error_code, bool prefault)
{
- gfn_t gfn;
+ gfn_t gfn = gva >> PAGE_SHIFT;
int r;
pgprintk("%s: gva %lx error %x\n", __func__, gva, error_code);
@@ -3403,13 +3428,15 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
return r;
}
+ if (page_fault_handle_page_track(vcpu, error_code, gfn))
+ return 1;
+
r = mmu_topup_memory_caches(vcpu);
if (r)
return r;
MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu.root_hpa));
- gfn = gva >> PAGE_SHIFT;
return nonpaging_map(vcpu, gva & PAGE_MASK,
error_code, gfn, prefault);
@@ -3493,6 +3520,9 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
return r;
}
+ if (page_fault_handle_page_track(vcpu, error_code, gfn))
+ return 1;
+
r = mmu_topup_memory_caches(vcpu);
if (r)
return r;
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index e17efe9..de9b32f 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -174,3 +174,17 @@ void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
spin_unlock(&kvm->mmu_lock);
}
}
+
+/*
+ * check if the corresponding access on the specified guest page is tracked.
+ */
+bool kvm_page_track_check_mode(struct kvm_vcpu *vcpu, gfn_t gfn,
+ enum kvm_page_track_mode mode)
+{
+ struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+ int index = gfn_to_index(gfn, slot->base_gfn, PT_PAGE_TABLE_LEVEL);
+
+ WARN_ON(!check_mode(mode));
+
+ return !!ACCESS_ONCE(slot->arch.gfn_track[mode][index]);
+}
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 91e939b..ac85682 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -735,6 +735,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
return 0;
}
+ if (page_fault_handle_page_track(vcpu, error_code, walker.gfn))
+ return 1;
+
vcpu->arch.write_fault_to_shadow_pgtable = false;
is_self_change_mapping = FNAME(is_self_change_mapping)(vcpu,
--
1.8.3.1
Notifier list is introduced so that any node wants to receive the track
event can register to the list
Two APIs are introduced here:
- kvm_page_track_register_notifier(): register the notifier to receive
track event
- kvm_page_track_unregister_notifier(): stop receiving track event by
unregister the notifier
The callback, node->track_write() is called when a write access on the
write tracked page happens
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/asm/kvm_page_track.h | 39 ++++++++++++++++++++
arch/x86/kvm/page_track.c | 67 +++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 4 +++
4 files changed, 111 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 085fde7..50ad7e8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -671,6 +671,7 @@ struct kvm_arch {
*/
struct list_head active_mmu_pages;
struct list_head zapped_obsolete_pages;
+ struct kvm_page_track_notifier_head track_notifier_head;
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index 97ac9c3..1aae4ef 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -6,6 +6,36 @@ enum kvm_page_track_mode {
KVM_PAGE_TRACK_MAX,
};
+/*
+ * The notifier represented by @kvm_page_track_notifier_node is linked into
+ * the head which will be notified when guest is triggering the track event.
+ *
+ * Write access on the head is protected by kvm->mmu_lock, read access
+ * is protected by track_srcu.
+ */
+struct kvm_page_track_notifier_head {
+ struct srcu_struct track_srcu;
+ struct hlist_head track_notifier_list;
+};
+
+struct kvm_page_track_notifier_node {
+ struct hlist_node node;
+
+ /*
+ * It is called when guest is writing the write-tracked page
+ * and write emulation is finished at that time.
+ *
+ * @vcpu: the vcpu where the write access happened.
+ * @gpa: the physical address written by guest.
+ * @new: the data was written to the address.
+ * @bytes: the written length.
+ */
+ void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+ int bytes);
+};
+
+void kvm_page_track_init(struct kvm *kvm);
+
void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
struct kvm_memory_slot *dont);
int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
@@ -25,4 +55,13 @@ void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
enum kvm_page_track_mode mode);
bool kvm_page_track_check_mode(struct kvm_vcpu *vcpu, gfn_t gfn,
enum kvm_page_track_mode mode);
+
+void
+kvm_page_track_register_notifier(struct kvm *kvm,
+ struct kvm_page_track_notifier_node *n);
+void
+kvm_page_track_unregister_notifier(struct kvm *kvm,
+ struct kvm_page_track_notifier_node *n);
+void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+ int bytes);
#endif
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index de9b32f..0692cc6 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -188,3 +188,70 @@ bool kvm_page_track_check_mode(struct kvm_vcpu *vcpu, gfn_t gfn,
return !!ACCESS_ONCE(slot->arch.gfn_track[mode][index]);
}
+
+void kvm_page_track_init(struct kvm *kvm)
+{
+ struct kvm_page_track_notifier_head *head;
+
+ head = &kvm->arch.track_notifier_head;
+ init_srcu_struct(&head->track_srcu);
+ INIT_HLIST_HEAD(&head->track_notifier_list);
+}
+
+/*
+ * register the notifier so that event interception for the tracked guest
+ * pages can be received.
+ */
+void
+kvm_page_track_register_notifier(struct kvm *kvm,
+ struct kvm_page_track_notifier_node *n)
+{
+ struct kvm_page_track_notifier_head *head;
+
+ head = &kvm->arch.track_notifier_head;
+
+ spin_lock(&kvm->mmu_lock);
+ hlist_add_head_rcu(&n->node, &head->track_notifier_list);
+ spin_unlock(&kvm->mmu_lock);
+}
+
+/*
+ * stop receiving the event interception. It is the opposed operation of
+ * kvm_page_track_register_notifier().
+ */
+void
+kvm_page_track_unregister_notifier(struct kvm *kvm,
+ struct kvm_page_track_notifier_node *n)
+{
+ struct kvm_page_track_notifier_head *head;
+
+ head = &kvm->arch.track_notifier_head;
+
+ spin_lock(&kvm->mmu_lock);
+ hlist_del_rcu(&n->node);
+ spin_unlock(&kvm->mmu_lock);
+ synchronize_srcu(&head->track_srcu);
+}
+
+/*
+ * Notify the node that write access is intercepted and write emulation is
+ * finished at this time.
+ *
+ * The node should figure out if the written page is the one that node is
+ * interested in by itself.
+ */
+void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+ int bytes)
+{
+ struct kvm_page_track_notifier_head *head;
+ struct kvm_page_track_notifier_node *n;
+ int idx;
+
+ head = &vcpu->kvm->arch.track_notifier_head;
+
+ idx = srcu_read_lock(&head->track_srcu);
+ hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+ if (n->track_write)
+ n->track_write(vcpu, gpa, new, bytes);
+ srcu_read_unlock(&head->track_srcu, idx);
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8ab1ad9..dc43d99 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4331,6 +4331,7 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
if (ret < 0)
return 0;
kvm_mmu_pte_write(vcpu, gpa, val, bytes);
+ kvm_page_track_write(vcpu, gpa, val, bytes);
return 1;
}
@@ -4589,6 +4590,7 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
kvm_mmu_pte_write(vcpu, gpa, new, bytes);
+ kvm_page_track_write(vcpu, gpa, new, bytes);
return X86EMUL_CONTINUE;
@@ -7697,6 +7699,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
+ kvm_page_track_init(kvm);
+
return 0;
}
--
1.8.3.1
non-leaf shadow pages are always write protected, it can be the user
of page track
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b23f9fc..5a2ca73 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -806,11 +806,17 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
struct kvm_memory_slot *slot;
gfn_t gfn;
+ kvm->arch.indirect_shadow_pages++;
gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
+
+ /* the non-leaf shadow pages are keeping readonly. */
+ if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+ return kvm_slot_page_track_add_page_nolock(kvm, slot, gfn,
+ KVM_PAGE_TRACK_WRITE);
+
kvm_mmu_gfn_disallow_lpage(slot, gfn);
- kvm->arch.indirect_shadow_pages++;
}
static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -819,11 +825,15 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
struct kvm_memory_slot *slot;
gfn_t gfn;
+ kvm->arch.indirect_shadow_pages--;
gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
+ if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+ return kvm_slot_page_track_remove_page_nolock(kvm, slot, gfn,
+ KVM_PAGE_TRACK_WRITE);
+
kvm_mmu_gfn_allow_lpage(slot, gfn);
- kvm->arch.indirect_shadow_pages--;
}
static bool __mmu_gfn_lpage_is_disallowed(gfn_t gfn, int level,
@@ -2140,12 +2150,18 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
hlist_add_head(&sp->hash_link,
&vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]);
if (!direct) {
- if (rmap_write_protect(vcpu, gfn))
+ /*
+ * we should do write protection before syncing pages
+ * otherwise the content of the synced shadow page may
+ * be inconsistent with guest page table.
+ */
+ account_shadowed(vcpu->kvm, sp);
+
+ if (level == PT_PAGE_TABLE_LEVEL &&
+ rmap_write_protect(vcpu, gfn))
kvm_flush_remote_tlbs(vcpu->kvm);
if (level > PT_PAGE_TABLE_LEVEL && need_sync)
kvm_sync_pages(vcpu, gfn);
-
- account_shadowed(vcpu->kvm, sp);
}
sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
init_shadow_page_table(sp);
--
1.8.3.1
Now, all non-leaf shadow page are page tracked, if gfn is not tracked
there is no non-leaf shadow page of gfn is existed, we can directly
make the shadow page of gfn to unsync
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 26 ++++++++------------------
1 file changed, 8 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5a2ca73..f89e77f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2461,41 +2461,31 @@ static void __kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
kvm_mmu_mark_parents_unsync(sp);
}
-static void kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn)
+static bool kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
+ bool can_unsync)
{
struct kvm_mmu_page *s;
for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
+ if (!can_unsync)
+ return true;
+
if (s->unsync)
continue;
WARN_ON(s->role.level != PT_PAGE_TABLE_LEVEL);
__kvm_unsync_page(vcpu, s);
}
+
+ return false;
}
static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
bool can_unsync)
{
- struct kvm_mmu_page *s;
- bool need_unsync = false;
-
if (kvm_page_track_check_mode(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
return true;
- for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
- if (!can_unsync)
- return true;
-
- if (s->role.level != PT_PAGE_TABLE_LEVEL)
- return true;
-
- if (!s->unsync)
- need_unsync = true;
- }
- if (need_unsync)
- kvm_unsync_pages(vcpu, gfn);
-
- return false;
+ return kvm_unsync_pages(vcpu, gfn, can_unsync);
}
static bool kvm_is_mmio_pfn(pfn_t pfn)
--
1.8.3.1
If the page fault is caused by write access on write tracked page, the
real shadow page walking is skipped, we lost the chance to clear write
flooding for the page structure current vcpu is using
Fix it by locklessly waking shadow page table to clear write flooding
on the shadow page structure out of mmu-lock. So that we change the
count to atomic_t
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/mmu.c | 25 +++++++++++++++++++++----
arch/x86/kvm/paging_tmpl.h | 4 +++-
3 files changed, 25 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 50ad7e8..1d2968e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -252,7 +252,7 @@ struct kvm_mmu_page {
#endif
/* Number of writes since the last time traversal visited this page. */
- int write_flooding_count;
+ atomic_t write_flooding_count;
};
struct kvm_pio_request {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f89e77f..9f6a4ef 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2081,7 +2081,7 @@ static void init_shadow_page_table(struct kvm_mmu_page *sp)
static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
{
- sp->write_flooding_count = 0;
+ atomic_set(&sp->write_flooding_count, 0);
}
static void clear_sp_write_flooding_count(u64 *spte)
@@ -2461,8 +2461,7 @@ static void __kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
kvm_mmu_mark_parents_unsync(sp);
}
-static bool kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
- bool can_unsync)
+static bool kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync)
{
struct kvm_mmu_page *s;
@@ -3419,6 +3418,23 @@ static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
return false;
}
+static void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
+{
+ struct kvm_shadow_walk_iterator iterator;
+ u64 spte;
+
+ if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
+ return;
+
+ walk_shadow_page_lockless_begin(vcpu);
+ for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
+ clear_sp_write_flooding_count(iterator.sptep);
+ if (!is_shadow_present_pte(spte))
+ break;
+ }
+ walk_shadow_page_lockless_end(vcpu);
+}
+
static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
u32 error_code, bool prefault)
{
@@ -4246,7 +4262,8 @@ static bool detect_write_flooding(struct kvm_mmu_page *sp)
if (sp->role.level == PT_PAGE_TABLE_LEVEL)
return false;
- return ++sp->write_flooding_count >= 3;
+ atomic_inc(&sp->write_flooding_count);
+ return atomic_read(&sp->write_flooding_count) >= 3;
}
/*
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index ac85682..97fe5ac 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -735,8 +735,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
return 0;
}
- if (page_fault_handle_page_track(vcpu, error_code, walker.gfn))
+ if (page_fault_handle_page_track(vcpu, error_code, walker.gfn)) {
+ shadow_page_table_clear_flood(vcpu, addr);
return 1;
+ }
vcpu->arch.write_fault_to_shadow_pgtable = false;
--
1.8.3.1
Register the notifier to receive write track event so that we can update
our shadow page table
It makes kvm_mmu_pte_write() be the callback of the notifier, no function
is changed
Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 5 +++--
arch/x86/kvm/mmu.c | 19 +++++++++++++++++--
arch/x86/kvm/x86.c | 4 ++--
3 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1d2968e..28f5c7d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -671,6 +671,7 @@ struct kvm_arch {
*/
struct list_head active_mmu_pages;
struct list_head zapped_obsolete_pages;
+ struct kvm_page_track_notifier_node mmu_sp_tracker;
struct kvm_page_track_notifier_head track_notifier_head;
struct list_head assigned_dev_head;
@@ -966,6 +967,8 @@ void kvm_mmu_module_exit(void);
void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
int kvm_mmu_create(struct kvm_vcpu *vcpu);
void kvm_mmu_setup(struct kvm_vcpu *vcpu);
+void kvm_mmu_init_vm(struct kvm *kvm);
+void kvm_mmu_uninit_vm(struct kvm *kvm);
void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask);
@@ -1105,8 +1108,6 @@ void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id);
void kvm_inject_nmi(struct kvm_vcpu *vcpu);
-void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
- const u8 *new, int bytes);
int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn);
int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9f6a4ef..a420c43 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4325,8 +4325,8 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
return spte;
}
-void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
- const u8 *new, int bytes)
+static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
+ const u8 *new, int bytes)
{
gfn_t gfn = gpa >> PAGE_SHIFT;
struct kvm_mmu_page *sp;
@@ -4540,6 +4540,21 @@ void kvm_mmu_setup(struct kvm_vcpu *vcpu)
init_kvm_mmu(vcpu);
}
+void kvm_mmu_init_vm(struct kvm *kvm)
+{
+ struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
+
+ node->track_write = kvm_mmu_pte_write;
+ kvm_page_track_register_notifier(kvm, node);
+}
+
+void kvm_mmu_uninit_vm(struct kvm *kvm)
+{
+ struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
+
+ kvm_page_track_unregister_notifier(kvm, node);
+}
+
/* The return value indicates if tlb flush on all vcpus is needed. */
typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head *rmap_head);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dc43d99..f06a4fc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4330,7 +4330,6 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
ret = kvm_vcpu_write_guest(vcpu, gpa, val, bytes);
if (ret < 0)
return 0;
- kvm_mmu_pte_write(vcpu, gpa, val, bytes);
kvm_page_track_write(vcpu, gpa, val, bytes);
return 1;
}
@@ -4589,7 +4588,6 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
return X86EMUL_CMPXCHG_FAILED;
kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
- kvm_mmu_pte_write(vcpu, gpa, new, bytes);
kvm_page_track_write(vcpu, gpa, new, bytes);
return X86EMUL_CONTINUE;
@@ -7700,6 +7698,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
kvm_page_track_init(kvm);
+ kvm_mmu_init_vm(kvm);
return 0;
}
@@ -7827,6 +7826,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kfree(kvm->arch.vioapic);
kvm_free_vcpus(kvm);
kfree(rcu_dereference_check(kvm->arch.apic_map, 1));
+ kvm_mmu_uninit_vm(kvm);
}
void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free,
--
1.8.3.1
On 12/23/2015 07:25 PM, Xiao Guangrong wrote:
> Now, all non-leaf shadow page are page tracked, if gfn is not tracked
> there is no non-leaf shadow page of gfn is existed, we can directly
> make the shadow page of gfn to unsync
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 26 ++++++++------------------
> 1 file changed, 8 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 5a2ca73..f89e77f 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2461,41 +2461,31 @@ static void __kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> kvm_mmu_mark_parents_unsync(sp);
> }
>
> -static void kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn)
> +static bool kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
> + bool can_unsync)
> {
> struct kvm_mmu_page *s;
>
> for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
> + if (!can_unsync)
> + return true;
> +
> if (s->unsync)
> continue;
> WARN_ON(s->role.level != PT_PAGE_TABLE_LEVEL);
> __kvm_unsync_page(vcpu, s);
> }
> +
> + return false;
> }
I hate to say but it looks odd to me that kvm_unsync_pages takes a bool
parameter called can_unsync, and return a bool (which looks like
suggesting whether the unsync has succeeded or not). How about calling
__kvm_unsync_pages directly in mmu_need_write_protect, and leave
kvm_unsync_pages unchanged (or even remove it as looks it is used
nowhere else) ? But again it's to you and Paolo.
Thanks,
-Kai
>
> static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
> bool can_unsync)
> {
> - struct kvm_mmu_page *s;
> - bool need_unsync = false;
> -
> if (kvm_page_track_check_mode(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
> return true;
>
> - for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
> - if (!can_unsync)
> - return true;
> -
> - if (s->role.level != PT_PAGE_TABLE_LEVEL)
> - return true;
> -
> - if (!s->unsync)
> - need_unsync = true;
> - }
> - if (need_unsync)
> - kvm_unsync_pages(vcpu, gfn);
> -
> - return false;
> + return kvm_unsync_pages(vcpu, gfn, can_unsync);
> }
>
> static bool kvm_is_mmio_pfn(pfn_t pfn)
On 12/24/2015 04:36 PM, Kai Huang wrote:
>
>
> On 12/23/2015 07:25 PM, Xiao Guangrong wrote:
>> Now, all non-leaf shadow page are page tracked, if gfn is not tracked
>> there is no non-leaf shadow page of gfn is existed, we can directly
>> make the shadow page of gfn to unsync
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> arch/x86/kvm/mmu.c | 26 ++++++++------------------
>> 1 file changed, 8 insertions(+), 18 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 5a2ca73..f89e77f 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2461,41 +2461,31 @@ static void __kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>> kvm_mmu_mark_parents_unsync(sp);
>> }
>> -static void kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +static bool kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
>> + bool can_unsync)
>> {
>> struct kvm_mmu_page *s;
>> for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
>> + if (!can_unsync)
>> + return true;
>> +
>> if (s->unsync)
>> continue;
>> WARN_ON(s->role.level != PT_PAGE_TABLE_LEVEL);
>> __kvm_unsync_page(vcpu, s);
>> }
>> +
>> + return false;
>> }
> I hate to say but it looks odd to me that kvm_unsync_pages takes a bool parameter called can_unsync,
> and return a bool (which looks like suggesting whether the unsync has succeeded or not). How about
> calling __kvm_unsync_pages directly in mmu_need_write_protect, and leave kvm_unsync_pages unchanged
> (or even remove it as looks it is used nowhere else) ? But again it's to you and Paolo.
>
Make senses, the updated version is attached, count you review it?
On 12/24/2015 05:11 PM, Xiao Guangrong wrote:
>
>
> On 12/24/2015 04:36 PM, Kai Huang wrote:
>>
>>
>> On 12/23/2015 07:25 PM, Xiao Guangrong wrote:
>>> Now, all non-leaf shadow page are page tracked, if gfn is not tracked
>>> there is no non-leaf shadow page of gfn is existed, we can directly
>>> make the shadow page of gfn to unsync
>>>
>>> Signed-off-by: Xiao Guangrong <[email protected]>
>>> ---
>>> arch/x86/kvm/mmu.c | 26 ++++++++------------------
>>> 1 file changed, 8 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>> index 5a2ca73..f89e77f 100644
>>> --- a/arch/x86/kvm/mmu.c
>>> +++ b/arch/x86/kvm/mmu.c
>>> @@ -2461,41 +2461,31 @@ static void __kvm_unsync_page(struct
>>> kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>>> kvm_mmu_mark_parents_unsync(sp);
>>> }
>>> -static void kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn)
>>> +static bool kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
>>> + bool can_unsync)
>>> {
>>> struct kvm_mmu_page *s;
>>> for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
>>> + if (!can_unsync)
>>> + return true;
>>> +
>>> if (s->unsync)
>>> continue;
>>> WARN_ON(s->role.level != PT_PAGE_TABLE_LEVEL);
>>> __kvm_unsync_page(vcpu, s);
>>> }
>>> +
>>> + return false;
>>> }
>> I hate to say but it looks odd to me that kvm_unsync_pages takes a
>> bool parameter called can_unsync,
>> and return a bool (which looks like suggesting whether the unsync has
>> succeeded or not). How about
>> calling __kvm_unsync_pages directly in mmu_need_write_protect, and
>> leave kvm_unsync_pages unchanged
>> (or even remove it as looks it is used nowhere else) ? But again it's
>> to you and Paolo.
>>
>
> Make senses, the updated version is attached, count you review it?
Sure and it looks good to me.
Thanks,
-Kai