Hi,
This series is on top of kvmarm/next as I needed to also modify Eager
page splitting logic in clear-dirty-log API. Eager page splitting is not
present in Linux 6.4-rc4.
Also, I had to change selftests patches (1 to 5) as some commits were
removed from kvm/queue remote. This caused issue due to different APIs
being present in dirty_log_perf_test when I was rebasing v2. Those
removed commits are now back in kvm-x86 branch of Sean [1] but not in
kvmarm/next or kvm/queue. I didn't want to wait for review of v2, so I
changed dirty_log_perf_test to work with kvmarm/next branch. When Sean's
kvm-x86 branch is merged, sleftests in this patch series need to be
modified to use new APIs or whoever merges last need to take care of
that.
This patch series modifies clear-dirty-log operation to run under MMU
read lock. It write protects SPTEs and split huge pages using MMU read
lock instead of MMU write lock.
Use of MMU read lock is made possible by using shared page table
walkers. Currently only page fault handlers use shared page table
walkers, with this series, clear-dirty-log operation will also use
shared page table walkers.
Patches 1 to 5:
These patches are modifying dirty_log_perf_test. Intent is to mimic
production scenarios where guest keeps on executing while userspace
thread collects and clears dirty logs independently.
Three new command line options are added:
1. j: Allows to run guest vCPUs and main thread collecting dirty logs
independently of each other after initialization is complete.
2. k: Allows to clear dirty logs in smaller chunks compared to existing
whole memslot clear in one call.
3. l: Allows to add customizable wait time between consecutive clear
dirty log calls to mimic sending dirty memory to destination.
Patch 7-16:
These patches refactor code to move MMU lock operations to arch specific
code, refactor Arm's page table walker APIs, and change MMU write lock
for clearing dirty logs to read lock. Patch 16 has results showing
improvements based on dirty_log_perf_test.
1. https://lore.kernel.org/lkml/[email protected]/
v2:
- Fix compile warning for mips and riscv.
- Added logic to continue or retry shared page walk which are not fault
handler.
- Huge page split also changed to run under MMU read lock.
- Added more explanations in commit logs.
- Selftests is modified because a commit series was reverted back in
dirty_log_perf_test on kvm/queue.
v1: https://lore.kernel.org/lkml/[email protected]/
Vipin Sharma (16):
KVM: selftests: Clear dirty logs in user defined chunks sizes in
dirty_log_perf_test
KVM: selftests: Add optional delay between consecutive clear-dirty-log
calls
KVM: selftests: Pass the count of read and write accesses from guest
to host
KVM: selftests: Print read-write progress by vCPUs in
dirty_log_perf_test
KVM: selftests: Allow independent execution of vCPUs in
dirty_log_perf_test
KVM: arm64: Correct the kvm_pgtable_stage2_flush() documentation
KVM: mmu: Move mmu lock/unlock to arch code for clear dirty log
KMV: arm64: Pass page table walker flags to stage2_apply_range_*()
KVM: arm64: Document the page table walker actions based on the
callback's return value
KVM: arm64: Return -ENOENT if PTE is not valid in stage2_attr_walker
KVM: arm64: Use KVM_PGTABLE_WALK_SHARED flag instead of
KVM_PGTABLE_WALK_HANDLE_FAULT
KVM: arm64: Retry shared page table walks outside of fault handler
KVM: arm64: Run clear-dirty-log under MMU read lock
KVM: arm64: Pass page walker flags from callers of stage 2 split
walker
KVM: arm64: Provide option to pass page walker flag for huge page
splits
KVM: arm64: Split huge pages during clear-dirty-log under MMU read
lock
arch/arm64/include/asm/kvm_pgtable.h | 42 +++--
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 4 +-
arch/arm64/kvm/hyp/pgtable.c | 68 ++++++--
arch/arm64/kvm/mmu.c | 65 +++++---
arch/mips/kvm/mmu.c | 2 +
arch/riscv/kvm/mmu.c | 2 +
arch/x86/kvm/mmu/mmu.c | 3 +
.../selftests/kvm/dirty_log_perf_test.c | 147 ++++++++++++++----
tools/testing/selftests/kvm/lib/memstress.c | 13 +-
virt/kvm/dirty_ring.c | 2 -
virt/kvm/kvm_main.c | 4 -
11 files changed, 265 insertions(+), 87 deletions(-)
base-commit: 532b2ecfa547f02b1825108711565eff026bce5a
--
2.41.0.rc0.172.g3f132b7071-goog
Pass enum kvm_pgtable_walk_flags{} to kvm_pgtable_stage2_split() walker
from its caller.
This allows split walker users to specify if they want to run split
logic via shared walker or non-shared walker.
Signed-off-by: Vipin Sharma <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 4 +++-
arch/arm64/kvm/hyp/pgtable.c | 5 +++--
arch/arm64/kvm/mmu.c | 2 +-
3 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 145be12a5fc2..fbf5c6c509fb 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -684,6 +684,7 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size,
* @size: Size of the range.
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
* page-table pages.
+ * @flags: Page walker flags
*
* The function tries to split any level 1 or 2 entry that overlaps
* with the input range (given by @addr and @size).
@@ -693,7 +694,8 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size,
* blocks in the input range as allowed by @mc_capacity.
*/
int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
- struct kvm_mmu_memory_cache *mc);
+ struct kvm_mmu_memory_cache *mc,
+ enum kvm_pgtable_walk_flags flags);
/**
* kvm_pgtable_walk() - Walk a page-table.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 23cda3de2dd4..7e84be13d76d 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1408,11 +1408,12 @@ static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
}
int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
- struct kvm_mmu_memory_cache *mc)
+ struct kvm_mmu_memory_cache *mc,
+ enum kvm_pgtable_walk_flags flags)
{
struct kvm_pgtable_walker walker = {
.cb = stage2_split_walker,
- .flags = KVM_PGTABLE_WALK_LEAF,
+ .flags = flags | KVM_PGTABLE_WALK_LEAF,
.arg = mc,
};
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7c966f6f1a41..34d2bd03cf5f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -153,7 +153,7 @@ static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
return -EINVAL;
next = __stage2_range_addr_end(addr, end, chunk_size);
- ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
+ ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache, 0);
if (ret)
break;
} while (addr = next, addr != end);
--
2.41.0.rc0.172.g3f132b7071-goog
Allow stage2_apply_range_*() to accept enum kvm_pgtable_walk_flags{}
for stage 2 walkers. Pass 0 as the flag value from all of its caller
effectively making it a no-op.
Page table walker flags will be used in future commits to enable
clear-dirty-log operation under MMU read lock.
Current users of stage2_apply_range_*() API runs under assumption of
holding MMU write lock. Stage2 page table walkers then run under the
same assumption. In future commits, when clear-dirty-log is modified to
run under MMU read lock then this flag will be used to pass shared page
walk intent.
No functional changes intended.
Signed-off-by: Vipin Sharma <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 12 +++++++++---
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 4 ++--
arch/arm64/kvm/hyp/pgtable.c | 16 ++++++++++------
arch/arm64/kvm/mmu.c | 26 ++++++++++++++++----------
4 files changed, 37 insertions(+), 21 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index d542a671c564..8ef7e8f3f054 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -560,6 +560,7 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @addr: Intermediate physical address from which to remove the mapping.
* @size: Size of the mapping.
+ * @flags: Page-table walker flags.
*
* The offset of @addr within a page is ignored and @size is rounded-up to
* the next page boundary.
@@ -572,7 +573,8 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
*
* Return: 0 on success, negative error code on failure.
*/
-int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
+int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size,
+ enum kvm_pgtable_walk_flags flags);
/**
* kvm_pgtable_stage2_wrprotect() - Write-protect guest stage-2 address range
@@ -580,6 +582,7 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @addr: Intermediate physical address from which to write-protect,
* @size: Size of the range.
+ * @flags: Page-table walker flags.
*
* The offset of @addr within a page is ignored and @size is rounded-up to
* the next page boundary.
@@ -590,7 +593,8 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
*
* Return: 0 on success, negative error code on failure.
*/
-int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size);
+int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size,
+ enum kvm_pgtable_walk_flags flags);
/**
* kvm_pgtable_stage2_mkyoung() - Set the access flag in a page-table entry.
@@ -662,13 +666,15 @@ bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @addr: Intermediate physical address from which to flush.
* @size: Size of the range.
+ * @flags: Page-table walker flags.
*
* The offset of @addr within a page is ignored and @size is rounded-up to
* the next page boundary.
*
* Return: 0 on success, negative error code on failure.
*/
-int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
+int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size,
+ enum kvm_pgtable_walk_flags flags);
/**
* kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index d35e75b13ffe..13f5cf5f87c3 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -333,11 +333,11 @@ static int host_stage2_unmap_dev_all(void)
/* Unmap all non-memory regions to recycle the pages */
for (i = 0; i < hyp_memblock_nr; i++, addr = reg->base + reg->size) {
reg = &hyp_memory[i];
- ret = kvm_pgtable_stage2_unmap(pgt, addr, reg->base - addr);
+ ret = kvm_pgtable_stage2_unmap(pgt, addr, reg->base - addr, 0);
if (ret)
return ret;
}
- return kvm_pgtable_stage2_unmap(pgt, addr, BIT(pgt->ia_bits) - addr);
+ return kvm_pgtable_stage2_unmap(pgt, addr, BIT(pgt->ia_bits) - addr, 0);
}
struct kvm_mem_range {
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 364b68013038..a3a0812b2301 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1044,12 +1044,14 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
return 0;
}
-int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
+int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size,
+ enum kvm_pgtable_walk_flags flags)
{
struct kvm_pgtable_walker walker = {
.cb = stage2_unmap_walker,
.arg = pgt,
- .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
+ .flags = flags | KVM_PGTABLE_WALK_LEAF |
+ KVM_PGTABLE_WALK_TABLE_POST,
};
return kvm_pgtable_walk(pgt, addr, size, &walker);
@@ -1128,11 +1130,12 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
return 0;
}
-int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size)
+int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size,
+ enum kvm_pgtable_walk_flags flags)
{
return stage2_update_leaf_attrs(pgt, addr, size, 0,
KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
- NULL, NULL, 0);
+ NULL, NULL, flags);
}
kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr)
@@ -1213,11 +1216,12 @@ static int stage2_flush_walker(const struct kvm_pgtable_visit_ctx *ctx,
return 0;
}
-int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
+int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size,
+ enum kvm_pgtable_walk_flags flags)
{
struct kvm_pgtable_walker walker = {
.cb = stage2_flush_walker,
- .flags = KVM_PGTABLE_WALK_LEAF,
+ .flags = flags | KVM_PGTABLE_WALK_LEAF,
.arg = pgt,
};
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 0c2c2c0846f1..1030921d89f8 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -55,7 +55,9 @@ static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
*/
static int stage2_apply_range(struct kvm_s2_mmu *mmu, phys_addr_t addr,
phys_addr_t end,
- int (*fn)(struct kvm_pgtable *, u64, u64),
+ enum kvm_pgtable_walk_flags flags,
+ int (*fn)(struct kvm_pgtable *, u64, u64,
+ enum kvm_pgtable_walk_flags),
bool resched)
{
struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
@@ -68,7 +70,7 @@ static int stage2_apply_range(struct kvm_s2_mmu *mmu, phys_addr_t addr,
return -EINVAL;
next = stage2_range_addr_end(addr, end);
- ret = fn(pgt, addr, next - addr);
+ ret = fn(pgt, addr, next - addr, flags);
if (ret)
break;
@@ -79,8 +81,8 @@ static int stage2_apply_range(struct kvm_s2_mmu *mmu, phys_addr_t addr,
return ret;
}
-#define stage2_apply_range_resched(mmu, addr, end, fn) \
- stage2_apply_range(mmu, addr, end, fn, true)
+#define stage2_apply_range_resched(mmu, addr, end, flags, fn) \
+ stage2_apply_range(mmu, addr, end, flags, fn, true)
/*
* Get the maximum number of page-tables pages needed to split a range
@@ -316,7 +318,7 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64
lockdep_assert_held_write(&kvm->mmu_lock);
WARN_ON(size & ~PAGE_MASK);
- WARN_ON(stage2_apply_range(mmu, start, end, kvm_pgtable_stage2_unmap,
+ WARN_ON(stage2_apply_range(mmu, start, end, 0, kvm_pgtable_stage2_unmap,
may_block));
}
@@ -331,7 +333,8 @@ static void stage2_flush_memslot(struct kvm *kvm,
phys_addr_t addr = memslot->base_gfn << PAGE_SHIFT;
phys_addr_t end = addr + PAGE_SIZE * memslot->npages;
- stage2_apply_range_resched(&kvm->arch.mmu, addr, end, kvm_pgtable_stage2_flush);
+ stage2_apply_range_resched(&kvm->arch.mmu, addr, end, 0,
+ kvm_pgtable_stage2_flush);
}
/**
@@ -1041,10 +1044,13 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
* @mmu: The KVM stage-2 MMU pointer
* @addr: Start address of range
* @end: End address of range
+ * @flags: Page-table walker flags.
*/
-static void stage2_wp_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end)
+static void stage2_wp_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end,
+ enum kvm_pgtable_walk_flags flags)
{
- stage2_apply_range_resched(mmu, addr, end, kvm_pgtable_stage2_wrprotect);
+ stage2_apply_range_resched(mmu, addr, end, flags,
+ kvm_pgtable_stage2_wrprotect);
}
/**
@@ -1073,7 +1079,7 @@ static void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
write_lock(&kvm->mmu_lock);
- stage2_wp_range(&kvm->arch.mmu, start, end);
+ stage2_wp_range(&kvm->arch.mmu, start, end, 0);
write_unlock(&kvm->mmu_lock);
kvm_flush_remote_tlbs(kvm);
}
@@ -1128,7 +1134,7 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
write_lock(&kvm->mmu_lock);
lockdep_assert_held_write(&kvm->mmu_lock);
- stage2_wp_range(&kvm->arch.mmu, start, end);
+ stage2_wp_range(&kvm->arch.mmu, start, end, 0);
/*
* Eager-splitting is done when manual-protect is set. We
--
2.41.0.rc0.172.g3f132b7071-goog
Split huge pages under MMU read lock instead of write when clearing
dirty log.
Running huge page split under read lock will unblock vCPUs execution and
allow whole clear-dirty-log operation run parallelly to vCPUs.
Note that splitting huge pages involves two walkers. First walker calls
stage2_split_walker() callback on each huge page. This callback will call
another walker which creates an unlinked page table. This commit makes
first walker as shared page walker which means, -EAGAIN will be retried.
Before this patch, -EAGAIN would have been ignored and walker would go
to next huge page. In practice this would not happen as the first walker
was holding MMU write lock. Inner walker is unchanged as it is working
on unlinked page table so no other thread will have access to it.
To improve confidence in correctness tested via dirty_log_test.
To measure performance improvement tested via dirty_log_perf_test.
Set up:
-------
Host: ARM Ampere Altra host (64 CPUs, 256 GB memory and single NUMA
node)
Test VM: 48 vCPU, 192 GB total memory.
Ran dirty_log_perf_test for 400 iterations.
./dirty_log_perf_test -k 192G -v 48 -b 4G -m 2 -i 4000 -s anonymous_hugetlb_2mb -j
Observation:
------------
+==================+=============================+===================+
| Clear Chunk size | Clear dirty log time change | vCPUs improvement |
+==================+=============================+===================+
| 192GB | 56% | 152% |
+------------------+-----------------------------+-------------------+
| 1GB | -81% | 72% |
+------------------+-----------------------------+-------------------+
When larger chunks are used, clear dirty log time increases due to lots
of cmpxchg() but vCPUs are also able to execute parallelly causing
better performance of guest.
When chunk size is small, read lock is very fast in clearing dirty logs
as it is not waiting for MMU write lock and vCPUs are also able to run
parallelly.
Signed-off-by: Vipin Sharma <[email protected]>
---
arch/arm64/kvm/mmu.c | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6dd964e3682c..aa278f5d27a2 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -126,7 +126,10 @@ static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
int ret, cache_capacity;
u64 next, chunk_size;
- lockdep_assert_held_write(&kvm->mmu_lock);
+ if (flags & KVM_PGTABLE_WALK_SHARED)
+ lockdep_assert_held_read(&kvm->mmu_lock);
+ else
+ lockdep_assert_held_write(&kvm->mmu_lock);
chunk_size = kvm->arch.mmu.split_page_chunk_size;
cache_capacity = kvm_mmu_split_nr_page_tables(chunk_size);
@@ -138,13 +141,19 @@ static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
do {
if (need_split_memcache_topup_or_resched(kvm)) {
- write_unlock(&kvm->mmu_lock);
+ if (flags & KVM_PGTABLE_WALK_SHARED)
+ read_unlock(&kvm->mmu_lock);
+ else
+ write_unlock(&kvm->mmu_lock);
cond_resched();
/* Eager page splitting is best-effort. */
ret = __kvm_mmu_topup_memory_cache(cache,
cache_capacity,
cache_capacity);
- write_lock(&kvm->mmu_lock);
+ if (flags & KVM_PGTABLE_WALK_SHARED)
+ read_lock(&kvm->mmu_lock);
+ else
+ write_lock(&kvm->mmu_lock);
if (ret)
break;
}
@@ -1139,9 +1148,7 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
read_lock(&kvm->mmu_lock);
stage2_wp_range(&kvm->arch.mmu, start, end, KVM_PGTABLE_WALK_SHARED);
- read_unlock(&kvm->mmu_lock);
- write_lock(&kvm->mmu_lock);
/*
* Eager-splitting is done when manual-protect is set. We
* also check for initially-all-set because we can avoid
@@ -1151,8 +1158,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
* again.
*/
if (kvm_dirty_log_manual_protect_and_init_set(kvm))
- kvm_mmu_split_huge_pages(kvm, start, end, 0);
- write_unlock(&kvm->mmu_lock);
+ kvm_mmu_split_huge_pages(kvm, start, end, KVM_PGTABLE_WALK_SHARED);
+ read_unlock(&kvm->mmu_lock);
}
static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
--
2.41.0.rc0.172.g3f132b7071-goog
Return -ENOENT from stage2_attr_walker for invalid PTE. Continue page
table walk if walker callback returns -ENOENT outside of the fault
handler path else terminate the walk. In fault handler path, similar to
-EAGAIN in user_mem_abort, retry guest execution.
stage2_attr_walker() is used from multiple places like, write
protection, MMU notifier callbacks, and relaxing permission during vCPU
faults. This function returns -EAGAIN for different cases:
1. When PTE is not valid.
2. When cmpxchg() fails while setting new SPTE.
For non-shared walkers, like write protection and MMU notifier, above 2
cases are just ignored by walker and it moves to the next SPTE. #2 will
never happen for non-shared walkers as they don't use cmpxchg() for
updating SPTEs.
For shared walkers, like vCPU fault handler, above 2 cases results in
walk termination.
In future commits, clear-dirty-log walker will write protect SPTEs under
MMU read lock and use shared page table walker. This will result in two
shared page table walkers type, vCPUs fault handler and clear-dirty-log,
competing with each other and sometime causing cmpxchg() failure. So,
-EAGAIN in clear-dirty-log walker due to cmpxchg() failure must be
retried. Whereas, -EAGAIN in the clear-dirty-log due to invalid SPTE
must be ignored instead of exiting as per the current logic of shared
page table walker. This is not needed for vCPU fault handler which also
runs via shared page table walker and terminates walk on getting -EAGAIN
due to invalid SPTE.
To handle all these scenarios, stage2_attr_walker must return different
error codes for invalid SPTEs and cmxchg() failure. -ENOENT for invalid
SPTE is chosen because it is not used by any other shared walker. When
clear-dirty-log will be changed to use shared page table walker, it will
be possible to differentiate cases of retrying, continuing or
terminating the walk for shared fault handler and shared
clear-dirty-log.
Signed-off-by: Vipin Sharma <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 1 +
arch/arm64/kvm/hyp/pgtable.c | 19 ++++++++++++-------
arch/arm64/kvm/mmu.c | 2 +-
3 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 957bc20dab00..23e7e7851f1d 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -720,6 +720,7 @@ int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
* -------------|------------------|--------------
* Non-Shared | 0 | Continue
* Non-Shared | -EAGAIN | Continue
+ * Non-Shared | -ENOENT | Continue
* Non-Shared | Any other | Exit
* -------------|------------------|--------------
* Shared | 0 | Continue
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index a3a0812b2301..bc8c5c4ac1cf 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -186,14 +186,19 @@ static bool kvm_pgtable_walk_continue(const struct kvm_pgtable_walker *walker,
/*
* Visitor callbacks return EAGAIN when the conditions that led to a
* fault are no longer reflected in the page tables due to a race to
- * update a PTE. In the context of a fault handler this is interpreted
- * as a signal to retry guest execution.
+ * update a PTE.
*
- * Ignore the return code altogether for walkers outside a fault handler
- * (e.g. write protecting a range of memory) and chug along with the
- * page table walk.
+ * Callbacks can also return ENOENT when PTE which is visited is not
+ * valid.
+ *
+ * In the context of a fault handler interpret these as a signal
+ * to retry guest execution.
+ *
+ * Ignore these return codes altogether for walkers outside a fault
+ * handler (e.g. write protecting a range of memory) and chug along
+ * with the page table walk.
*/
- if (r == -EAGAIN)
+ if (r == -EAGAIN || r == -ENOENT)
return !(walker->flags & KVM_PGTABLE_WALK_HANDLE_FAULT);
return !r;
@@ -1072,7 +1077,7 @@ static int stage2_attr_walker(const struct kvm_pgtable_visit_ctx *ctx,
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
if (!kvm_pte_valid(ctx->old))
- return -EAGAIN;
+ return -ENOENT;
data->level = ctx->level;
data->pte = pte;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 1030921d89f8..356dc4131023 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1551,7 +1551,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
read_unlock(&kvm->mmu_lock);
kvm_set_pfn_accessed(pfn);
kvm_release_pfn_clean(pfn);
- return ret != -EAGAIN ? ret : 0;
+ return (ret != -EAGAIN && ret != -ENOENT) ? ret : 0;
}
/* Resolve the access fault by making the page young again. */
--
2.41.0.rc0.172.g3f132b7071-goog
For a shared page walker which is not fault handler, retry the walk if
walker callback function returns -EAGAIN, or continue to the next SPTE
if callback function return -ENOENT. Update the kvm_pgtable_walk
documentation.
For fault handler logic remains same, i.e. exit the walk and resume the
guest when getting -EAGAIN and -ENOENT errors from walker callback
function.
Currently, there is no page walker which is shared and not a fault
handler, but this will change in future patches when clear-dirty-log
walker will use MMU read lock and run via shared walker.
Signed-off-by: Vipin Sharma <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 23 ++++++++++-------
arch/arm64/kvm/hyp/pgtable.c | 38 +++++++++++++++++++++++-----
2 files changed, 46 insertions(+), 15 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 23e7e7851f1d..145be12a5fc2 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -716,15 +716,20 @@ int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
* type of the walker, i.e. shared walker (vCPU fault handlers) or non-shared
* walker.
*
- * Walker Type | Callback | Walker action
- * -------------|------------------|--------------
- * Non-Shared | 0 | Continue
- * Non-Shared | -EAGAIN | Continue
- * Non-Shared | -ENOENT | Continue
- * Non-Shared | Any other | Exit
- * -------------|------------------|--------------
- * Shared | 0 | Continue
- * Shared | Any other | Exit
+ * Walker Type | Callback | Walker action
+ * -----------------------|------------------|--------------
+ * Non-Shared | 0 | Continue
+ * Non-Shared | -EAGAIN | Continue
+ * Non-Shared | -ENOENT | Continue
+ * Non-Shared | Any other | Exit
+ * -----------------------|------------------|--------------
+ * Shared | 0 | Continue
+ * Shared | -EAGAIN | Retry
+ * Shared | -ENOENT | Continue
+ * Shared | Any other | Exit
+ * -----------------------|------------------|--------------
+ * Shared (Fault Handler) | 0 | Continue
+ * Shared (Fault Handler) | Any other | Exit
*
* Return: 0 on success, negative error code on failure.
*/
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 7f80e953b502..23cda3de2dd4 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -191,15 +191,21 @@ static bool kvm_pgtable_walk_continue(const struct kvm_pgtable_walker *walker,
* Callbacks can also return ENOENT when PTE which is visited is not
* valid.
*
- * In the context of a shared walker interpret these as a signal
+ * In the context of a fault handler interpret these as a signal
* to retry guest execution.
*
- * Ignore these return codes altogether for walkers outside a fault
- * handler (e.g. write protecting a range of memory) and chug along
+ * In the context of a shared walker which is not fault handler
+ * interpret:
+ * 1. EAGAIN - A signal to retry walk again.
+ * 2. ENOENT - A signal to ignore and move on to next SPTE.
+ *
+ * Ignore these return codes altogether for other walkers and chug along
* with the page table walk.
*/
- if (r == -EAGAIN || r == -ENOENT)
+ if (r == -EAGAIN)
return !(walker->flags & KVM_PGTABLE_WALK_SHARED);
+ if (r == -ENOENT)
+ return !(walker->flags & KVM_PGTABLE_WALK_HANDLE_FAULT);
return !r;
}
@@ -260,24 +266,44 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
return ret;
}
+static bool kvm_pgtable_walk_retry(const struct kvm_pgtable_walker *walker,
+ int r)
+{
+ /*
+ * All shared page table walks where visitor callbacks return -EAGAIN
+ * should be retried with the exception of fault handler. In case of
+ * fault handler retry is achieved by resuming the guest.
+ */
+ if (r == -EAGAIN)
+ return (walker->flags & KVM_PGTABLE_WALK_SHARED) &&
+ !(walker->flags & KVM_PGTABLE_WALK_HANDLE_FAULT);
+
+ return !r;
+}
+
static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
struct kvm_pgtable_mm_ops *mm_ops, kvm_pteref_t pgtable, u32 level)
{
u32 idx;
int ret = 0;
+ kvm_pteref_t pteref;
if (WARN_ON_ONCE(level >= KVM_PGTABLE_MAX_LEVELS))
return -EINVAL;
for (idx = kvm_pgtable_idx(data, level); idx < PTRS_PER_PTE; ++idx) {
- kvm_pteref_t pteref = &pgtable[idx];
+retry:
+ pteref = &pgtable[idx];
if (data->addr >= data->end)
break;
ret = __kvm_pgtable_visit(data, mm_ops, pteref, level);
- if (ret)
+ if (ret) {
+ if (kvm_pgtable_walk_retry(data->walker, ret))
+ goto retry;
break;
+ }
}
return ret;
--
2.41.0.rc0.172.g3f132b7071-goog
Check against shared page table walker flag instead of fault handler
flag when determining if walk should continue or not.
vCPU page fault handlers uses shared page walker and there are no
other shared page walkers in Arm. This will change in future commit when
clear-dirty-log will use shared page walker and continue, retry or
terminate logic for a walk will change between shared page walkers.
Signed-off-by: Vipin Sharma <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index bc8c5c4ac1cf..7f80e953b502 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -191,7 +191,7 @@ static bool kvm_pgtable_walk_continue(const struct kvm_pgtable_walker *walker,
* Callbacks can also return ENOENT when PTE which is visited is not
* valid.
*
- * In the context of a fault handler interpret these as a signal
+ * In the context of a shared walker interpret these as a signal
* to retry guest execution.
*
* Ignore these return codes altogether for walkers outside a fault
@@ -199,7 +199,7 @@ static bool kvm_pgtable_walk_continue(const struct kvm_pgtable_walker *walker,
* with the page table walk.
*/
if (r == -EAGAIN || r == -ENOENT)
- return !(walker->flags & KVM_PGTABLE_WALK_HANDLE_FAULT);
+ return !(walker->flags & KVM_PGTABLE_WALK_SHARED);
return !r;
}
--
2.41.0.rc0.172.g3f132b7071-goog
Give users command line option (-j) to execute vCPUs independently of
dirty log iterations after initialization is complete.
This change makes dirty_log_perf_test behave like real world workflows
where guest vCPUs keep on executing while VMM collects and clear dirty
logs. Total pages touched during execution of test will give good
estimate of how vCPUs are performing while dirty logging is enabled.
Signed-off-by: Vipin Sharma <[email protected]>
---
.../selftests/kvm/dirty_log_perf_test.c | 64 +++++++++++++------
1 file changed, 44 insertions(+), 20 deletions(-)
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 14b012a0dcb1..fbf973d6cc66 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -69,6 +69,7 @@ static int iteration;
static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
static atomic_ullong total_reads;
static atomic_ullong total_writes;
+static bool lockstep_iterations;
static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
{
@@ -83,12 +84,16 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
struct timespec total = (struct timespec){0};
struct timespec avg;
struct ucall uc = {};
+ int current_iteration = -1;
int ret;
run = vcpu->run;
while (!READ_ONCE(host_quit)) {
- int current_iteration = READ_ONCE(iteration);
+ if (lockstep_iterations)
+ current_iteration = READ_ONCE(iteration);
+ else
+ current_iteration++;
clock_gettime(CLOCK_MONOTONIC, &start);
ret = _vcpu_run(vcpu);
@@ -118,13 +123,19 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
ts_diff.tv_nsec);
}
- /*
- * Keep running the guest while dirty logging is being disabled
- * (iteration is negative) so that vCPUs are accessing memory
- * for the entire duration of zapping collapsible SPTEs.
- */
- while (current_iteration == READ_ONCE(iteration) &&
- READ_ONCE(iteration) >= 0 && !READ_ONCE(host_quit)) {}
+ if (lockstep_iterations) {
+ /*
+ * Keep running the guest while dirty logging is being disabled
+ * (iteration is negative) so that vCPUs are accessing memory
+ * for the entire duration of zapping collapsible SPTEs.
+ */
+ while (current_iteration == READ_ONCE(iteration) &&
+ READ_ONCE(iteration) >= 0 && !READ_ONCE(host_quit))
+ ;
+ } else {
+ while (!READ_ONCE(iteration) && !READ_ONCE(host_quit))
+ ;
+ }
}
avg = timespec_div(total, vcpu_last_completed_iteration[vcpu_idx]);
@@ -332,18 +343,20 @@ static void run_test(enum vm_guest_mode mode, void *arg)
clock_gettime(CLOCK_MONOTONIC, &start);
iteration++;
- pr_debug("Starting iteration %d\n", iteration);
- for (i = 0; i < nr_vcpus; i++) {
- while (READ_ONCE(vcpu_last_completed_iteration[i])
- != iteration)
- ;
+ if (lockstep_iterations) {
+ pr_debug("Starting iteration %d\n", iteration);
+ for (i = 0; i < nr_vcpus; i++) {
+ while (READ_ONCE(vcpu_last_completed_iteration[i])
+ != iteration)
+ ;
+ }
+
+ ts_diff = timespec_elapsed(start);
+ vcpu_dirty_total = timespec_add(vcpu_dirty_total, ts_diff);
+ pr_info("Iteration %d dirty memory time: %ld.%.9lds\n",
+ iteration, ts_diff.tv_sec, ts_diff.tv_nsec);
}
- ts_diff = timespec_elapsed(start);
- vcpu_dirty_total = timespec_add(vcpu_dirty_total, ts_diff);
- pr_info("Iteration %d dirty memory time: %ld.%.9lds\n",
- iteration, ts_diff.tv_sec, ts_diff.tv_nsec);
-
clock_gettime(CLOCK_MONOTONIC, &start);
get_dirty_log(vm, bitmaps, p->slots);
ts_diff = timespec_elapsed(start);
@@ -365,6 +378,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
}
}
+ /* Block further vCPUs execution */
+ if (!lockstep_iterations)
+ WRITE_ONCE(iteration, 0);
+
/*
* Run vCPUs while dirty logging is being disabled to stress disabling
* in terms of both performance and correctness. Opt-in via command
@@ -458,6 +475,10 @@ static void help(char *name)
" To leave the application task unpinned, drop the final entry:\n\n"
" ./dirty_log_perf_test -v 3 -c 22,23,24\n\n"
" (default: no pinning)\n");
+ printf(" -j: Execute vCPUs independent of dirty log iterations\n"
+ " Independent vCPUs execution will allow them to continuously\n"
+ " dirty memory while main thread is collecting and clearing\n"
+ " dirty logs in each iteration.\n");
printf(" -k: Specify the chunk size in which dirty memory gets cleared\n"
" in memslots in each iteration. If the size is bigger than\n"
" the memslot size then whole memslot is cleared in one call.\n"
@@ -492,10 +513,10 @@ int main(int argc, char *argv[])
kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2);
dirty_log_manual_caps &= (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE |
KVM_DIRTY_LOG_INITIALLY_SET);
-
+ lockstep_iterations = true;
guest_modes_append_default();
- while ((opt = getopt(argc, argv, "ab:c:eghi:k:l:m:nop:r:s:v:x:w:")) != -1) {
+ while ((opt = getopt(argc, argv, "ab:c:eghi:jk:l:m:nop:r:s:v:x:w:")) != -1) {
switch (opt) {
case 'a':
p.random_access = true;
@@ -519,6 +540,9 @@ int main(int argc, char *argv[])
case 'i':
p.iterations = atoi_positive("Number of iterations", optarg);
break;
+ case 'j':
+ lockstep_iterations = false;
+ break;
case 'k':
p.clear_chunk_size = parse_size(optarg);
break;
--
2.41.0.rc0.172.g3f132b7071-goog
In dirty_log_perf_test, provide a new option 'k' to specify the size of
the chunks and clear dirty memory in chunks in each iteration. If option
is not provided then fallback to the old way of clearing whole memslot
in one call in each iteration.
In production environment whole memslot is rarely cleared in a single
call, instead clearing operation is split across multiple calls to
reduce time between clearing and sending memory to a remote host. This
change mimics the production usecases and allows to get performance
numbers based on that.
Signed-off-by: Vipin Sharma <[email protected]>
---
.../selftests/kvm/dirty_log_perf_test.c | 42 +++++++++++++++----
1 file changed, 34 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index e9d6d1aecf89..119ddfc7306e 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -134,6 +134,7 @@ struct test_params {
uint32_t write_percent;
uint32_t random_seed;
bool random_access;
+ uint64_t clear_chunk_size;
};
static void toggle_dirty_logging(struct kvm_vm *vm, int slots, bool enable)
@@ -169,16 +170,28 @@ static void get_dirty_log(struct kvm_vm *vm, unsigned long *bitmaps[], int slots
}
}
-static void clear_dirty_log(struct kvm_vm *vm, unsigned long *bitmaps[],
- int slots, uint64_t pages_per_slot)
+static void clear_dirty_log_in_chunks(struct kvm_vm *vm,
+ unsigned long *bitmaps[], int slots,
+ uint64_t pages_per_slot,
+ uint64_t pages_per_clear)
{
- int i;
+ uint64_t from, clear_pages_count;
+ int i, slot;
for (i = 0; i < slots; i++) {
- int slot = MEMSTRESS_MEM_SLOT_INDEX + i;
-
- kvm_vm_clear_dirty_log(vm, slot, bitmaps[i], 0, pages_per_slot);
+ slot = MEMSTRESS_MEM_SLOT_INDEX + i;
+ from = 0;
+ clear_pages_count = pages_per_clear;
+
+ while (from < pages_per_slot) {
+ if (from + clear_pages_count > pages_per_slot)
+ clear_pages_count = pages_per_slot - from;
+ kvm_vm_clear_dirty_log(vm, slot, bitmaps[i], from,
+ clear_pages_count);
+ from += clear_pages_count;
+ }
}
+
}
static unsigned long **alloc_bitmaps(int slots, uint64_t pages_per_slot)
@@ -215,6 +228,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
uint64_t guest_num_pages;
uint64_t host_num_pages;
uint64_t pages_per_slot;
+ uint64_t pages_per_clear;
struct timespec start;
struct timespec ts_diff;
struct timespec get_dirty_log_total = (struct timespec){0};
@@ -235,6 +249,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
guest_num_pages = vm_adjust_num_guest_pages(mode, guest_num_pages);
host_num_pages = vm_num_host_pages(mode, guest_num_pages);
pages_per_slot = host_num_pages / p->slots;
+ pages_per_clear = p->clear_chunk_size / getpagesize();
bitmaps = alloc_bitmaps(p->slots, pages_per_slot);
@@ -315,7 +330,9 @@ static void run_test(enum vm_guest_mode mode, void *arg)
if (dirty_log_manual_caps) {
clock_gettime(CLOCK_MONOTONIC, &start);
- clear_dirty_log(vm, bitmaps, p->slots, pages_per_slot);
+ clear_dirty_log_in_chunks(vm, bitmaps, p->slots,
+ pages_per_slot,
+ pages_per_clear);
ts_diff = timespec_elapsed(start);
clear_dirty_log_total = timespec_add(clear_dirty_log_total,
ts_diff);
@@ -413,6 +430,11 @@ static void help(char *name)
" To leave the application task unpinned, drop the final entry:\n\n"
" ./dirty_log_perf_test -v 3 -c 22,23,24\n\n"
" (default: no pinning)\n");
+ printf(" -k: Specify the chunk size in which dirty memory gets cleared\n"
+ " in memslots in each iteration. If the size is bigger than\n"
+ " the memslot size then whole memslot is cleared in one call.\n"
+ " Size must be aligned to the host page size. e.g. 10M or 3G\n"
+ " (default: UINT64_MAX, clears whole memslot in one call)\n");
puts("");
exit(0);
}
@@ -428,6 +450,7 @@ int main(int argc, char *argv[])
.slots = 1,
.random_seed = 1,
.write_percent = 100,
+ .clear_chunk_size = UINT64_MAX,
};
int opt;
@@ -438,7 +461,7 @@ int main(int argc, char *argv[])
guest_modes_append_default();
- while ((opt = getopt(argc, argv, "ab:c:eghi:m:nop:r:s:v:x:w:")) != -1) {
+ while ((opt = getopt(argc, argv, "ab:c:eghi:k:m:nop:r:s:v:x:w:")) != -1) {
switch (opt) {
case 'a':
p.random_access = true;
@@ -462,6 +485,9 @@ int main(int argc, char *argv[])
case 'i':
p.iterations = atoi_positive("Number of iterations", optarg);
break;
+ case 'k':
+ p.clear_chunk_size = parse_size(optarg);
+ break;
case 'm':
guest_modes_cmdline(optarg);
break;
--
2.41.0.rc0.172.g3f132b7071-goog
Pass the number of read and write accesses done in the memstress guest
code to userspace.
These counts will provide a way to measure vCPUs performance during
memstress and dirty logging related tests. For example, in
dirty_log_perf_test this can be used to measure how much progress vCPUs
are able to do while VMM is getting and clearing dirty logs.
In dirty_log_perf_test, each vCPU runs once and then waits until
iteration value is incremented by main thread, therefore, these access
counts will not provide much useful information except for observing
read vs write counts.
However, in future commits, dirty_log_perf_test behavior will be changed
to allow vCPUs to execute independent of userspace iterations. This will
mimic real world workload where guest keeps on executing while VMM is
collecting and clearing dirty logs separately. With read and write
accesses known for each vCPU, impact of get and clear dirty log APIs can
be quantified.
Note that access counts will not be 100% reliable in knowing vCPUs
performances. Few things which can affect vCPU progress:
1. vCPUs are scheduled less by host
2. Userspace operations run for longer time which end up giving vCPUs
more time to execute.
Signed-off-by: Vipin Sharma <[email protected]>
---
tools/testing/selftests/kvm/lib/memstress.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c
index 5f1d3173c238..ac53cc6e36d7 100644
--- a/tools/testing/selftests/kvm/lib/memstress.c
+++ b/tools/testing/selftests/kvm/lib/memstress.c
@@ -49,6 +49,8 @@ void memstress_guest_code(uint32_t vcpu_idx)
struct memstress_args *args = &memstress_args;
struct memstress_vcpu_args *vcpu_args = &args->vcpu_args[vcpu_idx];
struct guest_random_state rand_state;
+ uint64_t write_access;
+ uint64_t read_access;
uint64_t gva;
uint64_t pages;
uint64_t addr;
@@ -64,6 +66,8 @@ void memstress_guest_code(uint32_t vcpu_idx)
GUEST_ASSERT(vcpu_args->vcpu_idx == vcpu_idx);
while (true) {
+ write_access = 0;
+ read_access = 0;
for (i = 0; i < pages; i++) {
if (args->random_access)
page = guest_random_u32(&rand_state) % pages;
@@ -72,13 +76,16 @@ void memstress_guest_code(uint32_t vcpu_idx)
addr = gva + (page * args->guest_page_size);
- if (guest_random_u32(&rand_state) % 100 < args->write_percent)
+ if (guest_random_u32(&rand_state) % 100 < args->write_percent) {
*(uint64_t *)addr = 0x0123456789ABCDEF;
- else
+ write_access++;
+ } else {
READ_ONCE(*(uint64_t *)addr);
+ read_access++;
+ }
}
- GUEST_SYNC(1);
+ GUEST_SYNC_ARGS(1, read_access, write_access, 0, 0);
}
}
--
2.41.0.rc0.172.g3f132b7071-goog
Take MMU read lock for clearing dirty logs and use shared page table
walker.
Dirty logs are currently cleared using MMU write locks. This
means vCPUs page faults, which takes MMU read lock, will
be blocked while dirty logs are being cleared. This causes guest
degradation and especially noticeable on VMs with lot of vCPUs.
Taking MMU read lock will allow vCPUs to execute parallelly and reduces
the impact on vCPUs performance.
Signed-off-by: Vipin Sharma <[email protected]>
---
arch/arm64/kvm/mmu.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 356dc4131023..7c966f6f1a41 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -74,8 +74,12 @@ static int stage2_apply_range(struct kvm_s2_mmu *mmu, phys_addr_t addr,
if (ret)
break;
- if (resched && next != end)
- cond_resched_rwlock_write(&kvm->mmu_lock);
+ if (resched && next != end) {
+ if (flags & KVM_PGTABLE_WALK_SHARED)
+ cond_resched_rwlock_read(&kvm->mmu_lock);
+ else
+ cond_resched_rwlock_write(&kvm->mmu_lock);
+ }
} while (addr = next, addr != end);
return ret;
@@ -1131,11 +1135,11 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
phys_addr_t start = (base_gfn + __ffs(mask)) << PAGE_SHIFT;
phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
- write_lock(&kvm->mmu_lock);
- lockdep_assert_held_write(&kvm->mmu_lock);
-
- stage2_wp_range(&kvm->arch.mmu, start, end, 0);
+ read_lock(&kvm->mmu_lock);
+ stage2_wp_range(&kvm->arch.mmu, start, end, KVM_PGTABLE_WALK_SHARED);
+ read_unlock(&kvm->mmu_lock);
+ write_lock(&kvm->mmu_lock);
/*
* Eager-splitting is done when manual-protect is set. We
* also check for initially-all-set because we can avoid
--
2.41.0.rc0.172.g3f132b7071-goog
Pass enum kvm_pgtable_walk_flags{} to kvm_mmu_split_huge_pages().
Use 0 as the flag value to make it no-op.
In future commit kvm_mmu_split_huge_pages() will be used under both MMU
read lock and MMU write lock. Flag allows to pass intent to use shared
or non-shared page walkers to split the huge pages.
Signed-off-by: Vipin Sharma <[email protected]>
---
arch/arm64/kvm/mmu.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 34d2bd03cf5f..6dd964e3682c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -118,7 +118,8 @@ static bool need_split_memcache_topup_or_resched(struct kvm *kvm)
}
static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
- phys_addr_t end)
+ phys_addr_t end,
+ enum kvm_pgtable_walk_flags flags)
{
struct kvm_mmu_memory_cache *cache;
struct kvm_pgtable *pgt;
@@ -153,7 +154,8 @@ static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
return -EINVAL;
next = __stage2_range_addr_end(addr, end, chunk_size);
- ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache, 0);
+ ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache,
+ flags);
if (ret)
break;
} while (addr = next, addr != end);
@@ -1112,7 +1114,7 @@ static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
write_lock(&kvm->mmu_lock);
- kvm_mmu_split_huge_pages(kvm, start, end);
+ kvm_mmu_split_huge_pages(kvm, start, end, 0);
write_unlock(&kvm->mmu_lock);
}
@@ -1149,7 +1151,7 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
* again.
*/
if (kvm_dirty_log_manual_protect_and_init_set(kvm))
- kvm_mmu_split_huge_pages(kvm, start, end);
+ kvm_mmu_split_huge_pages(kvm, start, end, 0);
write_unlock(&kvm->mmu_lock);
}
--
2.41.0.rc0.172.g3f132b7071-goog