This patch series add support for stage2 hardware DBM, and it is only
used for dirty log for now.
It works well under some migration test cases, including VM with 4K
pages or 2M THP. I checked the SHA256 hash digest of all memory and
they keep same for source VM and destination VM, which means no dirty
pages is missed under hardware DBM.
Some key points:
1. Only support hardware updates of dirty status for PTEs. PMDs and PUDs
are not involved for now.
2. About *performance*: In RFC patch, I have mentioned that for every 64GB
memory, KVM consumes about 40ms to scan all PTEs to collect dirty log.
Initially, I plan to solve this problem using parallel CPUs. However
I faced two problems.
The first is bottleneck of memory bandwith. Single thread will occupy
bandwidth about 500GB/s, we can support about 4 parallel threads at
most, so the ideal speedup ratio is low.
The second is huge impact on other CPUs. To scan PTs quickly, I use
smp_call_function_many, which is based on IPI, to dispatch workload
on other CPUs. Though it can complete work in time, the interrupt is
disabled during scaning PTs, which has huge impact on other CPUs.
Now, I make hardware dirty log can be dynamic enabled and disabled.
Userspace can enable it before VM migration and disable it when
remaining dirty pages is little. Thus VM downtime is not affected.
3. About correctness: Only add DBM bit when PTE is already writable, so
we still have readonly PTE and some mechanisms which rely on readonly
PTs are not broken.
4. About PTs modification races: There are two kinds of PTs modification.
The first is adding or clearing specific bit, such as AF or RW. All
these operations have been converted to be atomic, avoid covering
dirty status set by hardware.
The second is replacement, such as PTEs unmapping or changement. All
these operations will invoke kvm_set_pte finally. kvm_set_pte have
been converted to be atomic and we save the dirty status to underlying
bitmap if dirty status is coverred.
Keqian Zhu (12):
KVM: arm64: Add some basic functions to support hw DBM
KVM: arm64: Modify stage2 young mechanism to support hw DBM
KVM: arm64: Report hardware dirty status of stage2 PTE if coverred
KVM: arm64: Support clear DBM bit for PTEs
KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability
KVM: arm64: Set DBM bit of PTEs during write protecting
KVM: arm64: Scan PTEs to sync dirty log
KVM: Omit dirty log sync in log clear if initially all set
KVM: arm64: Steply write protect page table by mask bit
KVM: arm64: Save stage2 PTE dirty status if it is coverred
KVM: arm64: Support disable hw dirty log after enable
KVM: arm64: Enable stage2 hardware DBM
arch/arm64/include/asm/kvm_host.h | 11 +
arch/arm64/include/asm/kvm_mmu.h | 56 +++-
arch/arm64/include/asm/sysreg.h | 2 +
arch/arm64/kvm/arm.c | 22 +-
arch/arm64/kvm/mmu.c | 411 ++++++++++++++++++++++++++++--
arch/arm64/kvm/reset.c | 14 +-
include/uapi/linux/kvm.h | 1 +
tools/include/uapi/linux/kvm.h | 1 +
virt/kvm/kvm_main.c | 7 +-
9 files changed, 499 insertions(+), 26 deletions(-)
--
2.19.1
For that using arm64 DBM to log dirty pages has the side effect
of long time dirty log sync, we should give userspace opportunity
to enable or disable this feature, to realize some policy.
Signed-off-by: Keqian Zhu <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 7 +++++++
arch/arm64/kvm/arm.c | 10 ++++++++++
arch/arm64/kvm/reset.c | 5 +++++
include/uapi/linux/kvm.h | 1 +
tools/include/uapi/linux/kvm.h | 1 +
5 files changed, 24 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 9ea2dcfd609c..2bc3256759e3 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -95,6 +95,13 @@ struct kvm_arch {
* supported.
*/
bool return_nisv_io_abort_to_user;
+
+ /*
+ * Use hardware management of dirty status (DBM) to log dirty pages.
+ * Userspace can enable this feature if KVM_CAP_ARM_HW_DIRTY_LOG is
+ * supported.
+ */
+ bool hw_dirty_log;
};
#define KVM_NR_MEM_OBJS 40
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 90cb90561446..850cc5cbc6f0 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -87,6 +87,16 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
r = 0;
kvm->arch.return_nisv_io_abort_to_user = true;
break;
+#ifdef CONFIG_ARM64_HW_AFDBM
+ case KVM_CAP_ARM_HW_DIRTY_LOG:
+ if ((cap->args[0] & ~1) || !kvm_hw_dbm_enabled()) {
+ r = -EINVAL;
+ } else {
+ r = 0;
+ kvm->arch.hw_dirty_log = cap->args[0];
+ }
+ break;
+#endif
default:
r = -EINVAL;
break;
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index d3b209023727..52bb801c9b2c 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -83,6 +83,11 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = has_vhe() && system_supports_address_auth() &&
system_supports_generic_auth();
break;
+#ifdef CONFIG_ARM64_HW_AFDBM
+ case KVM_CAP_ARM_HW_DIRTY_LOG:
+ r = kvm_hw_dbm_enabled();
+ break;
+#endif /* CONFIG_ARM64_HW_AFDBM */
default:
r = 0;
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4fdf30316582..e0b12c43397b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1031,6 +1031,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_PPC_SECURE_GUEST 181
#define KVM_CAP_HALT_POLL 182
#define KVM_CAP_ASYNC_PF_INT 183
+#define KVM_CAP_ARM_HW_DIRTY_LOG 184
#ifdef KVM_CAP_IRQ_ROUTING
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index fdd632c833b4..53908a8881a4 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -1017,6 +1017,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_S390_VCPU_RESETS 179
#define KVM_CAP_S390_PROTECTED 180
#define KVM_CAP_PPC_SECURE_GUEST 181
+#define KVM_CAP_ARM_HW_DIRTY_LOG 184
#ifdef KVM_CAP_IRQ_ROUTING
--
2.19.1
Marking PTs young (set AF bit) should be atomic to avoid cover
dirty status set by hardware.
Signed-off-by: Keqian Zhu <[email protected]>
---
arch/arm64/include/asm/kvm_mmu.h | 32 ++++++++++++++++++++++----------
arch/arm64/kvm/mmu.c | 15 ++++++++-------
2 files changed, 30 insertions(+), 17 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e0ee6e23d626..51af71505fbc 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -215,6 +215,18 @@ static inline void kvm_set_s2pte_readonly(pte_t *ptep)
} while (pteval != old_pteval);
}
+static inline void kvm_set_s2pte_young(pte_t *ptep)
+{
+ pteval_t old_pteval, pteval;
+
+ pteval = READ_ONCE(pte_val(*ptep));
+ do {
+ old_pteval = pteval;
+ pteval |= PTE_AF;
+ pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
+ } while (pteval != old_pteval);
+}
+
static inline bool kvm_s2pte_readonly(pte_t *ptep)
{
return (READ_ONCE(pte_val(*ptep)) & PTE_S2_RDWR) == PTE_S2_RDONLY;
@@ -230,6 +242,11 @@ static inline void kvm_set_s2pmd_readonly(pmd_t *pmdp)
kvm_set_s2pte_readonly((pte_t *)pmdp);
}
+static inline void kvm_set_s2pmd_young(pmd_t *pmdp)
+{
+ kvm_set_s2pte_young((pte_t *)pmdp);
+}
+
static inline bool kvm_s2pmd_readonly(pmd_t *pmdp)
{
return kvm_s2pte_readonly((pte_t *)pmdp);
@@ -245,6 +262,11 @@ static inline void kvm_set_s2pud_readonly(pud_t *pudp)
kvm_set_s2pte_readonly((pte_t *)pudp);
}
+static inline void kvm_set_s2pud_young(pud_t *pudp)
+{
+ kvm_set_s2pte_young((pte_t *)pudp);
+}
+
static inline bool kvm_s2pud_readonly(pud_t *pudp)
{
return kvm_s2pte_readonly((pte_t *)pudp);
@@ -255,16 +277,6 @@ static inline bool kvm_s2pud_exec(pud_t *pudp)
return !(READ_ONCE(pud_val(*pudp)) & PUD_S2_XN);
}
-static inline pud_t kvm_s2pud_mkyoung(pud_t pud)
-{
- return pud_mkyoung(pud);
-}
-
-static inline bool kvm_s2pud_young(pud_t pud)
-{
- return pud_young(pud);
-}
-
#ifdef CONFIG_ARM64_HW_AFDBM
static inline bool kvm_hw_dbm_enabled(void)
{
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8c0035cab6b6..5ad87bce23c0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2008,8 +2008,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
* Resolve the access fault by making the page young again.
* Note that because the faulting entry is guaranteed not to be
* cached in the TLB, we don't need to invalidate anything.
- * Only the HW Access Flag updates are supported for Stage 2 (no DBM),
- * so there is no need for atomic (pte|pmd)_mkyoung operations.
+ *
+ * Note: Both DBM and HW AF updates are supported for Stage2, so
+ * young operations should be atomic.
*/
static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
{
@@ -2027,15 +2028,15 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
goto out;
if (pud) { /* HugeTLB */
- *pud = kvm_s2pud_mkyoung(*pud);
+ kvm_set_s2pud_young(pud);
pfn = kvm_pud_pfn(*pud);
pfn_valid = true;
} else if (pmd) { /* THP, HugeTLB */
- *pmd = pmd_mkyoung(*pmd);
+ kvm_set_s2pmd_young(pmd);
pfn = pmd_pfn(*pmd);
pfn_valid = true;
- } else {
- *pte = pte_mkyoung(*pte); /* Just a page... */
+ } else { /* Just a page... */
+ kvm_set_s2pte_young(pte);
pfn = pte_pfn(*pte);
pfn_valid = true;
}
@@ -2280,7 +2281,7 @@ static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *
return 0;
if (pud)
- return kvm_s2pud_young(*pud);
+ return pud_young(*pud);
else if (pmd)
return pmd_young(*pmd);
else
--
2.19.1
Synchronizing dirty log during log clear is useful only when the dirty
bitmap of userspace contains dirty bits that memslot dirty bitmap does
not contain, because we can sync new dirty bits to memslot dirty bitmap,
then we can clear them by the way and avoid reporting them to userspace
later.
With dirty bitmap "initially all set" feature, the above situation will
not appear if userspace logic is normal, so we can omit dirty log sync in
log clear. This is valuable when dirty log sync is a high-cost operation,
such as arm64 DBM.
Signed-off-by: Keqian Zhu <[email protected]>
---
virt/kvm/kvm_main.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3722343fd460..6c147d6f9da6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1554,7 +1554,8 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
(log->num_pages < memslot->npages - log->first_page && (log->num_pages & 63)))
return -EINVAL;
- kvm_arch_sync_dirty_log(kvm, memslot);
+ if (!kvm_dirty_log_manual_protect_and_init_set(kvm))
+ kvm_arch_sync_dirty_log(kvm, memslot);
flush = false;
dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
--
2.19.1
For hardware management of dirty state, dirty state is stored in
PTEs. We have to scan all PTEs to sync dirty log to memslot dirty
bitmap.
Signed-off-by: Keqian Zhu <[email protected]>
Signed-off-by: Peng Liang <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 2 +
arch/arm64/kvm/arm.c | 6 +-
arch/arm64/kvm/mmu.c | 162 ++++++++++++++++++++++++++++++
virt/kvm/kvm_main.c | 4 +-
4 files changed, 172 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 2bc3256759e3..910ec33afea8 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -489,6 +489,8 @@ void force_vm_exit(const cpumask_t *mask);
void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
void kvm_mmu_clear_dbm(struct kvm *kvm, struct kvm_memory_slot *memslot);
void kvm_mmu_clear_dbm_all(struct kvm *kvm);
+void kvm_mmu_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot);
+void kvm_mmu_sync_dirty_log_all(struct kvm *kvm);
int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
int exception_index);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 850cc5cbc6f0..92f0b40a30fa 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1209,7 +1209,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
{
-
+#ifdef CONFIG_ARM64_HW_AFDBM
+ if (kvm->arch.hw_dirty_log) {
+ kvm_mmu_sync_dirty_log(kvm, memslot);
+ }
+#endif
}
void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 742c7943176f..3aa0303d83f0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2600,6 +2600,168 @@ void kvm_mmu_clear_dbm_all(struct kvm *kvm)
kvm_mmu_clear_dbm(kvm, memslot);
}
}
+
+/**
+ * stage2_sync_dirty_log_ptes() - synchronize dirty log from PMD range
+ * @kvm: The KVM pointer
+ * @pmd: pointer to pmd entry
+ * @addr: range start address
+ * @end: range end address
+ */
+static void stage2_sync_dirty_log_ptes(struct kvm *kvm, pmd_t *pmd,
+ phys_addr_t addr, phys_addr_t end)
+{
+ pte_t *pte;
+
+ pte = pte_offset_kernel(pmd, addr);
+ do {
+ if (!pte_none(*pte) && !kvm_s2pte_readonly(pte))
+ mark_page_dirty(kvm, addr >> PAGE_SHIFT);
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_pmds() - synchronize dirty log from PUD range
+ * @kvm: The KVM pointer
+ * @pud: pointer to pud entry
+ * @addr: range start address
+ * @end: range end address
+ */
+static void stage2_sync_dirty_log_pmds(struct kvm *kvm, pud_t *pud,
+ phys_addr_t addr, phys_addr_t end)
+{
+ pmd_t *pmd;
+ phys_addr_t next;
+
+ pmd = stage2_pmd_offset(kvm, pud, addr);
+ do {
+ next = stage2_pmd_addr_end(kvm, addr, end);
+ if (!pmd_none(*pmd) && !pmd_thp_or_huge(*pmd))
+ stage2_sync_dirty_log_ptes(kvm, pmd, addr, next);
+ } while (pmd++, addr = next, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_puds() - synchronize dirty log from P4D range
+ * @kvm: The KVM pointer
+ * @pgd: pointer to pgd entry
+ * @addr: range start address
+ * @end: range end address
+ */
+static void stage2_sync_dirty_log_puds(struct kvm *kvm, p4d_t *p4d,
+ phys_addr_t addr, phys_addr_t end)
+{
+ pud_t *pud;
+ phys_addr_t next;
+
+ pud = stage2_pud_offset(kvm, p4d, addr);
+ do {
+ next = stage2_pud_addr_end(kvm, addr, end);
+ if (!stage2_pud_none(kvm, *pud) && !stage2_pud_huge(kvm, *pud))
+ stage2_sync_dirty_log_pmds(kvm, pud, addr, next);
+ } while (pud++, addr = next, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_p4ds() - synchronize dirty log from PGD range
+ * @kvm: The KVM pointer
+ * @pgd: pointer to pgd entry
+ * @addr: range start address
+ * @end: range end address
+ */
+static void stage2_sync_dirty_log_p4ds(struct kvm *kvm, pgd_t *pgd,
+ phys_addr_t addr, phys_addr_t end)
+{
+ p4d_t *p4d;
+ phys_addr_t next;
+
+ p4d = stage2_p4d_offset(kvm, pgd, addr);
+ do {
+ next = stage2_p4d_addr_end(kvm, addr, end);
+ if (!stage2_p4d_none(kvm, *p4d))
+ stage2_sync_dirty_log_puds(kvm, p4d, addr, next);
+ } while (p4d++, addr = next, addr != end);
+}
+
+/**
+ * stage2_sync_dirty_log_range() - synchronize dirty log from stage2 memory
+ * region range
+ * @kvm: The KVM pointer
+ * @addr: Start address of range
+ * @end: End address of range
+ */
+static void stage2_sync_dirty_log_range(struct kvm *kvm, phys_addr_t addr,
+ phys_addr_t end)
+{
+ pgd_t *pgd;
+ phys_addr_t next;
+
+ pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
+ do {
+ cond_resched_lock(&kvm->mmu_lock);
+ if (!READ_ONCE(kvm->arch.pgd))
+ break;
+ next = stage2_pgd_addr_end(kvm, addr, end);
+ if (stage2_pgd_present(kvm, *pgd))
+ stage2_sync_dirty_log_p4ds(kvm, pgd, addr, next);
+ } while (pgd++, addr = next, addr != end);
+}
+
+/**
+ * kvm_mmu_sync_dirty_log() - synchronize dirty log from stage2 PTEs for
+ * memory slot
+ * @kvm: The KVM pointer
+ * @slot: The memory slot to synchronize dirty log
+ *
+ * Called to synchronize dirty log (marked by hw) after memory region
+ * KVM_GET_DIRTY_LOG operation is called. After this function returns
+ * all dirty log information (for that hw will modify page tables during
+ * this routine, it is true only when guest is stopped, but it is OK
+ * because we won't miss dirty log finally.) are collected into memslot
+ * dirty_bitmap. Afterwards dirty_bitmap can be copied to userspace.
+ *
+ * Acquires kvm_mmu_lock. Called with kvm->slots_lock mutex acquired,
+ * serializing operations for VM memory regions.
+ */
+void kvm_mmu_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
+{
+ phys_addr_t start = memslot->base_gfn << PAGE_SHIFT;
+ phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+ int idx;
+
+ if (WARN_ON_ONCE(!memslot->dirty_bitmap))
+ return;
+
+ idx = srcu_read_lock(&kvm->srcu);
+ spin_lock(&kvm->mmu_lock);
+
+ stage2_sync_dirty_log_range(kvm, start, end);
+
+ spin_unlock(&kvm->mmu_lock);
+ srcu_read_unlock(&kvm->srcu, idx);
+}
+
+/**
+ * kvm_mmu_sync_dirty_log_all() - synchronize dirty log from PTEs for whole VM
+ * @kvm: The KVM pointer
+ *
+ * Called with kvm->slots_lock mutex acquired
+ */
+void kvm_mmu_sync_dirty_log_all(struct kvm *kvm)
+{
+ struct kvm_memslots *slots = kvm_memslots(kvm);
+ struct kvm_memory_slot *memslots = slots->memslots;
+ struct kvm_memory_slot *memslot;
+ int slot;
+
+ if (unlikely(!slots->used_slots))
+ return;
+
+ for (slot = 0; slot < slots->used_slots; slot++) {
+ memslot = &memslots[slot];
+ kvm_mmu_sync_dirty_log(kvm, memslot);
+ }
+}
#endif /* CONFIG_ARM64_HW_AFDBM */
void kvm_arch_commit_memory_region(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a852af5c3214..3722343fd460 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2581,7 +2581,9 @@ static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
if (memslot && memslot->dirty_bitmap) {
unsigned long rel_gfn = gfn - memslot->base_gfn;
- set_bit_le(rel_gfn, memslot->dirty_bitmap);
+ /* Speed up if this bit has already been set */
+ if (!test_bit_le(rel_gfn, memslot->dirty_bitmap))
+ set_bit_le(rel_gfn, memslot->dirty_bitmap);
}
}
--
2.19.1
During write protecting PTEs, if hardware dirty log is enabled,
set the DBM bit of PTEs when they are *already writable*. This
ensures some mechanisms that rely on "write fault", such as CoW,
are not broken.
Signed-off-by: Keqian Zhu <[email protected]>
Signed-off-by: Peng Liang <[email protected]>
---
arch/arm64/kvm/mmu.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f08b0fbca0a0..742c7943176f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1536,19 +1536,24 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
/**
* stage2_wp_ptes - write protect PMD range
+ * @kvm: kvm instance for the VM
* @pmd: pointer to pmd entry
* @addr: range start address
* @end: range end address
*/
-static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
+static void stage2_wp_ptes(struct kvm *kvm, pmd_t *pmd,
+ phys_addr_t addr, phys_addr_t end)
{
pte_t *pte;
pte = pte_offset_kernel(pmd, addr);
do {
- if (!pte_none(*pte)) {
- if (!kvm_s2pte_readonly(pte))
- kvm_set_s2pte_readonly(pte);
+ if (!pte_none(*pte) && !kvm_s2pte_readonly(pte)) {
+#ifdef CONFIG_ARM64_HW_AFDBM
+ if (kvm->arch.hw_dirty_log && !kvm_s2pte_dbm(pte))
+ kvm_set_s2pte_dbm(pte);
+#endif
+ kvm_set_s2pte_readonly(pte);
}
} while (pte++, addr += PAGE_SIZE, addr != end);
}
@@ -1575,7 +1580,7 @@ static void stage2_wp_pmds(struct kvm *kvm, pud_t *pud,
if (!kvm_s2pmd_readonly(pmd))
kvm_set_s2pmd_readonly(pmd);
} else {
- stage2_wp_ptes(pmd, addr, next);
+ stage2_wp_ptes(kvm, pmd, addr, next);
}
}
} while (pmd++, addr = next, addr != end);
--
2.19.1
During dirty log clear, page table entries are write protected
according to a mask. In the past we write protect all entries
corresponding to the mask from ffs to fls. Though there may be
zero bits between this range, we are holding the kvm mmu lock
so we won't write protect entries that we don't want to.
We are about to add support for hardware management of dirty state
to arm64, holding kvm mmu lock will be not enough. We should write
protect entries steply by mask bit.
Signed-off-by: Keqian Zhu <[email protected]>
---
arch/arm64/kvm/mmu.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 3aa0303d83f0..898e272a2c07 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1710,10 +1710,16 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
gfn_t gfn_offset, unsigned long mask)
{
phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
- phys_addr_t start = (base_gfn + __ffs(mask)) << PAGE_SHIFT;
- phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+ phys_addr_t start, end;
+ u32 i;
- stage2_wp_range(kvm, start, end);
+ for (i = __ffs(mask); i <= __fls(mask); i++) {
+ if (test_bit_le(i, &mask)) {
+ start = (base_gfn + i) << PAGE_SHIFT;
+ end = (base_gfn + i + 1) << PAGE_SHIFT;
+ stage2_wp_range(kvm, start, end);
+ }
+ }
}
/*
--
2.19.1
Prepare some basic functions to support hardware DBM for PTEs.
Signed-off-by: Keqian Zhu <[email protected]>
Signed-off-by: Peng Liang <[email protected]>
---
arch/arm64/include/asm/kvm_mmu.h | 36 ++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index b12bfc1f051a..e0ee6e23d626 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -265,6 +265,42 @@ static inline bool kvm_s2pud_young(pud_t pud)
return pud_young(pud);
}
+#ifdef CONFIG_ARM64_HW_AFDBM
+static inline bool kvm_hw_dbm_enabled(void)
+{
+ return !!(read_sysreg(vtcr_el2) & VTCR_EL2_HD);
+}
+
+static inline void kvm_set_s2pte_dbm(pte_t *ptep)
+{
+ pteval_t old_pteval, pteval;
+
+ pteval = READ_ONCE(pte_val(*ptep));
+ do {
+ old_pteval = pteval;
+ pteval |= PTE_DBM;
+ pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
+ } while (pteval != old_pteval);
+}
+
+static inline void kvm_clear_s2pte_dbm(pte_t *ptep)
+{
+ pteval_t old_pteval, pteval;
+
+ pteval = READ_ONCE(pte_val(*ptep));
+ do {
+ old_pteval = pteval;
+ pteval &= ~PTE_DBM;
+ pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
+ } while (pteval != old_pteval);
+}
+
+static inline bool kvm_s2pte_dbm(pte_t *ptep)
+{
+ return !!(READ_ONCE(pte_val(*ptep)) & PTE_DBM);
+}
+#endif /* CONFIG_ARM64_HW_AFDBM */
+
#define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)
#ifdef __PAGETABLE_PMD_FOLDED
--
2.19.1
Hi,
On 2020/6/16 17:35, Keqian Zhu wrote:
> This patch series add support for stage2 hardware DBM, and it is only
> used for dirty log for now.
>
> It works well under some migration test cases, including VM with 4K
> pages or 2M THP. I checked the SHA256 hash digest of all memory and
> they keep same for source VM and destination VM, which means no dirty
> pages is missed under hardware DBM.
>
> Some key points:
>
> 1. Only support hardware updates of dirty status for PTEs. PMDs and PUDs
> are not involved for now.
>
> 2. About *performance*: In RFC patch, I have mentioned that for every 64GB
> memory, KVM consumes about 40ms to scan all PTEs to collect dirty log.
>
> Initially, I plan to solve this problem using parallel CPUs. However
> I faced two problems.
>
> The first is bottleneck of memory bandwith. Single thread will occupy
> bandwidth about 500GB/s, we can support about 4 parallel threads at
> most, so the ideal speedup ratio is low.
Aha, I make it wrong here. I test it again, and find that speedup ratio can
be about 23x when I use 32 CPUs to scan PTs (takes about 5ms when scanning PTs
of 200GB RAM).
>
> The second is huge impact on other CPUs. To scan PTs quickly, I use
> smp_call_function_many, which is based on IPI, to dispatch workload
> on other CPUs. Though it can complete work in time, the interrupt is
> disabled during scaning PTs, which has huge impact on other CPUs.
I think we can divide scanning workload into smaller ones, which can disable
and enable interrupts periodly.
>
> Now, I make hardware dirty log can be dynamic enabled and disabled.
> Userspace can enable it before VM migration and disable it when
> remaining dirty pages is little. Thus VM downtime is not affected.
BTW, we can reserve this interface for userspace if CPU computing resources
are not enough.
Thanks,
Keqian
>
>
> 3. About correctness: Only add DBM bit when PTE is already writable, so
> we still have readonly PTE and some mechanisms which rely on readonly
> PTs are not broken.
>
> 4. About PTs modification races: There are two kinds of PTs modification.
>
> The first is adding or clearing specific bit, such as AF or RW. All
> these operations have been converted to be atomic, avoid covering
> dirty status set by hardware.
>
> The second is replacement, such as PTEs unmapping or changement. All
> these operations will invoke kvm_set_pte finally. kvm_set_pte have
> been converted to be atomic and we save the dirty status to underlying
> bitmap if dirty status is coverred.
>
>
> Keqian Zhu (12):
> KVM: arm64: Add some basic functions to support hw DBM
> KVM: arm64: Modify stage2 young mechanism to support hw DBM
> KVM: arm64: Report hardware dirty status of stage2 PTE if coverred
> KVM: arm64: Support clear DBM bit for PTEs
> KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability
> KVM: arm64: Set DBM bit of PTEs during write protecting
> KVM: arm64: Scan PTEs to sync dirty log
> KVM: Omit dirty log sync in log clear if initially all set
> KVM: arm64: Steply write protect page table by mask bit
> KVM: arm64: Save stage2 PTE dirty status if it is coverred
> KVM: arm64: Support disable hw dirty log after enable
> KVM: arm64: Enable stage2 hardware DBM
>
> arch/arm64/include/asm/kvm_host.h | 11 +
> arch/arm64/include/asm/kvm_mmu.h | 56 +++-
> arch/arm64/include/asm/sysreg.h | 2 +
> arch/arm64/kvm/arm.c | 22 +-
> arch/arm64/kvm/mmu.c | 411 ++++++++++++++++++++++++++++--
> arch/arm64/kvm/reset.c | 14 +-
> include/uapi/linux/kvm.h | 1 +
> tools/include/uapi/linux/kvm.h | 1 +
> virt/kvm/kvm_main.c | 7 +-
> 9 files changed, 499 insertions(+), 26 deletions(-)
>