The intention:
On arm64 platform, we tracking dirty log of vCPU through guest memory abort.
KVM occupys some vCPU time of guest to change stage2 mapping and mark dirty.
This leads to heavy side effect on VM, especially when multi vCPU race and
some of them block on kvm mmu_lock.
DBM is a HW auxiliary approach to log dirty. MMU chages PTE to be writable if
its DBM bit is set. Then KVM doesn't occupy vCPU time to log dirty.
About this patch series:
The biggest problem of apply DBM for stage2 is that software must scan PTs to
collect dirty state, which may cost much time and affect downtime of migration.
This series realize a SW/HW combined dirty log that can effectively solve this
problem (The smmu side can also use this approach to solve dma dirty log tracking).
The core idea is that we do not enable hardware dirty at start (do not add DBM bit).
When a arbitrary PT occurs fault, we execute soft tracking for this PT and enable
hardware tracking for its *nearby* PTs (e.g. Add DBM bit for nearby 16PTs). Then when
sync dirty log, we have known all PTs with hardware dirty enabled, so we do not need
to scan all PTs.
mem abort point mem abort point
↓ ↓
---------------------------------------------------------------
|********| | |********| | |
---------------------------------------------------------------
↑ ↑
set DBM bit of set DBM bit of
this PT section (64PTEs) this PT section (64PTEs)
We may worry that when dirty rate is over-high we still need to scan too much PTs.
We mainly concern the VM stop time. With Qemu dirty rate throttling, the dirty memory
is closing to the VM stop threshold, so there is a little PTs to scan after VM stop.
It has the advantages of hardware tracking that minimizes side effect on vCPU,
and also has the advantages of software tracking that controls vCPU dirty rate.
Moreover, software tracking helps us to scan PTs at some fixed points, which
greatly reduces scanning time. And the biggest benefit is that we can apply this
solution for dma dirty tracking.
Test:
Host: Kunpeng 920 with 128 CPU 512G RAM. Disable Transparent Hugepage (Ensure test result
is not effected by dissolve of block page table at the early stage of migration).
VM: 16 CPU 16GB RAM. Run 4 pair of (redis_benchmark+redis_server).
Each run 5 times for software dirty log and SW/HW conbined dirty log.
Test result:
Gain 5%~7% improvement of redis QPS during VM migration.
VM downtime is not affected fundamentally.
About 56.7% of DBM is effectively used.
Keqian Zhu (7):
arm64: cpufeature: Add API to report system support of HWDBM
kvm: arm64: Use atomic operation when update PTE
kvm: arm64: Add level_apply parameter for stage2_attr_walker
kvm: arm64: Add some HW_DBM related pgtable interfaces
kvm: arm64: Add some HW_DBM related mmu interfaces
kvm: arm64: Only write protect selected PTE
kvm: arm64: Start up SW/HW combined dirty log
arch/arm64/include/asm/cpufeature.h | 12 +++
arch/arm64/include/asm/kvm_host.h | 6 ++
arch/arm64/include/asm/kvm_mmu.h | 7 ++
arch/arm64/include/asm/kvm_pgtable.h | 45 ++++++++++
arch/arm64/kvm/arm.c | 125 ++++++++++++++++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 130 ++++++++++++++++++++++-----
arch/arm64/kvm/mmu.c | 47 +++++++++-
arch/arm64/kvm/reset.c | 8 +-
8 files changed, 351 insertions(+), 29 deletions(-)
--
2.19.1
Though we already has a cpu capability named ARM64_HW_DBM, it's a
LOCAL_CPU cap and conditionally compiled by CONFIG_ARM64_HW_AFDBM.
This reports the system wide support of HW_DBM.
Signed-off-by: Keqian Zhu <[email protected]>
---
arch/arm64/include/asm/cpufeature.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 9a555809b89c..dfded86c7684 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -664,6 +664,18 @@ static inline bool system_supports_mixed_endian(void)
return val == 0x1;
}
+static inline bool system_supports_hw_dbm(void)
+{
+ u64 mmfr1;
+ u32 val;
+
+ mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
+ val = cpuid_feature_extract_unsigned_field(mmfr1,
+ ID_AA64MMFR1_HADBS_SHIFT);
+
+ return val == 0x2;
+}
+
static __always_inline bool system_supports_fpsimd(void)
{
return !cpus_have_const_cap(ARM64_HAS_NO_FPSIMD);
--
2.19.1
We are about to add HW_DBM support for stage2 dirty log, so software
updating PTE may race with the MMU trying to set the access flag or
dirty state.
Use atomic oparations to avoid reverting these bits set by MMU.
Signed-off-by: Keqian Zhu <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 41 ++++++++++++++++++++++++------------
1 file changed, 27 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index bdf8e55ed308..4915ba35f93b 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -153,10 +153,34 @@ static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte)
return __va(kvm_pte_to_phys(pte));
}
+/*
+ * We may race with the MMU trying to set the access flag or dirty state,
+ * use atomic oparations to avoid reverting these bits.
+ *
+ * Return original PTE.
+ */
+static kvm_pte_t kvm_update_pte(kvm_pte_t *ptep, kvm_pte_t bit_set,
+ kvm_pte_t bit_clr)
+{
+ kvm_pte_t old_pte, pte = *ptep;
+
+ do {
+ old_pte = pte;
+ pte &= ~bit_clr;
+ pte |= bit_set;
+
+ if (old_pte == pte)
+ break;
+
+ pte = cmpxchg_relaxed(ptep, old_pte, pte);
+ } while (pte != old_pte);
+
+ return old_pte;
+}
+
static void kvm_set_invalid_pte(kvm_pte_t *ptep)
{
- kvm_pte_t pte = *ptep;
- WRITE_ONCE(*ptep, pte & ~KVM_PTE_VALID);
+ kvm_update_pte(ptep, 0, KVM_PTE_VALID);
}
static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp)
@@ -723,18 +747,7 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
return 0;
data->level = level;
- data->pte = pte;
- pte &= ~data->attr_clr;
- pte |= data->attr_set;
-
- /*
- * We may race with the CPU trying to set the access flag here,
- * but worst-case the access flag update gets lost and will be
- * set on the next access instead.
- */
- if (data->pte != pte)
- WRITE_ONCE(*ptep, pte);
-
+ data->pte = kvm_update_pte(ptep, data->attr_set, data->attr_clr);
return 0;
}
--
2.19.1
In order to change PTEs of some specific levels, the level_apply
parameter can be used as a level mask.
This has no fuctional change for current code.
Signed-off-by: Keqian Zhu <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 4915ba35f93b..0f8a319f16fe 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -734,6 +734,7 @@ struct stage2_attr_data {
kvm_pte_t attr_clr;
kvm_pte_t pte;
u32 level;
+ u32 level_apply;
};
static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
@@ -743,6 +744,9 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
kvm_pte_t pte = *ptep;
struct stage2_attr_data *data = arg;
+ if (!(data->level_apply & BIT(level)))
+ return 0;
+
if (!kvm_pte_valid(pte))
return 0;
@@ -753,14 +757,15 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
u64 size, kvm_pte_t attr_set,
- kvm_pte_t attr_clr, kvm_pte_t *orig_pte,
- u32 *level)
+ kvm_pte_t attr_clr, u32 level_apply,
+ kvm_pte_t *orig_pte, u32 *level)
{
int ret;
kvm_pte_t attr_mask = KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI;
struct stage2_attr_data data = {
.attr_set = attr_set & attr_mask,
.attr_clr = attr_clr & attr_mask,
+ .level_apply = level_apply,
};
struct kvm_pgtable_walker walker = {
.cb = stage2_attr_walker,
@@ -783,7 +788,7 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size)
{
return stage2_update_leaf_attrs(pgt, addr, size, 0,
- KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
+ KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W, -1,
NULL, NULL);
}
@@ -791,7 +796,7 @@ kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr)
{
kvm_pte_t pte = 0;
stage2_update_leaf_attrs(pgt, addr, 1, KVM_PTE_LEAF_ATTR_LO_S2_AF, 0,
- &pte, NULL);
+ -1, &pte, NULL);
dsb(ishst);
return pte;
}
@@ -800,7 +805,7 @@ kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, u64 addr)
{
kvm_pte_t pte = 0;
stage2_update_leaf_attrs(pgt, addr, 1, 0, KVM_PTE_LEAF_ATTR_LO_S2_AF,
- &pte, NULL);
+ -1, &pte, NULL);
/*
* "But where's the TLBI?!", you scream.
* "Over in the core code", I sigh.
@@ -813,7 +818,7 @@ kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, u64 addr)
bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr)
{
kvm_pte_t pte = 0;
- stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, &pte, NULL);
+ stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, -1, &pte, NULL);
return pte & KVM_PTE_LEAF_ATTR_LO_S2_AF;
}
@@ -833,7 +838,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
if (prot & KVM_PGTABLE_PROT_X)
clr |= KVM_PTE_LEAF_ATTR_HI_S2_XN;
- ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level);
+ ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, -1, NULL, &level);
if (!ret)
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, level);
return ret;
--
2.19.1
This adds set_dbm, clear_dbm and sync_dirty interfaces in pgtable
layer. (1) set_dbm: Set DBM bit for last level PTE of a specified
range. TLBI is completed. (2) clear_dbm: Clear DBM bit for last
level PTE of a specified range. TLBI is not acted. (3) sync_dirty:
Scan last level PTE of a specific range. Log dirty if PTE is writable.
Besides, save the dirty state of PTE if it's invalided by map or
unmap.
Signed-off-by: Keqian Zhu <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 45 ++++++++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 70 ++++++++++++++++++++++++++++
2 files changed, 115 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 52ab38db04c7..8984b5227cfc 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -204,6 +204,51 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
*/
int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size);
+/**
+ * kvm_pgtable_stage2_clear_dbm() - Clear DBM of guest stage-2 address range
+ * without TLB invalidation (only last level).
+ * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr: Intermediate physical address from which to clear DBM,
+ * @size: Size of the range.
+ *
+ * The offset of @addr within a page is ignored and @size is rounded-up to
+ * the next page boundary.
+ *
+ * Note that it is the caller's responsibility to invalidate the TLB after
+ * calling this function to ensure that the disabled HW dirty are visible
+ * to the CPUs.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_stage2_clear_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size);
+
+/**
+ * kvm_pgtable_stage2_set_dbm() - Set DBM of guest stage-2 address range to
+ * enable HW dirty (only last level).
+ * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr: Intermediate physical address from which to set DBM.
+ * @size: Size of the range.
+ *
+ * The offset of @addr within a page is ignored and @size is rounded-up to
+ * the next page boundary.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_stage2_set_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size);
+
+/**
+ * kvm_pgtable_stage2_sync_dirty() - Sync HW dirty state into memslot.
+ * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr: Intermediate physical address from which to sync.
+ * @size: Size of the range.
+ *
+ * The offset of @addr within a page is ignored and @size is rounded-up to
+ * the next page boundary.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_stage2_sync_dirty(struct kvm_pgtable *pgt, u64 addr, u64 size);
+
/**
* kvm_pgtable_stage2_mkyoung() - Set the access flag in a page-table entry.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init().
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0f8a319f16fe..b6f0d2f3aee4 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -43,6 +43,7 @@
#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
+#define KVM_PTE_LEAF_ATTR_HI_S2_DBM BIT(51)
#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
struct kvm_pgtable_walk_data {
@@ -485,6 +486,11 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
return 0;
}
+static bool stage2_pte_writable(kvm_pte_t pte)
+{
+ return pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
+}
+
static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
kvm_pte_t *ptep,
struct stage2_map_data *data)
@@ -509,6 +515,11 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
/* There's an existing valid leaf entry, so perform break-before-make */
kvm_set_invalid_pte(ptep);
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
+
+ /* Save the possible hardware dirty info */
+ if ((level == KVM_PGTABLE_MAX_LEVELS - 1) && stage2_pte_writable(*ptep))
+ mark_page_dirty(data->mmu->kvm, addr >> PAGE_SHIFT);
+
kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
out:
data->phys += granule;
@@ -547,6 +558,10 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
if (kvm_pte_valid(pte))
put_page(page);
+ /*
+ * HW DBM is not working during page merging, so we don't
+ * need to save possible hardware dirty info here.
+ */
return 0;
}
@@ -707,6 +722,10 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, addr, level);
put_page(virt_to_page(ptep));
+ /* Save the possible hardware dirty info */
+ if ((level == KVM_PGTABLE_MAX_LEVELS - 1) && stage2_pte_writable(*ptep))
+ mark_page_dirty(mmu->kvm, addr >> PAGE_SHIFT);
+
if (need_flush) {
stage2_flush_dcache(kvm_pte_follow(pte),
kvm_granule_size(level));
@@ -792,6 +811,30 @@ int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size)
NULL, NULL);
}
+int kvm_pgtable_stage2_set_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size)
+{
+ int ret;
+ u64 offset;
+
+ ret = stage2_update_leaf_attrs(pgt, addr, size,
+ KVM_PTE_LEAF_ATTR_HI_S2_DBM, 0, BIT(3),
+ NULL, NULL);
+ if (!ret)
+ return ret;
+
+ for (offset = 0; offset < size; offset += PAGE_SIZE)
+ kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, 3);
+
+ return 0;
+}
+
+int kvm_pgtable_stage2_clear_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size)
+{
+ return stage2_update_leaf_attrs(pgt, addr, size,
+ 0, KVM_PTE_LEAF_ATTR_HI_S2_DBM, BIT(3),
+ NULL, NULL);
+}
+
kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr)
{
kvm_pte_t pte = 0;
@@ -844,6 +887,33 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
return ret;
}
+static int stage2_sync_dirty_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+ enum kvm_pgtable_walk_flags flag,
+ void * const arg)
+{
+ kvm_pte_t pte = *ptep;
+ struct kvm *kvm = arg;
+
+ if (!kvm_pte_valid(pte))
+ return 0;
+
+ if ((level == KVM_PGTABLE_MAX_LEVELS - 1) && stage2_pte_writable(pte))
+ mark_page_dirty(kvm, addr >> PAGE_SHIFT);
+
+ return 0;
+}
+
+int kvm_pgtable_stage2_sync_dirty(struct kvm_pgtable *pgt, u64 addr, u64 size)
+{
+ struct kvm_pgtable_walker walker = {
+ .cb = stage2_sync_dirty_walker,
+ .flags = KVM_PGTABLE_WALK_LEAF,
+ .arg = pgt->mmu->kvm,
+ };
+
+ return kvm_pgtable_walk(pgt, addr, size, &walker);
+}
+
static int stage2_flush_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag,
void * const arg)
--
2.19.1
Hi Marc,
Do you have time to have a look at this? Thanks ;-)
Keqian.
On 2021/1/26 20:44, Keqian Zhu wrote:
> The intention:
>
> On arm64 platform, we tracking dirty log of vCPU through guest memory abort.
> KVM occupys some vCPU time of guest to change stage2 mapping and mark dirty.
> This leads to heavy side effect on VM, especially when multi vCPU race and
> some of them block on kvm mmu_lock.
>
> DBM is a HW auxiliary approach to log dirty. MMU chages PTE to be writable if
> its DBM bit is set. Then KVM doesn't occupy vCPU time to log dirty.
>
> About this patch series:
>
> The biggest problem of apply DBM for stage2 is that software must scan PTs to
> collect dirty state, which may cost much time and affect downtime of migration.
>
> This series realize a SW/HW combined dirty log that can effectively solve this
> problem (The smmu side can also use this approach to solve dma dirty log tracking).
>
> The core idea is that we do not enable hardware dirty at start (do not add DBM bit).
> When a arbitrary PT occurs fault, we execute soft tracking for this PT and enable
> hardware tracking for its *nearby* PTs (e.g. Add DBM bit for nearby 16PTs). Then when
> sync dirty log, we have known all PTs with hardware dirty enabled, so we do not need
> to scan all PTs.
>
> mem abort point mem abort point
> ↓ ↓
> ---------------------------------------------------------------
> |********| | |********| | |
> ---------------------------------------------------------------
> ↑ ↑
> set DBM bit of set DBM bit of
> this PT section (64PTEs) this PT section (64PTEs)
>
> We may worry that when dirty rate is over-high we still need to scan too much PTs.
> We mainly concern the VM stop time. With Qemu dirty rate throttling, the dirty memory
> is closing to the VM stop threshold, so there is a little PTs to scan after VM stop.
>
> It has the advantages of hardware tracking that minimizes side effect on vCPU,
> and also has the advantages of software tracking that controls vCPU dirty rate.
> Moreover, software tracking helps us to scan PTs at some fixed points, which
> greatly reduces scanning time. And the biggest benefit is that we can apply this
> solution for dma dirty tracking.
>
> Test:
>
> Host: Kunpeng 920 with 128 CPU 512G RAM. Disable Transparent Hugepage (Ensure test result
> is not effected by dissolve of block page table at the early stage of migration).
> VM: 16 CPU 16GB RAM. Run 4 pair of (redis_benchmark+redis_server).
>
> Each run 5 times for software dirty log and SW/HW conbined dirty log.
>
> Test result:
>
> Gain 5%~7% improvement of redis QPS during VM migration.
> VM downtime is not affected fundamentally.
> About 56.7% of DBM is effectively used.
>
> Keqian Zhu (7):
> arm64: cpufeature: Add API to report system support of HWDBM
> kvm: arm64: Use atomic operation when update PTE
> kvm: arm64: Add level_apply parameter for stage2_attr_walker
> kvm: arm64: Add some HW_DBM related pgtable interfaces
> kvm: arm64: Add some HW_DBM related mmu interfaces
> kvm: arm64: Only write protect selected PTE
> kvm: arm64: Start up SW/HW combined dirty log
>
> arch/arm64/include/asm/cpufeature.h | 12 +++
> arch/arm64/include/asm/kvm_host.h | 6 ++
> arch/arm64/include/asm/kvm_mmu.h | 7 ++
> arch/arm64/include/asm/kvm_pgtable.h | 45 ++++++++++
> arch/arm64/kvm/arm.c | 125 ++++++++++++++++++++++++++
> arch/arm64/kvm/hyp/pgtable.c | 130 ++++++++++++++++++++++-----
> arch/arm64/kvm/mmu.c | 47 +++++++++-
> arch/arm64/kvm/reset.c | 8 +-
> 8 files changed, 351 insertions(+), 29 deletions(-)
>
On 2021-02-01 13:12, Keqian Zhu wrote:
> Hi Marc,
>
> Do you have time to have a look at this? Thanks ;-)
Not immediately. I'm busy with stuff that is planned to go
in 5.12, which isn't the case for this series. I'll get to
it eventually.
Thanks,
M.
--
Jazz is not dead. It just smells funny...
On 2021/2/1 21:17, Marc Zyngier wrote:
> On 2021-02-01 13:12, Keqian Zhu wrote:
>> Hi Marc,
>>
>> Do you have time to have a look at this? Thanks ;-)
>
> Not immediately. I'm busy with stuff that is planned to go
> in 5.12, which isn't the case for this series. I'll get to
> it eventually.
>
> Thanks,
>
> M.
Sure, I am not eager. Please concentrate on your urgent work firstly. ;-) Thanks.
Keqian.
Hi everyone,
Any comments are welcome :).
Thanks,
Keqian
On 2021/1/26 20:44, Keqian Zhu wrote:
> The intention:
>
> On arm64 platform, we tracking dirty log of vCPU through guest memory abort.
> KVM occupys some vCPU time of guest to change stage2 mapping and mark dirty.
> This leads to heavy side effect on VM, especially when multi vCPU race and
> some of them block on kvm mmu_lock.
>
> DBM is a HW auxiliary approach to log dirty. MMU chages PTE to be writable if
> its DBM bit is set. Then KVM doesn't occupy vCPU time to log dirty.
>
> About this patch series:
>
> The biggest problem of apply DBM for stage2 is that software must scan PTs to
> collect dirty state, which may cost much time and affect downtime of migration.
>
> This series realize a SW/HW combined dirty log that can effectively solve this
> problem (The smmu side can also use this approach to solve dma dirty log tracking).
>
> The core idea is that we do not enable hardware dirty at start (do not add DBM bit).
> When a arbitrary PT occurs fault, we execute soft tracking for this PT and enable
> hardware tracking for its *nearby* PTs (e.g. Add DBM bit for nearby 16PTs). Then when
> sync dirty log, we have known all PTs with hardware dirty enabled, so we do not need
> to scan all PTs.
>
> mem abort point mem abort point
> ↓ ↓
> ---------------------------------------------------------------
> |********| | |********| | |
> ---------------------------------------------------------------
> ↑ ↑
> set DBM bit of set DBM bit of
> this PT section (64PTEs) this PT section (64PTEs)
>
> We may worry that when dirty rate is over-high we still need to scan too much PTs.
> We mainly concern the VM stop time. With Qemu dirty rate throttling, the dirty memory
> is closing to the VM stop threshold, so there is a little PTs to scan after VM stop.
>
> It has the advantages of hardware tracking that minimizes side effect on vCPU,
> and also has the advantages of software tracking that controls vCPU dirty rate.
> Moreover, software tracking helps us to scan PTs at some fixed points, which
> greatly reduces scanning time. And the biggest benefit is that we can apply this
> solution for dma dirty tracking.
>
> Test:
>
> Host: Kunpeng 920 with 128 CPU 512G RAM. Disable Transparent Hugepage (Ensure test result
> is not effected by dissolve of block page table at the early stage of migration).
> VM: 16 CPU 16GB RAM. Run 4 pair of (redis_benchmark+redis_server).
>
> Each run 5 times for software dirty log and SW/HW conbined dirty log.
>
> Test result:
>
> Gain 5%~7% improvement of redis QPS during VM migration.
> VM downtime is not affected fundamentally.
> About 56.7% of DBM is effectively used.
>
> Keqian Zhu (7):
> arm64: cpufeature: Add API to report system support of HWDBM
> kvm: arm64: Use atomic operation when update PTE
> kvm: arm64: Add level_apply parameter for stage2_attr_walker
> kvm: arm64: Add some HW_DBM related pgtable interfaces
> kvm: arm64: Add some HW_DBM related mmu interfaces
> kvm: arm64: Only write protect selected PTE
> kvm: arm64: Start up SW/HW combined dirty log
>
> arch/arm64/include/asm/cpufeature.h | 12 +++
> arch/arm64/include/asm/kvm_host.h | 6 ++
> arch/arm64/include/asm/kvm_mmu.h | 7 ++
> arch/arm64/include/asm/kvm_pgtable.h | 45 ++++++++++
> arch/arm64/kvm/arm.c | 125 ++++++++++++++++++++++++++
> arch/arm64/kvm/hyp/pgtable.c | 130 ++++++++++++++++++++++-----
> arch/arm64/kvm/mmu.c | 47 +++++++++-
> arch/arm64/kvm/reset.c | 8 +-
> 8 files changed, 351 insertions(+), 29 deletions(-)
>