2023-01-09 22:04:58

by Raghavendra Rao Ananta

[permalink] [raw]
Subject: [RFC PATCH 0/6] KVM: arm64: Add support for FEAT_TLBIRANGE

In certain code paths, KVM/ARM currently invalidates the entire VM's
page-tables instead of just invalidating a necessary range. For example,
when collapsing a table PTE to a block PTE, instead of iterating over
each PTE and flushing them, KVM uses 'vmalls12e1is' to flush all the
entries. This is inefficient since the guest would have to refill
the TLBs again, even for the addresses that aren't covered by the table
entry. The performance impact would scale poorly if many addresses in
the VM is going through this remapping.

For architectures that implement FEAT_TLBIRANGE, KVM can replace such
inefficient paths by performing the invalidations only on the range of
addresses that are in scope. This series tries to achieve the same.

Patch-1 refactors the core arm64's __flush_tlb_range() to be used by
other entities.

Patch-2 adds support to flush a range of IPAs for KVM.

Patch-3 defines a generic KVM function, kvm_flush_remote_tlbs_range(),
to be used in upcoming patches. The patch uses this in the MMU notifier
handlers to perform the flush only for a certain range of addresses.

Patch-4 optimizes the dirty-logging path to flush the TLBs using
the range based functions.

Patch-5 operates on stage2_try_break_pte() to use the range based
TLBI instructions when breaking a table entry. The map path is the
immediate consumer of this when KVM remaps a table entry into a block.

Patch-6 introduces a fast stage-2 unmap path in which, for the right
conditions, instead of traversing each and every PTE and unmapping them,
disconnect the PTE at a higher level (say at level-1 for a 4K pagesize)
and unmap the table entries using free_removed_table(). This would allow
KVM to use the range based TLBI to flush the entire range govered at
that level.

The series is based off of kvmarm-next.

The performance evaluation was done on a hardware that supports
FEAT_TLBIRANGE, on a VHE configuration, using the kvm_page_table_test.
Currently, the series captures the impact in the map and unmap paths as
described above.

kvm_page_table_test -m 2 -v 128 -s anonymous_hugetlb_2mb -b $i

+--------+------------------------------+------------------------------+
| mem_sz | ADJUST_MAPPINGS | Unmap VM |
| (GB) | next (s) | next + series (s) | next (s) | next + series (s) |
+--------+----------|-------------------+------------------------------+
| 1 | 0.70 | 0.73 | 0.50 | 0.50 |
| 2 | 0.91 | 0.97 | 0.50 | 0.50 |
| 4 | 1.47 | 1.48 | 0.51 | 0.51 |
| 8 | 2.25 | 2.43 | 0.52 | 0.51 |
| 16 | 4.09 | 4.60 | 0.54 | 0.54 |
| 32 | 7.77 | 8.99 | 0.58 | 0.61 |
| 64 | 16.73 | 17.50 | 0.66 | 0.7 |
| 128 | 30.45 | 35.55 | 0.80 | 0.77 |
+--------+----------+-------------------+----------+-------------------+

Unfortunately, the performance of ADJUST_MAPPINGS gets degraded with
increase in memory size. Upon closely profiling, __kvm_tlb_flush_vmid(),
that the baseline uses to flush takes an averge of 73.2 us, while
__kvm_tlb_flush_range_vmid_ipa(), that the series uses costs, 208.1 us.
That is a regression of ~2.8x per flush when breaking the PTE, and could
be the reason why ADJUST_MAPPING's performance degreades with size.

On the other hand, the unmap's performance is almost on par with the
baseline for 2M hugepages. However, the fast unmap path's performance is
significatly improved by 3-4x when the guest is backed by 4K pages. This
is expected as the number of PTEs that we traverse for 4K mappings is
significantly larger than 2M hugepages, which the fast path is avoiding.

kvm_page_table_test -m 2 -v 1 -b $i

+--------+------------------------------+
| mem_sz | Unmap VM |
| (GB) | next (s) | next + series (s) |
+--------+------------------------------+
| 1 | 1.03 | 1.05 |
| 2 | 1.57 | 1.19 |
| 4 | 2.61 | 1.45 |
| 8 | 4.69 | 1.96 |
| 16 | 8.84 | 3.03 |
| 32 | 18.07 | 4.80 |
| 64 | 36.62 | 8.56 |
| 128 | 66.81 | 17.18 |
+--------+----------+-------------------+

I'm looking for the suggestions/comments on the following from the
reviewers:

1. Given the poor performance of the TLBI range instructions against the
global VM flush, is there a room to improve the implementation of
__kvm_tlb_flush_range() (patch-2)?

2. When the series switches from a global flush to a range based flush
in the map path (patch-5), we'd expect fewer TLB misses and improved
guest performance. This performance is not yet measured. Is there any
upstream test that can meaure this?

Thank you.
Raghavendra

Raghavendra Rao Ananta (6):
arm64: tlb: Refactor the core flush algorithm of __flush_tlb_range
KVM: arm64: Add support for FEAT_TLBIRANGE
KVM: Define kvm_flush_remote_tlbs_range
KVM: arm64: Optimize TLBIs in the dirty logging path
KVM: arm64: Optimize the stage2 map path with TLBI range instructions
KVM: arm64: Create a fast stage-2 unmap path

arch/arm64/include/asm/kvm_asm.h | 21 ++++++
arch/arm64/include/asm/tlbflush.h | 107 +++++++++++++++--------------
arch/arm64/kvm/arm.c | 7 +-
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 11 +++
arch/arm64/kvm/hyp/nvhe/tlb.c | 24 +++++++
arch/arm64/kvm/hyp/pgtable.c | 73 ++++++++++++++++++--
arch/arm64/kvm/hyp/vhe/tlb.c | 20 ++++++
arch/arm64/kvm/mmu.c | 12 +++-
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 7 +-
10 files changed, 223 insertions(+), 60 deletions(-)

--
2.39.0.314.g84b9a713c41-goog


2023-01-09 22:11:46

by Raghavendra Rao Ananta

[permalink] [raw]
Subject: [RFC PATCH 6/6] KVM: arm64: Create a fast stage-2 unmap path

The current implementation of the stage-2 unmap walker
traverses the entire page-table to clear and flush the TLBs
for each entry. This could be very expensive if the VM
is not backed by hugepages. The unmap operation could be
made efficient by disconnecting the table at the very
top (level at which the largest block mapping can be hosted)
and do the rest of the unmapping using free_removed_table().
If the system supports FEAT_TLBIRANGE, flush the entire range
that has been disconnected from the rest of the page-table.

Suggested-by: Ricardo Koller <[email protected]>
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 099032bb01bce..7bcd898de2805 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1021,6 +1021,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
return 0;
}

+/*
+ * The fast walker executes only if the unmap size is exactly equal to the
+ * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
+ * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
+ * be disconnected from the rest of the page-table without the need to
+ * traverse all the PTEs, at all the levels, and unmap each and every one
+ * of them. The disconnected table can be freed using free_removed_table().
+ */
+static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
+ enum kvm_pgtable_walk_flags visit)
+{
+ struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
+ kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
+ struct kvm_s2_mmu *mmu = ctx->arg;
+
+ if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
+ return 0;
+
+ if (!stage2_try_break_pte(ctx, mmu, 0))
+ return -EAGAIN;
+
+ /*
+ * Gain back a reference for stage2_unmap_walker() to free
+ * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
+ */
+ mm_ops->get_page(ctx->ptep);
+
+ mm_ops->free_removed_table(childp, ctx->level);
+ return 0;
+}
+
+static void kvm_pgtable_try_fast_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
+{
+ struct kvm_pgtable_walker walker = {
+ .cb = fast_stage2_unmap_walker,
+ .arg = pgt->mmu,
+ .flags = KVM_PGTABLE_WALK_TABLE_PRE,
+ };
+
+ if (size == kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL))
+ kvm_pgtable_walk(pgt, addr, size, &walker);
+}
+
int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
{
struct kvm_pgtable_walker walker = {
@@ -1029,6 +1072,7 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
.flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
};

+ kvm_pgtable_try_fast_stage2_unmap(pgt, addr, size);
return kvm_pgtable_walk(pgt, addr, size, &walker);
}

--
2.39.0.314.g84b9a713c41-goog

2023-01-09 22:20:19

by Raghavendra Rao Ananta

[permalink] [raw]
Subject: [RFC PATCH 3/6] KVM: Define kvm_flush_remote_tlbs_range

Define kvm_flush_remote_tlbs_range() to limit the TLB flush only
to a certain range of addresses. Replace this with the existing
call to kvm_flush_remote_tlbs() in the MMU notifier path.
Architectures such as arm64 can define this to flush only the
necessary addresses, instead of the entire range.

Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/kvm/mmu.c | 10 ++++++++++
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 7 ++++++-
3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 39d9a334efb57..70f76bc909c5d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -91,6 +91,16 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
kvm_call_hyp(__kvm_tlb_flush_vmid, &kvm->arch.mmu);
}

+void kvm_flush_remote_tlbs_range(struct kvm *kvm, unsigned long start, unsigned long end)
+{
+ struct kvm_s2_mmu *mmu = &kvm->arch.mmu;
+
+ if (system_supports_tlb_range())
+ kvm_call_hyp(__kvm_tlb_flush_range_vmid_ipa, mmu, start, end, 0);
+ else
+ kvm_flush_remote_tlbs(kvm);
+}
+
static bool kvm_is_device_pfn(unsigned long pfn)
{
return !pfn_is_map_memory(pfn);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f51eb9419bfc3..a76cede9dc3bb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1359,6 +1359,7 @@ int kvm_vcpu_yield_to(struct kvm_vcpu *target);
void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool usermode_vcpu_not_eligible);

void kvm_flush_remote_tlbs(struct kvm *kvm);
+void kvm_flush_remote_tlbs_range(struct kvm *kvm, unsigned long start, unsigned long end);

#ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 03e6a38094c17..f538ecc984f5b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -376,6 +376,11 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
++kvm->stat.generic.remote_tlb_flush;
}
EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs);
+
+void kvm_flush_remote_tlbs_range(struct kvm *kvm, unsigned long start, unsigned long end)
+{
+ kvm_flush_remote_tlbs(kvm);
+}
#endif

static void kvm_flush_shadow_all(struct kvm *kvm)
@@ -637,7 +642,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
}

if (range->flush_on_ret && ret)
- kvm_flush_remote_tlbs(kvm);
+ kvm_flush_remote_tlbs_range(kvm, range->start, range->end - 1);

if (locked) {
KVM_MMU_UNLOCK(kvm);
--
2.39.0.314.g84b9a713c41-goog

2023-01-09 22:22:11

by Raghavendra Rao Ananta

[permalink] [raw]
Subject: [RFC PATCH 5/6] KVM: arm64: Optimize the stage2 map path with TLBI range instructions

Currently, when the map path of stage2 page-table coalesces a
bunch of pages into a hugepage, KVM invalidates the entire
VM's TLB entries. This would cause a perforamance penality for
the guest whose pages have already been coalesced earlier as they
would have to refill their TLB entries unnecessarily again.

Hence, if the system supports it, use __kvm_tlb_flush_range_vmid_ipa()
to flush only the range of pages that have been combined into
a hugepage, while leaving other TLB entries alone.

Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 29 +++++++++++++++++++++++++----
1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b11cf2c618a6c..099032bb01bce 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -686,6 +686,22 @@ static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_
return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old;
}

+static void kvm_table_pte_flush(struct kvm_s2_mmu *mmu, u64 addr, u32 level, u32 tlb_level)
+{
+ if (system_supports_tlb_range()) {
+ u64 end = addr + kvm_granule_size(level);
+
+ kvm_call_hyp(__kvm_tlb_flush_range_vmid_ipa, mmu, addr, end, tlb_level);
+ } else {
+ /*
+ * Invalidate the whole stage-2, as we may have numerous leaf
+ * entries below us which would otherwise need invalidating
+ * individually.
+ */
+ kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
+ }
+}
+
/**
* stage2_try_break_pte() - Invalidates a pte according to the
* 'break-before-make' requirements of the
@@ -693,6 +709,7 @@ static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_
*
* @ctx: context of the visited pte.
* @mmu: stage-2 mmu
+ * @tlb_level: The level at which the leaf pages are expected (for FEAT_TTL hint)
*
* Returns: true if the pte was successfully broken.
*
@@ -701,7 +718,7 @@ static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_
* on the containing table page.
*/
static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
- struct kvm_s2_mmu *mmu)
+ struct kvm_s2_mmu *mmu, u32 tlb_level)
{
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;

@@ -722,7 +739,7 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
* value (if any).
*/
if (kvm_pte_table(ctx->old, ctx->level))
- kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
+ kvm_table_pte_flush(mmu, ctx->addr, ctx->level, tlb_level);
else if (kvm_pte_valid(ctx->old))
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);

@@ -804,7 +821,7 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
if (!stage2_pte_needs_update(ctx->old, new))
return -EAGAIN;

- if (!stage2_try_break_pte(ctx, data->mmu))
+ if (!stage2_try_break_pte(ctx, data->mmu, ctx->level))
return -EAGAIN;

/* Perform CMOs before installation of the guest stage-2 PTE */
@@ -861,7 +878,11 @@ static int stage2_map_walk_leaf(const struct kvm_pgtable_visit_ctx *ctx,
if (!childp)
return -ENOMEM;

- if (!stage2_try_break_pte(ctx, data->mmu)) {
+ /*
+ * As the table will be replaced with a block, one level down would
+ * be the current page entries held by the table.
+ */
+ if (!stage2_try_break_pte(ctx, data->mmu, ctx->level + 1)) {
mm_ops->put_page(childp);
return -EAGAIN;
}
--
2.39.0.314.g84b9a713c41-goog

2023-01-09 22:23:49

by Raghavendra Rao Ananta

[permalink] [raw]
Subject: [RFC PATCH 2/6] KVM: arm64: Add support for FEAT_TLBIRANGE

Define a generic function __kvm_tlb_flush_range() to
invalidate the TLBs over a range of addresses. Use
this to define __kvm_tlb_flush_range_vmid_ipa()
(for VHE and nVHE) to flush a range of stage-2
page-tables using IPA in one go.

If the system supports FEAT_TLBIRANGE, the following
patches would conviniently replace global TLBI such
as vmalls12e1is in the map, unmap, and dirty-logging
paths with ripas2e1is instead.

Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/include/asm/kvm_asm.h | 21 +++++++++++++++++++++
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 11 +++++++++++
arch/arm64/kvm/hyp/nvhe/tlb.c | 24 ++++++++++++++++++++++++
arch/arm64/kvm/hyp/vhe/tlb.c | 20 ++++++++++++++++++++
4 files changed, 76 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 43c3bc0f9544d..bdf94ae0333b0 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -79,6 +79,7 @@ enum __kvm_host_smccc_func {
__KVM_HOST_SMCCC_FUNC___pkvm_init_vm,
__KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu,
__KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm,
+ __KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_range_vmid_ipa,
};

#define DECLARE_KVM_VHE_SYM(sym) extern char sym[]
@@ -221,10 +222,30 @@ DECLARE_KVM_NVHE_SYM(__per_cpu_end);
DECLARE_KVM_HYP_SYM(__bp_harden_hyp_vecs);
#define __bp_harden_hyp_vecs CHOOSE_HYP_SYM(__bp_harden_hyp_vecs)

+#define __kvm_tlb_flush_range(op, mmu, start, end, tlb_level) do { \
+ unsigned long pages, stride; \
+ \
+ stride = PAGE_SIZE; \
+ start = round_down(start, stride); \
+ end = round_up(end, stride); \
+ pages = (end - start) >> PAGE_SHIFT; \
+ \
+ if ((!system_supports_tlb_range() && \
+ (end - start) >= (MAX_TLBI_OPS * stride)) || \
+ pages >= MAX_TLBI_RANGE_PAGES) { \
+ __kvm_tlb_flush_vmid(mmu); \
+ break; \
+ } \
+ \
+ __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false); \
+} while (0)
+
extern void __kvm_flush_vm_context(void);
extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
int level);
+extern void __kvm_tlb_flush_range_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t start,
+ phys_addr_t end, int level);
extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);

extern void __kvm_timer_set_cntvoff(u64 cntvoff);
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 728e01d4536b0..ac52d0fbb9719 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -116,6 +116,16 @@ static void handle___kvm_flush_vm_context(struct kvm_cpu_context *host_ctxt)
__kvm_flush_vm_context();
}

+static void handle___kvm_tlb_flush_range_vmid_ipa(struct kvm_cpu_context *host_ctxt)
+{
+ DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
+ DECLARE_REG(phys_addr_t, start, host_ctxt, 2);
+ DECLARE_REG(phys_addr_t, end, host_ctxt, 3);
+ DECLARE_REG(int, level, host_ctxt, 4);
+
+ __kvm_tlb_flush_range_vmid_ipa(kern_hyp_va(mmu), start, end, level);
+}
+
static void handle___kvm_tlb_flush_vmid_ipa(struct kvm_cpu_context *host_ctxt)
{
DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
@@ -314,6 +324,7 @@ static const hcall_t host_hcall[] = {
HANDLE_FUNC(__kvm_adjust_pc),
HANDLE_FUNC(__kvm_vcpu_run),
HANDLE_FUNC(__kvm_flush_vm_context),
+ HANDLE_FUNC(__kvm_tlb_flush_range_vmid_ipa),
HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa),
HANDLE_FUNC(__kvm_tlb_flush_vmid),
HANDLE_FUNC(__kvm_flush_cpu_context),
diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
index d296d617f5896..292f5c4834d08 100644
--- a/arch/arm64/kvm/hyp/nvhe/tlb.c
+++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
@@ -55,6 +55,30 @@ static void __tlb_switch_to_host(struct tlb_inv_context *cxt)
}
}

+void __kvm_tlb_flush_range_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t start,
+ phys_addr_t end, int level)
+{
+ struct tlb_inv_context cxt;
+
+ dsb(ishst);
+
+ /* Switch to requested VMID */
+ __tlb_switch_to_guest(mmu, &cxt);
+
+ __kvm_tlb_flush_range(ipas2e1is, mmu, start, end, level);
+
+ dsb(ish);
+ __tlbi(vmalle1is);
+ dsb(ish);
+ isb();
+
+ /* See the comment below in __kvm_tlb_flush_vmid_ipa() */
+ if (icache_is_vpipt())
+ icache_inval_all_pou();
+
+ __tlb_switch_to_host(&cxt);
+}
+
void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
phys_addr_t ipa, int level)
{
diff --git a/arch/arm64/kvm/hyp/vhe/tlb.c b/arch/arm64/kvm/hyp/vhe/tlb.c
index 24cef9b87f9e9..2631cc09e4184 100644
--- a/arch/arm64/kvm/hyp/vhe/tlb.c
+++ b/arch/arm64/kvm/hyp/vhe/tlb.c
@@ -79,6 +79,26 @@ static void __tlb_switch_to_host(struct tlb_inv_context *cxt)
local_irq_restore(cxt->flags);
}

+void __kvm_tlb_flush_range_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t start,
+ phys_addr_t end, int level)
+{
+ struct tlb_inv_context cxt;
+
+ dsb(ishst);
+
+ /* Switch to requested VMID */
+ __tlb_switch_to_guest(mmu, &cxt);
+
+ __kvm_tlb_flush_range(ipas2e1is, mmu, start, end, level);
+
+ dsb(ish);
+ __tlbi(vmalle1is);
+ dsb(ish);
+ isb();
+
+ __tlb_switch_to_host(&cxt);
+}
+
void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
phys_addr_t ipa, int level)
{
--
2.39.0.314.g84b9a713c41-goog

2023-01-09 22:39:52

by Raghavendra Rao Ananta

[permalink] [raw]
Subject: [RFC PATCH 1/6] arm64: tlb: Refactor the core flush algorithm of __flush_tlb_range

Currently, the core TLB flush functionality of __flush_tlb_range()
hardcodes vae1is (and variants) for the flush operation. In the
upcoming patches, the KVM code reuses this core algorithm with
ipas2e1is for range based TLB invalidations based on the IPA.
Hence, extract the core flush functionality of __flush_tlb_range()
into its own macro that accepts an 'op' argument to pass any
TLBI instruction.

No functional changes intended.

Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 107 +++++++++++++++---------------
1 file changed, 54 insertions(+), 53 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25d..9a57eae14e576 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -278,14 +278,60 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
*/
#define MAX_TLBI_OPS PTRS_PER_PTE

+/* When the CPU does not support TLB range operations, flush the TLB
+ * entries one by one at the granularity of 'stride'. If the TLB
+ * range ops are supported, then:
+ *
+ * 1. If 'pages' is odd, flush the first page through non-range
+ * operations;
+ *
+ * 2. For remaining pages: the minimum range granularity is decided
+ * by 'scale', so multiple range TLBI operations may be required.
+ * Start from scale = 0, flush the corresponding number of pages
+ * ((num+1)*2^(5*scale+1) starting from 'addr'), then increase it
+ * until no pages left.
+ *
+ * Note that certain ranges can be represented by either num = 31 and
+ * scale or num = 0 and scale + 1. The loop below favours the latter
+ * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
+ */
+#define __flush_tlb_range_op(op, start, pages, stride, asid, tlb_level, tlbi_user) do { \
+ int num = 0; \
+ int scale = 0; \
+ unsigned long addr; \
+ \
+ while (pages > 0) { \
+ if (!system_supports_tlb_range() || \
+ pages % 2 == 1) { \
+ addr = __TLBI_VADDR(start, asid); \
+ __tlbi_level(op, addr, tlb_level); \
+ if (tlbi_user) \
+ __tlbi_user_level(op, addr, tlb_level); \
+ start += stride; \
+ pages -= stride >> PAGE_SHIFT; \
+ continue; \
+ } \
+ \
+ num = __TLBI_RANGE_NUM(pages, scale); \
+ if (num >= 0) { \
+ addr = __TLBI_VADDR_RANGE(start, asid, scale, \
+ num, tlb_level); \
+ __tlbi(r##op, addr); \
+ if (tlbi_user) \
+ __tlbi_user(r##op, addr); \
+ start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; \
+ pages -= __TLBI_RANGE_PAGES(num, scale); \
+ } \
+ scale++; \
+ } \
+} while (0)
+
static inline void __flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
unsigned long stride, bool last_level,
int tlb_level)
{
- int num = 0;
- int scale = 0;
- unsigned long asid, addr, pages;
+ unsigned long asid, pages;

start = round_down(start, stride);
end = round_up(end, stride);
@@ -307,56 +353,11 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
dsb(ishst);
asid = ASID(vma->vm_mm);

- /*
- * When the CPU does not support TLB range operations, flush the TLB
- * entries one by one at the granularity of 'stride'. If the TLB
- * range ops are supported, then:
- *
- * 1. If 'pages' is odd, flush the first page through non-range
- * operations;
- *
- * 2. For remaining pages: the minimum range granularity is decided
- * by 'scale', so multiple range TLBI operations may be required.
- * Start from scale = 0, flush the corresponding number of pages
- * ((num+1)*2^(5*scale+1) starting from 'addr'), then increase it
- * until no pages left.
- *
- * Note that certain ranges can be represented by either num = 31 and
- * scale or num = 0 and scale + 1. The loop below favours the latter
- * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
- */
- while (pages > 0) {
- if (!system_supports_tlb_range() ||
- pages % 2 == 1) {
- addr = __TLBI_VADDR(start, asid);
- if (last_level) {
- __tlbi_level(vale1is, addr, tlb_level);
- __tlbi_user_level(vale1is, addr, tlb_level);
- } else {
- __tlbi_level(vae1is, addr, tlb_level);
- __tlbi_user_level(vae1is, addr, tlb_level);
- }
- start += stride;
- pages -= stride >> PAGE_SHIFT;
- continue;
- }
-
- num = __TLBI_RANGE_NUM(pages, scale);
- if (num >= 0) {
- addr = __TLBI_VADDR_RANGE(start, asid, scale,
- num, tlb_level);
- if (last_level) {
- __tlbi(rvale1is, addr);
- __tlbi_user(rvale1is, addr);
- } else {
- __tlbi(rvae1is, addr);
- __tlbi_user(rvae1is, addr);
- }
- start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT;
- pages -= __TLBI_RANGE_PAGES(num, scale);
- }
- scale++;
- }
+ if (last_level)
+ __flush_tlb_range_op(vale1is, start, pages, stride, asid, tlb_level, true);
+ else
+ __flush_tlb_range_op(vae1is, start, pages, stride, asid, tlb_level, true);
+
dsb(ish);
}

--
2.39.0.314.g84b9a713c41-goog

2023-01-10 00:19:45

by David Matlack

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] KVM: Define kvm_flush_remote_tlbs_range

On Mon, Jan 9, 2023 at 3:41 PM David Matlack <[email protected]> wrote:
>
> On Mon, Jan 09, 2023 at 09:53:44PM +0000, Raghavendra Rao Ananta wrote:
> > +{
> > + kvm_flush_remote_tlbs(kvm);
> > +}
>
> FYI I also proposed a common kvm_flush_remote_tlbs() in my Common MMU
> series [1].
>
> Could I interest you in grabbing patches 29-33 from that series, which
> has the same end result (common kvm_flush_remote_tlbs_range()) but also
> hooks up the KVM/x86 range-based flushing, and folding them into this
> series?
>
> [1] https://lore.kernel.org/kvm/[email protected]/

(Also they make kvm_arch_flush_remote_tlbs_memslot() common so you
don't need the ARM-specific implementation in patch 4.)

>
> > #endif
> >
> > static void kvm_flush_shadow_all(struct kvm *kvm)
> > @@ -637,7 +642,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> > }
> >
> > if (range->flush_on_ret && ret)
> > - kvm_flush_remote_tlbs(kvm);
> > + kvm_flush_remote_tlbs_range(kvm, range->start, range->end - 1);
> >
> > if (locked) {
> > KVM_MMU_UNLOCK(kvm);
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

2023-01-10 00:49:58

by David Matlack

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] KVM: Define kvm_flush_remote_tlbs_range

On Mon, Jan 09, 2023 at 09:53:44PM +0000, Raghavendra Rao Ananta wrote:
> Define kvm_flush_remote_tlbs_range() to limit the TLB flush only
> to a certain range of addresses. Replace this with the existing
> call to kvm_flush_remote_tlbs() in the MMU notifier path.
> Architectures such as arm64 can define this to flush only the
> necessary addresses, instead of the entire range.
>
> Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> ---
> arch/arm64/kvm/mmu.c | 10 ++++++++++
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 7 ++++++-
> 3 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 39d9a334efb57..70f76bc909c5d 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -91,6 +91,16 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
> kvm_call_hyp(__kvm_tlb_flush_vmid, &kvm->arch.mmu);
> }
>
> +void kvm_flush_remote_tlbs_range(struct kvm *kvm, unsigned long start, unsigned long end)
> +{
> + struct kvm_s2_mmu *mmu = &kvm->arch.mmu;
> +
> + if (system_supports_tlb_range())
> + kvm_call_hyp(__kvm_tlb_flush_range_vmid_ipa, mmu, start, end, 0);
> + else
> + kvm_flush_remote_tlbs(kvm);
> +}
> +
> static bool kvm_is_device_pfn(unsigned long pfn)
> {
> return !pfn_is_map_memory(pfn);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f51eb9419bfc3..a76cede9dc3bb 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1359,6 +1359,7 @@ int kvm_vcpu_yield_to(struct kvm_vcpu *target);
> void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool usermode_vcpu_not_eligible);
>
> void kvm_flush_remote_tlbs(struct kvm *kvm);
> +void kvm_flush_remote_tlbs_range(struct kvm *kvm, unsigned long start, unsigned long end);
>
> #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
> int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 03e6a38094c17..f538ecc984f5b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -376,6 +376,11 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
> ++kvm->stat.generic.remote_tlb_flush;
> }
> EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs);
> +
> +void kvm_flush_remote_tlbs_range(struct kvm *kvm, unsigned long start, unsigned long end)

It's ambiguous what start/end represent. Case in point,
__kvm_handle_hva_range() is passing in HVAs but then patch 4 passes in
GFNs.

Probably kvm_flush_tlbs_range() should accept GFN and there can be a
helper wrapper that does the HVA-to-GFN conversion.

> +{
> + kvm_flush_remote_tlbs(kvm);
> +}

FYI I also proposed a common kvm_flush_remote_tlbs() in my Common MMU
series [1].

Could I interest you in grabbing patches 29-33 from that series, which
has the same end result (common kvm_flush_remote_tlbs_range()) but also
hooks up the KVM/x86 range-based flushing, and folding them into this
series?

[1] https://lore.kernel.org/kvm/[email protected]/

> #endif
>
> static void kvm_flush_shadow_all(struct kvm *kvm)
> @@ -637,7 +642,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> }
>
> if (range->flush_on_ret && ret)
> - kvm_flush_remote_tlbs(kvm);
> + kvm_flush_remote_tlbs_range(kvm, range->start, range->end - 1);
>
> if (locked) {
> KVM_MMU_UNLOCK(kvm);
> --
> 2.39.0.314.g84b9a713c41-goog
>

2023-01-10 17:55:41

by Raghavendra Rao Ananta

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] KVM: Define kvm_flush_remote_tlbs_range

On Mon, Jan 9, 2023 at 3:41 PM David Matlack <[email protected]> wrote:
>
> On Mon, Jan 09, 2023 at 09:53:44PM +0000, Raghavendra Rao Ananta wrote:
> > Define kvm_flush_remote_tlbs_range() to limit the TLB flush only
> > to a certain range of addresses. Replace this with the existing
> > call to kvm_flush_remote_tlbs() in the MMU notifier path.
> > Architectures such as arm64 can define this to flush only the
> > necessary addresses, instead of the entire range.
> >
> > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > ---
> > arch/arm64/kvm/mmu.c | 10 ++++++++++
> > include/linux/kvm_host.h | 1 +
> > virt/kvm/kvm_main.c | 7 ++++++-
> > 3 files changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 39d9a334efb57..70f76bc909c5d 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -91,6 +91,16 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
> > kvm_call_hyp(__kvm_tlb_flush_vmid, &kvm->arch.mmu);
> > }
> >
> > +void kvm_flush_remote_tlbs_range(struct kvm *kvm, unsigned long start, unsigned long end)
> > +{
> > + struct kvm_s2_mmu *mmu = &kvm->arch.mmu;
> > +
> > + if (system_supports_tlb_range())
> > + kvm_call_hyp(__kvm_tlb_flush_range_vmid_ipa, mmu, start, end, 0);
> > + else
> > + kvm_flush_remote_tlbs(kvm);
> > +}
> > +
> > static bool kvm_is_device_pfn(unsigned long pfn)
> > {
> > return !pfn_is_map_memory(pfn);
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index f51eb9419bfc3..a76cede9dc3bb 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1359,6 +1359,7 @@ int kvm_vcpu_yield_to(struct kvm_vcpu *target);
> > void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool usermode_vcpu_not_eligible);
> >
> > void kvm_flush_remote_tlbs(struct kvm *kvm);
> > +void kvm_flush_remote_tlbs_range(struct kvm *kvm, unsigned long start, unsigned long end);
> >
> > #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
> > int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 03e6a38094c17..f538ecc984f5b 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -376,6 +376,11 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
> > ++kvm->stat.generic.remote_tlb_flush;
> > }
> > EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs);
> > +
> > +void kvm_flush_remote_tlbs_range(struct kvm *kvm, unsigned long start, unsigned long end)
>
> It's ambiguous what start/end represent. Case in point,
> __kvm_handle_hva_range() is passing in HVAs but then patch 4 passes in
> GFNs.
>
> Probably kvm_flush_tlbs_range() should accept GFN and there can be a
> helper wrapper that does the HVA-to-GFN conversion.
>
> > +{
> > + kvm_flush_remote_tlbs(kvm);
> > +}
>
You are right, that should've been GFNs, and the function should
operate on GPAs. I can think about a wrapper.

> FYI I also proposed a common kvm_flush_remote_tlbs() in my Common MMU
> series [1].
>
> Could I interest you in grabbing patches 29-33 from that series, which
> has the same end result (common kvm_flush_remote_tlbs_range()) but also
> hooks up the KVM/x86 range-based flushing, and folding them into this
> series?
>
Of course! I'll grab them in my next version.

Thank you.
Raghavendra

> [1] https://lore.kernel.org/kvm/[email protected]/
>
> > #endif
> >
> > static void kvm_flush_shadow_all(struct kvm *kvm)
> > @@ -637,7 +642,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> > }
> >
> > if (range->flush_on_ret && ret)
> > - kvm_flush_remote_tlbs(kvm);
> > + kvm_flush_remote_tlbs_range(kvm, range->start, range->end - 1);
> >
> > if (locked) {
> > KVM_MMU_UNLOCK(kvm);
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

2023-01-24 23:26:00

by Oliver Upton

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] KVM: arm64: Optimize the stage2 map path with TLBI range instructions

Hi Raghavendra,

My comment from the previous change also applies here:

KVM: arm64: Use range-based TLBIs when collapsing hugepages

On Mon, Jan 09, 2023 at 09:53:46PM +0000, Raghavendra Rao Ananta wrote:
> Currently, when the map path of stage2 page-table coalesces a
> bunch of pages into a hugepage, KVM invalidates the entire
> VM's TLB entries. This would cause a perforamance penality for
> the guest whose pages have already been coalesced earlier as they
> would have to refill their TLB entries unnecessarily again.

It is also problematic that we do this on every single fault where we
collapse a hugepage.

> Hence, if the system supports it, use __kvm_tlb_flush_range_vmid_ipa()
> to flush only the range of pages that have been combined into
> a hugepage, while leaving other TLB entries alone.
>
> Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> ---
> arch/arm64/kvm/hyp/pgtable.c | 29 +++++++++++++++++++++++++----
> 1 file changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index b11cf2c618a6c..099032bb01bce 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -686,6 +686,22 @@ static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_
> return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old;
> }
>
> +static void kvm_table_pte_flush(struct kvm_s2_mmu *mmu, u64 addr, u32 level, u32 tlb_level)

Could you call this something like kvm_pgtable_flush_range() and take an
address range as an argument? TLBIRANGE can be used outside the context
of a table (i.e. a subset of PTEs).

> +{
> + if (system_supports_tlb_range()) {
> + u64 end = addr + kvm_granule_size(level);
> +
> + kvm_call_hyp(__kvm_tlb_flush_range_vmid_ipa, mmu, addr, end, tlb_level);
> + } else {
> + /*
> + * Invalidate the whole stage-2, as we may have numerous leaf
> + * entries below us which would otherwise need invalidating
> + * individually.
> + */
> + kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
> + }
> +}
> +
> /**
> * stage2_try_break_pte() - Invalidates a pte according to the
> * 'break-before-make' requirements of the
> @@ -693,6 +709,7 @@ static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_
> *
> * @ctx: context of the visited pte.
> * @mmu: stage-2 mmu
> + * @tlb_level: The level at which the leaf pages are expected (for FEAT_TTL hint)

Do we need the caller to provide the TTL hint? We already have
ctx->level, and stage2_try_break_pte() also knows what the removed PTE
contained (i.e. a table or a block/page).

> * Returns: true if the pte was successfully broken.
> *
> @@ -701,7 +718,7 @@ static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_
> * on the containing table page.
> */
> static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
> - struct kvm_s2_mmu *mmu)
> + struct kvm_s2_mmu *mmu, u32 tlb_level)
> {
> struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
>
> @@ -722,7 +739,7 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
> * value (if any).
> */
> if (kvm_pte_table(ctx->old, ctx->level))
> - kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
> + kvm_table_pte_flush(mmu, ctx->addr, ctx->level, tlb_level);

I don't think we should provide a TTL hint for a removed table. It is
entirely possible for the unlinked table to contain a mix of blocks and
pages, meaning there isn't a uniform table level for the whole range.

> else if (kvm_pte_valid(ctx->old))
> kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
>
> @@ -804,7 +821,7 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
> if (!stage2_pte_needs_update(ctx->old, new))
> return -EAGAIN;
>
> - if (!stage2_try_break_pte(ctx, data->mmu))
> + if (!stage2_try_break_pte(ctx, data->mmu, ctx->level))
> return -EAGAIN;
>
> /* Perform CMOs before installation of the guest stage-2 PTE */
> @@ -861,7 +878,11 @@ static int stage2_map_walk_leaf(const struct kvm_pgtable_visit_ctx *ctx,
> if (!childp)
> return -ENOMEM;
>
> - if (!stage2_try_break_pte(ctx, data->mmu)) {
> + /*
> + * As the table will be replaced with a block, one level down would
> + * be the current page entries held by the table.
> + */

This isn't necessarily true. Ignoring mixed block/pages for a moment,
Collapsing a PUD entry into a block after dirty logging (where we mapped
at PTE level) would imply a TTL of ctx->level + 2.

But again, I think it is best to provide no hint in this case.

--
Thanks,
Oliver

2023-01-25 22:20:55

by Raghavendra Rao Ananta

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] KVM: arm64: Optimize the stage2 map path with TLBI range instructions

Hi Oliver,

On Tue, Jan 24, 2023 at 3:25 PM Oliver Upton <[email protected]> wrote:
>
> Hi Raghavendra,
>
> My comment from the previous change also applies here:
>
> KVM: arm64: Use range-based TLBIs when collapsing hugepages
>
> On Mon, Jan 09, 2023 at 09:53:46PM +0000, Raghavendra Rao Ananta wrote:
> > Currently, when the map path of stage2 page-table coalesces a
> > bunch of pages into a hugepage, KVM invalidates the entire
> > VM's TLB entries. This would cause a perforamance penality for
> > the guest whose pages have already been coalesced earlier as they
> > would have to refill their TLB entries unnecessarily again.
>
> It is also problematic that we do this on every single fault where we
> collapse a hugepage.
>
Yes! I'll also include this description in v2.

> > Hence, if the system supports it, use __kvm_tlb_flush_range_vmid_ipa()
> > to flush only the range of pages that have been combined into
> > a hugepage, while leaving other TLB entries alone.
> >
> > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > ---
> > arch/arm64/kvm/hyp/pgtable.c | 29 +++++++++++++++++++++++++----
> > 1 file changed, 25 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index b11cf2c618a6c..099032bb01bce 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -686,6 +686,22 @@ static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_
> > return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old;
> > }
> >
> > +static void kvm_table_pte_flush(struct kvm_s2_mmu *mmu, u64 addr, u32 level, u32 tlb_level)
>
> Could you call this something like kvm_pgtable_flush_range() and take an
> address range as an argument? TLBIRANGE can be used outside the context
> of a table (i.e. a subset of PTEs).
>
Good idea. In that case, the function becomes very close to arm64's
implementation of kvm_flush_remote_tlbs_range() on top of David's
series [1].
Too bad, we may not be able to invoke that. I'll rename the function.

> > +{
> > + if (system_supports_tlb_range()) {
> > + u64 end = addr + kvm_granule_size(level);
> > +
> > + kvm_call_hyp(__kvm_tlb_flush_range_vmid_ipa, mmu, addr, end, tlb_level);
> > + } else {
> > + /*
> > + * Invalidate the whole stage-2, as we may have numerous leaf
> > + * entries below us which would otherwise need invalidating
> > + * individually.
> > + */
> > + kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
> > + }
> > +}
> > +
> > /**
> > * stage2_try_break_pte() - Invalidates a pte according to the
> > * 'break-before-make' requirements of the
> > @@ -693,6 +709,7 @@ static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_
> > *
> > * @ctx: context of the visited pte.
> > * @mmu: stage-2 mmu
> > + * @tlb_level: The level at which the leaf pages are expected (for FEAT_TTL hint)
>
> Do we need the caller to provide the TTL hint? We already have
> ctx->level, and stage2_try_break_pte() also knows what the removed PTE
> contained (i.e. a table or a block/page).
>
ctx->level may not always translate to TTL level hint. For example,
the patch 6/6 of this series also calls stage2_try_break_pte(), but
from the very top level (level-1). In that case, the level can be
extracted from ctx, but we won't have any idea what the TTL hint is
since we won't be traversing all of the page-table. As a result, we
pass 0. However, if we are remapping a table to a bunch of pages, TTL
level hint could be just one level down.

> > * Returns: true if the pte was successfully broken.
> > *
> > @@ -701,7 +718,7 @@ static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_
> > * on the containing table page.
> > */
> > static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
> > - struct kvm_s2_mmu *mmu)
> > + struct kvm_s2_mmu *mmu, u32 tlb_level)
> > {
> > struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> >
> > @@ -722,7 +739,7 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
> > * value (if any).
> > */
> > if (kvm_pte_table(ctx->old, ctx->level))
> > - kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
> > + kvm_table_pte_flush(mmu, ctx->addr, ctx->level, tlb_level);
>
> I don't think we should provide a TTL hint for a removed table. It is
> entirely possible for the unlinked table to contain a mix of blocks and
> pages, meaning there isn't a uniform table level for the whole range.
>
True, but it's particularly useful for the case where we know that a
table is going to be replaced by a block. In that case, can't we be
sure of the TTL level? Perhaps for other cases, we can supply 0. WDYT?

> > else if (kvm_pte_valid(ctx->old))
> > kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
> >
> > @@ -804,7 +821,7 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
> > if (!stage2_pte_needs_update(ctx->old, new))
> > return -EAGAIN;
> >
> > - if (!stage2_try_break_pte(ctx, data->mmu))
> > + if (!stage2_try_break_pte(ctx, data->mmu, ctx->level))
> > return -EAGAIN;
> >
> > /* Perform CMOs before installation of the guest stage-2 PTE */
> > @@ -861,7 +878,11 @@ static int stage2_map_walk_leaf(const struct kvm_pgtable_visit_ctx *ctx,
> > if (!childp)
> > return -ENOMEM;
> >
> > - if (!stage2_try_break_pte(ctx, data->mmu)) {
> > + /*
> > + * As the table will be replaced with a block, one level down would
> > + * be the current page entries held by the table.
> > + */
>
> This isn't necessarily true. Ignoring mixed block/pages for a moment,
> Collapsing a PUD entry into a block after dirty logging (where we mapped
> at PTE level) would imply a TTL of ctx->level + 2.
>
> But again, I think it is best to provide no hint in this case.
>
Ah, right. We could also collapse into a 1G block. No hint would be
better in that case.

Thanks,
Raghavendra

[1]: https://lore.kernel.org/linux-arm-kernel/[email protected]/T/#meca775fa8accc996900d5e9caeb004f7b9774627

> --
> Thanks,
> Oliver