In certain code paths, KVM/ARM currently invalidates the entire VM's
page-tables instead of just invalidating a necessary range. For example,
when collapsing a table PTE to a block PTE, instead of iterating over
each PTE and flushing them, KVM uses 'vmalls12e1is' TLBI operation to
flush all the entries. This is inefficient since the guest would have
to refill the TLBs again, even for the addresses that aren't covered
by the table entry. The performance impact would scale poorly if many
addresses in the VM is going through this remapping.
For architectures that implement FEAT_TLBIRANGE, KVM can replace such
inefficient paths by performing the invalidations only on the range of
addresses that are in scope. This series tries to achieve the same in
the areas of stage-2 map, unmap and write-protecting the pages.
Patch-1 refactors the core arm64's __flush_tlb_range() to be used by
other entities.
Patch-2 adds a range-based TLBI mechanism for KVM (VHE and nVHE).
Patch-3 implements the kvm_arch_flush_remote_tlbs_range() for arm64.
Patch-4 aims to flush only the memslot that undergoes a write-protect,
instead of the entire VM.
Patch-5 operates on stage2_try_break_pte() to use the range based
TLBI instructions when collapsing a table entry. The map path is the
immediate consumer of this when KVM remaps a table entry into a block.
Patch-6 Adds a 'skip_flush' parameter to stage2_put_pte() for the next
patch to take advantage of during TLB invalidations in unmap path.
Patch-7 modifies the stage-2 unmap path in which, if the system supports
FEAT_TLBIRANGE, the TLB invalidations are skipped during the page-table.
walk. Instead it's done in one go after the entire walk is finished.
The series is based off of upstream v6.3-rc5, and applied David
Matlack's common API for TLB invalidations[1] on top.
The performance evaluation was done on a hardware that supports
FEAT_TLBIRANGE, on a VHE configuration, using a modified
kvm_page_table_test.
The modified version updates the guest code in the ADJUST_MAPPINGS case
to not only access this page but also to access up to 512 pages
backwards
for every new page it iterates through. This is done to test the effect
of TLBI misses after KVM has handled a fault.
The series captures the impact in the map and unmap paths as described
above.
$ kvm_page_table_test -m 2 -v 128 -s anonymous_hugetlb_2mb -b $i
+--------+------------------------------+------------------------------+
| mem_sz | ADJUST_MAPPINGS (s) | Unmap VM (s) |
| (GB) | Baseline | Baseline + series | Baseline | Baseline + series |
+--------+----------|-------------------+------------------------------+
| 1 | 5.25 | 5.42 | 0.007 | 0.005 |
| 2 | 9.40 | 7.23 | 0.010 | 0.006 |
| 4 | 20.96 | 13.09 | 0.015 | 0.008 |
| 8 | 41.32 | 26.06 | 0.025 | 0.012 |
| 16 | 76.00 | 53.40 | 0.045 | 0.017 |
| 32 | 85.92 | 82.87 | 0.077 | 0.028 |
| 64 | 178.99 | 168.45 | 0.142 | 0.049 |
| 128 | 359.76 | 316.66 | 0.280 | 0.082 |
+--------+----------+-------------------+----------+-------------------+
$ kvm_page_table_test -m 2 -b 128G -s anonymous_hugetlb_2mb -v $i
+--------+------------------------------+
| vCPUs | ADJUST_MAPPINGS (s) |
| | Baseline | Baseline + series |
+--------+----------|-------------------+
| 1 | 130.84 | 139.65 |
| 2 | 91.42 | 105.49 |
| 4 | 98.83 | 97.68 |
| 8 | 108.28 | 104.90 |
| 16 | 110.69 | 108.28 |
| 32 | 146.47 | 120.02 |
| 64 | 204.04 | 185.88 |
| 128 | 359.76 | 316.66 |
+--------+----------+-------------------+
For the ADJUST_MAPPINGS cases, which maps back the 4K table entries to
2M hugepages, the series sees an average improvement of ~15%. For
unmapping
2M hugepages, we see at least a 3x improvement.
$ kvm_page_table_test -m 2 -b $i
+--------+------------------------------+
| mem_sz | Unmap VM (s) |
| (GB) | Baseline | Baseline + series |
+--------+------------------------------+
| 1 | 1.03 | 0.58 |
| 2 | 1.57 | 0.72 |
| 4 | 2.65 | 0.98 |
| 8 | 4.77 | 1.54 |
| 16 | 9.06 | 2.57 |
| 32 | 17.60 | 4.41 |
| 64 | 34.72 | 8.92 |
| 128 | 68.92 | 17.70 |
+--------+----------+-------------------+
The series sees an average gain of 4x when the guest backed by
PAGE_SIZE (4K) pages.
v3:
Thanks, Oliver for all the suggestions.
- The core flush API (__kvm_tlb_flush_vmid_range()) now checks if
the system support FEAT_TLBIRANGE or not, thus elimiating the
redundancy in the upper layers.
- If FEAT_TLBIRANGE is not supported, the implementation falls
back to invalidating all the TLB entries with the VMID, instead
of doing an iterative flush for the range.
- The kvm_arch_flush_remote_tlbs_range() doesn't return -EOPNOTSUPP
if the system doesn't implement FEAT_TLBIRANGE. It depends on
__kvm_tlb_flush_vmid_range() to do take care of the decisions
and return 0 regardless of the underlying feature support.
- __kvm_tlb_flush_vmid_range() doesn't take 'level' as input to
calculate the 'stride'. Instead, it always assumes PAGE_SIZE.
- Fast unmap path is eliminated. Instead, the existing unmap walker
is modified to skip the TLBIs during the walk, and do it all at
once after the walk, using the range-based instructions.
v2:
https://lore.kernel.org/all/[email protected]/
- Rebased the series on top of David Matlack's series for common
TLB invalidation API[1].
- Implement kvm_arch_flush_remote_tlbs_range() for arm64, by extending
the support introduced by [1].
- Use kvm_flush_remote_tlbs_memslot() introduced by [1] to flush
only the current memslot after write-protect.
- Modified the __kvm_tlb_flush_range() macro to accepts 'level' as an
argument to calculate the 'stride' instead of just using PAGE_SIZE.
- Split the patch that introduces the range-based TLBI to KVM and the
implementation of IPA-based invalidation into its own patches.
- Dropped the patch that tries to optimize the mmu notifiers paths.
- Rename the function kvm_table_pte_flush() to
kvm_pgtable_stage2_flush_range(), and accept the range of addresses to
flush. [Oliver]
- Drop the 'tlb_level' argument for stage2_try_break_pte() and directly
pass '0' as 'tlb_level' to kvm_pgtable_stage2_flush_range(). [Oliver]
v1:
https://lore.kernel.org/all/[email protected]/
Thank you.
Raghavendra
[1]:
https://lore.kernel.org/linux-arm-kernel/[email protected]/
Raghavendra Rao Ananta (7):
arm64: tlb: Refactor the core flush algorithm of __flush_tlb_range
KVM: arm64: Implement __kvm_tlb_flush_vmid_range()
KVM: arm64: Implement kvm_arch_flush_remote_tlbs_range()
KVM: arm64: Flush only the memslot after write-protect
KVM: arm64: Invalidate the table entries upon a range
KVM: arm64: Add 'skip_flush' arg to stage2_put_pte()
KVM: arm64: Use TLBI range-based intructions for unmap
arch/arm64/include/asm/kvm_asm.h | 3 +
arch/arm64/include/asm/kvm_host.h | 3 +
arch/arm64/include/asm/tlbflush.h | 108 +++++++++++++++--------------
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 11 +++
arch/arm64/kvm/hyp/nvhe/tlb.c | 39 +++++++++++
arch/arm64/kvm/hyp/pgtable.c | 49 ++++++++++---
arch/arm64/kvm/hyp/vhe/tlb.c | 35 ++++++++++
arch/arm64/kvm/mmu.c | 13 +++-
8 files changed, 198 insertions(+), 63 deletions(-)
--
2.40.0.634.g4ca3ef3211-goog
Currently, the core TLB flush functionality of __flush_tlb_range()
hardcodes vae1is (and variants) for the flush operation. In the
upcoming patches, the KVM code reuses this core algorithm with
ipas2e1is for range based TLB invalidations based on the IPA.
Hence, extract the core flush functionality of __flush_tlb_range()
into its own macro that accepts an 'op' argument to pass any
TLBI operation, such that other callers (KVM) can benefit.
No functional changes intended.
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 108 +++++++++++++++---------------
1 file changed, 55 insertions(+), 53 deletions(-)
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25d..4775378b6da1b 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -278,14 +278,61 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
*/
#define MAX_TLBI_OPS PTRS_PER_PTE
+/* When the CPU does not support TLB range operations, flush the TLB
+ * entries one by one at the granularity of 'stride'. If the TLB
+ * range ops are supported, then:
+ *
+ * 1. If 'pages' is odd, flush the first page through non-range
+ * operations;
+ *
+ * 2. For remaining pages: the minimum range granularity is decided
+ * by 'scale', so multiple range TLBI operations may be required.
+ * Start from scale = 0, flush the corresponding number of pages
+ * ((num+1)*2^(5*scale+1) starting from 'addr'), then increase it
+ * until no pages left.
+ *
+ * Note that certain ranges can be represented by either num = 31 and
+ * scale or num = 0 and scale + 1. The loop below favours the latter
+ * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
+ */
+#define __flush_tlb_range_op(op, start, pages, stride, \
+ asid, tlb_level, tlbi_user) do { \
+ int num = 0; \
+ int scale = 0; \
+ unsigned long addr; \
+ \
+ while (pages > 0) { \
+ if (!system_supports_tlb_range() || \
+ pages % 2 == 1) { \
+ addr = __TLBI_VADDR(start, asid); \
+ __tlbi_level(op, addr, tlb_level); \
+ if (tlbi_user) \
+ __tlbi_user_level(op, addr, tlb_level); \
+ start += stride; \
+ pages -= stride >> PAGE_SHIFT; \
+ continue; \
+ } \
+ \
+ num = __TLBI_RANGE_NUM(pages, scale); \
+ if (num >= 0) { \
+ addr = __TLBI_VADDR_RANGE(start, asid, scale, \
+ num, tlb_level); \
+ __tlbi(r##op, addr); \
+ if (tlbi_user) \
+ __tlbi_user(r##op, addr); \
+ start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; \
+ pages -= __TLBI_RANGE_PAGES(num, scale); \
+ } \
+ scale++; \
+ } \
+} while (0)
+
static inline void __flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
unsigned long stride, bool last_level,
int tlb_level)
{
- int num = 0;
- int scale = 0;
- unsigned long asid, addr, pages;
+ unsigned long asid, pages;
start = round_down(start, stride);
end = round_up(end, stride);
@@ -307,56 +354,11 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
dsb(ishst);
asid = ASID(vma->vm_mm);
- /*
- * When the CPU does not support TLB range operations, flush the TLB
- * entries one by one at the granularity of 'stride'. If the TLB
- * range ops are supported, then:
- *
- * 1. If 'pages' is odd, flush the first page through non-range
- * operations;
- *
- * 2. For remaining pages: the minimum range granularity is decided
- * by 'scale', so multiple range TLBI operations may be required.
- * Start from scale = 0, flush the corresponding number of pages
- * ((num+1)*2^(5*scale+1) starting from 'addr'), then increase it
- * until no pages left.
- *
- * Note that certain ranges can be represented by either num = 31 and
- * scale or num = 0 and scale + 1. The loop below favours the latter
- * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
- */
- while (pages > 0) {
- if (!system_supports_tlb_range() ||
- pages % 2 == 1) {
- addr = __TLBI_VADDR(start, asid);
- if (last_level) {
- __tlbi_level(vale1is, addr, tlb_level);
- __tlbi_user_level(vale1is, addr, tlb_level);
- } else {
- __tlbi_level(vae1is, addr, tlb_level);
- __tlbi_user_level(vae1is, addr, tlb_level);
- }
- start += stride;
- pages -= stride >> PAGE_SHIFT;
- continue;
- }
-
- num = __TLBI_RANGE_NUM(pages, scale);
- if (num >= 0) {
- addr = __TLBI_VADDR_RANGE(start, asid, scale,
- num, tlb_level);
- if (last_level) {
- __tlbi(rvale1is, addr);
- __tlbi_user(rvale1is, addr);
- } else {
- __tlbi(rvae1is, addr);
- __tlbi_user(rvae1is, addr);
- }
- start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT;
- pages -= __TLBI_RANGE_PAGES(num, scale);
- }
- scale++;
- }
+ if (last_level)
+ __flush_tlb_range_op(vale1is, start, pages, stride, asid, tlb_level, true);
+ else
+ __flush_tlb_range_op(vae1is, start, pages, stride, asid, tlb_level, true);
+
dsb(ish);
}
--
2.40.0.634.g4ca3ef3211-goog
Currently, during the operations such as a hugepage collapse,
KVM would flush the entire VM's context using 'vmalls12e1is'
TLBI operation. Specifically, if the VM is faulting on many
hugepages (say after dirty-logging), it creates a performance
penalty for the guest whose pages have already been faulted
earlier as they would have to refill their TLBs again.
Instead, call __kvm_tlb_flush_vmid_range() for table entries.
If the system supports it, only the required range will be
flushed. Else, it'll fallback to the previous mechanism.
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 3d61bd3e591d2..b8f0dbd12f773 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -745,10 +745,13 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
* Perform the appropriate TLB invalidation based on the evicted pte
* value (if any).
*/
- if (kvm_pte_table(ctx->old, ctx->level))
- kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
- else if (kvm_pte_valid(ctx->old))
+ if (kvm_pte_table(ctx->old, ctx->level)) {
+ u64 end = ctx->addr + kvm_granule_size(ctx->level);
+
+ kvm_call_hyp(__kvm_tlb_flush_vmid_range, mmu, ctx->addr, end);
+ } else if (kvm_pte_valid(ctx->old)) {
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
+ }
if (stage2_pte_is_counted(ctx->old))
mm_ops->put_page(ctx->ptep);
--
2.40.0.634.g4ca3ef3211-goog
Define __kvm_tlb_flush_vmid_range() (for VHE and nVHE)
to flush a range of stage-2 page-tables using IPA in one go.
If the system supports FEAT_TLBIRANGE, the following patches
would conviniently replace global TLBI such as vmalls12e1is
in the map, unmap, and dirty-logging paths with ripas2e1is
instead.
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/include/asm/kvm_asm.h | 3 +++
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 11 +++++++++
arch/arm64/kvm/hyp/nvhe/tlb.c | 39 ++++++++++++++++++++++++++++++
arch/arm64/kvm/hyp/vhe/tlb.c | 35 +++++++++++++++++++++++++++
4 files changed, 88 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 43c3bc0f9544d..33352d9399e32 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -79,6 +79,7 @@ enum __kvm_host_smccc_func {
__KVM_HOST_SMCCC_FUNC___pkvm_init_vm,
__KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu,
__KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm,
+ __KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_range,
};
#define DECLARE_KVM_VHE_SYM(sym) extern char sym[]
@@ -225,6 +226,8 @@ extern void __kvm_flush_vm_context(void);
extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
int level);
+extern void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
+ phys_addr_t start, phys_addr_t end);
extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);
extern void __kvm_timer_set_cntvoff(u64 cntvoff);
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 728e01d4536b0..81d30737dc7c9 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -125,6 +125,16 @@ static void handle___kvm_tlb_flush_vmid_ipa(struct kvm_cpu_context *host_ctxt)
__kvm_tlb_flush_vmid_ipa(kern_hyp_va(mmu), ipa, level);
}
+static void
+handle___kvm_tlb_flush_vmid_range(struct kvm_cpu_context *host_ctxt)
+{
+ DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
+ DECLARE_REG(phys_addr_t, start, host_ctxt, 2);
+ DECLARE_REG(phys_addr_t, end, host_ctxt, 3);
+
+ __kvm_tlb_flush_vmid_range(kern_hyp_va(mmu), start, end);
+}
+
static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
{
DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
@@ -315,6 +325,7 @@ static const hcall_t host_hcall[] = {
HANDLE_FUNC(__kvm_vcpu_run),
HANDLE_FUNC(__kvm_flush_vm_context),
HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa),
+ HANDLE_FUNC(__kvm_tlb_flush_vmid_range),
HANDLE_FUNC(__kvm_tlb_flush_vmid),
HANDLE_FUNC(__kvm_flush_cpu_context),
HANDLE_FUNC(__kvm_timer_set_cntvoff),
diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
index d296d617f5896..d2504df9d38b6 100644
--- a/arch/arm64/kvm/hyp/nvhe/tlb.c
+++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
@@ -109,6 +109,45 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
__tlb_switch_to_host(&cxt);
}
+void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
+ phys_addr_t start, phys_addr_t end)
+{
+ struct tlb_inv_context cxt;
+ unsigned long pages, stride;
+
+ /*
+ * Since the range of addresses may not be mapped at
+ * the same level, assume the worst case as PAGE_SIZE
+ */
+ stride = PAGE_SIZE;
+ start = round_down(start, stride);
+ end = round_up(end, stride);
+ pages = (end - start) >> PAGE_SHIFT;
+
+ if (!system_supports_tlb_range() || pages >= MAX_TLBI_RANGE_PAGES) {
+ __kvm_tlb_flush_vmid(mmu);
+ return;
+ }
+
+ dsb(ishst);
+
+ /* Switch to requested VMID */
+ __tlb_switch_to_guest(mmu, &cxt);
+
+ __flush_tlb_range_op(ipas2e1is, start, pages, stride, 0, 0, false);
+
+ dsb(ish);
+ __tlbi(vmalle1is);
+ dsb(ish);
+ isb();
+
+ /* See the comment below in __kvm_tlb_flush_vmid_ipa() */
+ if (icache_is_vpipt())
+ icache_inval_all_pou();
+
+ __tlb_switch_to_host(&cxt);
+}
+
void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
{
struct tlb_inv_context cxt;
diff --git a/arch/arm64/kvm/hyp/vhe/tlb.c b/arch/arm64/kvm/hyp/vhe/tlb.c
index 24cef9b87f9e9..f34d6dd9e4674 100644
--- a/arch/arm64/kvm/hyp/vhe/tlb.c
+++ b/arch/arm64/kvm/hyp/vhe/tlb.c
@@ -111,6 +111,41 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
__tlb_switch_to_host(&cxt);
}
+void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
+ phys_addr_t start, phys_addr_t end)
+{
+ struct tlb_inv_context cxt;
+ unsigned long pages, stride;
+
+ /*
+ * Since the range of addresses may not be mapped at
+ * the same level, assume the worst case as PAGE_SIZE
+ */
+ stride = PAGE_SIZE;
+ start = round_down(start, stride);
+ end = round_up(end, stride);
+ pages = (end - start) >> PAGE_SHIFT;
+
+ if (!system_supports_tlb_range() || pages >= MAX_TLBI_RANGE_PAGES) {
+ __kvm_tlb_flush_vmid(mmu);
+ return;
+ }
+
+ dsb(ishst);
+
+ /* Switch to requested VMID */
+ __tlb_switch_to_guest(mmu, &cxt);
+
+ __flush_tlb_range_op(ipas2e1is, start, pages, stride, 0, 0, false);
+
+ dsb(ish);
+ __tlbi(vmalle1is);
+ dsb(ish);
+ isb();
+
+ __tlb_switch_to_host(&cxt);
+}
+
void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
{
struct tlb_inv_context cxt;
--
2.40.0.634.g4ca3ef3211-goog
Implement kvm_arch_flush_remote_tlbs_range() for arm64
to invalidate the given range in the TLB.
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 3 +++
arch/arm64/kvm/mmu.c | 11 +++++++++++
2 files changed, 14 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 17c215a2df7d7..075d3e6482e53 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1044,6 +1044,9 @@ struct kvm *kvm_arch_alloc_vm(void);
#define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS
int kvm_arch_flush_remote_tlbs(struct kvm *kvm);
+#define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
+int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, u64 pages);
+
static inline bool kvm_vm_is_protected(struct kvm *kvm)
{
return false;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d0a0d3dca9316..e3673b4c10292 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -92,6 +92,17 @@ int kvm_arch_flush_remote_tlbs(struct kvm *kvm)
return 0;
}
+int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, u64 pages)
+{
+ phys_addr_t start, end;
+
+ start = start_gfn << PAGE_SHIFT;
+ end = (start_gfn + pages) << PAGE_SHIFT;
+
+ kvm_call_hyp(__kvm_tlb_flush_vmid_range, &kvm->arch.mmu, start, end);
+ return 0;
+}
+
static bool kvm_is_device_pfn(unsigned long pfn)
{
return !pfn_is_map_memory(pfn);
--
2.40.0.634.g4ca3ef3211-goog
After write-protecting the region, currently KVM invalidates
the entire TLB entries using kvm_flush_remote_tlbs(). Instead,
scope the invalidation only to the targeted memslot. If
supported, the architecture would use the range-based TLBI
instructions to flush the memslot or else fallback to flushing
all of the TLBs.
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/kvm/mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index e3673b4c10292..2ea6eb4ea763e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -996,7 +996,7 @@ static void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
write_lock(&kvm->mmu_lock);
stage2_wp_range(&kvm->arch.mmu, start, end);
write_unlock(&kvm->mmu_lock);
- kvm_flush_remote_tlbs(kvm);
+ kvm_flush_remote_tlbs_memslot(kvm, memslot);
}
/**
--
2.40.0.634.g4ca3ef3211-goog
Add a 'skip_flush' argument in stage2_put_pte() to
control the TLB invalidations. This will be leveraged
by the upcoming patch to defer the individual PTE
invalidations until the entire walk is finished.
No functional change intended.
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b8f0dbd12f773..3f136e35feb5e 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -772,7 +772,7 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
}
static void stage2_put_pte(const struct kvm_pgtable_visit_ctx *ctx, struct kvm_s2_mmu *mmu,
- struct kvm_pgtable_mm_ops *mm_ops)
+ struct kvm_pgtable_mm_ops *mm_ops, bool skip_flush)
{
/*
* Clear the existing PTE, and perform break-before-make with
@@ -780,7 +780,10 @@ static void stage2_put_pte(const struct kvm_pgtable_visit_ctx *ctx, struct kvm_s
*/
if (kvm_pte_valid(ctx->old)) {
kvm_clear_pte(ctx->ptep);
- kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
+
+ if (!skip_flush)
+ kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
+ ctx->addr, ctx->level);
}
mm_ops->put_page(ctx->ptep);
@@ -1015,7 +1018,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
* block entry and rely on the remaining portions being faulted
* back lazily.
*/
- stage2_put_pte(ctx, mmu, mm_ops);
+ stage2_put_pte(ctx, mmu, mm_ops, false);
if (need_flush && mm_ops->dcache_clean_inval_poc)
mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
--
2.40.0.634.g4ca3ef3211-goog
The current implementation of the stage-2 unmap walker traverses
the given range and, as a part of break-before-make, performs
TLB invalidations with a DSB for every PTE. A multitude of this
combination could cause a performance bottleneck.
Hence, if the system supports FEAT_TLBIRANGE, defer the TLB
invalidations until the entire walk is finished, and then
use range-based instructions to invalidate the TLBs in one go.
Condition this upon S2FWB in order to avoid walking the page-table
again to perform the CMOs after issuing the TLBI.
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
Suggested-by: Oliver Upton <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 33 +++++++++++++++++++++++++++++----
1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 3f136e35feb5e..bcb748e3566c7 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -987,10 +987,16 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
return ret;
}
+struct stage2_unmap_data {
+ struct kvm_pgtable *pgt;
+ bool skip_pte_tlbis;
+};
+
static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
enum kvm_pgtable_walk_flags visit)
{
- struct kvm_pgtable *pgt = ctx->arg;
+ struct stage2_unmap_data *unmap_data = ctx->arg;
+ struct kvm_pgtable *pgt = unmap_data->pgt;
struct kvm_s2_mmu *mmu = pgt->mmu;
struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
kvm_pte_t *childp = NULL;
@@ -1018,7 +1024,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
* block entry and rely on the remaining portions being faulted
* back lazily.
*/
- stage2_put_pte(ctx, mmu, mm_ops, false);
+ stage2_put_pte(ctx, mmu, mm_ops, unmap_data->skip_pte_tlbis);
if (need_flush && mm_ops->dcache_clean_inval_poc)
mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
@@ -1032,13 +1038,32 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
{
+ int ret;
+ struct stage2_unmap_data unmap_data = {
+ .pgt = pgt,
+ /*
+ * If FEAT_TLBIRANGE is implemented, defer the individial PTE
+ * TLB invalidations until the entire walk is finished, and
+ * then use the range-based TLBI instructions to do the
+ * invalidations. Condition this upon S2FWB in order to avoid
+ * a page-table walk again to perform the CMOs after TLBI.
+ */
+ .skip_pte_tlbis = system_supports_tlb_range() &&
+ stage2_has_fwb(pgt),
+ };
struct kvm_pgtable_walker walker = {
.cb = stage2_unmap_walker,
- .arg = pgt,
+ .arg = &unmap_data,
.flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
};
- return kvm_pgtable_walk(pgt, addr, size, &walker);
+ ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+ if (unmap_data.skip_pte_tlbis)
+ /* Perform the deferred TLB invalidations */
+ kvm_call_hyp(__kvm_tlb_flush_vmid_range, pgt->mmu,
+ addr, addr + size);
+
+ return ret;
}
struct stage2_attr_data {
--
2.40.0.634.g4ca3ef3211-goog
On Fri, Apr 14, 2023 at 05:29:16PM +0000, Raghavendra Rao Ananta wrote:
> Currently, the core TLB flush functionality of __flush_tlb_range()
> hardcodes vae1is (and variants) for the flush operation. In the
> upcoming patches, the KVM code reuses this core algorithm with
> ipas2e1is for range based TLB invalidations based on the IPA.
> Hence, extract the core flush functionality of __flush_tlb_range()
> into its own macro that accepts an 'op' argument to pass any
> TLBI operation, such that other callers (KVM) can benefit.
>
> No functional changes intended.
>
> Signed-off-by: Raghavendra Rao Ananta <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Hi Raghavendra,
On Fri, Apr 14, 2023 at 05:29:17PM +0000, Raghavendra Rao Ananta wrote:
> Define __kvm_tlb_flush_vmid_range() (for VHE and nVHE)
> to flush a range of stage-2 page-tables using IPA in one go.
> If the system supports FEAT_TLBIRANGE, the following patches
> would conviniently replace global TLBI such as vmalls12e1is
> in the map, unmap, and dirty-logging paths with ripas2e1is
> instead.
>
> Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> ---
> arch/arm64/include/asm/kvm_asm.h | 3 +++
> arch/arm64/kvm/hyp/nvhe/hyp-main.c | 11 +++++++++
> arch/arm64/kvm/hyp/nvhe/tlb.c | 39 ++++++++++++++++++++++++++++++
> arch/arm64/kvm/hyp/vhe/tlb.c | 35 +++++++++++++++++++++++++++
> 4 files changed, 88 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index 43c3bc0f9544d..33352d9399e32 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -79,6 +79,7 @@ enum __kvm_host_smccc_func {
> __KVM_HOST_SMCCC_FUNC___pkvm_init_vm,
> __KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu,
> __KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm,
> + __KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_range,
> };
>
> #define DECLARE_KVM_VHE_SYM(sym) extern char sym[]
> @@ -225,6 +226,8 @@ extern void __kvm_flush_vm_context(void);
> extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
> extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
> int level);
> +extern void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
> + phys_addr_t start, phys_addr_t end);
> extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);
>
> extern void __kvm_timer_set_cntvoff(u64 cntvoff);
> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> index 728e01d4536b0..81d30737dc7c9 100644
> --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> @@ -125,6 +125,16 @@ static void handle___kvm_tlb_flush_vmid_ipa(struct kvm_cpu_context *host_ctxt)
> __kvm_tlb_flush_vmid_ipa(kern_hyp_va(mmu), ipa, level);
> }
>
> +static void
> +handle___kvm_tlb_flush_vmid_range(struct kvm_cpu_context *host_ctxt)
> +{
> + DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
> + DECLARE_REG(phys_addr_t, start, host_ctxt, 2);
> + DECLARE_REG(phys_addr_t, end, host_ctxt, 3);
> +
> + __kvm_tlb_flush_vmid_range(kern_hyp_va(mmu), start, end);
> +}
> +
> static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
> {
> DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
> @@ -315,6 +325,7 @@ static const hcall_t host_hcall[] = {
> HANDLE_FUNC(__kvm_vcpu_run),
> HANDLE_FUNC(__kvm_flush_vm_context),
> HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa),
> + HANDLE_FUNC(__kvm_tlb_flush_vmid_range),
> HANDLE_FUNC(__kvm_tlb_flush_vmid),
> HANDLE_FUNC(__kvm_flush_cpu_context),
> HANDLE_FUNC(__kvm_timer_set_cntvoff),
> diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
> index d296d617f5896..d2504df9d38b6 100644
> --- a/arch/arm64/kvm/hyp/nvhe/tlb.c
> +++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
> @@ -109,6 +109,45 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
> __tlb_switch_to_host(&cxt);
> }
>
> +void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
> + phys_addr_t start, phys_addr_t end)
> +{
> + struct tlb_inv_context cxt;
> + unsigned long pages, stride;
> +
> + /*
> + * Since the range of addresses may not be mapped at
> + * the same level, assume the worst case as PAGE_SIZE
> + */
> + stride = PAGE_SIZE;
> + start = round_down(start, stride);
> + end = round_up(end, stride);
> + pages = (end - start) >> PAGE_SHIFT;
> +
> + if (!system_supports_tlb_range() || pages >= MAX_TLBI_RANGE_PAGES) {
> + __kvm_tlb_flush_vmid(mmu);
> + return;
> + }
> +
> + dsb(ishst);
Due to some concerns w.r.t. the imprecision of the architecture for
synchronizing with the table walkers, these preamble barriers were
upgraded to dsb(ish) in commit 7e1b2329c205 ("KVM: arm64: nvhe:
Synchronise with page table walker on TLBI"). Good news is, the magic is
behind __tlb_switch_to_guest(), so just drop these barriers when you
rebase the series onto a 6.4 rc.
--
Thanks,
Oliver
Hi Raghavendra,
On Fri, Apr 14, 2023 at 05:29:22PM +0000, Raghavendra Rao Ananta wrote:
> The current implementation of the stage-2 unmap walker traverses
> the given range and, as a part of break-before-make, performs
> TLB invalidations with a DSB for every PTE. A multitude of this
> combination could cause a performance bottleneck.
>
> Hence, if the system supports FEAT_TLBIRANGE, defer the TLB
> invalidations until the entire walk is finished, and then
> use range-based instructions to invalidate the TLBs in one go.
> Condition this upon S2FWB in order to avoid walking the page-table
> again to perform the CMOs after issuing the TLBI.
>
> Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> Suggested-by: Oliver Upton <[email protected]>
> ---
> arch/arm64/kvm/hyp/pgtable.c | 33 +++++++++++++++++++++++++++++----
> 1 file changed, 29 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 3f136e35feb5e..bcb748e3566c7 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -987,10 +987,16 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
> return ret;
> }
>
> +struct stage2_unmap_data {
> + struct kvm_pgtable *pgt;
> + bool skip_pte_tlbis;
> +};
> +
> static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> enum kvm_pgtable_walk_flags visit)
> {
> - struct kvm_pgtable *pgt = ctx->arg;
> + struct stage2_unmap_data *unmap_data = ctx->arg;
> + struct kvm_pgtable *pgt = unmap_data->pgt;
> struct kvm_s2_mmu *mmu = pgt->mmu;
> struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> kvm_pte_t *childp = NULL;
> @@ -1018,7 +1024,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> * block entry and rely on the remaining portions being faulted
> * back lazily.
> */
> - stage2_put_pte(ctx, mmu, mm_ops, false);
> + stage2_put_pte(ctx, mmu, mm_ops, unmap_data->skip_pte_tlbis);
>
> if (need_flush && mm_ops->dcache_clean_inval_poc)
> mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
> @@ -1032,13 +1038,32 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
>
> int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
> {
> + int ret;
> + struct stage2_unmap_data unmap_data = {
> + .pgt = pgt,
> + /*
> + * If FEAT_TLBIRANGE is implemented, defer the individial PTE
> + * TLB invalidations until the entire walk is finished, and
> + * then use the range-based TLBI instructions to do the
> + * invalidations. Condition this upon S2FWB in order to avoid
> + * a page-table walk again to perform the CMOs after TLBI.
> + */
> + .skip_pte_tlbis = system_supports_tlb_range() &&
> + stage2_has_fwb(pgt),
Why can't the underlying walker just call these two helpers directly?
There are static keys behind these...
> + };
> struct kvm_pgtable_walker walker = {
> .cb = stage2_unmap_walker,
> - .arg = pgt,
> + .arg = &unmap_data,
> .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
> };
>
> - return kvm_pgtable_walk(pgt, addr, size, &walker);
> + ret = kvm_pgtable_walk(pgt, addr, size, &walker);
> + if (unmap_data.skip_pte_tlbis)
> + /* Perform the deferred TLB invalidations */
> + kvm_call_hyp(__kvm_tlb_flush_vmid_range, pgt->mmu,
> + addr, addr + size);
> +
> + return ret;
> }
>
> struct stage2_attr_data {
> --
> 2.40.0.634.g4ca3ef3211-goog
>
--
Thanks,
Oliver
Hi Raghavendra,
On Fri, Apr 14, 2023 at 05:29:21PM +0000, Raghavendra Rao Ananta wrote:
> Add a 'skip_flush' argument in stage2_put_pte() to
> control the TLB invalidations. This will be leveraged
> by the upcoming patch to defer the individual PTE
> invalidations until the entire walk is finished.
>
> No functional change intended.
>
> Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> ---
> arch/arm64/kvm/hyp/pgtable.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index b8f0dbd12f773..3f136e35feb5e 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -772,7 +772,7 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
> }
>
> static void stage2_put_pte(const struct kvm_pgtable_visit_ctx *ctx, struct kvm_s2_mmu *mmu,
> - struct kvm_pgtable_mm_ops *mm_ops)
> + struct kvm_pgtable_mm_ops *mm_ops, bool skip_flush)
Assuming you are going to pull the cpufeature checks into this helper,
it might me helpful to narrow the scope of it. 'stage2_put_pte()' sounds
very generic, but it is about to have a very precise meaning in relation
to kvm_pgtable_stage2_unmap().
So maybe stage2_unmap_put_pte()? While at it, you'd want to have a
shared helper for the deferral check:
static bool stage2_unmap_defer_tlb_flush(struct kvm_pgtable *pgt)
{
/* your blurb for why FWB is required too */
return system_supports_tlb_range() && stage2_has_fwb(pgt);
}
The 'flush' part is annoying, because the exact term is an invalidation,
but we already have that pattern in all of our TLB invalidation helpers.
> {
> /*
> * Clear the existing PTE, and perform break-before-make with
> @@ -780,7 +780,10 @@ static void stage2_put_pte(const struct kvm_pgtable_visit_ctx *ctx, struct kvm_s
> */
> if (kvm_pte_valid(ctx->old)) {
> kvm_clear_pte(ctx->ptep);
> - kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
> +
> + if (!skip_flush)
> + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
> + ctx->addr, ctx->level);
> }
>
> mm_ops->put_page(ctx->ptep);
> @@ -1015,7 +1018,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> * block entry and rely on the remaining portions being faulted
> * back lazily.
> */
> - stage2_put_pte(ctx, mmu, mm_ops);
> + stage2_put_pte(ctx, mmu, mm_ops, false);
>
> if (need_flush && mm_ops->dcache_clean_inval_poc)
> mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
> --
> 2.40.0.634.g4ca3ef3211-goog
>
--
Thanks,
Oliver
Hi Oliver,
On Fri, May 12, 2023 at 9:50 AM Oliver Upton <[email protected]> wrote:
>
> Hi Raghavendra,
>
> On Fri, Apr 14, 2023 at 05:29:17PM +0000, Raghavendra Rao Ananta wrote:
> > Define __kvm_tlb_flush_vmid_range() (for VHE and nVHE)
> > to flush a range of stage-2 page-tables using IPA in one go.
> > If the system supports FEAT_TLBIRANGE, the following patches
> > would conviniently replace global TLBI such as vmalls12e1is
> > in the map, unmap, and dirty-logging paths with ripas2e1is
> > instead.
> >
> > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_asm.h | 3 +++
> > arch/arm64/kvm/hyp/nvhe/hyp-main.c | 11 +++++++++
> > arch/arm64/kvm/hyp/nvhe/tlb.c | 39 ++++++++++++++++++++++++++++++
> > arch/arm64/kvm/hyp/vhe/tlb.c | 35 +++++++++++++++++++++++++++
> > 4 files changed, 88 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> > index 43c3bc0f9544d..33352d9399e32 100644
> > --- a/arch/arm64/include/asm/kvm_asm.h
> > +++ b/arch/arm64/include/asm/kvm_asm.h
> > @@ -79,6 +79,7 @@ enum __kvm_host_smccc_func {
> > __KVM_HOST_SMCCC_FUNC___pkvm_init_vm,
> > __KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu,
> > __KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm,
> > + __KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_range,
> > };
> >
> > #define DECLARE_KVM_VHE_SYM(sym) extern char sym[]
> > @@ -225,6 +226,8 @@ extern void __kvm_flush_vm_context(void);
> > extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
> > extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
> > int level);
> > +extern void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
> > + phys_addr_t start, phys_addr_t end);
> > extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);
> >
> > extern void __kvm_timer_set_cntvoff(u64 cntvoff);
> > diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > index 728e01d4536b0..81d30737dc7c9 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > @@ -125,6 +125,16 @@ static void handle___kvm_tlb_flush_vmid_ipa(struct kvm_cpu_context *host_ctxt)
> > __kvm_tlb_flush_vmid_ipa(kern_hyp_va(mmu), ipa, level);
> > }
> >
> > +static void
> > +handle___kvm_tlb_flush_vmid_range(struct kvm_cpu_context *host_ctxt)
> > +{
> > + DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
> > + DECLARE_REG(phys_addr_t, start, host_ctxt, 2);
> > + DECLARE_REG(phys_addr_t, end, host_ctxt, 3);
> > +
> > + __kvm_tlb_flush_vmid_range(kern_hyp_va(mmu), start, end);
> > +}
> > +
> > static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
> > {
> > DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
> > @@ -315,6 +325,7 @@ static const hcall_t host_hcall[] = {
> > HANDLE_FUNC(__kvm_vcpu_run),
> > HANDLE_FUNC(__kvm_flush_vm_context),
> > HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa),
> > + HANDLE_FUNC(__kvm_tlb_flush_vmid_range),
> > HANDLE_FUNC(__kvm_tlb_flush_vmid),
> > HANDLE_FUNC(__kvm_flush_cpu_context),
> > HANDLE_FUNC(__kvm_timer_set_cntvoff),
> > diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
> > index d296d617f5896..d2504df9d38b6 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/tlb.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
> > @@ -109,6 +109,45 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
> > __tlb_switch_to_host(&cxt);
> > }
> >
> > +void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
> > + phys_addr_t start, phys_addr_t end)
> > +{
> > + struct tlb_inv_context cxt;
> > + unsigned long pages, stride;
> > +
> > + /*
> > + * Since the range of addresses may not be mapped at
> > + * the same level, assume the worst case as PAGE_SIZE
> > + */
> > + stride = PAGE_SIZE;
> > + start = round_down(start, stride);
> > + end = round_up(end, stride);
> > + pages = (end - start) >> PAGE_SHIFT;
> > +
> > + if (!system_supports_tlb_range() || pages >= MAX_TLBI_RANGE_PAGES) {
> > + __kvm_tlb_flush_vmid(mmu);
> > + return;
> > + }
> > +
> > + dsb(ishst);
>
> Due to some concerns w.r.t. the imprecision of the architecture for
> synchronizing with the table walkers, these preamble barriers were
> upgraded to dsb(ish) in commit 7e1b2329c205 ("KVM: arm64: nvhe:
> Synchronise with page table walker on TLBI"). Good news is, the magic is
> behind __tlb_switch_to_guest(), so just drop these barriers when you
> rebase the series onto a 6.4 rc.
>
Thanks for the info! I'll rebase the series and take care of this.
- Raghavendra
> --
> Thanks,
> Oliver
On Fri, May 12, 2023 at 10:02 AM Oliver Upton <[email protected]> wrote:
>
> Hi Raghavendra,
>
> On Fri, Apr 14, 2023 at 05:29:22PM +0000, Raghavendra Rao Ananta wrote:
> > The current implementation of the stage-2 unmap walker traverses
> > the given range and, as a part of break-before-make, performs
> > TLB invalidations with a DSB for every PTE. A multitude of this
> > combination could cause a performance bottleneck.
> >
> > Hence, if the system supports FEAT_TLBIRANGE, defer the TLB
> > invalidations until the entire walk is finished, and then
> > use range-based instructions to invalidate the TLBs in one go.
> > Condition this upon S2FWB in order to avoid walking the page-table
> > again to perform the CMOs after issuing the TLBI.
> >
> > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > Suggested-by: Oliver Upton <[email protected]>
> > ---
> > arch/arm64/kvm/hyp/pgtable.c | 33 +++++++++++++++++++++++++++++----
> > 1 file changed, 29 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 3f136e35feb5e..bcb748e3566c7 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -987,10 +987,16 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > return ret;
> > }
> >
> > +struct stage2_unmap_data {
> > + struct kvm_pgtable *pgt;
> > + bool skip_pte_tlbis;
> > +};
> > +
> > static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > enum kvm_pgtable_walk_flags visit)
> > {
> > - struct kvm_pgtable *pgt = ctx->arg;
> > + struct stage2_unmap_data *unmap_data = ctx->arg;
> > + struct kvm_pgtable *pgt = unmap_data->pgt;
> > struct kvm_s2_mmu *mmu = pgt->mmu;
> > struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > kvm_pte_t *childp = NULL;
> > @@ -1018,7 +1024,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > * block entry and rely on the remaining portions being faulted
> > * back lazily.
> > */
> > - stage2_put_pte(ctx, mmu, mm_ops, false);
> > + stage2_put_pte(ctx, mmu, mm_ops, unmap_data->skip_pte_tlbis);
> >
> > if (need_flush && mm_ops->dcache_clean_inval_poc)
> > mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
> > @@ -1032,13 +1038,32 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> >
> > int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
> > {
> > + int ret;
> > + struct stage2_unmap_data unmap_data = {
> > + .pgt = pgt,
> > + /*
> > + * If FEAT_TLBIRANGE is implemented, defer the individial PTE
> > + * TLB invalidations until the entire walk is finished, and
> > + * then use the range-based TLBI instructions to do the
> > + * invalidations. Condition this upon S2FWB in order to avoid
> > + * a page-table walk again to perform the CMOs after TLBI.
> > + */
> > + .skip_pte_tlbis = system_supports_tlb_range() &&
> > + stage2_has_fwb(pgt),
>
> Why can't the underlying walker just call these two helpers directly?
> There are static keys behind these...
>
I wasn't aware of that. Although, I tried to look into the
definitions, but couldn't understand how static keys are at play here.
By any chance are you referring to the alternative_has_feature_*()
implementations when checking for cpu capabilities?
Thank you.
Raghavendra
> > + };
> > struct kvm_pgtable_walker walker = {
> > .cb = stage2_unmap_walker,
> > - .arg = pgt,
> > + .arg = &unmap_data,
> > .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
> > };
> >
> > - return kvm_pgtable_walk(pgt, addr, size, &walker);
> > + ret = kvm_pgtable_walk(pgt, addr, size, &walker);
> > + if (unmap_data.skip_pte_tlbis)
> > + /* Perform the deferred TLB invalidations */
> > + kvm_call_hyp(__kvm_tlb_flush_vmid_range, pgt->mmu,
> > + addr, addr + size);
> > +
> > + return ret;
> > }
> >
> > struct stage2_attr_data {
> > --
> > 2.40.0.634.g4ca3ef3211-goog
> >
>
> --
> Thanks,
> Oliver
Hi Oliver,
On Fri, May 12, 2023 at 10:21 AM Oliver Upton <[email protected]> wrote:
>
> Hi Raghavendra,
>
> On Fri, Apr 14, 2023 at 05:29:21PM +0000, Raghavendra Rao Ananta wrote:
> > Add a 'skip_flush' argument in stage2_put_pte() to
> > control the TLB invalidations. This will be leveraged
> > by the upcoming patch to defer the individual PTE
> > invalidations until the entire walk is finished.
> >
> > No functional change intended.
> >
> > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > ---
> > arch/arm64/kvm/hyp/pgtable.c | 9 ++++++---
> > 1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index b8f0dbd12f773..3f136e35feb5e 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -772,7 +772,7 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n
> > }
> >
> > static void stage2_put_pte(const struct kvm_pgtable_visit_ctx *ctx, struct kvm_s2_mmu *mmu,
> > - struct kvm_pgtable_mm_ops *mm_ops)
> > + struct kvm_pgtable_mm_ops *mm_ops, bool skip_flush)
>
> Assuming you are going to pull the cpufeature checks into this helper,
> it might me helpful to narrow the scope of it. 'stage2_put_pte()' sounds
> very generic, but it is about to have a very precise meaning in relation
> to kvm_pgtable_stage2_unmap().
>
> So maybe stage2_unmap_put_pte()? While at it, you'd want to have a
> shared helper for the deferral check:
>
Yeah, stage2_unmap_put_pte() sounds better. I'll change that.
> static bool stage2_unmap_defer_tlb_flush(struct kvm_pgtable *pgt)
> {
> /* your blurb for why FWB is required too */
> return system_supports_tlb_range() && stage2_has_fwb(pgt);
> }
>
Good idea; I can introduce the helper, now that we'll get rid of
stage2_unmap_data.skip_pte_tlbis (patch 7/7) as per your comments.
Also, since we are now making stage2_put_pte() specific to unmap, I
can also get rid of the 'skip_flush' arg and call
stage2_unmap_defer_tlb_flush() directly, or do you have a preference
for the additional arg?
Thank you.
Raghavendra
> The 'flush' part is annoying, because the exact term is an invalidation,
> but we already have that pattern in all of our TLB invalidation helpers.
>
> > {
> > /*
> > * Clear the existing PTE, and perform break-before-make with
> > @@ -780,7 +780,10 @@ static void stage2_put_pte(const struct kvm_pgtable_visit_ctx *ctx, struct kvm_s
> > */
> > if (kvm_pte_valid(ctx->old)) {
> > kvm_clear_pte(ctx->ptep);
> > - kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
> > +
> > + if (!skip_flush)
> > + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
> > + ctx->addr, ctx->level);
> > }
> >
> > mm_ops->put_page(ctx->ptep);
> > @@ -1015,7 +1018,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > * block entry and rely on the remaining portions being faulted
> > * back lazily.
> > */
> > - stage2_put_pte(ctx, mmu, mm_ops);
> > + stage2_put_pte(ctx, mmu, mm_ops, false);
> >
> > if (need_flush && mm_ops->dcache_clean_inval_poc)
> > mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
> > --
> > 2.40.0.634.g4ca3ef3211-goog
> >
>
> --
> Thanks,
> Oliver
On Tue, May 16, 2023 at 10:21:33AM -0700, Raghavendra Rao Ananta wrote:
> On Fri, May 12, 2023 at 10:02 AM Oliver Upton <[email protected]> wrote:
> > > int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
> > > {
> > > + int ret;
> > > + struct stage2_unmap_data unmap_data = {
> > > + .pgt = pgt,
> > > + /*
> > > + * If FEAT_TLBIRANGE is implemented, defer the individial PTE
> > > + * TLB invalidations until the entire walk is finished, and
> > > + * then use the range-based TLBI instructions to do the
> > > + * invalidations. Condition this upon S2FWB in order to avoid
> > > + * a page-table walk again to perform the CMOs after TLBI.
> > > + */
> > > + .skip_pte_tlbis = system_supports_tlb_range() &&
> > > + stage2_has_fwb(pgt),
> >
> > Why can't the underlying walker just call these two helpers directly?
> > There are static keys behind these...
> >
> I wasn't aware of that. Although, I tried to look into the
> definitions, but couldn't understand how static keys are at play here.
> By any chance are you referring to the alternative_has_feature_*()
> implementations when checking for cpu capabilities?
Ah, right, these were recently changed to rely on alternative patching
in commit 21fb26bfb01f ("arm64: alternatives: add alternative_has_feature_*()").
Even still, the significance remains as the alternative patching
completely eliminates a conditional branch on the presence of a
particular feature.
Initializing a local with the presence/absence of a feature defeats such
an optimization.
--
Thanks,
Oliver
On Tue, May 16, 2023 at 11:46 AM Oliver Upton <[email protected]> wrote:
>
> On Tue, May 16, 2023 at 10:21:33AM -0700, Raghavendra Rao Ananta wrote:
> > On Fri, May 12, 2023 at 10:02 AM Oliver Upton <[email protected]> wrote:
> > > > int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
> > > > {
> > > > + int ret;
> > > > + struct stage2_unmap_data unmap_data = {
> > > > + .pgt = pgt,
> > > > + /*
> > > > + * If FEAT_TLBIRANGE is implemented, defer the individial PTE
> > > > + * TLB invalidations until the entire walk is finished, and
> > > > + * then use the range-based TLBI instructions to do the
> > > > + * invalidations. Condition this upon S2FWB in order to avoid
> > > > + * a page-table walk again to perform the CMOs after TLBI.
> > > > + */
> > > > + .skip_pte_tlbis = system_supports_tlb_range() &&
> > > > + stage2_has_fwb(pgt),
> > >
> > > Why can't the underlying walker just call these two helpers directly?
> > > There are static keys behind these...
> > >
> > I wasn't aware of that. Although, I tried to look into the
> > definitions, but couldn't understand how static keys are at play here.
> > By any chance are you referring to the alternative_has_feature_*()
> > implementations when checking for cpu capabilities?
>
> Ah, right, these were recently changed to rely on alternative patching
> in commit 21fb26bfb01f ("arm64: alternatives: add alternative_has_feature_*()").
> Even still, the significance remains as the alternative patching
> completely eliminates a conditional branch on the presence of a
> particular feature.
>
> Initializing a local with the presence/absence of a feature defeats such
> an optimization.
>
Thanks for the info! Introduction of stage2_unmap_defer_tlb_flush()
(in patch-7/7) would call these functions as needed and get rid of
'skip_pte_tlbis'.
- Raghavendra
> --
> Thanks,
> Oliver