In certain code paths, KVM/ARM currently invalidates the entire VM's
page-tables instead of just invalidating a necessary range. For example,
when collapsing a table PTE to a block PTE, instead of iterating over
each PTE and flushing them, KVM uses 'vmalls12e1is' TLBI operation to
flush all the entries. This is inefficient since the guest would have
to refill the TLBs again, even for the addresses that aren't covered
by the table entry. The performance impact would scale poorly if many
addresses in the VM is going through this remapping.
For architectures that implement FEAT_TLBIRANGE, KVM can replace such
inefficient paths by performing the invalidations only on the range of
addresses that are in scope. This series tries to achieve the same in
the areas of stage-2 map, unmap and write-protecting the pages.
Patch-1 refactors the core arm64's __flush_tlb_range() to be used by
other entities.
Patch-2 adds a generic range-based TLBI mechanism for KVM.
Patch-3 adds support to flush a range of IPAs for KVM.
Patch-4 implements the kvm_arch_flush_remote_tlbs_range() for arm64.
Patch-5 aims to flush only the memslot that undergoes a write-protect,
instead of the entire VM.
Patch-6 operates on stage2_try_break_pte() to use the range based
TLBI instructions when breaking a table entry. The map path is the
immediate consumer of this when KVM remaps a table entry into a block.
Patch-7 introduces a fast stage-2 unmap path in which, for the right
conditions, instead of traversing each and every PTE and unmapping them,
disconnect the PTE at a higher level (say at level-1 for a 4K pagesize)
and unmap the table entries using free_removed_table(). This would allow
KVM to use the range based TLBI to flush the entire range governed at
that level.
The series is based off of upstream v6.2-rc6, and applied David
Matlack's common API for TLB invalidations[1] on top.
The performance evaluation was done on a hardware that supports
FEAT_TLBIRANGE, on a VHE configuration, using a modified kvm_page_table_test.
The modified version updates the guest code in the ADJUST_MAPPINGS case
to not only access this page but also to access up to 512 pages backwards
for every new page it iterates through. This is done to test the effect
of TLBI misses after KVM has handled a fault.
The series captures the impact in the map and unmap paths as described above.
$ kvm_page_table_test -m 2 -v 128 -s anonymous_hugetlb_2mb -b $i
+--------+------------------------------+------------------------------+
| mem_sz | ADJUST_MAPPINGS (s) | Unmap VM (s) |
| (GB) | Baseline | Baseline + series | Baseline | Baseline + series |
+--------+----------|-------------------+------------------------------+
| 1 | 4.15 | 4.26 | 0.50 | 0.007 |
| 2 | 6.09 | 6.08 | 0.50 | 0.009 |
| 4 | 12.65 | 11.46 | 0.50 | 0.01 |
| 8 | 25.35 | 24.75 | 0.52 | 0.02 |
| 16 | 52.17 | 48.23 | 0.53 | 0.03 |
| 32 | 100.09 | 84.53 | 0.57 | 0.06 |
| 64 | 176.46 | 166.96 | 0.75 | 0.11 |
| 128 | 340.22 | 302.82 | 0.81 | 0.20 |
+--------+----------+-------------------+----------+-------------------+
$ kvm_page_table_test -m 2 -b 128G -s anonymous_hugetlb_2mb -v $i
+--------+------------------------------+
| vCPUs | ADJUST_MAPPINGS (s) |
| | Baseline | Baseline + series |
+--------+----------|-------------------+
| 1 | 153.91 | 148.75 |
| 2 | 188.17 | 176.11 |
| 4 | 193.15 | 175.77 |
| 8 | 195.60 | 184.92 |
| 16 | 183.49 | 170.22 |
| 32 | 159.37 | 152.70 |
| 64 | 190.15 | 180.45 |
| 128 | 340.22 | 302.82 |
+--------+----------+-------------------+
For the ADJUST_MAPPINGS cases, which maps back the 4K table entries to
2M hugepages, the series sees an average improvement of ~7%. For unmapping
2M hugepages, we see at least a 4x improvement.
$ kvm_page_table_test -m 2 -b $i
+--------+------------------------------+
| mem_sz | Unmap VM (s) |
| (GB) | Baseline | Baseline + series |
+--------+------------------------------+
| 1 | 1.03 | 0.58 |
| 2 | 1.57 | 0.72 |
| 4 | 2.65 | 0.98 |
| 8 | 4.77 | 1.54 |
| 16 | 9.06 | 2.57 |
| 32 | 17.60 | 4.41 |
| 64 | 34.72 | 8.92 |
| 128 | 68.92 | 17.70 |
+--------+----------+-------------------+
The 4x improvement for unmapping also holds true when the guest is
backed by PAGE_SIZE (4K) pages.
v2:
- Rebased the series on top of David Matlack's series for common
TLB invalidation API[1].
- Implement kvm_arch_flush_remote_tlbs_range() for arm64, by extending
the support introduced by [1].
- Use kvm_flush_remote_tlbs_memslot() introduced by [1] to flush
only the current memslot after write-protect.
- Modified the __kvm_tlb_flush_range() macro to accepts 'level' as an
argument to calculate the 'stride' instead of just using PAGE_SIZE.
- Split the patch that introduces the range-based TLBI to KVM and the
implementation of IPA-based invalidation into its own patches.
- Dropped the patch that tries to optimize the mmu notifiers paths.
- Rename the function kvm_table_pte_flush() to
kvm_pgtable_stage2_flush_range(), and accept the range of addresses to
flush. [Oliver]
- Drop the 'tlb_level' argument for stage2_try_break_pte() and directly
pass '0' as 'tlb_level' to kvm_pgtable_stage2_flush_range(). [Oliver]
v1: https://lore.kernel.org/all/[email protected]/
Thank you.
Raghavendra
[1]: https://lore.kernel.org/linux-arm-kernel/[email protected]/
Raghavendra Rao Ananta (7):
arm64: tlb: Refactor the core flush algorithm of __flush_tlb_range
KVM: arm64: Add FEAT_TLBIRANGE support
KVM: arm64: Implement __kvm_tlb_flush_range_vmid_ipa()
KVM: arm64: Implement kvm_arch_flush_remote_tlbs_range()
KVM: arm64: Flush only the memslot after write-protect
KVM: arm64: Break the table entries using TLBI range instructions
KVM: arm64: Create a fast stage-2 unmap path
arch/arm64/include/asm/kvm_asm.h | 21 ++++++
arch/arm64/include/asm/kvm_host.h | 3 +
arch/arm64/include/asm/tlbflush.h | 107 +++++++++++++++--------------
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 12 ++++
arch/arm64/kvm/hyp/nvhe/tlb.c | 28 ++++++++
arch/arm64/kvm/hyp/pgtable.c | 67 +++++++++++++++++-
arch/arm64/kvm/hyp/vhe/tlb.c | 24 +++++++
arch/arm64/kvm/mmu.c | 17 ++++-
8 files changed, 222 insertions(+), 57 deletions(-)
--
2.39.1.519.gcb327c4b5f-goog
Currently, the core TLB flush functionality of __flush_tlb_range()
hardcodes vae1is (and variants) for the flush operation. In the
upcoming patches, the KVM code reuses this core algorithm with
ipas2e1is for range based TLB invalidations based on the IPA.
Hence, extract the core flush functionality of __flush_tlb_range()
into its own macro that accepts an 'op' argument to pass any
TLBI operation, such that other callers (KVM) can benefit.
No functional changes intended.
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 107 +++++++++++++++---------------
1 file changed, 54 insertions(+), 53 deletions(-)
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25d..9a57eae14e576 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -278,14 +278,60 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
*/
#define MAX_TLBI_OPS PTRS_PER_PTE
+/* When the CPU does not support TLB range operations, flush the TLB
+ * entries one by one at the granularity of 'stride'. If the TLB
+ * range ops are supported, then:
+ *
+ * 1. If 'pages' is odd, flush the first page through non-range
+ * operations;
+ *
+ * 2. For remaining pages: the minimum range granularity is decided
+ * by 'scale', so multiple range TLBI operations may be required.
+ * Start from scale = 0, flush the corresponding number of pages
+ * ((num+1)*2^(5*scale+1) starting from 'addr'), then increase it
+ * until no pages left.
+ *
+ * Note that certain ranges can be represented by either num = 31 and
+ * scale or num = 0 and scale + 1. The loop below favours the latter
+ * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
+ */
+#define __flush_tlb_range_op(op, start, pages, stride, asid, tlb_level, tlbi_user) do { \
+ int num = 0; \
+ int scale = 0; \
+ unsigned long addr; \
+ \
+ while (pages > 0) { \
+ if (!system_supports_tlb_range() || \
+ pages % 2 == 1) { \
+ addr = __TLBI_VADDR(start, asid); \
+ __tlbi_level(op, addr, tlb_level); \
+ if (tlbi_user) \
+ __tlbi_user_level(op, addr, tlb_level); \
+ start += stride; \
+ pages -= stride >> PAGE_SHIFT; \
+ continue; \
+ } \
+ \
+ num = __TLBI_RANGE_NUM(pages, scale); \
+ if (num >= 0) { \
+ addr = __TLBI_VADDR_RANGE(start, asid, scale, \
+ num, tlb_level); \
+ __tlbi(r##op, addr); \
+ if (tlbi_user) \
+ __tlbi_user(r##op, addr); \
+ start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; \
+ pages -= __TLBI_RANGE_PAGES(num, scale); \
+ } \
+ scale++; \
+ } \
+} while (0)
+
static inline void __flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
unsigned long stride, bool last_level,
int tlb_level)
{
- int num = 0;
- int scale = 0;
- unsigned long asid, addr, pages;
+ unsigned long asid, pages;
start = round_down(start, stride);
end = round_up(end, stride);
@@ -307,56 +353,11 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
dsb(ishst);
asid = ASID(vma->vm_mm);
- /*
- * When the CPU does not support TLB range operations, flush the TLB
- * entries one by one at the granularity of 'stride'. If the TLB
- * range ops are supported, then:
- *
- * 1. If 'pages' is odd, flush the first page through non-range
- * operations;
- *
- * 2. For remaining pages: the minimum range granularity is decided
- * by 'scale', so multiple range TLBI operations may be required.
- * Start from scale = 0, flush the corresponding number of pages
- * ((num+1)*2^(5*scale+1) starting from 'addr'), then increase it
- * until no pages left.
- *
- * Note that certain ranges can be represented by either num = 31 and
- * scale or num = 0 and scale + 1. The loop below favours the latter
- * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
- */
- while (pages > 0) {
- if (!system_supports_tlb_range() ||
- pages % 2 == 1) {
- addr = __TLBI_VADDR(start, asid);
- if (last_level) {
- __tlbi_level(vale1is, addr, tlb_level);
- __tlbi_user_level(vale1is, addr, tlb_level);
- } else {
- __tlbi_level(vae1is, addr, tlb_level);
- __tlbi_user_level(vae1is, addr, tlb_level);
- }
- start += stride;
- pages -= stride >> PAGE_SHIFT;
- continue;
- }
-
- num = __TLBI_RANGE_NUM(pages, scale);
- if (num >= 0) {
- addr = __TLBI_VADDR_RANGE(start, asid, scale,
- num, tlb_level);
- if (last_level) {
- __tlbi(rvale1is, addr);
- __tlbi_user(rvale1is, addr);
- } else {
- __tlbi(rvae1is, addr);
- __tlbi_user(rvae1is, addr);
- }
- start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT;
- pages -= __TLBI_RANGE_PAGES(num, scale);
- }
- scale++;
- }
+ if (last_level)
+ __flush_tlb_range_op(vale1is, start, pages, stride, asid, tlb_level, true);
+ else
+ __flush_tlb_range_op(vae1is, start, pages, stride, asid, tlb_level, true);
+
dsb(ish);
}
--
2.39.1.519.gcb327c4b5f-goog
Define a generic function __kvm_tlb_flush_range() to
invalidate the TLBs over a range of addresses. The
implementation accepts 'op' as a generic TLBI operation.
Upcoming patches will use this to implement IPA based
TLB invalidations (ipas2e1is).
If the system doesn't support FEAT_TLBIRANGE, the
implementation falls back to flushing the pages one by one
for the range supplied.
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/include/asm/kvm_asm.h | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 43c3bc0f9544d..995ff048e8851 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -221,6 +221,24 @@ DECLARE_KVM_NVHE_SYM(__per_cpu_end);
DECLARE_KVM_HYP_SYM(__bp_harden_hyp_vecs);
#define __bp_harden_hyp_vecs CHOOSE_HYP_SYM(__bp_harden_hyp_vecs)
+#define __kvm_tlb_flush_range(op, mmu, start, end, level, tlb_level) do { \
+ unsigned long pages, stride; \
+ \
+ stride = kvm_granule_size(level); \
+ start = round_down(start, stride); \
+ end = round_up(end, stride); \
+ pages = (end - start) >> PAGE_SHIFT; \
+ \
+ if ((!system_supports_tlb_range() && \
+ (end - start) >= (MAX_TLBI_OPS * stride)) || \
+ pages >= MAX_TLBI_RANGE_PAGES) { \
+ __kvm_tlb_flush_vmid(mmu); \
+ break; \
+ } \
+ \
+ __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false); \
+} while (0)
+
extern void __kvm_flush_vm_context(void);
extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
--
2.39.1.519.gcb327c4b5f-goog
Implement kvm_arch_flush_remote_tlbs_range() for arm64,
such that it can utilize the TLBI range based instructions
if supported.
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 3 +++
arch/arm64/kvm/mmu.c | 15 +++++++++++++++
2 files changed, 18 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index dee530d75b957..211fab0c1de74 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1002,6 +1002,9 @@ struct kvm *kvm_arch_alloc_vm(void);
#define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS
int kvm_arch_flush_remote_tlbs(struct kvm *kvm);
+#define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
+int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, u64 pages);
+
static inline bool kvm_vm_is_protected(struct kvm *kvm)
{
return false;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index e98910a8d0af6..409cb187f4911 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -91,6 +91,21 @@ int kvm_arch_flush_remote_tlbs(struct kvm *kvm)
return 0;
}
+int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, u64 pages)
+{
+ phys_addr_t start, end;
+
+ if (!system_supports_tlb_range())
+ return -EOPNOTSUPP;
+
+ start = start_gfn << PAGE_SHIFT;
+ end = (start_gfn + pages) << PAGE_SHIFT;
+
+ kvm_call_hyp(__kvm_tlb_flush_range_vmid_ipa, &kvm->arch.mmu,
+ start, end, KVM_PGTABLE_MAX_LEVELS - 1, 0);
+ return 0;
+}
+
static bool kvm_is_device_pfn(unsigned long pfn)
{
return !pfn_is_map_memory(pfn);
--
2.39.1.519.gcb327c4b5f-goog
Define __kvm_tlb_flush_range_vmid_ipa() (for VHE and nVHE)
to flush a range of stage-2 page-tables using IPA in one go.
If the system supports FEAT_TLBIRANGE, the following patches
would conviniently replace global TLBI such as vmalls12e1is
in the map, unmap, and dirty-logging paths with ripas2e1is
instead.
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/include/asm/kvm_asm.h | 3 +++
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 12 ++++++++++++
arch/arm64/kvm/hyp/nvhe/tlb.c | 28 ++++++++++++++++++++++++++++
arch/arm64/kvm/hyp/vhe/tlb.c | 24 ++++++++++++++++++++++++
4 files changed, 67 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 995ff048e8851..80a8ea85e84f8 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -79,6 +79,7 @@ enum __kvm_host_smccc_func {
__KVM_HOST_SMCCC_FUNC___pkvm_init_vm,
__KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu,
__KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm,
+ __KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_range_vmid_ipa,
};
#define DECLARE_KVM_VHE_SYM(sym) extern char sym[]
@@ -243,6 +244,8 @@ extern void __kvm_flush_vm_context(void);
extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
int level);
+extern void __kvm_tlb_flush_range_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t start,
+ phys_addr_t end, int level, int tlb_level);
extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);
extern void __kvm_timer_set_cntvoff(u64 cntvoff);
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 728e01d4536b0..5787eee4c9fe4 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -125,6 +125,17 @@ static void handle___kvm_tlb_flush_vmid_ipa(struct kvm_cpu_context *host_ctxt)
__kvm_tlb_flush_vmid_ipa(kern_hyp_va(mmu), ipa, level);
}
+static void handle___kvm_tlb_flush_range_vmid_ipa(struct kvm_cpu_context *host_ctxt)
+{
+ DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
+ DECLARE_REG(phys_addr_t, start, host_ctxt, 2);
+ DECLARE_REG(phys_addr_t, end, host_ctxt, 3);
+ DECLARE_REG(int, level, host_ctxt, 4);
+ DECLARE_REG(int, tlb_level, host_ctxt, 5);
+
+ __kvm_tlb_flush_range_vmid_ipa(kern_hyp_va(mmu), start, end, level, tlb_level);
+}
+
static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
{
DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
@@ -315,6 +326,7 @@ static const hcall_t host_hcall[] = {
HANDLE_FUNC(__kvm_vcpu_run),
HANDLE_FUNC(__kvm_flush_vm_context),
HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa),
+ HANDLE_FUNC(__kvm_tlb_flush_range_vmid_ipa),
HANDLE_FUNC(__kvm_tlb_flush_vmid),
HANDLE_FUNC(__kvm_flush_cpu_context),
HANDLE_FUNC(__kvm_timer_set_cntvoff),
diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
index d296d617f5896..7398dd00445e7 100644
--- a/arch/arm64/kvm/hyp/nvhe/tlb.c
+++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
@@ -109,6 +109,34 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
__tlb_switch_to_host(&cxt);
}
+void __kvm_tlb_flush_range_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t start,
+ phys_addr_t end, int level, int tlb_level)
+{
+ struct tlb_inv_context cxt;
+
+ dsb(ishst);
+
+ /* Switch to requested VMID */
+ __tlb_switch_to_guest(mmu, &cxt);
+
+ __kvm_tlb_flush_range(ipas2e1is, mmu, start, end, level, tlb_level);
+
+ /*
+ * Range-based ipas2e1is flushes only Stage-2 entries, and since the
+ * VA isn't available for Stage-1 entries, flush the entire stage-1.
+ */
+ dsb(ish);
+ __tlbi(vmalle1is);
+ dsb(ish);
+ isb();
+
+ /* See the comment below in __kvm_tlb_flush_vmid_ipa() */
+ if (icache_is_vpipt())
+ icache_inval_all_pou();
+
+ __tlb_switch_to_host(&cxt);
+}
+
void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
{
struct tlb_inv_context cxt;
diff --git a/arch/arm64/kvm/hyp/vhe/tlb.c b/arch/arm64/kvm/hyp/vhe/tlb.c
index 24cef9b87f9e9..e9c1d69f7ddf7 100644
--- a/arch/arm64/kvm/hyp/vhe/tlb.c
+++ b/arch/arm64/kvm/hyp/vhe/tlb.c
@@ -111,6 +111,30 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
__tlb_switch_to_host(&cxt);
}
+void __kvm_tlb_flush_range_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t start,
+ phys_addr_t end, int level, int tlb_level)
+{
+ struct tlb_inv_context cxt;
+
+ dsb(ishst);
+
+ /* Switch to requested VMID */
+ __tlb_switch_to_guest(mmu, &cxt);
+
+ __kvm_tlb_flush_range(ipas2e1is, mmu, start, end, level, tlb_level);
+
+ /*
+ * Range-based ipas2e1is flushes only Stage-2 entries, and since the
+ * VA isn't available for Stage-1 entries, flush the entire stage-1.
+ */
+ dsb(ish);
+ __tlbi(vmalle1is);
+ dsb(ish);
+ isb();
+
+ __tlb_switch_to_host(&cxt);
+}
+
void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
{
struct tlb_inv_context cxt;
--
2.39.1.519.gcb327c4b5f-goog
After write-protecting the region, currently KVM invalidates
the entire TLB entries using kvm_flush_remote_tlbs(). Instead,
scope the invalidation only to the targeted memslot. If
supported, the architecture would use the range-based TLBI
instructions to flush the memslot or else fallback to flushing
all of the TLBs.
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/kvm/mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 409cb187f4911..3e33af0daf1d3 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -981,7 +981,7 @@ static void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
write_lock(&kvm->mmu_lock);
stage2_wp_range(&kvm->arch.mmu, start, end);
write_unlock(&kvm->mmu_lock);
- kvm_flush_remote_tlbs(kvm);
+ kvm_flush_remote_tlbs_memslot(kvm, memslot);
}
/**
--
2.39.1.519.gcb327c4b5f-goog
Currently, when breaking up the stage-2 table entries, KVM
would flush the entire VM's context using 'vmalls12e1is'
TLBI operation. One of the problematic situation is collapsing
table entries into a hugepage, specifically if the VM is
faulting on many hugepages (say after dirty-logging). This
creates a performance penality for the guest whose pages have
already been faulted earlier as they would have to refill their
TLBs again.
Hence, if the system supports it, use __kvm_tlb_flush_range_vmid_ipa()
to flush only the range of pages governed by the table entry,
while leaving other TLB entries alone. An upcoming patch also
takes advantage of this when breaking up table entries during
the unmap operation.
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b11cf2c618a6c..0858d1fa85d6b 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -686,6 +686,20 @@ static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_
return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old;
}
+static void kvm_pgtable_stage2_flush_range(struct kvm_s2_mmu *mmu, u64 start, u64 end,
+ u32 level, u32 tlb_level)
+{
+ if (system_supports_tlb_range())
+ kvm_call_hyp(__kvm_tlb_flush_range_vmid_ipa, mmu, start, end, level, tlb_level);
+ else
+ /*
+ * Invalidate the whole stage-2, as we may have numerous leaf
+ * entries below us which would otherwise need invalidating
+ * individually.
+ */
+ kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
+}
+
/**
* stage2_try_break_pte() - Invalidates a pte according to the
* 'break-before-make' requirements of the
@@ -721,10 +735,13 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
* Perform the appropriate TLB invalidation based on the evicted pte
* value (if any).
*/
- if (kvm_pte_table(ctx->old, ctx->level))
- kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
- else if (kvm_pte_valid(ctx->old))
+ if (kvm_pte_table(ctx->old, ctx->level)) {
+ u64 end = ctx->addr + kvm_granule_size(ctx->level);
+
+ kvm_pgtable_stage2_flush_range(mmu, ctx->addr, end, ctx->level, 0);
+ } else if (kvm_pte_valid(ctx->old)) {
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
+ }
if (stage2_pte_is_counted(ctx->old))
mm_ops->put_page(ctx->ptep);
--
2.39.1.519.gcb327c4b5f-goog
The current implementation of the stage-2 unmap walker
traverses the entire page-table to clear and flush the TLBs
for each entry. This could be very expensive, especially if
the VM is not backed by hugepages. The unmap operation could be
made efficient by disconnecting the table at the very
top (level at which the largest block mapping can be hosted)
and do the rest of the unmapping using free_removed_table().
If the system supports FEAT_TLBIRANGE, flush the entire range
that has been disconnected from the rest of the page-table.
Suggested-by: Ricardo Koller <[email protected]>
Signed-off-by: Raghavendra Rao Ananta <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0858d1fa85d6b..af3729d0971f2 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
return 0;
}
+/*
+ * The fast walker executes only if the unmap size is exactly equal to the
+ * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
+ * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
+ * be disconnected from the rest of the page-table without the need to
+ * traverse all the PTEs, at all the levels, and unmap each and every one
+ * of them. The disconnected table is freed using free_removed_table().
+ */
+static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
+ enum kvm_pgtable_walk_flags visit)
+{
+ struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
+ kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
+ struct kvm_s2_mmu *mmu = ctx->arg;
+
+ if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
+ return 0;
+
+ if (!stage2_try_break_pte(ctx, mmu))
+ return -EAGAIN;
+
+ /*
+ * Gain back a reference for stage2_unmap_walker() to free
+ * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
+ */
+ mm_ops->get_page(ctx->ptep);
+
+ mm_ops->free_removed_table(childp, ctx->level);
+ return 0;
+}
+
+static void kvm_pgtable_try_fast_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
+{
+ struct kvm_pgtable_walker walker = {
+ .cb = fast_stage2_unmap_walker,
+ .arg = pgt->mmu,
+ .flags = KVM_PGTABLE_WALK_TABLE_PRE,
+ };
+
+ if (size == kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL))
+ kvm_pgtable_walk(pgt, addr, size, &walker);
+}
+
int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
{
struct kvm_pgtable_walker walker = {
@@ -1025,6 +1068,7 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
.flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
};
+ kvm_pgtable_try_fast_stage2_unmap(pgt, addr, size);
return kvm_pgtable_walk(pgt, addr, size, &walker);
}
--
2.39.1.519.gcb327c4b5f-goog
nit: s/break/invalidate/g
There is a rather important degree of nuance there. 'Break' as it
relates to break-before-make implies that the PTE is made invalid and
visible to hardware _before_ a subsequent invalidation. There will be
systems that relax this requirement and also support TLBIRANGE.
On Mon, Feb 06, 2023 at 05:23:39PM +0000, Raghavendra Rao Ananta wrote:
Some nitpicking on the changelog:
> Currently, when breaking up the stage-2 table entries, KVM
'breaking up stage-2 table entries' is rather ambiguous. Instead
describe the operation taking place on the page tables (i.e. hugepage
collapse).
> would flush the entire VM's context using 'vmalls12e1is'
> TLBI operation. One of the problematic situation is collapsing
> table entries into a hugepage, specifically if the VM is
> faulting on many hugepages (say after dirty-logging). This
> creates a performance penality for the guest whose pages have
typo: penalty
> already been faulted earlier as they would have to refill their
> TLBs again.
>
> Hence, if the system supports it, use __kvm_tlb_flush_range_vmid_ipa()
> to flush only the range of pages governed by the table entry,
> while leaving other TLB entries alone. An upcoming patch also
> takes advantage of this when breaking up table entries during
> the unmap operation.
Language regarding an upcoming patch isn't necessary, as this one stands
on its own (implements and uses a range-based invalidation helper).
> Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> ---
> arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++++++---
> 1 file changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index b11cf2c618a6c..0858d1fa85d6b 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -686,6 +686,20 @@ static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_
> return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old;
> }
>
> +static void kvm_pgtable_stage2_flush_range(struct kvm_s2_mmu *mmu, u64 start, u64 end,
> + u32 level, u32 tlb_level)
> +{
> + if (system_supports_tlb_range())
You also check this in __kvm_tlb_flush_range(), ideally this should be
done exactly once per call.
> + kvm_call_hyp(__kvm_tlb_flush_range_vmid_ipa, mmu, start, end, level, tlb_level);
> + else
> + /*
> + * Invalidate the whole stage-2, as we may have numerous leaf
> + * entries below us which would otherwise need invalidating
> + * individually.
> + */
> + kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
> +}
> +
> /**
> * stage2_try_break_pte() - Invalidates a pte according to the
> * 'break-before-make' requirements of the
> @@ -721,10 +735,13 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
> * Perform the appropriate TLB invalidation based on the evicted pte
> * value (if any).
> */
> - if (kvm_pte_table(ctx->old, ctx->level))
> - kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
> - else if (kvm_pte_valid(ctx->old))
> + if (kvm_pte_table(ctx->old, ctx->level)) {
> + u64 end = ctx->addr + kvm_granule_size(ctx->level);
> +
> + kvm_pgtable_stage2_flush_range(mmu, ctx->addr, end, ctx->level, 0);
> + } else if (kvm_pte_valid(ctx->old)) {
> kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
> + }
>
> if (stage2_pte_is_counted(ctx->old))
> mm_ops->put_page(ctx->ptep);
> --
> 2.39.1.519.gcb327c4b5f-goog
>
>
--
Thanks,
Oliver
On Mon, Feb 06, 2023 at 05:23:36PM +0000, Raghavendra Rao Ananta wrote:
> Define __kvm_tlb_flush_range_vmid_ipa() (for VHE and nVHE)
bikeshed: Personally, I find that range implies it takes an address as an
argument already. Maybe just call it __kvm_tlb_flush_vmid_range()
> to flush a range of stage-2 page-tables using IPA in one go.
> If the system supports FEAT_TLBIRANGE, the following patches
> would conviniently replace global TLBI such as vmalls12e1is
> in the map, unmap, and dirty-logging paths with ripas2e1is
> instead.
>
> Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> ---
> arch/arm64/include/asm/kvm_asm.h | 3 +++
> arch/arm64/kvm/hyp/nvhe/hyp-main.c | 12 ++++++++++++
> arch/arm64/kvm/hyp/nvhe/tlb.c | 28 ++++++++++++++++++++++++++++
> arch/arm64/kvm/hyp/vhe/tlb.c | 24 ++++++++++++++++++++++++
> 4 files changed, 67 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index 995ff048e8851..80a8ea85e84f8 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -79,6 +79,7 @@ enum __kvm_host_smccc_func {
> __KVM_HOST_SMCCC_FUNC___pkvm_init_vm,
> __KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu,
> __KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm,
> + __KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_range_vmid_ipa,
> };
>
> #define DECLARE_KVM_VHE_SYM(sym) extern char sym[]
> @@ -243,6 +244,8 @@ extern void __kvm_flush_vm_context(void);
> extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
> extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
> int level);
> +extern void __kvm_tlb_flush_range_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t start,
> + phys_addr_t end, int level, int tlb_level);
> extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);
>
> extern void __kvm_timer_set_cntvoff(u64 cntvoff);
> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> index 728e01d4536b0..5787eee4c9fe4 100644
> --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> @@ -125,6 +125,17 @@ static void handle___kvm_tlb_flush_vmid_ipa(struct kvm_cpu_context *host_ctxt)
> __kvm_tlb_flush_vmid_ipa(kern_hyp_va(mmu), ipa, level);
> }
>
> +static void handle___kvm_tlb_flush_range_vmid_ipa(struct kvm_cpu_context *host_ctxt)
> +{
> + DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
> + DECLARE_REG(phys_addr_t, start, host_ctxt, 2);
> + DECLARE_REG(phys_addr_t, end, host_ctxt, 3);
> + DECLARE_REG(int, level, host_ctxt, 4);
> + DECLARE_REG(int, tlb_level, host_ctxt, 5);
> +
> + __kvm_tlb_flush_range_vmid_ipa(kern_hyp_va(mmu), start, end, level, tlb_level);
> +}
> +
> static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
> {
> DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
> @@ -315,6 +326,7 @@ static const hcall_t host_hcall[] = {
> HANDLE_FUNC(__kvm_vcpu_run),
> HANDLE_FUNC(__kvm_flush_vm_context),
> HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa),
> + HANDLE_FUNC(__kvm_tlb_flush_range_vmid_ipa),
> HANDLE_FUNC(__kvm_tlb_flush_vmid),
> HANDLE_FUNC(__kvm_flush_cpu_context),
> HANDLE_FUNC(__kvm_timer_set_cntvoff),
> diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
> index d296d617f5896..7398dd00445e7 100644
> --- a/arch/arm64/kvm/hyp/nvhe/tlb.c
> +++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
> @@ -109,6 +109,34 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
> __tlb_switch_to_host(&cxt);
> }
>
> +void __kvm_tlb_flush_range_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t start,
> + phys_addr_t end, int level, int tlb_level)
> +{
> + struct tlb_inv_context cxt;
> +
> + dsb(ishst);
> +
> + /* Switch to requested VMID */
> + __tlb_switch_to_guest(mmu, &cxt);
> +
> + __kvm_tlb_flush_range(ipas2e1is, mmu, start, end, level, tlb_level);
> +
> + /*
> + * Range-based ipas2e1is flushes only Stage-2 entries, and since the
> + * VA isn't available for Stage-1 entries, flush the entire stage-1.
> + */
nit: if we are going to preserve some of the commentary over in
__kvm_tlb_flush_vmid_ipa(), I would prefer just an exact copy/paste.
But, FWIW, I think you can just elide the clarifying comments altogether
since the relationship between stage-1 and stage-2 invalidations is
already documented.
> + dsb(ish);
> + __tlbi(vmalle1is);
> + dsb(ish);
> + isb();
> +
> + /* See the comment below in __kvm_tlb_flush_vmid_ipa() */
Same comment as above.
--
Thanks,
Oliver
On Mon, Feb 06, 2023 at 05:23:37PM +0000, Raghavendra Rao Ananta wrote:
> Implement kvm_arch_flush_remote_tlbs_range() for arm64,
> such that it can utilize the TLBI range based instructions
> if supported.
>
> Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> ---
> arch/arm64/include/asm/kvm_host.h | 3 +++
> arch/arm64/kvm/mmu.c | 15 +++++++++++++++
> 2 files changed, 18 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index dee530d75b957..211fab0c1de74 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -1002,6 +1002,9 @@ struct kvm *kvm_arch_alloc_vm(void);
> #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS
> int kvm_arch_flush_remote_tlbs(struct kvm *kvm);
>
> +#define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
> +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, u64 pages);
> +
> static inline bool kvm_vm_is_protected(struct kvm *kvm)
> {
> return false;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index e98910a8d0af6..409cb187f4911 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -91,6 +91,21 @@ int kvm_arch_flush_remote_tlbs(struct kvm *kvm)
> return 0;
> }
>
> +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, u64 pages)
> +{
> + phys_addr_t start, end;
> +
> + if (!system_supports_tlb_range())
> + return -EOPNOTSUPP;
There's multiple layers of fallback throughout this series, as it would
appear that deep in __kvm_tlb_flush_range() you're blasting the whole
VMID if either the range is too large or the feature isn't supported.
Is it possible to just normalize on a single spot to gate the use of
range-based invalidations? I have a slight preference for doing it deep
in the handler, as it keeps the upper layers of code a bit more
readable.
> + start = start_gfn << PAGE_SHIFT;
> + end = (start_gfn + pages) << PAGE_SHIFT;
> +
> + kvm_call_hyp(__kvm_tlb_flush_range_vmid_ipa, &kvm->arch.mmu,
> + start, end, KVM_PGTABLE_MAX_LEVELS - 1, 0);
> + return 0;
> +}
> +
> static bool kvm_is_device_pfn(unsigned long pfn)
> {
> return !pfn_is_map_memory(pfn);
> --
> 2.39.1.519.gcb327c4b5f-goog
>
>
--
Thanks,
Oliver
On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrote:
> The current implementation of the stage-2 unmap walker
> traverses the entire page-table to clear and flush the TLBs
> for each entry. This could be very expensive, especially if
> the VM is not backed by hugepages. The unmap operation could be
> made efficient by disconnecting the table at the very
> top (level at which the largest block mapping can be hosted)
> and do the rest of the unmapping using free_removed_table().
> If the system supports FEAT_TLBIRANGE, flush the entire range
> that has been disconnected from the rest of the page-table.
>
> Suggested-by: Ricardo Koller <[email protected]>
> Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> ---
> arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 44 insertions(+)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 0858d1fa85d6b..af3729d0971f2 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> return 0;
> }
>
> +/*
> + * The fast walker executes only if the unmap size is exactly equal to the
> + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
> + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
> + * be disconnected from the rest of the page-table without the need to
> + * traverse all the PTEs, at all the levels, and unmap each and every one
> + * of them. The disconnected table is freed using free_removed_table().
> + */
> +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> + enum kvm_pgtable_walk_flags visit)
> +{
> + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> + kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
> + struct kvm_s2_mmu *mmu = ctx->arg;
> +
> + if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
> + return 0;
> +
> + if (!stage2_try_break_pte(ctx, mmu))
> + return -EAGAIN;
> +
> + /*
> + * Gain back a reference for stage2_unmap_walker() to free
> + * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
> + */
> + mm_ops->get_page(ctx->ptep);
Doesn't this run the risk of a potential UAF if the refcount was 1 before
calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop
the refcount to 0 on the page before this ever gets called.
Also, AFAICT this misses the CMOs that are required on systems w/o
FEAT_FWB. Without them it is possible that the host will read something
other than what was most recently written by the guest if it is using
noncacheable memory attributes at stage-1.
I imagine the actual bottleneck is the DSB required after every
CMO/TLBI. Theoretically, the unmap path could be updated to:
- Perform the appropriate CMOs for every valid leaf entry *without*
issuing a DSB.
- Elide TLBIs entirely that take place in the middle of the walk
- After the walk completes, dsb(ish) to guarantee that the CMOs have
completed and the invalid PTEs are made visible to the hardware
walkers. This should be done implicitly by the TLBI implementation
- Invalidate the [addr, addr + size) range of IPAs
This would also avoid over-invalidating stage-1 since we blast the
entire stage-1 context for every stage-2 invalidation. Thoughts?
> + mm_ops->free_removed_table(childp, ctx->level);
> + return 0;
> +}
> +
--
Thanks,
Oliver
On Mon, Feb 06, 2023 at 05:23:35PM +0000, Raghavendra Rao Ananta wrote:
> Define a generic function __kvm_tlb_flush_range() to
> invalidate the TLBs over a range of addresses. The
> implementation accepts 'op' as a generic TLBI operation.
> Upcoming patches will use this to implement IPA based
> TLB invalidations (ipas2e1is).
>
> If the system doesn't support FEAT_TLBIRANGE, the
> implementation falls back to flushing the pages one by one
> for the range supplied.
>
> Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> ---
> arch/arm64/include/asm/kvm_asm.h | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index 43c3bc0f9544d..995ff048e8851 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -221,6 +221,24 @@ DECLARE_KVM_NVHE_SYM(__per_cpu_end);
> DECLARE_KVM_HYP_SYM(__bp_harden_hyp_vecs);
> #define __bp_harden_hyp_vecs CHOOSE_HYP_SYM(__bp_harden_hyp_vecs)
>
> +#define __kvm_tlb_flush_range(op, mmu, start, end, level, tlb_level) do { \
> + unsigned long pages, stride; \
> + \
> + stride = kvm_granule_size(level); \
Hmm... There's a rather subtle and annoying complication here that I
don't believe is handled.
Similar to what I said in the last spin of the series, there is no
guarantee that a range of IPAs is mapped at the exact same level
throughout. Dirty logging and memslots that aren't hugepage aligned
could lead to a mix of mapping levels being used within a range of the
IPA space.
> + start = round_down(start, stride); \
> + end = round_up(end, stride); \
> + pages = (end - start) >> PAGE_SHIFT; \
> + \
> + if ((!system_supports_tlb_range() && \
> + (end - start) >= (MAX_TLBI_OPS * stride)) || \
Doesn't checking for TLBIRANGE above eliminate the need to test against
MAX_TLBI_OPS?
> + pages >= MAX_TLBI_RANGE_PAGES) { \
> + __kvm_tlb_flush_vmid(mmu); \
> + break; \
> + } \
> + \
> + __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false); \
> +} while (0)
> +
> extern void __kvm_flush_vm_context(void);
> extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
> extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
> --
> 2.39.1.519.gcb327c4b5f-goog
>
>
--
Thanks,
Oliver
Hi Oliver,
On Wed, Mar 29, 2023 at 6:19 PM Oliver Upton <[email protected]> wrote:
>
> On Mon, Feb 06, 2023 at 05:23:35PM +0000, Raghavendra Rao Ananta wrote:
> > Define a generic function __kvm_tlb_flush_range() to
> > invalidate the TLBs over a range of addresses. The
> > implementation accepts 'op' as a generic TLBI operation.
> > Upcoming patches will use this to implement IPA based
> > TLB invalidations (ipas2e1is).
> >
> > If the system doesn't support FEAT_TLBIRANGE, the
> > implementation falls back to flushing the pages one by one
> > for the range supplied.
> >
> > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_asm.h | 18 ++++++++++++++++++
> > 1 file changed, 18 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> > index 43c3bc0f9544d..995ff048e8851 100644
> > --- a/arch/arm64/include/asm/kvm_asm.h
> > +++ b/arch/arm64/include/asm/kvm_asm.h
> > @@ -221,6 +221,24 @@ DECLARE_KVM_NVHE_SYM(__per_cpu_end);
> > DECLARE_KVM_HYP_SYM(__bp_harden_hyp_vecs);
> > #define __bp_harden_hyp_vecs CHOOSE_HYP_SYM(__bp_harden_hyp_vecs)
> >
> > +#define __kvm_tlb_flush_range(op, mmu, start, end, level, tlb_level) do { \
> > + unsigned long pages, stride; \
> > + \
> > + stride = kvm_granule_size(level); \
>
> Hmm... There's a rather subtle and annoying complication here that I
> don't believe is handled.
>
> Similar to what I said in the last spin of the series, there is no
> guarantee that a range of IPAs is mapped at the exact same level
> throughout. Dirty logging and memslots that aren't hugepage aligned
> could lead to a mix of mapping levels being used within a range of the
> IPA space.
>
Unlike the comment on v1, the level/stride here is used to jump the
addresses in case the system doesn't support TLBIRANGE. The TTL hint
is 0.
That being said, do you think we can always assume the least possible
stride (say, 4k) and hardcode it?
With respect to alignment, since the function is only called while
breaking the table PTE, do you think it'll still be a problem even if
we go with the least granularity stride?
> > + start = round_down(start, stride); \
> > + end = round_up(end, stride); \
> > + pages = (end - start) >> PAGE_SHIFT; \
> > + \
> > + if ((!system_supports_tlb_range() && \
> > + (end - start) >= (MAX_TLBI_OPS * stride)) || \
>
> Doesn't checking for TLBIRANGE above eliminate the need to test against
> MAX_TLBI_OPS?
>
Derived from __flush_tlb_range(), I think the condition is used to
just flush everything if the range is too large to iterate and flush
when the system doesn't support TLBIRANGE. Probably to prevent
soft-lockups?
Thank you.
Raghavendra
> > + pages >= MAX_TLBI_RANGE_PAGES) { \
> > + __kvm_tlb_flush_vmid(mmu); \
> > + break; \
> > + } \
> > + \
> > + __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false); \
> > +} while (0)
> > +
> > extern void __kvm_flush_vm_context(void);
> > extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
> > extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
> > --
> > 2.39.1.519.gcb327c4b5f-goog
> >
> >
>
> --
> Thanks,
> Oliver
On Wed, Mar 29, 2023 at 5:59 PM Oliver Upton <[email protected]> wrote:
>
> On Mon, Feb 06, 2023 at 05:23:36PM +0000, Raghavendra Rao Ananta wrote:
> > Define __kvm_tlb_flush_range_vmid_ipa() (for VHE and nVHE)
>
> bikeshed: Personally, I find that range implies it takes an address as an
> argument already. Maybe just call it __kvm_tlb_flush_vmid_range()
>
Hmm, since TLBI instructions takes-in a variety of ranges, VA or IPA,
I just thought of extending the '_ipa' to make things clear. Moreover
it aligns with the existing __kvm_tlb_flush_vmid_ipa(). WDYT?
Thank you.
Raghavendra
> > to flush a range of stage-2 page-tables using IPA in one go.
> > If the system supports FEAT_TLBIRANGE, the following patches
> > would conviniently replace global TLBI such as vmalls12e1is
> > in the map, unmap, and dirty-logging paths with ripas2e1is
> > instead.
> >
> > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_asm.h | 3 +++
> > arch/arm64/kvm/hyp/nvhe/hyp-main.c | 12 ++++++++++++
> > arch/arm64/kvm/hyp/nvhe/tlb.c | 28 ++++++++++++++++++++++++++++
> > arch/arm64/kvm/hyp/vhe/tlb.c | 24 ++++++++++++++++++++++++
> > 4 files changed, 67 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> > index 995ff048e8851..80a8ea85e84f8 100644
> > --- a/arch/arm64/include/asm/kvm_asm.h
> > +++ b/arch/arm64/include/asm/kvm_asm.h
> > @@ -79,6 +79,7 @@ enum __kvm_host_smccc_func {
> > __KVM_HOST_SMCCC_FUNC___pkvm_init_vm,
> > __KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu,
> > __KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm,
> > + __KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_range_vmid_ipa,
> > };
> >
> > #define DECLARE_KVM_VHE_SYM(sym) extern char sym[]
> > @@ -243,6 +244,8 @@ extern void __kvm_flush_vm_context(void);
> > extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
> > extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
> > int level);
> > +extern void __kvm_tlb_flush_range_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t start,
> > + phys_addr_t end, int level, int tlb_level);
> > extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);
> >
> > extern void __kvm_timer_set_cntvoff(u64 cntvoff);
> > diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > index 728e01d4536b0..5787eee4c9fe4 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > @@ -125,6 +125,17 @@ static void handle___kvm_tlb_flush_vmid_ipa(struct kvm_cpu_context *host_ctxt)
> > __kvm_tlb_flush_vmid_ipa(kern_hyp_va(mmu), ipa, level);
> > }
> >
> > +static void handle___kvm_tlb_flush_range_vmid_ipa(struct kvm_cpu_context *host_ctxt)
> > +{
> > + DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
> > + DECLARE_REG(phys_addr_t, start, host_ctxt, 2);
> > + DECLARE_REG(phys_addr_t, end, host_ctxt, 3);
> > + DECLARE_REG(int, level, host_ctxt, 4);
> > + DECLARE_REG(int, tlb_level, host_ctxt, 5);
> > +
> > + __kvm_tlb_flush_range_vmid_ipa(kern_hyp_va(mmu), start, end, level, tlb_level);
> > +}
> > +
> > static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
> > {
> > DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
> > @@ -315,6 +326,7 @@ static const hcall_t host_hcall[] = {
> > HANDLE_FUNC(__kvm_vcpu_run),
> > HANDLE_FUNC(__kvm_flush_vm_context),
> > HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa),
> > + HANDLE_FUNC(__kvm_tlb_flush_range_vmid_ipa),
> > HANDLE_FUNC(__kvm_tlb_flush_vmid),
> > HANDLE_FUNC(__kvm_flush_cpu_context),
> > HANDLE_FUNC(__kvm_timer_set_cntvoff),
> > diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
> > index d296d617f5896..7398dd00445e7 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/tlb.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
> > @@ -109,6 +109,34 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
> > __tlb_switch_to_host(&cxt);
> > }
> >
> > +void __kvm_tlb_flush_range_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t start,
> > + phys_addr_t end, int level, int tlb_level)
> > +{
> > + struct tlb_inv_context cxt;
> > +
> > + dsb(ishst);
> > +
> > + /* Switch to requested VMID */
> > + __tlb_switch_to_guest(mmu, &cxt);
> > +
> > + __kvm_tlb_flush_range(ipas2e1is, mmu, start, end, level, tlb_level);
> > +
> > + /*
> > + * Range-based ipas2e1is flushes only Stage-2 entries, and since the
> > + * VA isn't available for Stage-1 entries, flush the entire stage-1.
> > + */
>
> nit: if we are going to preserve some of the commentary over in
> __kvm_tlb_flush_vmid_ipa(), I would prefer just an exact copy/paste.
> But, FWIW, I think you can just elide the clarifying comments altogether
> since the relationship between stage-1 and stage-2 invalidations is
> already documented.
>
> > + dsb(ish);
> > + __tlbi(vmalle1is);
> > + dsb(ish);
> > + isb();
> > +
> > + /* See the comment below in __kvm_tlb_flush_vmid_ipa() */
>
> Same comment as above.
>
> --
> Thanks,
> Oliver
On Wed, Mar 29, 2023 at 5:53 PM Oliver Upton <[email protected]> wrote:
>
> On Mon, Feb 06, 2023 at 05:23:37PM +0000, Raghavendra Rao Ananta wrote:
> > Implement kvm_arch_flush_remote_tlbs_range() for arm64,
> > such that it can utilize the TLBI range based instructions
> > if supported.
> >
> > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_host.h | 3 +++
> > arch/arm64/kvm/mmu.c | 15 +++++++++++++++
> > 2 files changed, 18 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index dee530d75b957..211fab0c1de74 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -1002,6 +1002,9 @@ struct kvm *kvm_arch_alloc_vm(void);
> > #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS
> > int kvm_arch_flush_remote_tlbs(struct kvm *kvm);
> >
> > +#define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
> > +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, u64 pages);
> > +
> > static inline bool kvm_vm_is_protected(struct kvm *kvm)
> > {
> > return false;
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index e98910a8d0af6..409cb187f4911 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -91,6 +91,21 @@ int kvm_arch_flush_remote_tlbs(struct kvm *kvm)
> > return 0;
> > }
> >
> > +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, u64 pages)
> > +{
> > + phys_addr_t start, end;
> > +
> > + if (!system_supports_tlb_range())
> > + return -EOPNOTSUPP;
>
> There's multiple layers of fallback throughout this series, as it would
> appear that deep in __kvm_tlb_flush_range() you're blasting the whole
> VMID if either the range is too large or the feature isn't supported.
>
> Is it possible to just normalize on a single spot to gate the use of
> range-based invalidations? I have a slight preference for doing it deep
> in the handler, as it keeps the upper layers of code a bit more
> readable.
>
I was a little skeptical on this part, since the
kvm_arch_flush_remote_tlbs_range() expects to return -EOPNOTSUPP if
indeed there's no support.
But I see your point. The if-else in kvm_pgtable_stage2_flush_range()
seems redundant and I can simply manage this conditions inside
__kvm_tlb_flush_range_vmid_ipa() itself, but I'll leave the
kvm_arch_flush_remote_tlbs_range()'s implementation as is. Thoughts?
Thank you.
Raghavendra
> > + start = start_gfn << PAGE_SHIFT;
> > + end = (start_gfn + pages) << PAGE_SHIFT;
> > +
> > + kvm_call_hyp(__kvm_tlb_flush_range_vmid_ipa, &kvm->arch.mmu,
> > + start, end, KVM_PGTABLE_MAX_LEVELS - 1, 0);
> > + return 0;
> > +}
> > +
> > static bool kvm_is_device_pfn(unsigned long pfn)
> > {
> > return !pfn_is_map_memory(pfn);
> > --
> > 2.39.1.519.gcb327c4b5f-goog
> >
> >
>
> --
> Thanks,
> Oliver
On Wed, Mar 29, 2023 at 5:17 PM Oliver Upton <[email protected]> wrote:
>
> nit: s/break/invalidate/g
>
> There is a rather important degree of nuance there. 'Break' as it
> relates to break-before-make implies that the PTE is made invalid and
> visible to hardware _before_ a subsequent invalidation. There will be
> systems that relax this requirement and also support TLBIRANGE.
>
> On Mon, Feb 06, 2023 at 05:23:39PM +0000, Raghavendra Rao Ananta wrote:
>
> Some nitpicking on the changelog:
>
> > Currently, when breaking up the stage-2 table entries, KVM
>
> 'breaking up stage-2 table entries' is rather ambiguous. Instead
> describe the operation taking place on the page tables (i.e. hugepage
> collapse).
>
> > would flush the entire VM's context using 'vmalls12e1is'
> > TLBI operation. One of the problematic situation is collapsing
> > table entries into a hugepage, specifically if the VM is
> > faulting on many hugepages (say after dirty-logging). This
> > creates a performance penality for the guest whose pages have
>
> typo: penalty
>
> > already been faulted earlier as they would have to refill their
> > TLBs again.
> >
> > Hence, if the system supports it, use __kvm_tlb_flush_range_vmid_ipa()
>
> > to flush only the range of pages governed by the table entry,
> > while leaving other TLB entries alone. An upcoming patch also
> > takes advantage of this when breaking up table entries during
> > the unmap operation.
>
> Language regarding an upcoming patch isn't necessary, as this one stands
> on its own (implements and uses a range-based invalidation helper).
>
> > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > ---
> > arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++++++++++---
> > 1 file changed, 20 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index b11cf2c618a6c..0858d1fa85d6b 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -686,6 +686,20 @@ static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_
> > return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old;
> > }
> >
> > +static void kvm_pgtable_stage2_flush_range(struct kvm_s2_mmu *mmu, u64 start, u64 end,
> > + u32 level, u32 tlb_level)
> > +{
> > + if (system_supports_tlb_range())
>
> You also check this in __kvm_tlb_flush_range(), ideally this should be
> done exactly once per call.
>
> > + kvm_call_hyp(__kvm_tlb_flush_range_vmid_ipa, mmu, start, end, level, tlb_level);
> > + else
> > + /*
> > + * Invalidate the whole stage-2, as we may have numerous leaf
> > + * entries below us which would otherwise need invalidating
> > + * individually.
> > + */
> > + kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
> > +}
> > +
> > /**
> > * stage2_try_break_pte() - Invalidates a pte according to the
> > * 'break-before-make' requirements of the
> > @@ -721,10 +735,13 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
> > * Perform the appropriate TLB invalidation based on the evicted pte
> > * value (if any).
> > */
> > - if (kvm_pte_table(ctx->old, ctx->level))
> > - kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
> > - else if (kvm_pte_valid(ctx->old))
> > + if (kvm_pte_table(ctx->old, ctx->level)) {
> > + u64 end = ctx->addr + kvm_granule_size(ctx->level);
> > +
> > + kvm_pgtable_stage2_flush_range(mmu, ctx->addr, end, ctx->level, 0);
> > + } else if (kvm_pte_valid(ctx->old)) {
> > kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
> > + }
> >
> > if (stage2_pte_is_counted(ctx->old))
> > mm_ops->put_page(ctx->ptep);
> > --
> > 2.39.1.519.gcb327c4b5f-goog
> >
> >
ACK on all of the comments. I'll address them in next revision.
Thank you.
Raghavendra
>
> --
> Thanks,
> Oliver
On Wed, Mar 29, 2023 at 5:42 PM Oliver Upton <[email protected]> wrote:
>
> On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrote:
> > The current implementation of the stage-2 unmap walker
> > traverses the entire page-table to clear and flush the TLBs
> > for each entry. This could be very expensive, especially if
> > the VM is not backed by hugepages. The unmap operation could be
> > made efficient by disconnecting the table at the very
> > top (level at which the largest block mapping can be hosted)
> > and do the rest of the unmapping using free_removed_table().
> > If the system supports FEAT_TLBIRANGE, flush the entire range
> > that has been disconnected from the rest of the page-table.
> >
> > Suggested-by: Ricardo Koller <[email protected]>
> > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > ---
> > arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
> > 1 file changed, 44 insertions(+)
> >
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 0858d1fa85d6b..af3729d0971f2 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > return 0;
> > }
> >
> > +/*
> > + * The fast walker executes only if the unmap size is exactly equal to the
> > + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
> > + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
> > + * be disconnected from the rest of the page-table without the need to
> > + * traverse all the PTEs, at all the levels, and unmap each and every one
> > + * of them. The disconnected table is freed using free_removed_table().
> > + */
> > +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > + enum kvm_pgtable_walk_flags visit)
> > +{
> > + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > + kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
> > + struct kvm_s2_mmu *mmu = ctx->arg;
> > +
> > + if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
> > + return 0;
> > +
> > + if (!stage2_try_break_pte(ctx, mmu))
> > + return -EAGAIN;
> > +
> > + /*
> > + * Gain back a reference for stage2_unmap_walker() to free
> > + * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
> > + */
> > + mm_ops->get_page(ctx->ptep);
>
> Doesn't this run the risk of a potential UAF if the refcount was 1 before
> calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop
> the refcount to 0 on the page before this ever gets called.
>
> Also, AFAICT this misses the CMOs that are required on systems w/o
> FEAT_FWB. Without them it is possible that the host will read something
> other than what was most recently written by the guest if it is using
> noncacheable memory attributes at stage-1.
>
> I imagine the actual bottleneck is the DSB required after every
> CMO/TLBI. Theoretically, the unmap path could be updated to:
>
> - Perform the appropriate CMOs for every valid leaf entry *without*
> issuing a DSB.
>
> - Elide TLBIs entirely that take place in the middle of the walk
>
> - After the walk completes, dsb(ish) to guarantee that the CMOs have
> completed and the invalid PTEs are made visible to the hardware
> walkers. This should be done implicitly by the TLBI implementation
>
> - Invalidate the [addr, addr + size) range of IPAs
>
> This would also avoid over-invalidating stage-1 since we blast the
> entire stage-1 context for every stage-2 invalidation. Thoughts?
>
Correct me if I'm wrong, but if we invalidate the TLB after the walk
is complete, don't you think there's a risk of race if the guest can
hit in the TLB even though the page was unmapped?
Thanks,
Raghavendra
Raghavendra
> > + mm_ops->free_removed_table(childp, ctx->level);
> > + return 0;
> > +}
> > +
>
> --
> Thanks,
> Oliver
On Mon, Apr 03, 2023 at 10:26:01AM -0700, Raghavendra Rao Ananta wrote:
> Hi Oliver,
>
> On Wed, Mar 29, 2023 at 6:19 PM Oliver Upton <[email protected]> wrote:
> >
> > On Mon, Feb 06, 2023 at 05:23:35PM +0000, Raghavendra Rao Ananta wrote:
> > > Define a generic function __kvm_tlb_flush_range() to
> > > invalidate the TLBs over a range of addresses. The
> > > implementation accepts 'op' as a generic TLBI operation.
> > > Upcoming patches will use this to implement IPA based
> > > TLB invalidations (ipas2e1is).
> > >
> > > If the system doesn't support FEAT_TLBIRANGE, the
> > > implementation falls back to flushing the pages one by one
> > > for the range supplied.
> > >
> > > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > > ---
> > > arch/arm64/include/asm/kvm_asm.h | 18 ++++++++++++++++++
> > > 1 file changed, 18 insertions(+)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> > > index 43c3bc0f9544d..995ff048e8851 100644
> > > --- a/arch/arm64/include/asm/kvm_asm.h
> > > +++ b/arch/arm64/include/asm/kvm_asm.h
> > > @@ -221,6 +221,24 @@ DECLARE_KVM_NVHE_SYM(__per_cpu_end);
> > > DECLARE_KVM_HYP_SYM(__bp_harden_hyp_vecs);
> > > #define __bp_harden_hyp_vecs CHOOSE_HYP_SYM(__bp_harden_hyp_vecs)
> > >
> > > +#define __kvm_tlb_flush_range(op, mmu, start, end, level, tlb_level) do { \
> > > + unsigned long pages, stride; \
> > > + \
> > > + stride = kvm_granule_size(level); \
> >
> > Hmm... There's a rather subtle and annoying complication here that I
> > don't believe is handled.
> >
> > Similar to what I said in the last spin of the series, there is no
> > guarantee that a range of IPAs is mapped at the exact same level
> > throughout. Dirty logging and memslots that aren't hugepage aligned
> > could lead to a mix of mapping levels being used within a range of the
> > IPA space.
> >
> Unlike the comment on v1, the level/stride here is used to jump the
> addresses in case the system doesn't support TLBIRANGE. The TTL hint
> is 0.
Right. So we agree that the level is not uniform throughout the provided
range. The invalidation by IPA is also used if 'pages' is odd, even on
systems with TLBIRANGE. We must assume the worst case here, in that the
TLBI by IPA invalidated a single PTE-level entry. You could wind up
over-invalidating in that case, but you'd still be correct.
> That being said, do you think we can always assume the least possible
> stride (say, 4k) and hardcode it?
> With respect to alignment, since the function is only called while
> breaking the table PTE, do you think it'll still be a problem even if
> we go with the least granularity stride?
I believe so. If we want to apply the range-based invalidations generally
in KVM then we will not always be dealing with a block-aligned chunk of
address.
> > > + start = round_down(start, stride); \
> > > + end = round_up(end, stride); \
> > > + pages = (end - start) >> PAGE_SHIFT; \
> > > + \
> > > + if ((!system_supports_tlb_range() && \
> > > + (end - start) >= (MAX_TLBI_OPS * stride)) || \
> >
> > Doesn't checking for TLBIRANGE above eliminate the need to test against
> > MAX_TLBI_OPS?
> >
> Derived from __flush_tlb_range(), I think the condition is used to
> just flush everything if the range is too large to iterate and flush
> when the system doesn't support TLBIRANGE. Probably to prevent
> soft-lockups?
Right, but you test above for system_supports_tlb_range(), meaning that
you'd unconditionally call __kvm_tlb_flush_vmid() below.
> > > + pages >= MAX_TLBI_RANGE_PAGES) { \
> > > + __kvm_tlb_flush_vmid(mmu); \
> > > + break; \
> > > + } \
> > > + \
> > > + __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false); \
> > > +} while (0)
--
Thanks,
Oliver
On Mon, Apr 03, 2023 at 02:08:29PM -0700, Raghavendra Rao Ananta wrote:
> On Wed, Mar 29, 2023 at 5:59 PM Oliver Upton <[email protected]> wrote:
> >
> > On Mon, Feb 06, 2023 at 05:23:36PM +0000, Raghavendra Rao Ananta wrote:
> > > Define __kvm_tlb_flush_range_vmid_ipa() (for VHE and nVHE)
> >
> > bikeshed: Personally, I find that range implies it takes an address as an
> > argument already. Maybe just call it __kvm_tlb_flush_vmid_range()
> >
> Hmm, since TLBI instructions takes-in a variety of ranges, VA or IPA,
> I just thought of extending the '_ipa' to make things clear. Moreover
> it aligns with the existing __kvm_tlb_flush_vmid_ipa(). WDYT?
Like I said, just a bikeshed and it seemed trivial to eliminate a token
in the function name. FWIW, you're dealing in terms of the IPA space by
definition, as a VMID identifies an IPA address space. Range-based
invalidations by VA would instead take an ASID as the address space
identifier.
--
Thanks,
Oliver
On Tue, Apr 04, 2023 at 06:41:34PM +0000, Oliver Upton wrote:
> On Mon, Apr 03, 2023 at 10:26:01AM -0700, Raghavendra Rao Ananta wrote:
> > On Wed, Mar 29, 2023 at 6:19 PM Oliver Upton <[email protected]> wrote:
> > > > + start = round_down(start, stride); \
> > > > + end = round_up(end, stride); \
> > > > + pages = (end - start) >> PAGE_SHIFT; \
> > > > + \
> > > > + if ((!system_supports_tlb_range() && \
> > > > + (end - start) >= (MAX_TLBI_OPS * stride)) || \
> > >
> > > Doesn't checking for TLBIRANGE above eliminate the need to test against
> > > MAX_TLBI_OPS?
> > >
> > Derived from __flush_tlb_range(), I think the condition is used to
> > just flush everything if the range is too large to iterate and flush
> > when the system doesn't support TLBIRANGE. Probably to prevent
> > soft-lockups?
>
> Right, but you test above for system_supports_tlb_range(), meaning that
> you'd unconditionally call __kvm_tlb_flush_vmid() below.
Gah, I misread the parenthesis and managed to miss your statement in the
changelog about !TLBIRANGE systems. Apologies.
--
Thanks,
Oliver
On Mon, Apr 03, 2023 at 02:23:17PM -0700, Raghavendra Rao Ananta wrote:
> On Wed, Mar 29, 2023 at 5:53 PM Oliver Upton <[email protected]> wrote:
> >
> > On Mon, Feb 06, 2023 at 05:23:37PM +0000, Raghavendra Rao Ananta wrote:
> > > Implement kvm_arch_flush_remote_tlbs_range() for arm64,
> > > such that it can utilize the TLBI range based instructions
> > > if supported.
> > >
> > > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > > ---
> > > arch/arm64/include/asm/kvm_host.h | 3 +++
> > > arch/arm64/kvm/mmu.c | 15 +++++++++++++++
> > > 2 files changed, 18 insertions(+)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > index dee530d75b957..211fab0c1de74 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -1002,6 +1002,9 @@ struct kvm *kvm_arch_alloc_vm(void);
> > > #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS
> > > int kvm_arch_flush_remote_tlbs(struct kvm *kvm);
> > >
> > > +#define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
> > > +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, u64 pages);
> > > +
> > > static inline bool kvm_vm_is_protected(struct kvm *kvm)
> > > {
> > > return false;
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index e98910a8d0af6..409cb187f4911 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -91,6 +91,21 @@ int kvm_arch_flush_remote_tlbs(struct kvm *kvm)
> > > return 0;
> > > }
> > >
> > > +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, u64 pages)
> > > +{
> > > + phys_addr_t start, end;
> > > +
> > > + if (!system_supports_tlb_range())
> > > + return -EOPNOTSUPP;
> >
> > There's multiple layers of fallback throughout this series, as it would
> > appear that deep in __kvm_tlb_flush_range() you're blasting the whole
> > VMID if either the range is too large or the feature isn't supported.
> >
> > Is it possible to just normalize on a single spot to gate the use of
> > range-based invalidations? I have a slight preference for doing it deep
> > in the handler, as it keeps the upper layers of code a bit more
> > readable.
> >
> I was a little skeptical on this part, since the
> kvm_arch_flush_remote_tlbs_range() expects to return -EOPNOTSUPP if
> indeed there's no support.
Well, the arch-neutral code can expect whatever it wants :) The only
real contract we have with it is to return 0 iff the specified range has
been invalidated, even if that comes with over-invalidating.
> But I see your point. The if-else in kvm_pgtable_stage2_flush_range()
> seems redundant and I can simply manage this conditions inside
> __kvm_tlb_flush_range_vmid_ipa() itself, but I'll leave the
> kvm_arch_flush_remote_tlbs_range()'s implementation as is. Thoughts?
The largest concern I had is that the series is testing for FEAT_TLBIRANGE
all over the shop and I just raised that concern on this patch. AFAICT,
the iterative approach to invalidating a range of IPAs is effectively
dead code, as all flows into __kvm_tlb_flush_range_vmid_ipa() are gated
by system_supports_tlb_range() somewhere.
Personally, I prefer keeping the higher level software models making
aggressive use of range-based interfaces and letting the actual
implementation under the hood select the appropriate instruction. That
helps readability, as it directly communicates the expected outcome of
the invalidation.
So, if you want to make use of the iterative approach to TLB invalidations on
!TLBIRANGE systems, then this function should _not_ return EOPNOTSUPP.
--
Thanks,
Oliver
On Tue, Apr 04, 2023 at 10:52:01AM -0700, Raghavendra Rao Ananta wrote:
> On Wed, Mar 29, 2023 at 5:42 PM Oliver Upton <[email protected]> wrote:
> >
> > On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrote:
> > > The current implementation of the stage-2 unmap walker
> > > traverses the entire page-table to clear and flush the TLBs
> > > for each entry. This could be very expensive, especially if
> > > the VM is not backed by hugepages. The unmap operation could be
> > > made efficient by disconnecting the table at the very
> > > top (level at which the largest block mapping can be hosted)
> > > and do the rest of the unmapping using free_removed_table().
> > > If the system supports FEAT_TLBIRANGE, flush the entire range
> > > that has been disconnected from the rest of the page-table.
> > >
> > > Suggested-by: Ricardo Koller <[email protected]>
> > > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > > ---
> > > arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 44 insertions(+)
> > >
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 0858d1fa85d6b..af3729d0971f2 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > return 0;
> > > }
> > >
> > > +/*
> > > + * The fast walker executes only if the unmap size is exactly equal to the
> > > + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
> > > + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
> > > + * be disconnected from the rest of the page-table without the need to
> > > + * traverse all the PTEs, at all the levels, and unmap each and every one
> > > + * of them. The disconnected table is freed using free_removed_table().
> > > + */
> > > +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > + enum kvm_pgtable_walk_flags visit)
> > > +{
> > > + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > > + kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
> > > + struct kvm_s2_mmu *mmu = ctx->arg;
> > > +
> > > + if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
> > > + return 0;
> > > +
> > > + if (!stage2_try_break_pte(ctx, mmu))
> > > + return -EAGAIN;
> > > +
> > > + /*
> > > + * Gain back a reference for stage2_unmap_walker() to free
> > > + * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
> > > + */
> > > + mm_ops->get_page(ctx->ptep);
> >
> > Doesn't this run the risk of a potential UAF if the refcount was 1 before
> > calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop
> > the refcount to 0 on the page before this ever gets called.
> >
> > Also, AFAICT this misses the CMOs that are required on systems w/o
> > FEAT_FWB. Without them it is possible that the host will read something
> > other than what was most recently written by the guest if it is using
> > noncacheable memory attributes at stage-1.
> >
> > I imagine the actual bottleneck is the DSB required after every
> > CMO/TLBI. Theoretically, the unmap path could be updated to:
> >
> > - Perform the appropriate CMOs for every valid leaf entry *without*
> > issuing a DSB.
> >
> > - Elide TLBIs entirely that take place in the middle of the walk
> >
> > - After the walk completes, dsb(ish) to guarantee that the CMOs have
> > completed and the invalid PTEs are made visible to the hardware
> > walkers. This should be done implicitly by the TLBI implementation
> >
> > - Invalidate the [addr, addr + size) range of IPAs
> >
> > This would also avoid over-invalidating stage-1 since we blast the
> > entire stage-1 context for every stage-2 invalidation. Thoughts?
> >
> Correct me if I'm wrong, but if we invalidate the TLB after the walk
> is complete, don't you think there's a risk of race if the guest can
> hit in the TLB even though the page was unmapped?
Yeah, we'd need to do the CMOs _after_ making the translation invalid in
the page tables and completing the TLB invalidation. Apologies.
Otherwise, the only requirement we need to uphold w/ either the MMU
notifiers or userspace is that the translation has been invalidated at
the time of return.
--
Thanks,
Oliver
On Tue, Apr 4, 2023 at 11:46 AM Oliver Upton <[email protected]> wrote:
>
> On Mon, Apr 03, 2023 at 02:08:29PM -0700, Raghavendra Rao Ananta wrote:
> > On Wed, Mar 29, 2023 at 5:59 PM Oliver Upton <[email protected]> wrote:
> > >
> > > On Mon, Feb 06, 2023 at 05:23:36PM +0000, Raghavendra Rao Ananta wrote:
> > > > Define __kvm_tlb_flush_range_vmid_ipa() (for VHE and nVHE)
> > >
> > > bikeshed: Personally, I find that range implies it takes an address as an
> > > argument already. Maybe just call it __kvm_tlb_flush_vmid_range()
> > >
> > Hmm, since TLBI instructions takes-in a variety of ranges, VA or IPA,
> > I just thought of extending the '_ipa' to make things clear. Moreover
> > it aligns with the existing __kvm_tlb_flush_vmid_ipa(). WDYT?
>
> Like I said, just a bikeshed and it seemed trivial to eliminate a token
> in the function name. FWIW, you're dealing in terms of the IPA space by
> definition, as a VMID identifies an IPA address space. Range-based
> invalidations by VA would instead take an ASID as the address space
> identifier.
>
Okay, let's rename it to __kvm_tlb_flush_vmid_range().
Thanks,
Raghavendra
> --
> Thanks,
> Oliver
On Tue, Apr 4, 2023 at 12:09 PM Oliver Upton <[email protected]> wrote:
>
> On Mon, Apr 03, 2023 at 02:23:17PM -0700, Raghavendra Rao Ananta wrote:
> > On Wed, Mar 29, 2023 at 5:53 PM Oliver Upton <[email protected]> wrote:
> > >
> > > On Mon, Feb 06, 2023 at 05:23:37PM +0000, Raghavendra Rao Ananta wrote:
> > > > Implement kvm_arch_flush_remote_tlbs_range() for arm64,
> > > > such that it can utilize the TLBI range based instructions
> > > > if supported.
> > > >
> > > > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > > > ---
> > > > arch/arm64/include/asm/kvm_host.h | 3 +++
> > > > arch/arm64/kvm/mmu.c | 15 +++++++++++++++
> > > > 2 files changed, 18 insertions(+)
> > > >
> > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > > index dee530d75b957..211fab0c1de74 100644
> > > > --- a/arch/arm64/include/asm/kvm_host.h
> > > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > > @@ -1002,6 +1002,9 @@ struct kvm *kvm_arch_alloc_vm(void);
> > > > #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS
> > > > int kvm_arch_flush_remote_tlbs(struct kvm *kvm);
> > > >
> > > > +#define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
> > > > +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, u64 pages);
> > > > +
> > > > static inline bool kvm_vm_is_protected(struct kvm *kvm)
> > > > {
> > > > return false;
> > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > index e98910a8d0af6..409cb187f4911 100644
> > > > --- a/arch/arm64/kvm/mmu.c
> > > > +++ b/arch/arm64/kvm/mmu.c
> > > > @@ -91,6 +91,21 @@ int kvm_arch_flush_remote_tlbs(struct kvm *kvm)
> > > > return 0;
> > > > }
> > > >
> > > > +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, u64 pages)
> > > > +{
> > > > + phys_addr_t start, end;
> > > > +
> > > > + if (!system_supports_tlb_range())
> > > > + return -EOPNOTSUPP;
> > >
> > > There's multiple layers of fallback throughout this series, as it would
> > > appear that deep in __kvm_tlb_flush_range() you're blasting the whole
> > > VMID if either the range is too large or the feature isn't supported.
> > >
> > > Is it possible to just normalize on a single spot to gate the use of
> > > range-based invalidations? I have a slight preference for doing it deep
> > > in the handler, as it keeps the upper layers of code a bit more
> > > readable.
> > >
> > I was a little skeptical on this part, since the
> > kvm_arch_flush_remote_tlbs_range() expects to return -EOPNOTSUPP if
> > indeed there's no support.
>
> Well, the arch-neutral code can expect whatever it wants :) The only
> real contract we have with it is to return 0 iff the specified range has
> been invalidated, even if that comes with over-invalidating.
>
> > But I see your point. The if-else in kvm_pgtable_stage2_flush_range()
> > seems redundant and I can simply manage this conditions inside
> > __kvm_tlb_flush_range_vmid_ipa() itself, but I'll leave the
> > kvm_arch_flush_remote_tlbs_range()'s implementation as is. Thoughts?
>
> The largest concern I had is that the series is testing for FEAT_TLBIRANGE
> all over the shop and I just raised that concern on this patch. AFAICT,
> the iterative approach to invalidating a range of IPAs is effectively
> dead code, as all flows into __kvm_tlb_flush_range_vmid_ipa() are gated
> by system_supports_tlb_range() somewhere.
>
> Personally, I prefer keeping the higher level software models making
> aggressive use of range-based interfaces and letting the actual
> implementation under the hood select the appropriate instruction. That
> helps readability, as it directly communicates the expected outcome of
> the invalidation.
>
> So, if you want to make use of the iterative approach to TLB invalidations on
> !TLBIRANGE systems, then this function should _not_ return EOPNOTSUPP.
>
Thanks for the explanation. I can make the calls consistent across the
code, and let the actual handler deal TLBIRANGE support.
- Raghavendra
> --
> Thanks,
> Oliver
On Tue, Apr 4, 2023 at 12:19 PM Oliver Upton <[email protected]> wrote:
>
> On Tue, Apr 04, 2023 at 10:52:01AM -0700, Raghavendra Rao Ananta wrote:
> > On Wed, Mar 29, 2023 at 5:42 PM Oliver Upton <[email protected]> wrote:
> > >
> > > On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrote:
> > > > The current implementation of the stage-2 unmap walker
> > > > traverses the entire page-table to clear and flush the TLBs
> > > > for each entry. This could be very expensive, especially if
> > > > the VM is not backed by hugepages. The unmap operation could be
> > > > made efficient by disconnecting the table at the very
> > > > top (level at which the largest block mapping can be hosted)
> > > > and do the rest of the unmapping using free_removed_table().
> > > > If the system supports FEAT_TLBIRANGE, flush the entire range
> > > > that has been disconnected from the rest of the page-table.
> > > >
> > > > Suggested-by: Ricardo Koller <[email protected]>
> > > > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > > > ---
> > > > arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
> > > > 1 file changed, 44 insertions(+)
> > > >
> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index 0858d1fa85d6b..af3729d0971f2 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > return 0;
> > > > }
> > > >
> > > > +/*
> > > > + * The fast walker executes only if the unmap size is exactly equal to the
> > > > + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
> > > > + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
> > > > + * be disconnected from the rest of the page-table without the need to
> > > > + * traverse all the PTEs, at all the levels, and unmap each and every one
> > > > + * of them. The disconnected table is freed using free_removed_table().
> > > > + */
> > > > +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > + enum kvm_pgtable_walk_flags visit)
> > > > +{
> > > > + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > > > + kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
> > > > + struct kvm_s2_mmu *mmu = ctx->arg;
> > > > +
> > > > + if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
> > > > + return 0;
> > > > +
> > > > + if (!stage2_try_break_pte(ctx, mmu))
> > > > + return -EAGAIN;
> > > > +
> > > > + /*
> > > > + * Gain back a reference for stage2_unmap_walker() to free
> > > > + * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
> > > > + */
> > > > + mm_ops->get_page(ctx->ptep);
> > >
> > > Doesn't this run the risk of a potential UAF if the refcount was 1 before
> > > calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop
> > > the refcount to 0 on the page before this ever gets called.
> > >
> > > Also, AFAICT this misses the CMOs that are required on systems w/o
> > > FEAT_FWB. Without them it is possible that the host will read something
> > > other than what was most recently written by the guest if it is using
> > > noncacheable memory attributes at stage-1.
> > >
> > > I imagine the actual bottleneck is the DSB required after every
> > > CMO/TLBI. Theoretically, the unmap path could be updated to:
> > >
> > > - Perform the appropriate CMOs for every valid leaf entry *without*
> > > issuing a DSB.
> > >
> > > - Elide TLBIs entirely that take place in the middle of the walk
> > >
> > > - After the walk completes, dsb(ish) to guarantee that the CMOs have
> > > completed and the invalid PTEs are made visible to the hardware
> > > walkers. This should be done implicitly by the TLBI implementation
> > >
> > > - Invalidate the [addr, addr + size) range of IPAs
> > >
> > > This would also avoid over-invalidating stage-1 since we blast the
> > > entire stage-1 context for every stage-2 invalidation. Thoughts?
> > >
> > Correct me if I'm wrong, but if we invalidate the TLB after the walk
> > is complete, don't you think there's a risk of race if the guest can
> > hit in the TLB even though the page was unmapped?
>
> Yeah, we'd need to do the CMOs _after_ making the translation invalid in
> the page tables and completing the TLB invalidation. Apologies.
>
> Otherwise, the only requirement we need to uphold w/ either the MMU
> notifiers or userspace is that the translation has been invalidated at
> the time of return.
>
Actually, my concern about the race was against the hardware. If we
follow the above approach, let's say we invalidated a certain set of
PTEs, but the TLBs aren't yet invalidated. During this point if
another vCPU accesses the range governed by the invalidated PTEs,
wouldn't it still hit in the TLB? Have I misunderstood you or am I
missing something?
Thank you.
Raghavendra
> --
> Thanks,
> Oliver
On Tue, Apr 4, 2023 at 11:41 AM Oliver Upton <[email protected]> wrote:
>
> On Mon, Apr 03, 2023 at 10:26:01AM -0700, Raghavendra Rao Ananta wrote:
> > Hi Oliver,
> >
> > On Wed, Mar 29, 2023 at 6:19 PM Oliver Upton <[email protected]> wrote:
> > >
> > > On Mon, Feb 06, 2023 at 05:23:35PM +0000, Raghavendra Rao Ananta wrote:
> > > > Define a generic function __kvm_tlb_flush_range() to
> > > > invalidate the TLBs over a range of addresses. The
> > > > implementation accepts 'op' as a generic TLBI operation.
> > > > Upcoming patches will use this to implement IPA based
> > > > TLB invalidations (ipas2e1is).
> > > >
> > > > If the system doesn't support FEAT_TLBIRANGE, the
> > > > implementation falls back to flushing the pages one by one
> > > > for the range supplied.
> > > >
> > > > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > > > ---
> > > > arch/arm64/include/asm/kvm_asm.h | 18 ++++++++++++++++++
> > > > 1 file changed, 18 insertions(+)
> > > >
> > > > diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> > > > index 43c3bc0f9544d..995ff048e8851 100644
> > > > --- a/arch/arm64/include/asm/kvm_asm.h
> > > > +++ b/arch/arm64/include/asm/kvm_asm.h
> > > > @@ -221,6 +221,24 @@ DECLARE_KVM_NVHE_SYM(__per_cpu_end);
> > > > DECLARE_KVM_HYP_SYM(__bp_harden_hyp_vecs);
> > > > #define __bp_harden_hyp_vecs CHOOSE_HYP_SYM(__bp_harden_hyp_vecs)
> > > >
> > > > +#define __kvm_tlb_flush_range(op, mmu, start, end, level, tlb_level) do { \
> > > > + unsigned long pages, stride; \
> > > > + \
> > > > + stride = kvm_granule_size(level); \
> > >
> > > Hmm... There's a rather subtle and annoying complication here that I
> > > don't believe is handled.
> > >
> > > Similar to what I said in the last spin of the series, there is no
> > > guarantee that a range of IPAs is mapped at the exact same level
> > > throughout. Dirty logging and memslots that aren't hugepage aligned
> > > could lead to a mix of mapping levels being used within a range of the
> > > IPA space.
> > >
> > Unlike the comment on v1, the level/stride here is used to jump the
> > addresses in case the system doesn't support TLBIRANGE. The TTL hint
> > is 0.
>
> Right. So we agree that the level is not uniform throughout the provided
> range. The invalidation by IPA is also used if 'pages' is odd, even on
> systems with TLBIRANGE. We must assume the worst case here, in that the
> TLBI by IPA invalidated a single PTE-level entry. You could wind up
> over-invalidating in that case, but you'd still be correct.
>
Sure, let's always assume the stride as 4k. But with
over-invalidation, do you think the penalty is acceptable, especially
when invalidating say >2M blocks for systems without TLBIRANGE?
In __kvm_tlb_flush_vmid_range(), what if we just rely on the iterative
approach for invalidating odd number pages on systems with TLBIRANGE.
For !TLBIRANGE systems simply invalidate all of TLB (like we do
today). Thoughts?
Thank you.
Raghavendra
> > That being said, do you think we can always assume the least possible
> > stride (say, 4k) and hardcode it?
> > With respect to alignment, since the function is only called while
> > breaking the table PTE, do you think it'll still be a problem even if
> > we go with the least granularity stride?
>
> I believe so. If we want to apply the range-based invalidations generally
> in KVM then we will not always be dealing with a block-aligned chunk of
> address.
>
> > > > + start = round_down(start, stride); \
> > > > + end = round_up(end, stride); \
> > > > + pages = (end - start) >> PAGE_SHIFT; \
> > > > + \
> > > > + if ((!system_supports_tlb_range() && \
> > > > + (end - start) >= (MAX_TLBI_OPS * stride)) || \
> > >
> > > Doesn't checking for TLBIRANGE above eliminate the need to test against
> > > MAX_TLBI_OPS?
> > >
> > Derived from __flush_tlb_range(), I think the condition is used to
> > just flush everything if the range is too large to iterate and flush
> > when the system doesn't support TLBIRANGE. Probably to prevent
> > soft-lockups?
>
> Right, but you test above for system_supports_tlb_range(), meaning that
> you'd unconditionally call __kvm_tlb_flush_vmid() below.
>
> > > > + pages >= MAX_TLBI_RANGE_PAGES) { \
> > > > + __kvm_tlb_flush_vmid(mmu); \
> > > > + break; \
> > > > + } \
> > > > + \
> > > > + __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false); \
> > > > +} while (0)
>
> --
> Thanks,
> Oliver
On Tue, Apr 04, 2023 at 02:07:06PM -0700, Raghavendra Rao Ananta wrote:
> On Tue, Apr 4, 2023 at 12:19 PM Oliver Upton <[email protected]> wrote:
> >
> > On Tue, Apr 04, 2023 at 10:52:01AM -0700, Raghavendra Rao Ananta wrote:
> > > On Wed, Mar 29, 2023 at 5:42 PM Oliver Upton <[email protected]> wrote:
> > > >
> > > > On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrote:
> > > > > The current implementation of the stage-2 unmap walker
> > > > > traverses the entire page-table to clear and flush the TLBs
> > > > > for each entry. This could be very expensive, especially if
> > > > > the VM is not backed by hugepages. The unmap operation could be
> > > > > made efficient by disconnecting the table at the very
> > > > > top (level at which the largest block mapping can be hosted)
> > > > > and do the rest of the unmapping using free_removed_table().
> > > > > If the system supports FEAT_TLBIRANGE, flush the entire range
> > > > > that has been disconnected from the rest of the page-table.
> > > > >
> > > > > Suggested-by: Ricardo Koller <[email protected]>
> > > > > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > > > > ---
> > > > > arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
> > > > > 1 file changed, 44 insertions(+)
> > > > >
> > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > > index 0858d1fa85d6b..af3729d0971f2 100644
> > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > > @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > > return 0;
> > > > > }
> > > > >
> > > > > +/*
> > > > > + * The fast walker executes only if the unmap size is exactly equal to the
> > > > > + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
> > > > > + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
> > > > > + * be disconnected from the rest of the page-table without the need to
> > > > > + * traverse all the PTEs, at all the levels, and unmap each and every one
> > > > > + * of them. The disconnected table is freed using free_removed_table().
> > > > > + */
> > > > > +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > > + enum kvm_pgtable_walk_flags visit)
> > > > > +{
> > > > > + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > > > > + kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
> > > > > + struct kvm_s2_mmu *mmu = ctx->arg;
> > > > > +
> > > > > + if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
> > > > > + return 0;
> > > > > +
> > > > > + if (!stage2_try_break_pte(ctx, mmu))
> > > > > + return -EAGAIN;
> > > > > +
> > > > > + /*
> > > > > + * Gain back a reference for stage2_unmap_walker() to free
> > > > > + * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
> > > > > + */
> > > > > + mm_ops->get_page(ctx->ptep);
> > > >
> > > > Doesn't this run the risk of a potential UAF if the refcount was 1 before
> > > > calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop
> > > > the refcount to 0 on the page before this ever gets called.
> > > >
> > > > Also, AFAICT this misses the CMOs that are required on systems w/o
> > > > FEAT_FWB. Without them it is possible that the host will read something
> > > > other than what was most recently written by the guest if it is using
> > > > noncacheable memory attributes at stage-1.
> > > >
> > > > I imagine the actual bottleneck is the DSB required after every
> > > > CMO/TLBI. Theoretically, the unmap path could be updated to:
> > > >
> > > > - Perform the appropriate CMOs for every valid leaf entry *without*
> > > > issuing a DSB.
> > > >
> > > > - Elide TLBIs entirely that take place in the middle of the walk
> > > >
> > > > - After the walk completes, dsb(ish) to guarantee that the CMOs have
> > > > completed and the invalid PTEs are made visible to the hardware
> > > > walkers. This should be done implicitly by the TLBI implementation
> > > >
> > > > - Invalidate the [addr, addr + size) range of IPAs
> > > >
> > > > This would also avoid over-invalidating stage-1 since we blast the
> > > > entire stage-1 context for every stage-2 invalidation. Thoughts?
> > > >
> > > Correct me if I'm wrong, but if we invalidate the TLB after the walk
> > > is complete, don't you think there's a risk of race if the guest can
> > > hit in the TLB even though the page was unmapped?
> >
> > Yeah, we'd need to do the CMOs _after_ making the translation invalid in
> > the page tables and completing the TLB invalidation. Apologies.
> >
> > Otherwise, the only requirement we need to uphold w/ either the MMU
> > notifiers or userspace is that the translation has been invalidated at
> > the time of return.
> >
> Actually, my concern about the race was against the hardware. If we
> follow the above approach, let's say we invalidated a certain set of
> PTEs, but the TLBs aren't yet invalidated. During this point if
> another vCPU accesses the range governed by the invalidated PTEs,
> wouldn't it still hit in the TLB? Have I misunderstood you or am I
> missing something?
Yep, that's exactly what would happen. There is no way to eliminate the
race you mention, there will always be a window of time where the page
tables no longer contain a particular translation but the TLBs may still
be holding a valid entry.
This race is benign so long as we guarantee that all translations for
the affected address (i.e. in the page tables, cached in a TLB) have
been invalidated before returning to the caller. For example, MM cannot
start swapping out guest memory until it is guaranteed that the guest is
no longer writing to it.
--
Thanks,
Oliver
On Tue, Apr 4, 2023 at 2:31 PM Oliver Upton <[email protected]> wrote:
>
> On Tue, Apr 04, 2023 at 02:07:06PM -0700, Raghavendra Rao Ananta wrote:
> > On Tue, Apr 4, 2023 at 12:19 PM Oliver Upton <[email protected]> wrote:
> > >
> > > On Tue, Apr 04, 2023 at 10:52:01AM -0700, Raghavendra Rao Ananta wrote:
> > > > On Wed, Mar 29, 2023 at 5:42 PM Oliver Upton <[email protected]> wrote:
> > > > >
> > > > > On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrote:
> > > > > > The current implementation of the stage-2 unmap walker
> > > > > > traverses the entire page-table to clear and flush the TLBs
> > > > > > for each entry. This could be very expensive, especially if
> > > > > > the VM is not backed by hugepages. The unmap operation could be
> > > > > > made efficient by disconnecting the table at the very
> > > > > > top (level at which the largest block mapping can be hosted)
> > > > > > and do the rest of the unmapping using free_removed_table().
> > > > > > If the system supports FEAT_TLBIRANGE, flush the entire range
> > > > > > that has been disconnected from the rest of the page-table.
> > > > > >
> > > > > > Suggested-by: Ricardo Koller <[email protected]>
> > > > > > Signed-off-by: Raghavendra Rao Ananta <[email protected]>
> > > > > > ---
> > > > > > arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
> > > > > > 1 file changed, 44 insertions(+)
> > > > > >
> > > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > > > index 0858d1fa85d6b..af3729d0971f2 100644
> > > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > > > @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > > > return 0;
> > > > > > }
> > > > > >
> > > > > > +/*
> > > > > > + * The fast walker executes only if the unmap size is exactly equal to the
> > > > > > + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
> > > > > > + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
> > > > > > + * be disconnected from the rest of the page-table without the need to
> > > > > > + * traverse all the PTEs, at all the levels, and unmap each and every one
> > > > > > + * of them. The disconnected table is freed using free_removed_table().
> > > > > > + */
> > > > > > +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > > > + enum kvm_pgtable_walk_flags visit)
> > > > > > +{
> > > > > > + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > > > > > + kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
> > > > > > + struct kvm_s2_mmu *mmu = ctx->arg;
> > > > > > +
> > > > > > + if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
> > > > > > + return 0;
> > > > > > +
> > > > > > + if (!stage2_try_break_pte(ctx, mmu))
> > > > > > + return -EAGAIN;
> > > > > > +
> > > > > > + /*
> > > > > > + * Gain back a reference for stage2_unmap_walker() to free
> > > > > > + * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
> > > > > > + */
> > > > > > + mm_ops->get_page(ctx->ptep);
> > > > >
> > > > > Doesn't this run the risk of a potential UAF if the refcount was 1 before
> > > > > calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop
> > > > > the refcount to 0 on the page before this ever gets called.
> > > > >
> > > > > Also, AFAICT this misses the CMOs that are required on systems w/o
> > > > > FEAT_FWB. Without them it is possible that the host will read something
> > > > > other than what was most recently written by the guest if it is using
> > > > > noncacheable memory attributes at stage-1.
> > > > >
> > > > > I imagine the actual bottleneck is the DSB required after every
> > > > > CMO/TLBI. Theoretically, the unmap path could be updated to:
> > > > >
> > > > > - Perform the appropriate CMOs for every valid leaf entry *without*
> > > > > issuing a DSB.
> > > > >
> > > > > - Elide TLBIs entirely that take place in the middle of the walk
> > > > >
> > > > > - After the walk completes, dsb(ish) to guarantee that the CMOs have
> > > > > completed and the invalid PTEs are made visible to the hardware
> > > > > walkers. This should be done implicitly by the TLBI implementation
> > > > >
> > > > > - Invalidate the [addr, addr + size) range of IPAs
> > > > >
> > > > > This would also avoid over-invalidating stage-1 since we blast the
> > > > > entire stage-1 context for every stage-2 invalidation. Thoughts?
> > > > >
> > > > Correct me if I'm wrong, but if we invalidate the TLB after the walk
> > > > is complete, don't you think there's a risk of race if the guest can
> > > > hit in the TLB even though the page was unmapped?
> > >
> > > Yeah, we'd need to do the CMOs _after_ making the translation invalid in
> > > the page tables and completing the TLB invalidation. Apologies.
> > >
> > > Otherwise, the only requirement we need to uphold w/ either the MMU
> > > notifiers or userspace is that the translation has been invalidated at
> > > the time of return.
> > >
> > Actually, my concern about the race was against the hardware. If we
> > follow the above approach, let's say we invalidated a certain set of
> > PTEs, but the TLBs aren't yet invalidated. During this point if
> > another vCPU accesses the range governed by the invalidated PTEs,
> > wouldn't it still hit in the TLB? Have I misunderstood you or am I
> > missing something?
>
> Yep, that's exactly what would happen. There is no way to eliminate the
> race you mention, there will always be a window of time where the page
> tables no longer contain a particular translation but the TLBs may still
> be holding a valid entry.
>
This is new to me :)
> This race is benign so long as we guarantee that all translations for
> the affected address (i.e. in the page tables, cached in a TLB) have
> been invalidated before returning to the caller. For example, MM cannot
> start swapping out guest memory until it is guaranteed that the guest is
> no longer writing to it.
>
Well, if you feel that the risk is acceptable, we can probably defer
the invalidations until the unmap walk is finished.
Thank you.
Raghavendra
> --
> Thanks,
> Oliver