2021-03-15 14:55:20

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 00/36] KVM: arm64: A stage 2 for the host

Hi all,

This is the v5 of the series previously posted here:

https://lore.kernel.org/kvmarm/[email protected]/

This basically allows us to wrap the host with a stage 2 when running in
nVHE, hence paving the way for protecting guest memory from the host in
the future (among other use-cases). For more details about the
motivation and the design angle taken here, I would recommend to have a
look at the cover letter of v1, and/or to watch these presentations at
LPC [1] and KVM forum 2020 [2].

Changes since v3:

- simplified the infrastructure allowing to copy feature registers for
use at EL2;

- reworked the page-ownership path in the pgtable code to use an
invalid PA instead of setting a valid bit upfront in the map() path;

- refactored hyp_map_set_prot_attr() to match its stage-2 counterpart;

- and a handful of small cleanups / comestic changes.

This series depends on Will's vCPU context fix ([3]) and Marc's PMU
fixes ([4]). And here's a branch with all the goodies applied:

https://android-kvm.googlesource.com/linux qperret/host-stage2-v5

Thanks,
Quentin

[1] https://youtu.be/54q6RzS9BpQ?t=10859
[2] https://youtu.be/wY-u6n75iXc
[3] https://lore.kernel.org/kvmarm/[email protected]/
[4] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pmu-undef-NV

Quentin Perret (33):
KVM: arm64: Initialize kvm_nvhe_init_params early
KVM: arm64: Avoid free_page() in page-table allocator
KVM: arm64: Factor memory allocation out of pgtable.c
KVM: arm64: Introduce a BSS section for use at Hyp
KVM: arm64: Make kvm_call_hyp() a function call at Hyp
KVM: arm64: Allow using kvm_nvhe_sym() in hyp code
KVM: arm64: Introduce an early Hyp page allocator
KVM: arm64: Stub CONFIG_DEBUG_LIST at Hyp
KVM: arm64: Introduce a Hyp buddy page allocator
KVM: arm64: Enable access to sanitized CPU features at EL2
KVM: arm64: Provide __flush_dcache_area at EL2
KVM: arm64: Factor out vector address calculation
arm64: asm: Provide set_sctlr_el2 macro
KVM: arm64: Prepare the creation of s1 mappings at EL2
KVM: arm64: Elevate hypervisor mappings creation at EL2
KVM: arm64: Use kvm_arch for stage 2 pgtable
KVM: arm64: Use kvm_arch in kvm_s2_mmu
KVM: arm64: Set host stage 2 using kvm_nvhe_init_params
KVM: arm64: Refactor kvm_arm_setup_stage2()
KVM: arm64: Refactor __load_guest_stage2()
KVM: arm64: Refactor __populate_fault_info()
KVM: arm64: Make memcache anonymous in pgtable allocator
KVM: arm64: Reserve memory for host stage 2
KVM: arm64: Sort the hypervisor memblocks
KVM: arm64: Always zero invalid PTEs
KVM: arm64: Use page-table to track page ownership
KVM: arm64: Refactor the *_map_set_prot_attr() helpers
KVM: arm64: Add kvm_pgtable_stage2_find_range()
KVM: arm64: Provide sanitized mmfr* registers at EL2
KVM: arm64: Wrap the host with a stage 2
KVM: arm64: Page-align the .hyp sections
KVM: arm64: Disable PMU support in protected mode
KVM: arm64: Protect the .hyp sections from the host

Will Deacon (3):
arm64: lib: Annotate {clear,copy}_page() as position-independent
KVM: arm64: Link position-independent string routines into .hyp.text
arm64: kvm: Add standalone ticket spinlock implementation for use at
hyp

arch/arm64/include/asm/assembler.h | 14 +-
arch/arm64/include/asm/cpufeature.h | 1 +
arch/arm64/include/asm/hyp_image.h | 7 +
arch/arm64/include/asm/kvm_asm.h | 9 +
arch/arm64/include/asm/kvm_cpufeature.h | 19 +
arch/arm64/include/asm/kvm_host.h | 19 +-
arch/arm64/include/asm/kvm_hyp.h | 8 +
arch/arm64/include/asm/kvm_mmu.h | 23 +-
arch/arm64/include/asm/kvm_pgtable.h | 128 +++++-
arch/arm64/include/asm/sections.h | 1 +
arch/arm64/kernel/asm-offsets.c | 3 +
arch/arm64/kernel/cpufeature.c | 13 +
arch/arm64/kernel/image-vars.h | 30 ++
arch/arm64/kernel/vmlinux.lds.S | 74 ++--
arch/arm64/kvm/arm.c | 199 +++++++--
arch/arm64/kvm/hyp/Makefile | 2 +-
arch/arm64/kvm/hyp/include/hyp/switch.h | 34 +-
arch/arm64/kvm/hyp/include/nvhe/early_alloc.h | 14 +
arch/arm64/kvm/hyp/include/nvhe/gfp.h | 68 ++++
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 36 ++
arch/arm64/kvm/hyp/include/nvhe/memory.h | 52 +++
arch/arm64/kvm/hyp/include/nvhe/mm.h | 96 +++++
arch/arm64/kvm/hyp/include/nvhe/spinlock.h | 92 +++++
arch/arm64/kvm/hyp/nvhe/Makefile | 9 +-
arch/arm64/kvm/hyp/nvhe/cache.S | 13 +
arch/arm64/kvm/hyp/nvhe/early_alloc.c | 54 +++
arch/arm64/kvm/hyp/nvhe/hyp-init.S | 42 +-
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 69 ++++
arch/arm64/kvm/hyp/nvhe/hyp-smp.c | 7 +
arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 1 +
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 279 +++++++++++++
arch/arm64/kvm/hyp/nvhe/mm.c | 173 ++++++++
arch/arm64/kvm/hyp/nvhe/page_alloc.c | 195 +++++++++
arch/arm64/kvm/hyp/nvhe/psci-relay.c | 4 +-
arch/arm64/kvm/hyp/nvhe/setup.c | 214 ++++++++++
arch/arm64/kvm/hyp/nvhe/stub.c | 22 +
arch/arm64/kvm/hyp/nvhe/switch.c | 12 +-
arch/arm64/kvm/hyp/nvhe/tlb.c | 4 +-
arch/arm64/kvm/hyp/pgtable.c | 378 ++++++++++++++----
arch/arm64/kvm/hyp/reserved_mem.c | 113 ++++++
arch/arm64/kvm/mmu.c | 115 +++++-
arch/arm64/kvm/perf.c | 3 +-
arch/arm64/kvm/pmu.c | 8 +-
arch/arm64/kvm/reset.c | 42 +-
arch/arm64/kvm/sys_regs.c | 22 +
arch/arm64/lib/clear_page.S | 4 +-
arch/arm64/lib/copy_page.S | 4 +-
arch/arm64/mm/init.c | 3 +
48 files changed, 2488 insertions(+), 244 deletions(-)
create mode 100644 arch/arm64/include/asm/kvm_cpufeature.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/early_alloc.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/gfp.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/memory.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mm.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/spinlock.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/cache.S
create mode 100644 arch/arm64/kvm/hyp/nvhe/early_alloc.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/mm.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/page_alloc.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/setup.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c
create mode 100644 arch/arm64/kvm/hyp/reserved_mem.c

--
2.31.0.rc2.261.g7f71774620-goog


2021-03-15 14:55:31

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 33/36] KVM: arm64: Wrap the host with a stage 2

When KVM runs in protected nVHE mode, make use of a stage 2 page-table
to give the hypervisor some control over the host memory accesses. The
host stage 2 is created lazily using large block mappings if possible,
and will default to page mappings in absence of a better solution.

From this point on, memory accesses from the host to protected memory
regions (e.g. not 'owned' by the host) are fatal and lead to hyp_panic().

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_asm.h | 1 +
arch/arm64/kernel/image-vars.h | 3 +
arch/arm64/kvm/arm.c | 10 +
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 34 +++
arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
arch/arm64/kvm/hyp/nvhe/hyp-init.S | 1 +
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 11 +
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 246 ++++++++++++++++++
arch/arm64/kvm/hyp/nvhe/setup.c | 5 +
arch/arm64/kvm/hyp/nvhe/switch.c | 7 +-
arch/arm64/kvm/hyp/nvhe/tlb.c | 4 +-
11 files changed, 317 insertions(+), 7 deletions(-)
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 6dce860f8bca..b127af02bd45 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -61,6 +61,7 @@
#define __KVM_HOST_SMCCC_FUNC___pkvm_create_mappings 16
#define __KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping 17
#define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18
+#define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19

#ifndef __ASSEMBLY__

diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 940c378fa837..d5dc2b792651 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -131,6 +131,9 @@ KVM_NVHE_ALIAS(__hyp_bss_end);
KVM_NVHE_ALIAS(__hyp_rodata_start);
KVM_NVHE_ALIAS(__hyp_rodata_end);

+/* pKVM static key */
+KVM_NVHE_ALIAS(kvm_protected_mode_initialized);
+
#endif /* CONFIG_KVM */

#endif /* __ARM64_KERNEL_IMAGE_VARS_H */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index d474eec606a3..7e6a81079652 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1889,12 +1889,22 @@ static int init_hyp_mode(void)
return err;
}

+void _kvm_host_prot_finalize(void *discard)
+{
+ WARN_ON(kvm_call_hyp_nvhe(__pkvm_prot_finalize));
+}
+
static int finalize_hyp_mode(void)
{
if (!is_protected_kvm_enabled())
return 0;

+ /*
+ * Flip the static key upfront as that may no longer be possible
+ * once the host stage 2 is installed.
+ */
static_branch_enable(&kvm_protected_mode_initialized);
+ on_each_cpu(_kvm_host_prot_finalize, NULL, 1);

return 0;
}
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
new file mode 100644
index 000000000000..d293cb328cc4
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2020 Google LLC
+ * Author: Quentin Perret <[email protected]>
+ */
+
+#ifndef __KVM_NVHE_MEM_PROTECT__
+#define __KVM_NVHE_MEM_PROTECT__
+#include <linux/kvm_host.h>
+#include <asm/kvm_hyp.h>
+#include <asm/kvm_pgtable.h>
+#include <asm/virt.h>
+#include <nvhe/spinlock.h>
+
+struct host_kvm {
+ struct kvm_arch arch;
+ struct kvm_pgtable pgt;
+ struct kvm_pgtable_mm_ops mm_ops;
+ hyp_spinlock_t lock;
+};
+extern struct host_kvm host_kvm;
+
+int __pkvm_prot_finalize(void);
+int kvm_host_prepare_stage2(void *mem_pgt_pool, void *dev_pgt_pool);
+void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt);
+
+static __always_inline void __load_host_stage2(void)
+{
+ if (static_branch_likely(&kvm_protected_mode_initialized))
+ __load_stage2(&host_kvm.arch.mmu, host_kvm.arch.vtcr);
+ else
+ write_sysreg(0, vttbr_el2);
+}
+#endif /* __KVM_NVHE_MEM_PROTECT__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index b334354b8dd0..f55201a7ff33 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -14,7 +14,7 @@ lib-objs := $(addprefix ../../../lib/, $(lib-objs))

obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o \
- cache.o setup.o mm.o
+ cache.o setup.o mm.o mem_protect.o
obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
obj-y += $(lib-objs)
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
index a50ad9e9fc05..c164045af238 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
@@ -119,6 +119,7 @@ alternative_else_nop_endif

/* Invalidate the stale TLBs from Bootloader */
tlbi alle2
+ tlbi vmalls12e1
dsb sy

/*
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index ae6503c9be15..f47028d3fd0a 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -13,6 +13,7 @@
#include <asm/kvm_hyp.h>
#include <asm/kvm_mmu.h>

+#include <nvhe/mem_protect.h>
#include <nvhe/mm.h>
#include <nvhe/trap_handler.h>

@@ -151,6 +152,10 @@ static void handle___pkvm_create_private_mapping(struct kvm_cpu_context *host_ct
cpu_reg(host_ctxt, 1) = __pkvm_create_private_mapping(phys, size, prot);
}

+static void handle___pkvm_prot_finalize(struct kvm_cpu_context *host_ctxt)
+{
+ cpu_reg(host_ctxt, 1) = __pkvm_prot_finalize();
+}
typedef void (*hcall_t)(struct kvm_cpu_context *);

#define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x
@@ -174,6 +179,7 @@ static const hcall_t host_hcall[] = {
HANDLE_FUNC(__pkvm_cpu_set_vector),
HANDLE_FUNC(__pkvm_create_mappings),
HANDLE_FUNC(__pkvm_create_private_mapping),
+ HANDLE_FUNC(__pkvm_prot_finalize),
};

static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
@@ -226,6 +232,11 @@ void handle_trap(struct kvm_cpu_context *host_ctxt)
case ESR_ELx_EC_SMC64:
handle_host_smc(host_ctxt);
break;
+ case ESR_ELx_EC_IABT_LOW:
+ fallthrough;
+ case ESR_ELx_EC_DABT_LOW:
+ handle_host_mem_abort(host_ctxt);
+ break;
default:
hyp_panic();
}
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
new file mode 100644
index 000000000000..5c88a325e6fc
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -0,0 +1,246 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Google LLC
+ * Author: Quentin Perret <[email protected]>
+ */
+
+#include <linux/kvm_host.h>
+#include <asm/kvm_cpufeature.h>
+#include <asm/kvm_emulate.h>
+#include <asm/kvm_hyp.h>
+#include <asm/kvm_mmu.h>
+#include <asm/kvm_pgtable.h>
+#include <asm/stage2_pgtable.h>
+
+#include <hyp/switch.h>
+
+#include <nvhe/gfp.h>
+#include <nvhe/memory.h>
+#include <nvhe/mem_protect.h>
+#include <nvhe/mm.h>
+
+extern unsigned long hyp_nr_cpus;
+struct host_kvm host_kvm;
+
+struct hyp_pool host_s2_mem;
+struct hyp_pool host_s2_dev;
+
+static void *host_s2_zalloc_pages_exact(size_t size)
+{
+ return hyp_alloc_pages(&host_s2_mem, get_order(size));
+}
+
+static void *host_s2_zalloc_page(void *pool)
+{
+ return hyp_alloc_pages(pool, 0);
+}
+
+static int prepare_s2_pools(void *mem_pgt_pool, void *dev_pgt_pool)
+{
+ unsigned long nr_pages, pfn;
+ int ret;
+
+ pfn = hyp_virt_to_pfn(mem_pgt_pool);
+ nr_pages = host_s2_mem_pgtable_pages();
+ ret = hyp_pool_init(&host_s2_mem, pfn, nr_pages, 0);
+ if (ret)
+ return ret;
+
+ pfn = hyp_virt_to_pfn(dev_pgt_pool);
+ nr_pages = host_s2_dev_pgtable_pages();
+ ret = hyp_pool_init(&host_s2_dev, pfn, nr_pages, 0);
+ if (ret)
+ return ret;
+
+ host_kvm.mm_ops = (struct kvm_pgtable_mm_ops) {
+ .zalloc_pages_exact = host_s2_zalloc_pages_exact,
+ .zalloc_page = host_s2_zalloc_page,
+ .phys_to_virt = hyp_phys_to_virt,
+ .virt_to_phys = hyp_virt_to_phys,
+ .page_count = hyp_page_count,
+ .get_page = hyp_get_page,
+ .put_page = hyp_put_page,
+ };
+
+ return 0;
+}
+
+static void prepare_host_vtcr(void)
+{
+ u32 parange, phys_shift;
+ u64 mmfr0, mmfr1;
+
+ mmfr0 = arm64_ftr_reg_id_aa64mmfr0_el1.sys_val;
+ mmfr1 = arm64_ftr_reg_id_aa64mmfr1_el1.sys_val;
+
+ /* The host stage 2 is id-mapped, so use parange for T0SZ */
+ parange = kvm_get_parange(mmfr0);
+ phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
+
+ host_kvm.arch.vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);
+}
+
+int kvm_host_prepare_stage2(void *mem_pgt_pool, void *dev_pgt_pool)
+{
+ struct kvm_s2_mmu *mmu = &host_kvm.arch.mmu;
+ int ret;
+
+ prepare_host_vtcr();
+ hyp_spin_lock_init(&host_kvm.lock);
+
+ ret = prepare_s2_pools(mem_pgt_pool, dev_pgt_pool);
+ if (ret)
+ return ret;
+
+ ret = kvm_pgtable_stage2_init(&host_kvm.pgt, &host_kvm.arch,
+ &host_kvm.mm_ops);
+ if (ret)
+ return ret;
+
+ mmu->pgd_phys = __hyp_pa(host_kvm.pgt.pgd);
+ mmu->arch = &host_kvm.arch;
+ mmu->pgt = &host_kvm.pgt;
+ mmu->vmid.vmid_gen = 0;
+ mmu->vmid.vmid = 0;
+
+ return 0;
+}
+
+int __pkvm_prot_finalize(void)
+{
+ struct kvm_s2_mmu *mmu = &host_kvm.arch.mmu;
+ struct kvm_nvhe_init_params *params = this_cpu_ptr(&kvm_init_params);
+
+ params->vttbr = kvm_get_vttbr(mmu);
+ params->vtcr = host_kvm.arch.vtcr;
+ params->hcr_el2 |= HCR_VM;
+ if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
+ params->hcr_el2 |= HCR_FWB;
+ kvm_flush_dcache_to_poc(params, sizeof(*params));
+
+ write_sysreg(params->hcr_el2, hcr_el2);
+ __load_stage2(&host_kvm.arch.mmu, host_kvm.arch.vtcr);
+
+ /*
+ * Make sure to have an ISB before the TLB maintenance below but only
+ * when __load_stage2() doesn't include one already.
+ */
+ asm(ALTERNATIVE("isb", "nop", ARM64_WORKAROUND_SPECULATIVE_AT));
+
+ /* Invalidate stale HCR bits that may be cached in TLBs */
+ __tlbi(vmalls12e1);
+ dsb(nsh);
+ isb();
+
+ return 0;
+}
+
+static int host_stage2_unmap_dev_all(void)
+{
+ struct kvm_pgtable *pgt = &host_kvm.pgt;
+ struct memblock_region *reg;
+ u64 addr = 0;
+ int i, ret;
+
+ /* Unmap all non-memory regions to recycle the pages */
+ for (i = 0; i < hyp_memblock_nr; i++, addr = reg->base + reg->size) {
+ reg = &hyp_memory[i];
+ ret = kvm_pgtable_stage2_unmap(pgt, addr, reg->base - addr);
+ if (ret)
+ return ret;
+ }
+ return kvm_pgtable_stage2_unmap(pgt, addr, BIT(pgt->ia_bits) - addr);
+}
+
+static bool find_mem_range(phys_addr_t addr, struct kvm_mem_range *range)
+{
+ int cur, left = 0, right = hyp_memblock_nr;
+ struct memblock_region *reg;
+ phys_addr_t end;
+
+ range->start = 0;
+ range->end = ULONG_MAX;
+
+ /* The list of memblock regions is sorted, binary search it */
+ while (left < right) {
+ cur = (left + right) >> 1;
+ reg = &hyp_memory[cur];
+ end = reg->base + reg->size;
+ if (addr < reg->base) {
+ right = cur;
+ range->end = reg->base;
+ } else if (addr >= end) {
+ left = cur + 1;
+ range->start = end;
+ } else {
+ range->start = reg->base;
+ range->end = end;
+ return true;
+ }
+ }
+
+ return false;
+}
+
+static inline int __host_stage2_idmap(u64 start, u64 end,
+ enum kvm_pgtable_prot prot,
+ struct hyp_pool *pool)
+{
+ return kvm_pgtable_stage2_map(&host_kvm.pgt, start, end - start, start,
+ prot, pool);
+}
+
+static int host_stage2_idmap(u64 addr)
+{
+ enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
+ struct kvm_mem_range range;
+ bool is_memory = find_mem_range(addr, &range);
+ struct hyp_pool *pool = is_memory ? &host_s2_mem : &host_s2_dev;
+ int ret;
+
+ if (is_memory)
+ prot |= KVM_PGTABLE_PROT_X;
+
+ hyp_spin_lock(&host_kvm.lock);
+ ret = kvm_pgtable_stage2_find_range(&host_kvm.pgt, addr, prot, &range);
+ if (ret)
+ goto unlock;
+
+ ret = __host_stage2_idmap(range.start, range.end, prot, pool);
+ if (is_memory || ret != -ENOMEM)
+ goto unlock;
+
+ /*
+ * host_s2_mem has been provided with enough pages to cover all of
+ * memory with page granularity, so we should never hit the ENOMEM case.
+ * However, it is difficult to know how much of the MMIO range we will
+ * need to cover upfront, so we may need to 'recycle' the pages if we
+ * run out.
+ */
+ ret = host_stage2_unmap_dev_all();
+ if (ret)
+ goto unlock;
+
+ ret = __host_stage2_idmap(range.start, range.end, prot, pool);
+
+unlock:
+ hyp_spin_unlock(&host_kvm.lock);
+
+ return ret;
+}
+
+void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
+{
+ struct kvm_vcpu_fault_info fault;
+ u64 esr, addr;
+ int ret = 0;
+
+ esr = read_sysreg_el2(SYS_ESR);
+ if (!__get_fault_info(esr, &fault))
+ hyp_panic();
+
+ addr = (fault.hpfar_el2 & HPFAR_MASK) << 8;
+ ret = host_stage2_idmap(addr);
+ if (ret && ret != -EAGAIN)
+ hyp_panic();
+}
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index c1a3e7e0ebbc..7488f53b0aa2 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -12,6 +12,7 @@
#include <nvhe/early_alloc.h>
#include <nvhe/gfp.h>
#include <nvhe/memory.h>
+#include <nvhe/mem_protect.h>
#include <nvhe/mm.h>
#include <nvhe/trap_handler.h>

@@ -157,6 +158,10 @@ void __noreturn __pkvm_init_finalise(void)
if (ret)
goto out;

+ ret = kvm_host_prepare_stage2(host_s2_mem_pgt_base, host_s2_dev_pgt_base);
+ if (ret)
+ goto out;
+
pkvm_pgtable_mm_ops = (struct kvm_pgtable_mm_ops) {
.zalloc_page = hyp_zalloc_hyp_page,
.phys_to_virt = hyp_phys_to_virt,
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index 979a76cdf9fb..31bc1a843bf8 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -28,6 +28,8 @@
#include <asm/processor.h>
#include <asm/thread_info.h>

+#include <nvhe/mem_protect.h>
+
/* Non-VHE specific context */
DEFINE_PER_CPU(struct kvm_host_data, kvm_host_data);
DEFINE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt);
@@ -102,11 +104,6 @@ static void __deactivate_traps(struct kvm_vcpu *vcpu)
write_sysreg(__kvm_hyp_host_vector, vbar_el2);
}

-static void __load_host_stage2(void)
-{
- write_sysreg(0, vttbr_el2);
-}
-
/* Save VGICv3 state on non-VHE systems */
static void __hyp_vgic_save_state(struct kvm_vcpu *vcpu)
{
diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
index fbde89a2c6e8..255a23a1b2db 100644
--- a/arch/arm64/kvm/hyp/nvhe/tlb.c
+++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
@@ -8,6 +8,8 @@
#include <asm/kvm_mmu.h>
#include <asm/tlbflush.h>

+#include <nvhe/mem_protect.h>
+
struct tlb_inv_context {
u64 tcr;
};
@@ -43,7 +45,7 @@ static void __tlb_switch_to_guest(struct kvm_s2_mmu *mmu,

static void __tlb_switch_to_host(struct tlb_inv_context *cxt)
{
- write_sysreg(0, vttbr_el2);
+ __load_host_stage2();

if (cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) {
/* Ensure write of the host VMID */
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 14:55:33

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 34/36] KVM: arm64: Page-align the .hyp sections

We will soon unmap the .hyp sections from the host stage 2 in Protected
nVHE mode, which obviously works with at least page granularity, so make
sure to align them correctly.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kernel/vmlinux.lds.S | 22 +++++++++-------------
1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index e96173ce211b..709d2c433c5e 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -15,9 +15,11 @@

#define HYPERVISOR_DATA_SECTIONS \
HYP_SECTION_NAME(.rodata) : { \
+ . = ALIGN(PAGE_SIZE); \
__hyp_rodata_start = .; \
*(HYP_SECTION_NAME(.data..ro_after_init)) \
*(HYP_SECTION_NAME(.rodata)) \
+ . = ALIGN(PAGE_SIZE); \
__hyp_rodata_end = .; \
}

@@ -72,21 +74,14 @@ ENTRY(_text)
jiffies = jiffies_64;

#define HYPERVISOR_TEXT \
- /* \
- * Align to 4 KB so that \
- * a) the HYP vector table is at its minimum \
- * alignment of 2048 bytes \
- * b) the HYP init code will not cross a page \
- * boundary if its size does not exceed \
- * 4 KB (see related ASSERT() below) \
- */ \
- . = ALIGN(SZ_4K); \
+ . = ALIGN(PAGE_SIZE); \
__hyp_idmap_text_start = .; \
*(.hyp.idmap.text) \
__hyp_idmap_text_end = .; \
__hyp_text_start = .; \
*(.hyp.text) \
HYPERVISOR_EXTABLE \
+ . = ALIGN(PAGE_SIZE); \
__hyp_text_end = .;

#define IDMAP_TEXT \
@@ -322,11 +317,12 @@ SECTIONS
#include "image-vars.h"

/*
- * The HYP init code and ID map text can't be longer than a page each,
- * and should not cross a page boundary.
+ * The HYP init code and ID map text can't be longer than a page each. The
+ * former is page-aligned, but the latter may not be with 16K or 64K pages, so
+ * it should also not cross a page boundary.
*/
-ASSERT(__hyp_idmap_text_end - (__hyp_idmap_text_start & ~(SZ_4K - 1)) <= SZ_4K,
- "HYP init code too big or misaligned")
+ASSERT(__hyp_idmap_text_end - __hyp_idmap_text_start <= PAGE_SIZE,
+ "HYP init code too big")
ASSERT(__idmap_text_end - (__idmap_text_start & ~(SZ_4K - 1)) <= SZ_4K,
"ID map text too big or misaligned")
#ifdef CONFIG_HIBERNATION
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 14:55:35

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 14/36] KVM: arm64: Provide __flush_dcache_area at EL2

We will need to do cache maintenance at EL2 soon, so compile a copy of
__flush_dcache_area at EL2, and provide a copy of arm64_ftr_reg_ctrel0
as it is needed by the read_ctr macro.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_cpufeature.h | 2 ++
arch/arm64/kvm/hyp/nvhe/Makefile | 3 ++-
arch/arm64/kvm/hyp/nvhe/cache.S | 13 +++++++++++++
arch/arm64/kvm/sys_regs.c | 1 +
4 files changed, 18 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/kvm/hyp/nvhe/cache.S

diff --git a/arch/arm64/include/asm/kvm_cpufeature.h b/arch/arm64/include/asm/kvm_cpufeature.h
index 3fd9f60d2180..efba1b89b8a4 100644
--- a/arch/arm64/include/asm/kvm_cpufeature.h
+++ b/arch/arm64/include/asm/kvm_cpufeature.h
@@ -13,3 +13,5 @@
#define KVM_HYP_CPU_FTR_REG(name) extern struct arm64_ftr_reg kvm_nvhe_sym(name)
#endif
#endif
+
+KVM_HYP_CPU_FTR_REG(arm64_ftr_reg_ctrel0);
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 6894a917f290..42dde4bb80b1 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -13,7 +13,8 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
lib-objs := $(addprefix ../../../lib/, $(lib-objs))

obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
- hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o
+ hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o \
+ cache.o
obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o
obj-y += $(lib-objs)
diff --git a/arch/arm64/kvm/hyp/nvhe/cache.S b/arch/arm64/kvm/hyp/nvhe/cache.S
new file mode 100644
index 000000000000..36cef6915428
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/cache.S
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Code copied from arch/arm64/mm/cache.S.
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+#include <asm/alternative.h>
+
+SYM_FUNC_START_PI(__flush_dcache_area)
+ dcache_by_line_op civac, sy, x0, x1, x2, x3
+ ret
+SYM_FUNC_END_PI(__flush_dcache_area)
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 6c5d133689ae..3ec34c25e877 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -2783,6 +2783,7 @@ struct __ftr_reg_copy_entry {
u32 sys_id;
struct arm64_ftr_reg *dst;
} hyp_ftr_regs[] __initdata = {
+ CPU_FTR_REG_HYP_COPY(SYS_CTR_EL0, arm64_ftr_reg_ctrel0),
};

void __init setup_kvm_el2_caps(void)
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 14:56:26

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 16/36] arm64: asm: Provide set_sctlr_el2 macro

We will soon need to turn the EL2 stage 1 MMU on and off in nVHE
protected mode, so refactor the set_sctlr_el1 macro to make it usable
for that purpose.

Acked-by: Will Deacon <[email protected]>
Suggested-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/assembler.h | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index ca31594d3d6c..fb651c1f26e9 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -676,11 +676,11 @@ USER(\label, ic ivau, \tmp2) // invalidate I line PoU
.endm

/*
- * Set SCTLR_EL1 to the passed value, and invalidate the local icache
+ * Set SCTLR_ELx to the @reg value, and invalidate the local icache
* in the process. This is called when setting the MMU on.
*/
-.macro set_sctlr_el1, reg
- msr sctlr_el1, \reg
+.macro set_sctlr, sreg, reg
+ msr \sreg, \reg
isb
/*
* Invalidate the local I-cache so that any instructions fetched
@@ -692,6 +692,14 @@ USER(\label, ic ivau, \tmp2) // invalidate I line PoU
isb
.endm

+.macro set_sctlr_el1, reg
+ set_sctlr sctlr_el1, \reg
+.endm
+
+.macro set_sctlr_el2, reg
+ set_sctlr sctlr_el2, \reg
+.endm
+
/*
* Check whether to yield to another runnable task from kernel mode NEON code
* (which runs with preemption disabled).
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 14:56:31

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 20/36] KVM: arm64: Use kvm_arch in kvm_s2_mmu

In order to make use of the stage 2 pgtable code for the host stage 2,
change kvm_s2_mmu to use a kvm_arch pointer in lieu of the kvm pointer,
as the host will have the former but not the latter.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 2 +-
arch/arm64/include/asm/kvm_mmu.h | 6 +++++-
arch/arm64/kvm/mmu.c | 8 ++++----
3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index b9d45a1f8538..90565782ce3e 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -94,7 +94,7 @@ struct kvm_s2_mmu {
/* The last vcpu id that ran on each physical CPU */
int __percpu *last_vcpu_ran;

- struct kvm *kvm;
+ struct kvm_arch *arch;
};

struct kvm_arch_memory_slot {
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index ce02a4052dcf..6f743e20cb06 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -272,7 +272,7 @@ static __always_inline u64 kvm_get_vttbr(struct kvm_s2_mmu *mmu)
*/
static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
{
- write_sysreg(kern_hyp_va(mmu->kvm)->arch.vtcr, vtcr_el2);
+ write_sysreg(kern_hyp_va(mmu->arch)->vtcr, vtcr_el2);
write_sysreg(kvm_get_vttbr(mmu), vttbr_el2);

/*
@@ -283,5 +283,9 @@ static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT));
}

+static inline struct kvm *kvm_s2_mmu_to_kvm(struct kvm_s2_mmu *mmu)
+{
+ return container_of(mmu->arch, struct kvm, arch);
+}
#endif /* __ASSEMBLY__ */
#endif /* __ARM64_KVM_MMU_H__ */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 41f9c03cbcc3..3257cadfab24 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -165,7 +165,7 @@ static void *kvm_host_va(phys_addr_t phys)
static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size,
bool may_block)
{
- struct kvm *kvm = mmu->kvm;
+ struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
phys_addr_t end = start + size;

assert_spin_locked(&kvm->mmu_lock);
@@ -470,7 +470,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu)
for_each_possible_cpu(cpu)
*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;

- mmu->kvm = kvm;
+ mmu->arch = &kvm->arch;
mmu->pgt = pgt;
mmu->pgd_phys = __pa(pgt->pgd);
mmu->vmid.vmid_gen = 0;
@@ -552,7 +552,7 @@ void stage2_unmap_vm(struct kvm *kvm)

void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
{
- struct kvm *kvm = mmu->kvm;
+ struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
struct kvm_pgtable *pgt = NULL;

spin_lock(&kvm->mmu_lock);
@@ -621,7 +621,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
*/
static void stage2_wp_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end)
{
- struct kvm *kvm = mmu->kvm;
+ struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
stage2_apply_range_resched(kvm, addr, end, kvm_pgtable_stage2_wrprotect);
}

--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 14:56:52

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 12/36] KVM: arm64: Introduce a Hyp buddy page allocator

When memory protection is enabled, the hyp code will require a basic
form of memory management in order to allocate and free memory pages at
EL2. This is needed for various use-cases, including the creation of hyp
mappings or the allocation of stage 2 page tables.

To address these use-case, introduce a simple memory allocator in the
hyp code. The allocator is designed as a conventional 'buddy allocator',
working with a page granularity. It allows to allocate and free
physically contiguous pages from memory 'pools', with a guaranteed order
alignment in the PA space. Each page in a memory pool is associated
with a struct hyp_page which holds the page's metadata, including its
refcount, as well as its current order, hence mimicking the kernel's
buddy system in the GFP infrastructure. The hyp_page metadata are made
accessible through a hyp_vmemmap, following the concept of
SPARSE_VMEMMAP in the kernel.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/gfp.h | 68 ++++++++
arch/arm64/kvm/hyp/include/nvhe/memory.h | 28 ++++
arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
arch/arm64/kvm/hyp/nvhe/page_alloc.c | 195 +++++++++++++++++++++++
4 files changed, 292 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/gfp.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/page_alloc.c

diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
new file mode 100644
index 000000000000..55b3f0ce5bc8
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_HYP_GFP_H
+#define __KVM_HYP_GFP_H
+
+#include <linux/list.h>
+
+#include <nvhe/memory.h>
+#include <nvhe/spinlock.h>
+
+#define HYP_NO_ORDER UINT_MAX
+
+struct hyp_pool {
+ /*
+ * Spinlock protecting concurrent changes to the memory pool as well as
+ * the struct hyp_page of the pool's pages until we have a proper atomic
+ * API at EL2.
+ */
+ hyp_spinlock_t lock;
+ struct list_head free_area[MAX_ORDER];
+ phys_addr_t range_start;
+ phys_addr_t range_end;
+ unsigned int max_order;
+};
+
+static inline void hyp_page_ref_inc(struct hyp_page *p)
+{
+ struct hyp_pool *pool = hyp_page_to_pool(p);
+
+ hyp_spin_lock(&pool->lock);
+ p->refcount++;
+ hyp_spin_unlock(&pool->lock);
+}
+
+static inline int hyp_page_ref_dec_and_test(struct hyp_page *p)
+{
+ struct hyp_pool *pool = hyp_page_to_pool(p);
+ int ret;
+
+ hyp_spin_lock(&pool->lock);
+ p->refcount--;
+ ret = (p->refcount == 0);
+ hyp_spin_unlock(&pool->lock);
+
+ return ret;
+}
+
+static inline void hyp_set_page_refcounted(struct hyp_page *p)
+{
+ struct hyp_pool *pool = hyp_page_to_pool(p);
+
+ hyp_spin_lock(&pool->lock);
+ if (p->refcount) {
+ hyp_spin_unlock(&pool->lock);
+ hyp_panic();
+ }
+ p->refcount = 1;
+ hyp_spin_unlock(&pool->lock);
+}
+
+/* Allocation */
+void *hyp_alloc_pages(struct hyp_pool *pool, unsigned int order);
+void hyp_get_page(void *addr);
+void hyp_put_page(void *addr);
+
+/* Used pages cannot be freed */
+int hyp_pool_init(struct hyp_pool *pool, u64 pfn, unsigned int nr_pages,
+ unsigned int reserved_pages);
+#endif /* __KVM_HYP_GFP_H */
diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h
index 3e49eaa7e682..d2fb307c5952 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/memory.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h
@@ -6,7 +6,17 @@

#include <linux/types.h>

+struct hyp_pool;
+struct hyp_page {
+ unsigned int refcount;
+ unsigned int order;
+ struct hyp_pool *pool;
+ struct list_head node;
+};
+
extern s64 hyp_physvirt_offset;
+extern u64 __hyp_vmemmap;
+#define hyp_vmemmap ((struct hyp_page *)__hyp_vmemmap)

#define __hyp_pa(virt) ((phys_addr_t)(virt) + hyp_physvirt_offset)
#define __hyp_va(phys) ((void *)((phys_addr_t)(phys) - hyp_physvirt_offset))
@@ -21,4 +31,22 @@ static inline phys_addr_t hyp_virt_to_phys(void *addr)
return __hyp_pa(addr);
}

+#define hyp_phys_to_pfn(phys) ((phys) >> PAGE_SHIFT)
+#define hyp_pfn_to_phys(pfn) ((phys_addr_t)((pfn) << PAGE_SHIFT))
+#define hyp_phys_to_page(phys) (&hyp_vmemmap[hyp_phys_to_pfn(phys)])
+#define hyp_virt_to_page(virt) hyp_phys_to_page(__hyp_pa(virt))
+#define hyp_virt_to_pfn(virt) hyp_phys_to_pfn(__hyp_pa(virt))
+
+#define hyp_page_to_pfn(page) ((struct hyp_page *)(page) - hyp_vmemmap)
+#define hyp_page_to_phys(page) hyp_pfn_to_phys((hyp_page_to_pfn(page)))
+#define hyp_page_to_virt(page) __hyp_va(hyp_page_to_phys(page))
+#define hyp_page_to_pool(page) (((struct hyp_page *)page)->pool)
+
+static inline int hyp_page_count(void *addr)
+{
+ struct hyp_page *p = hyp_virt_to_page(addr);
+
+ return p->refcount;
+}
+
#endif /* __KVM_HYP_MEMORY_H */
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 144da72ad510..6894a917f290 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -13,7 +13,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
lib-objs := $(addprefix ../../../lib/, $(lib-objs))

obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
- hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
+ hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o
obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o
obj-y += $(lib-objs)
diff --git a/arch/arm64/kvm/hyp/nvhe/page_alloc.c b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
new file mode 100644
index 000000000000..237e03bf0cb1
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
@@ -0,0 +1,195 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Google LLC
+ * Author: Quentin Perret <[email protected]>
+ */
+
+#include <asm/kvm_hyp.h>
+#include <nvhe/gfp.h>
+
+u64 __hyp_vmemmap;
+
+/*
+ * Index the hyp_vmemmap to find a potential buddy page, but make no assumption
+ * about its current state.
+ *
+ * Example buddy-tree for a 4-pages physically contiguous pool:
+ *
+ * o : Page 3
+ * /
+ * o-o : Page 2
+ * /
+ * / o : Page 1
+ * / /
+ * o---o-o : Page 0
+ * Order 2 1 0
+ *
+ * Example of requests on this pool:
+ * __find_buddy_nocheck(pool, page 0, order 0) => page 1
+ * __find_buddy_nocheck(pool, page 0, order 1) => page 2
+ * __find_buddy_nocheck(pool, page 1, order 0) => page 0
+ * __find_buddy_nocheck(pool, page 2, order 0) => page 3
+ */
+static struct hyp_page *__find_buddy_nocheck(struct hyp_pool *pool,
+ struct hyp_page *p,
+ unsigned int order)
+{
+ phys_addr_t addr = hyp_page_to_phys(p);
+
+ addr ^= (PAGE_SIZE << order);
+
+ /*
+ * Don't return a page outside the pool range -- it belongs to
+ * something else and may not be mapped in hyp_vmemmap.
+ */
+ if (addr < pool->range_start || addr >= pool->range_end)
+ return NULL;
+
+ return hyp_phys_to_page(addr);
+}
+
+/* Find a buddy page currently available for allocation */
+static struct hyp_page *__find_buddy_avail(struct hyp_pool *pool,
+ struct hyp_page *p,
+ unsigned int order)
+{
+ struct hyp_page *buddy = __find_buddy_nocheck(pool, p, order);
+
+ if (!buddy || buddy->order != order || list_empty(&buddy->node))
+ return NULL;
+
+ return buddy;
+
+}
+
+static void __hyp_attach_page(struct hyp_pool *pool,
+ struct hyp_page *p)
+{
+ unsigned int order = p->order;
+ struct hyp_page *buddy;
+
+ memset(hyp_page_to_virt(p), 0, PAGE_SIZE << p->order);
+
+ /*
+ * Only the first struct hyp_page of a high-order page (otherwise known
+ * as the 'head') should have p->order set. The non-head pages should
+ * have p->order = HYP_NO_ORDER. Here @p may no longer be the head
+ * after coallescing, so make sure to mark it HYP_NO_ORDER proactively.
+ */
+ p->order = HYP_NO_ORDER;
+ for (; (order + 1) < pool->max_order; order++) {
+ buddy = __find_buddy_avail(pool, p, order);
+ if (!buddy)
+ break;
+
+ /* Take the buddy out of its list, and coallesce with @p */
+ list_del_init(&buddy->node);
+ buddy->order = HYP_NO_ORDER;
+ p = min(p, buddy);
+ }
+
+ /* Mark the new head, and insert it */
+ p->order = order;
+ list_add_tail(&p->node, &pool->free_area[order]);
+}
+
+static void hyp_attach_page(struct hyp_page *p)
+{
+ struct hyp_pool *pool = hyp_page_to_pool(p);
+
+ hyp_spin_lock(&pool->lock);
+ __hyp_attach_page(pool, p);
+ hyp_spin_unlock(&pool->lock);
+}
+
+static struct hyp_page *__hyp_extract_page(struct hyp_pool *pool,
+ struct hyp_page *p,
+ unsigned int order)
+{
+ struct hyp_page *buddy;
+
+ list_del_init(&p->node);
+ while (p->order > order) {
+ /*
+ * The buddy of order n - 1 currently has HYP_NO_ORDER as it
+ * is covered by a higher-level page (whose head is @p). Use
+ * __find_buddy_nocheck() to find it and inject it in the
+ * free_list[n - 1], effectively splitting @p in half.
+ */
+ p->order--;
+ buddy = __find_buddy_nocheck(pool, p, p->order);
+ buddy->order = p->order;
+ list_add_tail(&buddy->node, &pool->free_area[buddy->order]);
+ }
+
+ return p;
+}
+
+void hyp_put_page(void *addr)
+{
+ struct hyp_page *p = hyp_virt_to_page(addr);
+
+ if (hyp_page_ref_dec_and_test(p))
+ hyp_attach_page(p);
+}
+
+void hyp_get_page(void *addr)
+{
+ struct hyp_page *p = hyp_virt_to_page(addr);
+
+ hyp_page_ref_inc(p);
+}
+
+void *hyp_alloc_pages(struct hyp_pool *pool, unsigned int order)
+{
+ unsigned int i = order;
+ struct hyp_page *p;
+
+ hyp_spin_lock(&pool->lock);
+
+ /* Look for a high-enough-order page */
+ while (i < pool->max_order && list_empty(&pool->free_area[i]))
+ i++;
+ if (i >= pool->max_order) {
+ hyp_spin_unlock(&pool->lock);
+ return NULL;
+ }
+
+ /* Extract it from the tree at the right order */
+ p = list_first_entry(&pool->free_area[i], struct hyp_page, node);
+ p = __hyp_extract_page(pool, p, order);
+
+ hyp_spin_unlock(&pool->lock);
+ hyp_set_page_refcounted(p);
+
+ return hyp_page_to_virt(p);
+}
+
+int hyp_pool_init(struct hyp_pool *pool, u64 pfn, unsigned int nr_pages,
+ unsigned int reserved_pages)
+{
+ phys_addr_t phys = hyp_pfn_to_phys(pfn);
+ struct hyp_page *p;
+ int i;
+
+ hyp_spin_lock_init(&pool->lock);
+ pool->max_order = min(MAX_ORDER, get_order(nr_pages << PAGE_SHIFT));
+ for (i = 0; i < pool->max_order; i++)
+ INIT_LIST_HEAD(&pool->free_area[i]);
+ pool->range_start = phys;
+ pool->range_end = phys + (nr_pages << PAGE_SHIFT);
+
+ /* Init the vmemmap portion */
+ p = hyp_phys_to_page(phys);
+ memset(p, 0, sizeof(*p) * nr_pages);
+ for (i = 0; i < nr_pages; i++) {
+ p[i].pool = pool;
+ INIT_LIST_HEAD(&p[i].node);
+ }
+
+ /* Attach the unused pages to the buddy tree */
+ for (i = reserved_pages; i < nr_pages; i++)
+ __hyp_attach_page(pool, &p[i]);
+
+ return 0;
+}
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 14:57:22

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 35/36] KVM: arm64: Disable PMU support in protected mode

The host currently writes directly in EL2 per-CPU data sections from
the PMU code when running in nVHE. In preparation for unmapping the EL2
sections from the host stage 2, disable PMU support in protected mode as
we currently do not have a use-case for it.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/perf.c | 3 ++-
arch/arm64/kvm/pmu.c | 8 ++++----
2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index 739164324afe..8f860ae56bb7 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -55,7 +55,8 @@ int kvm_perf_init(void)
* hardware performance counters. This could ensure the presence of
* a physical PMU and CONFIG_PERF_EVENT is selected.
*/
- if (IS_ENABLED(CONFIG_ARM_PMU) && perf_num_counters() > 0)
+ if (IS_ENABLED(CONFIG_ARM_PMU) && perf_num_counters() > 0
+ && !is_protected_kvm_enabled())
static_branch_enable(&kvm_arm_pmu_available);

return perf_register_guest_info_callbacks(&kvm_guest_cbs);
diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c
index faf32a44ba04..03a6c1f4a09a 100644
--- a/arch/arm64/kvm/pmu.c
+++ b/arch/arm64/kvm/pmu.c
@@ -33,7 +33,7 @@ void kvm_set_pmu_events(u32 set, struct perf_event_attr *attr)
{
struct kvm_host_data *ctx = this_cpu_ptr_hyp_sym(kvm_host_data);

- if (!ctx || !kvm_pmu_switch_needed(attr))
+ if (!kvm_arm_support_pmu_v3() || !ctx || !kvm_pmu_switch_needed(attr))
return;

if (!attr->exclude_host)
@@ -49,7 +49,7 @@ void kvm_clr_pmu_events(u32 clr)
{
struct kvm_host_data *ctx = this_cpu_ptr_hyp_sym(kvm_host_data);

- if (!ctx)
+ if (!kvm_arm_support_pmu_v3() || !ctx)
return;

ctx->pmu_events.events_host &= ~clr;
@@ -172,7 +172,7 @@ void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu)
struct kvm_host_data *host;
u32 events_guest, events_host;

- if (!has_vhe())
+ if (!kvm_arm_support_pmu_v3() || !has_vhe())
return;

preempt_disable();
@@ -193,7 +193,7 @@ void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu)
struct kvm_host_data *host;
u32 events_guest, events_host;

- if (!has_vhe())
+ if (!kvm_arm_support_pmu_v3() || !has_vhe())
return;

host = this_cpu_ptr_hyp_sym(kvm_host_data);
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 14:57:26

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 05/36] KVM: arm64: Avoid free_page() in page-table allocator

Currently, the KVM page-table allocator uses a mix of put_page() and
free_page() calls depending on the context even though page-allocation
is always achieved using variants of __get_free_page().

Make the code consistent by using put_page() throughout, and reduce the
memory management API surface used by the page-table code. This will
ease factoring out page-allocation from pgtable.c, which is a
pre-requisite to creating page-tables at EL2.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 4d177ce1d536..81fe032f34d1 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -413,7 +413,7 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits)
static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag, void * const arg)
{
- free_page((unsigned long)kvm_pte_follow(*ptep));
+ put_page(virt_to_page(kvm_pte_follow(*ptep)));
return 0;
}

@@ -425,7 +425,7 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt)
};

WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
- free_page((unsigned long)pgt->pgd);
+ put_page(virt_to_page(pgt->pgd));
pgt->pgd = NULL;
}

@@ -577,7 +577,7 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
if (!data->anchor)
return 0;

- free_page((unsigned long)kvm_pte_follow(*ptep));
+ put_page(virt_to_page(kvm_pte_follow(*ptep)));
put_page(virt_to_page(ptep));

if (data->anchor == ptep) {
@@ -700,7 +700,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
}

if (childp)
- free_page((unsigned long)childp);
+ put_page(virt_to_page(childp));

return 0;
}
@@ -897,7 +897,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
put_page(virt_to_page(ptep));

if (kvm_pte_table(pte, level))
- free_page((unsigned long)kvm_pte_follow(pte));
+ put_page(virt_to_page(kvm_pte_follow(pte)));

return 0;
}
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 14:57:29

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 30/36] KVM: arm64: Refactor the *_map_set_prot_attr() helpers

In order to ease their re-use in other code paths, refactor the
*_map_set_prot_attr() helpers to not depend on a map_data struct.
No functional change intended.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index bd44e84dedc4..a5347d78293f 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -324,8 +324,7 @@ struct hyp_map_data {
struct kvm_pgtable_mm_ops *mm_ops;
};

-static int hyp_map_set_prot_attr(enum kvm_pgtable_prot prot,
- struct hyp_map_data *data)
+static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
{
bool device = prot & KVM_PGTABLE_PROT_DEVICE;
u32 mtype = device ? MT_DEVICE_nGnRE : MT_NORMAL;
@@ -350,7 +349,8 @@ static int hyp_map_set_prot_attr(enum kvm_pgtable_prot prot,
attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_AP, ap);
attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_SH, sh);
attr |= KVM_PTE_LEAF_ATTR_LO_S1_AF;
- data->attr = attr;
+ *ptep = attr;
+
return 0;
}

@@ -407,7 +407,7 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
.arg = &map_data,
};

- ret = hyp_map_set_prot_attr(prot, &map_data);
+ ret = hyp_set_prot_attr(prot, &map_data.attr);
if (ret)
return ret;

@@ -500,8 +500,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
return vtcr;
}

-static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
- struct stage2_map_data *data)
+static int stage2_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
{
bool device = prot & KVM_PGTABLE_PROT_DEVICE;
kvm_pte_t attr = device ? PAGE_S2_MEMATTR(DEVICE_nGnRE) :
@@ -521,7 +520,8 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,

attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh);
attr |= KVM_PTE_LEAF_ATTR_LO_S2_AF;
- data->attr = attr;
+ *ptep = attr;
+
return 0;
}

@@ -741,7 +741,7 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
.arg = &map_data,
};

- ret = stage2_map_set_prot_attr(prot, &map_data);
+ ret = stage2_set_prot_attr(prot, &map_data.attr);
if (ret)
return ret;

--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 14:57:34

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 29/36] KVM: arm64: Use page-table to track page ownership

As the host stage 2 will be identity mapped, all the .hyp memory regions
and/or memory pages donated to protected guestis will have to marked
invalid in the host stage 2 page-table. At the same time, the hypervisor
will need a way to track the ownership of each physical page to ensure
memory sharing or donation between entities (host, guests, hypervisor) is
legal.

In order to enable this tracking at EL2, let's use the host stage 2
page-table itself. The idea is to use the top bits of invalid mappings
to store the unique identifier of the page owner. The page-table owner
(the host) gets identifier 0 such that, at boot time, it owns the entire
IPA space as the pgd starts zeroed.

Provide kvm_pgtable_stage2_set_owner() which allows to modify the
ownership of pages in the host stage 2. It re-uses most of the map()
logic, but ends up creating invalid mappings instead. This impacts
how we do refcount as we now need to count invalid mappings when they
are used for ownership tracking.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 21 +++++
arch/arm64/kvm/hyp/pgtable.c | 127 ++++++++++++++++++++++-----
2 files changed, 124 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 4ae19247837b..683e96abdc24 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -238,6 +238,27 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
u64 phys, enum kvm_pgtable_prot prot,
void *mc);

+/**
+ * kvm_pgtable_stage2_set_owner() - Annotate invalid mappings with metadata
+ * encoding the ownership of a page in the
+ * IPA space.
+ * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr: Base intermediate physical address to annotate.
+ * @size: Size of the annotated range.
+ * @mc: Cache of pre-allocated and zeroed memory from which to allocate
+ * page-table pages.
+ * @owner_id: Unique identifier for the owner of the page.
+ *
+ * By default, all page-tables are owned by identifier 0. This function can be
+ * used to mark portions of the IPA space as owned by other entities. When a
+ * stage 2 is used with identity-mappings, these annotations allow to use the
+ * page-table data structure as a simple rmap.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
+ void *mc, u8 owner_id);
+
/**
* kvm_pgtable_stage2_unmap() - Remove a mapping from a guest stage-2 page-table.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init().
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index f37b4179b880..bd44e84dedc4 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -48,6 +48,9 @@
KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
KVM_PTE_LEAF_ATTR_HI_S2_XN)

+#define KVM_INVALID_PTE_OWNER_MASK GENMASK(63, 56)
+#define KVM_MAX_OWNER_ID 1
+
struct kvm_pgtable_walk_data {
struct kvm_pgtable *pgt;
struct kvm_pgtable_walker *walker;
@@ -67,6 +70,13 @@ static u64 kvm_granule_size(u32 level)
return BIT(kvm_granule_shift(level));
}

+#define KVM_PHYS_INVALID (-1ULL)
+
+static bool kvm_phys_is_valid(u64 phys)
+{
+ return phys < BIT(id_aa64mmfr0_parange_to_phys_shift(ID_AA64MMFR0_PARANGE_MAX));
+}
+
static bool kvm_block_mapping_supported(u64 addr, u64 end, u64 phys, u32 level)
{
u64 granule = kvm_granule_size(level);
@@ -81,7 +91,10 @@ static bool kvm_block_mapping_supported(u64 addr, u64 end, u64 phys, u32 level)
if (granule > (end - addr))
return false;

- return IS_ALIGNED(addr, granule) && IS_ALIGNED(phys, granule);
+ if (kvm_phys_is_valid(phys) && !IS_ALIGNED(phys, granule))
+ return false;
+
+ return IS_ALIGNED(addr, granule);
}

static u32 kvm_pgtable_idx(struct kvm_pgtable_walk_data *data, u32 level)
@@ -186,6 +199,11 @@ static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, u32 level)
return pte;
}

+static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id)
+{
+ return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id);
+}
+
static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag)
@@ -440,6 +458,7 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt)
struct stage2_map_data {
u64 phys;
kvm_pte_t attr;
+ u8 owner_id;

kvm_pte_t *anchor;
kvm_pte_t *childp;
@@ -506,6 +525,39 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
return 0;
}

+static bool stage2_pte_needs_update(kvm_pte_t old, kvm_pte_t new)
+{
+ if (!kvm_pte_valid(old) || !kvm_pte_valid(new))
+ return true;
+
+ return ((old ^ new) & (~KVM_PTE_LEAF_ATTR_S2_PERMS));
+}
+
+static bool stage2_pte_is_counted(kvm_pte_t pte)
+{
+ /*
+ * The refcount tracks valid entries as well as invalid entries if they
+ * encode ownership of a page to another entity than the page-table
+ * owner, whose id is 0.
+ */
+ return !!pte;
+}
+
+static void stage2_put_pte(kvm_pte_t *ptep, struct kvm_s2_mmu *mmu, u64 addr,
+ u32 level, struct kvm_pgtable_mm_ops *mm_ops)
+{
+ /*
+ * Clear the existing PTE, and perform break-before-make with
+ * TLB maintenance if it was valid.
+ */
+ if (kvm_pte_valid(*ptep)) {
+ kvm_clear_pte(ptep);
+ kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, addr, level);
+ }
+
+ mm_ops->put_page(ptep);
+}
+
static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
kvm_pte_t *ptep,
struct stage2_map_data *data)
@@ -517,29 +569,29 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
if (!kvm_block_mapping_supported(addr, end, phys, level))
return -E2BIG;

- new = kvm_init_valid_leaf_pte(phys, data->attr, level);
- if (kvm_pte_valid(old)) {
+ if (kvm_phys_is_valid(phys))
+ new = kvm_init_valid_leaf_pte(phys, data->attr, level);
+ else
+ new = kvm_init_invalid_leaf_owner(data->owner_id);
+
+ if (stage2_pte_is_counted(old)) {
/*
* Skip updating the PTE if we are trying to recreate the exact
* same mapping or only change the access permissions. Instead,
* the vCPU will exit one more time from guest if still needed
* and then go through the path of relaxing permissions.
*/
- if (!((old ^ new) & (~KVM_PTE_LEAF_ATTR_S2_PERMS)))
+ if (!stage2_pte_needs_update(old, new))
return -EAGAIN;

- /*
- * There's an existing different valid leaf entry, so perform
- * break-before-make.
- */
- kvm_clear_pte(ptep);
- kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
- mm_ops->put_page(ptep);
+ stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
}

smp_store_release(ptep, new);
- mm_ops->get_page(ptep);
- data->phys += granule;
+ if (stage2_pte_is_counted(new))
+ mm_ops->get_page(ptep);
+ if (kvm_phys_is_valid(phys))
+ data->phys += granule;
return 0;
}

@@ -574,7 +626,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
int ret;

if (data->anchor) {
- if (kvm_pte_valid(pte))
+ if (stage2_pte_is_counted(pte))
mm_ops->put_page(ptep);

return 0;
@@ -599,11 +651,8 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
* a table. Accesses beyond 'end' that fall within the new table
* will be mapped lazily.
*/
- if (kvm_pte_valid(pte)) {
- kvm_clear_pte(ptep);
- kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
- mm_ops->put_page(ptep);
- }
+ if (stage2_pte_is_counted(pte))
+ stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);

kvm_set_table_pte(ptep, childp, mm_ops);
mm_ops->get_page(ptep);
@@ -701,6 +750,33 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
return ret;
}

+int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
+ void *mc, u8 owner_id)
+{
+ int ret;
+ struct stage2_map_data map_data = {
+ .phys = KVM_PHYS_INVALID,
+ .mmu = pgt->mmu,
+ .memcache = mc,
+ .mm_ops = pgt->mm_ops,
+ .owner_id = owner_id,
+ };
+ struct kvm_pgtable_walker walker = {
+ .cb = stage2_map_walker,
+ .flags = KVM_PGTABLE_WALK_TABLE_PRE |
+ KVM_PGTABLE_WALK_LEAF |
+ KVM_PGTABLE_WALK_TABLE_POST,
+ .arg = &map_data,
+ };
+
+ if (owner_id > KVM_MAX_OWNER_ID)
+ return -EINVAL;
+
+ ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+ dsb(ishst);
+ return ret;
+}
+
static void stage2_flush_dcache(void *addr, u64 size)
{
if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
@@ -725,8 +801,13 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
kvm_pte_t pte = *ptep, *childp = NULL;
bool need_flush = false;

- if (!kvm_pte_valid(pte))
+ if (!kvm_pte_valid(pte)) {
+ if (stage2_pte_is_counted(pte)) {
+ kvm_clear_pte(ptep);
+ mm_ops->put_page(ptep);
+ }
return 0;
+ }

if (kvm_pte_table(pte, level)) {
childp = kvm_pte_follow(pte, mm_ops);
@@ -742,9 +823,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
* block entry and rely on the remaining portions being faulted
* back lazily.
*/
- kvm_clear_pte(ptep);
- kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, addr, level);
- mm_ops->put_page(ptep);
+ stage2_put_pte(ptep, mmu, addr, level, mm_ops);

if (need_flush) {
stage2_flush_dcache(kvm_pte_follow(pte, mm_ops),
@@ -948,7 +1027,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
struct kvm_pgtable_mm_ops *mm_ops = arg;
kvm_pte_t pte = *ptep;

- if (!kvm_pte_valid(pte))
+ if (!stage2_pte_is_counted(pte))
return 0;

mm_ops->put_page(ptep);
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 14:57:39

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 06/36] KVM: arm64: Factor memory allocation out of pgtable.c

In preparation for enabling the creation of page-tables at EL2, factor
all memory allocation out of the page-table code, hence making it
re-usable with any compatible memory allocator.

No functional changes intended.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 41 +++++++++++-
arch/arm64/kvm/hyp/pgtable.c | 98 +++++++++++++++++-----------
arch/arm64/kvm/mmu.c | 66 ++++++++++++++++++-
3 files changed, 163 insertions(+), 42 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 8886d43cfb11..bbe840e430cb 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -13,17 +13,50 @@

typedef u64 kvm_pte_t;

+/**
+ * struct kvm_pgtable_mm_ops - Memory management callbacks.
+ * @zalloc_page: Allocate a single zeroed memory page. The @arg parameter
+ * can be used by the walker to pass a memcache. The
+ * initial refcount of the page is 1.
+ * @zalloc_pages_exact: Allocate an exact number of zeroed memory pages. The
+ * @size parameter is in bytes, and is rounded-up to the
+ * next page boundary. The resulting allocation is
+ * physically contiguous.
+ * @free_pages_exact: Free an exact number of memory pages previously
+ * allocated by zalloc_pages_exact.
+ * @get_page: Increment the refcount on a page.
+ * @put_page: Decrement the refcount on a page. When the refcount
+ * reaches 0 the page is automatically freed.
+ * @page_count: Return the refcount of a page.
+ * @phys_to_virt: Convert a physical address into a virtual address mapped
+ * in the current context.
+ * @virt_to_phys: Convert a virtual address mapped in the current context
+ * into a physical address.
+ */
+struct kvm_pgtable_mm_ops {
+ void* (*zalloc_page)(void *arg);
+ void* (*zalloc_pages_exact)(size_t size);
+ void (*free_pages_exact)(void *addr, size_t size);
+ void (*get_page)(void *addr);
+ void (*put_page)(void *addr);
+ int (*page_count)(void *addr);
+ void* (*phys_to_virt)(phys_addr_t phys);
+ phys_addr_t (*virt_to_phys)(void *addr);
+};
+
/**
* struct kvm_pgtable - KVM page-table.
* @ia_bits: Maximum input address size, in bits.
* @start_level: Level at which the page-table walk starts.
* @pgd: Pointer to the first top-level entry of the page-table.
+ * @mm_ops: Memory management callbacks.
* @mmu: Stage-2 KVM MMU struct. Unused for stage-1 page-tables.
*/
struct kvm_pgtable {
u32 ia_bits;
u32 start_level;
kvm_pte_t *pgd;
+ struct kvm_pgtable_mm_ops *mm_ops;

/* Stage-2 only */
struct kvm_s2_mmu *mmu;
@@ -86,10 +119,12 @@ struct kvm_pgtable_walker {
* kvm_pgtable_hyp_init() - Initialise a hypervisor stage-1 page-table.
* @pgt: Uninitialised page-table structure to initialise.
* @va_bits: Maximum virtual address bits.
+ * @mm_ops: Memory management callbacks.
*
* Return: 0 on success, negative error code on failure.
*/
-int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits);
+int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
+ struct kvm_pgtable_mm_ops *mm_ops);

/**
* kvm_pgtable_hyp_destroy() - Destroy an unused hypervisor stage-1 page-table.
@@ -126,10 +161,12 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
* kvm_pgtable_stage2_init() - Initialise a guest stage-2 page-table.
* @pgt: Uninitialised page-table structure to initialise.
* @kvm: KVM structure representing the guest virtual machine.
+ * @mm_ops: Memory management callbacks.
*
* Return: 0 on success, negative error code on failure.
*/
-int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm *kvm);
+int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm *kvm,
+ struct kvm_pgtable_mm_ops *mm_ops);

/**
* kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 81fe032f34d1..b975a67d1f85 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -152,9 +152,9 @@ static kvm_pte_t kvm_phys_to_pte(u64 pa)
return pte;
}

-static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte)
+static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
{
- return __va(kvm_pte_to_phys(pte));
+ return mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
}

static void kvm_set_invalid_pte(kvm_pte_t *ptep)
@@ -163,9 +163,10 @@ static void kvm_set_invalid_pte(kvm_pte_t *ptep)
WRITE_ONCE(*ptep, pte & ~KVM_PTE_VALID);
}

-static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp)
+static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp,
+ struct kvm_pgtable_mm_ops *mm_ops)
{
- kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(__pa(childp));
+ kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(mm_ops->virt_to_phys(childp));

pte |= FIELD_PREP(KVM_PTE_TYPE, KVM_PTE_TYPE_TABLE);
pte |= KVM_PTE_VALID;
@@ -227,7 +228,7 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
goto out;
}

- childp = kvm_pte_follow(pte);
+ childp = kvm_pte_follow(pte, data->pgt->mm_ops);
ret = __kvm_pgtable_walk(data, childp, level + 1);
if (ret)
goto out;
@@ -302,8 +303,9 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
}

struct hyp_map_data {
- u64 phys;
- kvm_pte_t attr;
+ u64 phys;
+ kvm_pte_t attr;
+ struct kvm_pgtable_mm_ops *mm_ops;
};

static int hyp_map_set_prot_attr(enum kvm_pgtable_prot prot,
@@ -358,6 +360,8 @@ static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag, void * const arg)
{
kvm_pte_t *childp;
+ struct hyp_map_data *data = arg;
+ struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;

if (hyp_map_walker_try_leaf(addr, end, level, ptep, arg))
return 0;
@@ -365,11 +369,11 @@ static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
return -EINVAL;

- childp = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL);
+ childp = (kvm_pte_t *)mm_ops->zalloc_page(NULL);
if (!childp)
return -ENOMEM;

- kvm_set_table_pte(ptep, childp);
+ kvm_set_table_pte(ptep, childp, mm_ops);
return 0;
}

@@ -379,6 +383,7 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
int ret;
struct hyp_map_data map_data = {
.phys = ALIGN_DOWN(phys, PAGE_SIZE),
+ .mm_ops = pgt->mm_ops,
};
struct kvm_pgtable_walker walker = {
.cb = hyp_map_walker,
@@ -396,16 +401,18 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
return ret;
}

-int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits)
+int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
+ struct kvm_pgtable_mm_ops *mm_ops)
{
u64 levels = ARM64_HW_PGTABLE_LEVELS(va_bits);

- pgt->pgd = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL);
+ pgt->pgd = (kvm_pte_t *)mm_ops->zalloc_page(NULL);
if (!pgt->pgd)
return -ENOMEM;

pgt->ia_bits = va_bits;
pgt->start_level = KVM_PGTABLE_MAX_LEVELS - levels;
+ pgt->mm_ops = mm_ops;
pgt->mmu = NULL;
return 0;
}
@@ -413,7 +420,9 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits)
static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag, void * const arg)
{
- put_page(virt_to_page(kvm_pte_follow(*ptep)));
+ struct kvm_pgtable_mm_ops *mm_ops = arg;
+
+ mm_ops->put_page((void *)kvm_pte_follow(*ptep, mm_ops));
return 0;
}

@@ -422,10 +431,11 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt)
struct kvm_pgtable_walker walker = {
.cb = hyp_free_walker,
.flags = KVM_PGTABLE_WALK_TABLE_POST,
+ .arg = pgt->mm_ops,
};

WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
- put_page(virt_to_page(pgt->pgd));
+ pgt->mm_ops->put_page(pgt->pgd);
pgt->pgd = NULL;
}

@@ -437,6 +447,8 @@ struct stage2_map_data {

struct kvm_s2_mmu *mmu;
struct kvm_mmu_memory_cache *memcache;
+
+ struct kvm_pgtable_mm_ops *mm_ops;
};

static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
@@ -470,7 +482,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
{
kvm_pte_t new, old = *ptep;
u64 granule = kvm_granule_size(level), phys = data->phys;
- struct page *page = virt_to_page(ptep);
+ struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;

if (!kvm_block_mapping_supported(addr, end, phys, level))
return -E2BIG;
@@ -492,11 +504,11 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
*/
kvm_set_invalid_pte(ptep);
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
- put_page(page);
+ mm_ops->put_page(ptep);
}

smp_store_release(ptep, new);
- get_page(page);
+ mm_ops->get_page(ptep);
data->phys += granule;
return 0;
}
@@ -526,13 +538,13 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
struct stage2_map_data *data)
{
- int ret;
+ struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
kvm_pte_t *childp, pte = *ptep;
- struct page *page = virt_to_page(ptep);
+ int ret;

if (data->anchor) {
if (kvm_pte_valid(pte))
- put_page(page);
+ mm_ops->put_page(ptep);

return 0;
}
@@ -547,7 +559,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
if (!data->memcache)
return -ENOMEM;

- childp = kvm_mmu_memory_cache_alloc(data->memcache);
+ childp = mm_ops->zalloc_page(data->memcache);
if (!childp)
return -ENOMEM;

@@ -559,11 +571,11 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
if (kvm_pte_valid(pte)) {
kvm_set_invalid_pte(ptep);
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
- put_page(page);
+ mm_ops->put_page(ptep);
}

- kvm_set_table_pte(ptep, childp);
- get_page(page);
+ kvm_set_table_pte(ptep, childp, mm_ops);
+ mm_ops->get_page(ptep);

return 0;
}
@@ -572,13 +584,14 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
kvm_pte_t *ptep,
struct stage2_map_data *data)
{
+ struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
int ret = 0;

if (!data->anchor)
return 0;

- put_page(virt_to_page(kvm_pte_follow(*ptep)));
- put_page(virt_to_page(ptep));
+ mm_ops->put_page(kvm_pte_follow(*ptep, mm_ops));
+ mm_ops->put_page(ptep);

if (data->anchor == ptep) {
data->anchor = NULL;
@@ -633,6 +646,7 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
.phys = ALIGN_DOWN(phys, PAGE_SIZE),
.mmu = pgt->mmu,
.memcache = mc,
+ .mm_ops = pgt->mm_ops,
};
struct kvm_pgtable_walker walker = {
.cb = stage2_map_walker,
@@ -669,7 +683,9 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag,
void * const arg)
{
- struct kvm_s2_mmu *mmu = arg;
+ struct kvm_pgtable *pgt = arg;
+ struct kvm_s2_mmu *mmu = pgt->mmu;
+ struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
kvm_pte_t pte = *ptep, *childp = NULL;
bool need_flush = false;

@@ -677,9 +693,9 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
return 0;

if (kvm_pte_table(pte, level)) {
- childp = kvm_pte_follow(pte);
+ childp = kvm_pte_follow(pte, mm_ops);

- if (page_count(virt_to_page(childp)) != 1)
+ if (mm_ops->page_count(childp) != 1)
return 0;
} else if (stage2_pte_cacheable(pte)) {
need_flush = true;
@@ -692,15 +708,15 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
*/
kvm_set_invalid_pte(ptep);
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, addr, level);
- put_page(virt_to_page(ptep));
+ mm_ops->put_page(ptep);

if (need_flush) {
- stage2_flush_dcache(kvm_pte_follow(pte),
+ stage2_flush_dcache(kvm_pte_follow(pte, mm_ops),
kvm_granule_size(level));
}

if (childp)
- put_page(virt_to_page(childp));
+ mm_ops->put_page(childp);

return 0;
}
@@ -709,7 +725,7 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
{
struct kvm_pgtable_walker walker = {
.cb = stage2_unmap_walker,
- .arg = pgt->mmu,
+ .arg = pgt,
.flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
};

@@ -841,12 +857,13 @@ static int stage2_flush_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag,
void * const arg)
{
+ struct kvm_pgtable_mm_ops *mm_ops = arg;
kvm_pte_t pte = *ptep;

if (!kvm_pte_valid(pte) || !stage2_pte_cacheable(pte))
return 0;

- stage2_flush_dcache(kvm_pte_follow(pte), kvm_granule_size(level));
+ stage2_flush_dcache(kvm_pte_follow(pte, mm_ops), kvm_granule_size(level));
return 0;
}

@@ -855,6 +872,7 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
struct kvm_pgtable_walker walker = {
.cb = stage2_flush_walker,
.flags = KVM_PGTABLE_WALK_LEAF,
+ .arg = pgt->mm_ops,
};

if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
@@ -863,7 +881,8 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
return kvm_pgtable_walk(pgt, addr, size, &walker);
}

-int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm *kvm)
+int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm *kvm,
+ struct kvm_pgtable_mm_ops *mm_ops)
{
size_t pgd_sz;
u64 vtcr = kvm->arch.vtcr;
@@ -872,12 +891,13 @@ int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm *kvm)
u32 start_level = VTCR_EL2_TGRAN_SL0_BASE - sl0;

pgd_sz = kvm_pgd_pages(ia_bits, start_level) * PAGE_SIZE;
- pgt->pgd = alloc_pages_exact(pgd_sz, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ pgt->pgd = mm_ops->zalloc_pages_exact(pgd_sz);
if (!pgt->pgd)
return -ENOMEM;

pgt->ia_bits = ia_bits;
pgt->start_level = start_level;
+ pgt->mm_ops = mm_ops;
pgt->mmu = &kvm->arch.mmu;

/* Ensure zeroed PGD pages are visible to the hardware walker */
@@ -889,15 +909,16 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag,
void * const arg)
{
+ struct kvm_pgtable_mm_ops *mm_ops = arg;
kvm_pte_t pte = *ptep;

if (!kvm_pte_valid(pte))
return 0;

- put_page(virt_to_page(ptep));
+ mm_ops->put_page(ptep);

if (kvm_pte_table(pte, level))
- put_page(virt_to_page(kvm_pte_follow(pte)));
+ mm_ops->put_page(kvm_pte_follow(pte, mm_ops));

return 0;
}
@@ -909,10 +930,11 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
.cb = stage2_free_walker,
.flags = KVM_PGTABLE_WALK_LEAF |
KVM_PGTABLE_WALK_TABLE_POST,
+ .arg = pgt->mm_ops,
};

WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
- free_pages_exact(pgt->pgd, pgd_sz);
+ pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
pgt->pgd = NULL;
}
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..4d41d7838d53 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -88,6 +88,44 @@ static bool kvm_is_device_pfn(unsigned long pfn)
return !pfn_valid(pfn);
}

+static void *stage2_memcache_zalloc_page(void *arg)
+{
+ struct kvm_mmu_memory_cache *mc = arg;
+
+ /* Allocated with __GFP_ZERO, so no need to zero */
+ return kvm_mmu_memory_cache_alloc(mc);
+}
+
+static void *kvm_host_zalloc_pages_exact(size_t size)
+{
+ return alloc_pages_exact(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+}
+
+static void kvm_host_get_page(void *addr)
+{
+ get_page(virt_to_page(addr));
+}
+
+static void kvm_host_put_page(void *addr)
+{
+ put_page(virt_to_page(addr));
+}
+
+static int kvm_host_page_count(void *addr)
+{
+ return page_count(virt_to_page(addr));
+}
+
+static phys_addr_t kvm_host_pa(void *addr)
+{
+ return __pa(addr);
+}
+
+static void *kvm_host_va(phys_addr_t phys)
+{
+ return __va(phys);
+}
+
/*
* Unmapping vs dcache management:
*
@@ -351,6 +389,17 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
return 0;
}

+static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
+ .zalloc_page = stage2_memcache_zalloc_page,
+ .zalloc_pages_exact = kvm_host_zalloc_pages_exact,
+ .free_pages_exact = free_pages_exact,
+ .get_page = kvm_host_get_page,
+ .put_page = kvm_host_put_page,
+ .page_count = kvm_host_page_count,
+ .phys_to_virt = kvm_host_va,
+ .virt_to_phys = kvm_host_pa,
+};
+
/**
* kvm_init_stage2_mmu - Initialise a S2 MMU strucrure
* @kvm: The pointer to the KVM structure
@@ -374,7 +423,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu)
if (!pgt)
return -ENOMEM;

- err = kvm_pgtable_stage2_init(pgt, kvm);
+ err = kvm_pgtable_stage2_init(pgt, kvm, &kvm_s2_mm_ops);
if (err)
goto out_free_pgtable;

@@ -1208,6 +1257,19 @@ static int kvm_map_idmap_text(void)
return err;
}

+static void *kvm_hyp_zalloc_page(void *arg)
+{
+ return (void *)get_zeroed_page(GFP_KERNEL);
+}
+
+static struct kvm_pgtable_mm_ops kvm_hyp_mm_ops = {
+ .zalloc_page = kvm_hyp_zalloc_page,
+ .get_page = kvm_host_get_page,
+ .put_page = kvm_host_put_page,
+ .phys_to_virt = kvm_host_va,
+ .virt_to_phys = kvm_host_pa,
+};
+
int kvm_mmu_init(void)
{
int err;
@@ -1251,7 +1313,7 @@ int kvm_mmu_init(void)
goto out;
}

- err = kvm_pgtable_hyp_init(hyp_pgtable, hyp_va_bits);
+ err = kvm_pgtable_hyp_init(hyp_pgtable, hyp_va_bits, &kvm_hyp_mm_ops);
if (err)
goto out_free_pgtable;

--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 16:36:15

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v5 14/36] KVM: arm64: Provide __flush_dcache_area at EL2

On Mon, Mar 15, 2021 at 02:35:14PM +0000, Quentin Perret wrote:
> We will need to do cache maintenance at EL2 soon, so compile a copy of
> __flush_dcache_area at EL2, and provide a copy of arm64_ftr_reg_ctrel0
> as it is needed by the read_ctr macro.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_cpufeature.h | 2 ++
> arch/arm64/kvm/hyp/nvhe/Makefile | 3 ++-
> arch/arm64/kvm/hyp/nvhe/cache.S | 13 +++++++++++++
> arch/arm64/kvm/sys_regs.c | 1 +
> 4 files changed, 18 insertions(+), 1 deletion(-)
> create mode 100644 arch/arm64/kvm/hyp/nvhe/cache.S
>
> diff --git a/arch/arm64/include/asm/kvm_cpufeature.h b/arch/arm64/include/asm/kvm_cpufeature.h
> index 3fd9f60d2180..efba1b89b8a4 100644
> --- a/arch/arm64/include/asm/kvm_cpufeature.h
> +++ b/arch/arm64/include/asm/kvm_cpufeature.h
> @@ -13,3 +13,5 @@
> #define KVM_HYP_CPU_FTR_REG(name) extern struct arm64_ftr_reg kvm_nvhe_sym(name)
> #endif
> #endif
> +
> +KVM_HYP_CPU_FTR_REG(arm64_ftr_reg_ctrel0);

I still think this is a bit weird. If you really want to macro-ise stuff,
then why not follow the sort of thing we do for e.g. per-cpu variables and
have separate DECLARE_HYP_CPU_FTR_REG and DEFINE_HYP_CPU_FTR_REG macros.

That way kvm_cpufeature.h can have header guards like a normal header and
we can drop the '#ifndef KVM_HYP_CPU_FTR_REG' altogether. I don't think
the duplication of the symbol name really matters -- it should fail at
build time if something is missing.

Will

2021-03-15 16:39:46

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v5 29/36] KVM: arm64: Use page-table to track page ownership

On Mon, Mar 15, 2021 at 02:35:29PM +0000, Quentin Perret wrote:
> As the host stage 2 will be identity mapped, all the .hyp memory regions
> and/or memory pages donated to protected guestis will have to marked
> invalid in the host stage 2 page-table. At the same time, the hypervisor
> will need a way to track the ownership of each physical page to ensure
> memory sharing or donation between entities (host, guests, hypervisor) is
> legal.
>
> In order to enable this tracking at EL2, let's use the host stage 2
> page-table itself. The idea is to use the top bits of invalid mappings
> to store the unique identifier of the page owner. The page-table owner
> (the host) gets identifier 0 such that, at boot time, it owns the entire
> IPA space as the pgd starts zeroed.
>
> Provide kvm_pgtable_stage2_set_owner() which allows to modify the
> ownership of pages in the host stage 2. It re-uses most of the map()
> logic, but ends up creating invalid mappings instead. This impacts
> how we do refcount as we now need to count invalid mappings when they
> are used for ownership tracking.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_pgtable.h | 21 +++++
> arch/arm64/kvm/hyp/pgtable.c | 127 ++++++++++++++++++++++-----
> 2 files changed, 124 insertions(+), 24 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 4ae19247837b..683e96abdc24 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -238,6 +238,27 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
> u64 phys, enum kvm_pgtable_prot prot,
> void *mc);
>
> +/**
> + * kvm_pgtable_stage2_set_owner() - Annotate invalid mappings with metadata
> + * encoding the ownership of a page in the
> + * IPA space.

The function does more than this, though, as it will also go ahead and unmap
existing valid mappings which I think should be mentioned here, no?

> +int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
> + void *mc, u8 owner_id)
> +{
> + int ret;
> + struct stage2_map_data map_data = {
> + .phys = KVM_PHYS_INVALID,
> + .mmu = pgt->mmu,
> + .memcache = mc,
> + .mm_ops = pgt->mm_ops,
> + .owner_id = owner_id,
> + };
> + struct kvm_pgtable_walker walker = {
> + .cb = stage2_map_walker,
> + .flags = KVM_PGTABLE_WALK_TABLE_PRE |
> + KVM_PGTABLE_WALK_LEAF |
> + KVM_PGTABLE_WALK_TABLE_POST,
> + .arg = &map_data,
> + };
> +
> + if (owner_id > KVM_MAX_OWNER_ID)
> + return -EINVAL;
> +
> + ret = kvm_pgtable_walk(pgt, addr, size, &walker);
> + dsb(ishst);

Why is the DSB needed here? afaict, we only ever unmap a valid entry (which
will have a DSB as part of the TLBI sequence) or we update the owner for an
existing invalid entry, in which case the walker doesn't care.

Will

2021-03-15 16:55:52

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 29/36] KVM: arm64: Use page-table to track page ownership

On Monday 15 Mar 2021 at 16:36:19 (+0000), Will Deacon wrote:
> On Mon, Mar 15, 2021 at 02:35:29PM +0000, Quentin Perret wrote:
> > As the host stage 2 will be identity mapped, all the .hyp memory regions
> > and/or memory pages donated to protected guestis will have to marked
> > invalid in the host stage 2 page-table. At the same time, the hypervisor
> > will need a way to track the ownership of each physical page to ensure
> > memory sharing or donation between entities (host, guests, hypervisor) is
> > legal.
> >
> > In order to enable this tracking at EL2, let's use the host stage 2
> > page-table itself. The idea is to use the top bits of invalid mappings
> > to store the unique identifier of the page owner. The page-table owner
> > (the host) gets identifier 0 such that, at boot time, it owns the entire
> > IPA space as the pgd starts zeroed.
> >
> > Provide kvm_pgtable_stage2_set_owner() which allows to modify the
> > ownership of pages in the host stage 2. It re-uses most of the map()
> > logic, but ends up creating invalid mappings instead. This impacts
> > how we do refcount as we now need to count invalid mappings when they
> > are used for ownership tracking.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_pgtable.h | 21 +++++
> > arch/arm64/kvm/hyp/pgtable.c | 127 ++++++++++++++++++++++-----
> > 2 files changed, 124 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 4ae19247837b..683e96abdc24 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -238,6 +238,27 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > u64 phys, enum kvm_pgtable_prot prot,
> > void *mc);
> >
> > +/**
> > + * kvm_pgtable_stage2_set_owner() - Annotate invalid mappings with metadata
> > + * encoding the ownership of a page in the
> > + * IPA space.
>
> The function does more than this, though, as it will also go ahead and unmap
> existing valid mappings which I think should be mentioned here, no?

Right, I see why you mean. How about:

'Unmap and annotate pages in the IPA space to track ownership'

> > +int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > + void *mc, u8 owner_id)
> > +{
> > + int ret;
> > + struct stage2_map_data map_data = {
> > + .phys = KVM_PHYS_INVALID,
> > + .mmu = pgt->mmu,
> > + .memcache = mc,
> > + .mm_ops = pgt->mm_ops,
> > + .owner_id = owner_id,
> > + };
> > + struct kvm_pgtable_walker walker = {
> > + .cb = stage2_map_walker,
> > + .flags = KVM_PGTABLE_WALK_TABLE_PRE |
> > + KVM_PGTABLE_WALK_LEAF |
> > + KVM_PGTABLE_WALK_TABLE_POST,
> > + .arg = &map_data,
> > + };
> > +
> > + if (owner_id > KVM_MAX_OWNER_ID)
> > + return -EINVAL;
> > +
> > + ret = kvm_pgtable_walk(pgt, addr, size, &walker);
> > + dsb(ishst);
>
> Why is the DSB needed here? afaict, we only ever unmap a valid entry (which
> will have a DSB as part of the TLBI sequence) or we update the owner for an
> existing invalid entry, in which case the walker doesn't care.

Indeed, that is now unnecessary. I'll remove it.

Thanks,
Quentin

2021-03-15 16:58:25

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 14/36] KVM: arm64: Provide __flush_dcache_area at EL2

On Monday 15 Mar 2021 at 16:33:23 (+0000), Will Deacon wrote:
> On Mon, Mar 15, 2021 at 02:35:14PM +0000, Quentin Perret wrote:
> > We will need to do cache maintenance at EL2 soon, so compile a copy of
> > __flush_dcache_area at EL2, and provide a copy of arm64_ftr_reg_ctrel0
> > as it is needed by the read_ctr macro.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_cpufeature.h | 2 ++
> > arch/arm64/kvm/hyp/nvhe/Makefile | 3 ++-
> > arch/arm64/kvm/hyp/nvhe/cache.S | 13 +++++++++++++
> > arch/arm64/kvm/sys_regs.c | 1 +
> > 4 files changed, 18 insertions(+), 1 deletion(-)
> > create mode 100644 arch/arm64/kvm/hyp/nvhe/cache.S
> >
> > diff --git a/arch/arm64/include/asm/kvm_cpufeature.h b/arch/arm64/include/asm/kvm_cpufeature.h
> > index 3fd9f60d2180..efba1b89b8a4 100644
> > --- a/arch/arm64/include/asm/kvm_cpufeature.h
> > +++ b/arch/arm64/include/asm/kvm_cpufeature.h
> > @@ -13,3 +13,5 @@
> > #define KVM_HYP_CPU_FTR_REG(name) extern struct arm64_ftr_reg kvm_nvhe_sym(name)
> > #endif
> > #endif
> > +
> > +KVM_HYP_CPU_FTR_REG(arm64_ftr_reg_ctrel0);
>
> I still think this is a bit weird. If you really want to macro-ise stuff,
> then why not follow the sort of thing we do for e.g. per-cpu variables and
> have separate DECLARE_HYP_CPU_FTR_REG and DEFINE_HYP_CPU_FTR_REG macros.
>
> That way kvm_cpufeature.h can have header guards like a normal header and
> we can drop the '#ifndef KVM_HYP_CPU_FTR_REG' altogether. I don't think
> the duplication of the symbol name really matters -- it should fail at
> build time if something is missing.

I just tend to hate unnecessary boilerplate, but if you feel strongly
about it, happy to change :)

Cheers,
Quentin

2021-03-15 17:03:49

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v5 29/36] KVM: arm64: Use page-table to track page ownership

On Mon, Mar 15, 2021 at 04:53:18PM +0000, Quentin Perret wrote:
> On Monday 15 Mar 2021 at 16:36:19 (+0000), Will Deacon wrote:
> > On Mon, Mar 15, 2021 at 02:35:29PM +0000, Quentin Perret wrote:
> > > As the host stage 2 will be identity mapped, all the .hyp memory regions
> > > and/or memory pages donated to protected guestis will have to marked
> > > invalid in the host stage 2 page-table. At the same time, the hypervisor
> > > will need a way to track the ownership of each physical page to ensure
> > > memory sharing or donation between entities (host, guests, hypervisor) is
> > > legal.
> > >
> > > In order to enable this tracking at EL2, let's use the host stage 2
> > > page-table itself. The idea is to use the top bits of invalid mappings
> > > to store the unique identifier of the page owner. The page-table owner
> > > (the host) gets identifier 0 such that, at boot time, it owns the entire
> > > IPA space as the pgd starts zeroed.
> > >
> > > Provide kvm_pgtable_stage2_set_owner() which allows to modify the
> > > ownership of pages in the host stage 2. It re-uses most of the map()
> > > logic, but ends up creating invalid mappings instead. This impacts
> > > how we do refcount as we now need to count invalid mappings when they
> > > are used for ownership tracking.
> > >
> > > Signed-off-by: Quentin Perret <[email protected]>
> > > ---
> > > arch/arm64/include/asm/kvm_pgtable.h | 21 +++++
> > > arch/arm64/kvm/hyp/pgtable.c | 127 ++++++++++++++++++++++-----
> > > 2 files changed, 124 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > > index 4ae19247837b..683e96abdc24 100644
> > > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > > @@ -238,6 +238,27 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > > u64 phys, enum kvm_pgtable_prot prot,
> > > void *mc);
> > >
> > > +/**
> > > + * kvm_pgtable_stage2_set_owner() - Annotate invalid mappings with metadata
> > > + * encoding the ownership of a page in the
> > > + * IPA space.
> >
> > The function does more than this, though, as it will also go ahead and unmap
> > existing valid mappings which I think should be mentioned here, no?
>
> Right, I see why you mean. How about:
>
> 'Unmap and annotate pages in the IPA space to track ownership'

I think I'd go with:

'Unmap pages and annotate the invalid mappings with ownership metadata for
the unmapped IPA range.'

as it's the page-table which is annotated, not the actual pages (which could
potentially be mapped by other page-tables).

Will

2021-03-15 17:05:13

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v5 14/36] KVM: arm64: Provide __flush_dcache_area at EL2

On Mon, Mar 15, 2021 at 04:56:21PM +0000, Quentin Perret wrote:
> On Monday 15 Mar 2021 at 16:33:23 (+0000), Will Deacon wrote:
> > On Mon, Mar 15, 2021 at 02:35:14PM +0000, Quentin Perret wrote:
> > > We will need to do cache maintenance at EL2 soon, so compile a copy of
> > > __flush_dcache_area at EL2, and provide a copy of arm64_ftr_reg_ctrel0
> > > as it is needed by the read_ctr macro.
> > >
> > > Signed-off-by: Quentin Perret <[email protected]>
> > > ---
> > > arch/arm64/include/asm/kvm_cpufeature.h | 2 ++
> > > arch/arm64/kvm/hyp/nvhe/Makefile | 3 ++-
> > > arch/arm64/kvm/hyp/nvhe/cache.S | 13 +++++++++++++
> > > arch/arm64/kvm/sys_regs.c | 1 +
> > > 4 files changed, 18 insertions(+), 1 deletion(-)
> > > create mode 100644 arch/arm64/kvm/hyp/nvhe/cache.S
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_cpufeature.h b/arch/arm64/include/asm/kvm_cpufeature.h
> > > index 3fd9f60d2180..efba1b89b8a4 100644
> > > --- a/arch/arm64/include/asm/kvm_cpufeature.h
> > > +++ b/arch/arm64/include/asm/kvm_cpufeature.h
> > > @@ -13,3 +13,5 @@
> > > #define KVM_HYP_CPU_FTR_REG(name) extern struct arm64_ftr_reg kvm_nvhe_sym(name)
> > > #endif
> > > #endif
> > > +
> > > +KVM_HYP_CPU_FTR_REG(arm64_ftr_reg_ctrel0);
> >
> > I still think this is a bit weird. If you really want to macro-ise stuff,
> > then why not follow the sort of thing we do for e.g. per-cpu variables and
> > have separate DECLARE_HYP_CPU_FTR_REG and DEFINE_HYP_CPU_FTR_REG macros.
> >
> > That way kvm_cpufeature.h can have header guards like a normal header and
> > we can drop the '#ifndef KVM_HYP_CPU_FTR_REG' altogether. I don't think
> > the duplication of the symbol name really matters -- it should fail at
> > build time if something is missing.
>
> I just tend to hate unnecessary boilerplate, but if you feel strongly
> about it, happy to change :)

I don't like it either, but I prefer it to overriding macros like this! I
think having the "boilerplate" is a better starting point should we decide
to consolidate the definitions somehow.

Will

2021-03-15 18:35:44

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 08/36] KVM: arm64: Make kvm_call_hyp() a function call at Hyp

kvm_call_hyp() has some logic to issue a function call or a hypercall
depending on the EL at which the kernel is running. However, all the
code compiled under __KVM_NVHE_HYPERVISOR__ is guaranteed to only run
at EL2 which allows us to simplify.

Add ifdefery to kvm_host.h to simplify kvm_call_hyp() in .hyp.text.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 3d10e6527f7d..06ca4828005f 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -591,6 +591,7 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
void kvm_arm_halt_guest(struct kvm *kvm);
void kvm_arm_resume_guest(struct kvm *kvm);

+#ifndef __KVM_NVHE_HYPERVISOR__
#define kvm_call_hyp_nvhe(f, ...) \
({ \
struct arm_smccc_res res; \
@@ -630,6 +631,11 @@ void kvm_arm_resume_guest(struct kvm *kvm);
\
ret; \
})
+#else /* __KVM_NVHE_HYPERVISOR__ */
+#define kvm_call_hyp(f, ...) f(__VA_ARGS__)
+#define kvm_call_hyp_ret(f, ...) f(__VA_ARGS__)
+#define kvm_call_hyp_nvhe(f, ...) f(__VA_ARGS__)
+#endif /* __KVM_NVHE_HYPERVISOR__ */

void force_vm_exit(const cpumask_t *mask);
void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:35:47

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 22/36] KVM: arm64: Refactor kvm_arm_setup_stage2()

In order to re-use some of the stage 2 setup code at EL2, factor parts
of kvm_arm_setup_stage2() out into separate functions.

No functional change intended.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 26 +++++++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 32 +++++++++++++++++++++
arch/arm64/kvm/reset.c | 42 +++-------------------------
3 files changed, 62 insertions(+), 38 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 7945ec87eaec..9cdc198ea6b4 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -13,6 +13,16 @@

#define KVM_PGTABLE_MAX_LEVELS 4U

+static inline u64 kvm_get_parange(u64 mmfr0)
+{
+ u64 parange = cpuid_feature_extract_unsigned_field(mmfr0,
+ ID_AA64MMFR0_PARANGE_SHIFT);
+ if (parange > ID_AA64MMFR0_PARANGE_MAX)
+ parange = ID_AA64MMFR0_PARANGE_MAX;
+
+ return parange;
+}
+
typedef u64 kvm_pte_t;

/**
@@ -159,6 +169,22 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt);
int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
enum kvm_pgtable_prot prot);

+/**
+ * kvm_get_vtcr() - Helper to construct VTCR_EL2
+ * @mmfr0: Sanitized value of SYS_ID_AA64MMFR0_EL1 register.
+ * @mmfr1: Sanitized value of SYS_ID_AA64MMFR1_EL1 register.
+ * @phys_shfit: Value to set in VTCR_EL2.T0SZ.
+ *
+ * The VTCR value is common across all the physical CPUs on the system.
+ * We use system wide sanitised values to fill in different fields,
+ * except for Hardware Management of Access Flags. HA Flag is set
+ * unconditionally on all CPUs, as it is safe to run with or without
+ * the feature and the bit is RES0 on CPUs that don't support it.
+ *
+ * Return: VTCR_EL2 value
+ */
+u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
+
/**
* kvm_pgtable_stage2_init() - Initialise a guest stage-2 page-table.
* @pgt: Uninitialised page-table structure to initialise.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 3d79c8094cdd..296675e5600d 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -9,6 +9,7 @@

#include <linux/bitfield.h>
#include <asm/kvm_pgtable.h>
+#include <asm/stage2_pgtable.h>

#define KVM_PTE_VALID BIT(0)

@@ -449,6 +450,37 @@ struct stage2_map_data {
struct kvm_pgtable_mm_ops *mm_ops;
};

+u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
+{
+ u64 vtcr = VTCR_EL2_FLAGS;
+ u8 lvls;
+
+ vtcr |= kvm_get_parange(mmfr0) << VTCR_EL2_PS_SHIFT;
+ vtcr |= VTCR_EL2_T0SZ(phys_shift);
+ /*
+ * Use a minimum 2 level page table to prevent splitting
+ * host PMD huge pages at stage2.
+ */
+ lvls = stage2_pgtable_levels(phys_shift);
+ if (lvls < 2)
+ lvls = 2;
+ vtcr |= VTCR_EL2_LVLS_TO_SL0(lvls);
+
+ /*
+ * Enable the Hardware Access Flag management, unconditionally
+ * on all CPUs. The features is RES0 on CPUs without the support
+ * and must be ignored by the CPUs.
+ */
+ vtcr |= VTCR_EL2_HA;
+
+ /* Set the vmid bits */
+ vtcr |= (get_vmid_bits(mmfr1) == 16) ?
+ VTCR_EL2_VS_16BIT :
+ VTCR_EL2_VS_8BIT;
+
+ return vtcr;
+}
+
static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
struct stage2_map_data *data)
{
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 47f3f035f3ea..6aae118c960a 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -332,19 +332,10 @@ int kvm_set_ipa_limit(void)
return 0;
}

-/*
- * Configure the VTCR_EL2 for this VM. The VTCR value is common
- * across all the physical CPUs on the system. We use system wide
- * sanitised values to fill in different fields, except for Hardware
- * Management of Access Flags. HA Flag is set unconditionally on
- * all CPUs, as it is safe to run with or without the feature and
- * the bit is RES0 on CPUs that don't support it.
- */
int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
{
- u64 vtcr = VTCR_EL2_FLAGS, mmfr0;
- u32 parange, phys_shift;
- u8 lvls;
+ u64 mmfr0, mmfr1;
+ u32 phys_shift;

if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
return -EINVAL;
@@ -359,33 +350,8 @@ int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
}

mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
- parange = cpuid_feature_extract_unsigned_field(mmfr0,
- ID_AA64MMFR0_PARANGE_SHIFT);
- if (parange > ID_AA64MMFR0_PARANGE_MAX)
- parange = ID_AA64MMFR0_PARANGE_MAX;
- vtcr |= parange << VTCR_EL2_PS_SHIFT;
-
- vtcr |= VTCR_EL2_T0SZ(phys_shift);
- /*
- * Use a minimum 2 level page table to prevent splitting
- * host PMD huge pages at stage2.
- */
- lvls = stage2_pgtable_levels(phys_shift);
- if (lvls < 2)
- lvls = 2;
- vtcr |= VTCR_EL2_LVLS_TO_SL0(lvls);
-
- /*
- * Enable the Hardware Access Flag management, unconditionally
- * on all CPUs. The features is RES0 on CPUs without the support
- * and must be ignored by the CPUs.
- */
- vtcr |= VTCR_EL2_HA;
+ mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
+ kvm->arch.vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);

- /* Set the vmid bits */
- vtcr |= (kvm_get_vmid_bits() == 16) ?
- VTCR_EL2_VS_16BIT :
- VTCR_EL2_VS_8BIT;
- kvm->arch.vtcr = vtcr;
return 0;
}
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:35:52

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 15/36] KVM: arm64: Factor out vector address calculation

In order to re-map the guest vectors at EL2 when pKVM is enabled,
refactor __kvm_vector_slot2idx() and kvm_init_vector_slot() to move all
the address calculation logic in a static inline function.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_mmu.h | 8 ++++++++
arch/arm64/kvm/arm.c | 9 +--------
2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 90873851f677..5c42ec023cc7 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -168,6 +168,14 @@ phys_addr_t kvm_mmu_get_httbr(void);
phys_addr_t kvm_get_idmap_vector(void);
int kvm_mmu_init(void);

+static inline void *__kvm_vector_slot2addr(void *base,
+ enum arm64_hyp_spectre_vector slot)
+{
+ int idx = slot - (slot != HYP_VECTOR_DIRECT);
+
+ return base + (idx * SZ_2K);
+}
+
struct kvm;

#define kvm_flush_dcache_to_poc(a,l) __flush_dcache_area((a), (l))
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 3f8bcf8db036..26e573cdede3 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1345,16 +1345,9 @@ static unsigned long nvhe_percpu_order(void)
/* A lookup table holding the hypervisor VA for each vector slot */
static void *hyp_spectre_vector_selector[BP_HARDEN_EL2_SLOTS];

-static int __kvm_vector_slot2idx(enum arm64_hyp_spectre_vector slot)
-{
- return slot - (slot != HYP_VECTOR_DIRECT);
-}
-
static void kvm_init_vector_slot(void *base, enum arm64_hyp_spectre_vector slot)
{
- int idx = __kvm_vector_slot2idx(slot);
-
- hyp_spectre_vector_selector[slot] = base + (idx * SZ_2K);
+ hyp_spectre_vector_selector[slot] = __kvm_vector_slot2addr(base, slot);
}

static int kvm_init_vector_slots(void)
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:35:57

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 23/36] KVM: arm64: Refactor __load_guest_stage2()

Refactor __load_guest_stage2() to introduce __load_stage2() which will
be re-used when loading the host stage 2.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_mmu.h | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 6f743e20cb06..9d64fa73ee67 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -270,9 +270,9 @@ static __always_inline u64 kvm_get_vttbr(struct kvm_s2_mmu *mmu)
* Must be called from hyp code running at EL2 with an updated VTTBR
* and interrupts disabled.
*/
-static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
+static __always_inline void __load_stage2(struct kvm_s2_mmu *mmu, unsigned long vtcr)
{
- write_sysreg(kern_hyp_va(mmu->arch)->vtcr, vtcr_el2);
+ write_sysreg(vtcr, vtcr_el2);
write_sysreg(kvm_get_vttbr(mmu), vttbr_el2);

/*
@@ -283,6 +283,11 @@ static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT));
}

+static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
+{
+ __load_stage2(mmu, kern_hyp_va(mmu->arch)->vtcr);
+}
+
static inline struct kvm *kvm_s2_mmu_to_kvm(struct kvm_s2_mmu *mmu)
{
return container_of(mmu->arch, struct kvm, arch);
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:36:06

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 03/36] arm64: kvm: Add standalone ticket spinlock implementation for use at hyp

From: Will Deacon <[email protected]>

We will soon need to synchronise multiple CPUs in the hyp text at EL2.
The qspinlock-based locking used by the host is overkill for this purpose
and relies on the kernel's "percpu" implementation for the MCS nodes.

Implement a simple ticket locking scheme based heavily on the code removed
by commit c11090474d70 ("arm64: locking: Replace ticket lock implementation
with qspinlock").

Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/spinlock.h | 92 ++++++++++++++++++++++
1 file changed, 92 insertions(+)
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/spinlock.h

diff --git a/arch/arm64/kvm/hyp/include/nvhe/spinlock.h b/arch/arm64/kvm/hyp/include/nvhe/spinlock.h
new file mode 100644
index 000000000000..76b537f8d1c6
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/spinlock.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * A stand-alone ticket spinlock implementation for use by the non-VHE
+ * KVM hypervisor code running at EL2.
+ *
+ * Copyright (C) 2020 Google LLC
+ * Author: Will Deacon <[email protected]>
+ *
+ * Heavily based on the implementation removed by c11090474d70 which was:
+ * Copyright (C) 2012 ARM Ltd.
+ */
+
+#ifndef __ARM64_KVM_NVHE_SPINLOCK_H__
+#define __ARM64_KVM_NVHE_SPINLOCK_H__
+
+#include <asm/alternative.h>
+#include <asm/lse.h>
+
+typedef union hyp_spinlock {
+ u32 __val;
+ struct {
+#ifdef __AARCH64EB__
+ u16 next, owner;
+#else
+ u16 owner, next;
+#endif
+ };
+} hyp_spinlock_t;
+
+#define hyp_spin_lock_init(l) \
+do { \
+ *(l) = (hyp_spinlock_t){ .__val = 0 }; \
+} while (0)
+
+static inline void hyp_spin_lock(hyp_spinlock_t *lock)
+{
+ u32 tmp;
+ hyp_spinlock_t lockval, newval;
+
+ asm volatile(
+ /* Atomically increment the next ticket. */
+ ARM64_LSE_ATOMIC_INSN(
+ /* LL/SC */
+" prfm pstl1strm, %3\n"
+"1: ldaxr %w0, %3\n"
+" add %w1, %w0, #(1 << 16)\n"
+" stxr %w2, %w1, %3\n"
+" cbnz %w2, 1b\n",
+ /* LSE atomics */
+" mov %w2, #(1 << 16)\n"
+" ldadda %w2, %w0, %3\n"
+ __nops(3))
+
+ /* Did we get the lock? */
+" eor %w1, %w0, %w0, ror #16\n"
+" cbz %w1, 3f\n"
+ /*
+ * No: spin on the owner. Send a local event to avoid missing an
+ * unlock before the exclusive load.
+ */
+" sevl\n"
+"2: wfe\n"
+" ldaxrh %w2, %4\n"
+" eor %w1, %w2, %w0, lsr #16\n"
+" cbnz %w1, 2b\n"
+ /* We got the lock. Critical section starts here. */
+"3:"
+ : "=&r" (lockval), "=&r" (newval), "=&r" (tmp), "+Q" (*lock)
+ : "Q" (lock->owner)
+ : "memory");
+}
+
+static inline void hyp_spin_unlock(hyp_spinlock_t *lock)
+{
+ u64 tmp;
+
+ asm volatile(
+ ARM64_LSE_ATOMIC_INSN(
+ /* LL/SC */
+ " ldrh %w1, %0\n"
+ " add %w1, %w1, #1\n"
+ " stlrh %w1, %0",
+ /* LSE atomics */
+ " mov %w1, #1\n"
+ " staddlh %w1, %0\n"
+ __nops(1))
+ : "=Q" (lock->owner), "=&r" (tmp)
+ :
+ : "memory");
+}
+
+#endif /* __ARM64_KVM_NVHE_SPINLOCK_H__ */
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:36:15

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 10/36] KVM: arm64: Introduce an early Hyp page allocator

With nVHE, the host currently creates all stage 1 hypervisor mappings at
EL1 during boot, installs them at EL2, and extends them as required
(e.g. when creating a new VM). But in a world where the host is no
longer trusted, it cannot have full control over the code mapped in the
hypervisor.

In preparation for enabling the hypervisor to create its own stage 1
mappings during boot, introduce an early page allocator, with minimal
functionality. This allocator is designed to be used only during early
bootstrap of the hyp code when memory protection is enabled, which will
then switch to using a full-fledged page allocator after init.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/early_alloc.h | 14 +++++
arch/arm64/kvm/hyp/include/nvhe/memory.h | 24 +++++++++
arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
arch/arm64/kvm/hyp/nvhe/early_alloc.c | 54 +++++++++++++++++++
arch/arm64/kvm/hyp/nvhe/psci-relay.c | 4 +-
5 files changed, 94 insertions(+), 4 deletions(-)
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/early_alloc.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/memory.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/early_alloc.c

diff --git a/arch/arm64/kvm/hyp/include/nvhe/early_alloc.h b/arch/arm64/kvm/hyp/include/nvhe/early_alloc.h
new file mode 100644
index 000000000000..dc61aaa56f31
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/early_alloc.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_HYP_EARLY_ALLOC_H
+#define __KVM_HYP_EARLY_ALLOC_H
+
+#include <asm/kvm_pgtable.h>
+
+void hyp_early_alloc_init(void *virt, unsigned long size);
+unsigned long hyp_early_alloc_nr_used_pages(void);
+void *hyp_early_alloc_page(void *arg);
+void *hyp_early_alloc_contig(unsigned int nr_pages);
+
+extern struct kvm_pgtable_mm_ops hyp_early_alloc_mm_ops;
+
+#endif /* __KVM_HYP_EARLY_ALLOC_H */
diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h
new file mode 100644
index 000000000000..3e49eaa7e682
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_HYP_MEMORY_H
+#define __KVM_HYP_MEMORY_H
+
+#include <asm/page.h>
+
+#include <linux/types.h>
+
+extern s64 hyp_physvirt_offset;
+
+#define __hyp_pa(virt) ((phys_addr_t)(virt) + hyp_physvirt_offset)
+#define __hyp_va(phys) ((void *)((phys_addr_t)(phys) - hyp_physvirt_offset))
+
+static inline void *hyp_phys_to_virt(phys_addr_t phys)
+{
+ return __hyp_va(phys);
+}
+
+static inline phys_addr_t hyp_virt_to_phys(void *addr)
+{
+ return __hyp_pa(addr);
+}
+
+#endif /* __KVM_HYP_MEMORY_H */
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index bc98f8e3d1da..24ff99e2eac5 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -13,7 +13,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
lib-objs := $(addprefix ../../../lib/, $(lib-objs))

obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
- hyp-main.o hyp-smp.o psci-relay.o
+ hyp-main.o hyp-smp.o psci-relay.o early_alloc.o
obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o
obj-y += $(lib-objs)
diff --git a/arch/arm64/kvm/hyp/nvhe/early_alloc.c b/arch/arm64/kvm/hyp/nvhe/early_alloc.c
new file mode 100644
index 000000000000..1306c430ab87
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/early_alloc.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Google LLC
+ * Author: Quentin Perret <[email protected]>
+ */
+
+#include <asm/kvm_pgtable.h>
+
+#include <nvhe/early_alloc.h>
+#include <nvhe/memory.h>
+
+struct kvm_pgtable_mm_ops hyp_early_alloc_mm_ops;
+s64 __ro_after_init hyp_physvirt_offset;
+
+static unsigned long base;
+static unsigned long end;
+static unsigned long cur;
+
+unsigned long hyp_early_alloc_nr_used_pages(void)
+{
+ return (cur - base) >> PAGE_SHIFT;
+}
+
+void *hyp_early_alloc_contig(unsigned int nr_pages)
+{
+ unsigned long size = (nr_pages << PAGE_SHIFT);
+ void *ret = (void *)cur;
+
+ if (!nr_pages)
+ return NULL;
+
+ if (end - cur < size)
+ return NULL;
+
+ cur += size;
+ memset(ret, 0, size);
+
+ return ret;
+}
+
+void *hyp_early_alloc_page(void *arg)
+{
+ return hyp_early_alloc_contig(1);
+}
+
+void hyp_early_alloc_init(void *virt, unsigned long size)
+{
+ base = cur = (unsigned long)virt;
+ end = base + size;
+
+ hyp_early_alloc_mm_ops.zalloc_page = hyp_early_alloc_page;
+ hyp_early_alloc_mm_ops.phys_to_virt = hyp_phys_to_virt;
+ hyp_early_alloc_mm_ops.virt_to_phys = hyp_virt_to_phys;
+}
diff --git a/arch/arm64/kvm/hyp/nvhe/psci-relay.c b/arch/arm64/kvm/hyp/nvhe/psci-relay.c
index 63de71c0481e..08508783ec3d 100644
--- a/arch/arm64/kvm/hyp/nvhe/psci-relay.c
+++ b/arch/arm64/kvm/hyp/nvhe/psci-relay.c
@@ -11,6 +11,7 @@
#include <linux/kvm_host.h>
#include <uapi/linux/psci.h>

+#include <nvhe/memory.h>
#include <nvhe/trap_handler.h>

void kvm_hyp_cpu_entry(unsigned long r0);
@@ -20,9 +21,6 @@ void __noreturn __host_enter(struct kvm_cpu_context *host_ctxt);

/* Config options set by the host. */
struct kvm_host_psci_config __ro_after_init kvm_host_psci_config;
-s64 __ro_after_init hyp_physvirt_offset;
-
-#define __hyp_pa(x) ((phys_addr_t)((x)) + hyp_physvirt_offset)

#define INVALID_CPU_ID UINT_MAX

--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:36:16

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 09/36] KVM: arm64: Allow using kvm_nvhe_sym() in hyp code

In order to allow the usage of code shared by the host and the hyp in
static inline library functions, allow the usage of kvm_nvhe_sym() at
EL2 by defaulting to the raw symbol name.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/hyp_image.h | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/arm64/include/asm/hyp_image.h b/arch/arm64/include/asm/hyp_image.h
index 78cd77990c9c..b4b3076a76fb 100644
--- a/arch/arm64/include/asm/hyp_image.h
+++ b/arch/arm64/include/asm/hyp_image.h
@@ -10,11 +10,15 @@
#define __HYP_CONCAT(a, b) a ## b
#define HYP_CONCAT(a, b) __HYP_CONCAT(a, b)

+#ifndef __KVM_NVHE_HYPERVISOR__
/*
* KVM nVHE code has its own symbol namespace prefixed with __kvm_nvhe_,
* to separate it from the kernel proper.
*/
#define kvm_nvhe_sym(sym) __kvm_nvhe_##sym
+#else
+#define kvm_nvhe_sym(sym) sym
+#endif

#ifdef LINKER_SCRIPT

--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:36:17

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 11/36] KVM: arm64: Stub CONFIG_DEBUG_LIST at Hyp

In order to use the kernel list library at EL2, introduce stubs for the
CONFIG_DEBUG_LIST out-of-lines calls.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
arch/arm64/kvm/hyp/nvhe/stub.c | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c

diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 24ff99e2eac5..144da72ad510 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -13,7 +13,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
lib-objs := $(addprefix ../../../lib/, $(lib-objs))

obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
- hyp-main.o hyp-smp.o psci-relay.o early_alloc.o
+ hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o
obj-y += $(lib-objs)
diff --git a/arch/arm64/kvm/hyp/nvhe/stub.c b/arch/arm64/kvm/hyp/nvhe/stub.c
new file mode 100644
index 000000000000..c0aa6bbfd79d
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/stub.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Stubs for out-of-line function calls caused by re-using kernel
+ * infrastructure at EL2.
+ *
+ * Copyright (C) 2020 - Google LLC
+ */
+
+#include <linux/list.h>
+
+#ifdef CONFIG_DEBUG_LIST
+bool __list_add_valid(struct list_head *new, struct list_head *prev,
+ struct list_head *next)
+{
+ return true;
+}
+
+bool __list_del_entry_valid(struct list_head *entry)
+{
+ return true;
+}
+#endif
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:36:24

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 28/36] KVM: arm64: Always zero invalid PTEs

kvm_set_invalid_pte() currently only clears bit 0 from a PTE because
stage2_map_walk_table_post() needs to be able to follow the anchor. In
preparation for re-using bits 63-01 from invalid PTEs, make sure to zero
it entirely by ensuring to cache the anchor's child upfront.

Acked-by: Will Deacon <[email protected]>
Suggested-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index bdd6e3d4eeb6..f37b4179b880 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -156,10 +156,9 @@ static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_op
return mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
}

-static void kvm_set_invalid_pte(kvm_pte_t *ptep)
+static void kvm_clear_pte(kvm_pte_t *ptep)
{
- kvm_pte_t pte = *ptep;
- WRITE_ONCE(*ptep, pte & ~KVM_PTE_VALID);
+ WRITE_ONCE(*ptep, 0);
}

static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp,
@@ -443,6 +442,7 @@ struct stage2_map_data {
kvm_pte_t attr;

kvm_pte_t *anchor;
+ kvm_pte_t *childp;

struct kvm_s2_mmu *mmu;
void *memcache;
@@ -532,7 +532,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
* There's an existing different valid leaf entry, so perform
* break-before-make.
*/
- kvm_set_invalid_pte(ptep);
+ kvm_clear_pte(ptep);
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
mm_ops->put_page(ptep);
}
@@ -553,7 +553,8 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
if (!kvm_block_mapping_supported(addr, end, data->phys, level))
return 0;

- kvm_set_invalid_pte(ptep);
+ data->childp = kvm_pte_follow(*ptep, data->mm_ops);
+ kvm_clear_pte(ptep);

/*
* Invalidate the whole stage-2, as we may have numerous leaf
@@ -599,7 +600,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
* will be mapped lazily.
*/
if (kvm_pte_valid(pte)) {
- kvm_set_invalid_pte(ptep);
+ kvm_clear_pte(ptep);
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
mm_ops->put_page(ptep);
}
@@ -615,19 +616,24 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
struct stage2_map_data *data)
{
struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
+ kvm_pte_t *childp;
int ret = 0;

if (!data->anchor)
return 0;

- mm_ops->put_page(kvm_pte_follow(*ptep, mm_ops));
- mm_ops->put_page(ptep);
-
if (data->anchor == ptep) {
+ childp = data->childp;
data->anchor = NULL;
+ data->childp = NULL;
ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
+ } else {
+ childp = kvm_pte_follow(*ptep, mm_ops);
}

+ mm_ops->put_page(childp);
+ mm_ops->put_page(ptep);
+
return ret;
}

@@ -736,7 +742,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
* block entry and rely on the remaining portions being faulted
* back lazily.
*/
- kvm_set_invalid_pte(ptep);
+ kvm_clear_pte(ptep);
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, addr, level);
mm_ops->put_page(ptep);

--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:36:30

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 04/36] KVM: arm64: Initialize kvm_nvhe_init_params early

Move the initialization of kvm_nvhe_init_params in a dedicated function
that is run early, and only once during KVM init, rather than every time
the KVM vectors are set and reset.

This also opens the opportunity for the hypervisor to change the init
structs during boot, hence simplifying the replacement of host-provided
page-table by the one the hypervisor will create for itself.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/arm.c | 30 ++++++++++++++++++------------
1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index fc4c95dd2d26..2d1e7ef69c04 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1383,22 +1383,18 @@ static int kvm_init_vector_slots(void)
return 0;
}

-static void cpu_init_hyp_mode(void)
+static void cpu_prepare_hyp_mode(int cpu)
{
- struct kvm_nvhe_init_params *params = this_cpu_ptr_nvhe_sym(kvm_init_params);
- struct arm_smccc_res res;
+ struct kvm_nvhe_init_params *params = per_cpu_ptr_nvhe_sym(kvm_init_params, cpu);
unsigned long tcr;

- /* Switch from the HYP stub to our own HYP init vector */
- __hyp_set_vectors(kvm_get_idmap_vector());
-
/*
* Calculate the raw per-cpu offset without a translation from the
* kernel's mapping to the linear mapping, and store it in tpidr_el2
* so that we can use adr_l to access per-cpu variables in EL2.
* Also drop the KASAN tag which gets in the way...
*/
- params->tpidr_el2 = (unsigned long)kasan_reset_tag(this_cpu_ptr_nvhe_sym(__per_cpu_start)) -
+ params->tpidr_el2 = (unsigned long)kasan_reset_tag(per_cpu_ptr_nvhe_sym(__per_cpu_start, cpu)) -
(unsigned long)kvm_ksym_ref(CHOOSE_NVHE_SYM(__per_cpu_start));

params->mair_el2 = read_sysreg(mair_el1);
@@ -1422,7 +1418,7 @@ static void cpu_init_hyp_mode(void)
tcr |= (idmap_t0sz & GENMASK(TCR_TxSZ_WIDTH - 1, 0)) << TCR_T0SZ_OFFSET;
params->tcr_el2 = tcr;

- params->stack_hyp_va = kern_hyp_va(__this_cpu_read(kvm_arm_hyp_stack_page) + PAGE_SIZE);
+ params->stack_hyp_va = kern_hyp_va(per_cpu(kvm_arm_hyp_stack_page, cpu) + PAGE_SIZE);
params->pgd_pa = kvm_mmu_get_httbr();

/*
@@ -1430,6 +1426,15 @@ static void cpu_init_hyp_mode(void)
* be read while the MMU is off.
*/
kvm_flush_dcache_to_poc(params, sizeof(*params));
+}
+
+static void cpu_init_hyp_mode(void)
+{
+ struct kvm_nvhe_init_params *params;
+ struct arm_smccc_res res;
+
+ /* Switch from the HYP stub to our own HYP init vector */
+ __hyp_set_vectors(kvm_get_idmap_vector());

/*
* Call initialization code, and switch to the full blown HYP code.
@@ -1438,6 +1443,7 @@ static void cpu_init_hyp_mode(void)
* cpus_have_const_cap() wrapper.
*/
BUG_ON(!system_capabilities_finalized());
+ params = this_cpu_ptr_nvhe_sym(kvm_init_params);
arm_smccc_1_1_hvc(KVM_HOST_SMCCC_FUNC(__kvm_hyp_init), virt_to_phys(params), &res);
WARN_ON(res.a0 != SMCCC_RET_SUCCESS);

@@ -1785,19 +1791,19 @@ static int init_hyp_mode(void)
}
}

- /*
- * Map Hyp percpu pages
- */
for_each_possible_cpu(cpu) {
char *percpu_begin = (char *)kvm_arm_hyp_percpu_base[cpu];
char *percpu_end = percpu_begin + nvhe_percpu_size();

+ /* Map Hyp percpu pages */
err = create_hyp_mappings(percpu_begin, percpu_end, PAGE_HYP);
-
if (err) {
kvm_err("Cannot map hyp percpu region\n");
goto out_err;
}
+
+ /* Prepare the CPU initialization parameters */
+ cpu_prepare_hyp_mode(cpu);
}

if (is_protected_kvm_enabled()) {
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:36:41

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 18/36] KVM: arm64: Elevate hypervisor mappings creation at EL2

Previous commits have introduced infrastructure to enable the EL2 code
to manage its own stage 1 mappings. However, this was preliminary work,
and none of it is currently in use.

Put all of this together by elevating the mapping creation at EL2 when
memory protection is enabled. In this case, the host kernel running
at EL1 still creates _temporary_ EL2 mappings, only used while
initializing the hypervisor, but frees them right after.

As such, all calls to create_hyp_mappings() after kvm init has finished
turn into hypercalls, as the host now has no 'legal' way to modify the
hypevisor page tables directly.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_mmu.h | 2 +-
arch/arm64/kvm/arm.c | 87 +++++++++++++++++++++++++++++---
arch/arm64/kvm/mmu.c | 43 ++++++++++++++--
3 files changed, 120 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 5c42ec023cc7..ce02a4052dcf 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -166,7 +166,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);

phys_addr_t kvm_mmu_get_httbr(void);
phys_addr_t kvm_get_idmap_vector(void);
-int kvm_mmu_init(void);
+int kvm_mmu_init(u32 *hyp_va_bits);

static inline void *__kvm_vector_slot2addr(void *base,
enum arm64_hyp_spectre_vector slot)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 26e573cdede3..7d62211109d9 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1421,7 +1421,7 @@ static void cpu_prepare_hyp_mode(int cpu)
kvm_flush_dcache_to_poc(params, sizeof(*params));
}

-static void cpu_init_hyp_mode(void)
+static void hyp_install_host_vector(void)
{
struct kvm_nvhe_init_params *params;
struct arm_smccc_res res;
@@ -1439,6 +1439,11 @@ static void cpu_init_hyp_mode(void)
params = this_cpu_ptr_nvhe_sym(kvm_init_params);
arm_smccc_1_1_hvc(KVM_HOST_SMCCC_FUNC(__kvm_hyp_init), virt_to_phys(params), &res);
WARN_ON(res.a0 != SMCCC_RET_SUCCESS);
+}
+
+static void cpu_init_hyp_mode(void)
+{
+ hyp_install_host_vector();

/*
* Disabling SSBD on a non-VHE system requires us to enable SSBS
@@ -1481,7 +1486,10 @@ static void cpu_set_hyp_vector(void)
struct bp_hardening_data *data = this_cpu_ptr(&bp_hardening_data);
void *vector = hyp_spectre_vector_selector[data->slot];

- *this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
+ if (!is_protected_kvm_enabled())
+ *this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
+ else
+ kvm_call_hyp_nvhe(__pkvm_cpu_set_vector, data->slot);
}

static void cpu_hyp_reinit(void)
@@ -1489,13 +1497,14 @@ static void cpu_hyp_reinit(void)
kvm_init_host_cpu_context(&this_cpu_ptr_hyp_sym(kvm_host_data)->host_ctxt);

cpu_hyp_reset();
- cpu_set_hyp_vector();

if (is_kernel_in_hyp_mode())
kvm_timer_init_vhe();
else
cpu_init_hyp_mode();

+ cpu_set_hyp_vector();
+
kvm_arm_init_debug();

if (vgic_present)
@@ -1691,18 +1700,59 @@ static void teardown_hyp_mode(void)
}
}

+static int do_pkvm_init(u32 hyp_va_bits)
+{
+ void *per_cpu_base = kvm_ksym_ref(kvm_arm_hyp_percpu_base);
+ int ret;
+
+ preempt_disable();
+ hyp_install_host_vector();
+ ret = kvm_call_hyp_nvhe(__pkvm_init, hyp_mem_base, hyp_mem_size,
+ num_possible_cpus(), kern_hyp_va(per_cpu_base),
+ hyp_va_bits);
+ preempt_enable();
+
+ return ret;
+}
+
+static int kvm_hyp_init_protection(u32 hyp_va_bits)
+{
+ void *addr = phys_to_virt(hyp_mem_base);
+ int ret;
+
+ ret = create_hyp_mappings(addr, addr + hyp_mem_size, PAGE_HYP);
+ if (ret)
+ return ret;
+
+ ret = do_pkvm_init(hyp_va_bits);
+ if (ret)
+ return ret;
+
+ free_hyp_pgds();
+
+ return 0;
+}
+
/**
* Inits Hyp-mode on all online CPUs
*/
static int init_hyp_mode(void)
{
+ u32 hyp_va_bits;
int cpu;
- int err = 0;
+ int err = -ENOMEM;
+
+ /*
+ * The protected Hyp-mode cannot be initialized if the memory pool
+ * allocation has failed.
+ */
+ if (is_protected_kvm_enabled() && !hyp_mem_base)
+ goto out_err;

/*
* Allocate Hyp PGD and setup Hyp identity mapping
*/
- err = kvm_mmu_init();
+ err = kvm_mmu_init(&hyp_va_bits);
if (err)
goto out_err;

@@ -1818,6 +1868,14 @@ static int init_hyp_mode(void)
goto out_err;
}

+ if (is_protected_kvm_enabled()) {
+ err = kvm_hyp_init_protection(hyp_va_bits);
+ if (err) {
+ kvm_err("Failed to init hyp memory protection\n");
+ goto out_err;
+ }
+ }
+
return 0;

out_err:
@@ -1826,6 +1884,16 @@ static int init_hyp_mode(void)
return err;
}

+static int finalize_hyp_mode(void)
+{
+ if (!is_protected_kvm_enabled())
+ return 0;
+
+ static_branch_enable(&kvm_protected_mode_initialized);
+
+ return 0;
+}
+
static void check_kvm_target_cpu(void *ret)
{
*(int *)ret = kvm_target_cpu();
@@ -1942,8 +2010,15 @@ int kvm_arch_init(void *opaque)
if (err)
goto out_hyp;

+ if (!in_hyp_mode) {
+ err = finalize_hyp_mode();
+ if (err) {
+ kvm_err("Failed to finalize Hyp protection\n");
+ goto out_hyp;
+ }
+ }
+
if (is_protected_kvm_enabled()) {
- static_branch_enable(&kvm_protected_mode_initialized);
kvm_info("Protected nVHE mode initialized successfully\n");
} else if (in_hyp_mode) {
kvm_info("VHE mode initialized successfully\n");
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 4d41d7838d53..9d331bf262d2 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -221,15 +221,39 @@ void free_hyp_pgds(void)
if (hyp_pgtable) {
kvm_pgtable_hyp_destroy(hyp_pgtable);
kfree(hyp_pgtable);
+ hyp_pgtable = NULL;
}
mutex_unlock(&kvm_hyp_pgd_mutex);
}

+static bool kvm_host_owns_hyp_mappings(void)
+{
+ if (static_branch_likely(&kvm_protected_mode_initialized))
+ return false;
+
+ /*
+ * This can happen at boot time when __create_hyp_mappings() is called
+ * after the hyp protection has been enabled, but the static key has
+ * not been flipped yet.
+ */
+ if (!hyp_pgtable && is_protected_kvm_enabled())
+ return false;
+
+ WARN_ON(!hyp_pgtable);
+
+ return true;
+}
+
static int __create_hyp_mappings(unsigned long start, unsigned long size,
unsigned long phys, enum kvm_pgtable_prot prot)
{
int err;

+ if (!kvm_host_owns_hyp_mappings()) {
+ return kvm_call_hyp_nvhe(__pkvm_create_mappings,
+ start, size, phys, prot);
+ }
+
mutex_lock(&kvm_hyp_pgd_mutex);
err = kvm_pgtable_hyp_map(hyp_pgtable, start, size, phys, prot);
mutex_unlock(&kvm_hyp_pgd_mutex);
@@ -291,6 +315,16 @@ static int __create_hyp_private_mapping(phys_addr_t phys_addr, size_t size,
unsigned long base;
int ret = 0;

+ if (!kvm_host_owns_hyp_mappings()) {
+ base = kvm_call_hyp_nvhe(__pkvm_create_private_mapping,
+ phys_addr, size, prot);
+ if (IS_ERR_OR_NULL((void *)base))
+ return PTR_ERR((void *)base);
+ *haddr = base;
+
+ return 0;
+ }
+
mutex_lock(&kvm_hyp_pgd_mutex);

/*
@@ -1270,10 +1304,9 @@ static struct kvm_pgtable_mm_ops kvm_hyp_mm_ops = {
.virt_to_phys = kvm_host_pa,
};

-int kvm_mmu_init(void)
+int kvm_mmu_init(u32 *hyp_va_bits)
{
int err;
- u32 hyp_va_bits;

hyp_idmap_start = __pa_symbol(__hyp_idmap_text_start);
hyp_idmap_start = ALIGN_DOWN(hyp_idmap_start, PAGE_SIZE);
@@ -1287,8 +1320,8 @@ int kvm_mmu_init(void)
*/
BUG_ON((hyp_idmap_start ^ (hyp_idmap_end - 1)) & PAGE_MASK);

- hyp_va_bits = 64 - ((idmap_t0sz & TCR_T0SZ_MASK) >> TCR_T0SZ_OFFSET);
- kvm_debug("Using %u-bit virtual addresses at EL2\n", hyp_va_bits);
+ *hyp_va_bits = 64 - ((idmap_t0sz & TCR_T0SZ_MASK) >> TCR_T0SZ_OFFSET);
+ kvm_debug("Using %u-bit virtual addresses at EL2\n", *hyp_va_bits);
kvm_debug("IDMAP page: %lx\n", hyp_idmap_start);
kvm_debug("HYP VA range: %lx:%lx\n",
kern_hyp_va(PAGE_OFFSET),
@@ -1313,7 +1346,7 @@ int kvm_mmu_init(void)
goto out;
}

- err = kvm_pgtable_hyp_init(hyp_pgtable, hyp_va_bits, &kvm_hyp_mm_ops);
+ err = kvm_pgtable_hyp_init(hyp_pgtable, *hyp_va_bits, &kvm_hyp_mm_ops);
if (err)
goto out_free_pgtable;

--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:36:43

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 24/36] KVM: arm64: Refactor __populate_fault_info()

Refactor __populate_fault_info() to introduce __get_fault_info() which
will be used once the host is wrapped in a stage 2.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/hyp/switch.h | 34 +++++++++++++------------
1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h
index 6c1f51f25eb3..40c274da5a7c 100644
--- a/arch/arm64/kvm/hyp/include/hyp/switch.h
+++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
@@ -160,19 +160,9 @@ static inline bool __translate_far_to_hpfar(u64 far, u64 *hpfar)
return true;
}

-static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)
+static inline bool __get_fault_info(u64 esr, struct kvm_vcpu_fault_info *fault)
{
- u8 ec;
- u64 esr;
- u64 hpfar, far;
-
- esr = vcpu->arch.fault.esr_el2;
- ec = ESR_ELx_EC(esr);
-
- if (ec != ESR_ELx_EC_DABT_LOW && ec != ESR_ELx_EC_IABT_LOW)
- return true;
-
- far = read_sysreg_el2(SYS_FAR);
+ fault->far_el2 = read_sysreg_el2(SYS_FAR);

/*
* The HPFAR can be invalid if the stage 2 fault did not
@@ -188,17 +178,29 @@ static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)
if (!(esr & ESR_ELx_S1PTW) &&
(cpus_have_final_cap(ARM64_WORKAROUND_834220) ||
(esr & ESR_ELx_FSC_TYPE) == FSC_PERM)) {
- if (!__translate_far_to_hpfar(far, &hpfar))
+ if (!__translate_far_to_hpfar(fault->far_el2, &fault->hpfar_el2))
return false;
} else {
- hpfar = read_sysreg(hpfar_el2);
+ fault->hpfar_el2 = read_sysreg(hpfar_el2);
}

- vcpu->arch.fault.far_el2 = far;
- vcpu->arch.fault.hpfar_el2 = hpfar;
return true;
}

+static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)
+{
+ u8 ec;
+ u64 esr;
+
+ esr = vcpu->arch.fault.esr_el2;
+ ec = ESR_ELx_EC(esr);
+
+ if (ec != ESR_ELx_EC_DABT_LOW && ec != ESR_ELx_EC_IABT_LOW)
+ return true;
+
+ return __get_fault_info(esr, &vcpu->arch.fault);
+}
+
/* Check for an FPSIMD/SVE trap and handle as appropriate */
static inline bool __hyp_handle_fpsimd(struct kvm_vcpu *vcpu)
{
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:37:14

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 31/36] KVM: arm64: Add kvm_pgtable_stage2_find_range()

Since the host stage 2 will be identity mapped, and since it will own
most of memory, it would preferable for performance to try and use large
block mappings whenever that is possible. To ease this, introduce a new
helper in the KVM page-table code which allows to search for large
ranges of available IPA space. This will be used in the host memory
abort path to greedily idmap large portion of the PA space.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 29 +++++++++
arch/arm64/kvm/hyp/pgtable.c | 89 ++++++++++++++++++++++++++--
2 files changed, 114 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 683e96abdc24..b93a2a3526ab 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -94,6 +94,16 @@ enum kvm_pgtable_prot {
#define PAGE_HYP_RO (KVM_PGTABLE_PROT_R)
#define PAGE_HYP_DEVICE (PAGE_HYP | KVM_PGTABLE_PROT_DEVICE)

+/**
+ * struct kvm_mem_range - Range of Intermediate Physical Addresses
+ * @start: Start of the range.
+ * @end: End of the range.
+ */
+struct kvm_mem_range {
+ u64 start;
+ u64 end;
+};
+
/**
* enum kvm_pgtable_walk_flags - Flags to control a depth-first page-table walk.
* @KVM_PGTABLE_WALK_LEAF: Visit leaf entries, including invalid
@@ -398,4 +408,23 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
struct kvm_pgtable_walker *walker);

+/**
+ * kvm_pgtable_stage2_find_range() - Find a range of Intermediate Physical
+ * Addresses with compatible permission
+ * attributes.
+ * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr: Address that must be covered by the range.
+ * @prot: Protection attributes that the range must be compatible with.
+ * @range: Range structure used to limit the search space at call time and
+ * that will hold the result.
+ *
+ * The offset of @addr within a page is ignored. An IPA is compatible with @prot
+ * iff its corresponding stage-2 page-table entry has default ownership and, if
+ * valid, is mapped with protection attributes identical to @prot.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_stage2_find_range(struct kvm_pgtable *pgt, u64 addr,
+ enum kvm_pgtable_prot prot,
+ struct kvm_mem_range *range);
#endif /* __ARM64_KVM_PGTABLE_H__ */
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index a5347d78293f..3a971df278bd 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -48,6 +48,8 @@
KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
KVM_PTE_LEAF_ATTR_HI_S2_XN)

+#define KVM_PTE_LEAF_ATTR_S2_IGNORED GENMASK(58, 55)
+
#define KVM_INVALID_PTE_OWNER_MASK GENMASK(63, 56)
#define KVM_MAX_OWNER_ID 1

@@ -77,15 +79,20 @@ static bool kvm_phys_is_valid(u64 phys)
return phys < BIT(id_aa64mmfr0_parange_to_phys_shift(ID_AA64MMFR0_PARANGE_MAX));
}

-static bool kvm_block_mapping_supported(u64 addr, u64 end, u64 phys, u32 level)
+static bool kvm_level_supports_block_mapping(u32 level)
{
- u64 granule = kvm_granule_size(level);
-
/*
* Reject invalid block mappings and don't bother with 4TB mappings for
* 52-bit PAs.
*/
- if (level == 0 || (PAGE_SIZE != SZ_4K && level == 1))
+ return !(level == 0 || (PAGE_SIZE != SZ_4K && level == 1));
+}
+
+static bool kvm_block_mapping_supported(u64 addr, u64 end, u64 phys, u32 level)
+{
+ u64 granule = kvm_granule_size(level);
+
+ if (!kvm_level_supports_block_mapping(level))
return false;

if (granule > (end - addr))
@@ -1053,3 +1060,77 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
pgt->pgd = NULL;
}
+
+#define KVM_PTE_LEAF_S2_COMPAT_MASK (KVM_PTE_LEAF_ATTR_S2_PERMS | \
+ KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR | \
+ KVM_PTE_LEAF_ATTR_S2_IGNORED)
+
+static int stage2_check_permission_walker(u64 addr, u64 end, u32 level,
+ kvm_pte_t *ptep,
+ enum kvm_pgtable_walk_flags flag,
+ void * const arg)
+{
+ kvm_pte_t old_attr, pte = *ptep, *new_attr = arg;
+
+ /*
+ * Compatible mappings are either invalid and owned by the page-table
+ * owner (whose id is 0), or valid with matching permission attributes.
+ */
+ if (kvm_pte_valid(pte)) {
+ old_attr = pte & KVM_PTE_LEAF_S2_COMPAT_MASK;
+ if (old_attr != *new_attr)
+ return -EEXIST;
+ } else if (pte) {
+ return -EEXIST;
+ }
+
+ return 0;
+}
+
+int kvm_pgtable_stage2_find_range(struct kvm_pgtable *pgt, u64 addr,
+ enum kvm_pgtable_prot prot,
+ struct kvm_mem_range *range)
+{
+ kvm_pte_t attr;
+ struct kvm_pgtable_walker check_perm_walker = {
+ .cb = stage2_check_permission_walker,
+ .flags = KVM_PGTABLE_WALK_LEAF,
+ .arg = &attr,
+ };
+ u64 granule, start, end;
+ u32 level;
+ int ret;
+
+ ret = stage2_set_prot_attr(prot, &attr);
+ if (ret)
+ return ret;
+ attr &= KVM_PTE_LEAF_S2_COMPAT_MASK;
+
+ for (level = pgt->start_level; level < KVM_PGTABLE_MAX_LEVELS; level++) {
+ granule = kvm_granule_size(level);
+ start = ALIGN_DOWN(addr, granule);
+ end = start + granule;
+
+ if (!kvm_level_supports_block_mapping(level))
+ continue;
+
+ if (start < range->start || range->end < end)
+ continue;
+
+ /*
+ * Check the presence of existing mappings with incompatible
+ * permissions within the current block range, and try one level
+ * deeper if one is found.
+ */
+ ret = kvm_pgtable_walk(pgt, start, granule, &check_perm_walker);
+ if (ret != -EEXIST)
+ break;
+ }
+
+ if (!ret) {
+ range->start = start;
+ range->end = end;
+ }
+
+ return ret;
+}
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:37:28

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 26/36] KVM: arm64: Reserve memory for host stage 2

Extend the memory pool allocated for the hypervisor to include enough
pages to map all of memory at page granularity for the host stage 2.
While at it, also reserve some memory for device mappings.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/mm.h | 27 ++++++++++++++++++++++++++-
arch/arm64/kvm/hyp/nvhe/setup.c | 12 ++++++++++++
arch/arm64/kvm/hyp/reserved_mem.c | 2 ++
3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h b/arch/arm64/kvm/hyp/include/nvhe/mm.h
index ac0f7fcffd08..0095f6289742 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mm.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
@@ -53,7 +53,7 @@ static inline unsigned long __hyp_pgtable_max_pages(unsigned long nr_pages)
return total;
}

-static inline unsigned long hyp_s1_pgtable_pages(void)
+static inline unsigned long __hyp_pgtable_total_pages(void)
{
unsigned long res = 0, i;

@@ -63,9 +63,34 @@ static inline unsigned long hyp_s1_pgtable_pages(void)
res += __hyp_pgtable_max_pages(reg->size >> PAGE_SHIFT);
}

+ return res;
+}
+
+static inline unsigned long hyp_s1_pgtable_pages(void)
+{
+ unsigned long res;
+
+ res = __hyp_pgtable_total_pages();
+
/* Allow 1 GiB for private mappings */
res += __hyp_pgtable_max_pages(SZ_1G >> PAGE_SHIFT);

return res;
}
+
+static inline unsigned long host_s2_mem_pgtable_pages(void)
+{
+ /*
+ * Include an extra 16 pages to safely upper-bound the worst case of
+ * concatenated pgds.
+ */
+ return __hyp_pgtable_total_pages() + 16;
+}
+
+static inline unsigned long host_s2_dev_pgtable_pages(void)
+{
+ /* Allow 1 GiB for MMIO mappings */
+ return __hyp_pgtable_max_pages(SZ_1G >> PAGE_SHIFT);
+}
+
#endif /* __KVM_HYP_MM_H */
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 1e8bcd8b0299..c1a3e7e0ebbc 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -24,6 +24,8 @@ unsigned long hyp_nr_cpus;

static void *vmemmap_base;
static void *hyp_pgt_base;
+static void *host_s2_mem_pgt_base;
+static void *host_s2_dev_pgt_base;

static int divide_memory_pool(void *virt, unsigned long size)
{
@@ -42,6 +44,16 @@ static int divide_memory_pool(void *virt, unsigned long size)
if (!hyp_pgt_base)
return -ENOMEM;

+ nr_pages = host_s2_mem_pgtable_pages();
+ host_s2_mem_pgt_base = hyp_early_alloc_contig(nr_pages);
+ if (!host_s2_mem_pgt_base)
+ return -ENOMEM;
+
+ nr_pages = host_s2_dev_pgtable_pages();
+ host_s2_dev_pgt_base = hyp_early_alloc_contig(nr_pages);
+ if (!host_s2_dev_pgt_base)
+ return -ENOMEM;
+
return 0;
}

diff --git a/arch/arm64/kvm/hyp/reserved_mem.c b/arch/arm64/kvm/hyp/reserved_mem.c
index 9bc6a6d27904..fd42705a3c26 100644
--- a/arch/arm64/kvm/hyp/reserved_mem.c
+++ b/arch/arm64/kvm/hyp/reserved_mem.c
@@ -52,6 +52,8 @@ void __init kvm_hyp_reserve(void)
}

hyp_mem_pages += hyp_s1_pgtable_pages();
+ hyp_mem_pages += host_s2_mem_pgtable_pages();
+ hyp_mem_pages += host_s2_dev_pgtable_pages();

/*
* The hyp_vmemmap needs to be backed by pages, but these pages
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:37:34

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 36/36] KVM: arm64: Protect the .hyp sections from the host

When KVM runs in nVHE protected mode, use the host stage 2 to unmap the
hypervisor sections by marking them as owned by the hypervisor itself.
The long-term goal is to ensure the EL2 code can remain robust
regardless of the host's state, so this starts by making sure the host
cannot e.g. write to the .hyp sections directly.

Acked-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_asm.h | 1 +
arch/arm64/kvm/arm.c | 46 +++++++++++++++++++
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 2 +
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 9 ++++
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 33 +++++++++++++
5 files changed, 91 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index b127af02bd45..d468c4b37190 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -62,6 +62,7 @@
#define __KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping 17
#define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18
#define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19
+#define __KVM_HOST_SMCCC_FUNC___pkvm_mark_hyp 20

#ifndef __ASSEMBLY__

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 7e6a81079652..d6baf76d4747 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1894,11 +1894,57 @@ void _kvm_host_prot_finalize(void *discard)
WARN_ON(kvm_call_hyp_nvhe(__pkvm_prot_finalize));
}

+static inline int pkvm_mark_hyp(phys_addr_t start, phys_addr_t end)
+{
+ return kvm_call_hyp_nvhe(__pkvm_mark_hyp, start, end);
+}
+
+#define pkvm_mark_hyp_section(__section) \
+ pkvm_mark_hyp(__pa_symbol(__section##_start), \
+ __pa_symbol(__section##_end))
+
static int finalize_hyp_mode(void)
{
+ int cpu, ret;
+
if (!is_protected_kvm_enabled())
return 0;

+ ret = pkvm_mark_hyp_section(__hyp_idmap_text);
+ if (ret)
+ return ret;
+
+ ret = pkvm_mark_hyp_section(__hyp_text);
+ if (ret)
+ return ret;
+
+ ret = pkvm_mark_hyp_section(__hyp_rodata);
+ if (ret)
+ return ret;
+
+ ret = pkvm_mark_hyp_section(__hyp_bss);
+ if (ret)
+ return ret;
+
+ ret = pkvm_mark_hyp(hyp_mem_base, hyp_mem_base + hyp_mem_size);
+ if (ret)
+ return ret;
+
+ for_each_possible_cpu(cpu) {
+ phys_addr_t start = virt_to_phys((void *)kvm_arm_hyp_percpu_base[cpu]);
+ phys_addr_t end = start + (PAGE_SIZE << nvhe_percpu_order());
+
+ ret = pkvm_mark_hyp(start, end);
+ if (ret)
+ return ret;
+
+ start = virt_to_phys((void *)per_cpu(kvm_arm_hyp_stack_page, cpu));
+ end = start + PAGE_SIZE;
+ ret = pkvm_mark_hyp(start, end);
+ if (ret)
+ return ret;
+ }
+
/*
* Flip the static key upfront as that may no longer be possible
* once the host stage 2 is installed.
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index d293cb328cc4..42d81ec739fa 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -21,6 +21,8 @@ struct host_kvm {
extern struct host_kvm host_kvm;

int __pkvm_prot_finalize(void);
+int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end);
+
int kvm_host_prepare_stage2(void *mem_pgt_pool, void *dev_pgt_pool);
void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt);

diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index f47028d3fd0a..3df33d4de4a1 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -156,6 +156,14 @@ static void handle___pkvm_prot_finalize(struct kvm_cpu_context *host_ctxt)
{
cpu_reg(host_ctxt, 1) = __pkvm_prot_finalize();
}
+
+static void handle___pkvm_mark_hyp(struct kvm_cpu_context *host_ctxt)
+{
+ DECLARE_REG(phys_addr_t, start, host_ctxt, 1);
+ DECLARE_REG(phys_addr_t, end, host_ctxt, 2);
+
+ cpu_reg(host_ctxt, 1) = __pkvm_mark_hyp(start, end);
+}
typedef void (*hcall_t)(struct kvm_cpu_context *);

#define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x
@@ -180,6 +188,7 @@ static const hcall_t host_hcall[] = {
HANDLE_FUNC(__pkvm_create_mappings),
HANDLE_FUNC(__pkvm_create_private_mapping),
HANDLE_FUNC(__pkvm_prot_finalize),
+ HANDLE_FUNC(__pkvm_mark_hyp),
};

static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 5c88a325e6fc..dd03252b9574 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -25,6 +25,8 @@ struct host_kvm host_kvm;
struct hyp_pool host_s2_mem;
struct hyp_pool host_s2_dev;

+static const u8 pkvm_hyp_id = 1;
+
static void *host_s2_zalloc_pages_exact(size_t size)
{
return hyp_alloc_pages(&host_s2_mem, get_order(size));
@@ -182,6 +184,18 @@ static bool find_mem_range(phys_addr_t addr, struct kvm_mem_range *range)
return false;
}

+static bool range_is_memory(u64 start, u64 end)
+{
+ struct kvm_mem_range r1, r2;
+
+ if (!find_mem_range(start, &r1) || !find_mem_range(end, &r2))
+ return false;
+ if (r1.start != r2.start)
+ return false;
+
+ return true;
+}
+
static inline int __host_stage2_idmap(u64 start, u64 end,
enum kvm_pgtable_prot prot,
struct hyp_pool *pool)
@@ -229,6 +243,25 @@ static int host_stage2_idmap(u64 addr)
return ret;
}

+int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end)
+{
+ int ret;
+
+ /*
+ * host_stage2_unmap_dev_all() currently relies on MMIO mappings being
+ * non-persistent, so don't allow changing page ownership in MMIO range.
+ */
+ if (!range_is_memory(start, end))
+ return -EINVAL;
+
+ hyp_spin_lock(&host_kvm.lock);
+ ret = kvm_pgtable_stage2_set_owner(&host_kvm.pgt, start, end - start,
+ &host_s2_mem, pkvm_hyp_id);
+ hyp_spin_unlock(&host_kvm.lock);
+
+ return ret != -EAGAIN ? ret : 0;
+}
+
void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
{
struct kvm_vcpu_fault_info fault;
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-15 18:37:45

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v5 32/36] KVM: arm64: Provide sanitized mmfr* registers at EL2

We will need to read sanitized values of mmfr{0,1}_el1 at EL2 soon, so
add them to the list of copied variables.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_cpufeature.h | 2 ++
arch/arm64/kvm/sys_regs.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_cpufeature.h b/arch/arm64/include/asm/kvm_cpufeature.h
index efba1b89b8a4..48cba6cecd71 100644
--- a/arch/arm64/include/asm/kvm_cpufeature.h
+++ b/arch/arm64/include/asm/kvm_cpufeature.h
@@ -15,3 +15,5 @@
#endif

KVM_HYP_CPU_FTR_REG(arm64_ftr_reg_ctrel0);
+KVM_HYP_CPU_FTR_REG(arm64_ftr_reg_id_aa64mmfr0_el1);
+KVM_HYP_CPU_FTR_REG(arm64_ftr_reg_id_aa64mmfr1_el1);
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 3ec34c25e877..dfb3b4f9ca84 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -2784,6 +2784,8 @@ struct __ftr_reg_copy_entry {
struct arm64_ftr_reg *dst;
} hyp_ftr_regs[] __initdata = {
CPU_FTR_REG_HYP_COPY(SYS_CTR_EL0, arm64_ftr_reg_ctrel0),
+ CPU_FTR_REG_HYP_COPY(SYS_ID_AA64MMFR0_EL1, arm64_ftr_reg_id_aa64mmfr0_el1),
+ CPU_FTR_REG_HYP_COPY(SYS_ID_AA64MMFR1_EL1, arm64_ftr_reg_id_aa64mmfr1_el1),
};

void __init setup_kvm_el2_caps(void)
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-16 18:53:13

by Mate Toth-Pal

[permalink] [raw]
Subject: Re: [PATCH v5 33/36] KVM: arm64: Wrap the host with a stage 2

On 2021-03-15 15:35, Quentin Perret wrote:
> +static int host_stage2_idmap(u64 addr)
> +{
> + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
> + struct kvm_mem_range range;
> + bool is_memory = find_mem_range(addr, &range);
> + struct hyp_pool *pool = is_memory ? &host_s2_mem : &host_s2_dev;
> + int ret;
> +
> + if (is_memory)
> + prot |= KVM_PGTABLE_PROT_X;
> +
> /* Ensure write of the host VMID */

Hi Quentin,

Testing the latest version of the patchset, we seem to have found
another thing related to FEAT_S2FWB.

This function always sets Normal memory in the stage 2 table, even if
the address in stage 1 was mapped as a device memory. However with the
current settings for normal memory (i.e. MT_S2_FWB_NORMAL being defined
to 6) according to the architecture (See Arm ARM, 'D5.5.5 Stage 2 memory
region type and cacheability attributes when FEAT_S2FWB is implemented')
the resulting attributes will be 'Normal Write-Back' even if the stage 1
mapping sets device memory. Accessing device memory mapped like this
causes an SError on some platforms with FEAT_S2FWB being implemented.

Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior,
and the resulting memory type would be device.

Another solution would be to add an else branch to the last 'if' above
like this:

diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index fffa432ce3eb..54e5d3b0b2e1 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -214,6 +214,8 @@ static int host_stage2_idmap(u64 addr)

        if (is_memory)
                prot |= KVM_PGTABLE_PROT_X;
+       else
+               prot |= KVM_PGTABLE_PROT_DEVICE;

        hyp_spin_lock(&host_kvm.lock);
        ret = kvm_pgtable_stage2_find_range(&host_kvm.pgt, addr, prot,
&range);

Best regards,
Mate Toth-Pal

2021-03-16 20:08:05

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 33/36] KVM: arm64: Wrap the host with a stage 2

On Tuesday 16 Mar 2021 at 13:28:42 (+0100), Mate Toth-Pal wrote:
> Testing the latest version of the patchset, we seem to have found another
> thing related to FEAT_S2FWB.

Argh! I wish I could put my hands on hardware with FWB. Thanks again for
the report.

> This function always sets Normal memory in the stage 2 table, even if the
> address in stage 1 was mapped as a device memory. However with the current
> settings for normal memory (i.e. MT_S2_FWB_NORMAL being defined to 6)
> according to the architecture (See Arm ARM, 'D5.5.5 Stage 2 memory region
> type and cacheability attributes when FEAT_S2FWB is implemented') the
> resulting attributes will be 'Normal Write-Back' even if the stage 1 mapping
> sets device memory. Accessing device memory mapped like this causes an
> SError on some platforms with FEAT_S2FWB being implemented.

Right.

> Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, and
> the resulting memory type would be device.

Sounds like the correct fix here -- see below.

> Another solution would be to add an else branch to the last 'if' above like
> this:
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index fffa432ce3eb..54e5d3b0b2e1 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -214,6 +214,8 @@ static int host_stage2_idmap(u64 addr)
>
>         if (is_memory)
>                 prot |= KVM_PGTABLE_PROT_X;
> +       else
> +               prot |= KVM_PGTABLE_PROT_DEVICE;
>
>         hyp_spin_lock(&host_kvm.lock);
>         ret = kvm_pgtable_stage2_find_range(&host_kvm.pgt, addr, prot,
> &range);

While this would work in this particular case, I don't think we should
force all non-RAM accesses as device as the host may have reasons not to
want this (e.g. accessing SRAM). Your first suggestions allows us to do
just that, so it is preferable I think.

Thanks,
Quentin

2021-03-16 20:31:48

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 33/36] KVM: arm64: Wrap the host with a stage 2

On Tuesday 16 Mar 2021 at 12:53:53 (+0000), Quentin Perret wrote:
> On Tuesday 16 Mar 2021 at 13:28:42 (+0100), Mate Toth-Pal wrote:
> > Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, and
> > the resulting memory type would be device.
>
> Sounds like the correct fix here -- see below.

Just to clarify this, I meant this should be the configuration for the
host stage-2. We'll want to keep the existing behaviour for guests I
believe.

Thanks,
Quentin

2021-03-16 20:41:28

by Mate Toth-Pal

[permalink] [raw]
Subject: Re: [PATCH v5 33/36] KVM: arm64: Wrap the host with a stage 2

On 2021-03-16 15:29, Quentin Perret wrote:
> On Tuesday 16 Mar 2021 at 12:53:53 (+0000), Quentin Perret wrote:
>> On Tuesday 16 Mar 2021 at 13:28:42 (+0100), Mate Toth-Pal wrote:
>>> Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, and
>>> the resulting memory type would be device.
>>
>> Sounds like the correct fix here -- see below.
>
> Just to clarify this, I meant this should be the configuration for the
> host stage-2. We'll want to keep the existing behaviour for guests I
> believe.

I Agree.

Thanks,
Mate

2021-03-16 21:12:03

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 33/36] KVM: arm64: Wrap the host with a stage 2

On Tuesday 16 Mar 2021 at 16:16:18 (+0100), Mate Toth-Pal wrote:
> On 2021-03-16 15:29, Quentin Perret wrote:
> > On Tuesday 16 Mar 2021 at 12:53:53 (+0000), Quentin Perret wrote:
> > > On Tuesday 16 Mar 2021 at 13:28:42 (+0100), Mate Toth-Pal wrote:
> > > > Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, and
> > > > the resulting memory type would be device.
> > >
> > > Sounds like the correct fix here -- see below.
> >
> > Just to clarify this, I meant this should be the configuration for the
> > host stage-2. We'll want to keep the existing behaviour for guests I
> > believe.
>
> I Agree.

OK, so the below seems to boot on my non-FWB-capable hardware and should
fix the issue. Could you by any chance give it a spin?

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index b93a2a3526ab..b2066bd03ca2 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -76,10 +76,11 @@ struct kvm_pgtable {

/**
* enum kvm_pgtable_prot - Page-table permissions and attributes.
- * @KVM_PGTABLE_PROT_X: Execute permission.
- * @KVM_PGTABLE_PROT_W: Write permission.
- * @KVM_PGTABLE_PROT_R: Read permission.
- * @KVM_PGTABLE_PROT_DEVICE: Device attributes.
+ * @KVM_PGTABLE_PROT_X: Execute permission.
+ * @KVM_PGTABLE_PROT_W: Write permission.
+ * @KVM_PGTABLE_PROT_R: Read permission.
+ * @KVM_PGTABLE_PROT_DEVICE: Device attributes.
+ * @KVM_PGTABLE_PROT_S2_NOFWB: Don't enforce Normal-WB with FWB.
*/
enum kvm_pgtable_prot {
KVM_PGTABLE_PROT_X = BIT(0),
@@ -87,6 +88,8 @@ enum kvm_pgtable_prot {
KVM_PGTABLE_PROT_R = BIT(2),

KVM_PGTABLE_PROT_DEVICE = BIT(3),
+
+ KVM_PGTABLE_PROT_S2_NOFWB = BIT(4),
};

#define PAGE_HYP (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index c759faf7a1ff..e695d2e1839d 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -144,13 +144,16 @@
* Memory types for Stage-2 translation
*/
#define MT_S2_NORMAL 0xf
+#define MT_S2_WEAK MT_S2_NORMAL
#define MT_S2_DEVICE_nGnRE 0x1

/*
* Memory types for Stage-2 translation when ID_AA64MMFR2_EL1.FWB is 0001
- * Stage-2 enforces Normal-WB and Device-nGnRE
+ * Stage-2 enforces Normal-WB and Device-nGnRE by default. The 'weak' mode
+ * honors Stage-1 attributes.
*/
#define MT_S2_FWB_NORMAL 6
+#define MT_S2_FWB_WEAK 7
#define MT_S2_FWB_DEVICE_nGnRE 1

#ifdef CONFIG_ARM64_4K_PAGES
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index dd03252b9574..1ff72babe565 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -214,6 +214,8 @@ static int host_stage2_idmap(u64 addr)

if (is_memory)
prot |= KVM_PGTABLE_PROT_X;
+ else
+ prot |= KVM_PGTABLE_PROT_S2_NOFWB;

hyp_spin_lock(&host_kvm.lock);
ret = kvm_pgtable_stage2_find_range(&host_kvm.pgt, addr, prot, &range);
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 3a971df278bd..bd1b8464a537 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -343,6 +343,9 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
if (!(prot & KVM_PGTABLE_PROT_R))
return -EINVAL;

+ if (prot & KVM_PGTABLE_PROT_S2_NOFWB)
+ return -EINVAL;
+
if (prot & KVM_PGTABLE_PROT_X) {
if (prot & KVM_PGTABLE_PROT_W)
return -EINVAL;
@@ -510,9 +513,18 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
static int stage2_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
{
bool device = prot & KVM_PGTABLE_PROT_DEVICE;
- kvm_pte_t attr = device ? PAGE_S2_MEMATTR(DEVICE_nGnRE) :
- PAGE_S2_MEMATTR(NORMAL);
+ bool nofwb = prot & KVM_PGTABLE_PROT_S2_NOFWB;
u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;
+ kvm_pte_t attr;
+
+ WARN_ON(nofwb && device);
+
+ if (device)
+ attr = PAGE_S2_MEMATTR(DEVICE_nGnRE);
+ else if (nofwb)
+ attr = PAGE_S2_MEMATTR(WEAK);
+ else
+ attr = PAGE_S2_MEMATTR(NORMAL);

if (!(prot & KVM_PGTABLE_PROT_X))
attr |= KVM_PTE_LEAF_ATTR_HI_S2_XN;

2021-03-17 08:44:45

by Mate Toth-Pal

[permalink] [raw]
Subject: Re: [PATCH v5 33/36] KVM: arm64: Wrap the host with a stage 2

On 2021-03-16 18:46, Quentin Perret wrote:
> On Tuesday 16 Mar 2021 at 16:16:18 (+0100), Mate Toth-Pal wrote:
>> On 2021-03-16 15:29, Quentin Perret wrote:
>>> On Tuesday 16 Mar 2021 at 12:53:53 (+0000), Quentin Perret wrote:
>>>> On Tuesday 16 Mar 2021 at 13:28:42 (+0100), Mate Toth-Pal wrote:
>>>>> Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, and
>>>>> the resulting memory type would be device.
>>>>
>>>> Sounds like the correct fix here -- see below.
>>>
>>> Just to clarify this, I meant this should be the configuration for the
>>> host stage-2. We'll want to keep the existing behaviour for guests I
>>> believe.
>>
>> I Agree.
>
> OK, so the below seems to boot on my non-FWB-capable hardware and should
> fix the issue. Could you by any chance give it a spin?
>

Sure, I can give it a go. I was trying to apply the patch on top of
https://android-kvm.googlesource.com/linux/+/refs/heads/qperret/host-stage2-v5
but it seems that your base is significantly different. Can you give
some hints what should I use as base?

2021-03-17 09:04:14

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 33/36] KVM: arm64: Wrap the host with a stage 2

On Wednesday 17 Mar 2021 at 09:41:09 (+0100), Mate Toth-Pal wrote:
> On 2021-03-16 18:46, Quentin Perret wrote:
> > On Tuesday 16 Mar 2021 at 16:16:18 (+0100), Mate Toth-Pal wrote:
> > > On 2021-03-16 15:29, Quentin Perret wrote:
> > > > On Tuesday 16 Mar 2021 at 12:53:53 (+0000), Quentin Perret wrote:
> > > > > On Tuesday 16 Mar 2021 at 13:28:42 (+0100), Mate Toth-Pal wrote:
> > > > > > Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, and
> > > > > > the resulting memory type would be device.
> > > > >
> > > > > Sounds like the correct fix here -- see below.
> > > >
> > > > Just to clarify this, I meant this should be the configuration for the
> > > > host stage-2. We'll want to keep the existing behaviour for guests I
> > > > believe.
> > >
> > > I Agree.
> >
> > OK, so the below seems to boot on my non-FWB-capable hardware and should
> > fix the issue. Could you by any chance give it a spin?
> >
>
> Sure, I can give it a go. I was trying to apply the patch on top of https://android-kvm.googlesource.com/linux/+/refs/heads/qperret/host-stage2-v5
> but it seems that your base is significantly different. Can you give some
> hints what should I use as base?

Oh interesting, it _should_ apply on v5. I just pushed a branch with
everything applied if that helps:

https://android-kvm.googlesource.com/linux qperret/wip/fix-fwb-host-stage2

Thanks again!
Quentin

2021-03-17 17:18:15

by Quentin Perret

[permalink] [raw]
Subject: [PATCH 0/2] Fixes for FWB

Hi folks,

This is an alternative solution to the KVM_PGTABLE_PROT_S2_NOFWB patch I
shared earlier (and which is a bit of a hack).

With this series we basically force FWB off for the host stage-2, even
when the CPUs support it. This is done by passing flags to the pgtable
init function, and propagating it down where needed. It's a bit more
intrusive, but cleaner conceptually.

Thoughts?

Thanks,
Quentin

Quentin Perret (2):
KVM: arm64: Introduce KVM_PGTABLE_S2_NOFWB Stage-2 flag
KVM: arm64: Disable FWB in host stage-2

arch/arm64/include/asm/kvm_pgtable.h | 19 +++++++++--
arch/arm64/include/asm/pgtable-prot.h | 4 +--
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 6 ++--
arch/arm64/kvm/hyp/pgtable.c | 49 +++++++++++++++++----------
4 files changed, 52 insertions(+), 26 deletions(-)

--
2.31.0.rc2.261.g7f71774620-goog

2021-03-17 17:18:25

by Quentin Perret

[permalink] [raw]
Subject: [PATCH 2/2] KVM: arm64: Disable FWB in host stage-2

We need the host to be in control of cacheability of its own mappings,
so let's disable FWB altogether in its stage 2.

Signed-off-by: Quentin Perret <[email protected]>

---

Obviously this will have to be folded in the relevant patch for v6, but
I kept it separate for the sake of review.
---
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index dd03252b9574..c472c3becf40 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -94,8 +94,8 @@ int kvm_host_prepare_stage2(void *mem_pgt_pool, void *dev_pgt_pool)
if (ret)
return ret;

- ret = kvm_pgtable_stage2_init(&host_kvm.pgt, &host_kvm.arch,
- &host_kvm.mm_ops);
+ ret = kvm_pgtable_stage2_init_flags(&host_kvm.pgt, &host_kvm.arch,
+ &host_kvm.mm_ops, KVM_PGTABLE_S2_NOFWB);
if (ret)
return ret;

@@ -116,8 +116,6 @@ int __pkvm_prot_finalize(void)
params->vttbr = kvm_get_vttbr(mmu);
params->vtcr = host_kvm.arch.vtcr;
params->hcr_el2 |= HCR_VM;
- if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
- params->hcr_el2 |= HCR_FWB;
kvm_flush_dcache_to_poc(params, sizeof(*params));

write_sysreg(params->hcr_el2, hcr_el2);
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-17 17:20:38

by Mate Toth-Pal

[permalink] [raw]
Subject: Re: [PATCH v5 33/36] KVM: arm64: Wrap the host with a stage 2

On 2021-03-17 10:02, Quentin Perret wrote:
> On Wednesday 17 Mar 2021 at 09:41:09 (+0100), Mate Toth-Pal wrote:
>> On 2021-03-16 18:46, Quentin Perret wrote:
>>> On Tuesday 16 Mar 2021 at 16:16:18 (+0100), Mate Toth-Pal wrote:
>>>> On 2021-03-16 15:29, Quentin Perret wrote:
>>>>> On Tuesday 16 Mar 2021 at 12:53:53 (+0000), Quentin Perret wrote:
>>>>>> On Tuesday 16 Mar 2021 at 13:28:42 (+0100), Mate Toth-Pal wrote:
>>>>>>> Changing the value of MT_S2_FWB_NORMAL to 7 would change this behavior, and
>>>>>>> the resulting memory type would be device.
>>>>>>
>>>>>> Sounds like the correct fix here -- see below.
>>>>>
>>>>> Just to clarify this, I meant this should be the configuration for the
>>>>> host stage-2. We'll want to keep the existing behaviour for guests I
>>>>> believe.
>>>>
>>>> I Agree.
>>>
>>> OK, so the below seems to boot on my non-FWB-capable hardware and should
>>> fix the issue. Could you by any chance give it a spin?
>>>
>>
>> Sure, I can give it a go. I was trying to apply the patch on top of https://android-kvm.googlesource.com/linux/+/refs/heads/qperret/host-stage2-v5
>> but it seems that your base is significantly different. Can you give some
>> hints what should I use as base?
>
> Oh interesting, it _should_ apply on v5. I just pushed a branch with
> everything applied if that helps:
>
> https://android-kvm.googlesource.com/linux qperret/wip/fix-fwb-host-stage2
>
> Thanks again!
> Quentin
>

Sorry, it was my mistake. I must have messed up my source, because the
patch should really have applied on the v5 patchset.

So I was able to boot the kernel from the branch
qperret/wip/fix-fwb-host-stage2 on the machine I was testing, so the fix
looks to be OK.

Thanks,
Mate