2021-01-08 12:17:25

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 00/26] KVM/arm64: A stage 2 for the host

Hi all,

This is the v2 of the series previously posted here:

https://lore.kernel.org/kvmarm/[email protected]/

This basically allows us to wrap the host with a stage 2 when running in
nVHE, hence paving the way for protecting guest memory from the host in
the future (among other use-cases). For more details about the
motivation and the design angle taken here, I would recommend to have a
look at the cover letter of v1, and/or to watch these presentations at
LPC [1] and KVM forum 2020 [2].

In short, the changes since v1 include:

- Renamed most pkvm-specific pgtable functions as pkvm_* to avoid
confusion with the host's (Fuad)

- Added an IC flush when switching pgtables (Fuad, Mark)

- Cleaned-up the PI aliasing in image-vars.h (David)

- Added a TLB flush when enabling the host stage 2 to avoid stale TLBs
from bootloader

- Fixed the early memory reservation by using NR_CPUS instead of
num_possible_cpus() (which is always 1 that early)

- Added missing preempt_{dis,en}able() guards in
kvm_hyp_enable_protection()

- Rebased on latest kvmarm/next

And if you'd like a branch that has all the goodies, there it is:

https://android-kvm.googlesource.com/linux qperret/host-stage2-v2

Thanks!
Quentin

[1] https://youtu.be/54q6RzS9BpQ?t=10859
[2] https://kvmforum2020.sched.com/event/eE24/virtualization-for-the-masses-exposing-kvm-on-android-will-deacon-google

Quentin Perret (23):
KVM: arm64: Initialize kvm_nvhe_init_params early
KVM: arm64: Avoid free_page() in page-table allocator
KVM: arm64: Factor memory allocation out of pgtable.c
KVM: arm64: Introduce a BSS section for use at Hyp
KVM: arm64: Make kvm_call_hyp() a function call at Hyp
KVM: arm64: Allow using kvm_nvhe_sym() in hyp code
KVM: arm64: Introduce an early Hyp page allocator
KVM: arm64: Stub CONFIG_DEBUG_LIST at Hyp
KVM: arm64: Introduce a Hyp buddy page allocator
KVM: arm64: Enable access to sanitized CPU features at EL2
KVM: arm64: Factor out vector address calculation
of/fdt: Introduce early_init_dt_add_memory_hyp()
KVM: arm64: Prepare Hyp memory protection
KVM: arm64: Elevate Hyp mappings creation at EL2
KVM: arm64: Use kvm_arch for stage 2 pgtable
KVM: arm64: Use kvm_arch in kvm_s2_mmu
KVM: arm64: Set host stage 2 using kvm_nvhe_init_params
KVM: arm64: Refactor kvm_arm_setup_stage2()
KVM: arm64: Refactor __load_guest_stage2()
KVM: arm64: Refactor __populate_fault_info()
KVM: arm64: Make memcache anonymous in pgtable allocator
KVM: arm64: Reserve memory for host stage 2
KVM: arm64: Wrap the host with a stage 2

Will Deacon (3):
arm64: lib: Annotate {clear,copy}_page() as position-independent
KVM: arm64: Link position-independent string routines into .hyp.text
arm64: kvm: Add standalone ticket spinlock implementation for use at
hyp

arch/arm64/include/asm/cpufeature.h | 1 +
arch/arm64/include/asm/hyp_image.h | 7 +
arch/arm64/include/asm/kvm_asm.h | 7 +
arch/arm64/include/asm/kvm_cpufeature.h | 19 ++
arch/arm64/include/asm/kvm_host.h | 16 +-
arch/arm64/include/asm/kvm_hyp.h | 8 +
arch/arm64/include/asm/kvm_mmu.h | 69 +++++-
arch/arm64/include/asm/kvm_pgtable.h | 41 +++-
arch/arm64/include/asm/sections.h | 1 +
arch/arm64/kernel/asm-offsets.c | 3 +
arch/arm64/kernel/cpufeature.c | 12 +
arch/arm64/kernel/image-vars.h | 33 +++
arch/arm64/kernel/vmlinux.lds.S | 7 +
arch/arm64/kvm/arm.c | 144 ++++++++++--
arch/arm64/kvm/hyp/Makefile | 2 +-
arch/arm64/kvm/hyp/include/hyp/switch.h | 36 +--
arch/arm64/kvm/hyp/include/nvhe/early_alloc.h | 14 ++
arch/arm64/kvm/hyp/include/nvhe/gfp.h | 32 +++
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 33 +++
arch/arm64/kvm/hyp/include/nvhe/memory.h | 55 +++++
arch/arm64/kvm/hyp/include/nvhe/mm.h | 107 +++++++++
arch/arm64/kvm/hyp/include/nvhe/spinlock.h | 92 ++++++++
arch/arm64/kvm/hyp/nvhe/Makefile | 9 +-
arch/arm64/kvm/hyp/nvhe/cache.S | 13 ++
arch/arm64/kvm/hyp/nvhe/cpufeature.c | 8 +
arch/arm64/kvm/hyp/nvhe/early_alloc.c | 60 +++++
arch/arm64/kvm/hyp/nvhe/hyp-init.S | 41 ++++
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 48 ++++
arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 1 +
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 191 ++++++++++++++++
arch/arm64/kvm/hyp/nvhe/mm.c | 174 ++++++++++++++
arch/arm64/kvm/hyp/nvhe/page_alloc.c | 185 +++++++++++++++
arch/arm64/kvm/hyp/nvhe/psci-relay.c | 4 +-
arch/arm64/kvm/hyp/nvhe/setup.c | 214 ++++++++++++++++++
arch/arm64/kvm/hyp/nvhe/stub.c | 22 ++
arch/arm64/kvm/hyp/nvhe/switch.c | 12 +-
arch/arm64/kvm/hyp/nvhe/tlb.c | 4 +-
arch/arm64/kvm/hyp/pgtable.c | 98 ++++----
arch/arm64/kvm/hyp/reserved_mem.c | 104 +++++++++
arch/arm64/kvm/mmu.c | 114 +++++++++-
arch/arm64/kvm/reset.c | 42 +---
arch/arm64/lib/clear_page.S | 4 +-
arch/arm64/lib/copy_page.S | 4 +-
arch/arm64/mm/init.c | 3 +
drivers/of/fdt.c | 5 +
45 files changed, 1954 insertions(+), 145 deletions(-)
create mode 100644 arch/arm64/include/asm/kvm_cpufeature.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/early_alloc.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/gfp.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/memory.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mm.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/spinlock.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/cache.S
create mode 100644 arch/arm64/kvm/hyp/nvhe/cpufeature.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/early_alloc.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/mm.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/page_alloc.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/setup.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c
create mode 100644 arch/arm64/kvm/hyp/reserved_mem.c

--
2.30.0.284.gd98b1dd5eaa7-goog


2021-01-08 12:18:04

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 05/26] KVM: arm64: Avoid free_page() in page-table allocator

Currently, the KVM page-table allocator uses a mix of put_page() and
free_page() calls depending on the context even though page-allocation
is always achieved using variants of __get_free_page().

Make the code consitent by using put_page() throughout, and reduce the
memory management API surface used by the page-table code. This will
ease factoring out page-alloction from pgtable.c, which is a
pre-requisite to creating page-tables at EL2.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0271b4a3b9fe..d7122c5eac24 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -410,7 +410,7 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits)
static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag, void * const arg)
{
- free_page((unsigned long)kvm_pte_follow(*ptep));
+ put_page(virt_to_page(kvm_pte_follow(*ptep)));
return 0;
}

@@ -422,7 +422,7 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt)
};

WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
- free_page((unsigned long)pgt->pgd);
+ put_page(virt_to_page(pgt->pgd));
pgt->pgd = NULL;
}

@@ -551,7 +551,7 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
if (!data->anchor)
return 0;

- free_page((unsigned long)kvm_pte_follow(*ptep));
+ put_page(virt_to_page(kvm_pte_follow(*ptep)));
put_page(virt_to_page(ptep));

if (data->anchor == ptep) {
@@ -674,7 +674,7 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
}

if (childp)
- free_page((unsigned long)childp);
+ put_page(virt_to_page(childp));

return 0;
}
@@ -871,7 +871,7 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
put_page(virt_to_page(ptep));

if (kvm_pte_table(pte, level))
- free_page((unsigned long)kvm_pte_follow(pte));
+ put_page(virt_to_page(kvm_pte_follow(pte)));

return 0;
}
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:18:09

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 03/26] arm64: kvm: Add standalone ticket spinlock implementation for use at hyp

From: Will Deacon <[email protected]>

We will soon need to synchronise multiple CPUs in the hyp text at EL2.
The qspinlock-based locking used by the host is overkill for this purpose
and relies on the kernel's "percpu" implementation for the MCS nodes.

Implement a simple ticket locking scheme based heavily on the code removed
by commit c11090474d70 ("arm64: locking: Replace ticket lock implementation
with qspinlock").

Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/spinlock.h | 92 ++++++++++++++++++++++
1 file changed, 92 insertions(+)
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/spinlock.h

diff --git a/arch/arm64/kvm/hyp/include/nvhe/spinlock.h b/arch/arm64/kvm/hyp/include/nvhe/spinlock.h
new file mode 100644
index 000000000000..7584c397bbac
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/spinlock.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * A stand-alone ticket spinlock implementation for use by the non-VHE
+ * KVM hypervisor code running at EL2.
+ *
+ * Copyright (C) 2020 Google LLC
+ * Author: Will Deacon <[email protected]>
+ *
+ * Heavily based on the implementation removed by c11090474d70 which was:
+ * Copyright (C) 2012 ARM Ltd.
+ */
+
+#ifndef __ARM64_KVM_NVHE_SPINLOCK_H__
+#define __ARM64_KVM_NVHE_SPINLOCK_H__
+
+#include <asm/alternative.h>
+#include <asm/lse.h>
+
+typedef union hyp_spinlock {
+ u32 __val;
+ struct {
+#ifdef __AARCH64EB__
+ u16 next, owner;
+#else
+ u16 owner, next;
+ };
+#endif
+} hyp_spinlock_t;
+
+#define hyp_spin_lock_init(l) \
+do { \
+ *(l) = (hyp_spinlock_t){ .__val = 0 }; \
+} while (0)
+
+static inline void hyp_spin_lock(hyp_spinlock_t *lock)
+{
+ u32 tmp;
+ hyp_spinlock_t lockval, newval;
+
+ asm volatile(
+ /* Atomically increment the next ticket. */
+ ARM64_LSE_ATOMIC_INSN(
+ /* LL/SC */
+" prfm pstl1strm, %3\n"
+"1: ldaxr %w0, %3\n"
+" add %w1, %w0, #(1 << 16)\n"
+" stxr %w2, %w1, %3\n"
+" cbnz %w2, 1b\n",
+ /* LSE atomics */
+" mov %w2, #(1 << 16)\n"
+" ldadda %w2, %w0, %3\n"
+ __nops(3))
+
+ /* Did we get the lock? */
+" eor %w1, %w0, %w0, ror #16\n"
+" cbz %w1, 3f\n"
+ /*
+ * No: spin on the owner. Send a local event to avoid missing an
+ * unlock before the exclusive load.
+ */
+" sevl\n"
+"2: wfe\n"
+" ldaxrh %w2, %4\n"
+" eor %w1, %w2, %w0, lsr #16\n"
+" cbnz %w1, 2b\n"
+ /* We got the lock. Critical section starts here. */
+"3:"
+ : "=&r" (lockval), "=&r" (newval), "=&r" (tmp), "+Q" (*lock)
+ : "Q" (lock->owner)
+ : "memory");
+}
+
+static inline void hyp_spin_unlock(hyp_spinlock_t *lock)
+{
+ u64 tmp;
+
+ asm volatile(
+ ARM64_LSE_ATOMIC_INSN(
+ /* LL/SC */
+ " ldrh %w1, %0\n"
+ " add %w1, %w1, #1\n"
+ " stlrh %w1, %0",
+ /* LSE atomics */
+ " mov %w1, #1\n"
+ " staddlh %w1, %0\n"
+ __nops(1))
+ : "=Q" (lock->owner), "=&r" (tmp)
+ :
+ : "memory");
+}
+
+#endif /* __ARM64_KVM_NVHE_SPINLOCK_H__ */
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:18:38

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 02/26] KVM: arm64: Link position-independent string routines into .hyp.text

From: Will Deacon <[email protected]>

Pull clear_page(), copy_page(), memcpy() and memset() into the nVHE hyp
code and ensure that we always execute the '__pi_' entry point on the
offchance that it changes in future.

[ qperret: Commit title nits ]

Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/hyp_image.h | 3 +++
arch/arm64/kernel/image-vars.h | 11 +++++++++++
arch/arm64/kvm/hyp/nvhe/Makefile | 4 ++++
3 files changed, 18 insertions(+)

diff --git a/arch/arm64/include/asm/hyp_image.h b/arch/arm64/include/asm/hyp_image.h
index daa1a1da539e..e06842756051 100644
--- a/arch/arm64/include/asm/hyp_image.h
+++ b/arch/arm64/include/asm/hyp_image.h
@@ -31,6 +31,9 @@
*/
#define KVM_NVHE_ALIAS(sym) kvm_nvhe_sym(sym) = sym;

+/* Defines a linker script alias for KVM nVHE hyp symbols */
+#define KVM_NVHE_ALIAS_HYP(first, sec) kvm_nvhe_sym(first) = kvm_nvhe_sym(sec);
+
#endif /* LINKER_SCRIPT */

#endif /* __ARM64_HYP_IMAGE_H__ */
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 39289d75118d..43f3a1d6e92d 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -102,6 +102,17 @@ KVM_NVHE_ALIAS(__stop___kvm_ex_table);
/* Array containing bases of nVHE per-CPU memory regions. */
KVM_NVHE_ALIAS(kvm_arm_hyp_percpu_base);

+/* Position-independent library routines */
+KVM_NVHE_ALIAS_HYP(clear_page, __pi_clear_page);
+KVM_NVHE_ALIAS_HYP(copy_page, __pi_copy_page);
+KVM_NVHE_ALIAS_HYP(memcpy, __pi_memcpy);
+KVM_NVHE_ALIAS_HYP(memset, __pi_memset);
+
+#ifdef CONFIG_KASAN
+KVM_NVHE_ALIAS_HYP(__memcpy, __pi_memcpy);
+KVM_NVHE_ALIAS_HYP(__memset, __pi_memset);
+#endif
+
#endif /* CONFIG_KVM */

#endif /* __ARM64_KERNEL_IMAGE_VARS_H */
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 1f1e351c5fe2..590fdefb42dd 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -6,10 +6,14 @@
asflags-y := -D__KVM_NVHE_HYPERVISOR__
ccflags-y := -D__KVM_NVHE_HYPERVISOR__

+lib-objs := clear_page.o copy_page.o memcpy.o memset.o
+lib-objs := $(addprefix ../../../lib/, $(lib-objs))
+
obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
hyp-main.o hyp-smp.o psci-relay.o
obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o
+obj-y += $(lib-objs)

##
## Build rules for compiling nVHE hyp code
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:19:00

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 19/26] KVM: arm64: Use kvm_arch in kvm_s2_mmu

In order to make use of the stage 2 pgtable code for the host stage 2,
change kvm_s2_mmu to use a kvm_arch pointer in lieu of the kvm pointer,
as the host will have the former but not the latter.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 2 +-
arch/arm64/include/asm/kvm_mmu.h | 7 ++++++-
arch/arm64/kvm/mmu.c | 8 ++++----
3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 9a2feb83eea0..9d59bebcc5ef 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -95,7 +95,7 @@ struct kvm_s2_mmu {
/* The last vcpu id that ran on each physical CPU */
int __percpu *last_vcpu_ran;

- struct kvm *kvm;
+ struct kvm_arch *arch;
};

struct kvm_arch_memory_slot {
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 6c8466a042a9..662f0415344e 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -299,7 +299,7 @@ static __always_inline u64 kvm_get_vttbr(struct kvm_s2_mmu *mmu)
*/
static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
{
- write_sysreg(kern_hyp_va(mmu->kvm)->arch.vtcr, vtcr_el2);
+ write_sysreg(kern_hyp_va(mmu->arch)->vtcr, vtcr_el2);
write_sysreg(kvm_get_vttbr(mmu), vttbr_el2);

/*
@@ -309,5 +309,10 @@ static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
*/
asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT));
}
+
+static inline struct kvm *kvm_s2_mmu_to_kvm(struct kvm_s2_mmu *mmu)
+{
+ return container_of(mmu->arch, struct kvm, arch);
+}
#endif /* __ASSEMBLY__ */
#endif /* __ARM64_KVM_MMU_H__ */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7e6263103943..6f9bf71722bd 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -169,7 +169,7 @@ static void *kvm_host_va(phys_addr_t phys)
static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size,
bool may_block)
{
- struct kvm *kvm = mmu->kvm;
+ struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
phys_addr_t end = start + size;

assert_spin_locked(&kvm->mmu_lock);
@@ -474,7 +474,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu)
for_each_possible_cpu(cpu)
*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;

- mmu->kvm = kvm;
+ mmu->arch = &kvm->arch;
mmu->pgt = pgt;
mmu->pgd_phys = __pa(pgt->pgd);
mmu->vmid.vmid_gen = 0;
@@ -556,7 +556,7 @@ void stage2_unmap_vm(struct kvm *kvm)

void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
{
- struct kvm *kvm = mmu->kvm;
+ struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
struct kvm_pgtable *pgt = NULL;

spin_lock(&kvm->mmu_lock);
@@ -625,7 +625,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
*/
static void stage2_wp_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end)
{
- struct kvm *kvm = mmu->kvm;
+ struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
stage2_apply_range_resched(kvm, addr, end, kvm_pgtable_stage2_wrprotect);
}

--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:19:23

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 11/26] KVM: arm64: Stub CONFIG_DEBUG_LIST at Hyp

In order to use the kernel list library at EL2, introduce stubs for the
CONFIG_DEBUG_LIST out-of-lines calls.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
arch/arm64/kvm/hyp/nvhe/stub.c | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c

diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 1fc0684a7678..33bd381d8f73 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
lib-objs := $(addprefix ../../../lib/, $(lib-objs))

obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
- hyp-main.o hyp-smp.o psci-relay.o early_alloc.o
+ hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o
obj-y += $(lib-objs)
diff --git a/arch/arm64/kvm/hyp/nvhe/stub.c b/arch/arm64/kvm/hyp/nvhe/stub.c
new file mode 100644
index 000000000000..c0aa6bbfd79d
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/stub.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Stubs for out-of-line function calls caused by re-using kernel
+ * infrastructure at EL2.
+ *
+ * Copyright (C) 2020 - Google LLC
+ */
+
+#include <linux/list.h>
+
+#ifdef CONFIG_DEBUG_LIST
+bool __list_add_valid(struct list_head *new, struct list_head *prev,
+ struct list_head *next)
+{
+ return true;
+}
+
+bool __list_del_entry_valid(struct list_head *entry)
+{
+ return true;
+}
+#endif
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:19:35

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 13/26] KVM: arm64: Enable access to sanitized CPU features at EL2

Introduce the infrastructure in KVM enabling to copy CPU feature
registers into EL2-owned data-structures, to allow reading sanitised
values directly at EL2 in nVHE.

Given that only a subset of these features are being read by the
hypervisor, the ones that need to be copied are to be listed under
<asm/kvm_cpufeature.h> together with the name of the nVHE variable that
will hold the copy.

While at it, introduce the first user of this infrastructure by
implementing __flush_dcache_area at EL2, which needs
arm64_ftr_reg_ctrel0.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/cpufeature.h | 1 +
arch/arm64/include/asm/kvm_cpufeature.h | 17 ++++++++++++++
arch/arm64/kernel/cpufeature.c | 12 ++++++++++
arch/arm64/kvm/arm.c | 31 +++++++++++++++++++++++++
arch/arm64/kvm/hyp/nvhe/Makefile | 3 ++-
arch/arm64/kvm/hyp/nvhe/cache.S | 13 +++++++++++
arch/arm64/kvm/hyp/nvhe/cpufeature.c | 8 +++++++
7 files changed, 84 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/include/asm/kvm_cpufeature.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/cache.S
create mode 100644 arch/arm64/kvm/hyp/nvhe/cpufeature.c

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 16063c813dcd..742e9bcc051b 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -600,6 +600,7 @@ void __init setup_cpu_features(void);
void check_local_cpu_capabilities(void);

u64 read_sanitised_ftr_reg(u32 id);
+int copy_ftr_reg(u32 id, struct arm64_ftr_reg *dst);

static inline bool cpu_supports_mixed_endian_el0(void)
{
diff --git a/arch/arm64/include/asm/kvm_cpufeature.h b/arch/arm64/include/asm/kvm_cpufeature.h
new file mode 100644
index 000000000000..d34f85cba358
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_cpufeature.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2020 - Google LLC
+ * Author: Quentin Perret <[email protected]>
+ */
+
+#include <asm/cpufeature.h>
+
+#ifndef KVM_HYP_CPU_FTR_REG
+#if defined(__KVM_NVHE_HYPERVISOR__)
+#define KVM_HYP_CPU_FTR_REG(id, name) extern struct arm64_ftr_reg name;
+#else
+#define KVM_HYP_CPU_FTR_REG(id, name) DECLARE_KVM_NVHE_SYM(name);
+#endif
+#endif
+
+KVM_HYP_CPU_FTR_REG(SYS_CTR_EL0, arm64_ftr_reg_ctrel0)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index bc3549663957..c2019aaaadc3 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1113,6 +1113,18 @@ u64 read_sanitised_ftr_reg(u32 id)
}
EXPORT_SYMBOL_GPL(read_sanitised_ftr_reg);

+int copy_ftr_reg(u32 id, struct arm64_ftr_reg *dst)
+{
+ struct arm64_ftr_reg *regp = get_arm64_ftr_reg(id);
+
+ if (!regp)
+ return -EINVAL;
+
+ memcpy(dst, regp, sizeof(*regp));
+
+ return 0;
+}
+
#define read_sysreg_case(r) \
case r: return read_sysreg_s(r)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 51b53ca36dc5..9fd769349e9e 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -34,6 +34,7 @@
#include <asm/virt.h>
#include <asm/kvm_arm.h>
#include <asm/kvm_asm.h>
+#include <asm/kvm_cpufeature.h>
#include <asm/kvm_mmu.h>
#include <asm/kvm_emulate.h>
#include <asm/sections.h>
@@ -1697,6 +1698,29 @@ static void teardown_hyp_mode(void)
}
}

+#undef KVM_HYP_CPU_FTR_REG
+#define KVM_HYP_CPU_FTR_REG(id, name) \
+ { .sys_id = id, .dst = (struct arm64_ftr_reg *)&kvm_nvhe_sym(name) },
+static const struct __ftr_reg_copy_entry {
+ u32 sys_id;
+ struct arm64_ftr_reg *dst;
+} hyp_ftr_regs[] = {
+ #include <asm/kvm_cpufeature.h>
+};
+
+static int copy_cpu_ftr_regs(void)
+{
+ int i, ret;
+
+ for (i = 0; i < ARRAY_SIZE(hyp_ftr_regs); i++) {
+ ret = copy_ftr_reg(hyp_ftr_regs[i].sys_id, hyp_ftr_regs[i].dst);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
/**
* Inits Hyp-mode on all online CPUs
*/
@@ -1705,6 +1729,13 @@ static int init_hyp_mode(void)
int cpu;
int err = 0;

+ /*
+ * Copy the required CPU feature register in their EL2 counterpart
+ */
+ err = copy_cpu_ftr_regs();
+ if (err)
+ return err;
+
/*
* Allocate Hyp PGD and setup Hyp identity mapping
*/
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 9e5eacfec6ec..72cfe53f106f 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -10,7 +10,8 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
lib-objs := $(addprefix ../../../lib/, $(lib-objs))

obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
- hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o
+ hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o \
+ cache.o cpufeature.o
obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o
obj-y += $(lib-objs)
diff --git a/arch/arm64/kvm/hyp/nvhe/cache.S b/arch/arm64/kvm/hyp/nvhe/cache.S
new file mode 100644
index 000000000000..36cef6915428
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/cache.S
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Code copied from arch/arm64/mm/cache.S.
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+#include <asm/alternative.h>
+
+SYM_FUNC_START_PI(__flush_dcache_area)
+ dcache_by_line_op civac, sy, x0, x1, x2, x3
+ ret
+SYM_FUNC_END_PI(__flush_dcache_area)
diff --git a/arch/arm64/kvm/hyp/nvhe/cpufeature.c b/arch/arm64/kvm/hyp/nvhe/cpufeature.c
new file mode 100644
index 000000000000..a887508f996f
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/cpufeature.c
@@ -0,0 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 - Google LLC
+ * Author: Quentin Perret <[email protected]>
+ */
+
+#define KVM_HYP_CPU_FTR_REG(id, name) struct arm64_ftr_reg name;
+#include <asm/kvm_cpufeature.h>
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:19:40

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 07/26] KVM: arm64: Introduce a BSS section for use at Hyp

Currently, the hyp code cannot make full use of a bss, as the kernel
section is mapped read-only.

While this mapping could simply be changed to read-write, it would
intermingle even more the hyp and kernel state than they currently are.
Instead, introduce a __hyp_bss section, that uses reserved pages, and
create the appropriate RW hyp mappings during KVM init.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/sections.h | 1 +
arch/arm64/kernel/vmlinux.lds.S | 7 +++++++
arch/arm64/kvm/arm.c | 11 +++++++++++
arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 1 +
4 files changed, 20 insertions(+)

diff --git a/arch/arm64/include/asm/sections.h b/arch/arm64/include/asm/sections.h
index 8ff579361731..f58cf493de16 100644
--- a/arch/arm64/include/asm/sections.h
+++ b/arch/arm64/include/asm/sections.h
@@ -12,6 +12,7 @@ extern char __hibernate_exit_text_start[], __hibernate_exit_text_end[];
extern char __hyp_idmap_text_start[], __hyp_idmap_text_end[];
extern char __hyp_text_start[], __hyp_text_end[];
extern char __hyp_data_ro_after_init_start[], __hyp_data_ro_after_init_end[];
+extern char __hyp_bss_start[], __hyp_bss_end[];
extern char __idmap_text_start[], __idmap_text_end[];
extern char __initdata_begin[], __initdata_end[];
extern char __inittext_begin[], __inittext_end[];
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 43af13968dfd..3eca35d5a7cf 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -8,6 +8,13 @@
#define RO_EXCEPTION_TABLE_ALIGN 8
#define RUNTIME_DISCARD_EXIT

+#define BSS_FIRST_SECTIONS \
+ . = ALIGN(PAGE_SIZE); \
+ __hyp_bss_start = .; \
+ *(.hyp.bss) \
+ . = ALIGN(PAGE_SIZE); \
+ __hyp_bss_end = .;
+
#include <asm-generic/vmlinux.lds.h>
#include <asm/cache.h>
#include <asm/hyp_image.h>
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 3ac0f3425833..51b53ca36dc5 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1770,7 +1770,18 @@ static int init_hyp_mode(void)
goto out_err;
}

+ /*
+ * .hyp.bss is placed at the beginning of the .bss section, so map that
+ * part RW, and the rest RO as the hyp shouldn't be touching it.
+ */
err = create_hyp_mappings(kvm_ksym_ref(__bss_start),
+ kvm_ksym_ref(__hyp_bss_end), PAGE_HYP);
+ if (err) {
+ kvm_err("Cannot map hyp bss section: %d\n", err);
+ goto out_err;
+ }
+
+ err = create_hyp_mappings(kvm_ksym_ref(__hyp_bss_end),
kvm_ksym_ref(__bss_stop), PAGE_HYP_RO);
if (err) {
kvm_err("Cannot map bss section\n");
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp.lds.S b/arch/arm64/kvm/hyp/nvhe/hyp.lds.S
index 5d76ff2ba63e..dc281d90063e 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp.lds.S
+++ b/arch/arm64/kvm/hyp/nvhe/hyp.lds.S
@@ -17,4 +17,5 @@ SECTIONS {
PERCPU_INPUT(L1_CACHE_BYTES)
}
HYP_SECTION(.data..ro_after_init)
+ HYP_SECTION(.bss)
}
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:19:55

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 06/26] KVM: arm64: Factor memory allocation out of pgtable.c

In preparation for enabling the creation of page-tables at EL2, factor
all memory allocation out of the page-table code, hence making it
re-usable with any compatible memory allocator.

No functional changes intended.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 32 +++++++++-
arch/arm64/kvm/hyp/pgtable.c | 90 +++++++++++++++++-----------
arch/arm64/kvm/mmu.c | 70 +++++++++++++++++++++-
3 files changed, 154 insertions(+), 38 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 52ab38db04c7..45acc9dc6c45 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -13,17 +13,41 @@

typedef u64 kvm_pte_t;

+/**
+ * struct kvm_pgtable_mm_ops - Memory management callbacks.
+ * @zalloc_page: Allocate a zeroed memory page.
+ * @zalloc_pages_exact: Allocate an exact number of zeroed memory pages.
+ * @free_pages_exact: Free an exact number of memory pages.
+ * @get_page: Increment the refcount on a page.
+ * @put_page: Decrement the refcount on a page.
+ * @page_count: Returns the refcount of a page.
+ * @phys_to_virt: Convert a physical address into a virtual address.
+ * @virt_to_phys: Convert a virtual address into a physical address.
+ */
+struct kvm_pgtable_mm_ops {
+ void* (*zalloc_page)(void *arg);
+ void* (*zalloc_pages_exact)(size_t size);
+ void (*free_pages_exact)(void *addr, size_t size);
+ void (*get_page)(void *addr);
+ void (*put_page)(void *addr);
+ int (*page_count)(void *addr);
+ void* (*phys_to_virt)(phys_addr_t phys);
+ phys_addr_t (*virt_to_phys)(void *addr);
+};
+
/**
* struct kvm_pgtable - KVM page-table.
* @ia_bits: Maximum input address size, in bits.
* @start_level: Level at which the page-table walk starts.
* @pgd: Pointer to the first top-level entry of the page-table.
+ * @mm_ops: Memory management callbacks.
* @mmu: Stage-2 KVM MMU struct. Unused for stage-1 page-tables.
*/
struct kvm_pgtable {
u32 ia_bits;
u32 start_level;
kvm_pte_t *pgd;
+ struct kvm_pgtable_mm_ops *mm_ops;

/* Stage-2 only */
struct kvm_s2_mmu *mmu;
@@ -86,10 +110,12 @@ struct kvm_pgtable_walker {
* kvm_pgtable_hyp_init() - Initialise a hypervisor stage-1 page-table.
* @pgt: Uninitialised page-table structure to initialise.
* @va_bits: Maximum virtual address bits.
+ * @mm_ops: Memory management callbacks.
*
* Return: 0 on success, negative error code on failure.
*/
-int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits);
+int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
+ struct kvm_pgtable_mm_ops *mm_ops);

/**
* kvm_pgtable_hyp_destroy() - Destroy an unused hypervisor stage-1 page-table.
@@ -126,10 +152,12 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
* kvm_pgtable_stage2_init() - Initialise a guest stage-2 page-table.
* @pgt: Uninitialised page-table structure to initialise.
* @kvm: KVM structure representing the guest virtual machine.
+ * @mm_ops: Memory management callbacks.
*
* Return: 0 on success, negative error code on failure.
*/
-int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm *kvm);
+int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm *kvm,
+ struct kvm_pgtable_mm_ops *mm_ops);

/**
* kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index d7122c5eac24..61a8a34ddfdb 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -148,9 +148,9 @@ static kvm_pte_t kvm_phys_to_pte(u64 pa)
return pte;
}

-static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte)
+static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte, struct kvm_pgtable_mm_ops *mm_ops)
{
- return __va(kvm_pte_to_phys(pte));
+ return mm_ops->phys_to_virt(kvm_pte_to_phys(pte));
}

static void kvm_set_invalid_pte(kvm_pte_t *ptep)
@@ -159,9 +159,10 @@ static void kvm_set_invalid_pte(kvm_pte_t *ptep)
WRITE_ONCE(*ptep, pte & ~KVM_PTE_VALID);
}

-static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp)
+static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp,
+ struct kvm_pgtable_mm_ops *mm_ops)
{
- kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(__pa(childp));
+ kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(mm_ops->virt_to_phys(childp));

pte |= FIELD_PREP(KVM_PTE_TYPE, KVM_PTE_TYPE_TABLE);
pte |= KVM_PTE_VALID;
@@ -229,7 +230,7 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
goto out;
}

- childp = kvm_pte_follow(pte);
+ childp = kvm_pte_follow(pte, data->pgt->mm_ops);
ret = __kvm_pgtable_walk(data, childp, level + 1);
if (ret)
goto out;
@@ -304,8 +305,9 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
}

struct hyp_map_data {
- u64 phys;
- kvm_pte_t attr;
+ u64 phys;
+ kvm_pte_t attr;
+ struct kvm_pgtable_mm_ops *mm_ops;
};

static int hyp_map_set_prot_attr(enum kvm_pgtable_prot prot,
@@ -355,6 +357,8 @@ static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag, void * const arg)
{
kvm_pte_t *childp;
+ struct hyp_map_data *data = arg;
+ struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;

if (hyp_map_walker_try_leaf(addr, end, level, ptep, arg))
return 0;
@@ -362,11 +366,11 @@ static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
return -EINVAL;

- childp = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL);
+ childp = (kvm_pte_t *)mm_ops->zalloc_page(NULL);
if (!childp)
return -ENOMEM;

- kvm_set_table_pte(ptep, childp);
+ kvm_set_table_pte(ptep, childp, mm_ops);
return 0;
}

@@ -376,6 +380,7 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
int ret;
struct hyp_map_data map_data = {
.phys = ALIGN_DOWN(phys, PAGE_SIZE),
+ .mm_ops = pgt->mm_ops,
};
struct kvm_pgtable_walker walker = {
.cb = hyp_map_walker,
@@ -393,16 +398,18 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
return ret;
}

-int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits)
+int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
+ struct kvm_pgtable_mm_ops *mm_ops)
{
u64 levels = ARM64_HW_PGTABLE_LEVELS(va_bits);

- pgt->pgd = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL);
+ pgt->pgd = (kvm_pte_t *)mm_ops->zalloc_page(NULL);
if (!pgt->pgd)
return -ENOMEM;

pgt->ia_bits = va_bits;
pgt->start_level = KVM_PGTABLE_MAX_LEVELS - levels;
+ pgt->mm_ops = mm_ops;
pgt->mmu = NULL;
return 0;
}
@@ -410,7 +417,9 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits)
static int hyp_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag, void * const arg)
{
- put_page(virt_to_page(kvm_pte_follow(*ptep)));
+ struct kvm_pgtable_mm_ops *mm_ops = arg;
+
+ mm_ops->put_page((void *)kvm_pte_follow(*ptep, mm_ops));
return 0;
}

@@ -419,10 +428,11 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt)
struct kvm_pgtable_walker walker = {
.cb = hyp_free_walker,
.flags = KVM_PGTABLE_WALK_TABLE_POST,
+ .arg = pgt->mm_ops,
};

WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
- put_page(virt_to_page(pgt->pgd));
+ pgt->mm_ops->put_page(pgt->pgd);
pgt->pgd = NULL;
}

@@ -434,6 +444,8 @@ struct stage2_map_data {

struct kvm_s2_mmu *mmu;
struct kvm_mmu_memory_cache *memcache;
+
+ struct kvm_pgtable_mm_ops *mm_ops;
};

static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
@@ -501,12 +513,12 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
struct stage2_map_data *data)
{
+ struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
kvm_pte_t *childp, pte = *ptep;
- struct page *page = virt_to_page(ptep);

if (data->anchor) {
if (kvm_pte_valid(pte))
- put_page(page);
+ mm_ops->put_page(ptep);

return 0;
}
@@ -520,7 +532,7 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
if (!data->memcache)
return -ENOMEM;

- childp = kvm_mmu_memory_cache_alloc(data->memcache);
+ childp = mm_ops->zalloc_page(data->memcache);
if (!childp)
return -ENOMEM;

@@ -532,13 +544,13 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
if (kvm_pte_valid(pte)) {
kvm_set_invalid_pte(ptep);
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
- put_page(page);
+ mm_ops->put_page(ptep);
}

- kvm_set_table_pte(ptep, childp);
+ kvm_set_table_pte(ptep, childp, mm_ops);

out_get_page:
- get_page(page);
+ mm_ops->get_page(ptep);
return 0;
}

@@ -546,13 +558,14 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
kvm_pte_t *ptep,
struct stage2_map_data *data)
{
+ struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
int ret = 0;

if (!data->anchor)
return 0;

- put_page(virt_to_page(kvm_pte_follow(*ptep)));
- put_page(virt_to_page(ptep));
+ mm_ops->put_page(kvm_pte_follow(*ptep, mm_ops));
+ mm_ops->put_page(ptep);

if (data->anchor == ptep) {
data->anchor = NULL;
@@ -607,6 +620,7 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
.phys = ALIGN_DOWN(phys, PAGE_SIZE),
.mmu = pgt->mmu,
.memcache = mc,
+ .mm_ops = pgt->mm_ops,
};
struct kvm_pgtable_walker walker = {
.cb = stage2_map_walker,
@@ -643,7 +657,9 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag,
void * const arg)
{
- struct kvm_s2_mmu *mmu = arg;
+ struct kvm_pgtable *pgt = arg;
+ struct kvm_s2_mmu *mmu = pgt->mmu;
+ struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
kvm_pte_t pte = *ptep, *childp = NULL;
bool need_flush = false;

@@ -651,9 +667,9 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
return 0;

if (kvm_pte_table(pte, level)) {
- childp = kvm_pte_follow(pte);
+ childp = kvm_pte_follow(pte, mm_ops);

- if (page_count(virt_to_page(childp)) != 1)
+ if (mm_ops->page_count(childp) != 1)
return 0;
} else if (stage2_pte_cacheable(pte)) {
need_flush = true;
@@ -666,15 +682,15 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
*/
kvm_set_invalid_pte(ptep);
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, addr, level);
- put_page(virt_to_page(ptep));
+ mm_ops->put_page(ptep);

if (need_flush) {
- stage2_flush_dcache(kvm_pte_follow(pte),
+ stage2_flush_dcache(kvm_pte_follow(pte, mm_ops),
kvm_granule_size(level));
}

if (childp)
- put_page(virt_to_page(childp));
+ mm_ops->put_page(childp);

return 0;
}
@@ -683,7 +699,7 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size)
{
struct kvm_pgtable_walker walker = {
.cb = stage2_unmap_walker,
- .arg = pgt->mmu,
+ .arg = pgt,
.flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_TABLE_POST,
};

@@ -815,12 +831,13 @@ static int stage2_flush_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag,
void * const arg)
{
+ struct kvm_pgtable_mm_ops *mm_ops = arg;
kvm_pte_t pte = *ptep;

if (!kvm_pte_valid(pte) || !stage2_pte_cacheable(pte))
return 0;

- stage2_flush_dcache(kvm_pte_follow(pte), kvm_granule_size(level));
+ stage2_flush_dcache(kvm_pte_follow(pte, mm_ops), kvm_granule_size(level));
return 0;
}

@@ -829,6 +846,7 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
struct kvm_pgtable_walker walker = {
.cb = stage2_flush_walker,
.flags = KVM_PGTABLE_WALK_LEAF,
+ .arg = pgt->mm_ops,
};

if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
@@ -837,7 +855,8 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
return kvm_pgtable_walk(pgt, addr, size, &walker);
}

-int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm *kvm)
+int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm *kvm,
+ struct kvm_pgtable_mm_ops *mm_ops)
{
size_t pgd_sz;
u64 vtcr = kvm->arch.vtcr;
@@ -846,12 +865,13 @@ int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm *kvm)
u32 start_level = VTCR_EL2_TGRAN_SL0_BASE - sl0;

pgd_sz = kvm_pgd_pages(ia_bits, start_level) * PAGE_SIZE;
- pgt->pgd = alloc_pages_exact(pgd_sz, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ pgt->pgd = mm_ops->zalloc_pages_exact(pgd_sz);
if (!pgt->pgd)
return -ENOMEM;

pgt->ia_bits = ia_bits;
pgt->start_level = start_level;
+ pgt->mm_ops = mm_ops;
pgt->mmu = &kvm->arch.mmu;

/* Ensure zeroed PGD pages are visible to the hardware walker */
@@ -863,15 +883,16 @@ static int stage2_free_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag,
void * const arg)
{
+ struct kvm_pgtable_mm_ops *mm_ops = arg;
kvm_pte_t pte = *ptep;

if (!kvm_pte_valid(pte))
return 0;

- put_page(virt_to_page(ptep));
+ mm_ops->put_page(ptep);

if (kvm_pte_table(pte, level))
- put_page(virt_to_page(kvm_pte_follow(pte)));
+ mm_ops->put_page(kvm_pte_follow(pte, mm_ops));

return 0;
}
@@ -883,10 +904,11 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
.cb = stage2_free_walker,
.flags = KVM_PGTABLE_WALK_LEAF |
KVM_PGTABLE_WALK_TABLE_POST,
+ .arg = pgt->mm_ops,
};

WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker));
pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE;
- free_pages_exact(pgt->pgd, pgd_sz);
+ pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
pgt->pgd = NULL;
}
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 1f41173e6149..278e163beda4 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -88,6 +88,48 @@ static bool kvm_is_device_pfn(unsigned long pfn)
return !pfn_valid(pfn);
}

+static void *stage2_memcache_alloc_page(void *arg)
+{
+ struct kvm_mmu_memory_cache *mc = arg;
+ kvm_pte_t *ptep = NULL;
+
+ /* Allocated with GFP_KERNEL_ACCOUNT, so no need to zero */
+ if (mc && mc->nobjs)
+ ptep = mc->objects[--mc->nobjs];
+
+ return ptep;
+}
+
+static void *kvm_host_zalloc_pages_exact(size_t size)
+{
+ return alloc_pages_exact(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+}
+
+static void kvm_host_get_page(void *addr)
+{
+ get_page(virt_to_page(addr));
+}
+
+static void kvm_host_put_page(void *addr)
+{
+ put_page(virt_to_page(addr));
+}
+
+static int kvm_host_page_count(void *addr)
+{
+ return page_count(virt_to_page(addr));
+}
+
+static phys_addr_t kvm_host_pa(void *addr)
+{
+ return __pa(addr);
+}
+
+static void *kvm_host_va(phys_addr_t phys)
+{
+ return __va(phys);
+}
+
/*
* Unmapping vs dcache management:
*
@@ -351,6 +393,17 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
return 0;
}

+static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
+ .zalloc_page = stage2_memcache_alloc_page,
+ .zalloc_pages_exact = kvm_host_zalloc_pages_exact,
+ .free_pages_exact = free_pages_exact,
+ .get_page = kvm_host_get_page,
+ .put_page = kvm_host_put_page,
+ .page_count = kvm_host_page_count,
+ .phys_to_virt = kvm_host_va,
+ .virt_to_phys = kvm_host_pa,
+};
+
/**
* kvm_init_stage2_mmu - Initialise a S2 MMU strucrure
* @kvm: The pointer to the KVM structure
@@ -374,7 +427,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu)
if (!pgt)
return -ENOMEM;

- err = kvm_pgtable_stage2_init(pgt, kvm);
+ err = kvm_pgtable_stage2_init(pgt, kvm, &kvm_s2_mm_ops);
if (err)
goto out_free_pgtable;

@@ -1198,6 +1251,19 @@ static int kvm_map_idmap_text(void)
return err;
}

+static void *kvm_hyp_zalloc_page(void *arg)
+{
+ return (void *)get_zeroed_page(GFP_KERNEL);
+}
+
+static struct kvm_pgtable_mm_ops kvm_hyp_mm_ops = {
+ .zalloc_page = kvm_hyp_zalloc_page,
+ .get_page = kvm_host_get_page,
+ .put_page = kvm_host_put_page,
+ .phys_to_virt = kvm_host_va,
+ .virt_to_phys = kvm_host_pa,
+};
+
int kvm_mmu_init(void)
{
int err;
@@ -1241,7 +1307,7 @@ int kvm_mmu_init(void)
goto out;
}

- err = kvm_pgtable_hyp_init(hyp_pgtable, hyp_va_bits);
+ err = kvm_pgtable_hyp_init(hyp_pgtable, hyp_va_bits, &kvm_hyp_mm_ops);
if (err)
goto out_free_pgtable;

--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:20:09

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 15/26] of/fdt: Introduce early_init_dt_add_memory_hyp()

Introduce early_init_dt_add_memory_hyp() to allow KVM to conserve a copy
of the memory regions parsed from DT. This will be needed in the context
of the protected nVHE feature of KVM/arm64 where the code running at EL2
will be cleanly separated from the host kernel during boot, and will
need its own representation of memory.

Signed-off-by: Quentin Perret <[email protected]>
---
drivers/of/fdt.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index 4602e467ca8b..af2b5a09c5b4 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -1099,6 +1099,10 @@ int __init early_init_dt_scan_chosen(unsigned long node, const char *uname,
#define MAX_MEMBLOCK_ADDR ((phys_addr_t)~0)
#endif

+void __init __weak early_init_dt_add_memory_hyp(u64 base, u64 size)
+{
+}
+
void __init __weak early_init_dt_add_memory_arch(u64 base, u64 size)
{
const u64 phys_offset = MIN_MEMBLOCK_ADDR;
@@ -1139,6 +1143,7 @@ void __init __weak early_init_dt_add_memory_arch(u64 base, u64 size)
base = phys_offset;
}
memblock_add(base, size);
+ early_init_dt_add_memory_hyp(base, size);
}

int __init __weak early_init_dt_mark_hotplug_memory_arch(u64 base, u64 size)
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:20:10

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 16/26] KVM: arm64: Prepare Hyp memory protection

When memory protection is enabled, the Hyp code needs the ability to
create and manage its own page-table. To do so, introduce a new set of
hypercalls to initialize Hyp memory protection.

During the init hcall, the hypervisor runs with the host-provided
page-table and uses the trivial early page allocator to create its own
set of page-tables, using a memory pool that was donated by the host.
Specifically, the hypervisor creates its own mappings for __hyp_text,
the Hyp memory pool, the __hyp_bss, the portion of hyp_vmemmap
corresponding to the Hyp pool, among other things. It then jumps back in
the idmap page, switches to use the newly-created pgd (instead of the
temporary one provided by the host) and then installs the full-fledged
buddy allocator which will then be the only one in used from then on.

Note that for the sake of symplifying the review, this only introduces
the code doing this operation, without actually being called by anyhing
yet. This will be done in a subsequent patch, which will introduce the
necessary host kernel changes.

Credits to Will for __pkvm_init_switch_pgd.

Co-authored-by: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_asm.h | 4 +
arch/arm64/include/asm/kvm_host.h | 8 +
arch/arm64/include/asm/kvm_hyp.h | 8 +
arch/arm64/kernel/image-vars.h | 19 +++
arch/arm64/kvm/hyp/Makefile | 2 +-
arch/arm64/kvm/hyp/include/nvhe/memory.h | 6 +
arch/arm64/kvm/hyp/include/nvhe/mm.h | 79 +++++++++
arch/arm64/kvm/hyp/nvhe/Makefile | 4 +-
arch/arm64/kvm/hyp/nvhe/hyp-init.S | 31 ++++
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 42 +++++
arch/arm64/kvm/hyp/nvhe/mm.c | 174 ++++++++++++++++++++
arch/arm64/kvm/hyp/nvhe/setup.c | 196 +++++++++++++++++++++++
arch/arm64/kvm/hyp/reserved_mem.c | 102 ++++++++++++
arch/arm64/kvm/mmu.c | 2 +-
arch/arm64/mm/init.c | 3 +
15 files changed, 676 insertions(+), 4 deletions(-)
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mm.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/mm.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/setup.c
create mode 100644 arch/arm64/kvm/hyp/reserved_mem.c

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 7ccf770c53d9..4fc27ac08836 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -57,6 +57,10 @@
#define __KVM_HOST_SMCCC_FUNC___kvm_get_mdcr_el2 12
#define __KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs 13
#define __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_aprs 14
+#define __KVM_HOST_SMCCC_FUNC___pkvm_init 15
+#define __KVM_HOST_SMCCC_FUNC___pkvm_create_mappings 16
+#define __KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping 17
+#define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18

#ifndef __ASSEMBLY__

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 81212958ef55..9a2feb83eea0 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -777,4 +777,12 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
#define kvm_vcpu_has_pmu(vcpu) \
(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))

+#ifdef CONFIG_KVM
+extern phys_addr_t hyp_mem_base;
+extern phys_addr_t hyp_mem_size;
+void __init kvm_hyp_reserve(void);
+#else
+static inline void kvm_hyp_reserve(void) { }
+#endif
+
#endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index c0450828378b..a0e113734b20 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -100,4 +100,12 @@ void __noreturn hyp_panic(void);
void __noreturn __hyp_do_panic(bool restore_host, u64 spsr, u64 elr, u64 par);
#endif

+#ifdef __KVM_NVHE_HYPERVISOR__
+void __pkvm_init_switch_pgd(phys_addr_t phys, unsigned long size,
+ phys_addr_t pgd, void *sp, void *cont_fn);
+int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus,
+ unsigned long *per_cpu_base);
+void __noreturn __host_enter(struct kvm_cpu_context *host_ctxt);
+#endif
+
#endif /* __ARM64_KVM_HYP_H__ */
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 43f3a1d6e92d..366d837f0d39 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -113,6 +113,25 @@ KVM_NVHE_ALIAS_HYP(__memcpy, __pi_memcpy);
KVM_NVHE_ALIAS_HYP(__memset, __pi_memset);
#endif

+/* Hypevisor VA size */
+KVM_NVHE_ALIAS(hyp_va_bits);
+
+/* Kernel memory sections */
+KVM_NVHE_ALIAS(__start_rodata);
+KVM_NVHE_ALIAS(__end_rodata);
+KVM_NVHE_ALIAS(__bss_start);
+KVM_NVHE_ALIAS(__bss_stop);
+
+/* Hyp memory sections */
+KVM_NVHE_ALIAS(__hyp_idmap_text_start);
+KVM_NVHE_ALIAS(__hyp_idmap_text_end);
+KVM_NVHE_ALIAS(__hyp_text_start);
+KVM_NVHE_ALIAS(__hyp_text_end);
+KVM_NVHE_ALIAS(__hyp_data_ro_after_init_start);
+KVM_NVHE_ALIAS(__hyp_data_ro_after_init_end);
+KVM_NVHE_ALIAS(__hyp_bss_start);
+KVM_NVHE_ALIAS(__hyp_bss_end);
+
#endif /* CONFIG_KVM */

#endif /* __ARM64_KERNEL_IMAGE_VARS_H */
diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
index 687598e41b21..b726332eec49 100644
--- a/arch/arm64/kvm/hyp/Makefile
+++ b/arch/arm64/kvm/hyp/Makefile
@@ -10,4 +10,4 @@ subdir-ccflags-y := -I$(incdir) \
-DDISABLE_BRANCH_PROFILING \
$(DISABLE_STACKLEAK_PLUGIN)

-obj-$(CONFIG_KVM) += vhe/ nvhe/ pgtable.o
+obj-$(CONFIG_KVM) += vhe/ nvhe/ pgtable.o reserved_mem.o
diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h
index ed47674bc988..c8af6fe87bfb 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/memory.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h
@@ -6,6 +6,12 @@

#include <linux/types.h>

+#define HYP_MEMBLOCK_REGIONS 128
+struct hyp_memblock_region {
+ phys_addr_t start;
+ phys_addr_t end;
+};
+
struct hyp_pool;
struct hyp_page {
unsigned int refcount;
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h b/arch/arm64/kvm/hyp/include/nvhe/mm.h
new file mode 100644
index 000000000000..f0cc09b127a5
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_HYP_MM_H
+#define __KVM_HYP_MM_H
+
+#include <asm/kvm_pgtable.h>
+#include <asm/spectre.h>
+#include <linux/types.h>
+
+#include <nvhe/memory.h>
+#include <nvhe/spinlock.h>
+
+extern struct hyp_memblock_region kvm_nvhe_sym(hyp_memory)[];
+extern int kvm_nvhe_sym(hyp_memblock_nr);
+extern struct kvm_pgtable pkvm_pgtable;
+extern hyp_spinlock_t pkvm_pgd_lock;
+extern struct hyp_pool hpool;
+extern u64 __io_map_base;
+extern u32 hyp_va_bits;
+
+int hyp_create_idmap(void);
+int hyp_map_vectors(void);
+int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
+int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
+int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
+int __pkvm_create_mappings(unsigned long start, unsigned long size,
+ unsigned long phys, unsigned long prot);
+unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
+ unsigned long prot);
+
+static inline void hyp_vmemmap_range(phys_addr_t phys, unsigned long size,
+ unsigned long *start, unsigned long *end)
+{
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+ struct hyp_page *p = hyp_phys_to_page(phys);
+
+ *start = (unsigned long)p;
+ *end = *start + nr_pages * sizeof(struct hyp_page);
+ *start = ALIGN_DOWN(*start, PAGE_SIZE);
+ *end = ALIGN(*end, PAGE_SIZE);
+}
+
+static inline unsigned long __hyp_pgtable_max_pages(unsigned long nr_pages)
+{
+ unsigned long total = 0, i;
+
+ /* Provision the worst case scenario with 4 levels of page-table */
+ for (i = 0; i < 4; i++) {
+ nr_pages = DIV_ROUND_UP(nr_pages, PTRS_PER_PTE);
+ total += nr_pages;
+ }
+
+ return total;
+}
+
+static inline unsigned long hyp_s1_pgtable_size(void)
+{
+ struct hyp_memblock_region *reg;
+ unsigned long nr_pages, res = 0;
+ int i;
+
+ if (kvm_nvhe_sym(hyp_memblock_nr) <= 0)
+ return 0;
+
+ for (i = 0; i < kvm_nvhe_sym(hyp_memblock_nr); i++) {
+ reg = &kvm_nvhe_sym(hyp_memory)[i];
+ nr_pages = (reg->end - reg->start) >> PAGE_SHIFT;
+ nr_pages = __hyp_pgtable_max_pages(nr_pages);
+ res += nr_pages << PAGE_SHIFT;
+ }
+
+ /* Allow 1 GiB for private mappings */
+ nr_pages = (1 << 30) >> PAGE_SHIFT;
+ nr_pages = __hyp_pgtable_max_pages(nr_pages);
+ res += nr_pages << PAGE_SHIFT;
+
+ return res;
+}
+
+#endif /* __KVM_HYP_MM_H */
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 72cfe53f106f..d7381a503182 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -11,9 +11,9 @@ lib-objs := $(addprefix ../../../lib/, $(lib-objs))

obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o \
- cache.o cpufeature.o
+ cache.o cpufeature.o setup.o mm.o
obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
- ../fpsimd.o ../hyp-entry.o ../exception.o
+ ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
obj-y += $(lib-objs)

##
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
index 31b060a44045..ad943966c39f 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
@@ -251,4 +251,35 @@ alternative_else_nop_endif

SYM_CODE_END(__kvm_handle_stub_hvc)

+SYM_FUNC_START(__pkvm_init_switch_pgd)
+ /* Turn the MMU off */
+ pre_disable_mmu_workaround
+ mrs x2, sctlr_el2
+ bic x3, x2, #SCTLR_ELx_M
+ msr sctlr_el2, x3
+ isb
+
+ tlbi alle2
+
+ /* Install the new pgtables */
+ ldr x3, [x0, #NVHE_INIT_PGD_PA]
+ phys_to_ttbr x4, x3
+alternative_if ARM64_HAS_CNP
+ orr x4, x4, #TTBR_CNP_BIT
+alternative_else_nop_endif
+ msr ttbr0_el2, x4
+
+ /* Set the new stack pointer */
+ ldr x0, [x0, #NVHE_INIT_STACK_HYP_VA]
+ mov sp, x0
+
+ /* And turn the MMU back on! */
+ dsb nsh
+ isb
+ msr sctlr_el2, x2
+ ic iallu
+ isb
+ ret x1
+SYM_FUNC_END(__pkvm_init_switch_pgd)
+
.popsection
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index a906f9e2ff34..3075f117651c 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -6,12 +6,14 @@

#include <hyp/switch.h>

+#include <asm/pgtable-types.h>
#include <asm/kvm_asm.h>
#include <asm/kvm_emulate.h>
#include <asm/kvm_host.h>
#include <asm/kvm_hyp.h>
#include <asm/kvm_mmu.h>

+#include <nvhe/mm.h>
#include <nvhe/trap_handler.h>

DEFINE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params);
@@ -106,6 +108,42 @@ static void handle___vgic_v3_restore_aprs(struct kvm_cpu_context *host_ctxt)
__vgic_v3_restore_aprs(kern_hyp_va(cpu_if));
}

+static void handle___pkvm_init(struct kvm_cpu_context *host_ctxt)
+{
+ DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
+ DECLARE_REG(unsigned long, size, host_ctxt, 2);
+ DECLARE_REG(unsigned long, nr_cpus, host_ctxt, 3);
+ DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 4);
+
+ cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, nr_cpus, per_cpu_base);
+}
+
+static void handle___pkvm_cpu_set_vector(struct kvm_cpu_context *host_ctxt)
+{
+ DECLARE_REG(enum arm64_hyp_spectre_vector, slot, host_ctxt, 1);
+
+ cpu_reg(host_ctxt, 1) = pkvm_cpu_set_vector(slot);
+}
+
+static void handle___pkvm_create_mappings(struct kvm_cpu_context *host_ctxt)
+{
+ DECLARE_REG(unsigned long, start, host_ctxt, 1);
+ DECLARE_REG(unsigned long, size, host_ctxt, 2);
+ DECLARE_REG(unsigned long, phys, host_ctxt, 3);
+ DECLARE_REG(unsigned long, prot, host_ctxt, 4);
+
+ cpu_reg(host_ctxt, 1) = __pkvm_create_mappings(start, size, phys, prot);
+}
+
+static void handle___pkvm_create_private_mapping(struct kvm_cpu_context *host_ctxt)
+{
+ DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
+ DECLARE_REG(size_t, size, host_ctxt, 2);
+ DECLARE_REG(unsigned long, prot, host_ctxt, 3);
+
+ cpu_reg(host_ctxt, 1) = __pkvm_create_private_mapping(phys, size, prot);
+}
+
typedef void (*hcall_t)(struct kvm_cpu_context *);

#define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = kimg_fn_ptr(handle_##x)
@@ -125,6 +163,10 @@ static const hcall_t *host_hcall[] = {
HANDLE_FUNC(__kvm_get_mdcr_el2),
HANDLE_FUNC(__vgic_v3_save_aprs),
HANDLE_FUNC(__vgic_v3_restore_aprs),
+ HANDLE_FUNC(__pkvm_init),
+ HANDLE_FUNC(__pkvm_cpu_set_vector),
+ HANDLE_FUNC(__pkvm_create_mappings),
+ HANDLE_FUNC(__pkvm_create_private_mapping),
};

static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
new file mode 100644
index 000000000000..f3481646a94e
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/mm.c
@@ -0,0 +1,174 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Google LLC
+ * Author: Quentin Perret <[email protected]>
+ */
+
+#include <linux/kvm_host.h>
+#include <asm/kvm_hyp.h>
+#include <asm/kvm_mmu.h>
+#include <asm/kvm_pgtable.h>
+#include <asm/spectre.h>
+
+#include <nvhe/early_alloc.h>
+#include <nvhe/gfp.h>
+#include <nvhe/memory.h>
+#include <nvhe/mm.h>
+#include <nvhe/spinlock.h>
+
+struct kvm_pgtable pkvm_pgtable;
+hyp_spinlock_t pkvm_pgd_lock;
+u64 __io_map_base;
+
+struct hyp_memblock_region hyp_memory[HYP_MEMBLOCK_REGIONS];
+int hyp_memblock_nr;
+
+int __pkvm_create_mappings(unsigned long start, unsigned long size,
+ unsigned long phys, unsigned long prot)
+{
+ int err;
+
+ hyp_spin_lock(&pkvm_pgd_lock);
+ err = kvm_pgtable_hyp_map(&pkvm_pgtable, start, size, phys, prot);
+ hyp_spin_unlock(&pkvm_pgd_lock);
+
+ return err;
+}
+
+unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
+ unsigned long prot)
+{
+ unsigned long addr;
+ int ret;
+
+ hyp_spin_lock(&pkvm_pgd_lock);
+
+ size = PAGE_ALIGN(size + offset_in_page(phys));
+ addr = __io_map_base;
+ __io_map_base += size;
+
+ /* Are we overflowing on the vmemmap ? */
+ if (__io_map_base > __hyp_vmemmap) {
+ __io_map_base -= size;
+ addr = 0;
+ goto out;
+ }
+
+ ret = kvm_pgtable_hyp_map(&pkvm_pgtable, addr, size, phys, prot);
+ if (ret) {
+ addr = 0;
+ goto out;
+ }
+
+ addr = addr + offset_in_page(phys);
+out:
+ hyp_spin_unlock(&pkvm_pgd_lock);
+
+ return addr;
+}
+
+int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
+{
+ unsigned long start = (unsigned long)from;
+ unsigned long end = (unsigned long)to;
+ unsigned long virt_addr;
+ phys_addr_t phys;
+
+ start = start & PAGE_MASK;
+ end = PAGE_ALIGN(end);
+
+ for (virt_addr = start; virt_addr < end; virt_addr += PAGE_SIZE) {
+ int err;
+
+ phys = hyp_virt_to_phys((void *)virt_addr);
+ err = __pkvm_create_mappings(virt_addr, PAGE_SIZE, phys, prot);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
+int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back)
+{
+ unsigned long start, end;
+
+ hyp_vmemmap_range(phys, size, &start, &end);
+
+ return __pkvm_create_mappings(start, end - start, back, PAGE_HYP);
+}
+
+static void *__hyp_bp_vect_base;
+int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot)
+{
+ void *vector;
+
+ switch (slot) {
+ case HYP_VECTOR_DIRECT: {
+ vector = hyp_symbol_addr(__kvm_hyp_vector);
+ break;
+ }
+ case HYP_VECTOR_SPECTRE_DIRECT: {
+ vector = hyp_symbol_addr(__bp_harden_hyp_vecs);
+ break;
+ }
+ case HYP_VECTOR_INDIRECT:
+ case HYP_VECTOR_SPECTRE_INDIRECT: {
+ vector = (void *)__hyp_bp_vect_base;
+ break;
+ }
+ default:
+ return -EINVAL;
+ }
+
+ vector = __kvm_vector_slot2addr(vector, slot);
+ *this_cpu_ptr(&kvm_hyp_vector) = (unsigned long)vector;
+
+ return 0;
+}
+
+int hyp_map_vectors(void)
+{
+ unsigned long bp_base;
+
+ if (!cpus_have_const_cap(ARM64_SPECTRE_V3A))
+ return 0;
+
+ bp_base = (unsigned long)hyp_symbol_addr(__bp_harden_hyp_vecs);
+ bp_base = __hyp_pa(bp_base);
+ bp_base = __pkvm_create_private_mapping(bp_base, __BP_HARDEN_HYP_VECS_SZ,
+ PAGE_HYP_EXEC);
+ if (!bp_base)
+ return -1;
+
+ __hyp_bp_vect_base = (void *)bp_base;
+
+ return 0;
+}
+
+int hyp_create_idmap(void)
+{
+ unsigned long start, end;
+
+ start = (unsigned long)hyp_symbol_addr(__hyp_idmap_text_start);
+ start = hyp_virt_to_phys((void *)start);
+ start = ALIGN_DOWN(start, PAGE_SIZE);
+
+ end = (unsigned long)hyp_symbol_addr(__hyp_idmap_text_end);
+ end = hyp_virt_to_phys((void *)end);
+ end = ALIGN(end, PAGE_SIZE);
+
+ /*
+ * One half of the VA space is reserved to linearly map portions of
+ * memory -- see va_layout.c for more details. The other half of the VA
+ * space contains the trampoline page, and needs some care. Split that
+ * second half in two and find the quarter of VA space not conflicting
+ * with the idmap to place the IOs and the vmemmap. IOs use the lower
+ * half of the quarter and the vmemmap the upper half.
+ */
+ __io_map_base = start & BIT(hyp_va_bits - 2);
+ __io_map_base ^= BIT(hyp_va_bits - 2);
+ __hyp_vmemmap = __io_map_base | BIT(hyp_va_bits - 3);
+
+ return __pkvm_create_mappings(start, end - start, start, PAGE_HYP_EXEC);
+}
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
new file mode 100644
index 000000000000..6d1faede86ae
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Google LLC
+ * Author: Quentin Perret <[email protected]>
+ */
+
+#include <linux/kvm_host.h>
+#include <asm/kvm_hyp.h>
+#include <asm/kvm_mmu.h>
+#include <asm/kvm_pgtable.h>
+
+#include <nvhe/early_alloc.h>
+#include <nvhe/gfp.h>
+#include <nvhe/memory.h>
+#include <nvhe/mm.h>
+
+struct hyp_pool hpool;
+struct kvm_pgtable_mm_ops pkvm_pgtable_mm_ops;
+unsigned long hyp_nr_cpus;
+
+#define hyp_percpu_size ((unsigned long)__per_cpu_end - \
+ (unsigned long)__per_cpu_start)
+
+static void *stacks_base;
+static void *vmemmap_base;
+static void *hyp_pgt_base;
+
+static int divide_memory_pool(void *virt, unsigned long size)
+{
+ unsigned long vstart, vend, nr_pages;
+
+ hyp_early_alloc_init(virt, size);
+
+ stacks_base = hyp_early_alloc_contig(hyp_nr_cpus);
+ if (!stacks_base)
+ return -ENOMEM;
+
+ hyp_vmemmap_range(__hyp_pa(virt), size, &vstart, &vend);
+ nr_pages = (vend - vstart) >> PAGE_SHIFT;
+ vmemmap_base = hyp_early_alloc_contig(nr_pages);
+ if (!vmemmap_base)
+ return -ENOMEM;
+
+ nr_pages = hyp_s1_pgtable_size() >> PAGE_SHIFT;
+ hyp_pgt_base = hyp_early_alloc_contig(nr_pages);
+ if (!hyp_pgt_base)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
+ unsigned long *per_cpu_base)
+{
+ void *start, *end, *virt = hyp_phys_to_virt(phys);
+ int ret, i;
+
+ /* Recreate the hyp page-table using the early page allocator */
+ hyp_early_alloc_init(hyp_pgt_base, hyp_s1_pgtable_size());
+ ret = kvm_pgtable_hyp_init(&pkvm_pgtable, hyp_va_bits,
+ &hyp_early_alloc_mm_ops);
+ if (ret)
+ return ret;
+
+ ret = hyp_create_idmap();
+ if (ret)
+ return ret;
+
+ ret = hyp_map_vectors();
+ if (ret)
+ return ret;
+
+ ret = hyp_back_vmemmap(phys, size, hyp_virt_to_phys(vmemmap_base));
+ if (ret)
+ return ret;
+
+ ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_text_start),
+ hyp_symbol_addr(__hyp_text_end),
+ PAGE_HYP_EXEC);
+ if (ret)
+ return ret;
+
+ ret = pkvm_create_mappings(hyp_symbol_addr(__start_rodata),
+ hyp_symbol_addr(__end_rodata), PAGE_HYP_RO);
+ if (ret)
+ return ret;
+
+ ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_data_ro_after_init_start),
+ hyp_symbol_addr(__hyp_data_ro_after_init_end),
+ PAGE_HYP_RO);
+ if (ret)
+ return ret;
+
+ ret = pkvm_create_mappings(hyp_symbol_addr(__bss_start),
+ hyp_symbol_addr(__hyp_bss_end), PAGE_HYP);
+ if (ret)
+ return ret;
+
+ ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_bss_end),
+ hyp_symbol_addr(__bss_stop), PAGE_HYP_RO);
+ if (ret)
+ return ret;
+
+ ret = pkvm_create_mappings(virt, virt + size - 1, PAGE_HYP);
+ if (ret)
+ return ret;
+
+ for (i = 0; i < hyp_nr_cpus; i++) {
+ start = (void *)kern_hyp_va(per_cpu_base[i]);
+ end = start + PAGE_ALIGN(hyp_percpu_size);
+ ret = pkvm_create_mappings(start, end, PAGE_HYP);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+static void update_nvhe_init_params(void)
+{
+ struct kvm_nvhe_init_params *params;
+ unsigned long i, stack;
+
+ for (i = 0; i < hyp_nr_cpus; i++) {
+ stack = (unsigned long)stacks_base + (i << PAGE_SHIFT);
+ params = per_cpu_ptr(&kvm_init_params, i);
+ params->stack_hyp_va = stack + PAGE_SIZE;
+ params->pgd_pa = __hyp_pa(pkvm_pgtable.pgd);
+ __flush_dcache_area(params, sizeof(*params));
+ }
+}
+
+static void *hyp_zalloc_hyp_page(void *arg)
+{
+ return hyp_alloc_pages(&hpool, HYP_GFP_ZERO, 0);
+}
+
+void __noreturn __pkvm_init_finalise(void)
+{
+ struct kvm_host_data *host_data = this_cpu_ptr(&kvm_host_data);
+ struct kvm_cpu_context *host_ctxt = &host_data->host_ctxt;
+ unsigned long nr_pages, used_pages;
+ int ret;
+
+ /* Now that the vmemmap is backed, install the full-fledged allocator */
+ nr_pages = hyp_s1_pgtable_size() >> PAGE_SHIFT;
+ used_pages = hyp_early_alloc_nr_pages();
+ ret = hyp_pool_init(&hpool, __hyp_pa(hyp_pgt_base), nr_pages, used_pages);
+ if (ret)
+ goto out;
+
+ pkvm_pgtable_mm_ops.zalloc_page = hyp_zalloc_hyp_page;
+ pkvm_pgtable_mm_ops.phys_to_virt = hyp_phys_to_virt;
+ pkvm_pgtable_mm_ops.virt_to_phys = hyp_virt_to_phys;
+ pkvm_pgtable_mm_ops.get_page = hyp_get_page;
+ pkvm_pgtable_mm_ops.put_page = hyp_put_page;
+ pkvm_pgtable.mm_ops = &pkvm_pgtable_mm_ops;
+
+out:
+ host_ctxt->regs.regs[0] = SMCCC_RET_SUCCESS;
+ host_ctxt->regs.regs[1] = ret;
+
+ __host_enter(host_ctxt);
+}
+
+int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus,
+ unsigned long *per_cpu_base)
+{
+ struct kvm_nvhe_init_params *params;
+ void *virt = hyp_phys_to_virt(phys);
+ void (*fn)(phys_addr_t params_pa, void *finalize_fn_va);
+ int ret;
+
+ if (phys % PAGE_SIZE || size % PAGE_SIZE || (u64)virt % PAGE_SIZE)
+ return -EINVAL;
+
+ hyp_spin_lock_init(&pkvm_pgd_lock);
+ hyp_nr_cpus = nr_cpus;
+
+ ret = divide_memory_pool(virt, size);
+ if (ret)
+ return ret;
+
+ ret = recreate_hyp_mappings(phys, size, per_cpu_base);
+ if (ret)
+ return ret;
+
+ update_nvhe_init_params();
+
+ /* Jump in the idmap page to switch to the new page-tables */
+ params = this_cpu_ptr(&kvm_init_params);
+ fn = (typeof(fn))__hyp_pa(hyp_symbol_addr(__pkvm_init_switch_pgd));
+ fn(__hyp_pa(params), hyp_symbol_addr(__pkvm_init_finalise));
+
+ unreachable();
+}
diff --git a/arch/arm64/kvm/hyp/reserved_mem.c b/arch/arm64/kvm/hyp/reserved_mem.c
new file mode 100644
index 000000000000..32f648992835
--- /dev/null
+++ b/arch/arm64/kvm/hyp/reserved_mem.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2020 - Google LLC
+ * Author: Quentin Perret <[email protected]>
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/memblock.h>
+#include <linux/sort.h>
+
+#include <asm/kvm_host.h>
+
+#include <nvhe/memory.h>
+#include <nvhe/mm.h>
+
+phys_addr_t hyp_mem_base;
+phys_addr_t hyp_mem_size;
+
+int __init early_init_dt_add_memory_hyp(u64 base, u64 size)
+{
+ struct hyp_memblock_region *reg;
+
+ if (kvm_nvhe_sym(hyp_memblock_nr) >= HYP_MEMBLOCK_REGIONS)
+ kvm_nvhe_sym(hyp_memblock_nr) = -1;
+
+ if (kvm_nvhe_sym(hyp_memblock_nr) < 0)
+ return -ENOMEM;
+
+ reg = kvm_nvhe_sym(hyp_memory);
+ reg[kvm_nvhe_sym(hyp_memblock_nr)].start = base;
+ reg[kvm_nvhe_sym(hyp_memblock_nr)].end = base + size;
+ kvm_nvhe_sym(hyp_memblock_nr)++;
+
+ return 0;
+}
+
+static int cmp_hyp_memblock(const void *p1, const void *p2)
+{
+ const struct hyp_memblock_region *r1 = p1;
+ const struct hyp_memblock_region *r2 = p2;
+
+ return r1->start < r2->start ? -1 : (r1->start > r2->start);
+}
+
+static void __init sort_memblock_regions(void)
+{
+ sort(kvm_nvhe_sym(hyp_memory),
+ kvm_nvhe_sym(hyp_memblock_nr),
+ sizeof(struct hyp_memblock_region),
+ cmp_hyp_memblock,
+ NULL);
+}
+
+void __init kvm_hyp_reserve(void)
+{
+ u64 nr_pages, prev;
+
+ if (!is_hyp_mode_available() || is_kernel_in_hyp_mode())
+ return;
+
+ if (kvm_get_mode() != KVM_MODE_PROTECTED)
+ return;
+
+ if (kvm_nvhe_sym(hyp_memblock_nr) < 0) {
+ kvm_err("Failed to register hyp memblocks\n");
+ return;
+ }
+
+ sort_memblock_regions();
+
+ /*
+ * We don't know the number of possible CPUs yet, so allocate for the
+ * worst case.
+ */
+ hyp_mem_size += NR_CPUS << PAGE_SHIFT;
+ hyp_mem_size += hyp_s1_pgtable_size();
+
+ /*
+ * The hyp_vmemmap needs to be backed by pages, but these pages
+ * themselves need to be present in the vmemmap, so compute the number
+ * of pages needed by looking for a fixed point.
+ */
+ nr_pages = 0;
+ do {
+ prev = nr_pages;
+ nr_pages = (hyp_mem_size >> PAGE_SHIFT) + prev;
+ nr_pages = DIV_ROUND_UP(nr_pages * sizeof(struct hyp_page), PAGE_SIZE);
+ nr_pages += __hyp_pgtable_max_pages(nr_pages);
+ } while (nr_pages != prev);
+ hyp_mem_size += nr_pages << PAGE_SHIFT;
+
+ hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
+ hyp_mem_size, SZ_2M);
+ if (!hyp_mem_base) {
+ kvm_err("Failed to reserve hyp memory\n");
+ return;
+ }
+ memblock_reserve(hyp_mem_base, hyp_mem_size);
+
+ kvm_info("Reserved %lld MiB at 0x%llx\n", hyp_mem_size >> 20,
+ hyp_mem_base);
+}
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 278e163beda4..3cf9397dabdb 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1264,10 +1264,10 @@ static struct kvm_pgtable_mm_ops kvm_hyp_mm_ops = {
.virt_to_phys = kvm_host_pa,
};

+u32 hyp_va_bits;
int kvm_mmu_init(void)
{
int err;
- u32 hyp_va_bits;

hyp_idmap_start = __pa_symbol(__hyp_idmap_text_start);
hyp_idmap_start = ALIGN_DOWN(hyp_idmap_start, PAGE_SIZE);
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 095540667f0f..903ad0b0476c 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -34,6 +34,7 @@
#include <asm/fixmap.h>
#include <asm/kasan.h>
#include <asm/kernel-pgtable.h>
+#include <asm/kvm_host.h>
#include <asm/memory.h>
#include <asm/numa.h>
#include <asm/sections.h>
@@ -420,6 +421,8 @@ void __init bootmem_init(void)

dma_pernuma_cma_reserve();

+ kvm_hyp_reserve();
+
/*
* sparse_init() tries to allocate memory from memblock, so must be
* done after the fixed reservations
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:20:07

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 23/26] KVM: arm64: Refactor __populate_fault_info()

Refactor __populate_fault_info() to introduce __get_fault_info() which
will be used once the host is wrapped in a stage 2.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/hyp/switch.h | 36 +++++++++++++++----------
1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h
index 84473574c2e7..e9005255d639 100644
--- a/arch/arm64/kvm/hyp/include/hyp/switch.h
+++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
@@ -157,19 +157,9 @@ static inline bool __translate_far_to_hpfar(u64 far, u64 *hpfar)
return true;
}

-static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)
+static inline bool __get_fault_info(u64 esr, u64 *far, u64 *hpfar)
{
- u8 ec;
- u64 esr;
- u64 hpfar, far;
-
- esr = vcpu->arch.fault.esr_el2;
- ec = ESR_ELx_EC(esr);
-
- if (ec != ESR_ELx_EC_DABT_LOW && ec != ESR_ELx_EC_IABT_LOW)
- return true;
-
- far = read_sysreg_el2(SYS_FAR);
+ *far = read_sysreg_el2(SYS_FAR);

/*
* The HPFAR can be invalid if the stage 2 fault did not
@@ -185,12 +175,30 @@ static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)
if (!(esr & ESR_ELx_S1PTW) &&
(cpus_have_final_cap(ARM64_WORKAROUND_834220) ||
(esr & ESR_ELx_FSC_TYPE) == FSC_PERM)) {
- if (!__translate_far_to_hpfar(far, &hpfar))
+ if (!__translate_far_to_hpfar(*far, hpfar))
return false;
} else {
- hpfar = read_sysreg(hpfar_el2);
+ *hpfar = read_sysreg(hpfar_el2);
}

+ return true;
+}
+
+static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)
+{
+ u8 ec;
+ u64 esr;
+ u64 hpfar, far;
+
+ esr = vcpu->arch.fault.esr_el2;
+ ec = ESR_ELx_EC(esr);
+
+ if (ec != ESR_ELx_EC_DABT_LOW && ec != ESR_ELx_EC_IABT_LOW)
+ return true;
+
+ if (!__get_fault_info(esr, &far, &hpfar))
+ return false;
+
vcpu->arch.fault.far_el2 = far;
vcpu->arch.fault.hpfar_el2 = hpfar;
return true;
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:20:15

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 21/26] KVM: arm64: Refactor kvm_arm_setup_stage2()

In order to re-use some of the stage 2 setup at EL2, factor parts of
kvm_arm_setup_stage2() out into static inline functions.

No functional change intended.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_mmu.h | 48 ++++++++++++++++++++++++++++++++
arch/arm64/kvm/reset.c | 42 +++-------------------------
2 files changed, 52 insertions(+), 38 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 662f0415344e..83b4c5cf4768 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -280,6 +280,54 @@ static inline int kvm_write_guest_lock(struct kvm *kvm, gpa_t gpa,
return ret;
}

+static inline u64 kvm_get_parange(u64 mmfr0)
+{
+ u64 parange = cpuid_feature_extract_unsigned_field(mmfr0,
+ ID_AA64MMFR0_PARANGE_SHIFT);
+ if (parange > ID_AA64MMFR0_PARANGE_MAX)
+ parange = ID_AA64MMFR0_PARANGE_MAX;
+
+ return parange;
+}
+
+/*
+ * The VTCR value is common across all the physical CPUs on the system.
+ * We use system wide sanitised values to fill in different fields,
+ * except for Hardware Management of Access Flags. HA Flag is set
+ * unconditionally on all CPUs, as it is safe to run with or without
+ * the feature and the bit is RES0 on CPUs that don't support it.
+ */
+static inline u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
+{
+ u64 vtcr = VTCR_EL2_FLAGS;
+ u8 lvls;
+
+ vtcr |= kvm_get_parange(mmfr0) << VTCR_EL2_PS_SHIFT;
+ vtcr |= VTCR_EL2_T0SZ(phys_shift);
+ /*
+ * Use a minimum 2 level page table to prevent splitting
+ * host PMD huge pages at stage2.
+ */
+ lvls = stage2_pgtable_levels(phys_shift);
+ if (lvls < 2)
+ lvls = 2;
+ vtcr |= VTCR_EL2_LVLS_TO_SL0(lvls);
+
+ /*
+ * Enable the Hardware Access Flag management, unconditionally
+ * on all CPUs. The features is RES0 on CPUs without the support
+ * and must be ignored by the CPUs.
+ */
+ vtcr |= VTCR_EL2_HA;
+
+ /* Set the vmid bits */
+ vtcr |= (get_vmid_bits(mmfr1) == 16) ?
+ VTCR_EL2_VS_16BIT :
+ VTCR_EL2_VS_8BIT;
+
+ return vtcr;
+}
+
#define kvm_phys_to_vttbr(addr) phys_to_ttbr(addr)

static __always_inline u64 kvm_get_vttbr(struct kvm_s2_mmu *mmu)
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 47f3f035f3ea..6aae118c960a 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -332,19 +332,10 @@ int kvm_set_ipa_limit(void)
return 0;
}

-/*
- * Configure the VTCR_EL2 for this VM. The VTCR value is common
- * across all the physical CPUs on the system. We use system wide
- * sanitised values to fill in different fields, except for Hardware
- * Management of Access Flags. HA Flag is set unconditionally on
- * all CPUs, as it is safe to run with or without the feature and
- * the bit is RES0 on CPUs that don't support it.
- */
int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
{
- u64 vtcr = VTCR_EL2_FLAGS, mmfr0;
- u32 parange, phys_shift;
- u8 lvls;
+ u64 mmfr0, mmfr1;
+ u32 phys_shift;

if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
return -EINVAL;
@@ -359,33 +350,8 @@ int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
}

mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
- parange = cpuid_feature_extract_unsigned_field(mmfr0,
- ID_AA64MMFR0_PARANGE_SHIFT);
- if (parange > ID_AA64MMFR0_PARANGE_MAX)
- parange = ID_AA64MMFR0_PARANGE_MAX;
- vtcr |= parange << VTCR_EL2_PS_SHIFT;
-
- vtcr |= VTCR_EL2_T0SZ(phys_shift);
- /*
- * Use a minimum 2 level page table to prevent splitting
- * host PMD huge pages at stage2.
- */
- lvls = stage2_pgtable_levels(phys_shift);
- if (lvls < 2)
- lvls = 2;
- vtcr |= VTCR_EL2_LVLS_TO_SL0(lvls);
-
- /*
- * Enable the Hardware Access Flag management, unconditionally
- * on all CPUs. The features is RES0 on CPUs without the support
- * and must be ignored by the CPUs.
- */
- vtcr |= VTCR_EL2_HA;
+ mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
+ kvm->arch.vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);

- /* Set the vmid bits */
- vtcr |= (kvm_get_vmid_bits() == 16) ?
- VTCR_EL2_VS_16BIT :
- VTCR_EL2_VS_8BIT;
- kvm->arch.vtcr = vtcr;
return 0;
}
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:20:19

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 25/26] KVM: arm64: Reserve memory for host stage 2

Extend the memory pool allocated for the hypervisor to include enough
pages to map all of memory at page granularity for the host stage 2.
While at it, also reserve some memory for device mappings.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/mm.h | 36 ++++++++++++++++++++++++----
arch/arm64/kvm/hyp/nvhe/setup.c | 12 ++++++++++
arch/arm64/kvm/hyp/reserved_mem.c | 2 ++
3 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h b/arch/arm64/kvm/hyp/include/nvhe/mm.h
index f0cc09b127a5..cdf2e3447b2a 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mm.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
@@ -52,15 +52,12 @@ static inline unsigned long __hyp_pgtable_max_pages(unsigned long nr_pages)
return total;
}

-static inline unsigned long hyp_s1_pgtable_size(void)
+static inline unsigned long __hyp_pgtable_total_size(void)
{
struct hyp_memblock_region *reg;
unsigned long nr_pages, res = 0;
int i;

- if (kvm_nvhe_sym(hyp_memblock_nr) <= 0)
- return 0;
-
for (i = 0; i < kvm_nvhe_sym(hyp_memblock_nr); i++) {
reg = &kvm_nvhe_sym(hyp_memory)[i];
nr_pages = (reg->end - reg->start) >> PAGE_SHIFT;
@@ -68,6 +65,18 @@ static inline unsigned long hyp_s1_pgtable_size(void)
res += nr_pages << PAGE_SHIFT;
}

+ return res;
+}
+
+static inline unsigned long hyp_s1_pgtable_size(void)
+{
+ unsigned long res, nr_pages;
+
+ if (kvm_nvhe_sym(hyp_memblock_nr) <= 0)
+ return 0;
+
+ res = __hyp_pgtable_total_size();
+
/* Allow 1 GiB for private mappings */
nr_pages = (1 << 30) >> PAGE_SHIFT;
nr_pages = __hyp_pgtable_max_pages(nr_pages);
@@ -76,4 +85,23 @@ static inline unsigned long hyp_s1_pgtable_size(void)
return res;
}

+static inline unsigned long host_s2_mem_pgtable_size(void)
+{
+ unsigned long max_pgd_sz = 16 << PAGE_SHIFT;
+
+ if (kvm_nvhe_sym(hyp_memblock_nr) <= 0)
+ return 0;
+
+ return __hyp_pgtable_total_size() + max_pgd_sz;
+}
+
+static inline unsigned long host_s2_dev_pgtable_size(void)
+{
+ if (kvm_nvhe_sym(hyp_memblock_nr) <= 0)
+ return 0;
+
+ /* Allow 1 GiB for private mappings */
+ return __hyp_pgtable_max_pages((1 << 30) >> PAGE_SHIFT) << PAGE_SHIFT;
+}
+
#endif /* __KVM_HYP_MM_H */
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 6d1faede86ae..79b697df01e2 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -24,6 +24,8 @@ unsigned long hyp_nr_cpus;
static void *stacks_base;
static void *vmemmap_base;
static void *hyp_pgt_base;
+static void *host_s2_mem_pgt_base;
+static void *host_s2_dev_pgt_base;

static int divide_memory_pool(void *virt, unsigned long size)
{
@@ -46,6 +48,16 @@ static int divide_memory_pool(void *virt, unsigned long size)
if (!hyp_pgt_base)
return -ENOMEM;

+ nr_pages = host_s2_mem_pgtable_size() >> PAGE_SHIFT;
+ host_s2_mem_pgt_base = hyp_early_alloc_contig(nr_pages);
+ if (!host_s2_mem_pgt_base)
+ return -ENOMEM;
+
+ nr_pages = host_s2_dev_pgtable_size() >> PAGE_SHIFT;
+ host_s2_dev_pgt_base = hyp_early_alloc_contig(nr_pages);
+ if (!host_s2_dev_pgt_base)
+ return -ENOMEM;
+
return 0;
}

diff --git a/arch/arm64/kvm/hyp/reserved_mem.c b/arch/arm64/kvm/hyp/reserved_mem.c
index 32f648992835..ee97e55e3c59 100644
--- a/arch/arm64/kvm/hyp/reserved_mem.c
+++ b/arch/arm64/kvm/hyp/reserved_mem.c
@@ -74,6 +74,8 @@ void __init kvm_hyp_reserve(void)
*/
hyp_mem_size += NR_CPUS << PAGE_SHIFT;
hyp_mem_size += hyp_s1_pgtable_size();
+ hyp_mem_size += host_s2_mem_pgtable_size();
+ hyp_mem_size += host_s2_dev_pgtable_size();

/*
* The hyp_vmemmap needs to be backed by pages, but these pages
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:20:25

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 26/26] KVM: arm64: Wrap the host with a stage 2

When KVM runs in protected nVHE mode, make use of a stage 2 page-table
to give the hypervisor some control over the host memory accesses. At
the moment all memory aborts from the host will be instantly idmapped
RWX at stage 2 in a lazy fashion. Later patches will make use of that
infrastructure to implement access control restrictions to e.g. protect
guest memory from the host.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_cpufeature.h | 2 +
arch/arm64/kernel/image-vars.h | 3 +
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 33 +++
arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
arch/arm64/kvm/hyp/nvhe/hyp-init.S | 1 +
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 6 +
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 191 ++++++++++++++++++
arch/arm64/kvm/hyp/nvhe/setup.c | 6 +
arch/arm64/kvm/hyp/nvhe/switch.c | 7 +-
arch/arm64/kvm/hyp/nvhe/tlb.c | 4 +-
10 files changed, 248 insertions(+), 7 deletions(-)
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c

diff --git a/arch/arm64/include/asm/kvm_cpufeature.h b/arch/arm64/include/asm/kvm_cpufeature.h
index d34f85cba358..74043a149322 100644
--- a/arch/arm64/include/asm/kvm_cpufeature.h
+++ b/arch/arm64/include/asm/kvm_cpufeature.h
@@ -15,3 +15,5 @@
#endif

KVM_HYP_CPU_FTR_REG(SYS_CTR_EL0, arm64_ftr_reg_ctrel0)
+KVM_HYP_CPU_FTR_REG(SYS_ID_AA64MMFR0_EL1, arm64_ftr_reg_id_aa64mmfr0_el1)
+KVM_HYP_CPU_FTR_REG(SYS_ID_AA64MMFR1_EL1, arm64_ftr_reg_id_aa64mmfr1_el1)
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 366d837f0d39..e4e4f30ac251 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -132,6 +132,9 @@ KVM_NVHE_ALIAS(__hyp_data_ro_after_init_end);
KVM_NVHE_ALIAS(__hyp_bss_start);
KVM_NVHE_ALIAS(__hyp_bss_end);

+/* pKVM static key */
+KVM_NVHE_ALIAS(kvm_protected_mode_initialized);
+
#endif /* CONFIG_KVM */

#endif /* __ARM64_KERNEL_IMAGE_VARS_H */
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
new file mode 100644
index 000000000000..a22ef118a610
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2020 Google LLC
+ * Author: Quentin Perret <[email protected]>
+ */
+
+#ifndef __KVM_NVHE_MEM_PROTECT__
+#define __KVM_NVHE_MEM_PROTECT__
+#include <linux/kvm_host.h>
+#include <asm/kvm_hyp.h>
+#include <asm/kvm_pgtable.h>
+#include <asm/virt.h>
+#include <nvhe/spinlock.h>
+
+struct host_kvm {
+ struct kvm_arch arch;
+ struct kvm_pgtable pgt;
+ struct kvm_pgtable_mm_ops mm_ops;
+ hyp_spinlock_t lock;
+};
+extern struct host_kvm host_kvm;
+
+int kvm_host_prepare_stage2(void *mem_pgt_pool, void *dev_pgt_pool);
+void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt);
+
+static __always_inline void __load_host_stage2(void)
+{
+ if (static_branch_likely(&kvm_protected_mode_initialized))
+ __load_stage2(&host_kvm.arch.mmu, host_kvm.arch.vtcr);
+ else
+ write_sysreg(0, vttbr_el2);
+}
+#endif /* __KVM_NVHE_MEM_PROTECT__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index d7381a503182..c3e2f98555c4 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -11,7 +11,7 @@ lib-objs := $(addprefix ../../../lib/, $(lib-objs))

obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o \
- cache.o cpufeature.o setup.o mm.o
+ cache.o cpufeature.o setup.o mm.o mem_protect.o
obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
obj-y += $(lib-objs)
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
index b1341bb4b453..32591db76c75 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
@@ -129,6 +129,7 @@ alternative_else_nop_endif

/* Invalidate the stale TLBs from Bootloader */
tlbi alle2
+ tlbi vmalls12e1
dsb sy

/*
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 3075f117651c..93699600bc22 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -13,6 +13,7 @@
#include <asm/kvm_hyp.h>
#include <asm/kvm_mmu.h>

+#include <nvhe/mem_protect.h>
#include <nvhe/mm.h>
#include <nvhe/trap_handler.h>

@@ -222,6 +223,11 @@ void handle_trap(struct kvm_cpu_context *host_ctxt)
case ESR_ELx_EC_SMC64:
handle_host_smc(host_ctxt);
break;
+ case ESR_ELx_EC_IABT_LOW:
+ fallthrough;
+ case ESR_ELx_EC_DABT_LOW:
+ handle_host_mem_abort(host_ctxt);
+ break;
default:
hyp_panic();
}
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
new file mode 100644
index 000000000000..0cd3eb178f3b
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -0,0 +1,191 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Google LLC
+ * Author: Quentin Perret <[email protected]>
+ */
+
+#include <linux/kvm_host.h>
+#include <asm/kvm_cpufeature.h>
+#include <asm/kvm_emulate.h>
+#include <asm/kvm_hyp.h>
+#include <asm/kvm_mmu.h>
+#include <asm/kvm_pgtable.h>
+#include <asm/stage2_pgtable.h>
+
+#include <hyp/switch.h>
+
+#include <nvhe/gfp.h>
+#include <nvhe/memory.h>
+#include <nvhe/mem_protect.h>
+#include <nvhe/mm.h>
+
+extern unsigned long hyp_nr_cpus;
+struct host_kvm host_kvm;
+
+struct hyp_pool host_s2_mem;
+struct hyp_pool host_s2_dev;
+
+static void *host_s2_zalloc_pages_exact(size_t size)
+{
+ return hyp_alloc_pages(&host_s2_mem, HYP_GFP_ZERO, get_order(size));
+}
+
+static void *host_s2_zalloc_page(void *pool)
+{
+ return hyp_alloc_pages(pool, HYP_GFP_ZERO, 0);
+}
+
+static int prepare_s2_pools(void *mem_pgt_pool, void *dev_pgt_pool)
+{
+ unsigned long nr_pages;
+ int ret;
+
+ nr_pages = host_s2_mem_pgtable_size() >> PAGE_SHIFT;
+ ret = hyp_pool_init(&host_s2_mem, __hyp_pa(mem_pgt_pool), nr_pages, 0);
+ if (ret)
+ return ret;
+
+ nr_pages = host_s2_dev_pgtable_size() >> PAGE_SHIFT;
+ ret = hyp_pool_init(&host_s2_dev, __hyp_pa(dev_pgt_pool), nr_pages, 0);
+ if (ret)
+ return ret;
+
+ host_kvm.mm_ops.zalloc_pages_exact = host_s2_zalloc_pages_exact;
+ host_kvm.mm_ops.zalloc_page = host_s2_zalloc_page;
+ host_kvm.mm_ops.phys_to_virt = hyp_phys_to_virt;
+ host_kvm.mm_ops.virt_to_phys = hyp_virt_to_phys;
+ host_kvm.mm_ops.page_count = hyp_page_count;
+ host_kvm.mm_ops.get_page = hyp_get_page;
+ host_kvm.mm_ops.put_page = hyp_put_page;
+
+ return 0;
+}
+
+static void prepare_host_vtcr(void)
+{
+ u32 parange, phys_shift;
+ u64 mmfr0, mmfr1;
+
+ mmfr0 = arm64_ftr_reg_id_aa64mmfr0_el1.sys_val;
+ mmfr1 = arm64_ftr_reg_id_aa64mmfr1_el1.sys_val;
+
+ /* The host stage 2 is id-mapped, so use parange for T0SZ */
+ parange = kvm_get_parange(mmfr0);
+ phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
+
+ host_kvm.arch.vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);
+}
+
+int kvm_host_prepare_stage2(void *mem_pgt_pool, void *dev_pgt_pool)
+{
+ struct kvm_s2_mmu *mmu = &host_kvm.arch.mmu;
+ struct kvm_nvhe_init_params *params;
+ int ret, i;
+
+ prepare_host_vtcr();
+ hyp_spin_lock_init(&host_kvm.lock);
+
+ ret = prepare_s2_pools(mem_pgt_pool, dev_pgt_pool);
+ if (ret)
+ return ret;
+
+ ret = kvm_pgtable_stage2_init(&host_kvm.pgt, &host_kvm.arch,
+ &host_kvm.mm_ops);
+ if (ret)
+ return ret;
+
+ mmu->pgd_phys = __hyp_pa(host_kvm.pgt.pgd);
+ mmu->arch = &host_kvm.arch;
+ mmu->pgt = &host_kvm.pgt;
+ mmu->vmid.vmid_gen = 0;
+ mmu->vmid.vmid = 0;
+
+ for (i = 0; i < hyp_nr_cpus; i++) {
+ params = per_cpu_ptr(&kvm_init_params, i);
+ params->vttbr = kvm_get_vttbr(mmu);
+ params->vtcr = host_kvm.arch.vtcr;
+ params->hcr_el2 |= HCR_VM;
+ __flush_dcache_area(params, sizeof(*params));
+ }
+
+ write_sysreg(this_cpu_ptr(&kvm_init_params)->hcr_el2, hcr_el2);
+ __load_stage2(&host_kvm.arch.mmu, host_kvm.arch.vtcr);
+
+ return 0;
+}
+
+static void host_stage2_unmap_dev_all(void)
+{
+ struct kvm_pgtable *pgt = &host_kvm.pgt;
+ struct hyp_memblock_region *reg;
+ u64 addr = 0;
+ int i;
+
+ /* Unmap all non-memory regions to recycle the pages */
+ for (i = 0; i < hyp_memblock_nr; i++, addr = reg->end) {
+ reg = &hyp_memory[i];
+ kvm_pgtable_stage2_unmap(pgt, addr, reg->start - addr);
+ }
+ kvm_pgtable_stage2_unmap(pgt, addr, ULONG_MAX);
+}
+
+static bool ipa_is_memory(u64 ipa)
+{
+ int cur, left = 0, right = hyp_memblock_nr;
+ struct hyp_memblock_region *reg;
+
+ /* The list of memblock regions is sorted, binary search it */
+ while (left < right) {
+ cur = (left + right) >> 1;
+ reg = &hyp_memory[cur];
+ if (ipa < reg->start)
+ right = cur;
+ else if (ipa >= reg->end)
+ left = cur + 1;
+ else
+ return true;
+ }
+
+ return false;
+}
+
+static int __host_stage2_map(u64 ipa, u64 size, enum kvm_pgtable_prot prot,
+ struct hyp_pool *p)
+{
+ return kvm_pgtable_stage2_map(&host_kvm.pgt, ipa, size, ipa, prot, p);
+}
+
+static int host_stage2_map(u64 ipa, u64 size, enum kvm_pgtable_prot prot)
+{
+ int ret, is_memory = ipa_is_memory(ipa);
+ struct hyp_pool *pool;
+
+ pool = is_memory ? &host_s2_mem : &host_s2_dev;
+
+ hyp_spin_lock(&host_kvm.lock);
+ ret = __host_stage2_map(ipa, size, prot, pool);
+ if (ret == -ENOMEM && !is_memory) {
+ host_stage2_unmap_dev_all();
+ ret = __host_stage2_map(ipa, size, prot, pool);
+ }
+ hyp_spin_unlock(&host_kvm.lock);
+
+ return ret;
+}
+
+void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
+{
+ enum kvm_pgtable_prot prot;
+ u64 far, hpfar, esr, ipa;
+ int ret;
+
+ esr = read_sysreg_el2(SYS_ESR);
+ if (!__get_fault_info(esr, &far, &hpfar))
+ hyp_panic();
+
+ prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W | KVM_PGTABLE_PROT_X;
+ ipa = (hpfar & HPFAR_MASK) << 8;
+ ret = host_stage2_map(ipa, PAGE_SIZE, prot);
+ if (ret)
+ hyp_panic();
+}
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 79b697df01e2..f6d3318e92fa 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -12,6 +12,7 @@
#include <nvhe/early_alloc.h>
#include <nvhe/gfp.h>
#include <nvhe/memory.h>
+#include <nvhe/mem_protect.h>
#include <nvhe/mm.h>

struct hyp_pool hpool;
@@ -161,6 +162,11 @@ void __noreturn __pkvm_init_finalise(void)
if (ret)
goto out;

+ /* Wrap the host with a stage 2 */
+ ret = kvm_host_prepare_stage2(host_s2_mem_pgt_base, host_s2_dev_pgt_base);
+ if (ret)
+ goto out;
+
pkvm_pgtable_mm_ops.zalloc_page = hyp_zalloc_hyp_page;
pkvm_pgtable_mm_ops.phys_to_virt = hyp_phys_to_virt;
pkvm_pgtable_mm_ops.virt_to_phys = hyp_virt_to_phys;
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index 979a76cdf9fb..31bc1a843bf8 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -28,6 +28,8 @@
#include <asm/processor.h>
#include <asm/thread_info.h>

+#include <nvhe/mem_protect.h>
+
/* Non-VHE specific context */
DEFINE_PER_CPU(struct kvm_host_data, kvm_host_data);
DEFINE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt);
@@ -102,11 +104,6 @@ static void __deactivate_traps(struct kvm_vcpu *vcpu)
write_sysreg(__kvm_hyp_host_vector, vbar_el2);
}

-static void __load_host_stage2(void)
-{
- write_sysreg(0, vttbr_el2);
-}
-
/* Save VGICv3 state on non-VHE systems */
static void __hyp_vgic_save_state(struct kvm_vcpu *vcpu)
{
diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
index fbde89a2c6e8..255a23a1b2db 100644
--- a/arch/arm64/kvm/hyp/nvhe/tlb.c
+++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
@@ -8,6 +8,8 @@
#include <asm/kvm_mmu.h>
#include <asm/tlbflush.h>

+#include <nvhe/mem_protect.h>
+
struct tlb_inv_context {
u64 tcr;
};
@@ -43,7 +45,7 @@ static void __tlb_switch_to_guest(struct kvm_s2_mmu *mmu,

static void __tlb_switch_to_host(struct tlb_inv_context *cxt)
{
- write_sysreg(0, vttbr_el2);
+ __load_host_stage2();

if (cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) {
/* Ensure write of the host VMID */
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:20:26

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 24/26] KVM: arm64: Make memcache anonymous in pgtable allocator

The current stage2 page-table allocator uses a memcache to get
pre-allocated pages when it needs any. To allow re-using this code at
EL2 which uses a concept of memory pools, make the memcache argument to
kvm_pgtable_stage2_map() anonymous. and let the mm_ops zalloc_page()
callbacks use it the way they need to.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 6 +++---
arch/arm64/kvm/hyp/pgtable.c | 4 ++--
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 8e8f1d2c5e0e..d846bc3d3b77 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -176,8 +176,8 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
* @size: Size of the mapping.
* @phys: Physical address of the memory to map.
* @prot: Permissions and attributes for the mapping.
- * @mc: Cache of pre-allocated GFP_PGTABLE_USER memory from which to
- * allocate page-table pages.
+ * @mc: Cache of pre-allocated memory from which to allocate page-table
+ * pages.
*
* The offset of @addr within a page is ignored, @size is rounded-up to
* the next page boundary and @phys is rounded-down to the previous page
@@ -194,7 +194,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
*/
int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
u64 phys, enum kvm_pgtable_prot prot,
- struct kvm_mmu_memory_cache *mc);
+ void *mc);

/**
* kvm_pgtable_stage2_unmap() - Remove a mapping from a guest stage-2 page-table.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 96a25d0b7b6e..5dd1b4978fe8 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -443,7 +443,7 @@ struct stage2_map_data {
kvm_pte_t *anchor;

struct kvm_s2_mmu *mmu;
- struct kvm_mmu_memory_cache *memcache;
+ void *memcache;

struct kvm_pgtable_mm_ops *mm_ops;
};
@@ -613,7 +613,7 @@ static int stage2_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,

int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
u64 phys, enum kvm_pgtable_prot prot,
- struct kvm_mmu_memory_cache *mc)
+ void *mc)
{
int ret;
struct stage2_map_data map_data = {
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:20:28

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 14/26] KVM: arm64: Factor out vector address calculation

In order to re-map the guest vectors at EL2 when pKVM is enabled,
refactor __kvm_vector_slot2idx() and kvm_init_vector_slot() to move all
the address calculation logic in a static inline function.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_mmu.h | 8 ++++++++
arch/arm64/kvm/arm.c | 9 +--------
2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e52d82aeadca..d7ebd73ec86f 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -195,6 +195,14 @@ phys_addr_t kvm_mmu_get_httbr(void);
phys_addr_t kvm_get_idmap_vector(void);
int kvm_mmu_init(void);

+static inline void *__kvm_vector_slot2addr(void *base,
+ enum arm64_hyp_spectre_vector slot)
+{
+ int idx = slot - (slot != HYP_VECTOR_DIRECT);
+
+ return base + (idx * SZ_2K);
+}
+
struct kvm;

#define kvm_flush_dcache_to_poc(a,l) __flush_dcache_area((a), (l))
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 9fd769349e9e..6af9204bcd5b 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1346,16 +1346,9 @@ static unsigned long nvhe_percpu_order(void)
/* A lookup table holding the hypervisor VA for each vector slot */
static void *hyp_spectre_vector_selector[BP_HARDEN_EL2_SLOTS];

-static int __kvm_vector_slot2idx(enum arm64_hyp_spectre_vector slot)
-{
- return slot - (slot != HYP_VECTOR_DIRECT);
-}
-
static void kvm_init_vector_slot(void *base, enum arm64_hyp_spectre_vector slot)
{
- int idx = __kvm_vector_slot2idx(slot);
-
- hyp_spectre_vector_selector[slot] = base + (idx * SZ_2K);
+ hyp_spectre_vector_selector[slot] = __kvm_vector_slot2addr(base, slot);
}

static int kvm_init_vector_slots(void)
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:20:57

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 10/26] KVM: arm64: Introduce an early Hyp page allocator

With nVHE, the host currently creates all s1 hypervisor mappings at EL1
during boot, installs them at EL2, and extends them as required (e.g.
when creating a new VM). But in a world where the host is no longer
trusted, it cannot have full control over the code mapped in the
hypervisor.

In preparation for enabling the hypervisor to create its own s1 mappings
during boot, introduce an early page allocator, with minimal
functionality. This allocator is designed to be used only during early
bootstrap of the hyp code when memory protection is enabled, which will
then switch to using a full-fledged page allocator after init.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/early_alloc.h | 14 +++++
arch/arm64/kvm/hyp/include/nvhe/memory.h | 24 ++++++++
arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
arch/arm64/kvm/hyp/nvhe/early_alloc.c | 60 +++++++++++++++++++
arch/arm64/kvm/hyp/nvhe/psci-relay.c | 4 +-
5 files changed, 100 insertions(+), 4 deletions(-)
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/early_alloc.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/memory.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/early_alloc.c

diff --git a/arch/arm64/kvm/hyp/include/nvhe/early_alloc.h b/arch/arm64/kvm/hyp/include/nvhe/early_alloc.h
new file mode 100644
index 000000000000..68ce2bf9a718
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/early_alloc.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_HYP_EARLY_ALLOC_H
+#define __KVM_HYP_EARLY_ALLOC_H
+
+#include <asm/kvm_pgtable.h>
+
+void hyp_early_alloc_init(void *virt, unsigned long size);
+unsigned long hyp_early_alloc_nr_pages(void);
+void *hyp_early_alloc_page(void *arg);
+void *hyp_early_alloc_contig(unsigned int nr_pages);
+
+extern struct kvm_pgtable_mm_ops hyp_early_alloc_mm_ops;
+
+#endif /* __KVM_HYP_EARLY_ALLOC_H */
diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h
new file mode 100644
index 000000000000..64c44c142c95
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_HYP_MEMORY_H
+#define __KVM_HYP_MEMORY_H
+
+#include <asm/page.h>
+
+#include <linux/types.h>
+
+extern s64 hyp_physvirt_offset;
+
+#define __hyp_pa(virt) ((phys_addr_t)(virt) + hyp_physvirt_offset)
+#define __hyp_va(virt) ((void *)((phys_addr_t)(virt) - hyp_physvirt_offset))
+
+static inline void *hyp_phys_to_virt(phys_addr_t phys)
+{
+ return __hyp_va(phys);
+}
+
+static inline phys_addr_t hyp_virt_to_phys(void *addr)
+{
+ return __hyp_pa(addr);
+}
+
+#endif /* __KVM_HYP_MEMORY_H */
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 590fdefb42dd..1fc0684a7678 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
lib-objs := $(addprefix ../../../lib/, $(lib-objs))

obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
- hyp-main.o hyp-smp.o psci-relay.o
+ hyp-main.o hyp-smp.o psci-relay.o early_alloc.o
obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o
obj-y += $(lib-objs)
diff --git a/arch/arm64/kvm/hyp/nvhe/early_alloc.c b/arch/arm64/kvm/hyp/nvhe/early_alloc.c
new file mode 100644
index 000000000000..de4c45662970
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/early_alloc.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Google LLC
+ * Author: Quentin Perret <[email protected]>
+ */
+
+#include <asm/kvm_pgtable.h>
+
+#include <nvhe/memory.h>
+
+struct kvm_pgtable_mm_ops hyp_early_alloc_mm_ops;
+s64 __ro_after_init hyp_physvirt_offset;
+
+static unsigned long base;
+static unsigned long end;
+static unsigned long cur;
+
+unsigned long hyp_early_alloc_nr_pages(void)
+{
+ return (cur - base) >> PAGE_SHIFT;
+}
+
+extern void clear_page(void *to);
+
+void *hyp_early_alloc_contig(unsigned int nr_pages)
+{
+ unsigned long ret = cur, i, p;
+
+ if (!nr_pages)
+ return NULL;
+
+ cur += nr_pages << PAGE_SHIFT;
+ if (cur > end) {
+ cur = ret;
+ return NULL;
+ }
+
+ for (i = 0; i < nr_pages; i++) {
+ p = ret + (i << PAGE_SHIFT);
+ clear_page((void *)(p));
+ }
+
+ return (void *)ret;
+}
+
+void *hyp_early_alloc_page(void *arg)
+{
+ return hyp_early_alloc_contig(1);
+}
+
+void hyp_early_alloc_init(unsigned long virt, unsigned long size)
+{
+ base = virt;
+ end = virt + size;
+ cur = virt;
+
+ hyp_early_alloc_mm_ops.zalloc_page = hyp_early_alloc_page;
+ hyp_early_alloc_mm_ops.phys_to_virt = hyp_phys_to_virt;
+ hyp_early_alloc_mm_ops.virt_to_phys = hyp_virt_to_phys;
+}
diff --git a/arch/arm64/kvm/hyp/nvhe/psci-relay.c b/arch/arm64/kvm/hyp/nvhe/psci-relay.c
index e3947846ffcb..bdd8054bce4c 100644
--- a/arch/arm64/kvm/hyp/nvhe/psci-relay.c
+++ b/arch/arm64/kvm/hyp/nvhe/psci-relay.c
@@ -11,6 +11,7 @@
#include <linux/kvm_host.h>
#include <uapi/linux/psci.h>

+#include <nvhe/memory.h>
#include <nvhe/trap_handler.h>

void kvm_hyp_cpu_entry(unsigned long r0);
@@ -20,9 +21,6 @@ void __noreturn __host_enter(struct kvm_cpu_context *host_ctxt);

/* Config options set by the host. */
struct kvm_host_psci_config __ro_after_init kvm_host_psci_config;
-s64 __ro_after_init hyp_physvirt_offset;
-
-#define __hyp_pa(x) ((phys_addr_t)((x)) + hyp_physvirt_offset)

#define INVALID_CPU_ID UINT_MAX

--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:21:02

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 08/26] KVM: arm64: Make kvm_call_hyp() a function call at Hyp

kvm_call_hyp() has some logic to issue a function call or a hypercall
depending the EL at which the kernel is running. However, all the code
compiled under __KVM_NVHE_HYPERVISOR__ is guaranteed to run only at EL2,
and in this case a simple function call is needed.

Add ifdefery to kvm_host.h to symplify kvm_call_hyp() in .hyp.text.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 8fcfab0c2567..81212958ef55 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -592,6 +592,7 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
void kvm_arm_halt_guest(struct kvm *kvm);
void kvm_arm_resume_guest(struct kvm *kvm);

+#ifndef __KVM_NVHE_HYPERVISOR__
#define kvm_call_hyp_nvhe(f, ...) \
({ \
struct arm_smccc_res res; \
@@ -631,6 +632,11 @@ void kvm_arm_resume_guest(struct kvm *kvm);
\
ret; \
})
+#else /* __KVM_NVHE_HYPERVISOR__ */
+#define kvm_call_hyp(f, ...) f(__VA_ARGS__)
+#define kvm_call_hyp_ret(f, ...) f(__VA_ARGS__)
+#define kvm_call_hyp_nvhe(f, ...) f(__VA_ARGS__)
+#endif /* __KVM_NVHE_HYPERVISOR__ */

void force_vm_exit(const cpumask_t *mask);
void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:21:08

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 22/26] KVM: arm64: Refactor __load_guest_stage2()

Refactor __load_guest_stage2() to introduce __load_stage2() which will
be re-used when loading the host stage 2.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_mmu.h | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 83b4c5cf4768..8d37d6d1ed29 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -345,9 +345,9 @@ static __always_inline u64 kvm_get_vttbr(struct kvm_s2_mmu *mmu)
* Must be called from hyp code running at EL2 with an updated VTTBR
* and interrupts disabled.
*/
-static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
+static __always_inline void __load_stage2(struct kvm_s2_mmu *mmu, unsigned long vtcr)
{
- write_sysreg(kern_hyp_va(mmu->arch)->vtcr, vtcr_el2);
+ write_sysreg(vtcr, vtcr_el2);
write_sysreg(kvm_get_vttbr(mmu), vttbr_el2);

/*
@@ -358,6 +358,11 @@ static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT));
}

+static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
+{
+ __load_stage2(mmu, kern_hyp_va(mmu->arch)->vtcr);
+}
+
static inline struct kvm *kvm_s2_mmu_to_kvm(struct kvm_s2_mmu *mmu)
{
return container_of(mmu->arch, struct kvm, arch);
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:21:18

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 20/26] KVM: arm64: Set host stage 2 using kvm_nvhe_init_params

Move the registers relevant to host stage 2 enablement to
kvm_nvhe_init_params to prepare the ground for enabling it in later
patches.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_asm.h | 3 +++
arch/arm64/kernel/asm-offsets.c | 3 +++
arch/arm64/kvm/arm.c | 5 +++++
arch/arm64/kvm/hyp/nvhe/hyp-init.S | 9 +++++++++
arch/arm64/kvm/hyp/nvhe/switch.c | 5 +----
5 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 4fc27ac08836..5354b05eb9e2 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -158,6 +158,9 @@ struct kvm_nvhe_init_params {
unsigned long tpidr_el2;
unsigned long stack_hyp_va;
phys_addr_t pgd_pa;
+ unsigned long hcr_el2;
+ unsigned long vttbr;
+ unsigned long vtcr;
};

/* Translate a kernel address @ptr into its equivalent linear mapping */
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 5e82488f1b82..9cf7736e31db 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -114,6 +114,9 @@ int main(void)
DEFINE(NVHE_INIT_TPIDR_EL2, offsetof(struct kvm_nvhe_init_params, tpidr_el2));
DEFINE(NVHE_INIT_STACK_HYP_VA, offsetof(struct kvm_nvhe_init_params, stack_hyp_va));
DEFINE(NVHE_INIT_PGD_PA, offsetof(struct kvm_nvhe_init_params, pgd_pa));
+ DEFINE(NVHE_INIT_HCR_EL2, offsetof(struct kvm_nvhe_init_params, hcr_el2));
+ DEFINE(NVHE_INIT_VTTBR, offsetof(struct kvm_nvhe_init_params, vttbr));
+ DEFINE(NVHE_INIT_VTCR, offsetof(struct kvm_nvhe_init_params, vtcr));
#endif
#ifdef CONFIG_CPU_PM
DEFINE(CPU_CTX_SP, offsetof(struct cpu_suspend_ctx, sp));
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e524682c2ccf..00cee4489cd7 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1413,6 +1413,11 @@ static void cpu_prepare_hyp_mode(int cpu)

params->stack_hyp_va = kern_hyp_va(per_cpu(kvm_arm_hyp_stack_page, cpu) + PAGE_SIZE);
params->pgd_pa = kvm_mmu_get_httbr();
+ if (is_protected_kvm_enabled())
+ params->hcr_el2 = HCR_HOST_NVHE_PROTECTED_FLAGS;
+ else
+ params->hcr_el2 = HCR_HOST_NVHE_FLAGS;
+ params->vttbr = params->vtcr = 0;

/*
* Flush the init params from the data cache because the struct will
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
index ad943966c39f..b1341bb4b453 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
@@ -102,6 +102,15 @@ alternative_else_nop_endif
ldr x1, [x0, #NVHE_INIT_MAIR_EL2]
msr mair_el2, x1

+ ldr x1, [x0, #NVHE_INIT_HCR_EL2]
+ msr hcr_el2, x1
+
+ ldr x1, [x0, #NVHE_INIT_VTTBR]
+ msr vttbr_el2, x1
+
+ ldr x1, [x0, #NVHE_INIT_VTCR]
+ msr vtcr_el2, x1
+
ldr x1, [x0, #NVHE_INIT_PGD_PA]
phys_to_ttbr x2, x1
alternative_if ARM64_HAS_CNP
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index f3d0e9eca56c..979a76cdf9fb 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -97,10 +97,7 @@ static void __deactivate_traps(struct kvm_vcpu *vcpu)
mdcr_el2 |= MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT;

write_sysreg(mdcr_el2, mdcr_el2);
- if (is_protected_kvm_enabled())
- write_sysreg(HCR_HOST_NVHE_PROTECTED_FLAGS, hcr_el2);
- else
- write_sysreg(HCR_HOST_NVHE_FLAGS, hcr_el2);
+ write_sysreg(this_cpu_ptr(&kvm_init_params)->hcr_el2, hcr_el2);
write_sysreg(CPTR_EL2_DEFAULT, cptr_el2);
write_sysreg(__kvm_hyp_host_vector, vbar_el2);
}
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:21:27

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 17/26] KVM: arm64: Elevate Hyp mappings creation at EL2

Previous commits have introduced infrastructure at EL2 to enable the Hyp
code to manage its own memory, and more specifically its stage 1 page
tables. However, this was preliminary work, and none of it is currently
in use.

Put all of this together by elevating the hyp mappings creation at EL2
when memory protection is enabled. In this case, the host kernel running
at EL1 still creates _temporary_ Hyp mappings, only used while
initializing the hypervisor, but frees them right after.

As such, all calls to create_hyp_mappings() after kvm init has finished
turn into hypercalls, as the host now has no 'legal' way to modify the
hypevisor page tables directly.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_mmu.h | 1 -
arch/arm64/kvm/arm.c | 62 +++++++++++++++++++++++++++++---
arch/arm64/kvm/mmu.c | 34 ++++++++++++++++++
3 files changed, 92 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index d7ebd73ec86f..6c8466a042a9 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -309,6 +309,5 @@ static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
*/
asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT));
}
-
#endif /* __ASSEMBLY__ */
#endif /* __ARM64_KVM_MMU_H__ */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 6af9204bcd5b..e524682c2ccf 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1421,7 +1421,7 @@ static void cpu_prepare_hyp_mode(int cpu)
kvm_flush_dcache_to_poc(params, sizeof(*params));
}

-static void cpu_init_hyp_mode(void)
+static void kvm_set_hyp_vector(void)
{
struct kvm_nvhe_init_params *params;
struct arm_smccc_res res;
@@ -1439,6 +1439,11 @@ static void cpu_init_hyp_mode(void)
params = this_cpu_ptr_nvhe_sym(kvm_init_params);
arm_smccc_1_1_hvc(KVM_HOST_SMCCC_FUNC(__kvm_hyp_init), virt_to_phys(params), &res);
WARN_ON(res.a0 != SMCCC_RET_SUCCESS);
+}
+
+static void cpu_init_hyp_mode(void)
+{
+ kvm_set_hyp_vector();

/*
* Disabling SSBD on a non-VHE system requires us to enable SSBS
@@ -1481,7 +1486,10 @@ static void cpu_set_hyp_vector(void)
struct bp_hardening_data *data = this_cpu_ptr(&bp_hardening_data);
void *vector = hyp_spectre_vector_selector[data->slot];

- *this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
+ if (!is_protected_kvm_enabled())
+ *this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
+ else
+ kvm_call_hyp_nvhe(__pkvm_cpu_set_vector, data->slot);
}

static void cpu_hyp_reinit(void)
@@ -1489,13 +1497,14 @@ static void cpu_hyp_reinit(void)
kvm_init_host_cpu_context(&this_cpu_ptr_hyp_sym(kvm_host_data)->host_ctxt);

cpu_hyp_reset();
- cpu_set_hyp_vector();

if (is_kernel_in_hyp_mode())
kvm_timer_init_vhe();
else
cpu_init_hyp_mode();

+ cpu_set_hyp_vector();
+
kvm_arm_init_debug();

if (vgic_present)
@@ -1714,13 +1723,52 @@ static int copy_cpu_ftr_regs(void)
return 0;
}

+static int kvm_hyp_enable_protection(void)
+{
+ void *per_cpu_base = kvm_ksym_ref(kvm_arm_hyp_percpu_base);
+ int ret, cpu;
+ void *addr;
+
+ if (!is_protected_kvm_enabled())
+ return 0;
+
+ if (!hyp_mem_base)
+ return -ENOMEM;
+
+ addr = phys_to_virt(hyp_mem_base);
+ ret = create_hyp_mappings(addr, addr + hyp_mem_size - 1, PAGE_HYP);
+ if (ret)
+ return ret;
+
+ preempt_disable();
+ kvm_set_hyp_vector();
+ ret = kvm_call_hyp_nvhe(__pkvm_init, hyp_mem_base, hyp_mem_size,
+ num_possible_cpus(), kern_hyp_va(per_cpu_base));
+ preempt_enable();
+ if (ret)
+ return ret;
+
+ free_hyp_pgds();
+ for_each_possible_cpu(cpu)
+ free_page(per_cpu(kvm_arm_hyp_stack_page, cpu));
+
+ return 0;
+}
+
/**
* Inits Hyp-mode on all online CPUs
*/
static int init_hyp_mode(void)
{
int cpu;
- int err = 0;
+ int err = -ENOMEM;
+
+ /*
+ * The protected Hyp-mode cannot be initialized if the memory pool
+ * allocation has failed.
+ */
+ if (is_protected_kvm_enabled() && !hyp_mem_base)
+ return err;

/*
* Copy the required CPU feature register in their EL2 counterpart
@@ -1854,6 +1902,12 @@ static int init_hyp_mode(void)
for_each_possible_cpu(cpu)
cpu_prepare_hyp_mode(cpu);

+ err = kvm_hyp_enable_protection();
+ if (err) {
+ kvm_err("Failed to enable hyp memory protection: %d\n", err);
+ goto out_err;
+ }
+
return 0;

out_err:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 3cf9397dabdb..9d4c9251208e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -225,15 +225,39 @@ void free_hyp_pgds(void)
if (hyp_pgtable) {
kvm_pgtable_hyp_destroy(hyp_pgtable);
kfree(hyp_pgtable);
+ hyp_pgtable = NULL;
}
mutex_unlock(&kvm_hyp_pgd_mutex);
}

+static bool kvm_host_owns_hyp_mappings(void)
+{
+ if (static_branch_likely(&kvm_protected_mode_initialized))
+ return false;
+
+ /*
+ * This can happen at boot time when __create_hyp_mappings() is called
+ * after the hyp protection has been enabled, but the static key has
+ * not been flipped yet.
+ */
+ if (!hyp_pgtable && is_protected_kvm_enabled())
+ return false;
+
+ BUG_ON(!hyp_pgtable);
+
+ return true;
+}
+
static int __create_hyp_mappings(unsigned long start, unsigned long size,
unsigned long phys, enum kvm_pgtable_prot prot)
{
int err;

+ if (!kvm_host_owns_hyp_mappings()) {
+ return kvm_call_hyp_nvhe(__pkvm_create_mappings,
+ start, size, phys, prot);
+ }
+
mutex_lock(&kvm_hyp_pgd_mutex);
err = kvm_pgtable_hyp_map(hyp_pgtable, start, size, phys, prot);
mutex_unlock(&kvm_hyp_pgd_mutex);
@@ -295,6 +319,16 @@ static int __create_hyp_private_mapping(phys_addr_t phys_addr, size_t size,
unsigned long base;
int ret = 0;

+ if (!kvm_host_owns_hyp_mappings()) {
+ base = kvm_call_hyp_nvhe(__pkvm_create_private_mapping,
+ phys_addr, size, prot);
+ if (!base)
+ return -ENOMEM;
+ *haddr = base;
+
+ return 0;
+ }
+
mutex_lock(&kvm_hyp_pgd_mutex);

/*
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:21:48

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 09/26] KVM: arm64: Allow using kvm_nvhe_sym() in hyp code

In order to allow the usage of code shared by the host and the hyp in
static inline library function, allow the usage of kvm_nvhe_sym() at el2
by defaulting to the raw symbol name.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/hyp_image.h | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/arm64/include/asm/hyp_image.h b/arch/arm64/include/asm/hyp_image.h
index e06842756051..fb16e1018ea9 100644
--- a/arch/arm64/include/asm/hyp_image.h
+++ b/arch/arm64/include/asm/hyp_image.h
@@ -7,11 +7,15 @@
#ifndef __ARM64_HYP_IMAGE_H__
#define __ARM64_HYP_IMAGE_H__

+#ifndef __KVM_NVHE_HYPERVISOR__
/*
* KVM nVHE code has its own symbol namespace prefixed with __kvm_nvhe_,
* to separate it from the kernel proper.
*/
#define kvm_nvhe_sym(sym) __kvm_nvhe_##sym
+#else
+#define kvm_nvhe_sym(sym) sym
+#endif

#ifdef LINKER_SCRIPT

--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:21:56

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 04/26] KVM: arm64: Initialize kvm_nvhe_init_params early

Move the initialization of kvm_nvhe_init_params in a dedicated function
that is run early, and only once during KVM init, rather than every time
the KVM vectors are set and reset.

This also opens the opportunity for the hypervisor to change the init
structs during boot, hence simplifying the replacement of host-provided
page-tables and stacks by the ones the hypervisor will create for
itself.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/arm.c | 28 ++++++++++++++++++++--------
1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 04c44853b103..3ac0f3425833 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1383,21 +1383,17 @@ static int kvm_init_vector_slots(void)
return 0;
}

-static void cpu_init_hyp_mode(void)
+static void cpu_prepare_hyp_mode(int cpu)
{
- struct kvm_nvhe_init_params *params = this_cpu_ptr_nvhe_sym(kvm_init_params);
- struct arm_smccc_res res;
+ struct kvm_nvhe_init_params *params = per_cpu_ptr_nvhe_sym(kvm_init_params, cpu);
unsigned long tcr;

- /* Switch from the HYP stub to our own HYP init vector */
- __hyp_set_vectors(kvm_get_idmap_vector());
-
/*
* Calculate the raw per-cpu offset without a translation from the
* kernel's mapping to the linear mapping, and store it in tpidr_el2
* so that we can use adr_l to access per-cpu variables in EL2.
*/
- params->tpidr_el2 = (unsigned long)this_cpu_ptr_nvhe_sym(__per_cpu_start) -
+ params->tpidr_el2 = (unsigned long)per_cpu_ptr_nvhe_sym(__per_cpu_start, cpu) -
(unsigned long)kvm_ksym_ref(CHOOSE_NVHE_SYM(__per_cpu_start));

params->mair_el2 = read_sysreg(mair_el1);
@@ -1421,7 +1417,7 @@ static void cpu_init_hyp_mode(void)
tcr |= (idmap_t0sz & GENMASK(TCR_TxSZ_WIDTH - 1, 0)) << TCR_T0SZ_OFFSET;
params->tcr_el2 = tcr;

- params->stack_hyp_va = kern_hyp_va(__this_cpu_read(kvm_arm_hyp_stack_page) + PAGE_SIZE);
+ params->stack_hyp_va = kern_hyp_va(per_cpu(kvm_arm_hyp_stack_page, cpu) + PAGE_SIZE);
params->pgd_pa = kvm_mmu_get_httbr();

/*
@@ -1429,6 +1425,15 @@ static void cpu_init_hyp_mode(void)
* be read while the MMU is off.
*/
kvm_flush_dcache_to_poc(params, sizeof(*params));
+}
+
+static void cpu_init_hyp_mode(void)
+{
+ struct kvm_nvhe_init_params *params;
+ struct arm_smccc_res res;
+
+ /* Switch from the HYP stub to our own HYP init vector */
+ __hyp_set_vectors(kvm_get_idmap_vector());

/*
* Call initialization code, and switch to the full blown HYP code.
@@ -1437,6 +1442,7 @@ static void cpu_init_hyp_mode(void)
* cpus_have_const_cap() wrapper.
*/
BUG_ON(!system_capabilities_finalized());
+ params = this_cpu_ptr_nvhe_sym(kvm_init_params);
arm_smccc_1_1_hvc(KVM_HOST_SMCCC_FUNC(__kvm_hyp_init), virt_to_phys(params), &res);
WARN_ON(res.a0 != SMCCC_RET_SUCCESS);

@@ -1807,6 +1813,12 @@ static int init_hyp_mode(void)
goto out_err;
}

+ /*
+ * Prepare the CPU initialization parameters
+ */
+ for_each_possible_cpu(cpu)
+ cpu_prepare_hyp_mode(cpu);
+
return 0;

out_err:
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:21:59

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 12/26] KVM: arm64: Introduce a Hyp buddy page allocator

When memory protection is enabled, the hyp code will require a basic
form of memory management in order to allocate and free memory pages at
EL2. This is needed for various use-cases, including the creation of hyp
mappings or the allocation of stage 2 page tables.

To address these use-case, introduce a simple memory allocator in the
hyp code. The allocator is designed as a conventional 'buddy allocator',
working with a page granularity. It allows to allocate and free
physically contiguous pages from memory 'pools', with a guaranteed order
alignment in the PA space. Each page in a memory pool is associated
with a struct hyp_page which holds the page's metadata, including its
refcount, as well as its current order, hence mimicking the kernel's
buddy system in the GFP infrastructure. The hyp_page metadata are made
accessible through a hyp_vmemmap, following the concept of
SPARSE_VMEMMAP in the kernel.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/gfp.h | 32 ++++
arch/arm64/kvm/hyp/include/nvhe/memory.h | 25 +++
arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
arch/arm64/kvm/hyp/nvhe/page_alloc.c | 185 +++++++++++++++++++++++
4 files changed, 243 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/gfp.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/page_alloc.c

diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
new file mode 100644
index 000000000000..95587faee171
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_HYP_GFP_H
+#define __KVM_HYP_GFP_H
+
+#include <linux/list.h>
+
+#include <nvhe/memory.h>
+#include <nvhe/spinlock.h>
+
+#define HYP_MAX_ORDER 11U
+#define HYP_NO_ORDER UINT_MAX
+
+struct hyp_pool {
+ hyp_spinlock_t lock;
+ struct list_head free_area[HYP_MAX_ORDER + 1];
+ phys_addr_t range_start;
+ phys_addr_t range_end;
+};
+
+/* GFP flags */
+#define HYP_GFP_NONE 0
+#define HYP_GFP_ZERO 1
+
+/* Allocation */
+void *hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask, unsigned int order);
+void hyp_get_page(void *addr);
+void hyp_put_page(void *addr);
+
+/* Used pages cannot be freed */
+int hyp_pool_init(struct hyp_pool *pool, phys_addr_t phys,
+ unsigned int nr_pages, unsigned int used_pages);
+#endif /* __KVM_HYP_GFP_H */
diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h
index 64c44c142c95..ed47674bc988 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/memory.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h
@@ -6,7 +6,17 @@

#include <linux/types.h>

+struct hyp_pool;
+struct hyp_page {
+ unsigned int refcount;
+ unsigned int order;
+ struct hyp_pool *pool;
+ struct list_head node;
+};
+
extern s64 hyp_physvirt_offset;
+extern u64 __hyp_vmemmap;
+#define hyp_vmemmap ((struct hyp_page *)__hyp_vmemmap)

#define __hyp_pa(virt) ((phys_addr_t)(virt) + hyp_physvirt_offset)
#define __hyp_va(virt) ((void *)((phys_addr_t)(virt) - hyp_physvirt_offset))
@@ -21,4 +31,19 @@ static inline phys_addr_t hyp_virt_to_phys(void *addr)
return __hyp_pa(addr);
}

+#define hyp_phys_to_pfn(phys) ((phys) >> PAGE_SHIFT)
+#define hyp_phys_to_page(phys) (&hyp_vmemmap[hyp_phys_to_pfn(phys)])
+#define hyp_virt_to_page(virt) hyp_phys_to_page(__hyp_pa(virt))
+
+#define hyp_page_to_phys(page) ((phys_addr_t)((page) - hyp_vmemmap) << PAGE_SHIFT)
+#define hyp_page_to_virt(page) __hyp_va(hyp_page_to_phys(page))
+#define hyp_page_to_pool(page) (((struct hyp_page *)page)->pool)
+
+static inline int hyp_page_count(void *addr)
+{
+ struct hyp_page *p = hyp_virt_to_page(addr);
+
+ return p->refcount;
+}
+
#endif /* __KVM_HYP_MEMORY_H */
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 33bd381d8f73..9e5eacfec6ec 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
lib-objs := $(addprefix ../../../lib/, $(lib-objs))

obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
- hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
+ hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o
obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o
obj-y += $(lib-objs)
diff --git a/arch/arm64/kvm/hyp/nvhe/page_alloc.c b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
new file mode 100644
index 000000000000..6de6515f0432
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
@@ -0,0 +1,185 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Google LLC
+ * Author: Quentin Perret <[email protected]>
+ */
+
+#include <asm/kvm_hyp.h>
+#include <nvhe/gfp.h>
+
+u64 __hyp_vmemmap;
+
+/*
+ * Example buddy-tree for a 4-pages physically contiguous pool:
+ *
+ * o : Page 3
+ * /
+ * o-o : Page 2
+ * /
+ * / o : Page 1
+ * / /
+ * o---o-o : Page 0
+ * Order 2 1 0
+ *
+ * Example of requests on this zon:
+ * __find_buddy(pool, page 0, order 0) => page 1
+ * __find_buddy(pool, page 0, order 1) => page 2
+ * __find_buddy(pool, page 1, order 0) => page 0
+ * __find_buddy(pool, page 2, order 0) => page 3
+ */
+static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
+ unsigned int order)
+{
+ phys_addr_t addr = hyp_page_to_phys(p);
+
+ addr ^= (PAGE_SIZE << order);
+ if (addr < pool->range_start || addr >= pool->range_end)
+ return NULL;
+
+ return hyp_phys_to_page(addr);
+}
+
+static void __hyp_attach_page(struct hyp_pool *pool,
+ struct hyp_page *p)
+{
+ unsigned int order = p->order;
+ struct hyp_page *buddy;
+
+ p->order = HYP_NO_ORDER;
+ for (; order < HYP_MAX_ORDER; order++) {
+ /* Nothing to do if the buddy isn't in a free-list */
+ buddy = __find_buddy(pool, p, order);
+ if (!buddy || list_empty(&buddy->node) || buddy->order != order)
+ break;
+
+ /* Otherwise, coalesce the buddies and go one level up */
+ list_del_init(&buddy->node);
+ buddy->order = HYP_NO_ORDER;
+ p = (p < buddy) ? p : buddy;
+ }
+
+ p->order = order;
+ list_add_tail(&p->node, &pool->free_area[order]);
+}
+
+void hyp_put_page(void *addr)
+{
+ struct hyp_page *p = hyp_virt_to_page(addr);
+ struct hyp_pool *pool = hyp_page_to_pool(p);
+
+ hyp_spin_lock(&pool->lock);
+ if (!p->refcount)
+ hyp_panic();
+ p->refcount--;
+ if (!p->refcount)
+ __hyp_attach_page(pool, p);
+ hyp_spin_unlock(&pool->lock);
+}
+
+void hyp_get_page(void *addr)
+{
+ struct hyp_page *p = hyp_virt_to_page(addr);
+ struct hyp_pool *pool = hyp_page_to_pool(p);
+
+ hyp_spin_lock(&pool->lock);
+ p->refcount++;
+ hyp_spin_unlock(&pool->lock);
+}
+
+/* Extract a page from the buddy tree, at a specific order */
+static struct hyp_page *__hyp_extract_page(struct hyp_pool *pool,
+ struct hyp_page *p,
+ unsigned int order)
+{
+ struct hyp_page *buddy;
+
+ if (p->order == HYP_NO_ORDER || p->order < order)
+ return NULL;
+
+ list_del_init(&p->node);
+
+ /* Split the page in two until reaching the requested order */
+ while (p->order > order) {
+ p->order--;
+ buddy = __find_buddy(pool, p, p->order);
+ buddy->order = p->order;
+ list_add_tail(&buddy->node, &pool->free_area[buddy->order]);
+ }
+
+ p->refcount = 1;
+
+ return p;
+}
+
+static void clear_hyp_page(struct hyp_page *p)
+{
+ unsigned long i;
+
+ for (i = 0; i < (1 << p->order); i++)
+ clear_page(hyp_page_to_virt(p) + (i << PAGE_SHIFT));
+}
+
+static void *__hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask,
+ unsigned int order)
+{
+ unsigned int i = order;
+ struct hyp_page *p;
+
+ /* Look for a high-enough-order page */
+ while (i <= HYP_MAX_ORDER && list_empty(&pool->free_area[i]))
+ i++;
+ if (i > HYP_MAX_ORDER)
+ return NULL;
+
+ /* Extract it from the tree at the right order */
+ p = list_first_entry(&pool->free_area[i], struct hyp_page, node);
+ p = __hyp_extract_page(pool, p, order);
+
+ if (mask & HYP_GFP_ZERO)
+ clear_hyp_page(p);
+
+ return p;
+}
+
+void *hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask, unsigned int order)
+{
+ struct hyp_page *p;
+
+ hyp_spin_lock(&pool->lock);
+ p = __hyp_alloc_pages(pool, mask, order);
+ hyp_spin_unlock(&pool->lock);
+
+ return p ? hyp_page_to_virt(p) : NULL;
+}
+
+/* hyp_vmemmap must be backed beforehand */
+int hyp_pool_init(struct hyp_pool *pool, phys_addr_t phys,
+ unsigned int nr_pages, unsigned int used_pages)
+{
+ struct hyp_page *p;
+ int i;
+
+ if (phys % PAGE_SIZE)
+ return -EINVAL;
+
+ hyp_spin_lock_init(&pool->lock);
+ for (i = 0; i <= HYP_MAX_ORDER; i++)
+ INIT_LIST_HEAD(&pool->free_area[i]);
+ pool->range_start = phys;
+ pool->range_end = phys + (nr_pages << PAGE_SHIFT);
+
+ /* Init the vmemmap portion */
+ p = hyp_phys_to_page(phys);
+ memset(p, 0, sizeof(*p) * nr_pages);
+ for (i = 0; i < nr_pages; i++, p++) {
+ p->pool = pool;
+ INIT_LIST_HEAD(&p->node);
+ }
+
+ /* Attach the unused pages to the buddy tree */
+ p = hyp_phys_to_page(phys + (used_pages << PAGE_SHIFT));
+ for (i = used_pages; i < nr_pages; i++, p++)
+ __hyp_attach_page(pool, p);
+
+ return 0;
+}
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-08 12:23:36

by Quentin Perret

[permalink] [raw]
Subject: [RFC PATCH v2 18/26] KVM: arm64: Use kvm_arch for stage 2 pgtable

In order to make use of the stage 2 pgtable code for the host stage 2,
use struct kvm_arch in lieu of struct kvm as the host will have the
former but not the latter.

Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 5 +++--
arch/arm64/kvm/hyp/pgtable.c | 6 +++---
arch/arm64/kvm/mmu.c | 2 +-
3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 45acc9dc6c45..8e8f1d2c5e0e 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -151,12 +151,13 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
/**
* kvm_pgtable_stage2_init() - Initialise a guest stage-2 page-table.
* @pgt: Uninitialised page-table structure to initialise.
- * @kvm: KVM structure representing the guest virtual machine.
+ * @arch: Arch-specific KVM structure representing the guest virtual
+ * machine.
* @mm_ops: Memory management callbacks.
*
* Return: 0 on success, negative error code on failure.
*/
-int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm *kvm,
+int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_arch *arch,
struct kvm_pgtable_mm_ops *mm_ops);

/**
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 61a8a34ddfdb..96a25d0b7b6e 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -855,11 +855,11 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
return kvm_pgtable_walk(pgt, addr, size, &walker);
}

-int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm *kvm,
+int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_arch *arch,
struct kvm_pgtable_mm_ops *mm_ops)
{
size_t pgd_sz;
- u64 vtcr = kvm->arch.vtcr;
+ u64 vtcr = arch->vtcr;
u32 ia_bits = VTCR_EL2_IPA(vtcr);
u32 sl0 = FIELD_GET(VTCR_EL2_SL0_MASK, vtcr);
u32 start_level = VTCR_EL2_TGRAN_SL0_BASE - sl0;
@@ -872,7 +872,7 @@ int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm *kvm,
pgt->ia_bits = ia_bits;
pgt->start_level = start_level;
pgt->mm_ops = mm_ops;
- pgt->mmu = &kvm->arch.mmu;
+ pgt->mmu = &arch->mmu;

/* Ensure zeroed PGD pages are visible to the hardware walker */
dsb(ishst);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 9d4c9251208e..7e6263103943 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -461,7 +461,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu)
if (!pgt)
return -ENOMEM;

- err = kvm_pgtable_stage2_init(pgt, kvm, &kvm_s2_mm_ops);
+ err = kvm_pgtable_stage2_init(pgt, &kvm->arch, &kvm_s2_mm_ops);
if (err)
goto out_free_pgtable;

--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-11 14:48:36

by Rob Herring

[permalink] [raw]
Subject: Re: [RFC PATCH v2 15/26] of/fdt: Introduce early_init_dt_add_memory_hyp()

On Fri, Jan 8, 2021 at 6:16 AM Quentin Perret <[email protected]> wrote:
>
> Introduce early_init_dt_add_memory_hyp() to allow KVM to conserve a copy
> of the memory regions parsed from DT. This will be needed in the context
> of the protected nVHE feature of KVM/arm64 where the code running at EL2
> will be cleanly separated from the host kernel during boot, and will
> need its own representation of memory.

What happened to doing this with memblock?

> Signed-off-by: Quentin Perret <[email protected]>
> ---
> drivers/of/fdt.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> index 4602e467ca8b..af2b5a09c5b4 100644
> --- a/drivers/of/fdt.c
> +++ b/drivers/of/fdt.c
> @@ -1099,6 +1099,10 @@ int __init early_init_dt_scan_chosen(unsigned long node, const char *uname,
> #define MAX_MEMBLOCK_ADDR ((phys_addr_t)~0)
> #endif
>
> +void __init __weak early_init_dt_add_memory_hyp(u64 base, u64 size)
> +{
> +}
> +
> void __init __weak early_init_dt_add_memory_arch(u64 base, u64 size)
> {
> const u64 phys_offset = MIN_MEMBLOCK_ADDR;
> @@ -1139,6 +1143,7 @@ void __init __weak early_init_dt_add_memory_arch(u64 base, u64 size)
> base = phys_offset;
> }
> memblock_add(base, size);
> + early_init_dt_add_memory_hyp(base, size);
> }
>
> int __init __weak early_init_dt_mark_hotplug_memory_arch(u64 base, u64 size)
> --
> 2.30.0.284.gd98b1dd5eaa7-goog
>

2021-01-12 12:34:15

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 15/26] of/fdt: Introduce early_init_dt_add_memory_hyp()

On Monday 11 Jan 2021 at 08:45:10 (-0600), Rob Herring wrote:
> On Fri, Jan 8, 2021 at 6:16 AM Quentin Perret <[email protected]> wrote:
> >
> > Introduce early_init_dt_add_memory_hyp() to allow KVM to conserve a copy
> > of the memory regions parsed from DT. This will be needed in the context
> > of the protected nVHE feature of KVM/arm64 where the code running at EL2
> > will be cleanly separated from the host kernel during boot, and will
> > need its own representation of memory.
>
> What happened to doing this with memblock?

I gave it a go, but as mentioned in v1, I ran into issues for nomap
regions. I want the hypervisor to know about these memory regions (it's
possible some of those will be given to protected guests for instance)
but these seem to be entirely removed from the memblocks when using DT:

https://elixir.bootlin.com/linux/latest/source/drivers/of/fdt.c#L1153

EFI appears to do things differently, though, as it 'just' uses
memblock_mark_nomap() instead of actively removing the memblock. And that
means I could actually use the memblock API for EFI, but I'd rather
have a common solution. I tried to understand why things are done
differently but couldn't find an answer and kept things simple and
working for now.

Is there a good reason for not using memblock_mark_nomap() with DT? If
not, I'm happy to try that.

Thanks,
Quentin

2021-01-12 14:16:32

by Rob Herring

[permalink] [raw]
Subject: Re: [RFC PATCH v2 15/26] of/fdt: Introduce early_init_dt_add_memory_hyp()

On Tue, Jan 12, 2021 at 3:51 AM Quentin Perret <[email protected]> wrote:
>
> On Monday 11 Jan 2021 at 08:45:10 (-0600), Rob Herring wrote:
> > On Fri, Jan 8, 2021 at 6:16 AM Quentin Perret <[email protected]> wrote:
> > >
> > > Introduce early_init_dt_add_memory_hyp() to allow KVM to conserve a copy
> > > of the memory regions parsed from DT. This will be needed in the context
> > > of the protected nVHE feature of KVM/arm64 where the code running at EL2
> > > will be cleanly separated from the host kernel during boot, and will
> > > need its own representation of memory.
> >
> > What happened to doing this with memblock?
>
> I gave it a go, but as mentioned in v1, I ran into issues for nomap
> regions. I want the hypervisor to know about these memory regions (it's
> possible some of those will be given to protected guests for instance)
> but these seem to be entirely removed from the memblocks when using DT:
>
> https://elixir.bootlin.com/linux/latest/source/drivers/of/fdt.c#L1153
>
> EFI appears to do things differently, though, as it 'just' uses
> memblock_mark_nomap() instead of actively removing the memblock. And that
> means I could actually use the memblock API for EFI, but I'd rather
> have a common solution. I tried to understand why things are done
> differently but couldn't find an answer and kept things simple and
> working for now.
>
> Is there a good reason for not using memblock_mark_nomap() with DT? If
> not, I'm happy to try that.

There were 2 patches to do that, but it never got resolved. See here[1].

Rob

[1] https://lore.kernel.org/linux-devicetree/?q=s%3Ano-map

2021-01-12 14:29:21

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 15/26] of/fdt: Introduce early_init_dt_add_memory_hyp()

On Tuesday 12 Jan 2021 at 08:10:47 (-0600), Rob Herring wrote:
> On Tue, Jan 12, 2021 at 3:51 AM Quentin Perret <[email protected]> wrote:
> >
> > On Monday 11 Jan 2021 at 08:45:10 (-0600), Rob Herring wrote:
> > > On Fri, Jan 8, 2021 at 6:16 AM Quentin Perret <[email protected]> wrote:
> > > >
> > > > Introduce early_init_dt_add_memory_hyp() to allow KVM to conserve a copy
> > > > of the memory regions parsed from DT. This will be needed in the context
> > > > of the protected nVHE feature of KVM/arm64 where the code running at EL2
> > > > will be cleanly separated from the host kernel during boot, and will
> > > > need its own representation of memory.
> > >
> > > What happened to doing this with memblock?
> >
> > I gave it a go, but as mentioned in v1, I ran into issues for nomap
> > regions. I want the hypervisor to know about these memory regions (it's
> > possible some of those will be given to protected guests for instance)
> > but these seem to be entirely removed from the memblocks when using DT:
> >
> > https://elixir.bootlin.com/linux/latest/source/drivers/of/fdt.c#L1153
> >
> > EFI appears to do things differently, though, as it 'just' uses
> > memblock_mark_nomap() instead of actively removing the memblock. And that
> > means I could actually use the memblock API for EFI, but I'd rather
> > have a common solution. I tried to understand why things are done
> > differently but couldn't find an answer and kept things simple and
> > working for now.
> >
> > Is there a good reason for not using memblock_mark_nomap() with DT? If
> > not, I'm happy to try that.
>
> There were 2 patches to do that, but it never got resolved. See here[1].

Thanks. So the DT stuff predates the introduction of memblock_mark_nomap,
that's why...

By reading the discussions, [1] still looks a sensible patch on its own,
independently from the issue Nicolas tried to solve. Any reason for not
applying it?

I'll try to rework my series on top and see how that goes.

Thanks,
Quentin

[1] https://lore.kernel.org/linux-devicetree/[email protected]/

2021-01-12 16:18:28

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 15/26] of/fdt: Introduce early_init_dt_add_memory_hyp()

On Tuesday 12 Jan 2021 at 09:53:36 (-0600), Rob Herring wrote:
> On Tue, Jan 12, 2021 at 8:26 AM Quentin Perret <[email protected]> wrote:
> >
> > On Tuesday 12 Jan 2021 at 08:10:47 (-0600), Rob Herring wrote:
> > > On Tue, Jan 12, 2021 at 3:51 AM Quentin Perret <[email protected]> wrote:
> > > >
> > > > On Monday 11 Jan 2021 at 08:45:10 (-0600), Rob Herring wrote:
> > > > > On Fri, Jan 8, 2021 at 6:16 AM Quentin Perret <[email protected]> wrote:
> > > > > >
> > > > > > Introduce early_init_dt_add_memory_hyp() to allow KVM to conserve a copy
> > > > > > of the memory regions parsed from DT. This will be needed in the context
> > > > > > of the protected nVHE feature of KVM/arm64 where the code running at EL2
> > > > > > will be cleanly separated from the host kernel during boot, and will
> > > > > > need its own representation of memory.
> > > > >
> > > > > What happened to doing this with memblock?
> > > >
> > > > I gave it a go, but as mentioned in v1, I ran into issues for nomap
> > > > regions. I want the hypervisor to know about these memory regions (it's
> > > > possible some of those will be given to protected guests for instance)
> > > > but these seem to be entirely removed from the memblocks when using DT:
> > > >
> > > > https://elixir.bootlin.com/linux/latest/source/drivers/of/fdt.c#L1153
> > > >
> > > > EFI appears to do things differently, though, as it 'just' uses
> > > > memblock_mark_nomap() instead of actively removing the memblock. And that
> > > > means I could actually use the memblock API for EFI, but I'd rather
> > > > have a common solution. I tried to understand why things are done
> > > > differently but couldn't find an answer and kept things simple and
> > > > working for now.
> > > >
> > > > Is there a good reason for not using memblock_mark_nomap() with DT? If
> > > > not, I'm happy to try that.
> > >
> > > There were 2 patches to do that, but it never got resolved. See here[1].
> >
> > Thanks. So the DT stuff predates the introduction of memblock_mark_nomap,
> > that's why...
> >
> > By reading the discussions, [1] still looks a sensible patch on its own,
> > independently from the issue Nicolas tried to solve. Any reason for not
> > applying it?
>
> As I mentioned in the thread, same patch with 2 different reasons. So
> I just wanted a better commit message covering both.

Sorry if I'm being thick, but I'm not seeing it. How are they the same?
IIUC, as per Nicolas' last reply, using memblock_mark_nomap() does not
solve his issue with a broken DT. These 2 patches address two completely
separate issues no?

Thanks,
Quentin

2021-01-12 16:49:25

by Rob Herring

[permalink] [raw]
Subject: Re: [RFC PATCH v2 15/26] of/fdt: Introduce early_init_dt_add_memory_hyp()

On Tue, Jan 12, 2021 at 10:15 AM Quentin Perret <[email protected]> wrote:
>
> On Tuesday 12 Jan 2021 at 09:53:36 (-0600), Rob Herring wrote:
> > On Tue, Jan 12, 2021 at 8:26 AM Quentin Perret <[email protected]> wrote:
> > >
> > > On Tuesday 12 Jan 2021 at 08:10:47 (-0600), Rob Herring wrote:
> > > > On Tue, Jan 12, 2021 at 3:51 AM Quentin Perret <[email protected]> wrote:
> > > > >
> > > > > On Monday 11 Jan 2021 at 08:45:10 (-0600), Rob Herring wrote:
> > > > > > On Fri, Jan 8, 2021 at 6:16 AM Quentin Perret <[email protected]> wrote:
> > > > > > >
> > > > > > > Introduce early_init_dt_add_memory_hyp() to allow KVM to conserve a copy
> > > > > > > of the memory regions parsed from DT. This will be needed in the context
> > > > > > > of the protected nVHE feature of KVM/arm64 where the code running at EL2
> > > > > > > will be cleanly separated from the host kernel during boot, and will
> > > > > > > need its own representation of memory.
> > > > > >
> > > > > > What happened to doing this with memblock?
> > > > >
> > > > > I gave it a go, but as mentioned in v1, I ran into issues for nomap
> > > > > regions. I want the hypervisor to know about these memory regions (it's
> > > > > possible some of those will be given to protected guests for instance)
> > > > > but these seem to be entirely removed from the memblocks when using DT:
> > > > >
> > > > > https://elixir.bootlin.com/linux/latest/source/drivers/of/fdt.c#L1153
> > > > >
> > > > > EFI appears to do things differently, though, as it 'just' uses
> > > > > memblock_mark_nomap() instead of actively removing the memblock. And that
> > > > > means I could actually use the memblock API for EFI, but I'd rather
> > > > > have a common solution. I tried to understand why things are done
> > > > > differently but couldn't find an answer and kept things simple and
> > > > > working for now.
> > > > >
> > > > > Is there a good reason for not using memblock_mark_nomap() with DT? If
> > > > > not, I'm happy to try that.
> > > >
> > > > There were 2 patches to do that, but it never got resolved. See here[1].
> > >
> > > Thanks. So the DT stuff predates the introduction of memblock_mark_nomap,
> > > that's why...
> > >
> > > By reading the discussions, [1] still looks a sensible patch on its own,
> > > independently from the issue Nicolas tried to solve. Any reason for not
> > > applying it?
> >
> > As I mentioned in the thread, same patch with 2 different reasons. So
> > I just wanted a better commit message covering both.
>
> Sorry if I'm being thick, but I'm not seeing it. How are they the same?
> IIUC, as per Nicolas' last reply, using memblock_mark_nomap() does not
> solve his issue with a broken DT. These 2 patches address two completely
> separate issues no?

Umm, yes you are right. But both are dealing with nomap. So someone
needs to sort out what the right thing to do here is. No one cared
enough to follow up in a year and a half.

Rob

2021-01-12 16:54:59

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 15/26] of/fdt: Introduce early_init_dt_add_memory_hyp()

On Tuesday 12 Jan 2021 at 10:45:56 (-0600), Rob Herring wrote:
> Umm, yes you are right. But both are dealing with nomap. So someone
> needs to sort out what the right thing to do here is. No one cared
> enough to follow up in a year and a half.

Fair enough, happy to do that. I'll send a small series with these two
patches independently from this series which may take a while to land.

Thanks,
Quentin

2021-01-13 02:17:30

by Rob Herring

[permalink] [raw]
Subject: Re: [RFC PATCH v2 15/26] of/fdt: Introduce early_init_dt_add_memory_hyp()

On Tue, Jan 12, 2021 at 8:26 AM Quentin Perret <[email protected]> wrote:
>
> On Tuesday 12 Jan 2021 at 08:10:47 (-0600), Rob Herring wrote:
> > On Tue, Jan 12, 2021 at 3:51 AM Quentin Perret <[email protected]> wrote:
> > >
> > > On Monday 11 Jan 2021 at 08:45:10 (-0600), Rob Herring wrote:
> > > > On Fri, Jan 8, 2021 at 6:16 AM Quentin Perret <[email protected]> wrote:
> > > > >
> > > > > Introduce early_init_dt_add_memory_hyp() to allow KVM to conserve a copy
> > > > > of the memory regions parsed from DT. This will be needed in the context
> > > > > of the protected nVHE feature of KVM/arm64 where the code running at EL2
> > > > > will be cleanly separated from the host kernel during boot, and will
> > > > > need its own representation of memory.
> > > >
> > > > What happened to doing this with memblock?
> > >
> > > I gave it a go, but as mentioned in v1, I ran into issues for nomap
> > > regions. I want the hypervisor to know about these memory regions (it's
> > > possible some of those will be given to protected guests for instance)
> > > but these seem to be entirely removed from the memblocks when using DT:
> > >
> > > https://elixir.bootlin.com/linux/latest/source/drivers/of/fdt.c#L1153
> > >
> > > EFI appears to do things differently, though, as it 'just' uses
> > > memblock_mark_nomap() instead of actively removing the memblock. And that
> > > means I could actually use the memblock API for EFI, but I'd rather
> > > have a common solution. I tried to understand why things are done
> > > differently but couldn't find an answer and kept things simple and
> > > working for now.
> > >
> > > Is there a good reason for not using memblock_mark_nomap() with DT? If
> > > not, I'm happy to try that.
> >
> > There were 2 patches to do that, but it never got resolved. See here[1].
>
> Thanks. So the DT stuff predates the introduction of memblock_mark_nomap,
> that's why...
>
> By reading the discussions, [1] still looks a sensible patch on its own,
> independently from the issue Nicolas tried to solve. Any reason for not
> applying it?

As I mentioned in the thread, same patch with 2 different reasons. So
I just wanted a better commit message covering both.

Rob

2021-01-13 11:35:32

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC PATCH v2 13/26] KVM: arm64: Enable access to sanitized CPU features at EL2

Hi Quentin,

On 2021-01-08 12:15, Quentin Perret wrote:
> Introduce the infrastructure in KVM enabling to copy CPU feature
> registers into EL2-owned data-structures, to allow reading sanitised
> values directly at EL2 in nVHE.
>
> Given that only a subset of these features are being read by the
> hypervisor, the ones that need to be copied are to be listed under
> <asm/kvm_cpufeature.h> together with the name of the nVHE variable that
> will hold the copy.
>
> While at it, introduce the first user of this infrastructure by
> implementing __flush_dcache_area at EL2, which needs
> arm64_ftr_reg_ctrel0.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/cpufeature.h | 1 +
> arch/arm64/include/asm/kvm_cpufeature.h | 17 ++++++++++++++
> arch/arm64/kernel/cpufeature.c | 12 ++++++++++
> arch/arm64/kvm/arm.c | 31 +++++++++++++++++++++++++
> arch/arm64/kvm/hyp/nvhe/Makefile | 3 ++-
> arch/arm64/kvm/hyp/nvhe/cache.S | 13 +++++++++++
> arch/arm64/kvm/hyp/nvhe/cpufeature.c | 8 +++++++
> 7 files changed, 84 insertions(+), 1 deletion(-)
> create mode 100644 arch/arm64/include/asm/kvm_cpufeature.h
> create mode 100644 arch/arm64/kvm/hyp/nvhe/cache.S
> create mode 100644 arch/arm64/kvm/hyp/nvhe/cpufeature.c
>
> diff --git a/arch/arm64/include/asm/cpufeature.h
> b/arch/arm64/include/asm/cpufeature.h
> index 16063c813dcd..742e9bcc051b 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -600,6 +600,7 @@ void __init setup_cpu_features(void);
> void check_local_cpu_capabilities(void);
>
> u64 read_sanitised_ftr_reg(u32 id);
> +int copy_ftr_reg(u32 id, struct arm64_ftr_reg *dst);
>
> static inline bool cpu_supports_mixed_endian_el0(void)
> {
> diff --git a/arch/arm64/include/asm/kvm_cpufeature.h
> b/arch/arm64/include/asm/kvm_cpufeature.h
> new file mode 100644
> index 000000000000..d34f85cba358
> --- /dev/null
> +++ b/arch/arm64/include/asm/kvm_cpufeature.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2020 - Google LLC
> + * Author: Quentin Perret <[email protected]>
> + */
> +
> +#include <asm/cpufeature.h>
> +
> +#ifndef KVM_HYP_CPU_FTR_REG
> +#if defined(__KVM_NVHE_HYPERVISOR__)
> +#define KVM_HYP_CPU_FTR_REG(id, name) extern struct arm64_ftr_reg
> name;
> +#else
> +#define KVM_HYP_CPU_FTR_REG(id, name) DECLARE_KVM_NVHE_SYM(name);
> +#endif
> +#endif
> +
> +KVM_HYP_CPU_FTR_REG(SYS_CTR_EL0, arm64_ftr_reg_ctrel0)
> diff --git a/arch/arm64/kernel/cpufeature.c
> b/arch/arm64/kernel/cpufeature.c
> index bc3549663957..c2019aaaadc3 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -1113,6 +1113,18 @@ u64 read_sanitised_ftr_reg(u32 id)
> }
> EXPORT_SYMBOL_GPL(read_sanitised_ftr_reg);
>
> +int copy_ftr_reg(u32 id, struct arm64_ftr_reg *dst)
> +{
> + struct arm64_ftr_reg *regp = get_arm64_ftr_reg(id);
> +
> + if (!regp)
> + return -EINVAL;
> +
> + memcpy(dst, regp, sizeof(*regp));
> +
> + return 0;
> +}
> +
> #define read_sysreg_case(r) \
> case r: return read_sysreg_s(r)
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 51b53ca36dc5..9fd769349e9e 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -34,6 +34,7 @@
> #include <asm/virt.h>
> #include <asm/kvm_arm.h>
> #include <asm/kvm_asm.h>
> +#include <asm/kvm_cpufeature.h>
> #include <asm/kvm_mmu.h>
> #include <asm/kvm_emulate.h>
> #include <asm/sections.h>
> @@ -1697,6 +1698,29 @@ static void teardown_hyp_mode(void)
> }
> }
>
> +#undef KVM_HYP_CPU_FTR_REG
> +#define KVM_HYP_CPU_FTR_REG(id, name) \
> + { .sys_id = id, .dst = (struct arm64_ftr_reg *)&kvm_nvhe_sym(name) },
> +static const struct __ftr_reg_copy_entry {
> + u32 sys_id;
> + struct arm64_ftr_reg *dst;

Why do we need the whole data structure? Can't we just live with
sys_val?

> +} hyp_ftr_regs[] = {
> + #include <asm/kvm_cpufeature.h>
> +};

Can't this be made __initdata?

> +
> +static int copy_cpu_ftr_regs(void)
> +{
> + int i, ret;
> +
> + for (i = 0; i < ARRAY_SIZE(hyp_ftr_regs); i++) {
> + ret = copy_ftr_reg(hyp_ftr_regs[i].sys_id, hyp_ftr_regs[i].dst);
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> /**
> * Inits Hyp-mode on all online CPUs
> */
> @@ -1705,6 +1729,13 @@ static int init_hyp_mode(void)
> int cpu;
> int err = 0;
>
> + /*
> + * Copy the required CPU feature register in their EL2 counterpart
> + */
> + err = copy_cpu_ftr_regs();
> + if (err)
> + return err;
> +

Just to keep things together, please move any sysreg manipulation into
sys_regs.c, most probably into kvm_sys_reg_table_init().

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2021-01-13 14:26:05

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 13/26] KVM: arm64: Enable access to sanitized CPU features at EL2

Hey Marc,

On Wednesday 13 Jan 2021 at 11:33:13 (+0000), Marc Zyngier wrote:
> > +#undef KVM_HYP_CPU_FTR_REG
> > +#define KVM_HYP_CPU_FTR_REG(id, name) \
> > + { .sys_id = id, .dst = (struct arm64_ftr_reg *)&kvm_nvhe_sym(name) },
> > +static const struct __ftr_reg_copy_entry {
> > + u32 sys_id;
> > + struct arm64_ftr_reg *dst;
>
> Why do we need the whole data structure? Can't we just live with sys_val?

I don't have a use-case for anything else than sys_val, so yes I think I
should be able to simplify. I'll try that for v3.

>
> > +} hyp_ftr_regs[] = {
> > + #include <asm/kvm_cpufeature.h>
> > +};
>
> Can't this be made __initdata?

Good point, that would be nice indeed. Can I use that from outside an
__init function? If not, I'll need to rework the code a bit more, but
that should be simple enough either way.

> > +
> > +static int copy_cpu_ftr_regs(void)
> > +{
> > + int i, ret;
> > +
> > + for (i = 0; i < ARRAY_SIZE(hyp_ftr_regs); i++) {
> > + ret = copy_ftr_reg(hyp_ftr_regs[i].sys_id, hyp_ftr_regs[i].dst);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > /**
> > * Inits Hyp-mode on all online CPUs
> > */
> > @@ -1705,6 +1729,13 @@ static int init_hyp_mode(void)
> > int cpu;
> > int err = 0;
> >
> > + /*
> > + * Copy the required CPU feature register in their EL2 counterpart
> > + */
> > + err = copy_cpu_ftr_regs();
> > + if (err)
> > + return err;
> > +
>
> Just to keep things together, please move any sysreg manipulation into
> sys_regs.c, most probably into kvm_sys_reg_table_init().

Will do.

Thanks,
Quentin

2021-01-13 14:39:42

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 13/26] KVM: arm64: Enable access to sanitized CPU features at EL2

On Wednesday 13 Jan 2021 at 14:23:03 (+0000), Quentin Perret wrote:
> Good point, that would be nice indeed. Can I use that from outside an
> __init function?

Just gave it a go, and the answer to this appears to be yes,
surprisingly -- I was expecting a compile-time warning similar to what
we get when non-__init code calls into __init, but that doesn't seem to
trigger here. Anyways, I'll add the annotation in v3.

Thanks,
Quentin

2021-01-13 17:31:12

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC PATCH v2 13/26] KVM: arm64: Enable access to sanitized CPU features at EL2

On 2021-01-13 14:35, Quentin Perret wrote:
> On Wednesday 13 Jan 2021 at 14:23:03 (+0000), Quentin Perret wrote:
>> Good point, that would be nice indeed. Can I use that from outside an
>> __init function?
>
> Just gave it a go, and the answer to this appears to be yes,
> surprisingly -- I was expecting a compile-time warning similar to what
> we get when non-__init code calls into __init, but that doesn't seem to
> trigger here. Anyways, I'll add the annotation in v3.

That's surprising. I'd definitely expect something to explode...
Do you have CONFIG_DEBUG_SECTION_MISMATCH=y?

M.
--
Jazz is not dead. It just smells funny...

2021-01-13 18:31:10

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 13/26] KVM: arm64: Enable access to sanitized CPU features at EL2

On Wednesday 13 Jan 2021 at 17:27:49 (+0000), Marc Zyngier wrote:
> On 2021-01-13 14:35, Quentin Perret wrote:
> > On Wednesday 13 Jan 2021 at 14:23:03 (+0000), Quentin Perret wrote:
> > > Good point, that would be nice indeed. Can I use that from outside an
> > > __init function?
> >
> > Just gave it a go, and the answer to this appears to be yes,
> > surprisingly -- I was expecting a compile-time warning similar to what
> > we get when non-__init code calls into __init, but that doesn't seem to
> > trigger here. Anyways, I'll add the annotation in v3.
>
> That's surprising. I'd definitely expect something to explode...
> Do you have CONFIG_DEBUG_SECTION_MISMATCH=y?

Yes I do, so, that doesn't seem to be it. Now, the plot thickens: I
_do_ get a warning if I remove the 'const' qualifier. But interestingly,
in both cases hyp_ftr_regs is placed in .init.data:

$ objdump -t vmlinux | grep hyp_ftr_regs
ffff8000116c17b0 g O .init.data 0000000000000030 hyp_ftr_regs

The warning is silenced only if I mark hyp_ftr_regs as const. modpost
bug? I'll double check my findings and follow up in a separate series.

Thanks,
Quentin

2021-01-15 11:51:54

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 15/26] of/fdt: Introduce early_init_dt_add_memory_hyp()

On Tuesday 12 Jan 2021 at 16:50:12 (+0000), Quentin Perret wrote:
> On Tuesday 12 Jan 2021 at 10:45:56 (-0600), Rob Herring wrote:
> > Umm, yes you are right. But both are dealing with nomap. So someone
> > needs to sort out what the right thing to do here is. No one cared
> > enough to follow up in a year and a half.
>
> Fair enough, happy to do that. I'll send a small series with these two
> patches independently from this series which may take a while to land.

Now sent:

https://lore.kernel.org/lkml/[email protected]/

Thanks,
Quentin

2021-02-01 17:33:05

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 03/26] arm64: kvm: Add standalone ticket spinlock implementation for use at hyp

On Fri, Jan 08, 2021 at 12:15:01PM +0000, Quentin Perret wrote:
> From: Will Deacon <[email protected]>
>
> We will soon need to synchronise multiple CPUs in the hyp text at EL2.
> The qspinlock-based locking used by the host is overkill for this purpose
> and relies on the kernel's "percpu" implementation for the MCS nodes.
>
> Implement a simple ticket locking scheme based heavily on the code removed
> by commit c11090474d70 ("arm64: locking: Replace ticket lock implementation
> with qspinlock").
>
> Signed-off-by: Will Deacon <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/kvm/hyp/include/nvhe/spinlock.h | 92 ++++++++++++++++++++++
> 1 file changed, 92 insertions(+)
> create mode 100644 arch/arm64/kvm/hyp/include/nvhe/spinlock.h
>
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/spinlock.h b/arch/arm64/kvm/hyp/include/nvhe/spinlock.h
> new file mode 100644
> index 000000000000..7584c397bbac
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/include/nvhe/spinlock.h
> @@ -0,0 +1,92 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * A stand-alone ticket spinlock implementation for use by the non-VHE
> + * KVM hypervisor code running at EL2.
> + *
> + * Copyright (C) 2020 Google LLC
> + * Author: Will Deacon <[email protected]>
> + *
> + * Heavily based on the implementation removed by c11090474d70 which was:
> + * Copyright (C) 2012 ARM Ltd.
> + */
> +
> +#ifndef __ARM64_KVM_NVHE_SPINLOCK_H__
> +#define __ARM64_KVM_NVHE_SPINLOCK_H__
> +
> +#include <asm/alternative.h>
> +#include <asm/lse.h>
> +
> +typedef union hyp_spinlock {
> + u32 __val;
> + struct {
> +#ifdef __AARCH64EB__
> + u16 next, owner;
> +#else
> + u16 owner, next;
> + };
> +#endif

Looks like I put this #endif in the wrong place; probably needs to be a line
higher.

Will

2021-02-01 17:45:14

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 03/26] arm64: kvm: Add standalone ticket spinlock implementation for use at hyp

On Monday 01 Feb 2021 at 17:28:34 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:01PM +0000, Quentin Perret wrote:
> > From: Will Deacon <[email protected]>
> >
> > We will soon need to synchronise multiple CPUs in the hyp text at EL2.
> > The qspinlock-based locking used by the host is overkill for this purpose
> > and relies on the kernel's "percpu" implementation for the MCS nodes.
> >
> > Implement a simple ticket locking scheme based heavily on the code removed
> > by commit c11090474d70 ("arm64: locking: Replace ticket lock implementation
> > with qspinlock").
> >
> > Signed-off-by: Will Deacon <[email protected]>
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/kvm/hyp/include/nvhe/spinlock.h | 92 ++++++++++++++++++++++
> > 1 file changed, 92 insertions(+)
> > create mode 100644 arch/arm64/kvm/hyp/include/nvhe/spinlock.h
> >
> > diff --git a/arch/arm64/kvm/hyp/include/nvhe/spinlock.h b/arch/arm64/kvm/hyp/include/nvhe/spinlock.h
> > new file mode 100644
> > index 000000000000..7584c397bbac
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/include/nvhe/spinlock.h
> > @@ -0,0 +1,92 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * A stand-alone ticket spinlock implementation for use by the non-VHE
> > + * KVM hypervisor code running at EL2.
> > + *
> > + * Copyright (C) 2020 Google LLC
> > + * Author: Will Deacon <[email protected]>
> > + *
> > + * Heavily based on the implementation removed by c11090474d70 which was:
> > + * Copyright (C) 2012 ARM Ltd.
> > + */
> > +
> > +#ifndef __ARM64_KVM_NVHE_SPINLOCK_H__
> > +#define __ARM64_KVM_NVHE_SPINLOCK_H__
> > +
> > +#include <asm/alternative.h>
> > +#include <asm/lse.h>
> > +
> > +typedef union hyp_spinlock {
> > + u32 __val;
> > + struct {
> > +#ifdef __AARCH64EB__
> > + u16 next, owner;
> > +#else
> > + u16 owner, next;
> > + };
> > +#endif
>
> Looks like I put this #endif in the wrong place; probably needs to be a line
> higher.

Uh oh, missed that too. Fix now merged locally, thanks.

Quentin

2021-02-01 17:46:46

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 04/26] KVM: arm64: Initialize kvm_nvhe_init_params early

On Fri, Jan 08, 2021 at 12:15:02PM +0000, Quentin Perret wrote:
> Move the initialization of kvm_nvhe_init_params in a dedicated function
> that is run early, and only once during KVM init, rather than every time
> the KVM vectors are set and reset.
>
> This also opens the opportunity for the hypervisor to change the init
> structs during boot, hence simplifying the replacement of host-provided
> page-tables and stacks by the ones the hypervisor will create for
> itself.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/kvm/arm.c | 28 ++++++++++++++++++++--------
> 1 file changed, 20 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 04c44853b103..3ac0f3425833 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c

[...]

> @@ -1807,6 +1813,12 @@ static int init_hyp_mode(void)
> goto out_err;
> }
>
> + /*
> + * Prepare the CPU initialization parameters
> + */
> + for_each_possible_cpu(cpu)
> + cpu_prepare_hyp_mode(cpu);
> +

This is the fifth for_each_possible_cpu() loop in this function; can any of
them be merged together?

Will

2021-02-01 17:49:14

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 05/26] KVM: arm64: Avoid free_page() in page-table allocator

On Fri, Jan 08, 2021 at 12:15:03PM +0000, Quentin Perret wrote:
> Currently, the KVM page-table allocator uses a mix of put_page() and
> free_page() calls depending on the context even though page-allocation
> is always achieved using variants of __get_free_page().
>
> Make the code consitent by using put_page() throughout, and reduce the

typo: consistent

> memory management API surface used by the page-table code. This will
> ease factoring out page-alloction from pgtable.c, which is a
> pre-requisite to creating page-tables at EL2.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/kvm/hyp/pgtable.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)

Acked-by: Will Deacon <[email protected]>

Will

2021-02-01 18:30:59

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 06/26] KVM: arm64: Factor memory allocation out of pgtable.c

On Fri, Jan 08, 2021 at 12:15:04PM +0000, Quentin Perret wrote:
> In preparation for enabling the creation of page-tables at EL2, factor
> all memory allocation out of the page-table code, hence making it
> re-usable with any compatible memory allocator.
>
> No functional changes intended.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_pgtable.h | 32 +++++++++-
> arch/arm64/kvm/hyp/pgtable.c | 90 +++++++++++++++++-----------
> arch/arm64/kvm/mmu.c | 70 +++++++++++++++++++++-
> 3 files changed, 154 insertions(+), 38 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 52ab38db04c7..45acc9dc6c45 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -13,17 +13,41 @@
>
> typedef u64 kvm_pte_t;
>
> +/**
> + * struct kvm_pgtable_mm_ops - Memory management callbacks.
> + * @zalloc_page: Allocate a zeroed memory page.

Please describe the 'arg' parameter.

> + * @zalloc_pages_exact: Allocate an exact number of zeroed memory pages.

I think this comment coulld be expanded somewhat to make it clear that (a)
the 'size' parameter is in bytes rather than pages (b) the rounding
behaviour applied if 'size' is not page-aligned and (c) that the resulting
allocation is physically contiguous.

> + * @free_pages_exact: Free an exact number of memory pages.
> + * @get_page: Increment the refcount on a page.
> + * @put_page: Decrement the refcount on a page.
> + * @page_count: Returns the refcount of a page.
> + * @phys_to_virt: Convert a physical address into a virtual address.
> + * @virt_to_phys: Convert a virtual address into a physical address.

I think it would be good to be explicit about the nature of the virtual
address here. We've dealing with virtual addresses that are mapped in the
current context rather than e.g. guest virtual addresses.

> + */
> +struct kvm_pgtable_mm_ops {
> + void* (*zalloc_page)(void *arg);
> + void* (*zalloc_pages_exact)(size_t size);
> + void (*free_pages_exact)(void *addr, size_t size);
> + void (*get_page)(void *addr);
> + void (*put_page)(void *addr);
> + int (*page_count)(void *addr);
> + void* (*phys_to_virt)(phys_addr_t phys);
> + phys_addr_t (*virt_to_phys)(void *addr);
> +};

[...]

> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 1f41173e6149..278e163beda4 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -88,6 +88,48 @@ static bool kvm_is_device_pfn(unsigned long pfn)
> return !pfn_valid(pfn);
> }
>
> +static void *stage2_memcache_alloc_page(void *arg)
> +{
> + struct kvm_mmu_memory_cache *mc = arg;
> + kvm_pte_t *ptep = NULL;
> +
> + /* Allocated with GFP_KERNEL_ACCOUNT, so no need to zero */

I couldn't spot where GFP_KERNEL_ACCOUNT implies __GFP_ZERO. Please can you
elaborate?

> + if (mc && mc->nobjs)
> + ptep = mc->objects[--mc->nobjs];
> +
> + return ptep;
> +}

Why can't we use kvm_mmu_memory_cache_alloc() directly instead of opening up
the memory_cache?

> +static void *kvm_host_zalloc_pages_exact(size_t size)
> +{
> + return alloc_pages_exact(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);

Hmm, so now we're passing __GFP_ZERO? ;)

> +static void kvm_host_get_page(void *addr)
> +{
> + get_page(virt_to_page(addr));
> +}
> +
> +static void kvm_host_put_page(void *addr)
> +{
> + put_page(virt_to_page(addr));
> +}
> +
> +static int kvm_host_page_count(void *addr)
> +{
> + return page_count(virt_to_page(addr));
> +}
> +
> +static phys_addr_t kvm_host_pa(void *addr)
> +{
> + return __pa(addr);
> +}
> +
> +static void *kvm_host_va(phys_addr_t phys)
> +{
> + return __va(phys);
> +}
> +
> /*
> * Unmapping vs dcache management:
> *
> @@ -351,6 +393,17 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
> return 0;
> }
>
> +static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> + .zalloc_page = stage2_memcache_alloc_page,
> + .zalloc_pages_exact = kvm_host_zalloc_pages_exact,
> + .free_pages_exact = free_pages_exact,
> + .get_page = kvm_host_get_page,
> + .put_page = kvm_host_put_page,
> + .page_count = kvm_host_page_count,
> + .phys_to_virt = kvm_host_va,
> + .virt_to_phys = kvm_host_pa,
> +};

Idle thought, but I wonder whether it would be better to have these
implementations as the default and make the mm_ops structure parameter
to kvm_pgtable_stage2_init() optional? I guess you don't gain an awful
lot though, so feel free to ignore me.

Will

2021-02-01 18:41:31

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 06/26] KVM: arm64: Factor memory allocation out of pgtable.c

On Monday 01 Feb 2021 at 18:16:08 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:04PM +0000, Quentin Perret wrote:
> > In preparation for enabling the creation of page-tables at EL2, factor
> > all memory allocation out of the page-table code, hence making it
> > re-usable with any compatible memory allocator.
> >
> > No functional changes intended.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_pgtable.h | 32 +++++++++-
> > arch/arm64/kvm/hyp/pgtable.c | 90 +++++++++++++++++-----------
> > arch/arm64/kvm/mmu.c | 70 +++++++++++++++++++++-
> > 3 files changed, 154 insertions(+), 38 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 52ab38db04c7..45acc9dc6c45 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -13,17 +13,41 @@
> >
> > typedef u64 kvm_pte_t;
> >
> > +/**
> > + * struct kvm_pgtable_mm_ops - Memory management callbacks.
> > + * @zalloc_page: Allocate a zeroed memory page.
>
> Please describe the 'arg' parameter.
>
> > + * @zalloc_pages_exact: Allocate an exact number of zeroed memory pages.
>
> I think this comment coulld be expanded somewhat to make it clear that (a)
> the 'size' parameter is in bytes rather than pages (b) the rounding
> behaviour applied if 'size' is not page-aligned and (c) that the resulting
> allocation is physically contiguous.
>
> > + * @free_pages_exact: Free an exact number of memory pages.
> > + * @get_page: Increment the refcount on a page.
> > + * @put_page: Decrement the refcount on a page.
> > + * @page_count: Returns the refcount of a page.
> > + * @phys_to_virt: Convert a physical address into a virtual address.
> > + * @virt_to_phys: Convert a virtual address into a physical address.
>
> I think it would be good to be explicit about the nature of the virtual
> address here. We've dealing with virtual addresses that are mapped in the
> current context rather than e.g. guest virtual addresses.

Ack to all the above.

> > + */
> > +struct kvm_pgtable_mm_ops {
> > + void* (*zalloc_page)(void *arg);
> > + void* (*zalloc_pages_exact)(size_t size);
> > + void (*free_pages_exact)(void *addr, size_t size);
> > + void (*get_page)(void *addr);
> > + void (*put_page)(void *addr);
> > + int (*page_count)(void *addr);
> > + void* (*phys_to_virt)(phys_addr_t phys);
> > + phys_addr_t (*virt_to_phys)(void *addr);
> > +};
>
> [...]
>
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 1f41173e6149..278e163beda4 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -88,6 +88,48 @@ static bool kvm_is_device_pfn(unsigned long pfn)
> > return !pfn_valid(pfn);
> > }
> >
> > +static void *stage2_memcache_alloc_page(void *arg)
> > +{
> > + struct kvm_mmu_memory_cache *mc = arg;
> > + kvm_pte_t *ptep = NULL;
> > +
> > + /* Allocated with GFP_KERNEL_ACCOUNT, so no need to zero */
>
> I couldn't spot where GFP_KERNEL_ACCOUNT implies __GFP_ZERO.

I'm not suprised, it doesn't. Broken comment clearly, I'll fix with
s/GFP_KERNEL_ACCOUNT/__GFP_ZERO

> Please can you elaborate?
>
> > + if (mc && mc->nobjs)
> > + ptep = mc->objects[--mc->nobjs];
> > +
> > + return ptep;
> > +}
>
> Why can't we use kvm_mmu_memory_cache_alloc() directly instead of opening up
> the memory_cache?

I think we can -- that function didn't exist when I first wrote this,
but no good reason not to use it now.

> > +static void *kvm_host_zalloc_pages_exact(size_t size)
> > +{
> > + return alloc_pages_exact(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>
> Hmm, so now we're passing __GFP_ZERO? ;)

:-)

> > +static void kvm_host_get_page(void *addr)
> > +{
> > + get_page(virt_to_page(addr));
> > +}
> > +
> > +static void kvm_host_put_page(void *addr)
> > +{
> > + put_page(virt_to_page(addr));
> > +}
> > +
> > +static int kvm_host_page_count(void *addr)
> > +{
> > + return page_count(virt_to_page(addr));
> > +}
> > +
> > +static phys_addr_t kvm_host_pa(void *addr)
> > +{
> > + return __pa(addr);
> > +}
> > +
> > +static void *kvm_host_va(phys_addr_t phys)
> > +{
> > + return __va(phys);
> > +}
> > +
> > /*
> > * Unmapping vs dcache management:
> > *
> > @@ -351,6 +393,17 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
> > return 0;
> > }
> >
> > +static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> > + .zalloc_page = stage2_memcache_alloc_page,
> > + .zalloc_pages_exact = kvm_host_zalloc_pages_exact,
> > + .free_pages_exact = free_pages_exact,
> > + .get_page = kvm_host_get_page,
> > + .put_page = kvm_host_put_page,
> > + .page_count = kvm_host_page_count,
> > + .phys_to_virt = kvm_host_va,
> > + .virt_to_phys = kvm_host_pa,
> > +};
>
> Idle thought, but I wonder whether it would be better to have these
> implementations as the default and make the mm_ops structure parameter
> to kvm_pgtable_stage2_init() optional? I guess you don't gain an awful
> lot though, so feel free to ignore me.

No strong opinion really, but I suppose I could do something as simple
as having static inline wrappers which provide kvm_s2_mm_ops to the
pgtable API for me. I'll probably want to make sure these are not
defined when compiling EL2 code, though, to avoid confusion.

Or maybe you had something else in mind?

Cheers,
Quentin

2021-02-01 18:43:33

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 07/26] KVM: arm64: Introduce a BSS section for use at Hyp

On Fri, Jan 08, 2021 at 12:15:05PM +0000, Quentin Perret wrote:
> Currently, the hyp code cannot make full use of a bss, as the kernel
> section is mapped read-only.
>
> While this mapping could simply be changed to read-write, it would
> intermingle even more the hyp and kernel state than they currently are.
> Instead, introduce a __hyp_bss section, that uses reserved pages, and
> create the appropriate RW hyp mappings during KVM init.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/sections.h | 1 +
> arch/arm64/kernel/vmlinux.lds.S | 7 +++++++
> arch/arm64/kvm/arm.c | 11 +++++++++++
> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 1 +
> 4 files changed, 20 insertions(+)
>
> diff --git a/arch/arm64/include/asm/sections.h b/arch/arm64/include/asm/sections.h
> index 8ff579361731..f58cf493de16 100644
> --- a/arch/arm64/include/asm/sections.h
> +++ b/arch/arm64/include/asm/sections.h
> @@ -12,6 +12,7 @@ extern char __hibernate_exit_text_start[], __hibernate_exit_text_end[];
> extern char __hyp_idmap_text_start[], __hyp_idmap_text_end[];
> extern char __hyp_text_start[], __hyp_text_end[];
> extern char __hyp_data_ro_after_init_start[], __hyp_data_ro_after_init_end[];
> +extern char __hyp_bss_start[], __hyp_bss_end[];
> extern char __idmap_text_start[], __idmap_text_end[];
> extern char __initdata_begin[], __initdata_end[];
> extern char __inittext_begin[], __inittext_end[];
> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> index 43af13968dfd..3eca35d5a7cf 100644
> --- a/arch/arm64/kernel/vmlinux.lds.S
> +++ b/arch/arm64/kernel/vmlinux.lds.S
> @@ -8,6 +8,13 @@
> #define RO_EXCEPTION_TABLE_ALIGN 8
> #define RUNTIME_DISCARD_EXIT
>
> +#define BSS_FIRST_SECTIONS \
> + . = ALIGN(PAGE_SIZE); \
> + __hyp_bss_start = .; \
> + *(.hyp.bss) \

Use HYP_SECTION_NAME() here?

> + . = ALIGN(PAGE_SIZE); \
> + __hyp_bss_end = .;

Should this be gated on CONFIG_KVM like the other hyp sections are? In fact,
it might be nice to define all of those together. Yeah, it means moving
things higher up in the file, but I think it will be easier to read.

> #include <asm-generic/vmlinux.lds.h>
> #include <asm/cache.h>
> #include <asm/hyp_image.h>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 3ac0f3425833..51b53ca36dc5 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1770,7 +1770,18 @@ static int init_hyp_mode(void)
> goto out_err;
> }
>
> + /*
> + * .hyp.bss is placed at the beginning of the .bss section, so map that
> + * part RW, and the rest RO as the hyp shouldn't be touching it.
> + */
> err = create_hyp_mappings(kvm_ksym_ref(__bss_start),

I think it would be clearer to refer to __hyp_bss_start here ^^.
You could always add an ASSERT in the linker script if you want to catch
anybody adding something before the hyp bss in future.

Will

2021-02-01 18:45:04

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 06/26] KVM: arm64: Factor memory allocation out of pgtable.c

On Mon, Feb 01, 2021 at 06:32:52PM +0000, Quentin Perret wrote:
> On Monday 01 Feb 2021 at 18:16:08 (+0000), Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:04PM +0000, Quentin Perret wrote:
> > > +static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> > > + .zalloc_page = stage2_memcache_alloc_page,
> > > + .zalloc_pages_exact = kvm_host_zalloc_pages_exact,
> > > + .free_pages_exact = free_pages_exact,
> > > + .get_page = kvm_host_get_page,
> > > + .put_page = kvm_host_put_page,
> > > + .page_count = kvm_host_page_count,
> > > + .phys_to_virt = kvm_host_va,
> > > + .virt_to_phys = kvm_host_pa,
> > > +};
> >
> > Idle thought, but I wonder whether it would be better to have these
> > implementations as the default and make the mm_ops structure parameter
> > to kvm_pgtable_stage2_init() optional? I guess you don't gain an awful
> > lot though, so feel free to ignore me.
>
> No strong opinion really, but I suppose I could do something as simple
> as having static inline wrappers which provide kvm_s2_mm_ops to the
> pgtable API for me. I'll probably want to make sure these are not
> defined when compiling EL2 code, though, to avoid confusion.
>
> Or maybe you had something else in mind?

No, just food for thought. If we can reduce the changes for normal KVM then
it's probably worth considering if it doesn't add divergent code paths. But
I'm also fine with the proposal you have here, so if it doesn't work then
don't get hung up on it.

Will

2021-02-01 18:45:55

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 08/26] KVM: arm64: Make kvm_call_hyp() a function call at Hyp

On Fri, Jan 08, 2021 at 12:15:06PM +0000, Quentin Perret wrote:
> kvm_call_hyp() has some logic to issue a function call or a hypercall
> depending the EL at which the kernel is running. However, all the code
> compiled under __KVM_NVHE_HYPERVISOR__ is guaranteed to run only at EL2,
> and in this case a simple function call is needed.
>
> Add ifdefery to kvm_host.h to symplify kvm_call_hyp() in .hyp.text.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_host.h | 6 ++++++
> 1 file changed, 6 insertions(+)

Acked-by: Will Deacon <[email protected]>

Will

2021-02-01 18:47:32

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 09/26] KVM: arm64: Allow using kvm_nvhe_sym() in hyp code

On Fri, Jan 08, 2021 at 12:15:07PM +0000, Quentin Perret wrote:
> In order to allow the usage of code shared by the host and the hyp in
> static inline library function, allow the usage of kvm_nvhe_sym() at el2

typo: functions

> by defaulting to the raw symbol name.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/hyp_image.h | 4 ++++
> 1 file changed, 4 insertions(+)

Acked-by: Will Deacon <[email protected]>

Will

2021-02-01 19:02:12

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 10/26] KVM: arm64: Introduce an early Hyp page allocator

On Fri, Jan 08, 2021 at 12:15:08PM +0000, Quentin Perret wrote:
> diff --git a/arch/arm64/kvm/hyp/nvhe/early_alloc.c b/arch/arm64/kvm/hyp/nvhe/early_alloc.c
> new file mode 100644
> index 000000000000..de4c45662970
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/nvhe/early_alloc.c
> @@ -0,0 +1,60 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 Google LLC
> + * Author: Quentin Perret <[email protected]>
> + */
> +
> +#include <asm/kvm_pgtable.h>
> +
> +#include <nvhe/memory.h>
> +
> +struct kvm_pgtable_mm_ops hyp_early_alloc_mm_ops;
> +s64 __ro_after_init hyp_physvirt_offset;
> +
> +static unsigned long base;
> +static unsigned long end;
> +static unsigned long cur;
> +
> +unsigned long hyp_early_alloc_nr_pages(void)
> +{
> + return (cur - base) >> PAGE_SHIFT;
> +}

nit: but I find this function name confusing (it's returning the number of
_allocated_ pages, not the number of _free_ pages!). How about something
like hyp_early_alloc_size() to match hyp_s1_pgtable_size() which you add
later? [and move the shift out to the caller]?

> +
> +extern void clear_page(void *to);

Stick this in a header?

> +
> +void *hyp_early_alloc_contig(unsigned int nr_pages)

I think order might make more sense, or do you need to allocate
non-power-of-2 batches of pages?

> +{
> + unsigned long ret = cur, i, p;
> +
> + if (!nr_pages)
> + return NULL;
> +
> + cur += nr_pages << PAGE_SHIFT;
> + if (cur > end) {

This would mean that concurrent hyp_early_alloc_nr_pages() would transiently
give the wrong answer. Might be worth sticking the locking expectations with
the function prototypes.

That said, maybe it would be better to write this check as:

if (end - cur < (nr_pages << PAGE_SHIFT))

as that also removes the need to worry about overflow if nr_pages is huge
(which would be a bug in the hypervisor, which we would then catch here).

> + cur = ret;
> + return NULL;
> + }
> +
> + for (i = 0; i < nr_pages; i++) {
> + p = ret + (i << PAGE_SHIFT);
> + clear_page((void *)(p));
> + }
> +
> + return (void *)ret;
> +}
> +
> +void *hyp_early_alloc_page(void *arg)
> +{
> + return hyp_early_alloc_contig(1);
> +}
> +
> +void hyp_early_alloc_init(unsigned long virt, unsigned long size)
> +{
> + base = virt;
> + end = virt + size;
> + cur = virt;

nit: base = cur = virt;

Will

2021-02-01 19:09:14

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 11/26] KVM: arm64: Stub CONFIG_DEBUG_LIST at Hyp

On Fri, Jan 08, 2021 at 12:15:09PM +0000, Quentin Perret wrote:
> In order to use the kernel list library at EL2, introduce stubs for the
> CONFIG_DEBUG_LIST out-of-lines calls.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
> arch/arm64/kvm/hyp/nvhe/stub.c | 22 ++++++++++++++++++++++
> 2 files changed, 23 insertions(+), 1 deletion(-)
> create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> index 1fc0684a7678..33bd381d8f73 100644
> --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> @@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
> lib-objs := $(addprefix ../../../lib/, $(lib-objs))
>
> obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> - hyp-main.o hyp-smp.o psci-relay.o early_alloc.o
> + hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
> obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> ../fpsimd.o ../hyp-entry.o ../exception.o
> obj-y += $(lib-objs)
> diff --git a/arch/arm64/kvm/hyp/nvhe/stub.c b/arch/arm64/kvm/hyp/nvhe/stub.c
> new file mode 100644
> index 000000000000..c0aa6bbfd79d
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/nvhe/stub.c
> @@ -0,0 +1,22 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Stubs for out-of-line function calls caused by re-using kernel
> + * infrastructure at EL2.
> + *
> + * Copyright (C) 2020 - Google LLC
> + */
> +
> +#include <linux/list.h>
> +
> +#ifdef CONFIG_DEBUG_LIST
> +bool __list_add_valid(struct list_head *new, struct list_head *prev,
> + struct list_head *next)
> +{
> + return true;
> +}
> +
> +bool __list_del_entry_valid(struct list_head *entry)
> +{
> + return true;
> +}
> +#endif

Can we get away with defining our own CHECK_DATA_CORRUPTION macro instead?

Will

2021-02-02 09:50:16

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 10/26] KVM: arm64: Introduce an early Hyp page allocator

On Monday 01 Feb 2021 at 19:00:08 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:08PM +0000, Quentin Perret wrote:
> > diff --git a/arch/arm64/kvm/hyp/nvhe/early_alloc.c b/arch/arm64/kvm/hyp/nvhe/early_alloc.c
> > new file mode 100644
> > index 000000000000..de4c45662970
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/nvhe/early_alloc.c
> > @@ -0,0 +1,60 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2020 Google LLC
> > + * Author: Quentin Perret <[email protected]>
> > + */
> > +
> > +#include <asm/kvm_pgtable.h>
> > +
> > +#include <nvhe/memory.h>
> > +
> > +struct kvm_pgtable_mm_ops hyp_early_alloc_mm_ops;
> > +s64 __ro_after_init hyp_physvirt_offset;
> > +
> > +static unsigned long base;
> > +static unsigned long end;
> > +static unsigned long cur;
> > +
> > +unsigned long hyp_early_alloc_nr_pages(void)
> > +{
> > + return (cur - base) >> PAGE_SHIFT;
> > +}
>
> nit: but I find this function name confusing (it's returning the number of
> _allocated_ pages, not the number of _free_ pages!). How about something
> like hyp_early_alloc_size() to match hyp_s1_pgtable_size() which you add
> later? [and move the shift out to the caller]?

Works for me.

> > +extern void clear_page(void *to);
>
> Stick this in a header?

Right, that, or perhaps just use asm/page.h directly -- I _think_ that
should work fine assuming with have the correct symbol aliasing in
place.

> > +
> > +void *hyp_early_alloc_contig(unsigned int nr_pages)
>
> I think order might make more sense, or do you need to allocate
> non-power-of-2 batches of pages?

Indeed, I allocate page-aligned blobs of arbitrary size (e.g.
divide_memory_pool() in patch 16), so I prefer it that way.

> > +{
> > + unsigned long ret = cur, i, p;
> > +
> > + if (!nr_pages)
> > + return NULL;
> > +
> > + cur += nr_pages << PAGE_SHIFT;
> > + if (cur > end) {
>
> This would mean that concurrent hyp_early_alloc_nr_pages() would transiently
> give the wrong answer. Might be worth sticking the locking expectations with
> the function prototypes.

This is only called from a single CPU from a non-preemptible section, so
that is not a problem. But yes, I'll stick a comment.

> That said, maybe it would be better to write this check as:
>
> if (end - cur < (nr_pages << PAGE_SHIFT))
>
> as that also removes the need to worry about overflow if nr_pages is huge
> (which would be a bug in the hypervisor, which we would then catch here).

Sounds good.

> > + cur = ret;
> > + return NULL;
> > + }
> > +
> > + for (i = 0; i < nr_pages; i++) {
> > + p = ret + (i << PAGE_SHIFT);
> > + clear_page((void *)(p));
> > + }
> > +
> > + return (void *)ret;
> > +}
> > +
> > +void *hyp_early_alloc_page(void *arg)
> > +{
> > + return hyp_early_alloc_contig(1);
> > +}
> > +
> > +void hyp_early_alloc_init(unsigned long virt, unsigned long size)
> > +{
> > + base = virt;
> > + end = virt + size;
> > + cur = virt;
>
> nit: base = cur = virt;

Ack.

Thanks for the review,
Quentin

2021-02-02 09:59:46

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 11/26] KVM: arm64: Stub CONFIG_DEBUG_LIST at Hyp

On Monday 01 Feb 2021 at 19:06:20 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:09PM +0000, Quentin Perret wrote:
> > In order to use the kernel list library at EL2, introduce stubs for the
> > CONFIG_DEBUG_LIST out-of-lines calls.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
> > arch/arm64/kvm/hyp/nvhe/stub.c | 22 ++++++++++++++++++++++
> > 2 files changed, 23 insertions(+), 1 deletion(-)
> > create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c
> >
> > diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> > index 1fc0684a7678..33bd381d8f73 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> > +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> > @@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
> > lib-objs := $(addprefix ../../../lib/, $(lib-objs))
> >
> > obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> > - hyp-main.o hyp-smp.o psci-relay.o early_alloc.o
> > + hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
> > obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> > ../fpsimd.o ../hyp-entry.o ../exception.o
> > obj-y += $(lib-objs)
> > diff --git a/arch/arm64/kvm/hyp/nvhe/stub.c b/arch/arm64/kvm/hyp/nvhe/stub.c
> > new file mode 100644
> > index 000000000000..c0aa6bbfd79d
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/nvhe/stub.c
> > @@ -0,0 +1,22 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Stubs for out-of-line function calls caused by re-using kernel
> > + * infrastructure at EL2.
> > + *
> > + * Copyright (C) 2020 - Google LLC
> > + */
> > +
> > +#include <linux/list.h>
> > +
> > +#ifdef CONFIG_DEBUG_LIST
> > +bool __list_add_valid(struct list_head *new, struct list_head *prev,
> > + struct list_head *next)
> > +{
> > + return true;
> > +}
> > +
> > +bool __list_del_entry_valid(struct list_head *entry)
> > +{
> > + return true;
> > +}
> > +#endif
>
> Can we get away with defining our own CHECK_DATA_CORRUPTION macro instead?

Yes I think eventually it'd be nice to get there, but that has other
implications (e.g. how do you report something in dmesg from EL2?) so
perhaps we can keep that a separate series?

Cheers,
Quentin

2021-02-02 10:05:30

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 11/26] KVM: arm64: Stub CONFIG_DEBUG_LIST at Hyp

On Tue, Feb 02, 2021 at 09:57:36AM +0000, Quentin Perret wrote:
> On Monday 01 Feb 2021 at 19:06:20 (+0000), Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:09PM +0000, Quentin Perret wrote:
> > > In order to use the kernel list library at EL2, introduce stubs for the
> > > CONFIG_DEBUG_LIST out-of-lines calls.
> > >
> > > Signed-off-by: Quentin Perret <[email protected]>
> > > ---
> > > arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
> > > arch/arm64/kvm/hyp/nvhe/stub.c | 22 ++++++++++++++++++++++
> > > 2 files changed, 23 insertions(+), 1 deletion(-)
> > > create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c
> > >
> > > diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> > > index 1fc0684a7678..33bd381d8f73 100644
> > > --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> > > +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> > > @@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
> > > lib-objs := $(addprefix ../../../lib/, $(lib-objs))
> > >
> > > obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> > > - hyp-main.o hyp-smp.o psci-relay.o early_alloc.o
> > > + hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
> > > obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> > > ../fpsimd.o ../hyp-entry.o ../exception.o
> > > obj-y += $(lib-objs)
> > > diff --git a/arch/arm64/kvm/hyp/nvhe/stub.c b/arch/arm64/kvm/hyp/nvhe/stub.c
> > > new file mode 100644
> > > index 000000000000..c0aa6bbfd79d
> > > --- /dev/null
> > > +++ b/arch/arm64/kvm/hyp/nvhe/stub.c
> > > @@ -0,0 +1,22 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * Stubs for out-of-line function calls caused by re-using kernel
> > > + * infrastructure at EL2.
> > > + *
> > > + * Copyright (C) 2020 - Google LLC
> > > + */
> > > +
> > > +#include <linux/list.h>
> > > +
> > > +#ifdef CONFIG_DEBUG_LIST
> > > +bool __list_add_valid(struct list_head *new, struct list_head *prev,
> > > + struct list_head *next)
> > > +{
> > > + return true;
> > > +}
> > > +
> > > +bool __list_del_entry_valid(struct list_head *entry)
> > > +{
> > > + return true;
> > > +}
> > > +#endif
> >
> > Can we get away with defining our own CHECK_DATA_CORRUPTION macro instead?
>
> Yes I think eventually it'd be nice to get there, but that has other
> implications (e.g. how do you report something in dmesg from EL2?) so
> perhaps we can keep that a separate series?

We wouldn't necessarily have to report anything, but having the return value
of these functions be based off the generic checks would be great if we can
do it (i.e. we'd avoid corrupting the list).

Will

2021-02-02 10:19:29

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 11/26] KVM: arm64: Stub CONFIG_DEBUG_LIST at Hyp

On Tuesday 02 Feb 2021 at 10:00:29 (+0000), Will Deacon wrote:
> On Tue, Feb 02, 2021 at 09:57:36AM +0000, Quentin Perret wrote:
> > On Monday 01 Feb 2021 at 19:06:20 (+0000), Will Deacon wrote:
> > > On Fri, Jan 08, 2021 at 12:15:09PM +0000, Quentin Perret wrote:
> > > > In order to use the kernel list library at EL2, introduce stubs for the
> > > > CONFIG_DEBUG_LIST out-of-lines calls.
> > > >
> > > > Signed-off-by: Quentin Perret <[email protected]>
> > > > ---
> > > > arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
> > > > arch/arm64/kvm/hyp/nvhe/stub.c | 22 ++++++++++++++++++++++
> > > > 2 files changed, 23 insertions(+), 1 deletion(-)
> > > > create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c
> > > >
> > > > diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> > > > index 1fc0684a7678..33bd381d8f73 100644
> > > > --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> > > > +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> > > > @@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
> > > > lib-objs := $(addprefix ../../../lib/, $(lib-objs))
> > > >
> > > > obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> > > > - hyp-main.o hyp-smp.o psci-relay.o early_alloc.o
> > > > + hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
> > > > obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> > > > ../fpsimd.o ../hyp-entry.o ../exception.o
> > > > obj-y += $(lib-objs)
> > > > diff --git a/arch/arm64/kvm/hyp/nvhe/stub.c b/arch/arm64/kvm/hyp/nvhe/stub.c
> > > > new file mode 100644
> > > > index 000000000000..c0aa6bbfd79d
> > > > --- /dev/null
> > > > +++ b/arch/arm64/kvm/hyp/nvhe/stub.c
> > > > @@ -0,0 +1,22 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > +/*
> > > > + * Stubs for out-of-line function calls caused by re-using kernel
> > > > + * infrastructure at EL2.
> > > > + *
> > > > + * Copyright (C) 2020 - Google LLC
> > > > + */
> > > > +
> > > > +#include <linux/list.h>
> > > > +
> > > > +#ifdef CONFIG_DEBUG_LIST
> > > > +bool __list_add_valid(struct list_head *new, struct list_head *prev,
> > > > + struct list_head *next)
> > > > +{
> > > > + return true;
> > > > +}
> > > > +
> > > > +bool __list_del_entry_valid(struct list_head *entry)
> > > > +{
> > > > + return true;
> > > > +}
> > > > +#endif
> > >
> > > Can we get away with defining our own CHECK_DATA_CORRUPTION macro instead?
> >
> > Yes I think eventually it'd be nice to get there, but that has other
> > implications (e.g. how do you report something in dmesg from EL2?) so
> > perhaps we can keep that a separate series?
>
> We wouldn't necessarily have to report anything, but having the return value
> of these functions be based off the generic checks would be great if we can
> do it (i.e. we'd avoid corrupting the list).

Ah, I see what you mean. Happy to have a go a it, there are a few other
small things that make that it a bit annoying e.g. CHECK_DATA_CORRUPTION
is unconditionally defined in bug.h, and I'll need to stub EXPORT_SYMBOL
as well, which may both require changing core files, but maybe that's
fine. And if that is too painful I think it would make sense to keep
this a separate and self-contained series which would be a nice
incremental improvement over the simple approach I have here :)

Cheers,
Quentin

2021-02-03 00:35:02

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/26] KVM: arm64: Introduce a Hyp buddy page allocator

Hi Quentin,

Sorry for the delay, this one took me a while to grok.

On Fri, Jan 08, 2021 at 12:15:10PM +0000, Quentin Perret wrote:
> When memory protection is enabled, the hyp code will require a basic
> form of memory management in order to allocate and free memory pages at
> EL2. This is needed for various use-cases, including the creation of hyp
> mappings or the allocation of stage 2 page tables.
>
> To address these use-case, introduce a simple memory allocator in the
> hyp code. The allocator is designed as a conventional 'buddy allocator',
> working with a page granularity. It allows to allocate and free
> physically contiguous pages from memory 'pools', with a guaranteed order
> alignment in the PA space. Each page in a memory pool is associated
> with a struct hyp_page which holds the page's metadata, including its
> refcount, as well as its current order, hence mimicking the kernel's
> buddy system in the GFP infrastructure. The hyp_page metadata are made
> accessible through a hyp_vmemmap, following the concept of
> SPARSE_VMEMMAP in the kernel.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/kvm/hyp/include/nvhe/gfp.h | 32 ++++
> arch/arm64/kvm/hyp/include/nvhe/memory.h | 25 +++
> arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
> arch/arm64/kvm/hyp/nvhe/page_alloc.c | 185 +++++++++++++++++++++++
> 4 files changed, 243 insertions(+), 1 deletion(-)
> create mode 100644 arch/arm64/kvm/hyp/include/nvhe/gfp.h
> create mode 100644 arch/arm64/kvm/hyp/nvhe/page_alloc.c
>
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
> new file mode 100644
> index 000000000000..95587faee171
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef __KVM_HYP_GFP_H
> +#define __KVM_HYP_GFP_H
> +
> +#include <linux/list.h>
> +
> +#include <nvhe/memory.h>
> +#include <nvhe/spinlock.h>
> +
> +#define HYP_MAX_ORDER 11U

Could we just use MAX_ORDER here?

> +#define HYP_NO_ORDER UINT_MAX
> +
> +struct hyp_pool {
> + hyp_spinlock_t lock;

A comment about what this lock protects would be handy, especially as the
'refcount' field of 'struct hyp_page' isn't updated atomically. I think it
also means that we don't have a safe way to move a page from one pool to
another; it's fixed forever once the page has been made available for
allocation.

> + struct list_head free_area[HYP_MAX_ORDER + 1];
> + phys_addr_t range_start;
> + phys_addr_t range_end;
> +};
> +
> +/* GFP flags */
> +#define HYP_GFP_NONE 0
> +#define HYP_GFP_ZERO 1
> +
> +/* Allocation */
> +void *hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask, unsigned int order);
> +void hyp_get_page(void *addr);
> +void hyp_put_page(void *addr);
> +
> +/* Used pages cannot be freed */
> +int hyp_pool_init(struct hyp_pool *pool, phys_addr_t phys,
> + unsigned int nr_pages, unsigned int used_pages);

Maybe "reserved_pages" would be a better name than "used_pages"?

> +#endif /* __KVM_HYP_GFP_H */
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h
> index 64c44c142c95..ed47674bc988 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/memory.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h
> @@ -6,7 +6,17 @@
>
> #include <linux/types.h>
>
> +struct hyp_pool;
> +struct hyp_page {
> + unsigned int refcount;
> + unsigned int order;
> + struct hyp_pool *pool;
> + struct list_head node;
> +};
> +
> extern s64 hyp_physvirt_offset;
> +extern u64 __hyp_vmemmap;
> +#define hyp_vmemmap ((struct hyp_page *)__hyp_vmemmap)
>
> #define __hyp_pa(virt) ((phys_addr_t)(virt) + hyp_physvirt_offset)
> #define __hyp_va(virt) ((void *)((phys_addr_t)(virt) - hyp_physvirt_offset))
> @@ -21,4 +31,19 @@ static inline phys_addr_t hyp_virt_to_phys(void *addr)
> return __hyp_pa(addr);
> }
>
> +#define hyp_phys_to_pfn(phys) ((phys) >> PAGE_SHIFT)
> +#define hyp_phys_to_page(phys) (&hyp_vmemmap[hyp_phys_to_pfn(phys)])
> +#define hyp_virt_to_page(virt) hyp_phys_to_page(__hyp_pa(virt))
> +
> +#define hyp_page_to_phys(page) ((phys_addr_t)((page) - hyp_vmemmap) << PAGE_SHIFT)

Maybe implement this in terms of a new hyp_page_to_pfn() macro?

> +#define hyp_page_to_virt(page) __hyp_va(hyp_page_to_phys(page))
> +#define hyp_page_to_pool(page) (((struct hyp_page *)page)->pool)
> +
> +static inline int hyp_page_count(void *addr)
> +{
> + struct hyp_page *p = hyp_virt_to_page(addr);
> +
> + return p->refcount;
> +}
> +
> #endif /* __KVM_HYP_MEMORY_H */
> diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> index 33bd381d8f73..9e5eacfec6ec 100644
> --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> @@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
> lib-objs := $(addprefix ../../../lib/, $(lib-objs))
>
> obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> - hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
> + hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o
> obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> ../fpsimd.o ../hyp-entry.o ../exception.o
> obj-y += $(lib-objs)
> diff --git a/arch/arm64/kvm/hyp/nvhe/page_alloc.c b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
> new file mode 100644
> index 000000000000..6de6515f0432
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
> @@ -0,0 +1,185 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 Google LLC
> + * Author: Quentin Perret <[email protected]>
> + */
> +
> +#include <asm/kvm_hyp.h>
> +#include <nvhe/gfp.h>
> +
> +u64 __hyp_vmemmap;
> +
> +/*
> + * Example buddy-tree for a 4-pages physically contiguous pool:
> + *
> + * o : Page 3
> + * /
> + * o-o : Page 2
> + * /
> + * / o : Page 1
> + * / /
> + * o---o-o : Page 0
> + * Order 2 1 0
> + *
> + * Example of requests on this zon:

typo: zone

> + * __find_buddy(pool, page 0, order 0) => page 1
> + * __find_buddy(pool, page 0, order 1) => page 2
> + * __find_buddy(pool, page 1, order 0) => page 0
> + * __find_buddy(pool, page 2, order 0) => page 3
> + */
> +static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
> + unsigned int order)
> +{
> + phys_addr_t addr = hyp_page_to_phys(p);
> +
> + addr ^= (PAGE_SIZE << order);
> + if (addr < pool->range_start || addr >= pool->range_end)
> + return NULL;

Are these range checks only needed because the pool isn't required to be
an exact power-of-2 pages in size? If so, maybe it would be more
straightforward to limit the max order on a per-pool basis depending upon
its size?

> +
> + return hyp_phys_to_page(addr);
> +}
> +
> +static void __hyp_attach_page(struct hyp_pool *pool,
> + struct hyp_page *p)
> +{
> + unsigned int order = p->order;
> + struct hyp_page *buddy;
> +
> + p->order = HYP_NO_ORDER;

Why is this needed?

> + for (; order < HYP_MAX_ORDER; order++) {
> + /* Nothing to do if the buddy isn't in a free-list */
> + buddy = __find_buddy(pool, p, order);
> + if (!buddy || list_empty(&buddy->node) || buddy->order != order)

Could we move the "buddy->order" check into __find_buddy()?

> + break;
> +
> + /* Otherwise, coalesce the buddies and go one level up */
> + list_del_init(&buddy->node);
> + buddy->order = HYP_NO_ORDER;
> + p = (p < buddy) ? p : buddy;
> + }
> +
> + p->order = order;
> + list_add_tail(&p->node, &pool->free_area[order]);
> +}
> +
> +void hyp_put_page(void *addr)
> +{
> + struct hyp_page *p = hyp_virt_to_page(addr);
> + struct hyp_pool *pool = hyp_page_to_pool(p);
> +
> + hyp_spin_lock(&pool->lock);
> + if (!p->refcount)
> + hyp_panic();
> + p->refcount--;
> + if (!p->refcount)
> + __hyp_attach_page(pool, p);
> + hyp_spin_unlock(&pool->lock);
> +}
> +
> +void hyp_get_page(void *addr)
> +{
> + struct hyp_page *p = hyp_virt_to_page(addr);
> + struct hyp_pool *pool = hyp_page_to_pool(p);
> +
> + hyp_spin_lock(&pool->lock);
> + p->refcount++;
> + hyp_spin_unlock(&pool->lock);

We should probably have a proper atomic refcount type for this along the
lines of refcount_t. Even if initially that is implemented with a lock, it
would be good to hide that behind a refcount API.

> +}
> +
> +/* Extract a page from the buddy tree, at a specific order */
> +static struct hyp_page *__hyp_extract_page(struct hyp_pool *pool,
> + struct hyp_page *p,
> + unsigned int order)
> +{
> + struct hyp_page *buddy;
> +
> + if (p->order == HYP_NO_ORDER || p->order < order)
> + return NULL;

Can you drop the explicit HYP_NO_ORDER check here?

> +
> + list_del_init(&p->node);
> +
> + /* Split the page in two until reaching the requested order */
> + while (p->order > order) {
> + p->order--;
> + buddy = __find_buddy(pool, p, p->order);
> + buddy->order = p->order;
> + list_add_tail(&buddy->node, &pool->free_area[buddy->order]);
> + }
> +
> + p->refcount = 1;
> +
> + return p;
> +}
> +
> +static void clear_hyp_page(struct hyp_page *p)
> +{
> + unsigned long i;
> +
> + for (i = 0; i < (1 << p->order); i++)
> + clear_page(hyp_page_to_virt(p) + (i << PAGE_SHIFT));

I wonder if this is actually any better than a memset(0)? That should use
DC ZCA as appropriate afaict.

> +static void *__hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask,
> + unsigned int order)
> +{
> + unsigned int i = order;
> + struct hyp_page *p;
> +
> + /* Look for a high-enough-order page */
> + while (i <= HYP_MAX_ORDER && list_empty(&pool->free_area[i]))
> + i++;
> + if (i > HYP_MAX_ORDER)
> + return NULL;
> +
> + /* Extract it from the tree at the right order */
> + p = list_first_entry(&pool->free_area[i], struct hyp_page, node);
> + p = __hyp_extract_page(pool, p, order);
> +
> + if (mask & HYP_GFP_ZERO)
> + clear_hyp_page(p);

Do we have a use-case where skipping the zeroing is worthwhile? If not,
it might make some sense to zero on the freeing path instead.

> +
> + return p;
> +}
> +
> +void *hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask, unsigned int order)
> +{
> + struct hyp_page *p;
> +
> + hyp_spin_lock(&pool->lock);
> + p = __hyp_alloc_pages(pool, mask, order);
> + hyp_spin_unlock(&pool->lock);
> +
> + return p ? hyp_page_to_virt(p) : NULL;

It looks weird not having __hyp_alloc_pages return the VA, but I guess later
patches will use __hyp_alloc_pages() for something else.

> +}
> +
> +/* hyp_vmemmap must be backed beforehand */
> +int hyp_pool_init(struct hyp_pool *pool, phys_addr_t phys,
> + unsigned int nr_pages, unsigned int used_pages)
> +{
> + struct hyp_page *p;
> + int i;
> +
> + if (phys % PAGE_SIZE)
> + return -EINVAL;

Maybe just take a pfn instead?

> + hyp_spin_lock_init(&pool->lock);
> + for (i = 0; i <= HYP_MAX_ORDER; i++)
> + INIT_LIST_HEAD(&pool->free_area[i]);
> + pool->range_start = phys;
> + pool->range_end = phys + (nr_pages << PAGE_SHIFT);
> +
> + /* Init the vmemmap portion */
> + p = hyp_phys_to_page(phys);
> + memset(p, 0, sizeof(*p) * nr_pages);
> + for (i = 0; i < nr_pages; i++, p++) {
> + p->pool = pool;
> + INIT_LIST_HEAD(&p->node);
> + }

Maybe index p like an array (e.g. p[i]) instead of maintaining two loop
increments?

> +
> + /* Attach the unused pages to the buddy tree */
> + p = hyp_phys_to_page(phys + (used_pages << PAGE_SHIFT));
> + for (i = used_pages; i < nr_pages; i++, p++)
> + __hyp_attach_page(pool, p);

Likewise.

Will

2021-02-03 00:38:23

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 14/26] KVM: arm64: Factor out vector address calculation

On Fri, Jan 08, 2021 at 12:15:12PM +0000, Quentin Perret wrote:
> In order to re-map the guest vectors at EL2 when pKVM is enabled,
> refactor __kvm_vector_slot2idx() and kvm_init_vector_slot() to move all
> the address calculation logic in a static inline function.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_mmu.h | 8 ++++++++
> arch/arm64/kvm/arm.c | 9 +--------
> 2 files changed, 9 insertions(+), 8 deletions(-)

Acked-by: Will Deacon <[email protected]>

Will

2021-02-03 14:41:21

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/26] KVM: arm64: Prepare Hyp memory protection

On Fri, Jan 08, 2021 at 12:15:14PM +0000, Quentin Perret wrote:
> When memory protection is enabled, the Hyp code needs the ability to
> create and manage its own page-table. To do so, introduce a new set of
> hypercalls to initialize Hyp memory protection.
>
> During the init hcall, the hypervisor runs with the host-provided
> page-table and uses the trivial early page allocator to create its own
> set of page-tables, using a memory pool that was donated by the host.
> Specifically, the hypervisor creates its own mappings for __hyp_text,
> the Hyp memory pool, the __hyp_bss, the portion of hyp_vmemmap
> corresponding to the Hyp pool, among other things. It then jumps back in
> the idmap page, switches to use the newly-created pgd (instead of the
> temporary one provided by the host) and then installs the full-fledged
> buddy allocator which will then be the only one in used from then on.
>
> Note that for the sake of symplifying the review, this only introduces
> the code doing this operation, without actually being called by anyhing
> yet. This will be done in a subsequent patch, which will introduce the
> necessary host kernel changes.
>
> Credits to Will for __pkvm_init_switch_pgd.
>
> Co-authored-by: Will Deacon <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>

[...]

> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
> index c0450828378b..a0e113734b20 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -100,4 +100,12 @@ void __noreturn hyp_panic(void);
> void __noreturn __hyp_do_panic(bool restore_host, u64 spsr, u64 elr, u64 par);
> #endif
>
> +#ifdef __KVM_NVHE_HYPERVISOR__
> +void __pkvm_init_switch_pgd(phys_addr_t phys, unsigned long size,
> + phys_addr_t pgd, void *sp, void *cont_fn);
> +int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus,
> + unsigned long *per_cpu_base);
> +void __noreturn __host_enter(struct kvm_cpu_context *host_ctxt);
> +#endif
> +
> #endif /* __ARM64_KVM_HYP_H__ */
> diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
> index 43f3a1d6e92d..366d837f0d39 100644
> --- a/arch/arm64/kernel/image-vars.h
> +++ b/arch/arm64/kernel/image-vars.h
> @@ -113,6 +113,25 @@ KVM_NVHE_ALIAS_HYP(__memcpy, __pi_memcpy);
> KVM_NVHE_ALIAS_HYP(__memset, __pi_memset);
> #endif
>
> +/* Hypevisor VA size */

typo: Hypervisor

> +KVM_NVHE_ALIAS(hyp_va_bits);
> +
> +/* Kernel memory sections */
> +KVM_NVHE_ALIAS(__start_rodata);
> +KVM_NVHE_ALIAS(__end_rodata);
> +KVM_NVHE_ALIAS(__bss_start);
> +KVM_NVHE_ALIAS(__bss_stop);
> +
> +/* Hyp memory sections */
> +KVM_NVHE_ALIAS(__hyp_idmap_text_start);
> +KVM_NVHE_ALIAS(__hyp_idmap_text_end);
> +KVM_NVHE_ALIAS(__hyp_text_start);
> +KVM_NVHE_ALIAS(__hyp_text_end);
> +KVM_NVHE_ALIAS(__hyp_data_ro_after_init_start);
> +KVM_NVHE_ALIAS(__hyp_data_ro_after_init_end);
> +KVM_NVHE_ALIAS(__hyp_bss_start);
> +KVM_NVHE_ALIAS(__hyp_bss_end);
> +
> #endif /* CONFIG_KVM */
>
> #endif /* __ARM64_KERNEL_IMAGE_VARS_H */
> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> index 687598e41b21..b726332eec49 100644
> --- a/arch/arm64/kvm/hyp/Makefile
> +++ b/arch/arm64/kvm/hyp/Makefile
> @@ -10,4 +10,4 @@ subdir-ccflags-y := -I$(incdir) \
> -DDISABLE_BRANCH_PROFILING \
> $(DISABLE_STACKLEAK_PLUGIN)
>
> -obj-$(CONFIG_KVM) += vhe/ nvhe/ pgtable.o
> +obj-$(CONFIG_KVM) += vhe/ nvhe/ pgtable.o reserved_mem.o
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h
> index ed47674bc988..c8af6fe87bfb 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/memory.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h
> @@ -6,6 +6,12 @@
>
> #include <linux/types.h>
>
> +#define HYP_MEMBLOCK_REGIONS 128
> +struct hyp_memblock_region {
> + phys_addr_t start;
> + phys_addr_t end;
> +};
> +
> struct hyp_pool;
> struct hyp_page {
> unsigned int refcount;
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h b/arch/arm64/kvm/hyp/include/nvhe/mm.h
> new file mode 100644
> index 000000000000..f0cc09b127a5
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
> @@ -0,0 +1,79 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef __KVM_HYP_MM_H
> +#define __KVM_HYP_MM_H
> +
> +#include <asm/kvm_pgtable.h>
> +#include <asm/spectre.h>
> +#include <linux/types.h>
> +
> +#include <nvhe/memory.h>
> +#include <nvhe/spinlock.h>
> +
> +extern struct hyp_memblock_region kvm_nvhe_sym(hyp_memory)[];
> +extern int kvm_nvhe_sym(hyp_memblock_nr);
> +extern struct kvm_pgtable pkvm_pgtable;
> +extern hyp_spinlock_t pkvm_pgd_lock;
> +extern struct hyp_pool hpool;
> +extern u64 __io_map_base;
> +extern u32 hyp_va_bits;
> +
> +int hyp_create_idmap(void);
> +int hyp_map_vectors(void);
> +int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
> +int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
> +int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
> +int __pkvm_create_mappings(unsigned long start, unsigned long size,
> + unsigned long phys, unsigned long prot);
> +unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
> + unsigned long prot);
> +
> +static inline void hyp_vmemmap_range(phys_addr_t phys, unsigned long size,
> + unsigned long *start, unsigned long *end)
> +{
> + unsigned long nr_pages = size >> PAGE_SHIFT;
> + struct hyp_page *p = hyp_phys_to_page(phys);
> +
> + *start = (unsigned long)p;
> + *end = *start + nr_pages * sizeof(struct hyp_page);
> + *start = ALIGN_DOWN(*start, PAGE_SIZE);
> + *end = ALIGN(*end, PAGE_SIZE);
> +}
> +
> +static inline unsigned long __hyp_pgtable_max_pages(unsigned long nr_pages)
> +{
> + unsigned long total = 0, i;
> +
> + /* Provision the worst case scenario with 4 levels of page-table */
> + for (i = 0; i < 4; i++) {

Looks like you want KVM_PGTABLE_MAX_LEVELS, so maybe move that into a
header?

> + nr_pages = DIV_ROUND_UP(nr_pages, PTRS_PER_PTE);
> + total += nr_pages;
> + }

... that said, I'm not sure this needs to iterate at all. What exactly are
you trying to compute?

> +
> + return total;
> +}
> +
> +static inline unsigned long hyp_s1_pgtable_size(void)
> +{
> + struct hyp_memblock_region *reg;
> + unsigned long nr_pages, res = 0;
> + int i;
> +
> + if (kvm_nvhe_sym(hyp_memblock_nr) <= 0)
> + return 0;

It's a bit grotty having this be signed. Why do we need to encode the error
case differently from the 0 case?

> +
> + for (i = 0; i < kvm_nvhe_sym(hyp_memblock_nr); i++) {
> + reg = &kvm_nvhe_sym(hyp_memory)[i];

You could declare reg in the loop body.

> + nr_pages = (reg->end - reg->start) >> PAGE_SHIFT;
> + nr_pages = __hyp_pgtable_max_pages(nr_pages);

Maybe it would make more sense for __hyp_pgtable_max_pages to take the
size in bytes rather than pages, since most callers seem to have to do the
conversion?

> + res += nr_pages << PAGE_SHIFT;
> + }
> +
> + /* Allow 1 GiB for private mappings */
> + nr_pages = (1 << 30) >> PAGE_SHIFT;

SZ_1G >> PAGE_SHIFT

> + nr_pages = __hyp_pgtable_max_pages(nr_pages);
> + res += nr_pages << PAGE_SHIFT;
> +
> + return res;

Might make more sense to keep res in pages until here, then just shift when
returning.

> +}
> +
> +#endif /* __KVM_HYP_MM_H */
> diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> index 72cfe53f106f..d7381a503182 100644
> --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> @@ -11,9 +11,9 @@ lib-objs := $(addprefix ../../../lib/, $(lib-objs))
>
> obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o \
> - cache.o cpufeature.o
> + cache.o cpufeature.o setup.o mm.o
> obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> - ../fpsimd.o ../hyp-entry.o ../exception.o
> + ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
> obj-y += $(lib-objs)
>
> ##
> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> index 31b060a44045..ad943966c39f 100644
> --- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> +++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> @@ -251,4 +251,35 @@ alternative_else_nop_endif
>
> SYM_CODE_END(__kvm_handle_stub_hvc)
>
> +SYM_FUNC_START(__pkvm_init_switch_pgd)
> + /* Turn the MMU off */
> + pre_disable_mmu_workaround
> + mrs x2, sctlr_el2
> + bic x3, x2, #SCTLR_ELx_M
> + msr sctlr_el2, x3
> + isb
> +
> + tlbi alle2
> +
> + /* Install the new pgtables */
> + ldr x3, [x0, #NVHE_INIT_PGD_PA]
> + phys_to_ttbr x4, x3
> +alternative_if ARM64_HAS_CNP
> + orr x4, x4, #TTBR_CNP_BIT
> +alternative_else_nop_endif
> + msr ttbr0_el2, x4
> +
> + /* Set the new stack pointer */
> + ldr x0, [x0, #NVHE_INIT_STACK_HYP_VA]
> + mov sp, x0
> +
> + /* And turn the MMU back on! */
> + dsb nsh
> + isb
> + msr sctlr_el2, x2
> + ic iallu
> + isb
> + ret x1
> +SYM_FUNC_END(__pkvm_init_switch_pgd)
> +
> .popsection
> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> index a906f9e2ff34..3075f117651c 100644
> --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> @@ -6,12 +6,14 @@
>
> #include <hyp/switch.h>
>
> +#include <asm/pgtable-types.h>
> #include <asm/kvm_asm.h>
> #include <asm/kvm_emulate.h>
> #include <asm/kvm_host.h>
> #include <asm/kvm_hyp.h>
> #include <asm/kvm_mmu.h>
>
> +#include <nvhe/mm.h>
> #include <nvhe/trap_handler.h>
>
> DEFINE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params);
> @@ -106,6 +108,42 @@ static void handle___vgic_v3_restore_aprs(struct kvm_cpu_context *host_ctxt)
> __vgic_v3_restore_aprs(kern_hyp_va(cpu_if));
> }
>
> +static void handle___pkvm_init(struct kvm_cpu_context *host_ctxt)
> +{
> + DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
> + DECLARE_REG(unsigned long, size, host_ctxt, 2);
> + DECLARE_REG(unsigned long, nr_cpus, host_ctxt, 3);
> + DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 4);
> +
> + cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, nr_cpus, per_cpu_base);

__pkvm_init() doesn't return, so I think this assignment back into host_ctxt
is confusing.

Also, I wonder if these bare numbers would be better hidden behind, e.g.

#define DECLARE_ARG0(...) DECLARE_REG(__VA_ARGS__, 1)
...
#define DECLARE_RET(...) DECLARE_REG(__VA_ARGS__, 1)

but it's cosmetic, so no need to change your patch. Just worried about
off-by-1s causing interesting behaviour!

> +
> +static void handle___pkvm_cpu_set_vector(struct kvm_cpu_context *host_ctxt)
> +{
> + DECLARE_REG(enum arm64_hyp_spectre_vector, slot, host_ctxt, 1);
> +
> + cpu_reg(host_ctxt, 1) = pkvm_cpu_set_vector(slot);
> +}
> +
> +static void handle___pkvm_create_mappings(struct kvm_cpu_context *host_ctxt)
> +{
> + DECLARE_REG(unsigned long, start, host_ctxt, 1);
> + DECLARE_REG(unsigned long, size, host_ctxt, 2);
> + DECLARE_REG(unsigned long, phys, host_ctxt, 3);
> + DECLARE_REG(unsigned long, prot, host_ctxt, 4);
> +
> + cpu_reg(host_ctxt, 1) = __pkvm_create_mappings(start, size, phys, prot);
> +}
> +
> +static void handle___pkvm_create_private_mapping(struct kvm_cpu_context *host_ctxt)
> +{
> + DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
> + DECLARE_REG(size_t, size, host_ctxt, 2);

Why the size_t vs unsigned long discrepancy with pkvm_create_mappings?
Same with phys_addr_t, although that one probably doesn't matter.

Also, the pgtable API uses an enum type for the prot bits.

> + DECLARE_REG(unsigned long, prot, host_ctxt, 3);
> +
> + cpu_reg(host_ctxt, 1) = __pkvm_create_private_mapping(phys, size, prot);
> +}
> +
> typedef void (*hcall_t)(struct kvm_cpu_context *);
>
> #define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = kimg_fn_ptr(handle_##x)
> @@ -125,6 +163,10 @@ static const hcall_t *host_hcall[] = {
> HANDLE_FUNC(__kvm_get_mdcr_el2),
> HANDLE_FUNC(__vgic_v3_save_aprs),
> HANDLE_FUNC(__vgic_v3_restore_aprs),
> + HANDLE_FUNC(__pkvm_init),
> + HANDLE_FUNC(__pkvm_cpu_set_vector),
> + HANDLE_FUNC(__pkvm_create_mappings),
> + HANDLE_FUNC(__pkvm_create_private_mapping),
> };
>
> static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
> diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
> new file mode 100644
> index 000000000000..f3481646a94e
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/nvhe/mm.c
> @@ -0,0 +1,174 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 Google LLC
> + * Author: Quentin Perret <[email protected]>
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <asm/kvm_hyp.h>
> +#include <asm/kvm_mmu.h>
> +#include <asm/kvm_pgtable.h>
> +#include <asm/spectre.h>
> +
> +#include <nvhe/early_alloc.h>
> +#include <nvhe/gfp.h>
> +#include <nvhe/memory.h>
> +#include <nvhe/mm.h>
> +#include <nvhe/spinlock.h>
> +
> +struct kvm_pgtable pkvm_pgtable;
> +hyp_spinlock_t pkvm_pgd_lock;
> +u64 __io_map_base;
> +
> +struct hyp_memblock_region hyp_memory[HYP_MEMBLOCK_REGIONS];
> +int hyp_memblock_nr;
> +
> +int __pkvm_create_mappings(unsigned long start, unsigned long size,
> + unsigned long phys, unsigned long prot)
> +{
> + int err;
> +
> + hyp_spin_lock(&pkvm_pgd_lock);
> + err = kvm_pgtable_hyp_map(&pkvm_pgtable, start, size, phys, prot);
> + hyp_spin_unlock(&pkvm_pgd_lock);
> +
> + return err;
> +}
> +
> +unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
> + unsigned long prot)
> +{
> + unsigned long addr;
> + int ret;
> +
> + hyp_spin_lock(&pkvm_pgd_lock);
> +
> + size = PAGE_ALIGN(size + offset_in_page(phys));

It might just be simpler to require page-aligned size and phys in the
caller. At least, for the vectors that should be straightforward because
I think they're guaranteed not to span a page boundary.

> + addr = __io_map_base;
> + __io_map_base += size;
> +
> + /* Are we overflowing on the vmemmap ? */
> + if (__io_map_base > __hyp_vmemmap) {
> + __io_map_base -= size;
> + addr = 0;

Can we use ERR_PTR(), or does that fail miserably at EL2?

> + goto out;
> + }
> +
> + ret = kvm_pgtable_hyp_map(&pkvm_pgtable, addr, size, phys, prot);
> + if (ret) {
> + addr = 0;
> + goto out;
> + }
> +
> + addr = addr + offset_in_page(phys);
> +out:
> + hyp_spin_unlock(&pkvm_pgd_lock);
> +
> + return addr;
> +}

[...]

> +static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
> + unsigned long *per_cpu_base)
> +{
> + void *start, *end, *virt = hyp_phys_to_virt(phys);
> + int ret, i;
> +
> + /* Recreate the hyp page-table using the early page allocator */
> + hyp_early_alloc_init(hyp_pgt_base, hyp_s1_pgtable_size());
> + ret = kvm_pgtable_hyp_init(&pkvm_pgtable, hyp_va_bits,
> + &hyp_early_alloc_mm_ops);
> + if (ret)
> + return ret;
> +
> + ret = hyp_create_idmap();
> + if (ret)
> + return ret;
> +
> + ret = hyp_map_vectors();
> + if (ret)
> + return ret;
> +
> + ret = hyp_back_vmemmap(phys, size, hyp_virt_to_phys(vmemmap_base));
> + if (ret)
> + return ret;
> +
> + ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_text_start),
> + hyp_symbol_addr(__hyp_text_end),
> + PAGE_HYP_EXEC);
> + if (ret)
> + return ret;
> +
> + ret = pkvm_create_mappings(hyp_symbol_addr(__start_rodata),
> + hyp_symbol_addr(__end_rodata), PAGE_HYP_RO);
> + if (ret)
> + return ret;
> +
> + ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_data_ro_after_init_start),
> + hyp_symbol_addr(__hyp_data_ro_after_init_end),
> + PAGE_HYP_RO);
> + if (ret)
> + return ret;
> +
> + ret = pkvm_create_mappings(hyp_symbol_addr(__bss_start),

__hyp_bss_start

> + hyp_symbol_addr(__hyp_bss_end), PAGE_HYP);
> + if (ret)
> + return ret;
> +
> + ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_bss_end),
> + hyp_symbol_addr(__bss_stop), PAGE_HYP_RO);
> + if (ret)
> + return ret;
> +
> + ret = pkvm_create_mappings(virt, virt + size - 1, PAGE_HYP);

Why is the range inclusive here?

> + if (ret)
> + return ret;
> +
> + for (i = 0; i < hyp_nr_cpus; i++) {
> + start = (void *)kern_hyp_va(per_cpu_base[i]);
> + end = start + PAGE_ALIGN(hyp_percpu_size);
> + ret = pkvm_create_mappings(start, end, PAGE_HYP);
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> +static void update_nvhe_init_params(void)
> +{
> + struct kvm_nvhe_init_params *params;
> + unsigned long i, stack;
> +
> + for (i = 0; i < hyp_nr_cpus; i++) {
> + stack = (unsigned long)stacks_base + (i << PAGE_SHIFT);
> + params = per_cpu_ptr(&kvm_init_params, i);
> + params->stack_hyp_va = stack + PAGE_SIZE;
> + params->pgd_pa = __hyp_pa(pkvm_pgtable.pgd);
> + __flush_dcache_area(params, sizeof(*params));
> + }
> +}
> +
> +static void *hyp_zalloc_hyp_page(void *arg)
> +{
> + return hyp_alloc_pages(&hpool, HYP_GFP_ZERO, 0);
> +}
> +
> +void __noreturn __pkvm_init_finalise(void)
> +{
> + struct kvm_host_data *host_data = this_cpu_ptr(&kvm_host_data);
> + struct kvm_cpu_context *host_ctxt = &host_data->host_ctxt;
> + unsigned long nr_pages, used_pages;
> + int ret;
> +
> + /* Now that the vmemmap is backed, install the full-fledged allocator */
> + nr_pages = hyp_s1_pgtable_size() >> PAGE_SHIFT;
> + used_pages = hyp_early_alloc_nr_pages();
> + ret = hyp_pool_init(&hpool, __hyp_pa(hyp_pgt_base), nr_pages, used_pages);
> + if (ret)
> + goto out;
> +
> + pkvm_pgtable_mm_ops.zalloc_page = hyp_zalloc_hyp_page;
> + pkvm_pgtable_mm_ops.phys_to_virt = hyp_phys_to_virt;
> + pkvm_pgtable_mm_ops.virt_to_phys = hyp_virt_to_phys;
> + pkvm_pgtable_mm_ops.get_page = hyp_get_page;
> + pkvm_pgtable_mm_ops.put_page = hyp_put_page;
> + pkvm_pgtable.mm_ops = &pkvm_pgtable_mm_ops;
> +
> +out:
> + host_ctxt->regs.regs[0] = SMCCC_RET_SUCCESS;
> + host_ctxt->regs.regs[1] = ret;

Use the cpu_reg() helper for these?

> +
> + __host_enter(host_ctxt);
> +}
> +
> +int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus,
> + unsigned long *per_cpu_base)
> +{
> + struct kvm_nvhe_init_params *params;
> + void *virt = hyp_phys_to_virt(phys);
> + void (*fn)(phys_addr_t params_pa, void *finalize_fn_va);
> + int ret;
> +
> + if (phys % PAGE_SIZE || size % PAGE_SIZE || (u64)virt % PAGE_SIZE)
> + return -EINVAL;
> +
> + hyp_spin_lock_init(&pkvm_pgd_lock);
> + hyp_nr_cpus = nr_cpus;
> +
> + ret = divide_memory_pool(virt, size);
> + if (ret)
> + return ret;
> +
> + ret = recreate_hyp_mappings(phys, size, per_cpu_base);
> + if (ret)
> + return ret;
> +
> + update_nvhe_init_params();
> +
> + /* Jump in the idmap page to switch to the new page-tables */
> + params = this_cpu_ptr(&kvm_init_params);
> + fn = (typeof(fn))__hyp_pa(hyp_symbol_addr(__pkvm_init_switch_pgd));
> + fn(__hyp_pa(params), hyp_symbol_addr(__pkvm_init_finalise));
> +
> + unreachable();
> +}
> diff --git a/arch/arm64/kvm/hyp/reserved_mem.c b/arch/arm64/kvm/hyp/reserved_mem.c
> new file mode 100644
> index 000000000000..32f648992835
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/reserved_mem.c
> @@ -0,0 +1,102 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2020 - Google LLC
> + * Author: Quentin Perret <[email protected]>
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <linux/memblock.h>
> +#include <linux/sort.h>
> +
> +#include <asm/kvm_host.h>
> +
> +#include <nvhe/memory.h>
> +#include <nvhe/mm.h>
> +
> +phys_addr_t hyp_mem_base;
> +phys_addr_t hyp_mem_size;
> +
> +int __init early_init_dt_add_memory_hyp(u64 base, u64 size)
> +{
> + struct hyp_memblock_region *reg;
> +
> + if (kvm_nvhe_sym(hyp_memblock_nr) >= HYP_MEMBLOCK_REGIONS)
> + kvm_nvhe_sym(hyp_memblock_nr) = -1;
> +
> + if (kvm_nvhe_sym(hyp_memblock_nr) < 0)
> + return -ENOMEM;
> +
> + reg = kvm_nvhe_sym(hyp_memory);
> + reg[kvm_nvhe_sym(hyp_memblock_nr)].start = base;
> + reg[kvm_nvhe_sym(hyp_memblock_nr)].end = base + size;
> + kvm_nvhe_sym(hyp_memblock_nr)++;
> +
> + return 0;
> +}

This isn't called by anything in this patch afaict, so it's a bit tricky to
review, especially as I was trying to see how it interacts with
kvm_hyp_reserve(), which reads hyp_memblock_nr.

> +
> +static int cmp_hyp_memblock(const void *p1, const void *p2)
> +{
> + const struct hyp_memblock_region *r1 = p1;
> + const struct hyp_memblock_region *r2 = p2;
> +
> + return r1->start < r2->start ? -1 : (r1->start > r2->start);
> +}
> +
> +static void __init sort_memblock_regions(void)
> +{
> + sort(kvm_nvhe_sym(hyp_memory),
> + kvm_nvhe_sym(hyp_memblock_nr),
> + sizeof(struct hyp_memblock_region),
> + cmp_hyp_memblock,
> + NULL);
> +}
> +
> +void __init kvm_hyp_reserve(void)
> +{
> + u64 nr_pages, prev;
> +
> + if (!is_hyp_mode_available() || is_kernel_in_hyp_mode())
> + return;
> +
> + if (kvm_get_mode() != KVM_MODE_PROTECTED)
> + return;
> +
> + if (kvm_nvhe_sym(hyp_memblock_nr) < 0) {
> + kvm_err("Failed to register hyp memblocks\n");
> + return;
> + }
> +
> + sort_memblock_regions();
> +
> + /*
> + * We don't know the number of possible CPUs yet, so allocate for the
> + * worst case.
> + */
> + hyp_mem_size += NR_CPUS << PAGE_SHIFT;

There was a recent patch bumping NR_CPUs to 512, so this would be 32MB
with 64k pages. Is it possible to return memory to the host later on once
we have a better handle on the number of CPUs in the system?

> + hyp_mem_size += hyp_s1_pgtable_size();
> +
> + /*
> + * The hyp_vmemmap needs to be backed by pages, but these pages
> + * themselves need to be present in the vmemmap, so compute the number
> + * of pages needed by looking for a fixed point.
> + */
> + nr_pages = 0;
> + do {
> + prev = nr_pages;
> + nr_pages = (hyp_mem_size >> PAGE_SHIFT) + prev;
> + nr_pages = DIV_ROUND_UP(nr_pages * sizeof(struct hyp_page), PAGE_SIZE);
> + nr_pages += __hyp_pgtable_max_pages(nr_pages);
> + } while (nr_pages != prev);
> + hyp_mem_size += nr_pages << PAGE_SHIFT;
> +
> + hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
> + hyp_mem_size, SZ_2M);

Why SZ_2M? Guessing you might mean PMD_SIZE, although then we will probably
want to retry with smaller alignment if the allocation fails as this can
again be large with e.g. 64k pages.

> + if (!hyp_mem_base) {
> + kvm_err("Failed to reserve hyp memory\n");
> + return;
> + }
> + memblock_reserve(hyp_mem_base, hyp_mem_size);
> +
> + kvm_info("Reserved %lld MiB at 0x%llx\n", hyp_mem_size >> 20,
> + hyp_mem_base);
> +}
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 278e163beda4..3cf9397dabdb 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1264,10 +1264,10 @@ static struct kvm_pgtable_mm_ops kvm_hyp_mm_ops = {
> .virt_to_phys = kvm_host_pa,
> };
>
> +u32 hyp_va_bits;

Perhaps it would be better to pass this to __pkvm_init() instead of making
it global?

Will

2021-02-03 15:38:09

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 17/26] KVM: arm64: Elevate Hyp mappings creation at EL2

On Fri, Jan 08, 2021 at 12:15:15PM +0000, Quentin Perret wrote:
> Previous commits have introduced infrastructure at EL2 to enable the Hyp
> code to manage its own memory, and more specifically its stage 1 page
> tables. However, this was preliminary work, and none of it is currently
> in use.
>
> Put all of this together by elevating the hyp mappings creation at EL2
> when memory protection is enabled. In this case, the host kernel running
> at EL1 still creates _temporary_ Hyp mappings, only used while
> initializing the hypervisor, but frees them right after.
>
> As such, all calls to create_hyp_mappings() after kvm init has finished
> turn into hypercalls, as the host now has no 'legal' way to modify the
> hypevisor page tables directly.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_mmu.h | 1 -
> arch/arm64/kvm/arm.c | 62 +++++++++++++++++++++++++++++---
> arch/arm64/kvm/mmu.c | 34 ++++++++++++++++++
> 3 files changed, 92 insertions(+), 5 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index d7ebd73ec86f..6c8466a042a9 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -309,6 +309,5 @@ static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
> */
> asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT));
> }
> -
> #endif /* __ASSEMBLY__ */
> #endif /* __ARM64_KVM_MMU_H__ */
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 6af9204bcd5b..e524682c2ccf 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1421,7 +1421,7 @@ static void cpu_prepare_hyp_mode(int cpu)
> kvm_flush_dcache_to_poc(params, sizeof(*params));
> }
>
> -static void cpu_init_hyp_mode(void)
> +static void kvm_set_hyp_vector(void)

Please do something about the naming: now we have both cpu_set_hyp_vector()
and kvm_set_hyp_vector()!

> {
> struct kvm_nvhe_init_params *params;
> struct arm_smccc_res res;
> @@ -1439,6 +1439,11 @@ static void cpu_init_hyp_mode(void)
> params = this_cpu_ptr_nvhe_sym(kvm_init_params);
> arm_smccc_1_1_hvc(KVM_HOST_SMCCC_FUNC(__kvm_hyp_init), virt_to_phys(params), &res);
> WARN_ON(res.a0 != SMCCC_RET_SUCCESS);
> +}
> +
> +static void cpu_init_hyp_mode(void)
> +{
> + kvm_set_hyp_vector();
>
> /*
> * Disabling SSBD on a non-VHE system requires us to enable SSBS
> @@ -1481,7 +1486,10 @@ static void cpu_set_hyp_vector(void)
> struct bp_hardening_data *data = this_cpu_ptr(&bp_hardening_data);
> void *vector = hyp_spectre_vector_selector[data->slot];
>
> - *this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
> + if (!is_protected_kvm_enabled())
> + *this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
> + else
> + kvm_call_hyp_nvhe(__pkvm_cpu_set_vector, data->slot);

*Very* minor nit, but it might be cleaner to have static inline functions
with the same prototypes as the hypercalls, just to make the code even
easier to read. e.g

if (!is_protected_kvm_enabled())
_cpu_set_vector(data->slot);
else
kvm_call_hyp_nvhe(__pkvm_cpu_set_vector, data->slot);

you could then conceivably wrap that in a macro and avoid having the
"is_protected_kvm_enabled()" checks explicit every time.

> }
>
> static void cpu_hyp_reinit(void)
> @@ -1489,13 +1497,14 @@ static void cpu_hyp_reinit(void)
> kvm_init_host_cpu_context(&this_cpu_ptr_hyp_sym(kvm_host_data)->host_ctxt);
>
> cpu_hyp_reset();
> - cpu_set_hyp_vector();
>
> if (is_kernel_in_hyp_mode())
> kvm_timer_init_vhe();
> else
> cpu_init_hyp_mode();
>
> + cpu_set_hyp_vector();
> +
> kvm_arm_init_debug();
>
> if (vgic_present)
> @@ -1714,13 +1723,52 @@ static int copy_cpu_ftr_regs(void)
> return 0;
> }
>
> +static int kvm_hyp_enable_protection(void)
> +{
> + void *per_cpu_base = kvm_ksym_ref(kvm_arm_hyp_percpu_base);
> + int ret, cpu;
> + void *addr;
> +
> + if (!is_protected_kvm_enabled())
> + return 0;

Maybe I'm hung up on my previous suggestion, but I feel like we shouldn't
get here if protected kvm isn't enabled.

> + if (!hyp_mem_base)
> + return -ENOMEM;
> +
> + addr = phys_to_virt(hyp_mem_base);
> + ret = create_hyp_mappings(addr, addr + hyp_mem_size - 1, PAGE_HYP);
> + if (ret)
> + return ret;
> +
> + preempt_disable();
> + kvm_set_hyp_vector();
> + ret = kvm_call_hyp_nvhe(__pkvm_init, hyp_mem_base, hyp_mem_size,
> + num_possible_cpus(), kern_hyp_va(per_cpu_base));

Would it make sense for the __pkvm_init() hypercall to set the vector as
well, so that we wouldn't need to disable preemption over two hypercalls?

Failing that, maybe move the whole preempt_disable/enable sequence into
another function.

> + preempt_enable();
> + if (ret)
> + return ret;
> +
> + free_hyp_pgds();
> + for_each_possible_cpu(cpu)
> + free_page(per_cpu(kvm_arm_hyp_stack_page, cpu));
> +
> + return 0;
> +}
> +
> /**
> * Inits Hyp-mode on all online CPUs
> */
> static int init_hyp_mode(void)
> {
> int cpu;
> - int err = 0;
> + int err = -ENOMEM;
> +
> + /*
> + * The protected Hyp-mode cannot be initialized if the memory pool
> + * allocation has failed.
> + */
> + if (is_protected_kvm_enabled() && !hyp_mem_base)
> + return err;
>
> /*
> * Copy the required CPU feature register in their EL2 counterpart
> @@ -1854,6 +1902,12 @@ static int init_hyp_mode(void)
> for_each_possible_cpu(cpu)
> cpu_prepare_hyp_mode(cpu);
>
> + err = kvm_hyp_enable_protection();
> + if (err) {
> + kvm_err("Failed to enable hyp memory protection: %d\n", err);
> + goto out_err;
> + }
> +
> return 0;
>
> out_err:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 3cf9397dabdb..9d4c9251208e 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -225,15 +225,39 @@ void free_hyp_pgds(void)
> if (hyp_pgtable) {
> kvm_pgtable_hyp_destroy(hyp_pgtable);
> kfree(hyp_pgtable);
> + hyp_pgtable = NULL;
> }
> mutex_unlock(&kvm_hyp_pgd_mutex);
> }
>
> +static bool kvm_host_owns_hyp_mappings(void)
> +{
> + if (static_branch_likely(&kvm_protected_mode_initialized))
> + return false;
> +
> + /*
> + * This can happen at boot time when __create_hyp_mappings() is called
> + * after the hyp protection has been enabled, but the static key has
> + * not been flipped yet.
> + */
> + if (!hyp_pgtable && is_protected_kvm_enabled())
> + return false;
> +
> + BUG_ON(!hyp_pgtable);

Can we fail more gracefully, e.g. by continuing without KVM?

Will

2021-02-03 15:38:33

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/26] KVM: arm64: Use kvm_arch for stage 2 pgtable

On Fri, Jan 08, 2021 at 12:15:16PM +0000, Quentin Perret wrote:
> In order to make use of the stage 2 pgtable code for the host stage 2,
> use struct kvm_arch in lieu of struct kvm as the host will have the
> former but not the latter.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_pgtable.h | 5 +++--
> arch/arm64/kvm/hyp/pgtable.c | 6 +++---
> arch/arm64/kvm/mmu.c | 2 +-
> 3 files changed, 7 insertions(+), 6 deletions(-)

Acked-by: Will Deacon <[email protected]>

Will

2021-02-03 15:44:05

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 19/26] KVM: arm64: Use kvm_arch in kvm_s2_mmu

On Fri, Jan 08, 2021 at 12:15:17PM +0000, Quentin Perret wrote:
> In order to make use of the stage 2 pgtable code for the host stage 2,
> change kvm_s2_mmu to use a kvm_arch pointer in lieu of the kvm pointer,
> as the host will have the former but not the latter.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_host.h | 2 +-
> arch/arm64/include/asm/kvm_mmu.h | 7 ++++++-
> arch/arm64/kvm/mmu.c | 8 ++++----
> 3 files changed, 11 insertions(+), 6 deletions(-)

Acked-by: Will Deacon <[email protected]>

Will

2021-02-03 15:58:02

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 21/26] KVM: arm64: Refactor kvm_arm_setup_stage2()

On Fri, Jan 08, 2021 at 12:15:19PM +0000, Quentin Perret wrote:
> In order to re-use some of the stage 2 setup at EL2, factor parts of
> kvm_arm_setup_stage2() out into static inline functions.
>
> No functional change intended.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_mmu.h | 48 ++++++++++++++++++++++++++++++++
> arch/arm64/kvm/reset.c | 42 +++-------------------------
> 2 files changed, 52 insertions(+), 38 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 662f0415344e..83b4c5cf4768 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -280,6 +280,54 @@ static inline int kvm_write_guest_lock(struct kvm *kvm, gpa_t gpa,
> return ret;
> }
>
> +static inline u64 kvm_get_parange(u64 mmfr0)
> +{
> + u64 parange = cpuid_feature_extract_unsigned_field(mmfr0,
> + ID_AA64MMFR0_PARANGE_SHIFT);
> + if (parange > ID_AA64MMFR0_PARANGE_MAX)
> + parange = ID_AA64MMFR0_PARANGE_MAX;
> +
> + return parange;
> +}
> +
> +/*
> + * The VTCR value is common across all the physical CPUs on the system.
> + * We use system wide sanitised values to fill in different fields,
> + * except for Hardware Management of Access Flags. HA Flag is set
> + * unconditionally on all CPUs, as it is safe to run with or without
> + * the feature and the bit is RES0 on CPUs that don't support it.
> + */
> +static inline u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
> +{
> + u64 vtcr = VTCR_EL2_FLAGS;
> + u8 lvls;
> +
> + vtcr |= kvm_get_parange(mmfr0) << VTCR_EL2_PS_SHIFT;
> + vtcr |= VTCR_EL2_T0SZ(phys_shift);
> + /*
> + * Use a minimum 2 level page table to prevent splitting
> + * host PMD huge pages at stage2.
> + */
> + lvls = stage2_pgtable_levels(phys_shift);
> + if (lvls < 2)
> + lvls = 2;
> + vtcr |= VTCR_EL2_LVLS_TO_SL0(lvls);
> +
> + /*
> + * Enable the Hardware Access Flag management, unconditionally
> + * on all CPUs. The features is RES0 on CPUs without the support
> + * and must be ignored by the CPUs.
> + */
> + vtcr |= VTCR_EL2_HA;
> +
> + /* Set the vmid bits */
> + vtcr |= (get_vmid_bits(mmfr1) == 16) ?
> + VTCR_EL2_VS_16BIT :
> + VTCR_EL2_VS_8BIT;
> +
> + return vtcr;
> +}

Although I think this is functionally fine, I think it's unusual to see
large "static inline" functions like this in shared header files. One
alternative approach would be to follow the example of
kernel/locking/qspinlock_paravirt.h, where the header is guarded in such a
way that is only ever included by kernel/locking/qspinlock.c and therefore
doesn't need the "inline" at all. That separation really helps, I think.

Will

2021-02-03 15:59:17

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/26] KVM: arm64: Refactor __load_guest_stage2()

On Fri, Jan 08, 2021 at 12:15:20PM +0000, Quentin Perret wrote:
> Refactor __load_guest_stage2() to introduce __load_stage2() which will
> be re-used when loading the host stage 2.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_mmu.h | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)

Acked-by: Will Deacon <[email protected]>

Will

2021-02-03 16:01:02

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 23/26] KVM: arm64: Refactor __populate_fault_info()

On Fri, Jan 08, 2021 at 12:15:21PM +0000, Quentin Perret wrote:
> Refactor __populate_fault_info() to introduce __get_fault_info() which
> will be used once the host is wrapped in a stage 2.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/kvm/hyp/include/hyp/switch.h | 36 +++++++++++++++----------
> 1 file changed, 22 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h
> index 84473574c2e7..e9005255d639 100644
> --- a/arch/arm64/kvm/hyp/include/hyp/switch.h
> +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
> @@ -157,19 +157,9 @@ static inline bool __translate_far_to_hpfar(u64 far, u64 *hpfar)
> return true;
> }
>
> -static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)
> +static inline bool __get_fault_info(u64 esr, u64 *far, u64 *hpfar)

Could this take a pointer to a struct kvm_vcpu_fault_info instead?

Will

2021-02-03 16:05:52

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 24/26] KVM: arm64: Make memcache anonymous in pgtable allocator

On Fri, Jan 08, 2021 at 12:15:22PM +0000, Quentin Perret wrote:
> The current stage2 page-table allocator uses a memcache to get
> pre-allocated pages when it needs any. To allow re-using this code at
> EL2 which uses a concept of memory pools, make the memcache argument to
> kvm_pgtable_stage2_map() anonymous. and let the mm_ops zalloc_page()
> callbacks use it the way they need to.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_pgtable.h | 6 +++---
> arch/arm64/kvm/hyp/pgtable.c | 4 ++--
> 2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 8e8f1d2c5e0e..d846bc3d3b77 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -176,8 +176,8 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
> * @size: Size of the mapping.
> * @phys: Physical address of the memory to map.
> * @prot: Permissions and attributes for the mapping.
> - * @mc: Cache of pre-allocated GFP_PGTABLE_USER memory from which to
> - * allocate page-table pages.
> + * @mc: Cache of pre-allocated memory from which to allocate page-table
> + * pages.

We should probably mention that this memory must be zeroed, since I don't
think the page-table code takes care of that.

Will

2021-02-03 16:15:27

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 26/26] KVM: arm64: Wrap the host with a stage 2

On Fri, Jan 08, 2021 at 12:15:24PM +0000, Quentin Perret wrote:
> When KVM runs in protected nVHE mode, make use of a stage 2 page-table
> to give the hypervisor some control over the host memory accesses. At
> the moment all memory aborts from the host will be instantly idmapped
> RWX at stage 2 in a lazy fashion. Later patches will make use of that
> infrastructure to implement access control restrictions to e.g. protect
> guest memory from the host.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_cpufeature.h | 2 +
> arch/arm64/kernel/image-vars.h | 3 +
> arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 33 +++
> arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
> arch/arm64/kvm/hyp/nvhe/hyp-init.S | 1 +
> arch/arm64/kvm/hyp/nvhe/hyp-main.c | 6 +
> arch/arm64/kvm/hyp/nvhe/mem_protect.c | 191 ++++++++++++++++++
> arch/arm64/kvm/hyp/nvhe/setup.c | 6 +
> arch/arm64/kvm/hyp/nvhe/switch.c | 7 +-
> arch/arm64/kvm/hyp/nvhe/tlb.c | 4 +-
> 10 files changed, 248 insertions(+), 7 deletions(-)
> create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c

[...]

> +void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
> +{
> + enum kvm_pgtable_prot prot;
> + u64 far, hpfar, esr, ipa;
> + int ret;
> +
> + esr = read_sysreg_el2(SYS_ESR);
> + if (!__get_fault_info(esr, &far, &hpfar))
> + hyp_panic();
> +
> + prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W | KVM_PGTABLE_PROT_X;
> + ipa = (hpfar & HPFAR_MASK) << 8;
> + ret = host_stage2_map(ipa, PAGE_SIZE, prot);

Can we try to put down a block mapping if the whole thing falls within
memory?

Will

2021-02-03 16:15:41

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 20/26] KVM: arm64: Set host stage 2 using kvm_nvhe_init_params

On Fri, Jan 08, 2021 at 12:15:18PM +0000, Quentin Perret wrote:
> Move the registers relevant to host stage 2 enablement to
> kvm_nvhe_init_params to prepare the ground for enabling it in later
> patches.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_asm.h | 3 +++
> arch/arm64/kernel/asm-offsets.c | 3 +++
> arch/arm64/kvm/arm.c | 5 +++++
> arch/arm64/kvm/hyp/nvhe/hyp-init.S | 9 +++++++++
> arch/arm64/kvm/hyp/nvhe/switch.c | 5 +----
> 5 files changed, 21 insertions(+), 4 deletions(-)

Acked-by: Will Deacon <[email protected]>

Will

2021-02-03 18:37:32

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/26] KVM: arm64: Introduce a Hyp buddy page allocator

Hey Will,

On Tuesday 02 Feb 2021 at 18:13:08 (+0000), Will Deacon wrote:
> Hi Quentin,
>
> Sorry for the delay, this one took me a while to grok.

No need to be sorry, thanks for having a look!

> On Fri, Jan 08, 2021 at 12:15:10PM +0000, Quentin Perret wrote:
> > When memory protection is enabled, the hyp code will require a basic
> > form of memory management in order to allocate and free memory pages at
> > EL2. This is needed for various use-cases, including the creation of hyp
> > mappings or the allocation of stage 2 page tables.
> >
> > To address these use-case, introduce a simple memory allocator in the
> > hyp code. The allocator is designed as a conventional 'buddy allocator',
> > working with a page granularity. It allows to allocate and free
> > physically contiguous pages from memory 'pools', with a guaranteed order
> > alignment in the PA space. Each page in a memory pool is associated
> > with a struct hyp_page which holds the page's metadata, including its
> > refcount, as well as its current order, hence mimicking the kernel's
> > buddy system in the GFP infrastructure. The hyp_page metadata are made
> > accessible through a hyp_vmemmap, following the concept of
> > SPARSE_VMEMMAP in the kernel.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/kvm/hyp/include/nvhe/gfp.h | 32 ++++
> > arch/arm64/kvm/hyp/include/nvhe/memory.h | 25 +++
> > arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
> > arch/arm64/kvm/hyp/nvhe/page_alloc.c | 185 +++++++++++++++++++++++
> > 4 files changed, 243 insertions(+), 1 deletion(-)
> > create mode 100644 arch/arm64/kvm/hyp/include/nvhe/gfp.h
> > create mode 100644 arch/arm64/kvm/hyp/nvhe/page_alloc.c
> >
> > diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
> > new file mode 100644
> > index 000000000000..95587faee171
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
> > @@ -0,0 +1,32 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +#ifndef __KVM_HYP_GFP_H
> > +#define __KVM_HYP_GFP_H
> > +
> > +#include <linux/list.h>
> > +
> > +#include <nvhe/memory.h>
> > +#include <nvhe/spinlock.h>
> > +
> > +#define HYP_MAX_ORDER 11U
>
> Could we just use MAX_ORDER here?

Sure, that would work too. I just figured we might decide to set this to
a lower value in the future -- this effectively limits the size of the
largest portion of memory we can allocate, so maybe it would make sense
to set that to match the size of the largest concatenated pgd for ex,
hence minimizing the overhead of struct hyp_pool. But I suppose we
can also do that later, so...

> > +#define HYP_NO_ORDER UINT_MAX
> > +
> > +struct hyp_pool {
> > + hyp_spinlock_t lock;
>
> A comment about what this lock protects would be handy, especially as the
> 'refcount' field of 'struct hyp_page' isn't updated atomically. I think it
> also means that we don't have a safe way to move a page from one pool to
> another; it's fixed forever once the page has been made available for
> allocation.

Indeed, there is currently no good way to do this. I'll stick a comment.

> > + struct list_head free_area[HYP_MAX_ORDER + 1];
> > + phys_addr_t range_start;
> > + phys_addr_t range_end;
> > +};
> > +
> > +/* GFP flags */
> > +#define HYP_GFP_NONE 0
> > +#define HYP_GFP_ZERO 1
> > +
> > +/* Allocation */
> > +void *hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask, unsigned int order);
> > +void hyp_get_page(void *addr);
> > +void hyp_put_page(void *addr);
> > +
> > +/* Used pages cannot be freed */
> > +int hyp_pool_init(struct hyp_pool *pool, phys_addr_t phys,
> > + unsigned int nr_pages, unsigned int used_pages);
>
> Maybe "reserved_pages" would be a better name than "used_pages"?

That works too. These pages could maybe use a bit of love as well,
they're the pages that have been allocated by the early allocator before
we hand over the memory pool to this allocator. So we might want to do
something about them (such as fixup their refcount).

> > +#endif /* __KVM_HYP_GFP_H */
> > diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h
> > index 64c44c142c95..ed47674bc988 100644
> > --- a/arch/arm64/kvm/hyp/include/nvhe/memory.h
> > +++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h
> > @@ -6,7 +6,17 @@
> >
> > #include <linux/types.h>
> >
> > +struct hyp_pool;
> > +struct hyp_page {
> > + unsigned int refcount;
> > + unsigned int order;
> > + struct hyp_pool *pool;
> > + struct list_head node;
> > +};
> > +
> > extern s64 hyp_physvirt_offset;
> > +extern u64 __hyp_vmemmap;
> > +#define hyp_vmemmap ((struct hyp_page *)__hyp_vmemmap)
> >
> > #define __hyp_pa(virt) ((phys_addr_t)(virt) + hyp_physvirt_offset)
> > #define __hyp_va(virt) ((void *)((phys_addr_t)(virt) - hyp_physvirt_offset))
> > @@ -21,4 +31,19 @@ static inline phys_addr_t hyp_virt_to_phys(void *addr)
> > return __hyp_pa(addr);
> > }
> >
> > +#define hyp_phys_to_pfn(phys) ((phys) >> PAGE_SHIFT)
> > +#define hyp_phys_to_page(phys) (&hyp_vmemmap[hyp_phys_to_pfn(phys)])
> > +#define hyp_virt_to_page(virt) hyp_phys_to_page(__hyp_pa(virt))
> > +
> > +#define hyp_page_to_phys(page) ((phys_addr_t)((page) - hyp_vmemmap) << PAGE_SHIFT)
>
> Maybe implement this in terms of a new hyp_page_to_pfn() macro?

Sure, should be easy enough.

> > +#define hyp_page_to_virt(page) __hyp_va(hyp_page_to_phys(page))
> > +#define hyp_page_to_pool(page) (((struct hyp_page *)page)->pool)
> > +
> > +static inline int hyp_page_count(void *addr)
> > +{
> > + struct hyp_page *p = hyp_virt_to_page(addr);
> > +
> > + return p->refcount;
> > +}
> > +
> > #endif /* __KVM_HYP_MEMORY_H */
> > diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> > index 33bd381d8f73..9e5eacfec6ec 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> > +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> > @@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
> > lib-objs := $(addprefix ../../../lib/, $(lib-objs))
> >
> > obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> > - hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
> > + hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o
> > obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> > ../fpsimd.o ../hyp-entry.o ../exception.o
> > obj-y += $(lib-objs)
> > diff --git a/arch/arm64/kvm/hyp/nvhe/page_alloc.c b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
> > new file mode 100644
> > index 000000000000..6de6515f0432
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
> > @@ -0,0 +1,185 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2020 Google LLC
> > + * Author: Quentin Perret <[email protected]>
> > + */
> > +
> > +#include <asm/kvm_hyp.h>
> > +#include <nvhe/gfp.h>
> > +
> > +u64 __hyp_vmemmap;
> > +
> > +/*
> > + * Example buddy-tree for a 4-pages physically contiguous pool:
> > + *
> > + * o : Page 3
> > + * /
> > + * o-o : Page 2
> > + * /
> > + * / o : Page 1
> > + * / /
> > + * o---o-o : Page 0
> > + * Order 2 1 0
> > + *
> > + * Example of requests on this zon:
>
> typo: zone

s/zon/pool, even :)

> > + * __find_buddy(pool, page 0, order 0) => page 1
> > + * __find_buddy(pool, page 0, order 1) => page 2
> > + * __find_buddy(pool, page 1, order 0) => page 0
> > + * __find_buddy(pool, page 2, order 0) => page 3
> > + */
> > +static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
> > + unsigned int order)
> > +{
> > + phys_addr_t addr = hyp_page_to_phys(p);
> > +
> > + addr ^= (PAGE_SIZE << order);
> > + if (addr < pool->range_start || addr >= pool->range_end)
> > + return NULL;
>
> Are these range checks only needed because the pool isn't required to be
> an exact power-of-2 pages in size? If so, maybe it would be more
> straightforward to limit the max order on a per-pool basis depending upon
> its size?

More importantly, it is because pages outside of the pool are not
guaranteed to be covered by the hyp_vmemmap, so I really need to make
sure I don't dereference them.

> > +
> > + return hyp_phys_to_page(addr);
> > +}
> > +
> > +static void __hyp_attach_page(struct hyp_pool *pool,
> > + struct hyp_page *p)
> > +{
> > + unsigned int order = p->order;
> > + struct hyp_page *buddy;
> > +
> > + p->order = HYP_NO_ORDER;
>
> Why is this needed?

If p->order is say 3, I may be able to coalesce with the buddy of order
3 to form a higher order page of order 4. And that higher order page
will be represented by the 'first' of the two order-3 pages (let's call
it the head), and the other order 3 page (let's say the tail) will be
assigned 'HYP_NO_ORDER'.

And basically at this point I don't know if 'p' is going be the head or
the tail, so I set it to HYP_NO_ORDER a priori so I don't have to think
about this in the loop below. Is that helping?

I suppose this could use more comments as well ...

>
> > + for (; order < HYP_MAX_ORDER; order++) {
> > + /* Nothing to do if the buddy isn't in a free-list */
> > + buddy = __find_buddy(pool, p, order);
> > + if (!buddy || list_empty(&buddy->node) || buddy->order != order)
>
> Could we move the "buddy->order" check into __find_buddy()?

I think might break __hyp_extract_page() below. The way I think about
__find_buddy() is as a low level function which gives you the buddy page
blindly if it exists in the hyp_vmemmap, and it's up to the callers to
decide whether the buddy is in the right state for their use or not.

Again, a comment would help I guess.

>
> > + break;
> > +
> > + /* Otherwise, coalesce the buddies and go one level up */
> > + list_del_init(&buddy->node);
> > + buddy->order = HYP_NO_ORDER;
> > + p = (p < buddy) ? p : buddy;
> > + }
> > +
> > + p->order = order;
> > + list_add_tail(&p->node, &pool->free_area[order]);
> > +}
> > +
> > +void hyp_put_page(void *addr)
> > +{
> > + struct hyp_page *p = hyp_virt_to_page(addr);
> > + struct hyp_pool *pool = hyp_page_to_pool(p);
> > +
> > + hyp_spin_lock(&pool->lock);
> > + if (!p->refcount)
> > + hyp_panic();
> > + p->refcount--;
> > + if (!p->refcount)
> > + __hyp_attach_page(pool, p);
> > + hyp_spin_unlock(&pool->lock);
> > +}
> > +
> > +void hyp_get_page(void *addr)
> > +{
> > + struct hyp_page *p = hyp_virt_to_page(addr);
> > + struct hyp_pool *pool = hyp_page_to_pool(p);
> > +
> > + hyp_spin_lock(&pool->lock);
> > + p->refcount++;
> > + hyp_spin_unlock(&pool->lock);
>
> We should probably have a proper atomic refcount type for this along the
> lines of refcount_t. Even if initially that is implemented with a lock, it
> would be good to hide that behind a refcount API.

Makes sense, I'll introduce wrappers around these.
>
> > +}
> > +
> > +/* Extract a page from the buddy tree, at a specific order */
> > +static struct hyp_page *__hyp_extract_page(struct hyp_pool *pool,
> > + struct hyp_page *p,
> > + unsigned int order)
> > +{
> > + struct hyp_page *buddy;
> > +
> > + if (p->order == HYP_NO_ORDER || p->order < order)
> > + return NULL;
>
> Can you drop the explicit HYP_NO_ORDER check here?

I think so, yes.

> > +
> > + list_del_init(&p->node);
> > +
> > + /* Split the page in two until reaching the requested order */
> > + while (p->order > order) {
> > + p->order--;
> > + buddy = __find_buddy(pool, p, p->order);
> > + buddy->order = p->order;
> > + list_add_tail(&buddy->node, &pool->free_area[buddy->order]);
> > + }
> > +
> > + p->refcount = 1;
> > +
> > + return p;
> > +}
> > +
> > +static void clear_hyp_page(struct hyp_page *p)
> > +{
> > + unsigned long i;
> > +
> > + for (i = 0; i < (1 << p->order); i++)
> > + clear_page(hyp_page_to_virt(p) + (i << PAGE_SHIFT));
>
> I wonder if this is actually any better than a memset(0)? That should use
> DC ZCA as appropriate afaict.

I think that makes sense, and would allow us to drop the EL2 dependency
on clear_page(), so I'll do the change for v3.

> > +static void *__hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask,
> > + unsigned int order)
> > +{
> > + unsigned int i = order;
> > + struct hyp_page *p;
> > +
> > + /* Look for a high-enough-order page */
> > + while (i <= HYP_MAX_ORDER && list_empty(&pool->free_area[i]))
> > + i++;
> > + if (i > HYP_MAX_ORDER)
> > + return NULL;
> > +
> > + /* Extract it from the tree at the right order */
> > + p = list_first_entry(&pool->free_area[i], struct hyp_page, node);
> > + p = __hyp_extract_page(pool, p, order);
> > +
> > + if (mask & HYP_GFP_ZERO)
> > + clear_hyp_page(p);
>
> Do we have a use-case where skipping the zeroing is worthwhile? If not,
> it might make some sense to zero on the freeing path instead.

And during hyp_pool_init(), but this should be infrequent, so yes I
think this is preferable. I'll get rid of HYP_GFP_ZERO altogether.

>
> > +
> > + return p;
> > +}
> > +
> > +void *hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask, unsigned int order)
> > +{
> > + struct hyp_page *p;
> > +
> > + hyp_spin_lock(&pool->lock);
> > + p = __hyp_alloc_pages(pool, mask, order);
> > + hyp_spin_unlock(&pool->lock);
> > +
> > + return p ? hyp_page_to_virt(p) : NULL;
>
> It looks weird not having __hyp_alloc_pages return the VA, but I guess later
> patches will use __hyp_alloc_pages() for something else.

Actually no, this can be simplified.

> > +}
> > +
> > +/* hyp_vmemmap must be backed beforehand */
> > +int hyp_pool_init(struct hyp_pool *pool, phys_addr_t phys,
> > + unsigned int nr_pages, unsigned int used_pages)
> > +{
> > + struct hyp_page *p;
> > + int i;
> > +
> > + if (phys % PAGE_SIZE)
> > + return -EINVAL;
>
> Maybe just take a pfn instead?
>
> > + hyp_spin_lock_init(&pool->lock);
> > + for (i = 0; i <= HYP_MAX_ORDER; i++)
> > + INIT_LIST_HEAD(&pool->free_area[i]);
> > + pool->range_start = phys;
> > + pool->range_end = phys + (nr_pages << PAGE_SHIFT);
> > +
> > + /* Init the vmemmap portion */
> > + p = hyp_phys_to_page(phys);
> > + memset(p, 0, sizeof(*p) * nr_pages);
> > + for (i = 0; i < nr_pages; i++, p++) {
> > + p->pool = pool;
> > + INIT_LIST_HEAD(&p->node);
> > + }
>
> Maybe index p like an array (e.g. p[i]) instead of maintaining two loop
> increments?
>
> > +
> > + /* Attach the unused pages to the buddy tree */
> > + p = hyp_phys_to_page(phys + (used_pages << PAGE_SHIFT));
> > + for (i = used_pages; i < nr_pages; i++, p++)
> > + __hyp_attach_page(pool, p);
>
> Likewise.

And ack for these 3 comments.

Cheers,
Quentin

2021-02-04 10:52:26

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/26] KVM: arm64: Prepare Hyp memory protection

On Wednesday 03 Feb 2021 at 14:37:10 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:14PM +0000, Quentin Perret wrote:
> > +static inline unsigned long __hyp_pgtable_max_pages(unsigned long nr_pages)
> > +{
> > + unsigned long total = 0, i;
> > +
> > + /* Provision the worst case scenario with 4 levels of page-table */
> > + for (i = 0; i < 4; i++) {
>
> Looks like you want KVM_PGTABLE_MAX_LEVELS, so maybe move that into a
> header?

Will do.

>
> > + nr_pages = DIV_ROUND_UP(nr_pages, PTRS_PER_PTE);
> > + total += nr_pages;
> > + }
>
> ... that said, I'm not sure this needs to iterate at all. What exactly are
> you trying to compute?

I'm trying to figure out how many pages I will need to construct a
page-table covering nr_pages contiguous pages. The first iteration tells
me how many level 0 pages I need to cover nr_pages, the second iteration
how many level 1 pages I need to cover the level 0 pages, and so on...

I might be doing this naively though. Got a better idea?

> > +
> > + return total;
> > +}
> > +
> > +static inline unsigned long hyp_s1_pgtable_size(void)
> > +{
> > + struct hyp_memblock_region *reg;
> > + unsigned long nr_pages, res = 0;
> > + int i;
> > +
> > + if (kvm_nvhe_sym(hyp_memblock_nr) <= 0)
> > + return 0;
>
> It's a bit grotty having this be signed. Why do we need to encode the error
> case differently from the 0 case?

Here specifically we don't, but it is needed in early_init_dt_add_memory_hyp()
to distinguish the overflow case from the first memblock being added.

>
> > +
> > + for (i = 0; i < kvm_nvhe_sym(hyp_memblock_nr); i++) {
> > + reg = &kvm_nvhe_sym(hyp_memory)[i];
>
> You could declare reg in the loop body.

I found it prettier like that and assumed the compiler would optimize it
anyway, but ok.

> > + nr_pages = (reg->end - reg->start) >> PAGE_SHIFT;
> > + nr_pages = __hyp_pgtable_max_pages(nr_pages);
>
> Maybe it would make more sense for __hyp_pgtable_max_pages to take the
> size in bytes rather than pages, since most callers seem to have to do the
> conversion?

Yes, and it seems I can apply small cleanups in other places, so I'll
fix this throughout the patch.

> > + res += nr_pages << PAGE_SHIFT;
> > + }
> > +
> > + /* Allow 1 GiB for private mappings */
> > + nr_pages = (1 << 30) >> PAGE_SHIFT;
>
> SZ_1G >> PAGE_SHIFT

Much nicer, thanks. I was also considering adding a Kconfig option for
that, because 1GiB is totally arbitrary. Toughts?

> > + nr_pages = __hyp_pgtable_max_pages(nr_pages);
> > + res += nr_pages << PAGE_SHIFT;
> > +
> > + return res;
>
> Might make more sense to keep res in pages until here, then just shift when
> returning.
>
> > +}
> > +
> > +#endif /* __KVM_HYP_MM_H */
> > diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> > index 72cfe53f106f..d7381a503182 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> > +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> > @@ -11,9 +11,9 @@ lib-objs := $(addprefix ../../../lib/, $(lib-objs))
> >
> > obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> > hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o \
> > - cache.o cpufeature.o
> > + cache.o cpufeature.o setup.o mm.o
> > obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> > - ../fpsimd.o ../hyp-entry.o ../exception.o
> > + ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
> > obj-y += $(lib-objs)
> >
> > ##
> > diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> > index 31b060a44045..ad943966c39f 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> > +++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> > @@ -251,4 +251,35 @@ alternative_else_nop_endif
> >
> > SYM_CODE_END(__kvm_handle_stub_hvc)
> >
> > +SYM_FUNC_START(__pkvm_init_switch_pgd)
> > + /* Turn the MMU off */
> > + pre_disable_mmu_workaround
> > + mrs x2, sctlr_el2
> > + bic x3, x2, #SCTLR_ELx_M
> > + msr sctlr_el2, x3
> > + isb
> > +
> > + tlbi alle2
> > +
> > + /* Install the new pgtables */
> > + ldr x3, [x0, #NVHE_INIT_PGD_PA]
> > + phys_to_ttbr x4, x3
> > +alternative_if ARM64_HAS_CNP
> > + orr x4, x4, #TTBR_CNP_BIT
> > +alternative_else_nop_endif
> > + msr ttbr0_el2, x4
> > +
> > + /* Set the new stack pointer */
> > + ldr x0, [x0, #NVHE_INIT_STACK_HYP_VA]
> > + mov sp, x0
> > +
> > + /* And turn the MMU back on! */
> > + dsb nsh
> > + isb
> > + msr sctlr_el2, x2
> > + ic iallu
> > + isb
> > + ret x1
> > +SYM_FUNC_END(__pkvm_init_switch_pgd)
> > +
> > .popsection
> > diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > index a906f9e2ff34..3075f117651c 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > @@ -6,12 +6,14 @@
> >
> > #include <hyp/switch.h>
> >
> > +#include <asm/pgtable-types.h>
> > #include <asm/kvm_asm.h>
> > #include <asm/kvm_emulate.h>
> > #include <asm/kvm_host.h>
> > #include <asm/kvm_hyp.h>
> > #include <asm/kvm_mmu.h>
> >
> > +#include <nvhe/mm.h>
> > #include <nvhe/trap_handler.h>
> >
> > DEFINE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params);
> > @@ -106,6 +108,42 @@ static void handle___vgic_v3_restore_aprs(struct kvm_cpu_context *host_ctxt)
> > __vgic_v3_restore_aprs(kern_hyp_va(cpu_if));
> > }
> >
> > +static void handle___pkvm_init(struct kvm_cpu_context *host_ctxt)
> > +{
> > + DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
> > + DECLARE_REG(unsigned long, size, host_ctxt, 2);
> > + DECLARE_REG(unsigned long, nr_cpus, host_ctxt, 3);
> > + DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 4);
> > +
> > + cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, nr_cpus, per_cpu_base);
>
> __pkvm_init() doesn't return, so I think this assignment back into host_ctxt
> is confusing.

Very good point, I'll get rid of this.

>
> Also, I wonder if these bare numbers would be better hidden behind, e.g.
>
> #define DECLARE_ARG0(...) DECLARE_REG(__VA_ARGS__, 1)
> ...
> #define DECLARE_RET(...) DECLARE_REG(__VA_ARGS__, 1)
>
> but it's cosmetic, so no need to change your patch. Just worried about
> off-by-1s causing interesting behaviour!

Works for me, but I'll leave this with Marc.

> > +
> > +static void handle___pkvm_cpu_set_vector(struct kvm_cpu_context *host_ctxt)
> > +{
> > + DECLARE_REG(enum arm64_hyp_spectre_vector, slot, host_ctxt, 1);
> > +
> > + cpu_reg(host_ctxt, 1) = pkvm_cpu_set_vector(slot);
> > +}
> > +
> > +static void handle___pkvm_create_mappings(struct kvm_cpu_context *host_ctxt)
> > +{
> > + DECLARE_REG(unsigned long, start, host_ctxt, 1);
> > + DECLARE_REG(unsigned long, size, host_ctxt, 2);
> > + DECLARE_REG(unsigned long, phys, host_ctxt, 3);
> > + DECLARE_REG(unsigned long, prot, host_ctxt, 4);
> > +
> > + cpu_reg(host_ctxt, 1) = __pkvm_create_mappings(start, size, phys, prot);
> > +}
> > +
> > +static void handle___pkvm_create_private_mapping(struct kvm_cpu_context *host_ctxt)
> > +{
> > + DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
> > + DECLARE_REG(size_t, size, host_ctxt, 2);
>
> Why the size_t vs unsigned long discrepancy with pkvm_create_mappings?
> Same with phys_addr_t, although that one probably doesn't matter.
>
> Also, the pgtable API uses an enum type for the prot bits.

Yes this needs cleaning up.

> > + DECLARE_REG(unsigned long, prot, host_ctxt, 3);
> > +
> > + cpu_reg(host_ctxt, 1) = __pkvm_create_private_mapping(phys, size, prot);
> > +}
> > +
> > typedef void (*hcall_t)(struct kvm_cpu_context *);
> >
> > #define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = kimg_fn_ptr(handle_##x)
> > @@ -125,6 +163,10 @@ static const hcall_t *host_hcall[] = {
> > HANDLE_FUNC(__kvm_get_mdcr_el2),
> > HANDLE_FUNC(__vgic_v3_save_aprs),
> > HANDLE_FUNC(__vgic_v3_restore_aprs),
> > + HANDLE_FUNC(__pkvm_init),
> > + HANDLE_FUNC(__pkvm_cpu_set_vector),
> > + HANDLE_FUNC(__pkvm_create_mappings),
> > + HANDLE_FUNC(__pkvm_create_private_mapping),
> > };
> >
> > static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
> > diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
> > new file mode 100644
> > index 000000000000..f3481646a94e
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/nvhe/mm.c
> > @@ -0,0 +1,174 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2020 Google LLC
> > + * Author: Quentin Perret <[email protected]>
> > + */
> > +
> > +#include <linux/kvm_host.h>
> > +#include <asm/kvm_hyp.h>
> > +#include <asm/kvm_mmu.h>
> > +#include <asm/kvm_pgtable.h>
> > +#include <asm/spectre.h>
> > +
> > +#include <nvhe/early_alloc.h>
> > +#include <nvhe/gfp.h>
> > +#include <nvhe/memory.h>
> > +#include <nvhe/mm.h>
> > +#include <nvhe/spinlock.h>
> > +
> > +struct kvm_pgtable pkvm_pgtable;
> > +hyp_spinlock_t pkvm_pgd_lock;
> > +u64 __io_map_base;
> > +
> > +struct hyp_memblock_region hyp_memory[HYP_MEMBLOCK_REGIONS];
> > +int hyp_memblock_nr;
> > +
> > +int __pkvm_create_mappings(unsigned long start, unsigned long size,
> > + unsigned long phys, unsigned long prot)
> > +{
> > + int err;
> > +
> > + hyp_spin_lock(&pkvm_pgd_lock);
> > + err = kvm_pgtable_hyp_map(&pkvm_pgtable, start, size, phys, prot);
> > + hyp_spin_unlock(&pkvm_pgd_lock);
> > +
> > + return err;
> > +}
> > +
> > +unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
> > + unsigned long prot)
> > +{
> > + unsigned long addr;
> > + int ret;
> > +
> > + hyp_spin_lock(&pkvm_pgd_lock);
> > +
> > + size = PAGE_ALIGN(size + offset_in_page(phys));
>
> It might just be simpler to require page-aligned size and phys in the
> caller. At least, for the vectors that should be straightforward because
> I think they're guaranteed not to span a page boundary.

That is done this way for consistency with the host's equivalent
(__create_hyp_private_mapping()), but I can probably factorize the
size-alignment stuff for both.

> > + addr = __io_map_base;
> > + __io_map_base += size;
> > +
> > + /* Are we overflowing on the vmemmap ? */
> > + if (__io_map_base > __hyp_vmemmap) {
> > + __io_map_base -= size;
> > + addr = 0;
>
> Can we use ERR_PTR(), or does that fail miserably at EL2?

Good question, I'll give it a go.

> > + goto out;
> > + }
> > +
> > + ret = kvm_pgtable_hyp_map(&pkvm_pgtable, addr, size, phys, prot);
> > + if (ret) {
> > + addr = 0;
> > + goto out;
> > + }
> > +
> > + addr = addr + offset_in_page(phys);
> > +out:
> > + hyp_spin_unlock(&pkvm_pgd_lock);
> > +
> > + return addr;
> > +}
>
> [...]
>
> > +static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
> > + unsigned long *per_cpu_base)
> > +{
> > + void *start, *end, *virt = hyp_phys_to_virt(phys);
> > + int ret, i;
> > +
> > + /* Recreate the hyp page-table using the early page allocator */
> > + hyp_early_alloc_init(hyp_pgt_base, hyp_s1_pgtable_size());
> > + ret = kvm_pgtable_hyp_init(&pkvm_pgtable, hyp_va_bits,
> > + &hyp_early_alloc_mm_ops);
> > + if (ret)
> > + return ret;
> > +
> > + ret = hyp_create_idmap();
> > + if (ret)
> > + return ret;
> > +
> > + ret = hyp_map_vectors();
> > + if (ret)
> > + return ret;
> > +
> > + ret = hyp_back_vmemmap(phys, size, hyp_virt_to_phys(vmemmap_base));
> > + if (ret)
> > + return ret;
> > +
> > + ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_text_start),
> > + hyp_symbol_addr(__hyp_text_end),
> > + PAGE_HYP_EXEC);
> > + if (ret)
> > + return ret;
> > +
> > + ret = pkvm_create_mappings(hyp_symbol_addr(__start_rodata),
> > + hyp_symbol_addr(__end_rodata), PAGE_HYP_RO);
> > + if (ret)
> > + return ret;
> > +
> > + ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_data_ro_after_init_start),
> > + hyp_symbol_addr(__hyp_data_ro_after_init_end),
> > + PAGE_HYP_RO);
> > + if (ret)
> > + return ret;
> > +
> > + ret = pkvm_create_mappings(hyp_symbol_addr(__bss_start),
>
> __hyp_bss_start
>
> > + hyp_symbol_addr(__hyp_bss_end), PAGE_HYP);
> > + if (ret)
> > + return ret;
> > +
> > + ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_bss_end),
> > + hyp_symbol_addr(__bss_stop), PAGE_HYP_RO);
> > + if (ret)
> > + return ret;
> > +
> > + ret = pkvm_create_mappings(virt, virt + size - 1, PAGE_HYP);
>
> Why is the range inclusive here?

It shouldn't be really, I'll fix it.

> > + if (ret)
> > + return ret;
> > +
> > + for (i = 0; i < hyp_nr_cpus; i++) {
> > + start = (void *)kern_hyp_va(per_cpu_base[i]);
> > + end = start + PAGE_ALIGN(hyp_percpu_size);
> > + ret = pkvm_create_mappings(start, end, PAGE_HYP);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static void update_nvhe_init_params(void)
> > +{
> > + struct kvm_nvhe_init_params *params;
> > + unsigned long i, stack;
> > +
> > + for (i = 0; i < hyp_nr_cpus; i++) {
> > + stack = (unsigned long)stacks_base + (i << PAGE_SHIFT);
> > + params = per_cpu_ptr(&kvm_init_params, i);
> > + params->stack_hyp_va = stack + PAGE_SIZE;
> > + params->pgd_pa = __hyp_pa(pkvm_pgtable.pgd);
> > + __flush_dcache_area(params, sizeof(*params));
> > + }
> > +}
> > +
> > +static void *hyp_zalloc_hyp_page(void *arg)
> > +{
> > + return hyp_alloc_pages(&hpool, HYP_GFP_ZERO, 0);
> > +}
> > +
> > +void __noreturn __pkvm_init_finalise(void)
> > +{
> > + struct kvm_host_data *host_data = this_cpu_ptr(&kvm_host_data);
> > + struct kvm_cpu_context *host_ctxt = &host_data->host_ctxt;
> > + unsigned long nr_pages, used_pages;
> > + int ret;
> > +
> > + /* Now that the vmemmap is backed, install the full-fledged allocator */
> > + nr_pages = hyp_s1_pgtable_size() >> PAGE_SHIFT;
> > + used_pages = hyp_early_alloc_nr_pages();
> > + ret = hyp_pool_init(&hpool, __hyp_pa(hyp_pgt_base), nr_pages, used_pages);
> > + if (ret)
> > + goto out;
> > +
> > + pkvm_pgtable_mm_ops.zalloc_page = hyp_zalloc_hyp_page;
> > + pkvm_pgtable_mm_ops.phys_to_virt = hyp_phys_to_virt;
> > + pkvm_pgtable_mm_ops.virt_to_phys = hyp_virt_to_phys;
> > + pkvm_pgtable_mm_ops.get_page = hyp_get_page;
> > + pkvm_pgtable_mm_ops.put_page = hyp_put_page;
> > + pkvm_pgtable.mm_ops = &pkvm_pgtable_mm_ops;
> > +
> > +out:
> > + host_ctxt->regs.regs[0] = SMCCC_RET_SUCCESS;
> > + host_ctxt->regs.regs[1] = ret;
>
> Use the cpu_reg() helper for these?

+1

> > +
> > + __host_enter(host_ctxt);
> > +}
> > +
> > +int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus,
> > + unsigned long *per_cpu_base)
> > +{
> > + struct kvm_nvhe_init_params *params;
> > + void *virt = hyp_phys_to_virt(phys);
> > + void (*fn)(phys_addr_t params_pa, void *finalize_fn_va);
> > + int ret;
> > +
> > + if (phys % PAGE_SIZE || size % PAGE_SIZE || (u64)virt % PAGE_SIZE)
> > + return -EINVAL;
> > +
> > + hyp_spin_lock_init(&pkvm_pgd_lock);
> > + hyp_nr_cpus = nr_cpus;
> > +
> > + ret = divide_memory_pool(virt, size);
> > + if (ret)
> > + return ret;
> > +
> > + ret = recreate_hyp_mappings(phys, size, per_cpu_base);
> > + if (ret)
> > + return ret;
> > +
> > + update_nvhe_init_params();
> > +
> > + /* Jump in the idmap page to switch to the new page-tables */
> > + params = this_cpu_ptr(&kvm_init_params);
> > + fn = (typeof(fn))__hyp_pa(hyp_symbol_addr(__pkvm_init_switch_pgd));
> > + fn(__hyp_pa(params), hyp_symbol_addr(__pkvm_init_finalise));
> > +
> > + unreachable();
> > +}
> > diff --git a/arch/arm64/kvm/hyp/reserved_mem.c b/arch/arm64/kvm/hyp/reserved_mem.c
> > new file mode 100644
> > index 000000000000..32f648992835
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/reserved_mem.c
> > @@ -0,0 +1,102 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (C) 2020 - Google LLC
> > + * Author: Quentin Perret <[email protected]>
> > + */
> > +
> > +#include <linux/kvm_host.h>
> > +#include <linux/memblock.h>
> > +#include <linux/sort.h>
> > +
> > +#include <asm/kvm_host.h>
> > +
> > +#include <nvhe/memory.h>
> > +#include <nvhe/mm.h>
> > +
> > +phys_addr_t hyp_mem_base;
> > +phys_addr_t hyp_mem_size;
> > +
> > +int __init early_init_dt_add_memory_hyp(u64 base, u64 size)
> > +{
> > + struct hyp_memblock_region *reg;
> > +
> > + if (kvm_nvhe_sym(hyp_memblock_nr) >= HYP_MEMBLOCK_REGIONS)
> > + kvm_nvhe_sym(hyp_memblock_nr) = -1;
> > +
> > + if (kvm_nvhe_sym(hyp_memblock_nr) < 0)
> > + return -ENOMEM;
> > +
> > + reg = kvm_nvhe_sym(hyp_memory);
> > + reg[kvm_nvhe_sym(hyp_memblock_nr)].start = base;
> > + reg[kvm_nvhe_sym(hyp_memblock_nr)].end = base + size;
> > + kvm_nvhe_sym(hyp_memblock_nr)++;
> > +
> > + return 0;
> > +}
>
> This isn't called by anything in this patch afaict, so it's a bit tricky to
> review, especially as I was trying to see how it interacts with
> kvm_hyp_reserve(), which reads hyp_memblock_nr.

It's not obvious by the look of it, but this _is_ called -- see the
previous patch. But note that given the outcome of the discussion
with Rob, this is changing in v3 as I'll be using the memblock API
instead.

> > +
> > +static int cmp_hyp_memblock(const void *p1, const void *p2)
> > +{
> > + const struct hyp_memblock_region *r1 = p1;
> > + const struct hyp_memblock_region *r2 = p2;
> > +
> > + return r1->start < r2->start ? -1 : (r1->start > r2->start);
> > +}
> > +
> > +static void __init sort_memblock_regions(void)
> > +{
> > + sort(kvm_nvhe_sym(hyp_memory),
> > + kvm_nvhe_sym(hyp_memblock_nr),
> > + sizeof(struct hyp_memblock_region),
> > + cmp_hyp_memblock,
> > + NULL);
> > +}
> > +
> > +void __init kvm_hyp_reserve(void)
> > +{
> > + u64 nr_pages, prev;
> > +
> > + if (!is_hyp_mode_available() || is_kernel_in_hyp_mode())
> > + return;
> > +
> > + if (kvm_get_mode() != KVM_MODE_PROTECTED)
> > + return;
> > +
> > + if (kvm_nvhe_sym(hyp_memblock_nr) < 0) {
> > + kvm_err("Failed to register hyp memblocks\n");
> > + return;
> > + }
> > +
> > + sort_memblock_regions();
> > +
> > + /*
> > + * We don't know the number of possible CPUs yet, so allocate for the
> > + * worst case.
> > + */
> > + hyp_mem_size += NR_CPUS << PAGE_SHIFT;
>
> There was a recent patch bumping NR_CPUs to 512, so this would be 32MB
> with 64k pages. Is it possible to return memory to the host later on once
> we have a better handle on the number of CPUs in the system?

That's not possible yet, no :/

>
> > + hyp_mem_size += hyp_s1_pgtable_size();
> > +
> > + /*
> > + * The hyp_vmemmap needs to be backed by pages, but these pages
> > + * themselves need to be present in the vmemmap, so compute the number
> > + * of pages needed by looking for a fixed point.
> > + */
> > + nr_pages = 0;
> > + do {
> > + prev = nr_pages;
> > + nr_pages = (hyp_mem_size >> PAGE_SHIFT) + prev;
> > + nr_pages = DIV_ROUND_UP(nr_pages * sizeof(struct hyp_page), PAGE_SIZE);
> > + nr_pages += __hyp_pgtable_max_pages(nr_pages);
> > + } while (nr_pages != prev);
> > + hyp_mem_size += nr_pages << PAGE_SHIFT;
> > +
> > + hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
> > + hyp_mem_size, SZ_2M);
>
> Why SZ_2M? Guessing you might mean PMD_SIZE,

Indeed.

> although then we will probably
> want to retry with smaller alignment if the allocation fails as this can
> again be large with e.g. 64k pages.

That can't hurt I guess.

> > + if (!hyp_mem_base) {
> > + kvm_err("Failed to reserve hyp memory\n");
> > + return;
> > + }
> > + memblock_reserve(hyp_mem_base, hyp_mem_size);
> > +
> > + kvm_info("Reserved %lld MiB at 0x%llx\n", hyp_mem_size >> 20,
> > + hyp_mem_base);
> > +}
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 278e163beda4..3cf9397dabdb 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1264,10 +1264,10 @@ static struct kvm_pgtable_mm_ops kvm_hyp_mm_ops = {
> > .virt_to_phys = kvm_host_pa,
> > };
> >
> > +u32 hyp_va_bits;
>
> Perhaps it would be better to pass this to __pkvm_init() instead of making
> it global?

Sure, I'll have init_hyp_mode() pass a pointer to kvm_mmu_init() to
propagate it back.

Cheers,
Quentin

2021-02-04 11:12:12

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 17/26] KVM: arm64: Elevate Hyp mappings creation at EL2

On Wednesday 03 Feb 2021 at 15:31:39 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:15PM +0000, Quentin Perret wrote:
> > Previous commits have introduced infrastructure at EL2 to enable the Hyp
> > code to manage its own memory, and more specifically its stage 1 page
> > tables. However, this was preliminary work, and none of it is currently
> > in use.
> >
> > Put all of this together by elevating the hyp mappings creation at EL2
> > when memory protection is enabled. In this case, the host kernel running
> > at EL1 still creates _temporary_ Hyp mappings, only used while
> > initializing the hypervisor, but frees them right after.
> >
> > As such, all calls to create_hyp_mappings() after kvm init has finished
> > turn into hypercalls, as the host now has no 'legal' way to modify the
> > hypevisor page tables directly.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_mmu.h | 1 -
> > arch/arm64/kvm/arm.c | 62 +++++++++++++++++++++++++++++---
> > arch/arm64/kvm/mmu.c | 34 ++++++++++++++++++
> > 3 files changed, 92 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index d7ebd73ec86f..6c8466a042a9 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -309,6 +309,5 @@ static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
> > */
> > asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT));
> > }
> > -
> > #endif /* __ASSEMBLY__ */
> > #endif /* __ARM64_KVM_MMU_H__ */
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 6af9204bcd5b..e524682c2ccf 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -1421,7 +1421,7 @@ static void cpu_prepare_hyp_mode(int cpu)
> > kvm_flush_dcache_to_poc(params, sizeof(*params));
> > }
> >
> > -static void cpu_init_hyp_mode(void)
> > +static void kvm_set_hyp_vector(void)
>
> Please do something about the naming: now we have both cpu_set_hyp_vector()
> and kvm_set_hyp_vector()!

I'll try to find something different, but no guarantees it'll be much
better :) Suggestions welcome.

> > {
> > struct kvm_nvhe_init_params *params;
> > struct arm_smccc_res res;
> > @@ -1439,6 +1439,11 @@ static void cpu_init_hyp_mode(void)
> > params = this_cpu_ptr_nvhe_sym(kvm_init_params);
> > arm_smccc_1_1_hvc(KVM_HOST_SMCCC_FUNC(__kvm_hyp_init), virt_to_phys(params), &res);
> > WARN_ON(res.a0 != SMCCC_RET_SUCCESS);
> > +}
> > +
> > +static void cpu_init_hyp_mode(void)
> > +{
> > + kvm_set_hyp_vector();
> >
> > /*
> > * Disabling SSBD on a non-VHE system requires us to enable SSBS
> > @@ -1481,7 +1486,10 @@ static void cpu_set_hyp_vector(void)
> > struct bp_hardening_data *data = this_cpu_ptr(&bp_hardening_data);
> > void *vector = hyp_spectre_vector_selector[data->slot];
> >
> > - *this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
> > + if (!is_protected_kvm_enabled())
> > + *this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
> > + else
> > + kvm_call_hyp_nvhe(__pkvm_cpu_set_vector, data->slot);
>
> *Very* minor nit, but it might be cleaner to have static inline functions
> with the same prototypes as the hypercalls, just to make the code even
> easier to read. e.g
>
> if (!is_protected_kvm_enabled())
> _cpu_set_vector(data->slot);
> else
> kvm_call_hyp_nvhe(__pkvm_cpu_set_vector, data->slot);
>
> you could then conceivably wrap that in a macro and avoid having the
> "is_protected_kvm_enabled()" checks explicit every time.

Happy to do this here, but are you suggesting to generalize this pattern
to other places as well?

> > }
> >
> > static void cpu_hyp_reinit(void)
> > @@ -1489,13 +1497,14 @@ static void cpu_hyp_reinit(void)
> > kvm_init_host_cpu_context(&this_cpu_ptr_hyp_sym(kvm_host_data)->host_ctxt);
> >
> > cpu_hyp_reset();
> > - cpu_set_hyp_vector();
> >
> > if (is_kernel_in_hyp_mode())
> > kvm_timer_init_vhe();
> > else
> > cpu_init_hyp_mode();
> >
> > + cpu_set_hyp_vector();
> > +
> > kvm_arm_init_debug();
> >
> > if (vgic_present)
> > @@ -1714,13 +1723,52 @@ static int copy_cpu_ftr_regs(void)
> > return 0;
> > }
> >
> > +static int kvm_hyp_enable_protection(void)
> > +{
> > + void *per_cpu_base = kvm_ksym_ref(kvm_arm_hyp_percpu_base);
> > + int ret, cpu;
> > + void *addr;
> > +
> > + if (!is_protected_kvm_enabled())
> > + return 0;
>
> Maybe I'm hung up on my previous suggestion, but I feel like we shouldn't
> get here if protected kvm isn't enabled.

The alternative is to move this check next to the call site, but it
won't help much IMO.

>
> > + if (!hyp_mem_base)
> > + return -ENOMEM;
> > +
> > + addr = phys_to_virt(hyp_mem_base);
> > + ret = create_hyp_mappings(addr, addr + hyp_mem_size - 1, PAGE_HYP);
> > + if (ret)
> > + return ret;
> > +
> > + preempt_disable();
> > + kvm_set_hyp_vector();
> > + ret = kvm_call_hyp_nvhe(__pkvm_init, hyp_mem_base, hyp_mem_size,
> > + num_possible_cpus(), kern_hyp_va(per_cpu_base));
>
> Would it make sense for the __pkvm_init() hypercall to set the vector as
> well, so that we wouldn't need to disable preemption over two hypercalls?

Not sure, kvm_set_hyp_vector() itself already does multiple hypercalls,
and I need it separate from __pkvm_init for secondary CPUs.

> Failing that, maybe move the whole preempt_disable/enable sequence into
> another function.

But that I can do.

> > + preempt_enable();
> > + if (ret)
> > + return ret;
> > +
> > + free_hyp_pgds();
> > + for_each_possible_cpu(cpu)
> > + free_page(per_cpu(kvm_arm_hyp_stack_page, cpu));
> > +
> > + return 0;
> > +}
> > +
> > /**
> > * Inits Hyp-mode on all online CPUs
> > */
> > static int init_hyp_mode(void)
> > {
> > int cpu;
> > - int err = 0;
> > + int err = -ENOMEM;
> > +
> > + /*
> > + * The protected Hyp-mode cannot be initialized if the memory pool
> > + * allocation has failed.
> > + */
> > + if (is_protected_kvm_enabled() && !hyp_mem_base)
> > + return err;
> >
> > /*
> > * Copy the required CPU feature register in their EL2 counterpart
> > @@ -1854,6 +1902,12 @@ static int init_hyp_mode(void)
> > for_each_possible_cpu(cpu)
> > cpu_prepare_hyp_mode(cpu);
> >
> > + err = kvm_hyp_enable_protection();
> > + if (err) {
> > + kvm_err("Failed to enable hyp memory protection: %d\n", err);
> > + goto out_err;
> > + }
> > +
> > return 0;
> >
> > out_err:
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 3cf9397dabdb..9d4c9251208e 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -225,15 +225,39 @@ void free_hyp_pgds(void)
> > if (hyp_pgtable) {
> > kvm_pgtable_hyp_destroy(hyp_pgtable);
> > kfree(hyp_pgtable);
> > + hyp_pgtable = NULL;
> > }
> > mutex_unlock(&kvm_hyp_pgd_mutex);
> > }
> >
> > +static bool kvm_host_owns_hyp_mappings(void)
> > +{
> > + if (static_branch_likely(&kvm_protected_mode_initialized))
> > + return false;
> > +
> > + /*
> > + * This can happen at boot time when __create_hyp_mappings() is called
> > + * after the hyp protection has been enabled, but the static key has
> > + * not been flipped yet.
> > + */
> > + if (!hyp_pgtable && is_protected_kvm_enabled())
> > + return false;
> > +
> > + BUG_ON(!hyp_pgtable);
>
> Can we fail more gracefully, e.g. by continuing without KVM?

Got any suggestion as to how that can be done? We could also just remove
that line -- that really should not happen.

Thanks!
Quentin

2021-02-04 14:13:48

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 21/26] KVM: arm64: Refactor kvm_arm_setup_stage2()

On Wednesday 03 Feb 2021 at 15:53:54 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:19PM +0000, Quentin Perret wrote:
> > In order to re-use some of the stage 2 setup at EL2, factor parts of
> > kvm_arm_setup_stage2() out into static inline functions.
> >
> > No functional change intended.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_mmu.h | 48 ++++++++++++++++++++++++++++++++
> > arch/arm64/kvm/reset.c | 42 +++-------------------------
> > 2 files changed, 52 insertions(+), 38 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index 662f0415344e..83b4c5cf4768 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -280,6 +280,54 @@ static inline int kvm_write_guest_lock(struct kvm *kvm, gpa_t gpa,
> > return ret;
> > }
> >
> > +static inline u64 kvm_get_parange(u64 mmfr0)
> > +{
> > + u64 parange = cpuid_feature_extract_unsigned_field(mmfr0,
> > + ID_AA64MMFR0_PARANGE_SHIFT);
> > + if (parange > ID_AA64MMFR0_PARANGE_MAX)
> > + parange = ID_AA64MMFR0_PARANGE_MAX;
> > +
> > + return parange;
> > +}
> > +
> > +/*
> > + * The VTCR value is common across all the physical CPUs on the system.
> > + * We use system wide sanitised values to fill in different fields,
> > + * except for Hardware Management of Access Flags. HA Flag is set
> > + * unconditionally on all CPUs, as it is safe to run with or without
> > + * the feature and the bit is RES0 on CPUs that don't support it.
> > + */
> > +static inline u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
> > +{
> > + u64 vtcr = VTCR_EL2_FLAGS;
> > + u8 lvls;
> > +
> > + vtcr |= kvm_get_parange(mmfr0) << VTCR_EL2_PS_SHIFT;
> > + vtcr |= VTCR_EL2_T0SZ(phys_shift);
> > + /*
> > + * Use a minimum 2 level page table to prevent splitting
> > + * host PMD huge pages at stage2.
> > + */
> > + lvls = stage2_pgtable_levels(phys_shift);
> > + if (lvls < 2)
> > + lvls = 2;
> > + vtcr |= VTCR_EL2_LVLS_TO_SL0(lvls);
> > +
> > + /*
> > + * Enable the Hardware Access Flag management, unconditionally
> > + * on all CPUs. The features is RES0 on CPUs without the support
> > + * and must be ignored by the CPUs.
> > + */
> > + vtcr |= VTCR_EL2_HA;
> > +
> > + /* Set the vmid bits */
> > + vtcr |= (get_vmid_bits(mmfr1) == 16) ?
> > + VTCR_EL2_VS_16BIT :
> > + VTCR_EL2_VS_8BIT;
> > +
> > + return vtcr;
> > +}
>
> Although I think this is functionally fine, I think it's unusual to see
> large "static inline" functions like this in shared header files. One
> alternative approach would be to follow the example of
> kernel/locking/qspinlock_paravirt.h, where the header is guarded in such a
> way that is only ever included by kernel/locking/qspinlock.c and therefore
> doesn't need the "inline" at all. That separation really helps, I think.

Alternatively, I might be able to have an mmu.c file in the hyp/ folder,
and to compile it for both the host kernel and the EL2 obj as we do for
a few things already. Or maybe I'll just stick it in pgtable.c. Either
way, it'll add a function call, but I can't really see that having any
measurable impact, so we should be fine.

Cheers,
Quentin

2021-02-04 14:36:13

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 23/26] KVM: arm64: Refactor __populate_fault_info()

On Wednesday 03 Feb 2021 at 15:58:32 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:21PM +0000, Quentin Perret wrote:
> > Refactor __populate_fault_info() to introduce __get_fault_info() which
> > will be used once the host is wrapped in a stage 2.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/kvm/hyp/include/hyp/switch.h | 36 +++++++++++++++----------
> > 1 file changed, 22 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h
> > index 84473574c2e7..e9005255d639 100644
> > --- a/arch/arm64/kvm/hyp/include/hyp/switch.h
> > +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
> > @@ -157,19 +157,9 @@ static inline bool __translate_far_to_hpfar(u64 far, u64 *hpfar)
> > return true;
> > }
> >
> > -static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)
> > +static inline bool __get_fault_info(u64 esr, u64 *far, u64 *hpfar)
>
> Could this take a pointer to a struct kvm_vcpu_fault_info instead?

The disr_el1 field will be unused in this case, but yes, that should
work.

Cheers,
Quentin

2021-02-04 14:44:44

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 26/26] KVM: arm64: Wrap the host with a stage 2

On Thu, Feb 04, 2021 at 02:26:35PM +0000, Quentin Perret wrote:
> On Wednesday 03 Feb 2021 at 16:11:47 (+0000), Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:24PM +0000, Quentin Perret wrote:
> > > When KVM runs in protected nVHE mode, make use of a stage 2 page-table
> > > to give the hypervisor some control over the host memory accesses. At
> > > the moment all memory aborts from the host will be instantly idmapped
> > > RWX at stage 2 in a lazy fashion. Later patches will make use of that
> > > infrastructure to implement access control restrictions to e.g. protect
> > > guest memory from the host.
> > >
> > > Signed-off-by: Quentin Perret <[email protected]>
> > > ---
> > > arch/arm64/include/asm/kvm_cpufeature.h | 2 +
> > > arch/arm64/kernel/image-vars.h | 3 +
> > > arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 33 +++
> > > arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
> > > arch/arm64/kvm/hyp/nvhe/hyp-init.S | 1 +
> > > arch/arm64/kvm/hyp/nvhe/hyp-main.c | 6 +
> > > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 191 ++++++++++++++++++
> > > arch/arm64/kvm/hyp/nvhe/setup.c | 6 +
> > > arch/arm64/kvm/hyp/nvhe/switch.c | 7 +-
> > > arch/arm64/kvm/hyp/nvhe/tlb.c | 4 +-
> > > 10 files changed, 248 insertions(+), 7 deletions(-)
> > > create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > > create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c
> >
> > [...]
> >
> > > +void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
> > > +{
> > > + enum kvm_pgtable_prot prot;
> > > + u64 far, hpfar, esr, ipa;
> > > + int ret;
> > > +
> > > + esr = read_sysreg_el2(SYS_ESR);
> > > + if (!__get_fault_info(esr, &far, &hpfar))
> > > + hyp_panic();
> > > +
> > > + prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W | KVM_PGTABLE_PROT_X;
> > > + ipa = (hpfar & HPFAR_MASK) << 8;
> > > + ret = host_stage2_map(ipa, PAGE_SIZE, prot);
> >
> > Can we try to put down a block mapping if the whole thing falls within
> > memory?
>
> Yes we can! And in fact we can do that outside of memory too. It's
> queued for v3 already, so stay tuned ... :)

Awesome! The Stage-2 TLB thanks you.

Will

2021-02-04 15:00:23

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/26] KVM: arm64: Introduce a Hyp buddy page allocator

On Thursday 04 Feb 2021 at 14:31:08 (+0000), Will Deacon wrote:
> On Wed, Feb 03, 2021 at 06:33:30PM +0000, Quentin Perret wrote:
> > On Tuesday 02 Feb 2021 at 18:13:08 (+0000), Will Deacon wrote:
> > > On Fri, Jan 08, 2021 at 12:15:10PM +0000, Quentin Perret wrote:
> > > > + * __find_buddy(pool, page 0, order 0) => page 1
> > > > + * __find_buddy(pool, page 0, order 1) => page 2
> > > > + * __find_buddy(pool, page 1, order 0) => page 0
> > > > + * __find_buddy(pool, page 2, order 0) => page 3
> > > > + */
> > > > +static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
> > > > + unsigned int order)
> > > > +{
> > > > + phys_addr_t addr = hyp_page_to_phys(p);
> > > > +
> > > > + addr ^= (PAGE_SIZE << order);
> > > > + if (addr < pool->range_start || addr >= pool->range_end)
> > > > + return NULL;
> > >
> > > Are these range checks only needed because the pool isn't required to be
> > > an exact power-of-2 pages in size? If so, maybe it would be more
> > > straightforward to limit the max order on a per-pool basis depending upon
> > > its size?
> >
> > More importantly, it is because pages outside of the pool are not
> > guaranteed to be covered by the hyp_vmemmap, so I really need to make
> > sure I don't dereference them.
>
> Wouldn't having a per-pool max order help with that?

The issue is, I have no alignment guarantees for the pools, so I may end
up with max_order = 0 ...

2021-02-04 18:09:44

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/26] KVM: arm64: Introduce a Hyp buddy page allocator

On Thursday 04 Feb 2021 at 17:48:49 (+0000), Will Deacon wrote:
> On Thu, Feb 04, 2021 at 02:52:52PM +0000, Quentin Perret wrote:
> > On Thursday 04 Feb 2021 at 14:31:08 (+0000), Will Deacon wrote:
> > > On Wed, Feb 03, 2021 at 06:33:30PM +0000, Quentin Perret wrote:
> > > > On Tuesday 02 Feb 2021 at 18:13:08 (+0000), Will Deacon wrote:
> > > > > On Fri, Jan 08, 2021 at 12:15:10PM +0000, Quentin Perret wrote:
> > > > > > + * __find_buddy(pool, page 0, order 0) => page 1
> > > > > > + * __find_buddy(pool, page 0, order 1) => page 2
> > > > > > + * __find_buddy(pool, page 1, order 0) => page 0
> > > > > > + * __find_buddy(pool, page 2, order 0) => page 3
> > > > > > + */
> > > > > > +static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
> > > > > > + unsigned int order)
> > > > > > +{
> > > > > > + phys_addr_t addr = hyp_page_to_phys(p);
> > > > > > +
> > > > > > + addr ^= (PAGE_SIZE << order);
> > > > > > + if (addr < pool->range_start || addr >= pool->range_end)
> > > > > > + return NULL;
> > > > >
> > > > > Are these range checks only needed because the pool isn't required to be
> > > > > an exact power-of-2 pages in size? If so, maybe it would be more
> > > > > straightforward to limit the max order on a per-pool basis depending upon
> > > > > its size?
> > > >
> > > > More importantly, it is because pages outside of the pool are not
> > > > guaranteed to be covered by the hyp_vmemmap, so I really need to make
> > > > sure I don't dereference them.
> > >
> > > Wouldn't having a per-pool max order help with that?
> >
> > The issue is, I have no alignment guarantees for the pools, so I may end
> > up with max_order = 0 ...
>
> Yeah, so you would still need the range tracking,

Hmm actually I don't think I would, but that would essentially mean the
'buddy' allocator is now turned into a free list of single pages
(because we cannot create pages of order 1).

> but it would at least help
> to reduce HYP_MAX_ORDER failed searches each time. Still, we can always do
> that later.

Sorry but I am not following. In which case do we have HYP_MAX_ORDER
failed searches?

Thanks,
Quentin

2021-02-04 18:27:25

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/26] KVM: arm64: Introduce a Hyp buddy page allocator

On Thursday 04 Feb 2021 at 14:31:08 (+0000), Will Deacon wrote:
> Just feels a bit backwards having __find_buddy() take an order parameter,
> yet then return a page of the wrong order! __hyp_extract_page() always
> passes the p->order as the order,

Gotcha, so maybe this is just a naming problem. __find_buddy() is simply
a helper to lookup/index the vmemmap, but it's perfectly possible that
the 'destination' page that is being indexed has already been allocated,
and split up multiple time (and so at a different order), etc ... And
that is the caller's job to decide.

How about __lookup_potential_buddy() ? Any suggestion?

2021-02-05 00:26:12

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 24/26] KVM: arm64: Make memcache anonymous in pgtable allocator

On Wednesday 03 Feb 2021 at 15:59:44 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:22PM +0000, Quentin Perret wrote:
> > The current stage2 page-table allocator uses a memcache to get
> > pre-allocated pages when it needs any. To allow re-using this code at
> > EL2 which uses a concept of memory pools, make the memcache argument to
> > kvm_pgtable_stage2_map() anonymous. and let the mm_ops zalloc_page()
> > callbacks use it the way they need to.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_pgtable.h | 6 +++---
> > arch/arm64/kvm/hyp/pgtable.c | 4 ++--
> > 2 files changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 8e8f1d2c5e0e..d846bc3d3b77 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -176,8 +176,8 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
> > * @size: Size of the mapping.
> > * @phys: Physical address of the memory to map.
> > * @prot: Permissions and attributes for the mapping.
> > - * @mc: Cache of pre-allocated GFP_PGTABLE_USER memory from which to
> > - * allocate page-table pages.
> > + * @mc: Cache of pre-allocated memory from which to allocate page-table
> > + * pages.
>
> We should probably mention that this memory must be zeroed, since I don't
> think the page-table code takes care of that.

OK, though I think this is unrelated to this change -- this is already
true today I believe. Anyhow, I'll pile a change on top.

Cheers,
Quentin

2021-02-05 00:26:30

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 26/26] KVM: arm64: Wrap the host with a stage 2

On Wednesday 03 Feb 2021 at 16:11:47 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:24PM +0000, Quentin Perret wrote:
> > When KVM runs in protected nVHE mode, make use of a stage 2 page-table
> > to give the hypervisor some control over the host memory accesses. At
> > the moment all memory aborts from the host will be instantly idmapped
> > RWX at stage 2 in a lazy fashion. Later patches will make use of that
> > infrastructure to implement access control restrictions to e.g. protect
> > guest memory from the host.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_cpufeature.h | 2 +
> > arch/arm64/kernel/image-vars.h | 3 +
> > arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 33 +++
> > arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
> > arch/arm64/kvm/hyp/nvhe/hyp-init.S | 1 +
> > arch/arm64/kvm/hyp/nvhe/hyp-main.c | 6 +
> > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 191 ++++++++++++++++++
> > arch/arm64/kvm/hyp/nvhe/setup.c | 6 +
> > arch/arm64/kvm/hyp/nvhe/switch.c | 7 +-
> > arch/arm64/kvm/hyp/nvhe/tlb.c | 4 +-
> > 10 files changed, 248 insertions(+), 7 deletions(-)
> > create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c
>
> [...]
>
> > +void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
> > +{
> > + enum kvm_pgtable_prot prot;
> > + u64 far, hpfar, esr, ipa;
> > + int ret;
> > +
> > + esr = read_sysreg_el2(SYS_ESR);
> > + if (!__get_fault_info(esr, &far, &hpfar))
> > + hyp_panic();
> > +
> > + prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W | KVM_PGTABLE_PROT_X;
> > + ipa = (hpfar & HPFAR_MASK) << 8;
> > + ret = host_stage2_map(ipa, PAGE_SIZE, prot);
>
> Can we try to put down a block mapping if the whole thing falls within
> memory?

Yes we can! And in fact we can do that outside of memory too. It's
queued for v3 already, so stay tuned ... :)

Thanks,
Quentin

2021-02-05 00:27:02

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/26] KVM: arm64: Introduce a Hyp buddy page allocator

On Wed, Feb 03, 2021 at 06:33:30PM +0000, Quentin Perret wrote:
> On Tuesday 02 Feb 2021 at 18:13:08 (+0000), Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:10PM +0000, Quentin Perret wrote:
> > > + * __find_buddy(pool, page 0, order 0) => page 1
> > > + * __find_buddy(pool, page 0, order 1) => page 2
> > > + * __find_buddy(pool, page 1, order 0) => page 0
> > > + * __find_buddy(pool, page 2, order 0) => page 3
> > > + */
> > > +static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
> > > + unsigned int order)
> > > +{
> > > + phys_addr_t addr = hyp_page_to_phys(p);
> > > +
> > > + addr ^= (PAGE_SIZE << order);
> > > + if (addr < pool->range_start || addr >= pool->range_end)
> > > + return NULL;
> >
> > Are these range checks only needed because the pool isn't required to be
> > an exact power-of-2 pages in size? If so, maybe it would be more
> > straightforward to limit the max order on a per-pool basis depending upon
> > its size?
>
> More importantly, it is because pages outside of the pool are not
> guaranteed to be covered by the hyp_vmemmap, so I really need to make
> sure I don't dereference them.

Wouldn't having a per-pool max order help with that?

> > > + return hyp_phys_to_page(addr);
> > > +}
> > > +
> > > +static void __hyp_attach_page(struct hyp_pool *pool,
> > > + struct hyp_page *p)
> > > +{
> > > + unsigned int order = p->order;
> > > + struct hyp_page *buddy;
> > > +
> > > + p->order = HYP_NO_ORDER;
> >
> > Why is this needed?
>
> If p->order is say 3, I may be able to coalesce with the buddy of order
> 3 to form a higher order page of order 4. And that higher order page
> will be represented by the 'first' of the two order-3 pages (let's call
> it the head), and the other order 3 page (let's say the tail) will be
> assigned 'HYP_NO_ORDER'.
>
> And basically at this point I don't know if 'p' is going be the head or
> the tail, so I set it to HYP_NO_ORDER a priori so I don't have to think
> about this in the loop below. Is that helping?
>
> I suppose this could use more comments as well ...

Comments would definitely help, but perhaps even having a simple function to
do the coalescing, which you could call from the loop body and which would
deal with marking the tail pages as HYP_NO_ORDER?

> > > + for (; order < HYP_MAX_ORDER; order++) {
> > > + /* Nothing to do if the buddy isn't in a free-list */
> > > + buddy = __find_buddy(pool, p, order);
> > > + if (!buddy || list_empty(&buddy->node) || buddy->order != order)
> >
> > Could we move the "buddy->order" check into __find_buddy()?
>
> I think might break __hyp_extract_page() below. The way I think about
> __find_buddy() is as a low level function which gives you the buddy page
> blindly if it exists in the hyp_vmemmap, and it's up to the callers to
> decide whether the buddy is in the right state for their use or not.

Just feels a bit backwards having __find_buddy() take an order parameter,
yet then return a page of the wrong order! __hyp_extract_page() always
passes the p->order as the order, so I think it would be worth having a
separate function that just takes the pool and the page for that.

Will

2021-02-05 00:28:24

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 24/26] KVM: arm64: Make memcache anonymous in pgtable allocator

On Thu, Feb 04, 2021 at 02:24:44PM +0000, Quentin Perret wrote:
> On Wednesday 03 Feb 2021 at 15:59:44 (+0000), Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:22PM +0000, Quentin Perret wrote:
> > > The current stage2 page-table allocator uses a memcache to get
> > > pre-allocated pages when it needs any. To allow re-using this code at
> > > EL2 which uses a concept of memory pools, make the memcache argument to
> > > kvm_pgtable_stage2_map() anonymous. and let the mm_ops zalloc_page()
> > > callbacks use it the way they need to.
> > >
> > > Signed-off-by: Quentin Perret <[email protected]>
> > > ---
> > > arch/arm64/include/asm/kvm_pgtable.h | 6 +++---
> > > arch/arm64/kvm/hyp/pgtable.c | 4 ++--
> > > 2 files changed, 5 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > > index 8e8f1d2c5e0e..d846bc3d3b77 100644
> > > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > > @@ -176,8 +176,8 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
> > > * @size: Size of the mapping.
> > > * @phys: Physical address of the memory to map.
> > > * @prot: Permissions and attributes for the mapping.
> > > - * @mc: Cache of pre-allocated GFP_PGTABLE_USER memory from which to
> > > - * allocate page-table pages.
> > > + * @mc: Cache of pre-allocated memory from which to allocate page-table
> > > + * pages.
> >
> > We should probably mention that this memory must be zeroed, since I don't
> > think the page-table code takes care of that.
>
> OK, though I think this is unrelated to this change -- this is already
> true today I believe. Anyhow, I'll pile a change on top.

It is, but GFP_PGTABLE_USER implies __GFP_ZERO, so the existing comment
captures that.

Will

2021-02-05 01:09:51

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/26] KVM: arm64: Introduce a Hyp buddy page allocator

On Thu, Feb 04, 2021 at 02:52:52PM +0000, Quentin Perret wrote:
> On Thursday 04 Feb 2021 at 14:31:08 (+0000), Will Deacon wrote:
> > On Wed, Feb 03, 2021 at 06:33:30PM +0000, Quentin Perret wrote:
> > > On Tuesday 02 Feb 2021 at 18:13:08 (+0000), Will Deacon wrote:
> > > > On Fri, Jan 08, 2021 at 12:15:10PM +0000, Quentin Perret wrote:
> > > > > + * __find_buddy(pool, page 0, order 0) => page 1
> > > > > + * __find_buddy(pool, page 0, order 1) => page 2
> > > > > + * __find_buddy(pool, page 1, order 0) => page 0
> > > > > + * __find_buddy(pool, page 2, order 0) => page 3
> > > > > + */
> > > > > +static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
> > > > > + unsigned int order)
> > > > > +{
> > > > > + phys_addr_t addr = hyp_page_to_phys(p);
> > > > > +
> > > > > + addr ^= (PAGE_SIZE << order);
> > > > > + if (addr < pool->range_start || addr >= pool->range_end)
> > > > > + return NULL;
> > > >
> > > > Are these range checks only needed because the pool isn't required to be
> > > > an exact power-of-2 pages in size? If so, maybe it would be more
> > > > straightforward to limit the max order on a per-pool basis depending upon
> > > > its size?
> > >
> > > More importantly, it is because pages outside of the pool are not
> > > guaranteed to be covered by the hyp_vmemmap, so I really need to make
> > > sure I don't dereference them.
> >
> > Wouldn't having a per-pool max order help with that?
>
> The issue is, I have no alignment guarantees for the pools, so I may end
> up with max_order = 0 ...

Yeah, so you would still need the range tracking, but it would at least help
to reduce HYP_MAX_ORDER failed searches each time. Still, we can always do
that later.

Will

2021-02-05 01:14:11

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/26] KVM: arm64: Introduce a Hyp buddy page allocator

On Thu, Feb 04, 2021 at 06:01:12PM +0000, Quentin Perret wrote:
> On Thursday 04 Feb 2021 at 17:48:49 (+0000), Will Deacon wrote:
> > On Thu, Feb 04, 2021 at 02:52:52PM +0000, Quentin Perret wrote:
> > > On Thursday 04 Feb 2021 at 14:31:08 (+0000), Will Deacon wrote:
> > > > On Wed, Feb 03, 2021 at 06:33:30PM +0000, Quentin Perret wrote:
> > > > > On Tuesday 02 Feb 2021 at 18:13:08 (+0000), Will Deacon wrote:
> > > > > > On Fri, Jan 08, 2021 at 12:15:10PM +0000, Quentin Perret wrote:
> > > > > > > + * __find_buddy(pool, page 0, order 0) => page 1
> > > > > > > + * __find_buddy(pool, page 0, order 1) => page 2
> > > > > > > + * __find_buddy(pool, page 1, order 0) => page 0
> > > > > > > + * __find_buddy(pool, page 2, order 0) => page 3
> > > > > > > + */
> > > > > > > +static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
> > > > > > > + unsigned int order)
> > > > > > > +{
> > > > > > > + phys_addr_t addr = hyp_page_to_phys(p);
> > > > > > > +
> > > > > > > + addr ^= (PAGE_SIZE << order);
> > > > > > > + if (addr < pool->range_start || addr >= pool->range_end)
> > > > > > > + return NULL;
> > > > > >
> > > > > > Are these range checks only needed because the pool isn't required to be
> > > > > > an exact power-of-2 pages in size? If so, maybe it would be more
> > > > > > straightforward to limit the max order on a per-pool basis depending upon
> > > > > > its size?
> > > > >
> > > > > More importantly, it is because pages outside of the pool are not
> > > > > guaranteed to be covered by the hyp_vmemmap, so I really need to make
> > > > > sure I don't dereference them.
> > > >
> > > > Wouldn't having a per-pool max order help with that?
> > >
> > > The issue is, I have no alignment guarantees for the pools, so I may end
> > > up with max_order = 0 ...
> >
> > Yeah, so you would still need the range tracking,
>
> Hmm actually I don't think I would, but that would essentially mean the
> 'buddy' allocator is now turned into a free list of single pages
> (because we cannot create pages of order 1).

Right, I'm not suggesting we do that.

> > but it would at least help
> > to reduce HYP_MAX_ORDER failed searches each time. Still, we can always do
> > that later.
>
> Sorry but I am not following. In which case do we have HYP_MAX_ORDER
> failed searches?

I was going from memory, but the loop in __hyp_alloc_pages() searches up to
HYP_MAX_ORDER, whereas this is _never_ going to succeed beyond some per-pool
order determined by the size of the pool. But I doubt it matters -- I
thought we did more than just check a list.

Will

2021-02-05 01:26:09

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/26] KVM: arm64: Introduce a Hyp buddy page allocator

On Thursday 04 Feb 2021 at 18:13:18 (+0000), Will Deacon wrote:
> I was going from memory, but the loop in __hyp_alloc_pages() searches up to
> HYP_MAX_ORDER, whereas this is _never_ going to succeed beyond some per-pool
> order determined by the size of the pool. But I doubt it matters -- I
> thought we did more than just check a list.

Ah, I see -- I was looking at the __hyp_attach_page() loop.

I think it's a good point, I should be able to figure out a max order
based on the size and alignment of the pool, and cache that in struct
hyp_pool to optimize cases where this is < HYP_MAX_ORDER.
Should be easy enough, I'll see what I can do in v3.

Thanks!
Quentin

2021-02-05 01:26:11

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/26] KVM: arm64: Introduce a Hyp buddy page allocator

On Thu, Feb 04, 2021 at 06:19:36PM +0000, Quentin Perret wrote:
> On Thursday 04 Feb 2021 at 14:31:08 (+0000), Will Deacon wrote:
> > Just feels a bit backwards having __find_buddy() take an order parameter,
> > yet then return a page of the wrong order! __hyp_extract_page() always
> > passes the p->order as the order,
>
> Gotcha, so maybe this is just a naming problem. __find_buddy() is simply
> a helper to lookup/index the vmemmap, but it's perfectly possible that
> the 'destination' page that is being indexed has already been allocated,
> and split up multiple time (and so at a different order), etc ... And
> that is the caller's job to decide.
>
> How about __lookup_potential_buddy() ? Any suggestion?

Hey, my job here is to waffle incoherently and hope that you find bugs in
your own code. Now you want me to _name_ something! Jeez...

Ok, how about __find_buddy() does what it does today but doesn't take an
order argument, whereas __find_buddy_of_order() takes the order argument
and checks the page order before returning?

Will

2021-02-05 01:30:55

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/26] KVM: arm64: Introduce a Hyp buddy page allocator

On Thursday 04 Feb 2021 at 18:24:05 (+0000), Will Deacon wrote:
> On Thu, Feb 04, 2021 at 06:19:36PM +0000, Quentin Perret wrote:
> > On Thursday 04 Feb 2021 at 14:31:08 (+0000), Will Deacon wrote:
> > > Just feels a bit backwards having __find_buddy() take an order parameter,
> > > yet then return a page of the wrong order! __hyp_extract_page() always
> > > passes the p->order as the order,
> >
> > Gotcha, so maybe this is just a naming problem. __find_buddy() is simply
> > a helper to lookup/index the vmemmap, but it's perfectly possible that
> > the 'destination' page that is being indexed has already been allocated,
> > and split up multiple time (and so at a different order), etc ... And
> > that is the caller's job to decide.
> >
> > How about __lookup_potential_buddy() ? Any suggestion?
>
> Hey, my job here is to waffle incoherently and hope that you find bugs in
> your own code. Now you want me to _name_ something! Jeez...

Hey, that's my special -- I already got Marc to make a suggestion on v1
and it's been my favorite function name so far, so why not try again?

https://lore.kernel.org/kvmarm/[email protected]/

> Ok, how about __find_buddy() does what it does today but doesn't take an
> order argument, whereas __find_buddy_of_order() takes the order argument
> and checks the page order before returning?

Sounds like a plan!

Cheers,
Quentin

2021-02-05 18:01:35

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/26] KVM: arm64: Prepare Hyp memory protection

On Thu, Feb 04, 2021 at 10:47:08AM +0000, Quentin Perret wrote:
> On Wednesday 03 Feb 2021 at 14:37:10 (+0000), Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:14PM +0000, Quentin Perret wrote:
> > > +static inline unsigned long __hyp_pgtable_max_pages(unsigned long nr_pages)
> > > +{
> > > + unsigned long total = 0, i;
> > > +
> > > + /* Provision the worst case scenario with 4 levels of page-table */
> > > + for (i = 0; i < 4; i++) {
> >
> > Looks like you want KVM_PGTABLE_MAX_LEVELS, so maybe move that into a
> > header?
>
> Will do.
>
> >
> > > + nr_pages = DIV_ROUND_UP(nr_pages, PTRS_PER_PTE);
> > > + total += nr_pages;
> > > + }
> >
> > ... that said, I'm not sure this needs to iterate at all. What exactly are
> > you trying to compute?
>
> I'm trying to figure out how many pages I will need to construct a
> page-table covering nr_pages contiguous pages. The first iteration tells
> me how many level 0 pages I need to cover nr_pages, the second iteration
> how many level 1 pages I need to cover the level 0 pages, and so on...

Ah, you iterate from leaves back to the root. Got it, thanks.

> I might be doing this naively though. Got a better idea?

I thought I did, but I ended up with something based on a geometric series
and it looks terrible to code-up in C without, err, iterating like you do.

So yeah, ignore me :)

> > > +
> > > + return total;
> > > +}
> > > +
> > > +static inline unsigned long hyp_s1_pgtable_size(void)
> > > +{
> > > + struct hyp_memblock_region *reg;
> > > + unsigned long nr_pages, res = 0;
> > > + int i;
> > > +
> > > + if (kvm_nvhe_sym(hyp_memblock_nr) <= 0)
> > > + return 0;
> >
> > It's a bit grotty having this be signed. Why do we need to encode the error
> > case differently from the 0 case?
>
> Here specifically we don't, but it is needed in early_init_dt_add_memory_hyp()
> to distinguish the overflow case from the first memblock being added.

Fair enough, but if you figure out a way for hyp_memblock_nr to be unsigned,
I think that would be preferable.

Will

2021-02-05 18:09:36

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 17/26] KVM: arm64: Elevate Hyp mappings creation at EL2

On Thu, Feb 04, 2021 at 11:08:33AM +0000, Quentin Perret wrote:
> On Wednesday 03 Feb 2021 at 15:31:39 (+0000), Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:15PM +0000, Quentin Perret wrote:
> > > @@ -1481,7 +1486,10 @@ static void cpu_set_hyp_vector(void)
> > > struct bp_hardening_data *data = this_cpu_ptr(&bp_hardening_data);
> > > void *vector = hyp_spectre_vector_selector[data->slot];
> > >
> > > - *this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
> > > + if (!is_protected_kvm_enabled())
> > > + *this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
> > > + else
> > > + kvm_call_hyp_nvhe(__pkvm_cpu_set_vector, data->slot);
> >
> > *Very* minor nit, but it might be cleaner to have static inline functions
> > with the same prototypes as the hypercalls, just to make the code even
> > easier to read. e.g
> >
> > if (!is_protected_kvm_enabled())
> > _cpu_set_vector(data->slot);
> > else
> > kvm_call_hyp_nvhe(__pkvm_cpu_set_vector, data->slot);
> >
> > you could then conceivably wrap that in a macro and avoid having the
> > "is_protected_kvm_enabled()" checks explicit every time.
>
> Happy to do this here, but are you suggesting to generalize this pattern
> to other places as well?

I think it's probably a good pattern to follow, but no need to generalise it
prematurely.

> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 3cf9397dabdb..9d4c9251208e 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -225,15 +225,39 @@ void free_hyp_pgds(void)
> > > if (hyp_pgtable) {
> > > kvm_pgtable_hyp_destroy(hyp_pgtable);
> > > kfree(hyp_pgtable);
> > > + hyp_pgtable = NULL;
> > > }
> > > mutex_unlock(&kvm_hyp_pgd_mutex);
> > > }
> > >
> > > +static bool kvm_host_owns_hyp_mappings(void)
> > > +{
> > > + if (static_branch_likely(&kvm_protected_mode_initialized))
> > > + return false;
> > > +
> > > + /*
> > > + * This can happen at boot time when __create_hyp_mappings() is called
> > > + * after the hyp protection has been enabled, but the static key has
> > > + * not been flipped yet.
> > > + */
> > > + if (!hyp_pgtable && is_protected_kvm_enabled())
> > > + return false;
> > > +
> > > + BUG_ON(!hyp_pgtable);
> >
> > Can we fail more gracefully, e.g. by continuing without KVM?
>
> Got any suggestion as to how that can be done? We could also just remove
> that line -- that really should not happen.

Or downgrade to WARN_ON.

Will

2021-02-09 10:12:30

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/26] KVM: arm64: Prepare Hyp memory protection

On Thursday 04 Feb 2021 at 10:47:08 (+0000), Quentin Perret wrote:
> On Wednesday 03 Feb 2021 at 14:37:10 (+0000), Will Deacon wrote:
> > > +static void handle___pkvm_init(struct kvm_cpu_context *host_ctxt)
> > > +{
> > > + DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
> > > + DECLARE_REG(unsigned long, size, host_ctxt, 2);
> > > + DECLARE_REG(unsigned long, nr_cpus, host_ctxt, 3);
> > > + DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 4);
> > > +
> > > + cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, nr_cpus, per_cpu_base);
> >
> > __pkvm_init() doesn't return, so I think this assignment back into host_ctxt
> > is confusing.
>
> Very good point, I'll get rid of this.

Actually not, I think I'll leave it like that. __pkvm_init can return an
error, which is why I did this in the first place And it is useful for
debugging to have it propagated back to the host.

Thanks,
Quentin

2021-02-09 12:26:38

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/26] KVM: arm64: Prepare Hyp memory protection

On Tue, Feb 09, 2021 at 10:00:29AM +0000, Quentin Perret wrote:
> On Thursday 04 Feb 2021 at 10:47:08 (+0000), Quentin Perret wrote:
> > On Wednesday 03 Feb 2021 at 14:37:10 (+0000), Will Deacon wrote:
> > > > +static void handle___pkvm_init(struct kvm_cpu_context *host_ctxt)
> > > > +{
> > > > + DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
> > > > + DECLARE_REG(unsigned long, size, host_ctxt, 2);
> > > > + DECLARE_REG(unsigned long, nr_cpus, host_ctxt, 3);
> > > > + DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 4);
> > > > +
> > > > + cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, nr_cpus, per_cpu_base);
> > >
> > > __pkvm_init() doesn't return, so I think this assignment back into host_ctxt
> > > is confusing.
> >
> > Very good point, I'll get rid of this.
>
> Actually not, I think I'll leave it like that. __pkvm_init can return an
> error, which is why I did this in the first place And it is useful for
> debugging to have it propagated back to the host.

Good point, but please add a comment!

Will

2021-02-17 16:30:48

by Mate Toth-Pal

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/26] KVM/arm64: A stage 2 for the host

Hi Quentin,


On 2021-01-08 13:14, Quentin Perret wrote:
> Hi all,
>
> This is the v2 of the series previously posted here:
>
> https://lore.kernel.org/kvmarm/[email protected]/
>
> This basically allows us to wrap the host with a stage 2 when running in
> nVHE, hence paving the way for protecting guest memory from the host in
> the future (among other use-cases). For more details about the
> motivation and the design angle taken here, I would recommend to have a
> look at the cover letter of v1, and/or to watch these presentations at
> LPC [1] and KVM forum 2020 [2].


We tested the pKVM changes pulled from here:


> https://android-kvm.googlesource.com/linux qperret/host-stage2-v2


We were using a target with Arm architecture with FEAT_S2FWB, and found
that there is a bug in the patch.


It turned out that the Kernel checks for the extension, and sets up the
stage 2 translation so that it forces the host memory type to
write-through. However it seems that the code doesn't turn on the
feature in the HCR_EL2 register.


We were able to fix the issue by applying the following patch:


diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 0cd3eb178f3b..e8521a072ea6 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -105,6 +105,8 @@ int kvm_host_prepare_stage2(void *mem_pgt_pool, void
*dev_pgt_pool)
                params->vttbr = kvm_get_vttbr(mmu);
                params->vtcr = host_kvm.arch.vtcr;
                params->hcr_el2 |= HCR_VM;
+               if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
+                       params->hcr_el2 |= HCR_FWB;
                __flush_dcache_area(params, sizeof(*params));
        }


Best regards,
Mate Toth-Pal

2021-02-17 22:16:37

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/26] KVM/arm64: A stage 2 for the host

Hi Mate,

On Wednesday 17 Feb 2021 at 17:27:07 (+0100), Mate Toth-Pal wrote:
> We tested the pKVM changes pulled from here:
>
>
> > https://android-kvm.googlesource.com/linux qperret/host-stage2-v2
>
>
> We were using a target with Arm architecture with FEAT_S2FWB, and found that
> there is a bug in the patch.
>
>
> It turned out that the Kernel checks for the extension, and sets up the
> stage 2 translation so that it forces the host memory type to write-through.
> However it seems that the code doesn't turn on the feature in the HCR_EL2
> register.
>
>
> We were able to fix the issue by applying the following patch:
>
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 0cd3eb178f3b..e8521a072ea6 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -105,6 +105,8 @@ int kvm_host_prepare_stage2(void *mem_pgt_pool, void
> *dev_pgt_pool)
>                 params->vttbr = kvm_get_vttbr(mmu);
>                 params->vtcr = host_kvm.arch.vtcr;
>                 params->hcr_el2 |= HCR_VM;
> +               if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> +                       params->hcr_el2 |= HCR_FWB;
>                 __flush_dcache_area(params, sizeof(*params));
>         }

Aha, indeed, this looks right. I'll double check HCR_EL2 to see if I'm
missing any other, and I'll add this to v3.

Thanks for testing, and the for the report.
Quentin

2021-02-19 17:58:29

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/26] KVM/arm64: A stage 2 for the host

On Fri, Jan 08, 2021, Quentin Perret wrote:
> [2] https://kvmforum2020.sched.com/event/eE24/virtualization-for-the-masses-exposing-kvm-on-android-will-deacon-google

I couldn't find any slides on the official KVM forum site linked above. I was
able to track down a mirror[1] and the recorded presentation[2].

[1] https://mirrors.edge.kernel.org/pub/linux/kernel/people/will/slides/kvmforum-2020-edited.pdf
[2] https://youtu.be/wY-u6n75iXc

2021-02-19 17:59:23

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/26] KVM/arm64: A stage 2 for the host

On Friday 19 Feb 2021 at 09:54:38 (-0800), Sean Christopherson wrote:
> On Fri, Jan 08, 2021, Quentin Perret wrote:
> > [2] https://kvmforum2020.sched.com/event/eE24/virtualization-for-the-masses-exposing-kvm-on-android-will-deacon-google
>
> I couldn't find any slides on the official KVM forum site linked above. I was
> able to track down a mirror[1] and the recorded presentation[2].
>
> [1] https://mirrors.edge.kernel.org/pub/linux/kernel/people/will/slides/kvmforum-2020-edited.pdf
> [2] https://youtu.be/wY-u6n75iXc

Much nicer, I'll make sure to link those in the next cover letter.

Thanks Sean!
Quentin

2021-02-19 18:34:39

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/26] KVM: arm64: Prepare Hyp memory protection

On Wed, Feb 03, 2021, Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:14PM +0000, Quentin Perret wrote:

...

> > +static inline unsigned long hyp_s1_pgtable_size(void)
> > +{

...

> > + res += nr_pages << PAGE_SHIFT;
> > + }
> > +
> > + /* Allow 1 GiB for private mappings */
> > + nr_pages = (1 << 30) >> PAGE_SHIFT;
>
> SZ_1G >> PAGE_SHIFT

Where does the 1gb magic number come from? IIUC, this is calculating the number
of pages needed for the hypervisor's Stage-1 page tables. The amount of memory
needed for those page tables should be easily calculated, and assuming huge
pages can be used, should be far less the 1gb.

> > + nr_pages = __hyp_pgtable_max_pages(nr_pages);
> > + res += nr_pages << PAGE_SHIFT;
> > +
> > + return res;

...

> > +void __init kvm_hyp_reserve(void)
> > +{
> > + u64 nr_pages, prev;
> > +
> > + if (!is_hyp_mode_available() || is_kernel_in_hyp_mode())
> > + return;
> > +
> > + if (kvm_get_mode() != KVM_MODE_PROTECTED)
> > + return;
> > +
> > + if (kvm_nvhe_sym(hyp_memblock_nr) < 0) {
> > + kvm_err("Failed to register hyp memblocks\n");
> > + return;
> > + }
> > +
> > + sort_memblock_regions();
> > +
> > + /*
> > + * We don't know the number of possible CPUs yet, so allocate for the
> > + * worst case.
> > + */
> > + hyp_mem_size += NR_CPUS << PAGE_SHIFT;

Is this for per-cpu stack?

If so, what guarantees a single page is sufficient? Mostly a curiosity question,
since it looks like this is an existing assumption by init_hyp_mode(). Shouldn't
the required stack size be defined in bytes and converted to pages, or is there a
guarantee that 64kb pages will be used?

> There was a recent patch bumping NR_CPUs to 512, so this would be 32MB
> with 64k pages. Is it possible to return memory to the host later on once
> we have a better handle on the number of CPUs in the system?

Does kvm_hyp_reserve() really need to be called during bootmem_init()? What
prevents doing the reservation during init_hyp_mode()? If the problem is that
pKVM needs a single contiguous chunk of memory, then it might be worth solving
_that_ problem, e.g. letting the host donate memory in N-byte chunks instead of
requiring a single huge blob of memory.

> > + hyp_mem_size += hyp_s1_pgtable_size();

2021-02-22 11:06:17

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/26] KVM: arm64: Prepare Hyp memory protection

Hi Sean,

On Friday 19 Feb 2021 at 10:32:58 (-0800), Sean Christopherson wrote:
> On Wed, Feb 03, 2021, Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:14PM +0000, Quentin Perret wrote:
>
> ...
>
> > > +static inline unsigned long hyp_s1_pgtable_size(void)
> > > +{
>
> ...
>
> > > + res += nr_pages << PAGE_SHIFT;
> > > + }
> > > +
> > > + /* Allow 1 GiB for private mappings */
> > > + nr_pages = (1 << 30) >> PAGE_SHIFT;
> >
> > SZ_1G >> PAGE_SHIFT
>
> Where does the 1gb magic number come from?

Admittedly it is arbitrary. It needs to be enough to cover all the
so-called 'private' mappings that EL2 needs, and which can vary a little
depending on the hardware.

> IIUC, this is calculating the number
> of pages needed for the hypervisor's Stage-1 page tables.

Correct. The thing worth noting is that the hypervisor VA space is
essentially split in half. One half is reserved to map portions of
memory with a fixed offset, and the other half is used for a whole bunch
of other things: we have a vmemmap, the 'private' mappings and the idmap
page.

> The amount of memory
> needed for those page tables should be easily calculated

As mentioned above, that is true for pretty much everything in the hyp
VA space except the private mappings as that depends on e.g. the CPU
uarch and such.

> and assuming huge pages can be used, should be far less the 1gb.

Ack, though this is no supported for the EL2 mappings yet. Historically
the amount of contiguous portions of memory mapped at EL2 has been
rather small, so there wasn't really a need, but we might want to
revisit this at some point.

> > > + nr_pages = __hyp_pgtable_max_pages(nr_pages);
> > > + res += nr_pages << PAGE_SHIFT;
> > > +
> > > + return res;
>
> ...
>
> > > +void __init kvm_hyp_reserve(void)
> > > +{
> > > + u64 nr_pages, prev;
> > > +
> > > + if (!is_hyp_mode_available() || is_kernel_in_hyp_mode())
> > > + return;
> > > +
> > > + if (kvm_get_mode() != KVM_MODE_PROTECTED)
> > > + return;
> > > +
> > > + if (kvm_nvhe_sym(hyp_memblock_nr) < 0) {
> > > + kvm_err("Failed to register hyp memblocks\n");
> > > + return;
> > > + }
> > > +
> > > + sort_memblock_regions();
> > > +
> > > + /*
> > > + * We don't know the number of possible CPUs yet, so allocate for the
> > > + * worst case.
> > > + */
> > > + hyp_mem_size += NR_CPUS << PAGE_SHIFT;
>
> Is this for per-cpu stack?

Correct.

> If so, what guarantees a single page is sufficient? Mostly a curiosity question,
> since it looks like this is an existing assumption by init_hyp_mode(). Shouldn't
> the required stack size be defined in bytes and converted to pages, or is there a
> guarantee that 64kb pages will be used?

Nope, we have no such guarantees, but 4K has been more than enough for
EL2 so far. The hyp code doesn't use recursion much (I think the only
occurence we have is Will's pgtable code, and that is architecturally
limited to 4 levels of recursion for obvious reasons) and doesn't have
use stack allocations.

It's on my todo list to remap the stack pages in the 'private' range, to
surround them with guard pages so we can at least run-time check this
assumption, so stay tuned :)

> > There was a recent patch bumping NR_CPUs to 512, so this would be 32MB
> > with 64k pages. Is it possible to return memory to the host later on once
> > we have a better handle on the number of CPUs in the system?
>
> Does kvm_hyp_reserve() really need to be called during bootmem_init()? What
> prevents doing the reservation during init_hyp_mode()? If the problem is that
> pKVM needs a single contiguous chunk of memory, then it might be worth solving
> _that_ problem, e.g. letting the host donate memory in N-byte chunks instead of
> requiring a single huge blob of memory.

Right, I've been thinking about this over the weekend and that might
actually be fairly straightforward for stack pages. I'll try to move this
allocation to init_hyp_mode() where it belongs (or better, re-use the
existing one) in the nest version.

Thanks,
Quentin