This series is based on v5.17-rc3 and adds the following stack features to
the KVM nVHE hypervisor:
== Hyp Stack Guard Pages ==
Based on the technique used by arm64 VMAP_STACK to detect overflow.
i.e. the stack is aligned to twice its size which ensure that the
'stack shift' bit of any valid SP is 0. The 'stack shift' bit can be
tested in the exception entry to detect overflow without corrupting GPRs.
== Hyp Stack Unwinder ==
Based on the arm64 kernel stack unwinder
(See: arch/arm64/kernel/stacktrace.c)
The unwinding and dumping of the hyp stack is not enabled by default and
depends on CONFIG_NVHE_EL2_DEBUG to avoid potential information leaks.
When CONFIG_NVHE_EL2_DEBUG is enabled the host stage 2 protection is
disabled, allowing the host to read the hypervisor stack pages and unwind
the stack from EL1. This allows us to print the hypervisor stacktrace
before panicking the host; as shown below:
kvm [408]: nVHE hyp panic at: \
[<ffffffc01161460c>] __kvm_nvhe_overflow_stack+0x10/0x34!
kvm [408]: nVHE HYP call trace:
kvm [408]: [<ffffffc011614974>] __kvm_nvhe_hyp_panic_bad_stack+0xc/0x10
kvm [408]: [<ffffffc01160fa48>] __kvm_nvhe___kvm_hyp_host_vector+0x248/0x794
kvm [408]: [<ffffffc01161461c>] __kvm_nvhe_overflow_stack+0x20/0x34
. . .
kvm [408]: [<ffffffc01161461c>] __kvm_nvhe_overflow_stack+0x20/0x34
kvm [408]: [<ffffffc01161421c>] __kvm_nvhe___kvm_vcpu_run+0x2c/0x40c
kvm [408]: [<ffffffc011615e14>] __kvm_nvhe_handle___kvm_vcpu_run+0x1c8/0x36c
kvm [408]: [<ffffffc0116157c4>] __kvm_nvhe_handle_trap+0xa4/0x124
kvm [408]: [<ffffffc01160f060>] __kvm_nvhe___host_exit+0x60/0x64
kvm [408]: ---- end of nVHE HYP call trace ----
Kalesh Singh (3):
KVM: arm64: Add Hyp overflow stack
KVM: arm64: Unwind and dump nVHE HYP stacktrace
KVM: arm64: Symbolize the nVHE HYP backtrace
Quentin Perret (4):
KVM: arm64: Map the stack pages in the 'private' range
KVM: arm64: Factor out private range VA allocation
arm64: asm: Introduce test_sp_overflow macro
KVM: arm64: Allocate guard pages near hyp stacks
arch/arm64/include/asm/assembler.h | 11 +
arch/arm64/include/asm/kvm_asm.h | 17 ++
arch/arm64/kernel/entry.S | 9 +-
arch/arm64/kvm/Makefile | 1 +
arch/arm64/kvm/arm.c | 2 +-
arch/arm64/kvm/handle_exit.c | 14 +-
arch/arm64/kvm/hyp/include/nvhe/mm.h | 1 +
arch/arm64/kvm/hyp/nvhe/host.S | 21 ++
arch/arm64/kvm/hyp/nvhe/mm.c | 28 ++-
arch/arm64/kvm/hyp/nvhe/setup.c | 63 +++++-
arch/arm64/kvm/hyp/nvhe/switch.c | 22 ++
arch/arm64/kvm/stacktrace.c | 290 +++++++++++++++++++++++++++
arch/arm64/kvm/stacktrace.h | 17 ++
scripts/kallsyms.c | 2 +-
14 files changed, 468 insertions(+), 30 deletions(-)
create mode 100644 arch/arm64/kvm/stacktrace.c
create mode 100644 arch/arm64/kvm/stacktrace.h
base-commit: dfd42facf1e4ada021b939b4e19c935dcdd55566
--
2.35.1.265.g69c8d7142f-goog
From: Quentin Perret <[email protected]>
In preparation for introducing guard pages for the stacks, map them
in the 'private' range of the EL2 VA space in which the VA to PA
relation is flexible when running in protected mode.
Signed-off-by: Quentin Perret <[email protected]>
[Kalesh - Refactor, add comments, resolve conflicts,
use __pkvm_create_private_mapping()]
Signed-off-by: Kalesh Singh <[email protected]>
---
arch/arm64/kvm/hyp/nvhe/setup.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 27af337f9fea..99e178cf4249 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -105,11 +105,19 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
if (ret)
return ret;
- end = (void *)per_cpu_ptr(&kvm_init_params, i)->stack_hyp_va;
+ /* Map stack pages in the 'private' VA range */
+ end = (void *)__hyp_pa(per_cpu_ptr(&kvm_init_params, i)->stack_hyp_va);
start = end - PAGE_SIZE;
- ret = pkvm_create_mappings(start, end, PAGE_HYP);
- if (ret)
- return ret;
+ start = (void *)__pkvm_create_private_mapping((phys_addr_t)start,
+ PAGE_SIZE, PAGE_HYP);
+ if (IS_ERR_OR_NULL(start))
+ return PTR_ERR(start);
+ end = start + PAGE_SIZE;
+ /*
+ * Update stack_hyp_va to the end of the stack page's
+ * allocated 'private' VA range.
+ */
+ per_cpu_ptr(&kvm_init_params, i)->stack_hyp_va = (unsigned long) end;
}
/*
--
2.35.1.265.g69c8d7142f-goog
From: Quentin Perret <[email protected]>
Allocate unbacked VA space underneath each stack page to ensure stack
overflows get trapped and don't corrupt memory silently.
The stack is aligned to twice its size (PAGE_SIZE), meaning that any
valid stack address has PAGE_SHIFT bit as 0. This allows us to easily
check for overflow in the exception entry without corrupting any GPRs.
Signed-off-by: Quentin Perret <[email protected]>
[ Kalesh - Update commit text and comments,
refactor, add overflow handling ]
Signed-off-by: Kalesh Singh <[email protected]>
---
arch/arm64/kvm/hyp/nvhe/host.S | 16 ++++++++++++++++
arch/arm64/kvm/hyp/nvhe/setup.c | 19 ++++++++++++++++++-
arch/arm64/kvm/hyp/nvhe/switch.c | 5 +++++
3 files changed, 39 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/hyp/nvhe/host.S b/arch/arm64/kvm/hyp/nvhe/host.S
index 3d613e721a75..78e4b612ac06 100644
--- a/arch/arm64/kvm/hyp/nvhe/host.S
+++ b/arch/arm64/kvm/hyp/nvhe/host.S
@@ -153,6 +153,10 @@ SYM_FUNC_END(__host_hvc)
.macro invalid_host_el2_vect
.align 7
+
+ /* Test stack overflow without corrupting GPRs */
+ test_sp_overflow PAGE_SHIFT, .L__hyp_sp_overflow\@
+
/* If a guest is loaded, panic out of it. */
stp x0, x1, [sp, #-16]!
get_loaded_vcpu x0, x1
@@ -165,6 +169,18 @@ SYM_FUNC_END(__host_hvc)
* been partially clobbered by __host_enter.
*/
b hyp_panic
+
+.L__hyp_sp_overflow\@:
+ /*
+ * Reset SP to the top of the stack, to allow handling the hyp_panic.
+ * This corrupts the stack but is ok, since we won't be attempting
+ * any unwinding here.
+ */
+ ldr_this_cpu x0, kvm_init_params + NVHE_INIT_STACK_HYP_VA, x1
+ mov sp, x0
+
+ bl hyp_panic_bad_stack
+ ASM_BUG()
.endm
.macro invalid_host_el1_vect
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 99e178cf4249..114053dff228 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -105,7 +105,24 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
if (ret)
return ret;
- /* Map stack pages in the 'private' VA range */
+ /*
+ * Allocate 'private' VA range for stack guard pages.
+ *
+ * The 'private' VA range grows upward and stacks downwards, so
+ * allocate the guard page first. But make sure to align the
+ * stack itself with PAGE_SIZE * 2 granularity to ease overflow
+ * detection in the entry assembly code.
+ */
+ do {
+ start = (void *)hyp_alloc_private_va_range(PAGE_SIZE);
+ if (IS_ERR(start))
+ return PTR_ERR(start);
+ } while (IS_ALIGNED((u64) start, PAGE_SIZE * 2));
+
+ /*
+ * Map stack pages in the 'private' VA range above the allocated
+ * guard pages.
+ */
end = (void *)__hyp_pa(per_cpu_ptr(&kvm_init_params, i)->stack_hyp_va);
start = end - PAGE_SIZE;
start = (void *)__pkvm_create_private_mapping((phys_addr_t)start,
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index 6410d21d8695..5a2e1ab79913 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -369,6 +369,11 @@ void __noreturn hyp_panic(void)
unreachable();
}
+void __noreturn hyp_panic_bad_stack(void)
+{
+ hyp_panic();
+}
+
asmlinkage void kvm_unexpected_el2_exception(void)
{
return __kvm_unexpected_el2_exception();
--
2.35.1.265.g69c8d7142f-goog
Unwind the stack in EL1, when CONFIG_NVHE_EL2_DEBUG is enabled. This is
possible because CONFIG_NVHE_EL2_DEBUG disables the host stage 2 protection
which allows host to access the hypervisor stack pages in EL1.
Unwinding and dumping hyp call traces is gated on CONFIG_NVHE_EL2_DEBUG
to avoid the potential leaking of information to the host.
A simple stack overflow test produces the following output:
[ 580.376051][ T412] kvm: nVHE hyp panic at: ffffffc0116145c4!
[ 580.378034][ T412] kvm [412]: nVHE HYP call trace (vmlinux
addresses):
[ 580.378591][ T412] kvm [412]: [<ffffffc011614934>]
[ 580.378993][ T412] kvm [412]: [<ffffffc01160fa48>]
[ 580.379386][ T412] kvm [412]: [<ffffffc0116145dc>] // Non-terminating recursive call
[ 580.379772][ T412] kvm [412]: [<ffffffc0116145dc>]
[ 580.380158][ T412] kvm [412]: [<ffffffc0116145dc>]
[ 580.380544][ T412] kvm [412]: [<ffffffc0116145dc>]
[ 580.380928][ T412] kvm [412]: [<ffffffc0116145dc>]
. . .
Since nVHE hyp symbols are not included by kallsyms to avoid issues
with aliasing, we fallback to the vmlinux addresses. Symbolizing the
addresses is handled in the next patch in this series.
Signed-off-by: Kalesh Singh <[email protected]>
---
arch/arm64/include/asm/kvm_asm.h | 17 ++
arch/arm64/kvm/Makefile | 1 +
arch/arm64/kvm/arm.c | 2 +-
arch/arm64/kvm/handle_exit.c | 3 +
arch/arm64/kvm/hyp/nvhe/setup.c | 25 +++
arch/arm64/kvm/hyp/nvhe/switch.c | 17 ++
arch/arm64/kvm/stacktrace.c | 290 +++++++++++++++++++++++++++++++
arch/arm64/kvm/stacktrace.h | 17 ++
8 files changed, 371 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/kvm/stacktrace.c
create mode 100644 arch/arm64/kvm/stacktrace.h
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index d5b0386ef765..f2b4c2ae5905 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -175,6 +175,23 @@ struct kvm_nvhe_init_params {
unsigned long vtcr;
};
+#ifdef CONFIG_NVHE_EL2_DEBUG
+/*
+ * Used by the host in EL1 to dump the nVHE hypervisor backtrace on
+ * hyp_panic. This is possible because CONFIG_NVHE_EL2_DEBUG disables
+ * the host stage 2 protection. See: __hyp_do_panic()
+ *
+ * @hyp_stack_base: hyp VA of the hyp_stack base.
+ * @hyp_overflow_stack_base: hyp VA of the hyp_overflow_stack base.
+ * @start_fp: hyp FP where the hyp backtrace should begin.
+ */
+struct kvm_nvhe_panic_info {
+ unsigned long hyp_stack_base;
+ unsigned long hyp_overflow_stack_base;
+ unsigned long start_fp;
+};
+#endif
+
/* Translate a kernel address @ptr into its equivalent linear mapping */
#define kvm_ksym_ref(ptr) \
({ \
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 91861fd8b897..262b5c58cc62 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -23,6 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
vgic/vgic-its.o vgic/vgic-debug.o
kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o
+kvm-$(CONFIG_NVHE_EL2_DEBUG) += stacktrace.o
always-y := hyp_constants.h hyp-constants.s
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index ecc5958e27fe..f779436919ad 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -49,7 +49,7 @@ DEFINE_STATIC_KEY_FALSE(kvm_protected_mode_initialized);
DECLARE_KVM_HYP_PER_CPU(unsigned long, kvm_hyp_vector);
-static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
+DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
unsigned long kvm_arm_hyp_percpu_base[NR_CPUS];
DECLARE_KVM_NVHE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params);
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index e3140abd2e2e..b038c32a3236 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -23,6 +23,7 @@
#define CREATE_TRACE_POINTS
#include "trace_handle_exit.h"
+#include "stacktrace.h"
typedef int (*exit_handle_fn)(struct kvm_vcpu *);
@@ -326,6 +327,8 @@ void __noreturn __cold nvhe_hyp_panic_handler(u64 esr, u64 spsr,
kvm_err("nVHE hyp panic at: %016llx!\n", elr_virt + hyp_offset);
}
+ hyp_dump_backtrace(hyp_offset);
+
/*
* Hyp has panicked and we're going to handle that by panicking the
* kernel. The kernel offset will be revealed in the panic so we're
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 39937fa6a1b2..3d7720d25acb 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -23,6 +23,29 @@ unsigned long hyp_nr_cpus;
#ifdef CONFIG_NVHE_EL2_DEBUG
DEFINE_PER_CPU(unsigned long [PAGE_SIZE/sizeof(long)], hyp_overflow_stack)
__aligned(16);
+
+DEFINE_PER_CPU(struct kvm_nvhe_panic_info, kvm_panic_info);
+
+static void init_nvhe_panic_info(void)
+{
+ struct kvm_nvhe_panic_info *panic_info;
+ struct kvm_nvhe_init_params *params;
+ int cpu;
+
+ for (cpu = 0; cpu < hyp_nr_cpus; cpu++) {
+ panic_info = per_cpu_ptr(&kvm_panic_info, cpu);
+ params = per_cpu_ptr(&kvm_init_params, cpu);
+
+ panic_info->hyp_stack_base = (unsigned long)(params->stack_hyp_va - PAGE_SIZE);
+ panic_info->hyp_overflow_stack_base
+ = (unsigned long)per_cpu_ptr(hyp_overflow_stack, cpu);
+ panic_info->start_fp = 0;
+ }
+}
+#else
+static inline void init_nvhe_panic_info(void)
+{
+}
#endif
#define hyp_percpu_size ((unsigned long)__per_cpu_end - \
@@ -140,6 +163,8 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
* allocated 'private' VA range.
*/
per_cpu_ptr(&kvm_init_params, i)->stack_hyp_va = (unsigned long) end;
+
+ init_nvhe_panic_info();
}
/*
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index 5a2e1ab79913..8f8cd0c02e1a 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -34,6 +34,21 @@ DEFINE_PER_CPU(struct kvm_host_data, kvm_host_data);
DEFINE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt);
DEFINE_PER_CPU(unsigned long, kvm_hyp_vector);
+#ifdef CONFIG_NVHE_EL2_DEBUG
+DECLARE_PER_CPU(struct kvm_nvhe_panic_info, kvm_panic_info);
+
+static void update_nvhe_panic_info_fp(void)
+{
+ struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr(&kvm_panic_info);
+
+ panic_info->start_fp = (unsigned long)__builtin_frame_address(0);
+}
+#else
+static inline void update_nvhe_panic_info_fp(void)
+{
+}
+#endif
+
static void __activate_traps(struct kvm_vcpu *vcpu)
{
u64 val;
@@ -355,6 +370,8 @@ void __noreturn hyp_panic(void)
struct kvm_cpu_context *host_ctxt;
struct kvm_vcpu *vcpu;
+ update_nvhe_panic_info_fp();
+
host_ctxt = &this_cpu_ptr(&kvm_host_data)->host_ctxt;
vcpu = host_ctxt->__hyp_running_vcpu;
diff --git a/arch/arm64/kvm/stacktrace.c b/arch/arm64/kvm/stacktrace.c
new file mode 100644
index 000000000000..3990a616ab66
--- /dev/null
+++ b/arch/arm64/kvm/stacktrace.c
@@ -0,0 +1,290 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Stack unwinder for EL2 nVHE hypervisor.
+ *
+ * Code mostly copied from the arm64 kernel stack unwinder
+ * and adapted to the nVHE hypervisor.
+ *
+ * See: arch/arm64/kernel/stacktrace.c
+ *
+ * CONFIG_NVHE_EL2_DEBUG disables the host stage-2 protection
+ * allowing us to access the hypervisor stack pages and
+ * consequently unwind its stack from the host in EL1.
+ *
+ * See: __hyp_do_panic()
+ */
+
+#include <asm/kvm_asm.h>
+#include <asm/kvm_hyp.h>
+#include <linux/kvm_host.h>
+#include "stacktrace.h"
+
+DECLARE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
+DECLARE_KVM_NVHE_PER_CPU(unsigned long [PAGE_SIZE/sizeof(long)], hyp_overflow_stack);
+DECLARE_KVM_NVHE_PER_CPU(struct kvm_nvhe_panic_info, kvm_panic_info);
+
+enum hyp_stack_type {
+ HYP_STACK_TYPE_UNKNOWN,
+ HYP_STACK_TYPE_HYP,
+ HYP_STACK_TYPE_OVERFLOW,
+ __NR_HYP_STACK_TYPES
+};
+
+struct hyp_stack_info {
+ unsigned long low;
+ unsigned long high;
+ enum hyp_stack_type type;
+};
+
+/*
+ * A snapshot of a frame record or fp/lr register values, along with some
+ * accounting information necessary for robust unwinding.
+ *
+ * @fp: The fp value in the frame record (or the real fp)
+ * @pc: The pc value calculated from lr in the frame record.
+ *
+ * @stacks_done: Stacks which have been entirely unwound, for which it is no
+ * longer valid to unwind to.
+ *
+ * @prev_fp: The fp that pointed to this frame record, or a synthetic value
+ * of 0. This is used to ensure that within a stack, each
+ * subsequent frame record is at an increasing address.
+ * @prev_type: The type of stack this frame record was on, or a synthetic
+ * value of HYP_STACK_TYPE_UNKNOWN. This is used to detect a
+ * transition from one stack to another.
+ */
+struct hyp_stackframe {
+ unsigned long fp;
+ unsigned long pc;
+ DECLARE_BITMAP(stacks_done, __NR_HYP_STACK_TYPES);
+ unsigned long prev_fp;
+ enum hyp_stack_type prev_type;
+};
+
+static inline bool __on_hyp_stack(unsigned long hyp_sp, unsigned long size,
+ unsigned long low, unsigned long high,
+ enum hyp_stack_type type,
+ struct hyp_stack_info *info)
+{
+ if (!low)
+ return false;
+
+ if (hyp_sp < low || hyp_sp + size < hyp_sp || hyp_sp + size > high)
+ return false;
+
+ if (info) {
+ info->low = low;
+ info->high = high;
+ info->type = type;
+ }
+ return true;
+}
+
+static inline bool on_hyp_overflow_stack(unsigned long hyp_sp, unsigned long size,
+ struct hyp_stack_info *info)
+{
+ struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
+ unsigned long low = (unsigned long)panic_info->hyp_overflow_stack_base;
+ unsigned long high = low + PAGE_SIZE;
+
+ return __on_hyp_stack(hyp_sp, size, low, high, HYP_STACK_TYPE_OVERFLOW, info);
+}
+
+static inline bool on_hyp_stack(unsigned long hyp_sp, unsigned long size,
+ struct hyp_stack_info *info)
+{
+ struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
+ unsigned long low = (unsigned long)panic_info->hyp_stack_base;
+ unsigned long high = low + PAGE_SIZE;
+
+ return __on_hyp_stack(hyp_sp, size, low, high, HYP_STACK_TYPE_HYP, info);
+}
+
+static inline bool on_hyp_accessible_stack(unsigned long hyp_sp, unsigned long size,
+ struct hyp_stack_info *info)
+{
+ if (info)
+ info->type = HYP_STACK_TYPE_UNKNOWN;
+
+ if (on_hyp_stack(hyp_sp, size, info))
+ return true;
+ if (on_hyp_overflow_stack(hyp_sp, size, info))
+ return true;
+
+ return false;
+}
+
+static unsigned long __hyp_stack_kern_va(unsigned long hyp_va)
+{
+ struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
+ unsigned long hyp_base, kern_base, hyp_offset;
+
+ hyp_base = (unsigned long)panic_info->hyp_stack_base;
+ hyp_offset = hyp_va - hyp_base;
+
+ kern_base = (unsigned long)*this_cpu_ptr(&kvm_arm_hyp_stack_page);
+
+ return kern_base + hyp_offset;
+}
+
+static unsigned long __hyp_overflow_stack_kern_va(unsigned long hyp_va)
+{
+ struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
+ unsigned long hyp_base, kern_base, hyp_offset;
+
+ hyp_base = (unsigned long)panic_info->hyp_overflow_stack_base;
+ hyp_offset = hyp_va - hyp_base;
+
+ kern_base = (unsigned long)this_cpu_ptr_nvhe_sym(hyp_overflow_stack);
+
+ return kern_base + hyp_offset;
+}
+
+/*
+ * Convert hypervisor stack VA to a kernel VA.
+ *
+ * The hypervisor stack is mapped in the flexible 'private' VA range, to allow
+ * for guard pages below the stack. Consequently, the fixed offset address
+ * translation macros won't work here.
+ *
+ * The kernel VA is calculated as an offset from the kernel VA of the hypervisor
+ * stack base. See: __hyp_stack_kern_va(), __hyp_overflow_stack_kern_va()
+ */
+static unsigned long hyp_stack_kern_va(unsigned long hyp_va,
+ enum hyp_stack_type stack_type)
+{
+ switch (stack_type) {
+ case HYP_STACK_TYPE_HYP:
+ return __hyp_stack_kern_va(hyp_va);
+ case HYP_STACK_TYPE_OVERFLOW:
+ return __hyp_overflow_stack_kern_va(hyp_va);
+ default:
+ return 0UL;
+ }
+}
+
+/*
+ * Unwind from one frame record (A) to the next frame record (B).
+ *
+ * We terminate early if the location of B indicates a malformed chain of frame
+ * records (e.g. a cycle), determined based on the location and fp value of A
+ * and the location (but not the fp value) of B.
+ */
+static int notrace hyp_unwind_frame(struct hyp_stackframe *frame)
+{
+ unsigned long fp = frame->fp, fp_kern_va;
+ struct hyp_stack_info info;
+
+ if (fp & 0x7)
+ return -EINVAL;
+
+ if (!on_hyp_accessible_stack(fp, 16, &info))
+ return -EINVAL;
+
+ if (test_bit(info.type, frame->stacks_done))
+ return -EINVAL;
+
+ /*
+ * As stacks grow downward, any valid record on the same stack must be
+ * at a strictly higher address than the prior record.
+ *
+ * Stacks can nest in the following order:
+ *
+ * HYP -> OVERFLOW
+ *
+ * ... but the nesting itself is strict. Once we transition from one
+ * stack to another, it's never valid to unwind back to that first
+ * stack.
+ */
+ if (info.type == frame->prev_type) {
+ if (fp <= frame->prev_fp)
+ return -EINVAL;
+ } else {
+ set_bit(frame->prev_type, frame->stacks_done);
+ }
+
+ /* Translate the hyp stack address to a kernel address */
+ fp_kern_va = hyp_stack_kern_va(fp, info.type);
+ if (!fp_kern_va)
+ return -EINVAL;
+
+ /*
+ * Record this frame record's values and location. The prev_fp and
+ * prev_type are only meaningful to the next hyp_unwind_frame()
+ * invocation.
+ */
+ frame->fp = READ_ONCE_NOCHECK(*(unsigned long *)(fp_kern_va));
+ /* PC = LR - 4; All aarch64 instructions are 32-bits in size */
+ frame->pc = READ_ONCE_NOCHECK(*(unsigned long *)(fp_kern_va + 8)) - 4;
+ frame->prev_fp = fp;
+ frame->prev_type = info.type;
+
+ return 0;
+}
+
+/*
+ * AArch64 PCS assigns the frame pointer to x29.
+ *
+ * A simple function prologue looks like this:
+ * sub sp, sp, #0x10
+ * stp x29, x30, [sp]
+ * mov x29, sp
+ *
+ * A simple function epilogue looks like this:
+ * mov sp, x29
+ * ldp x29, x30, [sp]
+ * add sp, sp, #0x10
+ */
+static void hyp_start_backtrace(struct hyp_stackframe *frame, unsigned long fp)
+{
+ frame->fp = fp;
+
+ /*
+ * Prime the first unwind.
+ *
+ * In hyp_unwind_frame() we'll check that the FP points to a valid
+ * stack, which can't be HYP_STACK_TYPE_UNKNOWN, and the first unwind
+ * will be treated as a transition to whichever stack that happens to
+ * be. The prev_fp value won't be used, but we set it to 0 such that
+ * it is definitely not an accessible stack address. The first frame
+ * (hyp_panic()) is skipped, so we also set PC to 0.
+ */
+ bitmap_zero(frame->stacks_done, __NR_HYP_STACK_TYPES);
+ frame->pc = frame->prev_fp = 0;
+ frame->prev_type = HYP_STACK_TYPE_UNKNOWN;
+}
+
+static void hyp_dump_backtrace_entry(unsigned long hyp_pc, unsigned long hyp_offset)
+{
+ unsigned long va_mask = GENMASK_ULL(vabits_actual - 1, 0);
+
+ hyp_pc &= va_mask;
+ hyp_pc += hyp_offset;
+
+ kvm_err(" [<%016llx>]\n", hyp_pc);
+}
+
+void hyp_dump_backtrace(unsigned long hyp_offset)
+{
+ struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
+ struct hyp_stackframe frame;
+ int frame_nr = 0;
+ int skip = 1; /* Skip the first frame: hyp_panic() */
+
+ kvm_err("nVHE HYP call trace (vmlinux addresses):\n");
+
+ hyp_start_backtrace(&frame, (unsigned long)panic_info->start_fp);
+
+ do {
+ if (skip) {
+ skip--;
+ continue;
+ }
+
+ hyp_dump_backtrace_entry(frame.pc, hyp_offset);
+
+ frame_nr++;
+ } while (!hyp_unwind_frame(&frame));
+
+ kvm_err("---- end of nVHE HYP call trace ----\n");
+}
diff --git a/arch/arm64/kvm/stacktrace.h b/arch/arm64/kvm/stacktrace.h
new file mode 100644
index 000000000000..40c397394b9b
--- /dev/null
+++ b/arch/arm64/kvm/stacktrace.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Stack unwinder for EL2 nVHE hypervisor.
+ */
+
+#ifndef __KVM_HYP_STACKTRACE_H
+#define __KVM_HYP_STACKTRACE_H
+
+#ifdef CONFIG_NVHE_EL2_DEBUG
+void hyp_dump_backtrace(unsigned long hyp_offset);
+#else
+static inline void hyp_dump_backtrace(unsigned long hyp_offset)
+{
+}
+#endif /* CONFIG_NVHE_EL2_DEBUG */
+
+#endif /* __KVM_HYP_STACKTRACE_H */
--
2.35.1.265.g69c8d7142f-goog
From: Quentin Perret <[email protected]>
__pkvm_create_private_mapping() is currently responsible for allocating
VA space in the hypervisor's "private" range and creating stage-1
mappings. In order to allow reusing the VA space allocation logic from
other places, let's factor it out in a standalone function. This is will
be used to allocate private VA ranges for hypervisor stack guard pages
in a subsequent patch in this series.
Signed-off-by: Quentin Perret <[email protected]>
[Kalesh - Resolve conflicts and make hyp_alloc_private_va_range
available outside nvhe/mm.c, update commit message]
Signed-off-by: Kalesh Singh <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/mm.h | 1 +
arch/arm64/kvm/hyp/nvhe/mm.c | 28 +++++++++++++++++++---------
2 files changed, 20 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h b/arch/arm64/kvm/hyp/include/nvhe/mm.h
index 2d08510c6cc1..f53fb0e406db 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mm.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
@@ -21,6 +21,7 @@ int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot);
unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
enum kvm_pgtable_prot prot);
+unsigned long hyp_alloc_private_va_range(size_t size);
static inline void hyp_vmemmap_range(phys_addr_t phys, unsigned long size,
unsigned long *start, unsigned long *end)
diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
index 526a7d6fa86f..e196441e072f 100644
--- a/arch/arm64/kvm/hyp/nvhe/mm.c
+++ b/arch/arm64/kvm/hyp/nvhe/mm.c
@@ -37,6 +37,22 @@ static int __pkvm_create_mappings(unsigned long start, unsigned long size,
return err;
}
+unsigned long hyp_alloc_private_va_range(size_t size)
+{
+ unsigned long addr = __io_map_base;
+
+ hyp_assert_lock_held(&pkvm_pgd_lock);
+ __io_map_base += PAGE_ALIGN(size);
+
+ /* Are we overflowing on the vmemmap ? */
+ if (__io_map_base > __hyp_vmemmap) {
+ __io_map_base = addr;
+ addr = (unsigned long)ERR_PTR(-ENOMEM);
+ }
+
+ return addr;
+}
+
unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
enum kvm_pgtable_prot prot)
{
@@ -45,16 +61,10 @@ unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
hyp_spin_lock(&pkvm_pgd_lock);
- size = PAGE_ALIGN(size + offset_in_page(phys));
- addr = __io_map_base;
- __io_map_base += size;
-
- /* Are we overflowing on the vmemmap ? */
- if (__io_map_base > __hyp_vmemmap) {
- __io_map_base -= size;
- addr = (unsigned long)ERR_PTR(-ENOMEM);
+ size = size + offset_in_page(phys);
+ addr = hyp_alloc_private_va_range(size);
+ if (IS_ERR((void *)addr))
goto out;
- }
err = kvm_pgtable_hyp_map(&pkvm_pgtable, addr, size, phys, prot);
if (err) {
--
2.35.1.265.g69c8d7142f-goog
On Thu, 10 Feb 2022 22:41:41 +0000,
Kalesh Singh <[email protected]> wrote:
>
> This series is based on v5.17-rc3 and adds the following stack features to
> the KVM nVHE hypervisor:
>
> == Hyp Stack Guard Pages ==
>
> Based on the technique used by arm64 VMAP_STACK to detect overflow.
> i.e. the stack is aligned to twice its size which ensure that the
> 'stack shift' bit of any valid SP is 0. The 'stack shift' bit can be
> tested in the exception entry to detect overflow without corrupting GPRs.
Having quickly parsed the code, this seems to only be effective for
pKVM and the EL2-allocated stack. Is there any technical reason not to
implement this for the much more common case of 'classic' KVM in nVHE
mode?
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Thu, 10 Feb 2022 22:41:45 +0000,
Kalesh Singh <[email protected]> wrote:
>
> From: Quentin Perret <[email protected]>
>
> Allocate unbacked VA space underneath each stack page to ensure stack
> overflows get trapped and don't corrupt memory silently.
>
> The stack is aligned to twice its size (PAGE_SIZE), meaning that any
> valid stack address has PAGE_SHIFT bit as 0. This allows us to easily
> check for overflow in the exception entry without corrupting any GPRs.
>
> Signed-off-by: Quentin Perret <[email protected]>
> [ Kalesh - Update commit text and comments,
> refactor, add overflow handling ]
> Signed-off-by: Kalesh Singh <[email protected]>
> ---
> arch/arm64/kvm/hyp/nvhe/host.S | 16 ++++++++++++++++
> arch/arm64/kvm/hyp/nvhe/setup.c | 19 ++++++++++++++++++-
> arch/arm64/kvm/hyp/nvhe/switch.c | 5 +++++
> 3 files changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/host.S b/arch/arm64/kvm/hyp/nvhe/host.S
> index 3d613e721a75..78e4b612ac06 100644
> --- a/arch/arm64/kvm/hyp/nvhe/host.S
> +++ b/arch/arm64/kvm/hyp/nvhe/host.S
> @@ -153,6 +153,10 @@ SYM_FUNC_END(__host_hvc)
>
> .macro invalid_host_el2_vect
> .align 7
> +
> + /* Test stack overflow without corrupting GPRs */
> + test_sp_overflow PAGE_SHIFT, .L__hyp_sp_overflow\@
> +
I am definitely concerned with this in a system not using pKVM (which
is on average 100% of the upstream users so far! ;-). This is more or
less guaranteed to do the wrong thing 50% of the times, depending on
the alignment of the stack.
> /* If a guest is loaded, panic out of it. */
> stp x0, x1, [sp, #-16]!
> get_loaded_vcpu x0, x1
> @@ -165,6 +169,18 @@ SYM_FUNC_END(__host_hvc)
> * been partially clobbered by __host_enter.
> */
> b hyp_panic
> +
> +.L__hyp_sp_overflow\@:
> + /*
> + * Reset SP to the top of the stack, to allow handling the hyp_panic.
> + * This corrupts the stack but is ok, since we won't be attempting
> + * any unwinding here.
> + */
> + ldr_this_cpu x0, kvm_init_params + NVHE_INIT_STACK_HYP_VA, x1
> + mov sp, x0
> +
> + bl hyp_panic_bad_stack
> + ASM_BUG()
> .endm
>
> .macro invalid_host_el1_vect
> diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
> index 99e178cf4249..114053dff228 100644
> --- a/arch/arm64/kvm/hyp/nvhe/setup.c
> +++ b/arch/arm64/kvm/hyp/nvhe/setup.c
> @@ -105,7 +105,24 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
> if (ret)
> return ret;
>
> - /* Map stack pages in the 'private' VA range */
> + /*
> + * Allocate 'private' VA range for stack guard pages.
> + *
> + * The 'private' VA range grows upward and stacks downwards, so
> + * allocate the guard page first. But make sure to align the
> + * stack itself with PAGE_SIZE * 2 granularity to ease overflow
> + * detection in the entry assembly code.
> + */
> + do {
> + start = (void *)hyp_alloc_private_va_range(PAGE_SIZE);
> + if (IS_ERR(start))
> + return PTR_ERR(start);
> + } while (IS_ALIGNED((u64) start, PAGE_SIZE * 2));
This seems cumbersome. Can't we tweak hyp_alloc_private_va_range() to
perform the required alignment? It could easily be convinced to return
an address that is aligned on the size of the region, which would
avoid this sort of loop.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Mon, Feb 14, 2022 at 3:41 AM Marc Zyngier <[email protected]> wrote:
>
> On Thu, 10 Feb 2022 22:41:41 +0000,
> Kalesh Singh <[email protected]> wrote:
> >
> > This series is based on v5.17-rc3 and adds the following stack features to
> > the KVM nVHE hypervisor:
> >
> > == Hyp Stack Guard Pages ==
> >
> > Based on the technique used by arm64 VMAP_STACK to detect overflow.
> > i.e. the stack is aligned to twice its size which ensure that the
> > 'stack shift' bit of any valid SP is 0. The 'stack shift' bit can be
> > tested in the exception entry to detect overflow without corrupting GPRs.
>
> Having quickly parsed the code, this seems to only be effective for
> pKVM and the EL2-allocated stack. Is there any technical reason not to
> implement this for the much more common case of 'classic' KVM in nVHE
> mode?
Hi Marc,
No technical reason. We hadn't thought of it from that perspective.
It's a good idea, I'll look into this and repost a new version.
Thanks,
Kalesh
>
> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
On Mon, Feb 14, 2022 at 6:06 AM Marc Zyngier <[email protected]> wrote:
>
> On Thu, 10 Feb 2022 22:41:45 +0000,
> Kalesh Singh <[email protected]> wrote:
> >
> > From: Quentin Perret <[email protected]>
> >
> > Allocate unbacked VA space underneath each stack page to ensure stack
> > overflows get trapped and don't corrupt memory silently.
> >
> > The stack is aligned to twice its size (PAGE_SIZE), meaning that any
> > valid stack address has PAGE_SHIFT bit as 0. This allows us to easily
> > check for overflow in the exception entry without corrupting any GPRs.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > [ Kalesh - Update commit text and comments,
> > refactor, add overflow handling ]
> > Signed-off-by: Kalesh Singh <[email protected]>
> > ---
> > arch/arm64/kvm/hyp/nvhe/host.S | 16 ++++++++++++++++
> > arch/arm64/kvm/hyp/nvhe/setup.c | 19 ++++++++++++++++++-
> > arch/arm64/kvm/hyp/nvhe/switch.c | 5 +++++
> > 3 files changed, 39 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/nvhe/host.S b/arch/arm64/kvm/hyp/nvhe/host.S
> > index 3d613e721a75..78e4b612ac06 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/host.S
> > +++ b/arch/arm64/kvm/hyp/nvhe/host.S
> > @@ -153,6 +153,10 @@ SYM_FUNC_END(__host_hvc)
> >
> > .macro invalid_host_el2_vect
> > .align 7
> > +
> > + /* Test stack overflow without corrupting GPRs */
> > + test_sp_overflow PAGE_SHIFT, .L__hyp_sp_overflow\@
> > +
>
> I am definitely concerned with this in a system not using pKVM (which
> is on average 100% of the upstream users so far! ;-). This is more or
> less guaranteed to do the wrong thing 50% of the times, depending on
> the alignment of the stack.
Thanks for pointing this out Marc. I'll rework this to work for both
protected and non-protected modes.
>
> > /* If a guest is loaded, panic out of it. */
> > stp x0, x1, [sp, #-16]!
> > get_loaded_vcpu x0, x1
> > @@ -165,6 +169,18 @@ SYM_FUNC_END(__host_hvc)
> > * been partially clobbered by __host_enter.
> > */
> > b hyp_panic
> > +
> > +.L__hyp_sp_overflow\@:
> > + /*
> > + * Reset SP to the top of the stack, to allow handling the hyp_panic.
> > + * This corrupts the stack but is ok, since we won't be attempting
> > + * any unwinding here.
> > + */
> > + ldr_this_cpu x0, kvm_init_params + NVHE_INIT_STACK_HYP_VA, x1
> > + mov sp, x0
> > +
> > + bl hyp_panic_bad_stack
> > + ASM_BUG()
> > .endm
> >
> > .macro invalid_host_el1_vect
> > diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
> > index 99e178cf4249..114053dff228 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/setup.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/setup.c
> > @@ -105,7 +105,24 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
> > if (ret)
> > return ret;
> >
> > - /* Map stack pages in the 'private' VA range */
> > + /*
> > + * Allocate 'private' VA range for stack guard pages.
> > + *
> > + * The 'private' VA range grows upward and stacks downwards, so
> > + * allocate the guard page first. But make sure to align the
> > + * stack itself with PAGE_SIZE * 2 granularity to ease overflow
> > + * detection in the entry assembly code.
> > + */
> > + do {
> > + start = (void *)hyp_alloc_private_va_range(PAGE_SIZE);
> > + if (IS_ERR(start))
> > + return PTR_ERR(start);
> > + } while (IS_ALIGNED((u64) start, PAGE_SIZE * 2));
>
> This seems cumbersome. Can't we tweak hyp_alloc_private_va_range() to
> perform the required alignment? It could easily be convinced to return
> an address that is aligned on the size of the region, which would
> avoid this sort of loop.
Ack. I'll update it for v2.
- Kalesh
>
> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.