Hi all,
This is v6 of the nVHE hypervisor stack enhancements. Addresses some
refactoring/cleanup and documentation improvments from Stephen,
and rebased on 5.17-rc8.
Previous versions can be found at:
v5: https://lore.kernel.org/r/[email protected]/
v4: https://lore.kernel.org/r/[email protected]/
v3: https://lore.kernel.org/r/[email protected]/
v2: https://lore.kernel.org/r/[email protected]/
v1: https://lore.kernel.org/r/[email protected]/
The previous cover letter has been copied below for convenience.
Thanks,
Kalesh
-----
This series is based on 5.17-rc8 and adds the following stack features to
the KVM nVHE hypervisor:
== Hyp Stack Guard Pages ==
Based on the technique used by arm64 VMAP_STACK to detect overflow.
i.e. the stack is aligned such that the 'stack shift' bit of any valid
SP is 1. The 'stack shift' bit can be tested in the exception entry to
detect overflow without corrupting GPRs.
== Hyp Stack Unwinder ==
The unwinding and dumping of the hyp stack is not enabled by default and
depends on CONFIG_NVHE_EL2_DEBUG to avoid potential information leaks.
When CONFIG_NVHE_EL2_DEBUG is enabled the host stage 2 protection is
disabled, allowing the host to read the hypervisor stack pages and unwind
the stack from EL1. This allows us to print the hypervisor stacktrace
before panicking the host; as shown below.
Example call trace:
[ 98.916444][ T426] kvm [426]: nVHE hyp panic at: [<ffffffc0096156fc>] __kvm_nvhe_overflow_stack+0x8/0x34!
[ 98.918360][ T426] nVHE HYP call trace:
[ 98.918692][ T426] kvm [426]: [<ffffffc009615aac>] __kvm_nvhe_cpu_prepare_nvhe_panic_info+0x4c/0x68
[ 98.919545][ T426] kvm [426]: [<ffffffc0096159a4>] __kvm_nvhe_hyp_panic+0x2c/0xe8
[ 98.920107][ T426] kvm [426]: [<ffffffc009615ad8>] __kvm_nvhe_hyp_panic_bad_stack+0x10/0x10
[ 98.920665][ T426] kvm [426]: [<ffffffc009610a4c>] __kvm_nvhe___kvm_hyp_host_vector+0x24c/0x794
[ 98.921292][ T426] kvm [426]: [<ffffffc009615718>] __kvm_nvhe_overflow_stack+0x24/0x34
. . .
[ 98.973382][ T426] kvm [426]: [<ffffffc009615718>] __kvm_nvhe_overflow_stack+0x24/0x34
[ 98.973816][ T426] kvm [426]: [<ffffffc0096152f4>] __kvm_nvhe___kvm_vcpu_run+0x38/0x438
[ 98.974255][ T426] kvm [426]: [<ffffffc009616f80>] __kvm_nvhe_handle___kvm_vcpu_run+0x1c4/0x364
[ 98.974719][ T426] kvm [426]: [<ffffffc009616928>] __kvm_nvhe_handle_trap+0xa8/0x130
[ 98.975152][ T426] kvm [426]: [<ffffffc009610064>] __kvm_nvhe___host_exit+0x64/0x64
[ 98.975588][ T426] ---- end of nVHE HYP call trace ----
Kalesh Singh (8):
KVM: arm64: Introduce hyp_alloc_private_va_range()
KVM: arm64: Introduce pkvm_alloc_private_va_range()
KVM: arm64: Add guard pages for KVM nVHE hypervisor stack
KVM: arm64: Add guard pages for pKVM (protected nVHE) hypervisor stack
KVM: arm64: Detect and handle hypervisor stack overflows
KVM: arm64: Add hypervisor overflow stack
KVM: arm64: Unwind and dump nVHE HYP stacktrace
KVM: arm64: Symbolize the nVHE HYP backtrace
arch/arm64/include/asm/kvm_asm.h | 21 +++
arch/arm64/include/asm/kvm_mmu.h | 4 +
arch/arm64/include/asm/stacktrace.h | 12 ++
arch/arm64/kernel/stacktrace.c | 210 ++++++++++++++++++++++++---
arch/arm64/kvm/Kconfig | 5 +-
arch/arm64/kvm/arm.c | 41 +++++-
arch/arm64/kvm/handle_exit.c | 16 +-
arch/arm64/kvm/hyp/include/nvhe/mm.h | 6 +-
arch/arm64/kvm/hyp/nvhe/host.S | 29 ++++
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 18 ++-
arch/arm64/kvm/hyp/nvhe/mm.c | 78 ++++++----
arch/arm64/kvm/hyp/nvhe/setup.c | 31 +++-
arch/arm64/kvm/hyp/nvhe/switch.c | 30 +++-
arch/arm64/kvm/mmu.c | 70 ++++++---
scripts/kallsyms.c | 2 +-
15 files changed, 477 insertions(+), 96 deletions(-)
base-commit: 09688c0166e76ce2fb85e86b9d99be8b0084cdf9
--
2.35.1.723.g4982287a31-goog
Map the stack pages in the flexible private VA range and allocate
guard pages below the stack as unbacked VA space. The stack is aligned
so that any valid stack address has PAGE_SHIFT bit as 1 - this is used
for overflow detection (implemented in a subsequent patch in the series)
Signed-off-by: Kalesh Singh <[email protected]>
---
Changes in v6:
- Update call to pkvm_alloc_private_va_range() (return val and params)
Changes in v5:
- Use a single allocation for stack and guard pages to ensure they
are contiguous, per Marc
Changes in v4:
- Replace IS_ERR_OR_NULL check with IS_ERR check now that
pkvm_alloc_private_va_range() returns an error for null
pointer, per Fuad
Changes in v3:
- Handle null ptr in IS_ERR_OR_NULL checks, per Mark
arch/arm64/kvm/hyp/nvhe/setup.c | 31 ++++++++++++++++++++++++++++---
1 file changed, 28 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 27af337f9fea..e8d4ea2fcfa0 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -99,17 +99,42 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
return ret;
for (i = 0; i < hyp_nr_cpus; i++) {
+ struct kvm_nvhe_init_params *params = per_cpu_ptr(&kvm_init_params, i);
+ unsigned long hyp_addr;
+
start = (void *)kern_hyp_va(per_cpu_base[i]);
end = start + PAGE_ALIGN(hyp_percpu_size);
ret = pkvm_create_mappings(start, end, PAGE_HYP);
if (ret)
return ret;
- end = (void *)per_cpu_ptr(&kvm_init_params, i)->stack_hyp_va;
- start = end - PAGE_SIZE;
- ret = pkvm_create_mappings(start, end, PAGE_HYP);
+ /*
+ * Allocate a contiguous HYP private VA range for the stack
+ * and guard page. The allocation is also aligned based on
+ * the order of its size.
+ */
+ ret = pkvm_alloc_private_va_range(PAGE_SIZE * 2, &hyp_addr);
+ if (ret)
+ return ret;
+
+ /*
+ * Since the stack grows downwards, map the stack to the page
+ * at the higher address and leave the lower guard page
+ * unbacked.
+ *
+ * Any valid stack address now has the PAGE_SHIFT bit as 1
+ * and addresses corresponding to the guard page have the
+ * PAGE_SHIFT bit as 0 - this is used for overflow detection.
+ */
+ hyp_spin_lock(&pkvm_pgd_lock);
+ ret = kvm_pgtable_hyp_map(&pkvm_pgtable, hyp_addr + PAGE_SIZE,
+ PAGE_SIZE, params->stack_pa, PAGE_HYP);
+ hyp_spin_unlock(&pkvm_pgd_lock);
if (ret)
return ret;
+
+ /* Update stack_hyp_va to end of the stack's private VA range */
+ params->stack_hyp_va = hyp_addr + (2 * PAGE_SIZE);
}
/*
--
2.35.1.723.g4982287a31-goog
Allocate and switch to 16-byte aligned secondary stack on overflow. This
provides us stack space to better handle overflows; and is used in
a subsequent patch to dump the hypervisor stacktrace. The overflow stack
is only allocated if CONFIG_NVHE_EL2_DEBUG is enabled, as hypervisor
stacktraces is a debug feature dependent on CONFIG_NVHE_EL2_DEBUG.
Signed-off-by: Kalesh Singh <[email protected]>
---
Changes in v4:
- Update comment to clarify resetting the SP to the top of the stack
only happens if CONFIG_NVHE_EL2_DEBUG is disabled, per Fuad
arch/arm64/kvm/hyp/nvhe/host.S | 11 ++++++++---
arch/arm64/kvm/hyp/nvhe/switch.c | 5 +++++
2 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kvm/hyp/nvhe/host.S b/arch/arm64/kvm/hyp/nvhe/host.S
index be6d844279b1..a0c4b4f1549f 100644
--- a/arch/arm64/kvm/hyp/nvhe/host.S
+++ b/arch/arm64/kvm/hyp/nvhe/host.S
@@ -179,13 +179,18 @@ SYM_FUNC_END(__host_hvc)
b hyp_panic
.L__hyp_sp_overflow\@:
+#ifdef CONFIG_NVHE_EL2_DEBUG
+ /* Switch to the overflow stack */
+ adr_this_cpu sp, hyp_overflow_stack + PAGE_SIZE, x0
+#else
/*
- * Reset SP to the top of the stack, to allow handling the hyp_panic.
- * This corrupts the stack but is ok, since we won't be attempting
- * any unwinding here.
+ * If !CONFIG_NVHE_EL2_DEBUG, reset SP to the top of the stack, to
+ * allow handling the hyp_panic. This corrupts the stack but is ok,
+ * since we won't be attempting any unwinding here.
*/
ldr_this_cpu x0, kvm_init_params + NVHE_INIT_STACK_HYP_VA, x1
mov sp, x0
+#endif
bl hyp_panic_bad_stack
ASM_BUG()
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index 703a5d3f611b..efc20273a352 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -34,6 +34,11 @@ DEFINE_PER_CPU(struct kvm_host_data, kvm_host_data);
DEFINE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt);
DEFINE_PER_CPU(unsigned long, kvm_hyp_vector);
+#ifdef CONFIG_NVHE_EL2_DEBUG
+DEFINE_PER_CPU(unsigned long [PAGE_SIZE/sizeof(long)], hyp_overflow_stack)
+ __aligned(16);
+#endif
+
static void __activate_traps(struct kvm_vcpu *vcpu)
{
u64 val;
--
2.35.1.723.g4982287a31-goog
Unwind the stack in EL1, when CONFIG_NVHE_EL2_DEBUG is enabled.
This is possible because CONFIG_NVHE_EL2_DEBUG disables the host
stage-2 protection on hyp_panic(), allowing the host to access
the hypervisor stack pages in EL1.
A simple stack overflow test produces the following output:
[ 580.376051][ T412] kvm: nVHE hyp panic at: ffffffc0116145c4!
[ 580.378034][ T412] kvm [412]: nVHE HYP call trace:
[ 580.378591][ T412] kvm [412]: [<ffffffc011614934>]
[ 580.378993][ T412] kvm [412]: [<ffffffc01160fa48>]
[ 580.379386][ T412] kvm [412]: [<ffffffc0116145dc>] // Non-terminating recursive call
[ 580.379772][ T412] kvm [412]: [<ffffffc0116145dc>]
[ 580.380158][ T412] kvm [412]: [<ffffffc0116145dc>]
[ 580.380544][ T412] kvm [412]: [<ffffffc0116145dc>]
[ 580.380928][ T412] kvm [412]: [<ffffffc0116145dc>]
. . .
Since nVHE hyp symbols are not included by kallsyms to avoid issues
with aliasing, we fallback to the vmlinux addresses. Symbolizing the
addresses is handled in the next patch in this series.
Signed-off-by: Kalesh Singh <[email protected]>
---
Changes in v4:
- Update commit text and struct kvm_nvhe_panic_info kernel-doc comment
to clarify that CONFIG_NVHE_EL2_DEBUG only disables the host stage-2
protection on hyp_panic(), per Fuad
- Update NVHE_EL2_DEBUG Kconfig description to clarify that the
hypervisor stack trace is printed when hyp_panic() is called, per Fuad
Changes in v3:
- The nvhe hyp stack unwinder now makes use of the core logic from the
regular kernel unwinder to avoid duplication, per Mark
Changes in v2:
- Add cpu_prepare_nvhe_panic_info()
- Move updating the panic info to hyp_panic(), so that unwinding also
works for conventional nVHE Hyp-mode.
arch/arm64/include/asm/kvm_asm.h | 20 +++
arch/arm64/include/asm/stacktrace.h | 12 ++
arch/arm64/kernel/stacktrace.c | 210 +++++++++++++++++++++++++---
arch/arm64/kvm/Kconfig | 5 +-
arch/arm64/kvm/arm.c | 2 +-
arch/arm64/kvm/handle_exit.c | 3 +
arch/arm64/kvm/hyp/nvhe/switch.c | 18 +++
7 files changed, 244 insertions(+), 26 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 2e277f2ed671..4abcf93c6662 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -176,6 +176,26 @@ struct kvm_nvhe_init_params {
unsigned long vtcr;
};
+#ifdef CONFIG_NVHE_EL2_DEBUG
+/**
+ * struct kvm_nvhe_panic_info - nVHE hypervisor panic info.
+ * @hyp_stack_base: hyp VA of the hyp_stack base.
+ * @hyp_overflow_stack_base: hyp VA of the hyp_overflow_stack base.
+ * @fp: hyp FP where the backtrace begins.
+ * @pc: hyp PC where the backtrace begins.
+ *
+ * Used by the host in EL1 to dump the nVHE hypervisor backtrace on
+ * hyp_panic. This is possible because CONFIG_NVHE_EL2_DEBUG disables
+ * the host stage 2 protection on hyp_panic(). See: __hyp_do_panic()
+ */
+struct kvm_nvhe_panic_info {
+ unsigned long hyp_stack_base;
+ unsigned long hyp_overflow_stack_base;
+ unsigned long fp;
+ unsigned long pc;
+};
+#endif /* CONFIG_NVHE_EL2_DEBUG */
+
/* Translate a kernel address @ptr into its equivalent linear mapping */
#define kvm_ksym_ref(ptr) \
({ \
diff --git a/arch/arm64/include/asm/stacktrace.h b/arch/arm64/include/asm/stacktrace.h
index e77cdef9ca29..18611a51cf14 100644
--- a/arch/arm64/include/asm/stacktrace.h
+++ b/arch/arm64/include/asm/stacktrace.h
@@ -22,6 +22,10 @@ enum stack_type {
STACK_TYPE_OVERFLOW,
STACK_TYPE_SDEI_NORMAL,
STACK_TYPE_SDEI_CRITICAL,
+#ifdef CONFIG_NVHE_EL2_DEBUG
+ STACK_TYPE_KVM_NVHE_HYP,
+ STACK_TYPE_KVM_NVHE_OVERFLOW,
+#endif /* CONFIG_NVHE_EL2_DEBUG */
__NR_STACK_TYPES
};
@@ -147,4 +151,12 @@ static inline bool on_accessible_stack(const struct task_struct *tsk,
return false;
}
+#ifdef CONFIG_NVHE_EL2_DEBUG
+void kvm_nvhe_dump_backtrace(unsigned long hyp_offset);
+#else
+static inline void kvm_nvhe_dump_backtrace(unsigned long hyp_offset)
+{
+}
+#endif /* CONFIG_NVHE_EL2_DEBUG */
+
#endif /* __ASM_STACKTRACE_H */
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index e4103e085681..6ec85cb69b1f 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -15,6 +15,8 @@
#include <asm/irq.h>
#include <asm/pointer_auth.h>
+#include <asm/kvm_asm.h>
+#include <asm/kvm_hyp.h>
#include <asm/stack_pointer.h>
#include <asm/stacktrace.h>
@@ -64,26 +66,15 @@ NOKPROBE_SYMBOL(start_backtrace);
* records (e.g. a cycle), determined based on the location and fp value of A
* and the location (but not the fp value) of B.
*/
-static int notrace unwind_frame(struct task_struct *tsk,
- struct stackframe *frame)
+static int notrace __unwind_frame(struct stackframe *frame, struct stack_info *info,
+ unsigned long (*translate_fp)(unsigned long, enum stack_type))
{
unsigned long fp = frame->fp;
- struct stack_info info;
-
- if (!tsk)
- tsk = current;
-
- /* Final frame; nothing to unwind */
- if (fp == (unsigned long)task_pt_regs(tsk)->stackframe)
- return -ENOENT;
if (fp & 0x7)
return -EINVAL;
- if (!on_accessible_stack(tsk, fp, 16, &info))
- return -EINVAL;
-
- if (test_bit(info.type, frame->stacks_done))
+ if (test_bit(info->type, frame->stacks_done))
return -EINVAL;
/*
@@ -94,28 +85,62 @@ static int notrace unwind_frame(struct task_struct *tsk,
*
* TASK -> IRQ -> OVERFLOW -> SDEI_NORMAL
* TASK -> SDEI_NORMAL -> SDEI_CRITICAL -> OVERFLOW
+ * KVM_NVHE_HYP -> KVM_NVHE_OVERFLOW
*
* ... but the nesting itself is strict. Once we transition from one
* stack to another, it's never valid to unwind back to that first
* stack.
*/
- if (info.type == frame->prev_type) {
+ if (info->type == frame->prev_type) {
if (fp <= frame->prev_fp)
return -EINVAL;
} else {
set_bit(frame->prev_type, frame->stacks_done);
}
+ /* Record fp as prev_fp before attempting to get the next fp */
+ frame->prev_fp = fp;
+
+ /*
+ * If fp is not from the current address space perform the
+ * necessary translation before dereferencing it to get next fp.
+ */
+ if (translate_fp)
+ fp = translate_fp(fp, info->type);
+ if (!fp)
+ return -EINVAL;
+
/*
* Record this frame record's values and location. The prev_fp and
- * prev_type are only meaningful to the next unwind_frame() invocation.
+ * prev_type are only meaningful to the next __unwind_frame() invocation.
*/
frame->fp = READ_ONCE_NOCHECK(*(unsigned long *)(fp));
frame->pc = READ_ONCE_NOCHECK(*(unsigned long *)(fp + 8));
- frame->prev_fp = fp;
- frame->prev_type = info.type;
-
frame->pc = ptrauth_strip_insn_pac(frame->pc);
+ frame->prev_type = info->type;
+
+ return 0;
+}
+
+static int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
+{
+ unsigned long fp = frame->fp;
+ struct stack_info info;
+ int err;
+
+ if (!tsk)
+ tsk = current;
+
+ /* Final frame; nothing to unwind */
+ if (fp == (unsigned long)task_pt_regs(tsk)->stackframe)
+ return -ENOENT;
+
+ if (!on_accessible_stack(tsk, fp, 16, &info))
+ return -EINVAL;
+
+ err = __unwind_frame(frame, &info, NULL);
+ if (err)
+ return err;
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
if (tsk->ret_stack &&
@@ -143,20 +168,27 @@ static int notrace unwind_frame(struct task_struct *tsk,
}
NOKPROBE_SYMBOL(unwind_frame);
-static void notrace walk_stackframe(struct task_struct *tsk,
- struct stackframe *frame,
- bool (*fn)(void *, unsigned long), void *data)
+static void notrace __walk_stackframe(struct task_struct *tsk, struct stackframe *frame,
+ bool (*fn)(void *, unsigned long), void *data,
+ int (*unwind_frame_fn)(struct task_struct *tsk, struct stackframe *frame))
{
while (1) {
int ret;
if (!fn(data, frame->pc))
break;
- ret = unwind_frame(tsk, frame);
+ ret = unwind_frame_fn(tsk, frame);
if (ret < 0)
break;
}
}
+
+static void notrace walk_stackframe(struct task_struct *tsk,
+ struct stackframe *frame,
+ bool (*fn)(void *, unsigned long), void *data)
+{
+ __walk_stackframe(tsk, frame, fn, data, unwind_frame);
+}
NOKPROBE_SYMBOL(walk_stackframe);
static bool dump_backtrace_entry(void *arg, unsigned long where)
@@ -210,3 +242,135 @@ noinline notrace void arch_stack_walk(stack_trace_consume_fn consume_entry,
walk_stackframe(task, &frame, consume_entry, cookie);
}
+
+#ifdef CONFIG_NVHE_EL2_DEBUG
+DECLARE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
+DECLARE_KVM_NVHE_PER_CPU(unsigned long [PAGE_SIZE/sizeof(long)], hyp_overflow_stack);
+DECLARE_KVM_NVHE_PER_CPU(struct kvm_nvhe_panic_info, kvm_panic_info);
+
+static inline bool kvm_nvhe_on_overflow_stack(unsigned long sp, unsigned long size,
+ struct stack_info *info)
+{
+ struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
+ unsigned long low = (unsigned long)panic_info->hyp_overflow_stack_base;
+ unsigned long high = low + PAGE_SIZE;
+
+ return on_stack(sp, size, low, high, STACK_TYPE_KVM_NVHE_OVERFLOW, info);
+}
+
+static inline bool kvm_nvhe_on_hyp_stack(unsigned long sp, unsigned long size,
+ struct stack_info *info)
+{
+ struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
+ unsigned long low = (unsigned long)panic_info->hyp_stack_base;
+ unsigned long high = low + PAGE_SIZE;
+
+ return on_stack(sp, size, low, high, STACK_TYPE_KVM_NVHE_HYP, info);
+}
+
+static inline bool kvm_nvhe_on_accessible_stack(unsigned long sp, unsigned long size,
+ struct stack_info *info)
+{
+ if (info)
+ info->type = STACK_TYPE_UNKNOWN;
+
+ if (kvm_nvhe_on_hyp_stack(sp, size, info))
+ return true;
+ if (kvm_nvhe_on_overflow_stack(sp, size, info))
+ return true;
+
+ return false;
+}
+
+static unsigned long kvm_nvhe_hyp_stack_kern_va(unsigned long addr)
+{
+ struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
+ unsigned long hyp_base, kern_base, hyp_offset;
+
+ hyp_base = (unsigned long)panic_info->hyp_stack_base;
+ hyp_offset = addr - hyp_base;
+
+ kern_base = (unsigned long)*this_cpu_ptr(&kvm_arm_hyp_stack_page);
+
+ return kern_base + hyp_offset;
+}
+
+static unsigned long kvm_nvhe_overflow_stack_kern_va(unsigned long addr)
+{
+ struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
+ unsigned long hyp_base, kern_base, hyp_offset;
+
+ hyp_base = (unsigned long)panic_info->hyp_overflow_stack_base;
+ hyp_offset = addr - hyp_base;
+
+ kern_base = (unsigned long)this_cpu_ptr_nvhe_sym(hyp_overflow_stack);
+
+ return kern_base + hyp_offset;
+}
+
+/*
+ * Convert KVM nVHE hypervisor stack VA to a kernel VA.
+ *
+ * The nVHE hypervisor stack is mapped in the flexible 'private' VA range, to allow
+ * for guard pages below the stack. Consequently, the fixed offset address
+ * translation macros won't work here.
+ *
+ * The kernel VA is calculated as an offset from the kernel VA of the hypervisor
+ * stack base. See: kvm_nvhe_hyp_stack_kern_va(), kvm_nvhe_overflow_stack_kern_va()
+ */
+static unsigned long kvm_nvhe_stack_kern_va(unsigned long addr,
+ enum stack_type type)
+{
+ switch (type) {
+ case STACK_TYPE_KVM_NVHE_HYP:
+ return kvm_nvhe_hyp_stack_kern_va(addr);
+ case STACK_TYPE_KVM_NVHE_OVERFLOW:
+ return kvm_nvhe_overflow_stack_kern_va(addr);
+ default:
+ return 0UL;
+ }
+}
+
+static int notrace kvm_nvhe_unwind_frame(struct task_struct *tsk,
+ struct stackframe *frame)
+{
+ struct stack_info info;
+
+ if (!kvm_nvhe_on_accessible_stack(frame->fp, 16, &info))
+ return -EINVAL;
+
+ return __unwind_frame(frame, &info, kvm_nvhe_stack_kern_va);
+}
+
+static bool kvm_nvhe_dump_backtrace_entry(void *arg, unsigned long where)
+{
+ unsigned long va_mask = GENMASK_ULL(vabits_actual - 1, 0);
+ unsigned long hyp_offset = (unsigned long)arg;
+
+ where &= va_mask; /* Mask tags */
+ where += hyp_offset; /* Convert to kern addr */
+
+ kvm_err("[<%016lx>] %pB\n", where, (void *)where);
+
+ return true;
+}
+
+static void notrace kvm_nvhe_walk_stackframe(struct task_struct *tsk,
+ struct stackframe *frame,
+ bool (*fn)(void *, unsigned long), void *data)
+{
+ __walk_stackframe(tsk, frame, fn, data, kvm_nvhe_unwind_frame);
+}
+
+void kvm_nvhe_dump_backtrace(unsigned long hyp_offset)
+{
+ struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
+ struct stackframe frame;
+
+ start_backtrace(&frame, panic_info->fp, panic_info->pc);
+ pr_err("nVHE HYP call trace:\n");
+ kvm_nvhe_walk_stackframe(NULL, &frame, kvm_nvhe_dump_backtrace_entry,
+ (void *)hyp_offset);
+ pr_err("---- end of nVHE HYP call trace ----\n");
+}
+#endif /* CONFIG_NVHE_EL2_DEBUG */
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 8a5fbbf084df..a7be4ef35fbf 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -51,8 +51,9 @@ config NVHE_EL2_DEBUG
depends on KVM
help
Say Y here to enable the debug mode for the non-VHE KVM EL2 object.
- Failure reports will BUG() in the hypervisor. This is intended for
- local EL2 hypervisor development.
+ Failure reports will BUG() in the hypervisor; and calls to hyp_panic()
+ will result in printing the hypervisor call stack.
+ This is intended for local EL2 hypervisor development.
If unsure, say N.
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 72be7e695d8d..c7216ce1d55c 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -49,7 +49,7 @@ DEFINE_STATIC_KEY_FALSE(kvm_protected_mode_initialized);
DECLARE_KVM_HYP_PER_CPU(unsigned long, kvm_hyp_vector);
-static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
+DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
unsigned long kvm_arm_hyp_percpu_base[NR_CPUS];
DECLARE_KVM_NVHE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params);
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index e3140abd2e2e..ff69dff33700 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -17,6 +17,7 @@
#include <asm/kvm_emulate.h>
#include <asm/kvm_mmu.h>
#include <asm/debug-monitors.h>
+#include <asm/stacktrace.h>
#include <asm/traps.h>
#include <kvm/arm_hypercalls.h>
@@ -326,6 +327,8 @@ void __noreturn __cold nvhe_hyp_panic_handler(u64 esr, u64 spsr,
kvm_err("nVHE hyp panic at: %016llx!\n", elr_virt + hyp_offset);
}
+ kvm_nvhe_dump_backtrace(hyp_offset);
+
/*
* Hyp has panicked and we're going to handle that by panicking the
* kernel. The kernel offset will be revealed in the panic so we're
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index efc20273a352..b8ecffc47424 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -37,6 +37,22 @@ DEFINE_PER_CPU(unsigned long, kvm_hyp_vector);
#ifdef CONFIG_NVHE_EL2_DEBUG
DEFINE_PER_CPU(unsigned long [PAGE_SIZE/sizeof(long)], hyp_overflow_stack)
__aligned(16);
+DEFINE_PER_CPU(struct kvm_nvhe_panic_info, kvm_panic_info);
+
+static inline void cpu_prepare_nvhe_panic_info(void)
+{
+ struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr(&kvm_panic_info);
+ struct kvm_nvhe_init_params *params = this_cpu_ptr(&kvm_init_params);
+
+ panic_info->hyp_stack_base = (unsigned long)(params->stack_hyp_va - PAGE_SIZE);
+ panic_info->hyp_overflow_stack_base = (unsigned long)this_cpu_ptr(hyp_overflow_stack);
+ panic_info->fp = (unsigned long)__builtin_frame_address(0);
+ panic_info->pc = _THIS_IP_;
+}
+ #else
+static inline void cpu_prepare_nvhe_panic_info(void)
+{
+}
#endif
static void __activate_traps(struct kvm_vcpu *vcpu)
@@ -360,6 +376,8 @@ asmlinkage void __noreturn hyp_panic(void)
struct kvm_cpu_context *host_ctxt;
struct kvm_vcpu *vcpu;
+ cpu_prepare_nvhe_panic_info();
+
host_ctxt = &this_cpu_ptr(&kvm_host_data)->host_ctxt;
vcpu = host_ctxt->__hyp_running_vcpu;
--
2.35.1.723.g4982287a31-goog
hyp_alloc_private_va_range() can be used to reserve private VA ranges
in the nVHE hypervisor. Allocations are aligned based on the order of
the requested size.
This will be used to implement stack guard pages for KVM nVHE hypervisor
(nVHE Hyp mode / not pKVM), in a subsequent patch in the series.
Signed-off-by: Kalesh Singh <[email protected]>
---
Changes in v6:
- Update kernel-doc for hyp_alloc_private_va_range()
and add return description, per Stephen
- Update hyp_alloc_private_va_range() to return an int error code,
per Stephen
- Replace IS_ERR() checks with IS_ERR_VALUE() check, per Stephen
- Clean up goto, per Stephen
Changes in v5:
- Align private allocations based on the order of their size, per Marc
Changes in v4:
- Handle null ptr in hyp_alloc_private_va_range() and replace
IS_ERR_OR_NULL checks in callers with IS_ERR checks, per Fuad
- Fix kernel-doc comments format, per Fuad
Changes in v3:
- Handle null ptr in IS_ERR_OR_NULL checks, per Mark
arch/arm64/include/asm/kvm_mmu.h | 1 +
arch/arm64/kvm/mmu.c | 66 +++++++++++++++++++++-----------
2 files changed, 45 insertions(+), 22 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 81839e9a8a24..3cc9aa25f510 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -153,6 +153,7 @@ static __always_inline unsigned long __kern_hyp_va(unsigned long v)
int kvm_share_hyp(void *from, void *to);
void kvm_unshare_hyp(void *from, void *to);
int create_hyp_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
+int hyp_alloc_private_va_range(size_t size, unsigned long *haddr);
int create_hyp_io_mappings(phys_addr_t phys_addr, size_t size,
void __iomem **kaddr,
void __iomem **haddr);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index bc2aba953299..7326d683c500 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -457,23 +457,22 @@ int create_hyp_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
return 0;
}
-static int __create_hyp_private_mapping(phys_addr_t phys_addr, size_t size,
- unsigned long *haddr,
- enum kvm_pgtable_prot prot)
+
+/**
+ * hyp_alloc_private_va_range - Allocates a private VA range.
+ * @size: The size of the VA range to reserve.
+ * @haddr: The hypervisor virtual start address of the allocation.
+ *
+ * The private virtual address (VA) range is allocated below io_map_base
+ * and aligned based on the order of @size.
+ *
+ * Return: 0 on success or negative error code on failure.
+ */
+int hyp_alloc_private_va_range(size_t size, unsigned long *haddr)
{
unsigned long base;
int ret = 0;
- if (!kvm_host_owns_hyp_mappings()) {
- base = kvm_call_hyp_nvhe(__pkvm_create_private_mapping,
- phys_addr, size, prot);
- if (IS_ERR_OR_NULL((void *)base))
- return PTR_ERR((void *)base);
- *haddr = base;
-
- return 0;
- }
-
mutex_lock(&kvm_hyp_pgd_mutex);
/*
@@ -484,30 +483,53 @@ static int __create_hyp_private_mapping(phys_addr_t phys_addr, size_t size,
*
* The allocated size is always a multiple of PAGE_SIZE.
*/
- size = PAGE_ALIGN(size + offset_in_page(phys_addr));
- base = io_map_base - size;
+ base = io_map_base - PAGE_ALIGN(size);
+
+ /* Align the allocation based on the order of its size */
+ base = ALIGN_DOWN(base, PAGE_SIZE << get_order(size));
/*
* Verify that BIT(VA_BITS - 1) hasn't been flipped by
* allocating the new area, as it would indicate we've
* overflowed the idmap/IO address range.
*/
- if ((base ^ io_map_base) & BIT(VA_BITS - 1))
+ if (!base || (base ^ io_map_base) & BIT(VA_BITS - 1))
ret = -ENOMEM;
else
- io_map_base = base;
+ *haddr = io_map_base = base;
mutex_unlock(&kvm_hyp_pgd_mutex);
+ return ret;
+}
+
+static int __create_hyp_private_mapping(phys_addr_t phys_addr, size_t size,
+ unsigned long *haddr,
+ enum kvm_pgtable_prot prot)
+{
+ unsigned long addr;
+ int ret = 0;
+
+ if (!kvm_host_owns_hyp_mappings()) {
+ addr = kvm_call_hyp_nvhe(__pkvm_create_private_mapping,
+ phys_addr, size, prot);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+ *haddr = addr;
+
+ return 0;
+ }
+
+ size += offset_in_page(phys_addr);
+ ret = hyp_alloc_private_va_range(size, &addr);
if (ret)
- goto out;
+ return ret;
- ret = __create_hyp_mappings(base, size, phys_addr, prot);
+ ret = __create_hyp_mappings(addr, size, phys_addr, prot);
if (ret)
- goto out;
+ return ret;
- *haddr = base + offset_in_page(phys_addr);
-out:
+ *haddr = addr + offset_in_page(phys_addr);
return ret;
}
--
2.35.1.723.g4982287a31-goog
The hypervisor stacks (for both nVHE Hyp mode and nVHE protected mode)
are aligned such that any valid stack address has PAGE_SHIFT bit as 1.
This allows us to conveniently check for overflow in the exception entry
without corrupting any GPRs. We won't recover from a stack overflow so
panic the hypervisor.
Signed-off-by: Kalesh Singh <[email protected]>
---
Changes in v5:
- Valid stack addresses now have PAGE_SHIFT bit as 1 instead of 0
Changes in v3:
- Remove test_sp_overflow macro, per Mark
- Add asmlinkage attribute for hyp_panic, hyp_panic_bad_stack, per Ard
arch/arm64/kvm/hyp/nvhe/host.S | 24 ++++++++++++++++++++++++
arch/arm64/kvm/hyp/nvhe/switch.c | 7 ++++++-
2 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/hyp/nvhe/host.S b/arch/arm64/kvm/hyp/nvhe/host.S
index 3d613e721a75..be6d844279b1 100644
--- a/arch/arm64/kvm/hyp/nvhe/host.S
+++ b/arch/arm64/kvm/hyp/nvhe/host.S
@@ -153,6 +153,18 @@ SYM_FUNC_END(__host_hvc)
.macro invalid_host_el2_vect
.align 7
+
+ /*
+ * Test whether the SP has overflowed, without corrupting a GPR.
+ * nVHE hypervisor stacks are aligned so that the PAGE_SHIFT bit
+ * of SP should always be 1.
+ */
+ add sp, sp, x0 // sp' = sp + x0
+ sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 = sp
+ tbz x0, #PAGE_SHIFT, .L__hyp_sp_overflow\@
+ sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp = x0
+ sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 = sp
+
/* If a guest is loaded, panic out of it. */
stp x0, x1, [sp, #-16]!
get_loaded_vcpu x0, x1
@@ -165,6 +177,18 @@ SYM_FUNC_END(__host_hvc)
* been partially clobbered by __host_enter.
*/
b hyp_panic
+
+.L__hyp_sp_overflow\@:
+ /*
+ * Reset SP to the top of the stack, to allow handling the hyp_panic.
+ * This corrupts the stack but is ok, since we won't be attempting
+ * any unwinding here.
+ */
+ ldr_this_cpu x0, kvm_init_params + NVHE_INIT_STACK_HYP_VA, x1
+ mov sp, x0
+
+ bl hyp_panic_bad_stack
+ ASM_BUG()
.endm
.macro invalid_host_el1_vect
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index 6410d21d8695..703a5d3f611b 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -347,7 +347,7 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
return exit_code;
}
-void __noreturn hyp_panic(void)
+asmlinkage void __noreturn hyp_panic(void)
{
u64 spsr = read_sysreg_el2(SYS_SPSR);
u64 elr = read_sysreg_el2(SYS_ELR);
@@ -369,6 +369,11 @@ void __noreturn hyp_panic(void)
unreachable();
}
+asmlinkage void __noreturn hyp_panic_bad_stack(void)
+{
+ hyp_panic();
+}
+
asmlinkage void kvm_unexpected_el2_exception(void)
{
return __kvm_unexpected_el2_exception();
--
2.35.1.723.g4982287a31-goog
On Mon, Mar 14, 2022 at 1:02 PM Kalesh Singh <[email protected]> wrote:
>
> Hi all,
>
> This is v6 of the nVHE hypervisor stack enhancements. Addresses some
> refactoring/cleanup and documentation improvments from Stephen,
> and rebased on 5.17-rc8.
Friendly ping on this :). I've addressed all feedback received in this
latest version.
Thanks,
Kalesh
>
> Previous versions can be found at:
> v5: https://lore.kernel.org/r/[email protected]/
> v4: https://lore.kernel.org/r/[email protected]/
> v3: https://lore.kernel.org/r/[email protected]/
> v2: https://lore.kernel.org/r/[email protected]/
> v1: https://lore.kernel.org/r/[email protected]/
>
> The previous cover letter has been copied below for convenience.
>
> Thanks,
> Kalesh
>
> -----
>
> This series is based on 5.17-rc8 and adds the following stack features to
> the KVM nVHE hypervisor:
>
> == Hyp Stack Guard Pages ==
>
> Based on the technique used by arm64 VMAP_STACK to detect overflow.
> i.e. the stack is aligned such that the 'stack shift' bit of any valid
> SP is 1. The 'stack shift' bit can be tested in the exception entry to
> detect overflow without corrupting GPRs.
>
> == Hyp Stack Unwinder ==
>
> The unwinding and dumping of the hyp stack is not enabled by default and
> depends on CONFIG_NVHE_EL2_DEBUG to avoid potential information leaks.
>
> When CONFIG_NVHE_EL2_DEBUG is enabled the host stage 2 protection is
> disabled, allowing the host to read the hypervisor stack pages and unwind
> the stack from EL1. This allows us to print the hypervisor stacktrace
> before panicking the host; as shown below.
>
> Example call trace:
>
> [ 98.916444][ T426] kvm [426]: nVHE hyp panic at: [<ffffffc0096156fc>] __kvm_nvhe_overflow_stack+0x8/0x34!
> [ 98.918360][ T426] nVHE HYP call trace:
> [ 98.918692][ T426] kvm [426]: [<ffffffc009615aac>] __kvm_nvhe_cpu_prepare_nvhe_panic_info+0x4c/0x68
> [ 98.919545][ T426] kvm [426]: [<ffffffc0096159a4>] __kvm_nvhe_hyp_panic+0x2c/0xe8
> [ 98.920107][ T426] kvm [426]: [<ffffffc009615ad8>] __kvm_nvhe_hyp_panic_bad_stack+0x10/0x10
> [ 98.920665][ T426] kvm [426]: [<ffffffc009610a4c>] __kvm_nvhe___kvm_hyp_host_vector+0x24c/0x794
> [ 98.921292][ T426] kvm [426]: [<ffffffc009615718>] __kvm_nvhe_overflow_stack+0x24/0x34
> . . .
>
> [ 98.973382][ T426] kvm [426]: [<ffffffc009615718>] __kvm_nvhe_overflow_stack+0x24/0x34
> [ 98.973816][ T426] kvm [426]: [<ffffffc0096152f4>] __kvm_nvhe___kvm_vcpu_run+0x38/0x438
> [ 98.974255][ T426] kvm [426]: [<ffffffc009616f80>] __kvm_nvhe_handle___kvm_vcpu_run+0x1c4/0x364
> [ 98.974719][ T426] kvm [426]: [<ffffffc009616928>] __kvm_nvhe_handle_trap+0xa8/0x130
> [ 98.975152][ T426] kvm [426]: [<ffffffc009610064>] __kvm_nvhe___host_exit+0x64/0x64
> [ 98.975588][ T426] ---- end of nVHE HYP call trace ----
>
>
>
>
> Kalesh Singh (8):
> KVM: arm64: Introduce hyp_alloc_private_va_range()
> KVM: arm64: Introduce pkvm_alloc_private_va_range()
> KVM: arm64: Add guard pages for KVM nVHE hypervisor stack
> KVM: arm64: Add guard pages for pKVM (protected nVHE) hypervisor stack
> KVM: arm64: Detect and handle hypervisor stack overflows
> KVM: arm64: Add hypervisor overflow stack
> KVM: arm64: Unwind and dump nVHE HYP stacktrace
> KVM: arm64: Symbolize the nVHE HYP backtrace
>
> arch/arm64/include/asm/kvm_asm.h | 21 +++
> arch/arm64/include/asm/kvm_mmu.h | 4 +
> arch/arm64/include/asm/stacktrace.h | 12 ++
> arch/arm64/kernel/stacktrace.c | 210 ++++++++++++++++++++++++---
> arch/arm64/kvm/Kconfig | 5 +-
> arch/arm64/kvm/arm.c | 41 +++++-
> arch/arm64/kvm/handle_exit.c | 16 +-
> arch/arm64/kvm/hyp/include/nvhe/mm.h | 6 +-
> arch/arm64/kvm/hyp/nvhe/host.S | 29 ++++
> arch/arm64/kvm/hyp/nvhe/hyp-main.c | 18 ++-
> arch/arm64/kvm/hyp/nvhe/mm.c | 78 ++++++----
> arch/arm64/kvm/hyp/nvhe/setup.c | 31 +++-
> arch/arm64/kvm/hyp/nvhe/switch.c | 30 +++-
> arch/arm64/kvm/mmu.c | 70 ++++++---
> scripts/kallsyms.c | 2 +-
> 15 files changed, 477 insertions(+), 96 deletions(-)
>
>
> base-commit: 09688c0166e76ce2fb85e86b9d99be8b0084cdf9
> --
> 2.35.1.723.g4982287a31-goog
>
Hi Kalesh,
On Mon, Mar 14, 2022 at 8:04 PM Kalesh Singh <[email protected]> wrote:
>
> Map the stack pages in the flexible private VA range and allocate
> guard pages below the stack as unbacked VA space. The stack is aligned
> so that any valid stack address has PAGE_SHIFT bit as 1 - this is used
> for overflow detection (implemented in a subsequent patch in the series)
>
> Signed-off-by: Kalesh Singh <[email protected]>
Tested-by: Fuad Tabba <[email protected]>
Reviewed-by: Fuad Tabba <[email protected]>
Thanks,
/fuad
> ---
>
> Changes in v6:
> - Update call to pkvm_alloc_private_va_range() (return val and params)
>
> Changes in v5:
> - Use a single allocation for stack and guard pages to ensure they
> are contiguous, per Marc
>
> Changes in v4:
> - Replace IS_ERR_OR_NULL check with IS_ERR check now that
> pkvm_alloc_private_va_range() returns an error for null
> pointer, per Fuad
>
> Changes in v3:
> - Handle null ptr in IS_ERR_OR_NULL checks, per Mark
>
>
> arch/arm64/kvm/hyp/nvhe/setup.c | 31 ++++++++++++++++++++++++++++---
> 1 file changed, 28 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
> index 27af337f9fea..e8d4ea2fcfa0 100644
> --- a/arch/arm64/kvm/hyp/nvhe/setup.c
> +++ b/arch/arm64/kvm/hyp/nvhe/setup.c
> @@ -99,17 +99,42 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
> return ret;
>
> for (i = 0; i < hyp_nr_cpus; i++) {
> + struct kvm_nvhe_init_params *params = per_cpu_ptr(&kvm_init_params, i);
> + unsigned long hyp_addr;
> +
> start = (void *)kern_hyp_va(per_cpu_base[i]);
> end = start + PAGE_ALIGN(hyp_percpu_size);
> ret = pkvm_create_mappings(start, end, PAGE_HYP);
> if (ret)
> return ret;
>
> - end = (void *)per_cpu_ptr(&kvm_init_params, i)->stack_hyp_va;
> - start = end - PAGE_SIZE;
> - ret = pkvm_create_mappings(start, end, PAGE_HYP);
> + /*
> + * Allocate a contiguous HYP private VA range for the stack
> + * and guard page. The allocation is also aligned based on
> + * the order of its size.
> + */
> + ret = pkvm_alloc_private_va_range(PAGE_SIZE * 2, &hyp_addr);
> + if (ret)
> + return ret;
> +
> + /*
> + * Since the stack grows downwards, map the stack to the page
> + * at the higher address and leave the lower guard page
> + * unbacked.
> + *
> + * Any valid stack address now has the PAGE_SHIFT bit as 1
> + * and addresses corresponding to the guard page have the
> + * PAGE_SHIFT bit as 0 - this is used for overflow detection.
> + */
> + hyp_spin_lock(&pkvm_pgd_lock);
> + ret = kvm_pgtable_hyp_map(&pkvm_pgtable, hyp_addr + PAGE_SIZE,
> + PAGE_SIZE, params->stack_pa, PAGE_HYP);
> + hyp_spin_unlock(&pkvm_pgd_lock);
> if (ret)
> return ret;
> +
> + /* Update stack_hyp_va to end of the stack's private VA range */
> + params->stack_hyp_va = hyp_addr + (2 * PAGE_SIZE);
> }
>
> /*
> --
> 2.35.1.723.g4982287a31-goog
>
Hi Kalesh,
On Mon, Mar 14, 2022 at 8:05 PM Kalesh Singh <[email protected]> wrote:
>
> The hypervisor stacks (for both nVHE Hyp mode and nVHE protected mode)
> are aligned such that any valid stack address has PAGE_SHIFT bit as 1.
> This allows us to conveniently check for overflow in the exception entry
> without corrupting any GPRs. We won't recover from a stack overflow so
> panic the hypervisor.
>
> Signed-off-by: Kalesh Singh <[email protected]>
Tested-by: Fuad Tabba <[email protected]>
Reviewed-by: Fuad Tabba <[email protected]>
Thanks,
/fuad
> ---
>
> Changes in v5:
> - Valid stack addresses now have PAGE_SHIFT bit as 1 instead of 0
>
> Changes in v3:
> - Remove test_sp_overflow macro, per Mark
> - Add asmlinkage attribute for hyp_panic, hyp_panic_bad_stack, per Ard
>
>
> arch/arm64/kvm/hyp/nvhe/host.S | 24 ++++++++++++++++++++++++
> arch/arm64/kvm/hyp/nvhe/switch.c | 7 ++++++-
> 2 files changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/host.S b/arch/arm64/kvm/hyp/nvhe/host.S
> index 3d613e721a75..be6d844279b1 100644
> --- a/arch/arm64/kvm/hyp/nvhe/host.S
> +++ b/arch/arm64/kvm/hyp/nvhe/host.S
> @@ -153,6 +153,18 @@ SYM_FUNC_END(__host_hvc)
>
> .macro invalid_host_el2_vect
> .align 7
> +
> + /*
> + * Test whether the SP has overflowed, without corrupting a GPR.
> + * nVHE hypervisor stacks are aligned so that the PAGE_SHIFT bit
> + * of SP should always be 1.
> + */
> + add sp, sp, x0 // sp' = sp + x0
> + sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 = sp
> + tbz x0, #PAGE_SHIFT, .L__hyp_sp_overflow\@
> + sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp = x0
> + sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 = sp
> +
> /* If a guest is loaded, panic out of it. */
> stp x0, x1, [sp, #-16]!
> get_loaded_vcpu x0, x1
> @@ -165,6 +177,18 @@ SYM_FUNC_END(__host_hvc)
> * been partially clobbered by __host_enter.
> */
> b hyp_panic
> +
> +.L__hyp_sp_overflow\@:
> + /*
> + * Reset SP to the top of the stack, to allow handling the hyp_panic.
> + * This corrupts the stack but is ok, since we won't be attempting
> + * any unwinding here.
> + */
> + ldr_this_cpu x0, kvm_init_params + NVHE_INIT_STACK_HYP_VA, x1
> + mov sp, x0
> +
> + bl hyp_panic_bad_stack
> + ASM_BUG()
> .endm
>
> .macro invalid_host_el1_vect
> diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
> index 6410d21d8695..703a5d3f611b 100644
> --- a/arch/arm64/kvm/hyp/nvhe/switch.c
> +++ b/arch/arm64/kvm/hyp/nvhe/switch.c
> @@ -347,7 +347,7 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
> return exit_code;
> }
>
> -void __noreturn hyp_panic(void)
> +asmlinkage void __noreturn hyp_panic(void)
> {
> u64 spsr = read_sysreg_el2(SYS_SPSR);
> u64 elr = read_sysreg_el2(SYS_ELR);
> @@ -369,6 +369,11 @@ void __noreturn hyp_panic(void)
> unreachable();
> }
>
> +asmlinkage void __noreturn hyp_panic_bad_stack(void)
> +{
> + hyp_panic();
> +}
> +
> asmlinkage void kvm_unexpected_el2_exception(void)
> {
> return __kvm_unexpected_el2_exception();
> --
> 2.35.1.723.g4982287a31-goog
>
Hi Kalesh,
On Mon, Mar 14, 2022 at 8:02 PM Kalesh Singh <[email protected]> wrote:
>
> hyp_alloc_private_va_range() can be used to reserve private VA ranges
> in the nVHE hypervisor. Allocations are aligned based on the order of
> the requested size.
>
> This will be used to implement stack guard pages for KVM nVHE hypervisor
> (nVHE Hyp mode / not pKVM), in a subsequent patch in the series.
>
> Signed-off-by: Kalesh Singh <[email protected]>
This looks good to me. I have also tested this entire series, and your
enhancements will make debugging easier.
Tested-by: Fuad Tabba <[email protected]>
Reviewed-by: Fuad Tabba <[email protected]>
Thanks,
/fuad
> ---
>
> Changes in v6:
> - Update kernel-doc for hyp_alloc_private_va_range()
> and add return description, per Stephen
> - Update hyp_alloc_private_va_range() to return an int error code,
> per Stephen
> - Replace IS_ERR() checks with IS_ERR_VALUE() check, per Stephen
> - Clean up goto, per Stephen
>
> Changes in v5:
> - Align private allocations based on the order of their size, per Marc
>
> Changes in v4:
> - Handle null ptr in hyp_alloc_private_va_range() and replace
> IS_ERR_OR_NULL checks in callers with IS_ERR checks, per Fuad
> - Fix kernel-doc comments format, per Fuad
>
> Changes in v3:
> - Handle null ptr in IS_ERR_OR_NULL checks, per Mark
>
>
> arch/arm64/include/asm/kvm_mmu.h | 1 +
> arch/arm64/kvm/mmu.c | 66 +++++++++++++++++++++-----------
> 2 files changed, 45 insertions(+), 22 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 81839e9a8a24..3cc9aa25f510 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -153,6 +153,7 @@ static __always_inline unsigned long __kern_hyp_va(unsigned long v)
> int kvm_share_hyp(void *from, void *to);
> void kvm_unshare_hyp(void *from, void *to);
> int create_hyp_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
> +int hyp_alloc_private_va_range(size_t size, unsigned long *haddr);
> int create_hyp_io_mappings(phys_addr_t phys_addr, size_t size,
> void __iomem **kaddr,
> void __iomem **haddr);
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index bc2aba953299..7326d683c500 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -457,23 +457,22 @@ int create_hyp_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
> return 0;
> }
>
> -static int __create_hyp_private_mapping(phys_addr_t phys_addr, size_t size,
> - unsigned long *haddr,
> - enum kvm_pgtable_prot prot)
> +
> +/**
> + * hyp_alloc_private_va_range - Allocates a private VA range.
> + * @size: The size of the VA range to reserve.
> + * @haddr: The hypervisor virtual start address of the allocation.
> + *
> + * The private virtual address (VA) range is allocated below io_map_base
> + * and aligned based on the order of @size.
> + *
> + * Return: 0 on success or negative error code on failure.
> + */
> +int hyp_alloc_private_va_range(size_t size, unsigned long *haddr)
> {
> unsigned long base;
> int ret = 0;
>
> - if (!kvm_host_owns_hyp_mappings()) {
> - base = kvm_call_hyp_nvhe(__pkvm_create_private_mapping,
> - phys_addr, size, prot);
> - if (IS_ERR_OR_NULL((void *)base))
> - return PTR_ERR((void *)base);
> - *haddr = base;
> -
> - return 0;
> - }
> -
> mutex_lock(&kvm_hyp_pgd_mutex);
>
> /*
> @@ -484,30 +483,53 @@ static int __create_hyp_private_mapping(phys_addr_t phys_addr, size_t size,
> *
> * The allocated size is always a multiple of PAGE_SIZE.
> */
> - size = PAGE_ALIGN(size + offset_in_page(phys_addr));
> - base = io_map_base - size;
> + base = io_map_base - PAGE_ALIGN(size);
> +
> + /* Align the allocation based on the order of its size */
> + base = ALIGN_DOWN(base, PAGE_SIZE << get_order(size));
>
> /*
> * Verify that BIT(VA_BITS - 1) hasn't been flipped by
> * allocating the new area, as it would indicate we've
> * overflowed the idmap/IO address range.
> */
> - if ((base ^ io_map_base) & BIT(VA_BITS - 1))
> + if (!base || (base ^ io_map_base) & BIT(VA_BITS - 1))
> ret = -ENOMEM;
> else
> - io_map_base = base;
> + *haddr = io_map_base = base;
>
> mutex_unlock(&kvm_hyp_pgd_mutex);
>
> + return ret;
> +}
> +
> +static int __create_hyp_private_mapping(phys_addr_t phys_addr, size_t size,
> + unsigned long *haddr,
> + enum kvm_pgtable_prot prot)
> +{
> + unsigned long addr;
> + int ret = 0;
> +
> + if (!kvm_host_owns_hyp_mappings()) {
> + addr = kvm_call_hyp_nvhe(__pkvm_create_private_mapping,
> + phys_addr, size, prot);
> + if (IS_ERR_VALUE(addr))
> + return addr;
> + *haddr = addr;
> +
> + return 0;
> + }
> +
> + size += offset_in_page(phys_addr);
> + ret = hyp_alloc_private_va_range(size, &addr);
> if (ret)
> - goto out;
> + return ret;
>
> - ret = __create_hyp_mappings(base, size, phys_addr, prot);
> + ret = __create_hyp_mappings(addr, size, phys_addr, prot);
> if (ret)
> - goto out;
> + return ret;
>
> - *haddr = base + offset_in_page(phys_addr);
> -out:
> + *haddr = addr + offset_in_page(phys_addr);
> return ret;
> }
>
> --
> 2.35.1.723.g4982287a31-goog
>
Hi Kalesh,
On Mon, Mar 14, 2022 at 8:05 PM Kalesh Singh <[email protected]> wrote:
>
> Allocate and switch to 16-byte aligned secondary stack on overflow. This
> provides us stack space to better handle overflows; and is used in
> a subsequent patch to dump the hypervisor stacktrace. The overflow stack
> is only allocated if CONFIG_NVHE_EL2_DEBUG is enabled, as hypervisor
> stacktraces is a debug feature dependent on CONFIG_NVHE_EL2_DEBUG.
>
> Signed-off-by: Kalesh Singh <[email protected]>
Tested-by: Fuad Tabba <[email protected]>
Reviewed-by: Fuad Tabba <[email protected]>
Thanks,
/fuad
> ---
>
> Changes in v4:
> - Update comment to clarify resetting the SP to the top of the stack
> only happens if CONFIG_NVHE_EL2_DEBUG is disabled, per Fuad
>
>
> arch/arm64/kvm/hyp/nvhe/host.S | 11 ++++++++---
> arch/arm64/kvm/hyp/nvhe/switch.c | 5 +++++
> 2 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/host.S b/arch/arm64/kvm/hyp/nvhe/host.S
> index be6d844279b1..a0c4b4f1549f 100644
> --- a/arch/arm64/kvm/hyp/nvhe/host.S
> +++ b/arch/arm64/kvm/hyp/nvhe/host.S
> @@ -179,13 +179,18 @@ SYM_FUNC_END(__host_hvc)
> b hyp_panic
>
> .L__hyp_sp_overflow\@:
> +#ifdef CONFIG_NVHE_EL2_DEBUG
> + /* Switch to the overflow stack */
> + adr_this_cpu sp, hyp_overflow_stack + PAGE_SIZE, x0
> +#else
> /*
> - * Reset SP to the top of the stack, to allow handling the hyp_panic.
> - * This corrupts the stack but is ok, since we won't be attempting
> - * any unwinding here.
> + * If !CONFIG_NVHE_EL2_DEBUG, reset SP to the top of the stack, to
> + * allow handling the hyp_panic. This corrupts the stack but is ok,
> + * since we won't be attempting any unwinding here.
> */
> ldr_this_cpu x0, kvm_init_params + NVHE_INIT_STACK_HYP_VA, x1
> mov sp, x0
> +#endif
>
> bl hyp_panic_bad_stack
> ASM_BUG()
> diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
> index 703a5d3f611b..efc20273a352 100644
> --- a/arch/arm64/kvm/hyp/nvhe/switch.c
> +++ b/arch/arm64/kvm/hyp/nvhe/switch.c
> @@ -34,6 +34,11 @@ DEFINE_PER_CPU(struct kvm_host_data, kvm_host_data);
> DEFINE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt);
> DEFINE_PER_CPU(unsigned long, kvm_hyp_vector);
>
> +#ifdef CONFIG_NVHE_EL2_DEBUG
> +DEFINE_PER_CPU(unsigned long [PAGE_SIZE/sizeof(long)], hyp_overflow_stack)
> + __aligned(16);
> +#endif
> +
> static void __activate_traps(struct kvm_vcpu *vcpu)
> {
> u64 val;
> --
> 2.35.1.723.g4982287a31-goog
>
On Tue, Mar 29, 2022 at 1:51 AM Fuad Tabba <[email protected]> wrote:
>
> Hi Kalesh,
>
>
> On Mon, Mar 14, 2022 at 8:02 PM Kalesh Singh <[email protected]> wrote:
> >
> > hyp_alloc_private_va_range() can be used to reserve private VA ranges
> > in the nVHE hypervisor. Allocations are aligned based on the order of
> > the requested size.
> >
> > This will be used to implement stack guard pages for KVM nVHE hypervisor
> > (nVHE Hyp mode / not pKVM), in a subsequent patch in the series.
> >
> > Signed-off-by: Kalesh Singh <[email protected]>
>
> This looks good to me. I have also tested this entire series, and your
> enhancements will make debugging easier.
>
> Tested-by: Fuad Tabba <[email protected]>
> Reviewed-by: Fuad Tabba <[email protected]>
Thanks for the reviews Fuad :)
- Kalesh
-Kalesh
>
> Thanks,
> /fuad
>
>
>
> > ---
> >
> > Changes in v6:
> > - Update kernel-doc for hyp_alloc_private_va_range()
> > and add return description, per Stephen
> > - Update hyp_alloc_private_va_range() to return an int error code,
> > per Stephen
> > - Replace IS_ERR() checks with IS_ERR_VALUE() check, per Stephen
> > - Clean up goto, per Stephen
> >
> > Changes in v5:
> > - Align private allocations based on the order of their size, per Marc
> >
> > Changes in v4:
> > - Handle null ptr in hyp_alloc_private_va_range() and replace
> > IS_ERR_OR_NULL checks in callers with IS_ERR checks, per Fuad
> > - Fix kernel-doc comments format, per Fuad
> >
> > Changes in v3:
> > - Handle null ptr in IS_ERR_OR_NULL checks, per Mark
> >
> >
> > arch/arm64/include/asm/kvm_mmu.h | 1 +
> > arch/arm64/kvm/mmu.c | 66 +++++++++++++++++++++-----------
> > 2 files changed, 45 insertions(+), 22 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index 81839e9a8a24..3cc9aa25f510 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -153,6 +153,7 @@ static __always_inline unsigned long __kern_hyp_va(unsigned long v)
> > int kvm_share_hyp(void *from, void *to);
> > void kvm_unshare_hyp(void *from, void *to);
> > int create_hyp_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
> > +int hyp_alloc_private_va_range(size_t size, unsigned long *haddr);
> > int create_hyp_io_mappings(phys_addr_t phys_addr, size_t size,
> > void __iomem **kaddr,
> > void __iomem **haddr);
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index bc2aba953299..7326d683c500 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -457,23 +457,22 @@ int create_hyp_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
> > return 0;
> > }
> >
> > -static int __create_hyp_private_mapping(phys_addr_t phys_addr, size_t size,
> > - unsigned long *haddr,
> > - enum kvm_pgtable_prot prot)
> > +
> > +/**
> > + * hyp_alloc_private_va_range - Allocates a private VA range.
> > + * @size: The size of the VA range to reserve.
> > + * @haddr: The hypervisor virtual start address of the allocation.
> > + *
> > + * The private virtual address (VA) range is allocated below io_map_base
> > + * and aligned based on the order of @size.
> > + *
> > + * Return: 0 on success or negative error code on failure.
> > + */
> > +int hyp_alloc_private_va_range(size_t size, unsigned long *haddr)
> > {
> > unsigned long base;
> > int ret = 0;
> >
> > - if (!kvm_host_owns_hyp_mappings()) {
> > - base = kvm_call_hyp_nvhe(__pkvm_create_private_mapping,
> > - phys_addr, size, prot);
> > - if (IS_ERR_OR_NULL((void *)base))
> > - return PTR_ERR((void *)base);
> > - *haddr = base;
> > -
> > - return 0;
> > - }
> > -
> > mutex_lock(&kvm_hyp_pgd_mutex);
> >
> > /*
> > @@ -484,30 +483,53 @@ static int __create_hyp_private_mapping(phys_addr_t phys_addr, size_t size,
> > *
> > * The allocated size is always a multiple of PAGE_SIZE.
> > */
> > - size = PAGE_ALIGN(size + offset_in_page(phys_addr));
> > - base = io_map_base - size;
> > + base = io_map_base - PAGE_ALIGN(size);
> > +
> > + /* Align the allocation based on the order of its size */
> > + base = ALIGN_DOWN(base, PAGE_SIZE << get_order(size));
> >
> > /*
> > * Verify that BIT(VA_BITS - 1) hasn't been flipped by
> > * allocating the new area, as it would indicate we've
> > * overflowed the idmap/IO address range.
> > */
> > - if ((base ^ io_map_base) & BIT(VA_BITS - 1))
> > + if (!base || (base ^ io_map_base) & BIT(VA_BITS - 1))
> > ret = -ENOMEM;
> > else
> > - io_map_base = base;
> > + *haddr = io_map_base = base;
> >
> > mutex_unlock(&kvm_hyp_pgd_mutex);
> >
> > + return ret;
> > +}
> > +
> > +static int __create_hyp_private_mapping(phys_addr_t phys_addr, size_t size,
> > + unsigned long *haddr,
> > + enum kvm_pgtable_prot prot)
> > +{
> > + unsigned long addr;
> > + int ret = 0;
> > +
> > + if (!kvm_host_owns_hyp_mappings()) {
> > + addr = kvm_call_hyp_nvhe(__pkvm_create_private_mapping,
> > + phys_addr, size, prot);
> > + if (IS_ERR_VALUE(addr))
> > + return addr;
> > + *haddr = addr;
> > +
> > + return 0;
> > + }
> > +
> > + size += offset_in_page(phys_addr);
> > + ret = hyp_alloc_private_va_range(size, &addr);
> > if (ret)
> > - goto out;
> > + return ret;
> >
> > - ret = __create_hyp_mappings(base, size, phys_addr, prot);
> > + ret = __create_hyp_mappings(addr, size, phys_addr, prot);
> > if (ret)
> > - goto out;
> > + return ret;
> >
> > - *haddr = base + offset_in_page(phys_addr);
> > -out:
> > + *haddr = addr + offset_in_page(phys_addr);
> > return ret;
> > }
> >
> > --
> > 2.35.1.723.g4982287a31-goog
> >
Hi Kalesh,
On Mon, Mar 14, 2022 at 8:06 PM Kalesh Singh <[email protected]> wrote:
>
> Unwind the stack in EL1, when CONFIG_NVHE_EL2_DEBUG is enabled.
> This is possible because CONFIG_NVHE_EL2_DEBUG disables the host
> stage-2 protection on hyp_panic(), allowing the host to access
> the hypervisor stack pages in EL1.
>
> A simple stack overflow test produces the following output:
>
> [ 580.376051][ T412] kvm: nVHE hyp panic at: ffffffc0116145c4!
> [ 580.378034][ T412] kvm [412]: nVHE HYP call trace:
> [ 580.378591][ T412] kvm [412]: [<ffffffc011614934>]
> [ 580.378993][ T412] kvm [412]: [<ffffffc01160fa48>]
> [ 580.379386][ T412] kvm [412]: [<ffffffc0116145dc>] // Non-terminating recursive call
> [ 580.379772][ T412] kvm [412]: [<ffffffc0116145dc>]
> [ 580.380158][ T412] kvm [412]: [<ffffffc0116145dc>]
> [ 580.380544][ T412] kvm [412]: [<ffffffc0116145dc>]
> [ 580.380928][ T412] kvm [412]: [<ffffffc0116145dc>]
> . . .
>
> Since nVHE hyp symbols are not included by kallsyms to avoid issues
> with aliasing, we fallback to the vmlinux addresses. Symbolizing the
> addresses is handled in the next patch in this series.
>
> Signed-off-by: Kalesh Singh <[email protected]>
Tested-by: Fuad Tabba <[email protected]>
Reviewed-by: Fuad Tabba <[email protected]>
Thanks,
/fuad
> ---
>
> Changes in v4:
> - Update commit text and struct kvm_nvhe_panic_info kernel-doc comment
> to clarify that CONFIG_NVHE_EL2_DEBUG only disables the host stage-2
> protection on hyp_panic(), per Fuad
> - Update NVHE_EL2_DEBUG Kconfig description to clarify that the
> hypervisor stack trace is printed when hyp_panic() is called, per Fuad
>
> Changes in v3:
> - The nvhe hyp stack unwinder now makes use of the core logic from the
> regular kernel unwinder to avoid duplication, per Mark
>
> Changes in v2:
> - Add cpu_prepare_nvhe_panic_info()
> - Move updating the panic info to hyp_panic(), so that unwinding also
> works for conventional nVHE Hyp-mode.
>
>
> arch/arm64/include/asm/kvm_asm.h | 20 +++
> arch/arm64/include/asm/stacktrace.h | 12 ++
> arch/arm64/kernel/stacktrace.c | 210 +++++++++++++++++++++++++---
> arch/arm64/kvm/Kconfig | 5 +-
> arch/arm64/kvm/arm.c | 2 +-
> arch/arm64/kvm/handle_exit.c | 3 +
> arch/arm64/kvm/hyp/nvhe/switch.c | 18 +++
> 7 files changed, 244 insertions(+), 26 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index 2e277f2ed671..4abcf93c6662 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -176,6 +176,26 @@ struct kvm_nvhe_init_params {
> unsigned long vtcr;
> };
>
> +#ifdef CONFIG_NVHE_EL2_DEBUG
> +/**
> + * struct kvm_nvhe_panic_info - nVHE hypervisor panic info.
> + * @hyp_stack_base: hyp VA of the hyp_stack base.
> + * @hyp_overflow_stack_base: hyp VA of the hyp_overflow_stack base.
> + * @fp: hyp FP where the backtrace begins.
> + * @pc: hyp PC where the backtrace begins.
> + *
> + * Used by the host in EL1 to dump the nVHE hypervisor backtrace on
> + * hyp_panic. This is possible because CONFIG_NVHE_EL2_DEBUG disables
> + * the host stage 2 protection on hyp_panic(). See: __hyp_do_panic()
> + */
> +struct kvm_nvhe_panic_info {
> + unsigned long hyp_stack_base;
> + unsigned long hyp_overflow_stack_base;
> + unsigned long fp;
> + unsigned long pc;
> +};
> +#endif /* CONFIG_NVHE_EL2_DEBUG */
> +
> /* Translate a kernel address @ptr into its equivalent linear mapping */
> #define kvm_ksym_ref(ptr) \
> ({ \
> diff --git a/arch/arm64/include/asm/stacktrace.h b/arch/arm64/include/asm/stacktrace.h
> index e77cdef9ca29..18611a51cf14 100644
> --- a/arch/arm64/include/asm/stacktrace.h
> +++ b/arch/arm64/include/asm/stacktrace.h
> @@ -22,6 +22,10 @@ enum stack_type {
> STACK_TYPE_OVERFLOW,
> STACK_TYPE_SDEI_NORMAL,
> STACK_TYPE_SDEI_CRITICAL,
> +#ifdef CONFIG_NVHE_EL2_DEBUG
> + STACK_TYPE_KVM_NVHE_HYP,
> + STACK_TYPE_KVM_NVHE_OVERFLOW,
> +#endif /* CONFIG_NVHE_EL2_DEBUG */
> __NR_STACK_TYPES
> };
>
> @@ -147,4 +151,12 @@ static inline bool on_accessible_stack(const struct task_struct *tsk,
> return false;
> }
>
> +#ifdef CONFIG_NVHE_EL2_DEBUG
> +void kvm_nvhe_dump_backtrace(unsigned long hyp_offset);
> +#else
> +static inline void kvm_nvhe_dump_backtrace(unsigned long hyp_offset)
> +{
> +}
> +#endif /* CONFIG_NVHE_EL2_DEBUG */
> +
> #endif /* __ASM_STACKTRACE_H */
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index e4103e085681..6ec85cb69b1f 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -15,6 +15,8 @@
>
> #include <asm/irq.h>
> #include <asm/pointer_auth.h>
> +#include <asm/kvm_asm.h>
> +#include <asm/kvm_hyp.h>
> #include <asm/stack_pointer.h>
> #include <asm/stacktrace.h>
>
> @@ -64,26 +66,15 @@ NOKPROBE_SYMBOL(start_backtrace);
> * records (e.g. a cycle), determined based on the location and fp value of A
> * and the location (but not the fp value) of B.
> */
> -static int notrace unwind_frame(struct task_struct *tsk,
> - struct stackframe *frame)
> +static int notrace __unwind_frame(struct stackframe *frame, struct stack_info *info,
> + unsigned long (*translate_fp)(unsigned long, enum stack_type))
> {
> unsigned long fp = frame->fp;
> - struct stack_info info;
> -
> - if (!tsk)
> - tsk = current;
> -
> - /* Final frame; nothing to unwind */
> - if (fp == (unsigned long)task_pt_regs(tsk)->stackframe)
> - return -ENOENT;
>
> if (fp & 0x7)
> return -EINVAL;
>
> - if (!on_accessible_stack(tsk, fp, 16, &info))
> - return -EINVAL;
> -
> - if (test_bit(info.type, frame->stacks_done))
> + if (test_bit(info->type, frame->stacks_done))
> return -EINVAL;
>
> /*
> @@ -94,28 +85,62 @@ static int notrace unwind_frame(struct task_struct *tsk,
> *
> * TASK -> IRQ -> OVERFLOW -> SDEI_NORMAL
> * TASK -> SDEI_NORMAL -> SDEI_CRITICAL -> OVERFLOW
> + * KVM_NVHE_HYP -> KVM_NVHE_OVERFLOW
> *
> * ... but the nesting itself is strict. Once we transition from one
> * stack to another, it's never valid to unwind back to that first
> * stack.
> */
> - if (info.type == frame->prev_type) {
> + if (info->type == frame->prev_type) {
> if (fp <= frame->prev_fp)
> return -EINVAL;
> } else {
> set_bit(frame->prev_type, frame->stacks_done);
> }
>
> + /* Record fp as prev_fp before attempting to get the next fp */
> + frame->prev_fp = fp;
> +
> + /*
> + * If fp is not from the current address space perform the
> + * necessary translation before dereferencing it to get next fp.
> + */
> + if (translate_fp)
> + fp = translate_fp(fp, info->type);
> + if (!fp)
> + return -EINVAL;
> +
> /*
> * Record this frame record's values and location. The prev_fp and
> - * prev_type are only meaningful to the next unwind_frame() invocation.
> + * prev_type are only meaningful to the next __unwind_frame() invocation.
> */
> frame->fp = READ_ONCE_NOCHECK(*(unsigned long *)(fp));
> frame->pc = READ_ONCE_NOCHECK(*(unsigned long *)(fp + 8));
> - frame->prev_fp = fp;
> - frame->prev_type = info.type;
> -
> frame->pc = ptrauth_strip_insn_pac(frame->pc);
> + frame->prev_type = info->type;
> +
> + return 0;
> +}
> +
> +static int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
> +{
> + unsigned long fp = frame->fp;
> + struct stack_info info;
> + int err;
> +
> + if (!tsk)
> + tsk = current;
> +
> + /* Final frame; nothing to unwind */
> + if (fp == (unsigned long)task_pt_regs(tsk)->stackframe)
> + return -ENOENT;
> +
> + if (!on_accessible_stack(tsk, fp, 16, &info))
> + return -EINVAL;
> +
> + err = __unwind_frame(frame, &info, NULL);
> + if (err)
> + return err;
>
> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> if (tsk->ret_stack &&
> @@ -143,20 +168,27 @@ static int notrace unwind_frame(struct task_struct *tsk,
> }
> NOKPROBE_SYMBOL(unwind_frame);
>
> -static void notrace walk_stackframe(struct task_struct *tsk,
> - struct stackframe *frame,
> - bool (*fn)(void *, unsigned long), void *data)
> +static void notrace __walk_stackframe(struct task_struct *tsk, struct stackframe *frame,
> + bool (*fn)(void *, unsigned long), void *data,
> + int (*unwind_frame_fn)(struct task_struct *tsk, struct stackframe *frame))
> {
> while (1) {
> int ret;
>
> if (!fn(data, frame->pc))
> break;
> - ret = unwind_frame(tsk, frame);
> + ret = unwind_frame_fn(tsk, frame);
> if (ret < 0)
> break;
> }
> }
> +
> +static void notrace walk_stackframe(struct task_struct *tsk,
> + struct stackframe *frame,
> + bool (*fn)(void *, unsigned long), void *data)
> +{
> + __walk_stackframe(tsk, frame, fn, data, unwind_frame);
> +}
> NOKPROBE_SYMBOL(walk_stackframe);
>
> static bool dump_backtrace_entry(void *arg, unsigned long where)
> @@ -210,3 +242,135 @@ noinline notrace void arch_stack_walk(stack_trace_consume_fn consume_entry,
>
> walk_stackframe(task, &frame, consume_entry, cookie);
> }
> +
> +#ifdef CONFIG_NVHE_EL2_DEBUG
> +DECLARE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
> +DECLARE_KVM_NVHE_PER_CPU(unsigned long [PAGE_SIZE/sizeof(long)], hyp_overflow_stack);
> +DECLARE_KVM_NVHE_PER_CPU(struct kvm_nvhe_panic_info, kvm_panic_info);
> +
> +static inline bool kvm_nvhe_on_overflow_stack(unsigned long sp, unsigned long size,
> + struct stack_info *info)
> +{
> + struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
> + unsigned long low = (unsigned long)panic_info->hyp_overflow_stack_base;
> + unsigned long high = low + PAGE_SIZE;
> +
> + return on_stack(sp, size, low, high, STACK_TYPE_KVM_NVHE_OVERFLOW, info);
> +}
> +
> +static inline bool kvm_nvhe_on_hyp_stack(unsigned long sp, unsigned long size,
> + struct stack_info *info)
> +{
> + struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
> + unsigned long low = (unsigned long)panic_info->hyp_stack_base;
> + unsigned long high = low + PAGE_SIZE;
> +
> + return on_stack(sp, size, low, high, STACK_TYPE_KVM_NVHE_HYP, info);
> +}
> +
> +static inline bool kvm_nvhe_on_accessible_stack(unsigned long sp, unsigned long size,
> + struct stack_info *info)
> +{
> + if (info)
> + info->type = STACK_TYPE_UNKNOWN;
> +
> + if (kvm_nvhe_on_hyp_stack(sp, size, info))
> + return true;
> + if (kvm_nvhe_on_overflow_stack(sp, size, info))
> + return true;
> +
> + return false;
> +}
> +
> +static unsigned long kvm_nvhe_hyp_stack_kern_va(unsigned long addr)
> +{
> + struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
> + unsigned long hyp_base, kern_base, hyp_offset;
> +
> + hyp_base = (unsigned long)panic_info->hyp_stack_base;
> + hyp_offset = addr - hyp_base;
> +
> + kern_base = (unsigned long)*this_cpu_ptr(&kvm_arm_hyp_stack_page);
> +
> + return kern_base + hyp_offset;
> +}
> +
> +static unsigned long kvm_nvhe_overflow_stack_kern_va(unsigned long addr)
> +{
> + struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
> + unsigned long hyp_base, kern_base, hyp_offset;
> +
> + hyp_base = (unsigned long)panic_info->hyp_overflow_stack_base;
> + hyp_offset = addr - hyp_base;
> +
> + kern_base = (unsigned long)this_cpu_ptr_nvhe_sym(hyp_overflow_stack);
> +
> + return kern_base + hyp_offset;
> +}
> +
> +/*
> + * Convert KVM nVHE hypervisor stack VA to a kernel VA.
> + *
> + * The nVHE hypervisor stack is mapped in the flexible 'private' VA range, to allow
> + * for guard pages below the stack. Consequently, the fixed offset address
> + * translation macros won't work here.
> + *
> + * The kernel VA is calculated as an offset from the kernel VA of the hypervisor
> + * stack base. See: kvm_nvhe_hyp_stack_kern_va(), kvm_nvhe_overflow_stack_kern_va()
> + */
> +static unsigned long kvm_nvhe_stack_kern_va(unsigned long addr,
> + enum stack_type type)
> +{
> + switch (type) {
> + case STACK_TYPE_KVM_NVHE_HYP:
> + return kvm_nvhe_hyp_stack_kern_va(addr);
> + case STACK_TYPE_KVM_NVHE_OVERFLOW:
> + return kvm_nvhe_overflow_stack_kern_va(addr);
> + default:
> + return 0UL;
> + }
> +}
> +
> +static int notrace kvm_nvhe_unwind_frame(struct task_struct *tsk,
> + struct stackframe *frame)
> +{
> + struct stack_info info;
> +
> + if (!kvm_nvhe_on_accessible_stack(frame->fp, 16, &info))
> + return -EINVAL;
> +
> + return __unwind_frame(frame, &info, kvm_nvhe_stack_kern_va);
> +}
> +
> +static bool kvm_nvhe_dump_backtrace_entry(void *arg, unsigned long where)
> +{
> + unsigned long va_mask = GENMASK_ULL(vabits_actual - 1, 0);
> + unsigned long hyp_offset = (unsigned long)arg;
> +
> + where &= va_mask; /* Mask tags */
> + where += hyp_offset; /* Convert to kern addr */
> +
> + kvm_err("[<%016lx>] %pB\n", where, (void *)where);
> +
> + return true;
> +}
> +
> +static void notrace kvm_nvhe_walk_stackframe(struct task_struct *tsk,
> + struct stackframe *frame,
> + bool (*fn)(void *, unsigned long), void *data)
> +{
> + __walk_stackframe(tsk, frame, fn, data, kvm_nvhe_unwind_frame);
> +}
> +
> +void kvm_nvhe_dump_backtrace(unsigned long hyp_offset)
> +{
> + struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
> + struct stackframe frame;
> +
> + start_backtrace(&frame, panic_info->fp, panic_info->pc);
> + pr_err("nVHE HYP call trace:\n");
> + kvm_nvhe_walk_stackframe(NULL, &frame, kvm_nvhe_dump_backtrace_entry,
> + (void *)hyp_offset);
> + pr_err("---- end of nVHE HYP call trace ----\n");
> +}
> +#endif /* CONFIG_NVHE_EL2_DEBUG */
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 8a5fbbf084df..a7be4ef35fbf 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -51,8 +51,9 @@ config NVHE_EL2_DEBUG
> depends on KVM
> help
> Say Y here to enable the debug mode for the non-VHE KVM EL2 object.
> - Failure reports will BUG() in the hypervisor. This is intended for
> - local EL2 hypervisor development.
> + Failure reports will BUG() in the hypervisor; and calls to hyp_panic()
> + will result in printing the hypervisor call stack.
> + This is intended for local EL2 hypervisor development.
>
> If unsure, say N.
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 72be7e695d8d..c7216ce1d55c 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -49,7 +49,7 @@ DEFINE_STATIC_KEY_FALSE(kvm_protected_mode_initialized);
>
> DECLARE_KVM_HYP_PER_CPU(unsigned long, kvm_hyp_vector);
>
> -static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
> +DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
> unsigned long kvm_arm_hyp_percpu_base[NR_CPUS];
> DECLARE_KVM_NVHE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params);
>
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index e3140abd2e2e..ff69dff33700 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -17,6 +17,7 @@
> #include <asm/kvm_emulate.h>
> #include <asm/kvm_mmu.h>
> #include <asm/debug-monitors.h>
> +#include <asm/stacktrace.h>
> #include <asm/traps.h>
>
> #include <kvm/arm_hypercalls.h>
> @@ -326,6 +327,8 @@ void __noreturn __cold nvhe_hyp_panic_handler(u64 esr, u64 spsr,
> kvm_err("nVHE hyp panic at: %016llx!\n", elr_virt + hyp_offset);
> }
>
> + kvm_nvhe_dump_backtrace(hyp_offset);
> +
> /*
> * Hyp has panicked and we're going to handle that by panicking the
> * kernel. The kernel offset will be revealed in the panic so we're
> diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
> index efc20273a352..b8ecffc47424 100644
> --- a/arch/arm64/kvm/hyp/nvhe/switch.c
> +++ b/arch/arm64/kvm/hyp/nvhe/switch.c
> @@ -37,6 +37,22 @@ DEFINE_PER_CPU(unsigned long, kvm_hyp_vector);
> #ifdef CONFIG_NVHE_EL2_DEBUG
> DEFINE_PER_CPU(unsigned long [PAGE_SIZE/sizeof(long)], hyp_overflow_stack)
> __aligned(16);
> +DEFINE_PER_CPU(struct kvm_nvhe_panic_info, kvm_panic_info);
> +
> +static inline void cpu_prepare_nvhe_panic_info(void)
> +{
> + struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr(&kvm_panic_info);
> + struct kvm_nvhe_init_params *params = this_cpu_ptr(&kvm_init_params);
> +
> + panic_info->hyp_stack_base = (unsigned long)(params->stack_hyp_va - PAGE_SIZE);
> + panic_info->hyp_overflow_stack_base = (unsigned long)this_cpu_ptr(hyp_overflow_stack);
> + panic_info->fp = (unsigned long)__builtin_frame_address(0);
> + panic_info->pc = _THIS_IP_;
> +}
> + #else
> +static inline void cpu_prepare_nvhe_panic_info(void)
> +{
> +}
> #endif
>
> static void __activate_traps(struct kvm_vcpu *vcpu)
> @@ -360,6 +376,8 @@ asmlinkage void __noreturn hyp_panic(void)
> struct kvm_cpu_context *host_ctxt;
> struct kvm_vcpu *vcpu;
>
> + cpu_prepare_nvhe_panic_info();
> +
> host_ctxt = &this_cpu_ptr(&kvm_host_data)->host_ctxt;
> vcpu = host_ctxt->__hyp_running_vcpu;
>
> --
> 2.35.1.723.g4982287a31-goog
>
On Tue, Mar 29, 2022 at 1:52 AM Fuad Tabba <[email protected]> wrote:
>
> Hi Kalesh,
>
> On Mon, Mar 14, 2022 at 8:06 PM Kalesh Singh <[email protected]> wrote:
> >
> > Unwind the stack in EL1, when CONFIG_NVHE_EL2_DEBUG is enabled.
> > This is possible because CONFIG_NVHE_EL2_DEBUG disables the host
> > stage-2 protection on hyp_panic(), allowing the host to access
> > the hypervisor stack pages in EL1.
Hi everyone,
There has been expressed interest in having hypervisor stack unwinding
in production Android builds.
The current design targets NVHE_EL2_DEBUG enabled builds and is not
suitable for production environments, since this config disables host
stage-2 protection on hyp_panic() which breaks security guarantees.
The benefit of this approach is that the stack unwinding can happen at
EL1 and allows us to reuse most of the unwinding logic from the host
kernel unwinder.
Proposal for how this can be done without disabling host stage-2 protection:
- The host allocates a "panic_info" page and shares it with the hypervisor.
- On hyp_panic(), the hypervisor can unwind and dump its stack
addresses to the shared page.
- The host can read out this information and symbolize these addresses.
This would allow for getting hyp stack traces in production while
preserving the security model. The downside being that the core
unwinding logic would be duplicated at EL2.
Are there any objections to making this change?
Mark Rutland, I’m interested to hear your thoughts on this?
Thanks,
Kalesh
> >
> > A simple stack overflow test produces the following output:
> >
> > [ 580.376051][ T412] kvm: nVHE hyp panic at: ffffffc0116145c4!
> > [ 580.378034][ T412] kvm [412]: nVHE HYP call trace:
> > [ 580.378591][ T412] kvm [412]: [<ffffffc011614934>]
> > [ 580.378993][ T412] kvm [412]: [<ffffffc01160fa48>]
> > [ 580.379386][ T412] kvm [412]: [<ffffffc0116145dc>] // Non-terminating recursive call
> > [ 580.379772][ T412] kvm [412]: [<ffffffc0116145dc>]
> > [ 580.380158][ T412] kvm [412]: [<ffffffc0116145dc>]
> > [ 580.380544][ T412] kvm [412]: [<ffffffc0116145dc>]
> > [ 580.380928][ T412] kvm [412]: [<ffffffc0116145dc>]
> > . . .
> >
> > Since nVHE hyp symbols are not included by kallsyms to avoid issues
> > with aliasing, we fallback to the vmlinux addresses. Symbolizing the
> > addresses is handled in the next patch in this series.
> >
> > Signed-off-by: Kalesh Singh <[email protected]>
>
> Tested-by: Fuad Tabba <[email protected]>
> Reviewed-by: Fuad Tabba <[email protected]>
>
> Thanks,
> /fuad
>
>
>
> > ---
> >
> > Changes in v4:
> > - Update commit text and struct kvm_nvhe_panic_info kernel-doc comment
> > to clarify that CONFIG_NVHE_EL2_DEBUG only disables the host stage-2
> > protection on hyp_panic(), per Fuad
> > - Update NVHE_EL2_DEBUG Kconfig description to clarify that the
> > hypervisor stack trace is printed when hyp_panic() is called, per Fuad
> >
> > Changes in v3:
> > - The nvhe hyp stack unwinder now makes use of the core logic from the
> > regular kernel unwinder to avoid duplication, per Mark
> >
> > Changes in v2:
> > - Add cpu_prepare_nvhe_panic_info()
> > - Move updating the panic info to hyp_panic(), so that unwinding also
> > works for conventional nVHE Hyp-mode.
> >
> >
> > arch/arm64/include/asm/kvm_asm.h | 20 +++
> > arch/arm64/include/asm/stacktrace.h | 12 ++
> > arch/arm64/kernel/stacktrace.c | 210 +++++++++++++++++++++++++---
> > arch/arm64/kvm/Kconfig | 5 +-
> > arch/arm64/kvm/arm.c | 2 +-
> > arch/arm64/kvm/handle_exit.c | 3 +
> > arch/arm64/kvm/hyp/nvhe/switch.c | 18 +++
> > 7 files changed, 244 insertions(+), 26 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> > index 2e277f2ed671..4abcf93c6662 100644
> > --- a/arch/arm64/include/asm/kvm_asm.h
> > +++ b/arch/arm64/include/asm/kvm_asm.h
> > @@ -176,6 +176,26 @@ struct kvm_nvhe_init_params {
> > unsigned long vtcr;
> > };
> >
> > +#ifdef CONFIG_NVHE_EL2_DEBUG
> > +/**
> > + * struct kvm_nvhe_panic_info - nVHE hypervisor panic info.
> > + * @hyp_stack_base: hyp VA of the hyp_stack base.
> > + * @hyp_overflow_stack_base: hyp VA of the hyp_overflow_stack base.
> > + * @fp: hyp FP where the backtrace begins.
> > + * @pc: hyp PC where the backtrace begins.
> > + *
> > + * Used by the host in EL1 to dump the nVHE hypervisor backtrace on
> > + * hyp_panic. This is possible because CONFIG_NVHE_EL2_DEBUG disables
> > + * the host stage 2 protection on hyp_panic(). See: __hyp_do_panic()
> > + */
> > +struct kvm_nvhe_panic_info {
> > + unsigned long hyp_stack_base;
> > + unsigned long hyp_overflow_stack_base;
> > + unsigned long fp;
> > + unsigned long pc;
> > +};
> > +#endif /* CONFIG_NVHE_EL2_DEBUG */
> > +
> > /* Translate a kernel address @ptr into its equivalent linear mapping */
> > #define kvm_ksym_ref(ptr) \
> > ({ \
> > diff --git a/arch/arm64/include/asm/stacktrace.h b/arch/arm64/include/asm/stacktrace.h
> > index e77cdef9ca29..18611a51cf14 100644
> > --- a/arch/arm64/include/asm/stacktrace.h
> > +++ b/arch/arm64/include/asm/stacktrace.h
> > @@ -22,6 +22,10 @@ enum stack_type {
> > STACK_TYPE_OVERFLOW,
> > STACK_TYPE_SDEI_NORMAL,
> > STACK_TYPE_SDEI_CRITICAL,
> > +#ifdef CONFIG_NVHE_EL2_DEBUG
> > + STACK_TYPE_KVM_NVHE_HYP,
> > + STACK_TYPE_KVM_NVHE_OVERFLOW,
> > +#endif /* CONFIG_NVHE_EL2_DEBUG */
> > __NR_STACK_TYPES
> > };
> >
> > @@ -147,4 +151,12 @@ static inline bool on_accessible_stack(const struct task_struct *tsk,
> > return false;
> > }
> >
> > +#ifdef CONFIG_NVHE_EL2_DEBUG
> > +void kvm_nvhe_dump_backtrace(unsigned long hyp_offset);
> > +#else
> > +static inline void kvm_nvhe_dump_backtrace(unsigned long hyp_offset)
> > +{
> > +}
> > +#endif /* CONFIG_NVHE_EL2_DEBUG */
> > +
> > #endif /* __ASM_STACKTRACE_H */
> > diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> > index e4103e085681..6ec85cb69b1f 100644
> > --- a/arch/arm64/kernel/stacktrace.c
> > +++ b/arch/arm64/kernel/stacktrace.c
> > @@ -15,6 +15,8 @@
> >
> > #include <asm/irq.h>
> > #include <asm/pointer_auth.h>
> > +#include <asm/kvm_asm.h>
> > +#include <asm/kvm_hyp.h>
> > #include <asm/stack_pointer.h>
> > #include <asm/stacktrace.h>
> >
> > @@ -64,26 +66,15 @@ NOKPROBE_SYMBOL(start_backtrace);
> > * records (e.g. a cycle), determined based on the location and fp value of A
> > * and the location (but not the fp value) of B.
> > */
> > -static int notrace unwind_frame(struct task_struct *tsk,
> > - struct stackframe *frame)
> > +static int notrace __unwind_frame(struct stackframe *frame, struct stack_info *info,
> > + unsigned long (*translate_fp)(unsigned long, enum stack_type))
> > {
> > unsigned long fp = frame->fp;
> > - struct stack_info info;
> > -
> > - if (!tsk)
> > - tsk = current;
> > -
> > - /* Final frame; nothing to unwind */
> > - if (fp == (unsigned long)task_pt_regs(tsk)->stackframe)
> > - return -ENOENT;
> >
> > if (fp & 0x7)
> > return -EINVAL;
> >
> > - if (!on_accessible_stack(tsk, fp, 16, &info))
> > - return -EINVAL;
> > -
> > - if (test_bit(info.type, frame->stacks_done))
> > + if (test_bit(info->type, frame->stacks_done))
> > return -EINVAL;
> >
> > /*
> > @@ -94,28 +85,62 @@ static int notrace unwind_frame(struct task_struct *tsk,
> > *
> > * TASK -> IRQ -> OVERFLOW -> SDEI_NORMAL
> > * TASK -> SDEI_NORMAL -> SDEI_CRITICAL -> OVERFLOW
> > + * KVM_NVHE_HYP -> KVM_NVHE_OVERFLOW
> > *
> > * ... but the nesting itself is strict. Once we transition from one
> > * stack to another, it's never valid to unwind back to that first
> > * stack.
> > */
> > - if (info.type == frame->prev_type) {
> > + if (info->type == frame->prev_type) {
> > if (fp <= frame->prev_fp)
> > return -EINVAL;
> > } else {
> > set_bit(frame->prev_type, frame->stacks_done);
> > }
> >
> > + /* Record fp as prev_fp before attempting to get the next fp */
> > + frame->prev_fp = fp;
> > +
> > + /*
> > + * If fp is not from the current address space perform the
> > + * necessary translation before dereferencing it to get next fp.
> > + */
> > + if (translate_fp)
> > + fp = translate_fp(fp, info->type);
> > + if (!fp)
> > + return -EINVAL;
> > +
> > /*
> > * Record this frame record's values and location. The prev_fp and
> > - * prev_type are only meaningful to the next unwind_frame() invocation.
> > + * prev_type are only meaningful to the next __unwind_frame() invocation.
> > */
> > frame->fp = READ_ONCE_NOCHECK(*(unsigned long *)(fp));
> > frame->pc = READ_ONCE_NOCHECK(*(unsigned long *)(fp + 8));
> > - frame->prev_fp = fp;
> > - frame->prev_type = info.type;
> > -
> > frame->pc = ptrauth_strip_insn_pac(frame->pc);
> > + frame->prev_type = info->type;
> > +
> > + return 0;
> > +}
> > +
> > +static int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
> > +{
> > + unsigned long fp = frame->fp;
> > + struct stack_info info;
> > + int err;
> > +
> > + if (!tsk)
> > + tsk = current;
> > +
> > + /* Final frame; nothing to unwind */
> > + if (fp == (unsigned long)task_pt_regs(tsk)->stackframe)
> > + return -ENOENT;
> > +
> > + if (!on_accessible_stack(tsk, fp, 16, &info))
> > + return -EINVAL;
> > +
> > + err = __unwind_frame(frame, &info, NULL);
> > + if (err)
> > + return err;
> >
> > #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> > if (tsk->ret_stack &&
> > @@ -143,20 +168,27 @@ static int notrace unwind_frame(struct task_struct *tsk,
> > }
> > NOKPROBE_SYMBOL(unwind_frame);
> >
> > -static void notrace walk_stackframe(struct task_struct *tsk,
> > - struct stackframe *frame,
> > - bool (*fn)(void *, unsigned long), void *data)
> > +static void notrace __walk_stackframe(struct task_struct *tsk, struct stackframe *frame,
> > + bool (*fn)(void *, unsigned long), void *data,
> > + int (*unwind_frame_fn)(struct task_struct *tsk, struct stackframe *frame))
> > {
> > while (1) {
> > int ret;
> >
> > if (!fn(data, frame->pc))
> > break;
> > - ret = unwind_frame(tsk, frame);
> > + ret = unwind_frame_fn(tsk, frame);
> > if (ret < 0)
> > break;
> > }
> > }
> > +
> > +static void notrace walk_stackframe(struct task_struct *tsk,
> > + struct stackframe *frame,
> > + bool (*fn)(void *, unsigned long), void *data)
> > +{
> > + __walk_stackframe(tsk, frame, fn, data, unwind_frame);
> > +}
> > NOKPROBE_SYMBOL(walk_stackframe);
> >
> > static bool dump_backtrace_entry(void *arg, unsigned long where)
> > @@ -210,3 +242,135 @@ noinline notrace void arch_stack_walk(stack_trace_consume_fn consume_entry,
> >
> > walk_stackframe(task, &frame, consume_entry, cookie);
> > }
> > +
> > +#ifdef CONFIG_NVHE_EL2_DEBUG
> > +DECLARE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
> > +DECLARE_KVM_NVHE_PER_CPU(unsigned long [PAGE_SIZE/sizeof(long)], hyp_overflow_stack);
> > +DECLARE_KVM_NVHE_PER_CPU(struct kvm_nvhe_panic_info, kvm_panic_info);
> > +
> > +static inline bool kvm_nvhe_on_overflow_stack(unsigned long sp, unsigned long size,
> > + struct stack_info *info)
> > +{
> > + struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
> > + unsigned long low = (unsigned long)panic_info->hyp_overflow_stack_base;
> > + unsigned long high = low + PAGE_SIZE;
> > +
> > + return on_stack(sp, size, low, high, STACK_TYPE_KVM_NVHE_OVERFLOW, info);
> > +}
> > +
> > +static inline bool kvm_nvhe_on_hyp_stack(unsigned long sp, unsigned long size,
> > + struct stack_info *info)
> > +{
> > + struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
> > + unsigned long low = (unsigned long)panic_info->hyp_stack_base;
> > + unsigned long high = low + PAGE_SIZE;
> > +
> > + return on_stack(sp, size, low, high, STACK_TYPE_KVM_NVHE_HYP, info);
> > +}
> > +
> > +static inline bool kvm_nvhe_on_accessible_stack(unsigned long sp, unsigned long size,
> > + struct stack_info *info)
> > +{
> > + if (info)
> > + info->type = STACK_TYPE_UNKNOWN;
> > +
> > + if (kvm_nvhe_on_hyp_stack(sp, size, info))
> > + return true;
> > + if (kvm_nvhe_on_overflow_stack(sp, size, info))
> > + return true;
> > +
> > + return false;
> > +}
> > +
> > +static unsigned long kvm_nvhe_hyp_stack_kern_va(unsigned long addr)
> > +{
> > + struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
> > + unsigned long hyp_base, kern_base, hyp_offset;
> > +
> > + hyp_base = (unsigned long)panic_info->hyp_stack_base;
> > + hyp_offset = addr - hyp_base;
> > +
> > + kern_base = (unsigned long)*this_cpu_ptr(&kvm_arm_hyp_stack_page);
> > +
> > + return kern_base + hyp_offset;
> > +}
> > +
> > +static unsigned long kvm_nvhe_overflow_stack_kern_va(unsigned long addr)
> > +{
> > + struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
> > + unsigned long hyp_base, kern_base, hyp_offset;
> > +
> > + hyp_base = (unsigned long)panic_info->hyp_overflow_stack_base;
> > + hyp_offset = addr - hyp_base;
> > +
> > + kern_base = (unsigned long)this_cpu_ptr_nvhe_sym(hyp_overflow_stack);
> > +
> > + return kern_base + hyp_offset;
> > +}
> > +
> > +/*
> > + * Convert KVM nVHE hypervisor stack VA to a kernel VA.
> > + *
> > + * The nVHE hypervisor stack is mapped in the flexible 'private' VA range, to allow
> > + * for guard pages below the stack. Consequently, the fixed offset address
> > + * translation macros won't work here.
> > + *
> > + * The kernel VA is calculated as an offset from the kernel VA of the hypervisor
> > + * stack base. See: kvm_nvhe_hyp_stack_kern_va(), kvm_nvhe_overflow_stack_kern_va()
> > + */
> > +static unsigned long kvm_nvhe_stack_kern_va(unsigned long addr,
> > + enum stack_type type)
> > +{
> > + switch (type) {
> > + case STACK_TYPE_KVM_NVHE_HYP:
> > + return kvm_nvhe_hyp_stack_kern_va(addr);
> > + case STACK_TYPE_KVM_NVHE_OVERFLOW:
> > + return kvm_nvhe_overflow_stack_kern_va(addr);
> > + default:
> > + return 0UL;
> > + }
> > +}
> > +
> > +static int notrace kvm_nvhe_unwind_frame(struct task_struct *tsk,
> > + struct stackframe *frame)
> > +{
> > + struct stack_info info;
> > +
> > + if (!kvm_nvhe_on_accessible_stack(frame->fp, 16, &info))
> > + return -EINVAL;
> > +
> > + return __unwind_frame(frame, &info, kvm_nvhe_stack_kern_va);
> > +}
> > +
> > +static bool kvm_nvhe_dump_backtrace_entry(void *arg, unsigned long where)
> > +{
> > + unsigned long va_mask = GENMASK_ULL(vabits_actual - 1, 0);
> > + unsigned long hyp_offset = (unsigned long)arg;
> > +
> > + where &= va_mask; /* Mask tags */
> > + where += hyp_offset; /* Convert to kern addr */
> > +
> > + kvm_err("[<%016lx>] %pB\n", where, (void *)where);
> > +
> > + return true;
> > +}
> > +
> > +static void notrace kvm_nvhe_walk_stackframe(struct task_struct *tsk,
> > + struct stackframe *frame,
> > + bool (*fn)(void *, unsigned long), void *data)
> > +{
> > + __walk_stackframe(tsk, frame, fn, data, kvm_nvhe_unwind_frame);
> > +}
> > +
> > +void kvm_nvhe_dump_backtrace(unsigned long hyp_offset)
> > +{
> > + struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr_nvhe_sym(kvm_panic_info);
> > + struct stackframe frame;
> > +
> > + start_backtrace(&frame, panic_info->fp, panic_info->pc);
> > + pr_err("nVHE HYP call trace:\n");
> > + kvm_nvhe_walk_stackframe(NULL, &frame, kvm_nvhe_dump_backtrace_entry,
> > + (void *)hyp_offset);
> > + pr_err("---- end of nVHE HYP call trace ----\n");
> > +}
> > +#endif /* CONFIG_NVHE_EL2_DEBUG */
> > diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> > index 8a5fbbf084df..a7be4ef35fbf 100644
> > --- a/arch/arm64/kvm/Kconfig
> > +++ b/arch/arm64/kvm/Kconfig
> > @@ -51,8 +51,9 @@ config NVHE_EL2_DEBUG
> > depends on KVM
> > help
> > Say Y here to enable the debug mode for the non-VHE KVM EL2 object.
> > - Failure reports will BUG() in the hypervisor. This is intended for
> > - local EL2 hypervisor development.
> > + Failure reports will BUG() in the hypervisor; and calls to hyp_panic()
> > + will result in printing the hypervisor call stack.
> > + This is intended for local EL2 hypervisor development.
> >
> > If unsure, say N.
> >
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 72be7e695d8d..c7216ce1d55c 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -49,7 +49,7 @@ DEFINE_STATIC_KEY_FALSE(kvm_protected_mode_initialized);
> >
> > DECLARE_KVM_HYP_PER_CPU(unsigned long, kvm_hyp_vector);
> >
> > -static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
> > +DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
> > unsigned long kvm_arm_hyp_percpu_base[NR_CPUS];
> > DECLARE_KVM_NVHE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params);
> >
> > diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> > index e3140abd2e2e..ff69dff33700 100644
> > --- a/arch/arm64/kvm/handle_exit.c
> > +++ b/arch/arm64/kvm/handle_exit.c
> > @@ -17,6 +17,7 @@
> > #include <asm/kvm_emulate.h>
> > #include <asm/kvm_mmu.h>
> > #include <asm/debug-monitors.h>
> > +#include <asm/stacktrace.h>
> > #include <asm/traps.h>
> >
> > #include <kvm/arm_hypercalls.h>
> > @@ -326,6 +327,8 @@ void __noreturn __cold nvhe_hyp_panic_handler(u64 esr, u64 spsr,
> > kvm_err("nVHE hyp panic at: %016llx!\n", elr_virt + hyp_offset);
> > }
> >
> > + kvm_nvhe_dump_backtrace(hyp_offset);
> > +
> > /*
> > * Hyp has panicked and we're going to handle that by panicking the
> > * kernel. The kernel offset will be revealed in the panic so we're
> > diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
> > index efc20273a352..b8ecffc47424 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/switch.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/switch.c
> > @@ -37,6 +37,22 @@ DEFINE_PER_CPU(unsigned long, kvm_hyp_vector);
> > #ifdef CONFIG_NVHE_EL2_DEBUG
> > DEFINE_PER_CPU(unsigned long [PAGE_SIZE/sizeof(long)], hyp_overflow_stack)
> > __aligned(16);
> > +DEFINE_PER_CPU(struct kvm_nvhe_panic_info, kvm_panic_info);
> > +
> > +static inline void cpu_prepare_nvhe_panic_info(void)
> > +{
> > + struct kvm_nvhe_panic_info *panic_info = this_cpu_ptr(&kvm_panic_info);
> > + struct kvm_nvhe_init_params *params = this_cpu_ptr(&kvm_init_params);
> > +
> > + panic_info->hyp_stack_base = (unsigned long)(params->stack_hyp_va - PAGE_SIZE);
> > + panic_info->hyp_overflow_stack_base = (unsigned long)this_cpu_ptr(hyp_overflow_stack);
> > + panic_info->fp = (unsigned long)__builtin_frame_address(0);
> > + panic_info->pc = _THIS_IP_;
> > +}
> > + #else
> > +static inline void cpu_prepare_nvhe_panic_info(void)
> > +{
> > +}
> > #endif
> >
> > static void __activate_traps(struct kvm_vcpu *vcpu)
> > @@ -360,6 +376,8 @@ asmlinkage void __noreturn hyp_panic(void)
> > struct kvm_cpu_context *host_ctxt;
> > struct kvm_vcpu *vcpu;
> >
> > + cpu_prepare_nvhe_panic_info();
> > +
> > host_ctxt = &this_cpu_ptr(&kvm_host_data)->host_ctxt;
> > vcpu = host_ctxt->__hyp_running_vcpu;
> >
> > --
> > 2.35.1.723.g4982287a31-goog
> >
Hi Kalesh,
Sorry for the radiosilence.
I see that in v7 you've dropped the stacktrace bits for now; I'm just
commenting here fot future reference.
On Thu, Mar 31, 2022 at 12:22:05PM -0700, Kalesh Singh wrote:
> Hi everyone,
>
> There has been expressed interest in having hypervisor stack unwinding
> in production Android builds.
>
> The current design targets NVHE_EL2_DEBUG enabled builds and is not
> suitable for production environments, since this config disables host
> stage-2 protection on hyp_panic() which breaks security guarantees.
> The benefit of this approach is that the stack unwinding can happen at
> EL1 and allows us to reuse most of the unwinding logic from the host
> kernel unwinder.
>
> Proposal for how this can be done without disabling host stage-2 protection:
> - The host allocates a "panic_info" page and shares it with the hypervisor.
> - On hyp_panic(), the hypervisor can unwind and dump its stack
> addresses to the shared page.
> - The host can read out this information and symbolize these addresses.
>
> This would allow for getting hyp stack traces in production while
> preserving the security model. The downside being that the core
> unwinding logic would be duplicated at EL2.
>
> Are there any objections to making this change?
I'm fine with the concept of splitting the unwind and logging steps; this is
akin to doing:
stack_trace_save_tsk(...);
...
stack_trace_print(...);
... and I'm fine with having a stack_trace_save_hyp(...) variant.
However, I would like to ensure that we're reusing logic rather than
duplicating it wholesale. There are some changes I would like to make to the
stacktrace code in the near future that might make that a bit easier, e.g.
reworking the stack transition checks to be table-driven, and factoring out the
way we handle return trampolines.
I'll Cc you on changes to the stacktrace code. There are some preparatory
cleanups I'd like to get out of the way first which I'll send shortly.
Thanks,
Mark.
On Wed, Apr 13, 2022 at 6:59 AM Mark Rutland <[email protected]> wrote:
>
> Hi Kalesh,
>
> Sorry for the radiosilence.
>
> I see that in v7 you've dropped the stacktrace bits for now; I'm just
> commenting here fot future reference.
>
> On Thu, Mar 31, 2022 at 12:22:05PM -0700, Kalesh Singh wrote:
> > Hi everyone,
> >
> > There has been expressed interest in having hypervisor stack unwinding
> > in production Android builds.
> >
> > The current design targets NVHE_EL2_DEBUG enabled builds and is not
> > suitable for production environments, since this config disables host
> > stage-2 protection on hyp_panic() which breaks security guarantees.
> > The benefit of this approach is that the stack unwinding can happen at
> > EL1 and allows us to reuse most of the unwinding logic from the host
> > kernel unwinder.
> >
> > Proposal for how this can be done without disabling host stage-2 protection:
> > - The host allocates a "panic_info" page and shares it with the hypervisor.
> > - On hyp_panic(), the hypervisor can unwind and dump its stack
> > addresses to the shared page.
> > - The host can read out this information and symbolize these addresses.
> >
> > This would allow for getting hyp stack traces in production while
> > preserving the security model. The downside being that the core
> > unwinding logic would be duplicated at EL2.
> >
> > Are there any objections to making this change?
>
> I'm fine with the concept of splitting the unwind and logging steps; this is
> akin to doing:
>
> stack_trace_save_tsk(...);
> ...
> stack_trace_print(...);
>
> ... and I'm fine with having a stack_trace_save_hyp(...) variant.
>
> However, I would like to ensure that we're reusing logic rather than
> duplicating it wholesale.
Agreed. Although some reimplementation may be unavoidable, as we can't
safely link against kernel code from the protected KVM hypervisor.
Perhaps we can move some of the common logic to a shared header that
can be included in both places (host, hyp), WDYT?
> There are some changes I would like to make to the
> stacktrace code in the near future that might make that a bit easier, e.g.
> reworking the stack transition checks to be table-driven, and factoring out the
> way we handle return trampolines.
Sounds good to me.
Thanks,
Kalesh
>
> I'll Cc you on changes to the stacktrace code. There are some preparatory
> cleanups I'd like to get out of the way first which I'll send shortly.
>
> Thanks,
> Mark.
On Tue, Apr 19, 2022 at 10:37:56AM -0700, Kalesh Singh wrote:
> On Wed, Apr 13, 2022 at 6:59 AM Mark Rutland <[email protected]> wrote:
> > I'm fine with the concept of splitting the unwind and logging steps; this is
> > akin to doing:
> >
> > stack_trace_save_tsk(...);
> > ...
> > stack_trace_print(...);
> >
> > ... and I'm fine with having a stack_trace_save_hyp(...) variant.
> >
> > However, I would like to ensure that we're reusing logic rather than
> > duplicating it wholesale.
>
> Agreed. Although some reimplementation may be unavoidable, as we can't
> safely link against kernel code from the protected KVM hypervisor.
Sure; I just mean that we have one implementation, even if that gets recompiled
in separate objects for different contexts.
> Perhaps we can move some of the common logic to a shared header that
> can be included in both places (host, hyp), WDYT?
My rough thinking was that we'd build the same stacktrace.c file (reworked from
the current one) as stracktrace.o and stacktrace.nvhe.o, but moving things
around into headers is also an option. Either way will need some
experimentation.
Thanks,
Mark.