From: Andrey Konovalov <[email protected]>
kasan, arm64, scs, stacktrace: collect stack traces from Shadow Call Stack
Currently, KASAN always uses the normal stack trace collection routines,
which rely on the unwinder, when saving alloc and free stack traces.
Instead of invoking the unwinder, collect the stack trace by copying
frames from the Shadow Call Stack whenever it is enabled. This reduces
boot time by 30% for all KASAN modes when Shadow Call Stack is enabled.
Stack staces are collected from the Shadow Call Stack via a new
stack_trace_save_shadow() interface.
Note that the frame of the interrupted function is not included into
the stack trace, as it is not yet saved on the SCS when an interrupt
happens.
---
To deal with this last thing, we could save the interrupted frame address
in another per-CPU variable. I'll look into implementing this for v3.
I decided to postpone the changes to stack depot that avoid copying
frames twice until a planned upcoming update for stack depot.
Changes v1->v2:
- Provide a kernel-wide stack_trace_save_shadow() interface for collecting
stack traces from shadow stack.
- Use ptrauth_strip_insn_pac() and READ_ONCE_NOCHECK, see the comments.
- Get SCS pointer from x18, as per-task value is meant to save the SCS
value on CPU switches.
- Collect stack frames from SDEI and IRQ contexts.
Andrey Konovalov (4):
stacktrace: add interface based on shadow call stack
arm64, scs: save scs_sp values per-cpu when switching stacks
arm64: implement stack_trace_save_shadow
kasan: use stack_trace_save_shadow
arch/Kconfig | 6 +++
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/assembler.h | 12 +++++
arch/arm64/include/asm/scs.h | 13 ++++-
arch/arm64/kernel/entry.S | 28 ++++++++--
arch/arm64/kernel/irq.c | 4 +-
arch/arm64/kernel/sdei.c | 5 +-
arch/arm64/kernel/stacktrace.c | 83 ++++++++++++++++++++++++++++++
include/linux/stacktrace.h | 15 ++++++
kernel/stacktrace.c | 21 ++++++++
mm/kasan/common.c | 9 ++--
11 files changed, 183 insertions(+), 14 deletions(-)
--
2.25.1
From: Andrey Konovalov <[email protected]>
Add a new interface stack_trace_save_shadow() for collecting stack traces
by copying frames from the Shadow Call Stack.
Collecting stack traces this way is significantly faster: boot time
of a defconfig build with KASAN enabled gets descreased by ~30%.
The few patches following this one add an implementation of
stack_trace_save_shadow() for arm64.
The implementation of the added interface is not meant to use
stack_trace_consume_fn to avoid making a function call for each
collected frame to further improve performance.
Signed-off-by: Andrey Konovalov <[email protected]>
---
arch/Kconfig | 6 ++++++
include/linux/stacktrace.h | 15 +++++++++++++++
kernel/stacktrace.c | 21 +++++++++++++++++++++
3 files changed, 42 insertions(+)
diff --git a/arch/Kconfig b/arch/Kconfig
index e12a4268c01d..207c1679c53a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1041,6 +1041,12 @@ config HAVE_RELIABLE_STACKTRACE
arch_stack_walk_reliable() function which only returns a stack trace
if it can guarantee the trace is reliable.
+config HAVE_SHADOW_STACKTRACE
+ bool
+ help
+ If this is set, the architecture provides the arch_stack_walk_shadow()
+ function, which collects the stack trace from the shadow call stack.
+
config HAVE_ARCH_HASH
bool
default n
diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
index 97455880ac41..b74d1e42e157 100644
--- a/include/linux/stacktrace.h
+++ b/include/linux/stacktrace.h
@@ -60,6 +60,9 @@ int arch_stack_walk_reliable(stack_trace_consume_fn consume_entry, void *cookie,
void arch_stack_walk_user(stack_trace_consume_fn consume_entry, void *cookie,
const struct pt_regs *regs);
+
+int arch_stack_walk_shadow(unsigned long *store, unsigned int size,
+ unsigned int skipnr);
#endif /* CONFIG_ARCH_STACKWALK */
#ifdef CONFIG_STACKTRACE
@@ -108,4 +111,16 @@ static inline int stack_trace_save_tsk_reliable(struct task_struct *tsk,
}
#endif
+#if defined(CONFIG_STACKTRACE) && defined(CONFIG_HAVE_SHADOW_STACKTRACE)
+int stack_trace_save_shadow(unsigned long *store, unsigned int size,
+ unsigned int skipnr);
+#else
+static inline int stack_trace_save_shadow(unsigned long *store,
+ unsigned int size,
+ unsigned int skipnr)
+{
+ return -ENOSYS;
+}
+#endif
+
#endif /* __LINUX_STACKTRACE_H */
diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index 9ed5ce989415..fe305861fd55 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -237,6 +237,27 @@ unsigned int stack_trace_save_user(unsigned long *store, unsigned int size)
}
#endif
+#ifdef CONFIG_HAVE_SHADOW_STACKTRACE
+/**
+ * stack_trace_save_shadow - Save a stack trace based on shadow call stack
+ * @store: Pointer to the storage array
+ * @size: Size of the storage array
+ * @skipnr: Number of entries to skip at the start of the stack trace
+ *
+ * Return: Number of trace entries stored.
+ */
+int stack_trace_save_shadow(unsigned long *store, unsigned int size,
+ unsigned int skipnr)
+{
+ /*
+ * Do not use stack_trace_consume_fn to avoid making a function
+ * call for each collected frame to improve performance.
+ * Skip + 1 frame to skip stack_trace_save_shadow.
+ */
+ return arch_stack_walk_shadow(store, size, skipnr + 1);
+}
+#endif
+
#else /* CONFIG_ARCH_STACKWALK */
/*
--
2.25.1
From: Andrey Konovalov <[email protected]>
When an interrupt happens, the current Shadow Call Stack (SCS) pointer
is switched to a per-interrupt one stored in a per-CPU variable. The old
pointer is then saved on the normal stack and restored when the interrupt
is handled.
To collect the current stack trace based on SCS when the interrupt is
being handled, we need to know the SCS pointers that belonged to the
task and potentially other interrupts that were interrupted.
Instead of trying to retrieve the SCS pointers from the stack, change
interrupt handlers (for hard IRQ, Normal and Critical SDEI) to save the
previous SCS pointer in a per-CPU variable.
Note that interrupts stack. A task can be interrupted by a hard IRQ,
which then can interrupted by a normal SDEI, etc. This is handled by
using a separate per-CPU variable for each interrupt type.
Also reset the saved SCS pointer when exiting the interrupt. This allows
checking whether we should include any interrupt frames when collecting
the stack trace. While we could use in_hardirq(), there seems to be no
easy way to check whether we are in an SDEI handler. Directly checking
the per-CPU variables for being non-zero is more resilient.
Also expose both the the added saved SCS variables and the existing SCS
base variables in arch/arm64/include/asm/scs.h so that the stack trace
collection impementation can use them.
Signed-off-by: Andrey Konovalov <[email protected]>
---
arch/arm64/include/asm/assembler.h | 12 ++++++++++++
arch/arm64/include/asm/scs.h | 13 ++++++++++++-
arch/arm64/kernel/entry.S | 28 ++++++++++++++++++++++++----
arch/arm64/kernel/irq.c | 4 +---
arch/arm64/kernel/sdei.c | 5 ++---
5 files changed, 51 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 8c5a61aeaf8e..ca018e981d13 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -270,6 +270,18 @@ alternative_endif
ldr \dst, [\dst, \tmp]
.endm
+ /*
+ * @src: Register whose value gets stored in sym
+ * @sym: The name of the per-cpu variable
+ * @tmp0: Scratch register
+ * @tmp1: Another scratch register
+ */
+ .macro str_this_cpu src, sym, tmp0, tmp1
+ adr_l \tmp0, \sym
+ get_this_cpu_offset \tmp1
+ str \src, [\tmp0, \tmp1]
+ .endm
+
/*
* vma_vm_mm - get mm pointer from vma pointer (vma->vm_mm)
*/
diff --git a/arch/arm64/include/asm/scs.h b/arch/arm64/include/asm/scs.h
index 8297bccf0784..2bb2b32f787b 100644
--- a/arch/arm64/include/asm/scs.h
+++ b/arch/arm64/include/asm/scs.h
@@ -24,6 +24,17 @@
.endm
#endif /* CONFIG_SHADOW_CALL_STACK */
-#endif /* __ASSEMBLY __ */
+#else /* __ASSEMBLY__ */
+
+#include <linux/percpu.h>
+
+DECLARE_PER_CPU(unsigned long *, irq_shadow_call_stack_ptr);
+DECLARE_PER_CPU(unsigned long *, irq_shadow_call_stack_saved_ptr);
+DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_normal_ptr);
+DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_normal_saved_ptr);
+DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_critical_ptr);
+DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_critical_saved_ptr);
+
+#endif /* __ASSEMBLY__ */
#endif /* _ASM_SCS_H */
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index ede028dee81b..1c62fecda172 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -880,7 +880,8 @@ NOKPROBE(ret_from_fork)
*/
SYM_FUNC_START(call_on_irq_stack)
#ifdef CONFIG_SHADOW_CALL_STACK
- stp scs_sp, xzr, [sp, #-16]!
+ /* Save the current SCS pointer and load the per-IRQ one. */
+ str_this_cpu scs_sp, irq_shadow_call_stack_saved_ptr, x15, x17
ldr_this_cpu scs_sp, irq_shadow_call_stack_ptr, x17
#endif
/* Create a frame record to save our LR and SP (implicit in FP) */
@@ -902,7 +903,9 @@ SYM_FUNC_START(call_on_irq_stack)
mov sp, x29
ldp x29, x30, [sp], #16
#ifdef CONFIG_SHADOW_CALL_STACK
- ldp scs_sp, xzr, [sp], #16
+ /* Restore saved SCS pointer and reset the saved value. */
+ ldr_this_cpu scs_sp, irq_shadow_call_stack_saved_ptr, x17
+ str_this_cpu xzr, irq_shadow_call_stack_saved_ptr, x15, x17
#endif
ret
SYM_FUNC_END(call_on_irq_stack)
@@ -1024,11 +1027,16 @@ SYM_CODE_START(__sdei_asm_handler)
#endif
#ifdef CONFIG_SHADOW_CALL_STACK
- /* Use a separate shadow call stack for normal and critical events */
+ /*
+ * Use a separate shadow call stack for normal and critical events.
+ * Save the current SCS pointer and load the per-SDEI one.
+ */
cbnz w4, 3f
+ str_this_cpu src=scs_sp, sym=sdei_shadow_call_stack_normal_saved_ptr, tmp0=x5, tmp1=x6
ldr_this_cpu dst=scs_sp, sym=sdei_shadow_call_stack_normal_ptr, tmp=x6
b 4f
-3: ldr_this_cpu dst=scs_sp, sym=sdei_shadow_call_stack_critical_ptr, tmp=x6
+3: str_this_cpu src=scs_sp, sym=sdei_shadow_call_stack_critical_saved_ptr, tmp0=x5, tmp1=x6
+ ldr_this_cpu dst=scs_sp, sym=sdei_shadow_call_stack_critical_ptr, tmp=x6
4:
#endif
@@ -1062,6 +1070,18 @@ SYM_CODE_START(__sdei_asm_handler)
ldp lr, x1, [x4, #SDEI_EVENT_INTREGS + S_LR]
mov sp, x1
+#ifdef CONFIG_SHADOW_CALL_STACK
+ /* Restore saved SCS pointer and reset the saved value. */
+ ldrb w5, [x4, #SDEI_EVENT_PRIORITY]
+ cbnz w5, 5f
+ ldr_this_cpu dst=scs_sp, sym=sdei_shadow_call_stack_normal_saved_ptr, tmp=x6
+ str_this_cpu src=xzr, sym=sdei_shadow_call_stack_normal_saved_ptr, tmp0=x5, tmp1=x6
+ b 6f
+5: ldr_this_cpu dst=scs_sp, sym=sdei_shadow_call_stack_critical_saved_ptr, tmp=x6
+ str_this_cpu src=xzr, sym=sdei_shadow_call_stack_critical_saved_ptr, tmp0=x5, tmp1=x6
+6:
+#endif
+
mov x1, x0 // address to complete_and_resume
/* x0 = (x0 <= SDEI_EV_FAILED) ?
* EVENT_COMPLETE:EVENT_COMPLETE_AND_RESUME
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index bda49430c9ea..4199f900714a 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -28,11 +28,9 @@ DEFINE_PER_CPU(struct nmi_ctx, nmi_contexts);
DEFINE_PER_CPU(unsigned long *, irq_stack_ptr);
-
-DECLARE_PER_CPU(unsigned long *, irq_shadow_call_stack_ptr);
-
#ifdef CONFIG_SHADOW_CALL_STACK
DEFINE_PER_CPU(unsigned long *, irq_shadow_call_stack_ptr);
+DEFINE_PER_CPU(unsigned long *, irq_shadow_call_stack_saved_ptr);
#endif
static void init_irq_scs(void)
diff --git a/arch/arm64/kernel/sdei.c b/arch/arm64/kernel/sdei.c
index d20620a1c51a..269adcb9e854 100644
--- a/arch/arm64/kernel/sdei.c
+++ b/arch/arm64/kernel/sdei.c
@@ -39,12 +39,11 @@ DEFINE_PER_CPU(unsigned long *, sdei_stack_normal_ptr);
DEFINE_PER_CPU(unsigned long *, sdei_stack_critical_ptr);
#endif
-DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_normal_ptr);
-DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_critical_ptr);
-
#ifdef CONFIG_SHADOW_CALL_STACK
DEFINE_PER_CPU(unsigned long *, sdei_shadow_call_stack_normal_ptr);
+DEFINE_PER_CPU(unsigned long *, sdei_shadow_call_stack_normal_saved_ptr);
DEFINE_PER_CPU(unsigned long *, sdei_shadow_call_stack_critical_ptr);
+DEFINE_PER_CPU(unsigned long *, sdei_shadow_call_stack_critical_saved_ptr);
#endif
static void _free_sdei_stack(unsigned long * __percpu *ptr, int cpu)
--
2.25.1
Hi,
I love your patch! Yet something to improve:
[auto build test ERROR on next-20220323]
[also build test ERROR on v5.17]
[cannot apply to arm64/for-next/core hnaz-mm/master linus/master v5.17 v5.17-rc8 v5.17-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/andrey-konovalov-linux-dev/kasan-arm64-scs-stacktrace-collect-stack-traces-from-Shadow-Call-Stack/20220323-233436
base: b61581ae229d8eb9f21f8753be3f4011f7692384
config: arm64-allyesconfig (https://download.01.org/0day-ci/archive/20220325/[email protected]/config)
compiler: aarch64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/da5bedb1ac7aa0b303f6d996d306e675860b6e12
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review andrey-konovalov-linux-dev/kasan-arm64-scs-stacktrace-collect-stack-traces-from-Shadow-Call-Stack/20220323-233436
git checkout da5bedb1ac7aa0b303f6d996d306e675860b6e12
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arm64 SHELL=/bin/bash
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
arch/arm64/kernel/sdei.c: In function 'free_sdei_scs':
>> arch/arm64/kernel/sdei.c:124:33: error: 'sdei_shadow_call_stack_normal_ptr' undeclared (first use in this function)
124 | _free_sdei_scs(&sdei_shadow_call_stack_normal_ptr, cpu);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/arm64/kernel/sdei.c:124:33: note: each undeclared identifier is reported only once for each function it appears in
>> arch/arm64/kernel/sdei.c:125:33: error: 'sdei_shadow_call_stack_critical_ptr' undeclared (first use in this function); did you mean 'sdei_stack_critical_ptr'?
125 | _free_sdei_scs(&sdei_shadow_call_stack_critical_ptr, cpu);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| sdei_stack_critical_ptr
arch/arm64/kernel/sdei.c: In function 'init_sdei_scs':
arch/arm64/kernel/sdei.c:150:39: error: 'sdei_shadow_call_stack_normal_ptr' undeclared (first use in this function)
150 | err = _init_sdei_scs(&sdei_shadow_call_stack_normal_ptr, cpu);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/arm64/kernel/sdei.c:153:39: error: 'sdei_shadow_call_stack_critical_ptr' undeclared (first use in this function); did you mean 'sdei_stack_critical_ptr'?
153 | err = _init_sdei_scs(&sdei_shadow_call_stack_critical_ptr, cpu);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| sdei_stack_critical_ptr
vim +/sdei_shadow_call_stack_normal_ptr +124 arch/arm64/kernel/sdei.c
ac20ffbb0279aae Sami Tolvanen 2020-11-30 118
ac20ffbb0279aae Sami Tolvanen 2020-11-30 119 static void free_sdei_scs(void)
ac20ffbb0279aae Sami Tolvanen 2020-11-30 120 {
ac20ffbb0279aae Sami Tolvanen 2020-11-30 121 int cpu;
ac20ffbb0279aae Sami Tolvanen 2020-11-30 122
ac20ffbb0279aae Sami Tolvanen 2020-11-30 123 for_each_possible_cpu(cpu) {
ac20ffbb0279aae Sami Tolvanen 2020-11-30 @124 _free_sdei_scs(&sdei_shadow_call_stack_normal_ptr, cpu);
ac20ffbb0279aae Sami Tolvanen 2020-11-30 @125 _free_sdei_scs(&sdei_shadow_call_stack_critical_ptr, cpu);
ac20ffbb0279aae Sami Tolvanen 2020-11-30 126 }
ac20ffbb0279aae Sami Tolvanen 2020-11-30 127 }
ac20ffbb0279aae Sami Tolvanen 2020-11-30 128
--
0-DAY CI Kernel Test Service
https://01.org/lkp
On Wed, 23 Mar 2022 16:32:52 +0100 [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> Add a new interface stack_trace_save_shadow() for collecting stack traces
> by copying frames from the Shadow Call Stack.
>
> Collecting stack traces this way is significantly faster: boot time
> of a defconfig build with KASAN enabled gets descreased by ~30%.
>
> The few patches following this one add an implementation of
> stack_trace_save_shadow() for arm64.
>
> The implementation of the added interface is not meant to use
> stack_trace_consume_fn to avoid making a function call for each
> collected frame to further improve performance.
>
> ...
>
> @@ -108,4 +111,16 @@ static inline int stack_trace_save_tsk_reliable(struct task_struct *tsk,
> }
> #endif
>
> +#if defined(CONFIG_STACKTRACE) && defined(CONFIG_HAVE_SHADOW_STACKTRACE)
> +int stack_trace_save_shadow(unsigned long *store, unsigned int size,
> + unsigned int skipnr);
> +#else
> +static inline int stack_trace_save_shadow(unsigned long *store,
> + unsigned int size,
> + unsigned int skipnr)
> +{
> + return -ENOSYS;
> +}
> +#endif
checkpatch sayeth "WARNING: ENOSYS means 'invalid syscall nr' and
nothing else".
checkpatch also picked up a typo in a changelog. Useful thing to run,
is checkpatch.
On Wed, 23 Mar 2022 at 16:33, <[email protected]> wrote:
>
> From: Andrey Konovalov <[email protected]>
>
> kasan, arm64, scs, stacktrace: collect stack traces from Shadow Call Stack
>
> Currently, KASAN always uses the normal stack trace collection routines,
> which rely on the unwinder, when saving alloc and free stack traces.
>
> Instead of invoking the unwinder, collect the stack trace by copying
> frames from the Shadow Call Stack whenever it is enabled. This reduces
> boot time by 30% for all KASAN modes when Shadow Call Stack is enabled.
>
> Stack staces are collected from the Shadow Call Stack via a new
> stack_trace_save_shadow() interface.
>
> Note that the frame of the interrupted function is not included into
> the stack trace, as it is not yet saved on the SCS when an interrupt
> happens.
>
> ---
>
> To deal with this last thing, we could save the interrupted frame address
> in another per-CPU variable. I'll look into implementing this for v3.
>
> I decided to postpone the changes to stack depot that avoid copying
> frames twice until a planned upcoming update for stack depot.
That's fair.
> Changes v1->v2:
> - Provide a kernel-wide stack_trace_save_shadow() interface for collecting
> stack traces from shadow stack.
> - Use ptrauth_strip_insn_pac() and READ_ONCE_NOCHECK, see the comments.
> - Get SCS pointer from x18, as per-task value is meant to save the SCS
> value on CPU switches.
> - Collect stack frames from SDEI and IRQ contexts.
Do any of these new changes introduce new (noticeable) overhead (in
particular patch 2)?
On Mon, Mar 28, 2022 at 2:36 PM Marco Elver <[email protected]> wrote:
>
> > Changes v1->v2:
> > - Provide a kernel-wide stack_trace_save_shadow() interface for collecting
> > stack traces from shadow stack.
> > - Use ptrauth_strip_insn_pac() and READ_ONCE_NOCHECK, see the comments.
> > - Get SCS pointer from x18, as per-task value is meant to save the SCS
> > value on CPU switches.
> > - Collect stack frames from SDEI and IRQ contexts.
>
> Do any of these new changes introduce new (noticeable) overhead (in
> particular patch 2)?
I'll measure the overheads and include the results into v3. Thanks!
On Fri, Mar 25, 2022 at 9:46 PM Andrew Morton <[email protected]> wrote:
>
> On Wed, 23 Mar 2022 16:32:52 +0100 [email protected] wrote:
>
> > From: Andrey Konovalov <[email protected]>
> >
> > Add a new interface stack_trace_save_shadow() for collecting stack traces
> > by copying frames from the Shadow Call Stack.
> >
> > Collecting stack traces this way is significantly faster: boot time
> > of a defconfig build with KASAN enabled gets descreased by ~30%.
> >
> > The few patches following this one add an implementation of
> > stack_trace_save_shadow() for arm64.
> >
> > The implementation of the added interface is not meant to use
> > stack_trace_consume_fn to avoid making a function call for each
> > collected frame to further improve performance.
> >
> > ...
> >
> > @@ -108,4 +111,16 @@ static inline int stack_trace_save_tsk_reliable(struct task_struct *tsk,
> > }
> > #endif
> >
> > +#if defined(CONFIG_STACKTRACE) && defined(CONFIG_HAVE_SHADOW_STACKTRACE)
> > +int stack_trace_save_shadow(unsigned long *store, unsigned int size,
> > + unsigned int skipnr);
> > +#else
> > +static inline int stack_trace_save_shadow(unsigned long *store,
> > + unsigned int size,
> > + unsigned int skipnr)
> > +{
> > + return -ENOSYS;
> > +}
> > +#endif
>
> checkpatch sayeth "WARNING: ENOSYS means 'invalid syscall nr' and
> nothing else".
This is done deliberately to mimic stack_trace_save_tsk_reliable().
I'll mention this in the changelog.
> checkpatch also picked up a typo in a changelog. Useful thing to run,
> is checkpatch.
My bad, I ran it on diff instead of the patch. Will fix in v3. Thanks!
On Tue, Mar 29, 2022 at 8:36 PM Andrey Konovalov <[email protected]> wrote:
>
> On Mon, Mar 28, 2022 at 2:36 PM Marco Elver <[email protected]> wrote:
> >
> > > Changes v1->v2:
> > > - Provide a kernel-wide stack_trace_save_shadow() interface for collecting
> > > stack traces from shadow stack.
> > > - Use ptrauth_strip_insn_pac() and READ_ONCE_NOCHECK, see the comments.
> > > - Get SCS pointer from x18, as per-task value is meant to save the SCS
> > > value on CPU switches.
> > > - Collect stack frames from SDEI and IRQ contexts.
> >
> > Do any of these new changes introduce new (noticeable) overhead (in
> > particular patch 2)?
>
> I'll measure the overheads and include the results into v3. Thanks!
Hm, looks like the overhead is overly significant: ~5%. I'll explore a
different approach in v3 instead.
Hi Andrey,
On Wed, Mar 23, 2022 at 04:32:51PM +0100, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> kasan, arm64, scs, stacktrace: collect stack traces from Shadow Call Stack
>
> Currently, KASAN always uses the normal stack trace collection routines,
> which rely on the unwinder, when saving alloc and free stack traces.
>
> Instead of invoking the unwinder, collect the stack trace by copying
> frames from the Shadow Call Stack whenever it is enabled. This reduces
> boot time by 30% for all KASAN modes when Shadow Call Stack is enabled.
That is an impressive number. TBH, I'm shocked that this has *that* much of an
improvement, and I suspect this means we're doing something unnecssarily
expensive in the regular unwinder.
I've given some specific comments on patches, but a a high-level, I don't want
to add yet another unwind mechanism. For maintenance and correctness reasons,
we've spent the last few years consolidating various unwinders, which this
unfortunately goes against.
I see that there are number of cases this unwinder will fall afoul of (e.g.
kretprobes and ftrace graph trampolines), and making those work correctly will
require changes elsewhere (e.g. as we rely upon a snapshot of the FP to
disambiguate cases today).
I'm also very much not keen on having to stash things in the entry assembly for
this distinct unwinder.
Going forward, I'm also planning on making changes to the way we unwind across
exception boundaries (e.g. to report the LR and FP), and as that depends on
finding the pt_regs based on the FP, that's not going to work with SCS.
So at a high level, I don't want to add an SCS based unwinder.
However, I'm very much open to how we could improve the standard unwinder to be
faster, which would be more generally beneficial. I can see that there are some
things we could reasonably do with simple refactoring.
Thanks,
Mark.
> Stack staces are collected from the Shadow Call Stack via a new
> stack_trace_save_shadow() interface.
>
> Note that the frame of the interrupted function is not included into
> the stack trace, as it is not yet saved on the SCS when an interrupt
> happens.
>
> ---
>
> To deal with this last thing, we could save the interrupted frame address
> in another per-CPU variable. I'll look into implementing this for v3.
>
> I decided to postpone the changes to stack depot that avoid copying
> frames twice until a planned upcoming update for stack depot.
>
> Changes v1->v2:
> - Provide a kernel-wide stack_trace_save_shadow() interface for collecting
> stack traces from shadow stack.
> - Use ptrauth_strip_insn_pac() and READ_ONCE_NOCHECK, see the comments.
> - Get SCS pointer from x18, as per-task value is meant to save the SCS
> value on CPU switches.
> - Collect stack frames from SDEI and IRQ contexts.
>
> Andrey Konovalov (4):
> stacktrace: add interface based on shadow call stack
> arm64, scs: save scs_sp values per-cpu when switching stacks
> arm64: implement stack_trace_save_shadow
> kasan: use stack_trace_save_shadow
>
> arch/Kconfig | 6 +++
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/assembler.h | 12 +++++
> arch/arm64/include/asm/scs.h | 13 ++++-
> arch/arm64/kernel/entry.S | 28 ++++++++--
> arch/arm64/kernel/irq.c | 4 +-
> arch/arm64/kernel/sdei.c | 5 +-
> arch/arm64/kernel/stacktrace.c | 83 ++++++++++++++++++++++++++++++
> include/linux/stacktrace.h | 15 ++++++
> kernel/stacktrace.c | 21 ++++++++
> mm/kasan/common.c | 9 ++--
> 11 files changed, 183 insertions(+), 14 deletions(-)
>
> --
> 2.25.1
>
On Wed, Mar 23, 2022 at 04:32:53PM +0100, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> When an interrupt happens, the current Shadow Call Stack (SCS) pointer
> is switched to a per-interrupt one stored in a per-CPU variable. The old
> pointer is then saved on the normal stack and restored when the interrupt
> is handled.
>
> To collect the current stack trace based on SCS when the interrupt is
> being handled, we need to know the SCS pointers that belonged to the
> task and potentially other interrupts that were interrupted.
>
> Instead of trying to retrieve the SCS pointers from the stack, change
> interrupt handlers (for hard IRQ, Normal and Critical SDEI) to save the
> previous SCS pointer in a per-CPU variable.
I'm *really* not keen on *always* poking this in the entry code for the
uncommon case of unwind. It complicates the entry code and means we're always
paying a cost for potentially no benefit. At a high-level, I don't think this
is the right approach.
For the regular unwinder, I want to rework things such that we can identify
exception boundaries and look into the regs (e.g. so that we can recover the
PC+LR+FP and avoid duplicating part of this in a frame record), and I'd much
prefer that we did the same here.
Thanks,
Mark.
> Note that interrupts stack. A task can be interrupted by a hard IRQ,
> which then can interrupted by a normal SDEI, etc. This is handled by
> using a separate per-CPU variable for each interrupt type.
>
> Also reset the saved SCS pointer when exiting the interrupt. This allows
> checking whether we should include any interrupt frames when collecting
> the stack trace. While we could use in_hardirq(), there seems to be no
> easy way to check whether we are in an SDEI handler. Directly checking
> the per-CPU variables for being non-zero is more resilient.
>
> Also expose both the the added saved SCS variables and the existing SCS
> base variables in arch/arm64/include/asm/scs.h so that the stack trace
> collection impementation can use them.
>
> Signed-off-by: Andrey Konovalov <[email protected]>
> ---
> arch/arm64/include/asm/assembler.h | 12 ++++++++++++
> arch/arm64/include/asm/scs.h | 13 ++++++++++++-
> arch/arm64/kernel/entry.S | 28 ++++++++++++++++++++++++----
> arch/arm64/kernel/irq.c | 4 +---
> arch/arm64/kernel/sdei.c | 5 ++---
> 5 files changed, 51 insertions(+), 11 deletions(-)
>
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index 8c5a61aeaf8e..ca018e981d13 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -270,6 +270,18 @@ alternative_endif
> ldr \dst, [\dst, \tmp]
> .endm
>
> + /*
> + * @src: Register whose value gets stored in sym
> + * @sym: The name of the per-cpu variable
> + * @tmp0: Scratch register
> + * @tmp1: Another scratch register
> + */
> + .macro str_this_cpu src, sym, tmp0, tmp1
> + adr_l \tmp0, \sym
> + get_this_cpu_offset \tmp1
> + str \src, [\tmp0, \tmp1]
> + .endm
> +
> /*
> * vma_vm_mm - get mm pointer from vma pointer (vma->vm_mm)
> */
> diff --git a/arch/arm64/include/asm/scs.h b/arch/arm64/include/asm/scs.h
> index 8297bccf0784..2bb2b32f787b 100644
> --- a/arch/arm64/include/asm/scs.h
> +++ b/arch/arm64/include/asm/scs.h
> @@ -24,6 +24,17 @@
> .endm
> #endif /* CONFIG_SHADOW_CALL_STACK */
>
> -#endif /* __ASSEMBLY __ */
> +#else /* __ASSEMBLY__ */
> +
> +#include <linux/percpu.h>
> +
> +DECLARE_PER_CPU(unsigned long *, irq_shadow_call_stack_ptr);
> +DECLARE_PER_CPU(unsigned long *, irq_shadow_call_stack_saved_ptr);
> +DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_normal_ptr);
> +DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_normal_saved_ptr);
> +DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_critical_ptr);
> +DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_critical_saved_ptr);
> +
> +#endif /* __ASSEMBLY__ */
>
> #endif /* _ASM_SCS_H */
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index ede028dee81b..1c62fecda172 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -880,7 +880,8 @@ NOKPROBE(ret_from_fork)
> */
> SYM_FUNC_START(call_on_irq_stack)
> #ifdef CONFIG_SHADOW_CALL_STACK
> - stp scs_sp, xzr, [sp, #-16]!
> + /* Save the current SCS pointer and load the per-IRQ one. */
> + str_this_cpu scs_sp, irq_shadow_call_stack_saved_ptr, x15, x17
> ldr_this_cpu scs_sp, irq_shadow_call_stack_ptr, x17
> #endif
> /* Create a frame record to save our LR and SP (implicit in FP) */
> @@ -902,7 +903,9 @@ SYM_FUNC_START(call_on_irq_stack)
> mov sp, x29
> ldp x29, x30, [sp], #16
> #ifdef CONFIG_SHADOW_CALL_STACK
> - ldp scs_sp, xzr, [sp], #16
> + /* Restore saved SCS pointer and reset the saved value. */
> + ldr_this_cpu scs_sp, irq_shadow_call_stack_saved_ptr, x17
> + str_this_cpu xzr, irq_shadow_call_stack_saved_ptr, x15, x17
> #endif
> ret
> SYM_FUNC_END(call_on_irq_stack)
> @@ -1024,11 +1027,16 @@ SYM_CODE_START(__sdei_asm_handler)
> #endif
>
> #ifdef CONFIG_SHADOW_CALL_STACK
> - /* Use a separate shadow call stack for normal and critical events */
> + /*
> + * Use a separate shadow call stack for normal and critical events.
> + * Save the current SCS pointer and load the per-SDEI one.
> + */
> cbnz w4, 3f
> + str_this_cpu src=scs_sp, sym=sdei_shadow_call_stack_normal_saved_ptr, tmp0=x5, tmp1=x6
> ldr_this_cpu dst=scs_sp, sym=sdei_shadow_call_stack_normal_ptr, tmp=x6
> b 4f
> -3: ldr_this_cpu dst=scs_sp, sym=sdei_shadow_call_stack_critical_ptr, tmp=x6
> +3: str_this_cpu src=scs_sp, sym=sdei_shadow_call_stack_critical_saved_ptr, tmp0=x5, tmp1=x6
> + ldr_this_cpu dst=scs_sp, sym=sdei_shadow_call_stack_critical_ptr, tmp=x6
> 4:
> #endif
>
> @@ -1062,6 +1070,18 @@ SYM_CODE_START(__sdei_asm_handler)
> ldp lr, x1, [x4, #SDEI_EVENT_INTREGS + S_LR]
> mov sp, x1
>
> +#ifdef CONFIG_SHADOW_CALL_STACK
> + /* Restore saved SCS pointer and reset the saved value. */
> + ldrb w5, [x4, #SDEI_EVENT_PRIORITY]
> + cbnz w5, 5f
> + ldr_this_cpu dst=scs_sp, sym=sdei_shadow_call_stack_normal_saved_ptr, tmp=x6
> + str_this_cpu src=xzr, sym=sdei_shadow_call_stack_normal_saved_ptr, tmp0=x5, tmp1=x6
> + b 6f
> +5: ldr_this_cpu dst=scs_sp, sym=sdei_shadow_call_stack_critical_saved_ptr, tmp=x6
> + str_this_cpu src=xzr, sym=sdei_shadow_call_stack_critical_saved_ptr, tmp0=x5, tmp1=x6
> +6:
> +#endif
> +
> mov x1, x0 // address to complete_and_resume
> /* x0 = (x0 <= SDEI_EV_FAILED) ?
> * EVENT_COMPLETE:EVENT_COMPLETE_AND_RESUME
> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
> index bda49430c9ea..4199f900714a 100644
> --- a/arch/arm64/kernel/irq.c
> +++ b/arch/arm64/kernel/irq.c
> @@ -28,11 +28,9 @@ DEFINE_PER_CPU(struct nmi_ctx, nmi_contexts);
>
> DEFINE_PER_CPU(unsigned long *, irq_stack_ptr);
>
> -
> -DECLARE_PER_CPU(unsigned long *, irq_shadow_call_stack_ptr);
> -
> #ifdef CONFIG_SHADOW_CALL_STACK
> DEFINE_PER_CPU(unsigned long *, irq_shadow_call_stack_ptr);
> +DEFINE_PER_CPU(unsigned long *, irq_shadow_call_stack_saved_ptr);
> #endif
>
> static void init_irq_scs(void)
> diff --git a/arch/arm64/kernel/sdei.c b/arch/arm64/kernel/sdei.c
> index d20620a1c51a..269adcb9e854 100644
> --- a/arch/arm64/kernel/sdei.c
> +++ b/arch/arm64/kernel/sdei.c
> @@ -39,12 +39,11 @@ DEFINE_PER_CPU(unsigned long *, sdei_stack_normal_ptr);
> DEFINE_PER_CPU(unsigned long *, sdei_stack_critical_ptr);
> #endif
>
> -DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_normal_ptr);
> -DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_critical_ptr);
> -
> #ifdef CONFIG_SHADOW_CALL_STACK
> DEFINE_PER_CPU(unsigned long *, sdei_shadow_call_stack_normal_ptr);
> +DEFINE_PER_CPU(unsigned long *, sdei_shadow_call_stack_normal_saved_ptr);
> DEFINE_PER_CPU(unsigned long *, sdei_shadow_call_stack_critical_ptr);
> +DEFINE_PER_CPU(unsigned long *, sdei_shadow_call_stack_critical_saved_ptr);
> #endif
>
> static void _free_sdei_stack(unsigned long * __percpu *ptr, int cpu)
> --
> 2.25.1
>
On Wed, Mar 23, 2022 at 04:32:52PM +0100, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> Add a new interface stack_trace_save_shadow() for collecting stack traces
> by copying frames from the Shadow Call Stack.
>
> Collecting stack traces this way is significantly faster: boot time
> of a defconfig build with KASAN enabled gets descreased by ~30%.
Hmm... just to check, do ou know if that's just because of hte linear copy, or
because we're skipping other work we have to do in the regular stacktrace?
> The few patches following this one add an implementation of
> stack_trace_save_shadow() for arm64.
>
> The implementation of the added interface is not meant to use
> stack_trace_consume_fn to avoid making a function call for each
> collected frame to further improve performance.
... because we could easily provide an inline-optimized stack copy *without*
having to write a distinct unwinder, and I'd *really* like to avoid having a
bunch of distinct unwinders for arm64, as it really hinders maintenance. We're
working on fixing/improving the arm64 unwinder for things like
RELIABLE_STACKTRACE, and I know that some of that work is non-trivial to make
work with an SCS-based unwind rather than an FP-based unwind, and/or will
undermine the saving anyway.
>
> Signed-off-by: Andrey Konovalov <[email protected]>
> ---
> arch/Kconfig | 6 ++++++
> include/linux/stacktrace.h | 15 +++++++++++++++
> kernel/stacktrace.c | 21 +++++++++++++++++++++
> 3 files changed, 42 insertions(+)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index e12a4268c01d..207c1679c53a 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1041,6 +1041,12 @@ config HAVE_RELIABLE_STACKTRACE
> arch_stack_walk_reliable() function which only returns a stack trace
> if it can guarantee the trace is reliable.
>
> +config HAVE_SHADOW_STACKTRACE
> + bool
> + help
> + If this is set, the architecture provides the arch_stack_walk_shadow()
> + function, which collects the stack trace from the shadow call stack.
> +
> config HAVE_ARCH_HASH
> bool
> default n
> diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
> index 97455880ac41..b74d1e42e157 100644
> --- a/include/linux/stacktrace.h
> +++ b/include/linux/stacktrace.h
> @@ -60,6 +60,9 @@ int arch_stack_walk_reliable(stack_trace_consume_fn consume_entry, void *cookie,
>
> void arch_stack_walk_user(stack_trace_consume_fn consume_entry, void *cookie,
> const struct pt_regs *regs);
> +
> +int arch_stack_walk_shadow(unsigned long *store, unsigned int size,
> + unsigned int skipnr);
> #endif /* CONFIG_ARCH_STACKWALK */
>
> #ifdef CONFIG_STACKTRACE
> @@ -108,4 +111,16 @@ static inline int stack_trace_save_tsk_reliable(struct task_struct *tsk,
> }
> #endif
>
> +#if defined(CONFIG_STACKTRACE) && defined(CONFIG_HAVE_SHADOW_STACKTRACE)
> +int stack_trace_save_shadow(unsigned long *store, unsigned int size,
> + unsigned int skipnr);
> +#else
> +static inline int stack_trace_save_shadow(unsigned long *store,
> + unsigned int size,
> + unsigned int skipnr)
> +{
> + return -ENOSYS;
> +}
> +#endif
> +
> #endif /* __LINUX_STACKTRACE_H */
> diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
> index 9ed5ce989415..fe305861fd55 100644
> --- a/kernel/stacktrace.c
> +++ b/kernel/stacktrace.c
> @@ -237,6 +237,27 @@ unsigned int stack_trace_save_user(unsigned long *store, unsigned int size)
> }
> #endif
>
> +#ifdef CONFIG_HAVE_SHADOW_STACKTRACE
> +/**
> + * stack_trace_save_shadow - Save a stack trace based on shadow call stack
> + * @store: Pointer to the storage array
> + * @size: Size of the storage array
> + * @skipnr: Number of entries to skip at the start of the stack trace
> + *
> + * Return: Number of trace entries stored.
> + */
> +int stack_trace_save_shadow(unsigned long *store, unsigned int size,
> + unsigned int skipnr)
> +{
> + /*
> + * Do not use stack_trace_consume_fn to avoid making a function
> + * call for each collected frame to improve performance.
> + * Skip + 1 frame to skip stack_trace_save_shadow.
> + */
> + return arch_stack_walk_shadow(store, size, skipnr + 1);
> +}
> +#endif
If we really need this, can we make it an __always_inline in a header so that
we can avoid the skip? Generally the skipping is problematic due to
inlining/outlining and LTO, and I'd like to avoid adding more of it
unnecessarily.
Thanks,
Mark.
> +
> #else /* CONFIG_ARCH_STACKWALK */
>
> /*
> --
> 2.25.1
>
On Thu, Mar 31, 2022 at 10:54:08AM +0100, Mark Rutland wrote:
> On Wed, Mar 23, 2022 at 04:32:51PM +0100, [email protected] wrote:
> > From: Andrey Konovalov <[email protected]>
> >
> > kasan, arm64, scs, stacktrace: collect stack traces from Shadow Call Stack
> >
> > Currently, KASAN always uses the normal stack trace collection routines,
> > which rely on the unwinder, when saving alloc and free stack traces.
> >
> > Instead of invoking the unwinder, collect the stack trace by copying
> > frames from the Shadow Call Stack whenever it is enabled. This reduces
> > boot time by 30% for all KASAN modes when Shadow Call Stack is enabled.
>
> That is an impressive number. TBH, I'm shocked that this has *that* much of an
> improvement, and I suspect this means we're doing something unnecssarily
> expensive in the regular unwinder.
I've had a quick look into this, to see what we could do to improve the regular
unwinder, but I can't reproduce that 30% number.
In local testing the worst can I could get to was 6-13% (with both the
stacktrace *and* stackdepot logic hacked out entirely).
I'm testing with clang 13.0.0 from the llvm.org binary releases, with defconfig
+ SHADOW_CALL_STACK + KASAN_<option>, using a very recent snapshot of mainline
(commit d888c83fcec75194a8a48ccd283953bdba7b2550). I'm booting a
KVM-accelerated QEMU VM on ThunderX2 with "init=/sbin/reboot -- -f" in the
kernel bootargs, timing the whole run from the outside with "perf stat --null".
The 6% figure is if I count boot as a whole including VM startup and teardown
(i.e. an under-estimate of the proportion), the 13% figure is if I subtract a
baseline timing from a run without KASAN (i.e. an over-estimate of the
proportion).
Could you let me know how you're measuring this, and which platform+config
you're using?
I'll have a play with some configs in case there's a pathological
configuration, but if you could let me know how/what you're testing that'd be a
great help.
Thanks,
Mark.
On Thu, Mar 31, 2022 at 11:54 AM Mark Rutland <[email protected]> wrote:
>
> That is an impressive number. TBH, I'm shocked that this has *that* much of an
> improvement, and I suspect this means we're doing something unnecssarily
> expensive in the regular unwinder.
>
> I've given some specific comments on patches, but a a high-level, I don't want
> to add yet another unwind mechanism. For maintenance and correctness reasons,
> we've spent the last few years consolidating various unwinders, which this
> unfortunately goes against.
>
> I see that there are number of cases this unwinder will fall afoul of (e.g.
> kretprobes and ftrace graph trampolines), and making those work correctly will
> require changes elsewhere (e.g. as we rely upon a snapshot of the FP to
> disambiguate cases today).
Do I understand correctly that kretprobes and ftrace modify frames
saved on SCS? So, if either is enabled, SCS frames might contain their
addresses instead of actual PCs?
If so, this is good enough for our use case. Having kretprobes or
ftrace enabled is an unusual setting and there's no requirement to
support it.
The goal is to have stack trace collection working in most cases
during a normal usage of an Android device. Being not feature-complete
and not covering all possible peculiar cases is fine.
> I'm also very much not keen on having to stash things in the entry assembly for
> this distinct unwinder.
I'll drop these changes, I'll respond on that patch.
> Going forward, I'm also planning on making changes to the way we unwind across
> exception boundaries (e.g. to report the LR and FP), and as that depends on
> finding the pt_regs based on the FP, that's not going to work with SCS.
>
> So at a high level, I don't want to add an SCS based unwinder.
>
> However, I'm very much open to how we could improve the standard unwinder to be
> faster, which would be more generally beneficial. I can see that there are some
> things we could reasonably do with simple refactoring.
The intention of adding an SCS-based unwinder it to use in production
together with MTE-based KASAN. Thus, it needs to be as fast as
possible. I doubt even a very optimized FP-based unwinder will compare
with a simple loop over SCS frames.
It seems a pity to let the kernel maintain the current call trace via
SCS and then not to use it to collect stack traces.
Would it be acceptable if we keep the SCS unwinder code in mm/kasan
and not integrate with the common stacktrace machanisms?
Thanks!
On Thu, Mar 31, 2022 at 11:19 AM Mark Rutland <[email protected]> wrote:
>
> > Collecting stack traces this way is significantly faster: boot time
> > of a defconfig build with KASAN enabled gets descreased by ~30%.
>
> Hmm... just to check, do ou know if that's just because of hte linear copy, or
> because we're skipping other work we have to do in the regular stacktrace?
No, I haven't looked into this.
> > The implementation of the added interface is not meant to use
> > stack_trace_consume_fn to avoid making a function call for each
> > collected frame to further improve performance.
>
> ... because we could easily provide an inline-optimized stack copy *without*
> having to write a distinct unwinder, and I'd *really* like to avoid having a
> bunch of distinct unwinders for arm64, as it really hinders maintenance. We're
> working on fixing/improving the arm64 unwinder for things like
> RELIABLE_STACKTRACE, and I know that some of that work is non-trivial to make
> work with an SCS-based unwind rather than an FP-based unwind, and/or will
> undermine the saving anyway.
Responded on the cover letter wrt this.
> > +int stack_trace_save_shadow(unsigned long *store, unsigned int size,
> > + unsigned int skipnr)
> > +{
> > + /*
> > + * Do not use stack_trace_consume_fn to avoid making a function
> > + * call for each collected frame to improve performance.
> > + * Skip + 1 frame to skip stack_trace_save_shadow.
> > + */
> > + return arch_stack_walk_shadow(store, size, skipnr + 1);
> > +}
> > +#endif
>
> If we really need this, can we make it an __always_inline in a header so that
> we can avoid the skip? Generally the skipping is problematic due to
> inlining/outlining and LTO, and I'd like to avoid adding more of it
> unnecessarily.
Yes, I think this should work.
However, if we keep the implementation in mm/kasan, this integration
will not be required.
Thanks!
On Thu, Mar 31, 2022 at 2:39 PM Mark Rutland <[email protected]> wrote:
>
> I've had a quick look into this, to see what we could do to improve the regular
> unwinder, but I can't reproduce that 30% number.
>
> In local testing the worst can I could get to was 6-13% (with both the
> stacktrace *and* stackdepot logic hacked out entirely).
>
> I'm testing with clang 13.0.0 from the llvm.org binary releases, with defconfig
> + SHADOW_CALL_STACK + KASAN_<option>, using a very recent snapshot of mainline
> (commit d888c83fcec75194a8a48ccd283953bdba7b2550). I'm booting a
> KVM-accelerated QEMU VM on ThunderX2 with "init=/sbin/reboot -- -f" in the
> kernel bootargs, timing the whole run from the outside with "perf stat --null".
>
> The 6% figure is if I count boot as a whole including VM startup and teardown
> (i.e. an under-estimate of the proportion), the 13% figure is if I subtract a
> baseline timing from a run without KASAN (i.e. an over-estimate of the
> proportion).
I think this is the reason for the limited improvement that you
observe. If you measure the time throughout VM startup and teardown,
you include the time required for userspace apps, which is irrelevant.
I measure boot time until a certain point during kernel boot. E.g.,
with the attached config, I measure the time until test_meminit start
running.
It takes 6 seconds for the kernel to reach test_meminit as is, and 4
seconds with kasan_save_stack() commented out. Only commenting out
__stack_depot_save() gives 5.9 seconds, so stack_trace_save() is the
slow part.
> Could you let me know how you're measuring this, and which platform+config
> you're using?
I've attached the config that I use. It's essentially defconfig + SCS
+ KASAN + maybe a few other options.
> I'll have a play with some configs in case there's a pathological
> configuration, but if you could let me know how/what you're testing that'd be a
> great help.
Thanks!
On Thu, Mar 31, 2022 at 11:24 AM Mark Rutland <[email protected]> wrote:
>
> On Wed, Mar 23, 2022 at 04:32:53PM +0100, [email protected] wrote:
> > From: Andrey Konovalov <[email protected]>
> >
> > Instead of trying to retrieve the SCS pointers from the stack, change
> > interrupt handlers (for hard IRQ, Normal and Critical SDEI) to save the
> > previous SCS pointer in a per-CPU variable.
>
> I'm *really* not keen on *always* poking this in the entry code for the
> uncommon case of unwind. It complicates the entry code and means we're always
> paying a cost for potentially no benefit. At a high-level, I don't think this
> is the right approach.
This also gives a 5% slowdown, which is not acceptable.
What we can do instead, is to not collect frames from the higher
exception levels at all. This would leave SCS-based stack collection
method impaired, but this is probably fine for KASAN's use case:
currently, stack depot filters out higher-level frames anyway, so
KASAN never saves them. And the lower-level part of the stack trace is
enough to identify the allocation.
Thanks!
> For the regular unwinder, I want to rework things such that we can identify
> exception boundaries and look into the regs (e.g. so that we can recover the
> PC+LR+FP and avoid duplicating part of this in a frame record), and I'd much
> prefer that we did the same here.
On Tue, Apr 05, 2022 at 05:10:02PM +0200, Andrey Konovalov wrote:
> On Thu, Mar 31, 2022 at 2:39 PM Mark Rutland <[email protected]> wrote:
> >
> > I've had a quick look into this, to see what we could do to improve the regular
> > unwinder, but I can't reproduce that 30% number.
> >
> > In local testing the worst can I could get to was 6-13% (with both the
> > stacktrace *and* stackdepot logic hacked out entirely).
> >
> > I'm testing with clang 13.0.0 from the llvm.org binary releases, with defconfig
> > + SHADOW_CALL_STACK + KASAN_<option>, using a very recent snapshot of mainline
> > (commit d888c83fcec75194a8a48ccd283953bdba7b2550). I'm booting a
> > KVM-accelerated QEMU VM on ThunderX2 with "init=/sbin/reboot -- -f" in the
> > kernel bootargs, timing the whole run from the outside with "perf stat --null".
> >
> > The 6% figure is if I count boot as a whole including VM startup and teardown
> > (i.e. an under-estimate of the proportion), the 13% figure is if I subtract a
> > baseline timing from a run without KASAN (i.e. an over-estimate of the
> > proportion).
>
> I think this is the reason for the limited improvement that you
> observe. If you measure the time throughout VM startup and teardown,
> you include the time required for userspace apps, which is irrelevant.
Yes, that's the case for the 6% figure. However I also mentioned how I
accounted for that to get the 13% figure, which does not include those
irrelevant timings (and is an over-estimate of that proportion).
I think the discrepancy lies elsewhere, e.g. kernel version, boot arguments,
platform (and hence driver / subsystem behaviour), etc.
Can you share any of those details? Are you able to test with a mainline
kernel, e.g. v5.18-rc1?
I have a bunch of numbers below, and I think those imply one of the following:
* In your test setup, there are significantly more allocs/frees for which a
stacktrace is being recorded. That could be down to the platform you're
testing on, and the drivers that are in use.
* In your test setup, for some reason, some aspect of the stacktrace is
signficantly more expensive than in my setup. There's the potential that a
hardware quirk has some impact here, so knowing which hardware you're testing
on would be very helpful.
* There is a secondary effect at play. There are a number of potential things
here (e.g. console accesses, the number of stacktraces taken in interrupt
context, etc).
I'd like to figure out which (if any) of those apply.
> I measure boot time until a certain point during kernel boot. E.g.,
> with the attached config, I measure the time until test_meminit start
> running.
How exactly are you measuring the point at which test_meminit() starts running?
Are you looking through dmesg, or passing some debug options? I ask because
that doesn't seem to dump anything into dmesg until at least one test has run.
FWIW, I was measuring the kernel boot up-to the point we'd run the userspace
init program, by booting the kernel with:
init=/sbin/reboot -- -f
... which I think is more representative of the full boot time.
I can instead avoid that by not passing a filesystem and booting with:
panic=-1
... to trigger an instant reboot when we'd normally mount the filesystem, but
the numbers as similar either way for me.
I've followed the latter approach for my numbers below, since it's easier to
reproduce.
> It takes 6 seconds for the kernel to reach test_meminit as is, and 4
> seconds with kasan_save_stack() commented out. Only commenting out
> __stack_depot_save() gives 5.9 seconds, so stack_trace_save() is the
> slow part.
As above, how are you measuring this?
... and since your config has CONFIG_KASAN_KUNIT_TEST=y, which console
options (e.g. 'quiet', 'loglevel=') are you passing on the kernel command line?
I ask because in my local testing, that options results in a load of test
results being dumped to the console, and the time taken to do so dominates
everything else. Hacking out the stack sampling reduces the amount of data we
log to the console, and this secondary effect reduces boot time.
> > Could you let me know how you're measuring this, and which platform+config
> > you're using?
>
> I've attached the config that I use. It's essentially defconfig + SCS
> + KASAN + maybe a few other options.
>
> > I'll have a play with some configs in case there's a pathological
> > configuration, but if you could let me know how/what you're testing that'd be a
> > great help.
I'm afraid from local testing (atop v5.18-rc1), with your config, I still can't
get anywhere near your figures. I've tried to match toolchain versions with
what was in your .config file, so I'm using clang 12.0.0 from the llvm.org
binary releases, and binutils from the kernel.org crosstool 11.1.0 release.
I took baselines with defconfig and defconfig + SHADOW_CALL_STACK, with console
output completely suppressed with 'quiet loglevel=0':
| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-defconfig/Image
|
| Performance counter stats for
| '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
| -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
| /home/mark/kernel-images/andrey-unwind-v5.18-rc1-defconfig/Image -append
| loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
|
| 0.512626031 seconds time elapsed ( +- 0.26% )
|
| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-defconfig+scs/Image
|
| Performance counter stats for
| '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
| -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
| /home/mark/kernel-images/andrey-unwind-v5.18-rc1-defconfig+scs/Image -append
| loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
|
| 0.523245952 seconds time elapsed ( +- 0.18% )
Then I tried the same with your config, without your patches:
| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image
|
| Performance counter stats for
| '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
| -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
| /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image -append
| loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
|
| 1.994692366 seconds time elapsed ( +- 0.05% )
Then with your config, without your patches, with the stacktrace hacked out:
| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
|
| Performance counter stats for
| '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
| -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
| /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
| -append loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
|
| 1.861823869 seconds time elapsed ( +- 0.05% )
If I use those number to estimate the proportion of time spent stacktracing,
with the baseline SCS number discounted to remove the hypervisor+VMM overheads,
I get:
(1.994692366 - 0.523245952) - (1.861823869 - 0.523245952)
--------------------------------------------------------- = 0.09029788358
(1.994692366 - 0.523245952)
So roughly 9% when I try to maximize that figure. When actually poking hardware
and doing real work, that figure goes down. For example, if just using "quiet":
| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image > /dev/null
|
| Performance counter stats for
| '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
| -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
| /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image -append
| loglevel=9 earlycon panic=-1 quiet' (50 runs):
|
| 4.653286475 seconds time elapsed ( +- 0.06% )
| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image > /dev/null
|
| Performance counter stats for
| '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
| -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
| /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
| -append loglevel=9 earlycon panic=-1 quiet' (50 runs):
|
| 4.585750154 seconds time elapsed ( +- 0.05% )
Which gives an estimate of:
(4.653286475 - 0.523245952) - (4.585750154 - 0.523245952)
--------------------------------------------------------- = 0.01635245964
(4.653286475 - 0.523245952)
... or ~1.6% time spent backtracing:
FWIW, applying your patches do show some benefit, but not as drastic as I was
expecting:
With console output suprressed:
| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
|
| Performance counter stats for
| '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
| -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
| /home/mark/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
| -append loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
|
| 1.920300410 seconds time elapsed ( +- 0.05% )
... down from ~9% to ~4%
With console output merely reduced:
| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image > /dev/null
|
| Performance counter stats for
| '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
| -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
| /home/mark/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
| -append loglevel=9 earlycon panic=-1 quiet' (50 runs):
|
| 4.611277833 seconds time elapsed ( +- 0.04% )
... down from 1.6% to 0.6%
Given the above I still think we need to understand this a bit better before we
consider pursuing the SCS unwinder, given the issues I laid out in my prior mails.
My hope is that we can improve the regular unwinder or other code such that
this becomes moot. I'm aware of a few things we could try, but given it's very
easy to sink a lot of time and effort into this, I'd like to first get some
more details, as above.
Thanks,
Mark.
On Thu, Apr 7, 2022 at 8:42 PM Mark Rutland <[email protected]> wrote:
>
> I'm afraid from local testing (atop v5.18-rc1), with your config, I still can't
> get anywhere near your figures. I've tried to match toolchain versions with
> what was in your .config file, so I'm using clang 12.0.0 from the llvm.org
> binary releases, and binutils from the kernel.org crosstool 11.1.0 release.
>
> I took baselines with defconfig and defconfig + SHADOW_CALL_STACK, with console
> output completely suppressed with 'quiet loglevel=0':
>
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-defconfig/Image
> |
> | Performance counter stats for
> | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-defconfig/Image -append
> | loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> |
> | 0.512626031 seconds time elapsed ( +- 0.26% )
> |
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-defconfig+scs/Image
> |
> | Performance counter stats for
> | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-defconfig+scs/Image -append
> | loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> |
> | 0.523245952 seconds time elapsed ( +- 0.18% )
>
> Then I tried the same with your config, without your patches:
>
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image
> |
> | Performance counter stats for
> | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image -append
> | loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> |
> | 1.994692366 seconds time elapsed ( +- 0.05% )
>
> Then with your config, without your patches, with the stacktrace hacked out:
>
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> |
> | Performance counter stats for
> | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> | -append loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> |
> | 1.861823869 seconds time elapsed ( +- 0.05% )
>
> If I use those number to estimate the proportion of time spent stacktracing,
> with the baseline SCS number discounted to remove the hypervisor+VMM overheads,
> I get:
>
> (1.994692366 - 0.523245952) - (1.861823869 - 0.523245952)
> --------------------------------------------------------- = 0.09029788358
> (1.994692366 - 0.523245952)
>
> So roughly 9% when I try to maximize that figure. When actually poking hardware
> and doing real work, that figure goes down. For example, if just using "quiet":
>
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image > /dev/null
> |
> | Performance counter stats for
> | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image -append
> | loglevel=9 earlycon panic=-1 quiet' (50 runs):
> |
> | 4.653286475 seconds time elapsed ( +- 0.06% )
>
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image > /dev/null
> |
> | Performance counter stats for
> | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> | -append loglevel=9 earlycon panic=-1 quiet' (50 runs):
> |
> | 4.585750154 seconds time elapsed ( +- 0.05% )
>
> Which gives an estimate of:
>
> (4.653286475 - 0.523245952) - (4.585750154 - 0.523245952)
> --------------------------------------------------------- = 0.01635245964
> (4.653286475 - 0.523245952)
>
> ... or ~1.6% time spent backtracing:
>
> FWIW, applying your patches do show some benefit, but not as drastic as I was
> expecting:
>
> With console output suprressed:
>
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> |
> | Performance counter stats for
> | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> | /home/mark/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> | -append loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> |
> | 1.920300410 seconds time elapsed ( +- 0.05% )
>
> ... down from ~9% to ~4%
>
> With console output merely reduced:
>
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image > /dev/null
> |
> | Performance counter stats for
> | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> | /home/mark/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> | -append loglevel=9 earlycon panic=-1 quiet' (50 runs):
> |
> | 4.611277833 seconds time elapsed ( +- 0.04% )
>
> ... down from 1.6% to 0.6%
>
> Given the above I still think we need to understand this a bit better before we
> consider pursuing the SCS unwinder, given the issues I laid out in my prior mails.
>
> My hope is that we can improve the regular unwinder or other code such that
> this becomes moot. I'm aware of a few things we could try, but given it's very
> easy to sink a lot of time and effort into this, I'd like to first get some
> more details, as above.
Hi Mark,
I'm about to publish v3, where I'll include a detailed description of
how I measured the performance.
Perhaps we see different performance numbers because you're using
KVM-enabled VM on an Arm host and I'm using QEMU on x86-64 host.
Although, it's suspicious that the difference is so drastic. I'll try
to get my hands on some Arm hardware in the next few days and do the
measurements there.
This new version also will not be making any changes to the entry
code, as these changes add unwanted additional slowdown. That would be
great, if you could check the performance impact of v3 with your
setup.
Thanks!
On Wed, Apr 13, 2022 at 09:28:00PM +0200, Andrey Konovalov wrote:
> On Thu, Apr 7, 2022 at 8:42 PM Mark Rutland <[email protected]> wrote:
> >
> > I'm afraid from local testing (atop v5.18-rc1), with your config, I still can't
> > get anywhere near your figures. I've tried to match toolchain versions with
> > what was in your .config file, so I'm using clang 12.0.0 from the llvm.org
> > binary releases, and binutils from the kernel.org crosstool 11.1.0 release.
> >
> > I took baselines with defconfig and defconfig + SHADOW_CALL_STACK, with console
> > output completely suppressed with 'quiet loglevel=0':
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-defconfig/Image
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-defconfig/Image -append
> > | loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > | 0.512626031 seconds time elapsed ( +- 0.26% )
> > |
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-defconfig+scs/Image
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-defconfig+scs/Image -append
> > | loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > | 0.523245952 seconds time elapsed ( +- 0.18% )
> >
> > Then I tried the same with your config, without your patches:
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image -append
> > | loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > | 1.994692366 seconds time elapsed ( +- 0.05% )
> >
> > Then with your config, without your patches, with the stacktrace hacked out:
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> > | -append loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > | 1.861823869 seconds time elapsed ( +- 0.05% )
> >
> > If I use those number to estimate the proportion of time spent stacktracing,
> > with the baseline SCS number discounted to remove the hypervisor+VMM overheads,
> > I get:
> >
> > (1.994692366 - 0.523245952) - (1.861823869 - 0.523245952)
> > --------------------------------------------------------- = 0.09029788358
> > (1.994692366 - 0.523245952)
> >
> > So roughly 9% when I try to maximize that figure. When actually poking hardware
> > and doing real work, that figure goes down. For example, if just using "quiet":
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image > /dev/null
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image -append
> > | loglevel=9 earlycon panic=-1 quiet' (50 runs):
> > |
> > | 4.653286475 seconds time elapsed ( +- 0.06% )
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image > /dev/null
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> > | -append loglevel=9 earlycon panic=-1 quiet' (50 runs):
> > |
> > | 4.585750154 seconds time elapsed ( +- 0.05% )
> >
> > Which gives an estimate of:
> >
> > (4.653286475 - 0.523245952) - (4.585750154 - 0.523245952)
> > --------------------------------------------------------- = 0.01635245964
> > (4.653286475 - 0.523245952)
> >
> > ... or ~1.6% time spent backtracing:
> >
> > FWIW, applying your patches do show some benefit, but not as drastic as I was
> > expecting:
> >
> > With console output suprressed:
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> > | -append loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > | 1.920300410 seconds time elapsed ( +- 0.05% )
> >
> > ... down from ~9% to ~4%
> >
> > With console output merely reduced:
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image > /dev/null
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> > | -append loglevel=9 earlycon panic=-1 quiet' (50 runs):
> > |
> > | 4.611277833 seconds time elapsed ( +- 0.04% )
> >
> > ... down from 1.6% to 0.6%
> >
> > Given the above I still think we need to understand this a bit better before we
> > consider pursuing the SCS unwinder, given the issues I laid out in my prior mails.
> >
> > My hope is that we can improve the regular unwinder or other code such that
> > this becomes moot. I'm aware of a few things we could try, but given it's very
> > easy to sink a lot of time and effort into this, I'd like to first get some
> > more details, as above.
>
> Hi Mark,
>
> I'm about to publish v3, where I'll include a detailed description of
> how I measured the performance.
>
> Perhaps we see different performance numbers because you're using
> KVM-enabled VM on an Arm host and I'm using QEMU on x86-64 host.
Hold on; are you using QEMU in TCG mode? If so that's in no way representative
of real HW performance, and there are operations it simply cannot make as fast
as HW can (e.g. pointer authentication using the architected QARMA variants).
> Although, it's suspicious that the difference is so drastic.
I'm not surprised at all. Some operations can be *orders of magnitude slower*
under TCG than on real HW even when considered relative to other operations,
and this can drasticaly skew benchmarks. We recently hit a case when PACIASP
and AUTIASP were so slow under TCG mode they appeared to be causing a boot
hang, and we eventually figured out that they were just *very* slow, adding
minutes to the boot time. Richard Henderson added options to QEMU to mitigate
that (either disabling authentication, or using an IMPLEMENTATION DEFINED
algorithm).
The bottom line is that QEMU TCG mode is in no way representative of real-world
performance, and *cannot* be used for benchmarking.
I think we first need to understand *what* is so slow under QEMU TCG mode, and
*why* TCG mode performance matters. I suspect there are other ways we could
avoid this overhead *without* adding another unwinder, but even then we need a
justification for *why* we should care.
> I'll try to get my hands on some Arm hardware in the next few days and do the
> measurements there.
>
> This new version also will not be making any changes to the entry
> code, as these changes add unwanted additional slowdown. That would be
> great, if you could check the performance impact of v3 with your
> setup.
I'll take a look at the series, but as before I do not think we should add
another unwinder. As above, I *certainly* do not think we should add another
unwinder based on TCG performance.
Thanks,
Mark.