2022-04-14 08:04:44

by andrey.konovalov

[permalink] [raw]
Subject: [PATCH v3 0/3] kasan, arm64, scs: collect stack traces from Shadow Call Stack

From: Andrey Konovalov <[email protected]>

Currently, when saving alloc and free stack traces, KASAN uses the normal
stack trace collection routines, which rely on the unwinder.

Instead of invoking the unwinder, collect the stack trace by copying
frames from the Shadow Call Stack. This reduces boot time by ~30% for
all KASAN modes when Shadow Call Stack is enabled. See below for the
details of how the measurements were performed.

Stack staces are collected from the Shadow Call Stack via a new
stack_trace_save_shadow() interface.

Note that the implementation is best-effort and only works in certain
contexts. See patch #3 for details.

---

Changes
=======

v2->v3:
- Limit hardirq and drop SDEI support for performance and simplicity.
- Move stack_trace_save_shadow() implementation back to mm/kasan:
it's not mature enough to be used as a system-wide stack trace
collection replacement.
- Clarify -ENOSYS return value from stack_trace_save_shadow().
- Don't rename nr_entries to size in kasan_save_stack().
- Check return value of stack_trace_save_shadow() instead of checking
CONFIG_HAVE_SHADOW_STACKTRACE in kasan_save_stack().

v1->v2:
- Provide a kernel-wide stack_trace_save_shadow() interface for collecting
stack traces from shadow stack.
- Use ptrauth_strip_insn_pac() and READ_ONCE_NOCHECK, see the comments.
- Get SCS pointer from x18, as per-task value is meant to save the SCS
value on CPU switches.
- Collect stack frames from SDEI and IRQ contexts.

Perf
====

To measure performance impact, I used QEMU in full system emulation mode
on an x86-64 host.

As proposed by Mark, I passed no filesystem to QEMU and booted with panic=-1:

qemu-system-aarch64 \
-machine virt,mte=on -cpu max \
-m 2G -smp 1 -nographic \
-kernel ./xbins/Image \
-append "console=ttyAMA0 earlyprintk=serial panic=-1" \
-no-shutdown -no-reboot

Just in case, the QEMU version is:

$ qemu-system-aarch64 --version
QEMU emulator version 6.2.94 (v5.2.0-rc3-12124-g81c7ed41a1)
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers

Then, I recorded the timestamp of when the "Kernel panic" line was printed
to the kernel log.

The measurements were done on 5 kernel flavors:

master (mainline commit a19944809fe99):
master-no-stack-traces (stack trace collection commented out)
master-no-stack-depot (saving to stack depot commented out)
up-scs-stacks-v3 (collecting stack traces from SCS)
up-scs-stacks-v3-noscs (up-scs-stacks-v3 with __noscs marking)

(The last flavor is included just for the record: it produces an unexpected
slowdown. The likely reason is that helper functions stop getting inlined.)

All the branches can be found here:

https://github.com/xairy/linux/branches/all

The measurements were performed for Generic and HW_TAGS KASAN modes.

The .configs are here (essentially, defconfig + SCS + KASAN):

Generic KASAN: https://gist.github.com/xairy/d527ad31c0b54898512c92898d62beed
HW_TAGS KASAN: https://gist.github.com/xairy/390e4ef0140de3f4f9a49efe20708d21

The results:

Generic KASAN
-------------

master-no-stack-traces: 8.03
master: 11.55 (+43.8%)
master-no-stack-depot: 11.53 (+43.5%)
up-scs-stacks-v3: 8.31 (+3.4%)
up-scs-stacks-v3-noscs: 9.11 (+13.4%)

HW_TAGS KASAN
-------------

master-no-stack-traces: 3.31
master: 5.01 (+51%)
master-no-stack-depot: 4.85 (+47%)
up-scs-stacks-v3: 3.49 (+5.4%)
up-scs-stacks-v3-noscs: 4.27 (+29%)

The deviation for all numbers above is ~0.05.

As can be seen, the up-scs-stacks-v3 flavor results in a significantly
faster boot compared to master.

Andrey Konovalov (3):
arm64, scs: expose irq_shadow_call_stack_ptr
kasan, arm64: implement stack_trace_save_shadow
kasan: use stack_trace_save_shadow

arch/arm64/include/asm/scs.h | 10 +++++-
arch/arm64/kernel/irq.c | 4 +--
arch/arm64/kernel/sdei.c | 3 --
mm/kasan/common.c | 66 +++++++++++++++++++++++++++++++++++-
4 files changed, 75 insertions(+), 8 deletions(-)

--
2.25.1


2022-04-14 13:31:19

by andrey.konovalov

[permalink] [raw]
Subject: [PATCH v3 2/3] kasan, arm64: implement stack_trace_save_shadow

From: Andrey Konovalov <[email protected]>

Implement stack_trace_save_shadow() that collects stack traces based on
the Shadow Call Stack (SCS) for arm64 by copiing the frames from SCS.

The implementation is best-effort and thus has limitations.

stack_trace_save_shadow() fully handles task and softirq contexts, which
are both processed on the per-task SCS.

For hardirqs, the support is limited: stack_trace_save_shadow() does not
collect the task part of the stack trace. For KASAN, this is not a problem,
as stack depot only saves the interrupt part of the stack anyway.

Otherwise, stack_trace_save_shadow() also takes a best-effort approach
with a focus on performance. Thus, it:

- Does not try to collect stack traces from other exceptions like SDEI.
- Does not try to recover frames modified by KRETPROBES or by FTRACE.

However, stack_trace_save_shadow() does strip PTR_AUTH tags to avoid
leaking them in stack traces.

The -ENOSYS return value is deliberatly used to match
stack_trace_save_tsk_reliable().

Signed-off-by: Andrey Konovalov <[email protected]>
---
mm/kasan/common.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)

diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index d9079ec11f31..23b30fa6e270 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -30,6 +30,68 @@
#include "kasan.h"
#include "../slab.h"

+#ifdef CONFIG_SHADOW_CALL_STACK
+#include <linux/scs.h>
+#include <asm/scs.h>
+
+/*
+ * Collect the stack trace from the Shadow Call Stack in a best-effort manner:
+ *
+ * - Do not collect the task part of the stack trace when in a hardirq.
+ * - Do not collect stack traces from other exception levels like SDEI.
+ * - Do not recover frames modified by KRETPROBES or by FTRACE.
+ *
+ * Note that marking the function with __noscs leads to unnacceptable
+ * performance impact, as helper functions stop being inlined.
+ */
+static inline int stack_trace_save_shadow(unsigned long *store,
+ unsigned int size)
+{
+ unsigned long *scs_top, *scs_base, *frame;
+ unsigned int len = 0;
+
+ /* Get the SCS base. */
+ if (in_task() || in_serving_softirq()) {
+ /* Softirqs reuse the task SCS area. */
+ scs_base = task_scs(current);
+ } else if (in_hardirq()) {
+ /* Hardirqs use a per-CPU SCS area. */
+ scs_base = *this_cpu_ptr(&irq_shadow_call_stack_ptr);
+ } else {
+ /* Ignore other exception levels. */
+ return 0;
+ }
+
+ /*
+ * Get the SCS pointer.
+ *
+ * Note that this assembly might be placed before the function's
+ * prologue. In this case, the last stack frame will be lost. This is
+ * acceptable: the lost frame will correspond to an internal KASAN
+ * function, which is not relevant to identify the external call site.
+ */
+ asm volatile("mov %0, x18" : "=&r" (scs_top));
+
+ /* The top SCS slot is empty. */
+ scs_top -= 1;
+
+ for (frame = scs_top; frame >= scs_base; frame--) {
+ if (len >= size)
+ break;
+ /* Do not leak PTR_AUTH tags in stack traces. */
+ store[len++] = ptrauth_strip_insn_pac(*frame);
+ }
+
+ return len;
+}
+#else /* CONFIG_SHADOW_CALL_STACK */
+static inline int stack_trace_save_shadow(unsigned long *store,
+ unsigned int size)
+{
+ return -ENOSYS;
+}
+#endif /* CONFIG_SHADOW_CALL_STACK */
+
depot_stack_handle_t kasan_save_stack(gfp_t flags, bool can_alloc)
{
unsigned long entries[KASAN_STACK_DEPTH];
--
2.25.1

2022-04-14 14:12:22

by andrey.konovalov

[permalink] [raw]
Subject: [PATCH v3 3/3] kasan: use stack_trace_save_shadow

From: Andrey Konovalov <[email protected]>

Use stack_trace_save_shadow() to collect stack traces whenever
CONFIG_HAVE_SHADOW_STACKTRACE is enabled. This improves the
boot time of a defconfig build by ~30% for all KASAN modes.

Signed-off-by: Andrey Konovalov <[email protected]>
---
mm/kasan/common.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 23b30fa6e270..00fef2e5fe90 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -97,7 +97,9 @@ depot_stack_handle_t kasan_save_stack(gfp_t flags, bool can_alloc)
unsigned long entries[KASAN_STACK_DEPTH];
unsigned int nr_entries;

- nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
+ nr_entries = stack_trace_save_shadow(entries, ARRAY_SIZE(entries));
+ if (nr_entries < 0)
+ nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
return __stack_depot_save(entries, nr_entries, flags, can_alloc);
}

--
2.25.1

2022-04-14 14:57:25

by andrey.konovalov

[permalink] [raw]
Subject: [PATCH v3 1/3] arm64, scs: expose irq_shadow_call_stack_ptr

From: Andrey Konovalov <[email protected]>

When a hardirq happens, the Shadow Call Stack (SCS) pointer is switched
to a per-hardirq one, which is stored in irq_shadow_call_stack_ptr per-CPU.

Collecting stack traces based on SCS in hardirqs requires access to
irq_shadow_call_stack_ptr.

Expose irq_shadow_call_stack_ptr in arch/arm64/include/asm/scs.h.

For symmetry, move sdei_shadow_call_stack_normal_ptr and
sdei_shadow_call_stack_critical_ptr declarations there too.

Signed-off-by: Andrey Konovalov <[email protected]>
---
arch/arm64/include/asm/scs.h | 10 +++++++++-
arch/arm64/kernel/irq.c | 4 +---
arch/arm64/kernel/sdei.c | 3 ---
3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/scs.h b/arch/arm64/include/asm/scs.h
index 8297bccf0784..9d9ac04014f1 100644
--- a/arch/arm64/include/asm/scs.h
+++ b/arch/arm64/include/asm/scs.h
@@ -24,6 +24,14 @@
.endm
#endif /* CONFIG_SHADOW_CALL_STACK */

-#endif /* __ASSEMBLY __ */
+#else /* __ASSEMBLY__ */
+
+#include <linux/percpu.h>
+
+DECLARE_PER_CPU(unsigned long *, irq_shadow_call_stack_ptr);
+DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_normal_ptr);
+DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_critical_ptr);
+
+#endif /* __ASSEMBLY__ */

#endif /* _ASM_SCS_H */
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index bda49430c9ea..9d85c82f0b1c 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -21,6 +21,7 @@
#include <linux/seq_file.h>
#include <linux/vmalloc.h>
#include <asm/daifflags.h>
+#include <asm/scs.h>
#include <asm/vmap_stack.h>

/* Only access this in an NMI enter/exit */
@@ -28,9 +29,6 @@ DEFINE_PER_CPU(struct nmi_ctx, nmi_contexts);

DEFINE_PER_CPU(unsigned long *, irq_stack_ptr);

-
-DECLARE_PER_CPU(unsigned long *, irq_shadow_call_stack_ptr);
-
#ifdef CONFIG_SHADOW_CALL_STACK
DEFINE_PER_CPU(unsigned long *, irq_shadow_call_stack_ptr);
#endif
diff --git a/arch/arm64/kernel/sdei.c b/arch/arm64/kernel/sdei.c
index d20620a1c51a..882f88ce8d4d 100644
--- a/arch/arm64/kernel/sdei.c
+++ b/arch/arm64/kernel/sdei.c
@@ -39,9 +39,6 @@ DEFINE_PER_CPU(unsigned long *, sdei_stack_normal_ptr);
DEFINE_PER_CPU(unsigned long *, sdei_stack_critical_ptr);
#endif

-DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_normal_ptr);
-DECLARE_PER_CPU(unsigned long *, sdei_shadow_call_stack_critical_ptr);
-
#ifdef CONFIG_SHADOW_CALL_STACK
DEFINE_PER_CPU(unsigned long *, sdei_shadow_call_stack_normal_ptr);
DEFINE_PER_CPU(unsigned long *, sdei_shadow_call_stack_critical_ptr);
--
2.25.1

2022-04-15 10:47:19

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] kasan, arm64, scs: collect stack traces from Shadow Call Stack

On Wed, Apr 13, 2022 at 09:26:43PM +0200, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> Currently, when saving alloc and free stack traces, KASAN uses the normal
> stack trace collection routines, which rely on the unwinder.
>
> Instead of invoking the unwinder, collect the stack trace by copying
> frames from the Shadow Call Stack. This reduces boot time by ~30% for
> all KASAN modes when Shadow Call Stack is enabled. See below for the
> details of how the measurements were performed.
>
> Stack staces are collected from the Shadow Call Stack via a new
> stack_trace_save_shadow() interface.
>
> Note that the implementation is best-effort and only works in certain
> contexts. See patch #3 for details.
>
> ---
>
> Changes
> =======
>
> v2->v3:
> - Limit hardirq and drop SDEI support for performance and simplicity.
> - Move stack_trace_save_shadow() implementation back to mm/kasan:
> it's not mature enough to be used as a system-wide stack trace
> collection replacement.
> - Clarify -ENOSYS return value from stack_trace_save_shadow().
> - Don't rename nr_entries to size in kasan_save_stack().
> - Check return value of stack_trace_save_shadow() instead of checking
> CONFIG_HAVE_SHADOW_STACKTRACE in kasan_save_stack().
>
> v1->v2:
> - Provide a kernel-wide stack_trace_save_shadow() interface for collecting
> stack traces from shadow stack.
> - Use ptrauth_strip_insn_pac() and READ_ONCE_NOCHECK, see the comments.
> - Get SCS pointer from x18, as per-task value is meant to save the SCS
> value on CPU switches.
> - Collect stack frames from SDEI and IRQ contexts.
>
> Perf
> ====
>
> To measure performance impact, I used QEMU in full system emulation mode
> on an x86-64 host.

Just to be clear: QEMU TCG mode is *in no way* representative of HW
performance, and has drastically different performance characteristics
compared to real HW. Please be very clear when you are quoting
performance figures from QEMU TCG mode.

Previously you said you were trying to optimize this so that some
version of KASAN could be enabled in production builds, and the above is
not a suitable benchmark system for that.

Is that *actually* what you're trying to enable, or are you just trying
to speed up running instances under QEMU (e.g. for arm64 Syzkaller runs
on GCE)?

> As proposed by Mark, I passed no filesystem to QEMU and booted with panic=-1:
>
> qemu-system-aarch64 \
> -machine virt,mte=on -cpu max \
> -m 2G -smp 1 -nographic \
> -kernel ./xbins/Image \
> -append "console=ttyAMA0 earlyprintk=serial panic=-1" \
> -no-shutdown -no-reboot
>
> Just in case, the QEMU version is:
>
> $ qemu-system-aarch64 --version
> QEMU emulator version 6.2.94 (v5.2.0-rc3-12124-g81c7ed41a1)

Thanks for this information, this makes it *much* easier to see what's
going on here.

As I suspected, you're hitting a known performance oddity with QEMU TCG
mode where pointer authentication is *incredibly* slow when using the
architected QARMA5 algorithm (enabled by default with `-cpu max`).

Practically speaking, when using TCG mode with `-cpu max`, PACIASP and
AUTIASP instructions in function prologues/epilogues take much longer
than they would on real hardware, and this means that those skew
performance with an overhead whcih scales with the number of function
calls. The regular unwinder is hit pretty bad by this because it has a
few function calls (including KASAN instrumentation), and the overhead
of this dominates the cost of the actual unwind logic. Your SCS unwinder
isn't as badly hit by that because it has fewer function calls.

This overhead has nothing to do with the *nature* of the unwinder, and
is an artifact of the *platform* and the *structure* of the code.
There's plenty that can be done to avoid that overhead, including basic
QEMU options which don't require kernel changes.

For example, if you replace:

-cpu max

With:

-cpu max,pauth-impdef=true

... QEMU will use an IMPLEMENTATION DEFINED pointer authentication
algorithm which is *much* faster than its implementation of QARMA5 (and
tests below show that make the kernel reach the panic in ~1/6 the time).

Since you're testing with shadow call stack, you could decide to disable
pointer authentication entirely. You can tell QEMU to not provide that
with:

-cpu max,pauth=false

... which in tests below makes the kernel reach the panic in 1/9 the
time.

You could instead change your config to have
CONFIG_ARM64_PTR_AUTH_KERNEL=n, which would cause the kernel to be built
without the instructions which are causing the slowdown.

For comparisons below, I've built the same version of QEMU as you're
using. The numbering in that version string is a bit confusing, so I've
gone by the git commit ID:

| commit 81c7ed41a1b33031f3e4fe24191a998a492044b8 (HEAD, tag: v7.0.0-rc4, origin/master, origin/HEAD)
| Author: Peter Maydell <[email protected]>
| AuthorDate: Tue Apr 12 17:01:34 2022 +0100
| Commit: Peter Maydell <[email protected]>
| CommitDate: Tue Apr 12 17:01:34 2022 +0100
|
| Update version for v7.0.0-rc4 release
|
| Signed-off-by: Peter Maydell <[email protected]>
|
| diff --git a/VERSION b/VERSION
| index 80d0afb063..9c57137cdf 100644
| --- a/VERSION
| +++ b/VERSION
| @@ -1 +1 @@
| -6.2.93
| +6.2.94

My host machine is an Intel Xeon E5-2660.

> Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
>
> Then, I recorded the timestamp of when the "Kernel panic" line was printed
> to the kernel log.
>
> The measurements were done on 5 kernel flavors:
>
> master (mainline commit a19944809fe99):
> master-no-stack-traces (stack trace collection commented out)
> master-no-stack-depot (saving to stack depot commented out)
> up-scs-stacks-v3 (collecting stack traces from SCS)
> up-scs-stacks-v3-noscs (up-scs-stacks-v3 with __noscs marking)
>
> (The last flavor is included just for the record: it produces an unexpected
> slowdown. The likely reason is that helper functions stop getting inlined.)

As above, that case is hitting the overhead of QEMU TCG mode's
incredibly slow pointer authentication.

> All the branches can be found here:
>
> https://github.com/xairy/linux/branches/all
>
> The measurements were performed for Generic and HW_TAGS KASAN modes.
>
> The .configs are here (essentially, defconfig + SCS + KASAN):
>
> Generic KASAN: https://gist.github.com/xairy/d527ad31c0b54898512c92898d62beed
> HW_TAGS KASAN: https://gist.github.com/xairy/390e4ef0140de3f4f9a49efe20708d21
>
> The results:
>
> Generic KASAN
> -------------
>
> master-no-stack-traces: 8.03
> master: 11.55 (+43.8%)
> master-no-stack-depot: 11.53 (+43.5%)
> up-scs-stacks-v3: 8.31 (+3.4%)
> up-scs-stacks-v3-noscs: 9.11 (+13.4%)

I made the same measurements, reading the timestamp on the panic
message. From my local results, the performance oddity with pointer
authentication dominates everything else, and by changing QEMU options
to minimize or eliminate that overhead the difference in unwinder
overhead becomes far less pronouced and the overall boot time is reduced
to a fraction of the time taken when the incredibly slow imlpementation
of pointer authentication is used:

Generic KASAN w/ `-cpu max`
---------------------------

master-no-stack-traces: 12.66
master: 18.39 (+45.2%)
master-no-stack-depot: 17.85 (+40.1%)
up-scs-stacks-v3: 13.54 (+7.0%)

Generic KASAN w/ `-cpu max,pauth-impdef=true`
---------------------------------------------

master-no-stack-traces: 2.69
master: 3.35 (+24.5%)
master-no-stack-depot: 3.54 (+31.5%)
up-scs-stacks-v3: 2.80 (+4.1%)

All results are 5x to 6x faster; the regular unwinder overhead is
reduced by ~20% relative to `-cpu max`.

Generic KASAN w/ `-cpu max,pauth=false`
---------------------------------------

master-no-stack-traces: 1.92
master: 2.27 (+18.2%)
master-no-stack-depot: 2.22 (+15.6%)
up-scs-stacks-v3: 2.06 (+7.3%)

All results are 6x to 9x faster; the regular unwinder overhead is
reduced by 27% relative to `-cpu max`.

To speed up your QEMU runs, there are a couple of trivial options
available to you which'll improve your runs by ~6x to ~9x, which vastly
outstrips any benefit gained from changing the unwinder. I'd recommend
you at least consider moving to `pauth-impdef=true`.

While the SCS unwinder is still faster, the difference is nowhere near
as pronounced. As I mentioned before, there are changes that we can make
to the regular unwinder to close that gap somewhat, some of which I
intend to make as part of ongoing cleanup/rework in that area.

I haven't bothered testing HW_TAGS, because the performance
characteristics of emulated MTE are also nothing like that of a real HW
implementation.

So, given that and the problems I mentioned before, I don't think
there's a justification for adding a separate SCS unwinder. As before,
I'm still happy to try to make the regular unwinder faster (and I'm
happy to make changes which benefit QEMU TCG mode if those don't harm
the maintainability of the unwinder).

NAK to adding an SCS-specific unwinder, regardless of where in the
source tree that is placed.

Thanks,
Mark.

2022-04-15 23:07:54

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] kasan, arm64, scs: collect stack traces from Shadow Call Stack

On Thu, Apr 14, 2022 at 01:36:59PM +0100, Mark Rutland wrote:
> As I suspected, you're hitting a known performance oddity with QEMU TCG
> mode where pointer authentication is *incredibly* slow when using the
> architected QARMA5 algorithm (enabled by default with `-cpu max`).

> This overhead has nothing to do with the *nature* of the unwinder, and
> is an artifact of the *platform* and the *structure* of the code.
> There's plenty that can be done to avoid that overhead

FWIW, from a quick look, disabling KASAN instrumentation for the
stacktrace object alone (with no other changes) has a significant impact
(compounded by the TCG QARMA5 slowdown), and I note that x86 doesn't
both instrumenting its stacktrace code anyway, so we could consider
doing likewise.

Atop that, replacing set_bit() with __set_bit() brings the regular
unwinder *really* close to the earlier SCS unwinder figures. I know that
the on_accessible_stack() calculations and checks could be ammortized
with some refactoring (which I'd planned to do anyway), so I think it's
plausible that with some changes to the existing unwinder we can bring
the difference into the noise.

> generic kasan w/ `-cpu max`
> ---------------------------
>
> master-no-stack-traces: 12.66
> master: 18.39 (+45.2%)
> master-no-stack-depot: 17.85 (+40.1%)
> up-scs-stacks-v3: 13.54 (+7.0%)

master-noasan: 15.67 (+23.8%)
master-noasan-__set_bit: 14.61 (+15.5%)

> Generic KASAN w/ `-cpu max,pauth-impdef=true`
> ---------------------------------------------
>
> master-no-stack-traces: 2.69
> master: 3.35 (+24.5%)
> master-no-stack-depot: 3.54 (+31.5%)
> up-scs-stacks-v3: 2.80 (+4.1%)

master-noasan: 3.05 (+13.0%)
master-noasan-__set_bit: 2.96 (+10.0%)

> Generic KASAN w/ `-cpu max,pauth=false`
> ---------------------------------------
>
> master-no-stack-traces: 1.92
> master: 2.27 (+18.2%)
> master-no-stack-depot: 2.22 (+15.6%)
> up-scs-stacks-v3: 2.06 (+7.3%)

master-noasan: 2.14 (+11.4%)
master-noasan-__set_bit: 2.10 (+9.4%)

Thanks,
Mark.

2022-04-16 01:36:49

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] kasan, arm64: implement stack_trace_save_shadow

On Wed, Apr 13, 2022 at 09:26:45PM +0200, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> Implement stack_trace_save_shadow() that collects stack traces based on
> the Shadow Call Stack (SCS) for arm64 by copiing the frames from SCS.
>
> The implementation is best-effort and thus has limitations.
>
> stack_trace_save_shadow() fully handles task and softirq contexts, which
> are both processed on the per-task SCS.
>
> For hardirqs, the support is limited: stack_trace_save_shadow() does not
> collect the task part of the stack trace. For KASAN, this is not a problem,
> as stack depot only saves the interrupt part of the stack anyway.
>
> Otherwise, stack_trace_save_shadow() also takes a best-effort approach
> with a focus on performance. Thus, it:
>
> - Does not try to collect stack traces from other exceptions like SDEI.
> - Does not try to recover frames modified by KRETPROBES or by FTRACE.
>
> However, stack_trace_save_shadow() does strip PTR_AUTH tags to avoid
> leaking them in stack traces.
>
> The -ENOSYS return value is deliberatly used to match
> stack_trace_save_tsk_reliable().
>
> Signed-off-by: Andrey Konovalov <[email protected]>
> ---
> mm/kasan/common.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 62 insertions(+)

As things stand, NAK to this patch, for the reasons I have laid out in
my replies to earlier postings and to my reply to the cover letter of
this posting.

To be clear, that NAK applies regardless of where this is placed within
the kernel tree. If we *really* need to have a special unwinder, that
should live under arch/arm64/, but my first objection is that it is not
necessary.

I am more than happy to extend the existing unwinder with some options
to minimize overhead (e.g. to stop dumping at an exception boundary),
since that sounds useful to you, and I know is relatively simple to
implement.

Thanks,
Mark.

> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index d9079ec11f31..23b30fa6e270 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -30,6 +30,68 @@
> #include "kasan.h"
> #include "../slab.h"
>
> +#ifdef CONFIG_SHADOW_CALL_STACK
> +#include <linux/scs.h>
> +#include <asm/scs.h>
> +
> +/*
> + * Collect the stack trace from the Shadow Call Stack in a best-effort manner:
> + *
> + * - Do not collect the task part of the stack trace when in a hardirq.
> + * - Do not collect stack traces from other exception levels like SDEI.
> + * - Do not recover frames modified by KRETPROBES or by FTRACE.
> + *
> + * Note that marking the function with __noscs leads to unnacceptable
> + * performance impact, as helper functions stop being inlined.
> + */
> +static inline int stack_trace_save_shadow(unsigned long *store,
> + unsigned int size)
> +{
> + unsigned long *scs_top, *scs_base, *frame;
> + unsigned int len = 0;
> +
> + /* Get the SCS base. */
> + if (in_task() || in_serving_softirq()) {
> + /* Softirqs reuse the task SCS area. */
> + scs_base = task_scs(current);
> + } else if (in_hardirq()) {
> + /* Hardirqs use a per-CPU SCS area. */
> + scs_base = *this_cpu_ptr(&irq_shadow_call_stack_ptr);
> + } else {
> + /* Ignore other exception levels. */
> + return 0;
> + }
> +
> + /*
> + * Get the SCS pointer.
> + *
> + * Note that this assembly might be placed before the function's
> + * prologue. In this case, the last stack frame will be lost. This is
> + * acceptable: the lost frame will correspond to an internal KASAN
> + * function, which is not relevant to identify the external call site.
> + */
> + asm volatile("mov %0, x18" : "=&r" (scs_top));
> +
> + /* The top SCS slot is empty. */
> + scs_top -= 1;
> +
> + for (frame = scs_top; frame >= scs_base; frame--) {
> + if (len >= size)
> + break;
> + /* Do not leak PTR_AUTH tags in stack traces. */
> + store[len++] = ptrauth_strip_insn_pac(*frame);
> + }
> +
> + return len;
> +}
> +#else /* CONFIG_SHADOW_CALL_STACK */
> +static inline int stack_trace_save_shadow(unsigned long *store,
> + unsigned int size)
> +{
> + return -ENOSYS;
> +}
> +#endif /* CONFIG_SHADOW_CALL_STACK */
> +
> depot_stack_handle_t kasan_save_stack(gfp_t flags, bool can_alloc)
> {
> unsigned long entries[KASAN_STACK_DEPTH];
> --
> 2.25.1
>

2022-05-23 07:56:10

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] kasan, arm64, scs: collect stack traces from Shadow Call Stack

On Thu, Apr 14, 2022 at 2:37 PM Mark Rutland <[email protected]> wrote:
>

Hi Mark,

Sorry for the delayed response, it took some time getting my hands on
hardware for testing these changes.

> Just to be clear: QEMU TCG mode is *in no way* representative of HW
> performance, and has drastically different performance characteristics
> compared to real HW. Please be very clear when you are quoting
> performance figures from QEMU TCG mode.
>
> Previously you said you were trying to optimize this so that some
> version of KASAN could be enabled in production builds, and the above is
> not a suitable benchmark system for that.

Understood.

My expectation was that performance numbers from QEMU would be close
to hardware. I knew that there are instructions that take longer to be
emulated, but I expected that they would be uniformly spread across
the code.

However, your explanation proved this wrong. This indeed doesn't apply
when measuring the performance of a piece of code with a different
density of function calls.

Thank you for the detailed explanation! Those QEMU arguments will
definitely be handy when I need a faster QEMU setup.

> Is that *actually* what you're trying to enable, or are you just trying
> to speed up running instances under QEMU (e.g. for arm64 Syzkaller runs
> on GCE)?

No, I'm not trying to speed up QEMU. QEMU was just the only setup that
I had access to at that moment.

The goal is to allow enabling stack trace collection in production on
HW_TAGS-enabled devices once those are created.

[...]

> While the SCS unwinder is still faster, the difference is nowhere near
> as pronounced. As I mentioned before, there are changes that we can make
> to the regular unwinder to close that gap somewhat, some of which I
> intend to make as part of ongoing cleanup/rework in that area.

I tried running the same experiments on Pixel 6.

Unfortunately, I was only able to test the OUTLINE SW_TAGS mode
(without STACK instrumentation, as HW_TAGS doesn't support STACK at
the moment.) All of the other modes either fail to flash or fail to
boot with AOSP on Pixel 6 :(

The results are (timestamps were measured when "ALSA device list" was
printed to the kernel log):

sw_tags outline nostacks: 2.218
sw_tags outline: 2.516 (+13.4%)
sw_tags outline nosanitize: 2.364 (+6.5%)
sw_tags outline nosanitize __set_bit: 2.364 (+6.5%)
sw_tags outline nosanitize scs: 2.236 (+0.8%)

Used markings:

nostacks: patch from master-no-stack-traces applied
nosanitize: KASAN_SANITIZE_stacktrace.o := n
__set_bit: set_bit -> __set_bit change applied
scs: patches from up-scs-stacks-v3 applied

First, disabling instrumentation of stacktrace.c is indeed a great
idea for software KASAN modes! I will send a patch for this later.

Changing set_bit to __set_bit seems to make no difference on Pixel 6.

The awesome part is that the overhead of collecting stack traces with
SCS and even saving them into the stack depot is less than 1%.

However once again note, that this is for OUTLINE SW_TAGS without STACK.

> I haven't bothered testing HW_TAGS, because the performance
> characteristics of emulated MTE are also nothing like that of a real HW
> implementation.
>
> So, given that and the problems I mentioned before, I don't think
> there's a justification for adding a separate SCS unwinder. As before,
> I'm still happy to try to make the regular unwinder faster (and I'm
> happy to make changes which benefit QEMU TCG mode if those don't harm
> the maintainability of the unwinder).
>
> NAK to adding an SCS-specific unwinder, regardless of where in the
> source tree that is placed.

I see.

Perhaps, it makes sense to wait until there's HW_TAGS-enabled hardware
available before continuing to look into this. At the end, the
performance overhead for that setup is what matters.

I'll look into improving the performance of the existing unwinder a
bit more. However, I don't think I'll be able to speed it up to < 1%.
Which means that we'll likely need a sample-based approach for HW_TAGS
stack collection to reduce the overhead.

Thank you!