Hi Alexander, Kees,
This is the vs I promised. Since Alexander wanted to look at this in
more detail (and since this is subtle and needs review), I'm assuming
that Kees will pick this up some time next week after that's happened,
if all goes well. :)
This series reworks the stackleak code and the associated LKDTM test.
The first patch fixes some latent issues on arm64, and the subsequent
patches improve the code to improve clarity and permit better code
generation. Patches 8-10 address some latent issues in the LKDTM test
and add more diagnostic output.
Since v1 [1]:
* Fix and rework the LKDTM test
* Rework the poison scan
[1] https://lore.kernel.org/lkml/[email protected]/
Benchmarking arm64 with a QEMU HVF VM on an M1 Macbook Pro shows a
reasonable improvement when stackleak is enabled. I've included figures
for when stackleak is dynamically disabled with v2:
* Calling getpid 1^22 times in a loop (avg 50 runs)
5.18-rc1/on: 0.652099387s ( +- 0.13% )
v1/on: 0.641005661s ( +- 0.13% ) ; 1.7% improvement
v2/on: 0.611699824s ( +- 0.08% ) ; 6.2% improvement
v2/off: 0.507505632s ( +- 0.15% )
* perf bench sched pipe (single run)
5.18-rc1/on: 2.138s total
v1/on: 2.118s total ; 0.93% improvement
v2/on: 2.116s total ; 1.03% improvement
v2/off: 1.708s total
The large jump between v1 and v2 is largely due to the changes to poison
scanning avoiding redundantly overwriting poison. For the getpid syscall
the poison which gets overwritten is a substantial fraction of the stack
usage, and for more involved syscalls this may be a trivial fraction, so
the average benefit is fairly small.
With the whole series applied, the LKDTM test reliably passes on both
arm64 and x86_64, e.g.
| # uname -a
| Linux buildroot 5.18.0-rc1-00013-g26f638ab0d7c #3 SMP PREEMPT Wed Apr 27 16:21:37 BST 2022 aarch64 GNU/Linux
| # echo STACKLEAK_ERASING > /sys/kernel/debug/provoke-crash/DIRECT
| lkdtm: Performing direct entry STACKLEAK_ERASING
| lkdtm: stackleak stack usage:
| high offset: 336 bytes
| current: 688 bytes
| lowest: 1232 bytes
| tracked: 1232 bytes
| untracked: 672 bytes
| poisoned: 14136 bytes
| low offset: 8 bytes
| lkdtm: OK: the rest of the thread stack is properly erased
The first patch fixes the major issues on arm64, and is Cc'd to stable
for backporting.
The second patch is a trivial optimization for when stackleak is
dynamically disabled.
The subsequent patches rework the way stackleak manipulates the stack
boundary values. This is partially for clarity (e.g. with separate 'low'
and 'high' boundary variables), and also permits the compiler to
generate more optimal assembly by generating the high and low bounds
from the same base.
Patch 6 changes the way that `current->lowest_stack` is reset prior to
return to userspace. The existing code uses an undocumented offset
relative to the top of the stack which doesn't make much sense (as thie
sometimes falls within the task's pt_regs, or sometimes adds 600+ bytes
to erase upon the next exit to userspace). For now I've removed the
offset entirely.
Patch 7 changes the way we scan for poison. This fixes what I believe
are fencepost errors, and also avoids the need to redundantly overwrite
poison.
Patches 8-11 rework the LKDTM test, fixing cases where the test could
spuriously fail and improving the diagnostic output.
Patches 12 and 13 add stackleak_erase_on_task_stack() and
stackleak_erase_off_task_stack() that can be used when a caller knows
they're always on or off the task stack respectively, avoiding redundant
logic to check this and generate the high boundary value. The former is
used on arm64 where we always perform the erase on the task stack, and
I've included the latter for completeness (as e.g. it could be used on
x86) given it is trivial.
Thanks,
Mark.
Mark Rutland (13):
arm64: stackleak: fix current_top_of_stack()
stackleak: move skip_erasing() check earlier
stackleak: remove redundant check
stackleak: rework stack low bound handling
stackleak: clarify variable names
stackleak: rework stack high bound handling
stackleak: rework poison scanning
lkdtm/stackleak: avoid spurious failure
lkdtm/stackleak: rework boundary management
lkdtm/stackleak: prevent unexpected stack usage
lkdtm/stackleak: check stack boundaries
stackleak: add on/off stack variants
arm64: entry: use stackleak_erase_on_task_stack()
arch/arm64/include/asm/processor.h | 10 +--
arch/arm64/kernel/entry.S | 2 +-
drivers/misc/lkdtm/stackleak.c | 133 +++++++++++++++++++----------
include/linux/stackleak.h | 55 +++++++++++-
kernel/stackleak.c | 105 ++++++++++++++---------
5 files changed, 210 insertions(+), 95 deletions(-)
--
2.30.2
The logic within __stackleak_erase() can be a little hard to follow, as
`boundary` switches from being the low bound to the high bound mid way
through the function, and `kstack_ptr` is used to represent the start of
the region to erase while `boundary` represents the end of the region to
erase.
Make this a little clearer by consistently using clearer variable names.
The `boundary` variable is removed, the bounds of the region to erase
are described by `erase_low` and `erase_high`, and bounds of the task
stack are described by `task_stack_low` and `task_stck_high`.
As the same time, remove the comment above the variables, since it is
unclear whether it's intended as rationale, a complaint, or a TODO, and
is more confusing than helpful.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <[email protected]>
Cc: Alexander Popov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Kees Cook <[email protected]>
---
kernel/stackleak.c | 30 ++++++++++++++----------------
1 file changed, 14 insertions(+), 16 deletions(-)
diff --git a/kernel/stackleak.c b/kernel/stackleak.c
index 24b7cf01b2972..d5f684dc0a2d9 100644
--- a/kernel/stackleak.c
+++ b/kernel/stackleak.c
@@ -73,40 +73,38 @@ late_initcall(stackleak_sysctls_init);
static __always_inline void __stackleak_erase(void)
{
const unsigned long task_stack_low = stackleak_task_low_bound(current);
-
- /* It would be nice not to have 'kstack_ptr' and 'boundary' on stack */
- unsigned long kstack_ptr = current->lowest_stack;
- unsigned long boundary = task_stack_low;
+ unsigned long erase_low = current->lowest_stack;
+ unsigned long erase_high;
unsigned int poison_count = 0;
const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
/* Search for the poison value in the kernel stack */
- while (kstack_ptr > boundary && poison_count <= depth) {
- if (*(unsigned long *)kstack_ptr == STACKLEAK_POISON)
+ while (erase_low > task_stack_low && poison_count <= depth) {
+ if (*(unsigned long *)erase_low == STACKLEAK_POISON)
poison_count++;
else
poison_count = 0;
- kstack_ptr -= sizeof(unsigned long);
+ erase_low -= sizeof(unsigned long);
}
#ifdef CONFIG_STACKLEAK_METRICS
- current->prev_lowest_stack = kstack_ptr;
+ current->prev_lowest_stack = erase_low;
#endif
/*
- * Now write the poison value to the kernel stack. Start from
- * 'kstack_ptr' and move up till the new 'boundary'. We assume that
- * the stack pointer doesn't change when we write poison.
+ * Now write the poison value to the kernel stack between 'erase_low'
+ * and 'erase_high'. We assume that the stack pointer doesn't change
+ * when we write poison.
*/
if (on_thread_stack())
- boundary = current_stack_pointer;
+ erase_high = current_stack_pointer;
else
- boundary = current_top_of_stack();
+ erase_high = current_top_of_stack();
- while (kstack_ptr < boundary) {
- *(unsigned long *)kstack_ptr = STACKLEAK_POISON;
- kstack_ptr += sizeof(unsigned long);
+ while (erase_low < erase_high) {
+ *(unsigned long *)erase_low = STACKLEAK_POISON;
+ erase_low += sizeof(unsigned long);
}
/* Reset the 'lowest_stack' value for the next syscall */
--
2.30.2
Prior to returning to userpace, we reset current->lowest_stack to a
reasonable high bound. Currently we do this by subtracting the arbitrary
value `THREAD_SIZE/64` from the top of the stack, for reasons lost to
history.
Looking at configurations today:
* On i386 where THREAD_SIZE is 8K, the bound will be 128 bytes. The
pt_regs at the top of the stack is 68 bytes (with 0 to 16 bytes of
padding above), and so this covers an additional portion of 44 to 60
bytes.
* On x86_64 where THREAD_SIZE is at least 16K (up to 32K with KASAN) the
bound will be at least 256 bytes (up to 512 with KASAN). The pt_regs
at the top of the stack is 168 bytes, and so this cover an additional
88 bytes of stack (up to 344 with KASAN).
* On arm64 where THREAD_SIZE is at least 16K (up to 64K with 64K pages
and VMAP_STACK), the bound will be at least 256 bytes (up to 1024 with
KASAN). The pt_regs at the top of the stack is 336 bytes, so this can
fall within the pt_regs, or can cover an additional 688 bytes of
stack.
Clearly the `THREAD_SIZE/64` value doesn't make much sense -- in the
worst case, this will cause more than 600 bytes of stack to be erased
for every syscall, even if actual stack usage were substantially
smaller.
This patches makes this slightly less nonsensical by consistently
resetting current->lowest_stack to the base of the task pt_regs. For
clarity and for consistency with the handling of the low bound, the
generation of the high bound is split into a helper with commentary
explaining why.
Since the pt_regs at the top of the stack will be clobbered upon the
next exception entry, we don't need to poison these at exception exit.
By using task_pt_regs() as the high stack boundary instead of
current_top_of_stack() we avoid some redundant poisoning, and the
compiler can share the address generation between the poisoning and
restting of `current->lowest_stack`, making the generated code more
optimal.
It's not clear to me whether the existing `THREAD_SIZE/64` offset was a
dodgy heuristic to skip the pt_regs, or whether it was attempting to
minimize the number of times stackleak_check_stack() would have to
update `current->lowest_stack` when stack usage was shallow at the cost
of unconditionally poisoning a small portion of the stack for every exit
to userspace.
For now I've simply removed the offset, and if we need/want to minimize
updates for shallow stack usage it should be easy to add a better
heuristic atop, with appropriate commentary so we know what's going on.
Signed-off-by: Mark Rutland <[email protected]>
Cc: Alexander Popov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Kees Cook <[email protected]>
---
include/linux/stackleak.h | 14 ++++++++++++++
kernel/stackleak.c | 19 ++++++++++++++-----
2 files changed, 28 insertions(+), 5 deletions(-)
diff --git a/include/linux/stackleak.h b/include/linux/stackleak.h
index 67430faa5c518..467661aeb4136 100644
--- a/include/linux/stackleak.h
+++ b/include/linux/stackleak.h
@@ -28,6 +28,20 @@ stackleak_task_low_bound(const struct task_struct *tsk)
return (unsigned long)end_of_stack(tsk) + sizeof(unsigned long);
}
+/*
+ * The address immediately after the highest address on tsk's stack which we
+ * can plausibly erase.
+ */
+static __always_inline unsigned long
+stackleak_task_high_bound(const struct task_struct *tsk)
+{
+ /*
+ * The task's pt_regs lives at the top of the task stack and will be
+ * overwritten by exception entry, so there's no need to erase them.
+ */
+ return (unsigned long)task_pt_regs(tsk);
+}
+
static inline void stackleak_task_init(struct task_struct *t)
{
t->lowest_stack = stackleak_task_low_bound(t);
diff --git a/kernel/stackleak.c b/kernel/stackleak.c
index d5f684dc0a2d9..ba346d46218f5 100644
--- a/kernel/stackleak.c
+++ b/kernel/stackleak.c
@@ -73,6 +73,7 @@ late_initcall(stackleak_sysctls_init);
static __always_inline void __stackleak_erase(void)
{
const unsigned long task_stack_low = stackleak_task_low_bound(current);
+ const unsigned long task_stack_high = stackleak_task_high_bound(current);
unsigned long erase_low = current->lowest_stack;
unsigned long erase_high;
unsigned int poison_count = 0;
@@ -93,14 +94,22 @@ static __always_inline void __stackleak_erase(void)
#endif
/*
- * Now write the poison value to the kernel stack between 'erase_low'
- * and 'erase_high'. We assume that the stack pointer doesn't change
- * when we write poison.
+ * Write poison to the task's stack between 'erase_low' and
+ * 'erase_high'.
+ *
+ * If we're running on a different stack (e.g. an entry trampoline
+ * stack) we can erase everything below the pt_regs at the top of the
+ * task stack.
+ *
+ * If we're running on the task stack itself, we must not clobber any
+ * stack used by this function and its caller. We assume that this
+ * function has a fixed-size stack frame, and the current stack pointer
+ * doesn't change while we write poison.
*/
if (on_thread_stack())
erase_high = current_stack_pointer;
else
- erase_high = current_top_of_stack();
+ erase_high = task_stack_high;
while (erase_low < erase_high) {
*(unsigned long *)erase_low = STACKLEAK_POISON;
@@ -108,7 +117,7 @@ static __always_inline void __stackleak_erase(void)
}
/* Reset the 'lowest_stack' value for the next syscall */
- current->lowest_stack = current_top_of_stack() - THREAD_SIZE/64;
+ current->lowest_stack = task_stack_high;
}
asmlinkage void noinstr stackleak_erase(void)
--
2.30.2
Currently we over-estimate the region of stack which must be erased.
To determine the region to be erased, we scan downards for a contiguous
block of poison values (or the low bound of the stack). There are a few
minor problems with this today:
* When we find a block of poison values, we include this block within
the region to erase.
As this is included within the region to erase, this causes us to
redundantly overwrite 'STACKLEAK_SEARCH_DEPTH' (128) bytes with
poison.
* As the loop condition checks 'poison_count <= depth', it will run an
additional iteration after finding the contiguous block of poison,
decrementing 'erase_low' once more than necessary.
As this is included within the region to erase, this causes us to
redundantly overwrite an additional unsigned long with poison.
* As we always decrement 'erase_low' after checking an element on the
stack, we always include the element below this within the region to
erase.
As this is included within the region to erase, this causes us to
redundantly overwrite an additional unsigned long with poison.
Note that this is not a functional problem. As the loop condition
checks 'erase_low > task_stack_low', we'll never clobber the
STACK_END_MAGIC. As we always decrement 'erase_low' after this, we'll
never fail to erase the element immediately above the STACK_END_MAGIC.
In total, this can cause us to erase `128 + 2 * sizeof(unsigned long)`
bytes more than necessary, which is unfortunate.
This patch reworks the logic to find the address immediately above the
poisoned region, by finding the lowest non-poisoned address. This is
factored into a stackleak_find_top_of_poison() helper both for clarity
and so that this can be shared with the LKDTM test in subsequent
patches.
Signed-off-by: Mark Rutland <[email protected]>
Cc: Alexander Popov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Kees Cook <[email protected]>
---
include/linux/stackleak.h | 26 ++++++++++++++++++++++++++
kernel/stackleak.c | 18 ++++--------------
2 files changed, 30 insertions(+), 14 deletions(-)
diff --git a/include/linux/stackleak.h b/include/linux/stackleak.h
index 467661aeb4136..c36e7a3b45e7e 100644
--- a/include/linux/stackleak.h
+++ b/include/linux/stackleak.h
@@ -42,6 +42,32 @@ stackleak_task_high_bound(const struct task_struct *tsk)
return (unsigned long)task_pt_regs(tsk);
}
+/*
+ * Find the address immediately above the poisoned region of the stack, where
+ * that region falls between 'low' (inclusive) and 'high' (exclusive).
+ */
+static __always_inline unsigned long
+stackleak_find_top_of_poison(const unsigned long low, const unsigned long high)
+{
+ const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
+ unsigned int poison_count = 0;
+ unsigned long poison_high = high;
+ unsigned long sp = high;
+
+ while (sp > low && poison_count < depth) {
+ sp -= sizeof(unsigned long);
+
+ if (*(unsigned long *)sp == STACKLEAK_POISON) {
+ poison_count++;
+ } else {
+ poison_count = 0;
+ poison_high = sp;
+ }
+ }
+
+ return poison_high;
+}
+
static inline void stackleak_task_init(struct task_struct *t)
{
t->lowest_stack = stackleak_task_low_bound(t);
diff --git a/kernel/stackleak.c b/kernel/stackleak.c
index ba346d46218f5..afd54b8e10b83 100644
--- a/kernel/stackleak.c
+++ b/kernel/stackleak.c
@@ -74,20 +74,10 @@ static __always_inline void __stackleak_erase(void)
{
const unsigned long task_stack_low = stackleak_task_low_bound(current);
const unsigned long task_stack_high = stackleak_task_high_bound(current);
- unsigned long erase_low = current->lowest_stack;
- unsigned long erase_high;
- unsigned int poison_count = 0;
- const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
-
- /* Search for the poison value in the kernel stack */
- while (erase_low > task_stack_low && poison_count <= depth) {
- if (*(unsigned long *)erase_low == STACKLEAK_POISON)
- poison_count++;
- else
- poison_count = 0;
-
- erase_low -= sizeof(unsigned long);
- }
+ unsigned long erase_low, erase_high;
+
+ erase_low = stackleak_find_top_of_poison(task_stack_low,
+ current->lowest_stack);
#ifdef CONFIG_STACKLEAK_METRICS
current->prev_lowest_stack = erase_low;
--
2.30.2
In stackleak_task_init(), stackleak_track_stack(), and
__stackleak_erase(), we open-code skipping the STACK_END_MAGIC at the
bottom of the stack. Each case is implemented slightly differently, and
only the __stackleak_erase() case is commented.
In stackleak_task_init() and stackleak_track_stack() we unconditionally
add sizeof(unsigned long) to the lowest stack address. In
stackleak_task_init() we use end_of_stack() for this, and in
stackleak_track_stack() we use task_stack_page(). In __stackleak_erase()
we handle this by detecting if `kstack_ptr` has hit the stack end
boundary, and if so, conditionally moving it above the magic.
This patch adds a new stackleak_task_low_bound() helper which is used in
all three cases, which unconditionally adds sizeof(unsigned long) to the
lowest address on the task stack, with commentary as to why. This uses
end_of_stack() as stackleak_task_init() did prior to this patch, as this
is consistent with the code in kernel/fork.c which initializes the
STACK_END_MAGIC value.
In __stackleak_erase() we no longer need to check whether we've spilled
into the STACK_END_MAGIC value, as stackleak_track_stack() ensures that
`current->lowest_stack` stops immediately above this, and similarly the
poison scan will stop immediately above this.
For stackleak_task_init() and stackleak_track_stack() this results in no
change to code generation. For __stackleak_erase() the generated
assembly is slightly simpler and shorter.
Signed-off-by: Mark Rutland <[email protected]>
Cc: Alexander Popov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Kees Cook <[email protected]>
---
include/linux/stackleak.h | 15 ++++++++++++++-
kernel/stackleak.c | 14 ++++----------
2 files changed, 18 insertions(+), 11 deletions(-)
diff --git a/include/linux/stackleak.h b/include/linux/stackleak.h
index ccaab2043fcd5..67430faa5c518 100644
--- a/include/linux/stackleak.h
+++ b/include/linux/stackleak.h
@@ -15,9 +15,22 @@
#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
#include <asm/stacktrace.h>
+/*
+ * The lowest address on tsk's stack which we can plausibly erase.
+ */
+static __always_inline unsigned long
+stackleak_task_low_bound(const struct task_struct *tsk)
+{
+ /*
+ * The lowest unsigned long on the task stack contains STACK_END_MAGIC,
+ * which we must not corrupt.
+ */
+ return (unsigned long)end_of_stack(tsk) + sizeof(unsigned long);
+}
+
static inline void stackleak_task_init(struct task_struct *t)
{
- t->lowest_stack = (unsigned long)end_of_stack(t) + sizeof(unsigned long);
+ t->lowest_stack = stackleak_task_low_bound(t);
# ifdef CONFIG_STACKLEAK_METRICS
t->prev_lowest_stack = t->lowest_stack;
# endif
diff --git a/kernel/stackleak.c b/kernel/stackleak.c
index f7a0f8cf73c37..24b7cf01b2972 100644
--- a/kernel/stackleak.c
+++ b/kernel/stackleak.c
@@ -72,9 +72,11 @@ late_initcall(stackleak_sysctls_init);
static __always_inline void __stackleak_erase(void)
{
+ const unsigned long task_stack_low = stackleak_task_low_bound(current);
+
/* It would be nice not to have 'kstack_ptr' and 'boundary' on stack */
unsigned long kstack_ptr = current->lowest_stack;
- unsigned long boundary = (unsigned long)end_of_stack(current);
+ unsigned long boundary = task_stack_low;
unsigned int poison_count = 0;
const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
@@ -88,13 +90,6 @@ static __always_inline void __stackleak_erase(void)
kstack_ptr -= sizeof(unsigned long);
}
- /*
- * One 'long int' at the bottom of the thread stack is reserved and
- * should not be poisoned (see CONFIG_SCHED_STACK_END_CHECK=y).
- */
- if (kstack_ptr == boundary)
- kstack_ptr += sizeof(unsigned long);
-
#ifdef CONFIG_STACKLEAK_METRICS
current->prev_lowest_stack = kstack_ptr;
#endif
@@ -140,8 +135,7 @@ void __used __no_caller_saved_registers noinstr stackleak_track_stack(void)
/* 'lowest_stack' should be aligned on the register width boundary */
sp = ALIGN(sp, sizeof(unsigned long));
if (sp < current->lowest_stack &&
- sp >= (unsigned long)task_stack_page(current) +
- sizeof(unsigned long)) {
+ sp >= stackleak_task_low_bound(current)) {
current->lowest_stack = sp;
}
}
--
2.30.2
The lkdtm_STACKLEAK_ERASING() test scans for a contiguous block of
poison values between the low stack bound and the stack pointer, and
fails if it does not find a sufficiently large block.
This can happen legitimately if the scan the low stack bound, which
could occur if functions called prior to lkdtm_STACKLEAK_ERASING() used
a large amount of stack. If this were to occur, it means that the erased
portion of the stack is smaller than the size used by the scan, but does
not cause a functional problem
In practice this is unlikely to happen, but as this is legitimate and
would not result in a functional problem, the test should not fail in
this case.
Remove the spurious failure case.
Signed-off-by: Mark Rutland <[email protected]>
Cc: Alexander Popov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Kees Cook <[email protected]>
---
drivers/misc/lkdtm/stackleak.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/drivers/misc/lkdtm/stackleak.c b/drivers/misc/lkdtm/stackleak.c
index 00db21ff115e4..707d530d509b7 100644
--- a/drivers/misc/lkdtm/stackleak.c
+++ b/drivers/misc/lkdtm/stackleak.c
@@ -53,13 +53,6 @@ void lkdtm_STACKLEAK_ERASING(void)
found = 0;
}
- if (found <= check_depth) {
- pr_err("FAIL: the erased part is not found (checked %lu bytes)\n",
- i * sizeof(unsigned long));
- test_failed = true;
- goto end;
- }
-
pr_info("the erased part begins after %lu not poisoned bytes\n",
(i - found) * sizeof(unsigned long));
--
2.30.2
The lkdtm_STACKLEAK_ERASING() test is instrumentable and runs with IRQs
unmasked, so it's possible for unrelated code to clobber the task stack
and/or manipulate current->lowest_stack while the test is running,
resulting in spurious failures.
The regular stackleak erasing code is non-instrumentable and runs with
IRQs masked, preventing similar issues.
Make the body of the test non-instrumentable, and run it with IRQs
masked, avoiding such spurious failures.
Signed-off-by: Mark Rutland <[email protected]>
Cc: Alexander Popov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Will Deacon <[email protected]>
---
drivers/misc/lkdtm/stackleak.c | 24 +++++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)
diff --git a/drivers/misc/lkdtm/stackleak.c b/drivers/misc/lkdtm/stackleak.c
index 0aafa46ced7d6..46c60761a05ea 100644
--- a/drivers/misc/lkdtm/stackleak.c
+++ b/drivers/misc/lkdtm/stackleak.c
@@ -11,7 +11,20 @@
#include "lkdtm.h"
#include <linux/stackleak.h>
-void lkdtm_STACKLEAK_ERASING(void)
+/*
+ * Check that stackleak tracks the lowest stack pointer and erases the stack
+ * below this as expected.
+ *
+ * To prevent the lowest stack pointer changing during the test, IRQs are
+ * masked and instrumentation of this function is disabled. We assume that the
+ * compiler will create a fixed-size stack frame for this function.
+ *
+ * Any non-inlined function may make further use of the stack, altering the
+ * lowest stack pointer and/or clobbering poison values. To avoid spurious
+ * failures we must avoid printing until the end of the test or have already
+ * encountered a failure condition.
+ */
+static void noinstr check_stackleak_irqoff(void)
{
const unsigned long task_stack_base = (unsigned long)task_stack_page(current);
const unsigned long task_stack_low = stackleak_task_low_bound(current);
@@ -81,3 +94,12 @@ void lkdtm_STACKLEAK_ERASING(void)
pr_info("OK: the rest of the thread stack is properly erased\n");
}
}
+
+void lkdtm_STACKLEAK_ERASING(void)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ check_stackleak_irqoff();
+ local_irq_restore(flags);
+}
--
2.30.2
The stackleak_erase() code dynamically handles being on a task stack or
another stack. In most cases, this is a fixed property of the caller,
which the caller is aware of, as an architecture might always return
using the task stack, or might always return using a trampoline stack.
This patch adds stackleak_erase_on_task_stack() and
stackleak_erase_off_task_stack() functions which callers can use to
avoid on_thread_stack() check and associated redundant work when the
calling stack is known. The existing stackleak_erase() is retained as a
safe default.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <[email protected]>
Cc: Alexander Popov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Kees Cook <[email protected]>
---
kernel/stackleak.c | 35 ++++++++++++++++++++++++++++++++---
1 file changed, 32 insertions(+), 3 deletions(-)
diff --git a/kernel/stackleak.c b/kernel/stackleak.c
index afd54b8e10b83..c2c33d2202e9a 100644
--- a/kernel/stackleak.c
+++ b/kernel/stackleak.c
@@ -70,7 +70,7 @@ late_initcall(stackleak_sysctls_init);
#define skip_erasing() false
#endif /* CONFIG_STACKLEAK_RUNTIME_DISABLE */
-static __always_inline void __stackleak_erase(void)
+static __always_inline void __stackleak_erase(bool on_task_stack)
{
const unsigned long task_stack_low = stackleak_task_low_bound(current);
const unsigned long task_stack_high = stackleak_task_high_bound(current);
@@ -96,7 +96,7 @@ static __always_inline void __stackleak_erase(void)
* function has a fixed-size stack frame, and the current stack pointer
* doesn't change while we write poison.
*/
- if (on_thread_stack())
+ if (on_task_stack)
erase_high = current_stack_pointer;
else
erase_high = task_stack_high;
@@ -110,12 +110,41 @@ static __always_inline void __stackleak_erase(void)
current->lowest_stack = task_stack_high;
}
+/*
+ * Erase and poison the portion of the task stack used since the last erase.
+ * Can be called from the task stack or an entry stack when the task stack is
+ * no longer in use.
+ */
asmlinkage void noinstr stackleak_erase(void)
{
if (skip_erasing())
return;
- __stackleak_erase();
+ __stackleak_erase(on_thread_stack());
+}
+
+/*
+ * Erase and poison the portion of the task stack used since the last erase.
+ * Can only be called from the task stack.
+ */
+asmlinkage void noinstr stackleak_erase_on_task_stack(void)
+{
+ if (skip_erasing())
+ return;
+
+ __stackleak_erase(true);
+}
+
+/*
+ * Erase and poison the portion of the task stack used since the last erase.
+ * Can only be called from a stack other than the task stack.
+ */
+asmlinkage void noinstr stackleak_erase_off_task_stack(void)
+{
+ if (skip_erasing())
+ return;
+
+ __stackleak_erase(false);
}
void __used __no_caller_saved_registers noinstr stackleak_track_stack(void)
--
2.30.2
On arm64 we always call stackleak_erase() on a task stack, and never
call it on another stack. We can avoid some redundant work by using
stackleak_erase_on_task_stack(), telling the stackleak code that it's
being called on a task stack.
Signed-off-by: Mark Rutland <[email protected]>
Cc: Alexander Popov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Will Deacon <[email protected]>
---
arch/arm64/kernel/entry.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index ede028dee81b0..5b82b92924005 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -596,7 +596,7 @@ SYM_CODE_START_LOCAL(ret_to_user)
ldr x19, [tsk, #TSK_TI_FLAGS] // re-check for single-step
enable_step_tsk x19, x2
#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
- bl stackleak_erase
+ bl stackleak_erase_on_task_stack
#endif
kernel_exit 0
SYM_CODE_END(ret_to_user)
--
2.30.2
There are a few problems with the way the LKDTM STACKLEAK_ERASING test
manipulates the stack pointer and boundary values:
* It uses the address of a local variable to determine the current stack
pointer, rather than using current_stack_pointer directly. As the
local variable could be placed anywhere within the stack frame, this
can be an over-estimate of the true stack pointer value.
* Is uses an estiamte of the current stack pointer as the upper boundary
when scanning for poison, even though prior functions could have used
more stack (and may have updated current->lowest stack accordingly).
* A pr_info() call is made in the middle of the test. As the printk()
code is out-of-line and will make use of the stack, this could clobber
poison and/or adjust current->lowest_stack. It would be better to log
the metadata after the body of the test to avoid such problems.
These have been observed to result in spurious test failures on arm64.
In addition to this there are a couple of thigns which are sub-optimal:
* To avoid the STACK_END_MAGIC value, it conditionally modifies 'left'
if this contains more than a single element, when it could instead
calculate the bound unconditionally using stackleak_task_low_bound().
* It open-codes the poison scanning. It would be better if this used the
same helper code as used by erasing function so that the two cannot
diverge.
This patch reworks the test to avoid these issues, making use of the
recently introduced helpers to ensure this is aligned with the regular
stackleak code.
As the new code tests stack boundaries before accessing the stack, there
is no need to fail early when the tracked or untracked portions of the
stack extend all the way to the low stack boundary.
As stackleak_find_top_of_poison() is now used to find the top of the
poisoned region of the stack, the subsequent poison checking starts at
this boundary and verifies that stackleak_find_top_of_poison() is
working correctly.
The pr_info() which logged the untracked portion of stack is now moved
to the end of the function, and logs the size of all the portions of the
stack relevant to the test, including the portions at the top and bottom
of the stack which are not erased or scanned, and the current / lowest
recorded stack usage.
Tested on x86_64:
| # echo STACKLEAK_ERASING > /sys/kernel/debug/provoke-crash/DIRECT
| lkdtm: Performing direct entry STACKLEAK_ERASING
| lkdtm: stackleak stack usage:
| high offset: 168 bytes
| current: 336 bytes
| lowest: 656 bytes
| tracked: 656 bytes
| untracked: 400 bytes
| poisoned: 15152 bytes
| low offset: 8 bytes
| lkdtm: OK: the rest of the thread stack is properly erased
Tested on arm64:
| # echo STACKLEAK_ERASING > /sys/kernel/debug/provoke-crash/DIRECT
| lkdtm: Performing direct entry STACKLEAK_ERASING
| lkdtm: stackleak stack usage:
| high offset: 336 bytes
| current: 656 bytes
| lowest: 1232 bytes
| tracked: 1232 bytes
| untracked: 672 bytes
| poisoned: 14136 bytes
| low offset: 8 bytes
| lkdtm: OK: the rest of the thread stack is properly erased
Tested on arm64 with deliberate breakage to the starting stack value and
poison scanning:
| # echo STACKLEAK_ERASING > /sys/kernel/debug/provoke-crash/DIRECT
| lkdtm: Performing direct entry STACKLEAK_ERASING
| lkdtm: FAIL: non-poison value 24 bytes below poison boundary: 0x0
| lkdtm: FAIL: non-poison value 32 bytes below poison boundary: 0xffff8000083dbc00
...
| lkdtm: FAIL: non-poison value 1912 bytes below poison boundary: 0x78b4b9999e8cb15
| lkdtm: FAIL: non-poison value 1920 bytes below poison boundary: 0xffff8000083db400
| lkdtm: stackleak stack usage:
| high offset: 336 bytes
| current: 688 bytes
| lowest: 1232 bytes
| tracked: 576 bytes
| untracked: 288 bytes
| poisoned: 15176 bytes
| low offset: 8 bytes
| lkdtm: FAIL: the thread stack is NOT properly erased!
| lkdtm: Unexpected! This kernel (5.18.0-rc1-00013-g1f7b1f1e29e0-dirty aarch64) was built with CONFIG_GCC_PLUGIN_STACKLEAK=y
Signed-off-by: Mark Rutland <[email protected]>
Cc: Alexander Popov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Kees Cook <[email protected]>
---
drivers/misc/lkdtm/stackleak.c | 86 +++++++++++++++++++---------------
1 file changed, 47 insertions(+), 39 deletions(-)
diff --git a/drivers/misc/lkdtm/stackleak.c b/drivers/misc/lkdtm/stackleak.c
index 707d530d509b7..0aafa46ced7d6 100644
--- a/drivers/misc/lkdtm/stackleak.c
+++ b/drivers/misc/lkdtm/stackleak.c
@@ -13,59 +13,67 @@
void lkdtm_STACKLEAK_ERASING(void)
{
- unsigned long *sp, left, found, i;
- const unsigned long check_depth =
- STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
+ const unsigned long task_stack_base = (unsigned long)task_stack_page(current);
+ const unsigned long task_stack_low = stackleak_task_low_bound(current);
+ const unsigned long task_stack_high = stackleak_task_high_bound(current);
+ const unsigned long current_sp = current_stack_pointer;
+ const unsigned long lowest_sp = current->lowest_stack;
+ unsigned long untracked_high;
+ unsigned long poison_high, poison_low;
bool test_failed = false;
/*
- * For the details about the alignment of the poison values, see
- * the comment in stackleak_track_stack().
+ * Depending on what has run prior to this test, the lowest recorded
+ * stack pointer could be above or below the current stack pointer.
+ * Start from the lowest of the two.
+ *
+ * Poison values are naturally-aligned unsigned longs. As the current
+ * stack pointer might not be sufficiently aligned, we must align
+ * downwards to find the lowest known stack pointer value. This is the
+ * high boundary for a portion of the stack which may have been used
+ * without being tracked, and has to be scanned for poison.
*/
- sp = PTR_ALIGN(&i, sizeof(unsigned long));
-
- left = ((unsigned long)sp & (THREAD_SIZE - 1)) / sizeof(unsigned long);
- sp--;
+ untracked_high = min(current_sp, lowest_sp);
+ untracked_high = ALIGN_DOWN(untracked_high, sizeof(unsigned long));
/*
- * One 'long int' at the bottom of the thread stack is reserved
- * and not poisoned.
+ * Find the top of the poison in the same way as the erasing code.
*/
- if (left > 1) {
- left--;
- } else {
- pr_err("FAIL: not enough stack space for the test\n");
- test_failed = true;
- goto end;
- }
-
- pr_info("checking unused part of the thread stack (%lu bytes)...\n",
- left * sizeof(unsigned long));
+ poison_high = stackleak_find_top_of_poison(task_stack_low, untracked_high);
/*
- * Search for 'check_depth' poison values in a row (just like
- * stackleak_erase() does).
+ * Check whether the poisoned portion of the stack (if any) consists
+ * entirely of poison. This verifies the entries that
+ * stackleak_find_top_of_poison() should have checked.
*/
- for (i = 0, found = 0; i < left && found <= check_depth; i++) {
- if (*(sp - i) == STACKLEAK_POISON)
- found++;
- else
- found = 0;
- }
+ poison_low = poison_high;
+ while (poison_low > task_stack_low) {
+ poison_low -= sizeof(unsigned long);
- pr_info("the erased part begins after %lu not poisoned bytes\n",
- (i - found) * sizeof(unsigned long));
+ if (*(unsigned long *)poison_low == STACKLEAK_POISON)
+ continue;
- /* The rest of thread stack should be erased */
- for (; i < left; i++) {
- if (*(sp - i) != STACKLEAK_POISON) {
- pr_err("FAIL: bad value number %lu in the erased part: 0x%lx\n",
- i, *(sp - i));
- test_failed = true;
- }
+ pr_err("FAIL: non-poison value %lu bytes below poison boundary: 0x%lx\n",
+ poison_high - poison_low, *(unsigned long *)poison_low);
+ test_failed = true;
}
-end:
+ pr_info("stackleak stack usage:\n"
+ " high offset: %lu bytes\n"
+ " current: %lu bytes\n"
+ " lowest: %lu bytes\n"
+ " tracked: %lu bytes\n"
+ " untracked: %lu bytes\n"
+ " poisoned: %lu bytes\n"
+ " low offset: %lu bytes\n",
+ task_stack_base + THREAD_SIZE - task_stack_high,
+ task_stack_high - current_sp,
+ task_stack_high - lowest_sp,
+ task_stack_high - untracked_high,
+ untracked_high - poison_high,
+ poison_high - task_stack_low,
+ task_stack_low - task_stack_base);
+
if (test_failed) {
pr_err("FAIL: the thread stack is NOT properly erased!\n");
pr_expected_config(CONFIG_GCC_PLUGIN_STACKLEAK);
--
2.30.2
The stackleak code relies upon the current SP and lowest recorded SP
falling within expected task stack boundaries.
Check this at the start of the test.
Signed-off-by: Mark Rutland <[email protected]>
Cc: Alexander Popov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Will Deacon <[email protected]>
---
drivers/misc/lkdtm/stackleak.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/drivers/misc/lkdtm/stackleak.c b/drivers/misc/lkdtm/stackleak.c
index 46c60761a05ea..52800583fd051 100644
--- a/drivers/misc/lkdtm/stackleak.c
+++ b/drivers/misc/lkdtm/stackleak.c
@@ -35,6 +35,25 @@ static void noinstr check_stackleak_irqoff(void)
unsigned long poison_high, poison_low;
bool test_failed = false;
+ /*
+ * Check that the current and lowest recorded stack pointer values fall
+ * within the expected task stack boundaries. These tests should never
+ * fail unless the boundaries are incorrect or we're clobbering the
+ * STACK_END_MAGIC, and in either casee something is seriously wrong.
+ */
+ if (current_sp < task_stack_low || current_sp >= task_stack_high) {
+ pr_err("FAIL: current_stack_pointer (0x%lx) outside of task stack bounds [0x%lx..0x%lx]\n",
+ current_sp, task_stack_low, task_stack_high - 1);
+ test_failed = true;
+ goto out;
+ }
+ if (lowest_sp < task_stack_low || lowest_sp >= task_stack_high) {
+ pr_err("FAIL: current->lowest_stack (0x%lx) outside of task stack bounds [0x%lx..0x%lx]\n",
+ lowest_sp, task_stack_low, task_stack_high - 1);
+ test_failed = true;
+ goto out;
+ }
+
/*
* Depending on what has run prior to this test, the lowest recorded
* stack pointer could be above or below the current stack pointer.
@@ -87,6 +106,7 @@ static void noinstr check_stackleak_irqoff(void)
poison_high - task_stack_low,
task_stack_low - task_stack_base);
+out:
if (test_failed) {
pr_err("FAIL: the thread stack is NOT properly erased!\n");
pr_expected_config(CONFIG_GCC_PLUGIN_STACKLEAK);
--
2.30.2
On Wed, 27 Apr 2022 18:31:15 +0100, Mark Rutland wrote:
> This is the vs I promised. Since Alexander wanted to look at this in
> more detail (and since this is subtle and needs review), I'm assuming
> that Kees will pick this up some time next week after that's happened,
> if all goes well. :)
>
> This series reworks the stackleak code and the associated LKDTM test.
> The first patch fixes some latent issues on arm64, and the subsequent
> patches improve the code to improve clarity and permit better code
> generation. Patches 8-10 address some latent issues in the LKDTM test
> and add more diagnostic output.
>
> [...]
I fixed some small commit log typos, but otherwise this looks great. If
anything new comes up we can adjust it.
Applied to for-next/hardening, thanks!
[01/13] arm64: stackleak: fix current_top_of_stack()
https://git.kernel.org/kees/c/4c849d27b729
[02/13] stackleak: move skip_erasing() check earlier
https://git.kernel.org/kees/c/e98a7c56d73c
[03/13] stackleak: remove redundant check
https://git.kernel.org/kees/c/e45d9f71deea
[04/13] stackleak: rework stack low bound handling
https://git.kernel.org/kees/c/cbe7edb47d3c
[05/13] stackleak: clarify variable names
https://git.kernel.org/kees/c/e9da2241ed85
[06/13] stackleak: rework stack high bound handling
https://git.kernel.org/kees/c/cfef4372a4b7
[07/13] stackleak: rework poison scanning
https://git.kernel.org/kees/c/ff5f6d37e5bc
[08/13] lkdtm/stackleak: avoid spurious failure
https://git.kernel.org/kees/c/23fd893fa0d7
[09/13] lkdtm/stackleak: rework boundary management
https://git.kernel.org/kees/c/f4cfacd92972
[10/13] lkdtm/stackleak: prevent unexpected stack usage
https://git.kernel.org/kees/c/c393c0b98d75
[11/13] lkdtm/stackleak: check stack boundaries
https://git.kernel.org/kees/c/b6bf5a354eca
[12/13] stackleak: add on/off stack variants
https://git.kernel.org/kees/c/96c59349a56c
[13/13] arm64: entry: use stackleak_erase_on_task_stack()
https://git.kernel.org/kees/c/d46ac904fd35
--
Kees Cook
On Wed, Apr 27, 2022 at 06:31:24PM +0100, Mark Rutland wrote:
> There are a few problems with the way the LKDTM STACKLEAK_ERASING test
> manipulates the stack pointer and boundary values:
>
> * It uses the address of a local variable to determine the current stack
> pointer, rather than using current_stack_pointer directly. As the
> local variable could be placed anywhere within the stack frame, this
> can be an over-estimate of the true stack pointer value.
>
> * Is uses an estiamte of the current stack pointer as the upper boundary
> when scanning for poison, even though prior functions could have used
> more stack (and may have updated current->lowest stack accordingly).
>
> * A pr_info() call is made in the middle of the test. As the printk()
> code is out-of-line and will make use of the stack, this could clobber
> poison and/or adjust current->lowest_stack. It would be better to log
> the metadata after the body of the test to avoid such problems.
Yeah, I noticed this too when I was testing the v1 series. I started
cleaning it up, but your version is much better. :)
--
Kees Cook
On Wed, Apr 27, 2022 at 06:31:28PM +0100, Mark Rutland wrote:
> On arm64 we always call stackleak_erase() on a task stack, and never
> call it on another stack. We can avoid some redundant work by using
> stackleak_erase_on_task_stack(), telling the stackleak code that it's
> being called on a task stack.
>
> Signed-off-by: Mark Rutland <[email protected]>
> Cc: Alexander Popov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Will Deacon <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
On 27.04.2022 20:31, Mark Rutland wrote:
> The logic within __stackleak_erase() can be a little hard to follow, as
> `boundary` switches from being the low bound to the high bound mid way
> through the function, and `kstack_ptr` is used to represent the start of
> the region to erase while `boundary` represents the end of the region to
> erase.
>
> Make this a little clearer by consistently using clearer variable names.
> The `boundary` variable is removed, the bounds of the region to erase
> are described by `erase_low` and `erase_high`, and bounds of the task
> stack are described by `task_stack_low` and `task_stck_high`.
A typo here in `task_stck_high`.
> As the same time, remove the comment above the variables, since it is
> unclear whether it's intended as rationale, a complaint, or a TODO, and
> is more confusing than helpful.
Yes, this comment is a bit confusing :) I can elaborate.
In the original grsecurity patch, the stackleak erasing was written in asm.
When I adopted it and proposed for the upstream, Linus strongly opposed this.
So I developed stackleak erasing in C.
And I wrote this comment to remember that having 'kstack_ptr' and 'boundary'
variables on the stack (which we are clearing) would not be good.
That was also the main reason why I reused the 'boundary' variable: I wanted the
compiler to allocate it in the register and I avoided creating many local variables.
Mark, did your refactoring make the compiler allocate local variables on the
stack instead of the registers?
> There should be no functional change as a result of this patch.
>
> Signed-off-by: Mark Rutland <[email protected]>
> Cc: Alexander Popov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Kees Cook <[email protected]>
> ---
> kernel/stackleak.c | 30 ++++++++++++++----------------
> 1 file changed, 14 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/stackleak.c b/kernel/stackleak.c
> index 24b7cf01b2972..d5f684dc0a2d9 100644
> --- a/kernel/stackleak.c
> +++ b/kernel/stackleak.c
> @@ -73,40 +73,38 @@ late_initcall(stackleak_sysctls_init);
> static __always_inline void __stackleak_erase(void)
> {
> const unsigned long task_stack_low = stackleak_task_low_bound(current);
> -
> - /* It would be nice not to have 'kstack_ptr' and 'boundary' on stack */
> - unsigned long kstack_ptr = current->lowest_stack;
> - unsigned long boundary = task_stack_low;
> + unsigned long erase_low = current->lowest_stack;
> + unsigned long erase_high;
> unsigned int poison_count = 0;
> const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
>
> /* Search for the poison value in the kernel stack */
> - while (kstack_ptr > boundary && poison_count <= depth) {
> - if (*(unsigned long *)kstack_ptr == STACKLEAK_POISON)
> + while (erase_low > task_stack_low && poison_count <= depth) {
> + if (*(unsigned long *)erase_low == STACKLEAK_POISON)
> poison_count++;
> else
> poison_count = 0;
>
> - kstack_ptr -= sizeof(unsigned long);
> + erase_low -= sizeof(unsigned long);
> }
>
> #ifdef CONFIG_STACKLEAK_METRICS
> - current->prev_lowest_stack = kstack_ptr;
> + current->prev_lowest_stack = erase_low;
> #endif
>
> /*
> - * Now write the poison value to the kernel stack. Start from
> - * 'kstack_ptr' and move up till the new 'boundary'. We assume that
> - * the stack pointer doesn't change when we write poison.
> + * Now write the poison value to the kernel stack between 'erase_low'
> + * and 'erase_high'. We assume that the stack pointer doesn't change
> + * when we write poison.
> */
> if (on_thread_stack())
> - boundary = current_stack_pointer;
> + erase_high = current_stack_pointer;
> else
> - boundary = current_top_of_stack();
> + erase_high = current_top_of_stack();
>
> - while (kstack_ptr < boundary) {
> - *(unsigned long *)kstack_ptr = STACKLEAK_POISON;
> - kstack_ptr += sizeof(unsigned long);
> + while (erase_low < erase_high) {
> + *(unsigned long *)erase_low = STACKLEAK_POISON;
> + erase_low += sizeof(unsigned long);
> }
>
> /* Reset the 'lowest_stack' value for the next syscall */
On 27.04.2022 20:31, Mark Rutland wrote:
> Prior to returning to userpace, we reset current->lowest_stack to a
> reasonable high bound. Currently we do this by subtracting the arbitrary
> value `THREAD_SIZE/64` from the top of the stack, for reasons lost to
> history.
>
> Looking at configurations today:
>
> * On i386 where THREAD_SIZE is 8K, the bound will be 128 bytes. The
> pt_regs at the top of the stack is 68 bytes (with 0 to 16 bytes of
> padding above), and so this covers an additional portion of 44 to 60
> bytes.
>
> * On x86_64 where THREAD_SIZE is at least 16K (up to 32K with KASAN) the
> bound will be at least 256 bytes (up to 512 with KASAN). The pt_regs
> at the top of the stack is 168 bytes, and so this cover an additional
> 88 bytes of stack (up to 344 with KASAN).
>
> * On arm64 where THREAD_SIZE is at least 16K (up to 64K with 64K pages
> and VMAP_STACK), the bound will be at least 256 bytes (up to 1024 with
> KASAN). The pt_regs at the top of the stack is 336 bytes, so this can
> fall within the pt_regs, or can cover an additional 688 bytes of
> stack.
>
> Clearly the `THREAD_SIZE/64` value doesn't make much sense -- in the
> worst case, this will cause more than 600 bytes of stack to be erased
> for every syscall, even if actual stack usage were substantially
> smaller.
>
> This patches makes this slightly less nonsensical by consistently
> resetting current->lowest_stack to the base of the task pt_regs. For
> clarity and for consistency with the handling of the low bound, the
> generation of the high bound is split into a helper with commentary
> explaining why.
>
> Since the pt_regs at the top of the stack will be clobbered upon the
> next exception entry, we don't need to poison these at exception exit.
> By using task_pt_regs() as the high stack boundary instead of
> current_top_of_stack() we avoid some redundant poisoning, and the
> compiler can share the address generation between the poisoning and
> restting of `current->lowest_stack`, making the generated code more
> optimal.
>
> It's not clear to me whether the existing `THREAD_SIZE/64` offset was a
> dodgy heuristic to skip the pt_regs, or whether it was attempting to
> minimize the number of times stackleak_check_stack() would have to
> update `current->lowest_stack` when stack usage was shallow at the cost
> of unconditionally poisoning a small portion of the stack for every exit
> to userspace.
I inherited this 'THREAD_SIZE/64' logic is from the original grsecurity patch.
As I mentioned, originally this was written in asm.
For x86_64:
mov TASK_thread_sp0(%r11), %rdi
sub $256, %rdi
mov %rdi, TASK_lowest_stack(%r11)
For x86_32:
mov TASK_thread_sp0(%ebp), %edi
sub $128, %edi
mov %edi, TASK_lowest_stack(%ebp)
256 bytes for x86_64 and 128 bytes for x86_32 are exactly THREAD_SIZE/64.
I think this value was chosen as optimal for minimizing poison scanning.
It's possible that stackleak_track_stack() is not called during the syscall
because all the called functions have small stack frames.
> For now I've simply removed the offset, and if we need/want to minimize
> updates for shallow stack usage it should be easy to add a better
> heuristic atop, with appropriate commentary so we know what's going on.
I like your idea to erase the thread stack up to pt_regs if we call the
stackleak erasing from the trampoline stack.
But here I don't understand where task_pt_regs() points to...
> Signed-off-by: Mark Rutland <[email protected]>
> Cc: Alexander Popov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Kees Cook <[email protected]>
> ---
> include/linux/stackleak.h | 14 ++++++++++++++
> kernel/stackleak.c | 19 ++++++++++++++-----
> 2 files changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/stackleak.h b/include/linux/stackleak.h
> index 67430faa5c518..467661aeb4136 100644
> --- a/include/linux/stackleak.h
> +++ b/include/linux/stackleak.h
> @@ -28,6 +28,20 @@ stackleak_task_low_bound(const struct task_struct *tsk)
> return (unsigned long)end_of_stack(tsk) + sizeof(unsigned long);
> }
>
> +/*
> + * The address immediately after the highest address on tsk's stack which we
> + * can plausibly erase.
> + */
> +static __always_inline unsigned long
> +stackleak_task_high_bound(const struct task_struct *tsk)
> +{
> + /*
> + * The task's pt_regs lives at the top of the task stack and will be
> + * overwritten by exception entry, so there's no need to erase them.
> + */
> + return (unsigned long)task_pt_regs(tsk);
> +}
> +
> static inline void stackleak_task_init(struct task_struct *t)
> {
> t->lowest_stack = stackleak_task_low_bound(t);
> diff --git a/kernel/stackleak.c b/kernel/stackleak.c
> index d5f684dc0a2d9..ba346d46218f5 100644
> --- a/kernel/stackleak.c
> +++ b/kernel/stackleak.c
> @@ -73,6 +73,7 @@ late_initcall(stackleak_sysctls_init);
> static __always_inline void __stackleak_erase(void)
> {
> const unsigned long task_stack_low = stackleak_task_low_bound(current);
> + const unsigned long task_stack_high = stackleak_task_high_bound(current);
> unsigned long erase_low = current->lowest_stack;
> unsigned long erase_high;
> unsigned int poison_count = 0;
> @@ -93,14 +94,22 @@ static __always_inline void __stackleak_erase(void)
> #endif
>
> /*
> - * Now write the poison value to the kernel stack between 'erase_low'
> - * and 'erase_high'. We assume that the stack pointer doesn't change
> - * when we write poison.
> + * Write poison to the task's stack between 'erase_low' and
> + * 'erase_high'.
> + *
> + * If we're running on a different stack (e.g. an entry trampoline
> + * stack) we can erase everything below the pt_regs at the top of the
> + * task stack.
> + *
> + * If we're running on the task stack itself, we must not clobber any
> + * stack used by this function and its caller. We assume that this
> + * function has a fixed-size stack frame, and the current stack pointer
> + * doesn't change while we write poison.
> */
> if (on_thread_stack())
> erase_high = current_stack_pointer;
> else
> - erase_high = current_top_of_stack();
> + erase_high = task_stack_high;
>
> while (erase_low < erase_high) {
> *(unsigned long *)erase_low = STACKLEAK_POISON;
> @@ -108,7 +117,7 @@ static __always_inline void __stackleak_erase(void)
> }
>
> /* Reset the 'lowest_stack' value for the next syscall */
> - current->lowest_stack = current_top_of_stack() - THREAD_SIZE/64;
> + current->lowest_stack = task_stack_high;
> }
>
> asmlinkage void noinstr stackleak_erase(void)
Hello Mark!
On 27.04.2022 20:31, Mark Rutland wrote:
> Currently we over-estimate the region of stack which must be erased.
>
> To determine the region to be erased, we scan downards for a contiguous
> block of poison values (or the low bound of the stack). There are a few
> minor problems with this today:
>
> * When we find a block of poison values, we include this block within
> the region to erase.
>
> As this is included within the region to erase, this causes us to
> redundantly overwrite 'STACKLEAK_SEARCH_DEPTH' (128) bytes with
> poison.
Right, this can be improved.
> * As the loop condition checks 'poison_count <= depth', it will run an
> additional iteration after finding the contiguous block of poison,
> decrementing 'erase_low' once more than necessary.
Actually, I think the current code is correct.
I'm intentionally searching one poison value more than STACKLEAK_SEARCH_DEPTH to
avoid the corner case. See the BUILD_BUG_ON assertion in stackleak_track_stack()
that describes it:
/*
* Having CONFIG_STACKLEAK_TRACK_MIN_SIZE larger than
* STACKLEAK_SEARCH_DEPTH makes the poison search in
* stackleak_erase() unreliable. Let's prevent that.
*/
BUILD_BUG_ON(CONFIG_STACKLEAK_TRACK_MIN_SIZE > STACKLEAK_SEARCH_DEPTH);
> As this is included within the region to erase, this causes us to
> redundantly overwrite an additional unsigned long with poison.
>
> * As we always decrement 'erase_low' after checking an element on the
> stack, we always include the element below this within the region to
> erase.
>
> As this is included within the region to erase, this causes us to
> redundantly overwrite an additional unsigned long with poison.
>
> Note that this is not a functional problem. As the loop condition
> checks 'erase_low > task_stack_low', we'll never clobber the
> STACK_END_MAGIC. As we always decrement 'erase_low' after this, we'll
> never fail to erase the element immediately above the STACK_END_MAGIC.
Right, I don't see any bug in the current erasing code.
When I wrote the current code, I carefully checked all the corner cases. For
example, on the first stack erasing, the STACK_END_MAGIC was not overwritten,
but the memory next to it was erased. Same for the beginning of the stack: I
carefully checked that no unpoisoned bytes were left on the thread stack.
> In total, this can cause us to erase `128 + 2 * sizeof(unsigned long)`
> bytes more than necessary, which is unfortunate.
>
> This patch reworks the logic to find the address immediately above the
> poisoned region, by finding the lowest non-poisoned address. This is
> factored into a stackleak_find_top_of_poison() helper both for clarity
> and so that this can be shared with the LKDTM test in subsequent
> patches.
You know, I wrote stackleak_erase() in very plain C. I wanted a compiler to
generate assembly that is very close to the original asm version. I worried that
compilers might do weird stuff with the local variables and the stack pointer.
So I checked stackleak for gcc versions 4.8, 5, 6, 7, 8, 9, and 10 on x86_64,
i386 and arm64. This is my project that helped with this work:
https://github.com/a13xp0p0v/kernel-build-containers
Mark, in this patch series you use many local variables and helper functions.
Honestly, this worries me. For example, compilers can (and usually do) ignore
the presence of the 'inline' specifier for the purpose of optimization.
Thanks!
> Signed-off-by: Mark Rutland <[email protected]>
> Cc: Alexander Popov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Kees Cook <[email protected]>
> ---
> include/linux/stackleak.h | 26 ++++++++++++++++++++++++++
> kernel/stackleak.c | 18 ++++--------------
> 2 files changed, 30 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/stackleak.h b/include/linux/stackleak.h
> index 467661aeb4136..c36e7a3b45e7e 100644
> --- a/include/linux/stackleak.h
> +++ b/include/linux/stackleak.h
> @@ -42,6 +42,32 @@ stackleak_task_high_bound(const struct task_struct *tsk)
> return (unsigned long)task_pt_regs(tsk);
> }
>
> +/*
> + * Find the address immediately above the poisoned region of the stack, where
> + * that region falls between 'low' (inclusive) and 'high' (exclusive).
> + */
> +static __always_inline unsigned long
> +stackleak_find_top_of_poison(const unsigned long low, const unsigned long high)
> +{
> + const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
> + unsigned int poison_count = 0;
> + unsigned long poison_high = high;
> + unsigned long sp = high;
> +
> + while (sp > low && poison_count < depth) {
> + sp -= sizeof(unsigned long);
> +
> + if (*(unsigned long *)sp == STACKLEAK_POISON) {
> + poison_count++;
> + } else {
> + poison_count = 0;
> + poison_high = sp;
> + }
> + }
> +
> + return poison_high;
> +}
> +
> static inline void stackleak_task_init(struct task_struct *t)
> {
> t->lowest_stack = stackleak_task_low_bound(t);
> diff --git a/kernel/stackleak.c b/kernel/stackleak.c
> index ba346d46218f5..afd54b8e10b83 100644
> --- a/kernel/stackleak.c
> +++ b/kernel/stackleak.c
> @@ -74,20 +74,10 @@ static __always_inline void __stackleak_erase(void)
> {
> const unsigned long task_stack_low = stackleak_task_low_bound(current);
> const unsigned long task_stack_high = stackleak_task_high_bound(current);
> - unsigned long erase_low = current->lowest_stack;
> - unsigned long erase_high;
> - unsigned int poison_count = 0;
> - const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
> -
> - /* Search for the poison value in the kernel stack */
> - while (erase_low > task_stack_low && poison_count <= depth) {
> - if (*(unsigned long *)erase_low == STACKLEAK_POISON)
> - poison_count++;
> - else
> - poison_count = 0;
> -
> - erase_low -= sizeof(unsigned long);
> - }
> + unsigned long erase_low, erase_high;
> +
> + erase_low = stackleak_find_top_of_poison(task_stack_low,
> + current->lowest_stack);
>
> #ifdef CONFIG_STACKLEAK_METRICS
> current->prev_lowest_stack = erase_low;
On Mon, May 09, 2022 at 04:51:35PM +0300, Alexander Popov wrote:
> Hello Mark!
>
> On 27.04.2022 20:31, Mark Rutland wrote:
> > Currently we over-estimate the region of stack which must be erased.
> >
> > To determine the region to be erased, we scan downards for a contiguous
> > block of poison values (or the low bound of the stack). There are a few
> > minor problems with this today:
> >
> > * When we find a block of poison values, we include this block within
> > the region to erase.
> >
> > As this is included within the region to erase, this causes us to
> > redundantly overwrite 'STACKLEAK_SEARCH_DEPTH' (128) bytes with
> > poison.
>
> Right, this can be improved.
>
> > * As the loop condition checks 'poison_count <= depth', it will run an
> > additional iteration after finding the contiguous block of poison,
> > decrementing 'erase_low' once more than necessary.
>
> Actually, I think the current code is correct.
>
> I'm intentionally searching one poison value more than
> STACKLEAK_SEARCH_DEPTH to avoid the corner case. See the BUILD_BUG_ON
> assertion in stackleak_track_stack() that describes it:
>
> /*
> * Having CONFIG_STACKLEAK_TRACK_MIN_SIZE larger than
> * STACKLEAK_SEARCH_DEPTH makes the poison search in
> * stackleak_erase() unreliable. Let's prevent that.
> */
> BUILD_BUG_ON(CONFIG_STACKLEAK_TRACK_MIN_SIZE > STACKLEAK_SEARCH_DEPTH);
I had read that, but as written that doesn't imply that it's necessary to scan
one more element than STACKLEAK_SEARCH_DEPTH, nor why. I'm more than happy to
change the logic, but I think we need a very clear explanation as to why we
need to scan the specific number of bytes we scan, and we should account for
that *within* STACKLEAK_SEARCH_DEPTH for clarity.
> > As this is included within the region to erase, this causes us to
> > redundantly overwrite an additional unsigned long with poison.
> >
> > * As we always decrement 'erase_low' after checking an element on the
> > stack, we always include the element below this within the region to
> > erase.
> >
> > As this is included within the region to erase, this causes us to
> > redundantly overwrite an additional unsigned long with poison.
> >
> > Note that this is not a functional problem. As the loop condition
> > checks 'erase_low > task_stack_low', we'll never clobber the
> > STACK_END_MAGIC. As we always decrement 'erase_low' after this, we'll
> > never fail to erase the element immediately above the STACK_END_MAGIC.
>
> Right, I don't see any bug in the current erasing code.
>
> When I wrote the current code, I carefully checked all the corner cases. For
> example, on the first stack erasing, the STACK_END_MAGIC was not
> overwritten, but the memory next to it was erased. Same for the beginning of
> the stack: I carefully checked that no unpoisoned bytes were left on the
> thread stack.
>
> > In total, this can cause us to erase `128 + 2 * sizeof(unsigned long)`
> > bytes more than necessary, which is unfortunate.
> >
> > This patch reworks the logic to find the address immediately above the
> > poisoned region, by finding the lowest non-poisoned address. This is
> > factored into a stackleak_find_top_of_poison() helper both for clarity
> > and so that this can be shared with the LKDTM test in subsequent
> > patches.
>
> You know, I wrote stackleak_erase() in very plain C. I wanted a compiler to
> generate assembly that is very close to the original asm version. I worried
> that compilers might do weird stuff with the local variables and the stack
> pointer.
>
> So I checked stackleak for gcc versions 4.8, 5, 6, 7, 8, 9, and 10 on
> x86_64, i386 and arm64. This is my project that helped with this work:
> https://github.com/a13xp0p0v/kernel-build-containers
I've used the kernel.org cross toolchains, as published at:
https://mirrors.edge.kernel.org/pub/tools/crosstool/
> Mark, in this patch series you use many local variables and helper functions.
> Honestly, this worries me. For example, compilers can (and usually do)
> ignore the presence of the 'inline' specifier for the purpose of
> optimization.
I've deliberately used `__always_inline` rather than regular `inline` to
prevent code being placed out-of-line. As mentioned in oether replies it has a
stronger semantic.
Thanks,
Mark.
>
> Thanks!
>
> > Signed-off-by: Mark Rutland <[email protected]>
> > Cc: Alexander Popov <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Andy Lutomirski <[email protected]>
> > Cc: Kees Cook <[email protected]>
> > ---
> > include/linux/stackleak.h | 26 ++++++++++++++++++++++++++
> > kernel/stackleak.c | 18 ++++--------------
> > 2 files changed, 30 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/stackleak.h b/include/linux/stackleak.h
> > index 467661aeb4136..c36e7a3b45e7e 100644
> > --- a/include/linux/stackleak.h
> > +++ b/include/linux/stackleak.h
> > @@ -42,6 +42,32 @@ stackleak_task_high_bound(const struct task_struct *tsk)
> > return (unsigned long)task_pt_regs(tsk);
> > }
> > +/*
> > + * Find the address immediately above the poisoned region of the stack, where
> > + * that region falls between 'low' (inclusive) and 'high' (exclusive).
> > + */
> > +static __always_inline unsigned long
> > +stackleak_find_top_of_poison(const unsigned long low, const unsigned long high)
> > +{
> > + const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
> > + unsigned int poison_count = 0;
> > + unsigned long poison_high = high;
> > + unsigned long sp = high;
> > +
> > + while (sp > low && poison_count < depth) {
> > + sp -= sizeof(unsigned long);
> > +
> > + if (*(unsigned long *)sp == STACKLEAK_POISON) {
> > + poison_count++;
> > + } else {
> > + poison_count = 0;
> > + poison_high = sp;
> > + }
> > + }
> > +
> > + return poison_high;
> > +}
> > +
> > static inline void stackleak_task_init(struct task_struct *t)
> > {
> > t->lowest_stack = stackleak_task_low_bound(t);
> > diff --git a/kernel/stackleak.c b/kernel/stackleak.c
> > index ba346d46218f5..afd54b8e10b83 100644
> > --- a/kernel/stackleak.c
> > +++ b/kernel/stackleak.c
> > @@ -74,20 +74,10 @@ static __always_inline void __stackleak_erase(void)
> > {
> > const unsigned long task_stack_low = stackleak_task_low_bound(current);
> > const unsigned long task_stack_high = stackleak_task_high_bound(current);
> > - unsigned long erase_low = current->lowest_stack;
> > - unsigned long erase_high;
> > - unsigned int poison_count = 0;
> > - const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
> > -
> > - /* Search for the poison value in the kernel stack */
> > - while (erase_low > task_stack_low && poison_count <= depth) {
> > - if (*(unsigned long *)erase_low == STACKLEAK_POISON)
> > - poison_count++;
> > - else
> > - poison_count = 0;
> > -
> > - erase_low -= sizeof(unsigned long);
> > - }
> > + unsigned long erase_low, erase_high;
> > +
> > + erase_low = stackleak_find_top_of_poison(task_stack_low,
> > + current->lowest_stack);
> > #ifdef CONFIG_STACKLEAK_METRICS
> > current->prev_lowest_stack = erase_low;
>
On Mon, May 09, 2022 at 12:27:22AM +0300, Alexander Popov wrote:
> On 27.04.2022 20:31, Mark Rutland wrote:
> > Prior to returning to userpace, we reset current->lowest_stack to a
> > reasonable high bound. Currently we do this by subtracting the arbitrary
> > value `THREAD_SIZE/64` from the top of the stack, for reasons lost to
> > history.
> >
> > Looking at configurations today:
> >
> > * On i386 where THREAD_SIZE is 8K, the bound will be 128 bytes. The
> > pt_regs at the top of the stack is 68 bytes (with 0 to 16 bytes of
> > padding above), and so this covers an additional portion of 44 to 60
> > bytes.
> >
> > * On x86_64 where THREAD_SIZE is at least 16K (up to 32K with KASAN) the
> > bound will be at least 256 bytes (up to 512 with KASAN). The pt_regs
> > at the top of the stack is 168 bytes, and so this cover an additional
> > 88 bytes of stack (up to 344 with KASAN).
> >
> > * On arm64 where THREAD_SIZE is at least 16K (up to 64K with 64K pages
> > and VMAP_STACK), the bound will be at least 256 bytes (up to 1024 with
> > KASAN). The pt_regs at the top of the stack is 336 bytes, so this can
> > fall within the pt_regs, or can cover an additional 688 bytes of
> > stack.
> >
> > Clearly the `THREAD_SIZE/64` value doesn't make much sense -- in the
> > worst case, this will cause more than 600 bytes of stack to be erased
> > for every syscall, even if actual stack usage were substantially
> > smaller.
> >
> > This patches makes this slightly less nonsensical by consistently
> > resetting current->lowest_stack to the base of the task pt_regs. For
> > clarity and for consistency with the handling of the low bound, the
> > generation of the high bound is split into a helper with commentary
> > explaining why.
> >
> > Since the pt_regs at the top of the stack will be clobbered upon the
> > next exception entry, we don't need to poison these at exception exit.
> > By using task_pt_regs() as the high stack boundary instead of
> > current_top_of_stack() we avoid some redundant poisoning, and the
> > compiler can share the address generation between the poisoning and
> > restting of `current->lowest_stack`, making the generated code more
> > optimal.
> >
> > It's not clear to me whether the existing `THREAD_SIZE/64` offset was a
> > dodgy heuristic to skip the pt_regs, or whether it was attempting to
> > minimize the number of times stackleak_check_stack() would have to
> > update `current->lowest_stack` when stack usage was shallow at the cost
> > of unconditionally poisoning a small portion of the stack for every exit
> > to userspace.
>
> I inherited this 'THREAD_SIZE/64' logic is from the original grsecurity patch.
> As I mentioned, originally this was written in asm.
>
> For x86_64:
> mov TASK_thread_sp0(%r11), %rdi
> sub $256, %rdi
> mov %rdi, TASK_lowest_stack(%r11)
>
> For x86_32:
> mov TASK_thread_sp0(%ebp), %edi
> sub $128, %edi
> mov %edi, TASK_lowest_stack(%ebp)
>
> 256 bytes for x86_64 and 128 bytes for x86_32 are exactly THREAD_SIZE/64.
To check my understanding, did you come up with the `THREAD_SIZE/64`
calculation from reverse-engineering the assembly?
I strongly suspect that has nothing to do with THEAD_SIZE, and was trying to
fall just below the task's pt_regs (and maybe some unconditional stack usage
just below that).
> I think this value was chosen as optimal for minimizing poison scanning.
> It's possible that stackleak_track_stack() is not called during the syscall
> because all the called functions have small stack frames.
I can believe that, but given the clearing cost appears to dominate the
scanning cost, I suspect we can live with making this precisely the bottom of
the pt_regs.
> > For now I've simply removed the offset, and if we need/want to minimize
> > updates for shallow stack usage it should be easy to add a better
> > heuristic atop, with appropriate commentary so we know what's going on.
>
> I like your idea to erase the thread stack up to pt_regs if we call the
> stackleak erasing from the trampoline stack.
>
> But here I don't understand where task_pt_regs() points to...
As mentioned in the commit message, there's a struct pt_regs at the top of each
task stack (created at exception entry, and consumed by exception return),
which contains the user register state. On arm64 and x86_64, that's *right* at
the top of the stack, but on i386 there can be some padding above that.
task_pt_regs(tsk) points at the base of that pt_regs.
That looks something like the following (with increasing addresses going
upwards):
---------------- <--- task_stack_page(tsk) + THREAD_SIZE
(optional padding)
---------------- <--- task_top_of_stack(tsk) // x86 only
/\
||
|| pt_regs
||
\/
---------------- <--- task_pt_regs(tsk)
/\
||
||
||
|| Usable task stack
||
||
||
\/
----------------
STACK_END_MAGIC
---------------- <--- task_stack_page(tsk)
Thanks,
Mark.
> > Signed-off-by: Mark Rutland <[email protected]>
> > Cc: Alexander Popov <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Andy Lutomirski <[email protected]>
> > Cc: Kees Cook <[email protected]>
> > ---
> > include/linux/stackleak.h | 14 ++++++++++++++
> > kernel/stackleak.c | 19 ++++++++++++++-----
> > 2 files changed, 28 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/stackleak.h b/include/linux/stackleak.h
> > index 67430faa5c518..467661aeb4136 100644
> > --- a/include/linux/stackleak.h
> > +++ b/include/linux/stackleak.h
> > @@ -28,6 +28,20 @@ stackleak_task_low_bound(const struct task_struct *tsk)
> > return (unsigned long)end_of_stack(tsk) + sizeof(unsigned long);
> > }
> > +/*
> > + * The address immediately after the highest address on tsk's stack which we
> > + * can plausibly erase.
> > + */
> > +static __always_inline unsigned long
> > +stackleak_task_high_bound(const struct task_struct *tsk)
> > +{
> > + /*
> > + * The task's pt_regs lives at the top of the task stack and will be
> > + * overwritten by exception entry, so there's no need to erase them.
> > + */
> > + return (unsigned long)task_pt_regs(tsk);
> > +}
> > +
> > static inline void stackleak_task_init(struct task_struct *t)
> > {
> > t->lowest_stack = stackleak_task_low_bound(t);
> > diff --git a/kernel/stackleak.c b/kernel/stackleak.c
> > index d5f684dc0a2d9..ba346d46218f5 100644
> > --- a/kernel/stackleak.c
> > +++ b/kernel/stackleak.c
> > @@ -73,6 +73,7 @@ late_initcall(stackleak_sysctls_init);
> > static __always_inline void __stackleak_erase(void)
> > {
> > const unsigned long task_stack_low = stackleak_task_low_bound(current);
> > + const unsigned long task_stack_high = stackleak_task_high_bound(current);
> > unsigned long erase_low = current->lowest_stack;
> > unsigned long erase_high;
> > unsigned int poison_count = 0;
> > @@ -93,14 +94,22 @@ static __always_inline void __stackleak_erase(void)
> > #endif
> > /*
> > - * Now write the poison value to the kernel stack between 'erase_low'
> > - * and 'erase_high'. We assume that the stack pointer doesn't change
> > - * when we write poison.
> > + * Write poison to the task's stack between 'erase_low' and
> > + * 'erase_high'.
> > + *
> > + * If we're running on a different stack (e.g. an entry trampoline
> > + * stack) we can erase everything below the pt_regs at the top of the
> > + * task stack.
> > + *
> > + * If we're running on the task stack itself, we must not clobber any
> > + * stack used by this function and its caller. We assume that this
> > + * function has a fixed-size stack frame, and the current stack pointer
> > + * doesn't change while we write poison.
> > */
> > if (on_thread_stack())
> > erase_high = current_stack_pointer;
> > else
> > - erase_high = current_top_of_stack();
> > + erase_high = task_stack_high;
> > while (erase_low < erase_high) {
> > *(unsigned long *)erase_low = STACKLEAK_POISON;
> > @@ -108,7 +117,7 @@ static __always_inline void __stackleak_erase(void)
> > }
> > /* Reset the 'lowest_stack' value for the next syscall */
> > - current->lowest_stack = current_top_of_stack() - THREAD_SIZE/64;
> > + current->lowest_stack = task_stack_high;
> > }
> > asmlinkage void noinstr stackleak_erase(void)
>
On Sun, May 08, 2022 at 11:49:46PM +0300, Alexander Popov wrote:
> On 27.04.2022 20:31, Mark Rutland wrote:
> > The logic within __stackleak_erase() can be a little hard to follow, as
> > `boundary` switches from being the low bound to the high bound mid way
> > through the function, and `kstack_ptr` is used to represent the start of
> > the region to erase while `boundary` represents the end of the region to
> > erase.
> >
> > Make this a little clearer by consistently using clearer variable names.
> > The `boundary` variable is removed, the bounds of the region to erase
> > are described by `erase_low` and `erase_high`, and bounds of the task
> > stack are described by `task_stack_low` and `task_stck_high`.
>
> A typo here in `task_stck_high`.
Ah; whoops.
> > As the same time, remove the comment above the variables, since it is
> > unclear whether it's intended as rationale, a complaint, or a TODO, and
> > is more confusing than helpful.
>
> Yes, this comment is a bit confusing :) I can elaborate.
>
> In the original grsecurity patch, the stackleak erasing was written in asm.
> When I adopted it and proposed for the upstream, Linus strongly opposed this.
> So I developed stackleak erasing in C.
>
> And I wrote this comment to remember that having 'kstack_ptr' and 'boundary'
> variables on the stack (which we are clearing) would not be good.
Ok, so I think that falls into the "complaint" bucket I mentioned. I understand
that we don't have any guarantee from the compiler as to how it will use the
stack, and that's obviously a potential problem.
> That was also the main reason why I reused the 'boundary' variable: I wanted
> the compiler to allocate it in the register and I avoided creating many
> local variables.
>
> Mark, did your refactoring make the compiler allocate local variables on the
> stack instead of the registers?
Considering the whole series, testing with GCC 11.1.0:
* On arm64:
before: stackleak_erase() uses 48 bytes of stack
after: stackleak_erase() uses 0 bytes of stack
Note: this is entirely due to patch 1; arm64 has enough GPRs that it
doesn't need to use the stack.
* On x86_64:
before: stackleak_erase() uses 0 bytes of stack
after: stackleak_erase() uses 0 bytes of stack
* On i386
before: stackleak_erase() uses 8 bytes of stach
after: stackleak_erase() uses 16 bytes of stack
The i386 case isn't ideal, but given that those bytes will easily be used by
the entry triage code before getting to any syscall handling, I don't believe
that's an issue in practice.
Thanks,
Mark.
> > There should be no functional change as a result of this patch.
> >
> > Signed-off-by: Mark Rutland <[email protected]>
> > Cc: Alexander Popov <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Andy Lutomirski <[email protected]>
> > Cc: Kees Cook <[email protected]>
> > ---
> > kernel/stackleak.c | 30 ++++++++++++++----------------
> > 1 file changed, 14 insertions(+), 16 deletions(-)
> >
> > diff --git a/kernel/stackleak.c b/kernel/stackleak.c
> > index 24b7cf01b2972..d5f684dc0a2d9 100644
> > --- a/kernel/stackleak.c
> > +++ b/kernel/stackleak.c
> > @@ -73,40 +73,38 @@ late_initcall(stackleak_sysctls_init);
> > static __always_inline void __stackleak_erase(void)
> > {
> > const unsigned long task_stack_low = stackleak_task_low_bound(current);
> > -
> > - /* It would be nice not to have 'kstack_ptr' and 'boundary' on stack */
> > - unsigned long kstack_ptr = current->lowest_stack;
> > - unsigned long boundary = task_stack_low;
> > + unsigned long erase_low = current->lowest_stack;
> > + unsigned long erase_high;
> > unsigned int poison_count = 0;
> > const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
> > /* Search for the poison value in the kernel stack */
> > - while (kstack_ptr > boundary && poison_count <= depth) {
> > - if (*(unsigned long *)kstack_ptr == STACKLEAK_POISON)
> > + while (erase_low > task_stack_low && poison_count <= depth) {
> > + if (*(unsigned long *)erase_low == STACKLEAK_POISON)
> > poison_count++;
> > else
> > poison_count = 0;
> > - kstack_ptr -= sizeof(unsigned long);
> > + erase_low -= sizeof(unsigned long);
> > }
> > #ifdef CONFIG_STACKLEAK_METRICS
> > - current->prev_lowest_stack = kstack_ptr;
> > + current->prev_lowest_stack = erase_low;
> > #endif
> > /*
> > - * Now write the poison value to the kernel stack. Start from
> > - * 'kstack_ptr' and move up till the new 'boundary'. We assume that
> > - * the stack pointer doesn't change when we write poison.
> > + * Now write the poison value to the kernel stack between 'erase_low'
> > + * and 'erase_high'. We assume that the stack pointer doesn't change
> > + * when we write poison.
> > */
> > if (on_thread_stack())
> > - boundary = current_stack_pointer;
> > + erase_high = current_stack_pointer;
> > else
> > - boundary = current_top_of_stack();
> > + erase_high = current_top_of_stack();
> > - while (kstack_ptr < boundary) {
> > - *(unsigned long *)kstack_ptr = STACKLEAK_POISON;
> > - kstack_ptr += sizeof(unsigned long);
> > + while (erase_low < erase_high) {
> > + *(unsigned long *)erase_low = STACKLEAK_POISON;
> > + erase_low += sizeof(unsigned long);
> > }
> > /* Reset the 'lowest_stack' value for the next syscall */
>
On Tue, May 10, 2022 at 02:01:49PM +0100, Mark Rutland wrote:
> On Sun, May 08, 2022 at 11:49:46PM +0300, Alexander Popov wrote:
> > On 27.04.2022 20:31, Mark Rutland wrote:
> > > The logic within __stackleak_erase() can be a little hard to follow, as
> > > `boundary` switches from being the low bound to the high bound mid way
> > > through the function, and `kstack_ptr` is used to represent the start of
> > > the region to erase while `boundary` represents the end of the region to
> > > erase.
> > >
> > > Make this a little clearer by consistently using clearer variable names.
> > > The `boundary` variable is removed, the bounds of the region to erase
> > > are described by `erase_low` and `erase_high`, and bounds of the task
> > > stack are described by `task_stack_low` and `task_stck_high`.
> >
> > A typo here in `task_stck_high`.
>
> Ah; whoops.
No worries; I fixed this when I took the patch.
> > That was also the main reason why I reused the 'boundary' variable: I wanted
> > the compiler to allocate it in the register and I avoided creating many
> > local variables.
> >
> > Mark, did your refactoring make the compiler allocate local variables on the
> > stack instead of the registers?
>
> Considering the whole series, testing with GCC 11.1.0:
>
> * On arm64:
> before: stackleak_erase() uses 48 bytes of stack
> after: stackleak_erase() uses 0 bytes of stack
>
> Note: this is entirely due to patch 1; arm64 has enough GPRs that it
> doesn't need to use the stack.
>
> * On x86_64:
> before: stackleak_erase() uses 0 bytes of stack
> after: stackleak_erase() uses 0 bytes of stack
>
> * On i386
> before: stackleak_erase() uses 8 bytes of stach
> after: stackleak_erase() uses 16 bytes of stack
>
> The i386 case isn't ideal, but given that those bytes will easily be used by
> the entry triage code before getting to any syscall handling, I don't believe
> that's an issue in practice.
I am biased and totally fine with choosing a solution where 64-bit
improvement comes at a 32-bit cost.
--
Kees Cook
On 10.05.2022 14:22, Mark Rutland wrote:
> On Mon, May 09, 2022 at 12:27:22AM +0300, Alexander Popov wrote:
>> On 27.04.2022 20:31, Mark Rutland wrote:
>>> Prior to returning to userpace, we reset current->lowest_stack to a
>>> reasonable high bound. Currently we do this by subtracting the arbitrary
>>> value `THREAD_SIZE/64` from the top of the stack, for reasons lost to
>>> history.
>>>
>>> Looking at configurations today:
>>>
>>> * On i386 where THREAD_SIZE is 8K, the bound will be 128 bytes. The
>>> pt_regs at the top of the stack is 68 bytes (with 0 to 16 bytes of
>>> padding above), and so this covers an additional portion of 44 to 60
>>> bytes.
>>>
>>> * On x86_64 where THREAD_SIZE is at least 16K (up to 32K with KASAN) the
>>> bound will be at least 256 bytes (up to 512 with KASAN). The pt_regs
>>> at the top of the stack is 168 bytes, and so this cover an additional
>>> 88 bytes of stack (up to 344 with KASAN).
>>>
>>> * On arm64 where THREAD_SIZE is at least 16K (up to 64K with 64K pages
>>> and VMAP_STACK), the bound will be at least 256 bytes (up to 1024 with
>>> KASAN). The pt_regs at the top of the stack is 336 bytes, so this can
>>> fall within the pt_regs, or can cover an additional 688 bytes of
>>> stack.
>>>
>>> Clearly the `THREAD_SIZE/64` value doesn't make much sense -- in the
>>> worst case, this will cause more than 600 bytes of stack to be erased
>>> for every syscall, even if actual stack usage were substantially
>>> smaller.
>>>
>>> This patches makes this slightly less nonsensical by consistently
>>> resetting current->lowest_stack to the base of the task pt_regs. For
>>> clarity and for consistency with the handling of the low bound, the
>>> generation of the high bound is split into a helper with commentary
>>> explaining why.
>>>
>>> Since the pt_regs at the top of the stack will be clobbered upon the
>>> next exception entry, we don't need to poison these at exception exit.
>>> By using task_pt_regs() as the high stack boundary instead of
>>> current_top_of_stack() we avoid some redundant poisoning, and the
>>> compiler can share the address generation between the poisoning and
>>> restting of `current->lowest_stack`, making the generated code more
>>> optimal.
>>>
>>> It's not clear to me whether the existing `THREAD_SIZE/64` offset was a
>>> dodgy heuristic to skip the pt_regs, or whether it was attempting to
>>> minimize the number of times stackleak_check_stack() would have to
>>> update `current->lowest_stack` when stack usage was shallow at the cost
>>> of unconditionally poisoning a small portion of the stack for every exit
>>> to userspace.
>>
>> I inherited this 'THREAD_SIZE/64' logic is from the original grsecurity patch.
>> As I mentioned, originally this was written in asm.
>>
>> For x86_64:
>> mov TASK_thread_sp0(%r11), %rdi
>> sub $256, %rdi
>> mov %rdi, TASK_lowest_stack(%r11)
>>
>> For x86_32:
>> mov TASK_thread_sp0(%ebp), %edi
>> sub $128, %edi
>> mov %edi, TASK_lowest_stack(%ebp)
>>
>> 256 bytes for x86_64 and 128 bytes for x86_32 are exactly THREAD_SIZE/64.
>
> To check my understanding, did you come up with the `THREAD_SIZE/64`
> calculation from reverse-engineering the assembly?
Yes, exactly.
> I strongly suspect that has nothing to do with THEAD_SIZE, and was trying to
> fall just below the task's pt_regs (and maybe some unconditional stack usage
> just below that).
Aha, that's probable.
>> I think this value was chosen as optimal for minimizing poison scanning.
>> It's possible that stackleak_track_stack() is not called during the syscall
>> because all the called functions have small stack frames.
>
> I can believe that, but given the clearing cost appears to dominate the
> scanning cost, I suspect we can live with making this precisely the bottom of
> the pt_regs.
Sounds good!
>>> For now I've simply removed the offset, and if we need/want to minimize
>>> updates for shallow stack usage it should be easy to add a better
>>> heuristic atop, with appropriate commentary so we know what's going on.
>>
>> I like your idea to erase the thread stack up to pt_regs if we call the
>> stackleak erasing from the trampoline stack.
>>
>> But here I don't understand where task_pt_regs() points to...
>
> As mentioned in the commit message, there's a struct pt_regs at the top of each
> task stack (created at exception entry, and consumed by exception return),
> which contains the user register state. On arm64 and x86_64, that's *right* at
> the top of the stack, but on i386 there can be some padding above that.
> task_pt_regs(tsk) points at the base of that pt_regs.
>
> That looks something like the following (with increasing addresses going
> upwards):
>
> ---------------- <--- task_stack_page(tsk) + THREAD_SIZE
> (optional padding)
> ---------------- <--- task_top_of_stack(tsk) // x86 only
> /\
> ||
> || pt_regs
> ||
> \/
> ---------------- <--- task_pt_regs(tsk)
> /\
> ||
> ||
> ||
> || Usable task stack
> ||
> ||
> ||
> \/
> ----------------
> STACK_END_MAGIC
> ---------------- <--- task_stack_page(tsk)
>
> Thanks,
> Mark.
>
>>> Signed-off-by: Mark Rutland <[email protected]>
>>> Cc: Alexander Popov <[email protected]>
>>> Cc: Andrew Morton <[email protected]>
>>> Cc: Andy Lutomirski <[email protected]>
>>> Cc: Kees Cook <[email protected]>
>>> ---
>>> include/linux/stackleak.h | 14 ++++++++++++++
>>> kernel/stackleak.c | 19 ++++++++++++++-----
>>> 2 files changed, 28 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/linux/stackleak.h b/include/linux/stackleak.h
>>> index 67430faa5c518..467661aeb4136 100644
>>> --- a/include/linux/stackleak.h
>>> +++ b/include/linux/stackleak.h
>>> @@ -28,6 +28,20 @@ stackleak_task_low_bound(const struct task_struct *tsk)
>>> return (unsigned long)end_of_stack(tsk) + sizeof(unsigned long);
>>> }
>>> +/*
>>> + * The address immediately after the highest address on tsk's stack which we
>>> + * can plausibly erase.
>>> + */
>>> +static __always_inline unsigned long
>>> +stackleak_task_high_bound(const struct task_struct *tsk)
>>> +{
>>> + /*
>>> + * The task's pt_regs lives at the top of the task stack and will be
>>> + * overwritten by exception entry, so there's no need to erase them.
>>> + */
>>> + return (unsigned long)task_pt_regs(tsk);
>>> +}
>>> +
>>> static inline void stackleak_task_init(struct task_struct *t)
>>> {
>>> t->lowest_stack = stackleak_task_low_bound(t);
>>> diff --git a/kernel/stackleak.c b/kernel/stackleak.c
>>> index d5f684dc0a2d9..ba346d46218f5 100644
>>> --- a/kernel/stackleak.c
>>> +++ b/kernel/stackleak.c
>>> @@ -73,6 +73,7 @@ late_initcall(stackleak_sysctls_init);
>>> static __always_inline void __stackleak_erase(void)
>>> {
>>> const unsigned long task_stack_low = stackleak_task_low_bound(current);
>>> + const unsigned long task_stack_high = stackleak_task_high_bound(current);
>>> unsigned long erase_low = current->lowest_stack;
>>> unsigned long erase_high;
>>> unsigned int poison_count = 0;
>>> @@ -93,14 +94,22 @@ static __always_inline void __stackleak_erase(void)
>>> #endif
>>> /*
>>> - * Now write the poison value to the kernel stack between 'erase_low'
>>> - * and 'erase_high'. We assume that the stack pointer doesn't change
>>> - * when we write poison.
>>> + * Write poison to the task's stack between 'erase_low' and
>>> + * 'erase_high'.
>>> + *
>>> + * If we're running on a different stack (e.g. an entry trampoline
>>> + * stack) we can erase everything below the pt_regs at the top of the
>>> + * task stack.
>>> + *
>>> + * If we're running on the task stack itself, we must not clobber any
>>> + * stack used by this function and its caller. We assume that this
>>> + * function has a fixed-size stack frame, and the current stack pointer
>>> + * doesn't change while we write poison.
>>> */
>>> if (on_thread_stack())
>>> erase_high = current_stack_pointer;
>>> else
>>> - erase_high = current_top_of_stack();
>>> + erase_high = task_stack_high;
>>> while (erase_low < erase_high) {
>>> *(unsigned long *)erase_low = STACKLEAK_POISON;
>>> @@ -108,7 +117,7 @@ static __always_inline void __stackleak_erase(void)
>>> }
>>> /* Reset the 'lowest_stack' value for the next syscall */
>>> - current->lowest_stack = current_top_of_stack() - THREAD_SIZE/64;
>>> + current->lowest_stack = task_stack_high;
>>> }
>>> asmlinkage void noinstr stackleak_erase(void)
>>
On 10.05.2022 16:13, Mark Rutland wrote:
> On Mon, May 09, 2022 at 04:51:35PM +0300, Alexander Popov wrote:
>> Hello Mark!
>>
>> On 27.04.2022 20:31, Mark Rutland wrote:
>>> Currently we over-estimate the region of stack which must be erased.
>>>
>>> To determine the region to be erased, we scan downards for a contiguous
>>> block of poison values (or the low bound of the stack). There are a few
>>> minor problems with this today:
>>>
>>> * When we find a block of poison values, we include this block within
>>> the region to erase.
>>>
>>> As this is included within the region to erase, this causes us to
>>> redundantly overwrite 'STACKLEAK_SEARCH_DEPTH' (128) bytes with
>>> poison.
>>
>> Right, this can be improved.
>>
>>> * As the loop condition checks 'poison_count <= depth', it will run an
>>> additional iteration after finding the contiguous block of poison,
>>> decrementing 'erase_low' once more than necessary.
>>
>> Actually, I think the current code is correct.
>>
>> I'm intentionally searching one poison value more than
>> STACKLEAK_SEARCH_DEPTH to avoid the corner case. See the BUILD_BUG_ON
>> assertion in stackleak_track_stack() that describes it:
>>
>> /*
>> * Having CONFIG_STACKLEAK_TRACK_MIN_SIZE larger than
>> * STACKLEAK_SEARCH_DEPTH makes the poison search in
>> * stackleak_erase() unreliable. Let's prevent that.
>> */
>> BUILD_BUG_ON(CONFIG_STACKLEAK_TRACK_MIN_SIZE > STACKLEAK_SEARCH_DEPTH);
>
> I had read that, but as written that doesn't imply that it's necessary to scan
> one more element than STACKLEAK_SEARCH_DEPTH, nor why. I'm more than happy to
> change the logic, but I think we need a very clear explanation as to why we
> need to scan the specific number of bytes we scan, and we should account for
> that *within* STACKLEAK_SEARCH_DEPTH for clarity.
I'll try to explain.
The stackleak gcc plugin instruments the kernel code inserting the
'stackleak_track_stack()' calls for the functions with a stack frame size
greater than or equal to CONFIG_STACKLEAK_TRACK_MIN_SIZE.
The kernel functions with a smaller stack frame are not instrumented (the
'lowest_stack' value is not updated for them).
Any kernel function may leave uninitialized data on its stack frame. The poison
scanning must handle that correctly. The uninitialized groups of poison values
must be smaller than the search depth, otherwise 'stackleak_erase()' is unreliable.
So with this BUILD_BUG_ON I control that
CONFIG_STACKLEAK_TRACK_MIN_SIZE <= STACKLEAK_SEARCH_DEPTH.
To be sure and avoid mistakes in the edge cases, 'stackleak_erase()' is
searching one poison value more than STACKLEAK_SEARCH_DEPTH.
If you don't like this one additional poison value in the search, I would
propose to change the assertion.
What do you think?
>>> As this is included within the region to erase, this causes us to
>>> redundantly overwrite an additional unsigned long with poison.
>>>
>>> * As we always decrement 'erase_low' after checking an element on the
>>> stack, we always include the element below this within the region to
>>> erase.
>>>
>>> As this is included within the region to erase, this causes us to
>>> redundantly overwrite an additional unsigned long with poison.
>>>
>>> Note that this is not a functional problem. As the loop condition
>>> checks 'erase_low > task_stack_low', we'll never clobber the
>>> STACK_END_MAGIC. As we always decrement 'erase_low' after this, we'll
>>> never fail to erase the element immediately above the STACK_END_MAGIC.
>>
>> Right, I don't see any bug in the current erasing code.
>>
>> When I wrote the current code, I carefully checked all the corner cases. For
>> example, on the first stack erasing, the STACK_END_MAGIC was not
>> overwritten, but the memory next to it was erased. Same for the beginning of
>> the stack: I carefully checked that no unpoisoned bytes were left on the
>> thread stack.
>>
>>> In total, this can cause us to erase `128 + 2 * sizeof(unsigned long)`
>>> bytes more than necessary, which is unfortunate.
>>>
>>> This patch reworks the logic to find the address immediately above the
>>> poisoned region, by finding the lowest non-poisoned address. This is
>>> factored into a stackleak_find_top_of_poison() helper both for clarity
>>> and so that this can be shared with the LKDTM test in subsequent
>>> patches.
>>
>> You know, I wrote stackleak_erase() in very plain C. I wanted a compiler to
>> generate assembly that is very close to the original asm version. I worried
>> that compilers might do weird stuff with the local variables and the stack
>> pointer.
>>
>> So I checked stackleak for gcc versions 4.8, 5, 6, 7, 8, 9, and 10 on
>> x86_64, i386 and arm64. This is my project that helped with this work:
>> https://github.com/a13xp0p0v/kernel-build-containers
>
> I've used the kernel.org cross toolchains, as published at:
>
> https://mirrors.edge.kernel.org/pub/tools/crosstool/
>
>> Mark, in this patch series you use many local variables and helper functions.
>> Honestly, this worries me. For example, compilers can (and usually do)
>> ignore the presence of the 'inline' specifier for the purpose of
>> optimization.
>
> I've deliberately used `__always_inline` rather than regular `inline` to
> prevent code being placed out-of-line. As mentioned in oether replies it has a
> stronger semantic.
>
> Thanks,
> Mark.
>
>>
>> Thanks!
>>
>>> Signed-off-by: Mark Rutland <[email protected]>
>>> Cc: Alexander Popov <[email protected]>
>>> Cc: Andrew Morton <[email protected]>
>>> Cc: Andy Lutomirski <[email protected]>
>>> Cc: Kees Cook <[email protected]>
>>> ---
>>> include/linux/stackleak.h | 26 ++++++++++++++++++++++++++
>>> kernel/stackleak.c | 18 ++++--------------
>>> 2 files changed, 30 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/include/linux/stackleak.h b/include/linux/stackleak.h
>>> index 467661aeb4136..c36e7a3b45e7e 100644
>>> --- a/include/linux/stackleak.h
>>> +++ b/include/linux/stackleak.h
>>> @@ -42,6 +42,32 @@ stackleak_task_high_bound(const struct task_struct *tsk)
>>> return (unsigned long)task_pt_regs(tsk);
>>> }
>>> +/*
>>> + * Find the address immediately above the poisoned region of the stack, where
>>> + * that region falls between 'low' (inclusive) and 'high' (exclusive).
>>> + */
>>> +static __always_inline unsigned long
>>> +stackleak_find_top_of_poison(const unsigned long low, const unsigned long high)
>>> +{
>>> + const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
>>> + unsigned int poison_count = 0;
>>> + unsigned long poison_high = high;
>>> + unsigned long sp = high;
>>> +
>>> + while (sp > low && poison_count < depth) {
>>> + sp -= sizeof(unsigned long);
>>> +
>>> + if (*(unsigned long *)sp == STACKLEAK_POISON) {
>>> + poison_count++;
>>> + } else {
>>> + poison_count = 0;
>>> + poison_high = sp;
>>> + }
>>> + }
>>> +
>>> + return poison_high;
>>> +}
>>> +
>>> static inline void stackleak_task_init(struct task_struct *t)
>>> {
>>> t->lowest_stack = stackleak_task_low_bound(t);
>>> diff --git a/kernel/stackleak.c b/kernel/stackleak.c
>>> index ba346d46218f5..afd54b8e10b83 100644
>>> --- a/kernel/stackleak.c
>>> +++ b/kernel/stackleak.c
>>> @@ -74,20 +74,10 @@ static __always_inline void __stackleak_erase(void)
>>> {
>>> const unsigned long task_stack_low = stackleak_task_low_bound(current);
>>> const unsigned long task_stack_high = stackleak_task_high_bound(current);
>>> - unsigned long erase_low = current->lowest_stack;
>>> - unsigned long erase_high;
>>> - unsigned int poison_count = 0;
>>> - const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
>>> -
>>> - /* Search for the poison value in the kernel stack */
>>> - while (erase_low > task_stack_low && poison_count <= depth) {
>>> - if (*(unsigned long *)erase_low == STACKLEAK_POISON)
>>> - poison_count++;
>>> - else
>>> - poison_count = 0;
>>> -
>>> - erase_low -= sizeof(unsigned long);
>>> - }
>>> + unsigned long erase_low, erase_high;
>>> +
>>> + erase_low = stackleak_find_top_of_poison(task_stack_low,
>>> + current->lowest_stack);
>>> #ifdef CONFIG_STACKLEAK_METRICS
>>> current->prev_lowest_stack = erase_low;
>>
On Sun, May 15, 2022 at 08:33:01PM +0300, Alexander Popov wrote:
> On 10.05.2022 16:13, Mark Rutland wrote:
> > On Mon, May 09, 2022 at 04:51:35PM +0300, Alexander Popov wrote:
> > > Hello Mark!
> > >
> > > On 27.04.2022 20:31, Mark Rutland wrote:
> > > > Currently we over-estimate the region of stack which must be erased.
> > > >
> > > > To determine the region to be erased, we scan downards for a contiguous
> > > > block of poison values (or the low bound of the stack). There are a few
> > > > minor problems with this today:
> > > >
> > > > * When we find a block of poison values, we include this block within
> > > > the region to erase.
> > > >
> > > > As this is included within the region to erase, this causes us to
> > > > redundantly overwrite 'STACKLEAK_SEARCH_DEPTH' (128) bytes with
> > > > poison.
> > >
> > > Right, this can be improved.
> > >
> > > > * As the loop condition checks 'poison_count <= depth', it will run an
> > > > additional iteration after finding the contiguous block of poison,
> > > > decrementing 'erase_low' once more than necessary.
> > >
> > > Actually, I think the current code is correct.
> > >
> > > I'm intentionally searching one poison value more than
> > > STACKLEAK_SEARCH_DEPTH to avoid the corner case. See the BUILD_BUG_ON
> > > assertion in stackleak_track_stack() that describes it:
> > >
> > > /*
> > > * Having CONFIG_STACKLEAK_TRACK_MIN_SIZE larger than
> > > * STACKLEAK_SEARCH_DEPTH makes the poison search in
> > > * stackleak_erase() unreliable. Let's prevent that.
> > > */
> > > BUILD_BUG_ON(CONFIG_STACKLEAK_TRACK_MIN_SIZE > STACKLEAK_SEARCH_DEPTH);
> >
> > I had read that, but as written that doesn't imply that it's necessary to scan
> > one more element than STACKLEAK_SEARCH_DEPTH, nor why. I'm more than happy to
> > change the logic, but I think we need a very clear explanation as to why we
> > need to scan the specific number of bytes we scan, and we should account for
> > that *within* STACKLEAK_SEARCH_DEPTH for clarity.
>
> I'll try to explain.
>
> The stackleak gcc plugin instruments the kernel code inserting the
> 'stackleak_track_stack()' calls for the functions with a stack frame size
> greater than or equal to CONFIG_STACKLEAK_TRACK_MIN_SIZE.
>
> The kernel functions with a smaller stack frame are not instrumented (the
> 'lowest_stack' value is not updated for them).
I understood these points.
It's also worth noting that `noinstr` code will also not be instrumented
regardless of frame size -- we might want some build-time assertion for those.
> Any kernel function may leave uninitialized data on its stack frame. The
> poison scanning must handle that correctly. The uninitialized groups of
> poison values must be smaller than the search depth, otherwise
> 'stackleak_erase()' is unreliable.
I had understood this, but I had understood that for a caller->callee pair,
*something* would be pushed onto the stack. On arm64 that'd be a frame record
in the caller's stack frame, and on x86 that would be the return address
between the stack frames of the caller and callee. Since any unrecorded frame
is less than CONFIG_STACKLEAK_TRACK_MIN_SIZE, the non-poison bytes would fall
within CONFIG_STACKLEAK_TRACK_MIN_SIZE bytes.
There is a potential edge case on arm64, since the frame record is permitted to
be placed anywhere within the stack frame, and in theory it could be placed
high on the caller and low on the callee. If we wanted to handle that, we'd
need to scan 2 times the tracking size. In practice compilers consistently
place the frame record at one end (usually the low end, as part of constructing
the frame).
> So with this BUILD_BUG_ON I control that
> CONFIG_STACKLEAK_TRACK_MIN_SIZE <= STACKLEAK_SEARCH_DEPTH.
>
> To be sure and avoid mistakes in the edge cases, 'stackleak_erase()' is
> searching one poison value more than STACKLEAK_SEARCH_DEPTH.
Just to check my understanding, did you have a specific edge-case in mind, or
was that "just in case"?
It would be really nice if we had an example.
> If you don't like this one additional poison value in the search, I would
> propose to change the assertion.
If we clearly document *why*, then changing the assertion is fine by me.
However, as above I don't think that this is necessary.
As an aside, why is it possible to configure CONFIG_STACKLEAK_TRACK_MIN_SIZE to
be bigger than STACKLEAK_SEARCH_DEPTH in the first place?
In security/Kconfig.hardening we have:
| config STACKLEAK_TRACK_MIN_SIZE
| int "Minimum stack frame size of functions tracked by STACKLEAK"
| default 100
| range 0 4096
| depends on GCC_PLUGIN_STACKLEAK
| help
| The STACKLEAK gcc plugin instruments the kernel code for tracking
| the lowest border of the kernel stack (and for some other purposes).
| It inserts the stackleak_track_stack() call for the functions with
| a stack frame size greater than or equal to this parameter.
| If unsure, leave the default value 100.
... where the vast majority of that range is going to lead to a BUILD_BUG().
Thanks,
Mark.
> What do you think?
>
> > > > As this is included within the region to erase, this causes us to
> > > > redundantly overwrite an additional unsigned long with poison.
> > > >
> > > > * As we always decrement 'erase_low' after checking an element on the
> > > > stack, we always include the element below this within the region to
> > > > erase.
> > > >
> > > > As this is included within the region to erase, this causes us to
> > > > redundantly overwrite an additional unsigned long with poison.
> > > >
> > > > Note that this is not a functional problem. As the loop condition
> > > > checks 'erase_low > task_stack_low', we'll never clobber the
> > > > STACK_END_MAGIC. As we always decrement 'erase_low' after this, we'll
> > > > never fail to erase the element immediately above the STACK_END_MAGIC.
> > >
> > > Right, I don't see any bug in the current erasing code.
> > >
> > > When I wrote the current code, I carefully checked all the corner cases. For
> > > example, on the first stack erasing, the STACK_END_MAGIC was not
> > > overwritten, but the memory next to it was erased. Same for the beginning of
> > > the stack: I carefully checked that no unpoisoned bytes were left on the
> > > thread stack.
> > >
> > > > In total, this can cause us to erase `128 + 2 * sizeof(unsigned long)`
> > > > bytes more than necessary, which is unfortunate.
> > > >
> > > > This patch reworks the logic to find the address immediately above the
> > > > poisoned region, by finding the lowest non-poisoned address. This is
> > > > factored into a stackleak_find_top_of_poison() helper both for clarity
> > > > and so that this can be shared with the LKDTM test in subsequent
> > > > patches.
> > >
> > > You know, I wrote stackleak_erase() in very plain C. I wanted a compiler to
> > > generate assembly that is very close to the original asm version. I worried
> > > that compilers might do weird stuff with the local variables and the stack
> > > pointer.
> > >
> > > So I checked stackleak for gcc versions 4.8, 5, 6, 7, 8, 9, and 10 on
> > > x86_64, i386 and arm64. This is my project that helped with this work:
> > > https://github.com/a13xp0p0v/kernel-build-containers
> >
> > I've used the kernel.org cross toolchains, as published at:
> >
> > https://mirrors.edge.kernel.org/pub/tools/crosstool/
> >
> > > Mark, in this patch series you use many local variables and helper functions.
> > > Honestly, this worries me. For example, compilers can (and usually do)
> > > ignore the presence of the 'inline' specifier for the purpose of
> > > optimization.
> >
> > I've deliberately used `__always_inline` rather than regular `inline` to
> > prevent code being placed out-of-line. As mentioned in oether replies it has a
> > stronger semantic.
> >
> > Thanks,
> > Mark.
> >
> > >
> > > Thanks!
> > >
> > > > Signed-off-by: Mark Rutland <[email protected]>
> > > > Cc: Alexander Popov <[email protected]>
> > > > Cc: Andrew Morton <[email protected]>
> > > > Cc: Andy Lutomirski <[email protected]>
> > > > Cc: Kees Cook <[email protected]>
> > > > ---
> > > > include/linux/stackleak.h | 26 ++++++++++++++++++++++++++
> > > > kernel/stackleak.c | 18 ++++--------------
> > > > 2 files changed, 30 insertions(+), 14 deletions(-)
> > > >
> > > > diff --git a/include/linux/stackleak.h b/include/linux/stackleak.h
> > > > index 467661aeb4136..c36e7a3b45e7e 100644
> > > > --- a/include/linux/stackleak.h
> > > > +++ b/include/linux/stackleak.h
> > > > @@ -42,6 +42,32 @@ stackleak_task_high_bound(const struct task_struct *tsk)
> > > > return (unsigned long)task_pt_regs(tsk);
> > > > }
> > > > +/*
> > > > + * Find the address immediately above the poisoned region of the stack, where
> > > > + * that region falls between 'low' (inclusive) and 'high' (exclusive).
> > > > + */
> > > > +static __always_inline unsigned long
> > > > +stackleak_find_top_of_poison(const unsigned long low, const unsigned long high)
> > > > +{
> > > > + const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
> > > > + unsigned int poison_count = 0;
> > > > + unsigned long poison_high = high;
> > > > + unsigned long sp = high;
> > > > +
> > > > + while (sp > low && poison_count < depth) {
> > > > + sp -= sizeof(unsigned long);
> > > > +
> > > > + if (*(unsigned long *)sp == STACKLEAK_POISON) {
> > > > + poison_count++;
> > > > + } else {
> > > > + poison_count = 0;
> > > > + poison_high = sp;
> > > > + }
> > > > + }
> > > > +
> > > > + return poison_high;
> > > > +}
> > > > +
> > > > static inline void stackleak_task_init(struct task_struct *t)
> > > > {
> > > > t->lowest_stack = stackleak_task_low_bound(t);
> > > > diff --git a/kernel/stackleak.c b/kernel/stackleak.c
> > > > index ba346d46218f5..afd54b8e10b83 100644
> > > > --- a/kernel/stackleak.c
> > > > +++ b/kernel/stackleak.c
> > > > @@ -74,20 +74,10 @@ static __always_inline void __stackleak_erase(void)
> > > > {
> > > > const unsigned long task_stack_low = stackleak_task_low_bound(current);
> > > > const unsigned long task_stack_high = stackleak_task_high_bound(current);
> > > > - unsigned long erase_low = current->lowest_stack;
> > > > - unsigned long erase_high;
> > > > - unsigned int poison_count = 0;
> > > > - const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
> > > > -
> > > > - /* Search for the poison value in the kernel stack */
> > > > - while (erase_low > task_stack_low && poison_count <= depth) {
> > > > - if (*(unsigned long *)erase_low == STACKLEAK_POISON)
> > > > - poison_count++;
> > > > - else
> > > > - poison_count = 0;
> > > > -
> > > > - erase_low -= sizeof(unsigned long);
> > > > - }
> > > > + unsigned long erase_low, erase_high;
> > > > +
> > > > + erase_low = stackleak_find_top_of_poison(task_stack_low,
> > > > + current->lowest_stack);
> > > > #ifdef CONFIG_STACKLEAK_METRICS
> > > > current->prev_lowest_stack = erase_low;
> > >
>
On 24.05.2022 16:31, Mark Rutland wrote:
> On Sun, May 15, 2022 at 08:33:01PM +0300, Alexander Popov wrote:
>> On 10.05.2022 16:13, Mark Rutland wrote:
>>> On Mon, May 09, 2022 at 04:51:35PM +0300, Alexander Popov wrote:
>>>> Hello Mark!
>>>>
>>>> On 27.04.2022 20:31, Mark Rutland wrote:
>>>>> Currently we over-estimate the region of stack which must be erased.
>>>>>
>>>>> To determine the region to be erased, we scan downards for a contiguous
>>>>> block of poison values (or the low bound of the stack). There are a few
>>>>> minor problems with this today:
>>>>>
>>>>> * When we find a block of poison values, we include this block within
>>>>> the region to erase.
>>>>>
>>>>> As this is included within the region to erase, this causes us to
>>>>> redundantly overwrite 'STACKLEAK_SEARCH_DEPTH' (128) bytes with
>>>>> poison.
>>>>
>>>> Right, this can be improved.
>>>>
>>>>> * As the loop condition checks 'poison_count <= depth', it will run an
>>>>> additional iteration after finding the contiguous block of poison,
>>>>> decrementing 'erase_low' once more than necessary.
>>>>
>>>> Actually, I think the current code is correct.
>>>>
>>>> I'm intentionally searching one poison value more than
>>>> STACKLEAK_SEARCH_DEPTH to avoid the corner case. See the BUILD_BUG_ON
>>>> assertion in stackleak_track_stack() that describes it:
>>>>
>>>> /*
>>>> * Having CONFIG_STACKLEAK_TRACK_MIN_SIZE larger than
>>>> * STACKLEAK_SEARCH_DEPTH makes the poison search in
>>>> * stackleak_erase() unreliable. Let's prevent that.
>>>> */
>>>> BUILD_BUG_ON(CONFIG_STACKLEAK_TRACK_MIN_SIZE > STACKLEAK_SEARCH_DEPTH);
>>>
>>> I had read that, but as written that doesn't imply that it's necessary to scan
>>> one more element than STACKLEAK_SEARCH_DEPTH, nor why. I'm more than happy to
>>> change the logic, but I think we need a very clear explanation as to why we
>>> need to scan the specific number of bytes we scan, and we should account for
>>> that *within* STACKLEAK_SEARCH_DEPTH for clarity.
>>
>> I'll try to explain.
>>
>> The stackleak gcc plugin instruments the kernel code inserting the
>> 'stackleak_track_stack()' calls for the functions with a stack frame size
>> greater than or equal to CONFIG_STACKLEAK_TRACK_MIN_SIZE.
>>
>> The kernel functions with a smaller stack frame are not instrumented (the
>> 'lowest_stack' value is not updated for them).
>
> I understood these points.
>
> It's also worth noting that `noinstr` code will also not be instrumented
> regardless of frame size -- we might want some build-time assertion for those.
I developed a trick that shows noinstr functions that stackleak would like to instrument:
diff --git a/scripts/gcc-plugins/stackleak_plugin.c b/scripts/gcc-plugins/stackleak_plugin.c
index 42f0252ee2a4..6db748d44957 100644
--- a/scripts/gcc-plugins/stackleak_plugin.c
+++ b/scripts/gcc-plugins/stackleak_plugin.c
@@ -397,6 +397,9 @@ static unsigned int stackleak_cleanup_execute(void)
const char *fn = DECL_NAME_POINTER(current_function_decl);
bool removed = false;
+ if (verbose)
+ fprintf(stderr, "stackleak: I see noinstr function %s()\n", fn);
+
/*
* Leave stack tracking in functions that call alloca().
* Additional case:
@@ -464,12 +467,12 @@ static bool stackleak_gate(void)
if (STRING_EQUAL(section, ".meminit.text"))
return false;
if (STRING_EQUAL(section, ".noinstr.text"))
- return false;
+ return true;
if (STRING_EQUAL(section, ".entry.text"))
return false;
}
- return track_frame_size >= 0;
+ return false;
}
/* Build the function declaration for stackleak_track_stack() */
@@ -589,8 +592,6 @@ __visible int plugin_init(struct plugin_name_args *plugin_info,
build_for_x86 = true;
} else if (!strcmp(argv[i].key, "disable")) {
disable = true;
- } else if (!strcmp(argv[i].key, "verbose")) {
- verbose = true;
} else {
error(G_("unknown option '-fplugin-arg-%s-%s'"),
plugin_name, argv[i].key);
@@ -598,6 +599,8 @@ __visible int plugin_init(struct plugin_name_args *plugin_info,
}
}
+ verbose = true;
+
if (disable) {
if (verbose)
fprintf(stderr, "stackleak: disabled for this translation unit\n");
Building defconfig for x86_64 gives this:
stackleak: I see noinstr function __do_fast_syscall_32()
stackleak: instrument __do_fast_syscall_32(): calls_alloca
--
stackleak: I see noinstr function do_syscall_64()
stackleak: instrument do_syscall_64(): calls_alloca
--
stackleak: I see noinstr function do_int80_syscall_32()
stackleak: instrument do_int80_syscall_32(): calls_alloca
--
stackleak: I see noinstr function do_machine_check()
stackleak: instrument do_machine_check()
--
stackleak: I see noinstr function exc_general_protection()
stackleak: instrument exc_general_protection()
--
stackleak: I see noinstr function fixup_bad_iret()
stackleak: instrument fixup_bad_iret()
The cases with calls_alloca are caused by CONFIG_RANDOMIZE_KSTACK_OFFSET=y.
Kees knows about that peculiarity.
Other cases are noinstr functions with large stack frame:
do_machine_check(), exc_general_protection(), and fixup_bad_iret().
I think adding a build-time assertion is not possible, since it would break
building the kernel.
>> Any kernel function may leave uninitialized data on its stack frame. The
>> poison scanning must handle that correctly. The uninitialized groups of
>> poison values must be smaller than the search depth, otherwise
>> 'stackleak_erase()' is unreliable.
>
> I had understood this, but I had understood that for a caller->callee pair,
> *something* would be pushed onto the stack. On arm64 that'd be a frame record
> in the caller's stack frame, and on x86 that would be the return address
> between the stack frames of the caller and callee. Since any unrecorded frame
> is less than CONFIG_STACKLEAK_TRACK_MIN_SIZE, the non-poison bytes would fall
> within CONFIG_STACKLEAK_TRACK_MIN_SIZE bytes.
Yes, exactly.
> There is a potential edge case on arm64, since the frame record is permitted to
> be placed anywhere within the stack frame, and in theory it could be placed
> high on the caller and low on the callee. If we wanted to handle that, we'd
> need to scan 2 times the tracking size. In practice compilers consistently
> place the frame record at one end (usually the low end, as part of constructing
> the frame).
Good to hear that.
>> So with this BUILD_BUG_ON I control that
>> CONFIG_STACKLEAK_TRACK_MIN_SIZE <= STACKLEAK_SEARCH_DEPTH.
>>
>> To be sure and avoid mistakes in the edge cases, 'stackleak_erase()' is
>> searching one poison value more than STACKLEAK_SEARCH_DEPTH.
>
> Just to check my understanding, did you have a specific edge-case in mind, or
> was that "just in case"?
I didn't have a specific edge-case, but I wanted to avoid possible weird problems.
> It would be really nice if we had an example.
>
>> If you don't like this one additional poison value in the search, I would
>> propose to change the assertion.
>
> If we clearly document *why*, then changing the assertion is fine by me.
> However, as above I don't think that this is necessary.
>
> As an aside, why is it possible to configure CONFIG_STACKLEAK_TRACK_MIN_SIZE to
> be bigger than STACKLEAK_SEARCH_DEPTH in the first place?
>
> In security/Kconfig.hardening we have:
>
> | config STACKLEAK_TRACK_MIN_SIZE
> | int "Minimum stack frame size of functions tracked by STACKLEAK"
> | default 100
> | range 0 4096
> | depends on GCC_PLUGIN_STACKLEAK
> | help
> | The STACKLEAK gcc plugin instruments the kernel code for tracking
> | the lowest border of the kernel stack (and for some other purposes).
> | It inserts the stackleak_track_stack() call for the functions with
> | a stack frame size greater than or equal to this parameter.
> | If unsure, leave the default value 100.
>
> ... where the vast majority of that range is going to lead to a BUILD_BUG().
Honestly, I don't like the idea of having the STACKLEAK_TRACK_MIN_SIZE option in the Kconfig.
I was forced by the maintainers to introduce it when I was working on the stackleak patchset.
How about dropping CONFIG_STACKLEAK_TRACK_MIN_SIZE from Kconfig?
That would also allow to drop this build-time assertion.
>> What do you think?
>>
>>>>> As this is included within the region to erase, this causes us to
>>>>> redundantly overwrite an additional unsigned long with poison.
>>>>>
>>>>> * As we always decrement 'erase_low' after checking an element on the
>>>>> stack, we always include the element below this within the region to
>>>>> erase.
>>>>>
>>>>> As this is included within the region to erase, this causes us to
>>>>> redundantly overwrite an additional unsigned long with poison.
>>>>>
>>>>> Note that this is not a functional problem. As the loop condition
>>>>> checks 'erase_low > task_stack_low', we'll never clobber the
>>>>> STACK_END_MAGIC. As we always decrement 'erase_low' after this, we'll
>>>>> never fail to erase the element immediately above the STACK_END_MAGIC.
>>>>
>>>> Right, I don't see any bug in the current erasing code.
>>>>
>>>> When I wrote the current code, I carefully checked all the corner cases. For
>>>> example, on the first stack erasing, the STACK_END_MAGIC was not
>>>> overwritten, but the memory next to it was erased. Same for the beginning of
>>>> the stack: I carefully checked that no unpoisoned bytes were left on the
>>>> thread stack.
>>>>
>>>>> In total, this can cause us to erase `128 + 2 * sizeof(unsigned long)`
>>>>> bytes more than necessary, which is unfortunate.
>>>>>
>>>>> This patch reworks the logic to find the address immediately above the
>>>>> poisoned region, by finding the lowest non-poisoned address. This is
>>>>> factored into a stackleak_find_top_of_poison() helper both for clarity
>>>>> and so that this can be shared with the LKDTM test in subsequent
>>>>> patches.
>>>>
>>>> You know, I wrote stackleak_erase() in very plain C. I wanted a compiler to
>>>> generate assembly that is very close to the original asm version. I worried
>>>> that compilers might do weird stuff with the local variables and the stack
>>>> pointer.
>>>>
>>>> So I checked stackleak for gcc versions 4.8, 5, 6, 7, 8, 9, and 10 on
>>>> x86_64, i386 and arm64. This is my project that helped with this work:
>>>> https://github.com/a13xp0p0v/kernel-build-containers
>>>
>>> I've used the kernel.org cross toolchains, as published at:
>>>
>>> https://mirrors.edge.kernel.org/pub/tools/crosstool/
>>>
>>>> Mark, in this patch series you use many local variables and helper functions.
>>>> Honestly, this worries me. For example, compilers can (and usually do)
>>>> ignore the presence of the 'inline' specifier for the purpose of
>>>> optimization.
>>>
>>> I've deliberately used `__always_inline` rather than regular `inline` to
>>> prevent code being placed out-of-line. As mentioned in oether replies it has a
>>> stronger semantic.
>>>
>>> Thanks,
>>> Mark.
>>>
>>>>
>>>> Thanks!
>>>>
>>>>> Signed-off-by: Mark Rutland <[email protected]>
>>>>> Cc: Alexander Popov <[email protected]>
>>>>> Cc: Andrew Morton <[email protected]>
>>>>> Cc: Andy Lutomirski <[email protected]>
>>>>> Cc: Kees Cook <[email protected]>
>>>>> ---
>>>>> include/linux/stackleak.h | 26 ++++++++++++++++++++++++++
>>>>> kernel/stackleak.c | 18 ++++--------------
>>>>> 2 files changed, 30 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/stackleak.h b/include/linux/stackleak.h
>>>>> index 467661aeb4136..c36e7a3b45e7e 100644
>>>>> --- a/include/linux/stackleak.h
>>>>> +++ b/include/linux/stackleak.h
>>>>> @@ -42,6 +42,32 @@ stackleak_task_high_bound(const struct task_struct *tsk)
>>>>> return (unsigned long)task_pt_regs(tsk);
>>>>> }
>>>>> +/*
>>>>> + * Find the address immediately above the poisoned region of the stack, where
>>>>> + * that region falls between 'low' (inclusive) and 'high' (exclusive).
>>>>> + */
>>>>> +static __always_inline unsigned long
>>>>> +stackleak_find_top_of_poison(const unsigned long low, const unsigned long high)
>>>>> +{
>>>>> + const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
>>>>> + unsigned int poison_count = 0;
>>>>> + unsigned long poison_high = high;
>>>>> + unsigned long sp = high;
>>>>> +
>>>>> + while (sp > low && poison_count < depth) {
>>>>> + sp -= sizeof(unsigned long);
>>>>> +
>>>>> + if (*(unsigned long *)sp == STACKLEAK_POISON) {
>>>>> + poison_count++;
>>>>> + } else {
>>>>> + poison_count = 0;
>>>>> + poison_high = sp;
>>>>> + }
>>>>> + }
>>>>> +
>>>>> + return poison_high;
>>>>> +}
>>>>> +
>>>>> static inline void stackleak_task_init(struct task_struct *t)
>>>>> {
>>>>> t->lowest_stack = stackleak_task_low_bound(t);
>>>>> diff --git a/kernel/stackleak.c b/kernel/stackleak.c
>>>>> index ba346d46218f5..afd54b8e10b83 100644
>>>>> --- a/kernel/stackleak.c
>>>>> +++ b/kernel/stackleak.c
>>>>> @@ -74,20 +74,10 @@ static __always_inline void __stackleak_erase(void)
>>>>> {
>>>>> const unsigned long task_stack_low = stackleak_task_low_bound(current);
>>>>> const unsigned long task_stack_high = stackleak_task_high_bound(current);
>>>>> - unsigned long erase_low = current->lowest_stack;
>>>>> - unsigned long erase_high;
>>>>> - unsigned int poison_count = 0;
>>>>> - const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
>>>>> -
>>>>> - /* Search for the poison value in the kernel stack */
>>>>> - while (erase_low > task_stack_low && poison_count <= depth) {
>>>>> - if (*(unsigned long *)erase_low == STACKLEAK_POISON)
>>>>> - poison_count++;
>>>>> - else
>>>>> - poison_count = 0;
>>>>> -
>>>>> - erase_low -= sizeof(unsigned long);
>>>>> - }
>>>>> + unsigned long erase_low, erase_high;
>>>>> +
>>>>> + erase_low = stackleak_find_top_of_poison(task_stack_low,
>>>>> + current->lowest_stack);
>>>>> #ifdef CONFIG_STACKLEAK_METRICS
>>>>> current->prev_lowest_stack = erase_low;
>>>>
>>
On Fri, May 27, 2022 at 02:25:12AM +0300, Alexander Popov wrote:
> On 24.05.2022 16:31, Mark Rutland wrote:
> > [...]
> > It's also worth noting that `noinstr` code will also not be instrumented
> > regardless of frame size -- we might want some build-time assertion for those.
>
> I developed a trick that shows noinstr functions that stackleak would like to instrument:
>
> diff --git a/scripts/gcc-plugins/stackleak_plugin.c b/scripts/gcc-plugins/stackleak_plugin.c
> index 42f0252ee2a4..6db748d44957 100644
> --- a/scripts/gcc-plugins/stackleak_plugin.c
> +++ b/scripts/gcc-plugins/stackleak_plugin.c
> @@ -397,6 +397,9 @@ static unsigned int stackleak_cleanup_execute(void)
> const char *fn = DECL_NAME_POINTER(current_function_decl);
> bool removed = false;
>
> + if (verbose)
> + fprintf(stderr, "stackleak: I see noinstr function %s()\n", fn);
> +
> /*
> * Leave stack tracking in functions that call alloca().
> * Additional case:
> @@ -464,12 +467,12 @@ static bool stackleak_gate(void)
> if (STRING_EQUAL(section, ".meminit.text"))
> return false;
> if (STRING_EQUAL(section, ".noinstr.text"))
> - return false;
> + return true;
> if (STRING_EQUAL(section, ".entry.text"))
> return false;
> }
>
> - return track_frame_size >= 0;
> + return false;
> }
>
> /* Build the function declaration for stackleak_track_stack() */
> @@ -589,8 +592,6 @@ __visible int plugin_init(struct plugin_name_args *plugin_info,
> build_for_x86 = true;
> } else if (!strcmp(argv[i].key, "disable")) {
> disable = true;
> - } else if (!strcmp(argv[i].key, "verbose")) {
> - verbose = true;
> } else {
> error(G_("unknown option '-fplugin-arg-%s-%s'"),
> plugin_name, argv[i].key);
> @@ -598,6 +599,8 @@ __visible int plugin_init(struct plugin_name_args *plugin_info,
> }
> }
>
> + verbose = true;
> +
> if (disable) {
> if (verbose)
> fprintf(stderr, "stackleak: disabled for this translation unit\n");
>
>
> Building defconfig for x86_64 gives this:
>
> stackleak: I see noinstr function __do_fast_syscall_32()
> stackleak: instrument __do_fast_syscall_32(): calls_alloca
> --
> stackleak: I see noinstr function do_syscall_64()
> stackleak: instrument do_syscall_64(): calls_alloca
> --
> stackleak: I see noinstr function do_int80_syscall_32()
> stackleak: instrument do_int80_syscall_32(): calls_alloca
As you say, these are from RANDOMIZE_KSTACK_OFFSET, and are around
bounds-checked, and should already be getting wiped since these will
call into deeper (non-noinst) functions.
> stackleak: I see noinstr function do_machine_check()
> stackleak: instrument do_machine_check()
> --
> stackleak: I see noinstr function exc_general_protection()
> stackleak: instrument exc_general_protection()
> --
> stackleak: I see noinstr function fixup_bad_iret()
> stackleak: instrument fixup_bad_iret()
>
>
> The cases with calls_alloca are caused by CONFIG_RANDOMIZE_KSTACK_OFFSET=y.
> Kees knows about that peculiarity.
>
> Other cases are noinstr functions with large stack frame:
> do_machine_check(), exc_general_protection(), and fixup_bad_iret().
>
> I think adding a build-time assertion is not possible, since it would break
> building the kernel.
Do these functions share the syscall behavior of always calling into
non-noinst functions that _do_ have stack depth instrumentation?
> [...]
> > In security/Kconfig.hardening we have:
> >
> > | config STACKLEAK_TRACK_MIN_SIZE
> > | int "Minimum stack frame size of functions tracked by STACKLEAK"
> > | default 100
> > | range 0 4096
> > | depends on GCC_PLUGIN_STACKLEAK
> > | help
> > | The STACKLEAK gcc plugin instruments the kernel code for tracking
> > | the lowest border of the kernel stack (and for some other purposes).
> > | It inserts the stackleak_track_stack() call for the functions with
> > | a stack frame size greater than or equal to this parameter.
> > | If unsure, leave the default value 100.
> >
> > ... where the vast majority of that range is going to lead to a BUILD_BUG().
>
> Honestly, I don't like the idea of having the STACKLEAK_TRACK_MIN_SIZE option in the Kconfig.
>
> I was forced by the maintainers to introduce it when I was working on the stackleak patchset.
>
> How about dropping CONFIG_STACKLEAK_TRACK_MIN_SIZE from Kconfig?
>
> That would also allow to drop this build-time assertion.
Should this be arch-specific? (i.e. just make it a per-arch Kconfig
default, instead of user-selectable into weird values?)
--
Kees Cook
On 31.05.2022 21:13, Kees Cook wrote:
> On Fri, May 27, 2022 at 02:25:12AM +0300, Alexander Popov wrote:
>> On 24.05.2022 16:31, Mark Rutland wrote:
>>> [...]
>>> It's also worth noting that `noinstr` code will also not be instrumented
>>> regardless of frame size -- we might want some build-time assertion for those.
>>
>> I developed a trick that shows noinstr functions that stackleak would like to instrument:
>>
>> diff --git a/scripts/gcc-plugins/stackleak_plugin.c b/scripts/gcc-plugins/stackleak_plugin.c
>> index 42f0252ee2a4..6db748d44957 100644
>> --- a/scripts/gcc-plugins/stackleak_plugin.c
>> +++ b/scripts/gcc-plugins/stackleak_plugin.c
>> @@ -397,6 +397,9 @@ static unsigned int stackleak_cleanup_execute(void)
>> const char *fn = DECL_NAME_POINTER(current_function_decl);
>> bool removed = false;
>>
>> + if (verbose)
>> + fprintf(stderr, "stackleak: I see noinstr function %s()\n", fn);
>> +
>> /*
>> * Leave stack tracking in functions that call alloca().
>> * Additional case:
>> @@ -464,12 +467,12 @@ static bool stackleak_gate(void)
>> if (STRING_EQUAL(section, ".meminit.text"))
>> return false;
>> if (STRING_EQUAL(section, ".noinstr.text"))
>> - return false;
>> + return true;
>> if (STRING_EQUAL(section, ".entry.text"))
>> return false;
>> }
>>
>> - return track_frame_size >= 0;
>> + return false;
>> }
>>
>> /* Build the function declaration for stackleak_track_stack() */
>> @@ -589,8 +592,6 @@ __visible int plugin_init(struct plugin_name_args *plugin_info,
>> build_for_x86 = true;
>> } else if (!strcmp(argv[i].key, "disable")) {
>> disable = true;
>> - } else if (!strcmp(argv[i].key, "verbose")) {
>> - verbose = true;
>> } else {
>> error(G_("unknown option '-fplugin-arg-%s-%s'"),
>> plugin_name, argv[i].key);
>> @@ -598,6 +599,8 @@ __visible int plugin_init(struct plugin_name_args *plugin_info,
>> }
>> }
>>
>> + verbose = true;
>> +
>> if (disable) {
>> if (verbose)
>> fprintf(stderr, "stackleak: disabled for this translation unit\n");
>>
>>
>> Building defconfig for x86_64 gives this:
>>
>> stackleak: I see noinstr function __do_fast_syscall_32()
>> stackleak: instrument __do_fast_syscall_32(): calls_alloca
>> --
>> stackleak: I see noinstr function do_syscall_64()
>> stackleak: instrument do_syscall_64(): calls_alloca
>> --
>> stackleak: I see noinstr function do_int80_syscall_32()
>> stackleak: instrument do_int80_syscall_32(): calls_alloca
>
> As you say, these are from RANDOMIZE_KSTACK_OFFSET, and are around
> bounds-checked, and should already be getting wiped since these will
> call into deeper (non-noinst) functions.
Kees, it crossed my mind that for correct stack erasing the kernel with
RANDOMIZE_KSTACK_OFFSET needs at least one stackleak_track_stack() call during
the syscall handling.
Otherwise current->lowest_stack would point to the stack address where no stack
frame was placed because of alloca with random size.
Am I right?
How about calling stackleak_track_stack() explicitly after the kernel stack
randomization?
>> stackleak: I see noinstr function do_machine_check()
>> stackleak: instrument do_machine_check()
>> --
>> stackleak: I see noinstr function exc_general_protection()
>> stackleak: instrument exc_general_protection()
>> --
>> stackleak: I see noinstr function fixup_bad_iret()
>> stackleak: instrument fixup_bad_iret()
>>
>>
>> The cases with calls_alloca are caused by CONFIG_RANDOMIZE_KSTACK_OFFSET=y.
>> Kees knows about that peculiarity.
>>
>> Other cases are noinstr functions with large stack frame:
>> do_machine_check(), exc_general_protection(), and fixup_bad_iret().
>>
>> I think adding a build-time assertion is not possible, since it would break
>> building the kernel.
>
> Do these functions share the syscall behavior of always calling into
> non-noinst functions that _do_ have stack depth instrumentation?
This is a right question.
I can't say for sure, but it looks like do_machine_check(),
exc_general_protection() and fixup_bad_iret() do some low-level exception/trap
handling and don't affect syscall handling. Do you agree?
>> [...]
>>> In security/Kconfig.hardening we have:
>>>
>>> | config STACKLEAK_TRACK_MIN_SIZE
>>> | int "Minimum stack frame size of functions tracked by STACKLEAK"
>>> | default 100
>>> | range 0 4096
>>> | depends on GCC_PLUGIN_STACKLEAK
>>> | help
>>> | The STACKLEAK gcc plugin instruments the kernel code for tracking
>>> | the lowest border of the kernel stack (and for some other purposes).
>>> | It inserts the stackleak_track_stack() call for the functions with
>>> | a stack frame size greater than or equal to this parameter.
>>> | If unsure, leave the default value 100.
>>>
>>> ... where the vast majority of that range is going to lead to a BUILD_BUG().
>>
>> Honestly, I don't like the idea of having the STACKLEAK_TRACK_MIN_SIZE option in the Kconfig.
>>
>> I was forced by the maintainers to introduce it when I was working on the stackleak patchset.
>>
>> How about dropping CONFIG_STACKLEAK_TRACK_MIN_SIZE from Kconfig?
>>
>> That would also allow to drop this build-time assertion.
>
> Should this be arch-specific? (i.e. just make it a per-arch Kconfig
> default, instead of user-selectable into weird values?)
I don't think CONFIG_STACKLEAK_TRACK_MIN_SIZE is arch-specific, since
STACKLEAK_SEARCH_DEPTH is the same for all architectures that support stackleak.
Best regards,
Alexander