If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
the kernel stack offset is randomized upon each
entry to a system call after fixed location of pt_regs
struct.
This feature is based on the original idea from
the PaX's RANDKSTACK feature:
https://pax.grsecurity.net/docs/randkstack.txt
All the credits for the original idea goes to the PaX team.
However, the design and implementation of
RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
feature (see below).
Reasoning for the feature:
This feature aims to make considerably harder various
stack-based attacks that rely on deterministic stack
structure.
We have had many of such attacks in past [1],[2],[3]
(just to name few), and as Linux kernel stack protections
have been constantly improving (vmap-based stack
allocation with guard pages, removal of thread_info,
STACKLEAK), attackers have to find new ways for their
exploits to work.
It is important to note that we currently cannot show
a concrete attack that would be stopped by this new
feature (given that other existing stack protections
are enabled), so this is an attempt to be on a proactive
side vs. catching up with existing successful exploits.
The main idea is that since the stack offset is
randomized upon each system call, it is very hard for
attacker to reliably land in any particular place on
the thread stack when attack is performed.
Also, since randomization is performed *after* pt_regs,
the ptrace-based approach to discover randomization
offset during a long-running syscall should not be
possible.
[1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
[2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
[3] googleprojectzero.blogspot.com/2016/06/exploiting-
recursion-in-linux-kernel_20.html
Design description:
During most of the kernel's execution, it runs on the "thread
stack", which is allocated at fork.c/dup_task_struct() and stored in
a per-task variable (tsk->stack). Since stack is growing downward,
the stack top can be always calculated using task_top_of_stack(tsk)
function, which essentially returns an address of tsk->stack + stack
size. When VMAP_STACK is enabled, the thread stack is allocated from
vmalloc space.
Thread stack is pretty deterministic on its structure - fixed in size,
and upon every entry from a userspace to kernel on a
syscall the thread stack is started to be constructed from an
address fetched from a per-cpu cpu_current_top_of_stack variable.
The first element to be pushed to the thread stack is the pt_regs struct
that stores all required CPU registers and sys call parameters.
The goal of RANDOMIZE_KSTACK_OFFSET feature is to add a random offset
after the pt_regs has been pushed to the stack and the rest of thread
stack (used during the syscall processing) every time a process issues
a syscall. The source of randomness can be taken either from prandom_u32()
pseudo random generator (not cryptographically secure). The offset is
added using alloca() call since it helps avoiding changes in assembly
syscall entry code and unwinder.
This is an example of produced assembly code for gcc x86_64:
...
add_random_stack_offset();
0xffffffff810022e9 callq 0xffffffff81459570 <prandom_u32>
0xffffffff810022ee movzbl %al,%eax
0xffffffff810022f1 add $0x16,%rax
0xffffffff810022f5 and $0x1f8,%eax
0xffffffff810022fa sub %rax,%rsp
0xffffffff810022fd lea 0xf(%rsp),%rax
0xffffffff81002302 and $0xfffffffffffffff0,%rax
...
As a result of the above gcc-produce code this patch introduces
a bit more than 5 bits of randomness after pt_regs location on
the thread stack (33 different offsets are generated
randomly for x86_64 and 47 for i386).
The amount of randomness can be adjusted based on how much of the
stack space we wish/can trade for security.
Performance (x86_64 measuments only):
1) lmbench: ./lat_syscall -N 1000000 null
base: Simple syscall: 0.1774 microseconds
random_offset (prandom_u32() every syscall): Simple syscall: 0.1822 microseconds
2) Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
base: 10000000 loops in 1.62224s = 162.22 nsec / loop
random_offset (prandom_u32() every syscall): 10000000 loops in 1.64660s = 166.26 nsec / loop
Comparison to grsecurity RANDKSTACK feature:
RANDKSTACK feature randomizes the location of the stack start
(cpu_current_top_of_stack), i.e. location of pt_regs structure
itself on the stack. Initially this patch followed the same approach,
but during the recent discussions [4], it has been determined
to be of a little value since, if ptrace functionality is available
for an attacker, he can use PTRACE_PEEKUSR/PTRACE_POKEUSR api to read/write
different offsets in the pt_regs struct, observe the cache
behavior of the pt_regs accesses, and figure out the random stack offset.
Another big difference is that randomization is done upon
syscall entry and not the exit, as with RANDKSTACK.
Also, as a result of the above two differences, the implementation
of RANDKSTACK and RANDOMIZE_KSTACK_OFFSET has nothing in common.
[4] https://www.openwall.com/lists/kernel-hardening/2019/02/08/6
Signed-off-by: Elena Reshetova <[email protected]>
---
arch/Kconfig | 15 +++++++++++++++
arch/x86/Kconfig | 1 +
arch/x86/entry/common.c | 18 ++++++++++++++++++
3 files changed, 34 insertions(+)
diff --git a/arch/Kconfig b/arch/Kconfig
index 4cfb6de48f79..9a2557b0cfce 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -808,6 +808,21 @@ config VMAP_STACK
the stack to map directly to the KASAN shadow map using a formula
that is incorrect if the stack is in vmalloc space.
+config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
+ def_bool n
+ help
+ An arch should select this symbol if it can support kernel stack
+ offset randomization.
+
+config RANDOMIZE_KSTACK_OFFSET
+ default n
+ bool "Randomize kernel stack offset on syscall entry"
+ depends on HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
+ help
+ Enable this if you want the randomize kernel stack offset upon
+ each syscall entry. This causes kernel stack (after pt_regs) to
+ have a randomized offset upon executing each system call.
+
config ARCH_OPTIONAL_KERNEL_RWX
def_bool n
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ade12ec4224b..87e5444cd366 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -131,6 +131,7 @@ config X86
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
select HAVE_ARCH_VMAP_STACK if X86_64
+ select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
select HAVE_ARCH_WITHIN_STACK_FRAMES
select HAVE_CMPXCHG_DOUBLE
select HAVE_CMPXCHG_LOCAL
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7bc105f47d21..076085611e94 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -35,6 +35,20 @@
#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+#include <linux/random.h>
+
+void *__builtin_alloca(size_t size);
+
+#define add_random_stack_offset() do { \
+ size_t offset = ((size_t)prandom_u32()) % 256; \
+ char *ptr = __builtin_alloca(offset); \
+ asm volatile("":"=m"(*ptr)); \
+} while (0)
+#else
+#define add_random_stack_offset() do {} while (0)
+#endif
+
#ifdef CONFIG_CONTEXT_TRACKING
/* Called on entry from user mode with IRQs off. */
__visible inline void enter_from_user_mode(void)
@@ -273,6 +287,7 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
struct thread_info *ti;
+ add_random_stack_offset();
enter_from_user_mode();
local_irq_enable();
ti = current_thread_info();
@@ -344,6 +359,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
/* Handles int $0x80 */
__visible void do_int80_syscall_32(struct pt_regs *regs)
{
+ add_random_stack_offset();
enter_from_user_mode();
local_irq_enable();
do_syscall_32_irqs_on(regs);
@@ -360,6 +376,8 @@ __visible long do_fast_syscall_32(struct pt_regs *regs)
unsigned long landing_pad = (unsigned long)current->mm->context.vdso +
vdso_image_32.sym_int80_landing_pad;
+ add_random_stack_offset();
+
/*
* SYSENTER loses EIP, and even SYSCALL32 needs us to skip forward
* so that 'regs->ip -= 2' lands back on an int $0x80 instruction.
--
2.17.1
* Elena Reshetova <[email protected]> wrote:
> This is an example of produced assembly code for gcc x86_64:
>
> ...
> add_random_stack_offset();
> 0xffffffff810022e9 callq 0xffffffff81459570 <prandom_u32>
> 0xffffffff810022ee movzbl %al,%eax
> 0xffffffff810022f1 add $0x16,%rax
> 0xffffffff810022f5 and $0x1f8,%eax
> 0xffffffff810022fa sub %rax,%rsp
> 0xffffffff810022fd lea 0xf(%rsp),%rax
> 0xffffffff81002302 and $0xfffffffffffffff0,%rax
> ...
Hm, that all to prandom_u32() then does:
ffffffff8146cd70 <prandom_u32>:
ffffffff8146cd70: 48 c7 c7 10 e3 01 00 mov $0x1e310,%rdi
ffffffff8146cd77: 65 48 03 3d d9 23 ba add %gs:0x7eba23d9(%rip),%rdi # f158 <this_cpu_off>
ffffffff8146cd7e: 7e
ffffffff8146cd7f: e8 6c ff ff ff callq ffffffff8146ccf0 <prandom_u32_state>
ffffffff8146cd84: c3 retq
which then does:
ffffffff8146ccf0 <prandom_u32_state>:
ffffffff8146ccf0: 8b 0f mov (%rdi),%ecx
ffffffff8146ccf2: 8b 47 04 mov 0x4(%rdi),%eax
ffffffff8146ccf5: 44 8b 47 08 mov 0x8(%rdi),%r8d
ffffffff8146ccf9: 89 ca mov %ecx,%edx
ffffffff8146ccfb: 8d 34 85 00 00 00 00 lea 0x0(,%rax,4),%esi
ffffffff8146cd02: c1 e2 06 shl $0x6,%edx
ffffffff8146cd05: 31 f0 xor %esi,%eax
ffffffff8146cd07: 83 e6 e0 and $0xffffffe0,%esi
ffffffff8146cd0a: 31 ca xor %ecx,%edx
ffffffff8146cd0c: c1 e1 12 shl $0x12,%ecx
ffffffff8146cd0f: 81 e1 00 00 f8 ff and $0xfff80000,%ecx
ffffffff8146cd15: c1 ea 0d shr $0xd,%edx
ffffffff8146cd18: 31 ca xor %ecx,%edx
ffffffff8146cd1a: 89 c1 mov %eax,%ecx
ffffffff8146cd1c: 44 89 c0 mov %r8d,%eax
ffffffff8146cd1f: c1 e0 0d shl $0xd,%eax
ffffffff8146cd22: c1 e9 1b shr $0x1b,%ecx
ffffffff8146cd25: 89 17 mov %edx,(%rdi)
ffffffff8146cd27: 44 31 c0 xor %r8d,%eax
ffffffff8146cd2a: 41 c1 e0 07 shl $0x7,%r8d
ffffffff8146cd2e: 31 f1 xor %esi,%ecx
ffffffff8146cd30: c1 e8 15 shr $0x15,%eax
ffffffff8146cd33: 41 81 e0 00 f8 ff ff and $0xfffff800,%r8d
ffffffff8146cd3a: 89 4f 04 mov %ecx,0x4(%rdi)
ffffffff8146cd3d: 41 31 c0 xor %eax,%r8d
ffffffff8146cd40: 8b 47 0c mov 0xc(%rdi),%eax
ffffffff8146cd43: 44 89 47 08 mov %r8d,0x8(%rdi)
ffffffff8146cd47: 8d 34 c5 00 00 00 00 lea 0x0(,%rax,8),%esi
ffffffff8146cd4e: 31 c6 xor %eax,%esi
ffffffff8146cd50: c1 e0 0d shl $0xd,%eax
ffffffff8146cd53: 25 00 00 f0 ff and $0xfff00000,%eax
ffffffff8146cd58: c1 ee 0c shr $0xc,%esi
ffffffff8146cd5b: 31 c6 xor %eax,%esi
ffffffff8146cd5d: 89 d0 mov %edx,%eax
ffffffff8146cd5f: 31 c8 xor %ecx,%eax
ffffffff8146cd61: 89 77 0c mov %esi,0xc(%rdi)
ffffffff8146cd64: 44 31 c0 xor %r8d,%eax
ffffffff8146cd67: 31 f0 xor %esi,%eax
ffffffff8146cd69: c3 retq
and just to recover ~5 bits of randomness:
> As a result of the above gcc-produce code this patch introduces
> a bit more than 5 bits of randomness after pt_regs location on
> the thread stack (33 different offsets are generated
> randomly for x86_64 and 47 for i386).
> The amount of randomness can be adjusted based on how much of the
> stack space we wish/can trade for security.
Note that the reduction in bits is mostly due to:
> +#define add_random_stack_offset() do { \
> + size_t offset = ((size_t)prandom_u32()) % 256; \
> + char *ptr = __builtin_alloca(offset); \
> + asm volatile("":"=m"(*ptr)); \
So if we are really sticking this into the syscall path, could we please
be a bit more smart about it? Note how we are already masking off bits in
prandom_u32_state():
ffffffff8146cd53: 25 00 00 f0 ff and $0xfff00000,%eax
and then we mask *again*, in a rather stupid way:
> 0xffffffff810022ee movzbl %al,%eax
> 0xffffffff810022f1 add $0x16,%rax
> 0xffffffff810022f5 and $0x1f8,%eax
> 0xffffffff810022fa sub %rax,%rsp
> 0xffffffff810022fd lea 0xf(%rsp),%rax
> 0xffffffff81002302 and $0xfffffffffffffff0,%rax
What I'd suggest is:
1)
Put the PRNG state not into a per-cpu area which has to be recovered, but
put it somewhere convenient, already accessibe easily at that point in
the syscall path - such as struct thread_info? (But don't take my word
for it - if task_struct is a better place then use that.)
This avoids the whole percpu redirection.
2)
Then also please make a private version of that function which is inlined
into the straight execution path: this code is going to be executed for
every single syscall, so inlining is 1000% justified.
3)
Have a really good look at the generated assembly code and justify every
single instruction, and try to eliminate it. Maybe even take a look at
whether there's any simpler PRNG that is better suited: if we have per
thread_info random state then there will be no cross-task leakage of the
"secret" and we can actually weaken the PRNG.
4)
But before you tweak the patch, a more fundamental question:
Does the stack offset have to be per *syscall execution* randomized?
Which threats does this protect against that a simpler per task syscall
random offset wouldn't protect against?
It would be far, far more performant to just add a fork() time randomized
kernel stack offset (and that can be a *real* random offset, not PRNG) to
syscalls. We could even make that an overall default setting if it's fast
enough, which I think it could be made.
> Performance (x86_64 measuments only):
>
> 1) lmbench: ./lat_syscall -N 1000000 null
> base: Simple syscall: 0.1774 microseconds
> random_offset (prandom_u32() every syscall): Simple syscall: 0.1822 microseconds
>
> 2) Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> base: 10000000 loops in 1.62224s = 162.22 nsec / loop
> random_offset (prandom_u32() every syscall): 10000000 loops in 1.64660s = 166.26 nsec / loop
Was this measured with CONFIG_PAGE_TABLE_ISOLATION turned off? By the
time such patches get to real Linux users we'll likely see modern
processors with Meltdown fixed.
I.e. the relative cost of syscall entry code changes should generally be
measured with CONFIG_PAGE_TABLE_ISOLATION off.
Thanks,
Ingo
Hi Ingo,
Thank you for your feedback! See my comments below.
> * Elena Reshetova <[email protected]> wrote:
>
> > This is an example of produced assembly code for gcc x86_64:
> >
> > ...
> > add_random_stack_offset();
> > 0xffffffff810022e9 callq 0xffffffff81459570 <prandom_u32>
> > 0xffffffff810022ee movzbl %al,%eax
> > 0xffffffff810022f1 add $0x16,%rax
> > 0xffffffff810022f5 and $0x1f8,%eax
> > 0xffffffff810022fa sub %rax,%rsp
> > 0xffffffff810022fd lea 0xf(%rsp),%rax
> > 0xffffffff81002302 and $0xfffffffffffffff0,%rax
> > ...
>
> Hm, that all to prandom_u32() then does:
>
> ffffffff8146cd70 <prandom_u32>:
> ffffffff8146cd70: 48 c7 c7 10 e3 01 00 mov $0x1e310,%rdi
> ffffffff8146cd77: 65 48 03 3d d9 23 ba add %gs:0x7eba23d9(%rip),%rdi #
> f158 <this_cpu_off>
> ffffffff8146cd7e: 7e
> ffffffff8146cd7f: e8 6c ff ff ff callq ffffffff8146ccf0 <prandom_u32_state>
> ffffffff8146cd84: c3 retq
>
> which then does:
>
> ffffffff8146ccf0 <prandom_u32_state>:
> ffffffff8146ccf0: 8b 0f mov (%rdi),%ecx
> ffffffff8146ccf2: 8b 47 04 mov 0x4(%rdi),%eax
> ffffffff8146ccf5: 44 8b 47 08 mov 0x8(%rdi),%r8d
> ffffffff8146ccf9: 89 ca mov %ecx,%edx
> ffffffff8146ccfb: 8d 34 85 00 00 00 00 lea 0x0(,%rax,4),%esi
> ffffffff8146cd02: c1 e2 06 shl $0x6,%edx
> ffffffff8146cd05: 31 f0 xor %esi,%eax
> ffffffff8146cd07: 83 e6 e0 and $0xffffffe0,%esi
> ffffffff8146cd0a: 31 ca xor %ecx,%edx
> ffffffff8146cd0c: c1 e1 12 shl $0x12,%ecx
> ffffffff8146cd0f: 81 e1 00 00 f8 ff and $0xfff80000,%ecx
> ffffffff8146cd15: c1 ea 0d shr $0xd,%edx
> ffffffff8146cd18: 31 ca xor %ecx,%edx
> ffffffff8146cd1a: 89 c1 mov %eax,%ecx
> ffffffff8146cd1c: 44 89 c0 mov %r8d,%eax
> ffffffff8146cd1f: c1 e0 0d shl $0xd,%eax
> ffffffff8146cd22: c1 e9 1b shr $0x1b,%ecx
> ffffffff8146cd25: 89 17 mov %edx,(%rdi)
> ffffffff8146cd27: 44 31 c0 xor %r8d,%eax
> ffffffff8146cd2a: 41 c1 e0 07 shl $0x7,%r8d
> ffffffff8146cd2e: 31 f1 xor %esi,%ecx
> ffffffff8146cd30: c1 e8 15 shr $0x15,%eax
> ffffffff8146cd33: 41 81 e0 00 f8 ff ff and $0xfffff800,%r8d
> ffffffff8146cd3a: 89 4f 04 mov %ecx,0x4(%rdi)
> ffffffff8146cd3d: 41 31 c0 xor %eax,%r8d
> ffffffff8146cd40: 8b 47 0c mov 0xc(%rdi),%eax
> ffffffff8146cd43: 44 89 47 08 mov %r8d,0x8(%rdi)
> ffffffff8146cd47: 8d 34 c5 00 00 00 00 lea 0x0(,%rax,8),%esi
> ffffffff8146cd4e: 31 c6 xor %eax,%esi
> ffffffff8146cd50: c1 e0 0d shl $0xd,%eax
> ffffffff8146cd53: 25 00 00 f0 ff and $0xfff00000,%eax
> ffffffff8146cd58: c1 ee 0c shr $0xc,%esi
> ffffffff8146cd5b: 31 c6 xor %eax,%esi
> ffffffff8146cd5d: 89 d0 mov %edx,%eax
> ffffffff8146cd5f: 31 c8 xor %ecx,%eax
> ffffffff8146cd61: 89 77 0c mov %esi,0xc(%rdi)
> ffffffff8146cd64: 44 31 c0 xor %r8d,%eax
> ffffffff8146cd67: 31 f0 xor %esi,%eax
> ffffffff8146cd69: c3 retq
>
> and just to recover ~5 bits of randomness:
Yes, I agree this is a lot of assembly code and that's why initially I had the
whole thing implemented in asm to control what is being produced,
but that had its own issues...
>
> > As a result of the above gcc-produce code this patch introduces
> > a bit more than 5 bits of randomness after pt_regs location on
> > the thread stack (33 different offsets are generated
> > randomly for x86_64 and 47 for i386).
> > The amount of randomness can be adjusted based on how much of the
> > stack space we wish/can trade for security.
>
> Note that the reduction in bits is mostly due to:
>
> > +#define add_random_stack_offset() do { \
> > + size_t offset = ((size_t)prandom_u32()) % 256; \
> > + char *ptr = __builtin_alloca(offset); \
> > + asm volatile("":"=m"(*ptr)); \
>
> So if we are really sticking this into the syscall path, could we please
> be a bit more smart about it? Note how we are already masking off bits in
> prandom_u32_state():
>
> ffffffff8146cd53: 25 00 00 f0 ff and $0xfff00000,%eax
>
> and then we mask *again*, in a rather stupid way:
>
> > 0xffffffff810022ee movzbl %al,%eax
> > 0xffffffff810022f1 add $0x16,%rax
> > 0xffffffff810022f5 and $0x1f8,%eax
> > 0xffffffff810022fa sub %rax,%rsp
> > 0xffffffff810022fd lea 0xf(%rsp),%rax
> > 0xffffffff81002302 and $0xfffffffffffffff0,%rax
>
> What I'd suggest is:
>
> 1)
>
> Put the PRNG state not into a per-cpu area which has to be recovered, but
> put it somewhere convenient, already accessibe easily at that point in
> the syscall path - such as struct thread_info? (But don't take my word
> for it - if task_struct is a better place then use that.)
>
> This avoids the whole percpu redirection.
Hm... but this wouldn't this mean that one can easily probe for PRNG state
from task or thread_info struct and as a result predict the next randomization
offset for that task? My understanding is that current prandom state that is below
prandom_u32() is shared between tasks and reseeded often enough (every 60 sec),
which makes it hard to predict the next generated random number and the place where
it will get used. If we now introduce per-task PRNG, how do we manage its state sanely
and securely? It is already per-task, so clear who is the user of it, then state is stored
in task/thread_info, so we know exactly where to probe (if we are able to via arbitrary read or
some other read), and then it is with high odds is used for stack randomization only, so
we can predict all future stack offsets.
>
> 2)
>
> Then also please make a private version of that function which is inlined
> into the straight execution path: this code is going to be executed for
> every single syscall, so inlining is 1000% justified.
>
> 3)
>
> Have a really good look at the generated assembly code and justify every
> single instruction, and try to eliminate it. Maybe even take a look at
> whether there's any simpler PRNG that is better suited: if we have per
> thread_info random state then there will be no cross-task leakage of the
> "secret" and we can actually weaken the PRNG.
Not sure about this, see the above. Unless we go back to the original way of just
using rdtsc like I had in the original patch:
pushq %rax
xorq %rax, %rax
rdtsc
andq $0xFF0, %rax
And then call this from do_syscall code paths and use alloca for doing the actual offset.
It would seem to me that it is better to do it this way for security vs. storing the PRNG state in
task/threat_info structs.
>
> 4)
>
> But before you tweak the patch, a more fundamental question:
>
> Does the stack offset have to be per *syscall execution* randomized?
> Which threats does this protect against that a simpler per task syscall
> random offset wouldn't protect against?
We *really* need it per syscall. If you take a look on the recent stack attacks
[1],[2],[3],[4], they all do some initial probing on syscalls first to discover stack addresses
or leftover data on the stack (or pre-populate stack with some attacker-controlled data),
and then in the following syscall execute the actual attack (leak data, use
pre-populated data for execution, etc.). If the offset stays the same during
task life time, it can be easily recovered during this initial probing phase, and
then nothing changes for the attacker.
[1] Kernel Exploitation Via Uninitialized Stack, 2011
https://www.defcon.org/images/defcon-19/dc-19-presentations/Cook/DEFCON-19-Cook-Kernel-Exploitation.pdf
[2] Stackjacking, 2011, https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
[3] The Stack is Back, 2012, https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
[4] Exploiting Recursion in the Linux Kernel, 2016,
https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
> It would be far, far more performant to just add a fork() time randomized
> kernel stack offset (and that can be a *real* random offset, not PRNG) to
> syscalls. We could even make that an overall default setting if it's fast
> enough, which I think it could be made.
>
> > Performance (x86_64 measuments only):
> >
> > 1) lmbench: ./lat_syscall -N 1000000 null
> > base: Simple syscall: 0.1774 microseconds
> > random_offset (prandom_u32() every syscall): Simple syscall: 0.1822
> microseconds
> >
> > 2) Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> > base: 10000000 loops in 1.62224s = 162.22 nsec / loop
> > random_offset (prandom_u32() every syscall): 10000000 loops in 1.64660s =
> 166.26 nsec / loop
>
> Was this measured with CONFIG_PAGE_TABLE_ISOLATION turned off? By the
> time such patches get to real Linux users we'll likely see modern
> processors with Meltdown fixed.
No the above numbers are with CONFIG_PAGE_TABLE_ISOLATION=y for x86_64,
I will test with CONFIG_PAGE_TABLE_ISOLATION turned off from now on also.
Best Regards,
Elena.
* Reshetova, Elena <[email protected]> wrote:
> > 4)
> >
> > But before you tweak the patch, a more fundamental question:
> >
> > Does the stack offset have to be per *syscall execution* randomized?
> > Which threats does this protect against that a simpler per task syscall
> > random offset wouldn't protect against?
>
> We *really* need it per syscall. If you take a look on the recent stack attacks
> [1],[2],[3],[4], they all do some initial probing on syscalls first to discover stack addresses
> or leftover data on the stack (or pre-populate stack with some attacker-controlled data),
> and then in the following syscall execute the actual attack (leak data, use
> pre-populated data for execution, etc.). If the offset stays the same during
> task life time, it can be easily recovered during this initial probing phase, and
> then nothing changes for the attacker.
>
> [1] Kernel Exploitation Via Uninitialized Stack, 2011
> https://www.defcon.org/images/defcon-19/dc-19-presentations/Cook/DEFCON-19-Cook-Kernel-Exploitation.pdf
> [2] Stackjacking, 2011, https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
> [3] The Stack is Back, 2012, https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
> [4] Exploiting Recursion in the Linux Kernel, 2016,
> https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
Yeah, so if there's an information leak from the kernel stack, don't we
now effectively store 5 PRNG bits there for every syscall, allowing the
systematic probing of the generic PRNG?
The kernel can execute millions of syscalls per second, I'm pretty sure
there's a statistical attack against:
* This is a maximally equidistributed combined Tausworthe generator
* based on code from GNU Scientific Library 1.5 (30 Jun 2004)
*
* lfsr113 version:
*
* x_n = (s1_n ^ s2_n ^ s3_n ^ s4_n)
*
* s1_{n+1} = (((s1_n & 4294967294) << 18) ^ (((s1_n << 6) ^ s1_n) >> 13))
* s2_{n+1} = (((s2_n & 4294967288) << 2) ^ (((s2_n << 2) ^ s2_n) >> 27))
* s3_{n+1} = (((s3_n & 4294967280) << 7) ^ (((s3_n << 13) ^ s3_n) >> 21))
* s4_{n+1} = (((s4_n & 4294967168) << 13) ^ (((s4_n << 3) ^ s4_n) >> 12))
*
* The period of this generator is about 2^113 (see erratum paper).
... which recovers the real PRNG state much faster than the ~60 seconds
seeding interval and allows the prediction of the next stack offset?
I.e. I don't see how kernel stack PRNG randomization protects against
information leaks from the kernel stack. By putting PRNG information into
the kernel stack for *every* system call we add a broad attack surface:
any obscure ioctl information leak can now be escalated into an attack
against the net_rand_state PRNG, right?
> No the above numbers are with CONFIG_PAGE_TABLE_ISOLATION=y for x86_64,
> I will test with CONFIG_PAGE_TABLE_ISOLATION turned off from now on
> also.
Thanks!
Ingo
Adding Theodore & Daniel since I guess they are the best positioned to comment on
exact strengths of prandom. See my comments below.
> * Reshetova, Elena <[email protected]> wrote:
>
> > > 4)
> > >
> > > But before you tweak the patch, a more fundamental question:
> > >
> > > Does the stack offset have to be per *syscall execution* randomized?
> > > Which threats does this protect against that a simpler per task syscall
> > > random offset wouldn't protect against?
> >
> > We *really* need it per syscall. If you take a look on the recent stack attacks
> > [1],[2],[3],[4], they all do some initial probing on syscalls first to discover stack
> addresses
> > or leftover data on the stack (or pre-populate stack with some attacker-controlled
> data),
> > and then in the following syscall execute the actual attack (leak data, use
> > pre-populated data for execution, etc.). If the offset stays the same during
> > task life time, it can be easily recovered during this initial probing phase, and
> > then nothing changes for the attacker.
> >
> > [1] Kernel Exploitation Via Uninitialized Stack, 2011
> > https://www.defcon.org/images/defcon-19/dc-19-presentations/Cook/DEFCON-
> 19-Cook-Kernel-Exploitation.pdf
> > [2] Stackjacking, 2011, https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
> > [3] The Stack is Back, 2012, https://jon.oberheide.org/files/infiltrate12-
> thestackisback.pdf
> > [4] Exploiting Recursion in the Linux Kernel, 2016,
> > https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-
> kernel_20.html
>
> Yeah, so if there's an information leak from the kernel stack, don't we
> now effectively store 5 PRNG bits there for every syscall, allowing the
> systematic probing of the generic PRNG?
Yes, so if there is a reliable stack-based leak that allows you to leak the 5
PRNG bits for every syscall within the 60 sec (until the reseeding happens), then
we indeed leak these 5 bits with every syscall. After we have accumulated
enough bits (not sure what is the correct bound here on how many bits we need
minimum), we can predict PRGN output until the reseed happens.
>
> The kernel can execute millions of syscalls per second, I'm pretty sure
> there's a statistical attack against:
>
> * This is a maximally equidistributed combined Tausworthe generator
> * based on code from GNU Scientific Library 1.5 (30 Jun 2004)
> *
> * lfsr113 version:
> *
> * x_n = (s1_n ^ s2_n ^ s3_n ^ s4_n)
> *
> * s1_{n+1} = (((s1_n & 4294967294) << 18) ^ (((s1_n << 6) ^ s1_n) >> 13))
> * s2_{n+1} = (((s2_n & 4294967288) << 2) ^ (((s2_n << 2) ^ s2_n) >> 27))
> * s3_{n+1} = (((s3_n & 4294967280) << 7) ^ (((s3_n << 13) ^ s3_n) >> 21))
> * s4_{n+1} = (((s4_n & 4294967168) << 13) ^ (((s4_n << 3) ^ s4_n) >> 12))
> *
> * The period of this generator is about 2^113 (see erratum paper).
>
> ... which recovers the real PRNG state much faster than the ~60 seconds
> seeding interval and allows the prediction of the next stack offset?
I hope Theodore can comment on bounds here. How many syscalls we need
to issue assuming that each leaks 5 presudorandom bits out of 32 bit
presudorandom number produced by PRGN before we can predict the
PRNG output.
Daniel made a change to prandom a while back to allow to store the state & do
the seeding of PRNG separately, so I guess this is what you meant initially when proposing
to move the state out of per-cpu to per-task/thread.
I have been thinking more about this now, it is certainly good to limit the
potential damage to just one task/thread by having it depend on its own PRNG,
and I guess my previous argument on amount of control that attacker has
when trying to break per-cpu PRGN vs. per-thread PRGN does not really hold,
because attacker can probably "create/adjust" environment quite some
(i.e. make sure that it is mostly the attacker process that execute syscalls
during the desired small attack window).
However, what I am still wondering
is how safe is to store per-task PRGN clear state in smth like task struct or thread_info?
If it leaks, only one leak is enough, these structs are very known and
well-studied by attackers, thread_info even used to be on stack itself
(and still is if you disable CONFIG_THREAD_INFO_IN_TASK)...
More thoughts on options below...
>
> I.e. I don't see how kernel stack PRNG randomization protects against
> information leaks from the kernel stack.
It does not protect against information leaks from the kernel stack, we have
other protections for that: STACKLEAK, CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL.
What it aims to do is to make the stack structure different and unpredictable on
each syscall, so that attacker cannot reliably dereference to any particular
spot in the stack. So, if you do smth like this as attacker:
1) Make some probing syscalls to discover some info about the stack,
where some data structures land on stack during a particular syscall,
or pre-fill particular spots in the stack with attacker controlled data.
2) Execute the actual attack via a new syscall where you attempt to
either reliably use the data you prefilled on the stack or for example leak some
particular spot from the stack via overlapping data structure.
In order to do 2) successful you need stack structure over different
invocations of syscalls to be somewhat predictable. This is exactly
what this protection aims to remove.
By putting PRNG information into
> the kernel stack for *every* system call we add a broad attack surface:
> any obscure ioctl information leak can now be escalated into an attack
> against the net_rand_state PRNG, right?
I was previously thinking that net code doesn't share the PRNG state with the rest of
the system, but looking into the code now, it does do exactly that.
Only bpf & netfiler out of net-related code manage their own states..
So, yes, looks like usage of prandom joined state through the kernel is pretty
big already and making it bigger is a bad idea, indeed.
What if we define separate new per-cpu states for this stack randomization and seed/reseed
them separately & regularly (have to just understand where to do code-wise initial seeding etc.).
This way we don't expose any other PRGN use and still
don't have an issue of storing PRGN state in task struct.
However this won't address your concern on a need to dereference this per-cpu variable
upon each syscall. But if I make specific inline function (like you recommended before) to do extraction
of small amount of bits (5-6) for every time we need it, it would be already
much smaller than the current construct and if performance is acceptable, then
would not it be a good alternative?
It still of course does not cancel the fact that if you have a reliable leak and do enough of syscalls
within the reseeding period, you can break the PRGN, but the worst consequence of
that would be breaking stack randomization only.
Best Regards,
Elena.
On Tue, Apr 16, 2019 at 11:10:16AM +0000, Reshetova, Elena wrote:
> >
> > The kernel can execute millions of syscalls per second, I'm pretty sure
> > there's a statistical attack against:
> >
> > * This is a maximally equidistributed combined Tausworthe generator
> > * based on code from GNU Scientific Library 1.5 (30 Jun 2004)
> > *
> > * lfsr113 version:
> > *
> > * x_n = (s1_n ^ s2_n ^ s3_n ^ s4_n)
> > *
> > * s1_{n+1} = (((s1_n & 4294967294) << 18) ^ (((s1_n << 6) ^ s1_n) >> 13))
> > * s2_{n+1} = (((s2_n & 4294967288) << 2) ^ (((s2_n << 2) ^ s2_n) >> 27))
> > * s3_{n+1} = (((s3_n & 4294967280) << 7) ^ (((s3_n << 13) ^ s3_n) >> 21))
> > * s4_{n+1} = (((s4_n & 4294967168) << 13) ^ (((s4_n << 3) ^ s4_n) >> 12))
> > *
> > * The period of this generator is about 2^113 (see erratum paper).
> >
> > ... which recovers the real PRNG state much faster than the ~60 seconds
> > seeding interval and allows the prediction of the next stack offset?
>
> I hope Theodore can comment on bounds here. How many syscalls we need
> to issue assuming that each leaks 5 presudorandom bits out of 32 bit
> presudorandom number produced by PRGN before we can predict the
> PRNG output.
So the argument against using TSC directly was that it might be easy to
guess most of the TSC bits in timing attack. But IIRC there is fairly
solid evidence that the lowest TSC bits are very hard to guess and might
in fact be a very good random source.
So what one could do, is for each invocation mix in the low (2?) bits of
the TSC into a per-cpu/task PRNG state. By always adding some fresh
entropy it would become very hard indeed to predict the outcome, even
for otherwise 'trivial' PRNGs.
From: Peter Zijlstra
> Sent: 16 April 2019 13:08
...
> So the argument against using TSC directly was that it might be easy to
> guess most of the TSC bits in timing attack. But IIRC there is fairly
> solid evidence that the lowest TSC bits are very hard to guess and might
> in fact be a very good random source.
>
> So what one could do, is for each invocation mix in the low (2?) bits of
> the TSC into a per-cpu/task PRNG state. By always adding some fresh
> entropy it would become very hard indeed to predict the outcome, even
> for otherwise 'trivial' PRNGs.
You could just feed 8 bits of TSC into a CRC.
Or even xor the entire TSC over a CRC state and then cycle it at least 6 bits.
Probably doesn't matter which CRC - but you may want one that is
cheap in software.
Even a 16bit CRC might be enough.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
So a couple of comments; I wasn't able to find the full context for
this patch, and looking over the thread on kernel-hardening from late
February still left me confused exactly what attacks this would help
us protect against (since this isn't my area and I didn't take the
time to read all of the links to slide decks, etc.)
So I'm not going to comment on the utility of this patch, but just on
the random number generator issues. If you're only going to be using
the low 8 bits of the output of get_prandom_u32(), even if two
adjacent calls to get_prandom_u32() (for which only the low 8 bits are
revealed) can be used to precisely identify which set of 2**120
potential prandom states could have generate that pair of states, it's
still going to take a lot of calls before you'd be able to figure out
the prandom's internal state.
It seems though the assumption that we're assuming the attacker has
arbitrary ability to get the low bits of the stack, so *if* that's
true, then eventually, you'd be able to get enough samples that you
could reverse engineer the prandom state. This could take long enough
that the process will have gotten rescheduled to another CPU, and
since the prandom state is per-cpu, that adds another wrinkle.
> > So the argument against using TSC directly was that it might be easy to
> > guess most of the TSC bits in timing attack. But IIRC there is fairly
> > solid evidence that the lowest TSC bits are very hard to guess and might
> > in fact be a very good random source.
> >
> > So what one could do, is for each invocation mix in the low (2?) bits of
> > the TSC into a per-cpu/task PRNG state. By always adding some fresh
> > entropy it would become very hard indeed to predict the outcome, even
> > for otherwise 'trivial' PRNGs.
>
> You could just feed 8 bits of TSC into a CRC. Or even xor the
> entire TSC over a CRC state and then cycle it at least 6 bits.
> Probably doesn't matter which CRC - but you may want one that is
> cheap in software. Even a 16bit CRC might be enough.
Do we only care about x86 in this discussion? Given "x86/entry/64",
I'm guessing the answer we're not trying to worry about how to protect
other architectures, like say ARM, that don't have a TSC?
If we do care about architectures w/o a TSC, how much cost are we
willing to pay as far as system call overhead is concerned?
If it's x86 specific, maybe the simplest thing to do is to use RDRAND
if it exists, and fall back to something involving a TSC and maybe
prandom_u32 (assuming on how bad you think the stack leak is going to
be) if RDRAND isn't available?
- Ted
On Tue, Apr 16, 2019 at 11:43:49AM -0400, Theodore Ts'o wrote:
> If it's x86 specific, maybe the simplest thing to do is to use RDRAND
> if it exists, and fall back to something involving a TSC and maybe
> prandom_u32 (assuming on how bad you think the stack leak is going to
> be) if RDRAND isn't available?
From https://lkml.kernel.org/r/[email protected]
Performance:
1) lmbench: ./lat_syscall -N 1000000 null
base: Simple syscall: 0.1774 microseconds
random_offset (rdtsc): Simple syscall: 0.1803 microseconds
random_offset (rdrand): Simple syscall: 0.3702 microseconds
2) Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
base: 10000000 loops in 1.62224s = 162.22 nsec / loop
random_offset (rdtsc): 10000000 loops in 1.64660s = 164.66 nsec / loop
random_offset (rdrand): 10000000 loops in 3.51315s = 351.32 nsec / loop
Basically, RDRAND is frigging slow...
> So a couple of comments; I wasn't able to find the full context for
> this patch, and looking over the thread on kernel-hardening from late
> February still left me confused exactly what attacks this would help
> us protect against (since this isn't my area and I didn't take the
> time to read all of the links to slide decks, etc.)
>
> So I'm not going to comment on the utility of this patch, but just on
> the random number generator issues. If you're only going to be using
> the low 8 bits of the output of get_prandom_u32(), even if two
> adjacent calls to get_prandom_u32() (for which only the low 8 bits are
> revealed) can be used to precisely identify which set of 2**120
> potential prandom states could have generate that pair of states, it's
> still going to take a lot of calls before you'd be able to figure out
> the prandom's internal state.
>
> It seems though the assumption that we're assuming the attacker has
> arbitrary ability to get the low bits of the stack, so *if* that's
> true, then eventually, you'd be able to get enough samples that you
> could reverse engineer the prandom state. This could take long enough
> that the process will have gotten rescheduled to another CPU, and
> since the prandom state is per-cpu, that adds another wrinkle.
Well, yes, this is also my feeling that it is going to be hard to do, but can we get
some more concrete numbers of this? We can forget about per-cpu rescheduling
for simplicity and just calculate how many calls it would take to recover the state
given that each call leaks 5 bits.
I can try to make the calculation of this based on my limited knowledge of crypto,
but I will have to read papers on this PRNG first, etc., so I was just checking if
people already have a feeling on it given how common this generator is in kenrel.
>
> > > So the argument against using TSC directly was that it might be easy to
> > > guess most of the TSC bits in timing attack. But IIRC there is fairly
> > > solid evidence that the lowest TSC bits are very hard to guess and might
> > > in fact be a very good random source.
> > >
> > > So what one could do, is for each invocation mix in the low (2?) bits of
> > > the TSC into a per-cpu/task PRNG state. By always adding some fresh
> > > entropy it would become very hard indeed to predict the outcome, even
> > > for otherwise 'trivial' PRNGs.
> >
> > You could just feed 8 bits of TSC into a CRC. Or even xor the
> > entire TSC over a CRC state and then cycle it at least 6 bits.
> > Probably doesn't matter which CRC - but you may want one that is
> > cheap in software. Even a 16bit CRC might be enough.
>
> Do we only care about x86 in this discussion? Given "x86/entry/64",
> I'm guessing the answer we're not trying to worry about how to protect
> other architectures, like say ARM, that don't have a TSC?
Well, this patch is for x86 only, but other arch might want to have similar
functionality I guess...
>
> If we do care about architectures w/o a TSC, how much cost are we
> willing to pay as far as system call overhead is concerned?
Good question, I don't know exact answer on what is acceptable overhead
for syscall for such a feature, but it should be very light to be useful, otherwise
the config would never be turned on.
>
> If it's x86 specific, maybe the simplest thing to do is to use RDRAND
> if it exists, and fall back to something involving a TSC and maybe
> prandom_u32 (assuming on how bad you think the stack leak is going to
> be) if RDRAND isn't available?
>
RDRAND is way too slow, so it is out. That's why we were looking into other
options for the fast randomness. Unfortunately it looks like we don't have
that many options in kernel.
rdtsc was the original candidate (its low bits) and Peter Zijlstra pointed a paper
that claimed that it has good source of randomness:
http://www.chronox.de/jent/doc/CPU-Jitter-NPTRNG.html
But I don't have enough knowledge on this to make a good judgment.
Original grsecurity patch used low bits of rdtsc for in-stack random offset.
Best Regards,
Elena
> On Tue, Apr 16, 2019 at 11:10:16AM +0000, Reshetova, Elena wrote:
> > >
> > > The kernel can execute millions of syscalls per second, I'm pretty sure
> > > there's a statistical attack against:
> > >
> > > * This is a maximally equidistributed combined Tausworthe generator
> > > * based on code from GNU Scientific Library 1.5 (30 Jun 2004)
> > > *
> > > * lfsr113 version:
> > > *
> > > * x_n = (s1_n ^ s2_n ^ s3_n ^ s4_n)
> > > *
> > > * s1_{n+1} = (((s1_n & 4294967294) << 18) ^ (((s1_n << 6) ^ s1_n) >> 13))
> > > * s2_{n+1} = (((s2_n & 4294967288) << 2) ^ (((s2_n << 2) ^ s2_n) >> 27))
> > > * s3_{n+1} = (((s3_n & 4294967280) << 7) ^ (((s3_n << 13) ^ s3_n) >> 21))
> > > * s4_{n+1} = (((s4_n & 4294967168) << 13) ^ (((s4_n << 3) ^ s4_n) >> 12))
> > > *
> > > * The period of this generator is about 2^113 (see erratum paper).
> > >
> > > ... which recovers the real PRNG state much faster than the ~60 seconds
> > > seeding interval and allows the prediction of the next stack offset?
> >
> > I hope Theodore can comment on bounds here. How many syscalls we need
> > to issue assuming that each leaks 5 presudorandom bits out of 32 bit
> > presudorandom number produced by PRGN before we can predict the
> > PRNG output.
>
> So the argument against using TSC directly was that it might be easy to
> guess most of the TSC bits in timing attack. But IIRC there is fairly
> solid evidence that the lowest TSC bits are very hard to guess and might
> in fact be a very good random source.
It might be the case, especially for this particular use case, but I am not
considering myself to be the right person to judge on the evidence (proper
randomness *is* hard), so I would prefer to have smth stronger, if possible.
>
> So what one could do, is for each invocation mix in the low (2?) bits of
> the TSC into a per-cpu/task PRNG state. By always adding some fresh
> entropy it would become very hard indeed to predict the outcome, even
> for otherwise 'trivial' PRNGs.
Again, with only very limited training in crypto: this might work if our
entropy would be a real entropy, but if we are mixing one preudo-randomness
(potentially predictable) with another, does it really make things much stronger?
Of course, in this case two sources we mix would be independent, so it would
kind of be adding a non-linearness into a PRNG.... Maybe I should sleep over
all these options and think in the morning.
Best Regards,
Elena.
* Theodore Ts'o <[email protected]> wrote:
> It seems though the assumption that we're assuming the attacker has
> arbitrary ability to get the low bits of the stack, so *if* that's
> true, then eventually, you'd be able to get enough samples that you
> could reverse engineer the prandom state. This could take long enough
> that the process will have gotten rescheduled to another CPU, and since
> the prandom state is per-cpu, that adds another wrinkle.
Yeah.
Note that if the attacker has this level of local access then they can
probably also bind the task to a CPU, which would increase the
statistical stability of any attack. Plus with millions of system calls
per second executed in an attack, each of which system call exposes a
couple of bits of prandom state, I'm pretty sure some prandom attack
exists that can make the extraction of the full internal state probable
within the ~60 seconds reseeding interval. (Is there any research on this
perhaps, or do researchers not even bother, because this isn't really a
secure algorithm in any reasonable meaning of the word?)
Thanks,
Ingo
From: Reshetova, Elena
> Sent: 16 April 2019 17:47
..
> > It seems though the assumption that we're assuming the attacker has
> > arbitrary ability to get the low bits of the stack, so *if* that's
> > true, then eventually, you'd be able to get enough samples that you
> > could reverse engineer the prandom state. This could take long enough
> > that the process will have gotten rescheduled to another CPU, and
> > since the prandom state is per-cpu, that adds another wrinkle.
>
> Well, yes, this is also my feeling that it is going to be hard to do, but can we get
> some more concrete numbers of this? We can forget about per-cpu rescheduling
> for simplicity and just calculate how many calls it would take to recover the state
> given that each call leaks 5 bits.
If you can guarantee back to back requests on the PRNG then it is probably
possible to recalculate its state from 'bits of state'/5 calls.
Depend on the PRNG this might be computationally expensive.
For some PRNG it will be absolutely trivial.
IIRC the kernel PRNG is just the xor of several crc, without any
addition or multiply operations it is probably analytically reversible.
Thinks further...
If it has 128 state bits, 26 consecutive outputs of 5 bits gives
you 130 linear equations in 128 unknowns - trivially solvable.
Stirring in a little bit of entropy doesn't help much either.
The entropy bits are effectively initial state bits.
Add 4 in with each request and 128 outputs gives 640 linear
equations in the (128 + 4 * 128) unknowns - still solvable.
The extra entropy does stop you completely predicting the following
outputs.
Of course, if the PRNG is non-linear solving the equations is hard.
In that case you don't need to worry anywhere near as much about
feeding in more entropy bits than are being removed.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Wed, Apr 17, 2019 at 09:28:35AM +0000, David Laight wrote:
>
> If you can guarantee back to back requests on the PRNG then it is probably
> possible to recalculate its state from 'bits of state'/5 calls.
> Depend on the PRNG this might be computationally expensive.
> For some PRNG it will be absolutely trivial.
> ...
> Stirring in a little bit of entropy doesn't help much either.
> The entropy bits are effectively initial state bits.
> Add 4 in with each request and 128 outputs gives 640 linear
> equations in the (128 + 4 * 128) unknowns - still solvable.
This is basically a scenario where the attacker has already taken
control of Ring 3 execution and the question is how hard is it for
them to perform privilege escalation attack to ring 0, right? I'm
sure the security folks will think I'm defeatist, but my personal rule
of thumb is if the attacker has ring 3 control, you've already lost
--- I figure there are so many zero days that getting ring 0 control
is a foregone conclusion. :-(
So that basically means if we want to protect against this, we're
going to do something which involves Real Crypto (tm). Whether that's
RDRAND, or using Chacha20, etc., or something that has some attack
resistance, such as "half MD5", etc., but emminently crackable by
brute force, is essentially a overhead vs. security argument, and what
it is we are willing to pay.
- Ted
On Wed, Apr 17, 2019 at 10:17 AM Theodore Ts'o <[email protected]> wrote:
>
> On Wed, Apr 17, 2019 at 09:28:35AM +0000, David Laight wrote:
> >
> > If you can guarantee back to back requests on the PRNG then it is probably
> > possible to recalculate its state from 'bits of state'/5 calls.
> > Depend on the PRNG this might be computationally expensive.
> > For some PRNG it will be absolutely trivial.
> > ...
> > Stirring in a little bit of entropy doesn't help much either.
> > The entropy bits are effectively initial state bits.
> > Add 4 in with each request and 128 outputs gives 640 linear
> > equations in the (128 + 4 * 128) unknowns - still solvable.
>
> This is basically a scenario where the attacker has already taken
> control of Ring 3 execution and the question is how hard is it for
> them to perform privilege escalation attack to ring 0, right? I'm
> sure the security folks will think I'm defeatist, but my personal rule
> of thumb is if the attacker has ring 3 control, you've already lost
> --- I figure there are so many zero days that getting ring 0 control
> is a foregone conclusion. :-(
I think this attitude comes from Linux traditionally having had such a
weak line between ring 3 and ring 0. That's what we're trying to fix,
generally speaking. :)
> So that basically means if we want to protect against this, we're
> going to do something which involves Real Crypto (tm). Whether that's
> RDRAND, or using Chacha20, etc., or something that has some attack
> resistance, such as "half MD5", etc., but emminently crackable by
> brute force, is essentially a overhead vs. security argument, and what
> it is we are willing to pay.
I wonder how a separate per-cpu state combined with frequent reseeding
would compare to chacha20 (or RDRAND)?
Another point to consider is that this weakness depends on a separate
bug existing, which is becoming less and less likely, given the
always-init options now available. I don't think we should try to
over-engineer this too much. Best-effort here seems fine. Using a
stack leak when the stack is randomized may also prove difficult, so
there's some chicken-and-egg problems with the proposed threat...
--
Kees Cook
From: Theodore Ts'o
> Sent: 17 April 2019 16:16
> On Wed, Apr 17, 2019 at 09:28:35AM +0000, David Laight wrote:
> >
> > If you can guarantee back to back requests on the PRNG then it is probably
> > possible to recalculate its state from 'bits of state'/5 calls.
> > Depend on the PRNG this might be computationally expensive.
> > For some PRNG it will be absolutely trivial.
> > ...
> > Stirring in a little bit of entropy doesn't help much either.
> > The entropy bits are effectively initial state bits.
> > Add 4 in with each request and 128 outputs gives 640 linear
> > equations in the (128 + 4 * 128) unknowns - still solvable.
>
> This is basically a scenario where the attacker has already taken
> control of Ring 3 execution and the question is how hard is it for
> them to perform privilege escalation attack to ring 0, right?
Or extract information that should only be known by ring 0.
I fairly sure many of the side-channel attacks not only require
ring 3 access, but also the ability to request ring 0 repeatedly
perform a specific action on an otherwise idle system.
> I'm sure the security folks will think I'm defeatist, but my personal rule
> of thumb is if the attacker has ring 3 control, you've already lost
> --- I figure there are so many zero days that getting ring 0 control
> is a foregone conclusion. :-(
>
> So that basically means if we want to protect against this, we're
> going to do something which involves Real Crypto (tm). Whether that's
> RDRAND, or using Chacha20, etc., or something that has some attack
> resistance, such as "half MD5", etc., but emminently crackable by
> brute force, is essentially a overhead vs. security argument, and what
> it is we are willing to pay.
Some of these 'random' values have a short lifetime - and would need
to be cracked quickly to be of any use.
I suspect that combining the output three linear generators with
addition not xor would make it computationally much harder to
reverse.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Hi,
Sorry for the delay - Easter holidays + I was trying to arrange my brain around proposed options.
Here what I think our options are with regards to the source of randomness:
1) rdtsc or variations based on it (David proposed some CRC-based variants for example)
2) prandom-based options
3) some proper crypto (chacha8 for example seems to be the lightest out of existing options,
and probably enough for our purpose, but looks like kernel has only chacha20)
4) rdrand or other HW-based crypto
Option 4 was measured to be heavy for the purpose:
base: Simple syscall: 0.1774 microseconds
random_offset (rdtsc): Simple syscall: 0.1803 microseconds
random_offset (rdrand): Simple syscall: 0.3702 microseconds
Option 2 (even if we fork our own state(s), do it per-cpu, reseed, etc.) starts to look for me as the least desired.
The existing generator's state, as people mentioned before, is trivially solvable given a very little amount of
equations (syscalls in our case) you need to issue and offsets to leak.
Even if we isolate the state/seed to just this purpose of stack randomization (and don't leak anything about the rest
of the system or net prandom usage), it still probably makes the
randomization more easily solvable than some constructs based on lower bits of rdtsc.
In addition building on top of existing kernel LFSR would add more (probably not useful for any other purpose)
code, a possible misconception that it can be used for "real security", etc. So, I would propose to abandon this idea.
Option 3 we have to measure I guess, but if it is as heavy as rdrand, then this is also out.
In that case, we are only left with rdtsc-based variants.
I guess for measuring I will have to use existing chacha20 kernel implementation, never used it before, so will check how to
use it/initialize, etc.
Any other option I missed?
Best Regards,
Elena.
From: Reshetova, Elena
> Sent: 24 April 2019 12:43
>
> Sorry for the delay - Easter holidays + I was trying to arrange my brain around proposed options.
> Here what I think our options are with regards to the source of randomness:
>
> 1) rdtsc or variations based on it (David proposed some CRC-based variants for example)
Do I remember something about rdtsc being made less accurate in order to
make it (slightly) more difficult to use it to measure timing attacks?
If true, and it applies to the kernel (eg in a VM) then this is probably
all pointless!
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
> From: Reshetova, Elena
> > Sent: 24 April 2019 12:43
> >
> > Sorry for the delay - Easter holidays + I was trying to arrange my brain around
> proposed options.
> > Here what I think our options are with regards to the source of randomness:
> >
> > 1) rdtsc or variations based on it (David proposed some CRC-based variants for
> example)
>
> Do I remember something about rdtsc being made less accurate in order to
> make it (slightly) more difficult to use it to measure timing attacks?
Do you have any pointers on this? I did an online search, but could not find anything
concrete. The Intel manual doesn't talk about precision at all, only about protected
mode.
>
> If true, and it applies to the kernel (eg in a VM) then this is probably
> all pointless!
You mean additional constructions on top of TSC is pointless?
Best Regards,
Elena.
> Hi,
>
> Sorry for the delay - Easter holidays + I was trying to arrange my brain around
> proposed options.
> Here what I think our options are with regards to the source of randomness:
>
> 1) rdtsc or variations based on it (David proposed some CRC-based variants for
> example)
> 2) prandom-based options
> 3) some proper crypto (chacha8 for example seems to be the lightest out of existing
> options,
> and probably enough for our purpose, but looks like kernel has only chacha20)
> 4) rdrand or other HW-based crypto
>
> Option 4 was measured to be heavy for the purpose:
> base: Simple syscall: 0.1774 microseconds
> random_offset (rdtsc): Simple syscall: 0.1803 microseconds
> random_offset (rdrand): Simple syscall: 0.3702 microseconds
>
>
> Option 2 (even if we fork our own state(s), do it per-cpu, reseed, etc.) starts to look
> for me as the least desired.
> The existing generator's state, as people mentioned before, is trivially solvable given
> a very little amount of
> equations (syscalls in our case) you need to issue and offsets to leak.
> Even if we isolate the state/seed to just this purpose of stack randomization (and
> don't leak anything about the rest
> of the system or net prandom usage), it still probably makes the
> randomization more easily solvable than some constructs based on lower bits of
> rdtsc.
> In addition building on top of existing kernel LFSR would add more (probably not
> useful for any other purpose)
> code, a possible misconception that it can be used for "real security", etc. So, I
> would propose to abandon this idea.
>
> Option 3 we have to measure I guess, but if it is as heavy as rdrand, then this is also
> out.
Adding Eric and Herbert to continue discussion for the chacha part.
So, as a short summary I am trying to find out a fast (fast enough to be used per syscall
invocation) source of random bits with good enough security properties.
I started to look into chacha kernel implementation and while it seems that it is designed to
work with any number of rounds, it does not expose less than 12 rounds primitive.
I guess this is done for security sake, since 12 is probably the lowest bound we want people
to use for the purpose of encryption/decryption, but if we are to build an efficient RNG,
chacha8 probably is a good tradeoff between security and speed.
What are people's opinions/perceptions on this? Has it been considered before to create a
kernel RNG based on chacha?
Best Regards,
Elena.
On Fri, Apr 26, 2019 at 11:33:09AM +0000, Reshetova, Elena wrote:
> Adding Eric and Herbert to continue discussion for the chacha part.
> So, as a short summary I am trying to find out a fast (fast enough to be used per syscall
> invocation) source of random bits with good enough security properties.
> I started to look into chacha kernel implementation and while it seems that it is designed to
> work with any number of rounds, it does not expose less than 12 rounds primitive.
> I guess this is done for security sake, since 12 is probably the lowest bound we want people
> to use for the purpose of encryption/decryption, but if we are to build an efficient RNG,
> chacha8 probably is a good tradeoff between security and speed.
>
> What are people's opinions/perceptions on this? Has it been considered before to create a
> kernel RNG based on chacha?
Well, sure. The get_random_bytes() kernel interface and the
getrandom(2) system call uses a CRNG based on chacha20. See
extract_crng() and crng_reseed() in drivers/char/random.c.
It *is* possible to use an arbitrary number of rounds if you use the
low level interface exposed as chacha_block(), which is an
EXPORT_SYMBOL interface so even modules can use it. "Does not expose
less than 12 rounds" applies only if you are using the high-level
crypto interface.
We have used cut down crypto algorithms for performance critical
applications before; at one point, we were using a cut down MD4(!) for
initial TCP sequence number generation. But that was getting rekeyed
every five minutes, and the goal was to make it just hard enough that
there were other easier ways of DOS attacking a server.
I'm not a cryptographer, so I'd really us to hear from multiple
experts about the security level of, say, ChaCha8 so we understand
exactly kind of security we'd offering. And I'd want that interface
to be named so that it's clear it's only intended for a very specific
use case, since it will be tempting for other kernel developers to use
it in other contexts, with undue consideration.
- Ted
On Fri, 2019-04-26 at 12:33 +0100, Reshetova, Elena wrote:
> 1) rdtsc or variations based on it (David proposed some CRC-based variants for
> > example)
Hi,
Could we repeatedly measure the time for a short syscall on a quiet system and
estimate the entropy we get from this? In some scenarios the attacker might have
less control of the timing as well.
Since this is a statistical defense, assuming the argument can be made that
there is at least some randomness in the timer, and could at least be out of the
control of an attacker sometimes, I wonder if this feature could be valuable
before the search for a faster stronger random number generator completes.
Could that be a way forward for now?
Thanks,
Rick
On Fri, Apr 26, 2019 at 10:01:02AM -0400, Theodore Ts'o wrote:
> On Fri, Apr 26, 2019 at 11:33:09AM +0000, Reshetova, Elena wrote:
> > Adding Eric and Herbert to continue discussion for the chacha part.
> > So, as a short summary I am trying to find out a fast (fast enough to be used per syscall
> > invocation) source of random bits with good enough security properties.
> > I started to look into chacha kernel implementation and while it seems that it is designed to
> > work with any number of rounds, it does not expose less than 12 rounds primitive.
> > I guess this is done for security sake, since 12 is probably the lowest bound we want people
> > to use for the purpose of encryption/decryption, but if we are to build an efficient RNG,
> > chacha8 probably is a good tradeoff between security and speed.
> >
> > What are people's opinions/perceptions on this? Has it been considered before to create a
> > kernel RNG based on chacha?
>
> Well, sure. The get_random_bytes() kernel interface and the
> getrandom(2) system call uses a CRNG based on chacha20. See
> extract_crng() and crng_reseed() in drivers/char/random.c.
>
> It *is* possible to use an arbitrary number of rounds if you use the
> low level interface exposed as chacha_block(), which is an
> EXPORT_SYMBOL interface so even modules can use it. "Does not expose
> less than 12 rounds" applies only if you are using the high-level
> crypto interface.
chacha_block() actually WARNs if the round count isn't 12 or 20, because I
didn't want people to sneak in uses of other variants without discussion :-)
(Possibly I should have made chacha_block() 'static' and only exported
chacha12_block() and chacha20_block(). But the 'nrounds' parameter is
convenient for crypto/chacha_generic.c.)
>
> We have used cut down crypto algorithms for performance critical
> applications before; at one point, we were using a cut down MD4(!) for
> initial TCP sequence number generation. But that was getting rekeyed
> every five minutes, and the goal was to make it just hard enough that
> there were other easier ways of DOS attacking a server.
>
> I'm not a cryptographer, so I'd really us to hear from multiple
> experts about the security level of, say, ChaCha8 so we understand
> exactly kind of security we'd offering. And I'd want that interface
> to be named so that it's clear it's only intended for a very specific
> use case, since it will be tempting for other kernel developers to use
> it in other contexts, with undue consideration.
>
> - Ted
The best attack on ChaCha is against 7 rounds and has time complexity 2^235. So
while there's no publicly known attack on ChaCha8, its security margin is too
small for it to be recommended for typical cryptographic use. I wouldn't be
suprised to see an attack published on ChaCha8 in the not-too-distant future.
(*Probably* not a practical one, but the crypto will be technically "broken"
regardless.)
I don't think it's completely out of the question for this specific use case,
since apparently you only need random numbers that are used temporarily for
runtime memory layout. Thus the algorithm can be upgraded at any time without
spending decades deprecating it from network protocols and on-disk formats.
But if you actually need cryptographically secure random numbers, it would be
much better to start with something with a higher security margin like ChaCha20,
optimizing it, and only going lower if you actually need to.
Would it be possibly to call ChaCha20 through the actual crypto API so that SIMD
instructions (e.g. AVX-2) could be used? That would make it *much* faster.
Also consider AES-CTR with AES-NI instructions.
- Eric
On Fri, Apr 26, 2019 at 10:44:20AM -0700, Eric Biggers wrote:
> Would it be possibly to call ChaCha20 through the actual crypto API so that SIMD
> instructions (e.g. AVX-2) could be used? That would make it *much* faster.
> Also consider AES-CTR with AES-NI instructions.
It's not obvious SIMD instructions will be faster in practice, since
it requires saving and restoring the vector/FPU registers. If you're
going to be doing a *lot* of vector processing (for example when doing
block-level RAID-5 / RAID-6 computations), it might be worth it. But
if you're only going to be turning the crank for 12 or 20 rounds, the
overhead of calling kernel_fpu_begin() and kernel_fpu_end() is
probably going to make this worth it.
As far as using aesni (if available) is concerned, since we don't need
to worry about interoperability between two systems ala IPSEC, it's
definitely something that's worth investigating.
- Ted
> On Apr 26, 2019, at 7:01 AM, Theodore Ts'o <[email protected]> wrote:
>
>> On Fri, Apr 26, 2019 at 11:33:09AM +0000, Reshetova, Elena wrote:
>> Adding Eric and Herbert to continue discussion for the chacha part.
>> So, as a short summary I am trying to find out a fast (fast enough to be used per syscall
>> invocation) source of random bits with good enough security properties.
>> I started to look into chacha kernel implementation and while it seems that it is designed to
>> work with any number of rounds, it does not expose less than 12 rounds primitive.
>> I guess this is done for security sake, since 12 is probably the lowest bound we want people
>> to use for the purpose of encryption/decryption, but if we are to build an efficient RNG,
>> chacha8 probably is a good tradeoff between security and speed.
>>
>> What are people's opinions/perceptions on this? Has it been considered before to create a
>> kernel RNG based on chacha?
>
> Well, sure. The get_random_bytes() kernel interface and the
> getrandom(2) system call uses a CRNG based on chacha20. See
> extract_crng() and crng_reseed() in drivers/char/random.c.
>
> It *is* possible to use an arbitrary number of rounds if you use the
> low level interface exposed as chacha_block(), which is an
> EXPORT_SYMBOL interface so even modules can use it. "Does not expose
> less than 12 rounds" applies only if you are using the high-level
> crypto interface.
>
> We have used cut down crypto algorithms for performance critical
> applications before; at one point, we were using a cut down MD4(!) for
> initial TCP sequence number generation. But that was getting rekeyed
> every five minutes, and the goal was to make it just hard enough that
> there were other easier ways of DOS attacking a server.
>
> I'm not a cryptographer, so I'd really us to hear from multiple
> experts about the security level of, say, ChaCha8 so we understand
> exactly kind of security we'd offering. And I'd want that interface
> to be named so that it's clear it's only intended for a very specific
> use case, since it will be tempting for other kernel developers to use
> it in other contexts, with undue consideration.
>
>
I don’t understand why we’re even considering weaker primitives. It seems to me that we should be using the “fast-erasure” construction for all get_random_bytes() invocations. Specifically, we should have a per cpu buffer that stores some random bytes and a count of how many random bytes there are. get_random_bytes() should take bytes from that buffer and *immediately* zero those bytes in memory. When the buffer is empty, it gets refilled with the full strength CRNG.
The obvious objection is “oh no, a side channel could leak the buffer,” to which I say so what? A side channel could just as easily leak the entire CRNG state.
For Elena’s specific use case, we would probably want a try_get_random_bytes_notrace() that *only* tries the percpu buffer, since this code runs so early in the syscall path that we can’t run real C code. Or it could be moved a bit later, I suppose — the really early part is not really an interesting attack surface.
> On Apr 26, 2019, at 11:02 AM, Theodore Ts'o <[email protected]> wrote:
>
>> On Fri, Apr 26, 2019 at 10:44:20AM -0700, Eric Biggers wrote:
>> Would it be possibly to call ChaCha20 through the actual crypto API so that SIMD
>> instructions (e.g. AVX-2) could be used? That would make it *much* faster.
>> Also consider AES-CTR with AES-NI instructions.
>
> It's not obvious SIMD instructions will be faster in practice, since
> it requires saving and restoring the vector/FPU registers. If you're
> going to be doing a *lot* of vector processing (for example when doing
> block-level RAID-5 / RAID-6 computations), it might be worth it. But
> if you're only going to be turning the crank for 12 or 20 rounds, the
> overhead of calling kernel_fpu_begin() and kernel_fpu_end() is
> probably going to make this worth it.
>
So generate a whole page or more of random bytes at a time and save them up for when they’re needed.
> > On Apr 26, 2019, at 7:01 AM, Theodore Ts'o <[email protected]> wrote:
> >
> >> On Fri, Apr 26, 2019 at 11:33:09AM +0000, Reshetova, Elena wrote:
> >> Adding Eric and Herbert to continue discussion for the chacha part.
> >> So, as a short summary I am trying to find out a fast (fast enough to be used per
> syscall
> >> invocation) source of random bits with good enough security properties.
> >> I started to look into chacha kernel implementation and while it seems that it is
> designed to
> >> work with any number of rounds, it does not expose less than 12 rounds primitive.
> >> I guess this is done for security sake, since 12 is probably the lowest bound we
> want people
> >> to use for the purpose of encryption/decryption, but if we are to build an efficient
> RNG,
> >> chacha8 probably is a good tradeoff between security and speed.
> >>
> >> What are people's opinions/perceptions on this? Has it been considered before to
> create a
> >> kernel RNG based on chacha?
> >
> > Well, sure. The get_random_bytes() kernel interface and the
> > getrandom(2) system call uses a CRNG based on chacha20. See
> > extract_crng() and crng_reseed() in drivers/char/random.c.
> >
> > It *is* possible to use an arbitrary number of rounds if you use the
> > low level interface exposed as chacha_block(), which is an
> > EXPORT_SYMBOL interface so even modules can use it. "Does not expose
> > less than 12 rounds" applies only if you are using the high-level
> > crypto interface.
> >
> > We have used cut down crypto algorithms for performance critical
> > applications before; at one point, we were using a cut down MD4(!) for
> > initial TCP sequence number generation. But that was getting rekeyed
> > every five minutes, and the goal was to make it just hard enough that
> > there were other easier ways of DOS attacking a server.
> >
> > I'm not a cryptographer, so I'd really us to hear from multiple
> > experts about the security level of, say, ChaCha8 so we understand
> > exactly kind of security we'd offering. And I'd want that interface
> > to be named so that it's clear it's only intended for a very specific
> > use case, since it will be tempting for other kernel developers to use
> > it in other contexts, with undue consideration.
> >
> >
>
> I don’t understand why we’re even considering weaker primitives.
I guess one reasoning here was that cryptographic primitives are expensive performance-wise
and we are not really have a full crypto use case here with all standard requirements
for CRNG, such as reconstructing earlier inputs, etc. So, it was a natural wish to try to find smth
cheaper that does the job, but if we can make performance reasonable, I am all for the
proper primitives.
>It seems to me
> that we should be using the “fast-erasure” construction for all get_random_bytes()
> invocations. Specifically, we should have a per cpu buffer that stores some random
> bytes and a count of how many random bytes there are. get_random_bytes() should
> take bytes from that buffer and *immediately* zero those bytes in memory. When
> the buffer is empty, it gets refilled with the full strength CRNG.
Ideally it would be great to call smth fast and secure on each syscall without a per-cpu
buffer, so that's why I was asking on chacha8. As Eric pointed it should not be used for
cryptographic purpose, but I think it is reasonably secure for our purpose, especially if
the generator is sometimes reseeded with fresh entropy.
However, it very well might be that is too slow anyway.
So, I think then we can do the per-cpu approach as you suggesting.
Have a per-cpu buffer big enough as you suggested (couple of pages) from where
we regularly read 8 bits at the time and zero them as we go.
I am just not sure on the right refill strategy in this case?
Should we try to maintain this per-cpu buffer always with some random bytes by
having a work queued that would refill it (or part of it, i.e. a page from a set of 4 pages)
regularly from CRNG source?
I guess how often we need to refill will depend so much on the syscall rate
on that cpu, so it might be hard to find a reasonable period.
In any case we need to prepare to do a refill straight from a syscall,
if despite our best efforts to keep the buffer refilled we run out of bits.
Is it ok to get a visible performance hit at this point? In worse case we will need to
generate n pages worth of random numbers, which is going to take a
while.
I will try doing this PoC and measure implications (without the worker
refill to start with). Let's see how bad (performance wise it looks).
Best Regards,
Elena.
> The obvious objection is “oh no, a side channel could leak the buffer,” to which I say
> so what? A side channel could just as easily leak the entire CRNG state.
>
> For Elena’s specific use case, we would probably want a
> try_get_random_bytes_notrace() that *only* tries the percpu buffer, since this code
> runs so early in the syscall path that we can’t run real C code. Or it could be moved a
> bit later, I suppose — the really early part is not really an interesting attack surface.
> On Fri, Apr 26, 2019 at 11:33:09AM +0000, Reshetova, Elena wrote:
> > Adding Eric and Herbert to continue discussion for the chacha part.
> > So, as a short summary I am trying to find out a fast (fast enough to be used per
> syscall
> > invocation) source of random bits with good enough security properties.
> > I started to look into chacha kernel implementation and while it seems that it is
> designed to
> > work with any number of rounds, it does not expose less than 12 rounds primitive.
> > I guess this is done for security sake, since 12 is probably the lowest bound we
> want people
> > to use for the purpose of encryption/decryption, but if we are to build an efficient
> RNG,
> > chacha8 probably is a good tradeoff between security and speed.
> >
> > What are people's opinions/perceptions on this? Has it been considered before to
> create a
> > kernel RNG based on chacha?
>
> Well, sure. The get_random_bytes() kernel interface and the
> getrandom(2) system call uses a CRNG based on chacha20. See
> extract_crng() and crng_reseed() in drivers/char/random.c.
Oh, indeed, I missed this link fully when was trying to trace chacha
usages in kernel. I am not familiar with crypto kernel API and looks like
my source code cross referencing failed here miserably.
Only question left is how fast/slow is this...
Best Regards,
Elena.
> On Fri, Apr 26, 2019 at 10:01:02AM -0400, Theodore Ts'o wrote:
> > On Fri, Apr 26, 2019 at 11:33:09AM +0000, Reshetova, Elena wrote:
> > > Adding Eric and Herbert to continue discussion for the chacha part.
> > > So, as a short summary I am trying to find out a fast (fast enough to be used per
> syscall
> > > invocation) source of random bits with good enough security properties.
> > > I started to look into chacha kernel implementation and while it seems that it is
> designed to
> > > work with any number of rounds, it does not expose less than 12 rounds
> primitive.
> > > I guess this is done for security sake, since 12 is probably the lowest bound we
> want people
> > > to use for the purpose of encryption/decryption, but if we are to build an
> efficient RNG,
> > > chacha8 probably is a good tradeoff between security and speed.
> > >
> > > What are people's opinions/perceptions on this? Has it been considered before
> to create a
> > > kernel RNG based on chacha?
> >
> > Well, sure. The get_random_bytes() kernel interface and the
> > getrandom(2) system call uses a CRNG based on chacha20. See
> > extract_crng() and crng_reseed() in drivers/char/random.c.
> >
> > It *is* possible to use an arbitrary number of rounds if you use the
> > low level interface exposed as chacha_block(), which is an
> > EXPORT_SYMBOL interface so even modules can use it. "Does not expose
> > less than 12 rounds" applies only if you are using the high-level
> > crypto interface.
>
> chacha_block() actually WARNs if the round count isn't 12 or 20, because I
> didn't want people to sneak in uses of other variants without discussion :-)
>
> (Possibly I should have made chacha_block() 'static' and only exported
> chacha12_block() and chacha20_block(). But the 'nrounds' parameter is
> convenient for crypto/chacha_generic.c.)
>
> >
> > We have used cut down crypto algorithms for performance critical
> > applications before; at one point, we were using a cut down MD4(!) for
> > initial TCP sequence number generation. But that was getting rekeyed
> > every five minutes, and the goal was to make it just hard enough that
> > there were other easier ways of DOS attacking a server.
> >
> > I'm not a cryptographer, so I'd really us to hear from multiple
> > experts about the security level of, say, ChaCha8 so we understand
> > exactly kind of security we'd offering. And I'd want that interface
> > to be named so that it's clear it's only intended for a very specific
> > use case, since it will be tempting for other kernel developers to use
> > it in other contexts, with undue consideration.
> >
> > - Ted
>
> The best attack on ChaCha is against 7 rounds and has time complexity 2^235. So
> while there's no publicly known attack on ChaCha8, its security margin is too
> small for it to be recommended for typical cryptographic use. I wouldn't be
> suprised to see an attack published on ChaCha8 in the not-too-distant future.
> (*Probably* not a practical one, but the crypto will be technically "broken"
> regardless.)
Yes, this is also what is my understanding with regards to chacha official
security strength. But our use case and requirements are slightly different
and can be in future upgraded, if needed, but let's indeed then try per-cpu
buffer solution that Andy is proposing first to see if it is satisfactory performance-
wise with chacha20, which probably stays secure for much longer unless whole
construction is fully broken.
>
> I don't think it's completely out of the question for this specific use case,
> since apparently you only need random numbers that are used temporarily for
> runtime memory layout. Thus the algorithm can be upgraded at any time without
> spending decades deprecating it from network protocols and on-disk formats.
>
> But if you actually need cryptographically secure random numbers, it would be
> much better to start with something with a higher security margin like ChaCha20,
> optimizing it, and only going lower if you actually need to.
Yes, agree, so this is what I am going to try then.
> Would it be possibly to call ChaCha20 through the actual crypto API so that SIMD
> instructions (e.g. AVX-2) could be used? That would make it *much* faster.
I can try measuring both ways given that we ask for enough random bits as Andy suggested.
Couple of pages or so, if it helps with overhead. Also, hope none of these specific
Instructions (including AES-NI) can block, as I was pointed out with RDRAND, otherwise
I guess we have a problem..
> Also consider AES-CTR with AES-NI instructions.
Yes, I guess based on these numbers they go hand in hand with chacha8 (depending on CPU):
https://bench.cr.yp.to/results-stream.html
Also need to compare cases when no special instructions are available I guess when choosing a primitive
here...
Best Regards,
Elena.
> On Apr 29, 2019, at 12:46 AM, Reshetova, Elena <[email protected]> wrote:
>
>
>>>> On Apr 26, 2019, at 7:01 AM, Theodore Ts'o <[email protected]> wrote:
>>>
>
>> It seems to me
>> that we should be using the “fast-erasure” construction for all get_random_bytes()
>> invocations. Specifically, we should have a per cpu buffer that stores some random
>> bytes and a count of how many random bytes there are. get_random_bytes() should
>> take bytes from that buffer and *immediately* zero those bytes in memory. When
>> the buffer is empty, it gets refilled with the full strength CRNG.
>
> Ideally it would be great to call smth fast and secure on each syscall without a per-cpu
> buffer,
Why? You only need a few bits, and any sensible crypto primitive is going to be much better at producing lots of bits than producing just a few bits. Even ignoring that, avoiding the I-cache hit on every syscall has value. And I still don’t see what’s wrong with a percpu buffer.
> so that's why I was asking on chacha8. As Eric pointed it should not be used for
> cryptographic purpose, but I think it is reasonably secure for our purpose, especially if
> the generator is sometimes reseeded with fresh entropy.
> However, it very well might be that is too slow anyway.
>
> So, I think then we can do the per-cpu approach as you suggesting.
> Have a per-cpu buffer big enough as you suggested (couple of pages) from where
> we regularly read 8 bits at the time and zero them as we go.
>
> I am just not sure on the right refill strategy in this case?
> Should we try to maintain this per-cpu buffer always with some random bytes by
> having a work queued that would refill it (or part of it, i.e. a page from a set of 4 pages)
> regularly from CRNG source?
> I guess how often we need to refill will depend so much on the syscall rate
> on that cpu, so it might be hard to find a reasonable period.
> In any case we need to prepare to do a refill straight from a syscall,
> if despite our best efforts to keep the buffer refilled we run out of bits.
> Is it ok to get a visible performance hit at this point? In worse case we will need to
> generate n pages worth of random numbers, which is going to take a
> while.
I think a small hit every few syscalls is okay. The real time people will turn this off regardless.
>
> I will try doing this PoC and measure implications (without the worker
> refill to start with). Let's see how bad (performance wise it looks).
>
> Best Regards,
> Elena.
>
>
>
>
>
>> The obvious objection is “oh no, a side channel could leak the buffer,” to which I say
>> so what? A side channel could just as easily leak the entire CRNG state.
>>
>> For Elena’s specific use case, we would probably want a
>> try_get_random_bytes_notrace() that *only* tries the percpu buffer, since this code
>> runs so early in the syscall path that we can’t run real C code. Or it could be moved a
>> bit later, I suppose — the really early part is not really an interesting attack surface.
>
> > On Apr 29, 2019, at 12:46 AM, Reshetova, Elena <[email protected]>
> wrote:
> >
> >
> >>>> On Apr 26, 2019, at 7:01 AM, Theodore Ts'o <[email protected]> wrote:
> >>>
> >
> >> It seems to me
> >> that we should be using the “fast-erasure” construction for all
> get_random_bytes()
> >> invocations. Specifically, we should have a per cpu buffer that stores some
> random
> >> bytes and a count of how many random bytes there are. get_random_bytes()
> should
> >> take bytes from that buffer and *immediately* zero those bytes in memory.
> When
> >> the buffer is empty, it gets refilled with the full strength CRNG.
> >
> > Ideally it would be great to call smth fast and secure on each syscall without a per-
> cpu
> > buffer,
>
> Why? You only need a few bits, and any sensible crypto primitive is going to be
> much better at producing lots of bits than producing just a few bits. Even ignoring
> that, avoiding the I-cache hit on every syscall has value. And I still don’t see what’s
> wrong with a percpu buffer.
I guess this is true, so I did a quick implementation now to estimate the
performance hits.
Here are the preliminary numbers (proper ones will take a bit more time):
base: Simple syscall: 0.1761 microseconds
get_random_bytes (4096 bytes per-cpu buffer): 0.1793 microsecons
get_random_bytes (64 bytes per-cpu buffer): 0.1866 microsecons
It does not make sense to go less than 64 bytes since this seems to be
Chacha20 block size, so if we go lower, we will trash useful bits.
You can go even higher than 4096 bytes, but even this looks like
okish performance to me.
Below is a snip of what I quickly did (relevant parts) to get these numbers.
I do initial population of per-cpu buffers in late_initcall, but
practice shows that rng might not always be in good state by then.
So, we might not have really good randomness then, but I am not sure
if this is a practical problem since it only applies to system boot and by
the time it booted, it already issued enough syscalls that buffer gets refilled
with really good numbers.
Alternatively we can also do it on the first syscall that each cpu gets, but I
am not sure if that is always guaranteed to have a good randomness.
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+#include <linux/random.h>
+
+void *__builtin_alloca(size_t size);
+
+#define add_random_stack_offset() do { \
+ size_t offset = random_get_byte(); \
+ char *ptr = __builtin_alloca(offset); \
+ asm volatile("":"=m"(*ptr)); \
+} while (0)
+#else
+#define add_random_stack_offset() do {} while (0)
+#endif
...
diff --git a/include/linux/random.h b/include/linux/random.h
index 445a0ea4ff49..9fbce9d6ee70 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -115,6 +115,15 @@ struct rnd_state {
__u32 s1, s2, s3, s4;
};
+#define RANDOM_BUFFER_SIZE 64
+/* structure to hold random bits */
+struct rnd_buffer {
+ unsigned char buffer[RANDOM_BUFFER_SIZE];
+ __u64 byte_counter;
+};
+unsigned char random_get_byte(void);
+
+
....
diff --git a/lib/percpu-random.c b/lib/percpu-random.c
new file mode 100644
index 000000000000..3f92c44fbc1a
--- /dev/null
+++ b/lib/percpu-random.c
@@ -0,0 +1,49 @@
+#include <linux/types.h>
+#include <linux/percpu.h>
+#include <linux/random.h>
+
+static DEFINE_PER_CPU(struct rnd_buffer, stack_rand_offset) __latent_entropy;
+
+
+/*
+ * Generate some initially weak seeding values to allow
+ * to start the prandom_u32() engine.
+ */
+static int __init stack_rand_offset_init(void)
+{
+ int i;
+
+ /* exctract bits to out per-cpu rand buffers */
+ for_each_possible_cpu(i) {
+ struct rnd_buffer *buffer = &per_cpu(stack_rand_offset, i);
+ buffer->byte_counter = 0;
+ /* if rng is not initialized, this won't extract us good stuff
+ * but we cannot wait for rng to initialize either */
+ get_random_bytes(&(buffer->buffer), sizeof(buffer->buffer));
+
+ }
+
+ return 0;
+}
+late_initcall(stack_rand_offset_init);
+
+unsigned char random_get_byte(void)
+{
+ struct rnd_buffer *buffer = &get_cpu_var(stack_rand_offset);
+ unsigned char res;
+
+ if (buffer->byte_counter >= RANDOM_BUFFER_SIZE) {
+ get_random_bytes(&(buffer->buffer), sizeof(buffer->buffer));
+ buffer->byte_counter = 0;
+ }
+
+ res = buffer->buffer[buffer->byte_counter];
+ buffer->buffer[buffer->byte_counter] = 0;
+ buffer->byte_counter ++;
+ put_cpu_var(stack_rand_offset);
+ return res;
+}
+EXPORT_SYMBOL(random_get_byte);
On Tue, Apr 30, 2019 at 10:51 AM Reshetova, Elena
<[email protected]> wrote:
> base: Simple syscall: 0.1761 microseconds
> get_random_bytes (4096 bytes per-cpu buffer): 0.1793 microsecons
> get_random_bytes (64 bytes per-cpu buffer): 0.1866 microsecons
The 4096 size seems pretty good.
> Below is a snip of what I quickly did (relevant parts) to get these numbers.
> I do initial population of per-cpu buffers in late_initcall, but
> practice shows that rng might not always be in good state by then.
> So, we might not have really good randomness then, but I am not sure
> if this is a practical problem since it only applies to system boot and by
> the time it booted, it already issued enough syscalls that buffer gets refilled
> with really good numbers.
> Alternatively we can also do it on the first syscall that each cpu gets, but I
> am not sure if that is always guaranteed to have a good randomness.
Populating at first syscall seems like a reasonable way to delay. And
I agree: I think we should not be too concerned about early RNG state:
we should design for the "after boot" behaviors.
> diff --git a/lib/percpu-random.c b/lib/percpu-random.c
> new file mode 100644
> index 000000000000..3f92c44fbc1a
> --- /dev/null
> +++ b/lib/percpu-random.c
> @@ -0,0 +1,49 @@
> +#include <linux/types.h>
> +#include <linux/percpu.h>
> +#include <linux/random.h>
> +
> +static DEFINE_PER_CPU(struct rnd_buffer, stack_rand_offset) __latent_entropy;
> +
> +
> +/*
> + * Generate some initially weak seeding values to allow
> + * to start the prandom_u32() engine.
> + */
> +static int __init stack_rand_offset_init(void)
> +{
> + int i;
> +
> + /* exctract bits to out per-cpu rand buffers */
> + for_each_possible_cpu(i) {
> + struct rnd_buffer *buffer = &per_cpu(stack_rand_offset, i);
> + buffer->byte_counter = 0;
> + /* if rng is not initialized, this won't extract us good stuff
> + * but we cannot wait for rng to initialize either */
> + get_random_bytes(&(buffer->buffer), sizeof(buffer->buffer));
Instead of doing get_random_bytes() here, just set byte_counter =
RANDOM_BUFFER_SIZE and let random_get_byte() do the work on a per-cpu
basis?
> +
> + }
> +
> + return 0;
> +}
> +late_initcall(stack_rand_offset_init);
> +
> +unsigned char random_get_byte(void)
> +{
> + struct rnd_buffer *buffer = &get_cpu_var(stack_rand_offset);
> + unsigned char res;
> +
> + if (buffer->byte_counter >= RANDOM_BUFFER_SIZE) {
> + get_random_bytes(&(buffer->buffer), sizeof(buffer->buffer));
> + buffer->byte_counter = 0;
> + }
> +
> + res = buffer->buffer[buffer->byte_counter];
> + buffer->buffer[buffer->byte_counter] = 0;
> + buffer->byte_counter ++;
> + put_cpu_var(stack_rand_offset);
> + return res;
> +}
> +EXPORT_SYMBOL(random_get_byte);
Otherwise, sure, looks good. I remain worried about info leaks of the
percpu area causing pain down the road, but we find a safer way to do
this, we can do it later.
--
Kees Cook
From: Reshetova, Elena
> Sent: 30 April 2019 18:51
...
> I guess this is true, so I did a quick implementation now to estimate the
> performance hits.
> Here are the preliminary numbers (proper ones will take a bit more time):
>
> base: Simple syscall: 0.1761 microseconds
> get_random_bytes (4096 bytes per-cpu buffer): 0.1793 microsecons
> get_random_bytes (64 bytes per-cpu buffer): 0.1866 microsecons
>
> It does not make sense to go less than 64 bytes since this seems to be
> Chacha20 block size, so if we go lower, we will trash useful bits.
> You can go even higher than 4096 bytes, but even this looks like
> okish performance to me.
>
> Below is a snip of what I quickly did (relevant parts) to get these numbers.
> I do initial population of per-cpu buffers in late_initcall, but
> practice shows that rng might not always be in good state by then.
> So, we might not have really good randomness then, but I am not sure
> if this is a practical problem since it only applies to system boot and by
> the time it booted, it already issued enough syscalls that buffer gets refilled
> with really good numbers.
> Alternatively we can also do it on the first syscall that each cpu gets, but I
> am not sure if that is always guaranteed to have a good randomness.
...
> +unsigned char random_get_byte(void)
> +{
> + struct rnd_buffer *buffer = &get_cpu_var(stack_rand_offset);
> + unsigned char res;
> +
> + if (buffer->byte_counter >= RANDOM_BUFFER_SIZE) {
> + get_random_bytes(&(buffer->buffer), sizeof(buffer->buffer));
> + buffer->byte_counter = 0;
> + }
> +
> + res = buffer->buffer[buffer->byte_counter];
> + buffer->buffer[buffer->byte_counter] = 0;
> + buffer->byte_counter ++;
> + put_cpu_var(stack_rand_offset);
> + return res;
> +}
> +EXPORT_SYMBOL(random_get_byte);
You'll almost certainly get better code if you copy buffer->byte_counter
to a local 'unsigned long' variable.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
From: Reshetova, Elena
> Sent: 30 April 2019 18:51
...
> +unsigned char random_get_byte(void)
> +{
> + struct rnd_buffer *buffer = &get_cpu_var(stack_rand_offset);
> + unsigned char res;
> +
> + if (buffer->byte_counter >= RANDOM_BUFFER_SIZE) {
> + get_random_bytes(&(buffer->buffer), sizeof(buffer->buffer));
> + buffer->byte_counter = 0;
> + }
> +
> + res = buffer->buffer[buffer->byte_counter];
> + buffer->buffer[buffer->byte_counter] = 0;
If is really worth dirtying a cache line to zero data we've used?
The unused bytes following are much more interesting.
Actually if you got 'byte_counter' into a completely different
area of memory (in data that is changed more often to avoid
dirtying an extra cache line) then not zeroing the used data
would make it harder to determine which byte will be used next.
I'm also guessing that get_cpu_var() disables pre-emption?
This code could probably run 'fast and loose' and just ignore
the fact that pre-emption would have odd effects.
All it would do is perturb the randomness!
David
> + buffer->byte_counter ++;
> + put_cpu_var(stack_rand_offset);
> + return res;
> +}
> +EXPORT_SYMBOL(random_get_byte);
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Wed, May 1, 2019 at 1:42 AM David Laight <[email protected]> wrote:
>
> From: Reshetova, Elena
> > Sent: 30 April 2019 18:51
> ...
> > +unsigned char random_get_byte(void)
> > +{
> > + struct rnd_buffer *buffer = &get_cpu_var(stack_rand_offset);
> > + unsigned char res;
> > +
> > + if (buffer->byte_counter >= RANDOM_BUFFER_SIZE) {
> > + get_random_bytes(&(buffer->buffer), sizeof(buffer->buffer));
> > + buffer->byte_counter = 0;
> > + }
> > +
> > + res = buffer->buffer[buffer->byte_counter];
> > + buffer->buffer[buffer->byte_counter] = 0;
>
> If is really worth dirtying a cache line to zero data we've used?
> The unused bytes following are much more interesting.
>
For this particular use case, zeroing is probably worthless. But, for
the general case of get_random_bytes(), we need to zero, and I would
argue that get_random_bytes() should be doing exactly this in general.
> From: Reshetova, Elena
> > Sent: 30 April 2019 18:51
> ...
> > I guess this is true, so I did a quick implementation now to estimate the
> > performance hits.
> > Here are the preliminary numbers (proper ones will take a bit more time):
> >
> > base: Simple syscall: 0.1761 microseconds
> > get_random_bytes (4096 bytes per-cpu buffer): 0.1793 microsecons
> > get_random_bytes (64 bytes per-cpu buffer): 0.1866 microsecons
> >
> > It does not make sense to go less than 64 bytes since this seems to be
> > Chacha20 block size, so if we go lower, we will trash useful bits.
> > You can go even higher than 4096 bytes, but even this looks like
> > okish performance to me.
> >
> > Below is a snip of what I quickly did (relevant parts) to get these numbers.
> > I do initial population of per-cpu buffers in late_initcall, but
> > practice shows that rng might not always be in good state by then.
> > So, we might not have really good randomness then, but I am not sure
> > if this is a practical problem since it only applies to system boot and by
> > the time it booted, it already issued enough syscalls that buffer gets refilled
> > with really good numbers.
> > Alternatively we can also do it on the first syscall that each cpu gets, but I
> > am not sure if that is always guaranteed to have a good randomness.
> ...
> > +unsigned char random_get_byte(void)
> > +{
> > + struct rnd_buffer *buffer = &get_cpu_var(stack_rand_offset);
> > + unsigned char res;
> > +
> > + if (buffer->byte_counter >= RANDOM_BUFFER_SIZE) {
> > + get_random_bytes(&(buffer->buffer), sizeof(buffer->buffer));
> > + buffer->byte_counter = 0;
> > + }
> > +
> > + res = buffer->buffer[buffer->byte_counter];
> > + buffer->buffer[buffer->byte_counter] = 0;
> > + buffer->byte_counter ++;
> > + put_cpu_var(stack_rand_offset);
> > + return res;
> > +}
> > +EXPORT_SYMBOL(random_get_byte);
>
> You'll almost certainly get better code if you copy buffer->byte_counter
> to a local 'unsigned long' variable.
>
Noted, will fix, thank you for the suggestion!
I haven't yet worked on this code for a "proper version", so I think many
things would need polishing for both speed and code size.
Best Regards,
Elena.
From: Reshetova, Elena
> > Sent: 30 April 2019 18:51
> ...
> > +unsigned char random_get_byte(void)
> > +{
> > + struct rnd_buffer *buffer = &get_cpu_var(stack_rand_offset);
> > + unsigned char res;
> > +
> > + if (buffer->byte_counter >= RANDOM_BUFFER_SIZE) {
> > + get_random_bytes(&(buffer->buffer), sizeof(buffer->buffer));
> > + buffer->byte_counter = 0;
> > + }
> > +
> > + res = buffer->buffer[buffer->byte_counter];
> > + buffer->buffer[buffer->byte_counter] = 0;
>
> If is really worth dirtying a cache line to zero data we've used?
> The unused bytes following are much more interesting.
>
> Actually if you got 'byte_counter' into a completely different
> area of memory (in data that is changed more often to avoid
> dirtying an extra cache line) then not zeroing the used data
> would make it harder to determine which byte will be used next.
Interesting idea, but what would this area be?
I am not that familiar with different data usage patterns.
>
> I'm also guessing that get_cpu_var() disables pre-emption?
Yes, in my understanding:
#define get_cpu_var(var) \
(*({ \
preempt_disable(); \
this_cpu_ptr(&var); \
}))
> This code could probably run 'fast and loose' and just ignore
> the fact that pre-emption would have odd effects.
> All it would do is perturb the randomness!
Hm.. I see your point, but I am wondering what the odd effects might
be.. i.e. can we end up using the same random bits twice for two or more
different syscalls and attackers can try to trigger this situation?
Best Regards,
Elena.
From: Reshetova, Elena
> Sent: 02 May 2019 09:16
...
> > I'm also guessing that get_cpu_var() disables pre-emption?
>
> Yes, in my understanding:
>
> #define get_cpu_var(var) \
> (*({ \
> preempt_disable(); \
> this_cpu_ptr(&var); \
> }))
>
> > This code could probably run 'fast and loose' and just ignore
> > the fact that pre-emption would have odd effects.
> > All it would do is perturb the randomness!
>
> Hm.. I see your point, but I am wondering what the odd effects might
> be.. i.e. can we end up using the same random bits twice for two or more
> different syscalls and attackers can try to trigger this situation?
To trigger it you'd need to arrange for an interrupt in the right
timing window to cause another process to run.
There are almost certainly easier ways to break things.
I think the main effects would be the increment writing to a different
cpu local data (causing the same data to be used again and/or skipped)
and the potential for updating the random buffer on the 'wrong cpu'.
So something like:
/* We don't really care if the update is written to the 'wrong'
* cpu or if the vale comes from the wrong buffer. */
offset = *this_cpu_ptr(&cpu_syscall_rand_offset);
*this_cpu_ptr(&cpu_syscall_rand_offset) = offset + 1;
if ((offset &= 4095)) return this_cpu_ptr(&cpu_syscall_rand_buffer)[offset];
buffer = get_cpu_var((&cpu_syscall_rand_buffer);
get_random_bytes();
val = buffer[0];
/* maybe set cpu_syscall_rand_offset to 1 */
put_cpu_var();
return val;
The whole thing might even work with a global buffer!
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Thu, May 2, 2019 at 2:23 AM David Laight <[email protected]> wrote:
>
> From: Reshetova, Elena
> > Sent: 02 May 2019 09:16
> ...
> > > I'm also guessing that get_cpu_var() disables pre-emption?
> >
> > Yes, in my understanding:
> >
> > #define get_cpu_var(var) \
> > (*({ \
> > preempt_disable(); \
> > this_cpu_ptr(&var); \
> > }))
> >
> > > This code could probably run 'fast and loose' and just ignore
> > > the fact that pre-emption would have odd effects.
> > > All it would do is perturb the randomness!
> >
> > Hm.. I see your point, but I am wondering what the odd effects might
> > be.. i.e. can we end up using the same random bits twice for two or more
> > different syscalls and attackers can try to trigger this situation?
>
> To trigger it you'd need to arrange for an interrupt in the right
> timing window to cause another process to run.
> There are almost certainly easier ways to break things.
>
> I think the main effects would be the increment writing to a different
> cpu local data (causing the same data to be used again and/or skipped)
> and the potential for updating the random buffer on the 'wrong cpu'.
>
> So something like:
> /* We don't really care if the update is written to the 'wrong'
> * cpu or if the vale comes from the wrong buffer. */
> offset = *this_cpu_ptr(&cpu_syscall_rand_offset);
> *this_cpu_ptr(&cpu_syscall_rand_offset) = offset + 1;
>
> if ((offset &= 4095)) return this_cpu_ptr(&cpu_syscall_rand_buffer)[offset];
>
> buffer = get_cpu_var((&cpu_syscall_rand_buffer);
> get_random_bytes();
> val = buffer[0];
> /* maybe set cpu_syscall_rand_offset to 1 */
> put_cpu_var();
> return val;
>
> The whole thing might even work with a global buffer!
>
I don't see how this makes sense in the context of the actual entry
code. The code looks like this right now:
enter_from_user_mode();
<--- IRQs off here
local_irq_enable();
Presumably this could become:
enter_from_user_mode();
if (the percpu buffer has enough bytes) {
use them;
local_irq_enable();
} else {
local_irq_enable();
get more bytes;
if (get_cpu() == the old cpu) {
refill the buffer;
} else {
feel rather silly;
}
put_cpu();
}
everything after the enter_from_user_mode() could get renamed
get_randstack_offset_and_irq_enable().
Or we decide that calling get_random_bytes() is okay with IRQs off and
this all gets a bit simpler.
--Andy
* Andy Lutomirski <[email protected]> wrote:
> Or we decide that calling get_random_bytes() is okay with IRQs off and
> this all gets a bit simpler.
BTW., before we go down this path any further, is the plan to bind this
feature to a real CPU-RNG capability, i.e. to the RDRAND instruction,
which excludes a significant group of x86 of CPUs?
Because calling tens of millions of system calls per second will deplete
any non-CPU-RNG sources of entropy and will also starve all other users
of random numbers, which might have a more legitimate need for
randomness, such as the networking stack ...
I.e. I'm really *super sceptical* of this whole plan, as currently
formulated.
If we bind it to RDRAND then we shouldn't be using the generic
drivers/char/random.c pool *at all*, but just call the darn instruction
directly. This is an x86 patch-set after all, right?
Furthermore the following post suggests that RDRAND isn't a per CPU
capability, but a core or socket level facility, depending on CPU make:
https://stackoverflow.com/questions/10484164/what-is-the-latency-and-throughput-of-the-rdrand-instruction-on-ivy-bridge
8 gigabits/sec sounds good throughput in principle, if there's no
scalability pathologies with that.
It would also be nice to know whether RDRAND does buffering *internally*,
in which case it might be better to buffer as little at the system call
level as possible, to allow the hardware RNG buffer to rebuild between
system calls.
I.e. I'd suggest to retrieve randomness via a fixed number of RDRAND-r64
calls (where '1' is a perfectly valid block size - it should be
measured), which random bits are then used as-is for the ~6 bits of
system call stack offset. (I'd even suggest 7 bits: that skips a full
cache line almost for free and makes the fuzz actually meaningful: no
spear attacker will take a 1/128, 0.8% chance to successfully attack a
critical system.)
Then those 64*N random bits get buffered and consumed in 5-7 bit chunk,
in a super efficient fashion, possibly inlining the fast path, totally
outside the flow of the drivers/char/random.c
Any non-CPU source of randomness for system calls and plans to add
several extra function calls to every x86 system call is crazy talk I
believe...
Thanks,
Ingo
On Thu, May 2, 2019 at 8:09 AM Ingo Molnar <[email protected]> wrote:
>
>
> * Andy Lutomirski <[email protected]> wrote:
>
> > Or we decide that calling get_random_bytes() is okay with IRQs off and
> > this all gets a bit simpler.
>
> BTW., before we go down this path any further, is the plan to bind this
> feature to a real CPU-RNG capability, i.e. to the RDRAND instruction,
> which excludes a significant group of x86 of CPUs?
It's kind of the opposite. Elena benchmarked it, and RDRAND's
performance was truly awful here.
>
> Because calling tens of millions of system calls per second will deplete
> any non-CPU-RNG sources of entropy and will also starve all other users
> of random numbers, which might have a more legitimate need for
> randomness, such as the networking stack ...
There's no such thing as "starving" other users in this context. The
current core RNG code uses a cryptographic RNG that has no limits on
the number of bytes extracted. If you want the entropy-accounted
stuff, you can use /dev/random, which is separate.
> 8 gigabits/sec sounds good throughput in principle, if there's no
> scalability pathologies with that.
The latency is horrible.
>
> It would also be nice to know whether RDRAND does buffering *internally*,
Not in a useful way :(
> Any non-CPU source of randomness for system calls and plans to add
> several extra function calls to every x86 system call is crazy talk I
> believe...
I think that, in practice, the only real downside to enabling this
thing will be the jitter in syscall times. Although we could decide
that the benefit is a bit dubious and the whole thing isn't worth it.
But it will definitely be optional.
From: Ingo Molnar
> Sent: 02 May 2019 16:09
> * Andy Lutomirski <[email protected]> wrote:
>
> > Or we decide that calling get_random_bytes() is okay with IRQs off and
> > this all gets a bit simpler.
>
> BTW., before we go down this path any further, is the plan to bind this
> feature to a real CPU-RNG capability, i.e. to the RDRAND instruction,
> which excludes a significant group of x86 of CPUs?
It has already been measured - it is far too slow.
Even just using 6 bits so it doesn't have to be read every system call is
probably a significant overhead (I don't think that was tested though).
I do agree that using 'real' randomness is probably OTT here.
> Because calling tens of millions of system calls per second will deplete
> any non-CPU-RNG sources of entropy and will also starve all other users
> of random numbers, which might have a more legitimate need for
> randomness, such as the networking stack ...
If the function you use to generate random numbers from the 'entropy
pool' isn't reversible (in a finite time) I don't think you really need
to worry about bits-in v bits-out.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
* Andy Lutomirski <[email protected]> wrote:
> > 8 gigabits/sec sounds good throughput in principle, if there's no
> > scalability pathologies with that.
>
> The latency is horrible.
Latency would be amortized via batching anyway, so 8 gigabits/sec
suggests something on the order of magnitude of 4 bits per cycle, right?
With 64 bits extraction at a time that would be 16 cycles per 64-bit
word, which isn't too bad, is it?
But you are right that get_random_bytes() is probably faster, and also
more generic.
> > It would also be nice to know whether RDRAND does buffering
> > *internally*,
>
> Not in a useful way :(
Too bad ...
> > Any non-CPU source of randomness for system calls and plans to add
> > several extra function calls to every x86 system call is crazy talk I
> > believe...
>
> I think that, in practice, the only real downside to enabling this
> thing will be the jitter in syscall times. Although we could decide
> that the benefit is a bit dubious and the whole thing isn't worth it.
> But it will definitely be optional.
Making it "optional" is not really a technical argument in any way
though, either distros enable it in which case it's a de-facto default
setting, or they don't, in which case it de-facto almost doesn't exist.
Thanks,
Ingo
* David Laight <[email protected]> wrote:
> It has already been measured - it is far too slow.
I don't think proper buffering was tested, was it? Only a per syscall
RDRAND overhead which I can imagine being not too good.
> > Because calling tens of millions of system calls per second will
> > deplete any non-CPU-RNG sources of entropy and will also starve all
> > other users of random numbers, which might have a more legitimate
> > need for randomness, such as the networking stack ...
>
> If the function you use to generate random numbers from the 'entropy
> pool' isn't reversible (in a finite time) I don't think you really need
> to worry about bits-in v bits-out.
Ok.
Thanks,
Ingo
> * David Laight <[email protected]> wrote:
>
> > It has already been measured - it is far too slow.
>
> I don't think proper buffering was tested, was it? Only a per syscall
> RDRAND overhead which I can imagine being not too good.
>
Well, I have some numbers, but I am struggling to understand one
aspect there.
So, this is how it looks when PAGE_TABLE_ISOLATION is off:
base: Simple syscall: 0.0516 microseconds
rdrand (calling every 8 syscalls): Simple syscall: 0.0795 microseconds
get_random_bytes (4096 bytes buffer): Simple syscall: 0.0597 microseconds
But then it looks like this with PAGE_TABLE_ISOLATION is on:
base: Simple syscall: 0.1761 microseconds
get_random_bytes (4096 bytes buffer): Simple syscall: 0.1793 microseconds
get_random_bytes (64 bytes buffer): Simple syscall: 0.1866 microseconds
rdrand (calling every 8 syscalls): Simple syscall: 0.3131 microseconds
So, suddenly calling rdrand is much more pricey...
Either smth is really weird going on when PAGE_TABLE is enabled, or
I managed to do smth wrongly (no idea what although). I will continue
Investigating.
Best Regards,
Elena.
From: Reshetova, Elena
> Sent: 03 May 2019 17:17
...
> rdrand (calling every 8 syscalls): Simple syscall: 0.0795 microseconds
You could try something like:
u64 rand_val = cpu_var->syscall_rand
while (unlikely(rand_val == 0))
rand_val = rdrand64();
stack_offset = rand_val & 0xff;
rand_val >>= 6;
if (likely(rand_val >= 4))
cpu_var->syscall_rand = rand_val;
else
cpu_var->syscall_rand = rdrand64();
return stack_offset;
That gives you 10 system calls per rdrand instruction
and mostly takes the latency out of line.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
> On May 2, 2019, at 9:43 AM, Ingo Molnar <[email protected]> wrote:
>
>
> * Andy Lutomirski <[email protected]> wrote:
>
>>> 8 gigabits/sec sounds good throughput in principle, if there's no
>>> scalability pathologies with that.
>>
>> The latency is horrible.
>
> Latency would be amortized via batching anyway, so 8 gigabits/sec
> suggests something on the order of magnitude of 4 bits per cycle, right?
> With 64 bits extraction at a time that would be 16 cycles per 64-bit
> word, which isn't too bad, is it?
I haven’t really dug in, but some Googling suggests that the 8Gbps figure is what you get with all cores doing RDRAND. It sounds like the actual RDRAND instruction doesn’t pipeline.
> Making it "optional" is not really a technical argument in any way
> though, either distros enable it in which case it's a de-facto default
> setting, or they don't, in which case it de-facto almost doesn't exist.
>
>
True.
On Fri, May 3, 2019 at 9:40 AM David Laight <[email protected]> wrote:
>
> That gives you 10 system calls per rdrand instruction
> and mostly takes the latency out of line.
Do we really want to do this? What is the attack scenario?
With no VLA's, and the stackleak plugin, what's the upside? Are we
adding random code (literally) "just because"?
Linus
>> On Fri, May 3, 2019 at 9:40 AM David Laight <[email protected]> wrote:
> >
> > That gives you 10 system calls per rdrand instruction
> > and mostly takes the latency out of line.
>
> Do we really want to do this? What is the attack scenario?
>
> With no VLA's, and the stackleak plugin, what's the upside? Are we
> adding random code (literally) "just because"?
>
> Linus
Hi Linus,
So, if your question is "do we know a thread stack-based
attack that would succeed given that all countermeasures are active
(stackleak, CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL=y,
assume no VLAs left, etc.) ?" Then the answer is "no", I don't know
and people I have asked also (this of course by itself does not provide any
guarantees of any kind in security world).
However, the whole nature of the thread stack is its predictable and deterministic
structure that attracted attackers over decades to mount various attacks
(ab)using it. So, while stackleak and others address concrete attack paths,
such as uninitialized variables on the stack, arbitrary stack jump primitives (VLAs),
etc. We don't really have anything in place that addresses the core feature:
"simple and deterministic structure that is persistent across syscalls".
So, this feature is an attempt to be more proactive (vs. reacting to already published
attack) and add randomization in a place where it would likely make most of thread-based
attacks harder for attackers. If we can make the measure cheap enough not to affect
performance considerably and get a security benefit, why not?
Best Regards,
Elena.
> From: Reshetova, Elena
> > Sent: 03 May 2019 17:17
> ...
> > rdrand (calling every 8 syscalls): Simple syscall: 0.0795 microseconds
>
> You could try something like:
> u64 rand_val = cpu_var->syscall_rand
>
> while (unlikely(rand_val == 0))
> rand_val = rdrand64();
>
> stack_offset = rand_val & 0xff;
> rand_val >>= 6;
> if (likely(rand_val >= 4))
> cpu_var->syscall_rand = rand_val;
> else
> cpu_var->syscall_rand = rdrand64();
>
> return stack_offset;
>
> That gives you 10 system calls per rdrand instruction
> and mostly takes the latency out of line.
I am not really happy going the rdrand path for a couple of reasons:
- it is not available on older PCs
- its performance varies across CPUs that support it (and as I understood varies quite some)
- it is x86 centric and not generic
So, if we can use get_random_bytes() interface without tightening ourselves to
a particular instruction, I think it would be better.
The numbers I have measured so far for buffer size of 4096 is SW only,
I will try to measure today what boost (if any) we can have if we use SIMD
code for it.
Best Regards,
Elena.
> * Andy Lutomirski <[email protected]> wrote:
>
> > Or we decide that calling get_random_bytes() is okay with IRQs off and
> > this all gets a bit simpler.
>
> BTW., before we go down this path any further, is the plan to bind this
> feature to a real CPU-RNG capability, i.e. to the RDRAND instruction,
> which excludes a significant group of x86 of CPUs?
I would not like to bind this to only CPUs that have RDRAND.
That's why I was looking into using kernel's CSRNG (we can also use it
as backup when rdrand is not available).
> Because calling tens of millions of system calls per second will deplete
> any non-CPU-RNG sources of entropy and will also starve all other users
> of random numbers, which might have a more legitimate need for
> randomness, such as the networking stack ...
This should not apply to the proper CSRNG. They of course also have a
limitation on the amount of bits they can produce safely (as any crypto
primitive), but this period is very big and within that it does not affect
any other user of this CSPRNG, otherwise all guarantees are broken.
> I.e. I'm really *super sceptical* of this whole plan, as currently
> formulated.
>
> If we bind it to RDRAND then we shouldn't be using the generic
> drivers/char/random.c pool *at all*, but just call the darn instruction
> directly. This is an x86 patch-set after all, right?
Yes, but my main issues with RDRAND (even if we focus strictly onx86) are:
- it is not available on older PCs
- its performance varies across CPUs that support it (and as I understood varies quite some)
The last one can actually give unpleasant surprises...
> Furthermore the following post suggests that RDRAND isn't a per CPU
> capability, but a core or socket level facility, depending on CPU make:
>
> https://stackoverflow.com/questions/10484164/what-is-the-latency-and-
> throughput-of-the-rdrand-instruction-on-ivy-bridge
>
> 8 gigabits/sec sounds good throughput in principle, if there's no
> scalability pathologies with that.
>
> It would also be nice to know whether RDRAND does buffering *internally*,
> in which case it might be better to buffer as little at the system call
> level as possible, to allow the hardware RNG buffer to rebuild between
> system calls.
I will try asking around about concrete details on RDRAND behavior.
I have various bits and pieces I have been told plus measurements I did, but things
don't quite add up..
>
> I.e. I'd suggest to retrieve randomness via a fixed number of RDRAND-r64
> calls (where '1' is a perfectly valid block size - it should be
> measured), which random bits are then used as-is for the ~6 bits of
> system call stack offset. (I'd even suggest 7 bits: that skips a full
> cache line almost for free and makes the fuzz actually meaningful: no
> spear attacker will take a 1/128, 0.8% chance to successfully attack a
> critical system.)
>
> Then those 64*N random bits get buffered and consumed in 5-7 bit chunk,
> in a super efficient fashion, possibly inlining the fast path, totally
> outside the flow of the drivers/char/random.c
I will ask around on what is the best way to use RDRAND for our purpose.
>
> Any non-CPU source of randomness for system calls and plans to add
> several extra function calls to every x86 system call is crazy talk I
> believe...
So, if we go the CPU randomness path, then what do we fall back to when
RNRAND is not available? Skip randomization altogether or backup to
CSRNG?
Best Regards,
Elena.
..
> > rdrand (calling every 8 syscalls): Simple syscall: 0.0795 microseconds
>
> You could try something like:
> u64 rand_val = cpu_var->syscall_rand
>
> while (unlikely(rand_val == 0))
> rand_val = rdrand64();
>
> stack_offset = rand_val & 0xff;
> rand_val >>= 6;
> if (likely(rand_val >= 4))
> cpu_var->syscall_rand = rand_val;
> else
> cpu_var->syscall_rand = rdrand64();
>
> return stack_offset;
>
> That gives you 10 system calls per rdrand instruction
> and mostly takes the latency out of line.
I have experimented more (including the version above) and here are
more stable numbers:
CONFIG_PAGE_TABLE_ISOLATION=y:
base: Simple syscall: 0.1761 microseconds
get_random_bytes (4096 bytes buffer): Simple syscall: 0.1793 microseconds
rdrand(every 10 syscalls):Simple syscall: 0.1905 microseconds
rdrand(every 8 syscalls): Simple syscall: 0.1980 microseconds
CONFIG_PAGE_TABLE_ISOLATION=n:
base: Simple syscall: 0.0510 microseconds
get_random_bytes(4096 bytes buffer): Simple syscall: 0.0597 microseconds
rdrand (every 10 syscalls): Simple syscall: 0.0719 microseconds
rdrand (every 8 syscalls): Simple syscall: 0.0783 microseconds
So, pure speed wise get_random_bytes() with 1 page per-cpu buffer wins.
Also, I haven't yet found any person with in-depth knowledge of generator
behind rdrand, but when you read public design docs, it does have indeed internal
buffering itself, so my understanding is that as soon as there is stuff available
in this internal buffer (shared between all CPUs), the rdrand instruction is fast,
but if buffer needs refilling, then it is slow.
However, you can only ask a register worth of randomness from it (can't ask
5 bits for example), so a strategy to ask one full 64 bits register and store outcome
in a per-cpu buffer seems reasonable. The only other way I can think to do this is to run
rdrand in a row multiple times in one syscall to fill a bigger buffer and then use bits from there,
I can try to measure this in case this is faster (I doubt).
Best Regards,
Elena.
* Reshetova, Elena <[email protected]> wrote:
> CONFIG_PAGE_TABLE_ISOLATION=n:
>
> base: Simple syscall: 0.0510 microseconds
> get_random_bytes(4096 bytes buffer): Simple syscall: 0.0597 microseconds
>
> So, pure speed wise get_random_bytes() with 1 page per-cpu buffer wins.
It still adds +17% overhead to the system call path, which is sad.
Why is it so expensive?
Thanks,
Ingo
> * Reshetova, Elena <[email protected]> wrote:
>
> > CONFIG_PAGE_TABLE_ISOLATION=n:
> >
> > base: Simple syscall: 0.0510 microseconds
> > get_random_bytes(4096 bytes buffer): Simple syscall: 0.0597 microseconds
> >
> > So, pure speed wise get_random_bytes() with 1 page per-cpu buffer wins.
>
> It still adds +17% overhead to the system call path, which is sad.
> Why is it so expensive?
I guess I can experiment further with buffer size increase and/or
using HW acceleration (I mostly played around different rdrand paths now).
What would be acceptable overheard approximately (so that I know how
much I need to squeeze this thing)?
Best Regards,
Elena.
* Reshetova, Elena <[email protected]> wrote:
> > * Reshetova, Elena <[email protected]> wrote:
> >
> > > CONFIG_PAGE_TABLE_ISOLATION=n:
> > >
> > > base: Simple syscall: 0.0510 microseconds
> > > get_random_bytes(4096 bytes buffer): Simple syscall: 0.0597 microseconds
> > >
> > > So, pure speed wise get_random_bytes() with 1 page per-cpu buffer wins.
> >
> > It still adds +17% overhead to the system call path, which is sad.
> > Why is it so expensive?
>
> I guess I can experiment further with buffer size increase and/or
> using HW acceleration (I mostly played around different rdrand paths now).
>
> What would be acceptable overheard approximately (so that I know how
> much I need to squeeze this thing)?
As much as possible? No idea, I'm sad about anything that is more than
0%, and I'd be *really* sad about anything more than say 1-2%.
I find it ridiculous that even with 4K blocked get_random_bytes(), which
gives us 32k bits, which with 5 bits should amortize the RNG call to
something like "once per 6553 calls", we still see 17% overhead? It's
either a measurement artifact, or something doesn't compute.
Thanks,
Ingo
> * Reshetova, Elena <[email protected]> wrote:
>
> > > * Reshetova, Elena <[email protected]> wrote:
> > >
> > > > CONFIG_PAGE_TABLE_ISOLATION=n:
> > > >
> > > > base: Simple syscall: 0.0510 microseconds
> > > > get_random_bytes(4096 bytes buffer): Simple syscall: 0.0597 microseconds
> > > >
> > > > So, pure speed wise get_random_bytes() with 1 page per-cpu buffer wins.
> > >
> > > It still adds +17% overhead to the system call path, which is sad.
> > > Why is it so expensive?
> >
> > I guess I can experiment further with buffer size increase and/or
> > using HW acceleration (I mostly played around different rdrand paths now).
> >
> > What would be acceptable overheard approximately (so that I know how
> > much I need to squeeze this thing)?
>
> As much as possible? No idea, I'm sad about anything that is more than
> 0%, and I'd be *really* sad about anything more than say 1-2%.
Ok, understood.
>
> I find it ridiculous that even with 4K blocked get_random_bytes(), which
> gives us 32k bits, which with 5 bits should amortize the RNG call to
> something like "once per 6553 calls", we still see 17% overhead? It's
> either a measurement artifact, or something doesn't compute.
If you check what happens underneath of get_random_bytes(), there is
a fair amount of stuff that is going on, including reseeding CRNG if reseeding
interval has passed (see _extract_crng()). It also even attempts to stir in more
entropy from rdrand if avalaible:
I will look into this whole construction
slowly now to investigate. I did't optimize anything yet also (I take 8 bits at
the time for offset), but these small optimization won't make performance
impact from 17% --> 2%, so pointless for now, need a more radical shift.
Best Regards,
Elena.
> > I find it ridiculous that even with 4K blocked get_random_bytes(), which
> > gives us 32k bits, which with 5 bits should amortize the RNG call to
> > something like "once per 6553 calls", we still see 17% overhead? It's
> > either a measurement artifact, or something doesn't compute.
>
> If you check what happens underneath of get_random_bytes(), there is
> a fair amount of stuff that is going on, including reseeding CRNG if reseeding
> interval has passed (see _extract_crng()). It also even attempts to stir in more
> entropy from rdrand if available:
Sorry pressed wrong button instead of copy pasting the code.
This is where it adds entropy:
if (arch_get_random_long(&v))
crng->state[14] ^= v;
> I will look into this whole construction
> slowly now to investigate. I did't optimize anything yet also (I take 8 bits at
> the time for offset), but these small optimization won't make performance
> impact from 17% --> 2%, so pointless for now, need a more radical shift.
>
> Best Regards,
> Elena.
* Reshetova, Elena <[email protected]> wrote:
> > I find it ridiculous that even with 4K blocked get_random_bytes(),
> > which gives us 32k bits, which with 5 bits should amortize the RNG
> > call to something like "once per 6553 calls", we still see 17%
> > overhead? It's either a measurement artifact, or something doesn't
> > compute.
>
> If you check what happens underneath of get_random_bytes(), there is a
> fair amount of stuff that is going on, including reseeding CRNG if
> reseeding interval has passed (see _extract_crng()). It also even
> attempts to stir in more entropy from rdrand if avalaible:
>
> I will look into this whole construction slowly now to investigate. I
> did't optimize anything yet also (I take 8 bits at the time for
> offset), but these small optimization won't make performance impact
> from 17% --> 2%, so pointless for now, need a more radical shift.
So assuming that the 17% overhead primarily comes from get_random_bytes()
(does it? I don't know), that's incredibly slow for something like the
system call entry path, even if it's batched.
If so then maybe we have to find something else? RDRAND was just my
desperate suggestion to find something faster, I assumed a modern chip
would find no trouble finding plenty of source of quantum randomness,
given how much it has to fight it just to keep working - but that
assumption was apparently wrong. ;-)
Is there no cryptographically robust yet fast (pseudo-)random sequence
generator that is unlikely to be cracked real-time?
The performance bar that such a feature has to cross should be very high,
given how the whole threat model and justification is dubious to begin
with.
Or just give up on the whole notion if it cannot be done? ;-)
Thanks,
Ingo
On Thu, May 9, 2019 at 1:43 AM Ingo Molnar <[email protected]> wrote:
>
>
> * Reshetova, Elena <[email protected]> wrote:
>
> > > I find it ridiculous that even with 4K blocked get_random_bytes(),
> > > which gives us 32k bits, which with 5 bits should amortize the RNG
> > > call to something like "once per 6553 calls", we still see 17%
> > > overhead? It's either a measurement artifact, or something doesn't
> > > compute.
> >
> > If you check what happens underneath of get_random_bytes(), there is a
> > fair amount of stuff that is going on, including reseeding CRNG if
> > reseeding interval has passed (see _extract_crng()). It also even
> > attempts to stir in more entropy from rdrand if avalaible:
> >
> > I will look into this whole construction slowly now to investigate. I
> > did't optimize anything yet also (I take 8 bits at the time for
> > offset), but these small optimization won't make performance impact
> > from 17% --> 2%, so pointless for now, need a more radical shift.
>
> So assuming that the 17% overhead primarily comes from get_random_bytes()
> (does it? I don't know), that's incredibly slow for something like the
> system call entry path, even if it's batched.
>
ISTM maybe a better first step would be to make get_random_bytes() be
much faster? :)
--Andy
On Sat, May 11, 2019 at 03:45:19PM -0700, Andy Lutomirski wrote:
> ISTM maybe a better first step would be to make get_random_bytes() be
> much faster? :)
I'm not opposed to that, but I want to make sure we don't break it for
"real" crypto uses...
I still think just using something very simply like rdtsc would be
good enough. This isn't meant to be a perfect defense: it's meant to
disrupt the ability to trivially predict (usually another thread's)
stack offset. And any sufficiently well-positioned local attacker can
defeat this no matter what the entropy source, given how small the number
of bits actually ends up being, assuming they can just keep launching
whatever they're trying to attack. (They can just hold still and try the
same offset until the randomness aligns: but that comes back to us also
needing a brute-force exec deterance, which is a separate subject...)
The entropy source bikeshedding doesn't seem helpful given how few bits
we're dealing with.
--
Kees Cook
* Kees Cook <[email protected]> wrote:
> On Sat, May 11, 2019 at 03:45:19PM -0700, Andy Lutomirski wrote:
> > ISTM maybe a better first step would be to make get_random_bytes() be
> > much faster? :)
>
> I'm not opposed to that, but I want to make sure we don't break it for
> "real" crypto uses...
I'm quite sure Andy implied that.
> I still think just using something very simply like rdtsc would be good
> enough.
>
> This isn't meant to be a perfect defense: it's meant to disrupt the
> ability to trivially predict (usually another thread's) stack offset.
But aren't most local kernel exploit attacks against the current task?
Are there any statistics about this?
> And any sufficiently well-positioned local attacker can defeat this no
> matter what the entropy source, given how small the number of bits
> actually ends up being, assuming they can just keep launching whatever
> they're trying to attack. (They can just hold still and try the same
> offset until the randomness aligns: but that comes back to us also
> needing a brute-force exec deterance, which is a separate subject...)
>
> The entropy source bikeshedding doesn't seem helpful given how few bits
> we're dealing with.
The low number of bits is still useful in terms of increasing the
probability of crashing the system if the attacker cannot guess the stack
offset.
With 5 bits there's a ~96.9% chance of crashing the system in an attempt,
the exploit cannot be used for a range of attacks, including spear
attacks and fast-spreading worms, right? A crashed and inaccessible
system also increases the odds of leaving around unfinished attack code
and leaking a zero-day attack.
Thanks,
Ingo
On Sun, May 12, 2019 at 10:02:45AM +0200, Ingo Molnar wrote:
> * Kees Cook <[email protected]> wrote:
> > I still think just using something very simply like rdtsc would be good
> > enough.
> >
> > This isn't meant to be a perfect defense: it's meant to disrupt the
> > ability to trivially predict (usually another thread's) stack offset.
>
> But aren't most local kernel exploit attacks against the current task?
> Are there any statistics about this?
I don't think I have any meaningful statistics on this any more. As
mentioned ealier, virtually all the known stack-based attack methods
have been mitigated:
- stack canaries: no linear buffer overflows
- thread_info moved: no more addr_limit games
- VMAP_STACK (i.e. guard page): no more linear overflows/exhaustion
overwriting neighboring memory
- no VLAs: no more index overflows of stack arrays
- stack variable initialization: vastly reduced chance of memory exposures
or use of "uninitialized" variables
One thing we don't have is a shadow call stack. i.e. the return addresses
are still mixed in with local variables, argument spills, etc. This is
still a potential target for attack, as ROP would need to get at the
return addresses. (Though creating a shadow stack just moves the problem,
but in theory to a place where there would be better control over it.)
As for current vs not, yes, many past exploits have been against the
current thread, though it's always been via the various low-hanging fruit
we've hopefully eliminated now. What I think remains interesting are
the attacks where there is some level of "spraying". It seems like this
happens more as mitigations land. For example, while trying to overwrite
a close() function pointer and an attacker can't target a specific heap
allocation, they make lots of sockets, and just close them all after
blindly poking memory. Similar things have been seen for thread-based
attacks (though that predated VMAP_STACK).
So, right now this mitigation remains a "what are we able to do to disrupt
the reliability of an attack that is targetting the stack?" without
a specific threat model in mind. I don't want the answer to be "we do
nothing because we can't find a way to perfectly construct entropy that
resists one threat model of many possible threats."
> > And any sufficiently well-positioned local attacker can defeat this no
> > matter what the entropy source, given how small the number of bits
> > actually ends up being, assuming they can just keep launching whatever
> > they're trying to attack. (They can just hold still and try the same
> > offset until the randomness aligns: but that comes back to us also
> > needing a brute-force exec deterance, which is a separate subject...)
> >
> > The entropy source bikeshedding doesn't seem helpful given how few bits
> > we're dealing with.
>
> The low number of bits is still useful in terms of increasing the
> probability of crashing the system if the attacker cannot guess the stack
> offset.
Right, we have the benefit of many attacks in the kernel being fragile,
but it's possible that it may only wreck the thread and not take out
the kernel itself. (Or the attack is depending on some kind of stack
information exposure, not a write.)
> With 5 bits there's a ~96.9% chance of crashing the system in an attempt,
> the exploit cannot be used for a range of attacks, including spear
> attacks and fast-spreading worms, right? A crashed and inaccessible
> system also increases the odds of leaving around unfinished attack code
> and leaking a zero-day attack.
Yup, which is why I'd like to have _something_ here without us getting
lost in the "perfect entropy" weeds. :)
--
Kees Cook
> > With 5 bits there's a ~96.9% chance of crashing the system in an attempt,
> > the exploit cannot be used for a range of attacks, including spear
> > attacks and fast-spreading worms, right? A crashed and inaccessible
> > system also increases the odds of leaving around unfinished attack code
> > and leaking a zero-day attack.
>
> Yup, which is why I'd like to have _something_ here without us getting
> lost in the "perfect entropy" weeds. :)
I really start to believe that we cannot make good randomness sources behave
fast enough for per-syscall usage if our target is 1-2% overhead under worst possible
(and potentially unrealistic ) scenario (stress test of some simple syscall like getpid()).
The only thing that would fit the margin is indeed rdtsc().
I profiled the path in use with get_random_bytes() and results look like this
(arch_get_random_long in not inline for measurement purpose here):
> >
> > | | | --9.44%--random_get_byte
> > | | | |
> > | | | --8.08%--get_random_bytes
> > | | | |
> > | | | --7.80%--_extract_crng.constprop.45
> > | | | |
> > | | | |--4.95%--arch_get_random_long
> > | | | |
> > | | | --2.39%--chacha_block
And here is the proof that under such usage _extract_crng bottlenecks on rdrand:
PerfTop: 5877 irqs/sec kernel:78.6% exact: 100.0% [4000Hz cycles:ppp], (all, 8 CPUs)
------------------------------------------------------------------------------------------------------------------------------------
Showing cycles:ppp for _extract_crng.constprop.46
Events Pcnt (>=5%)
Percent | Source code & Disassembly of kcore for cycles:ppp (2104 samples, percent: local period)
-------------------------------------------------------------------------------------------------------
0.00 : ffffffff9abd1a62: mov $0xa,%edx
97.94 : ffffffff9abd1a67: rdrand %rax
And then of course there is chacha permutation itself. So, I think Andy's proposal to rewrite
"get_random_bytes" for speed is not so easy to implement.
So, given that all we want is to raise the bar for attackers to predict the stack location
on subsequent syscall, is it really worth to try to come up with more complex solutions than
just using lower bits of rdtsc() by default?
One idea that I got suggested last week is to create a pool of good randomness and
then during syscall select a random number from the pool using smth rdtsc()%POOL_SIZE.
Pool would need to be refilled periodically, outside of syscall path to maintain diversity.
I can try this approach, if people believe that it would address the security concerns around
rdtsc() (my personal feeling is that one can still time attack this if we assume that rdtsc
can be attacked and complexity of the whole thing increases considerably).
If we decide that this is too much trouble for just 5 bits of randomness we need per syscall, I would
still propose we reconsider original rdtsc() approach since it is still better than nothing.
We can have the whole thing on three levels:
CONFIG_RANDOMIZE_KSTACK_OFFSET - off - no randomization, like now
CONFIG_RANDOMIZE_KSTACK_OFFSET on with rdtsc(), fast, better than nothing, but prone to
timing attacks
CONFIG_RANDOMIZE_KSTACK_OFFSET based on get_random_bytes() with better security guarantees.
Performance numbers for will approx. look like
No randomization: Simple syscall: 0.0534 microseconds
With rdtsc(): Simple syscall: 0.0539 microseconds
Wih get_random_bytes(4096 buffer): Simple syscall: 0.0597 microseconds
Pure rdrand option with calling rdrand_long every 10th syscall is considerably slower
With rdrand (every 10th syscall): Simple syscall: 0.0719 microseconds
And I guess we should once again remember that these are *not* the numbers that real
users will see in practice since I doubt we have the real loads issuing millions of *very
lightweight* syscalls in a loop, so this is really more "theoretical, worst case ever" numbers.
If someone could actually propose a reasonable *practical* workload to measure with,
then we can see what is the overhead on that both for rdtsc and get_random_bytes().
Best Regards,
Elena.
I confess I've kind of lost the plot on the performance requirements
at this point. Instead of measuring and evaluating potential
solutions, can we try to approach this from the opposite direction and
ask what the requirements are?
What's the maximum number of CPU cycles that we are allowed to burn
such that we can meet the 1-2% overhead?
And how many bits of uncertainty are we trying to present to the
attacker? What's the minimum beyond we shouldn't bother? (Perhaps
because rdtsc will give us that many bits?) And does that change if
we vary the reseed window in terms of the number of system calls
between reseeding?
And what are the ideal parameters after which point we're just gilding
the lily?
- Ted
> I confess I've kind of lost the plot on the performance requirements
> at this point. Instead of measuring and evaluating potential
> solutions, can we try to approach this from the opposite direction and
> ask what the requirements are?
>
> What's the maximum number of CPU cycles that we are allowed to burn
> such that we can meet the 1-2% overhead?
This is a good question on which I have no answer, so I tried to play with
performance numbers as much as I know. I don't know what is
considered acceptable for a syscall path and I guess answer can also
be presented in two security configurations, like weak (less CPU cycles)
and strong (more cycles).
>
> And how many bits of uncertainty are we trying to present to the
> attacker?
I think currently we are talking about 5 bits. Ideally maybe 8 bits,
given that offset is new for each syscall, it should be enough to remove
the "stability and predictability" feature of kernel thread stack that
attackers rely on. If your chances of crashing kernel 255 from 256,
you might figure some other attack path instead.
What's the minimum beyond we shouldn't bother? (Perhaps
> because rdtsc will give us that many bits?)
Again, it is all about probabilities. 4 bits gives us success chance of
1 in 16 (of not crashing kernel), this is already starting to be dangerous
in my view, so maybe no less than 5?
And does that change if
> we vary the reseed window in terms of the number of system calls
> between reseeding?
I think if we talk about prng, reseeding should be done periodically to
make sure we never overrun the safe generation period and also for
additional security (seeding with proper entropy).
But the most important part is to have a distinct offset (random bits) between
each consequent syscall.
> And what are the ideal parameters after which point we're just gilding
> the lily?
Not sure about ideal params for the whole combination here since security
and performance are basically conflicting with each other (as usual).
So, that's why I was trying to propose to have two version of this:
- one with tilt towards performance (rdtsc based)
- one with tilt towards security (CRNG-based)
And then let users choose what matters more for their workload.
For normal things like dektops, etc. CRNG based version won't provide
any noticeable overhead. It might only matter for syscall sensitive workloads,
which btw, most likely not enable quite a bunch of other security protections,
so I would say that for them to have even rdtsc() version is actually an
improvement in their defenses for stack (and basically no impact on performance).
On related note: the current prng we have in kernel (prandom) is based on a
*very old* style of prngs, which is basically 4 linear LFSRs xored together.
Nowadays, we have much more powerful prngs that show much better
statistical and even security properties (not cryptographically secure, but still
not so linear like the one above).
What is the reason why we still use a prng that is couple of decades away from the
state of art in the area? It is actively used, especially by network stack,
should we update it to smth that is more appropriate (speed would be comparable)?
I am mostly talking about PCG-based generators:
http://www.pcg-random.org/
If people are interested, I could put together a PoC and we have an expert here we can
consult for providing calculations for min-entropy, HILL entropy and whatever
is requested.
Best Regards,
Elena.
From: Reshetova, Elena
> Sent: 29 May 2019 11:14
....
> On related note: the current prng we have in kernel (prandom) is based on a
> *very old* style of prngs, which is basically 4 linear LFSRs xored together.
I'm no expert here (apart from some knowledge of LFRS/CRC) but
even adding the results of the 4 LFSR (instead of xor) will make
the generator much more secure (aka computationally expensive to
reverse) without affecting the randomness or repeat cycle.
FWIW if you are going to merge LFRS you probably want to clock
them different numbers of times (+ve or -ve) otherwise the
output 'mostly' shifts one bit per clock and the same bits
tend to get merged.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Wed, May 29, 2019 at 10:13:43AM +0000, Reshetova, Elena wrote:
> Not sure about ideal params for the whole combination here since security
> and performance are basically conflicting with each other (as usual).
> So, that's why I was trying to propose to have two version of this:
> - one with tilt towards performance (rdtsc based)
> - one with tilt towards security (CRNG-based)
> And then let users choose what matters more for their workload.
> For normal things like dektops, etc. CRNG based version won't provide
> any noticeable overhead. It might only matter for syscall sensitive workloads,
> which btw, most likely not enable quite a bunch of other security protections,
I think people that care about this would prefer the CRNG, and will be
much less interested in the performance issues. But giving people the
option to choose it at runtime seems sensible. Though really, for any
"real" workloads, it's totally lost in the noise, even with the CRNG.
> so I would say that for them to have even rdtsc() version is actually an
> improvement in their defenses for stack (and basically no impact on performance).
Yeah, I think a static-key based version of this would be very nice and
would stay out of anyone's way that didn't want it.
--
Kees Cook
On Wed, May 29, 2019 at 10:13:43AM +0000, Reshetova, Elena wrote:
> On related note: the current prng we have in kernel (prandom) is based on a
> *very old* style of prngs, which is basically 4 linear LFSRs xored together.
> Nowadays, we have much more powerful prngs that show much better
> statistical and even security properties (not cryptographically secure, but still
> not so linear like the one above).
> What is the reason why we still use a prng that is couple of decades away from the
> state of art in the area? It is actively used, especially by network stack,
> should we update it to smth that is more appropriate (speed would be comparable)?
>
> I am mostly talking about PCG-based generators:
> http://www.pcg-random.org/
>
> If people are interested, I could put together a PoC and we have an expert here we can
> consult for providing calculations for min-entropy, HILL entropy and whatever
> is requested.
If we get better generators with no speed loss, I can't imagine anyone
objecting. :)
--
Kees Cook
Ingo, Andy,
I want to summarize here the data (including the performance numbers)
and reasoning for the in-stack randomization feature. I have organized
it in a simple set of Q&A below.
Q: Why do we need in-stack per-syscall randomization when we already have
all known attack vectors covered with existing protections?
A: First, we have all known (to us) attack vectors covered *only* given that all
existing related protections are enabled. While we do expect the vmalloc-stack
allocation with guard pages, as well as VLA-free kernel to be the default setup
for most of the kernels we concerned about, this cannot be said about GCC-plugin
based features, such as STACKLEAK. The reason for this is a) STACKLEAK requires
GCC plugin enablement during compilation and this is not a default case now for
many distros. b) STACKLEAK can introduce very visible performance overhead.
Please see the microbenchmarks below for the cases you have asked me to measure
before. The performance hit can be 79% for simple syscall and I believe even
bigger for complex syscalls where a lot of stack needs to be cleaned.
Q: What is the performance impact both micro benchmark and on real workloads?
A: Here is the table summarizing the perf numbers depending on subroutine used
for obtaining random bits and STACKLEAK numbers for comparison:
1. Microbenchmarking:
lmbench: ./lat_syscall -N 1000000 null:
CONFIG_PAGE_TABLE_ISOLATION=y:
base: 0.1737
rdtsc: 0.1803 (+3.89%)
get_random_bytes (4096 byte buffer): 0.1811 (+4.26%)
RDRAND (every 10th syscall): 0.1910 (+9.96%)
STACKLEAK: 0.2150 (+23.78%)
CONFIG_PAGE_TABLE_ISOLATION=n:
base: 0.0522
rdtsc: 0.0556 (+6.51%)
get_random_bytes (4096 byte buffer): 0.0599 (+14.75%)
RDRAND (every 10th syscall): 0.0706 (+35.25%)
STACKLEAK: 0.0935 (+79.12%)
So, at least one thing that follows from above numbers is that STACKLEAK
performance is much worse even for simple syscall case (more complex
syscalls likely cause even more perf hit since stack must be cleaned deeper).
However, this hasn't prevented STACKLEAK upstream merge and STACKLEAK is much
more complex feature both code side and code logic...
2. Kernel compilation time (example of real workload):
Base (CONFIG_PAGE_TABLE_ISOLATION=n):
real: 31m17,601s
user: 224m56,144s
sys: 18m26,946s
get_random_bytes (4096 byte buffer, CONFIG_PAGE_TABLE_ISOLATION=n):
real: 31m22,684s (+0.271% increase)
user: 225m4,766s
sys: 18m27,113s
STACKLEAK (CONFIG_PAGE_TABLE_ISOLATION=n):
real: 31m23,289s (+0.303% increase)
user: 224m24,001s
sys: 19m36,882s
Q: Is there a way to drastically improve the microbenchmark results
for get_random_bytes() without decreasing security?
A: As far as I can see no, but I am not a perf optimization expert.
My measurements and profiling showed that biggest load comes from
extract_crng function and underlying arch_get_random_long()
(basically calling RDRAND for x86) and chacha_block function (core
chacha functionality and permutation itself). So, unless the permutation
is rewritten or additional seeding is dropped (happens automatically
if arch does not support RDRAND), there is no way in my understanding to
improve things drastically. See traces below.
perf (4096 isolation off, get_random_byte, arch_get_random_long not inlined):
98.94% 0.00% lat_syscall [unknown] [k] 0x48fffffe27e80f74
|
---0x48fffffe27e80f74
|
|--97.73%--__GI___getppid
| |
| |--53.29%--entry_SYSCALL_64_after_hwframe
| | |
| | |--29.73%--do_syscall_64
| | | |
| | | |--11.05%--__x64_sys_getppid
| | | | |
| | | | --10.64%--__task_pid_nr_ns
| | | |
| | | --9.44%--random_get_byte
| | | |
| | | --8.08%--get_random_bytes
| | | |
| | | --7.80%--_extract_crng.constprop.45
| | | |
| | | |--4.95%--arch_get_random_long
| | | |
| | | --2.39%--chacha_block
17.22% [kernel] [k] syscall_return_via_sysret
15.32% libc-2.27.so [.] __GI___getppid
13.33% [kernel] [k] __x64_sys_getppid
10.04% [kernel] [k] __task_pid_nr_ns
8.60% [kernel] [k] do_syscall_64
7.94% [kernel] [k] entry_SYSCALL_64
7.08% [kernel] [k] entry_SYSCALL_64_after_hwframe
4.45% [kernel] [k] _extract_crng.constprop.46
2.28% [kernel] [k] chacha_permute
1.92% [kernel] [k] __indirect_thunk_start
1.69% [kernel] [k] random_get_byte
0.75% lat_syscall [.] do_getppid
0.50% [kernel] [k] fw_domains_get_with_fallback
0.43% lat_syscall [.] getppid@plt
PerfTop: 5877 irqs/sec kernel:78.6% exact: 100.0% [4000Hz cycles:ppp], (all, 8 CPUs)
---------------------------------------------------------------------------------------------
Showing cycles:ppp for _extract_crng.constprop.46
Events Pcnt (>=5%)
Percent | Source code & Disassembly of kcore for cycles:ppp (2104 samples, percent: local period)
--------------------------------------------------------------------------------------------------
0.00 : ffffffff9abd1a62: mov $0xa,%edx
97.94 : ffffffff9abd1a67: rdrand %rax
Q: So, what is the resulting benefit of in-stack per-syscall randomization?
A: Randomizing kernel stack per each syscall removes one of the main property
that attracted attackers to kernel thread stack for couple of last decades - the
deterministic stack structure.
Reliability is one of the most important feature for kernel exploits (not so
much for userspace ones, since they always can survive the fact that a process
can be terminated). An attacker cannot afford to panic the kernel, because it
normally means that the attack target is dead and at best he can continue from
square one after the target reboots and intrusion detection systems are already
alert. So, removing the feature of reliably predicting what would be the thread
stack layout upon the subsequent syscall is a bug hurdle for an attacker in
constructing the exploit chain that involves kernel stack.
It basically means that a class of stack-based attacks that do some preparation
(or information discovery) work in a first syscall and then utilize this
knowledge on a subsequent syscall cannot be reliable anymore (unless target is
pt_regs since we don't change pt_regs position), which means that they are much
less attractive for attackers.
The in-stack randomization is really a very small change both code wise and
logic wise.
It does not affect real workloads and does not require enablement of other
features (such as GCC plugins).
So, I think we should really reconsider its inclusion.
On Mon, Jul 29, 2019 at 11:41:11AM +0000, Reshetova, Elena wrote:
> I want to summarize here the data (including the performance numbers)
> and reasoning for the in-stack randomization feature. I have organized
> it in a simple set of Q&A below.
Thanks for these!
> The in-stack randomization is really a very small change both code wise and
> logic wise.
> It does not affect real workloads and does not require enablement of other
> features (such as GCC plugins).
> So, I think we should really reconsider its inclusion.
I'd agree: the code is tiny and while the benefit can't point to a
specific issue, it does point to the general weakness of the stack
offset being predictable which has been a core observation for many
stack-based attacks.
If we're going to save state between syscalls (like the 4096 random
bytes pool), how about instead we just use a single per-CPU long mixed
with rdtsc saved at syscall exit. That should be a reasonable balance
between all the considerations and make it trivial for the feature to
be a boot flag without the extra page of storage, etc.
--
Kees Cook
>> The in-stack randomization is really a very small change both code wise and
>> logic wise.
>> It does not affect real workloads and does not require enablement of other
>> features (such as GCC plugins).
>> So, I think we should really reconsider its inclusion.
>I'd agree: the code is tiny and while the benefit can't point to a
>specific issue, it does point to the general weakness of the stack
>offset being predictable which has been a core observation for many
>stack-based attacks.
>If we're going to save state between syscalls (like the 4096 random
>bytes pool), how about instead we just use a single per-CPU long mixed
>with rdtsc saved at syscall exit. That should be a reasonable balance
>between all the considerations and make it trivial for the feature to
>be a boot flag without the extra page of storage, etc.
Sounds like a viable compromise for me.
Ingo, Andy?
Best Regards,
Elena.