Currently with kstack randomization there is somewhere on the order of
5x worse variation in response latencies vs unrandomized
syscalls. This is down from ~10x on pre 6.2 kernels where the RNG
reseeding was moved out of the syscall path, but get_random_uXX()
still contains a fair amount of additional global state manipulation
which is problematic.
So, lets replace the full get_random_u16 in the syscall path with
prandom_u32_state(). This also has the advantage of bringing the
randomized and unrandomized overall syscall performace much closer
together. Although in the syscall path, prandom_u32_state() remains
measurably worse than other architectures relying on non-random
functions (cycle counters) with respect to perf/latency
measurements. By comparison, the algorithm presented in the RFC which
had basically no impact given recent OoO cores are able to hide all of
the overhead from the the handful of additional instructions.
I'm still looking for suggestions reseeding prandom_u32_state() if
needed or improving the performace of get_random_u16. so consider this
somewhate more than an RFC and maybe less of a full patch request.
RFC->V1:
Replace custom inline RNG with prandom_u32_state
Jeremy Linton (1):
arm64: syscall: Direct PRNG kstack randomization
arch/arm64/kernel/syscall.c | 42 ++++++++++++++++++++++++++++++++++++-
1 file changed, 41 insertions(+), 1 deletion(-)
--
2.43.0
The existing arm64 stack randomization uses the kernel rng to acquire
5 bits of address space randomization. This is problematic because it
creates non determinism in the syscall path when the rng needs to be
generated or reseeded. This shows up as large tail latencies in some
benchmarks and directly affects the minimum RT latencies as seen by
cyclictest.
Other architectures are using timers/cycle counters for this function,
which is sketchy from a randomization perspective because it should be
possible to estimate this value from knowledge of the syscall return
time, and from reading the current value of the timer/counters.
So, a poor rng should be better than the cycle counter if it is hard
to extract the stack offsets sufficiently to be able to detect the
PRNG's period. Lets downgrade from get_random_u16() to
prandom_u32_state() under the theory that the danger of someone
guessing the 1 in 32 per call offset, is larger than that of being
able to extract sufficient history to accurately predict future
offsets. Further it should be safer to run with prandom_u32_state than
disabling stack randomization for those subset of applications where the
difference in latency is on the order of ~5X worse.
Reported-by: James Yang <[email protected]>
Reported-by: Shiyou Huang <[email protected]>
Signed-off-by: Jeremy Linton <[email protected]>
---
arch/arm64/kernel/syscall.c | 42 ++++++++++++++++++++++++++++++++++++-
1 file changed, 41 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index 9a70d9746b66..33b3ea4adff8 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -5,6 +5,7 @@
#include <linux/errno.h>
#include <linux/nospec.h>
#include <linux/ptrace.h>
+#include <linux/prandom.h>
#include <linux/randomize_kstack.h>
#include <linux/syscalls.h>
@@ -37,6 +38,45 @@ static long __invoke_syscall(struct pt_regs *regs, syscall_fn_t syscall_fn)
return syscall_fn(regs);
}
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+DEFINE_PER_CPU(struct rnd_state, kstackrng);
+
+static u16 kstack_rng(void)
+{
+ u32 rng = prandom_u32_state(this_cpu_ptr(&kstackrng));
+
+ return rng & 0x1ff;
+}
+
+/* Should we reseed? */
+static int kstack_rng_setup(unsigned int cpu)
+{
+ u32 rng_seed;
+
+ /* zero should be avoided as a seed */
+ do {
+ rng_seed = get_random_u32();
+ } while (!rng_seed);
+ prandom_seed_state(this_cpu_ptr(&kstackrng), rng_seed);
+ return 0;
+}
+
+static int kstack_init(void)
+{
+ int ret;
+
+ ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "arm64/cpuinfo:kstackrandomize",
+ kstack_rng_setup, NULL);
+ if (ret < 0)
+ pr_err("kstack: failed to register rng callbacks.\n");
+ return 0;
+}
+
+arch_initcall(kstack_init);
+#else
+static u16 kstack_rng(void) { return 0; }
+#endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
+
static void invoke_syscall(struct pt_regs *regs, unsigned int scno,
unsigned int sc_nr,
const syscall_fn_t syscall_table[])
@@ -66,7 +106,7 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno,
*
* The resulting 5 bits of entropy is seen in SP[8:4].
*/
- choose_random_kstack_offset(get_random_u16() & 0x1FF);
+ choose_random_kstack_offset(kstack_rng());
}
static inline bool has_syscall_work(unsigned long flags)
--
2.43.0
On Tue, Mar 05, 2024 at 04:18:24PM -0600, Jeremy Linton wrote:
> The existing arm64 stack randomization uses the kernel rng to acquire
> 5 bits of address space randomization. This is problematic because it
> creates non determinism in the syscall path when the rng needs to be
> generated or reseeded. This shows up as large tail latencies in some
> benchmarks and directly affects the minimum RT latencies as seen by
> cyclictest.
>
> Other architectures are using timers/cycle counters for this function,
> which is sketchy from a randomization perspective because it should be
> possible to estimate this value from knowledge of the syscall return
> time, and from reading the current value of the timer/counters.
>
> So, a poor rng should be better than the cycle counter if it is hard
> to extract the stack offsets sufficiently to be able to detect the
> PRNG's period. Lets downgrade from get_random_u16() to
> prandom_u32_state() under the theory that the danger of someone
> guessing the 1 in 32 per call offset, is larger than that of being
> able to extract sufficient history to accurately predict future
> offsets. Further it should be safer to run with prandom_u32_state than
> disabling stack randomization for those subset of applications where the
> difference in latency is on the order of ~5X worse.
>
> Reported-by: James Yang <[email protected]>
> Reported-by: Shiyou Huang <[email protected]>
> Signed-off-by: Jeremy Linton <[email protected]>
> ---
> arch/arm64/kernel/syscall.c | 42 ++++++++++++++++++++++++++++++++++++-
> 1 file changed, 41 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
> index 9a70d9746b66..33b3ea4adff8 100644
> --- a/arch/arm64/kernel/syscall.c
> +++ b/arch/arm64/kernel/syscall.c
> @@ -5,6 +5,7 @@
> #include <linux/errno.h>
> #include <linux/nospec.h>
> #include <linux/ptrace.h>
> +#include <linux/prandom.h>
> #include <linux/randomize_kstack.h>
> #include <linux/syscalls.h>
>
> @@ -37,6 +38,45 @@ static long __invoke_syscall(struct pt_regs *regs, syscall_fn_t syscall_fn)
> return syscall_fn(regs);
> }
>
> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> +DEFINE_PER_CPU(struct rnd_state, kstackrng);
> +
> +static u16 kstack_rng(void)
> +{
> + u32 rng = prandom_u32_state(this_cpu_ptr(&kstackrng));
> +
> + return rng & 0x1ff;
> +}
> +
> +/* Should we reseed? */
> +static int kstack_rng_setup(unsigned int cpu)
> +{
> + u32 rng_seed;
> +
> + /* zero should be avoided as a seed */
> + do {
> + rng_seed = get_random_u32();
> + } while (!rng_seed);
> + prandom_seed_state(this_cpu_ptr(&kstackrng), rng_seed);
> + return 0;
> +}
> +
> +static int kstack_init(void)
> +{
> + int ret;
> +
> + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "arm64/cpuinfo:kstackrandomize",
> + kstack_rng_setup, NULL);
This will run initial seeding, but don't we need to reseed this with
some kind of frequency?
Otherwise, seems fine to me.
--
Kees Cook
On Wed, Mar 6, 2024, at 00:33, Kees Cook wrote:
> On Tue, Mar 05, 2024 at 04:18:24PM -0600, Jeremy Linton wrote:
>> The existing arm64 stack randomization uses the kernel rng to acquire
>> 5 bits of address space randomization. This is problematic because it
>> creates non determinism in the syscall path when the rng needs to be
>> generated or reseeded. This shows up as large tail latencies in some
>> benchmarks and directly affects the minimum RT latencies as seen by
>> cyclictest.
>>
>> Other architectures are using timers/cycle counters for this function,
>> which is sketchy from a randomization perspective because it should be
>> possible to estimate this value from knowledge of the syscall return
>> time, and from reading the current value of the timer/counters.
As I commented on the previous version, I don't want to see
a change that only addresses one architecture like this. If you
are convinced that using a cycle counter is a mistake, then we
should do the same thing on the other architectures as well
that currently use a cycle counter.
>> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
>> +DEFINE_PER_CPU(struct rnd_state, kstackrng);
>> +
>> +static u16 kstack_rng(void)
>> +{
>> + u32 rng = prandom_u32_state(this_cpu_ptr(&kstackrng));
>> +
>> + return rng & 0x1ff;
>> +}
>> +
>> +/* Should we reseed? */
>> +static int kstack_rng_setup(unsigned int cpu)
>> +{
>> + u32 rng_seed;
>> +
>> + /* zero should be avoided as a seed */
>> + do {
>> + rng_seed = get_random_u32();
>> + } while (!rng_seed);
>> + prandom_seed_state(this_cpu_ptr(&kstackrng), rng_seed);
>> + return 0;
>> +}
>> +
>> +static int kstack_init(void)
>> +{
>> + int ret;
>> +
>> + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "arm64/cpuinfo:kstackrandomize",
>> + kstack_rng_setup, NULL);
>
> This will run initial seeding, but don't we need to reseed this with
> some kind of frequency?
Won't that defeat the purpose of the patch that was intended
to make the syscall latency more predictable? At least the
simpler approaches of reseeding from the kstack_rng()
function itself would have this problem, deferring it to
another context comes with a separate set of problems.
Arnd
Hi,
On 3/6/24 14:46, Arnd Bergmann wrote:
> On Wed, Mar 6, 2024, at 00:33, Kees Cook wrote:
>> On Tue, Mar 05, 2024 at 04:18:24PM -0600, Jeremy Linton wrote:
>>> The existing arm64 stack randomization uses the kernel rng to acquire
>>> 5 bits of address space randomization. This is problematic because it
>>> creates non determinism in the syscall path when the rng needs to be
>>> generated or reseeded. This shows up as large tail latencies in some
>>> benchmarks and directly affects the minimum RT latencies as seen by
>>> cyclictest.
>>>
>>> Other architectures are using timers/cycle counters for this function,
>>> which is sketchy from a randomization perspective because it should be
>>> possible to estimate this value from knowledge of the syscall return
>>> time, and from reading the current value of the timer/counters.
>
> As I commented on the previous version, I don't want to see
> a change that only addresses one architecture like this. If you
> are convinced that using a cycle counter is a mistake, then we
> should do the same thing on the other architectures as well
> that currently use a cycle counter.
I personally tend to agree as long as we aren't creating a similar set
of problems for those architectures as we are seeing on arm. Currently
the kstack rng on/off choice is basically zero overhead for them.
>
>>> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
>>> +DEFINE_PER_CPU(struct rnd_state, kstackrng);
>>> +
>>> +static u16 kstack_rng(void)
>>> +{
>>> + u32 rng = prandom_u32_state(this_cpu_ptr(&kstackrng));
>>> +
>>> + return rng & 0x1ff;
>>> +}
>>> +
>>> +/* Should we reseed? */
>>> +static int kstack_rng_setup(unsigned int cpu)
>>> +{
>>> + u32 rng_seed;
>>> +
>>> + /* zero should be avoided as a seed */
>>> + do {
>>> + rng_seed = get_random_u32();
>>> + } while (!rng_seed);
>>> + prandom_seed_state(this_cpu_ptr(&kstackrng), rng_seed);
>>> + return 0;
>>> +}
>>> +
>>> +static int kstack_init(void)
>>> +{
>>> + int ret;
>>> +
>>> + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "arm64/cpuinfo:kstackrandomize",
>>> + kstack_rng_setup, NULL);
>>
>> This will run initial seeding, but don't we need to reseed this with
>> some kind of frequency?
>
> Won't that defeat the purpose of the patch that was intended
> to make the syscall latency more predictable? At least the
> simpler approaches of reseeding from the kstack_rng()
> function itself would have this problem, deferring it to
> another context comes with a separate set of problems.
And that describes why I've not come up with an inline reseeding
solution. Which of course isn't a problem on !arm if one just pushes a
few bits of a cycle counter into the rnd_state every few dozen syscalls,
or whatever. Mark R, mentioned offline the idea of just picking a few
bits off CNTVCT as a seed, but its so slow it basically has to be used
to fuzz a bit or two of rnd_state on some fairly long interval. Long
enough that if someone has a solution for extracting rnd_state it might
not add any additional security. Or that is my take, since i'm not a big
fan of any independent counter/clock based RNG seeding (AFAIK, entropy
from clocks requires multiple _independent_ sources).
This is a bit out of my wheelhouse, so I defer to anyone with a better
feel or some actual data.
The best plan I have at the moment is just some deferred work to call
kstack_rng_setup on some call or time based interval, which AFAIK isn't
ideal for RT workloads which expect ~100% CPU isolation. Plus, that
solution assumes we have some handle on how fast an attacker can extract
kstackrng sufficiently to make predictions.
Again, thanks to everyone for looking at this,
Jeremy
On Wed, Mar 6, 2024, at 22:54, Jeremy Linton wrote:
> On 3/6/24 14:46, Arnd Bergmann wrote:
>> On Wed, Mar 6, 2024, at 00:33, Kees Cook wrote:
>>> On Tue, Mar 05, 2024 at 04:18:24PM -0600, Jeremy Linton wrote:
>>>> The existing arm64 stack randomization uses the kernel rng to acquire
>>>> 5 bits of address space randomization. This is problematic because it
>>>> creates non determinism in the syscall path when the rng needs to be
>>>> generated or reseeded. This shows up as large tail latencies in some
>>>> benchmarks and directly affects the minimum RT latencies as seen by
>>>> cyclictest.
>>>>
>>>> Other architectures are using timers/cycle counters for this function,
>>>> which is sketchy from a randomization perspective because it should be
>>>> possible to estimate this value from knowledge of the syscall return
>>>> time, and from reading the current value of the timer/counters.
>>
>> As I commented on the previous version, I don't want to see
>> a change that only addresses one architecture like this. If you
>> are convinced that using a cycle counter is a mistake, then we
>> should do the same thing on the other architectures as well
>> that currently use a cycle counter.
>
> I personally tend to agree as long as we aren't creating a similar set
> of problems for those architectures as we are seeing on arm. Currently
> the kstack rng on/off choice is basically zero overhead for them.
I think we have two separate problems to solve here: how
strong a kernel stack randomization should be on a given system,
and who gets to make the decision.
For the strength, we have at least four options:
- strong rng, most expensive
- your new prng, less strong but somewhat cheaper and/or more
predictable overhead
- cycle counter, cheap but probably even less strong,
needs architecture code.
- no rng, no overhead and no protection.
On architectures that have a cheap hardware rng instruction, you
can count that one as well.
For picking between the options, we have
- architecture maintainer
- defconfig (CONFIG_RANDOMIZE_KSTACK_OFFSET)
- boot time (randomize_kstack_offset=on)
For both of these lists, I would like to see as few options
as possible, and in particular I think the architecture
maintainer should not make that decision for the users.
If we want to improve this, I would start by changing
the binary CONFIG_RANDOMIZE_KSTACK_OFFSET option into
a three-way choice between cycle counter (where available),
strong rng and off, possibly adding the cycle counter
option to the two architectures that currently don't use
it, while moving the strong rng into common code.
After that, we can debate adding a fourth option, or
replacing one of the existing two that is too slow or
not random enough.
>> Won't that defeat the purpose of the patch that was intended
>> to make the syscall latency more predictable? At least the
>> simpler approaches of reseeding from the kstack_rng()
>> function itself would have this problem, deferring it to
>> another context comes with a separate set of problems.
>
> And that describes why I've not come up with an inline reseeding
> solution. Which of course isn't a problem on !arm if one just pushes a
> few bits of a cycle counter into the rnd_state every few dozen syscalls,
> or whatever. Mark R, mentioned offline the idea of just picking a few
> bits off CNTVCT as a seed, but its so slow it basically has to be used
> to fuzz a bit or two of rnd_state on some fairly long interval. Long
> enough that if someone has a solution for extracting rnd_state it might
> not add any additional security. Or that is my take, since i'm not a big
> fan of any independent counter/clock based RNG seeding (AFAIK, entropy
> from clocks requires multiple _independent_ sources).
I'm not sure I understand the logic. Do you mean that accessing
CNTVCT itself is slow, or that reseeding based on CNTVCT is slow
because of the overhead of reseeding?
On powerpc/s390/x86, the low bits of the respective cycle counter
is simply used without any kind of bit shuffling, and we already
rely on the cycle counter to be fast to access since it's used
for timekeeping anywhere.
There is not even any attempt to use the most random bits of
the cycle counter, as both the high 22 to 24 bits get masked
out (to keep the wasted stack space small) and the low 3 or 4
bits get ignored because of stack alignment. If there was
any desire to make it more random, a trivial improvement
would be:
+++ b/include/linux/randomize_kstack.h
@@ -80,7 +80,7 @@ DECLARE_PER_CPU(u32, kstack_offset);
if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
&randomize_kstack_offset)) { \
u32 offset = raw_cpu_read(kstack_offset); \
- offset ^= (rand); \
+ offset = ror32(offset, 5) & (rand); \
raw_cpu_write(kstack_offset, offset); \
} \
} while (0)
My impression is that is is already bordering on becoming
a "bespoke rng" implementation that Jason was objecting to,
so the current version is intentionally left weak in order
to not even give the appearance of being a security relevant
feature.
> This is a bit out of my wheelhouse, so I defer to anyone with a better
> feel or some actual data.
>
> The best plan I have at the moment is just some deferred work to call
> kstack_rng_setup on some call or time based interval, which AFAIK isn't
> ideal for RT workloads which expect ~100% CPU isolation. Plus, that
> solution assumes we have some handle on how fast an attacker can extract
> kstackrng sufficiently to make predictions.
I think you fundamentally can't have both. If you rely on the
reseeding to happen for a particular number of syscalls, you
eventually end up blocking on it, regardless of the context
it runs in. Deferring it to another process means blocking
for longer, and deferring it to a low-priority task would mean
that real-time tasks get less randomness.
Arnd
Hi Jeremy,
kernel test robot noticed the following build warnings:
[auto build test WARNING on arm64/for-next/core]
[also build test WARNING on arm/for-next arm/fixes kvmarm/next soc/for-next linus/master v6.8-rc7 next-20240307]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Jeremy-Linton/arm64-syscall-Direct-PRNG-kstack-randomization/20240306-061930
base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
patch link: https://lore.kernel.org/r/20240305221824.3300322-2-jeremy.linton%40arm.com
patch subject: [PATCH 1/1] arm64: syscall: Direct PRNG kstack randomization
config: arm64-randconfig-r063-20240307 (https://download.01.org/0day-ci/archive/20240308/[email protected]/config)
compiler: aarch64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240308/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
>> arch/arm64/kernel/syscall.c:74:12: warning: 'kstack_rng' defined but not used [-Wunused-function]
74 | static u16 kstack_rng(void) { return 0; }
| ^~~~~~~~~~
vim +/kstack_rng +74 arch/arm64/kernel/syscall.c
71
72 arch_initcall(kstack_init);
73 #else
> 74 static u16 kstack_rng(void) { return 0; }
75 #endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
76
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Thu, Mar 07, 2024 at 12:10:34PM +0100, Arnd Bergmann wrote:
> For the strength, we have at least four options:
>
> - strong rng, most expensive
> - your new prng, less strong but somewhat cheaper and/or more
> predictable overhead
> - cycle counter, cheap but probably even less strong,
> needs architecture code.
Are the low bits of a cycler counter really less safe than a
deterministic pRNG?
> - no rng, no overhead and no protection.
For the pRNG, why not just add a reseed timer or something that'll
happen outside the syscall window, if that's the concern about reseeding
delay? (In which case, why not continue to use the strong rng?)
--
Kees Cook
On Thu, Mar 07, 2024 at 12:10:34PM +0100, Arnd Bergmann wrote:
> There is not even any attempt to use the most random bits of
> the cycle counter, as both the high 22 to 24 bits get masked
> out (to keep the wasted stack space small) and the low 3 or 4
> bits get ignored because of stack alignment. If there was
> any desire to make it more random, a trivial improvement
> would be:
>
> +++ b/include/linux/randomize_kstack.h
> @@ -80,7 +80,7 @@ DECLARE_PER_CPU(u32, kstack_offset);
> if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
> &randomize_kstack_offset)) { \
> u32 offset = raw_cpu_read(kstack_offset); \
> - offset ^= (rand); \
> + offset = ror32(offset, 5) & (rand); \
Shouldn't this stay ^ instead of & ?
> raw_cpu_write(kstack_offset, offset); \
> } \
> } while (0)
But yeah, we should likely make this change regardless.
> My impression is that is is already bordering on becoming
> a "bespoke rng" implementation that Jason was objecting to,
> so the current version is intentionally left weak in order
> to not even give the appearance of being a security relevant
> feature.
I don't think it's bad to make a trivial improvement to entropy diffusion.
--
Kees Cook
On Thu, Mar 7, 2024, at 20:10, Kees Cook wrote:
> On Thu, Mar 07, 2024 at 12:10:34PM +0100, Arnd Bergmann wrote:
>> For the strength, we have at least four options:
>>
>> - strong rng, most expensive
>> - your new prng, less strong but somewhat cheaper and/or more
>> predictable overhead
>> - cycle counter, cheap but probably even less strong,
>> needs architecture code.
>
> Are the low bits of a cycler counter really less safe than a
> deterministic pRNG?
I don't know, that was based on Jeremy's assertion. I'm not
even convinced that either one is actually safer than no
randomization. ;-)
For both the timer and the pRNG, I guess you'd need a gadget
to extract the current offset and from there extract the
per-cpu state value. Once you know the state and there is no
reseed, you can pick the value you want for the next syscall
by either calling getpid() repeatedly or waiting for the
cycle counter to reach the desired value before calling
a vulnerable syscall. At that point the pRNG and no
randomization are reliably predictable, while the timer
may still have some jitter left.
>> - no rng, no overhead and no protection.
>
> For the pRNG, why not just add a reseed timer or something that'll
> happen outside the syscall window, if that's the concern about reseeding
> delay? (In which case, why not continue to use the strong rng?)
I fear a timer or workqueue would be both weaker than a forced
reseed if the attacker wins the race against the reseed, and
still cause latency that hurts realtime usecases.
Arnd
On Thu, Mar 7, 2024, at 20:15, Kees Cook wrote:
> On Thu, Mar 07, 2024 at 12:10:34PM +0100, Arnd Bergmann wrote:
>> There is not even any attempt to use the most random bits of
>> the cycle counter, as both the high 22 to 24 bits get masked
>> out (to keep the wasted stack space small) and the low 3 or 4
>> bits get ignored because of stack alignment. If there was
>> any desire to make it more random, a trivial improvement
>> would be:
>>
>> +++ b/include/linux/randomize_kstack.h
>> @@ -80,7 +80,7 @@ DECLARE_PER_CPU(u32, kstack_offset);
>> if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
>> &randomize_kstack_offset)) { \
>> u32 offset = raw_cpu_read(kstack_offset); \
>> - offset ^= (rand); \
>> + offset = ror32(offset, 5) & (rand); \
>
> Shouldn't this stay ^ instead of & ?
Yes, my mistake.
>> raw_cpu_write(kstack_offset, offset); \
>> } \
>> } while (0)
>
> But yeah, we should likely make this change regardless.
Ok, do you want to send a patch, or should I?
Arnd
Hi,
On 3/7/24 05:10, Arnd Bergmann wrote:
> On Wed, Mar 6, 2024, at 22:54, Jeremy Linton wrote:
>> On 3/6/24 14:46, Arnd Bergmann wrote:
>>> On Wed, Mar 6, 2024, at 00:33, Kees Cook wrote:
>>>> On Tue, Mar 05, 2024 at 04:18:24PM -0600, Jeremy Linton wrote:
>>>>> The existing arm64 stack randomization uses the kernel rng to acquire
>>>>> 5 bits of address space randomization. This is problematic because it
>>>>> creates non determinism in the syscall path when the rng needs to be
>>>>> generated or reseeded. This shows up as large tail latencies in some
>>>>> benchmarks and directly affects the minimum RT latencies as seen by
>>>>> cyclictest.
>>>>>
>>>>> Other architectures are using timers/cycle counters for this function,
>>>>> which is sketchy from a randomization perspective because it should be
>>>>> possible to estimate this value from knowledge of the syscall return
>>>>> time, and from reading the current value of the timer/counters.
>>>
>>> As I commented on the previous version, I don't want to see
>>> a change that only addresses one architecture like this. If you
>>> are convinced that using a cycle counter is a mistake, then we
>>> should do the same thing on the other architectures as well
>>> that currently use a cycle counter.
>>
>> I personally tend to agree as long as we aren't creating a similar set
>> of problems for those architectures as we are seeing on arm. Currently
>> the kstack rng on/off choice is basically zero overhead for them.
>
> I think we have two separate problems to solve here: how
> strong a kernel stack randomization should be on a given system,
> and who gets to make the decision.
>
> For the strength, we have at least four options:
>
> - strong rng, most expensive
> - your new prng, less strong but somewhat cheaper and/or more
> predictable overhead
> - cycle counter, cheap but probably even less strong,
> needs architecture code.
> - no rng, no overhead and no protection.
>
> On architectures that have a cheap hardware rng instruction, you
> can count that one as well.
>
> For picking between the options, we have
>
> - architecture maintainer
> - defconfig (CONFIG_RANDOMIZE_KSTACK_OFFSET)
> - boot time (randomize_kstack_offset=on)
>
> For both of these lists, I would like to see as few options
> as possible, and in particular I think the architecture
> maintainer should not make that decision for the users.
>
> If we want to improve this, I would start by changing
> the binary CONFIG_RANDOMIZE_KSTACK_OFFSET option into
> a three-way choice between cycle counter (where available),
> strong rng and off, possibly adding the cycle counter
> option to the two architectures that currently don't use
> it, while moving the strong rng into common code.
>
> After that, we can debate adding a fourth option, or
> replacing one of the existing two that is too slow or
> not random enough.
>
>>> Won't that defeat the purpose of the patch that was intended
>>> to make the syscall latency more predictable? At least the
>>> simpler approaches of reseeding from the kstack_rng()
>>> function itself would have this problem, deferring it to
>>> another context comes with a separate set of problems.
>>
>> And that describes why I've not come up with an inline reseeding
>> solution. Which of course isn't a problem on !arm if one just pushes a
>> few bits of a cycle counter into the rnd_state every few dozen syscalls,
>> or whatever. Mark R, mentioned offline the idea of just picking a few
>> bits off CNTVCT as a seed, but its so slow it basically has to be used
>> to fuzz a bit or two of rnd_state on some fairly long interval. Long
>> enough that if someone has a solution for extracting rnd_state it might
>> not add any additional security. Or that is my take, since i'm not a big
>> fan of any independent counter/clock based RNG seeding (AFAIK, entropy
>> from clocks requires multiple _independent_ sources).
>
> I'm not sure I understand the logic. Do you mean that accessing
> CNTVCT itself is slow, or that reseeding based on CNTVCT is slow
> because of the overhead of reseeding?
Slow, as in, its running at a much lower frequency than a cycle counter.
>
> On powerpc/s390/x86, the low bits of the respective cycle counter
> is simply used without any kind of bit shuffling, and we already
> rely on the cycle counter to be fast to access since it's used
> for timekeeping anywhere.
>
> There is not even any attempt to use the most random bits of
> the cycle counter, as both the high 22 to 24 bits get masked
> out (to keep the wasted stack space small) and the low 3 or 4
> bits get ignored because of stack alignment. If there was
> any desire to make it more random, a trivial improvement
> would be:
>
> +++ b/include/linux/randomize_kstack.h
> @@ -80,7 +80,7 @@ DECLARE_PER_CPU(u32, kstack_offset);
> if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
> &randomize_kstack_offset)) { \
> u32 offset = raw_cpu_read(kstack_offset); \
> - offset ^= (rand); \
> + offset = ror32(offset, 5) & (rand); \
> raw_cpu_write(kstack_offset, offset); \
> } \
> } while (0)
>
> My impression is that is is already bordering on becoming
> a "bespoke rng" implementation that Jason was objecting to,
> so the current version is intentionally left weak in order
> to not even give the appearance of being a security relevant
> feature.
>
>> This is a bit out of my wheelhouse, so I defer to anyone with a better
>> feel or some actual data.
>>
>> The best plan I have at the moment is just some deferred work to call
>> kstack_rng_setup on some call or time based interval, which AFAIK isn't
>> ideal for RT workloads which expect ~100% CPU isolation. Plus, that
>> solution assumes we have some handle on how fast an attacker can extract
>> kstackrng sufficiently to make predictions.
>
> I think you fundamentally can't have both. If you rely on the
> reseeding to happen for a particular number of syscalls, you
> eventually end up blocking on it, regardless of the context
> it runs in. Deferring it to another process means blocking
> for longer, and deferring it to a low-priority task would mean
> that real-time tasks get less randomness.
>
> Arnd
On Fri, Mar 8, 2024, at 17:49, Jeremy Linton wrote:
> On 3/7/24 05:10, Arnd Bergmann wrote:
>>
>> I'm not sure I understand the logic. Do you mean that accessing
>> CNTVCT itself is slow, or that reseeding based on CNTVCT is slow
>> because of the overhead of reseeding?
>
> Slow, as in, its running at a much lower frequency than a cycle counter.
Ok, I see. Would it be possible to use PMEVCNTR0 instead?
Arnd
Hi,
Sorry about the delay here, PTO and I actually wanted to verify my
assumptions.
On 3/8/24 14:29, Arnd Bergmann wrote:
> On Fri, Mar 8, 2024, at 17:49, Jeremy Linton wrote:
>> On 3/7/24 05:10, Arnd Bergmann wrote:
>>>
>>> I'm not sure I understand the logic. Do you mean that accessing
>>> CNTVCT itself is slow, or that reseeding based on CNTVCT is slow
>>> because of the overhead of reseeding?
>>
>> Slow, as in, its running at a much lower frequency than a cycle counter.
>
> Ok, I see. Would it be possible to use PMEVCNTR0 instead?
So, I presume you actually mean PMCCNTR_EL0 because I don't see the
point of a dedicated event, although maybe...
So, the first and maybe largest problem is the PMxxx registers are all
optional because the PMU is optional! Although, they are strongly
recommended and most (AFAIK) machines do implement them. So, maybe if
its just using a cycle counter to dump some entropy into rnd_state it
becomes a statement that kstack randomization is slower or unsupported
if there isn't a PMU?
But then we have to basically enable the PMU cycle counter globally,
which requires reworking how it works, because the cycle counter is
enabled/disabled and reset on the fly depending on whether the user is
trying to profile something. So, I have hacked that up, and it appears
to be working, although i'm not sure what kind of interaction will
happen with KVM yet.
But I guess the larger question is whether its worth changing the PMU
behavior for this?
Thanks,
On Sat, Mar 23, 2024, at 00:40, Jeremy Linton wrote:
> On 3/8/24 14:29, Arnd Bergmann wrote:
>> On Fri, Mar 8, 2024, at 17:49, Jeremy Linton wrote:
>>> On 3/7/24 05:10, Arnd Bergmann wrote:
>>>>
>>>> I'm not sure I understand the logic. Do you mean that accessing
>>>> CNTVCT itself is slow, or that reseeding based on CNTVCT is slow
>>>> because of the overhead of reseeding?
>>>
>>> Slow, as in, its running at a much lower frequency than a cycle counter.
>>
>> Ok, I see. Would it be possible to use PMEVCNTR0 instead?
>
> So, I presume you actually mean PMCCNTR_EL0 because I don't see the
> point of a dedicated event, although maybe...
Right, that would make more sense.
> So, the first and maybe largest problem is the PMxxx registers are all
> optional because the PMU is optional! Although, they are strongly
> recommended and most (AFAIK) machines do implement them. So, maybe if
> its just using a cycle counter to dump some entropy into rnd_state it
> becomes a statement that kstack randomization is slower or unsupported
> if there isn't a PMU?
I think that sounds workable, especially as there is already
the randomize_kstack_offset=on/off conditional at boot time, it
could fall back to just not randomizing and print a warning
if the feature is enabled but unavailable at boot time.
> But then we have to basically enable the PMU cycle counter globally,
> which requires reworking how it works, because the cycle counter is
> enabled/disabled and reset on the fly depending on whether the user is
> trying to profile something. So, I have hacked that up, and it appears
> to be working, although i'm not sure what kind of interaction will
> happen with KVM yet.
>
> But I guess the larger question is whether its worth changing the PMU
> behavior for this?
I don't know too much about how the PMU works in detail, but I'm
also worried about two possible issues that end up preventing us
from using it in practice:
- if enabling PMCCNTR_EL0 takes away one of the limited number
of available counters, we probably don't want to go there
- similarly, I would expect it to have a nonzero power
consumption if the default is to have the clock disabled
and non-counting. Probably not a big deal for server machines,
but could be an issue on battery powered embedded devices.
Arnd