2024-04-28 22:29:34

by H. Peter Anvin

[permalink] [raw]
Subject: x86: dynamic pt_regs pointer and kernel stack randomization

So the issue of having an explicit pointer to the user space pt_regs
structure has come up in the context of future FRED extensions, which
may increase the size of the exception stack frame under some
circumstances, which may not be constant in the first place.

It struck me that this can be combined with kernel stack randomization
in such a way that it mostly amortizes the cost of both.

Downside: for best performance, it should be pushed into assembly
entry/exit code, although that is technically not *required* (and it is
of course possible to do it in C on IDT but in the one single assembly
entry stub for FRED.)

In the FRED code it would look like [simplified]:

asm_fred_entrypoint_user:
/* Current code */
/* ... */
movq %rsp,%rdi /* part of FRED_ENTER */

/* New code */
movq %rdi,PER_CPU_VAR(user_pt_regs_ptr) /* [1] */
#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
subq PER_CPU_VAR(kstack_offset),%rsp /* [2] */
#endif
/* CFI annotation */

/* Current code */
call fred_entry_from_user

asm_fred_exit_uTser:
/* New code */
movq PER_CPU_VAR(user_pt_regs_ptr),%rsp
/* CFI annotation */

/* Current code */
/* ... */
ERETU

[1] - This instruction can be pushed into the C code in
fred_entry_from_user() without changing the functionality in any way.

[2] - This is the ONLY instruction in this sequence that would be
specific to CONFIG_RANDOMIZE_KSTACK_OFFSET, and it probably isn't even
worth patching out.

This requires a 64-bit premasked value to be generated byc
choose_random_kstack_offset() which would seem to be a better option for
performance, especially since there is already arithmetic done at that
time. Otherwise it requires three instructions. This means the
randomness accumulator ends up being in a separate variable from the
premasked value. This could be further very slightly optimized by adding
the actual stack location and making this a movq, but then that value
has to be context-switched; this is probably not all that useful.

The masking needs to consider alignment, which the current code doesn't;
that by itself adds additional code to the current code sequences.



That is literally *all* the code that is needed to replace
add_random_kstack_offset(). It doesn't block tailcall optimization
anywhere. If user_pt_regs_ptr and kstack_offset share a cache line it
becomes even cheaper.

Note that at least in the FRED case this code would be invoked even on
events other than system calls, some of which may be controllable by the
user, like page faults. I am guessing that this is actually a plus.

-hpa



2024-04-29 16:16:51

by Kees Cook

[permalink] [raw]
Subject: Re: x86: dynamic pt_regs pointer and kernel stack randomization

On Sun, Apr 28, 2024 at 03:28:44PM -0700, H. Peter Anvin wrote:
> So the issue of having an explicit pointer to the user space pt_regs
> structure has come up in the context of future FRED extensions, which may
> increase the size of the exception stack frame under some circumstances,
> which may not be constant in the first place.
>
> It struck me that this can be combined with kernel stack randomization in
> such a way that it mostly amortizes the cost of both.

I've tried to go get up to speed with what FRED changes in x86, and IIUC
it effectively redefines the interrupt handling? Does it also change
syscall entry? (I got the impression it does not, but the mention of
syscalls later in your email make me think I've gotten this wrong.)

I think currently kstack randomization isn't applied to interrupts (but
it should be), so I'm all for finding a way to make this happen.

> Downside: for best performance, it should be pushed into assembly entry/exit
> code, although that is technically not *required* (and it is of course
> possible to do it in C on IDT but in the one single assembly entry stub for
> FRED.)
>
> In the FRED code it would look like [simplified]:
>
> asm_fred_entrypoint_user:
> /* Current code */
> /* ... */
> movq %rsp,%rdi /* part of FRED_ENTER */
>
> /* New code */
> movq %rdi,PER_CPU_VAR(user_pt_regs_ptr) /* [1] */
> #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> subq PER_CPU_VAR(kstack_offset),%rsp /* [2] */
> #endif
> /* CFI annotation */
>
> /* Current code */
> call fred_entry_from_user
>
> asm_fred_exit_uTser:
> /* New code */
> movq PER_CPU_VAR(user_pt_regs_ptr),%rsp
> /* CFI annotation */
>
> /* Current code */
> /* ... */
> ERETU

If I'm understanding FRED correctly, I think this exit path
would need to call choose_random_kstack_offset() as well. (For
syscalls, add_random_kstack_offset() is used on entry and
choose_random_kstack_offset() is used on exit, so I think we'd want the
same pattern for interrupt handling.)

> [1] - This instruction can be pushed into the C code in
> fred_entry_from_user() without changing the functionality in any way.
>
> [2] - This is the ONLY instruction in this sequence that would be specific
> to CONFIG_RANDOMIZE_KSTACK_OFFSET, and it probably isn't even worth patching
> out.
>
> This requires a 64-bit premasked value to be generated byc
> choose_random_kstack_offset() which would seem to be a better option for
> performance, especially since there is already arithmetic done at that time.
> Otherwise it requires three instructions. This means the randomness
> accumulator ends up being in a separate variable from the premasked value.
> This could be further very slightly optimized by adding the actual stack
> location and making this a movq, but then that value has to be
> context-switched; this is probably not all that useful.

Yeah, having a pre-masked value ready to go seems fine to me. It's
simply a space trade-off.

> The masking needs to consider alignment, which the current code doesn't;
> that by itself adds additional code to the current code sequences.
>
>
> That is literally *all* the code that is needed to replace
> add_random_kstack_offset(). It doesn't block tailcall optimization anywhere.
> If user_pt_regs_ptr and kstack_offset share a cache line it becomes even
> cheaper.
>
> Note that at least in the FRED case this code would be invoked even on
> events other than system calls, some of which may be controllable by the
> user, like page faults. I am guessing that this is actually a plus.

Yeah, I'd like greater coverage for ring 3->0 transitions. We do want to
double-check the original design choices, though. I *think* they still
stand (see the comments for choose_random_kstack_offset() about entropy
location (per-cpu area), and lifetime (across userspace execution).

-Kees

--
Kees Cook

2024-04-29 19:18:31

by H. Peter Anvin

[permalink] [raw]
Subject: Re: x86: dynamic pt_regs pointer and kernel stack randomization

On April 29, 2024 9:09:02 AM PDT, Kees Cook <[email protected]> wrote:
>On Sun, Apr 28, 2024 at 03:28:44PM -0700, H. Peter Anvin wrote:
>> So the issue of having an explicit pointer to the user space pt_regs
>> structure has come up in the context of future FRED extensions, which may
>> increase the size of the exception stack frame under some circumstances,
>> which may not be constant in the first place.
>>
>> It struck me that this can be combined with kernel stack randomization in
>> such a way that it mostly amortizes the cost of both.
>
>I've tried to go get up to speed with what FRED changes in x86, and IIUC
>it effectively redefines the interrupt handling? Does it also change
>syscall entry? (I got the impression it does not, but the mention of
>syscalls later in your email make me think I've gotten this wrong.)
>
>I think currently kstack randomization isn't applied to interrupts (but
>it should be), so I'm all for finding a way to make this happen.
>
>> Downside: for best performance, it should be pushed into assembly entry/exit
>> code, although that is technically not *required* (and it is of course
>> possible to do it in C on IDT but in the one single assembly entry stub for
>> FRED.)
>>
>> In the FRED code it would look like [simplified]:
>>
>> asm_fred_entrypoint_user:
>> /* Current code */
>> /* ... */
>> movq %rsp,%rdi /* part of FRED_ENTER */
>>
>> /* New code */
>> movq %rdi,PER_CPU_VAR(user_pt_regs_ptr) /* [1] */
>> #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
>> subq PER_CPU_VAR(kstack_offset),%rsp /* [2] */
>> #endif
>> /* CFI annotation */
>>
>> /* Current code */
>> call fred_entry_from_user
>>
>> asm_fred_exit_uTser:
>> /* New code */
>> movq PER_CPU_VAR(user_pt_regs_ptr),%rsp
>> /* CFI annotation */
>>
>> /* Current code */
>> /* ... */
>> ERETU
>
>If I'm understanding FRED correctly, I think this exit path
>would need to call choose_random_kstack_offset() as well. (For
>syscalls, add_random_kstack_offset() is used on entry and
>choose_random_kstack_offset() is used on exit, so I think we'd want the
>same pattern for interrupt handling.)
>
>> [1] - This instruction can be pushed into the C code in
>> fred_entry_from_user() without changing the functionality in any way.
>>
>> [2] - This is the ONLY instruction in this sequence that would be specific
>> to CONFIG_RANDOMIZE_KSTACK_OFFSET, and it probably isn't even worth patching
>> out.
>>
>> This requires a 64-bit premasked value to be generated byc
>> choose_random_kstack_offset() which would seem to be a better option for
>> performance, especially since there is already arithmetic done at that time.
>> Otherwise it requires three instructions. This means the randomness
>> accumulator ends up being in a separate variable from the premasked value.
>> This could be further very slightly optimized by adding the actual stack
>> location and making this a movq, but then that value has to be
>> context-switched; this is probably not all that useful.
>
>Yeah, having a pre-masked value ready to go seems fine to me. It's
>simply a space trade-off.
>
>> The masking needs to consider alignment, which the current code doesn't;
>> that by itself adds additional code to the current code sequences.
>>
>>
>> That is literally *all* the code that is needed to replace
>> add_random_kstack_offset(). It doesn't block tailcall optimization anywhere.
>> If user_pt_regs_ptr and kstack_offset share a cache line it becomes even
>> cheaper.
>>
>> Note that at least in the FRED case this code would be invoked even on
>> events other than system calls, some of which may be controllable by the
>> user, like page faults. I am guessing that this is actually a plus.
>
>Yeah, I'd like greater coverage for ring 3->0 transitions. We do want to
>double-check the original design choices, though. I *think* they still
>stand (see the comments for choose_random_kstack_offset() about entropy
>location (per-cpu area), and lifetime (across userspace execution).
>
>-Kees
>

FRED replaces ALL exception handling and ring transitions, including system calls and call gates.

You can't enter ring 0 without going through the FRED entry point (and you can't enter rings 1 or 2 at all, officially deprecating them from the architecture.)

2024-04-29 20:58:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: x86: dynamic pt_regs pointer and kernel stack randomization

On 4/29/24 09:09, Kees Cook wrote:
>
> If I'm understanding FRED correctly, I think this exit path
> would need to call choose_random_kstack_offset() as well. (For
> syscalls, add_random_kstack_offset() is used on entry and
> choose_random_kstack_offset() is used on exit, so I think we'd want the
> same pattern for interrupt handling.)
>

Yes, I didn't include it in here because it doesn't affect the assembly
flow per se, since it "simply" sets up the parameters for the next entry
and so at least logically can be executed more or less anywhere.

> Yeah, I'd like greater coverage for ring 3->0 transitions. We do want to
> double-check the original design choices, though. I *think* they still
> stand (see the comments for choose_random_kstack_offset() about entropy
> location (per-cpu area), and lifetime (across userspace execution).

Yeah, I'm not super sure of what exactly the constraints really are;
they are written in a way that signals to me that there is implied
context that isn't clear to me, especially the bit about "long running
syscalls".

-hpa