Hi!
I was hoping to start a conversation about PKU usage for threads in two
places, which Stephen R?ttger brought to my attention, with the hope of
being able to use these in Chrome:
1) It appears to be a bug that a thread without the correct PK can make
VMAs covered by a separate PK, out from under other threads. (e.g. mmap
a new mapping to wipe out the defined PK for it.) It seems that PK checks
should be made when modifying VMAs.
2) It would be very helpful to have a mechanism for the signal stack to
be PK aware, in the sense that the kernel would switch to a predefined
PK. i.e. having a new interface to sigaltstack() which includes a PK.
Are either of these something the PKU authors have considered? (Or are
there some details we're missing in this area?)
Thanks!
-Kees
--
Kees Cook
On 8/22/22 13:40, Kees Cook wrote:
> 1) It appears to be a bug that a thread without the correct PK can make
> VMAs covered by a separate PK, out from under other threads. (e.g. mmap
> a new mapping to wipe out the defined PK for it.) It seems that PK checks
> should be made when modifying VMAs.
Hi Kees,
Could you give an example of this? Is this something along the lines of
a mmap(MAP_FIXED) wiping out an earlier mapping? Or, is it more subtle
than that?
I'm not sure I know of any bugs in the area.
> 2) It would be very helpful to have a mechanism for the signal stack to
> be PK aware, in the sense that the kernel would switch to a predefined
> PK. i.e. having a new interface to sigaltstack() which includes a PK.
Are you thinking that when switching to the sigaltstack that it would
also pick up a specific PKRU value? Or, that it would ensure that PKRU
allows access to the sigaltstack's pkey? Logically something like this:
stack_t sas = {
ss_sp = stack_ptr;
ss_flags = ... flags;
ss_size = ...;
ss_pkey = 12;
};
Then the kernel would set up RSP to point to ss_sp, and do (logically):
pkkru &= ~(3<<(12*2)); // clear Write and Access-disable for pkey-12
before building the signal frame running the signal handler?
> Are either of these something the PKU authors have considered? (Or are
> there some details we're missing in this area?)
We've talked about having signal behavior which might give different
PKRU values at signal entry, but nothing too concrete. Something like
that wouldn't be *awful* to implement. It would also be nice that it
would be confined to folks that set up special signal handlers anyway
and that context is already pretty special.
I'd love to hear more why this behavior is useful and how it will be used.
Hey Dave,
> I'd love to hear more why this behavior is useful and how it will be used.
The background is that we want to build a way to temporarily isolate threads
from each other, which we then want to use for CFI for V8 (Chrome's
JavaScript engine).
One of the main challenges we have is to protect the JIT compiled code from
an attacker with an arbitrary read/write primitive. For that, we're
planning to use PKU
to have access to some trusted memory, i.e. memory that can't be
corrupted by the
attacker.
You can find more details on our plans here:
https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit?usp=sharing
While designing this mitigation, we came up with two attack vectors for which we
believe we need help from the kernel to fix them, which are the two cases Kees
mentioned:
1) an attacker controlling some argument to certain VMA operations
like munmap().
2) an attacker corrupting the saved signal context on a signal handler stack.
On Mon, Aug 22, 2022 at 11:11 PM Dave Hansen <[email protected]> wrote:
>
> On 8/22/22 13:40, Kees Cook wrote:
> > 1) It appears to be a bug that a thread without the correct PK can make
> > VMAs covered by a separate PK, out from under other threads. (e.g. mmap
> > a new mapping to wipe out the defined PK for it.) It seems that PK checks
> > should be made when modifying VMAs.
>
> Hi Kees,
>
> Could you give an example of this? Is this something along the lines of
> a mmap(MAP_FIXED) wiping out an earlier mapping? Or, is it more subtle
> than that?
Yes, that's one example. And the same applies to other operations on the
VMA. E.g. another case we'd like to prevent would be munmap(addr) where
addr is covered by a pkey to which the calling thread doesn't have access
permissions to.
You can find more details in this doc:
https://docs.google.com/document/d/1qqVoVfRiF2nRylL3yjZyCQvzQaej1HRPh3f5wj1AS9I/edit?usp=sharing
> > 2) It would be very helpful to have a mechanism for the signal stack to
> > be PK aware, in the sense that the kernel would switch to a predefined
> > PK. i.e. having a new interface to sigaltstack() which includes a PK.
>
> Are you thinking that when switching to the sigaltstack that it would
> also pick up a specific PKRU value? Or, that it would ensure that PKRU
> allows access to the sigaltstack's pkey?
Either of those would work for us.
> Logically something like this:
>
> stack_t sas = {
> ss_sp = stack_ptr;
> ss_flags = ... flags;
> ss_size = ...;
> ss_pkey = 12;
> };
>
> Then the kernel would set up RSP to point to ss_sp, and do (logically):
>
> pkkru &= ~(3<<(12*2)); // clear Write and Access-disable for pkey-12
>
> before building the signal frame running the signal handler?
Yeah, that would work for our use case.
We also have a doc discussing this in more detail :) :
https://docs.google.com/document/d/1OlnJbR5TMoaOAJsf4hHOc-FdTmYK2aDUI7d2hfCZSOo/edit?usp=sharing&resourcekey=0-v9UJXONYsnG5PlCBbcYqIw
On Tue, Aug 23, 2022, at 11:12 AM, Dave Hansen wrote:
> On 8/23/22 04:08, Stephen Röttger wrote:
>> On Mon, Aug 22, 2022 at 11:11 PM Dave Hansen <[email protected]> wrote:
>>> On 8/22/22 13:40, Kees Cook wrote:
>>>> 1) It appears to be a bug that a thread without the correct PK can make
>>>> VMAs covered by a separate PK, out from under other threads. (e.g. mmap
>>>> a new mapping to wipe out the defined PK for it.) It seems that PK checks
>>>> should be made when modifying VMAs.
>>>
>>> Could you give an example of this? Is this something along the lines of
>>> a mmap(MAP_FIXED) wiping out an earlier mapping? Or, is it more subtle
>>> than that?
>>
>> Yes, that's one example. And the same applies to other operations on the
>> VMA. E.g. another case we'd like to prevent would be munmap(addr) where
>> addr is covered by a pkey to which the calling thread doesn't have access
>> permissions to.
>
> Yeah, that's something for which our defenses are quite weak. But, it
> also calls for a very generic mm/ solution and not something specific at
> all to pkeys.
>
> I assume that you wouldn't want to turn off *all* mmap(), MAP_FIXED or
> munmap() in the process. You just want to make one or more VMAs more or
> less immutable. That _sounds_ like a topic that would have broached at
> some point in the past, although it doesn't ring any bells.
>
> The concept would make a good lightning talk at Plumbers of LSF/MM.
This kind of thing seems questionable to me. If the attacker controls syscall arguments, they can do almost anything. ISTM a CFI scheme should aim to prevent that bogus call in the first place, e.g. by preventing a problematic call.
Which makes me think that the actual solution is to have syscall interception support changing PKRU, perhaps via sigaltstack.
>
>>>> 2) It would be very helpful to have a mechanism for the signal stack to
>>>> be PK aware, in the sense that the kernel would switch to a predefined
>>>> PK. i.e. having a new interface to sigaltstack() which includes a PK.
>>>
>>> Are you thinking that when switching to the sigaltstack that it would
>>> also pick up a specific PKRU value? Or, that it would ensure that PKRU
>>> allows access to the sigaltstack's pkey?
>>
>> Either of those would work for us.
>>
>>> Logically something like this:
>>>
>>> stack_t sas = {
>>> ss_sp = stack_ptr;
>>> ss_flags = ... flags;
>>> ss_size = ...;
>>> ss_pkey = 12;
>>> };
>>>
>>> Then the kernel would set up RSP to point to ss_sp, and do (logically):
>>>
>>> pkkru &= ~(3<<(12*2)); // clear Write and Access-disable for pkey-12
>>>
>>> before building the signal frame running the signal handler?
>>
>> Yeah, that would work for our use case.
>> We also have a doc discussing this in more detail :) :
>
> That doesn't seem like it would be too much of a stretch. There's a
> delicate point when building the stack frame that the kernel would need
> to move over to the new PKRU value to build the frame before it writes
> the *OLD* value to the frame. But, it's far from impossible.
>
> I also bet we could do this with minimal new ABI. There's already a
> ->ss_flags field. We could assign a flag to mean that stack_t doesn't
> end at '->ss_size' and that there's a pkey value *after* ss_size. I do
> think having a single pkey that is made accessible before signal entry
> is a more flexible ABI than taking an explicit PKRU value.
>
> I think that would allow just reusing sys_sigaltstack().
sys_sigaltstack() is already pretty much useless with SHSTK, and it’s kinda busted with AVX512. How about we just add a whole new non-kludgy API?
On 8/23/22 04:08, Stephen Röttger wrote:
> On Mon, Aug 22, 2022 at 11:11 PM Dave Hansen <[email protected]> wrote:
>> On 8/22/22 13:40, Kees Cook wrote:
>>> 1) It appears to be a bug that a thread without the correct PK can make
>>> VMAs covered by a separate PK, out from under other threads. (e.g. mmap
>>> a new mapping to wipe out the defined PK for it.) It seems that PK checks
>>> should be made when modifying VMAs.
>>
>> Could you give an example of this? Is this something along the lines of
>> a mmap(MAP_FIXED) wiping out an earlier mapping? Or, is it more subtle
>> than that?
>
> Yes, that's one example. And the same applies to other operations on the
> VMA. E.g. another case we'd like to prevent would be munmap(addr) where
> addr is covered by a pkey to which the calling thread doesn't have access
> permissions to.
Yeah, that's something for which our defenses are quite weak. But, it
also calls for a very generic mm/ solution and not something specific at
all to pkeys.
I assume that you wouldn't want to turn off *all* mmap(), MAP_FIXED or
munmap() in the process. You just want to make one or more VMAs more or
less immutable. That _sounds_ like a topic that would have broached at
some point in the past, although it doesn't ring any bells.
The concept would make a good lightning talk at Plumbers of LSF/MM.
>>> 2) It would be very helpful to have a mechanism for the signal stack to
>>> be PK aware, in the sense that the kernel would switch to a predefined
>>> PK. i.e. having a new interface to sigaltstack() which includes a PK.
>>
>> Are you thinking that when switching to the sigaltstack that it would
>> also pick up a specific PKRU value? Or, that it would ensure that PKRU
>> allows access to the sigaltstack's pkey?
>
> Either of those would work for us.
>
>> Logically something like this:
>>
>> stack_t sas = {
>> ss_sp = stack_ptr;
>> ss_flags = ... flags;
>> ss_size = ...;
>> ss_pkey = 12;
>> };
>>
>> Then the kernel would set up RSP to point to ss_sp, and do (logically):
>>
>> pkkru &= ~(3<<(12*2)); // clear Write and Access-disable for pkey-12
>>
>> before building the signal frame running the signal handler?
>
> Yeah, that would work for our use case.
> We also have a doc discussing this in more detail :) :
That doesn't seem like it would be too much of a stretch. There's a
delicate point when building the stack frame that the kernel would need
to move over to the new PKRU value to build the frame before it writes
the *OLD* value to the frame. But, it's far from impossible.
I also bet we could do this with minimal new ABI. There's already a
->ss_flags field. We could assign a flag to mean that stack_t doesn't
end at '->ss_size' and that there's a pkey value *after* ss_size. I do
think having a single pkey that is made accessible before signal entry
is a more flexible ABI than taking an explicit PKRU value.
I think that would allow just reusing sys_sigaltstack().
On Tue, Aug 23, 2022 at 8:24 PM Andy Lutomirski <[email protected]> wrote:
>
>
>
> On Tue, Aug 23, 2022, at 11:12 AM, Dave Hansen wrote:
> > On 8/23/22 04:08, Stephen Röttger wrote:
> >> On Mon, Aug 22, 2022 at 11:11 PM Dave Hansen <[email protected]> wrote:
> >>> On 8/22/22 13:40, Kees Cook wrote:
> >>>> 1) It appears to be a bug that a thread without the correct PK can make
> >>>> VMAs covered by a separate PK, out from under other threads. (e.g. mmap
> >>>> a new mapping to wipe out the defined PK for it.) It seems that PK checks
> >>>> should be made when modifying VMAs.
> >>>
> >>> Could you give an example of this? Is this something along the lines of
> >>> a mmap(MAP_FIXED) wiping out an earlier mapping? Or, is it more subtle
> >>> than that?
> >>
> >> Yes, that's one example. And the same applies to other operations on the
> >> VMA. E.g. another case we'd like to prevent would be munmap(addr) where
> >> addr is covered by a pkey to which the calling thread doesn't have access
> >> permissions to.
> >
> > Yeah, that's something for which our defenses are quite weak. But, it
> > also calls for a very generic mm/ solution and not something specific at
> > all to pkeys.
We were also thinking about if this should be a more generic feature instead of
being tied to pkeys. I.e. the doc above has an alternative proposal to introduce
something like a memory seal/unseal syscall.
I was personally leaning towards using pkeys for this for a few reasons:
* intuitively it would make sense to me to extend PKEY_DISABLE_ACCESS
to also mean disable all changes to the memory, not just the data.
* We couldn't come up with another use case for the more generic API that
doesn't include pkeys.
* Fewer syscalls for our use case, since we already set the pkey on the mappings
we want to become immutable.
That being said, either approach works for us.
> > The concept would make a good lightning talk at Plumbers of LSF/MM.
>
> This kind of thing seems questionable to me. If the attacker controls syscall arguments, they can do almost anything. ISTM a CFI scheme should aim to prevent that bogus call in the first place, e.g. by preventing a problematic call.
I'm not sure that CFI can solve this problem since the attacker
doesn't change the
control flow in this attack. We will definitely have to ensure that
certain syscall
arguments can't be controlled by the attacker, e.g. if the protection
bits of an mmap
syscall are ever spilled to the stack, that would be one way to bypass
the mitigation
and we have to ensure in userspace that this doesn't happen.
However, this seems infeasible for common syscalls like munmap. If any thread
wants to unmap a page, it's likely that the data for that came from a
place that the
attacker might have corrupted.
> Which makes me think that the actual solution is to have syscall interception support changing PKRU, perhaps via sigaltstack.
Can you expand on this? I didn't really understand what you meant.
>
> >
> >>>> 2) It would be very helpful to have a mechanism for the signal stack to
> >>>> be PK aware, in the sense that the kernel would switch to a predefined
> >>>> PK. i.e. having a new interface to sigaltstack() which includes a PK.
> >>>
> >>> Are you thinking that when switching to the sigaltstack that it would
> >>> also pick up a specific PKRU value? Or, that it would ensure that PKRU
> >>> allows access to the sigaltstack's pkey?
> >>
> >> Either of those would work for us.
> >>
> >>> Logically something like this:
> >>>
> >>> stack_t sas = {
> >>> ss_sp = stack_ptr;
> >>> ss_flags = ... flags;
> >>> ss_size = ...;
> >>> ss_pkey = 12;
> >>> };
> >>>
> >>> Then the kernel would set up RSP to point to ss_sp, and do (logically):
> >>>
> >>> pkkru &= ~(3<<(12*2)); // clear Write and Access-disable for pkey-12
> >>>
> >>> before building the signal frame running the signal handler?
> >>
> >> Yeah, that would work for our use case.
> >> We also have a doc discussing this in more detail :) :
> >
> > That doesn't seem like it would be too much of a stretch. There's a
> > delicate point when building the stack frame that the kernel would need
> > to move over to the new PKRU value to build the frame before it writes
> > the *OLD* value to the frame. But, it's far from impossible.
> >
> > I also bet we could do this with minimal new ABI. There's already a
> > ->ss_flags field. We could assign a flag to mean that stack_t doesn't
> > end at '->ss_size' and that there's a pkey value *after* ss_size. I do
> > think having a single pkey that is made accessible before signal entry
> > is a more flexible ABI than taking an explicit PKRU value.
> >
> > I think that would allow just reusing sys_sigaltstack().
>
> sys_sigaltstack() is already pretty much useless with SHSTK, and it’s kinda busted with AVX512. How about we just add a whole new non-kludgy API?
On 8/24/22 01:51, Stephen Röttger wrote:
>>> Yeah, that's something for which our defenses are quite weak. But, it
>>> also calls for a very generic mm/ solution and not something specific at
>>> all to pkeys.
> We were also thinking about if this should be a more generic feature instead of
> being tied to pkeys. I.e. the doc above has an alternative proposal to introduce
> something like a memory seal/unseal syscall.
> I was personally leaning towards using pkeys for this for a few reasons:
> * intuitively it would make sense to me to extend PKEY_DISABLE_ACCESS
> to also mean disable all changes to the memory, not just the data.
It would make some sense, but we can't do it with the existing
PKEY_DISABLE_ACCESS ABI. It would surely break existing users if they
couldn't munmap() memory that was PKEY_DISABLE_ACCESS.
But, making it part of the mprotect() ABI wouldn't be the worst thing in
the world. Since we have a pkey_mprotect(), any mprotect()-based
mechanism could even reuse the existing pkey syscalls.
I do agree with Andy, though, that I'm not quite sure what the attack
model is here. If an attacker can make arbitrary system calls, surely
protecting one little altstack VMA isn't doing to help much.
On Wed, Aug 24, 2022, at 1:51 AM, Stephen Röttger wrote:
> On Tue, Aug 23, 2022 at 8:24 PM Andy Lutomirski <[email protected]> wrote:
>>
>>
>>
>> On Tue, Aug 23, 2022, at 11:12 AM, Dave Hansen wrote:
>> > On 8/23/22 04:08, Stephen Röttger wrote:
>> >> On Mon, Aug 22, 2022 at 11:11 PM Dave Hansen <[email protected]> wrote:
>> >>> On 8/22/22 13:40, Kees Cook wrote:
>> >>>> 1) It appears to be a bug that a thread without the correct PK can make
>> >>>> VMAs covered by a separate PK, out from under other threads. (e.g. mmap
>> >>>> a new mapping to wipe out the defined PK for it.) It seems that PK checks
>> >>>> should be made when modifying VMAs.
>> >>>
>> >>> Could you give an example of this? Is this something along the lines of
>> >>> a mmap(MAP_FIXED) wiping out an earlier mapping? Or, is it more subtle
>> >>> than that?
>> >>
>> >> Yes, that's one example. And the same applies to other operations on the
>> >> VMA. E.g. another case we'd like to prevent would be munmap(addr) where
>> >> addr is covered by a pkey to which the calling thread doesn't have access
>> >> permissions to.
>> >
>> > Yeah, that's something for which our defenses are quite weak. But, it
>> > also calls for a very generic mm/ solution and not something specific at
>> > all to pkeys.
>
> We were also thinking about if this should be a more generic feature instead of
> being tied to pkeys. I.e. the doc above has an alternative proposal to introduce
> something like a memory seal/unseal syscall.
> I was personally leaning towards using pkeys for this for a few reasons:
> * intuitively it would make sense to me to extend PKEY_DISABLE_ACCESS
> to also mean disable all changes to the memory, not just the data.
> * We couldn't come up with another use case for the more generic API that
> doesn't include pkeys.
> * Fewer syscalls for our use case, since we already set the pkey on the mappings
> we want to become immutable.
> That being said, either approach works for us.
>
>> > The concept would make a good lightning talk at Plumbers of LSF/MM.
>>
>> This kind of thing seems questionable to me. If the attacker controls syscall arguments, they can do almost anything. ISTM a CFI scheme should aim to prevent that bogus call in the first place, e.g. by preventing a problematic call.
What I'm trying to say is: CFI, by itself, can protect syscalls by making sure that callers are safe. So, for example, if all munmap() callers do:
if (addr is dangerous)
abort();
else
munmap();
Then, with CFI, an attacker can't get to the actual munmap without first doing the dangerous check. And you can implement this entirely in user code.
With syscall user dispatch (this thing: https://lwn.net/Articles/828510/ -- sorry, I meant that when I typed interception), you even have a way to intercept *all* munmap() calls, for example.
Regardless, I think it would be reasonable to have a nice way to create a sigaltstack and make it harder for the stack and the associated code to get corrupted. I"m not sure that pkeys are the right solution, but a nice coherent package that makes it difficult to accidentally change what happens when a specific signal is delivered could be quite powerful.
>
> I'm not sure that CFI can solve this problem since the attacker
> doesn't change the
> control flow in this attack. We will definitely have to ensure that
> certain syscall
> arguments can't be controlled by the attacker, e.g. if the protection
> bits of an mmap
> syscall are ever spilled to the stack, that would be one way to bypass
> the mitigation
> and we have to ensure in userspace that this doesn't happen.
> However, this seems infeasible for common syscalls like munmap. If any thread
> wants to unmap a page, it's likely that the data for that came from a
> place that the
> attacker might have corrupted.
>
>> Which makes me think that the actual solution is to have syscall interception support changing PKRU, perhaps via sigaltstack.
>
> Can you expand on this? I didn't really understand what you meant.
>
>>
>> >
>> >>>> 2) It would be very helpful to have a mechanism for the signal stack to
>> >>>> be PK aware, in the sense that the kernel would switch to a predefined
>> >>>> PK. i.e. having a new interface to sigaltstack() which includes a PK.
>> >>>
>> >>> Are you thinking that when switching to the sigaltstack that it would
>> >>> also pick up a specific PKRU value? Or, that it would ensure that PKRU
>> >>> allows access to the sigaltstack's pkey?
>> >>
>> >> Either of those would work for us.
>> >>
>> >>> Logically something like this:
>> >>>
>> >>> stack_t sas = {
>> >>> ss_sp = stack_ptr;
>> >>> ss_flags = ... flags;
>> >>> ss_size = ...;
>> >>> ss_pkey = 12;
>> >>> };
>> >>>
>> >>> Then the kernel would set up RSP to point to ss_sp, and do (logically):
>> >>>
>> >>> pkkru &= ~(3<<(12*2)); // clear Write and Access-disable for pkey-12
>> >>>
>> >>> before building the signal frame running the signal handler?
>> >>
>> >> Yeah, that would work for our use case.
>> >> We also have a doc discussing this in more detail :) :
>> >
>> > That doesn't seem like it would be too much of a stretch. There's a
>> > delicate point when building the stack frame that the kernel would need
>> > to move over to the new PKRU value to build the frame before it writes
>> > the *OLD* value to the frame. But, it's far from impossible.
>> >
>> > I also bet we could do this with minimal new ABI. There's already a
>> > ->ss_flags field. We could assign a flag to mean that stack_t doesn't
>> > end at '->ss_size' and that there's a pkey value *after* ss_size. I do
>> > think having a single pkey that is made accessible before signal entry
>> > is a more flexible ABI than taking an explicit PKRU value.
>> >
>> > I think that would allow just reusing sys_sigaltstack().
>>
>> sys_sigaltstack() is already pretty much useless with SHSTK, and it’s kinda busted with AVX512. How about we just add a whole new non-kludgy API?
>
> Attachments:
> * smime.p7s
On Wed, Aug 24, 2022 at 6:28 PM Dave Hansen <[email protected]> wrote:
>
> On 8/24/22 01:51, Stephen Röttger wrote:
> >>> Yeah, that's something for which our defenses are quite weak. But, it
> >>> also calls for a very generic mm/ solution and not something specific at
> >>> all to pkeys.
> > We were also thinking about if this should be a more generic feature instead of
> > being tied to pkeys. I.e. the doc above has an alternative proposal to introduce
> > something like a memory seal/unseal syscall.
> > I was personally leaning towards using pkeys for this for a few reasons:
> > * intuitively it would make sense to me to extend PKEY_DISABLE_ACCESS
> > to also mean disable all changes to the memory, not just the data.
>
> It would make some sense, but we can't do it with the existing
> PKEY_DISABLE_ACCESS ABI. It would surely break existing users if they
> couldn't munmap() memory that was PKEY_DISABLE_ACCESS.
Our thought was that this could be opt-in with a prctl().
> But, making it part of the mprotect() ABI wouldn't be the worst thing in
> the world. Since we have a pkey_mprotect(), any mprotect()-based
> mechanism could even reuse the existing pkey syscalls.
>
> I do agree with Andy, though, that I'm not quite sure what the attack
> model is here. If an attacker can make arbitrary system calls, surely
> protecting one little altstack VMA isn't doing to help much.
Note that we don't assume arbitrary syscalls. We only expect the
attacker to be able to control a subset of arguments to a subset of
syscalls.
We run with a seccomp filter that greatly limits available syscalls.
And for arguments, we will ensure in code that sensitive arguments
won't touch attacker-writable memory (e.g. the prot argument in
mprotect()).
But this is hard to do for things like munmap(addr), that's why we're
hoping that the kernel can help us out for that subset of syscalls.
> >> This kind of thing seems questionable to me. If the attacker controls syscall arguments, they can do almost anything. ISTM a CFI scheme should aim to prevent that bogus call in the first place, e.g. by preventing a problematic call.
>
>
> What I'm trying to say is: CFI, by itself, can protect syscalls by making sure that callers are safe. So, for example, if all munmap() callers do:
>
> if (addr is dangerous)
> abort();
> else
> munmap();
>
> Then, with CFI, an attacker can't get to the actual munmap without first doing the dangerous check. And you can implement this entirely in user code.
>
> With syscall user dispatch (this thing: https://lwn.net/Articles/828510/ -- sorry, I meant that when I typed interception), you even have a way to intercept *all* munmap() calls, for example.
Ah I see. Yeah, you're right. It should be possible to do these checks
in user code. Though, it would come with some challenges.
For once, the `addr is dangerous` check is not as easy as in kernel
space since AFAIK we can't check the pkey of a given address.
So we'd probably switch our stack and pkey state to traverse some data
structure to verify this. That, or keeping a dedicated range for these
protected mappings.
I'm also not sure if this would lead to noticeable perf regressions,
hopefully these memory management operations should not be in any hot
paths.
I'll investigate more if this would be a feasible alternative for us.
Side note: it seems like seccomp already allows us to do this instead
of syscall user dispatch or is there a feature in the latter that
would be useful?
On 8/25/22 05:30, Stephen Röttger wrote:
>>> We were also thinking about if this should be a more generic feature instead of
>>> being tied to pkeys. I.e. the doc above has an alternative proposal to introduce
>>> something like a memory seal/unseal syscall.
>>> I was personally leaning towards using pkeys for this for a few reasons:
>>> * intuitively it would make sense to me to extend PKEY_DISABLE_ACCESS
>>> to also mean disable all changes to the memory, not just the data.
>> It would make some sense, but we can't do it with the existing
>> PKEY_DISABLE_ACCESS ABI. It would surely break existing users if they
>> couldn't munmap() memory that was PKEY_DISABLE_ACCESS.
> Our thought was that this could be opt-in with a prctl().
So, today, you have this:
foo = malloc(PAGE_SIZE);
pkey_mprotect(foo, PAGE_SIZE, READ|WRITE, pkey=1);
munmap(foo); // <-- works fine
mmap(hint=foo, ...); // now attacker controls &foo
Which is problematic. What you want instead is something like this:
prctl(PR_ARCH_NO_MUNMAP_ON_PKEY); // or whatever
foo = malloc(PAGE_SIZE);
pkey_mprotect(foo, PAGE_SIZE, READ|WRITE, pkey=1);
wrpkru(PKEY_DISABLE_ACCESS<<pkey*2);
munmap(foo); // returns -EPERM (or whatever)
Which requires the kernel to check when it's modifying a VMA (like the
munmap() above) to see if PKRU _currently_ permits access to the VMA's
contents. If not, the kernel should refuse to modify the VMA.
Like I said, I don't think this is _insane_, but I can see it breaking
perfectly innocent things. For instance, an app that today does a
free() if pkey-assigned memory might work perfectly fine for a long time
since that memory is rarely unmapped. But, the minute that malloc()
decides it needs to zap the memory, *malloc()* will fail.
I also wonder how far these semantics would go. Would madvise() work on
these access-denied VMAs?
My gut says that we don't want to mix up pkey semantics with this new
mechanism.
On Thu, Aug 25, 2022, at 7:36 AM, Dave Hansen wrote:
> On 8/25/22 05:30, Stephen Röttger wrote:
>>>> We were also thinking about if this should be a more generic feature instead of
>>>> being tied to pkeys. I.e. the doc above has an alternative proposal to introduce
>>>> something like a memory seal/unseal syscall.
>>>> I was personally leaning towards using pkeys for this for a few reasons:
>>>> * intuitively it would make sense to me to extend PKEY_DISABLE_ACCESS
>>>> to also mean disable all changes to the memory, not just the data.
>>> It would make some sense, but we can't do it with the existing
>>> PKEY_DISABLE_ACCESS ABI. It would surely break existing users if they
>>> couldn't munmap() memory that was PKEY_DISABLE_ACCESS.
>> Our thought was that this could be opt-in with a prctl().
I know Linux never copies other OSes, but OpenBSD is considering this:
https://undeadly.org/cgi?action=article;sid=20220902100648
If it works well, we could implement it.
>
> So, today, you have this:
>
> foo = malloc(PAGE_SIZE);
> pkey_mprotect(foo, PAGE_SIZE, READ|WRITE, pkey=1);
> munmap(foo); // <-- works fine
> mmap(hint=foo, ...); // now attacker controls &foo
>
> Which is problematic. What you want instead is something like this:
>
> prctl(PR_ARCH_NO_MUNMAP_ON_PKEY); // or whatever
> foo = malloc(PAGE_SIZE);
> pkey_mprotect(foo, PAGE_SIZE, READ|WRITE, pkey=1);
> wrpkru(PKEY_DISABLE_ACCESS<<pkey*2);
> munmap(foo); // returns -EPERM (or whatever)
>
> Which requires the kernel to check when it's modifying a VMA (like the
> munmap() above) to see if PKRU _currently_ permits access to the VMA's
> contents. If not, the kernel should refuse to modify the VMA.
>
> Like I said, I don't think this is _insane_, but I can see it breaking
> perfectly innocent things. For instance, an app that today does a
> free() if pkey-assigned memory might work perfectly fine for a long time
> since that memory is rarely unmapped. But, the minute that malloc()
> decides it needs to zap the memory, *malloc()* will fail.
>
> I also wonder how far these semantics would go. Would madvise() work on
> these access-denied VMAs?
>
> My gut says that we don't want to mix up pkey semantics with this new
> mechanism.
> sys_sigaltstack() is already pretty much useless with SHSTK, and it’s kinda busted with AVX512. How about we just add a whole new non-kludgy API?
Hi, why is that useless? Signal is not going to change the SSP, all it needs to do is creating a token that can be restored by sigreturn in the future on the current shadow stack.
Can you provide some examples about that? Thank you.
I guess the question here is whether the code to call sigaltstack and signal handler is considered part of the security code (sigreturn obviously has to be, since the kernel has to restore the PKRU based on the saved fpu).
I think to a large extent this is necessary, at least for the signal handler to be able to access the relevant registers at the time of the interrupt, which may contain data that the handler should not have access to. Even specifying a PKRU at the time of signal registration would make the system functionally sound and safe since the relevant calls must be protected.
It's just that the design here should be such as to minimize the ways in which the interface can be abused (e.g., accidental override access) as well as to simplify the difficulty of writing secure code. It might be reasonable, then, to save the PKRU when the `sigaltstack` is called.
The main purpose is to simplify the design of the handler entry point without adding new system calls, while not accidentally gaining privileges that do not belong to the current PKRU because of the system call, whether immediately, or later in signal delivery.
This is because this part of the design can be largely made easier if additional source checking and PKRU switching by the handler at the entry point can be avoided.
As `WRPKRU` can be abused, if the handler uses this instruction, additional SP as well as PKRU checks must be performed to prevent malicious programs from forging signals, and the check must get multiplex among all threads. However, for the kernel, it takes very little code to avoid these checks by giving the handler the PKRU it wants.
If only one PKEY is specified, then it is likely that `WRPKRU` is still needed, since the TCB itself may occupy multiple PKEYs, or, the handler need to access the memory of other PKEYs (e.g., complex multi-domain signal designs).
And, logically, it makes sense for a signal context (sigaltstack) to have the same PKRU when it is registered, and when it is used in the future. Thus, a special flag in `ss_flags & SS_SAVEPKRU` to ask the kernel to save the current PKRU would be sufficient.
From the security side, if the current PKRU does not have access to the signal stack, then a future signal occurring when the kernel uses this PKRU to write will also result in an segfault, thus avoiding unwanted access through sigaltstack.
This is also more accurate than checking the PKEY of the page when registering the signal stack (if we restricted the PKRU when registering the sigaltstack). Consider a possible error: a page is accidentally unmaped after being registered as a signal stack, and then another page that should not have been accessed by this PKRU is mapped to the same location, thus causing an override during signal delivery.
> I also bet we could do this with minimal new ABI. There's already a
> ->ss_flags field. We could assign a flag to mean that stack_t doesn't
> end at '->ss_size' and that there's a pkey value *after* ss_size. I do
> think having a single pkey that is made accessible before signal entry
> is a more flexible ABI than taking an explicit PKRU value.
Agreed, the most flexible way should be allow setting the PKRU to any subset of the current PKRU. So we can check `(~new_pkru) & current_pkru == 0` when calling sigaltstack.
However, no matter how it is done, one of the more disgusting thing is that code like this appears in the program that handles the signal.
```
old_pkru = read_pkru();
write_pkru(stack_pkru);
do_xsave();
*(fpu_saved + pkru_offset()) = old_pkru; // this may be an argument of fpu function call
```
And when restoring, you also need
```
old_pkru = *(fpu_saved + pkru_offset());
*(fpu_saved + pkru_offset()) = stack_pkru;
do_xstor();
write_pkru(old_pkru);
```
These plus the testing of the current runtime environment (MPK) are truly disgusting. It's just structually ugly.
On Fri, Sep 2, 2022, at 5:14 PM, Fangfei Yang wrote:
> I guess the question here is whether the code to call sigaltstack and
> signal handler is considered part of the security code (sigreturn
> obviously has to be, since the kernel has to restore the PKRU based on
> the saved fpu).
> I think to a large extent this is necessary, at least for the signal
> handler to be able to access the relevant registers at the time of the
> interrupt, which may contain data that the handler should not have
> access to. Even specifying a PKRU at the time of signal registration
> would make the system functionally sound and safe since the relevant
> calls must be protected.
>
> It's just that the design here should be such as to minimize the ways
> in which the interface can be abused (e.g., accidental override access)
> as well as to simplify the difficulty of writing secure code. It might
> be reasonable, then, to save the PKRU when the `sigaltstack` is called.
>
> The main purpose is to simplify the design of the handler entry point
> without adding new system calls, while not accidentally gaining
> privileges that do not belong to the current PKRU because of the system
> call, whether immediately, or later in signal delivery.
I think you might be so much more familiar with the system you’re working on than anyone else that you’re not explaining some basics and we’re all lost.
How is PKRU a “privilege” and what do you mean my “immediately”? I can’t follow this.
>
> This is because this part of the design can be largely made easier if
> additional source checking and PKRU switching by the handler at the
> entry point can be avoided.
Why would the entry point check a source? Or change PKRU? What would its PKRU logic be and why?
As I see it, the handler can (awkwardly, perhaps) manage PKRU just fine for all purposes except kernel access to the signal stack.
>
> As `WRPKRU` can be abused, if the handler uses this instruction,
> additional SP as well as PKRU checks must be performed to prevent
> malicious programs from forging signals, and the check must get
> multiplex among all threads. However, for the kernel, it takes very
> little code to avoid these checks by giving the handler the PKRU it
> wants.
Can you elaborate? Of course WRPKRU can be abused to fully bypass PKRU protection.
>
> If only one PKEY is specified, then it is likely that `WRPKRU` is still
> needed, since the TCB itself may occupy multiple PKEYs, or, the handler
> need to access the memory of other PKEYs (e.g., complex multi-domain
> signal designs).
>
> And, logically, it makes sense for a signal context (sigaltstack) to
> have the same PKRU when it is registered, and when it is used in the
> future. Thus, a special flag in `ss_flags & SS_SAVEPKRU` to ask the
> kernel to save the current PKRU would be sufficient.
This isn’t logical at all to me. It makes some sense as an API simplification to avoid a new syscall, and it makes sense in a bizarre (to me) world in which user code can control access to PKRU but not to sigaltstack(), but why do we live in that world?
>
> From the security side, if the current PKRU does not have access to the
> signal stack, then a future signal occurring when the kernel uses this
> PKRU to write will also result in an segfault, thus avoiding unwanted
> access through sigaltstack.
Do you mean in current kernels?
> This is also more accurate than checking the PKEY of the page when
> registering the signal stack (if we restricted the PKRU when
What do you mean “accurate”?
> registering the sigaltstack). Consider a possible error: a page is
> accidentally unmaped after being registered as a signal stack, and then
> another page that should not have been accessed by this PKRU is mapped
> to the same location, thus causing an override during signal delivery.
>
>> I also bet we could do this with minimal new ABI. There's already a
>> ->ss_flags field. We could assign a flag to mean that stack_t doesn't
>> end at '->ss_size' and that there's a pkey value *after* ss_size. I do
>> think having a single pkey that is made accessible before signal entry
>> is a more flexible ABI than taking an explicit PKRU value.
>
> Agreed, the most flexible way should be allow setting the PKRU to any
> subset of the current PKRU. So we can check `(~new_pkru) & current_pkru
> == 0` when calling sigaltstack.
>
> However, no matter how it is done, one of the more disgusting thing is
> that code like this appears in the program that handles the signal.
> ```
> old_pkru = read_pkru();
> write_pkru(stack_pkru);
> do_xsave();
> *(fpu_saved + pkru_offset()) = old_pkru; // this may be an argument of
> fpu function call
> ```
> And when restoring, you also need
> ```
> old_pkru = *(fpu_saved + pkru_offset());
> *(fpu_saved + pkru_offset()) = stack_pkru;
> do_xstor();
> write_pkru(old_pkru);
Sorry, what code does XSAVE here?
> ```
> These plus the testing of the current runtime environment (MPK) are
> truly disgusting. It's just structually ugly.
> How is PKRU a “privilege” and what do you mean my “immediately”? I can’t follow this.
If you use PKRU, then you add a layer of permissions controlled by the user
to the existing memory management to determine whether a particular PKEY's
memory can be read or written by the current context.
Then, we don't want this PKRU to be accidentally modified.
From this perspective, PKRU is a privilege in user space.
In addition to the restrictions on the user program itself,
some system calls can also cause the PKRU to be modified.
For the kernel, this user space is designed to be agnostic
(and can be anything, having all kinds of guarantees),
and only a few policies can be sent to the kernel via seccomp-bpf or other security apporches.
Examples of "immediately", e.g., sigreturn-oriented programming,
can change the current PKRU value while achieving the effect of ROP,
thus allowing an attacker to access memory that was not read or written in the original context.
This compared to attacks that takes time to work, sigaltstack once, and then wait the signal get triggered.
> Why would the entry point check a source? Or change PKRU? What would its PKRU logic be and why?
> As I see it, the handler can (awkwardly, perhaps) manage PKRU just fine for all purposes except kernel access to the signal stack.
Yes, it can be correct and safe, but it'd be better if kernel can just provide
the PKRU desired by the user when sending signal to the user.
> Can you elaborate? Of course WRPKRU can be abused to fully bypass PKRU protection.
This is determined by the characteristics of MPK and signal.
First, currently, the kernel uses a default PKRU value of 0x55555554 when it signals,
which means that only PKEY=0 is writable.
However, the program that actually handles the signal may need other permissions,
for example it may need to access PKEY=0,1,2 at the same time.
Then, I would need to write the following assembly code in the preamble of the handling function.
```
mov eax, 0x55555540
xor ecx, ecx
xor edx, edx
wrpkru
cmp eax, 0x55555540
jne fail
```
However, this leads to another problem, if, for example,
I am an attacker (assuming you have gone through some method that prevents sigreturn-oriented programming),
I can modify the RSP to jump to this signal handling function.
Then, you will see that in the code afterwards,
you have no way to tell if there is actually a signal currently being sent by the kernel to our handler function,
or if this was jumped over by the attacker since the previous PKRU has gone.
Of course this is not insurmountable.
For example, we can save a token to the memory that PKEY=0 before the very first `mov eax, 0x55555540`,
thus ensuring that PKEY=0 is writable before executing wrpkru
(assuming here that PKEY=0 indicates some higher privilege).
However, in a multi-threaded environment, saving such a token also becomes more complicated.
If the kernel can give a correct PKRU according to the user's specific design needs,
then there is no need for additional token here.
Then, even if the attacker jumps here, its PKRU unchanged,
and thus its privileges can be restricted by other methods
(e.g., it has been proposed to add a function to read pkru in seccomp-bpf[1]).
> This isn’t logical at all to me. It makes some sense as an API simplification to avoid a new syscall, and it makes sense in a bizarre (to me) world in which user code can control access to PKRU but not to sigaltstack(), but why do we live in that world?
If we can control WRPKRU, of course sigaltstack can also be controlled,
whether by seccomp or something else.
My point is that doing so would reduce the exposure of risky syscalls
by ensuring that no matter how the syscall is used,
it does not expand the privileges of the caller.
There is always a way to make the these system calls secure,
as long as the availability of the signal is addressed.
> Do you mean in current kernels?
No, what I mean is that if you save the current PKRU,
rather than allowing the caller to specify one at will.
Then, in the future, when doing a signal deliver,
if it tries to write to a stack address that is not allowed to be accessed by this saved PKRU,
it will segfault.
This is not necessary, as you can restrict access to sigaltstack,
I'm just saying that it makes sigaltstack itself a little more secure.
(Because PKRU is also in effect in the kernel, in the vast majority of cases, system calls
still follow the restrictions of the current context's PKRU when accessing user pointers.
This assumption is not absolute, but it is not a bad thing
if syscalls can satisfy this requirement)
> What do you mean “accurate”?
If the accessibility of the memory itself is only checked at the time of the system call,
then there is a possibility that this address is unmap and mapped again,
at which point the PKEY check based on is inaccurate.
> Sorry, what code does XSAVE here?
I simplified the process of signal delivery a bit.
For xsave here the process is: handle_signal->setup_rt_frame -> __setup_rt_frame->get_sigframe->copy_fpstate_to_sigframe->copy_fpregs_to_sigframe-> copy_xregs_to_user->xsave
It will be slightly different if other instructions are used.
Because this xsave will use the user's stack address directly,
the PKRU must be changed to the PKRU value obtained
by any of the methods discussed here before the xsave is made.
However, this results in the xsave saving the fpu registers with the new PKRU value,
but the xsave should save the old PKRU value (before the siganl).
So, we must backup the before write to it, and then after xsave,
the old value is used to replace the value in the signal frame.
Something similar needs to be done at `rt_sigreturn`,
because the restore_sigcontext will modify the PKRU register
to the PKRU value in the frame, but the `restore_altstack` that
follows will immediately access `&frame->uc.uc_stack`.
If, however, the restored PKRU does not have access to the `&frame->uc.uc_stack`,
it will cause a kernel panic.
[1] https://lore.kernel.org/linux-api/[email protected]/