2015-12-18 01:49:20

by Andy Lutomirski

[permalink] [raw]
Subject: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

Hi all-

I think that, for PKRU in particular, we want the default signal
handling behavior to be a bit unusual.

When a signal is delivered, I think we should save the entire xstate
including PKRU. I see no reason to do anything other than that.

When a signal returns (sigreturn is called), though, I think we should
*not* restore PKRU. To me, PKRU seems like a global per-thread state,
not something that signal handlers are likely to clobber and should
therefore have restored. It's also unusual in that it doesn't fit
into the usual xstate feature model of being a bunch of registers that
are mostly caller-saved.

Does this make sense? Should we do this?

We have _fpx_sw_bytes.xfeatures and _xstate._header.xfeatures. They
appear to do more or less the same thing. Could we say that, for
certain new features (e.g. PKRU), if they're not in
_fpx_sw_bytes.xfeatures, then sigreturn will preserve the old content
rather than copying it? User code that wants to change it on
sigreturn would manually or the feature in to xfeatures, which would
cause the feature to go to its init state if it's not in
_header.xfeatures or to go into the saved state if it is in
_header.xfeatures?

Other means of signaling this could work, too.

It's not all that much code, I think.

--Andy


2015-12-18 02:13:42

by Dave Hansen

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On 12/17/2015 05:48 PM, Andy Lutomirski wrote:
> I think that, for PKRU in particular, we want the default signal
> handling behavior to be a bit unusual.
>
> When a signal is delivered, I think we should save the entire xstate
> including PKRU. I see no reason to do anything other than that.

Yep, agreed.

But what about the register state when delivering a signal? Don't we
set the registers to the init state? Do we need to preserve PKRU state
instead of init'ing it? The init state _is_ nice here because it is
permissive allows us to do something useful no matter what PKRU gets set to.

But, if we leave the init state in place when entering a delivering a
signal, we _can't_ decide to (by default at least) preserve the
in-signal state.

> When a signal returns (sigreturn is called), though, I think we should
> *not* restore PKRU. To me, PKRU seems like a global per-thread state,
> not something that signal handlers are likely to clobber and should
> therefore have restored. It's also unusual in that it doesn't fit
> into the usual xstate feature model of being a bunch of registers that
> are mostly caller-saved.
>
> Does this make sense? Should we do this?

Well, the signal handler isn't necessarily going to clobber it, but the
delivery code already clobbers it when going to the init state.

> We have _fpx_sw_bytes.xfeatures and _xstate._header.xfeatures. They
> appear to do more or less the same thing.

I thought the _fpx_sw_bytes were only for 32-bit (or FXSAVE/FXRSTOR?).

> Could we say that, for
> certain new features (e.g. PKRU), if they're not in
> _fpx_sw_bytes.xfeatures, then sigreturn will preserve the old content
> rather than copying it? User code that wants to change it on
> sigreturn would manually or the feature in to xfeatures, which would
> cause the feature to go to its init state if it's not in
> _header.xfeatures or to go into the saved state if it is in
> _header.xfeatures?

I think we first need to decide on the state upon signal delivery.

A practial problem at the moment is that we always call XRSTOR (aka
copy_user_to_xregs()) with RFBM (aka 'mask') with all of the supported
xfeatures. So RFBM[i]=1 for each state, effectively. A state with
XSTATE_BV[i]=0 (aka header.xfeatures) and RFBM[i]=1 will init the state.

We'd need to rig up the copy_user_to_xregs() to first read in
header.xfeatures and then or RFBM with it.

Not a huge deal, but something we want to think about, especially as it
pertains to the init/modified optimizations.

2015-12-18 02:32:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Thu, Dec 17, 2015 at 6:13 PM, Dave Hansen
<[email protected]> wrote:
> On 12/17/2015 05:48 PM, Andy Lutomirski wrote:
>> I think that, for PKRU in particular, we want the default signal
>> handling behavior to be a bit unusual.
>>
>> When a signal is delivered, I think we should save the entire xstate
>> including PKRU. I see no reason to do anything other than that.
>
> Yep, agreed.
>
> But what about the register state when delivering a signal? Don't we
> set the registers to the init state? Do we need to preserve PKRU state
> instead of init'ing it? The init state _is_ nice here because it is
> permissive allows us to do something useful no matter what PKRU gets set to.

I think we leave the extended regs alone. Don't we?

I think that, for most signals, we want to leave PKRU as is,
especially for things that aren't SIGSEGV. For SIGSEGV, maybe we want
an option to reset PKRU on delivery (and then set the flag to restore
on return?).

In any case, I think there are a decent number of programs out there
that use siglongjmp and therefore never actually hit sigreturn. They
probably won't restore PKRU, so we shouldn't zero it out when
delivering most signals, I think.

>
> But, if we leave the init state in place when entering a delivering a
> signal, we _can't_ decide to (by default at least) preserve the
> in-signal state.
>
>> When a signal returns (sigreturn is called), though, I think we should
>> *not* restore PKRU. To me, PKRU seems like a global per-thread state,
>> not something that signal handlers are likely to clobber and should
>> therefore have restored. It's also unusual in that it doesn't fit
>> into the usual xstate feature model of being a bunch of registers that
>> are mostly caller-saved.
>>
>> Does this make sense? Should we do this?
>
> Well, the signal handler isn't necessarily going to clobber it, but the
> delivery code already clobbers it when going to the init state.

Can you point to that code?

>
>> We have _fpx_sw_bytes.xfeatures and _xstate._header.xfeatures. They
>> appear to do more or less the same thing.
>
> I thought the _fpx_sw_bytes were only for 32-bit (or FXSAVE/FXRSTOR?).

I thought they were everywhere. fpu/signal.c looks that way to me. I
could be missing something -- this code isn't the most straightforward
in the world.

>
>> Could we say that, for
>> certain new features (e.g. PKRU), if they're not in
>> _fpx_sw_bytes.xfeatures, then sigreturn will preserve the old content
>> rather than copying it? User code that wants to change it on
>> sigreturn would manually or the feature in to xfeatures, which would
>> cause the feature to go to its init state if it's not in
>> _header.xfeatures or to go into the saved state if it is in
>> _header.xfeatures?
>
> I think we first need to decide on the state upon signal delivery.
>

Agreed.

> A practial problem at the moment is that we always call XRSTOR (aka
> copy_user_to_xregs()) with RFBM (aka 'mask') with all of the supported
> xfeatures. So RFBM[i]=1 for each state, effectively. A state with
> XSTATE_BV[i]=0 (aka header.xfeatures) and RFBM[i]=1 will init the state.
>
> We'd need to rig up the copy_user_to_xregs() to first read in
> header.xfeatures and then or RFBM with it.

Indeed.

>
> Not a huge deal, but something we want to think about, especially as it
> pertains to the init/modified optimizations.

Fair point. FWIW, I don't think that sigreturn performance matters
all that much, so if we inadvertently lose some of the optimizations,
it may not be the end of the world.

I do wonder: are there any modern CPUs on which copy_user_to_xregs (as
opposed to first copying to the task_struct buffer and then
copy_kernel_to_xregs) is ever a win? It avoids clobbering a few
cachelines and saves some memory bandwidth, but that may come at the
cost of disabling the init/modified optimizations when the subsequent
save on context switch hits a different VA/PA than the
copy_user_to_xregs used.

--Andy

2015-12-18 02:53:04

by Dave Hansen

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On 12/17/2015 06:32 PM, Andy Lutomirski wrote:
> On Thu, Dec 17, 2015 at 6:13 PM, Dave Hansen <[email protected]> wrote:
>> But what about the register state when delivering a signal? Don't we
>> set the registers to the init state? Do we need to preserve PKRU state
>> instead of init'ing it? The init state _is_ nice here because it is
>> permissive allows us to do something useful no matter what PKRU gets set to.
>
> I think we leave the extended regs alone. Don't we?
>
> I think that, for most signals, we want to leave PKRU as is,
> especially for things that aren't SIGSEGV. For SIGSEGV, maybe we want
> an option to reset PKRU on delivery (and then set the flag to restore
> on return?).

Is there some precedent for doing the state differently for different
signals?

>> Well, the signal handler isn't necessarily going to clobber it, but the
>> delivery code already clobbers it when going to the init state.
>
> Can you point to that code?

handle_signal() -> fpu__clear()

The comment around it says:

"Ensure the signal handler starts with the new fpu state."

>>> We have _fpx_sw_bytes.xfeatures and _xstate._header.xfeatures. They
>>> appear to do more or less the same thing.
>>
>> I thought the _fpx_sw_bytes were only for 32-bit (or FXSAVE/FXRSTOR?).
>
> I thought they were everywhere. fpu/signal.c looks that way to me. I
> could be missing something -- this code isn't the most straightforward
> in the world.

I think there's some extra space on the ia32 frame for this stuff, but
some clarity from someone who knows the history would be appreciated.

>> Not a huge deal, but something we want to think about, especially as it
>> pertains to the init/modified optimizations.
>
> Fair point. FWIW, I don't think that sigreturn performance matters
> all that much, so if we inadvertently lose some of the optimizations,
> it may not be the end of the world.

Once we lose the init optimization, we're sunk for good. We never get
it back until we restore the init state again. Having it in the init
state can save writing the state at XSAVE time, which can now save up to
~2k of writes at each context switch.

I'm sure we can preserve it, we just need to be _careful_.

2015-12-18 05:29:43

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Dec 17, 2015 6:53 PM, "Dave Hansen" <[email protected]> wrote:
>
> On 12/17/2015 06:32 PM, Andy Lutomirski wrote:
> > On Thu, Dec 17, 2015 at 6:13 PM, Dave Hansen <[email protected]> wrote:
> >> But what about the register state when delivering a signal? Don't we
> >> set the registers to the init state? Do we need to preserve PKRU state
> >> instead of init'ing it? The init state _is_ nice here because it is
> >> permissive allows us to do something useful no matter what PKRU gets set to.
> >
> > I think we leave the extended regs alone. Don't we?
> >
> > I think that, for most signals, we want to leave PKRU as is,
> > especially for things that aren't SIGSEGV. For SIGSEGV, maybe we want
> > an option to reset PKRU on delivery (and then set the flag to restore
> > on return?).
>
> Is there some precedent for doing the state differently for different
> signals?

Yes, to a very limited extent: SA_ONSTACK.

>
> >> Well, the signal handler isn't necessarily going to clobber it, but the
> >> delivery code already clobbers it when going to the init state.
> >
> > Can you point to that code?
>
> handle_signal() -> fpu__clear()
>
> The comment around it says:
>
> "Ensure the signal handler starts with the new fpu state."
>

You win this round :)

So maybe we should have a mask of xfeatures that aren't cleared on
signal delivery (e.g. PKRU, perhaps) and that are, by default,
preserved across signal returns.

> >>> We have _fpx_sw_bytes.xfeatures and _xstate._header.xfeatures. They
> >>> appear to do more or less the same thing.
> >>
> >> I thought the _fpx_sw_bytes were only for 32-bit (or FXSAVE/FXRSTOR?).
> >
> > I thought they were everywhere. fpu/signal.c looks that way to me. I
> > could be missing something -- this code isn't the most straightforward
> > in the world.
>
> I think there's some extra space on the ia32 frame for this stuff, but
> some clarity from someone who knows the history would be appreciated.
>
> >> Not a huge deal, but something we want to think about, especially as it
> >> pertains to the init/modified optimizations.
> >
> > Fair point. FWIW, I don't think that sigreturn performance matters
> > all that much, so if we inadvertently lose some of the optimizations,
> > it may not be the end of the world.
>
> Once we lose the init optimization, we're sunk for good. We never get
> it back until we restore the init state again. Having it in the init
> state can save writing the state at XSAVE time, which can now save up to
> ~2k of writes at each context switch.
>
> I'm sure we can preserve it, we just need to be _careful_.

Right.

How much does XSAVEOPT help here? IOW if we're careful to save to the
same place we restored from and we don't modify the state in the mean
time, how much of the time do we save? In the best case, I guess we
save the memory writes but not the reads?

--Andy

2015-12-18 06:43:53

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On December 17, 2015 9:29:21 PM PST, Andy Lutomirski <[email protected]> wrote:
>On Dec 17, 2015 6:53 PM, "Dave Hansen" <[email protected]>
>wrote:
>>
>> On 12/17/2015 06:32 PM, Andy Lutomirski wrote:
>> > On Thu, Dec 17, 2015 at 6:13 PM, Dave Hansen
><[email protected]> wrote:
>> >> But what about the register state when delivering a signal? Don't
>we
>> >> set the registers to the init state? Do we need to preserve PKRU
>state
>> >> instead of init'ing it? The init state _is_ nice here because it
>is
>> >> permissive allows us to do something useful no matter what PKRU
>gets set to.
>> >
>> > I think we leave the extended regs alone. Don't we?
>> >
>> > I think that, for most signals, we want to leave PKRU as is,
>> > especially for things that aren't SIGSEGV. For SIGSEGV, maybe we
>want
>> > an option to reset PKRU on delivery (and then set the flag to
>restore
>> > on return?).
>>
>> Is there some precedent for doing the state differently for different
>> signals?
>
>Yes, to a very limited extent: SA_ONSTACK.
>
>>
>> >> Well, the signal handler isn't necessarily going to clobber it,
>but the
>> >> delivery code already clobbers it when going to the init state.
>> >
>> > Can you point to that code?
>>
>> handle_signal() -> fpu__clear()
>>
>> The comment around it says:
>>
>> "Ensure the signal handler starts with the new fpu state."
>>
>
>You win this round :)
>
>So maybe we should have a mask of xfeatures that aren't cleared on
>signal delivery (e.g. PKRU, perhaps) and that are, by default,
>preserved across signal returns.
>
>> >>> We have _fpx_sw_bytes.xfeatures and _xstate._header.xfeatures.
>They
>> >>> appear to do more or less the same thing.
>> >>
>> >> I thought the _fpx_sw_bytes were only for 32-bit (or
>FXSAVE/FXRSTOR?).
>> >
>> > I thought they were everywhere. fpu/signal.c looks that way to me.
> I
>> > could be missing something -- this code isn't the most
>straightforward
>> > in the world.
>>
>> I think there's some extra space on the ia32 frame for this stuff,
>but
>> some clarity from someone who knows the history would be appreciated.
>>
>> >> Not a huge deal, but something we want to think about, especially
>as it
>> >> pertains to the init/modified optimizations.
>> >
>> > Fair point. FWIW, I don't think that sigreturn performance matters
>> > all that much, so if we inadvertently lose some of the
>optimizations,
>> > it may not be the end of the world.
>>
>> Once we lose the init optimization, we're sunk for good. We never
>get
>> it back until we restore the init state again. Having it in the init
>> state can save writing the state at XSAVE time, which can now save up
>to
>> ~2k of writes at each context switch.
>>
>> I'm sure we can preserve it, we just need to be _careful_.
>
>Right.
>
>How much does XSAVEOPT help here? IOW if we're careful to save to the
>same place we restored from and we don't modify the state in the mean
>time, how much of the time do we save? In the best case, I guess we
>save the memory writes but not the reads?
>
>--Andy

I find the notion of PKRU not being initialized at the entry to a signal handler a bit odd. Saving/restoring it seems like the right thing to me.

What am I missing?
--
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

2015-12-18 08:32:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?


* Andy Lutomirski <[email protected]> wrote:

> > >> But what about the register state when delivering a signal? Don't we
> > >> set the registers to the init state? Do we need to preserve PKRU state
> > >> instead of init'ing it? The init state _is_ nice here because it is
> > >> permissive allows us to do something useful no matter what PKRU gets set to.
> > >
> > > I think we leave the extended regs alone. Don't we?
> > >
> > > I think that, for most signals, we want to leave PKRU as is,
> > > especially for things that aren't SIGSEGV. For SIGSEGV, maybe we want
> > > an option to reset PKRU on delivery (and then set the flag to restore
> > > on return?).
> >
> > Is there some precedent for doing the state differently for different
> > signals?
>
> Yes, to a very limited extent: SA_ONSTACK.
>
> >
> > >> Well, the signal handler isn't necessarily going to clobber it, but the
> > >> delivery code already clobbers it when going to the init state.
> > >
> > > Can you point to that code?
> >
> > handle_signal() -> fpu__clear()
> >
> > The comment around it says:
> >
> > "Ensure the signal handler starts with the new fpu state."
> >
>
> You win this round :)
>
> So maybe we should have a mask of xfeatures that aren't cleared on
> signal delivery (e.g. PKRU, perhaps) and that are, by default,
> preserved across signal returns.

So the principle is this: signals are conceptually like creating a new thread of
execution, and that's a very powerful programming concept, like fork() or
pthread_create() are powerful concepts. So we definitely want to keep that default
behavior, and I don't think we want to deviate from that for typical new extended
CPU context features, even if signal delivery slows down as a result.

But we've been arguing about 'lightweight signals' for up to two decades that I
can remember. (The first such suggestion was to not save the FPU state, back when
FPU saves were ridiculously slow compared to other parts of saving/restoring a
context.)

So having a well-enumerated, extensible opt-in mask (which defaults to 'all state
saved') that allows smart signal handlers to skip the save/restore of certain CPU
context components would be acceptable I think.

But I'd still expect this to be limited to closely coded, specialistic signal
handlers - as the trend goes against such type of specialization: compilers and
runtime environments do take advantage of new CPU features so if you want to have
an 'easy to use' signal handler, you should use the default one.

I'd not be surprised if large-scale signal users like Valgrind could benefit.

> > I'm sure we can preserve it, we just need to be _careful_.
>
> Right.
>
> How much does XSAVEOPT help here? IOW if we're careful to save to the same
> place we restored from and we don't modify the state in the mean time, how much
> of the time do we save? In the best case, I guess we save the memory writes but
> not the reads?

So I'd not design new signal interfaces around current behavior, I'd design them
for the existing patterns (which center around programming ease of use) - with
opt-in, performance-enhancing specializations.

Thanks,

Ingo

2015-12-18 08:59:25

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Thu, Dec 17, 2015 at 05:48:56PM -0800, Andy Lutomirski wrote:
> Hi all-
>
> I think that, for PKRU in particular, we want the default signal
> handling behavior to be a bit unusual.

Stupid question, but what the heck is PKRU? A grep of the kernel tree
shows no results, and a web search returns mostly Thai language results.

2015-12-18 12:58:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Fri, Dec 18, 2015 at 12:59:14AM -0800, Christoph Hellwig wrote:
> Stupid question, but what the heck is PKRU? A grep of the kernel tree
> shows no results, and a web search returns mostly Thai language results.

That should explain it:

https://lkml.kernel.org/r/[email protected]

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2015-12-18 16:04:32

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Thu, Dec 17, 2015 at 10:43 PM, H. Peter Anvin <[email protected]> wrote:
> On December 17, 2015 9:29:21 PM PST, Andy Lutomirski <[email protected]> wrote:
>>On Dec 17, 2015 6:53 PM, "Dave Hansen" <[email protected]>
>>wrote:
>>>
>>> On 12/17/2015 06:32 PM, Andy Lutomirski wrote:
>>> > On Thu, Dec 17, 2015 at 6:13 PM, Dave Hansen
>><[email protected]> wrote:
>>> >> But what about the register state when delivering a signal? Don't
>>we
>>> >> set the registers to the init state? Do we need to preserve PKRU
>>state
>>> >> instead of init'ing it? The init state _is_ nice here because it
>>is
>>> >> permissive allows us to do something useful no matter what PKRU
>>gets set to.
>>> >
>>> > I think we leave the extended regs alone. Don't we?
>>> >
>>> > I think that, for most signals, we want to leave PKRU as is,
>>> > especially for things that aren't SIGSEGV. For SIGSEGV, maybe we
>>want
>>> > an option to reset PKRU on delivery (and then set the flag to
>>restore
>>> > on return?).
>>>
>>> Is there some precedent for doing the state differently for different
>>> signals?
>>
>>Yes, to a very limited extent: SA_ONSTACK.
>>
>>>
>>> >> Well, the signal handler isn't necessarily going to clobber it,
>>but the
>>> >> delivery code already clobbers it when going to the init state.
>>> >
>>> > Can you point to that code?
>>>
>>> handle_signal() -> fpu__clear()
>>>
>>> The comment around it says:
>>>
>>> "Ensure the signal handler starts with the new fpu state."
>>>
>>
>>You win this round :)
>>
>>So maybe we should have a mask of xfeatures that aren't cleared on
>>signal delivery (e.g. PKRU, perhaps) and that are, by default,
>>preserved across signal returns.
>>

[...]

>
> I find the notion of PKRU not being initialized at the entry to a signal handler a bit odd. Saving/restoring it seems like the right thing to me.
>
> What am I missing?

On Fri, Dec 18, 2015 at 12:32 AM, Ingo Molnar <[email protected]> wrote:
>
> So the principle is this: signals are conceptually like creating a new thread of
> execution, and that's a very powerful programming concept, like fork() or
> pthread_create() are powerful concepts. So we definitely want to keep that default
> behavior, and I don't think we want to deviate from that for typical new extended
> CPU context features, even if signal delivery slows down as a result.

I think that PKRU is in almost no respect a typical no extended CPU
context feature. It's a control register, not a data register.

Let's try a concrete example. One big use case for PKRU will be in
conjunction with DAX-style pmem. Imagine some program (database
server, perhaps) that mmaps 1 TB of pmem MAP_SHARED, PROT_READ |
PROT_WRITE with protection key 1. It sets PKRU so that key 1 can't be
written. All of its accesses to the pmem store are in tight code
regions that do wrpkru; mov; wrpkru (just like what the kernel does
with stac/clac on SMAP systems).

>From the app's perspective, it's imperative that stray writes that
coincidentally hit the pmem region fail. (1 TB is big enough that a
random write to canonical user memory hits it almost 1% of the time.)
This means:

1. If a signal is delivered, the app wants PKRU to be set to some safe
value as early as possible. The app can do it by itself if it doesn't
use libraries that interfere, but it would be IMO much, much nicer if
the kernel made this happen automatically.

1b. If the app malfunctions such that RSP points to pmem, the kernel
MUST NOT clobber the pmem space. I think that this basically mandates
that PKRU needs to have some safe state (i.e. definitely not the init
state) on signal delivery: the kernel is going to write a signal frame
at the address identified by RSP, and that address is in pmem, so
those writes need to fail.

2. clone with shared VM should preserve PKRU.

So in this regard, I think signals are kind of like new threads after all.

Now that I think about it more, there are at least two possibilities
that work for this use case.

Option A: signal delivery preserves PKRU. If we go this route, I
think we should decide whether PKRU is preserved on exit. Given that
we're talking about adding syscalls to adjust PKRU, it seems a bit odd
to me that sigreturn would, by default, undo whatever those syscalls
do.

Option B: signal delivery resets PKRU to user-specified values. We're
talking about having syscalls to write PKRU. We could have clone and
signal entry use the values set by syscall instead of the values in
the actual PKRU register. That way:

set_protection_key(1, no writes);
...
wrpkru(allow writes to key 1)
mov something <<<<<<<<<<< fault or async signal here
wrpkru(disallow writes to key 1)

leaves key 1 protected in the signal handler.

If we go that route, I think we must restore PKRU just like any other
xstate on sigreturn, so in some sense it's the simplest of all to
implement. We'll need to change the signal delivery stuff to do the
fpu__clear and the automatic PKRU write before trying to set up the
signal frame so that the frame is written with the syscall-specified
protections instead of the wrpkru-overridden values, but that should
be easy.


Which reminds me: __get_user, etc all respect PKRU because the SDM
says that PKRU affects all user *and* kernel accesses to user memory
(i.e. anything with _PAGE_USER). Should get_user_pages also respect
PKRU?

>
> But we've been arguing about 'lightweight signals' for up to two decades that I
> can remember. (The first such suggestion was to not save the FPU state, back when
> FPU saves were ridiculously slow compared to other parts of saving/restoring a
> context.)

I'm not making any claims about lightweightness here. While
performance is an issue in some cases, I'd like to hash out the
baseline functionality first.

>
> So having a well-enumerated, extensible opt-in mask (which defaults to 'all state
> saved') that allows smart signal handlers to skip the save/restore of certain CPU
> context components would be acceptable I think.

I have a patch that helps considerably with a probably-unnoticeable
ABI change by letting signal delivery use sysret. I'll dust it off.

--Andy

2015-12-18 16:57:14

by Dave Hansen

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
> Which reminds me: __get_user, etc all respect PKRU because the SDM
> says that PKRU affects all user *and* kernel accesses to user memory
> (i.e. anything with _PAGE_USER). Should get_user_pages also respect
> PKRU?

It does. There are a couple of parts to it, but the most important are
in these patches:

> http://git.kernel.org/cgit/linux/kernel/git/daveh/x86-pkeys.git/commit/?h=pkeys-v018&id=c64acbe036487d07968ea6a5b9090169d6e40160
> http://git.kernel.org/cgit/linux/kernel/git/daveh/x86-pkeys.git/commit/?h=pkeys-v018&id=331c62eb789f969e187964e28ac80b1e0e26b69d

2015-12-18 18:42:59

by Dave Hansen

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
> 1b. If the app malfunctions such that RSP points to pmem, the kernel
> MUST NOT clobber the pmem space. I think that this basically mandates
> that PKRU needs to have some safe state (i.e. definitely not the init
> state) on signal delivery: the kernel is going to write a signal frame
> at the address identified by RSP, and that address is in pmem, so
> those writes need to fail.

The kernel is writing the signal frame using normal old copy_to_user().
Those are writing through mappings with _PAGE_USER set and should be
subject to the PKRU state of the thread before the signal started to be
delivered.

We don't do the fpu__clear() until after this copy, so I think pkeys
enforcement is being done properly for this today.

2015-12-18 19:22:18

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansen
<[email protected]> wrote:
> On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
>> 1b. If the app malfunctions such that RSP points to pmem, the kernel
>> MUST NOT clobber the pmem space. I think that this basically mandates
>> that PKRU needs to have some safe state (i.e. definitely not the init
>> state) on signal delivery: the kernel is going to write a signal frame
>> at the address identified by RSP, and that address is in pmem, so
>> those writes need to fail.
>
> The kernel is writing the signal frame using normal old copy_to_user().
> Those are writing through mappings with _PAGE_USER set and should be
> subject to the PKRU state of the thread before the signal started to be
> delivered.
>
> We don't do the fpu__clear() until after this copy, so I think pkeys
> enforcement is being done properly for this today.

True, but I think only in a very limited sense. Your average signal
handler is reasonably like to execute "push $rbp" as its very first
instruction, at which point we're immediately screwed with the current
arrangement.

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC

2015-12-18 20:07:19

by Dave Hansen

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On 12/18/2015 11:21 AM, Andy Lutomirski wrote:
> On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansen
> <[email protected]> wrote:
>> On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
>>> 1b. If the app malfunctions such that RSP points to pmem, the kernel
>>> MUST NOT clobber the pmem space. I think that this basically mandates
>>> that PKRU needs to have some safe state (i.e. definitely not the init
>>> state) on signal delivery: the kernel is going to write a signal frame
>>> at the address identified by RSP, and that address is in pmem, so
>>> those writes need to fail.
>>
>> The kernel is writing the signal frame using normal old copy_to_user().
>> Those are writing through mappings with _PAGE_USER set and should be
>> subject to the PKRU state of the thread before the signal started to be
>> delivered.
>>
>> We don't do the fpu__clear() until after this copy, so I think pkeys
>> enforcement is being done properly for this today.
>
> True, but I think only in a very limited sense. Your average signal
> handler is reasonably like to execute "push $rbp" as its very first
> instruction, at which point we're immediately screwed with the current
> arrangement.

I completely agree that there's a window for corruption.

But, I think it's a small one. Basically, RSP would have to pointing at
a place which was allowed by protection keys for all of the sigframe
setup. Then, _just_ happened to be at a place which was denied by
protection keys when it enters the signal handler back in userspace.
It's possible, but it's a small window.

2015-12-18 20:28:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Fri, Dec 18, 2015 at 12:07 PM, Dave Hansen
<[email protected]> wrote:
> On 12/18/2015 11:21 AM, Andy Lutomirski wrote:
>> On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansen
>> <[email protected]> wrote:
>>> On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
>>>> 1b. If the app malfunctions such that RSP points to pmem, the kernel
>>>> MUST NOT clobber the pmem space. I think that this basically mandates
>>>> that PKRU needs to have some safe state (i.e. definitely not the init
>>>> state) on signal delivery: the kernel is going to write a signal frame
>>>> at the address identified by RSP, and that address is in pmem, so
>>>> those writes need to fail.
>>>
>>> The kernel is writing the signal frame using normal old copy_to_user().
>>> Those are writing through mappings with _PAGE_USER set and should be
>>> subject to the PKRU state of the thread before the signal started to be
>>> delivered.
>>>
>>> We don't do the fpu__clear() until after this copy, so I think pkeys
>>> enforcement is being done properly for this today.
>>
>> True, but I think only in a very limited sense. Your average signal
>> handler is reasonably like to execute "push $rbp" as its very first
>> instruction, at which point we're immediately screwed with the current
>> arrangement.
>
> I completely agree that there's a window for corruption.
>
> But, I think it's a small one. Basically, RSP would have to pointing at
> a place which was allowed by protection keys for all of the sigframe
> setup. Then, _just_ happened to be at a place which was denied by
> protection keys when it enters the signal handler back in userspace.
> It's possible, but it's a small window.

That's true.

OTOH, I think that the signal in the middle of the wrpkru-allowed
window is also a compelling case. So maybe my new preference is
option B, where we save and restore PKRU like all the other xfeatures
but, rather than clearing it in the signal context, we set it to the
syscall-specified value.

If we do this, then we'd have to clearly separate the
syscall-specified value from the xstate value, and we'd probably want
one of the syscalls to offer an option to force user PKRU to match the
syscall value. Maybe your code already does this -- I didn't check.

This has the nice property that sigreturn isn't affected at all except
insofar as we should give a little bit of thought to the ordering of
restoring PKRU relative to any other uaccess. I'd imagine that we'd
just restore PKRU last, in case the stack we're restoring from has
access disallowed by the saved PKRU value.

--Andy

2015-12-18 20:37:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Fri, Dec 18, 2015 at 12:07 PM, Dave Hansen
<[email protected]> wrote:
>
> But, I think it's a small one. Basically, RSP would have to pointing at
> a place which was allowed by protection keys for all of the sigframe
> setup.

Note that the whole "stack is special" is not at all a new issue.

It's the main reason why sigaltstack() and SS_ONSTACK exists. This is
in no way new to PKRU, people have had to handle the issue of
stack-related SIGSEGV faults for a long time.

So any application that uses PKRU and may play games that affects the
stack, will always have to have a separate "safe stack" that it uses
for signal handling. But that is in no way PKRU-specific, it's been
the case for a lot of other memory management faults.

Linus

2015-12-18 20:49:24

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Fri, Dec 18, 2015 at 12:37 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Dec 18, 2015 at 12:07 PM, Dave Hansen
> <[email protected]> wrote:
>>
>> But, I think it's a small one. Basically, RSP would have to pointing at
>> a place which was allowed by protection keys for all of the sigframe
>> setup.
>
> Note that the whole "stack is special" is not at all a new issue.
>
> It's the main reason why sigaltstack() and SS_ONSTACK exists. This is
> in no way new to PKRU, people have had to handle the issue of
> stack-related SIGSEGV faults for a long time.
>
> So any application that uses PKRU and may play games that affects the
> stack, will always have to have a separate "safe stack" that it uses
> for signal handling. But that is in no way PKRU-specific, it's been
> the case for a lot of other memory management faults.
>

The trouble is that this isn't just limited to special "safe stack"
SA_ONSTACK aware code. It even affects normally innocuous things.

The thing that's new is that, before PKRU, the HW stack might have
been unsafe in the sense that accessing it would fault and we need a
different backup stack for some signals. With PKRU, we might be in a
transient extra-privileged (permissive PKRU) state, and we want to
tighten permissions for *all* signal deliveries so that, in case the
current stack is erroneous, we don't corrupt it.

Admittedly, ending up with permissive PKRU in some short sequence and
corrupt RSP is likely to corrupt things no matter what the kernel
does. But I don't think we want to deliver SIGALRM to a library while
the main program is in a specially permissive PKRU region.

IOW, I like my idea in which signal delivery always sets PKRU to the
application-requested-by-syscall values and sigreturn restores it.
Kinda like sigaltstack, but applies to all signals and affects PKRU
instead of RSP.

--Andy

2015-12-18 20:59:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On 12/18/2015 12:49 PM, Andy Lutomirski wrote:
>
> IOW, I like my idea in which signal delivery always sets PKRU to the
> application-requested-by-syscall values and sigreturn restores it.
> Kinda like sigaltstack, but applies to all signals and affects PKRU
> instead of RSP.
>

I think this is the only sensible option, with the default being all zero.

-hpa

2015-12-18 21:03:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Fri, Dec 18, 2015 at 12:58 PM, H. Peter Anvin <[email protected]> wrote:
> On 12/18/2015 12:49 PM, Andy Lutomirski wrote:
>>
>> IOW, I like my idea in which signal delivery always sets PKRU to the
>> application-requested-by-syscall values and sigreturn restores it.
>> Kinda like sigaltstack, but applies to all signals and affects PKRU
>> instead of RSP.
>>
>
> I think this is the only sensible option, with the default being all zero.
>

Or not quite all zero if we do Dave's experimental PROT_EXEC thing.

Actually, I want to introduce a set of per-mm "incompatible" bits. By
default, they'd be zero. We can, as needed, define bits that do
something nice but break old code. I want one of the bits to turn
vsyscalls off entirely. Another bit could say that the kernel is
allowed to steal a protection key for PROT_EXEC.

These bits would be read and written by prctl, but there could also be
an ELF note mechanism to initialize them on execve without a syscall.

--Andy

2015-12-18 21:04:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Fri, Dec 18, 2015 at 12:49 PM, Andy Lutomirski <[email protected]> wrote:
>
> IOW, I like my idea in which signal delivery always sets PKRU to the
> application-requested-by-syscall values and sigreturn restores it.

So I don't mind that, as long as the whole "sigreturn restores it" is
part of things.

Your original email with the suggestion to *not* resture PKRU I didn't
like. Setting it and restoring it is fine.

I do wonder if you need an explicit value, though. I think it's
reasonable to say that PKRU value 0 is special. It's what we'd start
processes with, and why not just say that it's what we run signal
handlers in?

Would any other value ever make sense, really?

Linus

2015-12-18 21:08:16

by Dave Hansen

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On 12/18/2015 01:02 PM, Andy Lutomirski wrote:
> On Fri, Dec 18, 2015 at 12:58 PM, H. Peter Anvin <[email protected]> wrote:
>> > On 12/18/2015 12:49 PM, Andy Lutomirski wrote:
>>> >>
>>> >> IOW, I like my idea in which signal delivery always sets PKRU to the
>>> >> application-requested-by-syscall values and sigreturn restores it.
>>> >> Kinda like sigaltstack, but applies to all signals and affects PKRU
>>> >> instead of RSP.
>>> >>
>> > I think this is the only sensible option, with the default being all zero.
>> >
> Or not quite all zero if we do Dave's experimental PROT_EXEC thing.
>
> Actually, I want to introduce a set of per-mm "incompatible" bits. By
> default, they'd be zero. We can, as needed, define bits that do
> something nice but break old code. I want one of the bits to turn
> vsyscalls off entirely. Another bit could say that the kernel is
> allowed to steal a protection key for PROT_EXEC.

That really only makes sense if we have userspace that expects all the
protection keys to be available. If we go the route of having
pkey_alloc/free() syscalls, then the kernel can easily tell userspace to
keep its mitts off a particular pkey by not ever returning it from a
pkey_alloc().

2015-12-18 21:10:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Fri, Dec 18, 2015 at 1:04 PM, Linus Torvalds
<[email protected]> wrote:
>
> I do wonder if you need an explicit value, though. I think it's
> reasonable to say that PKRU value 0 is special. It's what we'd start
> processes with, and why not just say that it's what we run signal
> handlers in?
>
> Would any other value ever make sense, really?

Ahh. Your point about the PROT_EXEC handling means that maybe we don't
want to default to zero. Maybe we want to make the default PKRU
startup value be 1 instead, enabling access disable key for key 0?

Linus

2015-12-18 21:12:46

by Dave Hansen

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On 12/18/2015 01:04 PM, Linus Torvalds wrote:
> On Fri, Dec 18, 2015 at 12:49 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> IOW, I like my idea in which signal delivery always sets PKRU to the
>> application-requested-by-syscall values and sigreturn restores it.
>
> So I don't mind that, as long as the whole "sigreturn restores it" is
> part of things.
>
> Your original email with the suggestion to *not* resture PKRU I didn't
> like. Setting it and restoring it is fine.
>
> I do wonder if you need an explicit value, though. I think it's
> reasonable to say that PKRU value 0 is special. It's what we'd start
> processes with, and why not just say that it's what we run signal
> handlers in?
>
> Would any other value ever make sense, really?

Having a PKRU with the execute-only permissions set is the only one I
can think of. For a system with a _dedicated_ PKEY for execute-only,
this is easy and could even be made a part of init_fpstate with no other
code changes.

But, if we are picking out an execute-only pkey more dynamically, we've
got to keep the default value for the entire process somewhere.

2015-12-18 21:45:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Fri, Dec 18, 2015 at 1:12 PM, Dave Hansen
<[email protected]> wrote:
>
> But, if we are picking out an execute-only pkey more dynamically, we've
> got to keep the default value for the entire process somewhere.

How dynamic do we want to make this, though?

I haven't looked at the details, and perhaps more importantly, I don't
know what exactly are the requirements you've gotten from the people
who are expected to actually use this.

I think we might want to hardcode a couple of keys as "kernel
reserved". And I'd rather reserve them up-front than have some user
program be unhappy later when we want to use them.

I guess we want to leave key #0 for "normal page", so my suggesting to
use that for the execute-only was probably misguided.

But I do think we might want to have that "no read access" as a real
fixed key too, because I think the kernel itself would want to use it:

(a) to make sure that it gets the right fault when user space passes
in a execute-only address to a system call.

(b) for much more efficient PAGEALLOC_DEBUG for kernel mappings.

so I do think that we'd want to reserve two of the 16 keys up front.

Would it be ok for the expected users to have those keys simply be
fixed? With key 0 being used for all default pages, and key 1 being
used for all execute-only pages? And then defaulting PKRU to 4,
disallowing access to that key #1?

I could imagine that some kernel person would want to use even more
keys, but I think two fixed keys are kind of the minimal we'd want to
use.

Linus

2015-12-18 22:29:09

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Fri, Dec 18, 2015 at 1:45 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Dec 18, 2015 at 1:12 PM, Dave Hansen
> <[email protected]> wrote:
>>
>> But, if we are picking out an execute-only pkey more dynamically, we've
>> got to keep the default value for the entire process somewhere.
>
> How dynamic do we want to make this, though?
>
> I haven't looked at the details, and perhaps more importantly, I don't
> know what exactly are the requirements you've gotten from the people
> who are expected to actually use this.
>
> I think we might want to hardcode a couple of keys as "kernel
> reserved". And I'd rather reserve them up-front than have some user
> program be unhappy later when we want to use them.
>
> I guess we want to leave key #0 for "normal page", so my suggesting to
> use that for the execute-only was probably misguided.
>
> But I do think we might want to have that "no read access" as a real
> fixed key too, because I think the kernel itself would want to use it:
>
> (a) to make sure that it gets the right fault when user space passes
> in a execute-only address to a system call.
>
> (b) for much more efficient PAGEALLOC_DEBUG for kernel mappings.
>
> so I do think that we'd want to reserve two of the 16 keys up front.
>
> Would it be ok for the expected users to have those keys simply be
> fixed? With key 0 being used for all default pages, and key 1 being
> used for all execute-only pages? And then defaulting PKRU to 4,
> disallowing access to that key #1?
>
> I could imagine that some kernel person would want to use even more
> keys, but I think two fixed keys are kind of the minimal we'd want to
> use.

I imagine we'd reserve key 0 for normal page and key 1 for deny-read.
Let me be a bit more concrete about what I'm suggesting:

We'd have thread_struct.baseline_pkru. It would start with key 0
allowing all access and key 1 denying reads.

We'd have a syscall like set_protection_key that could allocate unused
keys and change the values of keys that have been allocated. Those
changes would be reflected in baseline_pkru. Changes to keys 0 and 1
in baseline_pkru would not be allowed.

Signal delivery would load baseline_pkru into the PKRU register.
Signal restore would restore PKRU to its previous value.

WRPKRU would, of course, override baseline_pkru, but it wouldn't
change baseline_pkru. The set_protection_key syscall would modify
*both* real PKRU and baseline_pkru.

Apps that don't want to use the baseline_pkru mechanism could use
syscalls to claim ownership of protection keys but then manage them
purely with WRPKRU directly. We could optionally disallow
mprotect_key on keys that weren't allocated in advance.

Does that seem sane?

--Andy

2015-12-18 23:08:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Fri, Dec 18, 2015 at 2:28 PM, Andy Lutomirski <[email protected]> wrote:
>
> Apps that don't want to use the baseline_pkru mechanism could use
> syscalls to claim ownership of protection keys but then manage them
> purely with WRPKRU directly. We could optionally disallow
> mprotect_key on keys that weren't allocated in advance.
>
> Does that seem sane?

So everything seems sane except for the need for that baseline_pkru.

I'm not seeing why it couldn't just be a fixed value. Is there any
real downside to it?

Linus

2015-12-18 23:17:01

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Fri, Dec 18, 2015 at 3:08 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Dec 18, 2015 at 2:28 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> Apps that don't want to use the baseline_pkru mechanism could use
>> syscalls to claim ownership of protection keys but then manage them
>> purely with WRPKRU directly. We could optionally disallow
>> mprotect_key on keys that weren't allocated in advance.
>>
>> Does that seem sane?
>
> So everything seems sane except for the need for that baseline_pkru.
>
> I'm not seeing why it couldn't just be a fixed value. Is there any
> real downside to it?

Yes, I think. If I'm using protection keys to protect some critical
data structure (important stuff in shared memory, important memory
mapped files, pmem, etc), then I'll allocate a protection key and set
PKRU to deny writes. The problem is that I really, really want writes
denied except when explicitly enabled in narrow regions of code that
use wrpkru to enable them, and I don't want an asynchronous signal
delivered in those narrow regions of code or newly cloned threads to
pick up the write-allow value. So I want baseline_pkru to have the
deny writes entry.

I think I would do exactly this in my production code here if my
server supported it. Some day...

Hrm. We might also want an option to change pkru and/or baseline_pkru
in all threads in the current mm. That's optional but it could be
handy. Maybe it would be as simple as having the allocate-a-pkey call
have an option to set an initial baseline value and an option to
propagate that initial value to pre-existing threads.

--Andy

2015-12-18 23:20:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Fri, Dec 18, 2015 at 3:16 PM, Andy Lutomirski <[email protected]> wrote:
>
> Yes, I think. If I'm using protection keys to protect some critical
> data structure (important stuff in shared memory, important memory
> mapped files, pmem, etc), then I'll allocate a protection key and set
> PKRU to deny writes. The problem is that I really, really want writes
> denied except when explicitly enabled in narrow regions of code that
> use wrpkru to enable them, and I don't want an asynchronous signal
> delivered in those narrow regions of code or newly cloned threads to
> pick up the write-allow value. So I want baseline_pkru to have the
> deny writes entry.

Hmm. Ok, that does sound like a valid and interesting usage case. Fair enough.

Linus

2015-12-21 17:04:39

by Dave Hansen

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On 12/18/2015 03:16 PM, Andy Lutomirski wrote:
> Hrm. We might also want an option to change pkru and/or baseline_pkru
> in all threads in the current mm. That's optional but it could be
> handy. Maybe it would be as simple as having the allocate-a-pkey call
> have an option to set an initial baseline value and an option to
> propagate that initial value to pre-existing threads.

Do you mean actively going in and changing PKRU in other threads? I
fear that will be dangerous.

IMNHO, whatever we do, I think we need to ensure that _raw_ PKRU calls
are allowed (somehow). Raw in this case would mean a thread calling
WRPKRU without a system call and without checking in with what any other
threads are doing.

Let's say baseline_pkru=0x004 (we're access-disabling PKEY[1] and using
it for execute-only). Now, a thread is trying to do this:

pkey2 = sys_pkey_alloc(); // now pkey2=2
tmp = rdpkru(); // 0x004
tmp |= 0x10; // set PKRU[2].AD=1
wrpkru(tmp);

While another thread does:

pkey4 = pkey_alloc(); // pkey4=4
sys_pkey_set(pkey4, ACCESS_DISABLE, SET_BASELINE_ALL_THREADS);

Without some kind of locking, that's going to race. We could do all the
locking in the kernel, but that requires that the kernel do all the PKRU
writing, which I'd really like to avoid.

I think the closest we can get reasonably is to have the kernel track
the baseline_pkru and then allow userspace to query it in case userspace
decides that thread needs to update its thread-local PKRU from the baseline.

2015-12-21 22:53:02

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Dec 22, 2015 2:04 AM, "Dave Hansen" <[email protected]> wrote:
>
> On 12/18/2015 03:16 PM, Andy Lutomirski wrote:
> > Hrm. We might also want an option to change pkru and/or baseline_pkru
> > in all threads in the current mm. That's optional but it could be
> > handy. Maybe it would be as simple as having the allocate-a-pkey call
> > have an option to set an initial baseline value and an option to
> > propagate that initial value to pre-existing threads.
>
> Do you mean actively going in and changing PKRU in other threads? I
> fear that will be dangerous.
>
> IMNHO, whatever we do, I think we need to ensure that _raw_ PKRU calls
> are allowed (somehow). Raw in this case would mean a thread calling
> WRPKRU without a system call and without checking in with what any other
> threads are doing.
>
> Let's say baseline_pkru=0x004 (we're access-disabling PKEY[1] and using
> it for execute-only). Now, a thread is trying to do this:
>
> pkey2 = sys_pkey_alloc(); // now pkey2=2
> tmp = rdpkru(); // 0x004
> tmp |= 0x10; // set PKRU[2].AD=1
> wrpkru(tmp);
>
> While another thread does:
>
> pkey4 = pkey_alloc(); // pkey4=4
> sys_pkey_set(pkey4, ACCESS_DISABLE, SET_BASELINE_ALL_THREADS);
>
> Without some kind of locking, that's going to race. We could do all the
> locking in the kernel, but that requires that the kernel do all the PKRU
> writing, which I'd really like to avoid.
>
> I think the closest we can get reasonably is to have the kernel track
> the baseline_pkru and then allow userspace to query it in case userspace
> decides that thread needs to update its thread-local PKRU from the baseline.

Yeah, fair point. Let's skip the modify-other-threads thing.

Perhaps this is silly, but what if the default were changed to deny
reads and writes for unallocated keys? Is there a use case that
breaks?

--Andy

2015-12-21 23:00:41

by Dave Hansen

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On 12/21/2015 02:52 PM, Andy Lutomirski wrote:
> Perhaps this is silly, but what if the default were changed to deny
> reads and writes for unallocated keys? Is there a use case that
> breaks?

It's probably a reasonable debugging feature.

But, anything that takes an XSAVE feature out of its "init state" has
the potential to do a bit of harm because it increases the potential
size of writes during XSAVE. XSAVEOPT will _help_ here, but we probably
don't want to go out of our way to take things out of the init state
when we're unsure of the benefits.

2015-12-21 23:02:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Mon, Dec 21, 2015 at 3:00 PM, Dave Hansen
<[email protected]> wrote:
> On 12/21/2015 02:52 PM, Andy Lutomirski wrote:
>> Perhaps this is silly, but what if the default were changed to deny
>> reads and writes for unallocated keys? Is there a use case that
>> breaks?
>
> It's probably a reasonable debugging feature.
>
> But, anything that takes an XSAVE feature out of its "init state" has
> the potential to do a bit of harm because it increases the potential
> size of writes during XSAVE. XSAVEOPT will _help_ here, but we probably
> don't want to go out of our way to take things out of the init state
> when we're unsure of the benefits.

Aren't you already doing that with your magic execute-only thing?

Also, if we ever do the deferred-xstate-restore thing that Rik was
playing with awhile back, then we'll want to switch to using rdpkru
and wrpkru in-kernel directly, and we'll explicitly mask PKRU out of
the XRSTOR and XSAVEOPT state, and this particular issue will become
irrelevant.

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC

2015-12-21 23:04:30

by Dave Hansen

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On 12/18/2015 02:28 PM, Andy Lutomirski wrote:
...
>> I could imagine that some kernel person would want to use even more
>> keys, but I think two fixed keys are kind of the minimal we'd want to
>> use.
>
> I imagine we'd reserve key 0 for normal page and key 1 for deny-read.
> Let me be a bit more concrete about what I'm suggesting:
>
> We'd have thread_struct.baseline_pkru. It would start with key 0
> allowing all access and key 1 denying reads.

Are you sure thread_struct is the right place for this? I think of
signal handlers as a process-wide thing, and it seems a bit goofy if we
have the PKRU value in a signal handler depend on the PKRU of the thread
that got interrupted.

> We'd have a syscall like set_protection_key that could allocate unused
> keys and change the values of keys that have been allocated. Those
> changes would be reflected in baseline_pkru. Changes to keys 0 and 1
> in baseline_pkru would not be allowed.

FWIW, I think we can do this without *actually* dedicating key 1 to
execute-only. But that's a side issue.

> Signal delivery would load baseline_pkru into the PKRU register.
> Signal restore would restore PKRU to its previous value.

Do you really mean "its previous value" or are you OK with the existing
behavior which restores PKRU from the XSAVE buffer in the sigcontext?

> WRPKRU would, of course, override baseline_pkru, but it wouldn't
> change baseline_pkru. The set_protection_key syscall would modify
> *both* real PKRU and baseline_pkru.

How about this:

We make baseline_pkru a process-wide baseline and store it in
mm->context. That way, no matter which thread gets interrupted for a
signal, they see consistent values. We only write to it when an app
_specifically_ asks for it to be updated with a special flag to
sys_pkey_set().

When an app uses the execute-only support, we implicitly set the
read-disable bit in baseline_pkru for the execute-only pkey.

2015-12-21 23:05:47

by Dave Hansen

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On 12/21/2015 03:02 PM, Andy Lutomirski wrote:
> On Mon, Dec 21, 2015 at 3:00 PM, Dave Hansen
> <[email protected]> wrote:
>> On 12/21/2015 02:52 PM, Andy Lutomirski wrote:
>>> Perhaps this is silly, but what if the default were changed to deny
>>> reads and writes for unallocated keys? Is there a use case that
>>> breaks?
>>
>> It's probably a reasonable debugging feature.
>>
>> But, anything that takes an XSAVE feature out of its "init state" has
>> the potential to do a bit of harm because it increases the potential
>> size of writes during XSAVE. XSAVEOPT will _help_ here, but we probably
>> don't want to go out of our way to take things out of the init state
>> when we're unsure of the benefits.
>
> Aren't you already doing that with your magic execute-only thing?

Yep, but that's with a concrete benefit in mind.

> Also, if we ever do the deferred-xstate-restore thing that Rik was
> playing with awhile back, then we'll want to switch to using rdpkru
> and wrpkru in-kernel directly, and we'll explicitly mask PKRU out of
> the XRSTOR and XSAVEOPT state, and this particular issue will become
> irrelevant.

Yep, agreed.

2015-12-21 23:07:41

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On Mon, Dec 21, 2015 at 3:04 PM, Dave Hansen
<[email protected]> wrote:
> On 12/18/2015 02:28 PM, Andy Lutomirski wrote:
> ...
>>> I could imagine that some kernel person would want to use even more
>>> keys, but I think two fixed keys are kind of the minimal we'd want to
>>> use.
>>
>> I imagine we'd reserve key 0 for normal page and key 1 for deny-read.
>> Let me be a bit more concrete about what I'm suggesting:
>>
>> We'd have thread_struct.baseline_pkru. It would start with key 0
>> allowing all access and key 1 denying reads.
>
> Are you sure thread_struct is the right place for this? I think of
> signal handlers as a process-wide thing, and it seems a bit goofy if we
> have the PKRU value in a signal handler depend on the PKRU of the thread
> that got interrupted.

I think you're right. mmu_context_t might be a better choice.

>
>> We'd have a syscall like set_protection_key that could allocate unused
>> keys and change the values of keys that have been allocated. Those
>> changes would be reflected in baseline_pkru. Changes to keys 0 and 1
>> in baseline_pkru would not be allowed.
>
> FWIW, I think we can do this without *actually* dedicating key 1 to
> execute-only. But that's a side issue.
>
>> Signal delivery would load baseline_pkru into the PKRU register.
>> Signal restore would restore PKRU to its previous value.
>
> Do you really mean "its previous value" or are you OK with the existing
> behavior which restores PKRU from the XSAVE buffer in the sigcontext?

By "its previous value" I meant the value in the XSAVE buffer in the
sigcontext. So I think I'm okay with that :)

>
>> WRPKRU would, of course, override baseline_pkru, but it wouldn't
>> change baseline_pkru. The set_protection_key syscall would modify
>> *both* real PKRU and baseline_pkru.
>
> How about this:
>
> We make baseline_pkru a process-wide baseline and store it in
> mm->context. That way, no matter which thread gets interrupted for a
> signal, they see consistent values. We only write to it when an app
> _specifically_ asks for it to be updated with a special flag to
> sys_pkey_set().
>
> When an app uses the execute-only support, we implicitly set the
> read-disable bit in baseline_pkru for the execute-only pkey.

Sounds good, I think.

--Andy

2015-12-29 23:48:29

by Dave Hansen

[permalink] [raw]
Subject: Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

On 12/18/2015 01:45 PM, Linus Torvalds wrote:
> On Fri, Dec 18, 2015 at 1:12 PM, Dave Hansen
> <[email protected]> wrote:
>>
>> But, if we are picking out an execute-only pkey more dynamically, we've
>> got to keep the default value for the entire process somewhere.
>
> How dynamic do we want to make this, though?

Right now, all I plan to do is make it a one-way trip: if a process does
a prot=PROT_EXEC mapping, we dedicate a key local to that process, and
it gets 14 usable keys. If it doesn't use prot=PROT_EXEC, then it gets
15 usable keys.

> I haven't looked at the details, and perhaps more importantly, I don't
> know what exactly are the requirements you've gotten from the people
> who are expected to actually use this.
>
> I think we might want to hardcode a couple of keys as "kernel
> reserved". And I'd rather reserve them up-front than have some user
> program be unhappy later when we want to use them.

The one constant I've heard from the folks that are going to use this is
that 15 keys is not enough. That's why I'm hesitant to remove _any_ more.

> But I do think we might want to have that "no read access" as a real
> fixed key too, because I think the kernel itself would want to use it:
>
> (a) to make sure that it gets the right fault when user space passes
> in a execute-only address to a system call.

Having a dedicated or static key for execute-only doesn't really change
this code. We just have one extra step to go look in the mm->context
and see which pkey (if any) is assigned to be execute-only in the fault
code.

> (b) for much more efficient PAGEALLOC_DEBUG for kernel mappings.

The current hardware only applies the keys on _PAGE_USER mappings, so we
can't use it for kernel mappings.