[new thread because this sort of combines two threads]
There is recent interest in having a way to turn generally-available
kernel features off. Maybe we should add a good one so we can stop
bikeshedding and avoid proliferating dumb interfaces.
Things that might want to be turn-off-able include:
- getrandom with GRND_RANDOM [from the getrandom threads]
- Any lookup of a non-self pid [from the capsicum thread]
- Any lookup of a pid outside the caller thread group [capsicum]
- Various architectural things (personal wishlist), e.g.:
- RDTSC and userspace HPET access
- CPUID?
- 32-bit GDT code segments [huge attack surface]
- 64-bit GDT code segments [probably pointless]
I would propose a new syscall for this:
long restrict_userspace(int mode, int type, int value, int flags);
mode is RESTRICT_SET, RESTRICT_GET, or RESTRICT_LOCK.
type is RESTRICT_GRND_RANDOM, RESTRICT_PID_SCOPE, RESTRICT_X86_TIMING, etc.
Value is zero if RESTRICT_GET. Otherwise value is the desired value,
generally 0 or 1. For RESTRICT_PID_SCOPE, value would be
RESTRICT_PID_SCOPE_ANY, RESTRICT_PID_SCOPE_THREADGROUP, or
RESTRICT_PID_SCOPE_SELF.
flags must be zero. Someday, someone will propose a thread-sync flag.
restrict_userspace requires either no_new_privs or CAP_SYS_ADMIN in
the current user namespace.
Thoughts?
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
On Fri, Jul 25, 2014 at 11:30:48AM -0700, Andy Lutomirski wrote:
> There is recent interest in having a way to turn generally-available
> kernel features off. Maybe we should add a good one so we can stop
> bikeshedding and avoid proliferating dumb interfaces.
>
> Things that might want to be turn-off-able include:
> - getrandom with GRND_RANDOM [from the getrandom threads]
> - Any lookup of a non-self pid [from the capsicum thread]
> - Any lookup of a pid outside the caller thread group [capsicum]
> - Various architectural things (personal wishlist), e.g.:
> - RDTSC and userspace HPET access
> - CPUID?
> - 32-bit GDT code segments [huge attack surface]
> - 64-bit GDT code segments [probably pointless]
I'm not sure there's value in disabling cpuid dev interface,
when the instruction is unprivileged.
> I would propose a new syscall for this:
>
> long restrict_userspace(int mode, int type, int value, int flags);
do the restrictions happen system-wide like in say SELinux,
or only within the calling process, like seccomp ?
Dave
On Fri, Jul 25, 2014 at 1:15 PM, Dave Jones <[email protected]> wrote:
> On Fri, Jul 25, 2014 at 11:30:48AM -0700, Andy Lutomirski wrote:
>
> > There is recent interest in having a way to turn generally-available
> > kernel features off. Maybe we should add a good one so we can stop
> > bikeshedding and avoid proliferating dumb interfaces.
> >
> > Things that might want to be turn-off-able include:
> > - getrandom with GRND_RANDOM [from the getrandom threads]
> > - Any lookup of a non-self pid [from the capsicum thread]
> > - Any lookup of a pid outside the caller thread group [capsicum]
> > - Various architectural things (personal wishlist), e.g.:
> > - RDTSC and userspace HPET access
> > - CPUID?
> > - 32-bit GDT code segments [huge attack surface]
> > - 64-bit GDT code segments [probably pointless]
>
> I'm not sure there's value in disabling cpuid dev interface,
> when the instruction is unprivileged.
I meant the CPUID instruction. Some CPUs have a setting that turns
off the CPUID instruction for user code. In principle, all VMs can do
this, too, if the hypervisor would be kind enough to help out.
I only mentioned the x86 stuff here to make the point that there are
quite a few possibilities along these lines. There's actually already
a way to turn off RDTSC, but it's not currently very useful because it
doesn't do the right thing for the vDSO. That could be fixed, but
there's certainly no reason to make any of the other stuff here wait
for that.
>
> > I would propose a new syscall for this:
> >
> > long restrict_userspace(int mode, int type, int value, int flags);
>
> do the restrictions happen system-wide like in say SELinux,
> or only within the calling process, like seccomp ?
>
The calling process and children, like seccomp.
--Andy
On Fri, 25 Jul 2014 11:30:48 -0700
Andy Lutomirski <[email protected]> wrote:
> [new thread because this sort of combines two threads]
>
> There is recent interest in having a way to turn generally-available
> kernel features off. Maybe we should add a good one so we can stop
> bikeshedding and avoid proliferating dumb interfaces.
We sort of have one. It's called capable(). Just needs extending to cover
anything else you care about, and probably all the numeric constants
replacing with textual names.
Alan
On Fri, Jul 25, 2014 at 2:35 PM, One Thousand Gnomes
<[email protected]> wrote:
> On Fri, 25 Jul 2014 11:30:48 -0700
> Andy Lutomirski <[email protected]> wrote:
>
>> [new thread because this sort of combines two threads]
>>
>> There is recent interest in having a way to turn generally-available
>> kernel features off. Maybe we should add a good one so we can stop
>> bikeshedding and avoid proliferating dumb interfaces.
>
> We sort of have one. It's called capable(). Just needs extending to cover
> anything else you care about, and probably all the numeric constants
> replacing with textual names.
>
Except that it's all backwards: these are things that default to *on*,
and people might want them to turn off. capable() is totally fscked
if you want otherwise unprivileged users to carry capabilities around,
and fixing it seems to run into endless claims that the "capability"
system is carefully designed, flawless, perfect, ideal, amazing, and
shouldn't be changed, despite the fact that it's empirically damn near
useless.
Also, capabilities do the wrong thing wrt namespaces. The things I'm
talking about aren't namespaced. They're either on or off.
--Andy
On 07/25/2014 11:30 AM, Andy Lutomirski wrote:
> - 32-bit GDT code segments [huge attack surface]
> - 64-bit GDT code segments [probably pointless]
I presume you mean s/GDT/LDT/.
We already don't allow 64-bit LDT code segments. Also, it is unclear to
me how 32-bit LDT segments have a huge attack surface, given that there
will realistically always be a 32-bit *GDT* segment present.
-hpa
On Fri, Jul 25, 2014 at 4:43 PM, H. Peter Anvin <[email protected]> wrote:
> On 07/25/2014 11:30 AM, Andy Lutomirski wrote:
>> - 32-bit GDT code segments [huge attack surface]
>> - 64-bit GDT code segments [probably pointless]
>
> I presume you mean s/GDT/LDT/.
>
> We already don't allow 64-bit LDT code segments. Also, it is unclear to
> me how 32-bit LDT segments have a huge attack surface, given that there
> will realistically always be a 32-bit *GDT* segment present.
I really did mean GDT :) Setting the 32-bit code segment to "not
present" (and using seccomp to block modify_ldt) prevents any attempt
to exploit bugs in the sysenter and cstar code. It also might prevent
exploiting CPU bugs, although I've never heard of a relevant CPU bug
in this area.
If I actually tried to implement this (which wouldn't be part of the
initial implementation), I'd split out the unusual things in
__switch_to and friends to a slow path that's only used if weird
settings are present (e.g. this, TSC restrictions, etc). But
twiddling the present bit on a GDT entry is very fast, I assume --
it's just memory, and I don't think that any flush is needed.
Also, if I implement this, I will curse Xen. I might even go so far
as to disable the feature entirely if there's a paravirt GDT.
Hmm. A separate flag to turn int $0x80 into GPF could have some value, too.
--Andy
Andy Lutomirski <[email protected]> writes:
> On Fri, Jul 25, 2014 at 2:35 PM, One Thousand Gnomes
> <[email protected]> wrote:
>> On Fri, 25 Jul 2014 11:30:48 -0700
>> Andy Lutomirski <[email protected]> wrote:
>>
>>> [new thread because this sort of combines two threads]
>>>
>>> There is recent interest in having a way to turn generally-available
>>> kernel features off. Maybe we should add a good one so we can stop
>>> bikeshedding and avoid proliferating dumb interfaces.
>>
>> We sort of have one. It's called capable(). Just needs extending to cover
>> anything else you care about, and probably all the numeric constants
>> replacing with textual names.
The big difference is capable only subdivides roots powers (aka things
most applications should not have). When we start talking about things
that things that are safe for most applications capable is probably not
the right tool for the job.
A much closer match is the personality system call. Look at setarch
to see how it is used. My biggest concern with personality is there
are only 32bits to play with. Still I expect what you want may be a
sandbox personality, that disables everything that could possibly be
a problem (including access to the personality syscall).
Eric
On Fri, Jul 25, 2014 at 7:30 PM, Andy Lutomirski <[email protected]> wrote:
> [new thread because this sort of combines two threads]
>
> There is recent interest in having a way to turn generally-available
> kernel features off. Maybe we should add a good one so we can stop
> bikeshedding and avoid proliferating dumb interfaces.
>
> Things that might want to be turn-off-able include:
> - getrandom with GRND_RANDOM [from the getrandom threads]
> - Any lookup of a non-self pid [from the capsicum thread]
> - Any lookup of a pid outside the caller thread group [capsicum]
> - Various architectural things (personal wishlist), e.g.:
> - RDTSC and userspace HPET access
> - CPUID?
> - 32-bit GDT code segments [huge attack surface]
> - 64-bit GDT code segments [probably pointless]
>
> I would propose a new syscall for this:
>
> long restrict_userspace(int mode, int type, int value, int flags);
>
> mode is RESTRICT_SET, RESTRICT_GET, or RESTRICT_LOCK.
>
> type is RESTRICT_GRND_RANDOM, RESTRICT_PID_SCOPE, RESTRICT_X86_TIMING, etc.
>
> Value is zero if RESTRICT_GET. Otherwise value is the desired value,
> generally 0 or 1. For RESTRICT_PID_SCOPE, value would be
> RESTRICT_PID_SCOPE_ANY, RESTRICT_PID_SCOPE_THREADGROUP, or
> RESTRICT_PID_SCOPE_SELF.
>
> flags must be zero. Someday, someone will propose a thread-sync flag.
Today, me: proposed :-)
> restrict_userspace requires either no_new_privs or CAP_SYS_ADMIN in
> the current user namespace.
>
> Thoughts?
>
> --Andy
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
On Fri, Jul 25, 2014 at 11:30:48AM -0700, Andy Lutomirski wrote:
>
> There is recent interest in having a way to turn generally-available
> kernel features off. Maybe we should add a good one so we can stop
> bikeshedding and avoid proliferating dumb interfaces.
I believe the seccomp infrastructure (which is already upstream)
should be able to do most of what you want, at least with respect to
features which are exposed via system calls (which was most of your
list).
It won't cover x86 specific things like restricting RDTSC or CPUID
(and as far as I know you can't intercept the CPUID instruction), but
I'm not sure it matters. I don't really see the point, myself.
- Ted
On Jul 27, 2014 5:06 PM, "Theodore Ts'o" <[email protected]> wrote:
>
> On Fri, Jul 25, 2014 at 11:30:48AM -0700, Andy Lutomirski wrote:
> >
> > There is recent interest in having a way to turn generally-available
> > kernel features off. Maybe we should add a good one so we can stop
> > bikeshedding and avoid proliferating dumb interfaces.
>
> I believe the seccomp infrastructure (which is already upstream)
> should be able to do most of what you want, at least with respect to
> features which are exposed via system calls (which was most of your
> list).
Seccomp can't really restrict lookups of non-self pids. In fact, this
feature idea started out as a response to a patch adding a kind of
nasty seccomp feature to make it sort of possible.
I agree that that seccomp can turn off GRND_RANDOM, but how is it
supposed to do it in such a way that the filtered software will fall
back to something sensible? -ENOSYS? -EPERM? Something else?
I think that -ENOSYS is clearly wrong, but standardizing this would be
nice. Admittedly, adding something fancy like this for GRND_RANDOM
may not be appropriate.
--Andy
>
> It won't cover x86 specific things like restricting RDTSC or CPUID
> (and as far as I know you can't intercept the CPUID instruction), but
> I'm not sure it matters. I don't really see the point, myself.
>
> - Ted
Andy Lutomirski <[email protected]> writes:
> On Jul 27, 2014 5:06 PM, "Theodore Ts'o" <[email protected]> wrote:
>>
>> On Fri, Jul 25, 2014 at 11:30:48AM -0700, Andy Lutomirski wrote:
>> >
>> > There is recent interest in having a way to turn generally-available
>> > kernel features off. Maybe we should add a good one so we can stop
>> > bikeshedding and avoid proliferating dumb interfaces.
>>
>> I believe the seccomp infrastructure (which is already upstream)
>> should be able to do most of what you want, at least with respect to
>> features which are exposed via system calls (which was most of your
>> list).
>
> Seccomp can't really restrict lookups of non-self pids. In fact, this
> feature idea started out as a response to a patch adding a kind of
> nasty seccomp feature to make it sort of possible.
>
> I agree that that seccomp can turn off GRND_RANDOM, but how is it
> supposed to do it in such a way that the filtered software will fall
> back to something sensible? -ENOSYS? -EPERM? Something else?
>
> I think that -ENOSYS is clearly wrong, but standardizing this would be
> nice. Admittedly, adding something fancy like this for GRND_RANDOM
> may not be appropriate.
Andy you seem to be arguing here for two system calls.
get_urandom() and get_random().
Where get_urandom only blocks if there is not enough starting entropy,
and get_random(GRND_RANDOM) blocks if there is currently not enough
entropy.
That would allow -ENOSYS to be the right return value and it would
simply things for everyone.
Eric
> Andy you seem to be arguing here for two system calls.
> get_urandom() and get_random().
>
> Where get_urandom only blocks if there is not enough starting entropy,
> and get_random(GRND_RANDOM) blocks if there is currently not enough
> entropy.
>
> That would allow -ENOSYS to be the right return value and it would
> simply things for everyone.
So you replace the "no file handle" special case with the "unsupported or
disabled syscall" special case, which is even harder to test.
Interfaces have failure modes. People who can't deal with that shouldn't
be writing code that does anything important in languages which don't
handle it for them.
Alan
> > We sort of have one. It's called capable(). Just needs extending to cover
> > anything else you care about, and probably all the numeric constants
> > replacing with textual names.
> >
>
> Except that it's all backwards: these are things that default to *on*,
> and people might want them to turn off. capable() is totally fscked
> if you want otherwise unprivileged users to carry capabilities around
The userspace API is, but capable() as a userspace API and capable() as
an in kernel check are only connected by history.
For the in kernel part you can either teach everyone another disjoint API
or we can have a single API in kernel for saying "is XYZ allowed".
One Thousand Gnomes <[email protected]> writes:
>> Andy you seem to be arguing here for two system calls.
>> get_urandom() and get_random().
>>
>> Where get_urandom only blocks if there is not enough starting entropy,
>> and get_random(GRND_RANDOM) blocks if there is currently not enough
>> entropy.
>>
>> That would allow -ENOSYS to be the right return value and it would
>> simply things for everyone.
>
> So you replace the "no file handle" special case with the "unsupported or
> disabled syscall" special case, which is even harder to test.
>
> Interfaces have failure modes. People who can't deal with that shouldn't
> be writing code that does anything important in languages which don't
> handle it for them.
Perhaps I misread the earlier conversation but it what I have read of
this discussion people want to disable some of get_random() modes with
seccomp. Today get_random does not have any failure codes define except
-ENOSYS.
get_random(0) succeeding and get_random(GRND_RANDOM) returning -ENOSYS
has every chance of causing applications to legitimately assume the
get_random system call is not available in any mode.
So the code either needs a defined error code for bad flags (-EINVAL) or
we need to split the syscall in two. Now that I think about it having
the seccomp filter return -EINVAL if it doesn't like the parameter is
better that splitting a syscall. Presumably that is what
get_random(UNSUPPORTED_FLAG) returns.
Eric
On Wed, 30 Jul 2014 11:41:41 -0700
[email protected] (Eric W. Biederman) wrote:
> One Thousand Gnomes <[email protected]> writes:
>
> >> Andy you seem to be arguing here for two system calls.
> >> get_urandom() and get_random().
> >>
> >> Where get_urandom only blocks if there is not enough starting entropy,
> >> and get_random(GRND_RANDOM) blocks if there is currently not enough
> >> entropy.
> >>
> >> That would allow -ENOSYS to be the right return value and it would
> >> simply things for everyone.
> >
> > So you replace the "no file handle" special case with the "unsupported or
> > disabled syscall" special case, which is even harder to test.
> >
> > Interfaces have failure modes. People who can't deal with that shouldn't
> > be writing code that does anything important in languages which don't
> > handle it for them.
>
> Perhaps I misread the earlier conversation but it what I have read of
> this discussion people want to disable some of get_random() modes with
> seccomp. Today get_random does not have any failure codes define except
> -ENOSYS.
>
> get_random(0) succeeding and get_random(GRND_RANDOM) returning -ENOSYS
> has every chance of causing applications to legitimately assume the
> get_random system call is not available in any mode.
Or more likely it'll be used like this
get_random(foo); /* always works */
Now the existing failure mode is is
open(...)
/* forget the check */
read()
/* forget the check */
and triggered by evil local attacks on file handles. The "improved"
behaviour is unchecked -ENOSYS returns which are likely to occur
systemically when users run stuff on old kernels, in vm's with it off etc.
So you've swapped the odd evil user attack on a single target for the
likelyhood of mass generation of flawed keys with no error reporting.
In fact you could do a better job of the whole mess in libc rather than
the kernel, because in libc you'd write it like this
if (open(.. ) < 0)
kill(getpid(), 9);
if (read(...) < expected)
kill(getpid(), 9);
close(fd);
and
a) on an older library you'd get a good failure (unable to execute the
binary)
b) on a newer system you'd get "do or die" behaviour and can improve its
robustness as desired
Alan
One Thousand Gnomes <[email protected]> writes:
> On Wed, 30 Jul 2014 11:41:41 -0700
> [email protected] (Eric W. Biederman) wrote:
>
>> One Thousand Gnomes <[email protected]> writes:
>>
>> >> Andy you seem to be arguing here for two system calls.
>> >> get_urandom() and get_random().
>> >>
>> >> Where get_urandom only blocks if there is not enough starting entropy,
>> >> and get_random(GRND_RANDOM) blocks if there is currently not enough
>> >> entropy.
>> >>
>> >> That would allow -ENOSYS to be the right return value and it would
>> >> simply things for everyone.
>> >
>> > So you replace the "no file handle" special case with the "unsupported or
>> > disabled syscall" special case, which is even harder to test.
>> >
>> > Interfaces have failure modes. People who can't deal with that shouldn't
>> > be writing code that does anything important in languages which don't
>> > handle it for them.
>>
>> Perhaps I misread the earlier conversation but it what I have read of
>> this discussion people want to disable some of get_random() modes with
>> seccomp. Today get_random does not have any failure codes define except
>> -ENOSYS.
>>
>> get_random(0) succeeding and get_random(GRND_RANDOM) returning -ENOSYS
>> has every chance of causing applications to legitimately assume the
>> get_random system call is not available in any mode.
>
> Or more likely it'll be used like this
>
> get_random(foo); /* always works */
>
>
> Now the existing failure mode is is
>
> open(...)
> /* forget the check */
> read()
> /* forget the check */
>
> and triggered by evil local attacks on file handles. The "improved"
> behaviour is unchecked -ENOSYS returns which are likely to occur
> systemically when users run stuff on old kernels, in vm's with it off etc.
>
> So you've swapped the odd evil user attack on a single target for the
> likelyhood of mass generation of flawed keys with no error reporting.
>
> In fact you could do a better job of the whole mess in libc rather than
> the kernel, because in libc you'd write it like this
>
> if (open(.. ) < 0)
> kill(getpid(), 9);
> if (read(...) < expected)
> kill(getpid(), 9);
> close(fd);
>
> and
> a) on an older library you'd get a good failure (unable to execute the
> binary)
> b) on a newer system you'd get "do or die" behaviour and can improve its
> robustness as desired
I have said enough about the silliness of disabling this syscall with
seccomp or related infrastructure.
The aspect I like about get_random() is that it will silence the
requests from people to enable binary sysctl support in the kernel.
Just so they can get random numbers when /dev/random and /dev/urandom
are absent in their chroots.
sysctl(2) is finally legitmately going fading away.
Eric