2018-01-06 19:33:33

by Avi Kivity

[permalink] [raw]
Subject: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

Meltdown and Spectre mitigations focus on protecting the kernel from a
hostile userspace. However, it's not a given that the kernel is the most
important target in the system. It is common in server workloads that a
single userspace application contains the valuable data on a system, and
if it were hostile, the game would already be over, without the need to
compromise the kernel.


In these workloads, a single application performs most system calls, and
so it pays the cost of protection, without benefiting from it directly
(since it is the target, rather than the kernel).


I propose to create a new capability, CAP_PAYLOAD, that allows the
system administrator to designate an application as the main workload in
that system. Other processes (like sshd or monitoring daemons) exist to
support it, and so it makes sense to protect the rest of the system from
their being compromised.


When the kernel switches to user mode of a CAP_PAYLOAD process, it
doesn't switch page tables and instead leaves the kernel mapped into the
adddress space (still with supervisor protection, of course). This
reduces context switch cost, and will also reduce interrupt costs if the
interrupt happens while that process executes. Since a CAP_PAYLOAD
process is likely to consume the majority of CPU time, the costs
associated with Meltdown mitigation are almost completely nullified.


CAP_PAYLOAD has potential to be abused; every software vendor will be
absolutely certain that their application is the reason the universe
(let alone that server) exists and they will turn it on, so init systems
will have to work to make it doesn't get turned on without administrator
opt-in. It's also not perfect, since if there is a payload application
compromise, in addition to stealing the application's data ssh keys can
be stolen too. But I think it's better than having to choose between
significantly reduced performance and security. You get performance for
your important application, and protection against the possibility that
a remote exploit against a supporting process turns into a remote
exploit against that important application.




2018-01-06 20:02:36

by Alan Cox

[permalink] [raw]
Subject: Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

> I propose to create a new capability, CAP_PAYLOAD, that allows the
> system administrator to designate an application as the main workload in
> that system. Other processes (like sshd or monitoring daemons) exist to
> support it, and so it makes sense to protect the rest of the system from
> their being compromised.

Much more general would be to do this with cgroups both for group-group
trust and group-kernel trust levels.

Alan

2018-01-06 20:24:38

by Willy Tarreau

[permalink] [raw]
Subject: Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

Hi Avi,

On Sat, Jan 06, 2018 at 09:33:28PM +0200, Avi Kivity wrote:
> Meltdown and Spectre mitigations focus on protecting the kernel from a
> hostile userspace. However, it's not a given that the kernel is the most
> important target in the system. It is common in server workloads that a
> single userspace application contains the valuable data on a system, and if
> it were hostile, the game would already be over, without the need to
> compromise the kernel.
>
> In these workloads, a single application performs most system calls, and so
> it pays the cost of protection, without benefiting from it directly (since
> it is the target, rather than the kernel).

Definitely :-)

> I propose to create a new capability, CAP_PAYLOAD, that allows the system
> administrator to designate an application as the main workload in that
> system. Other processes (like sshd or monitoring daemons) exist to support
> it, and so it makes sense to protect the rest of the system from their being
> compromised.

Initially I was thinking about letting applications disable PTI using
prctl() when running under a certain capability (I initially thought
about CAP_SYSADMIN though I changed my mind). One advantage of
proceeding like this is that it would have to be explicitly implemented
in the application, which limits the risk of running by default.

I later thought that we could use CAP_RAWIO for this, given that such
processes already have access to the hardware anyway. We could even
imagine not switching the page tables on such a capability without
requiring prctl(), though it would mean that processes running as root
(as is often found on a number of servers) would automatically present
a risk for the system. But maybe CAP_RAWIO + prctl() could be a good
solution.

I'm interested in participating to working on such a solution, given
that haproxy is severely impacted by "pti=on" and that for now we'll
have to run with "pti=off" on the whole system until a more suitable
solution is found.

I'd rather not rush anything and let things calm down for a while to
avoid adding disturbance to the current situation. But I'm willing to
continue this discussion and even test patches.

Cheers,
Willy

2018-01-07 09:14:31

by Avi Kivity

[permalink] [raw]
Subject: Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs



On 01/06/2018 10:24 PM, Willy Tarreau wrote:
> Hi Avi,
>
> On Sat, Jan 06, 2018 at 09:33:28PM +0200, Avi Kivity wrote:
>> Meltdown and Spectre mitigations focus on protecting the kernel from a
>> hostile userspace. However, it's not a given that the kernel is the most
>> important target in the system. It is common in server workloads that a
>> single userspace application contains the valuable data on a system, and if
>> it were hostile, the game would already be over, without the need to
>> compromise the kernel.
>>
>> In these workloads, a single application performs most system calls, and so
>> it pays the cost of protection, without benefiting from it directly (since
>> it is the target, rather than the kernel).
> Definitely :-)
>
>> I propose to create a new capability, CAP_PAYLOAD, that allows the system
>> administrator to designate an application as the main workload in that
>> system. Other processes (like sshd or monitoring daemons) exist to support
>> it, and so it makes sense to protect the rest of the system from their being
>> compromised.
> Initially I was thinking about letting applications disable PTI using
> prctl() when running under a certain capability (I initially thought
> about CAP_SYSADMIN though I changed my mind). One advantage of
> proceeding like this is that it would have to be explicitly implemented
> in the application, which limits the risk of running by default.
>
> I later thought that we could use CAP_RAWIO for this, given that such
> processes already have access to the hardware anyway. We could even
> imagine not switching the page tables on such a capability without
> requiring prctl(), though it would mean that processes running as root
> (as is often found on a number of servers) would automatically present
> a risk for the system. But maybe CAP_RAWIO + prctl() could be a good
> solution.

CAP_RAWIO is like CAP_PAYLOAD in that both allow you to read stuff you
shouldn't have access to on a vulnerable CPU. But CAP_PAYLOAD won't give
you that access on a non-vulnerable CPU, so it's safer.

The advantage of not requiring prctl() is that it will work on
unmodified applications, requiring only sysadmin intervention (and it's
the sysadmin's role to designate an application as payload, not the
application's).

>
> I'm interested in participating to working on such a solution, given
> that haproxy is severely impacted by "pti=on" and that for now we'll
> have to run with "pti=off" on the whole system until a more suitable
> solution is found.
>
> I'd rather not rush anything and let things calm down for a while to
> avoid adding disturbance to the current situation. But I'm willing to
> continue this discussion and even test patches.
>
>

Then you might want to test
https://www.spinics.net/lists/kernel/msg2689101.html and its companion
patchset https://www.spinics.net/lists/kernel/msg2689134.html, which as
a side effect significantly reduce KPTI impact on C10K applications (and
as their main effect improve their performance).

2018-01-07 09:16:36

by Avi Kivity

[permalink] [raw]
Subject: Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

On 01/06/2018 10:02 PM, Alan Cox wrote:
>> I propose to create a new capability, CAP_PAYLOAD, that allows the
>> system administrator to designate an application as the main workload in
>> that system. Other processes (like sshd or monitoring daemons) exist to
>> support it, and so it makes sense to protect the rest of the system from
>> their being compromised.
> Much more general would be to do this with cgroups both for group-group
> trust and group-kernel trust levels.
>

I think capabilities will work just as well with cgroups. The container
manager will set CAP_PAYLOAD to payload containers; and if those run an
init system or a container manager themselves, they'll drop CAP_PAYLOAD
for all process/sub-containers but their payloads.

2018-01-07 12:29:24

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

On Sun, Jan 07, 2018 at 11:16:28AM +0200, Avi Kivity wrote:
> I think capabilities will work just as well with cgroups. The container
> manager will set CAP_PAYLOAD to payload containers; and if those run an init
> system or a container manager themselves, they'll drop CAP_PAYLOAD for all
> process/sub-containers but their payloads.

The reason why cgroups are better is Spectre can be used to steal
information from within the same privilege level --- e.g., you could
use Javascript to steal a user's Coindesk credentials or Lastpass
data, which is going to be *way* more lucrative than trying to mine
cryptocurrency in the sly in a user's browser. :-)

As a result, you probably want Spectre mitigations to be enabled in a
root process --- which means capabilities aren't the right answer.

Regards,

- Ted

2018-01-07 12:34:34

by Ozgur

[permalink] [raw]
Subject: Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs



07.01.2018, 15:29, "Theodore Ts'o" <[email protected]>:
> On Sun, Jan 07, 2018 at 11:16:28AM +0200, Avi Kivity wrote:
>>  I think capabilities will work just as well with cgroups. The container
>>  manager will set CAP_PAYLOAD to payload containers; and if those run an init
>>  system or a container manager themselves, they'll drop CAP_PAYLOAD for all
>>  process/sub-containers but their payloads.
>
> The reason why cgroups are better is Spectre can be used to steal
> information from within the same privilege level --- e.g., you could
> use Javascript to steal a user's Coindesk credentials or Lastpass
> data, which is going to be *way* more lucrative than trying to mine
> cryptocurrency in the sly in a user's browser. :-)

I think the web coin mining pages also work with this method they probably use JS in the background but currently, impossible to do kernel-level operations.
All process start on the browser level and Spectre not read kernel memory, right?

Ozgur

> As a result, you probably want Spectre mitigations to be enabled in a
> root process --- which means capabilities aren't the right answer.
>
> Regards,
>
>                                                 - Ted

2018-01-07 12:52:05

by Avi Kivity

[permalink] [raw]
Subject: Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs



On 01/07/2018 02:29 PM, Theodore Ts'o wrote:
> On Sun, Jan 07, 2018 at 11:16:28AM +0200, Avi Kivity wrote:
>> I think capabilities will work just as well with cgroups. The container
>> manager will set CAP_PAYLOAD to payload containers; and if those run an init
>> system or a container manager themselves, they'll drop CAP_PAYLOAD for all
>> process/sub-containers but their payloads.
> The reason why cgroups are better is Spectre can be used to steal
> information from within the same privilege level --- e.g., you could
> use Javascript to steal a user's Coindesk credentials or Lastpass
> data, which is going to be *way* more lucrative than trying to mine
> cryptocurrency in the sly in a user's browser. :-)
>
> As a result, you probably want Spectre mitigations to be enabled in a
> root process --- which means capabilities aren't the right answer.
>
>

I don't see the connection. The browser wouldn't run with CAP_PAYLOAD set.

In a desktop system, only init retains CAP_PAYLOAD.

On a server that runs one application (and some supporting processes),
only init and that one application have CAP_PAYLOAD (if the sysadmin
makes it so).

On a containerized server that happens to run just one application, init
will retain CAP_PAYLOAD, as well as the process in the container (if the
sysadmin makes it so).

On a containerized server that happens to run just one application,
which itself runs an init system, the two inits will retain CAP_PAYLOAD,
as well as the application process (if the sysadmin makes it so).

2018-01-07 14:36:33

by Alan Cox

[permalink] [raw]
Subject: Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

> I'm interested in participating to working on such a solution, given
> that haproxy is severely impacted by "pti=on" and that for now we'll
> have to run with "pti=off" on the whole system until a more suitable
> solution is found.

I'm still trying to work out what cases there are for this. I can see the
cases for pti-off. I've got minecraft servers for example where there
isn't anyone running untrusted code on the box (*) and the only data of
value is owned by the minecraft processes. If someone gets to the point
pti matters then I already lost.

What I struggle to see is why I'd want to nominate specific processes for
this except in very special cases (like your packet generator). Even then
it would make me nervous as the packet generator if that trusted is
effectively CAP_SYS_RAWIO or close to it and can steal any ssh keys or
similar on that guest.

I still prefer cgroups because once you include the branch predictions it
suddenly becomes very interesting to be able to say 'this pile of stuff
trusts itself' and avoid user/user protection costs while keeping
user/kernel.

Alan
(*) I am sure that java programs can do sandbox breaking via spectre just
as js can. Bonus points to anyone however who can do spectre through java
from redstone 8)

2018-01-07 15:15:24

by Avi Kivity

[permalink] [raw]
Subject: Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs



On 01/07/2018 04:36 PM, Alan Cox wrote:
>> I'm interested in participating to working on such a solution, given
>> that haproxy is severely impacted by "pti=on" and that for now we'll
>> have to run with "pti=off" on the whole system until a more suitable
>> solution is found.
> I'm still trying to work out what cases there are for this. I can see the
> cases for pti-off. I've got minecraft servers for example where there
> isn't anyone running untrusted code on the box (*) and the only data of
> value is owned by the minecraft processes. If someone gets to the point
> pti matters then I already lost.

You don't want, say, a local out-of-process dns resolver compromised and
exploited, then PTI used to read all of the machine's memory.

>
> What I struggle to see is why I'd want to nominate specific processes for
> this except in very special cases


Suppose you are running a database (I hope you'll agree that's not a
special case). Then, "all of your data was stolen but the good news is
that your ssh keys are safe" is not something that brightens your day.
Your ssh keys are worth a lot less than your data.

In that case disabling PTI just for the database can be a good trade-off
between security and performance. You lose almost nothing by disabling
PTI for the database, yet you're still secure* from a remote exploit in
any supporting processes, or from a malicious local user.


* as secure as you were with full PTI

> (like your packet generator). Even then
> it would make me nervous as the packet generator if that trusted is
> effectively CAP_SYS_RAWIO or close to it and can steal any ssh keys or
> similar on that guest.
>
> I still prefer cgroups because once you include the branch predictions it
> suddenly becomes very interesting to be able to say 'this pile of stuff
> trusts itself' and avoid user/user protection costs while keeping
> user/kernel.

By "this pile of stuff" do you mean a group of mutually-trusting
processes? I don't see how that can be implemented, because any
cross-process switch has to cross the kernel boundary. In any case a
cross-process switch will destroy the effectiveness of branch prediction
history, so preserving it doesn't buy you much.

>
> Alan
> (*) I am sure that java programs can do sandbox breaking via spectre just
> as js can. Bonus points to anyone however who can do spectre through java
> from redstone 8)

2018-01-07 17:26:43

by Willy Tarreau

[permalink] [raw]
Subject: Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

On Sun, Jan 07, 2018 at 02:36:28PM +0000, Alan Cox wrote:
> What I struggle to see is why I'd want to nominate specific processes for
> this except in very special cases (like your packet generator). Even then
> it would make me nervous as the packet generator if that trusted is
> effectively CAP_SYS_RAWIO or close to it and can steal any ssh keys or
> similar on that guest.

Sure but we can also say that a process with CAP_SYS_RAWIO can manipulate
the hardware using iopl() and reprogram memory controllers, PCI bridges
and various stuff to have direct access wherever it wants. That's why I
thought that grouping the risks reduces the attack surface in the end.

> I still prefer cgroups because once you include the branch predictions it
> suddenly becomes very interesting to be able to say 'this pile of stuff
> trusts itself' and avoid user/user protection costs while keeping
> user/kernel.

To be honnest, I don't know what it would imply in terms of management
(for the admin). Also, I'm really focused on the extra work to add to
syscalls, which should remain very minimalistic. Checking a flag on the
current task sounds reasonable. I don't know how far we might have to
go with cgroups. I remember a very long time ago you once todl me "we
have fast syscalls", I'd like this statement to remain true for those
who continue to rely on this property ;-)

Willy

2018-01-07 17:39:52

by Willy Tarreau

[permalink] [raw]
Subject: Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

On Sun, Jan 07, 2018 at 11:14:21AM +0200, Avi Kivity wrote:
> CAP_RAWIO is like CAP_PAYLOAD in that both allow you to read stuff you
> shouldn't have access to on a vulnerable CPU. But CAP_PAYLOAD won't give you
> that access on a non-vulnerable CPU, so it's safer.

But it's still a wider surface for something quite similar. With
CAP_SYS_RAWIO you already have /dev/mem, iopl(), etc. I don't think
it's unreasonable to require that prctl() is added to applications
that require such functionality, it's not really more difficult to
deal with than dealing with an extra capability and managing its
impacts. And prctl() already does quite a lot of similar stuff like
enabling/disabling access to the TSC for example.

> The advantage of not requiring prctl() is that it will work on unmodified
> applications, requiring only sysadmin intervention (and it's the sysadmin's
> role to designate an application as payload, not the application's).

It can as well be seen as a configuration option. And not opening this
to any random application by default sounds reasonable as well. I'm not
saying it's perfect, just trying to figure a reasonable path here.

> > I'm interested in participating to working on such a solution, given
> > that haproxy is severely impacted by "pti=on" and that for now we'll
> > have to run with "pti=off" on the whole system until a more suitable
> > solution is found.
> >
> > I'd rather not rush anything and let things calm down for a while to
> > avoid adding disturbance to the current situation. But I'm willing to
> > continue this discussion and even test patches.
> >
> >
>
> Then you might want to test
> https://www.spinics.net/lists/kernel/msg2689101.html and its companion
> patchset https://www.spinics.net/lists/kernel/msg2689134.html, which as a
> side effect significantly reduce KPTI impact on C10K applications (and as
> their main effect improve their performance).

I've seen that two days ago but didn't read more. Now I've checked a bit
more but it seems very focused on block I/O (which makes sense for a DB
or for a server for example), which will not help for my specific use
case. In my case I'm wasting a lot of time in accept(), setsockopt(),
fcntl(), bind(), connect(), recv(), send(), shutdown() or close(). The
poller is almost unnoticeable since I/O events are grouped.

Cheers,
Willy

2018-01-07 18:06:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

On Sun, Jan 07, 2018 at 02:51:59PM +0200, Avi Kivity wrote:
>
> I don't see the connection. The browser wouldn't run with CAP_PAYLOAD set.
>
> In a desktop system, only init retains CAP_PAYLOAD.
>
> On a server that runs one application (and some supporting processes), only
> init and that one application have CAP_PAYLOAD (if the sysadmin makes it
> so).

In the classical (as defined by the withdrawn Posix draft spec)
capaibilities model, if you have a setuid root process it gets all the
capabilities, and capabilities are used to limit what privileges a
root process. Hence using strict capabilities, any setuid root
process would have CAP_PAYLOAD.

Linux has extensions which allow you to have capability bound which
capabilities that can be obtained by a process, so you _could_ make it
work, but it just seems like an bad fit, since it's not strictly
speaking a root-owned privilege. It's more like a configuration
setting, and so modulating it via cgroups attribute seems to make a
lot more sense --- it's certainly (IMHO) less confusing than trying to
ab(use) the capabilities system and its extensions in this fashion.

- Ted

2018-01-08 01:33:53

by Casey Schaufler

[permalink] [raw]
Subject: Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

On 1/6/2018 11:33 AM, Avi Kivity wrote:
> Meltdown and Spectre mitigations focus on protecting the kernel from a hostile userspace. However, it's not a given that the kernel is the most important target in the system. It is common in server workloads that a single userspace application contains the valuable data on a system, and if it were hostile, the game would already be over, without the need to compromise the kernel.
>
>
> In these workloads, a single application performs most system calls, and so it pays the cost of protection, without benefiting from it directly (since it is the target, rather than the kernel).
>
>
> I propose to create a new capability, CAP_PAYLOAD, that allows the system administrator to designate an application as the main workload in that system. Other processes (like sshd or monitoring daemons) exist to support it, and so it makes sense to protect the rest of the system from their being compromised.

As one of the authors of the POSIX 1003.1e/2c WITHDRAWN DRAFT capability
specification I emphatically NAK this proposal. This is nothing like what
capabilities are for. It doesn't fit the use model and it isn't in any way
related to enforcing system security policy.

> When the kernel switches to user mode of a CAP_PAYLOAD process, it doesn't switch page tables and instead leaves the kernel mapped into the adddress space (still with supervisor protection, of course). This reduces context switch cost, and will also reduce interrupt costs if the interrupt happens while that process executes. Since a CAP_PAYLOAD process is likely to consume the majority of CPU time, the costs associated with Meltdown mitigation are almost completely nullified.

This is a horrible hack. The potential for exploitable edge cases
is enormous.

> CAP_PAYLOAD has potential to be abused;

Yet another really, really good reason not to implement it.

> every software vendor will be absolutely certain that their application is the reason the universe (let alone that server) exists and they will turn it on, so init systems will have to work to make it doesn't get turned on without administrator opt-in. It's also not perfect, since if there is a payload application compromise, in addition to stealing the application's data ssh keys can be stolen too. But I think it's better than having to choose between significantly reduced performance and security. You get performance for your important application, and protection against the possibility that a remote exploit against a supporting process turns into a remote exploit against that important application.

This is just not a viable use of the capability mechanism.
I am not at liberty to comment on any aspect of the exploits
du'jour, so suggesting alternatives is not something I can
do just now.

2018-01-18 22:51:27

by Pavel Machek

[permalink] [raw]
Subject: Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

On Sat 2018-01-06 21:33:28, Avi Kivity wrote:
> Meltdown and Spectre mitigations focus on protecting the kernel from a
> hostile userspace. However, it's not a given that the kernel is the most
> important target in the system. It is common in server workloads that a
> single userspace application contains the valuable data on a system, and if
> it were hostile, the game would already be over, without the need to
> compromise the kernel.
>
>
> In these workloads, a single application performs most system calls, and so
> it pays the cost of protection, without benefiting from it directly (since
> it is the target, rather than the kernel).
>
>
> I propose to create a new capability, CAP_PAYLOAD, that allows the system
> administrator to designate an application as the main workload in that
> system. Other processes (like sshd or monitoring daemons) exist to support
> it, and so it makes sense to protect the rest of the system from their being
> compromised.

prctl(I_AM_PAYLOAD) may do the trick. CAP_PAYLOAD is bad idea.

prctl() should require some pretty heavy capabilities, similar to
iopl() / ioperm() syscalls on x86, maybe CAP_SYS_RAWIO. Maybe it can
depend on some other capability.

But merely having the capability should definitely not change system
behaviour.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.44 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments