2019-11-18 17:05:58

by Prakash Sangappa

[permalink] [raw]
Subject: [RESEND RFC PATCH 0/1] CAP_SYS_NICE inside user namespace

Some of the capabilities(7) which affect system wide resources, are ineffective
inside user namespaces. This restriction applies even to root user( uid 0)
from init namespace mapped into the user namespace. One such capability
is CAP_SYS_NICE which is required to change process priority. As a result of
which the root user cannot perform operations like increase a process priority
using -ve nice value or set RT priority on processes inside the user namespace.
A workaround to deal with this restriction is to use the help of a process /
daemon running outside the user namespace to change process priority, which is
a an inconvenience.

We could allow these restricted capabilities to take effect only for the root
user from init namespace mapped inside a user namespace and limit the effect
with use of cgroups. It would seem reasonable to deal with each of these
restricted capabilities on a case by case basis and address them. This patch
is concerning CAP_SYS_NICE capability. The proposal here is to selectively
allow CAP_SYS_NICE to take effect inside user namespace only for a root user
mapped from init name space.

Which user id gets to map the root user(uid 0) from init namespace inside its
user namespaces is authorized thru /etc/subuid & /etc/subgid entries. Only
system admin / root user on the system can add these entries.
Therefore any ordinary user cannot simply map the root user(uid 0) into
user namespaces created. Necessary cgroup bandwidth control can be used
to limit cpu usage for such user namespaces.

The capabilities(7) manpage lists all the operations / system calls that are
subject to CAP_SYS_NICE capability check. This patch currently allows
CAP_SYS_NICE to take effect inside a user namespace only for system calls
affecting process priority. For completeness sake should memory
operations(migrate_pages(2), move_pages(2), mbind(2)) mentioned in the
manpage, also be permitted? There are no cgroup controls to limit the effect
of these memory operations.

Looking for feedback on this approach.

Prakash Sangappa (1):
Selectively allow CAP_SYS_NICE capability inside user namespaces

kernel/sched/core.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

--
2.7.4


2019-11-18 19:38:45

by Jann Horn

[permalink] [raw]
Subject: Re: [RESEND RFC PATCH 0/1] CAP_SYS_NICE inside user namespace

On Mon, Nov 18, 2019 at 6:04 PM Prakash Sangappa
<[email protected]> wrote:
> Some of the capabilities(7) which affect system wide resources, are ineffective
> inside user namespaces. This restriction applies even to root user( uid 0)
> from init namespace mapped into the user namespace. One such capability
> is CAP_SYS_NICE which is required to change process priority. As a result of
> which the root user cannot perform operations like increase a process priority
> using -ve nice value or set RT priority on processes inside the user namespace.
> A workaround to deal with this restriction is to use the help of a process /
> daemon running outside the user namespace to change process priority, which is
> a an inconvenience.

What is the goal here, in the big picture? Is your goal to allow
container admins to control the priorities of their tasks *relative to
each other*, or do you actually explicitly want container A to be able
to decide that its current workload is more timing-sensitive than
container B's?

2019-11-18 20:37:06

by Prakash Sangappa

[permalink] [raw]
Subject: Re: [RESEND RFC PATCH 0/1] CAP_SYS_NICE inside user namespace



On 11/18/19 11:36 AM, Jann Horn wrote:
> On Mon, Nov 18, 2019 at 6:04 PM Prakash Sangappa
> <[email protected]> wrote:
>> Some of the capabilities(7) which affect system wide resources, are ineffective
>> inside user namespaces. This restriction applies even to root user( uid 0)
>> from init namespace mapped into the user namespace. One such capability
>> is CAP_SYS_NICE which is required to change process priority. As a result of
>> which the root user cannot perform operations like increase a process priority
>> using -ve nice value or set RT priority on processes inside the user namespace.
>> A workaround to deal with this restriction is to use the help of a process /
>> daemon running outside the user namespace to change process priority, which is
>> a an inconvenience.
> What is the goal here, in the big picture? Is your goal to allow
> container admins to control the priorities of their tasks *relative to
> each other*, or do you actually explicitly want container A to be able
> to decide that its current workload is more timing-sensitive than
> container B's?

It is more the latter. Admin should be able to explicitly decide that
container A
workload is to be given priority over other containers.

Subject: Re: [RESEND RFC PATCH 0/1] CAP_SYS_NICE inside user namespace

On 18.11.19 21:34, Prakash Sangappa wrote:

> It is more the latter. Admin should be able to explicitly decide that
> container A
> workload is to be given priority over other containers.

I guess, you're talking about the host's admin, correct ?

Shouldn't this already be possibly by tweaking the container's cgroups ?


--mtx

--
Dringender Hinweis: aufgrund existenzieller Bedrohung durch "Emotet"
sollten Sie *niemals* MS-Office-Dokumente via E-Mail annehmen/öffenen,
selbst wenn diese von vermeintlich vertrauenswürdigen Absendern zu
stammen scheinen. Andernfalls droht Totalschaden.
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

2019-11-22 01:56:08

by Prakash Sangappa

[permalink] [raw]
Subject: Re: [RESEND RFC PATCH 0/1] CAP_SYS_NICE inside user namespace



On 11/21/19 10:33 AM, Enrico Weigelt, metux IT consult wrote:
> On 18.11.19 21:34, Prakash Sangappa wrote:
>
>> It is more the latter. Admin should be able to explicitly decide that container A
>> workload is to be given priority over other containers.
> I guess, you're talking about the host's admin, correct ?

Yes, Specifically host's admin decides which container gets the
privilege to increase priority of processes inside that container.

>
> Shouldn't this already be possibly by tweaking the container's cgroups ?

Don't think so. The use case is that admin/user inside the container
needs to be able to increase the priority of some the critical processes
running in the container.

>
>
> --mtx
>