LinuxLists.cc - [PATCH] userns: honour no_new_privs for cap

2017-12-21 21:06:36

Subject: [PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

From: Maciej Żenczykowski <[email protected]>

This allows locking down user namespaces tighter,
and it could even be considered a security fix.

Signed-off-by: Maciej Żenczykowski <[email protected]>
---
kernel/user_namespace.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..2354f7ade78a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -50,11 +50,12 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
* anything as the capabilities are bound to the new user namespace.
*/
cred->securebits = SECUREBITS_DEFAULT;
+ cred->cap_bset = task_no_new_privs(current) ? current_cred()->cap_bset
+ : CAP_FULL_SET;
cred->cap_inheritable = CAP_EMPTY_SET;
- cred->cap_permitted = CAP_FULL_SET;
- cred->cap_effective = CAP_FULL_SET;
+ cred->cap_permitted = cred->cap_bset;
+ cred->cap_effective = cred->cap_bset;
cred->cap_ambient = CAP_EMPTY_SET;
- cred->cap_bset = CAP_FULL_SET;
#ifdef CONFIG_KEYS
key_put(cred->request_key_auth);
cred->request_key_auth = NULL;
--
2.15.1.620.gb9897f4670-goog

2017-12-21 21:45:28

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

Maciej Żenczykowski <[email protected]> writes:

> From: Maciej Żenczykowski <[email protected]>
>
> This allows locking down user namespaces tighter,
> and it could even be considered a security fix.

No. This makes no logical sense.

A task that enters a user namespace loses all capabilities to everything
outside of the user namespace. Capabilities inside a user namespace are
only valid for objects created inside that user namespace.

So limiting capabilities inside a user namespace when the capability
bounding set is already fully honored by not giving the processes any of
those capabilities makes no logical sense.

If the concern is kernel attack surface versus logical permissions we
can look at ways to reduce the attack surface but that needs to be fully
discussed in the change log.

> Signed-off-by: Maciej Żenczykowski <[email protected]>
> ---
> kernel/user_namespace.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 246d4d4ce5c7..2354f7ade78a 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -50,11 +50,12 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
> * anything as the capabilities are bound to the new user namespace.
> */
> cred->securebits = SECUREBITS_DEFAULT;
> + cred->cap_bset = task_no_new_privs(current) ? current_cred()->cap_bset
> + : CAP_FULL_SET;
> cred->cap_inheritable = CAP_EMPTY_SET;
> - cred->cap_permitted = CAP_FULL_SET;
> - cred->cap_effective = CAP_FULL_SET;
> + cred->cap_permitted = cred->cap_bset;
> + cred->cap_effective = cred->cap_bset;
> cred->cap_ambient = CAP_EMPTY_SET;
> - cred->cap_bset = CAP_FULL_SET;
> #ifdef CONFIG_KEYS
> key_put(cred->request_key_auth);
> cred->request_key_auth = NULL;

2017-12-22 01:03:40

by Maciej Żenczykowski

[permalink] [raw]

Subject: Re: [PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

On Thu, Dec 21, 2017 at 10:44 PM, Eric W. Biederman
<[email protected]> wrote:
> No. This makes no logical sense.
>
> A task that enters a user namespace loses all capabilities to everything
> outside of the user namespace. Capabilities inside a user namespace are
> only valid for objects created inside that user namespace.
>
> So limiting capabilities inside a user namespace when the capability
> bounding set is already fully honored by not giving the processes any of
> those capabilities makes no logical sense.
>
> If the concern is kernel attack surface versus logical permissions we
> can look at ways to reduce the attack surface but that needs to be fully
> discussed in the change log.

Here's an example of using user namespaces to read a file you
shouldn't be able to.

lpk19:~# uname -r
4.15.0-smp-d1ce8ceb8ba8

(we start as true global root)
lpk19:~# id
uid=0(root) gid=0(root) groups=0(root)

(cleanup after previous run)
lpk19:~# cd /; chattr -i /immu; rm -f /immu/log; rmdir /immu

(now we create an append only logfile owned by target user:group)
lpk19:~# cd /; mkdir /immu; touch /immu/log; chown produser:prod
/immu/log; chmod a-rwx,u+w /immu/log; chattr +a /immu/log

(let's show what things look like)
lpk19:~# chattr +i /immu; ls -ld / /immu /immu/log; lsattr -d / /immu /immu/log
drwxr-xr-x 22 root root 4096 Dec 21 16:33 /
drwxr-xr-x 2 root root 4096 Dec 21 16:23 /immu
--w------- 1 produser prod 0 Dec 21 16:23 /immu/log
-----------I--e---- /
----i---------e---- /immu
-----a--------e---- /immu/log

(the immutable bit prevents us from changing permissions on the file)
lpk19:/# chmod a+rwx /immu/log
chmod: changing permissions of '/immu/log': Operation not permitted

(the append only bit prevents us from simply overwriting the file)
lpk19:/# echo log1 > /immu/log
-bash: /immu/log: Operation not permitted

(but we can append to it)
lpk19:/# echo log1 >> /immu/log

(we're global root with CAP_DAC_OVERRIDE, so we can *still* read it)
lpk19:/# cat /immu/log
log1

(let's transition to target user)
lpk19:/# su - produser

produser@lpk19:~$ id
uid=2080(produser) gid=620(prod) groups=620(prod)

(we can't overwrite it)
produser@lpk19:~$ echo log2 > /immu/log
-su: /immu/log: Operation not permitted

(but we can log to it: as intended)
produser@lpk19:~$ echo log2 >> /immu/log

(we can't change its permissions, cause it's in an immutable directory)
produser@lpk19:~$ chmod u+r /immu/log
chmod: changing permissions of '/immu/log': Operation not permitted

(we can't dump the file, cause we don't have CAP_DAC_OVERRIDE)
produser@lpk19:~$ cat /immu/log
cat: /immu/log: Permission denied

(or can we?)
produser@lpk19:~$ unshare -U -r cat /immu/log
log1
log2

----

Now, of course, the above patch doesn't actually fix this on it's own,
since 'su' doesn't (yet?) know to restrict bset or to set
no_new_privs.

But: it allows the sandbox equivalent of su to drop CAP_DAC_OVERRIDE
from it's inh/eff/perm/ambient/bset, and set no_new_privs.
Now the unshare won't gain CAP_DAC_OVERRIDE and won't be able to cat
the non-readable append-only log file.

IMHO the point of having a capability bounding set and/or no_new_privs
is to never be able to regain capabilities.
Note also that 'no_new_privs' isn't cleared across a
unshare(CLONE_NEWUSER) [presumably also applies to setns()].

We can of course argue the implementation details (for example instead
of using the existing no_new_privs flag, add a new
keep_bset_across_userns_transitions securebits flag)... but
*something* has to be done.

- Maciej

2017-12-22 01:18:35

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

Maciej Żenczykowski <[email protected]> writes:

> On Thu, Dec 21, 2017 at 10:44 PM, Eric W. Biederman
> <[email protected]> wrote:
>> No. This makes no logical sense.
>>
>> A task that enters a user namespace loses all capabilities to everything
>> outside of the user namespace. Capabilities inside a user namespace are
>> only valid for objects created inside that user namespace.
>>
>> So limiting capabilities inside a user namespace when the capability
>> bounding set is already fully honored by not giving the processes any of
>> those capabilities makes no logical sense.
>>
>> If the concern is kernel attack surface versus logical permissions we
>> can look at ways to reduce the attack surface but that needs to be fully
>> discussed in the change log.
>
> Here's an example of using user namespaces to read a file you
> shouldn't be able to.
>
> lpk19:~# uname -r
> 4.15.0-smp-d1ce8ceb8ba8
>
> (we start as true global root)
> lpk19:~# id
> uid=0(root) gid=0(root) groups=0(root)
>
> (cleanup after previous run)
> lpk19:~# cd /; chattr -i /immu; rm -f /immu/log; rmdir /immu
>
> (now we create an append only logfile owned by target user:group)
> lpk19:~# cd /; mkdir /immu; touch /immu/log; chown produser:prod
> /immu/log; chmod a-rwx,u+w /immu/log; chattr +a /immu/log
>
> (let's show what things look like)
> lpk19:~# chattr +i /immu; ls -ld / /immu /immu/log; lsattr -d / /immu /immu/log
> drwxr-xr-x 22 root root 4096 Dec 21 16:33 /
> drwxr-xr-x 2 root root 4096 Dec 21 16:23 /immu
> --w------- 1 produser prod 0 Dec 21 16:23 /immu/log
> -----------I--e---- /
> ----i---------e---- /immu
> -----a--------e---- /immu/log
>
> (the immutable bit prevents us from changing permissions on the file)
> lpk19:/# chmod a+rwx /immu/log
> chmod: changing permissions of '/immu/log': Operation not permitted
>
> (the append only bit prevents us from simply overwriting the file)
> lpk19:/# echo log1 > /immu/log
> -bash: /immu/log: Operation not permitted
>
> (but we can append to it)
> lpk19:/# echo log1 >> /immu/log
>
> (we're global root with CAP_DAC_OVERRIDE, so we can *still* read it)
> lpk19:/# cat /immu/log
> log1
>
> (let's transition to target user)
> lpk19:/# su - produser
>
> produser@lpk19:~$ id
> uid=2080(produser) gid=620(prod) groups=620(prod)
>
> (we can't overwrite it)
> produser@lpk19:~$ echo log2 > /immu/log
> -su: /immu/log: Operation not permitted
>
> (but we can log to it: as intended)
> produser@lpk19:~$ echo log2 >> /immu/log
>
> (we can't change its permissions, cause it's in an immutable directory)
> produser@lpk19:~$ chmod u+r /immu/log
> chmod: changing permissions of '/immu/log': Operation not permitted
>
> (we can't dump the file, cause we don't have CAP_DAC_OVERRIDE)
> produser@lpk19:~$ cat /immu/log
> cat: /immu/log: Permission denied
>
> (or can we?)
> produser@lpk19:~$ unshare -U -r cat /immu/log
> log1
> log2
>
> ----
>
> Now, of course, the above patch doesn't actually fix this on it's own,
> since 'su' doesn't (yet?) know to restrict bset or to set
> no_new_privs.
>
> But: it allows the sandbox equivalent of su to drop CAP_DAC_OVERRIDE
> from it's inh/eff/perm/ambient/bset, and set no_new_privs.
> Now the unshare won't gain CAP_DAC_OVERRIDE and won't be able to cat
> the non-readable append-only log file.
>
> IMHO the point of having a capability bounding set and/or no_new_privs
> is to never be able to regain capabilities.
> Note also that 'no_new_privs' isn't cleared across a
> unshare(CLONE_NEWUSER) [presumably also applies to setns()].
>
> We can of course argue the implementation details (for example instead
> of using the existing no_new_privs flag, add a new
> keep_bset_across_userns_transitions securebits flag)... but
> *something* has to be done.

Good point about CAP_DAC_OVERRIDE on files you own.

I think there is an argument that you are playing dangerous games with
the permission system there, as it isn't effectively a file you own if
you can't read it, and you can't change it's permissions.

Given little things like that I can completely see no_new_privs meaning
you can't create a user namespace. That seems consistent with the
meaning and philosophy of no_new_privs. So simple it is hard to get
wrong.

We could do more clever things like plug this whole in user namespaces,
and that would not hurt my feelings. However unless that is our only
choice to avoid badly breaking userspace I would have to have to depend
on user namespaces being perfect for no_new_privs to be a proper jail.

As a general rule user namespaces are where we tackle the subtle scary
things that should work, and no_new_privs is where we implement a simple
hard to get wrong jail. Most of the time the effect is the same to an
outside observer (bounded permissions), but there is a real difference
in difficulty of implementation.

Eric

2017-12-22 01:51:54

by Maciej Żenczykowski

[permalink] [raw]

Subject: Re: [PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

> Good point about CAP_DAC_OVERRIDE on files you own.
>
> I think there is an argument that you are playing dangerous games with
> the permission system there, as it isn't effectively a file you own if
> you can't read it, and you can't change it's permissions.

Append-only files are useful - particularly for logging.
It could also simply be a non-readable file on a R/O filesystem.

> Given little things like that I can completely see no_new_privs meaning
> you can't create a user namespace. That seems consistent with the
> meaning and philosophy of no_new_privs. So simple it is hard to get
> wrong.

Yes, I could totally buy the argument that no_new_privs should prevent
creating a user ns.

However, there's also setns() and that's a fair bit harder to reason about.
Entirely deny it? But that actually seems potentially useful...
Allow it but cap it? That's what this does...

> We could do more clever things like plug this whole in user namespaces,
> and that would not hurt my feelings.

Sure, this particular one wouldn't be all that easy I think... and how
many such holes are there?
I found this particular one *after* your first reply in this thread.

> However unless that is our only
> choice to avoid badly breaking userspace I would have to have to depend
> on user namespaces being perfect for no_new_privs to be a proper jail.

This stuff is ridiculously complex to get right from userspace. :-(

> As a general rule user namespaces are where we tackle the subtle scary
> things that should work, and no_new_privs is where we implement a simple
> hard to get wrong jail. Most of the time the effect is the same to an
> outside observer (bounded permissions), but there is a real difference
> in difficulty of implementation.

So, where to now...

Would you accept patches that:

- make no_new_priv block user ns creation?

- make no_new_priv block user ns transition?

Or perhaps we can assume that lack of create privs is sufficient, and
if there's a pre-existing user ns for you to enter, then that's
acceptable...
Although this implies you probably always want to combine no_new_privs
with a leaf user ns, or no_new_privs isn't all that useful for root in
root ns...
This added complexity, probably means it should be blocked...

- inherits bset across user ns creation/transition based on X?
[this is the one we care about, because there are simply too many bugs
in the kernel wrt. certain caps]
X could be:
- a new flag similar to no_new_priv
- a new securebit flag (w/lockbit) [provided securebits survive a
userns transition, haven't checked]
- or perhaps a new capability
- something else?

How do we make forward progress?

2017-12-22 02:17:44

by Aleksa Sarai

[permalink] [raw]

Subject: Re: [PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

On 2017-12-21, Eric W. Biederman <[email protected]> wrote:
> Good point about CAP_DAC_OVERRIDE on files you own.
>
> I think there is an argument that you are playing dangerous games with
> the permission system there, as it isn't effectively a file you own if
> you can't read it, and you can't change it's permissions.

This problem reminds me of the whole "unmapped group" problem. If you
have access to a file through an unmapped group you can still access a
file -- which to me is wrong. I understand the need for checking
unmapped groups in order to fix the "chmod 707" problem, but I think
that unmapped groups should only *block* access and never *grant* it.

I was working on a patch for that issue a while ago but it touched more
VFS than I was comfortable with. Eric, is that a fix you would be
interested in?

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

Attachments:

(No filename) (925.00 B)
signature.asc (833.00 B)
Download all attachments

2017-12-22 14:08:37

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

Maciej Żenczykowski <[email protected]> writes:

>> Good point about CAP_DAC_OVERRIDE on files you own.
>>
>> I think there is an argument that you are playing dangerous games with
>> the permission system there, as it isn't effectively a file you own if
>> you can't read it, and you can't change it's permissions.
>
> Append-only files are useful - particularly for logging.
> It could also simply be a non-readable file on a R/O filesystem.
>
>> Given little things like that I can completely see no_new_privs meaning
>> you can't create a user namespace. That seems consistent with the
>> meaning and philosophy of no_new_privs. So simple it is hard to get
>> wrong.
>
> Yes, I could totally buy the argument that no_new_privs should prevent
> creating a user ns.
>
> However, there's also setns() and that's a fair bit harder to reason about.
> Entirely deny it? But that actually seems potentially useful...
> Allow it but cap it? That's what this does...
>
>> We could do more clever things like plug this whole in user namespaces,
>> and that would not hurt my feelings.
>
> Sure, this particular one wouldn't be all that easy I think... and how
> many such holes are there?
> I found this particular one *after* your first reply in this thread.
>
>> However unless that is our only
>> choice to avoid badly breaking userspace I would have to have to depend
>> on user namespaces being perfect for no_new_privs to be a proper jail.
>
> This stuff is ridiculously complex to get right from userspace. :-(

>> As a general rule user namespaces are where we tackle the subtle scary
>> things that should work, and no_new_privs is where we implement a simple
>> hard to get wrong jail. Most of the time the effect is the same to an
>> outside observer (bounded permissions), but there is a real difference
>> in difficulty of implementation.
>
> So, where to now...
>
> Would you accept patches that:
>
> - make no_new_priv block user ns creation?
>
> - make no_new_priv block user ns transition?

Yes.

The approach will need to be rethought if there is anything deliberately
combining user namespaces and no_new_privs. As regressions are a no-no.
So we need wide spread testing, to avoid that.

But as much as possible I want no_new_privs to be simple and doing it's
job.

I will also take and encourage patches that close this minor privilege
escalation from the user namespace side. As ideally creating a user
namespace should be as safe as no_new_privs.

> Or perhaps we can assume that lack of create privs is sufficient, and
> if there's a pre-existing user ns for you to enter, then that's
> acceptable...
> Although this implies you probably always want to combine no_new_privs
> with a leaf user ns, or no_new_privs isn't all that useful for root in
> root ns...
> This added complexity, probably means it should be blocked...

Yes.

> - inherits bset across user ns creation/transition based on X?
> [this is the one we care about, because there are simply too many bugs
> in the kernel wrt. certain caps]

That was my suspicion, and attack surface reduction is a different
discussion. Would no_new_privs preventing a userns transition be enough
for the cases you care about?

Otherwise this is a different conversation because it is not about
semantics but about making the code safer to use. In general if code is
simply not safe to user in a user namespace I would prefer to tighten
the permission checks, and just not allow that code.

Mostly what I have seen in previous conversations is simply concerns
about code that is not used or needed, being a problem.

> X could be:
> - a new flag similar to no_new_priv
> - a new securebit flag (w/lockbit) [provided securebits survive a
> userns transition, haven't checked]
> - or perhaps a new capability
> - something else?
>
> How do we make forward progress?

We start by causing no_new_privs to block userns creation and entering.

Eric

2017-12-22 14:22:09

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

Aleksa Sarai <[email protected]> writes:

> On 2017-12-21, Eric W. Biederman <[email protected]> wrote:
>> Good point about CAP_DAC_OVERRIDE on files you own.
>>
>> I think there is an argument that you are playing dangerous games with
>> the permission system there, as it isn't effectively a file you own if
>> you can't read it, and you can't change it's permissions.
>
> This problem reminds me of the whole "unmapped group" problem. If you
> have access to a file through an unmapped group you can still access a
> file -- which to me is wrong. I understand the need for checking
> unmapped groups in order to fix the "chmod 707" problem, but I think
> that unmapped groups should only *block* access and never *grant* it.
>
> I was working on a patch for that issue a while ago but it touched more
> VFS than I was comfortable with. Eric, is that a fix you would be
> interested in?

I am not certain. I don't see how there is a problem with an unmapped
group granting permissions. You are talking about a scenario where a
more privileged login program set your groups, and uid and gid. The
process despite being a user namespace does not have permission to
transition them. As such I don't see the harm.

But spell it out for me, and deal with ensuring we don't have user space
regressions and I will take a patch that improves the security of user
namespaces.

I think the issue that raised all of this is that dropping a group can
in rare instances increase permissions. I have heard people grumble at
me that the way I handle it with /etc/subuid might break things on some
people's systems. AKA don't allow it by default but allow root to
configure a way for people using user namespaces to do that. I have yet
to see someone come forward and say that is a problem in the real world.
If it actually is a problem I want to hear about it.

Eric

2018-01-03 11:23:27

by Christian Brauner

[permalink] [raw]

Subject: Re: [PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

On Fri, Dec 22, 2017 at 08:08:04AM -0600, Eric W. Biederman wrote:
> Maciej Żenczykowski <[email protected]> writes:
>
> >> Good point about CAP_DAC_OVERRIDE on files you own.
> >>
> >> I think there is an argument that you are playing dangerous games with
> >> the permission system there, as it isn't effectively a file you own if
> >> you can't read it, and you can't change it's permissions.
> >
> > Append-only files are useful - particularly for logging.
> > It could also simply be a non-readable file on a R/O filesystem.
> >
> >> Given little things like that I can completely see no_new_privs meaning
> >> you can't create a user namespace. That seems consistent with the
> >> meaning and philosophy of no_new_privs. So simple it is hard to get
> >> wrong.
> >
> > Yes, I could totally buy the argument that no_new_privs should prevent
> > creating a user ns.
> >
> > However, there's also setns() and that's a fair bit harder to reason about.
> > Entirely deny it? But that actually seems potentially useful...
> > Allow it but cap it? That's what this does...
> >
> >> We could do more clever things like plug this whole in user namespaces,
> >> and that would not hurt my feelings.
> >
> > Sure, this particular one wouldn't be all that easy I think... and how
> > many such holes are there?
> > I found this particular one *after* your first reply in this thread.
> >
> >> However unless that is our only
> >> choice to avoid badly breaking userspace I would have to have to depend
> >> on user namespaces being perfect for no_new_privs to be a proper jail.
> >
> > This stuff is ridiculously complex to get right from userspace. :-(
>
> >> As a general rule user namespaces are where we tackle the subtle scary
> >> things that should work, and no_new_privs is where we implement a simple
> >> hard to get wrong jail. Most of the time the effect is the same to an
> >> outside observer (bounded permissions), but there is a real difference
> >> in difficulty of implementation.
> >
> > So, where to now...
> >
> > Would you accept patches that:
> >
> > - make no_new_priv block user ns creation?
> >
> > - make no_new_priv block user ns transition?
>
> Yes.
>
> The approach will need to be rethought if there is anything deliberately
> combining user namespaces and no_new_privs. As regressions are a no-no.
> So we need wide spread testing, to avoid that.

Just to be clear: as soon as you set no_new_privs you should not be
able to create new user namespace anymore and you shouldn't be able to
attach to user namespaces anymore.
This should work and not cause regressions for e.g. liblxc were we
always set no_new_privs as one of the last steps and especially after we
have clone(CLONE_NEWUSER) the container and - for the attach case -
after we have already done the setns(fd, CLONE_NEWUSER). These use-cases
should be preserved. But it seems to me that they will be. But I'd like
to test this once patches are floating around.

>
> But as much as possible I want no_new_privs to be simple and doing it's
> job.
>
> I will also take and encourage patches that close this minor privilege
> escalation from the user namespace side. As ideally creating a user
> namespace should be as safe as no_new_privs.
>
>
> > Or perhaps we can assume that lack of create privs is sufficient, and
> > if there's a pre-existing user ns for you to enter, then that's
> > acceptable...
> > Although this implies you probably always want to combine no_new_privs
> > with a leaf user ns, or no_new_privs isn't all that useful for root in
> > root ns...
> > This added complexity, probably means it should be blocked...
>
> Yes.
>
> > - inherits bset across user ns creation/transition based on X?
> > [this is the one we care about, because there are simply too many bugs
> > in the kernel wrt. certain caps]
>
> That was my suspicion, and attack surface reduction is a different
> discussion. Would no_new_privs preventing a userns transition be enough
> for the cases you care about?
>
> Otherwise this is a different conversation because it is not about
> semantics but about making the code safer to use. In general if code is
> simply not safe to user in a user namespace I would prefer to tighten
> the permission checks, and just not allow that code.
>
> Mostly what I have seen in previous conversations is simply concerns
> about code that is not used or needed, being a problem.
>
> > X could be:
> > - a new flag similar to no_new_priv
> > - a new securebit flag (w/lockbit) [provided securebits survive a
> > userns transition, haven't checked]
> > - or perhaps a new capability
> > - something else?
> >
> > How do we make forward progress?
>
> We start by causing no_new_privs to block userns creation and entering.

That sounds reasonable. One thing I thought about is what the
interaction between no_new_privs and /proc/sys/user/max_user_namespaces
should look like after this change. Both will basically have the same
effect, right? So wouldn't it make sense to have no_new_privs imply that
/proc/sys/user/max_user_namespaces is 0 and becomes read-only in the
in the calling process' user namespace (e.g. returning EINVAL would make
sense)? I worry that otherwise we might confuse users when they see that
/proc/sys/user/max_user_namespaces is not 0 but they still aren't able
to create user namespaces. I know that having multiple security options
block the same thing independent of each other is not unprecedented
(e.g. in addition to /proc/sys/user/max_user_namespaces CentOS and RHEL
have an additional boot parameter to {dis,en}able user namespace
creation which often confuses the heck out of users) but I'd rather not
make this a common scenario when we can establish a sensible interaction
between two security settings.

Christian