2021-04-16 05:56:29

by Serge E. Hallyn

[permalink] [raw]
Subject: [RFC PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3)

(Eric - this patch (v3) is a cleaned up version of the previous approach.
v4 is at https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
and is the approach you suggested. I can send it also as a separate patch
if you like)

A process running as uid 0 but without cap_setfcap currently can simply
unshare a new user namespace with uid 0 mapped to 0. While this task
will not have new capabilities against the parent namespace, there is
a loophole due to the way namespaced file capabilities work. File
capabilities valid in userns 1 are distinguised from file capabilities
valid in userns 2 by the kuid which underlies uid 0. Therefore
the restricted root process can unshare a new self-mapping namespace,
add a namespaced file capability onto a file, then use that file
capability in the parent namespace.

To prevent that, do not allow mapping uid 0 if the process which
opened the uid_map file does not have CAP_SETFCAP, which is the capability
for setting file capabilities.

A further wrinkle: a task can unshare its user namespace, then
open its uid_map file itself, and map (only) its own uid. In this
case we do not have the credential from before unshare, which was
potentially more restricted. So, when creating a user namespace, we
record whether the creator had CAP_SETFCAP. Then we can use that
during map_write().

With this patch:

1. unprivileged user can still unshare -Ur

ubuntu@caps:~$ unshare -Ur
root@caps:~# logout

2. root user can still unshare -Ur

ubuntu@caps:~$ sudo bash
root@caps:/home/ubuntu# unshare -Ur
root@caps:/home/ubuntu# logout

3. root user without CAP_SETFCAP cannot unshare -Ur:

root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
unable to set CAP_SETFCAP effective capability: Operation not permitted
root@caps:/home/ubuntu# unshare -Ur
unshare: write failed /proc/self/uid_map: Operation not permitted

Signed-off-by: Serge Hallyn <[email protected]>

Changelog:
* fix logic in the case of writing to another task's uid_map
* rename 'ns' to 'map_ns', and make a file_ns local variable
* use /* comments */
* update the CAP_SETFCAP comment in capability.h
* rename parent_unpriv to parent_can_setfcap (and reverse the
logic)
* remove printks
* clarify (i hope) the code comments
* update capability.h comment
* renamed parent_can_setfcap to parent_could_setfcap
* made the check its own disallowed_0_mapping() fn
* moved the check into new_idmap_permitted
---
include/linux/user_namespace.h | 3 ++
include/uapi/linux/capability.h | 3 +-
kernel/user_namespace.c | 61 +++++++++++++++++++++++++++++++--
3 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..f6c5f784be5a 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -63,6 +63,9 @@ struct user_namespace {
kgid_t group;
struct ns_common ns;
unsigned long flags;
+ /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
+ * in its effective capability set at the child ns creation time. */
+ bool parent_could_setfcap;

#ifdef CONFIG_KEYS
/* List of joinable keyrings in this namespace. Modification access of
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index c6ca33034147..2ddb4226cd23 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -335,7 +335,8 @@ struct vfs_ns_cap_data {

#define CAP_AUDIT_CONTROL 30

-/* Set or remove capabilities on files */
+/* Set or remove capabilities on files.
+ Map uid=0 into a child user namespace. */

#define CAP_SETFCAP 31

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index af612945a4d0..8c75028a9aae 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -106,6 +106,7 @@ int create_user_ns(struct cred *new)
if (!ns)
goto fail_dec;

+ ns->parent_could_setfcap = cap_raised(new->cap_effective, CAP_SETFCAP);
ret = ns_alloc_inum(&ns->ns);
if (ret)
goto fail_free;
@@ -841,6 +842,56 @@ static int sort_idmaps(struct uid_gid_map *map)
return 0;
}

+/*
+ * If mapping uid 0, then file capabilities created by the new namespace will
+ * be effective in the parent namespace. Adding file capabilities requires
+ * CAP_SETFCAP, which the child namespace will have, so creating such a
+ * mapping requires CAP_SETFCAP in the parent namespace.
+ */
+static bool disallowed_0_mapping(const struct file *file,
+ struct user_namespace *map_ns,
+ struct uid_gid_map *new_map)
+{
+ int idx;
+ bool zeromapping = false;
+ const struct user_namespace *file_ns = file->f_cred->user_ns;
+
+ for (idx = 0; idx < new_map->nr_extents; idx++) {
+ struct uid_gid_extent *e;
+ u32 lower_first;
+
+ if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+ e = &new_map->extent[idx];
+ else
+ e = &new_map->forward[idx];
+ if (e->lower_first == 0) {
+ zeromapping = true;
+ break;
+ }
+ }
+
+ if (!zeromapping)
+ return false;
+
+ if (map_ns == file_ns) {
+ /* The user unshared first and is writing to
+ * /proc/self/uid_map. User already has full
+ * capabilites in the new namespace, so verify
+ * that the parent has CAP_SETFCAP. */
+ if (!file_ns->parent_could_setfcap)
+ return true;
+ } else {
+ /* Process p1 is writing to uid_map of p2, who
+ * is in a child user namespace to p1's. So
+ * we verify that p1 has CAP_SETFCAP to its
+ * own namespace */
+ if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
+ return true;
+ }
+
+ return false;
+}
+
static ssize_t map_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos,
int cap_setid,
@@ -848,7 +899,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
struct uid_gid_map *parent_map)
{
struct seq_file *seq = file->private_data;
- struct user_namespace *ns = seq->private;
+ struct user_namespace *map_ns = seq->private;
struct uid_gid_map new_map;
unsigned idx;
struct uid_gid_extent extent;
@@ -895,7 +946,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
/*
* Adjusting namespace settings requires capabilities on the target.
*/
- if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN))
+ if (cap_valid(cap_setid) && !file_ns_capable(file, map_ns, CAP_SYS_ADMIN))
goto out;

/* Parse the user data */
@@ -965,7 +1016,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,

ret = -EPERM;
/* Validate the user is allowed to use user id's mapped to. */
- if (!new_idmap_permitted(file, ns, cap_setid, &new_map))
+ if (!new_idmap_permitted(file, map_ns, cap_setid, &new_map))
goto out;

ret = -EPERM;
@@ -1086,6 +1137,10 @@ static bool new_idmap_permitted(const struct file *file,
struct uid_gid_map *new_map)
{
const struct cred *cred = file->f_cred;
+
+ if (cap_setid == CAP_SETUID && disallowed_0_mapping(file, ns, new_map))
+ return false;
+
/* Don't allow mappings that would allow anything that wouldn't
* be allowed without the establishment of unprivileged mappings.
*/
--
2.17.1


2021-04-16 15:08:51

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3)

On Thu, Apr 15, 2021 at 11:58:51PM -0500, Serge Hallyn wrote:
> (Eric - this patch (v3) is a cleaned up version of the previous approach.
> v4 is at https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
> and is the approach you suggested. I can send it also as a separate patch
> if you like)
>
> A process running as uid 0 but without cap_setfcap currently can simply
> unshare a new user namespace with uid 0 mapped to 0. While this task
> will not have new capabilities against the parent namespace, there is
> a loophole due to the way namespaced file capabilities work. File
> capabilities valid in userns 1 are distinguised from file capabilities
> valid in userns 2 by the kuid which underlies uid 0. Therefore
> the restricted root process can unshare a new self-mapping namespace,
> add a namespaced file capability onto a file, then use that file
> capability in the parent namespace.
>
> To prevent that, do not allow mapping uid 0 if the process which
> opened the uid_map file does not have CAP_SETFCAP, which is the capability
> for setting file capabilities.
>
> A further wrinkle: a task can unshare its user namespace, then
> open its uid_map file itself, and map (only) its own uid. In this
> case we do not have the credential from before unshare, which was
> potentially more restricted. So, when creating a user namespace, we
> record whether the creator had CAP_SETFCAP. Then we can use that
> during map_write().
>
> With this patch:
>
> 1. unprivileged user can still unshare -Ur
>
> ubuntu@caps:~$ unshare -Ur
> root@caps:~# logout
>
> 2. root user can still unshare -Ur
>
> ubuntu@caps:~$ sudo bash
> root@caps:/home/ubuntu# unshare -Ur
> root@caps:/home/ubuntu# logout
>
> 3. root user without CAP_SETFCAP cannot unshare -Ur:
>
> root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> unable to set CAP_SETFCAP effective capability: Operation not permitted
> root@caps:/home/ubuntu# unshare -Ur
> unshare: write failed /proc/self/uid_map: Operation not permitted
>
> Signed-off-by: Serge Hallyn <[email protected]>
>
> Changelog:
> * fix logic in the case of writing to another task's uid_map
> * rename 'ns' to 'map_ns', and make a file_ns local variable
> * use /* comments */
> * update the CAP_SETFCAP comment in capability.h
> * rename parent_unpriv to parent_can_setfcap (and reverse the
> logic)
> * remove printks
> * clarify (i hope) the code comments
> * update capability.h comment
> * renamed parent_can_setfcap to parent_could_setfcap
> * made the check its own disallowed_0_mapping() fn
> * moved the check into new_idmap_permitted
> ---

Thank you for working on this fix!

I do prefer your approach of doing the check at user namespace creation
time instead of moving it into the setxattr() codepath.

Let me reiterate that the ability to write through fscaps is a valid
usecase and this should continue to work but that for locked down user
namespace as Andrew wants to use them your patch provides a clean
solution.
We've are using identity mappings in quite a few scenarios partially
when performing tests but also to write through fscaps.
We also had reports of users that use identity mappings. They create
their rootfs by running image extraction in an identity mapped userns
where fscaps are written through.
Podman has use-cases for this feature as well and has been affected by
the regression of the first fix.

> include/linux/user_namespace.h | 3 ++
> include/uapi/linux/capability.h | 3 +-
> kernel/user_namespace.c | 61 +++++++++++++++++++++++++++++++--
> 3 files changed, 63 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 64cf8ebdc4ec..f6c5f784be5a 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -63,6 +63,9 @@ struct user_namespace {
> kgid_t group;
> struct ns_common ns;
> unsigned long flags;
> + /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
> + * in its effective capability set at the child ns creation time. */
> + bool parent_could_setfcap;
>
> #ifdef CONFIG_KEYS
> /* List of joinable keyrings in this namespace. Modification access of
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index c6ca33034147..2ddb4226cd23 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -335,7 +335,8 @@ struct vfs_ns_cap_data {
>
> #define CAP_AUDIT_CONTROL 30
>
> -/* Set or remove capabilities on files */
> +/* Set or remove capabilities on files.
> + Map uid=0 into a child user namespace. */
>
> #define CAP_SETFCAP 31
>
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index af612945a4d0..8c75028a9aae 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -106,6 +106,7 @@ int create_user_ns(struct cred *new)
> if (!ns)
> goto fail_dec;
>
> + ns->parent_could_setfcap = cap_raised(new->cap_effective, CAP_SETFCAP);
> ret = ns_alloc_inum(&ns->ns);
> if (ret)
> goto fail_free;
> @@ -841,6 +842,56 @@ static int sort_idmaps(struct uid_gid_map *map)
> return 0;
> }
>
> +/*
> + * If mapping uid 0, then file capabilities created by the new namespace will
> + * be effective in the parent namespace. Adding file capabilities requires
> + * CAP_SETFCAP, which the child namespace will have, so creating such a
> + * mapping requires CAP_SETFCAP in the parent namespace.
> + */
> +static bool disallowed_0_mapping(const struct file *file,
> + struct user_namespace *map_ns,
> + struct uid_gid_map *new_map)
> +{
> + int idx;
> + bool zeromapping = false;
> + const struct user_namespace *file_ns = file->f_cred->user_ns;
> +
> + for (idx = 0; idx < new_map->nr_extents; idx++) {

I think having that loop is acceptable here since it's only called once
at map creation time even though the forward array is not yet sorted.

> + struct uid_gid_extent *e;
> + u32 lower_first;
> +
> + if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> + e = &new_map->extent[idx];
> + else
> + e = &new_map->forward[idx];
> + if (e->lower_first == 0) {
> + zeromapping = true;
> + break;
> + }
> + }
> +
> + if (!zeromapping)
> + return false;
> +
> + if (map_ns == file_ns) {
> + /* The user unshared first and is writing to
> + * /proc/self/uid_map. User already has full
> + * capabilites in the new namespace, so verify
> + * that the parent has CAP_SETFCAP. */
> + if (!file_ns->parent_could_setfcap)
> + return true;
> + } else {
> + /* Process p1 is writing to uid_map of p2, who
> + * is in a child user namespace to p1's. So
> + * we verify that p1 has CAP_SETFCAP to its
> + * own namespace */
> + if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
> + return true;
> + }
> +
> + return false;
> +}

Maybe we can tweak this a tiny bit to get rid of the "zeromapping"?:

static bool disallowed_0_mapping(const struct file *file,
struct user_namespace *map_ns,
struct uid_gid_map *new_map)
{
int idx;
const struct user_namespace *file_ns = file->f_cred->user_ns;
struct uid_gid_extent *extent0 = NULL;

for (idx = 0; idx < new_map->nr_extents; idx++) {
u32 lower_first;

if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
extent0 = &new_map->extent[idx];
else
extent0 = &new_map->forward[idx];
if (extent0->lower_first == 0)
break;

extent0 = NULL;
}

if (!extent0)
return false;

if (map_ns == file_ns) {
/*
* The user unshared first and is writing to
* /proc/self/uid_map. User already has full
* capabilites in the new namespace, so verify
* that the parent has CAP_SETFCAP.
*/
if (!file_ns->parent_could_setfcap)
return true;
} else {
/*
* Process p1 is writing to uid_map of p2, who
* is in a child user namespace to p1's. So
* we verify that p1 has CAP_SETFCAP to its
* own namespace.
*/
if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
return true;
}

return false;
}

In addition I would think that expressing the logic the other way around
is more legible. I'm not too keen on having negations in function names.
We should probably also tweak the comment a bit and make it kernel-doc
clean:

/**
* verify_root_map() - check the uid 0 mapping
* @file: idmapping file
* @map_ns: user namespace of the target process
* @new_map: requested idmap
*
* If a process requested a mapping for uid 0 onto
* uid 0 verify that the process writing the map had the CAP_SETFCAP
* capability as the target process will be able to
* write fscaps that are valid in ancestor user namespaces.
*
* Return: true if the mapping is allow, false if not.
*/
static bool verify_root_map()

2021-04-16 22:10:06

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3)

On Fri, Apr 16, 2021 at 05:05:01PM +0200, Christian Brauner wrote:
> On Thu, Apr 15, 2021 at 11:58:51PM -0500, Serge Hallyn wrote:
> > (Eric - this patch (v3) is a cleaned up version of the previous approach.
> > v4 is at https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
> > and is the approach you suggested. I can send it also as a separate patch
> > if you like)
> >
> > A process running as uid 0 but without cap_setfcap currently can simply
> > unshare a new user namespace with uid 0 mapped to 0. While this task
> > will not have new capabilities against the parent namespace, there is
> > a loophole due to the way namespaced file capabilities work. File
> > capabilities valid in userns 1 are distinguised from file capabilities
> > valid in userns 2 by the kuid which underlies uid 0. Therefore
> > the restricted root process can unshare a new self-mapping namespace,
> > add a namespaced file capability onto a file, then use that file
> > capability in the parent namespace.
> >
> > To prevent that, do not allow mapping uid 0 if the process which
> > opened the uid_map file does not have CAP_SETFCAP, which is the capability
> > for setting file capabilities.
> >
> > A further wrinkle: a task can unshare its user namespace, then
> > open its uid_map file itself, and map (only) its own uid. In this
> > case we do not have the credential from before unshare, which was
> > potentially more restricted. So, when creating a user namespace, we
> > record whether the creator had CAP_SETFCAP. Then we can use that
> > during map_write().
> >
> > With this patch:
> >
> > 1. unprivileged user can still unshare -Ur
> >
> > ubuntu@caps:~$ unshare -Ur
> > root@caps:~# logout
> >
> > 2. root user can still unshare -Ur
> >
> > ubuntu@caps:~$ sudo bash
> > root@caps:/home/ubuntu# unshare -Ur
> > root@caps:/home/ubuntu# logout
> >
> > 3. root user without CAP_SETFCAP cannot unshare -Ur:
> >
> > root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> > root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> > unable to set CAP_SETFCAP effective capability: Operation not permitted
> > root@caps:/home/ubuntu# unshare -Ur
> > unshare: write failed /proc/self/uid_map: Operation not permitted
> >
> > Signed-off-by: Serge Hallyn <[email protected]>
> >
> > Changelog:
> > * fix logic in the case of writing to another task's uid_map
> > * rename 'ns' to 'map_ns', and make a file_ns local variable
> > * use /* comments */
> > * update the CAP_SETFCAP comment in capability.h
> > * rename parent_unpriv to parent_can_setfcap (and reverse the
> > logic)
> > * remove printks
> > * clarify (i hope) the code comments
> > * update capability.h comment
> > * renamed parent_can_setfcap to parent_could_setfcap
> > * made the check its own disallowed_0_mapping() fn
> > * moved the check into new_idmap_permitted
> > ---
>
> Thank you for working on this fix!
>
> I do prefer your approach of doing the check at user namespace creation
> time instead of moving it into the setxattr() codepath.
>
> Let me reiterate that the ability to write through fscaps is a valid
> usecase and this should continue to work but that for locked down user
> namespace as Andrew wants to use them your patch provides a clean
> solution.
> We've are using identity mappings in quite a few scenarios partially
> when performing tests but also to write through fscaps.
> We also had reports of users that use identity mappings. They create
> their rootfs by running image extraction in an identity mapped userns
> where fscaps are written through.
> Podman has use-cases for this feature as well and has been affected by
> the regression of the first fix.

Thanks for reviewing.

I'm not sure what your point above is, so just to make sure - the
alternative implementation also does allow fscaps for cases where
root uid is remapped, only disallowing it if it would violate the
ancestor's lack of cap_setfcap.


> > include/linux/user_namespace.h | 3 ++
> > include/uapi/linux/capability.h | 3 +-
> > kernel/user_namespace.c | 61 +++++++++++++++++++++++++++++++--
> > 3 files changed, 63 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> > index 64cf8ebdc4ec..f6c5f784be5a 100644
> > --- a/include/linux/user_namespace.h
> > +++ b/include/linux/user_namespace.h
> > @@ -63,6 +63,9 @@ struct user_namespace {
> > kgid_t group;
> > struct ns_common ns;
> > unsigned long flags;
> > + /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
> > + * in its effective capability set at the child ns creation time. */
> > + bool parent_could_setfcap;
> >
> > #ifdef CONFIG_KEYS
> > /* List of joinable keyrings in this namespace. Modification access of
> > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> > index c6ca33034147..2ddb4226cd23 100644
> > --- a/include/uapi/linux/capability.h
> > +++ b/include/uapi/linux/capability.h
> > @@ -335,7 +335,8 @@ struct vfs_ns_cap_data {
> >
> > #define CAP_AUDIT_CONTROL 30
> >
> > -/* Set or remove capabilities on files */
> > +/* Set or remove capabilities on files.
> > + Map uid=0 into a child user namespace. */
> >
> > #define CAP_SETFCAP 31
> >
> > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> > index af612945a4d0..8c75028a9aae 100644
> > --- a/kernel/user_namespace.c
> > +++ b/kernel/user_namespace.c
> > @@ -106,6 +106,7 @@ int create_user_ns(struct cred *new)
> > if (!ns)
> > goto fail_dec;
> >
> > + ns->parent_could_setfcap = cap_raised(new->cap_effective, CAP_SETFCAP);
> > ret = ns_alloc_inum(&ns->ns);
> > if (ret)
> > goto fail_free;
> > @@ -841,6 +842,56 @@ static int sort_idmaps(struct uid_gid_map *map)
> > return 0;
> > }
> >
> > +/*
> > + * If mapping uid 0, then file capabilities created by the new namespace will
> > + * be effective in the parent namespace. Adding file capabilities requires
> > + * CAP_SETFCAP, which the child namespace will have, so creating such a
> > + * mapping requires CAP_SETFCAP in the parent namespace.
> > + */
> > +static bool disallowed_0_mapping(const struct file *file,
> > + struct user_namespace *map_ns,
> > + struct uid_gid_map *new_map)
> > +{
> > + int idx;
> > + bool zeromapping = false;
> > + const struct user_namespace *file_ns = file->f_cred->user_ns;
> > +
> > + for (idx = 0; idx < new_map->nr_extents; idx++) {
>
> I think having that loop is acceptable here since it's only called once
> at map creation time even though the forward array is not yet sorted.
>
> > + struct uid_gid_extent *e;
> > + u32 lower_first;
> > +
> > + if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> > + e = &new_map->extent[idx];
> > + else
> > + e = &new_map->forward[idx];
> > + if (e->lower_first == 0) {
> > + zeromapping = true;
> > + break;
> > + }
> > + }
> > +
> > + if (!zeromapping)
> > + return false;
> > +
> > + if (map_ns == file_ns) {
> > + /* The user unshared first and is writing to
> > + * /proc/self/uid_map. User already has full
> > + * capabilites in the new namespace, so verify
> > + * that the parent has CAP_SETFCAP. */
> > + if (!file_ns->parent_could_setfcap)
> > + return true;
> > + } else {
> > + /* Process p1 is writing to uid_map of p2, who
> > + * is in a child user namespace to p1's. So
> > + * we verify that p1 has CAP_SETFCAP to its
> > + * own namespace */
> > + if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
> > + return true;
> > + }
> > +
> > + return false;
> > +}
>
> Maybe we can tweak this a tiny bit to get rid of the "zeromapping"?:
>
> static bool disallowed_0_mapping(const struct file *file,
> struct user_namespace *map_ns,
> struct uid_gid_map *new_map)
> {
> int idx;
> const struct user_namespace *file_ns = file->f_cred->user_ns;
> struct uid_gid_extent *extent0 = NULL;
>
> for (idx = 0; idx < new_map->nr_extents; idx++) {
> u32 lower_first;
>
> if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> extent0 = &new_map->extent[idx];
> else
> extent0 = &new_map->forward[idx];
> if (extent0->lower_first == 0)
> break;
>
> extent0 = NULL;
> }
>
> if (!extent0)
> return false;

Feels a little less clear to me, but that's probably just me, so I'll
switch it over, thanks.

>
> if (map_ns == file_ns) {
> /*
> * The user unshared first and is writing to
> * /proc/self/uid_map. User already has full
> * capabilites in the new namespace, so verify
> * that the parent has CAP_SETFCAP.
> */
> if (!file_ns->parent_could_setfcap)
> return true;
> } else {
> /*
> * Process p1 is writing to uid_map of p2, who
> * is in a child user namespace to p1's. So
> * we verify that p1 has CAP_SETFCAP to its
> * own namespace.
> */
> if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
> return true;
> }
>
> return false;
> }
>
> In addition I would think that expressing the logic the other way around
> is more legible. I'm not too keen on having negations in function names.
> We should probably also tweak the comment a bit and make it kernel-doc
> clean:
>
> /**
> * verify_root_map() - check the uid 0 mapping

Hm. restrict_root_map() ? "verify" sounds like we should sometimes reject
it.

> * @file: idmapping file
> * @map_ns: user namespace of the target process
> * @new_map: requested idmap
> *
> * If a process requested a mapping for uid 0 onto
> * uid 0 verify that the process writing the map had the CAP_SETFCAP
> * capability as the target process will be able to
> * write fscaps that are valid in ancestor user namespaces.
> *
> * Return: true if the mapping is allow, false if not.
> */
> static bool verify_root_map()

2021-04-17 02:26:15

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3)

On Fri, Apr 16, 2021 at 04:34:53PM -0500, Serge E. Hallyn wrote:
> On Fri, Apr 16, 2021 at 05:05:01PM +0200, Christian Brauner wrote:
> > On Thu, Apr 15, 2021 at 11:58:51PM -0500, Serge Hallyn wrote:
> > > (Eric - this patch (v3) is a cleaned up version of the previous approach.
> > > v4 is at https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
> > > and is the approach you suggested. I can send it also as a separate patch
> > > if you like)
> > >
> > > A process running as uid 0 but without cap_setfcap currently can simply
> > > unshare a new user namespace with uid 0 mapped to 0. While this task
> > > will not have new capabilities against the parent namespace, there is
> > > a loophole due to the way namespaced file capabilities work. File
> > > capabilities valid in userns 1 are distinguised from file capabilities
> > > valid in userns 2 by the kuid which underlies uid 0. Therefore
> > > the restricted root process can unshare a new self-mapping namespace,
> > > add a namespaced file capability onto a file, then use that file
> > > capability in the parent namespace.
> > >
> > > To prevent that, do not allow mapping uid 0 if the process which
> > > opened the uid_map file does not have CAP_SETFCAP, which is the capability
> > > for setting file capabilities.
> > >
> > > A further wrinkle: a task can unshare its user namespace, then
> > > open its uid_map file itself, and map (only) its own uid. In this
> > > case we do not have the credential from before unshare, which was
> > > potentially more restricted. So, when creating a user namespace, we
> > > record whether the creator had CAP_SETFCAP. Then we can use that
> > > during map_write().
> > >
> > > With this patch:
> > >
> > > 1. unprivileged user can still unshare -Ur
> > >
> > > ubuntu@caps:~$ unshare -Ur
> > > root@caps:~# logout
> > >
> > > 2. root user can still unshare -Ur
> > >
> > > ubuntu@caps:~$ sudo bash
> > > root@caps:/home/ubuntu# unshare -Ur
> > > root@caps:/home/ubuntu# logout
> > >
> > > 3. root user without CAP_SETFCAP cannot unshare -Ur:
> > >
> > > root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> > > root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> > > unable to set CAP_SETFCAP effective capability: Operation not permitted
> > > root@caps:/home/ubuntu# unshare -Ur
> > > unshare: write failed /proc/self/uid_map: Operation not permitted
> > >
> > > Signed-off-by: Serge Hallyn <[email protected]>
> > >
> > > Changelog:
> > > * fix logic in the case of writing to another task's uid_map
> > > * rename 'ns' to 'map_ns', and make a file_ns local variable
> > > * use /* comments */
> > > * update the CAP_SETFCAP comment in capability.h
> > > * rename parent_unpriv to parent_can_setfcap (and reverse the
> > > logic)
> > > * remove printks
> > > * clarify (i hope) the code comments
> > > * update capability.h comment
> > > * renamed parent_can_setfcap to parent_could_setfcap
> > > * made the check its own disallowed_0_mapping() fn
> > > * moved the check into new_idmap_permitted
> > > ---
> >
> > Thank you for working on this fix!
> >
> > I do prefer your approach of doing the check at user namespace creation
> > time instead of moving it into the setxattr() codepath.
> >
> > Let me reiterate that the ability to write through fscaps is a valid
> > usecase and this should continue to work but that for locked down user
> > namespace as Andrew wants to use them your patch provides a clean
> > solution.
> > We've are using identity mappings in quite a few scenarios partially
> > when performing tests but also to write through fscaps.
> > We also had reports of users that use identity mappings. They create
> > their rootfs by running image extraction in an identity mapped userns
> > where fscaps are written through.
> > Podman has use-cases for this feature as well and has been affected by
> > the regression of the first fix.
>
> Thanks for reviewing.
>
> I'm not sure what your point above is, so just to make sure - the
> alternative implementation also does allow fscaps for cases where
> root uid is remapped, only disallowing it if it would violate the
> ancestor's lack of cap_setfcap.
>
>
> > > include/linux/user_namespace.h | 3 ++
> > > include/uapi/linux/capability.h | 3 +-
> > > kernel/user_namespace.c | 61 +++++++++++++++++++++++++++++++--
> > > 3 files changed, 63 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> > > index 64cf8ebdc4ec..f6c5f784be5a 100644
> > > --- a/include/linux/user_namespace.h
> > > +++ b/include/linux/user_namespace.h
> > > @@ -63,6 +63,9 @@ struct user_namespace {
> > > kgid_t group;
> > > struct ns_common ns;
> > > unsigned long flags;
> > > + /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
> > > + * in its effective capability set at the child ns creation time. */
> > > + bool parent_could_setfcap;
> > >
> > > #ifdef CONFIG_KEYS
> > > /* List of joinable keyrings in this namespace. Modification access of
> > > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> > > index c6ca33034147..2ddb4226cd23 100644
> > > --- a/include/uapi/linux/capability.h
> > > +++ b/include/uapi/linux/capability.h
> > > @@ -335,7 +335,8 @@ struct vfs_ns_cap_data {
> > >
> > > #define CAP_AUDIT_CONTROL 30
> > >
> > > -/* Set or remove capabilities on files */
> > > +/* Set or remove capabilities on files.
> > > + Map uid=0 into a child user namespace. */
> > >
> > > #define CAP_SETFCAP 31
> > >
> > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> > > index af612945a4d0..8c75028a9aae 100644
> > > --- a/kernel/user_namespace.c
> > > +++ b/kernel/user_namespace.c
> > > @@ -106,6 +106,7 @@ int create_user_ns(struct cred *new)
> > > if (!ns)
> > > goto fail_dec;
> > >
> > > + ns->parent_could_setfcap = cap_raised(new->cap_effective, CAP_SETFCAP);
> > > ret = ns_alloc_inum(&ns->ns);
> > > if (ret)
> > > goto fail_free;
> > > @@ -841,6 +842,56 @@ static int sort_idmaps(struct uid_gid_map *map)
> > > return 0;
> > > }
> > >
> > > +/*
> > > + * If mapping uid 0, then file capabilities created by the new namespace will
> > > + * be effective in the parent namespace. Adding file capabilities requires
> > > + * CAP_SETFCAP, which the child namespace will have, so creating such a
> > > + * mapping requires CAP_SETFCAP in the parent namespace.
> > > + */
> > > +static bool disallowed_0_mapping(const struct file *file,
> > > + struct user_namespace *map_ns,
> > > + struct uid_gid_map *new_map)
> > > +{
> > > + int idx;
> > > + bool zeromapping = false;
> > > + const struct user_namespace *file_ns = file->f_cred->user_ns;
> > > +
> > > + for (idx = 0; idx < new_map->nr_extents; idx++) {
> >
> > I think having that loop is acceptable here since it's only called once
> > at map creation time even though the forward array is not yet sorted.
> >
> > > + struct uid_gid_extent *e;
> > > + u32 lower_first;
> > > +
> > > + if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> > > + e = &new_map->extent[idx];
> > > + else
> > > + e = &new_map->forward[idx];
> > > + if (e->lower_first == 0) {
> > > + zeromapping = true;
> > > + break;
> > > + }
> > > + }
> > > +
> > > + if (!zeromapping)
> > > + return false;
> > > +
> > > + if (map_ns == file_ns) {
> > > + /* The user unshared first and is writing to
> > > + * /proc/self/uid_map. User already has full
> > > + * capabilites in the new namespace, so verify
> > > + * that the parent has CAP_SETFCAP. */
> > > + if (!file_ns->parent_could_setfcap)
> > > + return true;
> > > + } else {
> > > + /* Process p1 is writing to uid_map of p2, who
> > > + * is in a child user namespace to p1's. So
> > > + * we verify that p1 has CAP_SETFCAP to its
> > > + * own namespace */
> > > + if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
> > > + return true;
> > > + }
> > > +
> > > + return false;
> > > +}
> >
> > Maybe we can tweak this a tiny bit to get rid of the "zeromapping"?:
> >
> > static bool disallowed_0_mapping(const struct file *file,
> > struct user_namespace *map_ns,
> > struct uid_gid_map *new_map)
> > {
> > int idx;
> > const struct user_namespace *file_ns = file->f_cred->user_ns;
> > struct uid_gid_extent *extent0 = NULL;
> >
> > for (idx = 0; idx < new_map->nr_extents; idx++) {
> > u32 lower_first;
> >
> > if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> > extent0 = &new_map->extent[idx];
> > else
> > extent0 = &new_map->forward[idx];
> > if (extent0->lower_first == 0)
> > break;
> >
> > extent0 = NULL;
> > }
> >
> > if (!extent0)
> > return false;
>
> Feels a little less clear to me, but that's probably just me, so I'll
> switch it over, thanks.
>
> >
> > if (map_ns == file_ns) {
> > /*
> > * The user unshared first and is writing to
> > * /proc/self/uid_map. User already has full
> > * capabilites in the new namespace, so verify
> > * that the parent has CAP_SETFCAP.
> > */
> > if (!file_ns->parent_could_setfcap)
> > return true;
> > } else {
> > /*
> > * Process p1 is writing to uid_map of p2, who
> > * is in a child user namespace to p1's. So
> > * we verify that p1 has CAP_SETFCAP to its
> > * own namespace.
> > */
> > if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
> > return true;
> > }
> >
> > return false;
> > }
> >
> > In addition I would think that expressing the logic the other way around
> > is more legible. I'm not too keen on having negations in function names.
> > We should probably also tweak the comment a bit and make it kernel-doc
> > clean:
> >
> > /**
> > * verify_root_map() - check the uid 0 mapping
>
> Hm. restrict_root_map() ? "verify" sounds like we should sometimes reject
> it.

yes please ignore that :)

2021-04-17 20:05:49

by Serge E. Hallyn

[permalink] [raw]
Subject: [PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3.2)

A process running as uid 0 but without cap_setfcap currently can simply
unshare a new user namespace with uid 0 mapped to 0. While this task
will not have new capabilities against the parent namespace, there is
a loophole due to the way namespaced file capabilities work. File
capabilities valid in userns 1 are distinguised from file capabilities
valid in userns 2 by the kuid which underlies uid 0. Therefore
the restricted root process can unshare a new self-mapping namespace,
add a namespaced file capability onto a file, then use that file
capability in the parent namespace.

To prevent that, do not allow mapping uid 0 if the process which
opened the uid_map file does not have CAP_SETFCAP, which is the capability
for setting file capabilities.

A further wrinkle: a task can unshare its user namespace, then
open its uid_map file itself, and map (only) its own uid. In this
case we do not have the credential from before unshare, which was
potentially more restricted. So, when creating a user namespace, we
record whether the creator had CAP_SETFCAP. Then we can use that
during map_write().

With this patch:

1. unprivileged user can still unshare -Ur

ubuntu@caps:~$ unshare -Ur
root@caps:~# logout

2. root user can still unshare -Ur

ubuntu@caps:~$ sudo bash
root@caps:/home/ubuntu# unshare -Ur
root@caps:/home/ubuntu# logout

3. root user without CAP_SETFCAP cannot unshare -Ur:

root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
unable to set CAP_SETFCAP effective capability: Operation not permitted
root@caps:/home/ubuntu# unshare -Ur
unshare: write failed /proc/self/uid_map: Operation not permitted

Signed-off-by: Serge Hallyn <[email protected]>

Changelog:
* fix logic in the case of writing to another task's uid_map
* rename 'ns' to 'map_ns', and make a file_ns local variable
* use /* comments */
* update the CAP_SETFCAP comment in capability.h
* rename parent_unpriv to parent_can_setfcap (and reverse the
logic)
* remove printks
* clarify (i hope) the code comments
* update capability.h comment
* renamed parent_can_setfcap to parent_could_setfcap
* made the check its own disallowed_0_mapping() fn
* moved the check into new_idmap_permitted
* rename disallowed_0_mapping to verify_root_mapping
* change verify_root_mapping to Christian's suggested flow
---
include/linux/user_namespace.h | 3 ++
include/uapi/linux/capability.h | 3 +-
kernel/user_namespace.c | 66 +++++++++++++++++++++++++++++++--
3 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..f6c5f784be5a 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -63,6 +63,9 @@ struct user_namespace {
kgid_t group;
struct ns_common ns;
unsigned long flags;
+ /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
+ * in its effective capability set at the child ns creation time. */
+ bool parent_could_setfcap;

#ifdef CONFIG_KEYS
/* List of joinable keyrings in this namespace. Modification access of
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index c6ca33034147..2ddb4226cd23 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -335,7 +335,8 @@ struct vfs_ns_cap_data {

#define CAP_AUDIT_CONTROL 30

-/* Set or remove capabilities on files */
+/* Set or remove capabilities on files.
+ Map uid=0 into a child user namespace. */

#define CAP_SETFCAP 31

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index af612945a4d0..2ead291177b0 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -106,6 +106,7 @@ int create_user_ns(struct cred *new)
if (!ns)
goto fail_dec;

+ ns->parent_could_setfcap = cap_raised(new->cap_effective, CAP_SETFCAP);
ret = ns_alloc_inum(&ns->ns);
if (ret)
goto fail_free;
@@ -841,6 +842,61 @@ static int sort_idmaps(struct uid_gid_map *map)
return 0;
}

+/**
+ * verify_root_map() - check the uid 0 mapping
+ * @file: idmapping file
+ * @map_ns: user namespace of the target process
+ * @new_map: requested idmap
+ *
+ * If a process requested a mapping for uid 0 onto uid 0, verify that the
+ * process writing the map had the CAP_SETFCAP capability as the target process
+ * will be able to write fscaps that are valid in ancestor user namespaces.
+ *
+ * Return: true if the mapping is allowed, false if not.
+ */
+static bool verify_root_map(const struct file *file,
+ struct user_namespace *map_ns,
+ struct uid_gid_map *new_map)
+{
+ int idx;
+ const struct user_namespace *file_ns = file->f_cred->user_ns;
+ struct uid_gid_extent *extent0 = NULL;
+
+ for (idx = 0; idx < new_map->nr_extents; idx++) {
+ u32 lower_first;
+
+ if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+ extent0 = &new_map->extent[idx];
+ else
+ extent0 = &new_map->forward[idx];
+ if (extent0->lower_first == 0)
+ break;
+
+ extent0 = NULL;
+ }
+
+ if (!extent0)
+ return true;
+
+ if (map_ns == file_ns) {
+ /* The user unshared first and is writing to
+ * /proc/self/uid_map. User already has full
+ * capabilites in the new namespace, so verify
+ * that the parent has CAP_SETFCAP. */
+ if (!file_ns->parent_could_setfcap)
+ return false;
+ } else {
+ /* Process p1 is writing to uid_map of p2, who
+ * is in a child user namespace to p1's. So
+ * we verify that p1 has CAP_SETFCAP to its
+ * own namespace */
+ if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
+ return false;
+ }
+
+ return true;
+}
+
static ssize_t map_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos,
int cap_setid,
@@ -848,7 +904,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
struct uid_gid_map *parent_map)
{
struct seq_file *seq = file->private_data;
- struct user_namespace *ns = seq->private;
+ struct user_namespace *map_ns = seq->private;
struct uid_gid_map new_map;
unsigned idx;
struct uid_gid_extent extent;
@@ -895,7 +951,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
/*
* Adjusting namespace settings requires capabilities on the target.
*/
- if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN))
+ if (cap_valid(cap_setid) && !file_ns_capable(file, map_ns, CAP_SYS_ADMIN))
goto out;

/* Parse the user data */
@@ -965,7 +1021,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,

ret = -EPERM;
/* Validate the user is allowed to use user id's mapped to. */
- if (!new_idmap_permitted(file, ns, cap_setid, &new_map))
+ if (!new_idmap_permitted(file, map_ns, cap_setid, &new_map))
goto out;

ret = -EPERM;
@@ -1086,6 +1142,10 @@ static bool new_idmap_permitted(const struct file *file,
struct uid_gid_map *new_map)
{
const struct cred *cred = file->f_cred;
+
+ if (cap_setid == CAP_SETUID && !verify_root_map(file, ns, new_map))
+ return false;
+
/* Don't allow mappings that would allow anything that wouldn't
* be allowed without the establishment of unprivileged mappings.
*/
--
2.25.1

2021-04-18 17:22:46

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3.2)

On Sat, Apr 17, 2021 at 03:04:34PM -0500, Serge Hallyn wrote:
> A process running as uid 0 but without cap_setfcap currently can simply
> unshare a new user namespace with uid 0 mapped to 0. While this task
> will not have new capabilities against the parent namespace, there is
> a loophole due to the way namespaced file capabilities work. File
> capabilities valid in userns 1 are distinguised from file capabilities
> valid in userns 2 by the kuid which underlies uid 0. Therefore
> the restricted root process can unshare a new self-mapping namespace,
> add a namespaced file capability onto a file, then use that file
> capability in the parent namespace.
>
> To prevent that, do not allow mapping uid 0 if the process which
> opened the uid_map file does not have CAP_SETFCAP, which is the capability
> for setting file capabilities.
>
> A further wrinkle: a task can unshare its user namespace, then
> open its uid_map file itself, and map (only) its own uid. In this
> case we do not have the credential from before unshare, which was
> potentially more restricted. So, when creating a user namespace, we
> record whether the creator had CAP_SETFCAP. Then we can use that
> during map_write().
>
> With this patch:
>
> 1. unprivileged user can still unshare -Ur
>
> ubuntu@caps:~$ unshare -Ur
> root@caps:~# logout
>
> 2. root user can still unshare -Ur
>
> ubuntu@caps:~$ sudo bash
> root@caps:/home/ubuntu# unshare -Ur
> root@caps:/home/ubuntu# logout
>
> 3. root user without CAP_SETFCAP cannot unshare -Ur:
>
> root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> unable to set CAP_SETFCAP effective capability: Operation not permitted
> root@caps:/home/ubuntu# unshare -Ur
> unshare: write failed /proc/self/uid_map: Operation not permitted
>
> Signed-off-by: Serge Hallyn <[email protected]>
>
> Changelog:
> * fix logic in the case of writing to another task's uid_map
> * rename 'ns' to 'map_ns', and make a file_ns local variable
> * use /* comments */
> * update the CAP_SETFCAP comment in capability.h
> * rename parent_unpriv to parent_can_setfcap (and reverse the
> logic)
> * remove printks
> * clarify (i hope) the code comments
> * update capability.h comment
> * renamed parent_can_setfcap to parent_could_setfcap
> * made the check its own disallowed_0_mapping() fn
> * moved the check into new_idmap_permitted
> * rename disallowed_0_mapping to verify_root_mapping
> * change verify_root_mapping to Christian's suggested flow
> ---

Thank you. This looks good. I tested this with:

- fstests
- LXD testsuite
- Podman testsuite
- libcap testsuite

Tested-by: Christian Brauner <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>

2021-04-18 21:23:37

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3.2)


Guiseppe can you take a look at this?

This is a second attempt at tightening up the semantics of writing to
file capabilities from a user namespace.

The first attempt was reverted with 3b0c2d3eaa83 ("Revert 95ebabde382c
("capabilities: Don't allow writing ambiguous v3 file capabilities")"),
which corrected the issue reported in:
https://github.com/containers/buildah/issues/3071

There is a report the podman testsuite passes. While different this
looks in many ways much more strict than the code that was reverted. So
while I can imagine this change doesn't cause problems as is, I will be
surprised.

Eric



"Serge E. Hallyn" <[email protected]> writes:

> A process running as uid 0 but without cap_setfcap currently can simply
> unshare a new user namespace with uid 0 mapped to 0. While this task
> will not have new capabilities against the parent namespace, there is
> a loophole due to the way namespaced file capabilities work. File
> capabilities valid in userns 1 are distinguised from file capabilities
> valid in userns 2 by the kuid which underlies uid 0. Therefore
> the restricted root process can unshare a new self-mapping namespace,
> add a namespaced file capability onto a file, then use that file
> capability in the parent namespace.
>
> To prevent that, do not allow mapping uid 0 if the process which
> opened the uid_map file does not have CAP_SETFCAP, which is the capability
> for setting file capabilities.
>
> A further wrinkle: a task can unshare its user namespace, then
> open its uid_map file itself, and map (only) its own uid. In this
> case we do not have the credential from before unshare, which was
> potentially more restricted. So, when creating a user namespace, we
> record whether the creator had CAP_SETFCAP. Then we can use that
> during map_write().
>
> With this patch:
>
> 1. unprivileged user can still unshare -Ur
>
> ubuntu@caps:~$ unshare -Ur
> root@caps:~# logout
>
> 2. root user can still unshare -Ur
>
> ubuntu@caps:~$ sudo bash
> root@caps:/home/ubuntu# unshare -Ur
> root@caps:/home/ubuntu# logout
>
> 3. root user without CAP_SETFCAP cannot unshare -Ur:
>
> root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> unable to set CAP_SETFCAP effective capability: Operation not permitted
> root@caps:/home/ubuntu# unshare -Ur
> unshare: write failed /proc/self/uid_map: Operation not permitted
>
> Signed-off-by: Serge Hallyn <[email protected]>
>
> Changelog:
> * fix logic in the case of writing to another task's uid_map
> * rename 'ns' to 'map_ns', and make a file_ns local variable
> * use /* comments */
> * update the CAP_SETFCAP comment in capability.h
> * rename parent_unpriv to parent_can_setfcap (and reverse the
> logic)
> * remove printks
> * clarify (i hope) the code comments
> * update capability.h comment
> * renamed parent_can_setfcap to parent_could_setfcap
> * made the check its own disallowed_0_mapping() fn
> * moved the check into new_idmap_permitted
> * rename disallowed_0_mapping to verify_root_mapping
> * change verify_root_mapping to Christian's suggested flow
> ---
> include/linux/user_namespace.h | 3 ++
> include/uapi/linux/capability.h | 3 +-
> kernel/user_namespace.c | 66 +++++++++++++++++++++++++++++++--
> 3 files changed, 68 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 64cf8ebdc4ec..f6c5f784be5a 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -63,6 +63,9 @@ struct user_namespace {
> kgid_t group;
> struct ns_common ns;
> unsigned long flags;
> + /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
> + * in its effective capability set at the child ns creation time. */
> + bool parent_could_setfcap;
>
> #ifdef CONFIG_KEYS
> /* List of joinable keyrings in this namespace. Modification access of
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index c6ca33034147..2ddb4226cd23 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -335,7 +335,8 @@ struct vfs_ns_cap_data {
>
> #define CAP_AUDIT_CONTROL 30
>
> -/* Set or remove capabilities on files */
> +/* Set or remove capabilities on files.
> + Map uid=0 into a child user namespace. */
>
> #define CAP_SETFCAP 31
>
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index af612945a4d0..2ead291177b0 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -106,6 +106,7 @@ int create_user_ns(struct cred *new)
> if (!ns)
> goto fail_dec;
>
> + ns->parent_could_setfcap = cap_raised(new->cap_effective, CAP_SETFCAP);
> ret = ns_alloc_inum(&ns->ns);
> if (ret)
> goto fail_free;
> @@ -841,6 +842,61 @@ static int sort_idmaps(struct uid_gid_map *map)
> return 0;
> }
>
> +/**
> + * verify_root_map() - check the uid 0 mapping
> + * @file: idmapping file
> + * @map_ns: user namespace of the target process
> + * @new_map: requested idmap
> + *
> + * If a process requested a mapping for uid 0 onto uid 0, verify that the
> + * process writing the map had the CAP_SETFCAP capability as the target process
> + * will be able to write fscaps that are valid in ancestor user namespaces.
> + *
> + * Return: true if the mapping is allowed, false if not.
> + */
> +static bool verify_root_map(const struct file *file,
> + struct user_namespace *map_ns,
> + struct uid_gid_map *new_map)
> +{
> + int idx;
> + const struct user_namespace *file_ns = file->f_cred->user_ns;
> + struct uid_gid_extent *extent0 = NULL;
> +
> + for (idx = 0; idx < new_map->nr_extents; idx++) {
> + u32 lower_first;
> +
> + if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> + extent0 = &new_map->extent[idx];
> + else
> + extent0 = &new_map->forward[idx];
> + if (extent0->lower_first == 0)
> + break;
> +
> + extent0 = NULL;
> + }
> +
> + if (!extent0)
> + return true;
> +
> + if (map_ns == file_ns) {
> + /* The user unshared first and is writing to
> + * /proc/self/uid_map. User already has full
> + * capabilites in the new namespace, so verify
> + * that the parent has CAP_SETFCAP. */
> + if (!file_ns->parent_could_setfcap)
> + return false;
> + } else {
> + /* Process p1 is writing to uid_map of p2, who
> + * is in a child user namespace to p1's. So
> + * we verify that p1 has CAP_SETFCAP to its
> + * own namespace */
> + if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
> + return false;
> + }
> +
> + return true;
> +}
> +
> static ssize_t map_write(struct file *file, const char __user *buf,
> size_t count, loff_t *ppos,
> int cap_setid,
> @@ -848,7 +904,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
> struct uid_gid_map *parent_map)
> {
> struct seq_file *seq = file->private_data;
> - struct user_namespace *ns = seq->private;
> + struct user_namespace *map_ns = seq->private;
> struct uid_gid_map new_map;
> unsigned idx;
> struct uid_gid_extent extent;
> @@ -895,7 +951,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
> /*
> * Adjusting namespace settings requires capabilities on the target.
> */
> - if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN))
> + if (cap_valid(cap_setid) && !file_ns_capable(file, map_ns, CAP_SYS_ADMIN))
> goto out;
>
> /* Parse the user data */
> @@ -965,7 +1021,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>
> ret = -EPERM;
> /* Validate the user is allowed to use user id's mapped to. */
> - if (!new_idmap_permitted(file, ns, cap_setid, &new_map))
> + if (!new_idmap_permitted(file, map_ns, cap_setid, &new_map))
> goto out;
>
> ret = -EPERM;
> @@ -1086,6 +1142,10 @@ static bool new_idmap_permitted(const struct file *file,
> struct uid_gid_map *new_map)
> {
> const struct cred *cred = file->f_cred;
> +
> + if (cap_setid == CAP_SETUID && !verify_root_map(file, ns, new_map))
> + return false;
> +
> /* Don't allow mappings that would allow anything that wouldn't
> * be allowed without the establishment of unprivileged mappings.
> */

2021-04-19 12:27:14

by Serge E. Hallyn

[permalink] [raw]
Subject: [PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3.3)

cap_setfcap is required to create file capabilities.

Since 8db6c34f1dbc ("Introduce v3 namespaced file capabilities"), a
process running as uid 0 but without cap_setfcap is able to work around
this as follows: unshare a new user namespace which maps parent uid 0
into the child namespace. While this task will not have new
capabilities against the parent namespace, there is a loophole due to
the way namespaced file capabilities are represented as xattrs. File
capabilities valid in userns 1 are distinguished from file capabilities
valid in userns 2 by the kuid which underlies uid 0. Therefore the
restricted root process can unshare a new self-mapping namespace, add a
namespaced file capability onto a file, then use that file capability in
the parent namespace.

To prevent that, do not allow mapping parent uid 0 if the process which
opened the uid_map file does not have CAP_SETFCAP, which is the capability
for setting file capabilities.

As a further wrinkle: a task can unshare its user namespace, then
open its uid_map file itself, and map (only) its own uid. In this
case we do not have the credential from before unshare, which was
potentially more restricted. So, when creating a user namespace, we
record whether the creator had CAP_SETFCAP. Then we can use that
during map_write().

With this patch:

1. Unprivileged user can still unshare -Ur

ubuntu@caps:~$ unshare -Ur
root@caps:~# logout

2. Root user can still unshare -Ur

ubuntu@caps:~$ sudo bash
root@caps:/home/ubuntu# unshare -Ur
root@caps:/home/ubuntu# logout

3. Root user without CAP_SETFCAP cannot unshare -Ur:

root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
unable to set CAP_SETFCAP effective capability: Operation not permitted
root@caps:/home/ubuntu# unshare -Ur
unshare: write failed /proc/self/uid_map: Operation not permitted

Note: an alternative solution would be to allow uid 0 mappings by
processes without CAP_SETFCAP, but to prevent such a namespace from
writing any file capabilities. This approach can be seen here:
https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4

Signed-off-by: Serge Hallyn <[email protected]>
Reviewed-by: Andrew G. Morgan <[email protected]>
Tested-by: Christian Brauner <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>

Changelog:
* fix logic in the case of writing to another task's uid_map
* rename 'ns' to 'map_ns', and make a file_ns local variable
* use /* comments */
* update the CAP_SETFCAP comment in capability.h
* rename parent_unpriv to parent_can_setfcap (and reverse the
logic)
* remove printks
* clarify (i hope) the code comments
* update capability.h comment
* renamed parent_can_setfcap to parent_could_setfcap
* made the check its own disallowed_0_mapping() fn
* moved the check into new_idmap_permitted
* rename disallowed_0_mapping to verify_root_mapping
* change verify_root_mapping to Christian's suggested flow
* correct+clarify comments: parent uid 0 mapping to any
child uid is a problem.
---
include/linux/user_namespace.h | 3 ++
include/uapi/linux/capability.h | 3 +-
kernel/user_namespace.c | 67 +++++++++++++++++++++++++++++++--
3 files changed, 69 insertions(+), 4 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..f6c5f784be5a 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -63,6 +63,9 @@ struct user_namespace {
kgid_t group;
struct ns_common ns;
unsigned long flags;
+ /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
+ * in its effective capability set at the child ns creation time. */
+ bool parent_could_setfcap;

#ifdef CONFIG_KEYS
/* List of joinable keyrings in this namespace. Modification access of
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index c6ca33034147..2ddb4226cd23 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -335,7 +335,8 @@ struct vfs_ns_cap_data {

#define CAP_AUDIT_CONTROL 30

-/* Set or remove capabilities on files */
+/* Set or remove capabilities on files.
+ Map uid=0 into a child user namespace. */

#define CAP_SETFCAP 31

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index af612945a4d0..609a729a9879 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -106,6 +106,7 @@ int create_user_ns(struct cred *new)
if (!ns)
goto fail_dec;

+ ns->parent_could_setfcap = cap_raised(new->cap_effective, CAP_SETFCAP);
ret = ns_alloc_inum(&ns->ns);
if (ret)
goto fail_free;
@@ -841,6 +842,62 @@ static int sort_idmaps(struct uid_gid_map *map)
return 0;
}

+/**
+ * verify_root_map() - check the uid 0 mapping
+ * @file: idmapping file
+ * @map_ns: user namespace of the target process
+ * @new_map: requested idmap
+ *
+ * If a process requests mapping parent uid 0 into the new ns, verify that the
+ * process writing the map had the CAP_SETFCAP capability as the target process
+ * will be able to write fscaps that are valid in ancestor user namespaces.
+ *
+ * Return: true if the mapping is allowed, false if not.
+ */
+static bool verify_root_map(const struct file *file,
+ struct user_namespace *map_ns,
+ struct uid_gid_map *new_map)
+{
+ int idx;
+ const struct user_namespace *file_ns = file->f_cred->user_ns;
+ struct uid_gid_extent *extent0 = NULL;
+
+ for (idx = 0; idx < new_map->nr_extents; idx++) {
+ u32 lower_first;
+
+ if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+ extent0 = &new_map->extent[idx];
+ else
+ extent0 = &new_map->forward[idx];
+ if (extent0->lower_first == 0)
+ break;
+
+ extent0 = NULL;
+ }
+
+ if (!extent0)
+ return true;
+
+ if (map_ns == file_ns) {
+ /* The process unshared its ns and is writing to its own
+ * /proc/self/uid_map. User already has full capabilites in
+ * the new namespace. Verify that the parent had CAP_SETFCAP
+ * when it unshared.
+ * */
+ if (!file_ns->parent_could_setfcap)
+ return false;
+ } else {
+ /* Process p1 is writing to uid_map of p2, who is in a child
+ * user namespace to p1's. Verify that the opener of the map
+ * file has CAP_SETFCAP against the parent of the new map
+ * namespace */
+ if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
+ return false;
+ }
+
+ return true;
+}
+
static ssize_t map_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos,
int cap_setid,
@@ -848,7 +905,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
struct uid_gid_map *parent_map)
{
struct seq_file *seq = file->private_data;
- struct user_namespace *ns = seq->private;
+ struct user_namespace *map_ns = seq->private;
struct uid_gid_map new_map;
unsigned idx;
struct uid_gid_extent extent;
@@ -895,7 +952,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
/*
* Adjusting namespace settings requires capabilities on the target.
*/
- if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN))
+ if (cap_valid(cap_setid) && !file_ns_capable(file, map_ns, CAP_SYS_ADMIN))
goto out;

/* Parse the user data */
@@ -965,7 +1022,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,

ret = -EPERM;
/* Validate the user is allowed to use user id's mapped to. */
- if (!new_idmap_permitted(file, ns, cap_setid, &new_map))
+ if (!new_idmap_permitted(file, map_ns, cap_setid, &new_map))
goto out;

ret = -EPERM;
@@ -1086,6 +1143,10 @@ static bool new_idmap_permitted(const struct file *file,
struct uid_gid_map *new_map)
{
const struct cred *cred = file->f_cred;
+
+ if (cap_setid == CAP_SETUID && !verify_root_map(file, ns, new_map))
+ return false;
+
/* Don't allow mappings that would allow anything that wouldn't
* be allowed without the establishment of unprivileged mappings.
*/
--
2.25.1

2021-04-19 16:17:58

by Giuseppe Scrivano

[permalink] [raw]
Subject: Re: [PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3.2)

[email protected] (Eric W. Biederman) writes:

> Guiseppe can you take a look at this?
>
> This is a second attempt at tightening up the semantics of writing to
> file capabilities from a user namespace.
>
> The first attempt was reverted with 3b0c2d3eaa83 ("Revert 95ebabde382c
> ("capabilities: Don't allow writing ambiguous v3 file capabilities")"),
> which corrected the issue reported in:
> https://github.com/containers/buildah/issues/3071
>
> There is a report the podman testsuite passes. While different this
> looks in many ways much more strict than the code that was reverted. So
> while I can imagine this change doesn't cause problems as is, I will be
> surprised.

thanks for pulling me in the discussion.

I've tested the patch with several cases similar to the issue we had in
the past and the patch seems to work well.

Podman creates all the user namespaces within the same parent user
namespace. In the parent user namespace all the capabilities are kept
and AFAIK Docker does the same. I'd expect a change in behavior only
for nested user namespaces in containers where CAP_SETFCAP is not
granted, but that is not a common configuration given that CAP_SETFCAP
is added by default.


> "Serge E. Hallyn" <[email protected]> writes:
>
>> +/**
>> + * verify_root_map() - check the uid 0 mapping
>> + * @file: idmapping file
>> + * @map_ns: user namespace of the target process
>> + * @new_map: requested idmap
>> + *
>> + * If a process requested a mapping for uid 0 onto uid 0, verify that the
>> + * process writing the map had the CAP_SETFCAP capability as the target process
>> + * will be able to write fscaps that are valid in ancestor user namespaces.
>> + *
>> + * Return: true if the mapping is allowed, false if not.
>> + */
>> +static bool verify_root_map(const struct file *file,
>> + struct user_namespace *map_ns,
>> + struct uid_gid_map *new_map)
>> +{
>> + int idx;
>> + const struct user_namespace *file_ns = file->f_cred->user_ns;
>> + struct uid_gid_extent *extent0 = NULL;
>> +
>> + for (idx = 0; idx < new_map->nr_extents; idx++) {
>> + u32 lower_first;

nit: lower_first seems unused?

>> +
>> + if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
>> + extent0 = &new_map->extent[idx];
>> + else
>> + extent0 = &new_map->forward[idx];
>> + if (extent0->lower_first == 0)
>> + break;
>> +
>> + extent0 = NULL;
>> + }

Tested-by: Giuseppe Scrivano <[email protected]>

Regards,
Giuseppe

2021-04-19 16:24:49

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3.2)

On Mon, Apr 19, 2021 at 05:52:39PM +0200, Giuseppe Scrivano wrote:
> [email protected] (Eric W. Biederman) writes:
>
> > Guiseppe can you take a look at this?
> >
> > This is a second attempt at tightening up the semantics of writing to
> > file capabilities from a user namespace.
> >
> > The first attempt was reverted with 3b0c2d3eaa83 ("Revert 95ebabde382c
> > ("capabilities: Don't allow writing ambiguous v3 file capabilities")"),
> > which corrected the issue reported in:
> > https://github.com/containers/buildah/issues/3071
> >
> > There is a report the podman testsuite passes. While different this
> > looks in many ways much more strict than the code that was reverted. So
> > while I can imagine this change doesn't cause problems as is, I will be
> > surprised.
>
> thanks for pulling me in the discussion.
>
> I've tested the patch with several cases similar to the issue we had in
> the past and the patch seems to work well.
>
> Podman creates all the user namespaces within the same parent user
> namespace. In the parent user namespace all the capabilities are kept
> and AFAIK Docker does the same. I'd expect a change in behavior only
> for nested user namespaces in containers where CAP_SETFCAP is not
> granted, but that is not a common configuration given that CAP_SETFCAP
> is added by default.

Same for us and we do have extensive nested container workloads with
other runtimes running containers too.

>
>
> > "Serge E. Hallyn" <[email protected]> writes:
> >
> >> +/**
> >> + * verify_root_map() - check the uid 0 mapping
> >> + * @file: idmapping file
> >> + * @map_ns: user namespace of the target process
> >> + * @new_map: requested idmap
> >> + *
> >> + * If a process requested a mapping for uid 0 onto uid 0, verify that the
> >> + * process writing the map had the CAP_SETFCAP capability as the target process
> >> + * will be able to write fscaps that are valid in ancestor user namespaces.
> >> + *
> >> + * Return: true if the mapping is allowed, false if not.
> >> + */
> >> +static bool verify_root_map(const struct file *file,
> >> + struct user_namespace *map_ns,
> >> + struct uid_gid_map *new_map)
> >> +{
> >> + int idx;
> >> + const struct user_namespace *file_ns = file->f_cred->user_ns;
> >> + struct uid_gid_extent *extent0 = NULL;
> >> +
> >> + for (idx = 0; idx < new_map->nr_extents; idx++) {
> >> + u32 lower_first;
>
> nit: lower_first seems unused?
>
> >> +
> >> + if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> >> + extent0 = &new_map->extent[idx];
> >> + else
> >> + extent0 = &new_map->forward[idx];
> >> + if (extent0->lower_first == 0)
> >> + break;
> >> +
> >> + extent0 = NULL;
> >> + }
>
> Tested-by: Giuseppe Scrivano <[email protected]>

Thanks for running the tests and confirming my results!
Christian

2021-04-19 19:42:51

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3.3)

On Mon, Apr 19, 2021 at 07:25:14AM -0500, Serge Hallyn wrote:
> cap_setfcap is required to create file capabilities.
>
> Since 8db6c34f1dbc ("Introduce v3 namespaced file capabilities"), a
> process running as uid 0 but without cap_setfcap is able to work around
> this as follows: unshare a new user namespace which maps parent uid 0
> into the child namespace. While this task will not have new
> capabilities against the parent namespace, there is a loophole due to
> the way namespaced file capabilities are represented as xattrs. File
> capabilities valid in userns 1 are distinguished from file capabilities
> valid in userns 2 by the kuid which underlies uid 0. Therefore the
> restricted root process can unshare a new self-mapping namespace, add a
> namespaced file capability onto a file, then use that file capability in
> the parent namespace.
>
> To prevent that, do not allow mapping parent uid 0 if the process which
> opened the uid_map file does not have CAP_SETFCAP, which is the capability
> for setting file capabilities.
>
> As a further wrinkle: a task can unshare its user namespace, then
> open its uid_map file itself, and map (only) its own uid. In this
> case we do not have the credential from before unshare, which was
> potentially more restricted. So, when creating a user namespace, we
> record whether the creator had CAP_SETFCAP. Then we can use that
> during map_write().
>
> With this patch:
>
> 1. Unprivileged user can still unshare -Ur
>
> ubuntu@caps:~$ unshare -Ur
> root@caps:~# logout
>
> 2. Root user can still unshare -Ur
>
> ubuntu@caps:~$ sudo bash
> root@caps:/home/ubuntu# unshare -Ur
> root@caps:/home/ubuntu# logout
>
> 3. Root user without CAP_SETFCAP cannot unshare -Ur:
>
> root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> unable to set CAP_SETFCAP effective capability: Operation not permitted
> root@caps:/home/ubuntu# unshare -Ur
> unshare: write failed /proc/self/uid_map: Operation not permitted
>
> Note: an alternative solution would be to allow uid 0 mappings by
> processes without CAP_SETFCAP, but to prevent such a namespace from
> writing any file capabilities. This approach can be seen here:
> https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
>

Ah, can you link to the previous fix and its revert, please? I think
that was mentioned in the formerly private thread as well but we forgot:

commit 95ebabde382c371572297915b104e55403674e73
Author: Eric W. Biederman <[email protected]>
Date: Thu Dec 17 09:42:00 2020 -0600

capabilities: Don't allow writing ambiguous v3 file capabilities

commit 3b0c2d3eaa83da259d7726192cf55a137769012f
Author: Eric W. Biederman <[email protected]>
Date: Fri Mar 12 15:07:09 2021 -0600

Revert 95ebabde382c ("capabilities: Don't allow writing ambiguous v3 file capabilities")

> Signed-off-by: Serge Hallyn <[email protected]>
> Reviewed-by: Andrew G. Morgan <[email protected]>
> Tested-by: Christian Brauner <[email protected]>
> Reviewed-by: Christian Brauner <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>

2021-04-20 03:43:25

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3.3)

On Mon, Apr 19, 2021 at 06:09:11PM +0200, Christian Brauner wrote:
> On Mon, Apr 19, 2021 at 07:25:14AM -0500, Serge Hallyn wrote:
> > cap_setfcap is required to create file capabilities.
> >
> > Since 8db6c34f1dbc ("Introduce v3 namespaced file capabilities"), a
> > process running as uid 0 but without cap_setfcap is able to work around
> > this as follows: unshare a new user namespace which maps parent uid 0
> > into the child namespace. While this task will not have new
> > capabilities against the parent namespace, there is a loophole due to
> > the way namespaced file capabilities are represented as xattrs. File
> > capabilities valid in userns 1 are distinguished from file capabilities
> > valid in userns 2 by the kuid which underlies uid 0. Therefore the
> > restricted root process can unshare a new self-mapping namespace, add a
> > namespaced file capability onto a file, then use that file capability in
> > the parent namespace.
> >
> > To prevent that, do not allow mapping parent uid 0 if the process which
> > opened the uid_map file does not have CAP_SETFCAP, which is the capability
> > for setting file capabilities.
> >
> > As a further wrinkle: a task can unshare its user namespace, then
> > open its uid_map file itself, and map (only) its own uid. In this
> > case we do not have the credential from before unshare, which was
> > potentially more restricted. So, when creating a user namespace, we
> > record whether the creator had CAP_SETFCAP. Then we can use that
> > during map_write().
> >
> > With this patch:
> >
> > 1. Unprivileged user can still unshare -Ur
> >
> > ubuntu@caps:~$ unshare -Ur
> > root@caps:~# logout
> >
> > 2. Root user can still unshare -Ur
> >
> > ubuntu@caps:~$ sudo bash
> > root@caps:/home/ubuntu# unshare -Ur
> > root@caps:/home/ubuntu# logout
> >
> > 3. Root user without CAP_SETFCAP cannot unshare -Ur:
> >
> > root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> > root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> > unable to set CAP_SETFCAP effective capability: Operation not permitted
> > root@caps:/home/ubuntu# unshare -Ur
> > unshare: write failed /proc/self/uid_map: Operation not permitted
> >
> > Note: an alternative solution would be to allow uid 0 mappings by
> > processes without CAP_SETFCAP, but to prevent such a namespace from
> > writing any file capabilities. This approach can be seen here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
> >
>
> Ah, can you link to the previous fix and its revert, please? I think
> that was mentioned in the formerly private thread as well but we forgot:
>
> commit 95ebabde382c371572297915b104e55403674e73
> Author: Eric W. Biederman <[email protected]>
> Date: Thu Dec 17 09:42:00 2020 -0600
>
> capabilities: Don't allow writing ambiguous v3 file capabilities
>
> commit 3b0c2d3eaa83da259d7726192cf55a137769012f
> Author: Eric W. Biederman <[email protected]>
> Date: Fri Mar 12 15:07:09 2021 -0600
>
> Revert 95ebabde382c ("capabilities: Don't allow writing ambiguous v3 file capabilities")

Sure.

Is there a tag for that kind of thing or do I just mention it at the end
of the description?

2021-04-20 08:32:21

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3.3)

On Mon, Apr 19, 2021 at 10:42:08PM -0500, Serge Hallyn wrote:
> On Mon, Apr 19, 2021 at 06:09:11PM +0200, Christian Brauner wrote:
> > On Mon, Apr 19, 2021 at 07:25:14AM -0500, Serge Hallyn wrote:
> > > cap_setfcap is required to create file capabilities.
> > >
> > > Since 8db6c34f1dbc ("Introduce v3 namespaced file capabilities"), a
> > > process running as uid 0 but without cap_setfcap is able to work around
> > > this as follows: unshare a new user namespace which maps parent uid 0
> > > into the child namespace. While this task will not have new
> > > capabilities against the parent namespace, there is a loophole due to
> > > the way namespaced file capabilities are represented as xattrs. File
> > > capabilities valid in userns 1 are distinguished from file capabilities
> > > valid in userns 2 by the kuid which underlies uid 0. Therefore the
> > > restricted root process can unshare a new self-mapping namespace, add a
> > > namespaced file capability onto a file, then use that file capability in
> > > the parent namespace.
> > >
> > > To prevent that, do not allow mapping parent uid 0 if the process which
> > > opened the uid_map file does not have CAP_SETFCAP, which is the capability
> > > for setting file capabilities.
> > >
> > > As a further wrinkle: a task can unshare its user namespace, then
> > > open its uid_map file itself, and map (only) its own uid. In this
> > > case we do not have the credential from before unshare, which was
> > > potentially more restricted. So, when creating a user namespace, we
> > > record whether the creator had CAP_SETFCAP. Then we can use that
> > > during map_write().
> > >
> > > With this patch:
> > >
> > > 1. Unprivileged user can still unshare -Ur
> > >
> > > ubuntu@caps:~$ unshare -Ur
> > > root@caps:~# logout
> > >
> > > 2. Root user can still unshare -Ur
> > >
> > > ubuntu@caps:~$ sudo bash
> > > root@caps:/home/ubuntu# unshare -Ur
> > > root@caps:/home/ubuntu# logout
> > >
> > > 3. Root user without CAP_SETFCAP cannot unshare -Ur:
> > >
> > > root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> > > root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> > > unable to set CAP_SETFCAP effective capability: Operation not permitted
> > > root@caps:/home/ubuntu# unshare -Ur
> > > unshare: write failed /proc/self/uid_map: Operation not permitted
> > >
> > > Note: an alternative solution would be to allow uid 0 mappings by
> > > processes without CAP_SETFCAP, but to prevent such a namespace from
> > > writing any file capabilities. This approach can be seen here:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
> > >
> >
> > Ah, can you link to the previous fix and its revert, please? I think
> > that was mentioned in the formerly private thread as well but we forgot:
> >
> > commit 95ebabde382c371572297915b104e55403674e73
> > Author: Eric W. Biederman <[email protected]>
> > Date: Thu Dec 17 09:42:00 2020 -0600
> >
> > capabilities: Don't allow writing ambiguous v3 file capabilities
> >
> > commit 3b0c2d3eaa83da259d7726192cf55a137769012f
> > Author: Eric W. Biederman <[email protected]>
> > Date: Fri Mar 12 15:07:09 2021 -0600
> >
> > Revert 95ebabde382c ("capabilities: Don't allow writing ambiguous v3 file capabilities")
>
> Sure.
>
> Is there a tag for that kind of thing or do I just mention it at the end
> of the description?

In this case it might make sense to just have a little paragraph that
explains the regression. How do you feel about adding?:

Commit 95ebabde382 ("capabilities: Don't allow writing ambiguous v3 file
capabilities") tried to fix the issue by preventing v3 fscaps to be
written to disk when the root uid would map to the same uid in nested
user namespaces. This led to regressions for various workloads. For
example, see [1]. Ultimately this is a valid use-case we have to support
meaning we had to revert this change in 3b0c2d3eaa83 ("Revert
95ebabde382c ("capabilities: Don't allow writing ambiguous v3 file
capabilities")").

[1]: https://github.com/containers/buildah/issues/3071

2021-04-20 13:42:10

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3.2)

On Mon, Apr 19, 2021 at 05:52:39PM +0200, Giuseppe Scrivano wrote:
> [email protected] (Eric W. Biederman) writes:
>
> > Guiseppe can you take a look at this?
> >
> > This is a second attempt at tightening up the semantics of writing to
> > file capabilities from a user namespace.
> >
> > The first attempt was reverted with 3b0c2d3eaa83 ("Revert 95ebabde382c
> > ("capabilities: Don't allow writing ambiguous v3 file capabilities")"),
> > which corrected the issue reported in:
> > https://github.com/containers/buildah/issues/3071
> >
> > There is a report the podman testsuite passes. While different this
> > looks in many ways much more strict than the code that was reverted. So
> > while I can imagine this change doesn't cause problems as is, I will be
> > surprised.
>
> thanks for pulling me in the discussion.
>
> I've tested the patch with several cases similar to the issue we had in
> the past and the patch seems to work well.
>
> Podman creates all the user namespaces within the same parent user
> namespace. In the parent user namespace all the capabilities are kept
> and AFAIK Docker does the same. I'd expect a change in behavior only
> for nested user namespaces in containers where CAP_SETFCAP is not
> granted, but that is not a common configuration given that CAP_SETFCAP
> is added by default.
>
>
> > "Serge E. Hallyn" <[email protected]> writes:
> >
> >> +/**
> >> + * verify_root_map() - check the uid 0 mapping
> >> + * @file: idmapping file
> >> + * @map_ns: user namespace of the target process
> >> + * @new_map: requested idmap
> >> + *
> >> + * If a process requested a mapping for uid 0 onto uid 0, verify that the
> >> + * process writing the map had the CAP_SETFCAP capability as the target process
> >> + * will be able to write fscaps that are valid in ancestor user namespaces.
> >> + *
> >> + * Return: true if the mapping is allowed, false if not.
> >> + */
> >> +static bool verify_root_map(const struct file *file,
> >> + struct user_namespace *map_ns,
> >> + struct uid_gid_map *new_map)
> >> +{
> >> + int idx;
> >> + const struct user_namespace *file_ns = file->f_cred->user_ns;
> >> + struct uid_gid_extent *extent0 = NULL;
> >> +
> >> + for (idx = 0; idx < new_map->nr_extents; idx++) {
> >> + u32 lower_first;
>
> nit: lower_first seems unused?

Drat - I noticed that Sunday or Monday and forgot to remove it, thanks.

> >> +
> >> + if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> >> + extent0 = &new_map->extent[idx];
> >> + else
> >> + extent0 = &new_map->forward[idx];
> >> + if (extent0->lower_first == 0)
> >> + break;
> >> +
> >> + extent0 = NULL;
> >> + }
>
> Tested-by: Giuseppe Scrivano <[email protected]>

Awesome - thanks for testing.

2021-04-20 13:44:37

by Serge E. Hallyn

[permalink] [raw]
Subject: [PATCH v3.4] capabilities: require CAP_SETFCAP to map uid 0

cap_setfcap is required to create file capabilities.

Since 8db6c34f1dbc ("Introduce v3 namespaced file capabilities"), a
process running as uid 0 but without cap_setfcap is able to work around
this as follows: unshare a new user namespace which maps parent uid 0
into the child namespace. While this task will not have new
capabilities against the parent namespace, there is a loophole due to
the way namespaced file capabilities are represented as xattrs. File
capabilities valid in userns 1 are distinguished from file capabilities
valid in userns 2 by the kuid which underlies uid 0. Therefore the
restricted root process can unshare a new self-mapping namespace, add a
namespaced file capability onto a file, then use that file capability in
the parent namespace.

To prevent that, do not allow mapping parent uid 0 if the process which
opened the uid_map file does not have CAP_SETFCAP, which is the capability
for setting file capabilities.

As a further wrinkle: a task can unshare its user namespace, then
open its uid_map file itself, and map (only) its own uid. In this
case we do not have the credential from before unshare, which was
potentially more restricted. So, when creating a user namespace, we
record whether the creator had CAP_SETFCAP. Then we can use that
during map_write().

With this patch:

1. Unprivileged user can still unshare -Ur

ubuntu@caps:~$ unshare -Ur
root@caps:~# logout

2. Root user can still unshare -Ur

ubuntu@caps:~$ sudo bash
root@caps:/home/ubuntu# unshare -Ur
root@caps:/home/ubuntu# logout

3. Root user without CAP_SETFCAP cannot unshare -Ur:

root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
unable to set CAP_SETFCAP effective capability: Operation not permitted
root@caps:/home/ubuntu# unshare -Ur
unshare: write failed /proc/self/uid_map: Operation not permitted

Note: an alternative solution would be to allow uid 0 mappings by
processes without CAP_SETFCAP, but to prevent such a namespace from
writing any file capabilities. This approach can be seen here:
https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4

History:

Commit 95ebabde382 ("capabilities: Don't allow writing ambiguous v3 file
capabilities") tried to fix the issue by preventing v3 fscaps to be
written to disk when the root uid would map to the same uid in nested
user namespaces. This led to regressions for various workloads. For
example, see [1]. Ultimately this is a valid use-case we have to support
meaning we had to revert this change in 3b0c2d3eaa83 ("Revert
95ebabde382c ("capabilities: Don't allow writing ambiguous v3 file
capabilities")").

[1]: https://github.com/containers/buildah/issues/3071

Signed-off-by: Serge Hallyn <[email protected]>
Reviewed-by: Andrew G. Morgan <[email protected]>
Tested-by: Christian Brauner <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Tested-by: Giuseppe Scrivano <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>

Changelog:
* fix logic in the case of writing to another task's uid_map
* rename 'ns' to 'map_ns', and make a file_ns local variable
* use /* comments */
* update the CAP_SETFCAP comment in capability.h
* rename parent_unpriv to parent_can_setfcap (and reverse the
logic)
* remove printks
* clarify (i hope) the code comments
* update capability.h comment
* renamed parent_can_setfcap to parent_could_setfcap
* made the check its own disallowed_0_mapping() fn
* moved the check into new_idmap_permitted
* rename disallowed_0_mapping to verify_root_mapping
* change verify_root_mapping to Christian's suggested flow
* correct+clarify comments: parent uid 0 mapping to any
child uid is a problem.
* remove unused lower_first variable.
---
include/linux/user_namespace.h | 3 ++
include/uapi/linux/capability.h | 3 +-
kernel/user_namespace.c | 65 +++++++++++++++++++++++++++++++--
3 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..f6c5f784be5a 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -63,6 +63,9 @@ struct user_namespace {
kgid_t group;
struct ns_common ns;
unsigned long flags;
+ /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
+ * in its effective capability set at the child ns creation time. */
+ bool parent_could_setfcap;

#ifdef CONFIG_KEYS
/* List of joinable keyrings in this namespace. Modification access of
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index c6ca33034147..2ddb4226cd23 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -335,7 +335,8 @@ struct vfs_ns_cap_data {

#define CAP_AUDIT_CONTROL 30

-/* Set or remove capabilities on files */
+/* Set or remove capabilities on files.
+ Map uid=0 into a child user namespace. */

#define CAP_SETFCAP 31

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index af612945a4d0..9a4b980d695b 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -106,6 +106,7 @@ int create_user_ns(struct cred *new)
if (!ns)
goto fail_dec;

+ ns->parent_could_setfcap = cap_raised(new->cap_effective, CAP_SETFCAP);
ret = ns_alloc_inum(&ns->ns);
if (ret)
goto fail_free;
@@ -841,6 +842,60 @@ static int sort_idmaps(struct uid_gid_map *map)
return 0;
}

+/**
+ * verify_root_map() - check the uid 0 mapping
+ * @file: idmapping file
+ * @map_ns: user namespace of the target process
+ * @new_map: requested idmap
+ *
+ * If a process requests mapping parent uid 0 into the new ns, verify that the
+ * process writing the map had the CAP_SETFCAP capability as the target process
+ * will be able to write fscaps that are valid in ancestor user namespaces.
+ *
+ * Return: true if the mapping is allowed, false if not.
+ */
+static bool verify_root_map(const struct file *file,
+ struct user_namespace *map_ns,
+ struct uid_gid_map *new_map)
+{
+ int idx;
+ const struct user_namespace *file_ns = file->f_cred->user_ns;
+ struct uid_gid_extent *extent0 = NULL;
+
+ for (idx = 0; idx < new_map->nr_extents; idx++) {
+ if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+ extent0 = &new_map->extent[idx];
+ else
+ extent0 = &new_map->forward[idx];
+ if (extent0->lower_first == 0)
+ break;
+
+ extent0 = NULL;
+ }
+
+ if (!extent0)
+ return true;
+
+ if (map_ns == file_ns) {
+ /* The process unshared its ns and is writing to its own
+ * /proc/self/uid_map. User already has full capabilites in
+ * the new namespace. Verify that the parent had CAP_SETFCAP
+ * when it unshared.
+ * */
+ if (!file_ns->parent_could_setfcap)
+ return false;
+ } else {
+ /* Process p1 is writing to uid_map of p2, who is in a child
+ * user namespace to p1's. Verify that the opener of the map
+ * file has CAP_SETFCAP against the parent of the new map
+ * namespace */
+ if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
+ return false;
+ }
+
+ return true;
+}
+
static ssize_t map_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos,
int cap_setid,
@@ -848,7 +903,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
struct uid_gid_map *parent_map)
{
struct seq_file *seq = file->private_data;
- struct user_namespace *ns = seq->private;
+ struct user_namespace *map_ns = seq->private;
struct uid_gid_map new_map;
unsigned idx;
struct uid_gid_extent extent;
@@ -895,7 +950,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
/*
* Adjusting namespace settings requires capabilities on the target.
*/
- if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN))
+ if (cap_valid(cap_setid) && !file_ns_capable(file, map_ns, CAP_SYS_ADMIN))
goto out;

/* Parse the user data */
@@ -965,7 +1020,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,

ret = -EPERM;
/* Validate the user is allowed to use user id's mapped to. */
- if (!new_idmap_permitted(file, ns, cap_setid, &new_map))
+ if (!new_idmap_permitted(file, map_ns, cap_setid, &new_map))
goto out;

ret = -EPERM;
@@ -1086,6 +1141,10 @@ static bool new_idmap_permitted(const struct file *file,
struct uid_gid_map *new_map)
{
const struct cred *cred = file->f_cred;
+
+ if (cap_setid == CAP_SETUID && !verify_root_map(file, ns, new_map))
+ return false;
+
/* Don't allow mappings that would allow anything that wouldn't
* be allowed without the establishment of unprivileged mappings.
*/
--
2.25.1

2021-04-20 16:50:56

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v3.4] capabilities: require CAP_SETFCAP to map uid 0

On Tue, Apr 20, 2021 at 08:43:34AM -0500, Serge Hallyn wrote:
> cap_setfcap is required to create file capabilities.
>
> Since 8db6c34f1dbc ("Introduce v3 namespaced file capabilities"), a
> process running as uid 0 but without cap_setfcap is able to work around
> this as follows: unshare a new user namespace which maps parent uid 0
> into the child namespace. While this task will not have new
> capabilities against the parent namespace, there is a loophole due to
> the way namespaced file capabilities are represented as xattrs. File
> capabilities valid in userns 1 are distinguished from file capabilities
> valid in userns 2 by the kuid which underlies uid 0. Therefore the
> restricted root process can unshare a new self-mapping namespace, add a
> namespaced file capability onto a file, then use that file capability in
> the parent namespace.
>
> To prevent that, do not allow mapping parent uid 0 if the process which
> opened the uid_map file does not have CAP_SETFCAP, which is the capability
> for setting file capabilities.
>
> As a further wrinkle: a task can unshare its user namespace, then
> open its uid_map file itself, and map (only) its own uid. In this
> case we do not have the credential from before unshare, which was
> potentially more restricted. So, when creating a user namespace, we
> record whether the creator had CAP_SETFCAP. Then we can use that
> during map_write().
>
> With this patch:
>
> 1. Unprivileged user can still unshare -Ur
>
> ubuntu@caps:~$ unshare -Ur
> root@caps:~# logout
>
> 2. Root user can still unshare -Ur
>
> ubuntu@caps:~$ sudo bash
> root@caps:/home/ubuntu# unshare -Ur
> root@caps:/home/ubuntu# logout
>
> 3. Root user without CAP_SETFCAP cannot unshare -Ur:
>
> root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> unable to set CAP_SETFCAP effective capability: Operation not permitted
> root@caps:/home/ubuntu# unshare -Ur
> unshare: write failed /proc/self/uid_map: Operation not permitted
>
> Note: an alternative solution would be to allow uid 0 mappings by
> processes without CAP_SETFCAP, but to prevent such a namespace from
> writing any file capabilities. This approach can be seen here:
> https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
>
> History:
>
> Commit 95ebabde382 ("capabilities: Don't allow writing ambiguous v3 file
> capabilities") tried to fix the issue by preventing v3 fscaps to be
> written to disk when the root uid would map to the same uid in nested
> user namespaces. This led to regressions for various workloads. For
> example, see [1]. Ultimately this is a valid use-case we have to support
> meaning we had to revert this change in 3b0c2d3eaa83 ("Revert
> 95ebabde382c ("capabilities: Don't allow writing ambiguous v3 file
> capabilities")").
>
> [1]: https://github.com/containers/buildah/issues/3071
>
> Signed-off-by: Serge Hallyn <[email protected]>
> Reviewed-by: Andrew G. Morgan <[email protected]>
> Tested-by: Christian Brauner <[email protected]>
> Reviewed-by: Christian Brauner <[email protected]>
> Tested-by: Giuseppe Scrivano <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>

If there's no objections then Linus can probably just pick up the single
patch here directly:
https://lore.kernel.org/lkml/[email protected]

I'm not sure it's worth waiting and releasing another kernel with this
bug. This tigthens the semantics nicely and makes for a simple check at
userns creation time instead of repeatedly checking at setxattr(). With
all the testing done we can be quite confident the risk of regressions
is way lower than the old patch and even if we see one I think this
version of the fix is actually worth the risk.

Christian

2021-04-20 17:35:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3.4] capabilities: require CAP_SETFCAP to map uid 0

On Tue, Apr 20, 2021 at 9:47 AM Christian Brauner
<[email protected]> wrote:
>
> If there's no objections then Linus can probably just pick up the single
> patch here directly:

I'm ok with that assuming I don't hear any last-minute concerns..

I'll plan on appling it later today, so anybody with concerns please
holler quickly.

I don't want to leave it to much later in the week, and I may or may
not be functional tomorrow (getting my second vaccine shot, some
people react more strongly to it, so I'm leaving the possibility open
of not getting out of bed ;)

Linus

2021-04-21 09:58:00

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v3.4] capabilities: require CAP_SETFCAP to map uid 0

On Tue, Apr 20, 2021 at 10:33:54AM -0700, Linus Torvalds wrote:
> On Tue, Apr 20, 2021 at 9:47 AM Christian Brauner
> <[email protected]> wrote:
> >
> > If there's no objections then Linus can probably just pick up the single
> > patch here directly:
>
> I'm ok with that assuming I don't hear any last-minute concerns..
>
> I'll plan on appling it later today, so anybody with concerns please
> holler quickly.
>
> I don't want to leave it to much later in the week, and I may or may
> not be functional tomorrow (getting my second vaccine shot, some
> people react more strongly to it, so I'm leaving the possibility open
> of not getting out of bed ;)

We're a few short months away from that pleasure here... :)
At least the general plan is to still have 75% of the population
vaccinated by end of July (That's assuming supply holds up.).
And we're probably looking at another booster shot soon enough. With the
EU having cancelled J&J and AZ contracts for next year that's going to
be interesting as well. Let's hope they won't fumble it again.

Christian

2021-04-22 01:21:03

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v3.4] capabilities: require CAP_SETFCAP to map uid 0

"Serge E. Hallyn" <[email protected]> writes:

> +/**
> + * verify_root_map() - check the uid 0 mapping
> + * @file: idmapping file
> + * @map_ns: user namespace of the target process
> + * @new_map: requested idmap
> + *
> + * If a process requests mapping parent uid 0 into the new ns, verify that the
> + * process writing the map had the CAP_SETFCAP capability as the target process
> + * will be able to write fscaps that are valid in ancestor user namespaces.
> + *
> + * Return: true if the mapping is allowed, false if not.
> + */
> +static bool verify_root_map(const struct file *file,
> + struct user_namespace *map_ns,
> + struct uid_gid_map *new_map)
> +{
> + int idx;
> + const struct user_namespace *file_ns = file->f_cred->user_ns;
> + struct uid_gid_extent *extent0 = NULL;
> +
> + for (idx = 0; idx < new_map->nr_extents; idx++) {
> + if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> + extent0 = &new_map->extent[idx];
> + else
> + extent0 = &new_map->forward[idx];
> + if (extent0->lower_first == 0)
> + break;
> +
> + extent0 = NULL;
> + }
> +
> + if (!extent0)
> + return true;
> +
> + if (map_ns == file_ns) {
> + /* The process unshared its ns and is writing to its own
> + * /proc/self/uid_map. User already has full capabilites in
> + * the new namespace. Verify that the parent had CAP_SETFCAP
> + * when it unshared.
> + * */
> + if (!file_ns->parent_could_setfcap)
> + return false;
> + } else {
> + /* Process p1 is writing to uid_map of p2, who is in a child
> + * user namespace to p1's. Verify that the opener of the map
> + * file has CAP_SETFCAP against the parent of the new map
> + * namespace */
> + if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
> + return false;
> + }

Is there any reason this permission check is not simply:

return map_ns->parent_could_setfcap ||
file_ns_capable(file, map_ns->parent, CAP_SETFCAP);

That is why don't we allow any mapping (that is otherwise valid) in user
namespaces whose creator had the permission to call CAP_SETFCAP?

Why limit the case of using the creators permissions to only the case of
mapping just a single uid (that happens to be the current euid) in the
user namespace?

I don't see any safety reasons for the map_ns == file_ns test.



Is the file_ns_capable check for CAP_SETFCAP actually needed? AKA could
the permission check be simplified to:

return map_ns->parent_could_setfcap;

That would be a much easier rule to explain to people.

I seem to remember distributions at least trying to make newuidmap have
just CAP_SETUID and newgidmap have just CAP_SETGID. Such a simplified
check would facilitate that.

Eric

2021-04-22 13:22:11

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v3.4] capabilities: require CAP_SETFCAP to map uid 0

On Wed, Apr 21, 2021 at 02:16:34PM -0500, Eric W. Biederman wrote:
> "Serge E. Hallyn" <[email protected]> writes:
>
> > +/**
> > + * verify_root_map() - check the uid 0 mapping
> > + * @file: idmapping file
> > + * @map_ns: user namespace of the target process
> > + * @new_map: requested idmap
> > + *
> > + * If a process requests mapping parent uid 0 into the new ns, verify that the
> > + * process writing the map had the CAP_SETFCAP capability as the target process
> > + * will be able to write fscaps that are valid in ancestor user namespaces.
> > + *
> > + * Return: true if the mapping is allowed, false if not.
> > + */
> > +static bool verify_root_map(const struct file *file,
> > + struct user_namespace *map_ns,
> > + struct uid_gid_map *new_map)
> > +{
> > + int idx;
> > + const struct user_namespace *file_ns = file->f_cred->user_ns;
> > + struct uid_gid_extent *extent0 = NULL;
> > +
> > + for (idx = 0; idx < new_map->nr_extents; idx++) {
> > + if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> > + extent0 = &new_map->extent[idx];
> > + else
> > + extent0 = &new_map->forward[idx];
> > + if (extent0->lower_first == 0)
> > + break;
> > +
> > + extent0 = NULL;
> > + }
> > +
> > + if (!extent0)
> > + return true;
> > +
> > + if (map_ns == file_ns) {
> > + /* The process unshared its ns and is writing to its own
> > + * /proc/self/uid_map. User already has full capabilites in
> > + * the new namespace. Verify that the parent had CAP_SETFCAP
> > + * when it unshared.
> > + * */
> > + if (!file_ns->parent_could_setfcap)
> > + return false;
> > + } else {
> > + /* Process p1 is writing to uid_map of p2, who is in a child
> > + * user namespace to p1's. Verify that the opener of the map
> > + * file has CAP_SETFCAP against the parent of the new map
> > + * namespace */
> > + if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
> > + return false;
> > + }
>
> Is there any reason this permission check is not simply:
>
> return map_ns->parent_could_setfcap ||
> file_ns_capable(file, map_ns->parent, CAP_SETFCAP);
>
> That is why don't we allow any mapping (that is otherwise valid) in user
> namespaces whose creator had the permission to call CAP_SETFCAP?

Well I guess the question is exactly who has to have the privilege.

If task X does the unshare and its sibling task Y writes the mapping
(technically, opens the map file), do we want to say that it suffices
for *either* X or Y to have CAP_SETFCAP? I was thinking we want to
require task Y to have the privilege. Task X having it would not
suffice.

> Why limit the case of using the creators permissions to only the case of
> mapping just a single uid (that happens to be the current euid) in the
> user namespace?
>
> I don't see any safety reasons for the map_ns == file_ns test.
>
>
>
> Is the file_ns_capable check for CAP_SETFCAP actually needed? AKA could
> the permission check be simplified to:
>
> return map_ns->parent_could_setfcap;

Currently uid 1000 can create a new user namespace, then have a fully privileged
root process in uid 1000's peer userns write a 0 mapping. With just this
check, that would not be possible.

> That would be a much easier rule to explain to people.
>
> I seem to remember distributions at least trying to make newuidmap have
> just CAP_SETUID and newgidmap have just CAP_SETGID. Such a simplified
> check would facilitate that.

Yes they would have to add an additional CAP_SETFCAP.