Received: by 2002:a05:6a10:17d3:0:0:0:0 with SMTP id hz19csp661738pxb; Fri, 16 Apr 2021 15:10:06 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwFPmIh/hDC8SCXUX6+2Y2FbhWpBHI5HTKdXRsD3IxjcY4TtkTQ079Jv2xntuYbSZxQCVey X-Received: by 2002:a05:6402:142:: with SMTP id s2mr12404537edu.2.1618611005898; Fri, 16 Apr 2021 15:10:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618611005; cv=none; d=google.com; s=arc-20160816; b=sCvq9dkC3wKFvsLYUjcipwRjWjDVPcPgrucN8ioU4eRfQ+eDKJZuGzIssgDChQjX1t PAyRYmE6d1arVZvCn4e0YLmBjspt0IVM91q4E+gtQVLo4nhdg902Tmw2iMgNjOBoV+r2 uK/pMf1SoX9XcLxgOvbhNA9SwT9iQxE46IGXLusppm1gNgEjgzUQpnkjS8b2XE2ZfyUH mbHdtS9hVSd7Vyg2/0pdyfdmGp8Movj056g0IF17AmnEpCNKUH+qRL8SaRCiGo1XGIgQ 86i2RIF68PlON+maoDimr3Y2+UY7DzmzpQpBcyJFw9GwXFyNk02j/ZX4YeH7PnJ8fWCA BgwA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=ju6/Qu/q/0bGr5IFdgDfvCn7ZFTKw6PKuWIzldr/SXw=; b=hME22YKmoJMyKPi3M8JaEnVMFg7TjgNNwGYCvnqTHgDzPLhVsWu7Ox7yqiBZag87yf QXG+kZA22RvdfFD1GLV+D1e4223OxoiWS1xScjvTxfsxdfOZXo3hAR2Em3uKr6AqXAkh Kgsux3HgUry3fz46MGIelixJD6n0DEjoiMVZQlskpHHMEwm7I+9ZKK7Uz1Zx/HwISNjL wVYOdTPtZbKdOe/ATe94+ajLOja1YEeZk6GVLBW92zs9gqI8sOhHEvWmtZGzJD57AjSV sTWGti6mkaPLSmHdjq0TjXD0TqG4WZi/xle4qDKuxU+8aP14hE+6k30MnZQFBUCY+T33 bN4Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h15si5678623ejf.2.2021.04.16.15.09.41; Fri, 16 Apr 2021 15:10:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242810AbhDPVfU (ORCPT + 99 others); Fri, 16 Apr 2021 17:35:20 -0400 Received: from mail.hallyn.com ([178.63.66.53]:42506 "EHLO mail.hallyn.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235714AbhDPVfU (ORCPT ); Fri, 16 Apr 2021 17:35:20 -0400 Received: by mail.hallyn.com (Postfix, from userid 1001) id CE2A2471; Fri, 16 Apr 2021 16:34:53 -0500 (CDT) Date: Fri, 16 Apr 2021 16:34:53 -0500 From: "Serge E. Hallyn" To: Christian Brauner Cc: "Serge E. Hallyn" , lkml , Linus Torvalds , Kees Cook , "Andrew G. Morgan" , "Eric W. Biederman" , Greg Kroah-Hartman , security@kernel.org, Tycho Andersen , Andy Lutomirski Subject: Re: [RFC PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3) Message-ID: <20210416213453.GA29094@mail.hallyn.com> References: <20210416045851.GA13811@mail.hallyn.com> <20210416150501.zam55gschpn2w56i@wittgenstein> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210416150501.zam55gschpn2w56i@wittgenstein> User-Agent: Mutt/1.9.4 (2018-02-28) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 16, 2021 at 05:05:01PM +0200, Christian Brauner wrote: > On Thu, Apr 15, 2021 at 11:58:51PM -0500, Serge Hallyn wrote: > > (Eric - this patch (v3) is a cleaned up version of the previous approach. > > v4 is at https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4 > > and is the approach you suggested. I can send it also as a separate patch > > if you like) > > > > A process running as uid 0 but without cap_setfcap currently can simply > > unshare a new user namespace with uid 0 mapped to 0. While this task > > will not have new capabilities against the parent namespace, there is > > a loophole due to the way namespaced file capabilities work. File > > capabilities valid in userns 1 are distinguised from file capabilities > > valid in userns 2 by the kuid which underlies uid 0. Therefore > > the restricted root process can unshare a new self-mapping namespace, > > add a namespaced file capability onto a file, then use that file > > capability in the parent namespace. > > > > To prevent that, do not allow mapping uid 0 if the process which > > opened the uid_map file does not have CAP_SETFCAP, which is the capability > > for setting file capabilities. > > > > A further wrinkle: a task can unshare its user namespace, then > > open its uid_map file itself, and map (only) its own uid. In this > > case we do not have the credential from before unshare, which was > > potentially more restricted. So, when creating a user namespace, we > > record whether the creator had CAP_SETFCAP. Then we can use that > > during map_write(). > > > > With this patch: > > > > 1. unprivileged user can still unshare -Ur > > > > ubuntu@caps:~$ unshare -Ur > > root@caps:~# logout > > > > 2. root user can still unshare -Ur > > > > ubuntu@caps:~$ sudo bash > > root@caps:/home/ubuntu# unshare -Ur > > root@caps:/home/ubuntu# logout > > > > 3. root user without CAP_SETFCAP cannot unshare -Ur: > > > > root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap -- > > root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap > > unable to set CAP_SETFCAP effective capability: Operation not permitted > > root@caps:/home/ubuntu# unshare -Ur > > unshare: write failed /proc/self/uid_map: Operation not permitted > > > > Signed-off-by: Serge Hallyn > > > > Changelog: > > * fix logic in the case of writing to another task's uid_map > > * rename 'ns' to 'map_ns', and make a file_ns local variable > > * use /* comments */ > > * update the CAP_SETFCAP comment in capability.h > > * rename parent_unpriv to parent_can_setfcap (and reverse the > > logic) > > * remove printks > > * clarify (i hope) the code comments > > * update capability.h comment > > * renamed parent_can_setfcap to parent_could_setfcap > > * made the check its own disallowed_0_mapping() fn > > * moved the check into new_idmap_permitted > > --- > > Thank you for working on this fix! > > I do prefer your approach of doing the check at user namespace creation > time instead of moving it into the setxattr() codepath. > > Let me reiterate that the ability to write through fscaps is a valid > usecase and this should continue to work but that for locked down user > namespace as Andrew wants to use them your patch provides a clean > solution. > We've are using identity mappings in quite a few scenarios partially > when performing tests but also to write through fscaps. > We also had reports of users that use identity mappings. They create > their rootfs by running image extraction in an identity mapped userns > where fscaps are written through. > Podman has use-cases for this feature as well and has been affected by > the regression of the first fix. Thanks for reviewing. I'm not sure what your point above is, so just to make sure - the alternative implementation also does allow fscaps for cases where root uid is remapped, only disallowing it if it would violate the ancestor's lack of cap_setfcap. > > include/linux/user_namespace.h | 3 ++ > > include/uapi/linux/capability.h | 3 +- > > kernel/user_namespace.c | 61 +++++++++++++++++++++++++++++++-- > > 3 files changed, 63 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h > > index 64cf8ebdc4ec..f6c5f784be5a 100644 > > --- a/include/linux/user_namespace.h > > +++ b/include/linux/user_namespace.h > > @@ -63,6 +63,9 @@ struct user_namespace { > > kgid_t group; > > struct ns_common ns; > > unsigned long flags; > > + /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP > > + * in its effective capability set at the child ns creation time. */ > > + bool parent_could_setfcap; > > > > #ifdef CONFIG_KEYS > > /* List of joinable keyrings in this namespace. Modification access of > > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h > > index c6ca33034147..2ddb4226cd23 100644 > > --- a/include/uapi/linux/capability.h > > +++ b/include/uapi/linux/capability.h > > @@ -335,7 +335,8 @@ struct vfs_ns_cap_data { > > > > #define CAP_AUDIT_CONTROL 30 > > > > -/* Set or remove capabilities on files */ > > +/* Set or remove capabilities on files. > > + Map uid=0 into a child user namespace. */ > > > > #define CAP_SETFCAP 31 > > > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > > index af612945a4d0..8c75028a9aae 100644 > > --- a/kernel/user_namespace.c > > +++ b/kernel/user_namespace.c > > @@ -106,6 +106,7 @@ int create_user_ns(struct cred *new) > > if (!ns) > > goto fail_dec; > > > > + ns->parent_could_setfcap = cap_raised(new->cap_effective, CAP_SETFCAP); > > ret = ns_alloc_inum(&ns->ns); > > if (ret) > > goto fail_free; > > @@ -841,6 +842,56 @@ static int sort_idmaps(struct uid_gid_map *map) > > return 0; > > } > > > > +/* > > + * If mapping uid 0, then file capabilities created by the new namespace will > > + * be effective in the parent namespace. Adding file capabilities requires > > + * CAP_SETFCAP, which the child namespace will have, so creating such a > > + * mapping requires CAP_SETFCAP in the parent namespace. > > + */ > > +static bool disallowed_0_mapping(const struct file *file, > > + struct user_namespace *map_ns, > > + struct uid_gid_map *new_map) > > +{ > > + int idx; > > + bool zeromapping = false; > > + const struct user_namespace *file_ns = file->f_cred->user_ns; > > + > > + for (idx = 0; idx < new_map->nr_extents; idx++) { > > I think having that loop is acceptable here since it's only called once > at map creation time even though the forward array is not yet sorted. > > > + struct uid_gid_extent *e; > > + u32 lower_first; > > + > > + if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) > > + e = &new_map->extent[idx]; > > + else > > + e = &new_map->forward[idx]; > > + if (e->lower_first == 0) { > > + zeromapping = true; > > + break; > > + } > > + } > > + > > + if (!zeromapping) > > + return false; > > + > > + if (map_ns == file_ns) { > > + /* The user unshared first and is writing to > > + * /proc/self/uid_map. User already has full > > + * capabilites in the new namespace, so verify > > + * that the parent has CAP_SETFCAP. */ > > + if (!file_ns->parent_could_setfcap) > > + return true; > > + } else { > > + /* Process p1 is writing to uid_map of p2, who > > + * is in a child user namespace to p1's. So > > + * we verify that p1 has CAP_SETFCAP to its > > + * own namespace */ > > + if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP)) > > + return true; > > + } > > + > > + return false; > > +} > > Maybe we can tweak this a tiny bit to get rid of the "zeromapping"?: > > static bool disallowed_0_mapping(const struct file *file, > struct user_namespace *map_ns, > struct uid_gid_map *new_map) > { > int idx; > const struct user_namespace *file_ns = file->f_cred->user_ns; > struct uid_gid_extent *extent0 = NULL; > > for (idx = 0; idx < new_map->nr_extents; idx++) { > u32 lower_first; > > if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) > extent0 = &new_map->extent[idx]; > else > extent0 = &new_map->forward[idx]; > if (extent0->lower_first == 0) > break; > > extent0 = NULL; > } > > if (!extent0) > return false; Feels a little less clear to me, but that's probably just me, so I'll switch it over, thanks. > > if (map_ns == file_ns) { > /* > * The user unshared first and is writing to > * /proc/self/uid_map. User already has full > * capabilites in the new namespace, so verify > * that the parent has CAP_SETFCAP. > */ > if (!file_ns->parent_could_setfcap) > return true; > } else { > /* > * Process p1 is writing to uid_map of p2, who > * is in a child user namespace to p1's. So > * we verify that p1 has CAP_SETFCAP to its > * own namespace. > */ > if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP)) > return true; > } > > return false; > } > > In addition I would think that expressing the logic the other way around > is more legible. I'm not too keen on having negations in function names. > We should probably also tweak the comment a bit and make it kernel-doc > clean: > > /** > * verify_root_map() - check the uid 0 mapping Hm. restrict_root_map() ? "verify" sounds like we should sometimes reject it. > * @file: idmapping file > * @map_ns: user namespace of the target process > * @new_map: requested idmap > * > * If a process requested a mapping for uid 0 onto > * uid 0 verify that the process writing the map had the CAP_SETFCAP > * capability as the target process will be able to > * write fscaps that are valid in ancestor user namespaces. > * > * Return: true if the mapping is allow, false if not. > */ > static bool verify_root_map()