Received: by 2002:a05:6a10:17d3:0:0:0:0 with SMTP id hz19csp3361460pxb; Tue, 20 Apr 2021 06:44:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyzT0M+vXFGUOiko0iM45+NB0yQaTAjNkO8m82UwD+2F0cmapAPyXvVVVv3GXl51XwYl2eC X-Received: by 2002:a17:90a:8b91:: with SMTP id z17mr5033184pjn.73.1618926277584; Tue, 20 Apr 2021 06:44:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618926277; cv=none; d=google.com; s=arc-20160816; b=KYTk5lzzJ1HG68zfQJ5ihulr1kQRayxRaD5vaeatILMIiu6bo3vmQTezrnAqkQctFi hzwUZAVZQrPPBzQ75RpUqE8iaTpLz4ss9GPZjoUzUJ8+kwC/F/1iDHRDo7Ku5FxGr5Ph nzhmnWFkKYcjmbE3zzmBvmBEVJeu1Cfc4Ea/eTTj9axxQ9XnsHnPtN/trKPrt17ANQSG 1xE/IvJdOZkRGsm6PJx5nftZeKvzjqd+5+9/vaAJAZFftmzAR7d/dVjSc/9q2vVw+Gr5 /jY5nu5mze+D3+IW/J4c6YEbo1vrmh7U9DG8S1sM7x5Y4kXptPwUV6pyL5XRSH+Z4dDo Z1fg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=X4uzcFgdInm05krnHUkhoePpCsYtkfx/skUsVyL+AYw=; b=fDGN1KOHZkZJ2Hkr2eFGY/5YoSOWTMH+0zUiOKwuyoA730oJm0Vb8hnPZesYO8hsCt fSUQ9smXwLXcMKzE2BByAc2PoGTVyNtO76ZQo1vSk10nowxMFclaCNWQG3UX7J49K6fX MmMLNSwZQ3qXMlKLBgqwOWjMT5ucRglDp2tmo2UkWssZMv4Z2blB79WE2psO8rH7w+yE kfGYyRhiMNXs30E1hUYsMG7TLG5MumGulGbwlq7EgRGSUhtj62OYirzZVdz6AZ2433DK 6e+4MeDthmrOKDfCKndI3141M+L4vCsOZ5tNTJv1zwDInG8nFe8Ff/vJbemftjMy+pJk 5fWw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b11si15880735pgs.399.2021.04.20.06.44.24; Tue, 20 Apr 2021 06:44:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232516AbhDTNoP (ORCPT + 99 others); Tue, 20 Apr 2021 09:44:15 -0400 Received: from mail.hallyn.com ([178.63.66.53]:40374 "EHLO mail.hallyn.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232443AbhDTNoG (ORCPT ); Tue, 20 Apr 2021 09:44:06 -0400 Received: by mail.hallyn.com (Postfix, from userid 1001) id 3D569471; Tue, 20 Apr 2021 08:43:34 -0500 (CDT) Date: Tue, 20 Apr 2021 08:43:34 -0500 From: "Serge E. Hallyn" To: Christian Brauner Cc: "Serge E. Hallyn" , lkml , Linus Torvalds , Kees Cook , "Andrew G. Morgan" , "Eric W. Biederman" , Greg Kroah-Hartman , security@kernel.org, Tycho Andersen , Andy Lutomirski Subject: [PATCH v3.4] capabilities: require CAP_SETFCAP to map uid 0 Message-ID: <20210420134334.GA11582@mail.hallyn.com> References: <20210416045851.GA13811@mail.hallyn.com> <20210416150501.zam55gschpn2w56i@wittgenstein> <20210416213453.GA29094@mail.hallyn.com> <20210417021945.GA687@mail.hallyn.com> <20210417200434.GA17430@mail.hallyn.com> <20210419122514.GA20598@mail.hallyn.com> <20210419160911.5pguvpj7kfuj6rnr@wittgenstein> <20210420034208.GA2830@mail.hallyn.com> <20210420083129.exyn7ptahx2fg72e@wittgenstein> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210420083129.exyn7ptahx2fg72e@wittgenstein> User-Agent: Mutt/1.9.4 (2018-02-28) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org cap_setfcap is required to create file capabilities. Since 8db6c34f1dbc ("Introduce v3 namespaced file capabilities"), a process running as uid 0 but without cap_setfcap is able to work around this as follows: unshare a new user namespace which maps parent uid 0 into the child namespace. While this task will not have new capabilities against the parent namespace, there is a loophole due to the way namespaced file capabilities are represented as xattrs. File capabilities valid in userns 1 are distinguished from file capabilities valid in userns 2 by the kuid which underlies uid 0. Therefore the restricted root process can unshare a new self-mapping namespace, add a namespaced file capability onto a file, then use that file capability in the parent namespace. To prevent that, do not allow mapping parent uid 0 if the process which opened the uid_map file does not have CAP_SETFCAP, which is the capability for setting file capabilities. As a further wrinkle: a task can unshare its user namespace, then open its uid_map file itself, and map (only) its own uid. In this case we do not have the credential from before unshare, which was potentially more restricted. So, when creating a user namespace, we record whether the creator had CAP_SETFCAP. Then we can use that during map_write(). With this patch: 1. Unprivileged user can still unshare -Ur ubuntu@caps:~$ unshare -Ur root@caps:~# logout 2. Root user can still unshare -Ur ubuntu@caps:~$ sudo bash root@caps:/home/ubuntu# unshare -Ur root@caps:/home/ubuntu# logout 3. Root user without CAP_SETFCAP cannot unshare -Ur: root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap -- root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap unable to set CAP_SETFCAP effective capability: Operation not permitted root@caps:/home/ubuntu# unshare -Ur unshare: write failed /proc/self/uid_map: Operation not permitted Note: an alternative solution would be to allow uid 0 mappings by processes without CAP_SETFCAP, but to prevent such a namespace from writing any file capabilities. This approach can be seen here: https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4 History: Commit 95ebabde382 ("capabilities: Don't allow writing ambiguous v3 file capabilities") tried to fix the issue by preventing v3 fscaps to be written to disk when the root uid would map to the same uid in nested user namespaces. This led to regressions for various workloads. For example, see [1]. Ultimately this is a valid use-case we have to support meaning we had to revert this change in 3b0c2d3eaa83 ("Revert 95ebabde382c ("capabilities: Don't allow writing ambiguous v3 file capabilities")"). [1]: https://github.com/containers/buildah/issues/3071 Signed-off-by: Serge Hallyn Reviewed-by: Andrew G. Morgan Tested-by: Christian Brauner Reviewed-by: Christian Brauner Tested-by: Giuseppe Scrivano Cc: "Eric W. Biederman" Changelog: * fix logic in the case of writing to another task's uid_map * rename 'ns' to 'map_ns', and make a file_ns local variable * use /* comments */ * update the CAP_SETFCAP comment in capability.h * rename parent_unpriv to parent_can_setfcap (and reverse the logic) * remove printks * clarify (i hope) the code comments * update capability.h comment * renamed parent_can_setfcap to parent_could_setfcap * made the check its own disallowed_0_mapping() fn * moved the check into new_idmap_permitted * rename disallowed_0_mapping to verify_root_mapping * change verify_root_mapping to Christian's suggested flow * correct+clarify comments: parent uid 0 mapping to any child uid is a problem. * remove unused lower_first variable. --- include/linux/user_namespace.h | 3 ++ include/uapi/linux/capability.h | 3 +- kernel/user_namespace.c | 65 +++++++++++++++++++++++++++++++-- 3 files changed, 67 insertions(+), 4 deletions(-) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 64cf8ebdc4ec..f6c5f784be5a 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -63,6 +63,9 @@ struct user_namespace { kgid_t group; struct ns_common ns; unsigned long flags; + /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP + * in its effective capability set at the child ns creation time. */ + bool parent_could_setfcap; #ifdef CONFIG_KEYS /* List of joinable keyrings in this namespace. Modification access of diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h index c6ca33034147..2ddb4226cd23 100644 --- a/include/uapi/linux/capability.h +++ b/include/uapi/linux/capability.h @@ -335,7 +335,8 @@ struct vfs_ns_cap_data { #define CAP_AUDIT_CONTROL 30 -/* Set or remove capabilities on files */ +/* Set or remove capabilities on files. + Map uid=0 into a child user namespace. */ #define CAP_SETFCAP 31 diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index af612945a4d0..9a4b980d695b 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -106,6 +106,7 @@ int create_user_ns(struct cred *new) if (!ns) goto fail_dec; + ns->parent_could_setfcap = cap_raised(new->cap_effective, CAP_SETFCAP); ret = ns_alloc_inum(&ns->ns); if (ret) goto fail_free; @@ -841,6 +842,60 @@ static int sort_idmaps(struct uid_gid_map *map) return 0; } +/** + * verify_root_map() - check the uid 0 mapping + * @file: idmapping file + * @map_ns: user namespace of the target process + * @new_map: requested idmap + * + * If a process requests mapping parent uid 0 into the new ns, verify that the + * process writing the map had the CAP_SETFCAP capability as the target process + * will be able to write fscaps that are valid in ancestor user namespaces. + * + * Return: true if the mapping is allowed, false if not. + */ +static bool verify_root_map(const struct file *file, + struct user_namespace *map_ns, + struct uid_gid_map *new_map) +{ + int idx; + const struct user_namespace *file_ns = file->f_cred->user_ns; + struct uid_gid_extent *extent0 = NULL; + + for (idx = 0; idx < new_map->nr_extents; idx++) { + if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) + extent0 = &new_map->extent[idx]; + else + extent0 = &new_map->forward[idx]; + if (extent0->lower_first == 0) + break; + + extent0 = NULL; + } + + if (!extent0) + return true; + + if (map_ns == file_ns) { + /* The process unshared its ns and is writing to its own + * /proc/self/uid_map. User already has full capabilites in + * the new namespace. Verify that the parent had CAP_SETFCAP + * when it unshared. + * */ + if (!file_ns->parent_could_setfcap) + return false; + } else { + /* Process p1 is writing to uid_map of p2, who is in a child + * user namespace to p1's. Verify that the opener of the map + * file has CAP_SETFCAP against the parent of the new map + * namespace */ + if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP)) + return false; + } + + return true; +} + static ssize_t map_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos, int cap_setid, @@ -848,7 +903,7 @@ static ssize_t map_write(struct file *file, const char __user *buf, struct uid_gid_map *parent_map) { struct seq_file *seq = file->private_data; - struct user_namespace *ns = seq->private; + struct user_namespace *map_ns = seq->private; struct uid_gid_map new_map; unsigned idx; struct uid_gid_extent extent; @@ -895,7 +950,7 @@ static ssize_t map_write(struct file *file, const char __user *buf, /* * Adjusting namespace settings requires capabilities on the target. */ - if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN)) + if (cap_valid(cap_setid) && !file_ns_capable(file, map_ns, CAP_SYS_ADMIN)) goto out; /* Parse the user data */ @@ -965,7 +1020,7 @@ static ssize_t map_write(struct file *file, const char __user *buf, ret = -EPERM; /* Validate the user is allowed to use user id's mapped to. */ - if (!new_idmap_permitted(file, ns, cap_setid, &new_map)) + if (!new_idmap_permitted(file, map_ns, cap_setid, &new_map)) goto out; ret = -EPERM; @@ -1086,6 +1141,10 @@ static bool new_idmap_permitted(const struct file *file, struct uid_gid_map *new_map) { const struct cred *cred = file->f_cred; + + if (cap_setid == CAP_SETUID && !verify_root_map(file, ns, new_map)) + return false; + /* Don't allow mappings that would allow anything that wouldn't * be allowed without the establishment of unprivileged mappings. */ -- 2.25.1