Received: by 2002:a05:7412:bb8d:b0:d7:7d3a:4fe2 with SMTP id js13csp2914rdb; Mon, 14 Aug 2023 07:58:39 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGxPPoFQv2H4sk6hOPnnz3s8mrR9QmF75CYLcNXO0dVVG7Igx2Vyk5TsiL0T+yeN/fdWH2m X-Received: by 2002:a05:6a21:6d8a:b0:135:66d9:a52f with SMTP id wl10-20020a056a216d8a00b0013566d9a52fmr16642805pzb.7.1692025118588; Mon, 14 Aug 2023 07:58:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1692025118; cv=none; d=google.com; s=arc-20160816; b=zSiT/JhUpbZqmFdZ6iax3+5WAzohbfcbQl6iiHvGY6X2Gn3yil4299e3aNmTrzPvoa PfQEKBXUdJyfioqIi13U95Tp7fsPvrzbkoOTVmJxWg5h+ixRSwEgN1zG6ek194iWXImy K/0ClscRxQKfuQBu6GU4LHTXeJL5h0HDCzZXBv76T7coF/vF1qN4lpWpu0OibkGnnEvV tmsRnKtekApfC4buWyR61KbsO9SzbBj+99u2PrBgF5OG/DK8oY8qJnf/ZBVcLtsq5T7a 5YlopPnEAfImimyjOzowN5flZlB6aKaki1QvU4bvtkjv+UVd6pO3pdX17OeckJmX7stK 1bEw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:ui-outboundreport:cc:to:in-reply-to:references :message-id:content-transfer-encoding:mime-version:subject:date:from; bh=qz5C+lcfXWQDumxXmFQ8dQ6birCp3+t0E94olRS9cgg=; fh=jwJxZdUUofBTgt9wVHmWAoi0ItKWI8Pp9vqTyzXXy10=; b=fDSn13s9DGV8ALSEQ+Bv6AtuZag1IMPn+gDmAtI36XsTedF1POkX8SysLjSeSMisKg EC9/LCq2llIHaUQ0Oi44UwXX1kciV+4fu3fwTxhzGhVaGLB4+u8+rZd8UHCwuGNoCY6X crUGXxMAJUX2nGuQW3YBPRewFlEAfNH7dGWPFIZ3uYuApt5ViXNnwMI5fjeXwa0Pk8ht MqrsRuKI761dGPmYbGddyisWjT/QIgwcBbvfT5JX800AmF8rDyI8XbDrpscy9+oOVzJr SOBub+rDe37Emq6v4pqwapekkppGl+RodyEBO/Qcs5opA/ivUWwDHB/asBvVMxJ3QUFH kDWA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=aisec.fraunhofer.de Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k196-20020a636fcd000000b00565bcc6a256si1793931pgc.821.2023.08.14.07.58.27; Mon, 14 Aug 2023 07:58:38 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=aisec.fraunhofer.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232152AbjHNO1I (ORCPT + 99 others); Mon, 14 Aug 2023 10:27:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44278 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231634AbjHNO0o (ORCPT ); Mon, 14 Aug 2023 10:26:44 -0400 Received: from mout.kundenserver.de (mout.kundenserver.de [212.227.126.187]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8A4A9B2; Mon, 14 Aug 2023 07:26:40 -0700 (PDT) Received: from [127.0.1.1] ([91.67.199.65]) by mrelayeu.kundenserver.de (mreue012 [212.227.15.167]) with ESMTPSA (Nemesis) id 1MwjO6-1pXnAR2Gdu-00yAHm; Mon, 14 Aug 2023 16:26:17 +0200 From: =?utf-8?q?Michael_Wei=C3=9F?= Date: Mon, 14 Aug 2023 16:26:12 +0200 Subject: [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Message-Id: <20230814-devcg_guard-v1-4-654971ab88b1@aisec.fraunhofer.de> References: <20230814-devcg_guard-v1-0-654971ab88b1@aisec.fraunhofer.de> In-Reply-To: <20230814-devcg_guard-v1-0-654971ab88b1@aisec.fraunhofer.de> To: Alexander Mikhalitsyn , Christian Brauner , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , Quentin Monnet , Alexander Viro Cc: bpf@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, gyroidos@aisec.fraunhofer.de, =?utf-8?q?Michael_Wei=C3=9F?= X-Mailer: b4 0.12.3 X-Provags-ID: V03:K1:KnYIKMx55P6NK3c2PuhV1Jjt7cPFGO+HnnkYu53YVAgcVZCfr3h 4Z2EvDbitfpN9SHiCXEva3v7Eaf1aqh0jCzx/Pl3/T06bVjGJ4ocF2ilWIUhc+nmNzH/FRG ou/GAwoz57HLmMHB4NUbU2EhXSeSHNSfOPLl/Yz9JRTO7gOXJqqlGq8PpaoO75Z8t0B5ue2 51GzcyHexLZZdjbOwbofw== UI-OutboundReport: notjunk:1;M01:P0:ZC+KimZeN+Q=;hFy3WtlerHgRnIyCK1FgNycDelT tSlbU/EinBU2IgWv2cHad0g4tD/fF+XI/Ihofiat1cpCrsjndEekjM+UPJV6aFTt1lpByeW1z 6KuNn22ma5Ca1BEgu8H3jQP3Gx+AsouUS4uCllJIzlUodx5Ol9Zkty4Dd/KFGMvwg7SRbrxgN FU+G38weWulGB2RPunAeLAdu8vGaW01v30ucduy9O1NVgUCJ/F2OaEt9fv7s29Gq4SOk4EGXk qWaYLp+8+2xE6evTly40GIDQ4QZILo8i7I8sTTaXQ8rFA6t2INpG90HiAE6PAFQLK0M58O6RA Iwnzyv/3jmubBV12eNR8ROfaRJsmvjf8NMpfnt6zfHv+5tNLaN7/9m37Ns4l7ZNEpwX/eenHY 23WDlsZ+DRIsb52kTAReNlXdwA/0ES+Oyo2BymranLj8I8G6D+1sXuos5YMscRESK8eD6p98m OKxr6ZcFfRVc+DwxdinnZHasr/E5/6M877674vSsw4zIt9bS3a4/lsvuNsJjyjmiqX3sHGEV7 0XK7xE0Zu9ypCA7h4QUoPp2fzZydhdah/rGsPlCOckNd2g3eHqQ/fdyrE0epm4Nz/zI8V3hJR EqiwS8lFuhC6ktKVHRWmKKOf7h8eO4A3nzIJjdvB/NlZsS/8tWnyFlLaaI6YWWmDgGUCbwfXN rfKhbqAd7FDeN7v1fbDPRbeco1Ra9PRlkgs1sHgq5nzTzdL45llUn+YPndRWpDtcf8NMA+KXC ghNRnOwuIIstzPpWoX6EZhh/J9n2J12FzG3WYQ4nijD+rm062r3S/dczEUtHfx8crJUCWNPLL LSDhQXheZ+/esG2G4p7domdRczsbff0xcNCW4ts4NZjhW80v2BpneNCL4jfWQPZqDEK2X0/ec LGbvzftt1FUL72Q== X-Spam-Status: No, score=-1.2 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_SOFTFAIL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org If a container manager restricts its unprivileged (user namespaced) children by a device cgroup, it is not necessary to deny mknod anymore. Thus, user space applications may map devices on different locations in the file system by using mknod() inside the container. A use case for this, we also use in GyroidOS, is to run virsh for VMs inside an unprivileged container. virsh creates device nodes, e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails in a non-initial userns, even if a cgroup device white list with the corresponding major, minor of /dev/null exists. Thus, in this case the usual bind mounts or pre populated device nodes under /dev are not sufficient. To circumvent this limitation, we allow mknod() in fs/namei.c if a bpf cgroup device guard is enabeld for the current task using devcgroup_task_is_guarded() and check CAP_MKNOD for the current user namespace by ns_capable() instead of the global CAP_MKNOD. To avoid unusable device nodes on file systems mounted in non-initial user namespace, may_open_dev() ignores the SB_I_NODEV for cgroup device guarded tasks. Signed-off-by: Michael Weiß --- fs/namei.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index e56ff39a79bc..ef4f22b9575c 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3221,6 +3221,9 @@ EXPORT_SYMBOL(vfs_mkobj); bool may_open_dev(const struct path *path) { + if (devcgroup_task_is_guarded(current)) + return !(path->mnt->mnt_flags & MNT_NODEV); + return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir, if (error) return error; - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout && - !capable(CAP_MKNOD)) - return -EPERM; + /* + * In case of a device cgroup restirction allow mknod in user + * namespace. Otherwise just check global capability; thus, + * mknod is also disabled for user namespace other than the + * initial one. + */ + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) { + if (devcgroup_task_is_guarded(current)) { + if (!ns_capable(current_user_ns(), CAP_MKNOD)) + return -EPERM; + } else if (!capable(CAP_MKNOD)) + return -EPERM; + } if (!dir->i_op->mknod) return -EPERM; -- 2.30.2