Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp8240239imu; Fri, 28 Dec 2018 13:28:31 -0800 (PST) X-Google-Smtp-Source: ALg8bN79DVZUIQh6g2V9DmGEUOte18xtsjTfliD8liq7F94jt90g2dNyTD9HaU4/CCWADsqhaqLZ X-Received: by 2002:a17:902:8ec8:: with SMTP id x8mr29264951plo.210.1546032511764; Fri, 28 Dec 2018 13:28:31 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1546032511; cv=none; d=google.com; s=arc-20160816; b=jUuPPjszVqCuz6e9Os5a3T9FRJ0hmOXXA3NhlBBfw/qQ8H/TnU0MPJnz5mYzM5d5ig LtSFyPcCYp4mhOzN63tztMfQbypIJ57n8YnLaRXK3GLbc8igr/uO7jwOVdNrN4k3kVXF zvdrh5leElhaNO+kqZvD9lsZglKyKuSq/ILxLBfC0fL2+YgjtnBOEbDI0eNRfO/R7+aI JQT2+5wOEwTBe1SBjpSXSk6E94v4oc3EzhuyJlgYWZm2SC6WjA5zTZQ/toVcYRg99i3e /7TX2pecwCsCYu7gKtc/mcLjOt7A5uIPwWsnKDnNxP6vXS/74QpKRe1sepdlhQ3bs26D jsew== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=VBr3pwqOM2n/uUce052/wAl8376ymX9Jjc2AWdoGib4=; b=NrDNZLaWzbMphFdP2o06m88XsRfWsXPGHWwhaBG1rRja3RA3B7bKp1JQlg1CbPq1wF Hx8mMrWZ+RkGuLDUjG5mBwLxCPcPkUsvSVM5lLnVPy8ZYhi+/LPJbYxCdw0VoKwhaVyk su9tW6VJx3c2Fj2+owKCHvgUsjQ/lqjI/YJ2MWGWk8M2m2LqW3eHCv23QCAnmSnBNJ+B TTIj/exBf/EVVQMpVzR6gTEfC+st75IHnP+niPyTq2TBV/86rkWkAnH7D3Ci5e8nKId7 i8bmrvjh0vlRRpUXzqLBDQSkQHm+FH0gcvoBIsvDSgbdTBQbzoPLEO1kfKAQmkfuGbWU 1QlQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=vjRYzysh; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h5si4870674pfg.233.2018.12.28.13.28.01; Fri, 28 Dec 2018 13:28:31 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=vjRYzysh; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732321AbeL1Lxj (ORCPT + 99 others); Fri, 28 Dec 2018 06:53:39 -0500 Received: from mail.kernel.org ([198.145.29.99]:53440 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732311AbeL1Lxh (ORCPT ); Fri, 28 Dec 2018 06:53:37 -0500 Received: from localhost (5356596B.cm-6-7b.dynamic.ziggo.nl [83.86.89.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 00B1E20879; Fri, 28 Dec 2018 11:53:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1545998016; bh=hfIooJuRbVP9B1zNLi8EjiZv9ubQYED0hzyHYuod8I8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=vjRYzyshQkyzVvQNsykdwzGHXOvGbGXu1K+eKVw9TtDVy5NDXwA2bGh+d0S2GwVWz nS9CcI7R1fDirbAaBoEYplJ8Emdo6tmibe2lBxltBKJBKzIMIAsnSOBuGdo+Li28cM pzT6kjw0hXaBdzBFhDo4HfASCg1/6JCMlZhyoxvQ= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Christian Brauner , "Eric W. Biederman" , Seth Forshee , Serge Hallyn , Linus Torvalds Subject: [PATCH 4.19 02/46] Revert "vfs: Allow userns root to call mknod on owned filesystems." Date: Fri, 28 Dec 2018 12:51:56 +0100 Message-Id: <20181228113125.063988618@linuxfoundation.org> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20181228113124.971620049@linuxfoundation.org> References: <20181228113124.971620049@linuxfoundation.org> User-Agent: quilt/0.65 X-stable: review X-Patchwork-Hint: ignore MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 4.19-stable review patch. If anyone has any objections, please let me know. ------------------ From: Christian Brauner commit 94f82008ce30e2624537d240d64ce718255e0b80 upstream. This reverts commit 55956b59df336f6738da916dbb520b6e37df9fbd. commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: https://github.com/systemd/systemd/pull/9483 Signed-off-by: Christian Brauner Cc: "Eric W. Biederman" Cc: Seth Forshee Cc: Serge Hallyn Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- fs/namei.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- a/fs/namei.c +++ b/fs/namei.c @@ -3701,8 +3701,7 @@ int vfs_mknod(struct inode *dir, struct if (error) return error; - if ((S_ISCHR(mode) || S_ISBLK(mode)) && - !ns_capable(dentry->d_sb->s_user_ns, CAP_MKNOD)) + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD)) return -EPERM; if (!dir->i_op->mknod)