Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp4141294imu; Mon, 12 Nov 2018 06:29:53 -0800 (PST) X-Google-Smtp-Source: AJdET5c+Z58Luu3YOoRDi4RauJoXT+jewFAnE3uVtMYdZLlIRF86LBvkVMZROmR7rm89wVriIkPm X-Received: by 2002:a62:6a88:: with SMTP id f130-v6mr1120280pfc.98.1542032993304; Mon, 12 Nov 2018 06:29:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1542032993; cv=none; d=google.com; s=arc-20160816; b=ubaxEGuxDE6LL+fvu6+91NVXm5+nRaNuwrWpXsNqRJWf27UlVCwUGmdax1yd1Cs5kU CoPK71c6dUVylkN1bBqKl4GkbCMX8+xrxGtfDY0amFRZfiBvdM13sUXEfVS/FDuH2rzh DqKPt2sjyAnz69fV4fsJ9ApRhZ68nu3TIOMG1bjjFd5gAOaH0XcnShfSXRLvhqnyHPmU vFL39w2sF3jwVDtK6qMp7620cSrXPAUUtnptfF4eBKrrLQ8+72J9fgZUu9QPSqZedzQE 5eEnCmOeeTsnT1O4kzQmFS0+OXCPKDoIifOeSjKNEOuUMlgO5DMmnh4V0X2FHO9lQfZK zn1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=IvCFupdWwWJsu6YJAe0nP7FBQiAqNELZ3ASWKjoUPbU=; b=M19ihYIibVAEyKAqtaCdqfzlZ5/8MZbAjjFqZcRsDpLtVoZgREAFntd6+6ALmuyW3j NfzgZKi91i9N38khaFnA1J+rJnB40GSPLudNnDHGxPtufc9Ux5f4/6pIvbLjgCMqJ07i Br32NBCbkS99xHtsEf4aLahHv98A1SKEf2bcPbgQwPFyejPObZkdVp0D8KfFN3M8ER/V ob1HHdJRNaeWIuqtMEYElw6brUbX8UQguYAyICnzEVcJXQz+fF3TgDMq130Adusr+awD dJ4Y6uYWExZr3toqyEVnGiksABRX6EV0kPA3Ze98aa2nI/nfJlONxMdpIjl1Nl1u1MBj wIzQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v6si1402545pfj.167.2018.11.12.06.29.37; Mon, 12 Nov 2018 06:29:53 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730039AbeKMAVG (ORCPT + 99 others); Mon, 12 Nov 2018 19:21:06 -0500 Received: from mx2.mailbox.org ([80.241.60.215]:56930 "EHLO mx2.mailbox.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729917AbeKMAVF (ORCPT ); Mon, 12 Nov 2018 19:21:05 -0500 Received: from smtp1.mailbox.org (unknown [IPv6:2001:67c:2050:105:465:1:1:0]) (using TLSv1.2 with cipher ECDHE-RSA-CHACHA20-POLY1305 (256/256 bits)) (No client certificate requested) by mx2.mailbox.org (Postfix) with ESMTPS id 13BF4A11A9; Mon, 12 Nov 2018 15:27:33 +0100 (CET) X-Virus-Scanned: amavisd-new at heinlein-support.de Received: from smtp1.mailbox.org ([80.241.60.240]) by spamfilter01.heinlein-hosting.de (spamfilter01.heinlein-hosting.de [80.241.56.115]) (amavisd-new, port 10030) with ESMTP id erAAsmDweiQD; Mon, 12 Nov 2018 15:27:31 +0100 (CET) From: Aleksa Sarai To: Al Viro , Jeff Layton , "J. Bruce Fields" , Arnd Bergmann , David Howells Cc: Aleksa Sarai , Eric Biederman , Christian Brauner , linux-api@vger.kernel.org, Andy Lutomirski , Jann Horn , David Drysdale , Aleksa Sarai , containers@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Subject: [PATCH v4 3/4] namei: O_THISROOT: chroot-like path resolution Date: Tue, 13 Nov 2018 01:26:53 +1100 Message-Id: <20181112142654.341-4-cyphar@cyphar.com> In-Reply-To: <20181112142654.341-1-cyphar@cyphar.com> References: <20181112142654.341-1-cyphar@cyphar.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The primary motivation for the need for this flag is container runtimes which have to interact with malicious root filesystems in the host namespaces. One of the first requirements for a container runtime to be secure against a malicious rootfs is that they correctly scope symlinks (that is, they should be scoped as though they are chroot(2)ed into the container's rootfs) and ".."-style paths[*]. The already-existing O_XDEV and O_NOMAGICLINKS[**] help defend against other potential attacks in a malicious rootfs scenario. Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec (for some runtimes) which is *very* costly if necessary for every filesystem operation involving a container. [*] At the moment, ".." and "magic link" jumping are disallowed for the same reason it is disabled for O_BENEATH -- currently it is not safe to allow it. Future patches may enable it unconditionally once we have resolved the possible races (for "..") and semantics (for "magic link" jumping). The most significant openat(2) semantic change with O_THISROOT is that absolute pathnames no longer cause dirfd to be ignored completely. The rationale is that O_THISROOT must necessarily chroot-scope symlinks with absolute paths to dirfd, and so doing it for the base path seems to be the most consistent behaviour (and also avoids foot-gunning users who want to scope paths that are absolute). Currently this is only enabled for openat(2), and similar to O_BENEATH and family requires more discussion about extending it to more *at(2) syscalls as well as extending AT_EMPTY_PATH support. [1]: https://github.com/cyphar/filepath-securejoin Cc: Eric Biederman Cc: Christian Brauner Cc: Signed-off-by: Aleksa Sarai --- fs/fcntl.c | 2 +- fs/namei.c | 6 +++--- fs/open.c | 4 +++- include/linux/fcntl.h | 2 +- include/linux/namei.h | 1 + include/uapi/asm-generic/fcntl.h | 3 +++ 6 files changed, 12 insertions(+), 6 deletions(-) diff --git a/fs/fcntl.c b/fs/fcntl.c index e343618736f7..4c36c5b9fdb9 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -1031,7 +1031,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */ - BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ != + BUILD_BUG_ON(26 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | __FMODE_EXEC | __FMODE_NONOTIFY)); diff --git a/fs/namei.c b/fs/namei.c index b8d2bee89b78..459faea5b832 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1097,7 +1097,7 @@ const char *get_link(struct nameidata *nd) if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS)) return ERR_PTR(-ELOOP); /* Not currently safe. */ - if (unlikely(nd->flags & LOOKUP_BENEATH)) + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) return ERR_PTR(-EXDEV); } if (IS_ERR_OR_NULL(res)) @@ -1746,7 +1746,7 @@ static inline int handle_dots(struct nameidata *nd, int type) * cause our parent to have moved outside of the root and us to skip * over it. */ - if (unlikely(nd->flags & LOOKUP_BENEATH)) + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) return -EXDEV; if (!nd->root.mnt) set_root(nd); @@ -2297,7 +2297,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->m_seq = read_seqbegin(&mount_lock); - if (unlikely(nd->flags & LOOKUP_BENEATH)) { + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) { error = dirfd_path_init(nd); if (unlikely(error)) return ERR_PTR(error); diff --git a/fs/open.c b/fs/open.c index 3e73f940f56e..4ba44b07f3ff 100644 --- a/fs/open.c +++ b/fs/open.c @@ -960,7 +960,7 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o * cannot have anything other than the below set of flags */ flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH | O_BENEATH | - O_XDEV | O_NOSYMLINKS | O_NOMAGICLINKS; + O_XDEV | O_NOSYMLINKS | O_NOMAGICLINKS | O_THISROOT; acc_mode = 0; } @@ -997,6 +997,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o lookup_flags |= LOOKUP_NO_MAGICLINKS; if (flags & O_NOSYMLINKS) lookup_flags |= LOOKUP_NO_SYMLINKS; + if (flags & O_THISROOT) + lookup_flags |= LOOKUP_CHROOT; op->lookup_flags = lookup_flags; return 0; } diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index 864399c2fdd2..46c92bbfce4a 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -10,7 +10,7 @@ O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \ - O_NOMAGICLINKS | O_NOSYMLINKS) + O_NOMAGICLINKS | O_NOSYMLINKS | O_THISROOT) #ifndef force_o_largefile #define force_o_largefile() (BITS_PER_LONG != 32) diff --git a/include/linux/namei.h b/include/linux/namei.h index 82b5039d27a6..b6865eda86d5 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -53,6 +53,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_NO_MAGICLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ #define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*. Implies LOOKUP_NO_MAGICLINKS. */ +#define LOOKUP_CHROOT 0x100000 /* Treat dirfd as %current->fs->root. */ extern int path_pts(struct path *path); diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index b2d3811843e7..194f5de9ba51 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -113,6 +113,9 @@ #ifndef O_NOSYMLINKS #define O_NOSYMLINKS 01000000000 #endif +#ifndef O_THISROOT +#define O_THISROOT 02000000000 +#endif #define F_DUPFD 0 /* dup */ #define F_GETFD 1 /* get close_on_exec */ -- 2.19.1