Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp792147ybl; Fri, 6 Dec 2019 06:20:33 -0800 (PST) X-Google-Smtp-Source: APXvYqzyV6bAtnDhTHqxSrJCy9XqnddePZOZznQGPJoxf7OPWPnV7VsfXaPAl/6R2WcEeXG8Vt1O X-Received: by 2002:a9d:7593:: with SMTP id s19mr10473297otk.219.1575642033388; Fri, 06 Dec 2019 06:20:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1575642033; cv=none; d=google.com; s=arc-20160816; b=FgN1nlJ9zRWKVlr/jd+LJgoTsxH+Jh89b1LTY0XYdvW8g7HvJPIupfAVhtuMwpk/J2 P2wsz37mo8Rlz0iicBN2GP7uFPlBfAYjeAEo6hN7Z7OQzHYVi8todZIB/05D9ktsymAh H33xqlIMy5A5TVay5MzPX0GMnqjl7vu717zH5lUsPg5GbtJxefXjCoTRskoNDBuFCul/ LKlTQQPDOwmx85Kom14OqOG51ZBJpVhdnPvkM808vQm7aWtAFp0MEp87ZKI9T7EQ33Ls BY2nuWQN2uSHz3I5q8pnLRjhgiCwAeMoDZL27ATdYFBsD5G0MJhGtHZGACglv5uxL/8d oPdQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=Cq3qTc/nIriQslrKp+C4WNf8xyqGDHB6kUD3ma+DPto=; b=aDFykEkLVhhO0KS4D8oTI1KOmp25h6x2900xBFSoRYNHmVTRaNcfZdYLuPC2lC1BjZ BhgKxGpYu+UVm4zAL1dWrKk0neQfoXF9Bknqq8JP5hL/m/XRjrLLs26ScHWSUqzcBjsB p9pFJCJBriBzJ4YFED3IUOKrb9ayUAYwtbGUrunRx13oJQd2GXbijcmK5m/YmigEjitS TS+Mb1FBFD1f7vn/xrj7VXve8x2kc5ARMNDgbIHVPxg7Su8lMwLvzMhqMV9t8jLFdqSF 7rbkw53zahx9osQK1pbxLQ/K6TR0PxAip9+QtzTYjRhE73F9TFzCoQ4TGijh2d0PZHfT wyYA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c12si7235629otn.7.2019.12.06.06.20.20; Fri, 06 Dec 2019 06:20:33 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726720AbfLFOSu (ORCPT + 99 others); Fri, 6 Dec 2019 09:18:50 -0500 Received: from mout-p-101.mailbox.org ([80.241.56.151]:22222 "EHLO mout-p-101.mailbox.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726244AbfLFOSs (ORCPT ); Fri, 6 Dec 2019 09:18:48 -0500 Received: from smtp2.mailbox.org (smtp2.mailbox.org [80.241.60.241]) (using TLSv1.2 with cipher ECDHE-RSA-CHACHA20-POLY1305 (256/256 bits)) (No client certificate requested) by mout-p-101.mailbox.org (Postfix) with ESMTPS id 47TvpD3cQ0zKmSy; Fri, 6 Dec 2019 15:18:44 +0100 (CET) X-Virus-Scanned: amavisd-new at heinlein-support.de Received: from smtp2.mailbox.org ([80.241.60.241]) by hefe.heinlein-support.de (hefe.heinlein-support.de [91.198.250.172]) (amavisd-new, port 10030) with ESMTP id qxNN57LwiH2T; Fri, 6 Dec 2019 15:18:39 +0100 (CET) From: Aleksa Sarai To: Al Viro , Jeff Layton , "J. Bruce Fields" , Arnd Bergmann , David Howells , Shuah Khan , Shuah Khan , Ingo Molnar , Peter Zijlstra , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau , Song Liu , Yonghong Song , Andrii Nakryiko , Jonathan Corbet Cc: Aleksa Sarai , Christian Brauner , Jann Horn , Linus Torvalds , Eric Biederman , Andy Lutomirski , Andrew Morton , Kees Cook , Tycho Andersen , David Drysdale , Chanho Min , Oleg Nesterov , Rasmus Villemoes , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Christian Brauner , Aleksa Sarai , dev@opencontainers.org, containers@lists.linux-foundation.org, bpf@vger.kernel.org, netdev@vger.kernel.org, linux-alpha@vger.kernel.org, linux-api@vger.kernel.org, libc-alpha@sourceware.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-ia64@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-m68k@lists.linux-m68k.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-sh@vger.kernel.org, linux-xtensa@linux-xtensa.org, sparclinux@vger.kernel.org Subject: [PATCH v18 10/13] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution Date: Sat, 7 Dec 2019 01:13:35 +1100 Message-Id: <20191206141338.23338-11-cyphar@cyphar.com> In-Reply-To: <20191206141338.23338-1-cyphar@cyphar.com> References: <20191206141338.23338-1-cyphar@cyphar.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Allow LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution (in the case of LOOKUP_BENEATH the resolution will still fail if ".." resolution would resolve a path outside of the root -- while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps are still disallowed entirely[*]. As Jann explains[1,2], the need for this patch (and the original no-".." restriction) is explained by observing there is a fairly easy-to-exploit race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be used to "skip over" nd->root and thus escape to the filesystem above nd->root. thread1 [attacker]: for (;;) renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); thread2 [victim]: for (;;) openat2(dirb, "b/c/../../etc/shadow", { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } ); With fairly significant regularity, thread2 will resolve to "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar (though somewhat more privileged) attack using MS_MOVE. With this patch, such cases will be detected *during* ".." resolution and will return -EAGAIN for userspace to decide to either retry or abort the lookup. It should be noted that ".." is the weak point of chroot(2) -- walking *into* a subdirectory tautologically cannot result in you walking *outside* nd->root (except through a bind-mount or magic-link). There is also no other way for a directory's parent to change (which is the primary worry with ".." resolution here) other than a rename or MS_MOVE. The primary reason for deferring to userspace with -EAGAIN is that an in-kernel retry loop (or doing a path_is_under() check after re-taking the relevant seqlocks) can become unreasonably expensive on machines with lots of VFS activity (nfsd can cause lots of rename_lock updates). Thus it should be up to userspace how many times they wish to retry the lookup -- the selftests for this attack indicate that there is a ~35% chance of the lookup succeeding on the first try even with an attacker thrashing rename_lock. A variant of the above attack is included in the selftests for openat2(2) later in this patch series. I've run this test on several machines for several days and no instances of a breakout were detected. While this is not concrete proof that this is safe, when combined with the above argument it should lend some trustworthiness to this construction. [*] It may be acceptable in the future to do a path_is_under() check for magic-links after they are resolved. However this seems unlikely to be a feature that people *really* need -- it can be added later if it turns out a lot of people want it. [1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/ [2]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAORvqzg@mail.gmail.com/ Cc: Christian Brauner Suggested-by: Jann Horn Suggested-by: Linus Torvalds Signed-off-by: Aleksa Sarai --- fs/namei.c | 43 +++++++++++++++++++++++++++---------------- 1 file changed, 27 insertions(+), 16 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 69cd0b296e8e..9f60c3f49b8b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -491,7 +491,7 @@ struct nameidata { struct path root; struct inode *inode; /* path.dentry.d_inode */ unsigned int flags; - unsigned seq, m_seq; + unsigned seq, m_seq, r_seq; int last_type; unsigned depth; int total_link_count; @@ -1791,22 +1791,31 @@ static inline int handle_dots(struct nameidata *nd, int type) if (type == LAST_DOTDOT) { int error = 0; - /* - * Scoped-lookup flags resolving ".." is not currently safe -- - * races can cause our parent to have moved outside of the root - * and us to skip over it. - */ - if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) - return -EXDEV; if (!nd->root.mnt) { error = set_root(nd); if (error) return error; } - if (nd->flags & LOOKUP_RCU) { - return follow_dotdot_rcu(nd); - } else - return follow_dotdot(nd); + if (nd->flags & LOOKUP_RCU) + error = follow_dotdot_rcu(nd); + else + error = follow_dotdot(nd); + if (error) + return error; + + if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) { + /* + * If there was a racing rename or mount along our + * path, then we can't be sure that ".." hasn't jumped + * above nd->root (and so userspace should retry or use + * some fallback). + */ + smp_rmb(); + if (unlikely(__read_seqcount_retry(&mount_lock.seqcount, nd->m_seq))) + return -EAGAIN; + if (unlikely(__read_seqcount_retry(&rename_lock.seqcount, nd->r_seq))) + return -EAGAIN; + } } return 0; } @@ -2276,6 +2285,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->last_type = LAST_ROOT; /* if there are only slashes... */ nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT; nd->depth = 0; + + nd->m_seq = __read_seqcount_begin(&mount_lock.seqcount); + nd->r_seq = __read_seqcount_begin(&rename_lock.seqcount); + smp_rmb(); + if (flags & LOOKUP_ROOT) { struct dentry *root = nd->root.dentry; struct inode *inode = root->d_inode; @@ -2284,9 +2298,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path = nd->root; nd->inode = inode; if (flags & LOOKUP_RCU) { - nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); + nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); nd->root_seq = nd->seq; - nd->m_seq = read_seqbegin(&mount_lock); } else { path_get(&nd->path); } @@ -2297,8 +2310,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path.mnt = NULL; nd->path.dentry = NULL; - nd->m_seq = read_seqbegin(&mount_lock); - /* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */ if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) { error = nd_jump_root(nd); -- 2.24.0