This patchset is being developed here:
<https://github.com/cyphar/linux/tree/openat2/master>
Patch changelog:
v18:
* Further fixups from Al Viro:
- Don't WARN_ON in complete_walk() check since it can be trivially
triggered by userspace. Also, improve the comment so the purpose of the
check is more clear.
- Avoid duplicate smp_rmb() when in handle_dots() by doing
__read_seqcount_retry().
- Drop vestigial UPGRADE_NO* flag definitions in uapi.
* Update non-zero __padding test to include all bytes of the padding.
v17: <https://lore.kernel.org/lkml/[email protected]/>
<https://lore.kernel.org/lkml/[email protected]/>
v16: <https://lore.kernel.org/lkml/[email protected]/>
v15: <https://lore.kernel.org/lkml/[email protected]/>
v14: <https://lore.kernel.org/lkml/[email protected]/>
<https://lore.kernel.org/lkml/[email protected]>
v13: <https://lore.kernel.org/lkml/[email protected]/>
v12: <https://lore.kernel.org/lkml/[email protected]/>
v11: <https://lore.kernel.org/lkml/[email protected]/>
<https://lore.kernel.org/lkml/[email protected]/>
v10: <https://lore.kernel.org/lkml/[email protected]/>
v09: <https://lore.kernel.org/lkml/[email protected]/>
v08: <https://lore.kernel.org/lkml/[email protected]/>
v07: <https://lore.kernel.org/lkml/[email protected]/>
v06: <https://lore.kernel.org/lkml/[email protected]/>
v05: <https://lore.kernel.org/lkml/[email protected]/>
v04: <https://lore.kernel.org/lkml/[email protected]/>
v03: <https://lore.kernel.org/lkml/[email protected]/>
v02: <https://lore.kernel.org/lkml/[email protected]/>
v01: <https://lore.kernel.org/lkml/[email protected]/>
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is also
blocked (though magic-link jumps on the same vfsmount are permitted).
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[7] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and
possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes
though stale NFS handles).
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://lore.kernel.org/lkml/[email protected]
[5]: https://lore.kernel.org/lkml/[email protected]
[6]: https://lwn.net/Articles/723057/
[7]: https://github.com/cyphar/filepath-securejoin
[8]: https://github.com/openSUSE/libpathrs
The current draft of the openat2(2) man-page is included below.
--8<---------------------------------------------------------------------------
OPENAT2(2) Linux Programmer's Manual OPENAT2(2)
NAME
openat2 - open and possibly create a file (extended)
SYNOPSIS
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
The openat2() system call opens the file specified by pathname. If the specified file does not exist, it may
optionally (if O_CREAT is specified in how.flags) be created by openat2().
As with openat(2), if pathname is relative, then it is interpreted relative to the directory referred to by
the file descriptor dirfd (or the current working directory of the calling process, if dirfd is the special
value AT_FDCWD.) If pathname is absolute, then dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT,
in which case pathname is resolved relative to dirfd.)
The openat2() system call is an extension of openat(2) and provides a superset of its functionality. Rather
than taking a single flag argument, an extensible structure (how) is passed instead to allow for future exten-
sions. size must be set to sizeof(struct open_how), to facilitate future extensions (see the "Extensibility"
section of the NOTES for more detail on how extensions are handled.)
The open_how structure
The following structure indicates how pathname should be opened, and acts as a superset of the flag and mode
arguments to openat(2).
struct open_how {
__aligned_u64 flags; /* O_* flags. */
__u16 mode; /* Mode for O_{CREAT,TMPFILE}. */
__u16 __padding[3]; /* Must be zeroed. */
__aligned_u64 resolve; /* RESOLVE_* flags. */
};
Any future extensions to openat2() will be implemented as new fields appended to the above structure (or
through reuse of pre-existing padding space), with the zero value of the new fields acting as though the ex-
tension were not present.
The meaning of each field is as follows:
flags
The file creation and status flags to use for this operation. All of the O_* flags defined for
openat(2) are valid openat2() flag values.
Unlike openat(2), it is an error to provide openat2() unknown or conflicting flags in flags.
mode
File mode for the new file, with identical semantics to the mode argument to openat(2). How-
ever, unlike openat(2), it is an error to provide openat2() with a mode which contains bits
other than 0777.
It is an error to provide openat2() a non-zero mode if flags does not contain O_CREAT or O_TMP-
FILE.
resolve
Change how the components of pathname will be resolved (see path_resolution(7) for background
information.) The primary use case for these flags is to allow trusted programs to restrict how
untrusted paths (or paths inside untrusted directories) are resolved. The full list of resolve
flags is given below.
RESOLVE_NO_XDEV
Disallow traversal of mount points during path resolution (including all bind mounts).
Users of this flag are encouraged to make its use configurable (unless it is used for a
specific security purpose), as bind mounts are very widely used by end-users. Setting
this flag indiscrimnately for all uses of openat2() may result in spurious errors on pre-
viously-functional systems.
RESOLVE_NO_SYMLINKS
Disallow resolution of symbolic links during path resolution. This option implies RE-
SOLVE_NO_MAGICLINKS.
If the trailing component is a symbolic link, and flags contains both O_PATH and O_NOFOL-
LOW, then an O_PATH file descriptor referencing the symbolic link will be returned.
Users of this flag are encouraged to make its use configurable (unless it is used for a
specific security purpose), as symbolic links are very widely used by end-users. Setting
this flag indiscrimnately for all uses of openat2() may result in spurious errors on pre-
viously-functional systems.
RESOLVE_NO_MAGICLINKS
Disallow all magic link resolution during path resolution.
If the trailing component is a magic link, and flags contains both O_PATH and O_NOFOLLOW,
then an O_PATH file descriptor referencing the magic link will be returned.
Magic-links are symbolic link-like objects that are most notably found in proc(5) (exam-
ples include /proc/[pid]/exe and /proc/[pid]/fd/*.) Due to the potential danger of un-
knowingly opening these magic links, it may be preferable for users to disable their res-
olution entirely (see symboliclink(7) for more details.)
RESOLVE_BENEATH
Do not permit the path resolution to succeed if any component of the resolution is not a
descendant of the directory indicated by dirfd. This results in absolute symbolic links
(and absolute values of pathname) to be rejected.
Currently, this flag also disables magic link resolution. However, this may change in
the future. The caller should explicitly specify RESOLVE_NO_MAGICLINKS to ensure that
magic links are not resolved.
RESOLVE_IN_ROOT
Treat dirfd as the root directory while resolving pathname (as though the user called ch-
root(2) with dirfd as the argument.) Absolute symbolic links and ".." path components
will be scoped to dirfd. If pathname is an absolute path, it is also treated relative to
dirfd.
However, unlike chroot(2) (which changes the filesystem root permanently for a process),
RESOLVE_IN_ROOT allows a program to efficiently restrict path resolution for only certain
operations. It also has several hardening features (such detecting escape attempts dur-
ing .. resolution) which chroot(2) does not.
Currently, this flag also disables magic link resolution. However, this may change in
the future. The caller should explicitly specify RESOLVE_NO_MAGICLINKS to ensure that
magic links are not resolved.
It is an error to provide openat2() unknown flags in resolve.
RETURN VALUE
On success, a new file descriptor is returned. On error, -1 is returned, and errno is set appropriately.
ERRORS
The set of errors returned by openat2() includes all of the errors returned by openat(2), as well as the fol-
lowing additional errors:
EINVAL An unknown flag or invalid value was specified in how.
EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.
EINVAL size was smaller than any known version of struct open_how.
E2BIG An extension was specified in how, which the current kernel does not support (see the "Extensibility"
section of the NOTES for more detail on how extensions are handled.)
EAGAIN resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could not ensure that a ".."
component didn't escape (due to a race condition or potential attack.) Callers may choose to retry the
openat2() call.
EXDEV resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the root during path
resolution was detected.
EXDEV resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross a mount point.
ELOOP resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic link (or magic
link).
ELOOP resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic link.
VERSIONS
openat2() first appeared in Linux 5.6.
CONFORMING TO
This system call is Linux-specific.
The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
NOTES
Glibc does not provide a wrapper for this system call; call it using syscall(2).
Extensibility
In order to allow for struct open_how to be extended in future kernel revisions, openat2() requires userspace
to specify the size of struct open_how structure they are passing. By providing this information, it is pos-
sible for openat2() to provide both forwards- and backwards-compatibility — with size acting as an implicit
version number (because new extension fields will always be appended, the size will always increase.) This
extensibility design is very similar to other system calls such as perf_setattr(2), perf_event_open(2), and
clone(3).
If we let usize be the size of the structure according to userspace and ksize be the size of the structure
which the kernel supports, then there are only three cases to consider:
* If ksize equals usize, then there is no version mismatch and how can be used verbatim.
* If ksize is larger than usize, then there are some extensions the kernel supports which the
userspace program is unaware of. Because all extensions must have their zero values be a no-op, the
kernel treats all of the extension fields not set by userspace to have zero values. This provides
backwards-compatibility.
* If ksize is smaller than usize, then there are some extensions which the userspace program is aware
of but the kernel does not support. Because all extensions must have their zero values be a no-op,
the kernel can safely ignore the unsupported extension fields if they are all-zero. If any unsup-
ported extension fields are non-zero, then -1 is returned and errno is set to E2BIG. This provides
forwards-compatibility.
Therefore, most userspace programs will not need to have any special handling of extensions. However, if a
userspace program wishes to determine what extensions the running kernel supports, they may conduct a binary
search on size (to find the largest value which doesn't produce an error of E2BIG.)
SEE ALSO
openat(2), path_resolution(7), symlink(7)
Linux 2019-11-05 OPENAT2(2)
--8<---------------------------------------------------------------------------
Aleksa Sarai (13):
namei: only return -ECHILD from follow_dotdot_rcu()
nsfs: clean-up ns_get_path() signature to return int
namei: allow nd_jump_link() to produce errors
namei: allow set_root() to produce errors
namei: LOOKUP_NO_SYMLINKS: block symlink resolution
namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
namei: LOOKUP_NO_XDEV: block mountpoint crossing
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
open: introduce openat2(2) syscall
selftests: add openat2(2) selftests
Documentation: path-lookup: include new LOOKUP flags
CREDITS | 4 +-
Documentation/filesystems/path-lookup.rst | 68 ++-
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/namei.c | 199 +++++--
fs/nsfs.c | 29 +-
fs/open.c | 149 +++--
fs/proc/base.c | 3 +-
fs/proc/namespaces.c | 20 +-
include/linux/fcntl.h | 12 +-
include/linux/namei.h | 12 +-
include/linux/proc_ns.h | 4 +-
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 35 ++
kernel/bpf/offload.c | 12 +-
kernel/events/core.c | 2 +-
security/apparmor/apparmorfs.c | 6 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 107 ++++
.../testing/selftests/openat2/openat2_test.c | 320 +++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
42 files changed, 1697 insertions(+), 115 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
base-commit: 219d54332a09e8d8741c1e1982f5eae56099de85
--
2.24.0
In preparation for LOOKUP_NO_MAGICLINKS, it's necessary to add the
ability for nd_jump_link() to return an error which the corresponding
get_link() caller must propogate back up to the VFS.
Suggested-by: Al Viro <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 3 ++-
fs/proc/base.c | 3 +--
fs/proc/namespaces.c | 14 +++++++++-----
include/linux/namei.h | 2 +-
security/apparmor/apparmorfs.c | 6 ++++--
5 files changed, 17 insertions(+), 11 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 5a47d9c09581..1024a641f075 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -859,7 +859,7 @@ static int nd_jump_root(struct nameidata *nd)
* Helper to directly jump to a known parsed path from ->get_link,
* caller must have taken a reference to path beforehand.
*/
-void nd_jump_link(struct path *path)
+int nd_jump_link(struct path *path)
{
struct nameidata *nd = current->nameidata;
path_put(&nd->path);
@@ -867,6 +867,7 @@ void nd_jump_link(struct path *path)
nd->path = *path;
nd->inode = nd->path.dentry->d_inode;
nd->flags |= LOOKUP_JUMPED;
+ return 0;
}
static inline void put_link(struct nameidata *nd)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..ee97dd322f3e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1626,8 +1626,7 @@ static const char *proc_pid_get_link(struct dentry *dentry,
if (error)
goto out;
- nd_jump_link(&path);
- return NULL;
+ error = nd_jump_link(&path);
out:
return ERR_PTR(error);
}
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 08dd94df1a66..a8cca516f1a9 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -51,11 +51,15 @@ static const char *proc_ns_get_link(struct dentry *dentry,
if (!task)
return ERR_PTR(-EACCES);
- if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
- error = ns_get_path(&ns_path, task, ns_ops);
- if (!error)
- nd_jump_link(&ns_path);
- }
+ if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
+ goto out;
+
+ error = ns_get_path(&ns_path, task, ns_ops);
+ if (error)
+ goto out;
+
+ error = nd_jump_link(&ns_path);
+out:
put_task_struct(task);
return ERR_PTR(error);
}
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 397a08ade6a2..758e9b47db6f 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -68,7 +68,7 @@ extern int follow_up(struct path *);
extern struct dentry *lock_rename(struct dentry *, struct dentry *);
extern void unlock_rename(struct dentry *, struct dentry *);
-extern void nd_jump_link(struct path *path);
+extern int __must_check nd_jump_link(struct path *path);
static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
{
diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index 45d13b6462aa..0b7d6dce6291 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -2455,16 +2455,18 @@ static const char *policy_get_link(struct dentry *dentry,
{
struct aa_ns *ns;
struct path path;
+ int error;
if (!dentry)
return ERR_PTR(-ECHILD);
+
ns = aa_get_current_ns();
path.mnt = mntget(aafs_mnt);
path.dentry = dget(ns_dir(ns));
- nd_jump_link(&path);
+ error = nd_jump_link(&path);
aa_put_ns(ns);
- return NULL;
+ return ERR_PTR(error);
}
static int policy_readlink(struct dentry *dentry, char __user *buffer,
--
2.24.0
For LOOKUP_BENEATH and LOOKUP_IN_ROOT it is necessary to ensure that
set_root() is never called, and thus (for hardening purposes) it should
return an error rather than permit a breakout from the root. In
addition, move all of the repetitive set_root() calls to nd_jump_root().
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 35 ++++++++++++++++++++++++-----------
1 file changed, 24 insertions(+), 11 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 1024a641f075..74574a69a614 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -798,7 +798,7 @@ static int complete_walk(struct nameidata *nd)
return status;
}
-static void set_root(struct nameidata *nd)
+static int set_root(struct nameidata *nd)
{
struct fs_struct *fs = current->fs;
@@ -814,6 +814,7 @@ static void set_root(struct nameidata *nd)
get_fs_root(fs, &nd->root);
nd->flags |= LOOKUP_ROOT_GRABBED;
}
+ return 0;
}
static void path_put_conditional(struct path *path, struct nameidata *nd)
@@ -837,6 +838,11 @@ static inline void path_to_nameidata(const struct path *path,
static int nd_jump_root(struct nameidata *nd)
{
+ if (!nd->root.mnt) {
+ int error = set_root(nd);
+ if (error)
+ return error;
+ }
if (nd->flags & LOOKUP_RCU) {
struct dentry *d;
nd->path = nd->root;
@@ -1080,10 +1086,9 @@ const char *get_link(struct nameidata *nd)
return res;
}
if (*res == '/') {
- if (!nd->root.mnt)
- set_root(nd);
- if (unlikely(nd_jump_root(nd)))
- return ERR_PTR(-ECHILD);
+ error = nd_jump_root(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
while (unlikely(*++res == '/'))
;
}
@@ -1698,8 +1703,13 @@ static inline int may_lookup(struct nameidata *nd)
static inline int handle_dots(struct nameidata *nd, int type)
{
if (type == LAST_DOTDOT) {
- if (!nd->root.mnt)
- set_root(nd);
+ int error = 0;
+
+ if (!nd->root.mnt) {
+ error = set_root(nd);
+ if (error)
+ return error;
+ }
if (nd->flags & LOOKUP_RCU) {
return follow_dotdot_rcu(nd);
} else
@@ -2162,6 +2172,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
/* must be paired with terminate_walk() */
static const char *path_init(struct nameidata *nd, unsigned flags)
{
+ int error;
const char *s = nd->name->name;
if (!*s)
@@ -2194,11 +2205,13 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->path.dentry = NULL;
nd->m_seq = read_seqbegin(&mount_lock);
+
+ /* Figure out the starting path and root (if needed). */
if (*s == '/') {
- set_root(nd);
- if (likely(!nd_jump_root(nd)))
- return s;
- return ERR_PTR(-ECHILD);
+ error = nd_jump_root(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ return s;
} else if (nd->dfd == AT_FDCWD) {
if (flags & LOOKUP_RCU) {
struct fs_struct *fs = current->fs;
--
2.24.0
/* Background. */
Userspace cannot easily resolve a path without resolving symlinks, and
would have to manually resolve each path component with O_PATH and
O_NOFOLLOW. This is clearly inefficient, and can be fairly easy to screw
up (resulting in possible security bugs). Linus has mentioned that Git
has a particular need for this kind of flag[1]. It also resolves a
fairly long-standing perceived deficiency in O_NOFOLLOw -- that it only
blocks the opening of trailing symlinks.
This is part of a refresh of Al's AT_NO_JUMPS patchset[2] (which was a
variation on David Drysdale's O_BENEATH patchset[3], which in turn was
based on the Capsicum project[4]).
/* Userspace API. */
LOOKUP_NO_SYMLINKS will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_NO_SYMLINKS applies to all components of the path.
With LOOKUP_NO_SYMLINKS, any symlink path component encountered during
path resolution will yield -ELOOP. If the trailing component is a
symlink (and no other components were symlinks), then O_PATH|O_NOFOLLOW
will not error out and will instead provide a handle to the trailing
symlink -- without resolving it.
/* Testing. */
LOOKUP_NO_SYMLINKS is tested as part of the openat2(2) selftests.
[1]: https://lore.kernel.org/lkml/CA+55aFyOKM7DW7+0sdDFKdZFXgptb5r1id9=Wvhd8AgSP7qjwQ@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/[email protected]/
[3]: https://lore.kernel.org/lkml/[email protected]/
[4]: https://lore.kernel.org/lkml/[email protected]/
Cc: Christian Brauner <[email protected]>
Suggested-by: Al Viro <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 3 +++
include/linux/namei.h | 3 +++
2 files changed, 6 insertions(+)
diff --git a/fs/namei.c b/fs/namei.c
index 74574a69a614..303731935eb2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1052,6 +1052,9 @@ const char *get_link(struct nameidata *nd)
int error;
const char *res;
+ if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+ return ERR_PTR(-ELOOP);
+
if (!(nd->flags & LOOKUP_RCU)) {
touch_atime(&last->link);
cond_resched();
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 758e9b47db6f..0d86e75c04a7 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -39,6 +39,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_ROOT 0x2000
#define LOOKUP_ROOT_GRABBED 0x0008
+/* Scoping flags for lookup. */
+#define LOOKUP_NO_SYMLINKS 0x010000 /* No symlink crossing. */
+
extern int path_pts(struct path *path);
extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
--
2.24.0
/* Background. */
There has always been a special class of symlink-like objects in procfs
(and a few other pseudo-filesystems) which allow for non-lexical
resolution of paths using nd_jump_link(). These "magic-links" do not
follow traditional mount namespace boundaries, and have been used
consistently in container escape attacks because they can be used to
trick unsuspecting privileged processes into resolving unexpected paths.
It is also non-trivial for userspace to unambiguously avoid resolving
magic-links, because they do not have a reliable indication that they
are a magic-link (in order to verify them you'd have to manually open
the path given by readlink(2) and then verify that the two file
descriptors reference the same underlying file, which is plagued with
possible race conditions or supplementary attack scenarios).
It would therefore be very helpful for userspace to be able to avoid
these symlinks easily, thus hopefully removing a tool from attackers'
toolboxes.
This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
variation on David Drysdale's O_BENEATH patchset[2], which in turn was
based on the Capsicum project[3]).
/* Userspace API. */
LOOKUP_NO_MAGICLINKS will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_NO_MAGICLINKS applies to all components of the path.
With LOOKUP_NO_MAGICLINKS, any magic-link path component encountered
during path resolution will yield -ELOOP. The handling of ~LOOKUP_FOLLOW
for a trailing magic-link is identical to LOOKUP_NO_SYMLINKS.
LOOKUP_NO_SYMLINKS implies LOOKUP_NO_MAGICLINKS.
/* Testing. */
LOOKUP_NO_MAGICLINKS is tested as part of the openat2(2) selftests.
[1]: https://lore.kernel.org/lkml/[email protected]/
[2]: https://lore.kernel.org/lkml/[email protected]/
[3]: https://lore.kernel.org/lkml/[email protected]/
Cc: Christian Brauner <[email protected]>
Suggested-by: David Drysdale <[email protected]>
Suggested-by: Al Viro <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 10 +++++++++-
include/linux/namei.h | 1 +
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/fs/namei.c b/fs/namei.c
index 303731935eb2..415a897729c8 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -867,13 +867,21 @@ static int nd_jump_root(struct nameidata *nd)
*/
int nd_jump_link(struct path *path)
{
+ int error = -ELOOP;
struct nameidata *nd = current->nameidata;
- path_put(&nd->path);
+ if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
+ goto err;
+
+ path_put(&nd->path);
nd->path = *path;
nd->inode = nd->path.dentry->d_inode;
nd->flags |= LOOKUP_JUMPED;
return 0;
+
+err:
+ path_put(path);
+ return error;
}
static inline void put_link(struct nameidata *nd)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 0d86e75c04a7..1573b8493d98 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -41,6 +41,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
/* Scoping flags for lookup. */
#define LOOKUP_NO_SYMLINKS 0x010000 /* No symlink crossing. */
+#define LOOKUP_NO_MAGICLINKS 0x020000 /* No nd_jump_link() crossing. */
extern int path_pts(struct path *path);
--
2.24.0
/* Background. */
The need to contain path operations within a mountpoint has been a
long-standing usecase that userspace has historically implemented
manually with liberal usage of stat(). find, rsync, tar and
many other programs implement these semantics -- but it'd be much
simpler to have a fool-proof way of refusing to open a path if it
crosses a mountpoint.
This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
variation on David Drysdale's O_BENEATH patchset[2], which in turn was
based on the Capsicum project[3]).
/* Userspace API. */
LOOKUP_NO_XDEV will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_NO_XDEV applies to all components of the path.
With LOOKUP_NO_XDEV, any path component which crosses a mount-point
during path resolution (including "..") will yield an -EXDEV. Absolute
paths, absolute symlinks, and magic-links will only yield an -EXDEV if
the jump involved changing mount-points.
/* Testing. */
LOOKUP_NO_XDEV is tested as part of the openat2(2) selftests.
[1]: https://lore.kernel.org/lkml/[email protected]/
[2]: https://lore.kernel.org/lkml/[email protected]/
[3]: https://lore.kernel.org/lkml/[email protected]/
Cc: Christian Brauner <[email protected]>
Suggested-by: David Drysdale <[email protected]>
Suggested-by: Al Viro <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 31 +++++++++++++++++++++++++++----
include/linux/namei.h | 1 +
2 files changed, 28 insertions(+), 4 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 415a897729c8..321c8ad5d6b3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -838,6 +838,11 @@ static inline void path_to_nameidata(const struct path *path,
static int nd_jump_root(struct nameidata *nd)
{
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
+ /* Absolute path arguments to path_init() are allowed. */
+ if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
+ return -EXDEV;
+ }
if (!nd->root.mnt) {
int error = set_root(nd);
if (error)
@@ -873,6 +878,12 @@ int nd_jump_link(struct path *path)
if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
goto err;
+ error = -EXDEV;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
+ if (nd->path.mnt != path->mnt)
+ goto err;
+ }
+
path_put(&nd->path);
nd->path = *path;
nd->inode = nd->path.dentry->d_inode;
@@ -1280,12 +1291,16 @@ static int follow_managed(struct path *path, struct nameidata *nd)
break;
}
- if (need_mntput && path->mnt == mnt)
- mntput(path->mnt);
+ if (need_mntput) {
+ if (path->mnt == mnt)
+ mntput(path->mnt);
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ ret = -EXDEV;
+ else
+ nd->flags |= LOOKUP_JUMPED;
+ }
if (ret == -EISDIR || !ret)
ret = 1;
- if (need_mntput)
- nd->flags |= LOOKUP_JUMPED;
if (unlikely(ret < 0))
path_put_conditional(path, nd);
return ret;
@@ -1342,6 +1357,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
mounted = __lookup_mnt(path->mnt, path->dentry);
if (!mounted)
break;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ return false;
path->mnt = &mounted->mnt;
path->dentry = mounted->mnt.mnt_root;
nd->flags |= LOOKUP_JUMPED;
@@ -1388,6 +1405,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -ECHILD;
if (&mparent->mnt == nd->path.mnt)
break;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ return -ECHILD;
/* we know that mountpoint was pinned */
nd->path.dentry = mountpoint;
nd->path.mnt = &mparent->mnt;
@@ -1402,6 +1421,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -ECHILD;
if (!mounted)
break;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ return -ECHILD;
nd->path.mnt = &mounted->mnt;
nd->path.dentry = mounted->mnt.mnt_root;
inode = nd->path.dentry->d_inode;
@@ -1500,6 +1521,8 @@ static int follow_dotdot(struct nameidata *nd)
}
if (!follow_up(&nd->path))
break;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ return -EXDEV;
}
follow_mount(&nd->path);
nd->inode = nd->path.dentry->d_inode;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 1573b8493d98..25ee88c4acb1 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -42,6 +42,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
/* Scoping flags for lookup. */
#define LOOKUP_NO_SYMLINKS 0x010000 /* No symlink crossing. */
#define LOOKUP_NO_MAGICLINKS 0x020000 /* No nd_jump_link() crossing. */
+#define LOOKUP_NO_XDEV 0x040000 /* No mountpoint crossing. */
extern int path_pts(struct path *path);
--
2.24.0
/* Background. */
There are many circumstances when userspace wants to resolve a path and
ensure that it doesn't go outside of a particular root directory during
resolution. Obvious examples include archive extraction tools, as well as
other security-conscious userspace programs. FreeBSD spun out O_BENEATH
from their Capsicum project[1,2], so it also seems reasonable to
implement similar functionality for Linux.
This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
variation on David Drysdale's O_BENEATH patchset[4], which in turn was
based on the Capsicum project[5]).
/* Userspace API. */
LOOKUP_BENEATH will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_BENEATH applies to all components of the path.
With LOOKUP_BENEATH, any path component which attempts to "escape" the
starting point of the filesystem lookup (the dirfd passed to openat)
will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
Due to a security concern brought up by Jann[6], any ".." path
components are also blocked. This restriction will be lifted in a future
patch, but requires more work to ensure that permitting ".." is done
safely.
Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.
/* Testing. */
LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
[1]: https://reviews.freebsd.org/D2808
[2]: https://reviews.freebsd.org/D17547
[3]: https://lore.kernel.org/lkml/[email protected]/
[4]: https://lore.kernel.org/lkml/[email protected]/
[5]: https://lore.kernel.org/lkml/[email protected]/
[6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
Cc: Christian Brauner <[email protected]>
Suggested-by: David Drysdale <[email protected]>
Suggested-by: Al Viro <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 80 +++++++++++++++++++++++++++++++++++++++----
include/linux/namei.h | 4 +++
2 files changed, 78 insertions(+), 6 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 321c8ad5d6b3..f0c15d2ace54 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -641,6 +641,14 @@ static bool legitimize_links(struct nameidata *nd)
static bool legitimize_root(struct nameidata *nd)
{
+ /*
+ * For scoped-lookups (where nd->root has been zeroed), we need to
+ * restart the whole lookup from scratch -- because set_root() is wrong
+ * for these lookups (nd->dfd is the root, not the filesystem root).
+ */
+ if (!nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED))
+ return false;
+ /* Nothing to do if nd->root is zero or is managed by the VFS user. */
if (!nd->root.mnt || (nd->flags & LOOKUP_ROOT))
return true;
nd->flags |= LOOKUP_ROOT_GRABBED;
@@ -776,12 +784,37 @@ static int complete_walk(struct nameidata *nd)
int status;
if (nd->flags & LOOKUP_RCU) {
- if (!(nd->flags & LOOKUP_ROOT))
+ /*
+ * We don't want to zero nd->root for scoped-lookups or
+ * externally-managed nd->root.
+ */
+ if (!(nd->flags & (LOOKUP_ROOT | LOOKUP_IS_SCOPED)))
nd->root.mnt = NULL;
if (unlikely(unlazy_walk(nd)))
return -ECHILD;
}
+ if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
+ /*
+ * While the guarantee of LOOKUP_IS_SCOPED is (roughly) "don't
+ * ever step outside the root during lookup" and should already
+ * be guaranteed by the rest of namei, we want to avoid a namei
+ * BUG resulting in userspace being given a path that was not
+ * scoped within the root at some point during the lookup.
+ *
+ * So, do a final sanity-check to make sure that in the
+ * worst-case scenario (a complete bypass of LOOKUP_IS_SCOPED)
+ * we won't silently return an fd completely outside of the
+ * requested root to userspace.
+ *
+ * Userspace could move the path outside the root after this
+ * check, but as discussed elsewhere this is not a concern (the
+ * resolved file was inside the root at some point).
+ */
+ if (!path_is_under(&nd->path, &nd->root))
+ return -EXDEV;
+ }
+
if (likely(!(nd->flags & LOOKUP_JUMPED)))
return 0;
@@ -802,6 +835,14 @@ static int set_root(struct nameidata *nd)
{
struct fs_struct *fs = current->fs;
+ /*
+ * Jumping to the real root in a scoped-lookup is a BUG in namei, but we
+ * still have to ensure it doesn't happen because it will cause a breakout
+ * from the dirfd.
+ */
+ if (WARN_ON(nd->flags & LOOKUP_IS_SCOPED))
+ return -ENOTRECOVERABLE;
+
if (nd->flags & LOOKUP_RCU) {
unsigned seq;
@@ -838,6 +879,8 @@ static inline void path_to_nameidata(const struct path *path,
static int nd_jump_root(struct nameidata *nd)
{
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
/* Absolute path arguments to path_init() are allowed. */
if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
@@ -883,6 +926,9 @@ int nd_jump_link(struct path *path)
if (nd->path.mnt != path->mnt)
goto err;
}
+ /* Not currently safe for scoped-lookups. */
+ if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
+ goto err;
path_put(&nd->path);
nd->path = *path;
@@ -1379,8 +1425,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
struct inode *inode = nd->inode;
while (1) {
- if (path_equal(&nd->path, &nd->root))
+ if (path_equal(&nd->path, &nd->root)) {
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -ECHILD;
break;
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
struct dentry *old = nd->path.dentry;
struct dentry *parent = old->d_parent;
@@ -1510,9 +1559,12 @@ static int path_parent_directory(struct path *path)
static int follow_dotdot(struct nameidata *nd)
{
- while(1) {
- if (path_equal(&nd->path, &nd->root))
+ while (1) {
+ if (path_equal(&nd->path, &nd->root)) {
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
break;
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
int ret = path_parent_directory(&nd->path);
if (ret)
@@ -1739,6 +1791,13 @@ static inline int handle_dots(struct nameidata *nd, int type)
if (type == LAST_DOTDOT) {
int error = 0;
+ /*
+ * Scoped-lookup flags resolving ".." is not currently safe --
+ * races can cause our parent to have moved outside of the root
+ * and us to skip over it.
+ */
+ if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
+ return -EXDEV;
if (!nd->root.mnt) {
error = set_root(nd);
if (error)
@@ -2261,7 +2320,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
get_fs_pwd(current->fs, &nd->path);
nd->inode = nd->path.dentry->d_inode;
}
- return s;
} else {
/* Caller must check execute permissions on the starting path component */
struct fd f = fdget_raw(nd->dfd);
@@ -2286,8 +2344,18 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->inode = nd->path.dentry->d_inode;
}
fdput(f);
- return s;
}
+ /* For scoped-lookups we need to set the root to the dirfd as well. */
+ if (flags & LOOKUP_IS_SCOPED) {
+ nd->root = nd->path;
+ if (flags & LOOKUP_RCU) {
+ nd->root_seq = nd->seq;
+ } else {
+ path_get(&nd->root);
+ nd->flags |= LOOKUP_ROOT_GRABBED;
+ }
+ }
+ return s;
}
static const char *trailing_symlink(struct nameidata *nd)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 25ee88c4acb1..93dad378f1e8 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -2,6 +2,7 @@
#ifndef _LINUX_NAMEI_H
#define _LINUX_NAMEI_H
+#include <linux/fs.h>
#include <linux/kernel.h>
#include <linux/path.h>
#include <linux/fcntl.h>
@@ -43,6 +44,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_NO_SYMLINKS 0x010000 /* No symlink crossing. */
#define LOOKUP_NO_MAGICLINKS 0x020000 /* No nd_jump_link() crossing. */
#define LOOKUP_NO_XDEV 0x040000 /* No mountpoint crossing. */
+#define LOOKUP_BENEATH 0x080000 /* No escaping from starting point. */
+/* LOOKUP_* flags which do scope-related checks based on the dirfd. */
+#define LOOKUP_IS_SCOPED LOOKUP_BENEATH
extern int path_pts(struct path *path);
--
2.24.0
/* Background. */
Container runtimes or other administrative management processes will
often interact with root filesystems while in the host mount namespace,
because the cost of doing a chroot(2) on every operation is too
prohibitive (especially in Go, which cannot safely use vfork). However,
a malicious program can trick the management process into doing
operations on files outside of the root filesystem through careful
crafting of symlinks.
Most programs that need this feature have attempted to make this process
safe, by doing all of the path resolution in userspace (with symlinks
being scoped to the root of the malicious root filesystem).
Unfortunately, this method is prone to foot-guns and usually such
implementations have subtle security bugs.
Thus, what userspace needs is a way to resolve a path as though it were
in a chroot(2) -- with all absolute symlinks being resolved relative to
the dirfd root (and ".." components being stuck under the dirfd root).
It is much simpler and more straight-forward to provide this
functionality in-kernel (because it can be done far more cheaply and
correctly).
More classical applications that also have this problem (which have
their own potentially buggy userspace path sanitisation code) include
web servers, archive extraction tools, network file servers, and so on.
/* Userspace API. */
LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).
/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_IN_ROOT applies to all components of the path.
With LOOKUP_IN_ROOT, any path component which attempts to cross the
starting point of the pathname lookup (the dirfd passed to openat) will
remain at the starting point. Thus, all absolute paths and symlinks will
be scoped within the starting point.
There is a slight change in behaviour regarding pathnames -- if the
pathname is absolute then the dirfd is still used as the root of
resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious
foot-guns, at the cost of a minor API inconsistency).
As with LOOKUP_BENEATH, Jann's security concern about ".."[1] applies to
LOOKUP_IN_ROOT -- therefore ".." resolution is blocked. This restriction
will be lifted in a future patch, but requires more work to ensure that
permitting ".." is done safely.
Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.
/* Testing. */
LOOKUP_IN_ROOT is tested as part of the openat2(2) selftests.
[1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
Cc: Christian Brauner <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 10 +++++++---
include/linux/namei.h | 3 ++-
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index f0c15d2ace54..69cd0b296e8e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2299,13 +2299,16 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->m_seq = read_seqbegin(&mount_lock);
- /* Figure out the starting path and root (if needed). */
- if (*s == '/') {
+ /* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
+ if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {
error = nd_jump_root(nd);
if (unlikely(error))
return ERR_PTR(error);
return s;
- } else if (nd->dfd == AT_FDCWD) {
+ }
+
+ /* Relative pathname -- get the starting-point it is relative to. */
+ if (nd->dfd == AT_FDCWD) {
if (flags & LOOKUP_RCU) {
struct fs_struct *fs = current->fs;
unsigned seq;
@@ -2345,6 +2348,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
}
fdput(f);
}
+
/* For scoped-lookups we need to set the root to the dirfd as well. */
if (flags & LOOKUP_IS_SCOPED) {
nd->root = nd->path;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 93dad378f1e8..93151e47ec47 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -45,8 +45,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_NO_MAGICLINKS 0x020000 /* No nd_jump_link() crossing. */
#define LOOKUP_NO_XDEV 0x040000 /* No mountpoint crossing. */
#define LOOKUP_BENEATH 0x080000 /* No escaping from starting point. */
+#define LOOKUP_IN_ROOT 0x100000 /* Treat dirfd as fs root. */
/* LOOKUP_* flags which do scope-related checks based on the dirfd. */
-#define LOOKUP_IS_SCOPED LOOKUP_BENEATH
+#define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
extern int path_pts(struct path *path);
--
2.24.0
Test all of the various openat2(2) flags. A small stress-test of a
symlink-rename attack is included to show that the protections against
".."-based attacks are sufficient.
The main things these self-tests are enforcing are:
* The struct+usize ABI for openat2(2) and copy_struct_from_user() to
ensure that upgrades will be handled gracefully (in addition,
ensuring that misaligned structures are also handled correctly).
* The -EINVAL checks for openat2(2) are all correctly handled to avoid
userspace passing unknown or conflicting flag sets (most
importantly, ensuring that invalid flag combinations are checked).
* All of the RESOLVE_* semantics (including errno values) are
correctly handled with various combinations of paths and flags.
* RESOLVE_IN_ROOT correctly protects against the symlink rename(2)
attack that has been responsible for several CVEs (and likely will
be responsible for several more).
Cc: Shuah Khan <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 107 ++++
.../testing/selftests/openat2/openat2_test.c | 320 +++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
8 files changed, 1229 insertions(+)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 4cdbae6f4e61..28996856ed5e 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -37,6 +37,7 @@ TARGETS += powerpc
TARGETS += proc
TARGETS += pstore
TARGETS += ptrace
+TARGETS += openat2
TARGETS += rseq
TARGETS += rtc
TARGETS += seccomp
diff --git a/tools/testing/selftests/openat2/.gitignore b/tools/testing/selftests/openat2/.gitignore
new file mode 100644
index 000000000000..bd68f6c3fd07
--- /dev/null
+++ b/tools/testing/selftests/openat2/.gitignore
@@ -0,0 +1 @@
+/*_test
diff --git a/tools/testing/selftests/openat2/Makefile b/tools/testing/selftests/openat2/Makefile
new file mode 100644
index 000000000000..4b93b1417b86
--- /dev/null
+++ b/tools/testing/selftests/openat2/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g -fsanitize=address -fsanitize=undefined
+TEST_GEN_PROGS := openat2_test resolve_test rename_attack_test
+
+include ../lib.mk
+
+$(TEST_GEN_PROGS): helpers.c
diff --git a/tools/testing/selftests/openat2/helpers.c b/tools/testing/selftests/openat2/helpers.c
new file mode 100644
index 000000000000..e9a6557ab16f
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <string.h>
+#include <syscall.h>
+#include <limits.h>
+
+#include "helpers.h"
+
+bool needs_openat2(const struct open_how *how)
+{
+ return how->resolve != 0;
+}
+
+int raw_openat2(int dfd, const char *path, void *how, size_t size)
+{
+ int ret = syscall(__NR_openat2, dfd, path, how, size);
+ return ret >= 0 ? ret : -errno;
+}
+
+int sys_openat2(int dfd, const char *path, struct open_how *how)
+{
+ return raw_openat2(dfd, path, how, sizeof(*how));
+}
+
+int sys_openat(int dfd, const char *path, struct open_how *how)
+{
+ int ret = openat(dfd, path, how->flags, how->mode);
+ return ret >= 0 ? ret : -errno;
+}
+
+int sys_renameat2(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath, unsigned int flags)
+{
+ int ret = syscall(__NR_renameat2, olddirfd, oldpath,
+ newdirfd, newpath, flags);
+ return ret >= 0 ? ret : -errno;
+}
+
+int touchat(int dfd, const char *path)
+{
+ int fd = openat(dfd, path, O_CREAT);
+ if (fd >= 0)
+ close(fd);
+ return fd;
+}
+
+char *fdreadlink(int fd)
+{
+ char *target, *tmp;
+
+ E_asprintf(&tmp, "/proc/self/fd/%d", fd);
+
+ target = malloc(PATH_MAX);
+ if (!target)
+ ksft_exit_fail_msg("fdreadlink: malloc failed\n");
+ memset(target, 0, PATH_MAX);
+
+ E_readlink(tmp, target, PATH_MAX);
+ free(tmp);
+ return target;
+}
+
+bool fdequal(int fd, int dfd, const char *path)
+{
+ char *fdpath, *dfdpath, *other;
+ bool cmp;
+
+ fdpath = fdreadlink(fd);
+ dfdpath = fdreadlink(dfd);
+
+ if (!path)
+ E_asprintf(&other, "%s", dfdpath);
+ else if (*path == '/')
+ E_asprintf(&other, "%s", path);
+ else
+ E_asprintf(&other, "%s/%s", dfdpath, path);
+
+ cmp = !strcmp(fdpath, other);
+
+ free(fdpath);
+ free(dfdpath);
+ free(other);
+ return cmp;
+}
+
+bool openat2_supported = false;
+
+void __attribute__((constructor)) init(void)
+{
+ struct open_how how = {};
+ int fd;
+
+ BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_VER0);
+
+ /* Check openat2(2) support. */
+ fd = sys_openat2(AT_FDCWD, ".", &how);
+ openat2_supported = (fd >= 0);
+
+ if (fd >= 0)
+ close(fd);
+}
diff --git a/tools/testing/selftests/openat2/helpers.h b/tools/testing/selftests/openat2/helpers.h
new file mode 100644
index 000000000000..43ca5ceab6e3
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.h
@@ -0,0 +1,107 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#ifndef __RESOLVEAT_H__
+#define __RESOLVEAT_H__
+
+#define _GNU_SOURCE
+#include <stdint.h>
+#include <errno.h>
+#include <linux/types.h>
+#include "../kselftest.h"
+
+#define ARRAY_LEN(X) (sizeof (X) / sizeof (*(X)))
+#define BUILD_BUG_ON(e) ((void)(sizeof(struct { int:(-!!(e)); })))
+
+#ifndef SYS_openat2
+#ifndef __NR_openat2
+#define __NR_openat2 437
+#endif /* __NR_openat2 */
+#define SYS_openat2 __NR_openat2
+#endif /* SYS_openat2 */
+
+/*
+ * Arguments for how openat2(2) should open the target path. If @resolve is
+ * zero, then openat2(2) operates very similarly to openat(2).
+ *
+ * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
+ * than being silently ignored. @mode must be zero unless one of {O_CREAT,
+ * O_TMPFILE} are set.
+ *
+ * @flags: O_* flags.
+ * @mode: O_CREAT/O_TMPFILE file mode.
+ * @resolve: RESOLVE_* flags.
+ */
+struct open_how {
+ __aligned_u64 flags;
+ __u16 mode;
+ __u16 __padding[3]; /* must be zeroed */
+ __aligned_u64 resolve;
+};
+
+#define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */
+#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0
+
+bool needs_openat2(const struct open_how *how);
+
+#ifndef RESOLVE_IN_ROOT
+/* how->resolve flags for openat2(2). */
+#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings
+ (includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style
+ "magic-links". */
+#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks
+ (implies OEXT_NO_MAGICLINKS) */
+#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like
+ "..", symlinks, and absolute
+ paths which escape the dirfd. */
+#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".."
+ be scoped inside the dirfd
+ (similar to chroot(2)). */
+#endif /* RESOLVE_IN_ROOT */
+
+#define E_func(func, ...) \
+ do { \
+ if (func(__VA_ARGS__) < 0) \
+ ksft_exit_fail_msg("%s:%d %s failed\n", \
+ __FILE__, __LINE__, #func);\
+ } while (0)
+
+#define E_asprintf(...) E_func(asprintf, __VA_ARGS__)
+#define E_chmod(...) E_func(chmod, __VA_ARGS__)
+#define E_dup2(...) E_func(dup2, __VA_ARGS__)
+#define E_fchdir(...) E_func(fchdir, __VA_ARGS__)
+#define E_fstatat(...) E_func(fstatat, __VA_ARGS__)
+#define E_kill(...) E_func(kill, __VA_ARGS__)
+#define E_mkdirat(...) E_func(mkdirat, __VA_ARGS__)
+#define E_mount(...) E_func(mount, __VA_ARGS__)
+#define E_prctl(...) E_func(prctl, __VA_ARGS__)
+#define E_readlink(...) E_func(readlink, __VA_ARGS__)
+#define E_setresuid(...) E_func(setresuid, __VA_ARGS__)
+#define E_symlinkat(...) E_func(symlinkat, __VA_ARGS__)
+#define E_touchat(...) E_func(touchat, __VA_ARGS__)
+#define E_unshare(...) E_func(unshare, __VA_ARGS__)
+
+#define E_assert(expr, msg, ...) \
+ do { \
+ if (!(expr)) \
+ ksft_exit_fail_msg("ASSERT(%s:%d) failed (%s): " msg "\n", \
+ __FILE__, __LINE__, #expr, ##__VA_ARGS__); \
+ } while (0)
+
+int raw_openat2(int dfd, const char *path, void *how, size_t size);
+int sys_openat2(int dfd, const char *path, struct open_how *how);
+int sys_openat(int dfd, const char *path, struct open_how *how);
+int sys_renameat2(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath, unsigned int flags);
+
+int touchat(int dfd, const char *path);
+char *fdreadlink(int fd);
+bool fdequal(int fd, int dfd, const char *path);
+
+extern bool openat2_supported;
+
+#endif /* __RESOLVEAT_H__ */
diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c
new file mode 100644
index 000000000000..0b64fedc008b
--- /dev/null
+++ b/tools/testing/selftests/openat2/openat2_test.c
@@ -0,0 +1,320 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/*
+ * O_LARGEFILE is set to 0 by glibc.
+ * XXX: This is wrong on {mips, parisc, powerpc, sparc}.
+ */
+#undef O_LARGEFILE
+#define O_LARGEFILE 0x8000
+
+struct open_how_ext {
+ struct open_how inner;
+ uint32_t extra1;
+ char pad1[128];
+ uint32_t extra2;
+ char pad2[128];
+ uint32_t extra3;
+};
+
+struct struct_test {
+ const char *name;
+ struct open_how_ext arg;
+ size_t size;
+ int err;
+};
+
+#define NUM_OPENAT2_STRUCT_TESTS 10
+#define NUM_OPENAT2_STRUCT_VARIATIONS 13
+
+void test_openat2_struct(void)
+{
+ int misalignments[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 17, 87 };
+
+ struct struct_test tests[] = {
+ /* Normal struct. */
+ { .name = "normal struct",
+ .arg.inner.flags = O_RDONLY,
+ .size = sizeof(struct open_how) },
+ /* Bigger struct, with zeroed out end. */
+ { .name = "bigger struct (zeroed out)",
+ .arg.inner.flags = O_RDONLY,
+ .size = sizeof(struct open_how_ext) },
+
+ /* Normal struct with broken padding. */
+ { .name = "normal struct (non-zero padding[0])",
+ .arg.inner.flags = O_RDONLY,
+ .arg.inner.__padding = {0xa0, 0x00, 0x00},
+ .size = sizeof(struct open_how_ext), .err = -EINVAL },
+ { .name = "normal struct (non-zero padding[1])",
+ .arg.inner.flags = O_RDONLY,
+ .arg.inner.__padding = {0x00, 0x1a, 0x00},
+ .size = sizeof(struct open_how_ext), .err = -EINVAL },
+ { .name = "normal struct (non-zero padding[2])",
+ .arg.inner.flags = O_RDONLY,
+ .arg.inner.__padding = {0x00, 0x00, 0xef},
+ .size = sizeof(struct open_how_ext), .err = -EINVAL },
+
+ /* TODO: Once expanded, check zero-padding. */
+
+ /* Smaller than version-0 struct. */
+ { .name = "zero-sized 'struct'",
+ .arg.inner.flags = O_RDONLY, .size = 0, .err = -EINVAL },
+ { .name = "smaller-than-v0 struct",
+ .arg.inner.flags = O_RDONLY,
+ .size = OPEN_HOW_SIZE_VER0 - 1, .err = -EINVAL },
+
+ /* Bigger struct, with non-zero trailing bytes. */
+ { .name = "bigger struct (non-zero data in first 'future field')",
+ .arg.inner.flags = O_RDONLY, .arg.extra1 = 0xdeadbeef,
+ .size = sizeof(struct open_how_ext), .err = -E2BIG },
+ { .name = "bigger struct (non-zero data in middle of 'future fields')",
+ .arg.inner.flags = O_RDONLY, .arg.extra2 = 0xfeedcafe,
+ .size = sizeof(struct open_how_ext), .err = -E2BIG },
+ { .name = "bigger struct (non-zero data at end of 'future fields')",
+ .arg.inner.flags = O_RDONLY, .arg.extra3 = 0xabad1dea,
+ .size = sizeof(struct open_how_ext), .err = -E2BIG },
+ };
+
+ BUILD_BUG_ON(ARRAY_LEN(misalignments) != NUM_OPENAT2_STRUCT_VARIATIONS);
+ BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_STRUCT_TESTS);
+
+ for (int i = 0; i < ARRAY_LEN(tests); i++) {
+ struct struct_test *test = &tests[i];
+ struct open_how_ext how_ext = test->arg;
+
+ for (int j = 0; j < ARRAY_LEN(misalignments); j++) {
+ int fd, misalign = misalignments[j];
+ char *fdpath = NULL;
+ bool failed;
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+
+ void *copy = NULL, *how_copy = &how_ext;
+
+ if (!openat2_supported) {
+ ksft_print_msg("openat2(2) unsupported\n");
+ resultfn = ksft_test_result_skip;
+ goto skip;
+ }
+
+ if (misalign) {
+ /*
+ * Explicitly misalign the structure copying it with the given
+ * (mis)alignment offset. The other data is set to be non-zero to
+ * make sure that non-zero bytes outside the struct aren't checked
+ *
+ * This is effectively to check that is_zeroed_user() works.
+ */
+ copy = malloc(misalign + sizeof(how_ext));
+ how_copy = copy + misalign;
+ memset(copy, 0xff, misalign);
+ memcpy(how_copy, &how_ext, sizeof(how_ext));
+ }
+
+ fd = raw_openat2(AT_FDCWD, ".", how_copy, test->size);
+ if (test->err >= 0)
+ failed = (fd < 0);
+ else
+ failed = (fd != test->err);
+ if (fd >= 0) {
+ fdpath = fdreadlink(fd);
+ close(fd);
+ }
+
+ if (failed) {
+ resultfn = ksft_test_result_fail;
+
+ ksft_print_msg("openat2 unexpectedly returned ");
+ if (fdpath)
+ ksft_print_msg("%d['%s']\n", fd, fdpath);
+ else
+ ksft_print_msg("%d (%s)\n", fd, strerror(-fd));
+ }
+
+skip:
+ if (test->err >= 0)
+ resultfn("openat2 with %s argument [misalign=%d] succeeds\n",
+ test->name, misalign);
+ else
+ resultfn("openat2 with %s argument [misalign=%d] fails with %d (%s)\n",
+ test->name, misalign, test->err,
+ strerror(-test->err));
+
+ free(copy);
+ free(fdpath);
+ fflush(stdout);
+ }
+ }
+}
+
+struct flag_test {
+ const char *name;
+ struct open_how how;
+ int err;
+};
+
+#define NUM_OPENAT2_FLAG_TESTS 21
+
+void test_openat2_flags(void)
+{
+ struct flag_test tests[] = {
+ /* O_TMPFILE is incompatible with O_PATH and O_CREAT. */
+ { .name = "incompatible flags (O_TMPFILE | O_PATH)",
+ .how.flags = O_TMPFILE | O_PATH | O_RDWR, .err = -EINVAL },
+ { .name = "incompatible flags (O_TMPFILE | O_CREAT)",
+ .how.flags = O_TMPFILE | O_CREAT | O_RDWR, .err = -EINVAL },
+
+ /* O_PATH only permits certain other flags to be set ... */
+ { .name = "compatible flags (O_PATH | O_CLOEXEC)",
+ .how.flags = O_PATH | O_CLOEXEC },
+ { .name = "compatible flags (O_PATH | O_DIRECTORY)",
+ .how.flags = O_PATH | O_DIRECTORY },
+ { .name = "compatible flags (O_PATH | O_NOFOLLOW)",
+ .how.flags = O_PATH | O_NOFOLLOW },
+ /* ... and others are absolutely not permitted. */
+ { .name = "incompatible flags (O_PATH | O_RDWR)",
+ .how.flags = O_PATH | O_RDWR, .err = -EINVAL },
+ { .name = "incompatible flags (O_PATH | O_CREAT)",
+ .how.flags = O_PATH | O_CREAT, .err = -EINVAL },
+ { .name = "incompatible flags (O_PATH | O_EXCL)",
+ .how.flags = O_PATH | O_EXCL, .err = -EINVAL },
+ { .name = "incompatible flags (O_PATH | O_NOCTTY)",
+ .how.flags = O_PATH | O_NOCTTY, .err = -EINVAL },
+ { .name = "incompatible flags (O_PATH | O_DIRECT)",
+ .how.flags = O_PATH | O_DIRECT, .err = -EINVAL },
+ { .name = "incompatible flags (O_PATH | O_LARGEFILE)",
+ .how.flags = O_PATH | O_LARGEFILE, .err = -EINVAL },
+
+ /* ->mode must only be set with O_{CREAT,TMPFILE}. */
+ { .name = "non-zero how.mode and O_RDONLY",
+ .how.flags = O_RDONLY, .how.mode = 0600, .err = -EINVAL },
+ { .name = "non-zero how.mode and O_PATH",
+ .how.flags = O_PATH, .how.mode = 0600, .err = -EINVAL },
+ { .name = "valid how.mode and O_CREAT",
+ .how.flags = O_CREAT, .how.mode = 0600 },
+ { .name = "valid how.mode and O_TMPFILE",
+ .how.flags = O_TMPFILE | O_RDWR, .how.mode = 0600 },
+ /* ->mode must only contain 0777 bits. */
+ { .name = "invalid how.mode and O_CREAT",
+ .how.flags = O_CREAT,
+ .how.mode = 0xFFFF, .err = -EINVAL },
+ { .name = "invalid how.mode and O_TMPFILE",
+ .how.flags = O_TMPFILE | O_RDWR,
+ .how.mode = 0x1337, .err = -EINVAL },
+
+ /* ->resolve must only contain RESOLVE_* flags. */
+ { .name = "invalid how.resolve and O_RDONLY",
+ .how.flags = O_RDONLY,
+ .how.resolve = 0x1337, .err = -EINVAL },
+ { .name = "invalid how.resolve and O_CREAT",
+ .how.flags = O_CREAT,
+ .how.resolve = 0x1337, .err = -EINVAL },
+ { .name = "invalid how.resolve and O_TMPFILE",
+ .how.flags = O_TMPFILE | O_RDWR,
+ .how.resolve = 0x1337, .err = -EINVAL },
+ { .name = "invalid how.resolve and O_PATH",
+ .how.flags = O_PATH,
+ .how.resolve = 0x1337, .err = -EINVAL },
+ };
+
+ BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_FLAG_TESTS);
+
+ for (int i = 0; i < ARRAY_LEN(tests); i++) {
+ int fd, fdflags = -1;
+ char *path, *fdpath = NULL;
+ bool failed = false;
+ struct flag_test *test = &tests[i];
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+
+ if (!openat2_supported) {
+ ksft_print_msg("openat2(2) unsupported\n");
+ resultfn = ksft_test_result_skip;
+ goto skip;
+ }
+
+ path = (test->how.flags & O_CREAT) ? "/tmp/ksft.openat2_tmpfile" : ".";
+ unlink(path);
+
+ fd = sys_openat2(AT_FDCWD, path, &test->how);
+ if (test->err >= 0)
+ failed = (fd < 0);
+ else
+ failed = (fd != test->err);
+ if (fd >= 0) {
+ int otherflags;
+
+ fdpath = fdreadlink(fd);
+ fdflags = fcntl(fd, F_GETFL);
+ otherflags = fcntl(fd, F_GETFD);
+ close(fd);
+
+ E_assert(fdflags >= 0, "fcntl F_GETFL of new fd");
+ E_assert(otherflags >= 0, "fcntl F_GETFD of new fd");
+
+ /* O_CLOEXEC isn't shown in F_GETFL. */
+ if (otherflags & FD_CLOEXEC)
+ fdflags |= O_CLOEXEC;
+ /* O_CREAT is hidden from F_GETFL. */
+ if (test->how.flags & O_CREAT)
+ fdflags |= O_CREAT;
+ if (!(test->how.flags & O_LARGEFILE))
+ fdflags &= ~O_LARGEFILE;
+ failed |= (fdflags != test->how.flags);
+ }
+
+ if (failed) {
+ resultfn = ksft_test_result_fail;
+
+ ksft_print_msg("openat2 unexpectedly returned ");
+ if (fdpath)
+ ksft_print_msg("%d['%s'] with %X (!= %X)\n",
+ fd, fdpath, fdflags,
+ test->how.flags);
+ else
+ ksft_print_msg("%d (%s)\n", fd, strerror(-fd));
+ }
+
+skip:
+ if (test->err >= 0)
+ resultfn("openat2 with %s succeeds\n", test->name);
+ else
+ resultfn("openat2 with %s fails with %d (%s)\n",
+ test->name, test->err, strerror(-test->err));
+
+ free(fdpath);
+ fflush(stdout);
+ }
+}
+
+#define NUM_TESTS (NUM_OPENAT2_STRUCT_VARIATIONS * NUM_OPENAT2_STRUCT_TESTS + \
+ NUM_OPENAT2_FLAG_TESTS)
+
+int main(int argc, char **argv)
+{
+ ksft_print_header();
+ ksft_set_plan(NUM_TESTS);
+
+ test_openat2_struct();
+ test_openat2_flags();
+
+ if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/openat2/rename_attack_test.c b/tools/testing/selftests/openat2/rename_attack_test.c
new file mode 100644
index 000000000000..0a770728b436
--- /dev/null
+++ b/tools/testing/selftests/openat2/rename_attack_test.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <syscall.h>
+#include <limits.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/* Construct a test directory with the following structure:
+ *
+ * root/
+ * |-- a/
+ * | `-- c/
+ * `-- b/
+ */
+int setup_testdir(void)
+{
+ int dfd;
+ char dirname[] = "/tmp/ksft-openat2-rename-attack.XXXXXX";
+
+ /* Make the top-level directory. */
+ if (!mkdtemp(dirname))
+ ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n");
+ dfd = open(dirname, O_PATH | O_DIRECTORY);
+ if (dfd < 0)
+ ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+
+ E_mkdirat(dfd, "a", 0755);
+ E_mkdirat(dfd, "b", 0755);
+ E_mkdirat(dfd, "a/c", 0755);
+
+ return dfd;
+}
+
+/* Swap @dirfd/@a and @dirfd/@b constantly. Parent must kill this process. */
+pid_t spawn_attack(int dirfd, char *a, char *b)
+{
+ pid_t child = fork();
+ if (child != 0)
+ return child;
+
+ /* If the parent (the test process) dies, kill ourselves too. */
+ E_prctl(PR_SET_PDEATHSIG, SIGKILL);
+
+ /* Swap @a and @b. */
+ for (;;)
+ renameat2(dirfd, a, dirfd, b, RENAME_EXCHANGE);
+ exit(1);
+}
+
+#define NUM_RENAME_TESTS 2
+#define ROUNDS 400000
+
+const char *flagname(int resolve)
+{
+ switch (resolve) {
+ case RESOLVE_IN_ROOT:
+ return "RESOLVE_IN_ROOT";
+ case RESOLVE_BENEATH:
+ return "RESOLVE_BENEATH";
+ }
+ return "(unknown)";
+}
+
+void test_rename_attack(int resolve)
+{
+ int dfd, afd;
+ pid_t child;
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+ int escapes = 0, other_errs = 0, exdevs = 0, eagains = 0, successes = 0;
+
+ struct open_how how = {
+ .flags = O_PATH,
+ .resolve = resolve,
+ };
+
+ if (!openat2_supported) {
+ how.resolve = 0;
+ ksft_print_msg("openat2(2) unsupported -- using openat(2) instead\n");
+ }
+
+ dfd = setup_testdir();
+ afd = openat(dfd, "a", O_PATH);
+ if (afd < 0)
+ ksft_exit_fail_msg("test_rename_attack: failed to open 'a'\n");
+
+ child = spawn_attack(dfd, "a/c", "b");
+
+ for (int i = 0; i < ROUNDS; i++) {
+ int fd;
+ char *victim_path = "c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../..";
+
+ if (openat2_supported)
+ fd = sys_openat2(afd, victim_path, &how);
+ else
+ fd = sys_openat(afd, victim_path, &how);
+
+ if (fd < 0) {
+ if (fd == -EAGAIN)
+ eagains++;
+ else if (fd == -EXDEV)
+ exdevs++;
+ else if (fd == -ENOENT)
+ escapes++; /* escaped outside and got ENOENT... */
+ else
+ other_errs++; /* unexpected error */
+ } else {
+ if (fdequal(fd, afd, NULL))
+ successes++;
+ else
+ escapes++; /* we got an unexpected fd */
+ }
+ close(fd);
+ }
+
+ if (escapes > 0)
+ resultfn = ksft_test_result_fail;
+ ksft_print_msg("non-escapes: EAGAIN=%d EXDEV=%d E<other>=%d success=%d\n",
+ eagains, exdevs, other_errs, successes);
+ resultfn("rename attack with %s (%d runs, got %d escapes)\n",
+ flagname(resolve), ROUNDS, escapes);
+
+ /* Should be killed anyway, but might as well make sure. */
+ E_kill(child, SIGKILL);
+}
+
+#define NUM_TESTS NUM_RENAME_TESTS
+
+int main(int argc, char **argv)
+{
+ ksft_print_header();
+ ksft_set_plan(NUM_TESTS);
+
+ test_rename_attack(RESOLVE_BENEATH);
+ test_rename_attack(RESOLVE_IN_ROOT);
+
+ if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/openat2/resolve_test.c b/tools/testing/selftests/openat2/resolve_test.c
new file mode 100644
index 000000000000..7a94b1da8e7b
--- /dev/null
+++ b/tools/testing/selftests/openat2/resolve_test.c
@@ -0,0 +1,523 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/*
+ * Construct a test directory with the following structure:
+ *
+ * root/
+ * |-- procexe -> /proc/self/exe
+ * |-- procroot -> /proc/self/root
+ * |-- root/
+ * |-- mnt/ [mountpoint]
+ * | |-- self -> ../mnt/
+ * | `-- absself -> /mnt/
+ * |-- etc/
+ * | `-- passwd
+ * |-- creatlink -> /newfile3
+ * |-- reletc -> etc/
+ * |-- relsym -> etc/passwd
+ * |-- absetc -> /etc/
+ * |-- abssym -> /etc/passwd
+ * |-- abscheeky -> /cheeky
+ * `-- cheeky/
+ * |-- absself -> /
+ * |-- self -> ../../root/
+ * |-- garbageself -> /../../root/
+ * |-- passwd -> ../cheeky/../cheeky/../etc/../etc/passwd
+ * |-- abspasswd -> /../cheeky/../cheeky/../etc/../etc/passwd
+ * |-- dotdotlink -> ../../../../../../../../../../../../../../etc/passwd
+ * `-- garbagelink -> /../../../../../../../../../../../../../../etc/passwd
+ */
+int setup_testdir(void)
+{
+ int dfd, tmpfd;
+ char dirname[] = "/tmp/ksft-openat2-testdir.XXXXXX";
+
+ /* Unshare and make /tmp a new directory. */
+ E_unshare(CLONE_NEWNS);
+ E_mount("", "/tmp", "", MS_PRIVATE, "");
+
+ /* Make the top-level directory. */
+ if (!mkdtemp(dirname))
+ ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n");
+ dfd = open(dirname, O_PATH | O_DIRECTORY);
+ if (dfd < 0)
+ ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+
+ /* A sub-directory which is actually used for tests. */
+ E_mkdirat(dfd, "root", 0755);
+ tmpfd = openat(dfd, "root", O_PATH | O_DIRECTORY);
+ if (tmpfd < 0)
+ ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+ close(dfd);
+ dfd = tmpfd;
+
+ E_symlinkat("/proc/self/exe", dfd, "procexe");
+ E_symlinkat("/proc/self/root", dfd, "procroot");
+ E_mkdirat(dfd, "root", 0755);
+
+ /* There is no mountat(2), so use chdir. */
+ E_mkdirat(dfd, "mnt", 0755);
+ E_fchdir(dfd);
+ E_mount("tmpfs", "./mnt", "tmpfs", MS_NOSUID | MS_NODEV, "");
+ E_symlinkat("../mnt/", dfd, "mnt/self");
+ E_symlinkat("/mnt/", dfd, "mnt/absself");
+
+ E_mkdirat(dfd, "etc", 0755);
+ E_touchat(dfd, "etc/passwd");
+
+ E_symlinkat("/newfile3", dfd, "creatlink");
+ E_symlinkat("etc/", dfd, "reletc");
+ E_symlinkat("etc/passwd", dfd, "relsym");
+ E_symlinkat("/etc/", dfd, "absetc");
+ E_symlinkat("/etc/passwd", dfd, "abssym");
+ E_symlinkat("/cheeky", dfd, "abscheeky");
+
+ E_mkdirat(dfd, "cheeky", 0755);
+
+ E_symlinkat("/", dfd, "cheeky/absself");
+ E_symlinkat("../../root/", dfd, "cheeky/self");
+ E_symlinkat("/../../root/", dfd, "cheeky/garbageself");
+
+ E_symlinkat("../cheeky/../etc/../etc/passwd", dfd, "cheeky/passwd");
+ E_symlinkat("/../cheeky/../etc/../etc/passwd", dfd, "cheeky/abspasswd");
+
+ E_symlinkat("../../../../../../../../../../../../../../etc/passwd",
+ dfd, "cheeky/dotdotlink");
+ E_symlinkat("/../../../../../../../../../../../../../../etc/passwd",
+ dfd, "cheeky/garbagelink");
+
+ return dfd;
+}
+
+struct basic_test {
+ const char *name;
+ const char *dir;
+ const char *path;
+ struct open_how how;
+ bool pass;
+ union {
+ int err;
+ const char *path;
+ } out;
+};
+
+#define NUM_OPENAT2_OPATH_TESTS 88
+
+void test_openat2_opath_tests(void)
+{
+ int rootfd, hardcoded_fd;
+ char *procselfexe, *hardcoded_fdpath;
+
+ E_asprintf(&procselfexe, "/proc/%d/exe", getpid());
+ rootfd = setup_testdir();
+
+ hardcoded_fd = open("/dev/null", O_RDONLY);
+ E_assert(hardcoded_fd >= 0, "open fd to hardcode");
+ E_asprintf(&hardcoded_fdpath, "self/fd/%d", hardcoded_fd);
+
+ struct basic_test tests[] = {
+ /** RESOLVE_BENEATH **/
+ /* Attempts to cross dirfd should be blocked. */
+ { .name = "[beneath] jump to /",
+ .path = "/", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] absolute link to $root",
+ .path = "cheeky/absself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] chained absolute links to $root",
+ .path = "abscheeky/absself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] jump outside $root",
+ .path = "..", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] temporary jump outside $root",
+ .path = "../root/", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] symlink temporary jump outside $root",
+ .path = "cheeky/self", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] chained symlink temporary jump outside $root",
+ .path = "abscheeky/self", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] garbage links to $root",
+ .path = "cheeky/garbageself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] chained garbage links to $root",
+ .path = "abscheeky/garbageself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ /* Only relative paths that stay inside dirfd should work. */
+ { .name = "[beneath] ordinary path to 'root'",
+ .path = "root", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "root", .pass = true },
+ { .name = "[beneath] ordinary path to 'etc'",
+ .path = "etc", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc", .pass = true },
+ { .name = "[beneath] ordinary path to 'etc/passwd'",
+ .path = "etc/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[beneath] relative symlink inside $root",
+ .path = "relsym", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[beneath] chained-'..' relative symlink inside $root",
+ .path = "cheeky/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[beneath] absolute symlink component outside $root",
+ .path = "abscheeky/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] absolute symlink target outside $root",
+ .path = "abssym", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] absolute path outside $root",
+ .path = "/etc/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] cheeky absolute path outside $root",
+ .path = "cheeky/abspasswd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] chained cheeky absolute path outside $root",
+ .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ /* Tricky paths should fail. */
+ { .name = "[beneath] tricky '..'-chained symlink outside $root",
+ .path = "cheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] tricky absolute + '..'-chained symlink outside $root",
+ .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] tricky garbage link outside $root",
+ .path = "cheeky/garbagelink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] tricky absolute + garbage link outside $root",
+ .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+
+ /** RESOLVE_IN_ROOT **/
+ /* All attempts to cross the dirfd will be scoped-to-root. */
+ { .name = "[in_root] jump to /",
+ .path = "/", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .name = "[in_root] absolute symlink to /root",
+ .path = "cheeky/absself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .name = "[in_root] chained absolute symlinks to /root",
+ .path = "abscheeky/absself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .name = "[in_root] '..' at root",
+ .path = "..", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .name = "[in_root] '../root' at root",
+ .path = "../root/", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .name = "[in_root] relative symlink containing '..' above root",
+ .path = "cheeky/self", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .name = "[in_root] garbage link to /root",
+ .path = "cheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .name = "[in_root] chainged garbage links to /root",
+ .path = "abscheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .name = "[in_root] relative path to 'root'",
+ .path = "root", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .name = "[in_root] relative path to 'etc'",
+ .path = "etc", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc", .pass = true },
+ { .name = "[in_root] relative path to 'etc/passwd'",
+ .path = "etc/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] relative symlink to 'etc/passwd'",
+ .path = "relsym", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] chained-'..' relative symlink to 'etc/passwd'",
+ .path = "cheeky/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] chained-'..' absolute + relative symlink to 'etc/passwd'",
+ .path = "abscheeky/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] absolute symlink to 'etc/passwd'",
+ .path = "abssym", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] absolute path 'etc/passwd'",
+ .path = "/etc/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] cheeky absolute path 'etc/passwd'",
+ .path = "cheeky/abspasswd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] chained cheeky absolute path 'etc/passwd'",
+ .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] tricky '..'-chained symlink outside $root",
+ .path = "cheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] tricky absolute + '..'-chained symlink outside $root",
+ .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] tricky absolute path + absolute + '..'-chained symlink outside $root",
+ .path = "/../../../../abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] tricky garbage link outside $root",
+ .path = "cheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] tricky absolute + garbage link outside $root",
+ .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] tricky absolute path + absolute + garbage link outside $root",
+ .path = "/../../../../abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ /* O_CREAT should handle trailing symlinks correctly. */
+ { .name = "[in_root] O_CREAT of relative path inside $root",
+ .path = "newfile1", .how.flags = O_CREAT,
+ .how.mode = 0700,
+ .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "newfile1", .pass = true },
+ { .name = "[in_root] O_CREAT of absolute path",
+ .path = "/newfile2", .how.flags = O_CREAT,
+ .how.mode = 0700,
+ .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "newfile2", .pass = true },
+ { .name = "[in_root] O_CREAT of tricky symlink outside root",
+ .path = "/creatlink", .how.flags = O_CREAT,
+ .how.mode = 0700,
+ .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "newfile3", .pass = true },
+
+ /** RESOLVE_NO_XDEV **/
+ /* Crossing *down* into a mountpoint is disallowed. */
+ { .name = "[no_xdev] cross into $mnt",
+ .path = "mnt", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[no_xdev] cross into $mnt/",
+ .path = "mnt/", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[no_xdev] cross into $mnt/.",
+ .path = "mnt/.", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ /* Crossing *up* out of a mountpoint is disallowed. */
+ { .name = "[no_xdev] goto mountpoint root",
+ .dir = "mnt", .path = ".", .how.resolve = RESOLVE_NO_XDEV,
+ .out.path = "mnt", .pass = true },
+ { .name = "[no_xdev] cross up through '..'",
+ .dir = "mnt", .path = "..", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[no_xdev] temporary cross up through '..'",
+ .dir = "mnt", .path = "../mnt", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[no_xdev] temporary relative symlink cross up",
+ .dir = "mnt", .path = "self", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[no_xdev] temporary absolute symlink cross up",
+ .dir = "mnt", .path = "absself", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ /* Jumping to "/" is ok, but later components cannot cross. */
+ { .name = "[no_xdev] jump to / directly",
+ .dir = "mnt", .path = "/", .how.resolve = RESOLVE_NO_XDEV,
+ .out.path = "/", .pass = true },
+ { .name = "[no_xdev] jump to / (from /) directly",
+ .dir = "/", .path = "/", .how.resolve = RESOLVE_NO_XDEV,
+ .out.path = "/", .pass = true },
+ { .name = "[no_xdev] jump to / then proc",
+ .path = "/proc/1", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[no_xdev] jump to / then tmp",
+ .path = "/tmp", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ /* Magic-links are blocked since they can switch vfsmounts. */
+ { .name = "[no_xdev] cross through magic-link to self/root",
+ .dir = "/proc", .path = "self/root", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[no_xdev] cross through magic-link to self/cwd",
+ .dir = "/proc", .path = "self/cwd", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ /* Except magic-link jumps inside the same vfsmount. */
+ { .name = "[no_xdev] jump through magic-link to same procfs",
+ .dir = "/proc", .path = hardcoded_fdpath, .how.resolve = RESOLVE_NO_XDEV,
+ .out.path = "/proc", .pass = true, },
+
+ /** RESOLVE_NO_MAGICLINKS **/
+ /* Regular symlinks should work. */
+ { .name = "[no_magiclinks] ordinary relative symlink",
+ .path = "relsym", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.path = "etc/passwd", .pass = true },
+ /* Magic-links should not work. */
+ { .name = "[no_magiclinks] symlink to magic-link",
+ .path = "procexe", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_magiclinks] normal path to magic-link",
+ .path = "/proc/self/exe", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_magiclinks] normal path to magic-link with O_NOFOLLOW",
+ .path = "/proc/self/exe", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.path = procselfexe, .pass = true },
+ { .name = "[no_magiclinks] symlink to magic-link path component",
+ .path = "procroot/etc", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_magiclinks] magic-link path component",
+ .path = "/proc/self/root/etc", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_magiclinks] magic-link path component with O_NOFOLLOW",
+ .path = "/proc/self/root/etc", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+
+ /** RESOLVE_NO_SYMLINKS **/
+ /* Normal paths should work. */
+ { .name = "[no_symlinks] ordinary path to '.'",
+ .path = ".", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = NULL, .pass = true },
+ { .name = "[no_symlinks] ordinary path to 'root'",
+ .path = "root", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "root", .pass = true },
+ { .name = "[no_symlinks] ordinary path to 'etc'",
+ .path = "etc", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "etc", .pass = true },
+ { .name = "[no_symlinks] ordinary path to 'etc/passwd'",
+ .path = "etc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "etc/passwd", .pass = true },
+ /* Regular symlinks are blocked. */
+ { .name = "[no_symlinks] relative symlink target",
+ .path = "relsym", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_symlinks] relative symlink component",
+ .path = "reletc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_symlinks] absolute symlink target",
+ .path = "abssym", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_symlinks] absolute symlink component",
+ .path = "absetc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_symlinks] cheeky garbage link",
+ .path = "cheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_symlinks] cheeky absolute + garbage link",
+ .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_symlinks] cheeky absolute + absolute symlink",
+ .path = "abscheeky/absself", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ /* Trailing symlinks with NO_FOLLOW. */
+ { .name = "[no_symlinks] relative symlink with O_NOFOLLOW",
+ .path = "relsym", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "relsym", .pass = true },
+ { .name = "[no_symlinks] absolute symlink with O_NOFOLLOW",
+ .path = "abssym", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "abssym", .pass = true },
+ { .name = "[no_symlinks] trailing symlink with O_NOFOLLOW",
+ .path = "cheeky/garbagelink", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "cheeky/garbagelink", .pass = true },
+ { .name = "[no_symlinks] multiple symlink components with O_NOFOLLOW",
+ .path = "abscheeky/absself", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_symlinks] multiple symlink (and garbage link) components with O_NOFOLLOW",
+ .path = "abscheeky/garbagelink", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ };
+
+ BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_OPATH_TESTS);
+
+ for (int i = 0; i < ARRAY_LEN(tests); i++) {
+ int dfd, fd;
+ char *fdpath = NULL;
+ bool failed;
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+ struct basic_test *test = &tests[i];
+
+ if (!openat2_supported) {
+ ksft_print_msg("openat2(2) unsupported\n");
+ resultfn = ksft_test_result_skip;
+ goto skip;
+ }
+
+ /* Auto-set O_PATH. */
+ if (!(test->how.flags & O_CREAT))
+ test->how.flags |= O_PATH;
+
+ if (test->dir)
+ dfd = openat(rootfd, test->dir, O_PATH | O_DIRECTORY);
+ else
+ dfd = dup(rootfd);
+ E_assert(dfd, "failed to openat root '%s': %m", test->dir);
+
+ E_dup2(dfd, hardcoded_fd);
+
+ fd = sys_openat2(dfd, test->path, &test->how);
+ if (test->pass)
+ failed = (fd < 0 || !fdequal(fd, rootfd, test->out.path));
+ else
+ failed = (fd != test->out.err);
+ if (fd >= 0) {
+ fdpath = fdreadlink(fd);
+ close(fd);
+ }
+ close(dfd);
+
+ if (failed) {
+ resultfn = ksft_test_result_fail;
+
+ ksft_print_msg("openat2 unexpectedly returned ");
+ if (fdpath)
+ ksft_print_msg("%d['%s']\n", fd, fdpath);
+ else
+ ksft_print_msg("%d (%s)\n", fd, strerror(-fd));
+ }
+
+skip:
+ if (test->pass)
+ resultfn("%s gives path '%s'\n", test->name,
+ test->out.path ?: ".");
+ else
+ resultfn("%s fails with %d (%s)\n", test->name,
+ test->out.err, strerror(-test->out.err));
+
+ fflush(stdout);
+ free(fdpath);
+ }
+
+ free(procselfexe);
+ close(rootfd);
+
+ free(hardcoded_fdpath);
+ close(hardcoded_fd);
+}
+
+#define NUM_TESTS NUM_OPENAT2_OPATH_TESTS
+
+int main(int argc, char **argv)
+{
+ ksft_print_header();
+ ksft_set_plan(NUM_TESTS);
+
+ /* NOTE: We should be checking for CAP_SYS_ADMIN here... */
+ if (geteuid() != 0)
+ ksft_exit_skip("all tests require euid == 0\n");
+
+ test_openat2_opath_tests();
+
+ if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
--
2.24.0
Allow LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution
(in the case of LOOKUP_BENEATH the resolution will still fail if ".."
resolution would resolve a path outside of the root -- while
LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps are
still disallowed entirely[*].
As Jann explains[1,2], the need for this patch (and the original no-".."
restriction) is explained by observing there is a fairly easy-to-exploit
race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and
LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be
used to "skip over" nd->root and thus escape to the filesystem above
nd->root.
thread1 [attacker]:
for (;;)
renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
thread2 [victim]:
for (;;)
openat2(dirb, "b/c/../../etc/shadow",
{ .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );
With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.
With this patch, such cases will be detected *during* ".." resolution
and will return -EAGAIN for userspace to decide to either retry or abort
the lookup. It should be noted that ".." is the weak point of chroot(2)
-- walking *into* a subdirectory tautologically cannot result in you
walking *outside* nd->root (except through a bind-mount or magic-link).
There is also no other way for a directory's parent to change (which is
the primary worry with ".." resolution here) other than a rename or
MS_MOVE.
The primary reason for deferring to userspace with -EAGAIN is that an
in-kernel retry loop (or doing a path_is_under() check after re-taking
the relevant seqlocks) can become unreasonably expensive on machines
with lots of VFS activity (nfsd can cause lots of rename_lock updates).
Thus it should be up to userspace how many times they wish to retry the
lookup -- the selftests for this attack indicate that there is a ~35%
chance of the lookup succeeding on the first try even with an attacker
thrashing rename_lock.
A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.
[*] It may be acceptable in the future to do a path_is_under() check for
magic-links after they are resolved. However this seems unlikely to
be a feature that people *really* need -- it can be added later if
it turns out a lot of people want it.
[1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAORvqzg@mail.gmail.com/
Cc: Christian Brauner <[email protected]>
Suggested-by: Jann Horn <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 43 +++++++++++++++++++++++++++----------------
1 file changed, 27 insertions(+), 16 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 69cd0b296e8e..9f60c3f49b8b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@ struct nameidata {
struct path root;
struct inode *inode; /* path.dentry.d_inode */
unsigned int flags;
- unsigned seq, m_seq;
+ unsigned seq, m_seq, r_seq;
int last_type;
unsigned depth;
int total_link_count;
@@ -1791,22 +1791,31 @@ static inline int handle_dots(struct nameidata *nd, int type)
if (type == LAST_DOTDOT) {
int error = 0;
- /*
- * Scoped-lookup flags resolving ".." is not currently safe --
- * races can cause our parent to have moved outside of the root
- * and us to skip over it.
- */
- if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
- return -EXDEV;
if (!nd->root.mnt) {
error = set_root(nd);
if (error)
return error;
}
- if (nd->flags & LOOKUP_RCU) {
- return follow_dotdot_rcu(nd);
- } else
- return follow_dotdot(nd);
+ if (nd->flags & LOOKUP_RCU)
+ error = follow_dotdot_rcu(nd);
+ else
+ error = follow_dotdot(nd);
+ if (error)
+ return error;
+
+ if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
+ /*
+ * If there was a racing rename or mount along our
+ * path, then we can't be sure that ".." hasn't jumped
+ * above nd->root (and so userspace should retry or use
+ * some fallback).
+ */
+ smp_rmb();
+ if (unlikely(__read_seqcount_retry(&mount_lock.seqcount, nd->m_seq)))
+ return -EAGAIN;
+ if (unlikely(__read_seqcount_retry(&rename_lock.seqcount, nd->r_seq)))
+ return -EAGAIN;
+ }
}
return 0;
}
@@ -2276,6 +2285,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->last_type = LAST_ROOT; /* if there are only slashes... */
nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
nd->depth = 0;
+
+ nd->m_seq = __read_seqcount_begin(&mount_lock.seqcount);
+ nd->r_seq = __read_seqcount_begin(&rename_lock.seqcount);
+ smp_rmb();
+
if (flags & LOOKUP_ROOT) {
struct dentry *root = nd->root.dentry;
struct inode *inode = root->d_inode;
@@ -2284,9 +2298,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->path = nd->root;
nd->inode = inode;
if (flags & LOOKUP_RCU) {
- nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
+ nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
nd->root_seq = nd->seq;
- nd->m_seq = read_seqbegin(&mount_lock);
} else {
path_get(&nd->path);
}
@@ -2297,8 +2310,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->path.mnt = NULL;
nd->path.dentry = NULL;
- nd->m_seq = read_seqbegin(&mount_lock);
-
/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {
error = nd_jump_root(nd);
--
2.24.0
Now that we have new LOOKUP flags, we should document them in the
relevant path-walking documentation. And now that we've settled on a
common name for nd_jump_link() style symlinks ("magic links"), use that
term where magic-link semantics are described.
Signed-off-by: Aleksa Sarai <[email protected]>
---
Documentation/filesystems/path-lookup.rst | 68 +++++++++++++++++++++--
1 file changed, 62 insertions(+), 6 deletions(-)
diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst
index 434a07b0002b..a3216979298b 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -13,6 +13,7 @@ It has subsequently been updated to reflect changes in the kernel
including:
- per-directory parallel name lookup.
+- ``openat2()`` resolution restriction flags.
Introduction to pathname lookup
===============================
@@ -235,6 +236,13 @@ renamed. If ``d_lookup`` finds that a rename happened while it
unsuccessfully scanned a chain in the hash table, it simply tries
again.
+``rename_lock`` is also used to detect and defend against potential attacks
+against ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where
+the parent directory is moved outside the root, bypassing the ``path_equal()``
+check). If ``rename_lock`` is updated during the lookup and the path encounters
+a "..", a potential attack occurred and ``handle_dots()`` will bail out with
+``-EAGAIN``.
+
inode->i_rwsem
~~~~~~~~~~~~~~
@@ -348,6 +356,13 @@ any changes to any mount points while stepping up. This locking is
needed to stabilize the link to the mounted-on dentry, which the
refcount on the mount itself doesn't ensure.
+``mount_lock`` is also used to detect and defend against potential attacks
+against ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where
+the parent directory is moved outside the root, bypassing the ``path_equal()``
+check). If ``mount_lock`` is updated during the lookup and the path encounters
+a "..", a potential attack occurred and ``handle_dots()`` will bail out with
+``-EAGAIN``.
+
RCU
~~~
@@ -405,6 +420,10 @@ is requested. Keeping a reference in the ``nameidata`` ensures that
only one root is in effect for the entire path walk, even if it races
with a ``chroot()`` system call.
+It should be noted that in the case of ``LOOKUP_IN_ROOT`` or
+``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor
+passed to ``openat2()`` (which exposes these ``LOOKUP_`` flags).
+
The root is needed when either of two conditions holds: (1) either the
pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
component is being handled, since "``..``" from the root must always stay
@@ -1149,7 +1168,7 @@ so ``NULL`` is returned to indicate that the symlink can be released and
the stack frame discarded.
The other case involves things in ``/proc`` that look like symlinks but
-aren't really::
+aren't really (and are therefore commonly referred to as "magic-links")::
$ ls -l /proc/self/fd/1
lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
@@ -1286,7 +1305,9 @@ A few flags
A suitable way to wrap up this tour of pathname walking is to list
the various flags that can be stored in the ``nameidata`` to guide the
lookup process. Many of these are only meaningful on the final
-component, others reflect the current state of the pathname lookup.
+component, others reflect the current state of the pathname lookup, and some
+apply restrictions to all path components encountered in the path lookup.
+
And then there is ``LOOKUP_EMPTY``, which doesn't fit conceptually with
the others. If this is not set, an empty pathname causes an error
very early on. If it is set, empty pathnames are not considered to be
@@ -1310,13 +1331,48 @@ longer needed.
``LOOKUP_JUMPED`` means that the current dentry was chosen not because
it had the right name but for some other reason. This happens when
following "``..``", following a symlink to ``/``, crossing a mount point
-or accessing a "``/proc/$PID/fd/$FD``" symlink. In this case the
-filesystem has not been asked to revalidate the name (with
-``d_revalidate()``). In such cases the inode may still need to be
-revalidated, so ``d_op->d_weak_revalidate()`` is called if
+or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
+link"). In this case the filesystem has not been asked to revalidate the
+name (with ``d_revalidate()``). In such cases the inode may still need
+to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
``LOOKUP_JUMPED`` is set when the look completes - which may be at the
final component or, when creating, unlinking, or renaming, at the penultimate component.
+Resolution-restriction flags
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to allow userspace to protect itself against certain race conditions
+and attack scenarios involving changing path components, a series of flags are
+available which apply restrictions to all path components encountered during
+path lookup. These flags are exposed through ``openat2()``'s ``resolve`` field.
+
+``LOOKUP_NO_SYMLINKS`` blocks all symlink traversals (including magic-links).
+This is distinctly different from ``LOOKUP_FOLLOW``, because the latter only
+relates to restricting the following of trailing symlinks.
+
+``LOOKUP_NO_MAGICLINKS`` blocks all magic-link traversals. Filesystems must
+ensure that they return errors from ``nd_jump_link()``, because that is how
+``LOOKUP_NO_MAGICLINKS`` and other magic-link restrictions are implemented.
+
+``LOOKUP_NO_XDEV`` blocks all ``vfsmount`` traversals (this includes both
+bind-mounts and ordinary mounts). Note that the ``vfsmount`` which contains the
+lookup is determined by the first mountpoint the path lookup reaches --
+absolute paths start with the ``vfsmount`` of ``/``, and relative paths start
+with the ``dfd``'s ``vfsmount``. Magic-links are only permitted if the
+``vfsmount`` of the path is unchanged.
+
+``LOOKUP_BENEATH`` blocks any path components which resolve outside the
+starting point of the resolution. This is done by blocking ``nd_jump_root()``
+as well as blocking ".." if it would jump outside the starting point.
+``rename_lock`` and ``mount_lock`` are used to detect attacks against the
+resolution of "..". Magic-links are also blocked.
+
+``LOOKUP_IN_ROOT`` resolves all path components as though the starting point
+were the filesystem root. ``nd_jump_root()`` brings the resolution back to to
+the starting point, and ".." at the starting point will act as a no-op. As with
+``LOOKUP_BENEATH``, ``rename_lock`` and ``mount_lock`` are used to detect
+attacks against ".." resolution. Magic-links are also blocked.
+
Final-component flags
~~~~~~~~~~~~~~~~~~~~~
--
2.24.0
/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Userspace also has a hard time figuring out whether a particular flag is
supported on a particular kernel. While it is now possible with
contemporary kernels (thanks to [3]), older kernels will expose unknown
flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
openat(2) time matches modern syscall designs and is far more
fool-proof.
In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
We'd therefore like to add a new flag argument.
Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).
/* Syscall Prototype. */
/*
* open_how is an extensible structure (similar in interface to
* clone3(2) or sched_setattr(2)). The size parameter must be set to
* sizeof(struct open_how), to allow for future extensions. All future
* extensions will be appended to open_how, with their zero value
* acting as a no-op default.
*/
struct open_how { /* ... */ };
int openat2(int dfd, const char *pathname,
struct open_how *how, size_t size);
/* Description. */
The initial version of 'struct open_how' contains the following fields:
flags
Used to specify openat(2)-style flags. However, any unknown flag
bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
will result in -EINVAL. In addition, this field is 64-bits wide to
allow for more O_ flags than currently permitted with openat(2).
mode
The file mode for O_CREAT or O_TMPFILE.
Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
__padding
Must be set to all zeroes.
resolve
Restrict path resolution (in contrast to O_* flags they affect all
path components). The current set of flags are as follows (at the
moment, all of the RESOLVE_ flags are implemented as just passing
the corresponding LOOKUP_ flag).
RESOLVE_NO_XDEV => LOOKUP_NO_XDEV
RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS
RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
RESOLVE_BENEATH => LOOKUP_BENEATH
RESOLVE_IN_ROOT => LOOKUP_IN_ROOT
open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it).
Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).
/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.
In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).
/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking auto-mount
during resolution).
Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[5]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: commit 629e014bb834 ("fs: completely ignore unknown open flags")
[4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[5]: https://lore.kernel.org/lkml/[email protected]/
Suggested-by: Christian Brauner <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
CREDITS | 4 +-
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/open.c | 149 +++++++++++++++-----
include/linux/fcntl.h | 12 +-
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 35 +++++
24 files changed, 190 insertions(+), 38 deletions(-)
diff --git a/CREDITS b/CREDITS
index 031605d46b4d..a048e001d726 100644
--- a/CREDITS
+++ b/CREDITS
@@ -3301,7 +3301,9 @@ S: France
N: Aleksa Sarai
E: [email protected]
W: https://www.cyphar.com/
-D: `pids` cgroup subsystem
+D: /sys/fs/cgroup/pids
+D: openat2(2)
+S: Sydney, Australia
N: Dipankar Sarma
E: [email protected]
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 728fe028c02c..9f374f7d9514 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
543 common fspick sys_fspick
544 common pidfd_open sys_pidfd_open
# 545 reserved for clone3
+547 common openat2 sys_openat2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 6da7dc4d79cc..4ba54bc7e19a 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,3 +449,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+437 common openat2 sys_openat2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 2629a68b8724..8aa00ccb0b96 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 436
+#define __NR_compat_syscalls 438
#endif
#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 94ab29cf4f00..57f6f592d460 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -879,6 +879,8 @@ __SYSCALL(__NR_fspick, sys_fspick)
__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#define __NR_clone3 435
__SYSCALL(__NR_clone3, sys_clone3)
+#define __NR_openat2 437
+__SYSCALL(__NR_openat2, sys_openat2)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 36d5faf4c86c..8d36f2e2dc89 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -356,3 +356,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 common openat2 sys_openat2
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index a88a285a0e5f..2559925f1924 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 common openat2 sys_openat2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 09b0cd7dab0a..c04385e60833 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+437 common openat2 sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index e7c5ab38e403..68c9ec06851f 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -374,3 +374,4 @@
433 n32 fspick sys_fspick
434 n32 pidfd_open sys_pidfd_open
435 n32 clone3 __sys_clone3
+437 n32 openat2 sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 13cd66581f3b..42a72d010050 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -350,3 +350,4 @@
433 n64 fspick sys_fspick
434 n64 pidfd_open sys_pidfd_open
435 n64 clone3 __sys_clone3
+437 n64 openat2 sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 353539ea4140..f114c4aed0ed 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -423,3 +423,4 @@
433 o32 fspick sys_fspick
434 o32 pidfd_open sys_pidfd_open
435 o32 clone3 __sys_clone3
+437 o32 openat2 sys_openat2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 285ff516150c..b550ae9a7fea 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3_wrapper
+437 common openat2 sys_openat2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 43f736ed47f2..a8b5ecb5b602 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 nospu clone3 ppc_clone3
+437 common openat2 sys_openat2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 3054e9c035a3..16b571c06161 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
433 common fspick sys_fspick sys_fspick
434 common pidfd_open sys_pidfd_open sys_pidfd_open
435 common clone3 sys_clone3 sys_clone3
+437 common openat2 sys_openat2 sys_openat2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index b5ed26c4c005..a7185cc18626 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 common openat2 sys_openat2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 8c8cc7537fb2..b11c19552022 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -481,3 +481,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 common openat2 sys_openat2
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3fe02546aed3..e5c022e9a5c4 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -440,3 +440,4 @@
433 i386 fspick sys_fspick __ia32_sys_fspick
434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
435 i386 clone3 sys_clone3 __ia32_sys_clone3
+437 i386 openat2 sys_openat2 __ia32_sys_openat2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c29976eca4a8..9035647ef236 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,7 @@
433 common fspick __x64_sys_fspick
434 common pidfd_open __x64_sys_pidfd_open
435 common clone3 __x64_sys_clone3/ptregs
+437 common openat2 __x64_sys_openat2
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 25f4de729a6d..f0a68013c038 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -406,3 +406,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+437 common openat2 sys_openat2
diff --git a/fs/open.c b/fs/open.c
index b62f5c0923a8..50a46501bcc9 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -955,48 +955,86 @@ struct file *open_with_fake_path(const struct path *path, int flags,
}
EXPORT_SYMBOL(open_with_fake_path);
-static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op)
+#define WILL_CREATE(flags) (flags & (O_CREAT | __O_TMPFILE))
+#define O_PATH_FLAGS (O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC)
+
+static inline struct open_how build_open_how(int flags, umode_t mode)
+{
+ struct open_how how = {
+ .flags = flags & VALID_OPEN_FLAGS,
+ .mode = mode & S_IALLUGO,
+ };
+
+ /* O_PATH beats everything else. */
+ if (how.flags & O_PATH)
+ how.flags &= O_PATH_FLAGS;
+ /* Modes should only be set for create-like flags. */
+ if (!WILL_CREATE(how.flags))
+ how.mode = 0;
+ return how;
+}
+
+static inline int build_open_flags(const struct open_how *how,
+ struct open_flags *op)
{
+ int flags = how->flags;
int lookup_flags = 0;
int acc_mode = ACC_MODE(flags);
+ /* Must never be set by userspace */
+ flags &= ~(FMODE_NONOTIFY | O_CLOEXEC);
+
/*
- * Clear out all open flags we don't know about so that we don't report
- * them in fcntl(F_GETFD) or similar interfaces.
+ * Older syscalls implicitly clear all of the invalid flags or argument
+ * values before calling build_open_flags(), but openat2(2) checks all
+ * of its arguments.
*/
- flags &= VALID_OPEN_FLAGS;
+ if (flags & ~VALID_OPEN_FLAGS)
+ return -EINVAL;
+ if (how->resolve & ~VALID_RESOLVE_FLAGS)
+ return -EINVAL;
+ if (memchr_inv(how->__padding, 0, sizeof(how->__padding)))
+ return -EINVAL;
- if (flags & (O_CREAT | __O_TMPFILE))
- op->mode = (mode & S_IALLUGO) | S_IFREG;
- else
+ /* Deal with the mode. */
+ if (WILL_CREATE(flags)) {
+ if (how->mode & ~S_IALLUGO)
+ return -EINVAL;
+ op->mode = how->mode | S_IFREG;
+ } else {
+ if (how->mode != 0)
+ return -EINVAL;
op->mode = 0;
-
- /* Must never be set by userspace */
- flags &= ~FMODE_NONOTIFY & ~O_CLOEXEC;
+ }
/*
- * O_SYNC is implemented as __O_SYNC|O_DSYNC. As many places only
- * check for O_DSYNC if the need any syncing at all we enforce it's
- * always set instead of having to deal with possibly weird behaviour
- * for malicious applications setting only __O_SYNC.
+ * In order to ensure programs get explicit errors when trying to use
+ * O_TMPFILE on old kernels, O_TMPFILE is implemented such that it
+ * looks like (O_DIRECTORY|O_RDWR & ~O_CREAT) to old kernels. But we
+ * have to require userspace to explicitly set it.
*/
- if (flags & __O_SYNC)
- flags |= O_DSYNC;
-
if (flags & __O_TMPFILE) {
if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
return -EINVAL;
if (!(acc_mode & MAY_WRITE))
return -EINVAL;
- } else if (flags & O_PATH) {
- /*
- * If we have O_PATH in the open flag. Then we
- * cannot have anything other than the below set of flags
- */
- flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
+ }
+ if (flags & O_PATH) {
+ /* O_PATH only permits certain other flags to be set. */
+ if (flags & ~O_PATH_FLAGS)
+ return -EINVAL;
acc_mode = 0;
}
+ /*
+ * O_SYNC is implemented as __O_SYNC|O_DSYNC. As many places only
+ * check for O_DSYNC if the need any syncing at all we enforce it's
+ * always set instead of having to deal with possibly weird behaviour
+ * for malicious applications setting only __O_SYNC.
+ */
+ if (flags & __O_SYNC)
+ flags |= O_DSYNC;
+
op->open_flag = flags;
/* O_TRUNC implies we need access checks for write permissions */
@@ -1022,6 +1060,18 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
lookup_flags |= LOOKUP_DIRECTORY;
if (!(flags & O_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
+
+ if (how->resolve & RESOLVE_NO_XDEV)
+ lookup_flags |= LOOKUP_NO_XDEV;
+ if (how->resolve & RESOLVE_NO_MAGICLINKS)
+ lookup_flags |= LOOKUP_NO_MAGICLINKS;
+ if (how->resolve & RESOLVE_NO_SYMLINKS)
+ lookup_flags |= LOOKUP_NO_SYMLINKS;
+ if (how->resolve & RESOLVE_BENEATH)
+ lookup_flags |= LOOKUP_BENEATH;
+ if (how->resolve & RESOLVE_IN_ROOT)
+ lookup_flags |= LOOKUP_IN_ROOT;
+
op->lookup_flags = lookup_flags;
return 0;
}
@@ -1040,8 +1090,11 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
struct file *file_open_name(struct filename *name, int flags, umode_t mode)
{
struct open_flags op;
- int err = build_open_flags(flags, mode, &op);
- return err ? ERR_PTR(err) : do_filp_open(AT_FDCWD, name, &op);
+ struct open_how how = build_open_how(flags, mode);
+ int err = build_open_flags(&how, &op);
+ if (err)
+ return ERR_PTR(err);
+ return do_filp_open(AT_FDCWD, name, &op);
}
/**
@@ -1072,17 +1125,19 @@ struct file *file_open_root(struct dentry *dentry, struct vfsmount *mnt,
const char *filename, int flags, umode_t mode)
{
struct open_flags op;
- int err = build_open_flags(flags, mode, &op);
+ struct open_how how = build_open_how(flags, mode);
+ int err = build_open_flags(&how, &op);
if (err)
return ERR_PTR(err);
return do_file_open_root(dentry, mnt, filename, &op);
}
EXPORT_SYMBOL(file_open_root);
-long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
+static long do_sys_openat2(int dfd, const char __user *filename,
+ struct open_how *how)
{
struct open_flags op;
- int fd = build_open_flags(flags, mode, &op);
+ int fd = build_open_flags(how, &op);
struct filename *tmp;
if (fd)
@@ -1092,7 +1147,7 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
if (IS_ERR(tmp))
return PTR_ERR(tmp);
- fd = get_unused_fd_flags(flags);
+ fd = get_unused_fd_flags(how->flags);
if (fd >= 0) {
struct file *f = do_filp_open(dfd, tmp, &op);
if (IS_ERR(f)) {
@@ -1107,12 +1162,16 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
return fd;
}
-SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
+long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
- if (force_o_largefile())
- flags |= O_LARGEFILE;
+ struct open_how how = build_open_how(flags, mode);
+ return do_sys_openat2(dfd, filename, &how);
+}
- return do_sys_open(AT_FDCWD, filename, flags, mode);
+
+SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
+{
+ return ksys_open(filename, flags, mode);
}
SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
@@ -1120,10 +1179,32 @@ SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
{
if (force_o_largefile())
flags |= O_LARGEFILE;
-
return do_sys_open(dfd, filename, flags, mode);
}
+SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename,
+ struct open_how __user *, how, size_t, usize)
+{
+ int err;
+ struct open_how tmp;
+
+ BUILD_BUG_ON(sizeof(struct open_how) < OPEN_HOW_SIZE_VER0);
+ BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_LATEST);
+
+ if (unlikely(usize < OPEN_HOW_SIZE_VER0))
+ return -EINVAL;
+
+ err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize);
+ if (err)
+ return err;
+
+ /* O_LARGEFILE is only allowed for non-O_PATH. */
+ if (!(tmp.flags & O_PATH) && force_o_largefile())
+ tmp.flags |= O_LARGEFILE;
+
+ return do_sys_openat2(dfd, filename, &tmp);
+}
+
#ifdef CONFIG_COMPAT
/*
* Exactly like sys_open(), except that it doesn't set the
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index d019df946cb2..f2eb05bd3af3 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -2,15 +2,25 @@
#ifndef _LINUX_FCNTL_H
#define _LINUX_FCNTL_H
+#include <linux/stat.h>
#include <uapi/linux/fcntl.h>
-/* list of all valid flags for the open/openat flags argument: */
+/* List of all valid flags for the open/openat flags argument: */
#define VALID_OPEN_FLAGS \
(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+/* List of all valid flags for the how->upgrade_mask argument: */
+#define VALID_UPGRADE_FLAGS \
+ (UPGRADE_NOWRITE | UPGRADE_NOREAD)
+
+/* List of all valid flags for the how->resolve argument: */
+#define VALID_RESOLVE_FLAGS \
+ (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
+ RESOLVE_BENEATH | RESOLVE_IN_ROOT)
+
#ifndef force_o_largefile
#define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
#endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f7c561c4dcdd..808f103b7a62 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -69,6 +69,7 @@ struct rseq;
union bpf_attr;
struct io_uring_params;
struct clone_args;
+struct open_how;
#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -439,6 +440,8 @@ asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
umode_t mode);
+asmlinkage long sys_openat2(int dfd, const char __user *filename,
+ struct open_how *how, size_t size);
asmlinkage long sys_close(unsigned int fd);
asmlinkage long sys_vhangup(void);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1fc8faa6e973..d4122c091472 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -851,8 +851,11 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
__SYSCALL(__NR_clone3, sys_clone3)
#endif
+#define __NR_openat2 437
+__SYSCALL(__NR_openat2, sys_openat2)
+
#undef __NR_syscalls
-#define __NR_syscalls 436
+#define __NR_syscalls 438
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 1d338357df8a..58c3a0e543c6 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -93,5 +93,40 @@
#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
+/*
+ * Arguments for how openat2(2) should open the target path. If @resolve is
+ * zero, then openat2(2) operates very similarly to openat(2).
+ *
+ * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
+ * than being silently ignored. @mode must be zero unless one of {O_CREAT,
+ * O_TMPFILE} are set.
+ *
+ * @flags: O_* flags.
+ * @mode: O_CREAT/O_TMPFILE file mode.
+ * @resolve: RESOLVE_* flags.
+ */
+struct open_how {
+ __aligned_u64 flags;
+ __u16 mode;
+ __u16 __padding[3]; /* must be zeroed */
+ __aligned_u64 resolve;
+};
+
+#define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */
+#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0
+
+/* how->resolve flags for openat2(2). */
+#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings
+ (includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style
+ "magic-links". */
+#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks
+ (implies OEXT_NO_MAGICLINKS) */
+#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like
+ "..", symlinks, and absolute
+ paths which escape the dirfd. */
+#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".."
+ be scoped inside the dirfd
+ (similar to chroot(2)). */
#endif /* _UAPI_LINUX_FCNTL_H */
--
2.24.0
Rather than trying to merge the new syscall documentation into open.2
(which would probably result in the man-page being incomprehensible),
instead the new syscall gets its own dedicated page with links between
open(2) and openat2(2) to avoid duplicating information such as the list
of O_* flags or common errors.
Signed-off-by: Aleksa Sarai <[email protected]>
---
man2/open.2 | 17 ++
man2/openat2.2 | 435 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 452 insertions(+)
create mode 100644 man2/openat2.2
diff --git a/man2/open.2 b/man2/open.2
index b0f485b41589..2a721c991a20 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
.BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
.BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
", mode_t " mode );
+.PP
+/* Documented separately, in \fBopenat2\fP(2). */
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
.fi
.PP
.in -4n
@@ -933,6 +937,15 @@ If
is absolute, then
.I dirfd
is ignored.
+.SS openat2(2)
+The
+.BR openat2 (2)
+system call is an extension of
+.BR openat (),
+with a superset of features. To avoid making this man page too long, the
+description of
+.BR openat2 (2)
+and its features is documented in a separate man page.
.SH RETURN VALUE
.BR open (),
.BR openat (),
@@ -1220,6 +1233,9 @@ SVr4, 4.3BSD, POSIX.1-2001, POSIX.1-2008.
.BR openat ():
POSIX.1-2008.
.PP
+.BR openat2 (2)
+is Linux-specific.
+.PP
The
.BR O_DIRECT ,
.BR O_NOATIME ,
@@ -1778,6 +1794,7 @@ is ignored).
.BR mknod (2),
.BR mmap (2),
.BR mount (2),
+.BR openat2 (2),
.BR open_by_handle_at (2),
.BR read (2),
.BR socket (2),
diff --git a/man2/openat2.2 b/man2/openat2.2
new file mode 100644
index 000000000000..8588f0018657
--- /dev/null
+++ b/man2/openat2.2
@@ -0,0 +1,435 @@
+.\" Copyright (C) 2019 Aleksa Sarai <[email protected]>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date. The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein. The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.TH OPENAT2 2 2019-11-05 "Linux" "Linux Programmer's Manual"
+.SH NAME
+openat2 \- open and possibly create a file (extended)
+.SH SYNOPSIS
+.nf
+.B #include <sys/types.h>
+.B #include <sys/stat.h>
+.B #include <fcntl.h>
+.PP
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+struct open_how *" how ", size_t " size ");
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+The
+.BR openat2 ()
+system call opens the file specified by
+.IR pathname .
+If the specified file does not exist, it may optionally (if
+.B O_CREAT
+is specified in
+.IR how.flags )
+be created by
+.BR openat2() .
+.PP
+As with
+.BR openat (2),
+if
+.I pathname
+is relative, then it is interpreted relative to the
+directory referred to by the file descriptor
+.I dirfd
+(or the current working directory of the calling process, if
+.I dirfd
+is the special value
+.BR AT_FDCWD .)
+If
+.I pathname
+is absolute, then
+.I dirfd
+is ignored (unless
+.I how.resolve
+contains
+.BR RESOLVE_IN_ROOT,
+in which case
+.I pathname
+is resolved relative to
+.IR dirfd .)
+.PP
+The
+.BR openat2 ()
+system call is an extension of
+.BR openat (2)
+and provides a superset of its functionality.
+Rather than taking a single
+.I flag
+argument, an extensible structure (\fIhow\fP) is passed instead to allow for
+future extensions.
+.I size
+must be set to
+.IR "sizeof(struct open_how)" ,
+to facilitate future extensions (see the "Extensibility" section of the
+.B NOTES
+for more detail on how extensions are handled.)
+
+.SS The open_how structure
+The following structure indicates how
+.I pathname
+should be opened, and acts as a superset of the
+.IR flag " and " mode
+arguments to
+.BR openat (2).
+.PP
+.in +4n
+.EX
+struct open_how {
+ __aligned_u64 flags; /* O_* flags. */
+ __u16 mode; /* Mode for O_{CREAT,TMPFILE}. */
+ __u16 __padding[3]; /* Must be zeroed. */
+ __aligned_u64 resolve; /* RESOLVE_* flags. */
+};
+.EE
+.in
+.PP
+Any future extensions to
+.BR openat2 ()
+will be implemented as new fields appended to the above structure (or through
+reuse of pre-existing padding space), with the zero value of the new fields
+acting as though the extension were not present.
+.PP
+The meaning of each field is as follows:
+.RS
+
+.I flags
+.RS
+The file creation and status flags to use for this operation.
+All of the
+.B O_*
+flags defined for
+.BR openat (2)
+are valid
+.BR openat2 ()
+flag values.
+
+Unlike
+.BR openat (2),
+it is an error to provide
+.BR openat2 ()
+unknown or conflicting flags in
+.IR flags .
+.RE
+
+.IR mode
+.RS
+File mode for the new file, with identical semantics to the
+.I mode
+argument to
+.BR openat (2).
+However, unlike
+.BR openat (2),
+it is an error to provide
+.BR openat2 ()
+with a
+.I mode
+which contains bits other than
+.IR 0777 .
+
+It is an error to provide
+.BR openat2 ()
+a non-zero
+.IR mode " if " flags
+does not contain
+.BR O_CREAT " or " O_TMPFILE .
+.RE
+
+.I resolve
+.RS
+Change how the components of
+.I pathname
+will be resolved (see
+.BR path_resolution (7)
+for background information.)
+The primary use case for these flags is to allow trusted programs to restrict
+how untrusted paths (or paths inside untrusted directories) are resolved.
+The full list of
+.I resolve
+flags is given below.
+.TP
+.B RESOLVE_NO_XDEV
+Disallow traversal of mount points during path resolution (including all bind
+mounts).
+
+Users of this flag are encouraged to make its use configurable (unless it is
+used for a specific security purpose), as bind mounts are very widely used by
+end-users.
+Setting this flag indiscrimnately for all uses of
+.IR openat2 ()
+may result in spurious errors on previously-functional systems.
+.TP
+.B RESOLVE_NO_SYMLINKS
+Disallow resolution of symbolic links during path resolution.
+This option implies
+.BR RESOLVE_NO_MAGICLINKS .
+
+If the trailing component is a symbolic link, and
+.I flags
+contains both
+.BR O_PATH " and " O_NOFOLLOW ","
+then an
+.B O_PATH
+file descriptor referencing the symbolic link will be returned.
+
+Users of this flag are encouraged to make its use configurable (unless it is
+used for a specific security purpose), as symbolic links are very widely used
+by end-users.
+Setting this flag indiscrimnately for all uses of
+.IR openat2 ()
+may result in spurious errors on previously-functional systems.
+.TP
+.B RESOLVE_NO_MAGICLINKS
+Disallow all magic link resolution during path resolution.
+
+If the trailing component is a magic link, and
+.I flags
+contains both
+.BR O_PATH " and " O_NOFOLLOW ","
+then an
+.B O_PATH
+file descriptor referencing the magic link will be returned.
+
+Magic-links are symbolic link-like objects that are most notably found in
+.BR proc (5)
+(examples include
+.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
+Due to the potential danger of unknowingly opening these magic links, it may be
+preferable for users to disable their resolution entirely (see
+.BR symbolic link (7)
+for more details.)
+.TP
+.B RESOLVE_BENEATH
+Do not permit the path resolution to succeed if any component of the resolution
+is not a descendant of the directory indicated by
+.IR dirfd .
+This results in absolute symbolic links (and absolute values of
+.IR pathname )
+to be rejected.
+
+Currently, this flag also disables magic link resolution.
+However, this may change in the future.
+The caller should explicitly specify
+.B RESOLVE_NO_MAGICLINKS
+to ensure that magic links are not resolved.
+
+.TP
+.B RESOLVE_IN_ROOT
+Treat
+.I dirfd
+as the root directory while resolving
+.I pathname
+(as though the user called
+.BR chroot (2)
+with
+.IR dirfd
+as the argument.)
+Absolute symbolic links and ".." path components will be scoped to
+.IR dirfd .
+If
+.I pathname
+is an absolute path, it is also treated relative to
+.IR dirfd .
+
+However, unlike
+.BR chroot (2)
+(which changes the filesystem root permanently for a process),
+.B RESOLVE_IN_ROOT
+allows a program to efficiently restrict path resolution for only certain
+operations.
+It also has several hardening features (such detecting escape attempts during
+.I ".."
+resolution) which
+.BR chroot (2)
+does not.
+
+Currently, this flag also disables magic link resolution.
+However, this may change in the future.
+The caller should explicitly specify
+.B RESOLVE_NO_MAGICLINKS
+to ensure that magic links are not resolved.
+.PP
+It is an error to provide
+.BR openat2 ()
+unknown flags in
+.IR resolve .
+.RE
+.RE
+
+.SH RETURN VALUE
+On success, a new file descriptor is returned.
+On error, -1 is returned, and
+.I errno
+is set appropriately.
+
+.SH ERRORS
+The set of errors returned by
+.BR openat2 ()
+includes all of the errors returned by
+.BR openat (2),
+as well as the following additional errors:
+.TP
+.B EINVAL
+An unknown flag or invalid value was specified in
+.IR how .
+.TP
+.B EINVAL
+.I mode
+is non-zero, but
+.I flags
+does not contain
+.BR O_CREAT " or " O_TMPFILE .
+.TP
+.B EINVAL
+.I size
+was smaller than any known version of
+.IR "struct open_how" .
+.TP
+.B E2BIG
+An extension was specified in
+.IR how ,
+which the current kernel does not support (see the "Extensibility" section of
+the
+.B NOTES
+for more detail on how extensions are handled.)
+.TP
+.B EAGAIN
+.I resolve
+contains either
+.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
+and the kernel could not ensure that a ".." component didn't escape (due to a
+race condition or potential attack.)
+Callers may choose to retry the
+.BR openat2 ()
+call.
+.TP
+.B EXDEV
+.I resolve
+contains either
+.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
+and an escape from the root during path resolution was detected.
+
+.TP
+.B EXDEV
+.I resolve
+contains
+.BR RESOLVE_NO_XDEV ,
+and a path component attempted to cross a mount point.
+
+.TP
+.B ELOOP
+.I resolve
+contains
+.BR RESOLVE_NO_SYMLINKS ,
+and one of the path components was a symbolic link (or magic link).
+.TP
+.B ELOOP
+.I resolve
+contains
+.BR RESOLVE_NO_MAGICLINKS ,
+and one of the path components was a magic link.
+
+.SH VERSIONS
+.BR openat2 ()
+first appeared in Linux 5.6.
+
+.SH CONFORMING TO
+This system call is Linux-specific.
+
+The semantics of
+.B RESOLVE_BENEATH
+were modelled after FreeBSD's
+.BR O_BENEATH .
+
+.SH NOTES
+Glibc does not provide a wrapper for this system call; call it using
+.BR syscall (2).
+
+.SS Extensibility
+In order to allow for
+.I struct open_how
+to be extended in future kernel revisions,
+.BR openat2 ()
+requires userspace to specify the size of
+.I struct open_how
+structure they are passing.
+By providing this information, it is possible for
+.BR openat2 ()
+to provide both forwards- and backwards-compatibility \(em with
+.I size
+acting as an implicit version number (because new extension fields will always
+be appended, the size will always increase.)
+This extensibility design is very similar to other system calls such as
+.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).
+
+If we let
+.I usize
+be the size of the structure according to userspace and
+.I ksize
+be the size of the structure which the kernel supports, then there are only
+three cases to consider:
+
+.RS
+.IP * 3
+If
+.IR ksize " equals " usize ,
+then there is no version mismatch and
+.I how
+can be used verbatim.
+.IP *
+If
+.IR ksize " is larger than " usize ,
+then there are some extensions the kernel supports which the userspace program
+is unaware of.
+Because all extensions must have their zero values be a no-op, the kernel
+treats all of the extension fields not set by userspace to have zero values.
+This provides backwards-compatibility.
+.IP *
+If
+.IR ksize " is smaller than " usize ,
+then there are some extensions which the userspace program is aware of but the
+kernel does not support.
+Because all extensions must have their zero values be a no-op, the kernel can
+safely ignore the unsupported extension fields if they are all-zero.
+If any unsupported extension fields are non-zero, then -1 is returned and
+.I errno
+is set to
+.BR E2BIG .
+This provides forwards-compatibility.
+.RE
+
+Therefore, most userspace programs will not need to have any special handling
+of extensions.
+However, if a userspace program wishes to determine what extensions the running
+kernel supports, they may conduct a binary search on
+.IR size
+(to find the largest value which doesn't produce an error of
+.BR E2BIG .)
+
+.SH SEE ALSO
+.BR openat (2),
+.BR path_resolution (7),
+.BR symlink (7)
--
2.24.0