2018-11-12 14:28:07

by Aleksa Sarai

[permalink] [raw]
Subject: [PATCH v4 0/4] namei: O_* flags to restrict path resolution

Sorry for not sending a series earlier, I've been busy with assignments.

Patch changelog:
v4:
* Remove AT_* flag reservations, as they require more discussion.
* Switch to path_is_under() over __d_path() for breakout checking.
* Make O_XDEV no longer block openat("/tmp", "/", O_XDEV) -- dirfd
is now ignored for absolute paths to match other flags.
* Improve the dirfd_path_init() refactor and move it to a separate
commit.
* Remove reference to Linux-capsicum.
* Switch "proclink" name to "magic link".
v3: [resend]
v2:
* Made ".." resolution with AT_THIS_ROOT and AT_BENEATH safe(r) with
some semi-aggressive __d_path checking (see patch 3).
* Disallowed "proclinks" with AT_THIS_ROOT and AT_BENEATH, in the
hopes they can be re-enabled once safe.
* Removed the selftests as they will be reimplemented as xfstests.
* Removed stat(2) support, since you can already get it through
O_PATH and fstatat(2).

The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.

In line with the conclusions of the original discussion of AT_NO_JUMPS,
the flag has been split up into separate flags:

* O_XDEV blocks all mountpoint crossings (upwards, downwards, or
through absolute links). Absolute pathnames alone in openat(2) do
not trigger this.

* O_NOMAGICLINKS blocks resolution through /proc/$pid/fd-style links.
This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.

It should be noted that this is different to the scope of O_NOFOLLOW
in that it applies to all path components. However, you can do
open(O_NOFOLLOW|O_NOMAGICLINKS|O_PATH) on a "magic link" and it will
*not* fail (assuming that no parent component was a "magic link"),
and you will have an fd for the "magic link".

* O_BENEATH disallows escapes to outside the starting dirfd's tree,
using techniques such as ".." or absolute links. Absolute paths in
openat(2) are also disallowed. Conceptually this flag is to ensure
you "stay below" a certain point in the filesystem tree -- but this
requires some additional to protect against various races that would
allow escape using ".." (see patch 4 for more detail).

Currently O_BENEATH implies O_NOMAGICLINKS, because it can trivially
beam you around the filesystem (breaking the protection). In future,
there might be similar safety checks as in patch 4, but that
requires more discussion.

In addition, two new flags were added that expand on the above ideas:

* O_NOSYMLINKS does what it says on the tin. No symlink resolution is
allowed at all, including "magic links". Just as with O_NOMAGICLINKS
this can still be used with (O_PATH|O_NOFOLLOW) to open an fd for
the symlink as long as no parent path had a symlink component.

* O_THISROOT is an extension of O_BENEATH that, rather than blocking
attempts to move past the root, forces all such movements to be
scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.

If a race is detected (as with O_BENEATH) then an error is
generated, and similar to O_BENEATH it is not permitted to cross
"magic links" with O_THISROOT.

The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[6] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having O_THISROOT (such as
CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, to name a few).

Cc: Al Viro <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: David Howells <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Drysdale <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>

[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin

Aleksa Sarai (4):
namei: split out nd->dfd handling to dirfd_path_init
namei: O_BENEATH-style path resolution flags
namei: O_THISROOT: chroot-like path resolution
namei: aggressively check for nd->root escape on ".." resolution

fs/fcntl.c | 2 +-
fs/namei.c | 205 ++++++++++++++++++++++---------
fs/open.c | 13 +-
include/linux/fcntl.h | 3 +-
include/linux/namei.h | 8 ++
include/uapi/asm-generic/fcntl.h | 20 +++
6 files changed, 189 insertions(+), 62 deletions(-)

--
2.19.1



2018-11-12 14:28:18

by Aleksa Sarai

[permalink] [raw]
Subject: [PATCH v4 1/4] namei: split out nd->dfd handling to dirfd_path_init

Previously, path_init's handling of *at(dfd, ...) was only done once,
but with O_BENEATH (and O_THISROOT) we have to parse the initial
nd->path at different times (before or after absolute path handling)
depending on whether we have been asked to scope resolution within a
root.

Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 103 ++++++++++++++++++++++++++++++-----------------------
1 file changed, 59 insertions(+), 44 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index fb913148d4d1..faefca58348d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2168,9 +2168,59 @@ static int link_path_walk(const char *name, struct nameidata *nd)
}
}

+/*
+ * Configure nd->path based on the nd->dfd. This is only used as part of
+ * path_init().
+ */
+static inline int dirfd_path_init(struct nameidata *nd)
+{
+ if (nd->dfd == AT_FDCWD) {
+ if (nd->flags & LOOKUP_RCU) {
+ struct fs_struct *fs = current->fs;
+ unsigned seq;
+
+ do {
+ seq = read_seqcount_begin(&fs->seq);
+ nd->path = fs->pwd;
+ nd->inode = nd->path.dentry->d_inode;
+ nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
+ } while (read_seqcount_retry(&fs->seq, seq));
+ } else {
+ get_fs_pwd(current->fs, &nd->path);
+ nd->inode = nd->path.dentry->d_inode;
+ }
+ } else {
+ /* Caller must check execute permissions on the starting path component */
+ struct fd f = fdget_raw(nd->dfd);
+ struct dentry *dentry;
+
+ if (!f.file)
+ return -EBADF;
+
+ dentry = f.file->f_path.dentry;
+
+ if (*nd->name->name && unlikely(!d_can_lookup(dentry))) {
+ fdput(f);
+ return -ENOTDIR;
+ }
+
+ nd->path = f.file->f_path;
+ if (nd->flags & LOOKUP_RCU) {
+ nd->inode = nd->path.dentry->d_inode;
+ nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+ } else {
+ path_get(&nd->path);
+ nd->inode = nd->path.dentry->d_inode;
+ }
+ fdput(f);
+ }
+ return 0;
+}
+
/* must be paired with terminate_walk() */
static const char *path_init(struct nameidata *nd, unsigned flags)
{
+ int error;
const char *s = nd->name->name;

if (!*s)
@@ -2204,52 +2254,17 @@ static const char *path_init(struct nameidata *nd, unsigned flags)

nd->m_seq = read_seqbegin(&mount_lock);
if (*s == '/') {
- set_root(nd);
- if (likely(!nd_jump_root(nd)))
- return s;
- return ERR_PTR(-ECHILD);
- } else if (nd->dfd == AT_FDCWD) {
- if (flags & LOOKUP_RCU) {
- struct fs_struct *fs = current->fs;
- unsigned seq;
-
- do {
- seq = read_seqcount_begin(&fs->seq);
- nd->path = fs->pwd;
- nd->inode = nd->path.dentry->d_inode;
- nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
- } while (read_seqcount_retry(&fs->seq, seq));
- } else {
- get_fs_pwd(current->fs, &nd->path);
- nd->inode = nd->path.dentry->d_inode;
- }
- return s;
- } else {
- /* Caller must check execute permissions on the starting path component */
- struct fd f = fdget_raw(nd->dfd);
- struct dentry *dentry;
-
- if (!f.file)
- return ERR_PTR(-EBADF);
-
- dentry = f.file->f_path.dentry;
-
- if (*s && unlikely(!d_can_lookup(dentry))) {
- fdput(f);
- return ERR_PTR(-ENOTDIR);
- }
-
- nd->path = f.file->f_path;
- if (flags & LOOKUP_RCU) {
- nd->inode = nd->path.dentry->d_inode;
- nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
- } else {
- path_get(&nd->path);
- nd->inode = nd->path.dentry->d_inode;
- }
- fdput(f);
+ if (likely(!nd->root.mnt))
+ set_root(nd);
+ error = nd_jump_root(nd);
+ if (unlikely(error))
+ s = ERR_PTR(error);
return s;
}
+ error = dirfd_path_init(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ return s;
}

static const char *trailing_symlink(struct nameidata *nd)
--
2.19.1


2018-11-12 14:28:24

by Aleksa Sarai

[permalink] [raw]
Subject: [PATCH v4 2/4] namei: O_BENEATH-style path resolution flags

Add the following flags to allow various restrictions on path
resolution (these affect the *entire* resolution, rather than just the
final path component -- as is the case with most other AT_* flags).

The primary justification for these flags is to allow for programs to be
far more strict about how they want path resolution to handle symlinks,
mountpoint crossings, and paths that escape the dirfd (through an
absolute path or ".." shenanigans).

This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init). More classical applications (which have their own
potentially buggy userspace path sanitisation code) include web
servers, archive extraction tools, network file servers, and so on.

* O_XDEV: Disallow mount-point crossing (both *down* into one, or *up*
from one). The primary "scoping" use is to blocking resolution that
crosses a bind-mount, which has a similar property to a symlink (in
the way that it allows for escape from the starting-point). Since it
is not possible to differentiate bind-mounts However since
bind-mounting requires privileges (in ways symlinks don't) this has
been split from LOOKUP_BENEATH. The naming is based on "find -xdev"
(though find(1) doesn't walk upwards, the semantics seem obvious).

* O_NOMAGICLINKS: Disallows ->get_link "symlink" jumping. This is a very
specific restriction, and it exists because /proc/$pid/fd/...
"symlinks" allow for access outside nd->root and pose risk to
container runtimes that don't want to be tricked into accessing a host
path (but do want to allow no-funny-business symlink resolution).

* O_NOSYMLINKS: Disallows symlink jumping *of any kind*. Implies
O_NOMAGICLINKS (obviously).

* O_BENEATH: Disallow "escapes" from the starting point of the
filesystem tree during resolution (you must stay "beneath" the
starting point at all times). Currently this is done by disallowing
".." and absolute paths (either in the given path or found during
symlink resolution) entirely, as well as all "magic link" jumping.

The wholesale banning of ".." is because it is currently not safe to
allow ".." resolution (races can cause the path to be moved outside of
the root -- this is conceptually similar to historical chroot(2)
escape attacks). Future patches in this series will address this, and
will re-enable ".." resolution once it is safe. With those patches,
".." resolution will only be allowed if it remains in the root
throughout resolution (such as "a/../b" not "a/../../outside/b").

The banning of "magic link" jumping is done because it is not clear
whether semantically they should be allowed -- while some "magic
links" are safe there are many that can cause escapes (and once a
resolution is outside of the root, O_BENEATH will no longer detect
it). Future patches may re-enable "magic link" jumping when such jumps
would remain inside the root.

The O_NO*LINK flags return -ELOOP if path resolution would violates
their requirement, while the others all return -EXDEV. Currently these
are only enabled for openat(2). In future it may be necessary to enable
these for *at(2) flags, as well as expanding support for AT_EMPTY_PATH.

This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.

[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/619151/
[3]: https://lwn.net/Articles/603929/
[4]: https://lwn.net/Articles/723057/

Cc: Eric Biederman <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: <[email protected]>
Suggested-by: David Drysdale <[email protected]>
Suggested-by: Al Viro <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/fcntl.c | 2 +-
fs/namei.c | 76 +++++++++++++++++++++++++++-----
fs/open.c | 11 ++++-
include/linux/fcntl.h | 3 +-
include/linux/namei.h | 7 +++
include/uapi/asm-generic/fcntl.h | 17 +++++++
6 files changed, 102 insertions(+), 14 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 4137d96534a6..e343618736f7 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(
(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index faefca58348d..b8d2bee89b78 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -845,6 +845,12 @@ static inline void path_to_nameidata(const struct path *path,

static int nd_jump_root(struct nameidata *nd)
{
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
+ if (unlikely(nd->flags & LOOKUP_XDEV)) {
+ if (nd->path.mnt != nd->root.mnt)
+ return -EXDEV;
+ }
if (nd->flags & LOOKUP_RCU) {
struct dentry *d;
nd->path = nd->root;
@@ -1053,6 +1059,9 @@ const char *get_link(struct nameidata *nd)
int error;
const char *res;

+ if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+ return ERR_PTR(-ELOOP);
+
if (!(nd->flags & LOOKUP_RCU)) {
touch_atime(&last->link);
cond_resched();
@@ -1083,14 +1092,23 @@ const char *get_link(struct nameidata *nd)
} else {
res = get(dentry, inode, &last->done);
}
+ /* If we just jumped it was because of a procfs-style link. */
+ if (unlikely(nd->flags & LOOKUP_JUMPED)) {
+ if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
+ return ERR_PTR(-ELOOP);
+ /* Not currently safe. */
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return ERR_PTR(-EXDEV);
+ }
if (IS_ERR_OR_NULL(res))
return res;
}
if (*res == '/') {
if (!nd->root.mnt)
set_root(nd);
- if (unlikely(nd_jump_root(nd)))
- return ERR_PTR(-ECHILD);
+ error = nd_jump_root(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
while (unlikely(*++res == '/'))
;
}
@@ -1271,12 +1289,16 @@ static int follow_managed(struct path *path, struct nameidata *nd)
break;
}

- if (need_mntput && path->mnt == mnt)
- mntput(path->mnt);
+ if (need_mntput) {
+ if (path->mnt == mnt)
+ mntput(path->mnt);
+ if (unlikely(nd->flags & LOOKUP_XDEV))
+ ret = -EXDEV;
+ else
+ nd->flags |= LOOKUP_JUMPED;
+ }
if (ret == -EISDIR || !ret)
ret = 1;
- if (need_mntput)
- nd->flags |= LOOKUP_JUMPED;
if (unlikely(ret < 0))
path_put_conditional(path, nd);
return ret;
@@ -1333,6 +1355,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
mounted = __lookup_mnt(path->mnt, path->dentry);
if (!mounted)
break;
+ if (unlikely(nd->flags & LOOKUP_XDEV))
+ return false;
path->mnt = &mounted->mnt;
path->dentry = mounted->mnt.mnt_root;
nd->flags |= LOOKUP_JUMPED;
@@ -1353,8 +1377,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
struct inode *inode = nd->inode;

while (1) {
- if (path_equal(&nd->path, &nd->root))
+ if (path_equal(&nd->path, &nd->root)) {
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
break;
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
struct dentry *old = nd->path.dentry;
struct dentry *parent = old->d_parent;
@@ -1379,6 +1406,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -ECHILD;
if (&mparent->mnt == nd->path.mnt)
break;
+ if (unlikely(nd->flags & LOOKUP_XDEV))
+ return -EXDEV;
/* we know that mountpoint was pinned */
nd->path.dentry = mountpoint;
nd->path.mnt = &mparent->mnt;
@@ -1393,6 +1422,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -ECHILD;
if (!mounted)
break;
+ if (unlikely(nd->flags & LOOKUP_XDEV))
+ return -EXDEV;
nd->path.mnt = &mounted->mnt;
nd->path.dentry = mounted->mnt.mnt_root;
inode = nd->path.dentry->d_inode;
@@ -1481,8 +1512,11 @@ static int path_parent_directory(struct path *path)
static int follow_dotdot(struct nameidata *nd)
{
while(1) {
- if (path_equal(&nd->path, &nd->root))
+ if (path_equal(&nd->path, &nd->root)) {
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
break;
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
int ret = path_parent_directory(&nd->path);
if (ret)
@@ -1491,6 +1525,8 @@ static int follow_dotdot(struct nameidata *nd)
}
if (!follow_up(&nd->path))
break;
+ if (unlikely(nd->flags & LOOKUP_XDEV))
+ return -EXDEV;
}
follow_mount(&nd->path);
nd->inode = nd->path.dentry->d_inode;
@@ -1705,6 +1741,13 @@ static inline int may_lookup(struct nameidata *nd)
static inline int handle_dots(struct nameidata *nd, int type)
{
if (type == LAST_DOTDOT) {
+ /*
+ * LOOKUP_BENEATH resolving ".." is not currently safe -- races can
+ * cause our parent to have moved outside of the root and us to skip
+ * over it.
+ */
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
if (!nd->root.mnt)
set_root(nd);
if (nd->flags & LOOKUP_RCU) {
@@ -2253,6 +2296,15 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->path.dentry = NULL;

nd->m_seq = read_seqbegin(&mount_lock);
+
+ if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+ error = dirfd_path_init(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ nd->root = nd->path;
+ if (!(nd->flags & LOOKUP_RCU))
+ path_get(&nd->root);
+ }
if (*s == '/') {
if (likely(!nd->root.mnt))
set_root(nd);
@@ -2261,9 +2313,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
s = ERR_PTR(error);
return s;
}
- error = dirfd_path_init(nd);
- if (unlikely(error))
- return ERR_PTR(error);
+ if (likely(!nd->path.mnt)) {
+ error = dirfd_path_init(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ }
return s;
}

diff --git a/fs/open.c b/fs/open.c
index 0285ce7dbd51..3e73f940f56e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -959,7 +959,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
* If we have O_PATH in the open flag. Then we
* cannot have anything other than the below set of flags
*/
- flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
+ flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH | O_BENEATH |
+ O_XDEV | O_NOSYMLINKS | O_NOMAGICLINKS;
acc_mode = 0;
}

@@ -988,6 +989,14 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
lookup_flags |= LOOKUP_DIRECTORY;
if (!(flags & O_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
+ if (flags & O_BENEATH)
+ lookup_flags |= LOOKUP_BENEATH;
+ if (flags & O_XDEV)
+ lookup_flags |= LOOKUP_XDEV;
+ if (flags & O_NOMAGICLINKS)
+ lookup_flags |= LOOKUP_NO_MAGICLINKS;
+ if (flags & O_NOSYMLINKS)
+ lookup_flags |= LOOKUP_NO_SYMLINKS;
op->lookup_flags = lookup_flags;
return 0;
}
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 27dc7a60693e..864399c2fdd2 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -9,7 +9,8 @@
(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
- O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
+ O_NOMAGICLINKS | O_NOSYMLINKS)

#ifndef force_o_largefile
#define force_o_largefile() (BITS_PER_LONG != 32)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index a78606e8e3df..82b5039d27a6 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -47,6 +47,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_EMPTY 0x4000
#define LOOKUP_DOWN 0x8000

+/* Scoping flags for lookup. */
+#define LOOKUP_BENEATH 0x010000 /* No escaping from starting point. */
+#define LOOKUP_XDEV 0x020000 /* No mountpoint crossing. */
+#define LOOKUP_NO_MAGICLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
+#define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*.
+ Implies LOOKUP_NO_MAGICLINKS. */
+
extern int path_pts(struct path *path);

extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..b2d3811843e7 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -97,6 +97,23 @@
#define O_NDELAY O_NONBLOCK
#endif

+/*
+ * These are identical to their AT_* counterparts (which affect the entireity
+ * of path resolution).
+ */
+#ifndef O_BENEATH
+#define O_BENEATH 00040000000
+#endif
+#ifndef O_XDEV
+#define O_XDEV 00100000000
+#endif
+#ifndef O_NOMAGICLINKS
+#define O_NOMAGICLINKS 00200000000
+#endif
+#ifndef O_NOSYMLINKS
+#define O_NOSYMLINKS 01000000000
+#endif
+
#define F_DUPFD 0 /* dup */
#define F_GETFD 1 /* get close_on_exec */
#define F_SETFD 2 /* set/clear close_on_exec */
--
2.19.1


2018-11-12 14:29:53

by Aleksa Sarai

[permalink] [raw]
Subject: [PATCH v4 3/4] namei: O_THISROOT: chroot-like path resolution

The primary motivation for the need for this flag is container runtimes
which have to interact with malicious root filesystems in the host
namespaces. One of the first requirements for a container runtime to be
secure against a malicious rootfs is that they correctly scope symlinks
(that is, they should be scoped as though they are chroot(2)ed into the
container's rootfs) and ".."-style paths[*]. The already-existing O_XDEV
and O_NOMAGICLINKS[**] help defend against other potential attacks in a
malicious rootfs scenario.

Currently most container runtimes try to do this resolution in
userspace[1], causing many potential race conditions. In addition, the
"obvious" alternative (actually performing a {ch,pivot_}root(2))
requires a fork+exec (for some runtimes) which is *very* costly if
necessary for every filesystem operation involving a container.

[*] At the moment, ".." and "magic link" jumping are disallowed for the
same reason it is disabled for O_BENEATH -- currently it is not safe
to allow it. Future patches may enable it unconditionally once we
have resolved the possible races (for "..") and semantics (for
"magic link" jumping).

The most significant openat(2) semantic change with O_THISROOT is that
absolute pathnames no longer cause dirfd to be ignored completely. The
rationale is that O_THISROOT must necessarily chroot-scope symlinks with
absolute paths to dirfd, and so doing it for the base path seems to be
the most consistent behaviour (and also avoids foot-gunning users who
want to scope paths that are absolute).

Currently this is only enabled for openat(2), and similar to O_BENEATH
and family requires more discussion about extending it to more *at(2)
syscalls as well as extending AT_EMPTY_PATH support.

[1]: https://github.com/cyphar/filepath-securejoin

Cc: Eric Biederman <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/fcntl.c | 2 +-
fs/namei.c | 6 +++---
fs/open.c | 4 +++-
include/linux/fcntl.h | 2 +-
include/linux/namei.h | 1 +
include/uapi/asm-generic/fcntl.h | 3 +++
6 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index e343618736f7..4c36c5b9fdb9 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(26 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(
(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index b8d2bee89b78..459faea5b832 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1097,7 +1097,7 @@ const char *get_link(struct nameidata *nd)
if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
return ERR_PTR(-ELOOP);
/* Not currently safe. */
- if (unlikely(nd->flags & LOOKUP_BENEATH))
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)))
return ERR_PTR(-EXDEV);
}
if (IS_ERR_OR_NULL(res))
@@ -1746,7 +1746,7 @@ static inline int handle_dots(struct nameidata *nd, int type)
* cause our parent to have moved outside of the root and us to skip
* over it.
*/
- if (unlikely(nd->flags & LOOKUP_BENEATH))
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)))
return -EXDEV;
if (!nd->root.mnt)
set_root(nd);
@@ -2297,7 +2297,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)

nd->m_seq = read_seqbegin(&mount_lock);

- if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
error = dirfd_path_init(nd);
if (unlikely(error))
return ERR_PTR(error);
diff --git a/fs/open.c b/fs/open.c
index 3e73f940f56e..4ba44b07f3ff 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -960,7 +960,7 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
* cannot have anything other than the below set of flags
*/
flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH | O_BENEATH |
- O_XDEV | O_NOSYMLINKS | O_NOMAGICLINKS;
+ O_XDEV | O_NOSYMLINKS | O_NOMAGICLINKS | O_THISROOT;
acc_mode = 0;
}

@@ -997,6 +997,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
lookup_flags |= LOOKUP_NO_MAGICLINKS;
if (flags & O_NOSYMLINKS)
lookup_flags |= LOOKUP_NO_SYMLINKS;
+ if (flags & O_THISROOT)
+ lookup_flags |= LOOKUP_CHROOT;
op->lookup_flags = lookup_flags;
return 0;
}
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 864399c2fdd2..46c92bbfce4a 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
- O_NOMAGICLINKS | O_NOSYMLINKS)
+ O_NOMAGICLINKS | O_NOSYMLINKS | O_THISROOT)

#ifndef force_o_largefile
#define force_o_largefile() (BITS_PER_LONG != 32)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 82b5039d27a6..b6865eda86d5 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -53,6 +53,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_NO_MAGICLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
#define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*.
Implies LOOKUP_NO_MAGICLINKS. */
+#define LOOKUP_CHROOT 0x100000 /* Treat dirfd as %current->fs->root. */

extern int path_pts(struct path *path);

diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index b2d3811843e7..194f5de9ba51 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -113,6 +113,9 @@
#ifndef O_NOSYMLINKS
#define O_NOSYMLINKS 01000000000
#endif
+#ifndef O_THISROOT
+#define O_THISROOT 02000000000
+#endif

#define F_DUPFD 0 /* dup */
#define F_GETFD 1 /* get close_on_exec */
--
2.19.1


2018-11-12 14:30:37

by Aleksa Sarai

[permalink] [raw]
Subject: [PATCH v4 4/4] namei: aggressively check for nd->root escape on ".." resolution

This patch allows for O_BENEATH and O_THISROOT to safely permit ".."
resolution (in the case of O_BENEATH the resolution will still fail if
".." resolution would resolve a path outside of the root -- while
O_THISROOT will chroot(2)-style scope it). "magic link" jumps are still
disallowed entirely because now they could result in inconsistent
behaviour if resolution encounters a subsequent "..".

The need for this patch is explained by observing there is a fairly
easy-to-exploit race condition with chroot(2) (and thus by extension
O_THISROOT and O_BENEATH) where a rename(2) of a path can be used to
"skip over" nd->root and thus escape to the filesystem above nd->root.

thread1 [attacker]:
for (;;)
renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
thread2 [victim]:
for (;;)
openat(dirb, "b/c/../../etc/shadow", O_THISROOT);

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
(which is the weak point of chroot(2) -- since walking *into* a
subdirectory tautologically cannot result in you walking *outside*
nd->root -- except through a bind-mount or "magic link"). By detecting
this at ".." resolution (rather than checking only at the end of the
entire resolution) we can both correct escapes by jumping back to the
root (in the case of O_THISROOT), as well as avoid revealing to
attackers the structure of the filesystem outside of the root (through
timing attacks for instance).

In order to avoid a quadratic lookup with each ".." entry, we only
activate the slow path if a write through &rename_lock or &mount_lock
have occurred during path resolution (&rename_lock and &mount_lock are
re-taken to further optimise the lookup). Since the primary attack being
protected against is MS_MOVE or rename(2), not doing additional checks
unless a mount or rename have occurred avoids making the common case
slow.

The use of path_is_under() here might seem suspect, but on further
inspection of the most important race (a path was *inside* the root but
is now *outside*), there appears to be no attack potential. If
path_is_under() occurs before the rename, then the path will be resolved
but since the path was originally inside the root there is no escape.
Subsequent ".." jumps are guaranteed to check path_is_under() (by
construction, &rename_lock or &mount_lock must have been taken by the
attacker after path_is_under() returned in the victim), and thus will
not be able to escape from the previously-inside-root path. Walking down
is still safe since the entire subtree was moved (either by rename(2) or
MS_MOVE) and because (as discussed above) walking down is safe.

Cc: Al Viro <[email protected]>
Cc: Jann Horn <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 48 +++++++++++++++++++++++++++++++++---------------
1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 459faea5b832..b2c962f149c7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -493,7 +493,7 @@ struct nameidata {
struct path root;
struct inode *inode; /* path.dentry.d_inode */
unsigned int flags;
- unsigned seq, m_seq;
+ unsigned seq, m_seq, r_seq;
int last_type;
unsigned depth;
int total_link_count;
@@ -1741,19 +1741,35 @@ static inline int may_lookup(struct nameidata *nd)
static inline int handle_dots(struct nameidata *nd, int type)
{
if (type == LAST_DOTDOT) {
- /*
- * LOOKUP_BENEATH resolving ".." is not currently safe -- races can
- * cause our parent to have moved outside of the root and us to skip
- * over it.
- */
- if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)))
- return -EXDEV;
+ int error = 0;
+
if (!nd->root.mnt)
set_root(nd);
- if (nd->flags & LOOKUP_RCU) {
- return follow_dotdot_rcu(nd);
- } else
- return follow_dotdot(nd);
+ if (nd->flags & LOOKUP_RCU)
+ error = follow_dotdot_rcu(nd);
+ else
+ error = follow_dotdot(nd);
+ if (error)
+ return error;
+
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
+ bool m_retry = read_seqretry(&mount_lock, nd->m_seq);
+ bool r_retry = read_seqretry(&rename_lock, nd->r_seq);
+
+ /*
+ * Don't bother checking unless there's a racing
+ * rename(2) or MS_MOVE.
+ */
+ if (likely(!m_retry && !r_retry))
+ return 0;
+
+ if (m_retry && !(nd->flags & LOOKUP_RCU))
+ nd->m_seq = read_seqbegin(&mount_lock);
+ if (r_retry)
+ nd->r_seq = read_seqbegin(&rename_lock);
+ if (!path_is_under(&nd->path, &nd->root))
+ return -EXDEV;
+ }
}
return 0;
}
@@ -2274,6 +2290,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->last_type = LAST_ROOT; /* if there are only slashes... */
nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
nd->depth = 0;
+
+ nd->m_seq = read_seqbegin(&mount_lock);
+ if (unlikely(flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)))
+ nd->r_seq = read_seqbegin(&rename_lock);
+
if (flags & LOOKUP_ROOT) {
struct dentry *root = nd->root.dentry;
struct inode *inode = root->d_inode;
@@ -2284,7 +2305,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
if (flags & LOOKUP_RCU) {
nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
nd->root_seq = nd->seq;
- nd->m_seq = read_seqbegin(&mount_lock);
} else {
path_get(&nd->path);
}
@@ -2295,8 +2315,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->path.mnt = NULL;
nd->path.dentry = NULL;

- nd->m_seq = read_seqbegin(&mount_lock);
-
if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
error = dirfd_path_init(nd);
if (unlikely(error))
--
2.19.1


2018-11-24 08:31:26

by Jürg Billeter

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] namei: O_BENEATH-style path resolution flags

Hi Aleksa,

On Tue, 2018-11-13 at 01:26 +1100, Aleksa Sarai wrote:
> * O_BENEATH: Disallow "escapes" from the starting point of the
> filesystem tree during resolution (you must stay "beneath" the
> starting point at all times). Currently this is done by disallowing
> ".." and absolute paths (either in the given path or found during
> symlink resolution) entirely, as well as all "magic link" jumping.

With open_tree(2) and OPEN_TREE_CLONE, will O_BENEATH still be
necessary? As I understand it, O_BENEATH could be replaced by a much
simpler flag that only disallows absolute paths (incl. absolute
symlinks). And it would have the benefit that you can actually pass the
tree/directory fd to another process and escaping would not be possible
even if that other process doesn't use O_BENEATH (after calling
mount_setattr(2) to make sure it's locked down).

This approach would also make it easy to restrict writes via a cloned
tree/directory fd by marking it read-only via mount_setattr(2) (and
locking down the read-only flag). This would again be especially useful
when passing tree/directory fds across processes, or for voluntary
self-lockdown within a process for robustness against security bugs.

This wouldn't affect any of the other flags in this patch. And for full
equivalence to O_BENEATH you'd have to use O_NOMAGICLINKS in addition
to O_NOABSOLUTE, or whatever that new flag would be called.

Or is OPEN_TREE_CLONE too expensive for this use case? Or is there
anything else I'm missing?

Jürg


2018-11-24 08:41:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] namei: O_BENEATH-style path resolution flags

> On Nov 23, 2018, at 5:10 AM, Jürg Billeter <[email protected]> wrote:
>
> Hi Aleksa,
>
>> On Tue, 2018-11-13 at 01:26 +1100, Aleksa Sarai wrote:
>> * O_BENEATH: Disallow "escapes" from the starting point of the
>> filesystem tree during resolution (you must stay "beneath" the
>> starting point at all times). Currently this is done by disallowing
>> ".." and absolute paths (either in the given path or found during
>> symlink resolution) entirely, as well as all "magic link" jumping.
>
> With open_tree(2) and OPEN_TREE_CLONE, will O_BENEATH still be
> necessary?

This discussion reminds me of something I’m uncomfortable with in the
current patches: currently, most of the O_ flags determine some
property of the returned opened file. The new O_ flags you're adding
don't -- instead, they affect the lookup of the file. So O_BENEATH
doesn't return a descriptor that can only be used to loop up files
beneath it -- it just controls whether open(2) succeeds or fails. It
might be nice for the naming of the flags to reflect this. I also
don't love that we have some magic AT_ flags that work with some
syscalls and some magic O_ flags that work with others.

In this regard, I liked the AT_ naming better. Although I don't love
adding AT_ flags because the restrict our ability to usefully use the
high bits of the fd in the future.

--Andy

2018-11-24 08:43:30

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] namei: O_BENEATH-style path resolution flags

On 2018-11-23, Andy Lutomirski <[email protected]> wrote:
> > On Nov 23, 2018, at 5:10 AM, Jürg Billeter <[email protected]> wrote:
> >
> > Hi Aleksa,
> >
> >> On Tue, 2018-11-13 at 01:26 +1100, Aleksa Sarai wrote:
> >> * O_BENEATH: Disallow "escapes" from the starting point of the
> >> filesystem tree during resolution (you must stay "beneath" the
> >> starting point at all times). Currently this is done by disallowing
> >> ".." and absolute paths (either in the given path or found during
> >> symlink resolution) entirely, as well as all "magic link" jumping.
> >
> > With open_tree(2) and OPEN_TREE_CLONE, will O_BENEATH still be
> > necessary?
>
> This discussion reminds me of something I’m uncomfortable with in the
> current patches: currently, most of the O_ flags determine some
> property of the returned opened file. The new O_ flags you're adding
> don't -- instead, they affect the lookup of the file. So O_BENEATH
> doesn't return a descriptor that can only be used to loop up files
> beneath it -- it just controls whether open(2) succeeds or fails. It
> might be nice for the naming of the flags to reflect this.

I agree that there is something quite weird about having path resolution
flags in an *open* syscall. The main reason why it's linked to open is
because that's the only way to get O_PATH descriptors (which is what you
would use for most of the extended operations we need -- as well as
reading+writing to files which is what most users would do with this).

And I think O_PATH is another example of an open flag that is just odd
in how it changes the semantics of using open(2).

One of the ideas I pitched in an earlier mail was a hypothetical
resolveat(2) -- which would just be a new way of getting an O_PATH
descriptor. This way, we wouldn't be using up more O_* flag bits with
this feature and we'd be able to play with more radical semantic changes
in the future. I can outline these here if you like, but it's a bit of a
long discussion and I'd prefer not to derail things too much if
resolveat(2) is definitely out of the question.

> I also don't love that we have some magic AT_ flags that work with
> some syscalls and some magic O_ flags that work with others.

I also completely agree. I think that we should have a discussion about
the long-term plan of syscall flags because it's starting to get a
little bit crazy:

* Every "get an fd" syscall has its own brand of O_CLOEXEC. Thankfully,
many of them use the same value (except for memfd_create(2) and a few
other examples).
* AT_* was supposed to be generic across all *at(2) syscalls, but there
are several cases where this isn't really true anymore.

* renameat2(2) only supports RENAME_* flags.
* openat(2) supports only O_* flags.
* Most AT_* flags have O_* counterparts (or are even more of a mess
such as with {AT_SYMLINK_FOLLOW,AT_SYMLINK_NOFOLLOW,O_NOFOLLOW}).
* statx(2) added AT_STATX_* flags, making AT_* no longer generic.

(Also I just want to mention something I noticed while writing this
patch -- openat(2) violates one of the kernel "golden rules" -- that you
reject unknown flags. openat(2) will silently ignore unknown flag bits.
I'm sure there's a really good reason for this, but it's another flag
oddity that I felt fit here.)

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (3.37 kB)
signature.asc (849.00 B)
Download all attachments

2018-11-24 08:44:05

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] namei: O_BENEATH-style path resolution flags

On 2018-11-23, J?rg Billeter <[email protected]> wrote:
> On Tue, 2018-11-13 at 01:26 +1100, Aleksa Sarai wrote:
> > * O_BENEATH: Disallow "escapes" from the starting point of the
> > filesystem tree during resolution (you must stay "beneath" the
> > starting point at all times). Currently this is done by disallowing
> > ".." and absolute paths (either in the given path or found during
> > symlink resolution) entirely, as well as all "magic link" jumping.
>
> With open_tree(2) and OPEN_TREE_CLONE, will O_BENEATH still be
> necessary? As I understand it, O_BENEATH could be replaced by a much
> simpler flag that only disallows absolute paths (incl. absolute
> symlinks). And it would have the benefit that you can actually pass the
> tree/directory fd to another process and escaping would not be possible
> even if that other process doesn't use O_BENEATH (after calling
> mount_setattr(2) to make sure it's locked down).
>
> This approach would also make it easy to restrict writes via a cloned
> tree/directory fd by marking it read-only via mount_setattr(2) (and
> locking down the read-only flag). This would again be especially useful
> when passing tree/directory fds across processes, or for voluntary
> self-lockdown within a process for robustness against security bugs.
>
> This wouldn't affect any of the other flags in this patch. And for full
> equivalence to O_BENEATH you'd have to use O_NOMAGICLINKS in addition
> to O_NOABSOLUTE, or whatever that new flag would be called.
>
> Or is OPEN_TREE_CLONE too expensive for this use case? Or is there
> anything else I'm missing?

OPEN_TREE_CLONE currently requires CAP_SYS_ADMIN in mnt_ns->user_ns,
which requires a fork and unshare -- or at least a vfork and some other
magic -- at which point we're back to just doing a pivot_root(2) for
most operations.

I think open_tree(2) -- which I really should sit down and play around
with -- would be an interesting way of doing O_BENEATH, but I think
you'd still need to have the same race protections we have in the
current O_BENEATH proposal to handle "..".

So really you'd be using open_tree(OPEN_TREE_CLONE) just so that you can
use the "path.mnt" setting code, which I'm not sure is the best way of
doing it (plus the other interesting ideas which you get with the other
mount API changes).

But I am quite hopeful for what cool things we'll be able to make using
the new mount API.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (2.50 kB)
signature.asc (849.00 B)
Download all attachments

2018-11-24 08:44:47

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] namei: O_BENEATH-style path resolution flags

On 2018-11-24, Aleksa Sarai <[email protected]> wrote:
> > >> On Tue, 2018-11-13 at 01:26 +1100, Aleksa Sarai wrote:
> > >> * O_BENEATH: Disallow "escapes" from the starting point of the
> > >> filesystem tree during resolution (you must stay "beneath" the
> > >> starting point at all times). Currently this is done by disallowing
> > >> ".." and absolute paths (either in the given path or found during
> > >> symlink resolution) entirely, as well as all "magic link" jumping.
> > >
> > > With open_tree(2) and OPEN_TREE_CLONE, will O_BENEATH still be
> > > necessary?
> >
> > This discussion reminds me of something I’m uncomfortable with in the
> > current patches: currently, most of the O_ flags determine some
> > property of the returned opened file. The new O_ flags you're adding
> > don't -- instead, they affect the lookup of the file. So O_BENEATH
> > doesn't return a descriptor that can only be used to loop up files
> > beneath it -- it just controls whether open(2) succeeds or fails. It
> > might be nice for the naming of the flags to reflect this.
>
> I agree that there is something quite weird about having path resolution
> flags in an *open* syscall. The main reason why it's linked to open is
> because that's the only way to get O_PATH descriptors (which is what you
> would use for most of the extended operations we need -- as well as
> reading+writing to files which is what most users would do with this).
>
> And I think O_PATH is another example of an open flag that is just odd
> in how it changes the semantics of using open(2).
>
> One of the ideas I pitched in an earlier mail was a hypothetical
> resolveat(2) -- which would just be a new way of getting an O_PATH
> descriptor. This way, we wouldn't be using up more O_* flag bits with
> this feature and we'd be able to play with more radical semantic changes
> in the future. I can outline these here if you like, but it's a bit of a
> long discussion and I'd prefer not to derail things too much if
> resolveat(2) is definitely out of the question.

Sorry, one thing I forgot to mention about returning descriptors that
can only look up files beneath it -- while I think this would be very
useful, I'd be worried about jumping into chroot(2) territory where now
you are giving userspace the opportunity to try to create nested
"beneathfds" and so on. I do think it would be quite useful and
interesting though, but I'm not entirely sure how doable it would be
with the current namei infrastructure.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (2.58 kB)
signature.asc (849.00 B)
Download all attachments