The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
The most obvious change is that AT_NO_JUMPS has been split as dicussed
in the original thread, along with a further split of AT_NO_PROCLINKS
which means that each individual property of AT_NO_JUMPS is now a
separate flag:
* Path-based escapes from the starting-point using "/" or ".." are
blocked by AT_BENEATH.
* Mountpoint crossings are blocked by AT_XDEV.
* /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
correctly it actually blocks any user of nd_jump_link() because it
allows out-of-VFS path resolution manipulation).
AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
Linus' suggestion in the original thread, I've also implemented
AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
"proclink" resolution).
An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS
path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
blocks this as well (feel free to ask me to remove it if you feel this
is not sane).
Currently I've only enabled these for openat(2) and the stat(2) family.
I would hope we could enable it for basically every *at(2) syscall --
but many of them appear to not have a @flags argument and thus we'll
need to add several new syscalls to do this. I'm more than happy to send
those patches, but I'd prefer to know that this preliminary work is
acceptable before doing a bunch of copy-paste to add new sets of *at(2)
syscalls.
One additional feature I've implemented is AT_THIS_ROOT (I imagine this
is probably going to be more contentious than the refresh of
AT_NO_JUMPS, so I've included it in a separate patch). The patch itself
describes my reasoning, but the shortened version of the premise is that
continer runtimes need to have a way to resolve paths within a
potentially malicious rootfs. Container runtimes currently do this in
userspace[2] which has implicit race conditions that are not resolvable
in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for
path resolution, which would be invaluable for us -- and the
implementation is basically identical to AT_BENEATH (except that we
don't return errors when someone actually hits the root).
I've added some selftests for this, but it's not clear to me whether
they should live here or in xfstests (as far as I can tell there are no
other VFS tests in selftests, while there are some tests that look like
generic VFS tests in xfstests). If you'd prefer them to be included in
xfstests, let me know.
[1]: https://lore.kernel.org/patchwork/patch/784221/
[2]: https://github.com/cyphar/filepath-securejoin
Aleksa Sarai (3):
namei: implement O_BENEATH-style AT_* flags
namei: implement AT_THIS_ROOT chroot-like path resolution
selftests: vfs: add AT_* path resolution tests
fs/fcntl.c | 2 +-
fs/namei.c | 158 ++++++++++++------
fs/open.c | 10 ++
fs/stat.c | 15 +-
include/linux/fcntl.h | 3 +-
include/linux/namei.h | 8 +
include/uapi/asm-generic/fcntl.h | 20 +++
include/uapi/linux/fcntl.h | 10 ++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/vfs/.gitignore | 1 +
tools/testing/selftests/vfs/Makefile | 13 ++
tools/testing/selftests/vfs/at_flags.h | 40 +++++
tools/testing/selftests/vfs/common.sh | 37 ++++
.../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++
.../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++
.../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++
.../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++
.../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++
tools/testing/selftests/vfs/vfs_helper.c | 154 +++++++++++++++++
19 files changed, 707 insertions(+), 56 deletions(-)
create mode 100644 tools/testing/selftests/vfs/.gitignore
create mode 100644 tools/testing/selftests/vfs/Makefile
create mode 100644 tools/testing/selftests/vfs/at_flags.h
create mode 100644 tools/testing/selftests/vfs/common.sh
create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
--
2.19.0
Add the following flags for path resolution. The primary justification
for these flags is to allow for programs to be far more strict about how
they want path resolution to handle symlinks, mountpoint crossings, and
paths that escape the dirfd (through an absolute path or ".."
shenanigans).
This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init).
* AT_BENEATH: Disallow ".." or absolute paths (either in the path or
found during symlink resolution) to escape the starting point of name
resolution, though ".." is permitted in cases like "foo/../bar".
Relative symlinks are still allowed (as long as they don't escape the
starting point).
* AT_XDEV: Disallow mount-point crossing (both *down* into one, or *up*
from one). The primary "scoping" use is to blocking resolution that
crosses a bind-mount, which has a similar property to a symlink (in
the way that it allows for escape from the starting-point). Since it
is not possible to differentiate bind-mounts However since
bind-mounting requires privileges (in ways symlinks don't) this has
been split from LOOKUP_BENEATH. The naming is based on "find -xdev"
(though find(1) doesn't walk upwards, the semantics seem obvious).
* AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
specific restriction, and it exists because /proc/$pid/fd/...
"symlinks" allow for access outside nd->root and pose risk to
container runtimes that don't want to be tricked into accessing a host
path (but do want to allow no-funny-business symlink resolution).
* AT_NO_SYMLINK: Disallows symlink jumping *of any kind*. Implies
AT_NO_PROCLINK (obviously).
The AT_NO_*LINK flags return -ELOOP if path resolution would violates
their requirement, while the others all return -EXDEV. Currently these
are only enabled for the stat(2) family and the openat(2) family (the
latter has its own brand of O_* flags with the same semantics). Ideally
these flags would be supported by all *at(2) syscalls, but this will
require adding flags arguments to many of them (and will be done in a
separate patchset).
Cc: Andy Lutomirski <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/fcntl.c | 2 +-
fs/namei.c | 61 ++++++++++++++++++++++++++------
fs/open.c | 8 +++++
fs/stat.c | 13 +++++--
include/linux/fcntl.h | 3 +-
include/linux/namei.h | 7 ++++
include/uapi/asm-generic/fcntl.h | 17 +++++++++
include/uapi/linux/fcntl.h | 8 +++++
8 files changed, 104 insertions(+), 15 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 4137d96534a6..e343618736f7 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(
(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index fb913148d4d1..757dd783771c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -859,6 +859,8 @@ static int nd_jump_root(struct nameidata *nd)
path_get(&nd->path);
nd->inode = nd->path.dentry->d_inode;
}
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
nd->flags |= LOOKUP_JUMPED;
return 0;
}
@@ -1083,14 +1085,19 @@ const char *get_link(struct nameidata *nd)
} else {
res = get(dentry, inode, &last->done);
}
+ /* If we just jumped it was because of a procfs-style link. */
+ if (unlikely(nd->flags & LOOKUP_JUMPED) &&
+ unlikely(nd->flags & LOOKUP_NO_PROCLINKS))
+ return ERR_PTR(-ELOOP);
if (IS_ERR_OR_NULL(res))
return res;
}
if (*res == '/') {
if (!nd->root.mnt)
set_root(nd);
- if (unlikely(nd_jump_root(nd)))
- return ERR_PTR(-ECHILD);
+ error = nd_jump_root(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
while (unlikely(*++res == '/'))
;
}
@@ -1271,12 +1278,16 @@ static int follow_managed(struct path *path, struct nameidata *nd)
break;
}
- if (need_mntput && path->mnt == mnt)
- mntput(path->mnt);
+ if (need_mntput) {
+ if (path->mnt == mnt)
+ mntput(path->mnt);
+ if (unlikely(nd->flags & LOOKUP_XDEV))
+ ret = -EXDEV;
+ else
+ nd->flags |= LOOKUP_JUMPED;
+ }
if (ret == -EISDIR || !ret)
ret = 1;
- if (need_mntput)
- nd->flags |= LOOKUP_JUMPED;
if (unlikely(ret < 0))
path_put_conditional(path, nd);
return ret;
@@ -1333,6 +1344,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
mounted = __lookup_mnt(path->mnt, path->dentry);
if (!mounted)
break;
+ if (unlikely(nd->flags & LOOKUP_XDEV))
+ return false;
path->mnt = &mounted->mnt;
path->dentry = mounted->mnt.mnt_root;
nd->flags |= LOOKUP_JUMPED;
@@ -1353,8 +1366,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
struct inode *inode = nd->inode;
while (1) {
- if (path_equal(&nd->path, &nd->root))
+ if (path_equal(&nd->path, &nd->root)) {
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
break;
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
struct dentry *old = nd->path.dentry;
struct dentry *parent = old->d_parent;
@@ -1379,6 +1395,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -ECHILD;
if (&mparent->mnt == nd->path.mnt)
break;
+ if (unlikely(nd->flags & LOOKUP_XDEV))
+ return -EXDEV;
/* we know that mountpoint was pinned */
nd->path.dentry = mountpoint;
nd->path.mnt = &mparent->mnt;
@@ -1481,8 +1499,11 @@ static int path_parent_directory(struct path *path)
static int follow_dotdot(struct nameidata *nd)
{
while(1) {
- if (path_equal(&nd->path, &nd->root))
+ if (path_equal(&nd->path, &nd->root)) {
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
break;
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
int ret = path_parent_directory(&nd->path);
if (ret)
@@ -1491,6 +1512,8 @@ static int follow_dotdot(struct nameidata *nd)
}
if (!follow_up(&nd->path))
break;
+ if (unlikely(nd->flags & LOOKUP_XDEV))
+ return -EXDEV;
}
follow_mount(&nd->path);
nd->inode = nd->path.dentry->d_inode;
@@ -1720,6 +1743,8 @@ static int pick_link(struct nameidata *nd, struct path *link,
{
int error;
struct saved *last;
+ if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+ return -ELOOP;
if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) {
path_to_nameidata(link, nd);
return -ELOOP;
@@ -2175,6 +2200,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
if (!*s)
flags &= ~LOOKUP_RCU;
+ if (flags & LOOKUP_NO_SYMLINKS)
+ flags |= LOOKUP_NO_PROCLINKS;
if (flags & LOOKUP_RCU)
rcu_read_lock();
@@ -2204,10 +2231,12 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->m_seq = read_seqbegin(&mount_lock);
if (*s == '/') {
+ int error;
set_root(nd);
- if (likely(!nd_jump_root(nd)))
- return s;
- return ERR_PTR(-ECHILD);
+ error = nd_jump_root(nd);
+ if (unlikely(error))
+ s = ERR_PTR(error);
+ return s;
} else if (nd->dfd == AT_FDCWD) {
if (flags & LOOKUP_RCU) {
struct fs_struct *fs = current->fs;
@@ -2223,6 +2252,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
get_fs_pwd(current->fs, &nd->path);
nd->inode = nd->path.dentry->d_inode;
}
+ if (unlikely(flags & LOOKUP_BENEATH)) {
+ nd->root = nd->path;
+ if (!(flags & LOOKUP_RCU))
+ path_get(&nd->root);
+ }
return s;
} else {
/* Caller must check execute permissions on the starting path component */
@@ -2247,6 +2281,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
path_get(&nd->path);
nd->inode = nd->path.dentry->d_inode;
}
+ if (unlikely(flags & LOOKUP_BENEATH)) {
+ nd->root = nd->path;
+ if (!(flags & LOOKUP_RCU))
+ path_get(&nd->root);
+ }
fdput(f);
return s;
}
diff --git a/fs/open.c b/fs/open.c
index 0285ce7dbd51..80f5f566a5ff 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -988,6 +988,14 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
lookup_flags |= LOOKUP_DIRECTORY;
if (!(flags & O_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
+ if (flags & O_BENEATH)
+ lookup_flags |= LOOKUP_BENEATH;
+ if (flags & O_XDEV)
+ lookup_flags |= LOOKUP_XDEV;
+ if (flags & O_NOPROCLINKS)
+ lookup_flags |= LOOKUP_NO_PROCLINKS;
+ if (flags & O_NOSYMLINKS)
+ lookup_flags |= LOOKUP_NO_SYMLINKS;
op->lookup_flags = lookup_flags;
return 0;
}
diff --git a/fs/stat.c b/fs/stat.c
index f8e6fb2c3657..791e61b916ae 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -170,8 +170,9 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
int error = -EINVAL;
unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
- if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
- AT_EMPTY_PATH | KSTAT_QUERY_FLAGS)) != 0)
+ if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH |
+ KSTAT_QUERY_FLAGS | AT_BENEATH | AT_XDEV |
+ AT_NO_PROCLINKS | AT_NO_SYMLINKS))
return -EINVAL;
if (flags & AT_SYMLINK_NOFOLLOW)
@@ -180,6 +181,14 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
lookup_flags &= ~LOOKUP_AUTOMOUNT;
if (flags & AT_EMPTY_PATH)
lookup_flags |= LOOKUP_EMPTY;
+ if (flags & AT_BENEATH)
+ lookup_flags |= LOOKUP_BENEATH;
+ if (flags & AT_XDEV)
+ lookup_flags |= LOOKUP_XDEV;
+ if (flags & AT_NO_PROCLINKS)
+ lookup_flags |= LOOKUP_NO_PROCLINKS;
+ if (flags & AT_NO_SYMLINKS)
+ lookup_flags |= LOOKUP_NO_SYMLINKS;
retry:
error = user_path_at(dfd, filename, lookup_flags, &path);
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 27dc7a60693e..ad5bba4b5b12 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -9,7 +9,8 @@
(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
- O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
+ O_NOPROCLINKS | O_NOSYMLINKS)
#ifndef force_o_largefile
#define force_o_largefile() (BITS_PER_LONG != 32)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index a78606e8e3df..5ff7f3362d1b 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -47,6 +47,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_EMPTY 0x4000
#define LOOKUP_DOWN 0x8000
+/* Scoping flags for lookup. */
+#define LOOKUP_BENEATH 0x010000 /* No escaping from starting point. */
+#define LOOKUP_XDEV 0x020000 /* No mountpoint crossing. */
+#define LOOKUP_NO_PROCLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
+#define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*.
+ Implies LOOKUP_NO_PROCLINKS. */
+
extern int path_pts(struct path *path);
extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..c2bf5983e46a 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -97,6 +97,23 @@
#define O_NDELAY O_NONBLOCK
#endif
+/*
+ * These are identical to their AT_* counterparts (which affect the entireity
+ * of path resolution).
+ */
+#ifndef O_BENEATH
+#define O_BENEATH 00040000000 /* *Not* the same as capsicum's O_BENEATH! */
+#endif
+#ifndef O_XDEV
+#define O_XDEV 00100000000
+#endif
+#ifndef O_NOPROCLINKS
+#define O_NOPROCLINKS 00200000000
+#endif
+#ifndef O_NOSYMLINKS
+#define O_NOSYMLINKS 01000000000
+#endif
+
#define F_DUPFD 0 /* dup */
#define F_GETFD 1 /* get close_on_exec */
#define F_SETFD 2 /* set/clear close_on_exec */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 594b85f7cb86..551a9e2166a8 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -92,5 +92,13 @@
#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
+/* Flags which affect path *resolution*, not just last-component handling. */
+#define AT_BENEATH 0x10000 /* No absolute paths or ".." escaping
+ (in-path or through symlinks) */
+#define AT_XDEV 0x20000 /* No mountpoint crossing. */
+#define AT_NO_PROCLINKS 0x40000 /* No /proc/$pid/fd/... "symlinks". */
+#define AT_NO_SYMLINKS 0x80000 /* No symlinks *at all*.
+ Implies AT_NO_PROCLINKS. */
+
#endif /* _UAPI_LINUX_FCNTL_H */
--
2.19.0
The primary motivation for the need for this flag is container runtimes
which have to interact with malicious root filesystems in the host
namespaces. One of the first requirements for a container runtime to be
secure against a malicious rootfs is that they correctly scope symlinks
(that is, they should be scoped as though they are chroot(2)ed into the
container's rootfs) and ".."-style paths. The already-existing AT_XDEV
and AT_NO_PROCLINKS help defend against other potential attacks in a
malicious rootfs scenario.
Currently most container runtimes try to do this resolution in
userspace[1], causing many potential race conditions. In addition, the
"obvious" alternative (actually performing a {ch,pivot_}root(2))
requires a fork+exec which is *very* costly if necessary for every
filesystem operation involving a container.
The most significant change in semantics with AT_THIS_ROOT is that
*at(2) syscalls now no longer have the property that an absolute
pathname causes the dirfd to be ignored completely (if LOOKUP_CHROOT is
specified). The reasoning behind this is that AT_THIS_ROOT necessarily
has to chroot-scope symlinks with absolute paths to dirfd, and so doing
it for the base path seems to be the most consistent behaviour (and also
avoids foot-gunning users who want to chroot-scope paths that might be
absolute).
Currently this is only enabled for the stat(2) and openat(2) family (the
latter has its own flag O_THISROOT with the same semantics). Ideally
this flag would be supported by all *at(2) syscalls, but this will
require adding flags arguments to many of them (and will be done in a
separate patchset).
[1]: https://github.com/cyphar/filepath-securejoin
Cc: Eric Biederman <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/fcntl.c | 2 +-
fs/namei.c | 121 +++++++++++++++++--------------
fs/open.c | 2 +
fs/stat.c | 4 +-
include/linux/fcntl.h | 2 +-
include/linux/namei.h | 1 +
include/uapi/asm-generic/fcntl.h | 3 +
include/uapi/linux/fcntl.h | 2 +
8 files changed, 81 insertions(+), 56 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c
index e343618736f7..4c36c5b9fdb9 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(26 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(
(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index 757dd783771c..1b984f0dbbb4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2193,9 +2193,64 @@ static int link_path_walk(const char *name, struct nameidata *nd)
}
}
+/*
+ * Configure nd->path based on the nd->dfd. This is only used as part of
+ * path_init().
+ */
+static inline int dirfd_path_init(struct nameidata *nd)
+{
+ if (nd->dfd == AT_FDCWD) {
+ if (nd->flags & LOOKUP_RCU) {
+ struct fs_struct *fs = current->fs;
+ unsigned seq;
+
+ do {
+ seq = read_seqcount_begin(&fs->seq);
+ nd->path = fs->pwd;
+ nd->inode = nd->path.dentry->d_inode;
+ nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
+ } while (read_seqcount_retry(&fs->seq, seq));
+ } else {
+ get_fs_pwd(current->fs, &nd->path);
+ nd->inode = nd->path.dentry->d_inode;
+ }
+ } else {
+ /* Caller must check execute permissions on the starting path component */
+ struct fd f = fdget_raw(nd->dfd);
+ struct dentry *dentry;
+
+ if (!f.file)
+ return -EBADF;
+
+ dentry = f.file->f_path.dentry;
+
+ if (*nd->name->name && unlikely(!d_can_lookup(dentry))) {
+ fdput(f);
+ return -ENOTDIR;
+ }
+
+ nd->path = f.file->f_path;
+ if (nd->flags & LOOKUP_RCU) {
+ nd->inode = nd->path.dentry->d_inode;
+ nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+ } else {
+ path_get(&nd->path);
+ nd->inode = nd->path.dentry->d_inode;
+ }
+ fdput(f);
+ }
+ if (unlikely(nd->flags & (LOOKUP_CHROOT | LOOKUP_BENEATH))) {
+ nd->root = nd->path;
+ if (!(nd->flags & LOOKUP_RCU))
+ path_get(&nd->root);
+ }
+ return 0;
+}
+
/* must be paired with terminate_walk() */
static const char *path_init(struct nameidata *nd, unsigned flags)
{
+ int error;
const char *s = nd->name->name;
if (!*s)
@@ -2230,65 +2285,25 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->path.dentry = NULL;
nd->m_seq = read_seqbegin(&mount_lock);
+ if (unlikely(flags & LOOKUP_CHROOT)) {
+ error = dirfd_path_init(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ }
if (*s == '/') {
- int error;
- set_root(nd);
+ if (likely(!nd->root.mnt))
+ set_root(nd);
error = nd_jump_root(nd);
if (unlikely(error))
s = ERR_PTR(error);
return s;
- } else if (nd->dfd == AT_FDCWD) {
- if (flags & LOOKUP_RCU) {
- struct fs_struct *fs = current->fs;
- unsigned seq;
-
- do {
- seq = read_seqcount_begin(&fs->seq);
- nd->path = fs->pwd;
- nd->inode = nd->path.dentry->d_inode;
- nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
- } while (read_seqcount_retry(&fs->seq, seq));
- } else {
- get_fs_pwd(current->fs, &nd->path);
- nd->inode = nd->path.dentry->d_inode;
- }
- if (unlikely(flags & LOOKUP_BENEATH)) {
- nd->root = nd->path;
- if (!(flags & LOOKUP_RCU))
- path_get(&nd->root);
- }
- return s;
- } else {
- /* Caller must check execute permissions on the starting path component */
- struct fd f = fdget_raw(nd->dfd);
- struct dentry *dentry;
-
- if (!f.file)
- return ERR_PTR(-EBADF);
-
- dentry = f.file->f_path.dentry;
-
- if (*s && unlikely(!d_can_lookup(dentry))) {
- fdput(f);
- return ERR_PTR(-ENOTDIR);
- }
-
- nd->path = f.file->f_path;
- if (flags & LOOKUP_RCU) {
- nd->inode = nd->path.dentry->d_inode;
- nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
- } else {
- path_get(&nd->path);
- nd->inode = nd->path.dentry->d_inode;
- }
- if (unlikely(flags & LOOKUP_BENEATH)) {
- nd->root = nd->path;
- if (!(flags & LOOKUP_RCU))
- path_get(&nd->root);
- }
- fdput(f);
- return s;
}
+ if (likely(!nd->path.mnt)) {
+ error = dirfd_path_init(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ }
+ return s;
}
static const char *trailing_symlink(struct nameidata *nd)
diff --git a/fs/open.c b/fs/open.c
index 80f5f566a5ff..81d148f626cd 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -996,6 +996,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
lookup_flags |= LOOKUP_NO_PROCLINKS;
if (flags & O_NOSYMLINKS)
lookup_flags |= LOOKUP_NO_SYMLINKS;
+ if (flags & O_THISROOT)
+ lookup_flags |= LOOKUP_CHROOT;
op->lookup_flags = lookup_flags;
return 0;
}
diff --git a/fs/stat.c b/fs/stat.c
index 791e61b916ae..e8366e4812c3 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -172,7 +172,7 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH |
KSTAT_QUERY_FLAGS | AT_BENEATH | AT_XDEV |
- AT_NO_PROCLINKS | AT_NO_SYMLINKS))
+ AT_NO_PROCLINKS | AT_NO_SYMLINKS | AT_THIS_ROOT))
return -EINVAL;
if (flags & AT_SYMLINK_NOFOLLOW)
@@ -189,6 +189,8 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
lookup_flags |= LOOKUP_NO_PROCLINKS;
if (flags & AT_NO_SYMLINKS)
lookup_flags |= LOOKUP_NO_SYMLINKS;
+ if (flags & AT_THIS_ROOT)
+ lookup_flags |= LOOKUP_CHROOT;
retry:
error = user_path_at(dfd, filename, lookup_flags, &path);
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index ad5bba4b5b12..95480cd4c09d 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
- O_NOPROCLINKS | O_NOSYMLINKS)
+ O_NOPROCLINKS | O_NOSYMLINKS | O_THISROOT)
#ifndef force_o_largefile
#define force_o_largefile() (BITS_PER_LONG != 32)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 5ff7f3362d1b..7ec9e2d84649 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -53,6 +53,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_NO_PROCLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
#define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*.
Implies LOOKUP_NO_PROCLINKS. */
+#define LOOKUP_CHROOT 0x100000 /* Treat dirfd as %current->fs->root. */
extern int path_pts(struct path *path);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index c2bf5983e46a..11206b0e927c 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -113,6 +113,9 @@
#ifndef O_NOSYMLINKS
#define O_NOSYMLINKS 01000000000
#endif
+#ifndef O_THISROOT
+#define O_THISROOT 02000000000
+#endif
#define F_DUPFD 0 /* dup */
#define F_GETFD 1 /* get close_on_exec */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 551a9e2166a8..ea978457b68f 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -99,6 +99,8 @@
#define AT_NO_PROCLINKS 0x40000 /* No /proc/$pid/fd/... "symlinks". */
#define AT_NO_SYMLINKS 0x80000 /* No symlinks *at all*.
Implies AT_NO_PROCLINKS. */
+#define AT_THIS_ROOT 0x100000 /* Path resolution acts as though
+ it is chroot-ed into dirfd. */
#endif /* _UAPI_LINUX_FCNTL_H */
--
2.19.0
With the addition of so many new scoping flags, it's necessary to have
some sort of validation that they really work. There were no vfs
self-tests in the past, so this also includes a basic framework that
future VFS tests can use.
Signed-off-by: Aleksa Sarai <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/vfs/.gitignore | 1 +
tools/testing/selftests/vfs/Makefile | 13 ++
tools/testing/selftests/vfs/at_flags.h | 40 +++++
tools/testing/selftests/vfs/common.sh | 37 +++++
.../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++
.../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++
.../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++
.../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++
.../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++
tools/testing/selftests/vfs/vfs_helper.c | 154 ++++++++++++++++++
11 files changed, 537 insertions(+)
create mode 100644 tools/testing/selftests/vfs/.gitignore
create mode 100644 tools/testing/selftests/vfs/Makefile
create mode 100644 tools/testing/selftests/vfs/at_flags.h
create mode 100644 tools/testing/selftests/vfs/common.sh
create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index f1fe492c8e17..6f814e49071f 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -43,6 +43,7 @@ ifneq (1, $(quicktest))
TARGETS += timers
endif
TARGETS += user
+TARGETS += vfs
TARGETS += vm
TARGETS += x86
TARGETS += zram
diff --git a/tools/testing/selftests/vfs/.gitignore b/tools/testing/selftests/vfs/.gitignore
new file mode 100644
index 000000000000..c57ebcba14c0
--- /dev/null
+++ b/tools/testing/selftests/vfs/.gitignore
@@ -0,0 +1 @@
+/vfs_helper
diff --git a/tools/testing/selftests/vfs/Makefile b/tools/testing/selftests/vfs/Makefile
new file mode 100644
index 000000000000..8ca3cef43dc3
--- /dev/null
+++ b/tools/testing/selftests/vfs/Makefile
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0+
+# Author: Aleksa Sarai <[email protected]>
+# Copyright (C) 2018 SUSE LLC.
+
+# Makefile for mount selftests.
+CFLAGS = -Wall \
+ -O2 \
+ -I../../../../usr/include/
+
+TEST_PROGS := $(wildcard tests/*.sh)
+TEST_GEN_FILES := vfs_helper
+
+include ../lib.mk
diff --git a/tools/testing/selftests/vfs/at_flags.h b/tools/testing/selftests/vfs/at_flags.h
new file mode 100644
index 000000000000..a8ca8f689753
--- /dev/null
+++ b/tools/testing/selftests/vfs/at_flags.h
@@ -0,0 +1,40 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018 SUSE LLC.
+ */
+
+#ifndef __AT_FLAGS_H__
+#define __AT_FLAGS_H__
+
+/* These come from <uapi/asm-generic/linux/fcntl.h> */
+#ifndef O_BENEATH
+# define O_BENEATH 00040000000
+# define O_XDEV 00100000000
+# define O_NOPROCLINKS 00200000000
+# define O_NOSYMLINKS 01000000000
+# define O_THISROOT 02000000000
+#endif
+#ifndef AT_BENEATH
+# define AT_BENEATH 0x8000
+# define AT_XDEV 0x10000
+# define AT_NO_PROCLINKS 0x20000
+# define AT_NO_SYMLINKS 0x40000
+# define AT_THIS_ROOT 0x80000
+#endif
+
+struct flag {
+ const char *name;
+ unsigned int at_flag, open_flag;
+};
+
+struct flag AT_FLAGS[] = {
+ { .name = "beneath", .at_flag = AT_BENEATH, .open_flag = O_BENEATH },
+ { .name = "xdev", .at_flag = AT_XDEV, .open_flag = O_XDEV },
+ { .name = "no_proclinks", .at_flag = AT_NO_PROCLINKS, .open_flag = O_NOPROCLINKS },
+ { .name = "no_symlinks", .at_flag = AT_NO_SYMLINKS, .open_flag = O_NOSYMLINKS },
+ { .name = "this_root", .at_flag = AT_THIS_ROOT, .open_flag = O_THISROOT },
+ { 0 }, /* terminate */
+};
+
+#endif /* !defined(__AT_FLAGS_H__) */
diff --git a/tools/testing/selftests/vfs/common.sh b/tools/testing/selftests/vfs/common.sh
new file mode 100644
index 000000000000..82ac8ad2a5a5
--- /dev/null
+++ b/tools/testing/selftests/vfs/common.sh
@@ -0,0 +1,37 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0+
+# Author: Aleksa Sarai <[email protected]>
+# Copyright (C) 2018 SUSE LLC.
+
+set -e -o pipefail
+
+tmpdir="$(mktemp -d --tmpdir vfs_test.XXXXXX)"
+trap "rm -rf $tmpdir" EXIT
+
+root="$tmpdir/root"
+mkdir -p "$root"
+
+function fail() {
+ echo "# not ok" "$@"
+ exit 1
+}
+
+ksft_skip=4
+function skip() {
+ echo "# skip" "$@"
+ exit "$ksft_skip"
+}
+
+function run() {
+ local old_flags="$-"
+ set +eET
+ output="$("$@" 2>&1)"
+ status="$?"
+ set "-$old_flags"
+}
+
+testrootdir="$(readlink -f "$(dirname "$BASH_SOURCE")")"
+function vfs_helper() {
+ run "$testrootdir/vfs_helper" "$@"
+}
+vfs_ops=( "open" "stat" "lstat" )
diff --git a/tools/testing/selftests/vfs/tests/0001_at_beneath.sh b/tools/testing/selftests/vfs/tests/0001_at_beneath.sh
new file mode 100755
index 000000000000..9a03b0953032
--- /dev/null
+++ b/tools/testing/selftests/vfs/tests/0001_at_beneath.sh
@@ -0,0 +1,72 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0+
+# Author: Aleksa Sarai <[email protected]>
+# Copyright (C) 2018 SUSE LLC.
+
+sourcedir="$(readlink -f "$(dirname "$BASH_SOURCE")")"
+source "$sourcedir/../common.sh"
+
+touch "$root/inside"
+ln -s / "$root/rootlink"
+ln -s .. "$root/dotdot"
+ln -s "/../../../../../../$root" "$root/badlink"
+
+mkdir -p "$root/subdir"
+ln -s ../inside "$root/subdir/dotdotinside"
+ln -s ../subdir "$root/subdir/dotdotsubdir"
+ln -s subdir "$root/subdirlink"
+ln -s ../subdirlink/../../inside "$root/subdir/complexlink"
+
+for op in "${vfs_ops[@]}"
+do
+ vfs_helper -o "$op" -F beneath -d "$root" ..
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/].."
+
+ vfs_helper -o "$op" -F beneath -d "$root" ../root
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]../root"
+
+ vfs_helper -o "$op" -F beneath -d "$root" dotdot/root
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]dotdot(=..)/root"
+
+ vfs_helper -o "$op" -F beneath -d "$root" "$root"
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]/root"
+
+ vfs_helper -o "$op" -F beneath -d "$root" rootlink
+ if [[ "$op" == "lstat" ]]
+ then
+ [[ "$status" = 0 ]] || fail "$op beneath [/root/]rootlink(=/)"
+ else
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]rootlink(=/)"
+ fi
+
+ vfs_helper -o "$op" -F beneath -d "$root" rootlink/
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]rootlink(=/)/"
+
+ vfs_helper -o "$op" -F beneath -d "$root" "rootlink/$root"
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]rootlink(=/)/root"
+
+ vfs_helper -o "$op" -F beneath -d "$root" badlink
+ if [[ "$op" == "lstat" ]]
+ then
+ [[ "$status" = 0 ]] || fail "$op beneath [/root/]badlink(=/../.../root)"
+ else
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]badlink(=/../.../root)"
+ fi
+
+ vfs_helper -o "$op" -F beneath -d "$root" subdir/../inside
+ [[ "$status" -eq 0 ]] || fail "$op beneath [/root/]subdir/../inside"
+
+ vfs_helper -o "$op" -F beneath -d "$root" subdir/dotdotinside
+ [[ "$status" -eq 0 ]] || fail "$op beneath [/root/]subdir/dotdotinside(=../inside)"
+
+ vfs_helper -o "$op" -F beneath -d "$root" subdir/dotdotsubdir/
+ [[ "$status" -eq 0 ]] || fail "$op beneath [/root/]subdir/dotdotsubdir(=../subdir)/"
+
+ vfs_helper -o "$op" -F beneath -d "$root" subdir/complexlink
+ if [[ "$op" == "lstat" ]]
+ then
+ [[ "$status" = 0 ]] || fail "$op beneath [/root/]complexlink(=../subdirlink/../../inside)"
+ else
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]complexlink(=../subdirlink/../../inside)"
+ fi
+done
diff --git a/tools/testing/selftests/vfs/tests/0002_at_xdev.sh b/tools/testing/selftests/vfs/tests/0002_at_xdev.sh
new file mode 100755
index 000000000000..06be58a8ffe7
--- /dev/null
+++ b/tools/testing/selftests/vfs/tests/0002_at_xdev.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0+
+# Author: Aleksa Sarai <[email protected]>
+# Copyright (C) 2018 SUSE LLC.
+
+sourcedir="$(readlink -f "$(dirname "$BASH_SOURCE")")"
+source "$sourcedir/../common.sh"
+
+( mountpoint -q "/tmp" ) || skip "/tmp is not a mountpoint"
+
+touch /tmp/foo
+
+ln -s /tmp "$root/link_tmp"
+
+for op in "${vfs_ops[@]}"
+do
+ vfs_helper -o "$op" -F xdev -d / tmp/
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/]tmp/"
+
+ vfs_helper -o "$op" -F xdev -d / tmp/foo
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/]tmp/foo"
+
+ vfs_helper -o "$op" -F xdev -d "$root" /tmp
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/root/]/tmp"
+
+ vfs_helper -o "$op" -F xdev -d "$root" /tmp/
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/root/]/tmp/"
+
+ vfs_helper -o "$op" -F xdev -d "$root" /tmp/foo
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/root/]/tmp/foo"
+
+ vfs_helper -o "$op" -F xdev -d /tmp foo
+ [[ "$status" = 0 ]] || fail "$op xdev [/tmp/]foo"
+
+ vfs_helper -o "$op" -F xdev -d /tmp ..
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/tmp/].."
+
+ vfs_helper -o "$op" -F xdev -d /tmp ../
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/tmp/]../"
+
+ vfs_helper -o "$op" -F xdev -d /tmp ../tmp
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/tmp/]../tmp"
+
+ vfs_helper -o "$op" -F xdev -d "$root" link_tmp
+ if [[ "$op" == "lstat" ]]
+ then
+ [[ "$status" = 0 ]] || fail "$op xdev [/root/]link_tmp(=/tmp)"
+ else
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/root/]link_tmp(=/tmp)"
+ fi
+
+ vfs_helper -o "$op" -F xdev -d "$root" link_tmp/
+ [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/root/]link_tmp(=/tmp)/"
+done
diff --git a/tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh b/tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
new file mode 100755
index 000000000000..41d9655a1e46
--- /dev/null
+++ b/tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
@@ -0,0 +1,50 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0+
+# Author: Aleksa Sarai <[email protected]>
+# Copyright (C) 2018 SUSE LLC.
+
+sourcedir="$(readlink -f "$(dirname "$BASH_SOURCE")")"
+source "$sourcedir/../common.sh"
+
+[ -e "/proc/$$/cwd" ] || skip "/proc/$$/cwd doesn't exist"
+
+ln -s / "$root/testlink"
+
+for op in "${vfs_ops[@]}"
+do
+ for flags in {no_proclinks,no_symlinks,"no_proclinks,no_symlinks"}
+ do
+ vfs_helper -o "$op" -F "$flags" "/proc/$$/stat"
+ [[ "$status" = 0 ]] || fail "$op $flags /proc/$$/stat"
+
+ vfs_helper -o "$op" -F "$flags" "/proc/$$/cwd"
+ if [[ "$op" == "lstat" ]]
+ then
+ [[ "$status" = 0 ]] || fail "$op $flags /proc/$$/cwd"
+ else
+ [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op $flags /proc/$$/cwd"
+ fi
+
+ vfs_helper -o "$op" -F "$flags" -d "$root" "testlink/"
+ if [[ "$flags" == "no_proclinks" ]]
+ then
+ [[ "$status" = 0 ]] || fail "$op $flags [/root/]testlink/"
+ else
+ [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op $flags [/root/]testlink/"
+ fi
+
+ vfs_helper -o "$op" -F "$flags" "/proc/$$/cwd/"
+ [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op $flags /proc/$$/cwd/"
+
+ vfs_helper -o "$op" -F "$flags" "/proc/$$/cwd/$BASH_SOURCE"
+ [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op $flags /proc/$$/cwd/$BASH_SOURCE"
+
+ vfs_helper -o "$op" -F "$flags" -d "/proc/self" cwd
+ if [[ "$op" == "lstat" ]]
+ then
+ [[ "$status" = 0 ]] || fail "$op $flags [/proc/self/]cwd"
+ else
+ [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op $flags [/proc/self]/cwd"
+ fi
+ done
+done
diff --git a/tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh b/tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
new file mode 100755
index 000000000000..f7ec7e37f06a
--- /dev/null
+++ b/tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0+
+# Author: Aleksa Sarai <[email protected]>
+# Copyright (C) 2018 SUSE LLC.
+
+sourcedir="$(readlink -f "$(dirname "$BASH_SOURCE")")"
+source "$sourcedir/../common.sh"
+
+mkdir -p "$root/dir"
+touch "$root/foo"
+
+ln -s . "$root/link_dot"
+ln -s .. "$root/link_dotdot"
+ln -s foo "$root/link_foo"
+
+for op in "${vfs_ops[@]}"
+do
+ vfs_helper -o "$op" -F no_symlinks -d "$root" foo
+ [[ "$status" = 0 ]] || fail "$op no_symlinks [/root/]foo"
+
+ vfs_helper -o "$op" -F no_symlinks -d "$root" ../root/foo
+ [[ "$status" = 0 ]] || fail "$op no_symlinks [/root/]../root/foo"
+
+ vfs_helper -o "$op" -F no_symlinks -d "$root" link_foo
+ if [[ "$op" == "lstat" ]]
+ then
+ [[ "$status" = 0 ]] || fail "$op no_symlinks [/root/]link_foo(=foo)"
+ else
+ [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op no_symlinks [/root/]link_foo(=foo)"
+ fi
+
+ vfs_helper -o "$op" -F no_proclinks -d "$root" link_foo
+ [[ "$status" = 0 ]] || fail "$op no_proclinks [/root/]link_foo(=foo)"
+
+ vfs_helper -o "$op" -F no_symlinks -d "$root" link_dotdot/
+ [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op no_symlinks [/root/]link_dotdot(=..)/"
+
+ vfs_helper -o "$op" -F no_proclinks -d "$root" link_dotdot/
+ [[ "$status" = 0 ]] || fail "$op no_proclinks [/root/]link_dotdot(=..)/"
+
+ vfs_helper -o "$op" -F no_symlinks -d "$root" link_dot/dir
+ [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op no_symlinks [/root/]link_dot(=.)/dir"
+
+ vfs_helper -o "$op" -F no_proclinks -d "$root" link_dot/dir
+ [[ "$status" = 0 ]] || fail "$op no_proclinks [/root/]link_dot(=.)/dir"
+
+ vfs_helper -o "$op" -F no_symlinks -d "$root" ../root/link_dot/link_dotdot/root/dir
+ [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op no_symlinks [/root/]../root/link_dot(=.)/link_dotdot(=..)/root/dir"
+done
diff --git a/tools/testing/selftests/vfs/tests/0005_at_this_root.sh b/tools/testing/selftests/vfs/tests/0005_at_this_root.sh
new file mode 100755
index 000000000000..aba23c28a7b7
--- /dev/null
+++ b/tools/testing/selftests/vfs/tests/0005_at_this_root.sh
@@ -0,0 +1,66 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0+
+# Author: Aleksa Sarai <[email protected]>
+# Copyright (C) 2018 SUSE LLC.
+
+sourcedir="$(readlink -f "$(dirname "$BASH_SOURCE")")"
+source "$sourcedir/../common.sh"
+
+mkdir -p "$root/var" "$root/etc" "$root/usr/bin" "$root/usr/local/bin"
+ln -s bash "$root/usr/bin/sh"
+ln -s ../../bin/bash "$root/usr/local/bin/bash"
+ln -s /bin/sh "$root/usr/local/bin/sh"
+ln -s ../bin3 "$root/var/bin"
+ln -s /usr/bin "$root/bin"
+ln -s /usr/local/bin "$root/bin4"
+ln -s ../../../../../../../../../bin "$root/bin2"
+ln -s /../../../../../../../../../bin "$root/bin3"
+touch "$root/etc/passwd" "$root/usr/bin/bash"
+
+# How should each path be mapped to a host path, in the form
+# 'path:hostpath[:hostpath_trailing]'. Everything is assumed to be ${root}
+# prefixed.
+host_mappings=(
+ # Basic paths.
+ "..:."
+ "/:."
+ "/../../../../../../:."
+ "../var/../../../../../etc/passwd:etc/passwd"
+ "/var/../../../../../etc/passwd:etc/passwd"
+ "/../../../../../../var/../../../../../etc/passwd:etc/passwd"
+ "etc/passwd:etc/passwd"
+ "/etc/passwd:etc/passwd"
+
+ # Basic symlink paths.
+ "/bin/bash:usr/bin/bash"
+ "/bin/sh:usr/bin/bash:usr/bin/sh"
+ "/bin2/bash:usr/bin/bash"
+ "/bin2/sh:usr/bin/bash:usr/bin/sh"
+ "/bin3/sh:usr/bin/bash:usr/bin/sh"
+ "/bin3/bash:usr/bin/bash"
+
+ # More complicated symlink paths.
+ "/bin4/../../local/bin/bash:usr/bin/bash:usr/local/bin/bash"
+ "/bin4/../../local/bin/sh:usr/bin/bash:usr/local/bin/sh"
+ "/bin4/../../../../../../../../../../usr/local/bin/bash:usr/bin/bash:usr/local/bin/bash"
+ "/bin4/../../../../../../../../../../usr/local/bin/sh:usr/bin/bash:usr/local/bin/sh"
+ "/bin/../../bin4/../../local/bin/bash:usr/bin/bash:usr/local/bin/bash"
+ "/bin/../../bin4/../../local/bin/sh:usr/bin/bash:usr/local/bin/sh"
+)
+
+for op in "${vfs_ops[@]}"
+do
+ for mapping in "${host_mappings[@]}"
+ do
+ IFS=":" read path hostpath hostpath_trailing <<< "$mapping"
+ [[ "$hostpath_trailing" ]] || export hostpath_trailing="$hostpath"
+ [[ "$op" == "lstat" ]] && export hostpath="$hostpath_trailing"
+
+ # Compare with and without this_root...
+ vfs_helper -o "$op" -d "$root" "$hostpath"
+ old_status="$status" old_output="$output"
+ vfs_helper -o "$op" -F this_root -d "$root" "$path"
+ [[ "$status" = "$old_status" ]] || fail "$op this_root $path=$status neq $old_status"
+ [[ "$output" == "$old_output" ]] || fail "$op this_root $path=$output neq $old_output"
+ done
+done
diff --git a/tools/testing/selftests/vfs/vfs_helper.c b/tools/testing/selftests/vfs/vfs_helper.c
new file mode 100644
index 000000000000..d67ec74a3fca
--- /dev/null
+++ b/tools/testing/selftests/vfs/vfs_helper.c
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "at_flags.h"
+#include "../kselftest.h"
+
+#define bail(...) \
+ do { \
+ fprintf(stderr, __VA_ARGS__); \
+ fputs("\n", stderr); \
+ exit(1); \
+ } while (0)
+
+extern char *__progname;
+#define usage() \
+ bail("usage: %s -o {open|stat|lstat} [-d <dirfd-path>] " \
+ "[-F [<flag1>,...<flagN>] <path>", __progname)
+
+static unsigned int parse_at_flags(char *opts)
+{
+ char *opt, *saveptr = NULL;
+ unsigned int flags = 0;
+
+ opt = strtok_r(opts, ",", &saveptr);
+ do {
+ unsigned int found = 0;
+
+ if (!*opt)
+ continue;
+ for (struct flag *flag = AT_FLAGS; flag->name != NULL; flag++) {
+ if (!strcmp(opt, flag->name))
+ found |= flag->at_flag;
+ }
+ if (!found)
+ bail("unknown openat(2) flag: %s", opt);
+ flags |= found;
+ } while ((opt = strtok_r(NULL, ",", &saveptr)) != NULL);
+
+ return flags;
+}
+
+int stat_wrapper(int dirfd, const char *pathname, unsigned int flags)
+{
+ struct stat st = {0};
+ int err;
+
+ err = fstatat(dirfd, pathname, &st, flags);
+ if (err < 0)
+ return err;
+
+ printf("%lu:%lu\n", st.st_dev, st.st_ino);
+ return 0;
+}
+
+int lstat_wrapper(int dirfd, const char *pathname, unsigned int flags)
+{
+ return stat_wrapper(dirfd, pathname, flags | AT_SYMLINK_NOFOLLOW);
+}
+
+int openat_wrapper(int dirfd, const char *pathname, unsigned int flags)
+{
+ int fd;
+ char *fdpath = NULL, fullpath[PATH_MAX] = {0};
+
+ fd = openat(dirfd, pathname, flags);
+ if (fd < 0)
+ return fd;
+
+ /* Print the fully-qualified path using /proc/pid/fd/... */
+ if (asprintf(&fdpath, "/proc/self/fd/%d", fd) < 0)
+ bail("asprintf /proc/self/fd/%d: %m", fd);
+ if (readlink(fdpath, fullpath, PATH_MAX) < 0)
+ bail("readlink %s: %m", fdpath);
+ puts(fullpath);
+ return fd;
+}
+
+int main(int argc, char **argv)
+{
+ int opt, ret, dirfd;
+ unsigned int flags = 0;
+ char *opstr = NULL, *dir_path = NULL, *path = NULL;
+ int (*opfunc)(int dirfd, const char *pathname, unsigned int flags);
+
+ while ((opt = getopt(argc, argv, "o:d:F:")) != -1) {
+ switch (opt) {
+ case 'o':
+ opstr = optarg;
+ break;
+ case 'd':
+ dir_path = optarg;
+ break;
+ case 'F':
+ flags |= parse_at_flags(optarg);
+ break;
+ default:
+ usage();
+ }
+ }
+
+ argc -= optind;
+ argv += optind;
+
+ if (argc != 1)
+ usage();
+ path = argv[0];
+
+ if (!opstr)
+ usage();
+ else if (!strcmp(opstr, "stat"))
+ opfunc = stat_wrapper;
+ else if (!strcmp(opstr, "lstat"))
+ opfunc = lstat_wrapper;
+ else if (!strcmp(opstr, "open"))
+ opfunc = openat_wrapper;
+ else
+ usage();
+
+ if (opfunc == openat_wrapper) {
+ unsigned int open_flags = 0;
+
+ for (struct flag *flag = AT_FLAGS; flag->name != NULL; flag++) {
+ if (flags & flag->at_flag)
+ open_flags |= flag->open_flag;
+ }
+ flags = open_flags;
+ }
+
+ dirfd = AT_FDCWD;
+ if (dir_path) {
+ dirfd = open(dir_path, O_PATH|O_DIRECTORY);
+ if (dirfd < 0)
+ bail("cannot open dir_path: %m");
+ }
+
+ ret = opfunc(dirfd, path, flags);
+ if (ret < 0)
+ ret = -errno;
+ return (ret < 0) ? -ret : 0;
+}
--
2.19.0
> On Sep 29, 2018, at 3:34 AM, Aleksa Sarai <[email protected]> wrote:
>
> The need for some sort of control over VFS's path resolution (to avoid
> malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a
> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
>
> The most obvious change is that AT_NO_JUMPS has been split as dicussed
> in the original thread, along with a further split of AT_NO_PROCLINKS
> which means that each individual property of AT_NO_JUMPS is now a
> separate flag:
>
> * Path-based escapes from the starting-point using "/" or ".." are
> blocked by AT_BENEATH.
Seems useful.
> * Mountpoint crossings are blocked by AT_XDEV.
Seems useful.
> * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> correctly it actually blocks any user of nd_jump_link() because it
> allows out-of-VFS path resolution manipulation).
>
So how do I disable following symlinks? ISTM the most natural way would be to have AT_NO_SYMLINKS, and to have that flag disable proc links.
On Sat, Sep 29, 2018 at 08:34:50PM +1000, Aleksa Sarai wrote:
> The need for some sort of control over VFS's path resolution (to avoid
> malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a
> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
>
> The most obvious change is that AT_NO_JUMPS has been split as dicussed
> in the original thread, along with a further split of AT_NO_PROCLINKS
> which means that each individual property of AT_NO_JUMPS is now a
> separate flag:
>
> * Path-based escapes from the starting-point using "/" or ".." are
> blocked by AT_BENEATH.
> * Mountpoint crossings are blocked by AT_XDEV.
> * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> correctly it actually blocks any user of nd_jump_link() because it
> allows out-of-VFS path resolution manipulation).
>
> AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
> Linus' suggestion in the original thread, I've also implemented
> AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
> "proclink" resolution).
>
> An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS
> path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
> blocks this as well (feel free to ask me to remove it if you feel this
> is not sane).
Imho, these flags are very much needed and they all are pretty useful
not just for container runtimes but in general.
>
> Currently I've only enabled these for openat(2) and the stat(2) family.
> I would hope we could enable it for basically every *at(2) syscall --
> but many of them appear to not have a @flags argument and thus we'll
> need to add several new syscalls to do this. I'm more than happy to send
> those patches, but I'd prefer to know that this preliminary work is
> acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> syscalls.
We should really make sure that we can't make due with openat() alone
before adding a bunch of new syscalls. So there's no need to rush into
this. :)
>
> One additional feature I've implemented is AT_THIS_ROOT (I imagine this
> is probably going to be more contentious than the refresh of
> AT_NO_JUMPS, so I've included it in a separate patch). The patch itself
> describes my reasoning, but the shortened version of the premise is that
> continer runtimes need to have a way to resolve paths within a
> potentially malicious rootfs. Container runtimes currently do this in
> userspace[2] which has implicit race conditions that are not resolvable
> in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
> inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for
> path resolution, which would be invaluable for us -- and the
> implementation is basically identical to AT_BENEATH (except that we
> don't return errors when someone actually hits the root).
>
> I've added some selftests for this, but it's not clear to me whether
> they should live here or in xfstests (as far as I can tell there are no
> other VFS tests in selftests, while there are some tests that look like
> generic VFS tests in xfstests). If you'd prefer them to be included in
> xfstests, let me know.
>
> [1]: https://lore.kernel.org/patchwork/patch/784221/
> [2]: https://github.com/cyphar/filepath-securejoin
>
> Aleksa Sarai (3):
> namei: implement O_BENEATH-style AT_* flags
> namei: implement AT_THIS_ROOT chroot-like path resolution
> selftests: vfs: add AT_* path resolution tests
>
> fs/fcntl.c | 2 +-
> fs/namei.c | 158 ++++++++++++------
> fs/open.c | 10 ++
> fs/stat.c | 15 +-
> include/linux/fcntl.h | 3 +-
> include/linux/namei.h | 8 +
> include/uapi/asm-generic/fcntl.h | 20 +++
> include/uapi/linux/fcntl.h | 10 ++
> tools/testing/selftests/Makefile | 1 +
> tools/testing/selftests/vfs/.gitignore | 1 +
> tools/testing/selftests/vfs/Makefile | 13 ++
> tools/testing/selftests/vfs/at_flags.h | 40 +++++
> tools/testing/selftests/vfs/common.sh | 37 ++++
> .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++
> .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++
> .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++
> .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++
> .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++
> tools/testing/selftests/vfs/vfs_helper.c | 154 +++++++++++++++++
> 19 files changed, 707 insertions(+), 56 deletions(-)
> create mode 100644 tools/testing/selftests/vfs/.gitignore
> create mode 100644 tools/testing/selftests/vfs/Makefile
> create mode 100644 tools/testing/selftests/vfs/at_flags.h
> create mode 100644 tools/testing/selftests/vfs/common.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
> create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
>
> --
> 2.19.0
>
On Sat, Sep 29, 2018 at 08:34:51PM +1000, Aleksa Sarai wrote:
> Add the following flags for path resolution. The primary justification
> for these flags is to allow for programs to be far more strict about how
> they want path resolution to handle symlinks, mountpoint crossings, and
> paths that escape the dirfd (through an absolute path or ".."
> shenanigans).
>
> This is of particular concern to container runtimes that want to be very
> careful about malicious root filesystems that a container's init might
> have screwed around with (and there is no real way to protect against
> this in userspace if you consider potential races against a malicious
> container's init).
>
> * AT_BENEATH: Disallow ".." or absolute paths (either in the path or
> found during symlink resolution) to escape the starting point of name
> resolution, though ".." is permitted in cases like "foo/../bar".
> Relative symlinks are still allowed (as long as they don't escape the
> starting point).
>
> * AT_XDEV: Disallow mount-point crossing (both *down* into one, or *up*
> from one). The primary "scoping" use is to blocking resolution that
> crosses a bind-mount, which has a similar property to a symlink (in
> the way that it allows for escape from the starting-point). Since it
> is not possible to differentiate bind-mounts However since
> bind-mounting requires privileges (in ways symlinks don't) this has
> been split from LOOKUP_BENEATH. The naming is based on "find -xdev"
> (though find(1) doesn't walk upwards, the semantics seem obvious).
>
> * AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
> specific restriction, and it exists because /proc/$pid/fd/...
> "symlinks" allow for access outside nd->root and pose risk to
> container runtimes that don't want to be tricked into accessing a host
> path (but do want to allow no-funny-business symlink resolution).
>
> * AT_NO_SYMLINK: Disallows symlink jumping *of any kind*. Implies
> AT_NO_PROCLINK (obviously).
>
> The AT_NO_*LINK flags return -ELOOP if path resolution would violates
> their requirement, while the others all return -EXDEV. Currently these
> are only enabled for the stat(2) family and the openat(2) family (the
> latter has its own brand of O_* flags with the same semantics). Ideally
> these flags would be supported by all *at(2) syscalls, but this will
> require adding flags arguments to many of them (and will be done in a
> separate patchset).
>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Eric Biederman <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Signed-off-by: Aleksa Sarai <[email protected]>
Not to be a stickler about protocol but given that this is based heavily
on ideas from prior patchsets and suggestions as you mentioned it might
make sense to cite them as links and also maybe add some Suggested-by
lines for some of the authors for the sake of posterity. :)
[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/723057/
> ---
> fs/fcntl.c | 2 +-
> fs/namei.c | 61 ++++++++++++++++++++++++++------
> fs/open.c | 8 +++++
> fs/stat.c | 13 +++++--
> include/linux/fcntl.h | 3 +-
> include/linux/namei.h | 7 ++++
> include/uapi/asm-generic/fcntl.h | 17 +++++++++
> include/uapi/linux/fcntl.h | 8 +++++
> 8 files changed, 104 insertions(+), 15 deletions(-)
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 4137d96534a6..e343618736f7 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
> * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> * is defined as O_NONBLOCK on some platforms and not on others.
> */
> - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> + BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
> HWEIGHT32(
> (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> __FMODE_EXEC | __FMODE_NONOTIFY));
> diff --git a/fs/namei.c b/fs/namei.c
> index fb913148d4d1..757dd783771c 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -859,6 +859,8 @@ static int nd_jump_root(struct nameidata *nd)
> path_get(&nd->path);
> nd->inode = nd->path.dentry->d_inode;
> }
> + if (unlikely(nd->flags & LOOKUP_BENEATH))
> + return -EXDEV;
> nd->flags |= LOOKUP_JUMPED;
> return 0;
> }
> @@ -1083,14 +1085,19 @@ const char *get_link(struct nameidata *nd)
> } else {
> res = get(dentry, inode, &last->done);
> }
> + /* If we just jumped it was because of a procfs-style link. */
> + if (unlikely(nd->flags & LOOKUP_JUMPED) &&
> + unlikely(nd->flags & LOOKUP_NO_PROCLINKS))
> + return ERR_PTR(-ELOOP);
> if (IS_ERR_OR_NULL(res))
> return res;
> }
> if (*res == '/') {
> if (!nd->root.mnt)
> set_root(nd);
> - if (unlikely(nd_jump_root(nd)))
> - return ERR_PTR(-ECHILD);
> + error = nd_jump_root(nd);
> + if (unlikely(error))
> + return ERR_PTR(error);
> while (unlikely(*++res == '/'))
> ;
> }
> @@ -1271,12 +1278,16 @@ static int follow_managed(struct path *path, struct nameidata *nd)
> break;
> }
>
> - if (need_mntput && path->mnt == mnt)
> - mntput(path->mnt);
> + if (need_mntput) {
> + if (path->mnt == mnt)
> + mntput(path->mnt);
> + if (unlikely(nd->flags & LOOKUP_XDEV))
> + ret = -EXDEV;
> + else
> + nd->flags |= LOOKUP_JUMPED;
> + }
> if (ret == -EISDIR || !ret)
> ret = 1;
> - if (need_mntput)
> - nd->flags |= LOOKUP_JUMPED;
> if (unlikely(ret < 0))
> path_put_conditional(path, nd);
> return ret;
> @@ -1333,6 +1344,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
> mounted = __lookup_mnt(path->mnt, path->dentry);
> if (!mounted)
> break;
> + if (unlikely(nd->flags & LOOKUP_XDEV))
> + return false;
> path->mnt = &mounted->mnt;
> path->dentry = mounted->mnt.mnt_root;
> nd->flags |= LOOKUP_JUMPED;
> @@ -1353,8 +1366,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> struct inode *inode = nd->inode;
>
> while (1) {
> - if (path_equal(&nd->path, &nd->root))
> + if (path_equal(&nd->path, &nd->root)) {
> + if (unlikely(nd->flags & LOOKUP_BENEATH))
> + return -EXDEV;
> break;
> + }
> if (nd->path.dentry != nd->path.mnt->mnt_root) {
> struct dentry *old = nd->path.dentry;
> struct dentry *parent = old->d_parent;
> @@ -1379,6 +1395,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> return -ECHILD;
> if (&mparent->mnt == nd->path.mnt)
> break;
> + if (unlikely(nd->flags & LOOKUP_XDEV))
> + return -EXDEV;
> /* we know that mountpoint was pinned */
> nd->path.dentry = mountpoint;
> nd->path.mnt = &mparent->mnt;
> @@ -1481,8 +1499,11 @@ static int path_parent_directory(struct path *path)
> static int follow_dotdot(struct nameidata *nd)
> {
> while(1) {
> - if (path_equal(&nd->path, &nd->root))
> + if (path_equal(&nd->path, &nd->root)) {
> + if (unlikely(nd->flags & LOOKUP_BENEATH))
> + return -EXDEV;
> break;
> + }
> if (nd->path.dentry != nd->path.mnt->mnt_root) {
> int ret = path_parent_directory(&nd->path);
> if (ret)
> @@ -1491,6 +1512,8 @@ static int follow_dotdot(struct nameidata *nd)
> }
> if (!follow_up(&nd->path))
> break;
> + if (unlikely(nd->flags & LOOKUP_XDEV))
> + return -EXDEV;
> }
> follow_mount(&nd->path);
> nd->inode = nd->path.dentry->d_inode;
> @@ -1720,6 +1743,8 @@ static int pick_link(struct nameidata *nd, struct path *link,
> {
> int error;
> struct saved *last;
> + if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
> + return -ELOOP;
> if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) {
> path_to_nameidata(link, nd);
> return -ELOOP;
> @@ -2175,6 +2200,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
>
> if (!*s)
> flags &= ~LOOKUP_RCU;
> + if (flags & LOOKUP_NO_SYMLINKS)
> + flags |= LOOKUP_NO_PROCLINKS;
> if (flags & LOOKUP_RCU)
> rcu_read_lock();
>
> @@ -2204,10 +2231,12 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
>
> nd->m_seq = read_seqbegin(&mount_lock);
> if (*s == '/') {
> + int error;
> set_root(nd);
> - if (likely(!nd_jump_root(nd)))
> - return s;
> - return ERR_PTR(-ECHILD);
> + error = nd_jump_root(nd);
> + if (unlikely(error))
> + s = ERR_PTR(error);
> + return s;
> } else if (nd->dfd == AT_FDCWD) {
> if (flags & LOOKUP_RCU) {
> struct fs_struct *fs = current->fs;
> @@ -2223,6 +2252,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> get_fs_pwd(current->fs, &nd->path);
> nd->inode = nd->path.dentry->d_inode;
> }
> + if (unlikely(flags & LOOKUP_BENEATH)) {
> + nd->root = nd->path;
> + if (!(flags & LOOKUP_RCU))
> + path_get(&nd->root);
> + }
> return s;
> } else {
> /* Caller must check execute permissions on the starting path component */
> @@ -2247,6 +2281,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> path_get(&nd->path);
> nd->inode = nd->path.dentry->d_inode;
> }
> + if (unlikely(flags & LOOKUP_BENEATH)) {
> + nd->root = nd->path;
> + if (!(flags & LOOKUP_RCU))
> + path_get(&nd->root);
> + }
> fdput(f);
> return s;
> }
> diff --git a/fs/open.c b/fs/open.c
> index 0285ce7dbd51..80f5f566a5ff 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -988,6 +988,14 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
> lookup_flags |= LOOKUP_DIRECTORY;
> if (!(flags & O_NOFOLLOW))
> lookup_flags |= LOOKUP_FOLLOW;
> + if (flags & O_BENEATH)
> + lookup_flags |= LOOKUP_BENEATH;
> + if (flags & O_XDEV)
> + lookup_flags |= LOOKUP_XDEV;
> + if (flags & O_NOPROCLINKS)
> + lookup_flags |= LOOKUP_NO_PROCLINKS;
> + if (flags & O_NOSYMLINKS)
> + lookup_flags |= LOOKUP_NO_SYMLINKS;
> op->lookup_flags = lookup_flags;
> return 0;
> }
> diff --git a/fs/stat.c b/fs/stat.c
> index f8e6fb2c3657..791e61b916ae 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -170,8 +170,9 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
> int error = -EINVAL;
> unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
>
> - if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
> - AT_EMPTY_PATH | KSTAT_QUERY_FLAGS)) != 0)
> + if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH |
> + KSTAT_QUERY_FLAGS | AT_BENEATH | AT_XDEV |
> + AT_NO_PROCLINKS | AT_NO_SYMLINKS))
> return -EINVAL;
>
> if (flags & AT_SYMLINK_NOFOLLOW)
> @@ -180,6 +181,14 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
> lookup_flags &= ~LOOKUP_AUTOMOUNT;
> if (flags & AT_EMPTY_PATH)
> lookup_flags |= LOOKUP_EMPTY;
> + if (flags & AT_BENEATH)
> + lookup_flags |= LOOKUP_BENEATH;
> + if (flags & AT_XDEV)
> + lookup_flags |= LOOKUP_XDEV;
> + if (flags & AT_NO_PROCLINKS)
> + lookup_flags |= LOOKUP_NO_PROCLINKS;
> + if (flags & AT_NO_SYMLINKS)
> + lookup_flags |= LOOKUP_NO_SYMLINKS;
>
> retry:
> error = user_path_at(dfd, filename, lookup_flags, &path);
> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index 27dc7a60693e..ad5bba4b5b12 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -9,7 +9,8 @@
> (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
> O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
> FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> - O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
> + O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
> + O_NOPROCLINKS | O_NOSYMLINKS)
>
> #ifndef force_o_largefile
> #define force_o_largefile() (BITS_PER_LONG != 32)
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index a78606e8e3df..5ff7f3362d1b 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -47,6 +47,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
> #define LOOKUP_EMPTY 0x4000
> #define LOOKUP_DOWN 0x8000
>
> +/* Scoping flags for lookup. */
> +#define LOOKUP_BENEATH 0x010000 /* No escaping from starting point. */
> +#define LOOKUP_XDEV 0x020000 /* No mountpoint crossing. */
> +#define LOOKUP_NO_PROCLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
> +#define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*.
> + Implies LOOKUP_NO_PROCLINKS. */
> +
> extern int path_pts(struct path *path);
>
> extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 9dc0bf0c5a6e..c2bf5983e46a 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -97,6 +97,23 @@
> #define O_NDELAY O_NONBLOCK
> #endif
>
> +/*
> + * These are identical to their AT_* counterparts (which affect the entireity
> + * of path resolution).
> + */
> +#ifndef O_BENEATH
> +#define O_BENEATH 00040000000 /* *Not* the same as capsicum's O_BENEATH! */
> +#endif
> +#ifndef O_XDEV
> +#define O_XDEV 00100000000
> +#endif
> +#ifndef O_NOPROCLINKS
> +#define O_NOPROCLINKS 00200000000
> +#endif
> +#ifndef O_NOSYMLINKS
> +#define O_NOSYMLINKS 01000000000
> +#endif
> +
> #define F_DUPFD 0 /* dup */
> #define F_GETFD 1 /* get close_on_exec */
> #define F_SETFD 2 /* set/clear close_on_exec */
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 594b85f7cb86..551a9e2166a8 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -92,5 +92,13 @@
>
> #define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
>
> +/* Flags which affect path *resolution*, not just last-component handling. */
> +#define AT_BENEATH 0x10000 /* No absolute paths or ".." escaping
> + (in-path or through symlinks) */
> +#define AT_XDEV 0x20000 /* No mountpoint crossing. */
> +#define AT_NO_PROCLINKS 0x40000 /* No /proc/$pid/fd/... "symlinks". */
> +#define AT_NO_SYMLINKS 0x80000 /* No symlinks *at all*.
> + Implies AT_NO_PROCLINKS. */
> +
>
> #endif /* _UAPI_LINUX_FCNTL_H */
> --
> 2.19.0
>
On 2018-09-29, Christian Brauner <[email protected]> wrote:
> > Cc: Andy Lutomirski <[email protected]>
> > Cc: Eric Biederman <[email protected]>
> > Cc: Christian Brauner <[email protected]>
> > Signed-off-by: Aleksa Sarai <[email protected]>
>
> Not to be a stickler about protocol but given that this is based heavily
> on ideas from prior patchsets and suggestions as you mentioned it might
> make sense to cite them as links and also maybe add some Suggested-by
> lines for some of the authors for the sake of posterity. :)
Oops -- yup, I will fix that up.
> [1]: https://lwn.net/Articles/721443/
> [2]: https://lwn.net/Articles/723057/
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On 2018-09-29, Andy Lutomirski <[email protected]> wrote:
> > The most obvious change is that AT_NO_JUMPS has been split as dicussed
> > in the original thread, along with a further split of AT_NO_PROCLINKS
> > which means that each individual property of AT_NO_JUMPS is now a
> > separate flag:
> >
> > * Path-based escapes from the starting-point using "/" or ".." are
> > blocked by AT_BENEATH.
>
> Seems useful.
>
> > * Mountpoint crossings are blocked by AT_XDEV.
>
> Seems useful.
>
> > * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> > correctly it actually blocks any user of nd_jump_link() because it
> > allows out-of-VFS path resolution manipulation).
> >
>
> So how do I disable following symlinks? ISTM the most natural way
> would be to have AT_NO_SYMLINKS, and to have that flag disable proc
> links.
So, this patchset has both AT_NO_SYMLINKS and AT_NO_PROCLINKS.
* AT_NO_SYMLINKS blocks *all* symlinks (which is something Linus requested
in the original thread[2] -- apparently this is something that would
be useful to git even if wouldn't violate AT_BENEATH). This implies
AT_NO_PROCLINKS.
* AT_NO_PROCLINKS only blocks procfs-style "symlinks" (filesystem
"symlinks" that call nd_jump_link() themselves -- currently only
procfs and nsfs).
The reason why we need AT_NO_PROCLINKS is that "proclinks"[*] allow for
breaking-out of nd->root without a trivial way of detecting it (since
the filesystem can manipulate nd->path almost arbitrarily outside of the
control of VFS). Al Viro's original patchset[1] also blocked these but
it was all included within AT_NO_JUMPS.
Requiring you to block *all* symlinks in order to block "proclinks"
seems to be a bit overkill to me (especially if consider that
AT_THIS_ROOT|AT_NO_PROCLINKS is definitely a usecase most container
runtimes would be _very_ interested in -- while AT_NO_SYMLINKS will
cause issues with most distribution images).
[*]: Sorry for the awful naming, I'm not sure what the correct name is
(I've called them "super symlinks" in the past) -- if you have a
better name please let me know!
[1]: https://lwn.net/Articles/721443/
[2]: https://marc.info/?l=linux-kernel&m=149394765324531&w=2
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
> On Sep 29, 2018, at 8:45 AM, Aleksa Sarai <[email protected]> wrote:
>
> On 2018-09-29, Andy Lutomirski <[email protected]> wrote:
>>> The most obvious change is that AT_NO_JUMPS has been split as dicussed
>>> in the original thread, along with a further split of AT_NO_PROCLINKS
>>> which means that each individual property of AT_NO_JUMPS is now a
>>> separate flag:
>>>
>>> * Path-based escapes from the starting-point using "/" or ".." are
>>> blocked by AT_BENEATH.
>>
>> Seems useful.
>>
>>> * Mountpoint crossings are blocked by AT_XDEV.
>>
>> Seems useful.
>>
>>> * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
>>> correctly it actually blocks any user of nd_jump_link() because it
>>> allows out-of-VFS path resolution manipulation).
>>>
>>
>> So how do I disable following symlinks? ISTM the most natural way
>> would be to have AT_NO_SYMLINKS, and to have that flag disable proc
>> links.
>
> So, this patchset has both AT_NO_SYMLINKS and AT_NO_PROCLINKS.
And AT_THIS_ROOT, which is neat. Want to update your cover letter to include all of this? Or at I just reading the wrong thing?
>
> * AT_NO_SYMLINKS blocks *all* symlinks (which is something Linus requested
> in the original thread[2] -- apparently this is something that would
> be useful to git even if wouldn't violate AT_BENEATH). This implies
> AT_NO_PROCLINKS.
>
> * AT_NO_PROCLINKS only blocks procfs-style "symlinks" (filesystem
> "symlinks" that call nd_jump_link() themselves -- currently only
> procfs and nsfs).
>
Hmm. I’m not sure that blocking nsfs links is always what the container runtime wants, but the overall concept sounds quite useful. Maybe call it AT_NO_TELEPORT? Or AT_NO_MAGIC_LINKS?
Also, as a perhaps-silly suggestion: if you end up adding a new syscall, I can see a use for a mode that does the path walk but, rather than failing on a disallowed link, stops early and indicates where it stopped. Then web servers, samba, etc can more efficiently implement custom behavior when links are encountered. And it may also be useful to have a variant of AT_THIS_ROOT where trying to escape is an error instead of having it just get stuck at the root.
+cc linux-api; please keep them in CC for future versions of the patch
On Sat, Sep 29, 2018 at 4:29 PM Aleksa Sarai <[email protected]> wrote:
> The primary motivation for the need for this flag is container runtimes
> which have to interact with malicious root filesystems in the host
> namespaces. One of the first requirements for a container runtime to be
> secure against a malicious rootfs is that they correctly scope symlinks
> (that is, they should be scoped as though they are chroot(2)ed into the
> container's rootfs) and ".."-style paths. The already-existing AT_XDEV
> and AT_NO_PROCLINKS help defend against other potential attacks in a
> malicious rootfs scenario.
So, I really like the concept for patch 1 of this series (but haven't
read the code yet); but I dislike this patch because of its footgun
potential.
If this patch landed as-is, the manpage would need some big warning
labels. chroot() basically provides no security guarantees at all; and
yes, that includes that if you do `chroot(...); chdir("/");
open(attacker_controlled_path, ...);`, you can potentially end up
opening a file outside the chroot. See
https://github.com/QubesOS/qubes-secpack/blob/master/QSBs/qsb-014-2015.txt
for an example, where Qubes OS did pretty much that, and ended up with
a potentially exploitable security bug because of that, where one VM,
while performing a file transfer into another VM, could write outside
of the transfer target directory.
The problem is what happens if a folder you are walking through is
concurrently moved out of the chroot. Consider the following scenario:
You attempt to open "C/../../etc/passwd" under the root "/A/B".
Something else concurrently moves /A/B/C to /A/C. This can result in
the following:
1. You start the path walk and reach /A/B/C.
2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
3. Your path walk follows the first ".." up into /A. This is outside
the process root, but you never actually encountered the process root,
so you don't notice.
4. Your path walk follows the second ".." up to /. Again, this is
outside the process root, but you don't notice.
5. Your path walk walks down to /etc/passwd, and the open completes
successfully. You now have an fd pointing outside your chroot.
If the root of your walk is below an attacker-controlled directory,
this of course means that you lose instantly. If you point the root of
the walk at a directory out of which a process in the container
wouldn't be able to move the file, you're probably kinda mostly fine -
as long as you know, for certain, that nothing else on the system
would ever do that. But I still wouldn't feel good about that.
(Yes, this means that if you run an SFTP server with OpenSSH's
ChrootDirectory directive, you have to be very careful about these
things.)
I believe that the only way to robustly use this would be to point the
dirfd at a mount point, such that you know that being moved out of the
chroot is impossible because the mount point limits movement of
directories under it. (Well, technically, it doesn't, but it ensures
that if a directory does dangerously move away, the syscall fails.) It
might make sense to hardcode this constraint in the implementation of
AT_THIS_ROOT, to keep people from shooting themselves in the foot.
> Currently most container runtimes try to do this resolution in
> userspace[1], causing many potential race conditions. In addition, the
> "obvious" alternative (actually performing a {ch,pivot_}root(2))
> requires a fork+exec which is *very* costly if necessary for every
> filesystem operation involving a container.
Wait. fork() I understand, but why exec? And actually, you don't need
a full fork() either, clone() lets you do this with some process parts
shared. And then you also shouldn't need to use SCM_RIGHTS, just keep
the file descriptor table shared. And why chroot()/pivot_root(),
wouldn't you want to use setns()? I think something like this should
work (except that you should add some error handling - omitted here
because I'm lazy), assuming that the container runtime does NOT have
CAP_SYS_ADMIN in the init namespace (otherwise it's easier). Of
course, this is entirely untested, and probably won't compile because
I screwed something up. :P But you should get the idea...
// Ensure that we are non-dumpable. Together with
// commit bfedb589252c, this ensures that container root
// can't trace our child once it enters the container.
// My patch
// https://lore.kernel.org/lkml/[email protected]/
// would make this unnecessary, but that patch didn't
// land because Eric nacked it (for political reasons,
// because people incorrectly claimed that this was a
// security fix):
// https://lore.kernel.org/lkml/[email protected]/
// Note that dumpability is per-mm, not per-process,
// so this hack has the unfortunate side effect of preventing
// unprivileged debugging of the container runtime.
// Oh well.
prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE);
// Inform gcc that this particular syscall will effectively
// return twice, just like vfork() or setjmp().
__attribute__((returns_twice)) long syscall_(long sysno, ...) = (void*)syscall;
int result_fd = -1;
// CLONE_FILES means we don't need to do fd passing,
// we share the file descriptor table.
// CLONE_VM means we don't have the cost of duplicating
// our VMAs and page tables, and we don't have to mark
// all our pagetable entries as readonly for copy-on-write.
// CLONE_VFORK is a dirty hack to avoid having to
// allocate a child stack.
// Lack of SIGCHLD means we don't want to have to wait()
// for the child.
int child_pid = syscall_(__NR_clone, CLONE_FILES|CLONE_VM|CLONE_VFORK,
0, 0, 0, 0);
if (child_pid == 0) {
// Enter the container's user namespace; this allows us
// to afterwards join its mount namespace even if we're
// not capable in the init namespace.
// (I believe that it should be possible to change the kernel
// such that this is not required if you have set the
// no-new-privs flag.)
setns(container_user_ns_fd, CLONE_NEWUSER);
// Entering the filesystem namespace automatically
// moves us to that namespace's filesystem root.
setns(container_fs_ns_fd, CLONE_NEWNS);
result_fd = open(untrusted_container_path, ...);
syscall(__NR_exit, 0);
}
> The most significant change in semantics with AT_THIS_ROOT is that
> *at(2) syscalls now no longer have the property that an absolute
> pathname causes the dirfd to be ignored completely (if LOOKUP_CHROOT is
> specified). The reasoning behind this is that AT_THIS_ROOT necessarily
> has to chroot-scope symlinks with absolute paths to dirfd, and so doing
> it for the base path seems to be the most consistent behaviour (and also
> avoids foot-gunning users who want to chroot-scope paths that might be
> absolute).
>
> Currently this is only enabled for the stat(2) and openat(2) family (the
> latter has its own flag O_THISROOT with the same semantics). Ideally
> this flag would be supported by all *at(2) syscalls, but this will
> require adding flags arguments to many of them (and will be done in a
> separate patchset).
>
> [1]: https://github.com/cyphar/filepath-securejoin
>
> Cc: Eric Biederman <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Signed-off-by: Aleksa Sarai <[email protected]>
> ---
> fs/fcntl.c | 2 +-
> fs/namei.c | 121 +++++++++++++++++--------------
> fs/open.c | 2 +
> fs/stat.c | 4 +-
> include/linux/fcntl.h | 2 +-
> include/linux/namei.h | 1 +
> include/uapi/asm-generic/fcntl.h | 3 +
> include/uapi/linux/fcntl.h | 2 +
> 8 files changed, 81 insertions(+), 56 deletions(-)
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index e343618736f7..4c36c5b9fdb9 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
> * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> * is defined as O_NONBLOCK on some platforms and not on others.
> */
> - BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
> + BUILD_BUG_ON(26 - 1 /* for O_RDONLY being 0 */ !=
> HWEIGHT32(
> (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> __FMODE_EXEC | __FMODE_NONOTIFY));
> diff --git a/fs/namei.c b/fs/namei.c
> index 757dd783771c..1b984f0dbbb4 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2193,9 +2193,64 @@ static int link_path_walk(const char *name, struct nameidata *nd)
> }
> }
>
> +/*
> + * Configure nd->path based on the nd->dfd. This is only used as part of
> + * path_init().
> + */
> +static inline int dirfd_path_init(struct nameidata *nd)
> +{
> + if (nd->dfd == AT_FDCWD) {
> + if (nd->flags & LOOKUP_RCU) {
> + struct fs_struct *fs = current->fs;
> + unsigned seq;
> +
> + do {
> + seq = read_seqcount_begin(&fs->seq);
> + nd->path = fs->pwd;
> + nd->inode = nd->path.dentry->d_inode;
> + nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
> + } while (read_seqcount_retry(&fs->seq, seq));
> + } else {
> + get_fs_pwd(current->fs, &nd->path);
> + nd->inode = nd->path.dentry->d_inode;
> + }
> + } else {
> + /* Caller must check execute permissions on the starting path component */
> + struct fd f = fdget_raw(nd->dfd);
> + struct dentry *dentry;
> +
> + if (!f.file)
> + return -EBADF;
> +
> + dentry = f.file->f_path.dentry;
> +
> + if (*nd->name->name && unlikely(!d_can_lookup(dentry))) {
> + fdput(f);
> + return -ENOTDIR;
> + }
> +
> + nd->path = f.file->f_path;
> + if (nd->flags & LOOKUP_RCU) {
> + nd->inode = nd->path.dentry->d_inode;
> + nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
> + } else {
> + path_get(&nd->path);
> + nd->inode = nd->path.dentry->d_inode;
> + }
> + fdput(f);
> + }
> + if (unlikely(nd->flags & (LOOKUP_CHROOT | LOOKUP_BENEATH))) {
> + nd->root = nd->path;
> + if (!(nd->flags & LOOKUP_RCU))
> + path_get(&nd->root);
> + }
> + return 0;
> +}
> +
> /* must be paired with terminate_walk() */
> static const char *path_init(struct nameidata *nd, unsigned flags)
> {
> + int error;
> const char *s = nd->name->name;
>
> if (!*s)
> @@ -2230,65 +2285,25 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> nd->path.dentry = NULL;
>
> nd->m_seq = read_seqbegin(&mount_lock);
> + if (unlikely(flags & LOOKUP_CHROOT)) {
> + error = dirfd_path_init(nd);
> + if (unlikely(error))
> + return ERR_PTR(error);
> + }
> if (*s == '/') {
> - int error;
> - set_root(nd);
> + if (likely(!nd->root.mnt))
> + set_root(nd);
> error = nd_jump_root(nd);
> if (unlikely(error))
> s = ERR_PTR(error);
> return s;
> - } else if (nd->dfd == AT_FDCWD) {
> - if (flags & LOOKUP_RCU) {
> - struct fs_struct *fs = current->fs;
> - unsigned seq;
> -
> - do {
> - seq = read_seqcount_begin(&fs->seq);
> - nd->path = fs->pwd;
> - nd->inode = nd->path.dentry->d_inode;
> - nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
> - } while (read_seqcount_retry(&fs->seq, seq));
> - } else {
> - get_fs_pwd(current->fs, &nd->path);
> - nd->inode = nd->path.dentry->d_inode;
> - }
> - if (unlikely(flags & LOOKUP_BENEATH)) {
> - nd->root = nd->path;
> - if (!(flags & LOOKUP_RCU))
> - path_get(&nd->root);
> - }
> - return s;
> - } else {
> - /* Caller must check execute permissions on the starting path component */
> - struct fd f = fdget_raw(nd->dfd);
> - struct dentry *dentry;
> -
> - if (!f.file)
> - return ERR_PTR(-EBADF);
> -
> - dentry = f.file->f_path.dentry;
> -
> - if (*s && unlikely(!d_can_lookup(dentry))) {
> - fdput(f);
> - return ERR_PTR(-ENOTDIR);
> - }
> -
> - nd->path = f.file->f_path;
> - if (flags & LOOKUP_RCU) {
> - nd->inode = nd->path.dentry->d_inode;
> - nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
> - } else {
> - path_get(&nd->path);
> - nd->inode = nd->path.dentry->d_inode;
> - }
> - if (unlikely(flags & LOOKUP_BENEATH)) {
> - nd->root = nd->path;
> - if (!(flags & LOOKUP_RCU))
> - path_get(&nd->root);
> - }
> - fdput(f);
> - return s;
> }
> + if (likely(!nd->path.mnt)) {
> + error = dirfd_path_init(nd);
> + if (unlikely(error))
> + return ERR_PTR(error);
> + }
> + return s;
> }
>
> static const char *trailing_symlink(struct nameidata *nd)
> diff --git a/fs/open.c b/fs/open.c
> index 80f5f566a5ff..81d148f626cd 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -996,6 +996,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
> lookup_flags |= LOOKUP_NO_PROCLINKS;
> if (flags & O_NOSYMLINKS)
> lookup_flags |= LOOKUP_NO_SYMLINKS;
> + if (flags & O_THISROOT)
> + lookup_flags |= LOOKUP_CHROOT;
> op->lookup_flags = lookup_flags;
> return 0;
> }
> diff --git a/fs/stat.c b/fs/stat.c
> index 791e61b916ae..e8366e4812c3 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -172,7 +172,7 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
>
> if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH |
> KSTAT_QUERY_FLAGS | AT_BENEATH | AT_XDEV |
> - AT_NO_PROCLINKS | AT_NO_SYMLINKS))
> + AT_NO_PROCLINKS | AT_NO_SYMLINKS | AT_THIS_ROOT))
> return -EINVAL;
>
> if (flags & AT_SYMLINK_NOFOLLOW)
> @@ -189,6 +189,8 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
> lookup_flags |= LOOKUP_NO_PROCLINKS;
> if (flags & AT_NO_SYMLINKS)
> lookup_flags |= LOOKUP_NO_SYMLINKS;
> + if (flags & AT_THIS_ROOT)
> + lookup_flags |= LOOKUP_CHROOT;
>
> retry:
> error = user_path_at(dfd, filename, lookup_flags, &path);
> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index ad5bba4b5b12..95480cd4c09d 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -10,7 +10,7 @@
> O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
> FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
> - O_NOPROCLINKS | O_NOSYMLINKS)
> + O_NOPROCLINKS | O_NOSYMLINKS | O_THISROOT)
>
> #ifndef force_o_largefile
> #define force_o_largefile() (BITS_PER_LONG != 32)
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 5ff7f3362d1b..7ec9e2d84649 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -53,6 +53,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
> #define LOOKUP_NO_PROCLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
> #define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*.
> Implies LOOKUP_NO_PROCLINKS. */
> +#define LOOKUP_CHROOT 0x100000 /* Treat dirfd as %current->fs->root. */
>
> extern int path_pts(struct path *path);
>
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index c2bf5983e46a..11206b0e927c 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -113,6 +113,9 @@
> #ifndef O_NOSYMLINKS
> #define O_NOSYMLINKS 01000000000
> #endif
> +#ifndef O_THISROOT
> +#define O_THISROOT 02000000000
> +#endif
>
> #define F_DUPFD 0 /* dup */
> #define F_GETFD 1 /* get close_on_exec */
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 551a9e2166a8..ea978457b68f 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -99,6 +99,8 @@
> #define AT_NO_PROCLINKS 0x40000 /* No /proc/$pid/fd/... "symlinks". */
> #define AT_NO_SYMLINKS 0x80000 /* No symlinks *at all*.
> Implies AT_NO_PROCLINKS. */
> +#define AT_THIS_ROOT 0x100000 /* Path resolution acts as though
> + it is chroot-ed into dirfd. */
>
>
> #endif /* _UAPI_LINUX_FCNTL_H */
> --
> 2.19.0
>
>
> On Sep 29, 2018, at 9:35 AM, Jann Horn <[email protected]> wrote:
>
> +cc linux-api; please keep them in CC for future versions of the patch
>
>> On Sat, Sep 29, 2018 at 4:29 PM Aleksa Sarai <[email protected]> wrote:
>> The primary motivation for the need for this flag is container runtimes
>> which have to interact with malicious root filesystems in the host
>> namespaces. One of the first requirements for a container runtime to be
>> secure against a malicious rootfs is that they correctly scope symlinks
>> (that is, they should be scoped as though they are chroot(2)ed into the
>> container's rootfs) and ".."-style paths. The already-existing AT_XDEV
>> and AT_NO_PROCLINKS help defend against other potential attacks in a
>> malicious rootfs scenario.
>
> So, I really like the concept for patch 1 of this series (but haven't
> read the code yet); but I dislike this patch because of its footgun
> potential.
>
The code could do it differently: do the path walk and then, before accepting the result, walk back up and make sure the result is under the starting point.
This is *not* a full solution, though, since a walk above the root gas side effects on timing, various caches, and possibly network traffic, so it’s open to Spectre-like attacks in which a malicious container could use a runtime-initiated AT_THIS_ROOT to infer the existence of directories outside the container.
But what’s the container usecase? Any sane container is based on pivot_root or similar, so the runtime can just do the walk in the container context. IOW I’m a bit confused as to the exact intended use of the whole series. Can you elaborate?
On Sat, Sep 29, 2018 at 09:34:24AM -0700, Andy Lutomirski wrote:
> Also, as a perhaps-silly suggestion: if you end up adding a new
> syscall, I can see a use for a mode that does the path walk but, rather
> than failing on a disallowed link, stops early and indicates where it
> stopped. Then web servers, samba, etc can more efficiently implement
> custom behavior when links are encountered. And it may also be useful
> to have a variant of AT_THIS_ROOT where trying to escape is an error
> instead of having it just get stuck at the root.
AT_USER_LINKS indicating that userspace wants to resolve symlinks
themselves?
On 2018-09-29, Aleksa Sarai <[email protected]> wrote:
> * AT_XDEV: Disallow mount-point crossing (both *down* into one, or *up*
> from one). The primary "scoping" use is to blocking resolution that
> crosses a bind-mount, which has a similar property to a symlink (in
> the way that it allows for escape from the starting-point). Since it
> is not possible to differentiate bind-mounts However since
> bind-mounting requires privileges (in ways symlinks don't) this has
> been split from LOOKUP_BENEATH. The naming is based on "find -xdev"
> (though find(1) doesn't walk upwards, the semantics seem obvious).
I've just noticed that the mountpoint-crossing code for AT_XDEV doesn't
detect things like:
% ln -s / /tmp/jumpup
% vfs_helper -o open -F xdev -d /tmp jumpup
/
I will fix that in v2.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On 2018-09-29, Christian Brauner <[email protected]> wrote:
> > Currently I've only enabled these for openat(2) and the stat(2) family.
> > I would hope we could enable it for basically every *at(2) syscall --
> > but many of them appear to not have a @flags argument and thus we'll
> > need to add several new syscalls to do this. I'm more than happy to send
> > those patches, but I'd prefer to know that this preliminary work is
> > acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> > syscalls.
>
> We should really make sure that we can't make due with openat() alone
> before adding a bunch of new syscalls. So there's no need to rush into
> this. :)
Yeah, I think that we could (mostly) make do with openat(2). We might
need to have renameat(2) and a few others, but if we had more support
for AT_EMPTY_PATH you should be able to just O_PATH|O_{BENEATH,XDEV,...}
and then operate on the O_PATH fd.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Sat, Sep 29, 2018 at 12:35 PM Aleksa Sarai <[email protected]> wrote:
>
> The need for some sort of control over VFS's path resolution (to avoid
> malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a
> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
>
> The most obvious change is that AT_NO_JUMPS has been split as dicussed
> in the original thread, along with a further split of AT_NO_PROCLINKS
> which means that each individual property of AT_NO_JUMPS is now a
> separate flag:
>
> * Path-based escapes from the starting-point using "/" or ".." are
> blocked by AT_BENEATH.
> * Mountpoint crossings are blocked by AT_XDEV.
> * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> correctly it actually blocks any user of nd_jump_link() because it
> allows out-of-VFS path resolution manipulation).
>
> AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
> Linus' suggestion in the original thread, I've also implemented
> AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
> "proclink" resolution).
It seems quite useful to me.
> An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS
> path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
> blocks this as well (feel free to ask me to remove it if you feel this
> is not sane).
>
> Currently I've only enabled these for openat(2) and the stat(2) family.
> I would hope we could enable it for basically every *at(2) syscall --
> but many of them appear to not have a @flags argument and thus we'll
> need to add several new syscalls to do this. I'm more than happy to send
> those patches, but I'd prefer to know that this preliminary work is
> acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> syscalls.
What do you think of an equivalent feature AT_NO_SYMLINKS flag for mount()?
I guess that would have made the fix for CVE-2017-1002101 in
Kubernetes easier to write:
https://kubernetes.io/blog/2018/04/04/fixing-subpath-volume-vulnerability/
> One additional feature I've implemented is AT_THIS_ROOT (I imagine this
> is probably going to be more contentious than the refresh of
> AT_NO_JUMPS, so I've included it in a separate patch). The patch itself
> describes my reasoning, but the shortened version of the premise is that
> continer runtimes need to have a way to resolve paths within a
> potentially malicious rootfs. Container runtimes currently do this in
> userspace[2] which has implicit race conditions that are not resolvable
> in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
> inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for
> path resolution, which would be invaluable for us -- and the
> implementation is basically identical to AT_BENEATH (except that we
> don't return errors when someone actually hits the root).
>
> I've added some selftests for this, but it's not clear to me whether
> they should live here or in xfstests (as far as I can tell there are no
> other VFS tests in selftests, while there are some tests that look like
> generic VFS tests in xfstests). If you'd prefer them to be included in
> xfstests, let me know.
>
> [1]: https://lore.kernel.org/patchwork/patch/784221/
> [2]: https://github.com/cyphar/filepath-securejoin
>
> Aleksa Sarai (3):
> namei: implement O_BENEATH-style AT_* flags
> namei: implement AT_THIS_ROOT chroot-like path resolution
> selftests: vfs: add AT_* path resolution tests
>
> fs/fcntl.c | 2 +-
> fs/namei.c | 158 ++++++++++++------
> fs/open.c | 10 ++
> fs/stat.c | 15 +-
> include/linux/fcntl.h | 3 +-
> include/linux/namei.h | 8 +
> include/uapi/asm-generic/fcntl.h | 20 +++
> include/uapi/linux/fcntl.h | 10 ++
> tools/testing/selftests/Makefile | 1 +
> tools/testing/selftests/vfs/.gitignore | 1 +
> tools/testing/selftests/vfs/Makefile | 13 ++
> tools/testing/selftests/vfs/at_flags.h | 40 +++++
> tools/testing/selftests/vfs/common.sh | 37 ++++
> .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++
> .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++
> .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++
> .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++
> .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++
> tools/testing/selftests/vfs/vfs_helper.c | 154 +++++++++++++++++
> 19 files changed, 707 insertions(+), 56 deletions(-)
> create mode 100644 tools/testing/selftests/vfs/.gitignore
> create mode 100644 tools/testing/selftests/vfs/Makefile
> create mode 100644 tools/testing/selftests/vfs/at_flags.h
> create mode 100644 tools/testing/selftests/vfs/common.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
> create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
>
> --
> 2.19.0
On September 30, 2018 3:54:31 PM GMT+02:00, Alban Crequy <[email protected]> wrote:
>On Sat, Sep 29, 2018 at 12:35 PM Aleksa Sarai <[email protected]>
>wrote:
>>
>> The need for some sort of control over VFS's path resolution (to
>avoid
>> malicious paths resulting in inadvertent breakouts) has been a very
>> long-standing desire of many userspace applications. This patchset is
>a
>> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few
>additions.
>>
>> The most obvious change is that AT_NO_JUMPS has been split as
>dicussed
>> in the original thread, along with a further split of AT_NO_PROCLINKS
>> which means that each individual property of AT_NO_JUMPS is now a
>> separate flag:
>>
>> * Path-based escapes from the starting-point using "/" or ".." are
>> blocked by AT_BENEATH.
>> * Mountpoint crossings are blocked by AT_XDEV.
>> * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
>> correctly it actually blocks any user of nd_jump_link()
>because it
>> allows out-of-VFS path resolution manipulation).
>>
>> AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS).
>At
>> Linus' suggestion in the original thread, I've also implemented
>> AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
>> "proclink" resolution).
>
>It seems quite useful to me.
>
>> An additional improvement was made to AT_XDEV. The original
>AT_NO_JUMPS
>> path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
>> blocks this as well (feel free to ask me to remove it if you feel
>this
>> is not sane).
>>
>> Currently I've only enabled these for openat(2) and the stat(2)
>family.
>> I would hope we could enable it for basically every *at(2) syscall --
>> but many of them appear to not have a @flags argument and thus we'll
>> need to add several new syscalls to do this. I'm more than happy to
>send
>> those patches, but I'd prefer to know that this preliminary work is
>> acceptable before doing a bunch of copy-paste to add new sets of
>*at(2)
>> syscalls.
>
>What do you think of an equivalent feature AT_NO_SYMLINKS flag for
>mount()?
That's something we discussed but that would need to be part of the new mount API work by David. The current mount API doesn't take AT_* flags since it doesn't operate on fds and we're (sort of) out of mount flags.
>
>I guess that would have made the fix for CVE-2017-1002101 in
>Kubernetes easier to write:
>https://kubernetes.io/blog/2018/04/04/fixing-subpath-volume-vulnerability/
>
>> One additional feature I've implemented is AT_THIS_ROOT (I imagine
>this
>> is probably going to be more contentious than the refresh of
>> AT_NO_JUMPS, so I've included it in a separate patch). The patch
>itself
>> describes my reasoning, but the shortened version of the premise is
>that
>> continer runtimes need to have a way to resolve paths within a
>> potentially malicious rootfs. Container runtimes currently do this in
>> userspace[2] which has implicit race conditions that are not
>resolvable
>> in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
>> inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics
>for
>> path resolution, which would be invaluable for us -- and the
>> implementation is basically identical to AT_BENEATH (except that we
>> don't return errors when someone actually hits the root).
>>
>> I've added some selftests for this, but it's not clear to me whether
>> they should live here or in xfstests (as far as I can tell there are
>no
>> other VFS tests in selftests, while there are some tests that look
>like
>> generic VFS tests in xfstests). If you'd prefer them to be included
>in
>> xfstests, let me know.
>>
>> [1]: https://lore.kernel.org/patchwork/patch/784221/
>> [2]: https://github.com/cyphar/filepath-securejoin
>>
>> Aleksa Sarai (3):
>> namei: implement O_BENEATH-style AT_* flags
>> namei: implement AT_THIS_ROOT chroot-like path resolution
>> selftests: vfs: add AT_* path resolution tests
>>
>> fs/fcntl.c | 2 +-
>> fs/namei.c | 158
>++++++++++++------
>> fs/open.c | 10 ++
>> fs/stat.c | 15 +-
>> include/linux/fcntl.h | 3 +-
>> include/linux/namei.h | 8 +
>> include/uapi/asm-generic/fcntl.h | 20 +++
>> include/uapi/linux/fcntl.h | 10 ++
>> tools/testing/selftests/Makefile | 1 +
>> tools/testing/selftests/vfs/.gitignore | 1 +
>> tools/testing/selftests/vfs/Makefile | 13 ++
>> tools/testing/selftests/vfs/at_flags.h | 40 +++++
>> tools/testing/selftests/vfs/common.sh | 37 ++++
>> .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++
>> .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++
>> .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++
>> .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++
>> .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++
>> tools/testing/selftests/vfs/vfs_helper.c | 154
>+++++++++++++++++
>> 19 files changed, 707 insertions(+), 56 deletions(-)
>> create mode 100644 tools/testing/selftests/vfs/.gitignore
>> create mode 100644 tools/testing/selftests/vfs/Makefile
>> create mode 100644 tools/testing/selftests/vfs/at_flags.h
>> create mode 100644 tools/testing/selftests/vfs/common.sh
>> create mode 100755
>tools/testing/selftests/vfs/tests/0001_at_beneath.sh
>> create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
>> create mode 100755
>tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
>> create mode 100755
>tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
>> create mode 100755
>tools/testing/selftests/vfs/tests/0005_at_this_root.sh
>> create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
>>
>> --
>> 2.19.0
As a side note, I'm still working on Landlock which can achieve the same
goal but in a more flexible and dynamic way: https://landlock.io
Regards,
Micka?l
On 9/29/18 12:34, Aleksa Sarai wrote:
> The need for some sort of control over VFS's path resolution (to avoid
> malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a
> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
>
> The most obvious change is that AT_NO_JUMPS has been split as dicussed
> in the original thread, along with a further split of AT_NO_PROCLINKS
> which means that each individual property of AT_NO_JUMPS is now a
> separate flag:
>
> * Path-based escapes from the starting-point using "/" or ".." are
> blocked by AT_BENEATH.
> * Mountpoint crossings are blocked by AT_XDEV.
> * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> correctly it actually blocks any user of nd_jump_link() because it
> allows out-of-VFS path resolution manipulation).
>
> AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
> Linus' suggestion in the original thread, I've also implemented
> AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
> "proclink" resolution).
>
> An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS
> path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
> blocks this as well (feel free to ask me to remove it if you feel this
> is not sane).
>
> Currently I've only enabled these for openat(2) and the stat(2) family.
> I would hope we could enable it for basically every *at(2) syscall --
> but many of them appear to not have a @flags argument and thus we'll
> need to add several new syscalls to do this. I'm more than happy to send
> those patches, but I'd prefer to know that this preliminary work is
> acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> syscalls.
>
> One additional feature I've implemented is AT_THIS_ROOT (I imagine this
> is probably going to be more contentious than the refresh of
> AT_NO_JUMPS, so I've included it in a separate patch). The patch itself
> describes my reasoning, but the shortened version of the premise is that
> continer runtimes need to have a way to resolve paths within a
> potentially malicious rootfs. Container runtimes currently do this in
> userspace[2] which has implicit race conditions that are not resolvable
> in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
> inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for
> path resolution, which would be invaluable for us -- and the
> implementation is basically identical to AT_BENEATH (except that we
> don't return errors when someone actually hits the root).
>
> I've added some selftests for this, but it's not clear to me whether
> they should live here or in xfstests (as far as I can tell there are no
> other VFS tests in selftests, while there are some tests that look like
> generic VFS tests in xfstests). If you'd prefer them to be included in
> xfstests, let me know.
>
> [1]: https://lore.kernel.org/patchwork/patch/784221/
> [2]: https://github.com/cyphar/filepath-securejoin
>
> Aleksa Sarai (3):
> namei: implement O_BENEATH-style AT_* flags
> namei: implement AT_THIS_ROOT chroot-like path resolution
> selftests: vfs: add AT_* path resolution tests
>
> fs/fcntl.c | 2 +-
> fs/namei.c | 158 ++++++++++++------
> fs/open.c | 10 ++
> fs/stat.c | 15 +-
> include/linux/fcntl.h | 3 +-
> include/linux/namei.h | 8 +
> include/uapi/asm-generic/fcntl.h | 20 +++
> include/uapi/linux/fcntl.h | 10 ++
> tools/testing/selftests/Makefile | 1 +
> tools/testing/selftests/vfs/.gitignore | 1 +
> tools/testing/selftests/vfs/Makefile | 13 ++
> tools/testing/selftests/vfs/at_flags.h | 40 +++++
> tools/testing/selftests/vfs/common.sh | 37 ++++
> .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++
> .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++
> .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++
> .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++
> .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++
> tools/testing/selftests/vfs/vfs_helper.c | 154 +++++++++++++++++
> 19 files changed, 707 insertions(+), 56 deletions(-)
> create mode 100644 tools/testing/selftests/vfs/.gitignore
> create mode 100644 tools/testing/selftests/vfs/Makefile
> create mode 100644 tools/testing/selftests/vfs/at_flags.h
> create mode 100644 tools/testing/selftests/vfs/common.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
> create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
> create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
>
On Sun, Sep 30, 2018 at 10:39 PM Mickaël Salaün <[email protected]> wrote:
> As a side note, I'm still working on Landlock which can achieve the same
> goal but in a more flexible and dynamic way: https://landlock.io
Isn't Landlock mostly intended for userspace that wants to impose a
custom Mandatory Access Control policy on itself, restricting the
whole process?
As far as I can tell, a major usecase for AT_BENEATH are privileged
processes that do not want to restrict all filesystem operations they
perform, but want to sometimes impose limits on filesystem traversal
for the duration of a single system call. For example, a process might
want to first open a file from an untrusted filesystem area with
AT_BENEATH, and afterwards open a configuration file without
AT_BENEATH.
How would you do this in Landlock? Use a BPF map to store per-thread
filesystem restrictions, and then do bpf() calls before and after
every restricted filesystem access to set and unset the policy for the
current syscall?
> On 9/29/18 12:34, Aleksa Sarai wrote:
> > The need for some sort of control over VFS's path resolution (to avoid
> > malicious paths resulting in inadvertent breakouts) has been a very
> > long-standing desire of many userspace applications. This patchset is a
> > revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
> >
> > The most obvious change is that AT_NO_JUMPS has been split as dicussed
> > in the original thread, along with a further split of AT_NO_PROCLINKS
> > which means that each individual property of AT_NO_JUMPS is now a
> > separate flag:
> >
> > * Path-based escapes from the starting-point using "/" or ".." are
> > blocked by AT_BENEATH.
> > * Mountpoint crossings are blocked by AT_XDEV.
> > * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> > correctly it actually blocks any user of nd_jump_link() because it
> > allows out-of-VFS path resolution manipulation).
> >
> > AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
> > Linus' suggestion in the original thread, I've also implemented
> > AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
> > "proclink" resolution).
> >
> > An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS
> > path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
> > blocks this as well (feel free to ask me to remove it if you feel this
> > is not sane).
> >
> > Currently I've only enabled these for openat(2) and the stat(2) family.
> > I would hope we could enable it for basically every *at(2) syscall --
> > but many of them appear to not have a @flags argument and thus we'll
> > need to add several new syscalls to do this. I'm more than happy to send
> > those patches, but I'd prefer to know that this preliminary work is
> > acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> > syscalls.
> >
> > One additional feature I've implemented is AT_THIS_ROOT (I imagine this
> > is probably going to be more contentious than the refresh of
> > AT_NO_JUMPS, so I've included it in a separate patch). The patch itself
> > describes my reasoning, but the shortened version of the premise is that
> > continer runtimes need to have a way to resolve paths within a
> > potentially malicious rootfs. Container runtimes currently do this in
> > userspace[2] which has implicit race conditions that are not resolvable
> > in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
> > inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for
> > path resolution, which would be invaluable for us -- and the
> > implementation is basically identical to AT_BENEATH (except that we
> > don't return errors when someone actually hits the root).
> >
> > I've added some selftests for this, but it's not clear to me whether
> > they should live here or in xfstests (as far as I can tell there are no
> > other VFS tests in selftests, while there are some tests that look like
> > generic VFS tests in xfstests). If you'd prefer them to be included in
> > xfstests, let me know.
> >
> > [1]: https://lore.kernel.org/patchwork/patch/784221/
> > [2]: https://github.com/cyphar/filepath-securejoin
> >
> > Aleksa Sarai (3):
> > namei: implement O_BENEATH-style AT_* flags
> > namei: implement AT_THIS_ROOT chroot-like path resolution
> > selftests: vfs: add AT_* path resolution tests
> >
> > fs/fcntl.c | 2 +-
> > fs/namei.c | 158 ++++++++++++------
> > fs/open.c | 10 ++
> > fs/stat.c | 15 +-
> > include/linux/fcntl.h | 3 +-
> > include/linux/namei.h | 8 +
> > include/uapi/asm-generic/fcntl.h | 20 +++
> > include/uapi/linux/fcntl.h | 10 ++
> > tools/testing/selftests/Makefile | 1 +
> > tools/testing/selftests/vfs/.gitignore | 1 +
> > tools/testing/selftests/vfs/Makefile | 13 ++
> > tools/testing/selftests/vfs/at_flags.h | 40 +++++
> > tools/testing/selftests/vfs/common.sh | 37 ++++
> > .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++
> > .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++
> > .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++
> > .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++
> > .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++
> > tools/testing/selftests/vfs/vfs_helper.c | 154 +++++++++++++++++
> > 19 files changed, 707 insertions(+), 56 deletions(-)
> > create mode 100644 tools/testing/selftests/vfs/.gitignore
> > create mode 100644 tools/testing/selftests/vfs/Makefile
> > create mode 100644 tools/testing/selftests/vfs/at_flags.h
> > create mode 100644 tools/testing/selftests/vfs/common.sh
> > create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
> > create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
> > create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
> > create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
> > create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
> > create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
> >
>
On 9/30/18 23:46, Jann Horn wrote:
> On Sun, Sep 30, 2018 at 10:39 PM Mickaël Salaün <[email protected]> wrote:
>> As a side note, I'm still working on Landlock which can achieve the same
>> goal but in a more flexible and dynamic way: https://landlock.io
>
> Isn't Landlock mostly intended for userspace that wants to impose a
> custom Mandatory Access Control policy on itself, restricting the
> whole process?
>
> As far as I can tell, a major usecase for AT_BENEATH are privileged
> processes that do not want to restrict all filesystem operations they
> perform, but want to sometimes impose limits on filesystem traversal
> for the duration of a single system call. For example, a process might
> want to first open a file from an untrusted filesystem area with
> AT_BENEATH, and afterwards open a configuration file without
> AT_BENEATH.
I didn't realized this was the main use case for AT_BENEATH. Landlock is
indeed dedicated to apply a security policy on a set of processes. This
set can be a process and its children (seccomp-like), or another set of
processes that may be identified with a cgroup.
>
> How would you do this in Landlock? Use a BPF map to store per-thread
> filesystem restrictions, and then do bpf() calls before and after
> every restricted filesystem access to set and unset the policy for the
> current syscall?
Another way to apply a security policy could be to tied it to a file
descriptor, similarly to Capsicum, which could enable to create
programmable (real) capabilities. This way, it would be possible to
"wrap" a file descriptor with a Landlock program and use it with
FD-based syscalls or pass it to other processes. This would not require
changes to the FS subsystem, but only the Landlock LSM code. This isn't
done yet but I plan to add this new way to restrict operations on file
descriptors.
Anyway, for the use case you mentioned, the AT_BENEATH flag(s) should be
simple to use and enough for now. We must be careful of the hardcoded
policy though.
>
>> On 9/29/18 12:34, Aleksa Sarai wrote:
>>> The need for some sort of control over VFS's path resolution (to avoid
>>> malicious paths resulting in inadvertent breakouts) has been a very
>>> long-standing desire of many userspace applications. This patchset is a
>>> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
>>>
>>> The most obvious change is that AT_NO_JUMPS has been split as dicussed
>>> in the original thread, along with a further split of AT_NO_PROCLINKS
>>> which means that each individual property of AT_NO_JUMPS is now a
>>> separate flag:
>>>
>>> * Path-based escapes from the starting-point using "/" or ".." are
>>> blocked by AT_BENEATH.
>>> * Mountpoint crossings are blocked by AT_XDEV.
>>> * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
>>> correctly it actually blocks any user of nd_jump_link() because it
>>> allows out-of-VFS path resolution manipulation).
>>>
>>> AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
>>> Linus' suggestion in the original thread, I've also implemented
>>> AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
>>> "proclink" resolution).
>>>
>>> An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS
>>> path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
>>> blocks this as well (feel free to ask me to remove it if you feel this
>>> is not sane).
>>>
>>> Currently I've only enabled these for openat(2) and the stat(2) family.
>>> I would hope we could enable it for basically every *at(2) syscall --
>>> but many of them appear to not have a @flags argument and thus we'll
>>> need to add several new syscalls to do this. I'm more than happy to send
>>> those patches, but I'd prefer to know that this preliminary work is
>>> acceptable before doing a bunch of copy-paste to add new sets of *at(2)
>>> syscalls.
>>>
>>> One additional feature I've implemented is AT_THIS_ROOT (I imagine this
>>> is probably going to be more contentious than the refresh of
>>> AT_NO_JUMPS, so I've included it in a separate patch). The patch itself
>>> describes my reasoning, but the shortened version of the premise is that
>>> continer runtimes need to have a way to resolve paths within a
>>> potentially malicious rootfs. Container runtimes currently do this in
>>> userspace[2] which has implicit race conditions that are not resolvable
>>> in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
>>> inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for
>>> path resolution, which would be invaluable for us -- and the
>>> implementation is basically identical to AT_BENEATH (except that we
>>> don't return errors when someone actually hits the root).
>>>
>>> I've added some selftests for this, but it's not clear to me whether
>>> they should live here or in xfstests (as far as I can tell there are no
>>> other VFS tests in selftests, while there are some tests that look like
>>> generic VFS tests in xfstests). If you'd prefer them to be included in
>>> xfstests, let me know.
>>>
>>> [1]: https://lore.kernel.org/patchwork/patch/784221/
>>> [2]: https://github.com/cyphar/filepath-securejoin
>>>
>>> Aleksa Sarai (3):
>>> namei: implement O_BENEATH-style AT_* flags
>>> namei: implement AT_THIS_ROOT chroot-like path resolution
>>> selftests: vfs: add AT_* path resolution tests
>>>
>>> fs/fcntl.c | 2 +-
>>> fs/namei.c | 158 ++++++++++++------
>>> fs/open.c | 10 ++
>>> fs/stat.c | 15 +-
>>> include/linux/fcntl.h | 3 +-
>>> include/linux/namei.h | 8 +
>>> include/uapi/asm-generic/fcntl.h | 20 +++
>>> include/uapi/linux/fcntl.h | 10 ++
>>> tools/testing/selftests/Makefile | 1 +
>>> tools/testing/selftests/vfs/.gitignore | 1 +
>>> tools/testing/selftests/vfs/Makefile | 13 ++
>>> tools/testing/selftests/vfs/at_flags.h | 40 +++++
>>> tools/testing/selftests/vfs/common.sh | 37 ++++
>>> .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++
>>> .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++
>>> .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++
>>> .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++
>>> .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++
>>> tools/testing/selftests/vfs/vfs_helper.c | 154 +++++++++++++++++
>>> 19 files changed, 707 insertions(+), 56 deletions(-)
>>> create mode 100644 tools/testing/selftests/vfs/.gitignore
>>> create mode 100644 tools/testing/selftests/vfs/Makefile
>>> create mode 100644 tools/testing/selftests/vfs/at_flags.h
>>> create mode 100644 tools/testing/selftests/vfs/common.sh
>>> create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
>>> create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
>>> create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
>>> create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
>>> create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
>>> create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
>>>
>>
>
>
On Sat, Sep 29, 2018 at 08:34:50PM +1000, Aleksa Sarai wrote:
> I've added some selftests for this, but it's not clear to me whether
> they should live here or in xfstests (as far as I can tell there are no
> other VFS tests in selftests, while there are some tests that look like
> generic VFS tests in xfstests). If you'd prefer them to be included in
> xfstests, let me know.
xfstests, please. That way the new functionality will get immediate
coverage by all the main filesystem development and distro QA
teams....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 2018-09-29, Jann Horn <[email protected]> wrote:
> The problem is what happens if a folder you are walking through is
> concurrently moved out of the chroot. Consider the following scenario:
>
> You attempt to open "C/../../etc/passwd" under the root "/A/B".
> Something else concurrently moves /A/B/C to /A/C. This can result in
> the following:
>
> 1. You start the path walk and reach /A/B/C.
> 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
> 3. Your path walk follows the first ".." up into /A. This is outside
> the process root, but you never actually encountered the process root,
> so you don't notice.
> 4. Your path walk follows the second ".." up to /. Again, this is
> outside the process root, but you don't notice.
> 5. Your path walk walks down to /etc/passwd, and the open completes
> successfully. You now have an fd pointing outside your chroot.
>
> If the root of your walk is below an attacker-controlled directory,
> this of course means that you lose instantly. If you point the root of
> the walk at a directory out of which a process in the container
> wouldn't be able to move the file, you're probably kinda mostly fine -
> as long as you know, for certain, that nothing else on the system
> would ever do that. But I still wouldn't feel good about that.
Please correct me if I'm wrong here (this is the first patch I've
written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this
-- or does that only handle if a particular path component changes
*while* it's being walked through? Is it possible for a path walk to
succeed after a path component was unmounted (obviously you can't delete
a directory path component since you'd get -ENOTEMPTY)?
If this is an issue for AT_THIS_ROOT, I believe this might also be an
issue for AT_BENEATH since they are effectively both using the same
nd->root trick (so you could similarly trick AT_BENEATH to not error
out). So we'd need to figure out how to solve this problem in order for
AT_BENEATH to be safe.
Speaking naively, doesn't it make sense to invalidate the walk if a path
component was modified? Or is this something that would be far too
costly with little benefit? What if we do more aggressive nd->root
checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root !=
current->mnt_ns->root)?
Regarding chroot attacks, I was aware of the trivial
chroot-open-chroot-fchdir attack but I was not aware that there was a
rename attack for chroot. Thanks for bringing this up!
> I believe that the only way to robustly use this would be to point the
> dirfd at a mount point, such that you know that being moved out of the
> chroot is impossible because the mount point limits movement of
> directories under it. (Well, technically, it doesn't, but it ensures
> that if a directory does dangerously move away, the syscall fails.) It
> might make sense to hardcode this constraint in the implementation of
> AT_THIS_ROOT, to keep people from shooting themselves in the foot.
Unless I'm missing something, would this not also affect using a
mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through
path component) -- or does MS_MOVE cause a rewalk in a way that rename
does not?
I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I
thought that bind-mounts would be an issue but you also get -EXDEV when
trying to rename across bind-mounts even if they are on the same
underlying filesystem). But AT_BENEATH might be a more bitter pill to
swallow. I'm not sure.
In the usecase of container runtimes, we wouldn't generally be doing
resolution of attacker-controlled paths but it still definitely doesn't
hurt to consider this part of the threat model -- to avoid foot-gunning
as you've said. (There also might be some nested-container cases where
you might want to do that.)
> > Currently most container runtimes try to do this resolution in
> > userspace[1], causing many potential race conditions. In addition, the
> > "obvious" alternative (actually performing a {ch,pivot_}root(2))
> > requires a fork+exec which is *very* costly if necessary for every
> > filesystem operation involving a container.
>
> Wait. fork() I understand, but why exec? And actually, you don't need
> a full fork() either, clone() lets you do this with some process parts
> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep
> the file descriptor table shared. And why chroot()/pivot_root(),
> wouldn't you want to use setns()?
You're right about this -- for C runtimes. In Go we cannot do a raw
clone() or fork() (if you do it manually with RawSyscall you'll end with
broken runtime state). So you're forced to do fork+exec (which then
means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes
for CLONE_VFORK.
(It should be noted that multi-threaded C runtimes have somewhat similar
issues -- AFAIK you can technically only use AS-Safe glibc functions
after a fork() but that's more of a theoretical concern here. If you
just use raw syscalls there isn't an issue.)
As for why use setns() rather than pivot_root(), there are cases where
you're operating on a container's image without a running container
(think image extraction or snapshotting tools). In those cases, you
would need to set up a dummy container process in order to setns() into
its namespaces. You are right that setns() would be a better option if
you want the truthful state of what mounts the container sees.
[I also don't like the idea of joining the user namespace of a malicious
container unless it's necessary but that's probably just needless
paranoia more than anything -- since you're not joining the pidns you
aren't trivially addressable by a malicious container.]
> // Ensure that we are non-dumpable. Together with
> // commit bfedb589252c, this ensures that container root
> // can't trace our child once it enters the container.
> // My patch
> // https://lore.kernel.org/lkml/[email protected]/
> // would make this unnecessary, but that patch didn't
> // land because Eric nacked it (for political reasons,
> // because people incorrectly claimed that this was a
> // security fix):
Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a
user_ns owner to mm_struct and fix ptrace permission checks"). If you
join a user namespace then processes within that user namespace won't
have ptrace_may_access() permissions because your mm is owned by an
ancestor user namespace -- only after exec() will you be traceable.
We still use PR_SET_DUMPABLE in runc but that's because we support older
kernels (and people don't use user namespaces under Docker) but with
user namespaces this should not be required anymore.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On 2018-10-01, Dave Chinner <[email protected]> wrote:
> > I've added some selftests for this, but it's not clear to me whether
> > they should live here or in xfstests (as far as I can tell there are no
> > other VFS tests in selftests, while there are some tests that look like
> > generic VFS tests in xfstests). If you'd prefer them to be included in
> > xfstests, let me know.
>
> xfstests, please. That way the new functionality will get immediate
> coverage by all the main filesystem development and distro QA
> teams....
Sure, will do. Do you want me to submit them in parallel -- and what is
the correct ML to submit changes to xfstests? Sorry for the silly
questions. :P
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Mon, Oct 01, 2018 at 03:47:23PM +1000, Aleksa Sarai wrote:
> On 2018-10-01, Dave Chinner <[email protected]> wrote:
> > > I've added some selftests for this, but it's not clear to me whether
> > > they should live here or in xfstests (as far as I can tell there are no
> > > other VFS tests in selftests, while there are some tests that look like
> > > generic VFS tests in xfstests). If you'd prefer them to be included in
> > > xfstests, let me know.
> >
> > xfstests, please. That way the new functionality will get immediate
> > coverage by all the main filesystem development and distro QA
> > teams....
>
> Sure, will do. Do you want me to submit them in parallel --
That's usually the way we do things, but we don't tend to commit the
fstests changes until the thing it is testing has landed upstream.
> and what is
> the correct ML to submit changes to xfstests?
[email protected]
> Sorry for the silly questions. :P
You're going to have many more of them when you start moving the
tests across to fstests :P
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 2018-09-29, Andy Lutomirski <[email protected]> wrote:
> >> On Sat, Sep 29, 2018 at 4:29 PM Aleksa Sarai <[email protected]> wrote:
> >> The primary motivation for the need for this flag is container runtimes
> >> which have to interact with malicious root filesystems in the host
> >> namespaces. One of the first requirements for a container runtime to be
> >> secure against a malicious rootfs is that they correctly scope symlinks
> >> (that is, they should be scoped as though they are chroot(2)ed into the
> >> container's rootfs) and ".."-style paths. The already-existing AT_XDEV
> >> and AT_NO_PROCLINKS help defend against other potential attacks in a
> >> malicious rootfs scenario.
> >
> > So, I really like the concept for patch 1 of this series (but haven't
> > read the code yet); but I dislike this patch because of its footgun
> > potential.
> >
>
> The code could do it differently: do the path walk and then, before
> accepting the result, walk back up and make sure the result is under
> the starting point.
>
> This is *not* a full solution, though, since a walk above the root gas
> side effects on timing, various caches, and possibly network traffic,
> so it’s open to Spectre-like attacks in which a malicious container
> could use a runtime-initiated AT_THIS_ROOT to infer the existence of
> directories outside the container.
I think that one way to solve this problem might be to have more strict
checks on nd->root in follow_dotdot(). The problem here (as far as I can
tell) is that ".." could end up skipping past the root because of a
rename, however walking *down* into a path shouldn't be a problem (even
absolute symlinks shouldn't be a problem because they will nd_jump_root
and will land back in the root).
However, I'm not entirely sure what happens to nd->root if it gets
renamed -- can you still safely do checks against it (we'd need to do
some sort of is_descendant() check on the current path before we handle
".." in follow_dotdot).
That way, we wouldn't shouldn't have the spectre-like attack problem
(since the attack would be halted at the ".." stage -- before the path
walk can proceed into host paths). Would this be sufficient or is there
a more serious issue I'm missing?
> But what’s the container usecase? Any sane container is based on
> pivot_root or similar, so the runtime can just do the walk in the
> container context. IOW I’m a bit confused as to the exact intended use
> of the whole series. Can you elaborate?
I went into this in my response to Jann.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Mon, Oct 1, 2018 at 7:44 AM Aleksa Sarai <[email protected]> wrote:
> On 2018-09-29, Jann Horn <[email protected]> wrote:
> > The problem is what happens if a folder you are walking through is
> > concurrently moved out of the chroot. Consider the following scenario:
> >
> > You attempt to open "C/../../etc/passwd" under the root "/A/B".
> > Something else concurrently moves /A/B/C to /A/C. This can result in
> > the following:
> >
> > 1. You start the path walk and reach /A/B/C.
> > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
> > 3. Your path walk follows the first ".." up into /A. This is outside
> > the process root, but you never actually encountered the process root,
> > so you don't notice.
> > 4. Your path walk follows the second ".." up to /. Again, this is
> > outside the process root, but you don't notice.
> > 5. Your path walk walks down to /etc/passwd, and the open completes
> > successfully. You now have an fd pointing outside your chroot.
> >
> > If the root of your walk is below an attacker-controlled directory,
> > this of course means that you lose instantly. If you point the root of
> > the walk at a directory out of which a process in the container
> > wouldn't be able to move the file, you're probably kinda mostly fine -
> > as long as you know, for certain, that nothing else on the system
> > would ever do that. But I still wouldn't feel good about that.
>
> Please correct me if I'm wrong here (this is the first patch I've
> written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this
> -- or does that only handle if a particular path component changes
> *while* it's being walked through?
Eric Biederman should be able to talk about all this much better than
me, but as far as I know, the LOOKUP_REVAL path is only for dealing
with some special filesystems like procfs.
> Is it possible for a path walk to
> succeed after a path component was unmounted (obviously you can't delete
> a directory path component since you'd get -ENOTEMPTY)?
I don't think so, but I'm not exactly a VFS expert.
> If this is an issue for AT_THIS_ROOT, I believe this might also be an
> issue for AT_BENEATH since they are effectively both using the same
> nd->root trick (so you could similarly trick AT_BENEATH to not error
> out). So we'd need to figure out how to solve this problem in order for
> AT_BENEATH to be safe.
Oh, wait, what? I think I didn't notice that the semantics of
AT_BENEATH changed like that since the original posting of David
Drysdale's O_BENEATH_ONLY patch
(https://lore.kernel.org/lkml/[email protected]/).
David's patch had nice, straightforward semantics, blocking any form
of upwards traversal. Why was that changed? Does anyone actually want
to use paths that contain ".." with AT_BENEATH? I would strongly
prefer something that blocks any use of "..".
@Al: It looks like this already changed back when you posted
https://lore.kernel.org/lkml/[email protected]/
?
> Speaking naively, doesn't it make sense to invalidate the walk if a path
> component was modified? Or is this something that would be far too
> costly with little benefit? What if we do more aggressive nd->root
> checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root !=
> current->mnt_ns->root)?
It seems to me like doing that would basically require looking at each
node in the path walk twice? And it'd be difficult to guarantee
forward progress unless you're willing to do some fairly heavy
locking.
> Regarding chroot attacks, I was aware of the trivial
> chroot-open-chroot-fchdir attack but I was not aware that there was a
> rename attack for chroot. Thanks for bringing this up!
>
> > I believe that the only way to robustly use this would be to point the
> > dirfd at a mount point, such that you know that being moved out of the
> > chroot is impossible because the mount point limits movement of
> > directories under it. (Well, technically, it doesn't, but it ensures
> > that if a directory does dangerously move away, the syscall fails.) It
> > might make sense to hardcode this constraint in the implementation of
> > AT_THIS_ROOT, to keep people from shooting themselves in the foot.
>
> Unless I'm missing something, would this not also affect using a
> mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through
> path component) -- or does MS_MOVE cause a rewalk in a way that rename
> does not?
Hmm. Good point.
It looks to me like you probably wouldn't be able to walk up through a
mountpoint in RCU mode after the mount hierarchy has changed (see
follow_dotdot_rcu()), but it might work in refwalk mode.
Eric?
> I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I
> thought that bind-mounts would be an issue but you also get -EXDEV when
> trying to rename across bind-mounts even if they are on the same
> underlying filesystem). But AT_BENEATH might be a more bitter pill to
> swallow. I'm not sure.
Which is part of why I strongly prefer the semantics from David
Drysdale's O_BENEATH_ONLY.
> In the usecase of container runtimes, we wouldn't generally be doing
> resolution of attacker-controlled paths but it still definitely doesn't
> hurt to consider this part of the threat model -- to avoid foot-gunning
> as you've said. (There also might be some nested-container cases where
> you might want to do that.)
>
> > > Currently most container runtimes try to do this resolution in
> > > userspace[1], causing many potential race conditions. In addition, the
> > > "obvious" alternative (actually performing a {ch,pivot_}root(2))
> > > requires a fork+exec which is *very* costly if necessary for every
> > > filesystem operation involving a container.
> >
> > Wait. fork() I understand, but why exec? And actually, you don't need
> > a full fork() either, clone() lets you do this with some process parts
> > shared. And then you also shouldn't need to use SCM_RIGHTS, just keep
> > the file descriptor table shared. And why chroot()/pivot_root(),
> > wouldn't you want to use setns()?
>
> You're right about this -- for C runtimes. In Go we cannot do a raw
> clone() or fork() (if you do it manually with RawSyscall you'll end with
> broken runtime state). So you're forced to do fork+exec (which then
> means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes
> for CLONE_VFORK.
If you insist on implementing every last bit of your code in Go, I guess.
> (It should be noted that multi-threaded C runtimes have somewhat similar
> issues -- AFAIK you can technically only use AS-Safe glibc functions
> after a fork() but that's more of a theoretical concern here. If you
> just use raw syscalls there isn't an issue.)
>
> As for why use setns() rather than pivot_root(), there are cases where
> you're operating on a container's image without a running container
> (think image extraction or snapshotting tools). In those cases, you
> would need to set up a dummy container process in order to setns() into
> its namespaces. You are right that setns() would be a better option if
> you want the truthful state of what mounts the container sees.
>
> [I also don't like the idea of joining the user namespace of a malicious
> container unless it's necessary but that's probably just needless
> paranoia more than anything -- since you're not joining the pidns you
> aren't trivially addressable by a malicious container.]
>
> > // Ensure that we are non-dumpable. Together with
> > // commit bfedb589252c, this ensures that container root
> > // can't trace our child once it enters the container.
> > // My patch
> > // https://lore.kernel.org/lkml/[email protected]/
> > // would make this unnecessary, but that patch didn't
> > // land because Eric nacked it (for political reasons,
> > // because people incorrectly claimed that this was a
> > // security fix):
>
> Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a
> user_ns owner to mm_struct and fix ptrace permission checks"). If you
> join a user namespace then processes within that user namespace won't
> have ptrace_may_access() permissions because your mm is owned by an
> ancestor user namespace -- only after exec() will you be traceable.
Nope. The code added in bfedb589252c only applies if `get_dumpable(mm)
!= SUID_DUMP_USER`.
Looking at the current version, you can see that
`ptrace_has_cap(tcred->user_ns, mode)` is enough to ptrace a process
that is not nondumpable.
> We still use PR_SET_DUMPABLE in runc but that's because we support older
> kernels (and people don't use user namespaces under Docker) but with
> user namespaces this should not be required anymore.
On Mon, Oct 01, 2018 at 03:44:28PM +1000, Aleksa Sarai wrote:
> On 2018-09-29, Jann Horn <[email protected]> wrote:
> > The problem is what happens if a folder you are walking through is
> > concurrently moved out of the chroot. Consider the following scenario:
> >
> > You attempt to open "C/../../etc/passwd" under the root "/A/B".
> > Something else concurrently moves /A/B/C to /A/C. This can result in
> > the following:
> >
> > 1. You start the path walk and reach /A/B/C.
> > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
> > 3. Your path walk follows the first ".." up into /A. This is outside
> > the process root, but you never actually encountered the process root,
> > so you don't notice.
> > 4. Your path walk follows the second ".." up to /. Again, this is
> > outside the process root, but you don't notice.
> > 5. Your path walk walks down to /etc/passwd, and the open completes
> > successfully. You now have an fd pointing outside your chroot.
> >
> > If the root of your walk is below an attacker-controlled directory,
> > this of course means that you lose instantly. If you point the root of
> > the walk at a directory out of which a process in the container
> > wouldn't be able to move the file, you're probably kinda mostly fine -
> > as long as you know, for certain, that nothing else on the system
> > would ever do that. But I still wouldn't feel good about that.
>
> Please correct me if I'm wrong here (this is the first patch I've
> written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this
> -- or does that only handle if a particular path component changes
> *while* it's being walked through? Is it possible for a path walk to
> succeed after a path component was unmounted (obviously you can't delete
> a directory path component since you'd get -ENOTEMPTY)?
>
> If this is an issue for AT_THIS_ROOT, I believe this might also be an
> issue for AT_BENEATH since they are effectively both using the same
> nd->root trick (so you could similarly trick AT_BENEATH to not error
> out). So we'd need to figure out how to solve this problem in order for
> AT_BENEATH to be safe.
>
> Speaking naively, doesn't it make sense to invalidate the walk if a path
> component was modified? Or is this something that would be far too
> costly with little benefit? What if we do more aggressive nd->root
> checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root !=
> current->mnt_ns->root)?
>
> Regarding chroot attacks, I was aware of the trivial
> chroot-open-chroot-fchdir attack but I was not aware that there was a
> rename attack for chroot. Thanks for bringing this up!
>
> > I believe that the only way to robustly use this would be to point the
> > dirfd at a mount point, such that you know that being moved out of the
> > chroot is impossible because the mount point limits movement of
> > directories under it. (Well, technically, it doesn't, but it ensures
> > that if a directory does dangerously move away, the syscall fails.) It
> > might make sense to hardcode this constraint in the implementation of
> > AT_THIS_ROOT, to keep people from shooting themselves in the foot.
>
> Unless I'm missing something, would this not also affect using a
> mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through
> path component) -- or does MS_MOVE cause a rewalk in a way that rename
> does not?
>
> I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I
> thought that bind-mounts would be an issue but you also get -EXDEV when
> trying to rename across bind-mounts even if they are on the same
> underlying filesystem). But AT_BENEATH might be a more bitter pill to
> swallow. I'm not sure.
>
> In the usecase of container runtimes, we wouldn't generally be doing
> resolution of attacker-controlled paths but it still definitely doesn't
> hurt to consider this part of the threat model -- to avoid foot-gunning
> as you've said. (There also might be some nested-container cases where
> you might want to do that.)
>
> > > Currently most container runtimes try to do this resolution in
> > > userspace[1], causing many potential race conditions. In addition, the
> > > "obvious" alternative (actually performing a {ch,pivot_}root(2))
> > > requires a fork+exec which is *very* costly if necessary for every
> > > filesystem operation involving a container.
> >
> > Wait. fork() I understand, but why exec? And actually, you don't need
> > a full fork() either, clone() lets you do this with some process parts
> > shared. And then you also shouldn't need to use SCM_RIGHTS, just keep
> > the file descriptor table shared. And why chroot()/pivot_root(),
> > wouldn't you want to use setns()?
>
> You're right about this -- for C runtimes. In Go we cannot do a raw
> clone() or fork() (if you do it manually with RawSyscall you'll end with
> broken runtime state). So you're forced to do fork+exec (which then
> means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes
> for CLONE_VFORK.
>
> (It should be noted that multi-threaded C runtimes have somewhat similar
> issues -- AFAIK you can technically only use AS-Safe glibc functions
> after a fork() but that's more of a theoretical concern here. If you
> just use raw syscalls there isn't an issue.)
>
> As for why use setns() rather than pivot_root(), there are cases where
> you're operating on a container's image without a running container
> (think image extraction or snapshotting tools). In those cases, you
> would need to set up a dummy container process in order to setns() into
> its namespaces. You are right that setns() would be a better option if
> you want the truthful state of what mounts the container sees.
>
> [I also don't like the idea of joining the user namespace of a malicious
> container unless it's necessary but that's probably just needless
> paranoia more than anything -- since you're not joining the pidns you
> aren't trivially addressable by a malicious container.]
>
> > // Ensure that we are non-dumpable. Together with
> > // commit bfedb589252c, this ensures that container root
> > // can't trace our child once it enters the container.
> > // My patch
> > // https://lore.kernel.org/lkml/[email protected]/
> > // would make this unnecessary, but that patch didn't
> > // land because Eric nacked it (for political reasons,
> > // because people incorrectly claimed that this was a
> > // security fix):
>
> Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a
> user_ns owner to mm_struct and fix ptrace permission checks"). If you
> join a user namespace then processes within that user namespace won't
> have ptrace_may_access() permissions because your mm is owned by an
> ancestor user namespace -- only after exec() will you be traceable.
That is not _completely_ true.
Iirc (Please someone do yell at me if I'm wrong!), this is as follows.
You will in fact be dumpable as long as you don't set{g,u}id() to an
effective uid that is different from the effective uid of the process
that created the task. For example, if you clone(CLONE_NEWUSER) as an
unprivileged user with uid and euid 1000 you are in fact dumpable and
thus traceable *but* if you do a setuid(0) in the new task then you will
end up with old->euid = 1000 and new->euid = 0 at which point the kernel
will remove the dumpable flag and the creating process cannot trace you
anymore (which has funny consequences for lsm isolation and sending fds
around). Iiuc, The same logic applies when you do a setns() to another
user namespace.
/* dumpability changes */
if (!uid_eq(old->euid, new->euid) ||
!gid_eq(old->egid, new->egid) ||
!uid_eq(old->fsuid, new->fsuid) ||
!gid_eq(old->fsgid, new->fsgid) ||
!cred_cap_issubset(old, new)) {
if (task->mm)
set_dumpable(task->mm, suid_dumpable); <<< suid_dumpable == 0 at this point
task->pdeath_signal = 0;
smp_wmb();
}
>
> We still use PR_SET_DUMPABLE in runc but that's because we support older
> kernels (and people don't use user namespaces under Docker) but with
> user namespaces this should not be required anymore.
>
> --
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>
On Mon, Oct 1, 2018 at 12:42 PM Christian Brauner <[email protected]> wrote:
> On Mon, Oct 01, 2018 at 03:44:28PM +1000, Aleksa Sarai wrote:
> > On 2018-09-29, Jann Horn <[email protected]> wrote:
> > > The problem is what happens if a folder you are walking through is
> > > concurrently moved out of the chroot. Consider the following scenario:
> > >
> > > You attempt to open "C/../../etc/passwd" under the root "/A/B".
> > > Something else concurrently moves /A/B/C to /A/C. This can result in
> > > the following:
> > >
> > > 1. You start the path walk and reach /A/B/C.
> > > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
> > > 3. Your path walk follows the first ".." up into /A. This is outside
> > > the process root, but you never actually encountered the process root,
> > > so you don't notice.
> > > 4. Your path walk follows the second ".." up to /. Again, this is
> > > outside the process root, but you don't notice.
> > > 5. Your path walk walks down to /etc/passwd, and the open completes
> > > successfully. You now have an fd pointing outside your chroot.
> > >
> > > If the root of your walk is below an attacker-controlled directory,
> > > this of course means that you lose instantly. If you point the root of
> > > the walk at a directory out of which a process in the container
> > > wouldn't be able to move the file, you're probably kinda mostly fine -
> > > as long as you know, for certain, that nothing else on the system
> > > would ever do that. But I still wouldn't feel good about that.
> >
> > Please correct me if I'm wrong here (this is the first patch I've
> > written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this
> > -- or does that only handle if a particular path component changes
> > *while* it's being walked through? Is it possible for a path walk to
> > succeed after a path component was unmounted (obviously you can't delete
> > a directory path component since you'd get -ENOTEMPTY)?
> >
> > If this is an issue for AT_THIS_ROOT, I believe this might also be an
> > issue for AT_BENEATH since they are effectively both using the same
> > nd->root trick (so you could similarly trick AT_BENEATH to not error
> > out). So we'd need to figure out how to solve this problem in order for
> > AT_BENEATH to be safe.
> >
> > Speaking naively, doesn't it make sense to invalidate the walk if a path
> > component was modified? Or is this something that would be far too
> > costly with little benefit? What if we do more aggressive nd->root
> > checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root !=
> > current->mnt_ns->root)?
> >
> > Regarding chroot attacks, I was aware of the trivial
> > chroot-open-chroot-fchdir attack but I was not aware that there was a
> > rename attack for chroot. Thanks for bringing this up!
> >
> > > I believe that the only way to robustly use this would be to point the
> > > dirfd at a mount point, such that you know that being moved out of the
> > > chroot is impossible because the mount point limits movement of
> > > directories under it. (Well, technically, it doesn't, but it ensures
> > > that if a directory does dangerously move away, the syscall fails.) It
> > > might make sense to hardcode this constraint in the implementation of
> > > AT_THIS_ROOT, to keep people from shooting themselves in the foot.
> >
> > Unless I'm missing something, would this not also affect using a
> > mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through
> > path component) -- or does MS_MOVE cause a rewalk in a way that rename
> > does not?
> >
> > I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I
> > thought that bind-mounts would be an issue but you also get -EXDEV when
> > trying to rename across bind-mounts even if they are on the same
> > underlying filesystem). But AT_BENEATH might be a more bitter pill to
> > swallow. I'm not sure.
> >
> > In the usecase of container runtimes, we wouldn't generally be doing
> > resolution of attacker-controlled paths but it still definitely doesn't
> > hurt to consider this part of the threat model -- to avoid foot-gunning
> > as you've said. (There also might be some nested-container cases where
> > you might want to do that.)
> >
> > > > Currently most container runtimes try to do this resolution in
> > > > userspace[1], causing many potential race conditions. In addition, the
> > > > "obvious" alternative (actually performing a {ch,pivot_}root(2))
> > > > requires a fork+exec which is *very* costly if necessary for every
> > > > filesystem operation involving a container.
> > >
> > > Wait. fork() I understand, but why exec? And actually, you don't need
> > > a full fork() either, clone() lets you do this with some process parts
> > > shared. And then you also shouldn't need to use SCM_RIGHTS, just keep
> > > the file descriptor table shared. And why chroot()/pivot_root(),
> > > wouldn't you want to use setns()?
> >
> > You're right about this -- for C runtimes. In Go we cannot do a raw
> > clone() or fork() (if you do it manually with RawSyscall you'll end with
> > broken runtime state). So you're forced to do fork+exec (which then
> > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes
> > for CLONE_VFORK.
> >
> > (It should be noted that multi-threaded C runtimes have somewhat similar
> > issues -- AFAIK you can technically only use AS-Safe glibc functions
> > after a fork() but that's more of a theoretical concern here. If you
> > just use raw syscalls there isn't an issue.)
> >
> > As for why use setns() rather than pivot_root(), there are cases where
> > you're operating on a container's image without a running container
> > (think image extraction or snapshotting tools). In those cases, you
> > would need to set up a dummy container process in order to setns() into
> > its namespaces. You are right that setns() would be a better option if
> > you want the truthful state of what mounts the container sees.
> >
> > [I also don't like the idea of joining the user namespace of a malicious
> > container unless it's necessary but that's probably just needless
> > paranoia more than anything -- since you're not joining the pidns you
> > aren't trivially addressable by a malicious container.]
> >
> > > // Ensure that we are non-dumpable. Together with
> > > // commit bfedb589252c, this ensures that container root
> > > // can't trace our child once it enters the container.
> > > // My patch
> > > // https://lore.kernel.org/lkml/[email protected]/
> > > // would make this unnecessary, but that patch didn't
> > > // land because Eric nacked it (for political reasons,
> > > // because people incorrectly claimed that this was a
> > > // security fix):
> >
> > Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a
> > user_ns owner to mm_struct and fix ptrace permission checks"). If you
> > join a user namespace then processes within that user namespace won't
> > have ptrace_may_access() permissions because your mm is owned by an
> > ancestor user namespace -- only after exec() will you be traceable.
>
> That is not _completely_ true.
> Iirc (Please someone do yell at me if I'm wrong!), this is as follows.
> You will in fact be dumpable as long as you don't set{g,u}id() to an
> effective uid that is different from the effective uid of the process
> that created the task. For example, if you clone(CLONE_NEWUSER) as an
> unprivileged user with uid and euid 1000 you are in fact dumpable and
> thus traceable *but* if you do a setuid(0) in the new task then you will
> end up with old->euid = 1000 and new->euid = 0 at which point the kernel
> will remove the dumpable flag and the creating process cannot trace you
> anymore (which has funny consequences for lsm isolation and sending fds
> around). Iiuc, The same logic applies when you do a setns() to another
> user namespace.
(Note that this is only true if your un-namespaced UID actually
changes. If you create a user namespace and then write to its uid_map
such that your namespaced UID is zero, that won't trigger this logic.)
On Sat, Sep 29, 2018 at 4:28 PM Aleksa Sarai <[email protected]> wrote:
> Add the following flags for path resolution. The primary justification
> for these flags is to allow for programs to be far more strict about how
> they want path resolution to handle symlinks, mountpoint crossings, and
> paths that escape the dirfd (through an absolute path or ".."
> shenanigans).
>
> This is of particular concern to container runtimes that want to be very
> careful about malicious root filesystems that a container's init might
> have screwed around with (and there is no real way to protect against
> this in userspace if you consider potential races against a malicious
> container's init).
>
> * AT_BENEATH: Disallow ".." or absolute paths (either in the path or
> found during symlink resolution) to escape the starting point of name
> resolution, though ".." is permitted in cases like "foo/../bar".
> Relative symlinks are still allowed (as long as they don't escape the
> starting point).
As I said on the other thread, I would strongly prefer an API that
behaves along the lines of David Drysdale's old patch
https://lore.kernel.org/lkml/[email protected]/
: Forbid any use of "..". This would also be more straightforward to
implement safely. If that doesn't work for you, I would like it if you
could at least make that an option. I would like it if this API could
mitigate straightforward directory traversal bugs such as
https://bugs.chromium.org/p/project-zero/issues/detail?id=1583, where
a confused deputy attempts to access a path like
"/mnt/media_rw/../../data" while intending to access a directory under
"/mnt/media_rw".
> * AT_XDEV: Disallow mount-point crossing (both *down* into one, or *up*
> from one). The primary "scoping" use is to blocking resolution that
> crosses a bind-mount, which has a similar property to a symlink (in
> the way that it allows for escape from the starting-point). Since it
> is not possible to differentiate bind-mounts However since
> bind-mounting requires privileges (in ways symlinks don't) this has
> been split from LOOKUP_BENEATH. The naming is based on "find -xdev"
> (though find(1) doesn't walk upwards, the semantics seem obvious).
>
> * AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
> specific restriction, and it exists because /proc/$pid/fd/...
> "symlinks" allow for access outside nd->root and pose risk to
> container runtimes that don't want to be tricked into accessing a host
> path (but do want to allow no-funny-business symlink resolution).
AT_BENEATH has to imply AT_NO_PROCLINK, right? Especially with the
semantics you picked for AT_BENEATH. With the original O_BENEATH_ONLY
semantics, it might be okay to not imply AT_NO_PROCLINK...
> * AT_NO_SYMLINK: Disallows symlink jumping *of any kind*. Implies
> AT_NO_PROCLINK (obviously).
>
> The AT_NO_*LINK flags return -ELOOP if path resolution would violates
> their requirement, while the others all return -EXDEV. Currently these
> are only enabled for the stat(2) family and the openat(2) family (the
> latter has its own brand of O_* flags with the same semantics). Ideally
> these flags would be supported by all *at(2) syscalls, but this will
> require adding flags arguments to many of them (and will be done in a
> separate patchset).
On Mon, Oct 01, 2018 at 01:29:16PM +0200, Jann Horn wrote:
> On Mon, Oct 1, 2018 at 12:42 PM Christian Brauner <[email protected]> wrote:
> > On Mon, Oct 01, 2018 at 03:44:28PM +1000, Aleksa Sarai wrote:
> > > On 2018-09-29, Jann Horn <[email protected]> wrote:
> > > > The problem is what happens if a folder you are walking through is
> > > > concurrently moved out of the chroot. Consider the following scenario:
> > > >
> > > > You attempt to open "C/../../etc/passwd" under the root "/A/B".
> > > > Something else concurrently moves /A/B/C to /A/C. This can result in
> > > > the following:
> > > >
> > > > 1. You start the path walk and reach /A/B/C.
> > > > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
> > > > 3. Your path walk follows the first ".." up into /A. This is outside
> > > > the process root, but you never actually encountered the process root,
> > > > so you don't notice.
> > > > 4. Your path walk follows the second ".." up to /. Again, this is
> > > > outside the process root, but you don't notice.
> > > > 5. Your path walk walks down to /etc/passwd, and the open completes
> > > > successfully. You now have an fd pointing outside your chroot.
> > > >
> > > > If the root of your walk is below an attacker-controlled directory,
> > > > this of course means that you lose instantly. If you point the root of
> > > > the walk at a directory out of which a process in the container
> > > > wouldn't be able to move the file, you're probably kinda mostly fine -
> > > > as long as you know, for certain, that nothing else on the system
> > > > would ever do that. But I still wouldn't feel good about that.
> > >
> > > Please correct me if I'm wrong here (this is the first patch I've
> > > written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this
> > > -- or does that only handle if a particular path component changes
> > > *while* it's being walked through? Is it possible for a path walk to
> > > succeed after a path component was unmounted (obviously you can't delete
> > > a directory path component since you'd get -ENOTEMPTY)?
> > >
> > > If this is an issue for AT_THIS_ROOT, I believe this might also be an
> > > issue for AT_BENEATH since they are effectively both using the same
> > > nd->root trick (so you could similarly trick AT_BENEATH to not error
> > > out). So we'd need to figure out how to solve this problem in order for
> > > AT_BENEATH to be safe.
> > >
> > > Speaking naively, doesn't it make sense to invalidate the walk if a path
> > > component was modified? Or is this something that would be far too
> > > costly with little benefit? What if we do more aggressive nd->root
> > > checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root !=
> > > current->mnt_ns->root)?
> > >
> > > Regarding chroot attacks, I was aware of the trivial
> > > chroot-open-chroot-fchdir attack but I was not aware that there was a
> > > rename attack for chroot. Thanks for bringing this up!
> > >
> > > > I believe that the only way to robustly use this would be to point the
> > > > dirfd at a mount point, such that you know that being moved out of the
> > > > chroot is impossible because the mount point limits movement of
> > > > directories under it. (Well, technically, it doesn't, but it ensures
> > > > that if a directory does dangerously move away, the syscall fails.) It
> > > > might make sense to hardcode this constraint in the implementation of
> > > > AT_THIS_ROOT, to keep people from shooting themselves in the foot.
> > >
> > > Unless I'm missing something, would this not also affect using a
> > > mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through
> > > path component) -- or does MS_MOVE cause a rewalk in a way that rename
> > > does not?
> > >
> > > I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I
> > > thought that bind-mounts would be an issue but you also get -EXDEV when
> > > trying to rename across bind-mounts even if they are on the same
> > > underlying filesystem). But AT_BENEATH might be a more bitter pill to
> > > swallow. I'm not sure.
> > >
> > > In the usecase of container runtimes, we wouldn't generally be doing
> > > resolution of attacker-controlled paths but it still definitely doesn't
> > > hurt to consider this part of the threat model -- to avoid foot-gunning
> > > as you've said. (There also might be some nested-container cases where
> > > you might want to do that.)
> > >
> > > > > Currently most container runtimes try to do this resolution in
> > > > > userspace[1], causing many potential race conditions. In addition, the
> > > > > "obvious" alternative (actually performing a {ch,pivot_}root(2))
> > > > > requires a fork+exec which is *very* costly if necessary for every
> > > > > filesystem operation involving a container.
> > > >
> > > > Wait. fork() I understand, but why exec? And actually, you don't need
> > > > a full fork() either, clone() lets you do this with some process parts
> > > > shared. And then you also shouldn't need to use SCM_RIGHTS, just keep
> > > > the file descriptor table shared. And why chroot()/pivot_root(),
> > > > wouldn't you want to use setns()?
> > >
> > > You're right about this -- for C runtimes. In Go we cannot do a raw
> > > clone() or fork() (if you do it manually with RawSyscall you'll end with
> > > broken runtime state). So you're forced to do fork+exec (which then
> > > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes
> > > for CLONE_VFORK.
> > >
> > > (It should be noted that multi-threaded C runtimes have somewhat similar
> > > issues -- AFAIK you can technically only use AS-Safe glibc functions
> > > after a fork() but that's more of a theoretical concern here. If you
> > > just use raw syscalls there isn't an issue.)
> > >
> > > As for why use setns() rather than pivot_root(), there are cases where
> > > you're operating on a container's image without a running container
> > > (think image extraction or snapshotting tools). In those cases, you
> > > would need to set up a dummy container process in order to setns() into
> > > its namespaces. You are right that setns() would be a better option if
> > > you want the truthful state of what mounts the container sees.
> > >
> > > [I also don't like the idea of joining the user namespace of a malicious
> > > container unless it's necessary but that's probably just needless
> > > paranoia more than anything -- since you're not joining the pidns you
> > > aren't trivially addressable by a malicious container.]
> > >
> > > > // Ensure that we are non-dumpable. Together with
> > > > // commit bfedb589252c, this ensures that container root
> > > > // can't trace our child once it enters the container.
> > > > // My patch
> > > > // https://lore.kernel.org/lkml/[email protected]/
> > > > // would make this unnecessary, but that patch didn't
> > > > // land because Eric nacked it (for political reasons,
> > > > // because people incorrectly claimed that this was a
> > > > // security fix):
> > >
> > > Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a
> > > user_ns owner to mm_struct and fix ptrace permission checks"). If you
> > > join a user namespace then processes within that user namespace won't
> > > have ptrace_may_access() permissions because your mm is owned by an
> > > ancestor user namespace -- only after exec() will you be traceable.
> >
> > That is not _completely_ true.
> > Iirc (Please someone do yell at me if I'm wrong!), this is as follows.
> > You will in fact be dumpable as long as you don't set{g,u}id() to an
> > effective uid that is different from the effective uid of the process
> > that created the task. For example, if you clone(CLONE_NEWUSER) as an
> > unprivileged user with uid and euid 1000 you are in fact dumpable and
> > thus traceable *but* if you do a setuid(0) in the new task then you will
> > end up with old->euid = 1000 and new->euid = 0 at which point the kernel
> > will remove the dumpable flag and the creating process cannot trace you
> > anymore (which has funny consequences for lsm isolation and sending fds
> > around). Iiuc, The same logic applies when you do a setns() to another
> > user namespace.
>
> (Note that this is only true if your un-namespaced UID actually
> changes. If you create a user namespace and then write to its uid_map
> such that your namespaced UID is zero, that won't trigger this logic.)
The way I figured this works is that it actually only applies to the
case where you're creating a user namespace as an unprivileged user. And
even in that case you will retain dumpability in two cases:
1. mapping userns id 0 to <my-unpriv-host-uid>
and any identity mapping, i.e.
2. mapping <my-unpriv-host-uid> to <my-unpriv-host-uid>
If you're creating a user namespace as root and set up mappings you
should always retain the dumpable flag (I might not be remembering all
corner-cases atm.) if you don't muck with capability sets and so on.
On Mon, Oct 01, 2018 at 02:28:03PM +0200, Jann Horn wrote:
> On Sat, Sep 29, 2018 at 4:28 PM Aleksa Sarai <[email protected]> wrote:
> > Add the following flags for path resolution. The primary justification
> > for these flags is to allow for programs to be far more strict about how
> > they want path resolution to handle symlinks, mountpoint crossings, and
> > paths that escape the dirfd (through an absolute path or ".."
> > shenanigans).
> >
> > This is of particular concern to container runtimes that want to be very
> > careful about malicious root filesystems that a container's init might
> > have screwed around with (and there is no real way to protect against
> > this in userspace if you consider potential races against a malicious
> > container's init).
> >
> > * AT_BENEATH: Disallow ".." or absolute paths (either in the path or
> > found during symlink resolution) to escape the starting point of name
> > resolution, though ".." is permitted in cases like "foo/../bar".
> > Relative symlinks are still allowed (as long as they don't escape the
> > starting point).
>
> As I said on the other thread, I would strongly prefer an API that
> behaves along the lines of David Drysdale's old patch
> https://lore.kernel.org/lkml/[email protected]/
> : Forbid any use of "..". This would also be more straightforward to
> implement safely. If that doesn't work for you, I would like it if you
> could at least make that an option. I would like it if this API could
> mitigate straightforward directory traversal bugs such as
> https://bugs.chromium.org/p/project-zero/issues/detail?id=1583, where
> a confused deputy attempts to access a path like
> "/mnt/media_rw/../../data" while intending to access a directory under
> "/mnt/media_rw".
Oh, the semantics for this changed in this patchset, hah. I was still on
vacation so didn't get to look at it before it was sent out. From prior
discussion I remember that the original intention actual was what you
argue for. And the patchset should be as tight as possible. Having
special cases where ".." is allowed just sounds like an invitation for
userspace to get it wrong.
Aleksa, did you have a specific use-case in mind that made you change
this or was it already present in an earlier iteration of the patchset
by someone else?
>
> > * AT_XDEV: Disallow mount-point crossing (both *down* into one, or *up*
> > from one). The primary "scoping" use is to blocking resolution that
> > crosses a bind-mount, which has a similar property to a symlink (in
> > the way that it allows for escape from the starting-point). Since it
> > is not possible to differentiate bind-mounts However since
> > bind-mounting requires privileges (in ways symlinks don't) this has
> > been split from LOOKUP_BENEATH. The naming is based on "find -xdev"
> > (though find(1) doesn't walk upwards, the semantics seem obvious).
> >
> > * AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very
> > specific restriction, and it exists because /proc/$pid/fd/...
> > "symlinks" allow for access outside nd->root and pose risk to
> > container runtimes that don't want to be tricked into accessing a host
> > path (but do want to allow no-funny-business symlink resolution).
>
> AT_BENEATH has to imply AT_NO_PROCLINK, right? Especially with the
> semantics you picked for AT_BENEATH. With the original O_BENEATH_ONLY
> semantics, it might be okay to not imply AT_NO_PROCLINK...
>
> > * AT_NO_SYMLINK: Disallows symlink jumping *of any kind*. Implies
> > AT_NO_PROCLINK (obviously).
> >
> > The AT_NO_*LINK flags return -ELOOP if path resolution would violates
> > their requirement, while the others all return -EXDEV. Currently these
> > are only enabled for the stat(2) family and the openat(2) family (the
> > latter has its own brand of O_* flags with the same semantics). Ideally
> > these flags would be supported by all *at(2) syscalls, but this will
> > require adding flags arguments to many of them (and will be done in a
> > separate patchset).
From: Aleksa Sarai
> Sent: 29 September 2018 11:35
>
> The need for some sort of control over VFS's path resolution (to avoid
> malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a
> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
>
> The most obvious change is that AT_NO_JUMPS has been split as dicussed
> in the original thread, along with a further split of AT_NO_PROCLINKS
> which means that each individual property of AT_NO_JUMPS is now a
> separate flag:
>
> * Path-based escapes from the starting-point using "/" or ".." are
> blocked by AT_BENEATH.
You may need to allow absolute paths that refer to items inside
the controlled area.
(Even if done by a textual replacement based on the expected name
of the base directory.)
> * Mountpoint crossings are blocked by AT_XDEV.
You might want a mountpoint flag that allows crossing into the mounted
filesystem (you may need to get out in order to do pwd()).
> * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> correctly it actually blocks any user of nd_jump_link() because it
> allows out-of-VFS path resolution manipulation).
Or 'fix' the /proc/$pid/fd/$fd code to open the actual vnode rather than
being a symlink (although this might still let you get a directory vnode).
FWIW this is what NetBSD does - you can link the open file back into
the filesystem!
>
> AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
> Linus' suggestion in the original thread, I've also implemented
> AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
> "proclink" resolution).
What about allowing 'trivial' symlinks?
...
> Currently I've only enabled these for openat(2) and the stat(2) family.
> I would hope we could enable it for basically every *at(2) syscall --
> but many of them appear to not have a @flags argument and thus we'll
> need to add several new syscalls to do this. I'm more than happy to send
> those patches, but I'd prefer to know that this preliminary work is
> acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> syscalls.
If you make the flags a property of the directory vnode (perhaps as
well as any syscall flags), and make it inherited by vnode lookup then
it can be used to stop library functions (or entire binaries) using
blocked paths.
You'd then only need to add an fcntl() call to set the flags (but never
clear them) to get the restriction applied to every lookup.
...
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Mon, Oct 01, 2018 at 03:44:28PM +1000, Aleksa Sarai wrote:
> On 2018-09-29, Jann Horn <[email protected]> wrote:
> > The problem is what happens if a folder you are walking through is
> > concurrently moved out of the chroot. Consider the following scenario:
> >
> > You attempt to open "C/../../etc/passwd" under the root "/A/B".
> > Something else concurrently moves /A/B/C to /A/C. This can result in
> > the following:
> >
> > 1. You start the path walk and reach /A/B/C.
> > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
> > 3. Your path walk follows the first ".." up into /A. This is outside
> > the process root, but you never actually encountered the process root,
> > so you don't notice.
> > 4. Your path walk follows the second ".." up to /. Again, this is
> > outside the process root, but you don't notice.
> > 5. Your path walk walks down to /etc/passwd, and the open completes
> > successfully. You now have an fd pointing outside your chroot.
> >
> > If the root of your walk is below an attacker-controlled directory,
> > this of course means that you lose instantly. If you point the root of
> > the walk at a directory out of which a process in the container
> > wouldn't be able to move the file, you're probably kinda mostly fine -
> > as long as you know, for certain, that nothing else on the system
> > would ever do that. But I still wouldn't feel good about that.
>
> Please correct me if I'm wrong here (this is the first patch I've
> written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this
No.
...
> Speaking naively, doesn't it make sense to invalidate the walk if a path
> component was modified? Or is this something that would be far too
> costly with little benefit?
Lookups and renames can definitely proceed in parallel, and yes I
suspect it would be difficult to get good performance and guaranteed
forward progress if you required lookup of the full path to be atomic
with respect to renames.
--b.
On Sat, Sep 29, 2018 at 06:35:17PM +0200, Jann Horn wrote:
> +cc linux-api; please keep them in CC for future versions of the patch
>
> On Sat, Sep 29, 2018 at 4:29 PM Aleksa Sarai <[email protected]> wrote:
> > The primary motivation for the need for this flag is container runtimes
> > which have to interact with malicious root filesystems in the host
> > namespaces. One of the first requirements for a container runtime to be
> > secure against a malicious rootfs is that they correctly scope symlinks
> > (that is, they should be scoped as though they are chroot(2)ed into the
> > container's rootfs) and ".."-style paths. The already-existing AT_XDEV
> > and AT_NO_PROCLINKS help defend against other potential attacks in a
> > malicious rootfs scenario.
>
> So, I really like the concept for patch 1 of this series (but haven't
> read the code yet); but I dislike this patch because of its footgun
> potential.
>
> If this patch landed as-is, the manpage would need some big warning
> labels. chroot() basically provides no security guarantees at all; and
> yes, that includes that if you do `chroot(...); chdir("/");
> open(attacker_controlled_path, ...);`, you can potentially end up
> opening a file outside the chroot. See
> https://github.com/QubesOS/qubes-secpack/blob/master/QSBs/qsb-014-2015.txt
> for an example, where Qubes OS did pretty much that, and ended up with
> a potentially exploitable security bug because of that, where one VM,
> while performing a file transfer into another VM, could write outside
> of the transfer target directory.
> The problem is what happens if a folder you are walking through is
> concurrently moved out of the chroot. Consider the following scenario:
>
> You attempt to open "C/../../etc/passwd" under the root "/A/B".
> Something else concurrently moves /A/B/C to /A/C. This can result in
> the following:
>
> 1. You start the path walk and reach /A/B/C.
> 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
> 3. Your path walk follows the first ".." up into /A. This is outside
> the process root, but you never actually encountered the process root,
> so you don't notice.
> 4. Your path walk follows the second ".." up to /. Again, this is
> outside the process root, but you don't notice.
> 5. Your path walk walks down to /etc/passwd, and the open completes
> successfully. You now have an fd pointing outside your chroot.
>
> If the root of your walk is below an attacker-controlled directory,
> this of course means that you lose instantly. If you point the root of
> the walk at a directory out of which a process in the container
> wouldn't be able to move the file, you're probably kinda mostly fine -
> as long as you know, for certain, that nothing else on the system
> would ever do that. But I still wouldn't feel good about that.
>
> (Yes, this means that if you run an SFTP server with OpenSSH's
> ChrootDirectory directive, you have to be very careful about these
> things.)
>
> I believe that the only way to robustly use this would be to point the
> dirfd at a mount point, such that you know that being moved out of the
> chroot is impossible because the mount point limits movement of
> directories under it. (Well, technically, it doesn't, but it ensures
> that if a directory does dangerously move away, the syscall fails.) It
> might make sense to hardcode this constraint in the implementation of
> AT_THIS_ROOT, to keep people from shooting themselves in the foot.
I'm very much in favor of dropping AT_THIS_ROOT from this patch series
at least for now and only land the first patch. The first patch is
something that we really want and that it seems we can find a good
design for.
If AT_THIS_ROOT is a feature that still makes sense we can revisit it.
>
> > Currently most container runtimes try to do this resolution in
> > userspace[1], causing many potential race conditions. In addition, the
> > "obvious" alternative (actually performing a {ch,pivot_}root(2))
> > requires a fork+exec which is *very* costly if necessary for every
> > filesystem operation involving a container.
>
> Wait. fork() I understand, but why exec? And actually, you don't need
> a full fork() either, clone() lets you do this with some process parts
> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep
> the file descriptor table shared. And why chroot()/pivot_root(),
> wouldn't you want to use setns()? I think something like this should
> work (except that you should add some error handling - omitted here
> because I'm lazy), assuming that the container runtime does NOT have
> CAP_SYS_ADMIN in the init namespace (otherwise it's easier). Of
> course, this is entirely untested, and probably won't compile because
> I screwed something up. :P But you should get the idea...
>
> // Ensure that we are non-dumpable. Together with
> // commit bfedb589252c, this ensures that container root
> // can't trace our child once it enters the container.
> // My patch
> // https://lore.kernel.org/lkml/[email protected]/
> // would make this unnecessary, but that patch didn't
> // land because Eric nacked it (for political reasons,
> // because people incorrectly claimed that this was a
> // security fix):
> // https://lore.kernel.org/lkml/[email protected]/
> // Note that dumpability is per-mm, not per-process,
> // so this hack has the unfortunate side effect of preventing
> // unprivileged debugging of the container runtime.
> // Oh well.
> prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE);
> // Inform gcc that this particular syscall will effectively
> // return twice, just like vfork() or setjmp().
> __attribute__((returns_twice)) long syscall_(long sysno, ...) = (void*)syscall;
> int result_fd = -1;
> // CLONE_FILES means we don't need to do fd passing,
> // we share the file descriptor table.
> // CLONE_VM means we don't have the cost of duplicating
> // our VMAs and page tables, and we don't have to mark
> // all our pagetable entries as readonly for copy-on-write.
> // CLONE_VFORK is a dirty hack to avoid having to
> // allocate a child stack.
> // Lack of SIGCHLD means we don't want to have to wait()
> // for the child.
> int child_pid = syscall_(__NR_clone, CLONE_FILES|CLONE_VM|CLONE_VFORK,
> 0, 0, 0, 0);
> if (child_pid == 0) {
> // Enter the container's user namespace; this allows us
> // to afterwards join its mount namespace even if we're
> // not capable in the init namespace.
> // (I believe that it should be possible to change the kernel
> // such that this is not required if you have set the
> // no-new-privs flag.)
> setns(container_user_ns_fd, CLONE_NEWUSER);
> // Entering the filesystem namespace automatically
> // moves us to that namespace's filesystem root.
> setns(container_fs_ns_fd, CLONE_NEWNS);
> result_fd = open(untrusted_container_path, ...);
> syscall(__NR_exit, 0);
> }
>
> > The most significant change in semantics with AT_THIS_ROOT is that
> > *at(2) syscalls now no longer have the property that an absolute
> > pathname causes the dirfd to be ignored completely (if LOOKUP_CHROOT is
> > specified). The reasoning behind this is that AT_THIS_ROOT necessarily
> > has to chroot-scope symlinks with absolute paths to dirfd, and so doing
> > it for the base path seems to be the most consistent behaviour (and also
> > avoids foot-gunning users who want to chroot-scope paths that might be
> > absolute).
> >
> > Currently this is only enabled for the stat(2) and openat(2) family (the
> > latter has its own flag O_THISROOT with the same semantics). Ideally
> > this flag would be supported by all *at(2) syscalls, but this will
> > require adding flags arguments to many of them (and will be done in a
> > separate patchset).
> >
> > [1]: https://github.com/cyphar/filepath-securejoin
> >
> > Cc: Eric Biederman <[email protected]>
> > Cc: Christian Brauner <[email protected]>
> > Signed-off-by: Aleksa Sarai <[email protected]>
> > ---
> > fs/fcntl.c | 2 +-
> > fs/namei.c | 121 +++++++++++++++++--------------
> > fs/open.c | 2 +
> > fs/stat.c | 4 +-
> > include/linux/fcntl.h | 2 +-
> > include/linux/namei.h | 1 +
> > include/uapi/asm-generic/fcntl.h | 3 +
> > include/uapi/linux/fcntl.h | 2 +
> > 8 files changed, 81 insertions(+), 56 deletions(-)
> >
> > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > index e343618736f7..4c36c5b9fdb9 100644
> > --- a/fs/fcntl.c
> > +++ b/fs/fcntl.c
> > @@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
> > * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> > * is defined as O_NONBLOCK on some platforms and not on others.
> > */
> > - BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
> > + BUILD_BUG_ON(26 - 1 /* for O_RDONLY being 0 */ !=
> > HWEIGHT32(
> > (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> > __FMODE_EXEC | __FMODE_NONOTIFY));
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 757dd783771c..1b984f0dbbb4 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2193,9 +2193,64 @@ static int link_path_walk(const char *name, struct nameidata *nd)
> > }
> > }
> >
> > +/*
> > + * Configure nd->path based on the nd->dfd. This is only used as part of
> > + * path_init().
> > + */
> > +static inline int dirfd_path_init(struct nameidata *nd)
> > +{
> > + if (nd->dfd == AT_FDCWD) {
> > + if (nd->flags & LOOKUP_RCU) {
> > + struct fs_struct *fs = current->fs;
> > + unsigned seq;
> > +
> > + do {
> > + seq = read_seqcount_begin(&fs->seq);
> > + nd->path = fs->pwd;
> > + nd->inode = nd->path.dentry->d_inode;
> > + nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
> > + } while (read_seqcount_retry(&fs->seq, seq));
> > + } else {
> > + get_fs_pwd(current->fs, &nd->path);
> > + nd->inode = nd->path.dentry->d_inode;
> > + }
> > + } else {
> > + /* Caller must check execute permissions on the starting path component */
> > + struct fd f = fdget_raw(nd->dfd);
> > + struct dentry *dentry;
> > +
> > + if (!f.file)
> > + return -EBADF;
> > +
> > + dentry = f.file->f_path.dentry;
> > +
> > + if (*nd->name->name && unlikely(!d_can_lookup(dentry))) {
> > + fdput(f);
> > + return -ENOTDIR;
> > + }
> > +
> > + nd->path = f.file->f_path;
> > + if (nd->flags & LOOKUP_RCU) {
> > + nd->inode = nd->path.dentry->d_inode;
> > + nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
> > + } else {
> > + path_get(&nd->path);
> > + nd->inode = nd->path.dentry->d_inode;
> > + }
> > + fdput(f);
> > + }
> > + if (unlikely(nd->flags & (LOOKUP_CHROOT | LOOKUP_BENEATH))) {
> > + nd->root = nd->path;
> > + if (!(nd->flags & LOOKUP_RCU))
> > + path_get(&nd->root);
> > + }
> > + return 0;
> > +}
> > +
> > /* must be paired with terminate_walk() */
> > static const char *path_init(struct nameidata *nd, unsigned flags)
> > {
> > + int error;
> > const char *s = nd->name->name;
> >
> > if (!*s)
> > @@ -2230,65 +2285,25 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > nd->path.dentry = NULL;
> >
> > nd->m_seq = read_seqbegin(&mount_lock);
> > + if (unlikely(flags & LOOKUP_CHROOT)) {
> > + error = dirfd_path_init(nd);
> > + if (unlikely(error))
> > + return ERR_PTR(error);
> > + }
> > if (*s == '/') {
> > - int error;
> > - set_root(nd);
> > + if (likely(!nd->root.mnt))
> > + set_root(nd);
> > error = nd_jump_root(nd);
> > if (unlikely(error))
> > s = ERR_PTR(error);
> > return s;
> > - } else if (nd->dfd == AT_FDCWD) {
> > - if (flags & LOOKUP_RCU) {
> > - struct fs_struct *fs = current->fs;
> > - unsigned seq;
> > -
> > - do {
> > - seq = read_seqcount_begin(&fs->seq);
> > - nd->path = fs->pwd;
> > - nd->inode = nd->path.dentry->d_inode;
> > - nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
> > - } while (read_seqcount_retry(&fs->seq, seq));
> > - } else {
> > - get_fs_pwd(current->fs, &nd->path);
> > - nd->inode = nd->path.dentry->d_inode;
> > - }
> > - if (unlikely(flags & LOOKUP_BENEATH)) {
> > - nd->root = nd->path;
> > - if (!(flags & LOOKUP_RCU))
> > - path_get(&nd->root);
> > - }
> > - return s;
> > - } else {
> > - /* Caller must check execute permissions on the starting path component */
> > - struct fd f = fdget_raw(nd->dfd);
> > - struct dentry *dentry;
> > -
> > - if (!f.file)
> > - return ERR_PTR(-EBADF);
> > -
> > - dentry = f.file->f_path.dentry;
> > -
> > - if (*s && unlikely(!d_can_lookup(dentry))) {
> > - fdput(f);
> > - return ERR_PTR(-ENOTDIR);
> > - }
> > -
> > - nd->path = f.file->f_path;
> > - if (flags & LOOKUP_RCU) {
> > - nd->inode = nd->path.dentry->d_inode;
> > - nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
> > - } else {
> > - path_get(&nd->path);
> > - nd->inode = nd->path.dentry->d_inode;
> > - }
> > - if (unlikely(flags & LOOKUP_BENEATH)) {
> > - nd->root = nd->path;
> > - if (!(flags & LOOKUP_RCU))
> > - path_get(&nd->root);
> > - }
> > - fdput(f);
> > - return s;
> > }
> > + if (likely(!nd->path.mnt)) {
> > + error = dirfd_path_init(nd);
> > + if (unlikely(error))
> > + return ERR_PTR(error);
> > + }
> > + return s;
> > }
> >
> > static const char *trailing_symlink(struct nameidata *nd)
> > diff --git a/fs/open.c b/fs/open.c
> > index 80f5f566a5ff..81d148f626cd 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -996,6 +996,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
> > lookup_flags |= LOOKUP_NO_PROCLINKS;
> > if (flags & O_NOSYMLINKS)
> > lookup_flags |= LOOKUP_NO_SYMLINKS;
> > + if (flags & O_THISROOT)
> > + lookup_flags |= LOOKUP_CHROOT;
> > op->lookup_flags = lookup_flags;
> > return 0;
> > }
> > diff --git a/fs/stat.c b/fs/stat.c
> > index 791e61b916ae..e8366e4812c3 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -172,7 +172,7 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
> >
> > if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH |
> > KSTAT_QUERY_FLAGS | AT_BENEATH | AT_XDEV |
> > - AT_NO_PROCLINKS | AT_NO_SYMLINKS))
> > + AT_NO_PROCLINKS | AT_NO_SYMLINKS | AT_THIS_ROOT))
> > return -EINVAL;
> >
> > if (flags & AT_SYMLINK_NOFOLLOW)
> > @@ -189,6 +189,8 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
> > lookup_flags |= LOOKUP_NO_PROCLINKS;
> > if (flags & AT_NO_SYMLINKS)
> > lookup_flags |= LOOKUP_NO_SYMLINKS;
> > + if (flags & AT_THIS_ROOT)
> > + lookup_flags |= LOOKUP_CHROOT;
> >
> > retry:
> > error = user_path_at(dfd, filename, lookup_flags, &path);
> > diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> > index ad5bba4b5b12..95480cd4c09d 100644
> > --- a/include/linux/fcntl.h
> > +++ b/include/linux/fcntl.h
> > @@ -10,7 +10,7 @@
> > O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
> > FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> > O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
> > - O_NOPROCLINKS | O_NOSYMLINKS)
> > + O_NOPROCLINKS | O_NOSYMLINKS | O_THISROOT)
> >
> > #ifndef force_o_largefile
> > #define force_o_largefile() (BITS_PER_LONG != 32)
> > diff --git a/include/linux/namei.h b/include/linux/namei.h
> > index 5ff7f3362d1b..7ec9e2d84649 100644
> > --- a/include/linux/namei.h
> > +++ b/include/linux/namei.h
> > @@ -53,6 +53,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
> > #define LOOKUP_NO_PROCLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
> > #define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*.
> > Implies LOOKUP_NO_PROCLINKS. */
> > +#define LOOKUP_CHROOT 0x100000 /* Treat dirfd as %current->fs->root. */
> >
> > extern int path_pts(struct path *path);
> >
> > diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> > index c2bf5983e46a..11206b0e927c 100644
> > --- a/include/uapi/asm-generic/fcntl.h
> > +++ b/include/uapi/asm-generic/fcntl.h
> > @@ -113,6 +113,9 @@
> > #ifndef O_NOSYMLINKS
> > #define O_NOSYMLINKS 01000000000
> > #endif
> > +#ifndef O_THISROOT
> > +#define O_THISROOT 02000000000
> > +#endif
> >
> > #define F_DUPFD 0 /* dup */
> > #define F_GETFD 1 /* get close_on_exec */
> > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > index 551a9e2166a8..ea978457b68f 100644
> > --- a/include/uapi/linux/fcntl.h
> > +++ b/include/uapi/linux/fcntl.h
> > @@ -99,6 +99,8 @@
> > #define AT_NO_PROCLINKS 0x40000 /* No /proc/$pid/fd/... "symlinks". */
> > #define AT_NO_SYMLINKS 0x80000 /* No symlinks *at all*.
> > Implies AT_NO_PROCLINKS. */
> > +#define AT_THIS_ROOT 0x100000 /* Path resolution acts as though
> > + it is chroot-ed into dirfd. */
> >
> >
> > #endif /* _UAPI_LINUX_FCNTL_H */
> > --
> > 2.19.0
> >
> >
>>> Currently most container runtimes try to do this resolution in
>>> userspace[1], causing many potential race conditions. In addition, the
>>> "obvious" alternative (actually performing a {ch,pivot_}root(2))
>>> requires a fork+exec which is *very* costly if necessary for every
>>> filesystem operation involving a container.
>>
>> Wait. fork() I understand, but why exec? And actually, you don't need
>> a full fork() either, clone() lets you do this with some process parts
>> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep
>> the file descriptor table shared. And why chroot()/pivot_root(),
>> wouldn't you want to use setns()?
>
> You're right about this -- for C runtimes. In Go we cannot do a raw
> clone() or fork() (if you do it manually with RawSyscall you'll end with
> broken runtime state). So you're forced to do fork+exec (which then
> means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes
> for CLONE_VFORK.
I must admit that I’m not very sympathetic to the argument that “Go’s runtime model is incompatible with the simpler solution.”
Anyway, it occurs to me that the real problem is that setns() and chroot() are both overkill for this use case. What’s needed is to start your walk from /proc/pid-in-container/root, with two twists:
1. Do the walk as though rooted at a directory. This is basically just your AT_THIS_ROOT, but the footgun is avoided because the dirfd you use is from a foreign namespace, and, except for symlinks to absolute paths, no amount of .. racing is going to escape the *namespace*.
2. Avoid /proc. It’s not just the *links* — you really don’t want to walk into /proc/self. *Maybe* procfs is already careful enough when mounted in a namespace?
On 2018-10-01, Christian Brauner <[email protected]> wrote:
> On Mon, Oct 01, 2018 at 02:28:03PM +0200, Jann Horn wrote:
> > On Sat, Sep 29, 2018 at 4:28 PM Aleksa Sarai <[email protected]> wrote:
> > > * AT_BENEATH: Disallow ".." or absolute paths (either in the path or
> > > found during symlink resolution) to escape the starting point of name
> > > resolution, though ".." is permitted in cases like "foo/../bar".
> > > Relative symlinks are still allowed (as long as they don't escape the
> > > starting point).
> >
> > As I said on the other thread, I would strongly prefer an API that
> > behaves along the lines of David Drysdale's old patch
> > https://lore.kernel.org/lkml/[email protected]/
> > : Forbid any use of "..". This would also be more straightforward to
> > implement safely. If that doesn't work for you, I would like it if you
> > could at least make that an option. I would like it if this API could
> > mitigate straightforward directory traversal bugs such as
> > https://bugs.chromium.org/p/project-zero/issues/detail?id=1583, where
> > a confused deputy attempts to access a path like
> > "/mnt/media_rw/../../data" while intending to access a directory under
> > "/mnt/media_rw".
>
> Oh, the semantics for this changed in this patchset, hah. I was still on
> vacation so didn't get to look at it before it was sent out. From prior
> discussion I remember that the original intention actual was what you
> argue for. And the patchset should be as tight as possible. Having
> special cases where ".." is allowed just sounds like an invitation for
> userspace to get it wrong.
> Aleksa, did you have a specific use-case in mind that made you change
> this or was it already present in an earlier iteration of the patchset
> by someone else?
Al's original patchset allowed "..". A quick survey of my machine shows
that there are 100k symlinks that contain ".." (~37% of all symlinks on
my machine). This indicates to me that you would be restricting a large
amount of reasonable resolutions because of this restriction.
I posted a proposed way to protect against ".." shenanigans. If it's
turns out this is not possible, I'm okay with disallowing ".." (assuming
Al is also okay with that).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On 2018-10-01, David Laight <[email protected]> wrote:
> > The need for some sort of control over VFS's path resolution (to avoid
> > malicious paths resulting in inadvertent breakouts) has been a very
> > long-standing desire of many userspace applications. This patchset is a
> > revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
> >
> > The most obvious change is that AT_NO_JUMPS has been split as dicussed
> > in the original thread, along with a further split of AT_NO_PROCLINKS
> > which means that each individual property of AT_NO_JUMPS is now a
> > separate flag:
> >
> > * Path-based escapes from the starting-point using "/" or ".." are
> > blocked by AT_BENEATH.
>
> You may need to allow absolute paths that refer to items inside
> the controlled area.
> (Even if done by a textual replacement based on the expected name
> of the base directory.)
This is sort of what AT_THIS_ROOT does. I didn't want to include it for
AT_BENEATH because it would be just as contentious as AT_THIS_ROOT
currently is. :P
> > * Mountpoint crossings are blocked by AT_XDEV.
>
> You might want a mountpoint flag that allows crossing into the mounted
> filesystem (you may need to get out in order to do pwd()).
Like a mount flag? I'm not sure how I feel about that. The intention is
to allow for a process to have control over how path lookups are
handled, and tying it to a mount flag means that it's no longer entirely
up to the process.
> > * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> > correctly it actually blocks any user of nd_jump_link() because it
> > allows out-of-VFS path resolution manipulation).
>
> Or 'fix' the /proc/$pid/fd/$fd code to open the actual vnode rather than
> being a symlink (although this might still let you get a directory vnode).
> FWIW this is what NetBSD does - you can link the open file back into
> the filesystem!
Isn't this how it works currently? The /proc/$pid/fd/$fd "symlinks" are
actually references to the underlying file (they can even escape a
pivot_root()) -- you can re-open them or do any number of other dodgy
things through /proc with them (we definitely abuse this in container
runtimes -- and I'm sure plenty of other people do as well).
> > AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
> > Linus' suggestion in the original thread, I've also implemented
> > AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
> > "proclink" resolution).
>
> What about allowing 'trivial' symlinks?
The use-case of AT_NO_SYMLINKS that Linus pitched[1] is that git wants
to have a unique name for every object and so allowing trivial symlinks
is a no-go. I assume "trivial" here means "no-'..' components"?
> > Currently I've only enabled these for openat(2) and the stat(2) family.
> > I would hope we could enable it for basically every *at(2) syscall --
> > but many of them appear to not have a @flags argument and thus we'll
> > need to add several new syscalls to do this. I'm more than happy to send
> > those patches, but I'd prefer to know that this preliminary work is
> > acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> > syscalls.
>
> If you make the flags a property of the directory vnode (perhaps as
> well as any syscall flags), and make it inherited by vnode lookup then
> it can be used to stop library functions (or entire binaries) using
> blocked paths.
> You'd then only need to add an fcntl() call to set the flags (but never
> clear them) to get the restriction applied to every lookup.
This seems like it might be useful, but it could always be done as a
follow-up patch by just setting LOOKUP_BLAH if the dirfd has the flag
set. I'm also a little bit concerned that (because fd flags are set on
the 'struct file') if you start sharing fds then you can no longer use
the lookup scoping for security (a racing process could remove the
flags while the management process resolves through it).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On 2018-10-01, Jann Horn <[email protected]> wrote:
> > If this is an issue for AT_THIS_ROOT, I believe this might also be an
> > issue for AT_BENEATH since they are effectively both using the same
> > nd->root trick (so you could similarly trick AT_BENEATH to not error
> > out). So we'd need to figure out how to solve this problem in order for
> > AT_BENEATH to be safe.
>
> Oh, wait, what? I think I didn't notice that the semantics of
> AT_BENEATH changed like that since the original posting of David
> Drysdale's O_BENEATH_ONLY patch
> (https://lore.kernel.org/lkml/[email protected]/).
> David's patch had nice, straightforward semantics, blocking any form
> of upwards traversal. Why was that changed? Does anyone actually want
> to use paths that contain ".." with AT_BENEATH? I would strongly
> prefer something that blocks any use of "..".
>
> @Al: It looks like this already changed back when you posted
> https://lore.kernel.org/lkml/[email protected]/
> ?
Yes, I copied the semantics from Al's patchset. I don't know why he felt
strongly that this was the best idea, but in my opinion allowing paths
like "a/../b/../c" seems like it's quite useful because otherwise you
wouldn't be able to operate on most distribution root filesystems (many
have symlinks that have ".." components). Looking at my own (openSUSE)
machine there are something like 100k such symlinks (~37% of symlinks on
my machine).
While I do understand that the easiest way of solving this problem is to
disallow ".." entirely with AT_BENEATH, given that support ".." has
utility, I would like to know whether it's actually not possible to have
this work safely.
> > Speaking naively, doesn't it make sense to invalidate the walk if a path
> > component was modified? Or is this something that would be far too
> > costly with little benefit? What if we do more aggressive nd->root
> > checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root !=
> > current->mnt_ns->root)?
>
> It seems to me like doing that would basically require looking at each
> node in the path walk twice? And it'd be difficult to guarantee
> forward progress unless you're willing to do some fairly heavy
> locking.
I had another idea since I wrote my previous mail -- since the issue (at
least the way I understand it) is that ".." can "skip" over nd->root
because of the rename, what if we had some sort of is_descendant() check
within follow_dotdot()? (FWIW, ".." already has some pretty interesting
"hand-over-hand" locking semantics.) This should be effectively similar
to how prepend_path() deals with a path that is not reachable from @root
(I'm not sure if the locking is acceptable for the namei path though).
Since ".." with AT_THIS_ROOT (or AT_BENEATH) is not going to be the most
common component type (and we only need to do these checks for those
flags), I would think that the extra cost would not be _that_ awful.
Would this work?
> > You're right about this -- for C runtimes. In Go we cannot do a raw
> > clone() or fork() (if you do it manually with RawSyscall you'll end with
> > broken runtime state). So you're forced to do fork+exec (which then
> > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes
> > for CLONE_VFORK.
>
> If you insist on implementing every last bit of your code in Go, I guess.
Fair enough, though I believe this would affect most multi-threaded
programs as well (regardless of the language they're written in).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Mon, 1 Oct 2018, Mickaël Salaün wrote:
> Another way to apply a security policy could be to tied it to a file
> descriptor, similarly to Capsicum, which could enable to create
> programmable (real) capabilities. This way, it would be possible to
> "wrap" a file descriptor with a Landlock program and use it with
> FD-based syscalls or pass it to other processes. This would not require
> changes to the FS subsystem, but only the Landlock LSM code. This isn't
> done yet but I plan to add this new way to restrict operations on file
> descriptors.
Very interesting!
This could possibly be an LSM which stacks/integrates with other LSMs to
enforce MAC of object capabilities.
>
> Anyway, for the use case you mentioned, the AT_BENEATH flag(s) should be
> simple to use and enough for now. We must be careful of the hardcoded
> policy though.
>
>
> >
> >> On 9/29/18 12:34, Aleksa Sarai wrote:
> >>> The need for some sort of control over VFS's path resolution (to avoid
> >>> malicious paths resulting in inadvertent breakouts) has been a very
> >>> long-standing desire of many userspace applications. This patchset is a
> >>> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
> >>>
> >>> The most obvious change is that AT_NO_JUMPS has been split as dicussed
> >>> in the original thread, along with a further split of AT_NO_PROCLINKS
> >>> which means that each individual property of AT_NO_JUMPS is now a
> >>> separate flag:
> >>>
> >>> * Path-based escapes from the starting-point using "/" or ".." are
> >>> blocked by AT_BENEATH.
> >>> * Mountpoint crossings are blocked by AT_XDEV.
> >>> * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> >>> correctly it actually blocks any user of nd_jump_link() because it
> >>> allows out-of-VFS path resolution manipulation).
> >>>
> >>> AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
> >>> Linus' suggestion in the original thread, I've also implemented
> >>> AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
> >>> "proclink" resolution).
> >>>
> >>> An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS
> >>> path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
> >>> blocks this as well (feel free to ask me to remove it if you feel this
> >>> is not sane).
> >>>
> >>> Currently I've only enabled these for openat(2) and the stat(2) family.
> >>> I would hope we could enable it for basically every *at(2) syscall --
> >>> but many of them appear to not have a @flags argument and thus we'll
> >>> need to add several new syscalls to do this. I'm more than happy to send
> >>> those patches, but I'd prefer to know that this preliminary work is
> >>> acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> >>> syscalls.
> >>>
> >>> One additional feature I've implemented is AT_THIS_ROOT (I imagine this
> >>> is probably going to be more contentious than the refresh of
> >>> AT_NO_JUMPS, so I've included it in a separate patch). The patch itself
> >>> describes my reasoning, but the shortened version of the premise is that
> >>> continer runtimes need to have a way to resolve paths within a
> >>> potentially malicious rootfs. Container runtimes currently do this in
> >>> userspace[2] which has implicit race conditions that are not resolvable
> >>> in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
> >>> inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for
> >>> path resolution, which would be invaluable for us -- and the
> >>> implementation is basically identical to AT_BENEATH (except that we
> >>> don't return errors when someone actually hits the root).
> >>>
> >>> I've added some selftests for this, but it's not clear to me whether
> >>> they should live here or in xfstests (as far as I can tell there are no
> >>> other VFS tests in selftests, while there are some tests that look like
> >>> generic VFS tests in xfstests). If you'd prefer them to be included in
> >>> xfstests, let me know.
> >>>
> >>> [1]: https://lore.kernel.org/patchwork/patch/784221/
> >>> [2]: https://github.com/cyphar/filepath-securejoin
> >>>
> >>> Aleksa Sarai (3):
> >>> namei: implement O_BENEATH-style AT_* flags
> >>> namei: implement AT_THIS_ROOT chroot-like path resolution
> >>> selftests: vfs: add AT_* path resolution tests
> >>>
> >>> fs/fcntl.c | 2 +-
> >>> fs/namei.c | 158 ++++++++++++------
> >>> fs/open.c | 10 ++
> >>> fs/stat.c | 15 +-
> >>> include/linux/fcntl.h | 3 +-
> >>> include/linux/namei.h | 8 +
> >>> include/uapi/asm-generic/fcntl.h | 20 +++
> >>> include/uapi/linux/fcntl.h | 10 ++
> >>> tools/testing/selftests/Makefile | 1 +
> >>> tools/testing/selftests/vfs/.gitignore | 1 +
> >>> tools/testing/selftests/vfs/Makefile | 13 ++
> >>> tools/testing/selftests/vfs/at_flags.h | 40 +++++
> >>> tools/testing/selftests/vfs/common.sh | 37 ++++
> >>> .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++
> >>> .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++
> >>> .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++
> >>> .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++
> >>> .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++
> >>> tools/testing/selftests/vfs/vfs_helper.c | 154 +++++++++++++++++
> >>> 19 files changed, 707 insertions(+), 56 deletions(-)
> >>> create mode 100644 tools/testing/selftests/vfs/.gitignore
> >>> create mode 100644 tools/testing/selftests/vfs/Makefile
> >>> create mode 100644 tools/testing/selftests/vfs/at_flags.h
> >>> create mode 100644 tools/testing/selftests/vfs/common.sh
> >>> create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
> >>> create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
> >>> create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
> >>> create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
> >>> create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
> >>> create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
> >>>
> >>
> >
> >
>
>
--
James Morris
<[email protected]>
On 2018-10-01, Andy Lutomirski <[email protected]> wrote:
> >>> Currently most container runtimes try to do this resolution in
> >>> userspace[1], causing many potential race conditions. In addition, the
> >>> "obvious" alternative (actually performing a {ch,pivot_}root(2))
> >>> requires a fork+exec which is *very* costly if necessary for every
> >>> filesystem operation involving a container.
> >>
> >> Wait. fork() I understand, but why exec? And actually, you don't need
> >> a full fork() either, clone() lets you do this with some process parts
> >> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep
> >> the file descriptor table shared. And why chroot()/pivot_root(),
> >> wouldn't you want to use setns()?
> >
> > You're right about this -- for C runtimes. In Go we cannot do a raw
> > clone() or fork() (if you do it manually with RawSyscall you'll end with
> > broken runtime state). So you're forced to do fork+exec (which then
> > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes
> > for CLONE_VFORK.
>
> I must admit that I’m not very sympathetic to the argument that “Go’s
> runtime model is incompatible with the simpler solution.”
Multi-threaded programs have a similar issue (though with Go it's much
worse). If you fork a multi-threaded C program then you can only safely
use AS-Safe glibc functions (those that are safe within a signal
handler). But if you're just doing three syscalls this shouldn't be as
big of a problem as Go where you can't even do said syscalls.
> Anyway, it occurs to me that the real problem is that setns() and
> chroot() are both overkill for this use case.
I agree. My diversion to Go was to explain why it was particularly bad
for cri-o/rkt/runc/Docker/etc.
> What’s needed is to start your walk from /proc/pid-in-container/root,
> with two twists:
>
> 1. Do the walk as though rooted at a directory. This is basically just
> your AT_THIS_ROOT, but the footgun is avoided because the dirfd you
> use is from a foreign namespace, and, except for symlinks to absolute
> paths, no amount of .. racing is going to escape the *namespace*.
This is quite clever and I'll admit I hadn't thought of this. This
definitely fixes the ".." issue, but as you've said it won't handle
absolute symlinks (which means userspace has the same races that we
currently have even if you assume that you have a container process
already running -- CVE-2018-15664 is one of many examples of this).
(AT_THIS_ROOT using /proc/$container/root would in principle fix all of
the mentioned issues -- but as I said before I'd like to see whether
hardening ".." would be enough to solve the escape problem.)
> 2. Avoid /proc. It’s not just the *links* — you really don’t want to
> walk into /proc/self. *Maybe* procfs is already careful enough when
> mounted in a namespace?
I just tried it and /proc/self gives you -ENOENT. I believe this is
because it does a check against the pid namespace that the procfs mount
has pinned.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
From: Aleksa Sarai
> Sent: 01 October 2018 17:16
>
> On 2018-10-01, David Laight <[email protected]> wrote:
...
> > > * Mountpoint crossings are blocked by AT_XDEV.
> >
> > You might want a mountpoint flag that allows crossing into the mounted
> > filesystem (you may need to get out in order to do pwd()).
>
> Like a mount flag? I'm not sure how I feel about that. The intention is
> to allow for a process to have control over how path lookups are
> handled, and tying it to a mount flag means that it's no longer entirely
> up to the process.
Right, but you may have some mount points that you don't want to cross
and others that it is perfectly fine to cross.
For example you might want to be able to cross into a 'tmp' filesystem.
...
> > If you make the flags a property of the directory vnode (perhaps as
> > well as any syscall flags), and make it inherited by vnode lookup then
> > it can be used to stop library functions (or entire binaries) using
> > blocked paths.
> > You'd then only need to add an fcntl() call to set the flags (but never
> > clear them) to get the restriction applied to every lookup.
>
> This seems like it might be useful, but it could always be done as a
> follow-up patch by just setting LOOKUP_BLAH if the dirfd has the flag
> set. I'm also a little bit concerned that (because fd flags are set on
> the 'struct file') if you start sharing fds then you can no longer use
> the lookup scoping for security (a racing process could remove the
> flags while the management process resolves through it).
I was thinking that the flags would never be removable.
A management process might have to flip its cwd back and forth
in order to clear the flags (opendir(".") should give a different
struct file).
This all gets tied up with the slight requirement for per-thread cwd.
I had another thought that the crudentials structure used for a file
lookup could also be taken from the cwd (not sure how it would
get there - especially if you need the correct group list).
That would allow a 'management' process to open a file in the context
of the target user process.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Tue, Oct 2, 2018 at 12:32 AM Aleksa Sarai <[email protected]> wrote:
>
> On 2018-10-01, Andy Lutomirski <[email protected]> wrote:
> > >>> Currently most container runtimes try to do this resolution in
> > >>> userspace[1], causing many potential race conditions. In addition, the
> > >>> "obvious" alternative (actually performing a {ch,pivot_}root(2))
> > >>> requires a fork+exec which is *very* costly if necessary for every
> > >>> filesystem operation involving a container.
> > >>
> > >> Wait. fork() I understand, but why exec? And actually, you don't need
> > >> a full fork() either, clone() lets you do this with some process parts
> > >> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep
> > >> the file descriptor table shared. And why chroot()/pivot_root(),
> > >> wouldn't you want to use setns()?
> > >
> > > You're right about this -- for C runtimes. In Go we cannot do a raw
> > > clone() or fork() (if you do it manually with RawSyscall you'll end with
> > > broken runtime state). So you're forced to do fork+exec (which then
> > > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes
> > > for CLONE_VFORK.
> >
> > I must admit that I’m not very sympathetic to the argument that “Go’s
> > runtime model is incompatible with the simpler solution.”
>
> Multi-threaded programs have a similar issue (though with Go it's much
> worse). If you fork a multi-threaded C program then you can only safely
> use AS-Safe glibc functions (those that are safe within a signal
> handler). But if you're just doing three syscalls this shouldn't be as
> big of a problem as Go where you can't even do said syscalls.
>
> > Anyway, it occurs to me that the real problem is that setns() and
> > chroot() are both overkill for this use case.
>
> I agree. My diversion to Go was to explain why it was particularly bad
> for cri-o/rkt/runc/Docker/etc.
>
> > What’s needed is to start your walk from /proc/pid-in-container/root,
> > with two twists:
> >
> > 1. Do the walk as though rooted at a directory. This is basically just
> > your AT_THIS_ROOT, but the footgun is avoided because the dirfd you
> > use is from a foreign namespace, and, except for symlinks to absolute
> > paths, no amount of .. racing is going to escape the *namespace*.
>
> This is quite clever and I'll admit I hadn't thought of this. This
> definitely fixes the ".." issue, but as you've said it won't handle
> absolute symlinks (which means userspace has the same races that we
> currently have even if you assume that you have a container process
> already running -- CVE-2018-15664 is one of many examples of this).
>
> (AT_THIS_ROOT using /proc/$container/root would in principle fix all of
> the mentioned issues -- but as I said before I'd like to see whether
> hardening ".." would be enough to solve the escape problem.)
Hmm. Good point.
On 2018-09-29, Jann Horn <[email protected]> wrote:
> You attempt to open "C/../../etc/passwd" under the root "/A/B".
> Something else concurrently moves /A/B/C to /A/C. This can result in
> the following:
>
> 1. You start the path walk and reach /A/B/C.
> 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
> 3. Your path walk follows the first ".." up into /A. This is outside
> the process root, but you never actually encountered the process root,
> so you don't notice.
> 4. Your path walk follows the second ".." up to /. Again, this is
> outside the process root, but you don't notice.
> 5. Your path walk walks down to /etc/passwd, and the open completes
> successfully. You now have an fd pointing outside your chroot.
I've been playing with this and I have the following patch, which
according to my testing protects against attacks where ".." skips over
nd->root. It abuses __d_path to figure out if nd->path can be resolved
from nd->root (obviously a proper version of this patch would refactor
__d_path so it could be used like this -- and would not return
-EMULTIHOP).
I've also attached my reproducer. With it, I was seeing fairly constant
breakouts before this patch and after it I didn't see a single breakout
after running it overnight. Obviously this is not conclusive, but I'm
hoping that it can show what my idea for protecting against ".." was.
Does this patch make sense? Or is there something wrong with it that I'm
not seeing?
--8<-------------------------------------------------------------------
There is a fairly easy-to-exploit race condition with chroot(2) (and
thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a
path can be used to "skip over" nd->root and thus escape to the
filesystem above nd->root.
thread1 [attacker]:
for (;;)
renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
thread2 [victim]:
for (;;)
openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases
will be detected during ".." resolution (which is the weak point of
chroot(2) -- since walking *into* a subdirectory tautologically cannot
result in you walking *outside* nd->root).
The use of __d_path here might seem suspect, however we don't mind if a
path is moved from within the chroot to outside the chroot and we
incorrectly decide it is safe (because at that point we are still within
the set of files which were accessible at the beginning of resolution).
However, we can fail resolution on the next path component if it remains
outside of the root. A path which has always been outside nd->root
during resolution will never be resolveable from nd->root and thus will
always be blocked.
DO NOT MERGE: Currently this code returns -EMULTIHOP in this case,
purely as a debugging measure (so that you can see that
the protection actually does something). Obviously in the
proper patch this will return -EXDEV.
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 6f995e6de6b1..c8349693d47b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -53,8 +53,8 @@
* The new code replaces the old recursive symlink resolution with
* an iterative one (in case of non-nested symlink chains). It does
* this with calls to <fs>_follow_link().
- * As a side effect, dir_namei(), _namei() and follow_link() are now
- * replaced with a single function lookup_dentry() that can handle all
+ * As a side effect, dir_namei(), _namei() and follow_link() are now
+ * replaced with a single function lookup_dentry() that can handle all
* the special cases of the former code.
*
* With the new dcache, the pathname is stored at each inode, at least as
@@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -EXDEV;
break;
}
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
+ char *pathbuf, *pathptr;
+
+ pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
+ if (!pathbuf)
+ return -ECHILD;
+ pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
+ kfree(pathbuf);
+ if (IS_ERR_OR_NULL(pathptr)) {
+ if (!pathptr)
+ pathptr = ERR_PTR(-EMULTIHOP);
+ return PTR_ERR(pathptr);
+ }
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
struct dentry *old = nd->path.dentry;
struct dentry *parent = old->d_parent;
@@ -1510,6 +1524,20 @@ static int follow_dotdot(struct nameidata *nd)
return -EXDEV;
break;
}
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
+ char *pathbuf, *pathptr;
+
+ pathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
+ if (!pathbuf)
+ return -ENOMEM;
+ pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
+ kfree(pathbuf);
+ if (IS_ERR_OR_NULL(pathptr)) {
+ if (!pathptr)
+ pathptr = ERR_PTR(-EMULTIHOP);
+ return PTR_ERR(pathptr);
+ }
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
int ret = path_parent_directory(&nd->path);
if (ret)
--
2.19.0
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Tue, Oct 02, 2018 at 02:04:31AM +1000, Aleksa Sarai wrote:
> On 2018-10-01, Christian Brauner <[email protected]> wrote:
> > On Mon, Oct 01, 2018 at 02:28:03PM +0200, Jann Horn wrote:
> > > On Sat, Sep 29, 2018 at 4:28 PM Aleksa Sarai <[email protected]> wrote:
> > > > * AT_BENEATH: Disallow ".." or absolute paths (either in the path or
> > > > found during symlink resolution) to escape the starting point of name
> > > > resolution, though ".." is permitted in cases like "foo/../bar".
> > > > Relative symlinks are still allowed (as long as they don't escape the
> > > > starting point).
> > >
> > > As I said on the other thread, I would strongly prefer an API that
> > > behaves along the lines of David Drysdale's old patch
> > > https://lore.kernel.org/lkml/[email protected]/
> > > : Forbid any use of "..". This would also be more straightforward to
> > > implement safely. If that doesn't work for you, I would like it if you
> > > could at least make that an option. I would like it if this API could
> > > mitigate straightforward directory traversal bugs such as
> > > https://bugs.chromium.org/p/project-zero/issues/detail?id=1583, where
> > > a confused deputy attempts to access a path like
> > > "/mnt/media_rw/../../data" while intending to access a directory under
> > > "/mnt/media_rw".
> >
> > Oh, the semantics for this changed in this patchset, hah. I was still on
> > vacation so didn't get to look at it before it was sent out. From prior
> > discussion I remember that the original intention actual was what you
> > argue for. And the patchset should be as tight as possible. Having
> > special cases where ".." is allowed just sounds like an invitation for
> > userspace to get it wrong.
> > Aleksa, did you have a specific use-case in mind that made you change
> > this or was it already present in an earlier iteration of the patchset
> > by someone else?
>
> Al's original patchset allowed "..". A quick survey of my machine shows
> that there are 100k symlinks that contain ".." (~37% of all symlinks on
> my machine). This indicates to me that you would be restricting a large
> amount of reasonable resolutions because of this restriction.
>
> I posted a proposed way to protect against ".." shenanigans. If it's
> turns out this is not possible, I'm okay with disallowing ".." (assuming
> Al is also okay with that).
Sounds acceptable to me.
On Tue, Oct 02, 2018 at 02:18:33AM +1000, Aleksa Sarai wrote:
> On 2018-10-01, Jann Horn <[email protected]> wrote:
> > > If this is an issue for AT_THIS_ROOT, I believe this might also be an
> > > issue for AT_BENEATH since they are effectively both using the same
> > > nd->root trick (so you could similarly trick AT_BENEATH to not error
> > > out). So we'd need to figure out how to solve this problem in order for
> > > AT_BENEATH to be safe.
> >
> > Oh, wait, what? I think I didn't notice that the semantics of
> > AT_BENEATH changed like that since the original posting of David
> > Drysdale's O_BENEATH_ONLY patch
> > (https://lore.kernel.org/lkml/[email protected]/).
> > David's patch had nice, straightforward semantics, blocking any form
> > of upwards traversal. Why was that changed? Does anyone actually want
> > to use paths that contain ".." with AT_BENEATH? I would strongly
> > prefer something that blocks any use of "..".
> >
> > @Al: It looks like this already changed back when you posted
> > https://lore.kernel.org/lkml/[email protected]/
> > ?
>
> Yes, I copied the semantics from Al's patchset. I don't know why he felt
> strongly that this was the best idea, but in my opinion allowing paths
> like "a/../b/../c" seems like it's quite useful because otherwise you
> wouldn't be able to operate on most distribution root filesystems (many
> have symlinks that have ".." components). Looking at my own (openSUSE)
> machine there are something like 100k such symlinks (~37% of symlinks on
> my machine).
>
> While I do understand that the easiest way of solving this problem is to
> disallow ".." entirely with AT_BENEATH, given that support ".." has
> utility, I would like to know whether it's actually not possible to have
> this work safely.
>
> > > Speaking naively, doesn't it make sense to invalidate the walk if a path
> > > component was modified? Or is this something that would be far too
> > > costly with little benefit? What if we do more aggressive nd->root
> > > checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root !=
> > > current->mnt_ns->root)?
> >
> > It seems to me like doing that would basically require looking at each
> > node in the path walk twice? And it'd be difficult to guarantee
> > forward progress unless you're willing to do some fairly heavy
> > locking.
>
> I had another idea since I wrote my previous mail -- since the issue (at
> least the way I understand it) is that ".." can "skip" over nd->root
> because of the rename, what if we had some sort of is_descendant() check
> within follow_dotdot()? (FWIW, ".." already has some pretty interesting
> "hand-over-hand" locking semantics.) This should be effectively similar
> to how prepend_path() deals with a path that is not reachable from @root
> (I'm not sure if the locking is acceptable for the namei path though).
>
> Since ".." with AT_THIS_ROOT (or AT_BENEATH) is not going to be the most
> common component type (and we only need to do these checks for those
> flags), I would think that the extra cost would not be _that_ awful.
>
> Would this work?
>
> > > You're right about this -- for C runtimes. In Go we cannot do a raw
> > > clone() or fork() (if you do it manually with RawSyscall you'll end with
> > > broken runtime state). So you're forced to do fork+exec (which then
> > > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes
> > > for CLONE_VFORK.
> >
> > If you insist on implementing every last bit of your code in Go, I guess.
>
> Fair enough, though I believe this would affect most multi-threaded
> programs as well (regardless of the language they're written in).
(Depends on whether you do any explicit locking and have atfork handlers
for your locks and so on. If you do a clone syscall directly to avoid
having libc running any additional atfork handlers (flushing streams
etc.) it's doable though not ideal.)
On Fri, Oct 05, 2018 at 02:26:11AM +1000, Aleksa Sarai wrote:
> On 2018-09-29, Jann Horn <[email protected]> wrote:
> > You attempt to open "C/../../etc/passwd" under the root "/A/B".
> > Something else concurrently moves /A/B/C to /A/C. This can result in
> > the following:
> >
> > 1. You start the path walk and reach /A/B/C.
> > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
> > 3. Your path walk follows the first ".." up into /A. This is outside
> > the process root, but you never actually encountered the process root,
> > so you don't notice.
> > 4. Your path walk follows the second ".." up to /. Again, this is
> > outside the process root, but you don't notice.
> > 5. Your path walk walks down to /etc/passwd, and the open completes
> > successfully. You now have an fd pointing outside your chroot.
>
> I've been playing with this and I have the following patch, which
> according to my testing protects against attacks where ".." skips over
> nd->root. It abuses __d_path to figure out if nd->path can be resolved
> from nd->root (obviously a proper version of this patch would refactor
> __d_path so it could be used like this -- and would not return
> -EMULTIHOP).
>
> I've also attached my reproducer. With it, I was seeing fairly constant
> breakouts before this patch and after it I didn't see a single breakout
> after running it overnight. Obviously this is not conclusive, but I'm
> hoping that it can show what my idea for protecting against ".." was.
>
> Does this patch make sense? Or is there something wrong with it that I'm
> not seeing?
Interesting.
Apart from the abuse of __d_path() :) the question I'd have is whether
this just minimizes the race window or if you can provide a sound
argument that this actually can't happen anymore with this patch.
>
> --8<-------------------------------------------------------------------
>
> There is a fairly easy-to-exploit race condition with chroot(2) (and
> thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a
> path can be used to "skip over" nd->root and thus escape to the
> filesystem above nd->root.
>
> thread1 [attacker]:
> for (;;)
> renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
> thread2 [victim]:
> for (;;)
> openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
>
> With fairly significant regularity, thread2 will resolve to
> "/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases
> will be detected during ".." resolution (which is the weak point of
> chroot(2) -- since walking *into* a subdirectory tautologically cannot
> result in you walking *outside* nd->root).
>
> The use of __d_path here might seem suspect, however we don't mind if a
> path is moved from within the chroot to outside the chroot and we
> incorrectly decide it is safe (because at that point we are still within
> the set of files which were accessible at the beginning of resolution).
> However, we can fail resolution on the next path component if it remains
> outside of the root. A path which has always been outside nd->root
> during resolution will never be resolveable from nd->root and thus will
> always be blocked.
>
> DO NOT MERGE: Currently this code returns -EMULTIHOP in this case,
> purely as a debugging measure (so that you can see that
> the protection actually does something). Obviously in the
> proper patch this will return -EXDEV.
>
> Signed-off-by: Aleksa Sarai <[email protected]>
> ---
> fs/namei.c | 32 ++++++++++++++++++++++++++++++--
> 1 file changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 6f995e6de6b1..c8349693d47b 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -53,8 +53,8 @@
> * The new code replaces the old recursive symlink resolution with
> * an iterative one (in case of non-nested symlink chains). It does
> * this with calls to <fs>_follow_link().
> - * As a side effect, dir_namei(), _namei() and follow_link() are now
> - * replaced with a single function lookup_dentry() that can handle all
> + * As a side effect, dir_namei(), _namei() and follow_link() are now
> + * replaced with a single function lookup_dentry() that can handle all
> * the special cases of the former code.
> *
> * With the new dcache, the pathname is stored at each inode, at least as
> @@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> return -EXDEV;
> break;
> }
> + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
> + char *pathbuf, *pathptr;
> +
> + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> + if (!pathbuf)
> + return -ECHILD;
> + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
> + kfree(pathbuf);
> + if (IS_ERR_OR_NULL(pathptr)) {
> + if (!pathptr)
> + pathptr = ERR_PTR(-EMULTIHOP);
> + return PTR_ERR(pathptr);
> + }
> + }
> if (nd->path.dentry != nd->path.mnt->mnt_root) {
> struct dentry *old = nd->path.dentry;
> struct dentry *parent = old->d_parent;
> @@ -1510,6 +1524,20 @@ static int follow_dotdot(struct nameidata *nd)
> return -EXDEV;
> break;
> }
> + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
> + char *pathbuf, *pathptr;
> +
> + pathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
> + if (!pathbuf)
> + return -ENOMEM;
> + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
> + kfree(pathbuf);
> + if (IS_ERR_OR_NULL(pathptr)) {
> + if (!pathptr)
> + pathptr = ERR_PTR(-EMULTIHOP);
> + return PTR_ERR(pathptr);
> + }
> + }
> if (nd->path.dentry != nd->path.mnt->mnt_root) {
> int ret = path_parent_directory(&nd->path);
> if (ret)
> --
> 2.19.0
>
> --
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>
On Thu, Oct 4, 2018 at 6:26 PM Aleksa Sarai <[email protected]> wrote:
> On 2018-09-29, Jann Horn <[email protected]> wrote:
> > You attempt to open "C/../../etc/passwd" under the root "/A/B".
> > Something else concurrently moves /A/B/C to /A/C. This can result in
> > the following:
> >
> > 1. You start the path walk and reach /A/B/C.
> > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
> > 3. Your path walk follows the first ".." up into /A. This is outside
> > the process root, but you never actually encountered the process root,
> > so you don't notice.
> > 4. Your path walk follows the second ".." up to /. Again, this is
> > outside the process root, but you don't notice.
> > 5. Your path walk walks down to /etc/passwd, and the open completes
> > successfully. You now have an fd pointing outside your chroot.
>
> I've been playing with this and I have the following patch, which
> according to my testing protects against attacks where ".." skips over
> nd->root. It abuses __d_path to figure out if nd->path can be resolved
> from nd->root (obviously a proper version of this patch would refactor
> __d_path so it could be used like this -- and would not return
> -EMULTIHOP).
>
> I've also attached my reproducer. With it, I was seeing fairly constant
> breakouts before this patch and after it I didn't see a single breakout
> after running it overnight. Obviously this is not conclusive, but I'm
> hoping that it can show what my idea for protecting against ".." was.
>
> Does this patch make sense? Or is there something wrong with it that I'm
> not seeing?
>
> --8<-------------------------------------------------------------------
>
> There is a fairly easy-to-exploit race condition with chroot(2) (and
> thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a
> path can be used to "skip over" nd->root and thus escape to the
> filesystem above nd->root.
>
> thread1 [attacker]:
> for (;;)
> renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
> thread2 [victim]:
> for (;;)
> openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
>
> With fairly significant regularity, thread2 will resolve to
> "/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases
> will be detected during ".." resolution (which is the weak point of
> chroot(2) -- since walking *into* a subdirectory tautologically cannot
> result in you walking *outside* nd->root).
>
> The use of __d_path here might seem suspect, however we don't mind if a
> path is moved from within the chroot to outside the chroot and we
> incorrectly decide it is safe (because at that point we are still within
> the set of files which were accessible at the beginning of resolution).
> However, we can fail resolution on the next path component if it remains
> outside of the root. A path which has always been outside nd->root
> during resolution will never be resolveable from nd->root and thus will
> always be blocked.
>
> DO NOT MERGE: Currently this code returns -EMULTIHOP in this case,
> purely as a debugging measure (so that you can see that
> the protection actually does something). Obviously in the
> proper patch this will return -EXDEV.
>
> Signed-off-by: Aleksa Sarai <[email protected]>
> ---
> fs/namei.c | 32 ++++++++++++++++++++++++++++++--
> 1 file changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 6f995e6de6b1..c8349693d47b 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -53,8 +53,8 @@
> * The new code replaces the old recursive symlink resolution with
> * an iterative one (in case of non-nested symlink chains). It does
> * this with calls to <fs>_follow_link().
> - * As a side effect, dir_namei(), _namei() and follow_link() are now
> - * replaced with a single function lookup_dentry() that can handle all
> + * As a side effect, dir_namei(), _namei() and follow_link() are now
> + * replaced with a single function lookup_dentry() that can handle all
> * the special cases of the former code.
> *
> * With the new dcache, the pathname is stored at each inode, at least as
> @@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> return -EXDEV;
> break;
> }
> + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
> + char *pathbuf, *pathptr;
> +
> + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> + if (!pathbuf)
> + return -ECHILD;
> + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
> + kfree(pathbuf);
> + if (IS_ERR_OR_NULL(pathptr)) {
> + if (!pathptr)
> + pathptr = ERR_PTR(-EMULTIHOP);
> + return PTR_ERR(pathptr);
> + }
> + }
One somewhat problematic thing about this approach is that if someone
tries to lookup
"a/a/a/a/a/a/a/a/a/a/[...]/../../../../../../../../../.." for some
reason, you'll have quadratic runtime: For each "..", you'll have to
walk up to the root.
On 2018-10-04, Jann Horn <[email protected]> wrote:
> On Thu, Oct 4, 2018 at 6:26 PM Aleksa Sarai <[email protected]> wrote:
> > On 2018-09-29, Jann Horn <[email protected]> wrote:
> > > You attempt to open "C/../../etc/passwd" under the root "/A/B".
> > > Something else concurrently moves /A/B/C to /A/C. This can result in
> > > the following:
> > >
> > > 1. You start the path walk and reach /A/B/C.
> > > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
> > > 3. Your path walk follows the first ".." up into /A. This is outside
> > > the process root, but you never actually encountered the process root,
> > > so you don't notice.
> > > 4. Your path walk follows the second ".." up to /. Again, this is
> > > outside the process root, but you don't notice.
> > > 5. Your path walk walks down to /etc/passwd, and the open completes
> > > successfully. You now have an fd pointing outside your chroot.
> >
> > I've been playing with this and I have the following patch, which
> > according to my testing protects against attacks where ".." skips over
> > nd->root. It abuses __d_path to figure out if nd->path can be resolved
> > from nd->root (obviously a proper version of this patch would refactor
> > __d_path so it could be used like this -- and would not return
> > -EMULTIHOP).
> >
> > I've also attached my reproducer. With it, I was seeing fairly constant
> > breakouts before this patch and after it I didn't see a single breakout
> > after running it overnight. Obviously this is not conclusive, but I'm
> > hoping that it can show what my idea for protecting against ".." was.
> >
> > Does this patch make sense? Or is there something wrong with it that I'm
> > not seeing?
> >
> > --8<-------------------------------------------------------------------
> >
> > There is a fairly easy-to-exploit race condition with chroot(2) (and
> > thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a
> > path can be used to "skip over" nd->root and thus escape to the
> > filesystem above nd->root.
> >
> > thread1 [attacker]:
> > for (;;)
> > renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
> > thread2 [victim]:
> > for (;;)
> > openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
> >
> > With fairly significant regularity, thread2 will resolve to
> > "/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases
> > will be detected during ".." resolution (which is the weak point of
> > chroot(2) -- since walking *into* a subdirectory tautologically cannot
> > result in you walking *outside* nd->root).
> >
> > The use of __d_path here might seem suspect, however we don't mind if a
> > path is moved from within the chroot to outside the chroot and we
> > incorrectly decide it is safe (because at that point we are still within
> > the set of files which were accessible at the beginning of resolution).
> > However, we can fail resolution on the next path component if it remains
> > outside of the root. A path which has always been outside nd->root
> > during resolution will never be resolveable from nd->root and thus will
> > always be blocked.
> >
> > DO NOT MERGE: Currently this code returns -EMULTIHOP in this case,
> > purely as a debugging measure (so that you can see that
> > the protection actually does something). Obviously in the
> > proper patch this will return -EXDEV.
> >
> > Signed-off-by: Aleksa Sarai <[email protected]>
> > ---
> > fs/namei.c | 32 ++++++++++++++++++++++++++++++--
> > 1 file changed, 30 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 6f995e6de6b1..c8349693d47b 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -53,8 +53,8 @@
> > * The new code replaces the old recursive symlink resolution with
> > * an iterative one (in case of non-nested symlink chains). It does
> > * this with calls to <fs>_follow_link().
> > - * As a side effect, dir_namei(), _namei() and follow_link() are now
> > - * replaced with a single function lookup_dentry() that can handle all
> > + * As a side effect, dir_namei(), _namei() and follow_link() are now
> > + * replaced with a single function lookup_dentry() that can handle all
> > * the special cases of the former code.
> > *
> > * With the new dcache, the pathname is stored at each inode, at least as
> > @@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> > return -EXDEV;
> > break;
> > }
> > + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
> > + char *pathbuf, *pathptr;
> > +
> > + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> > + if (!pathbuf)
> > + return -ECHILD;
> > + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
> > + kfree(pathbuf);
> > + if (IS_ERR_OR_NULL(pathptr)) {
> > + if (!pathptr)
> > + pathptr = ERR_PTR(-EMULTIHOP);
> > + return PTR_ERR(pathptr);
> > + }
> > + }
>
> One somewhat problematic thing about this approach is that if someone
> tries to lookup
> "a/a/a/a/a/a/a/a/a/a/[...]/../../../../../../../../../.." for some
> reason, you'll have quadratic runtime: For each "..", you'll have to
> walk up to the root.
What if we took rename_lock (call it nd->r_seq) at the start of the
resolution, and then only tried the __d_path-style check
if (read_seqretry(&rename_lock, nd->r_seq) ||
read_seqretry(&mount_lock, nd->m_seq))
/* do the __d_path lookup. */
That way you would only hit the slow path if there were concurrent
renames or mounts *and* you are doing a path resolution with
AT_THIS_ROOT or AT_BENEATH. I've attached a modified patch that does
this (and after some testing it also appears to work).
I'm not sure if there's a way to always avoid the quadratic lookup
without (significantly and probably unreasonably) changing how dcache
invalidation works. And obviously using this slow path if there was
_any_ rename on the _entire_ system is suboptimal, but I think it is a
significant improvement.
Another possibility is to expand on Andy's suggestion to use
/proc/$pid/root, and instead require AT_THIS_ROOT to use the root of a
namespace as its dirfd (I'm not sure if there's a trivial way to detect
this though). This wouldn't help with AT_BENEATH, but it should protect
against ".." shenanigans without any ".." handling changes. (This is
less ideal because it requires a container process, but it is another
way of dealing with the issue.)
---
fs/namei.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 46 insertions(+), 3 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 6f995e6de6b1..12c9be175cb4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -493,7 +493,7 @@ struct nameidata {
struct path root;
struct inode *inode; /* path.dentry.d_inode */
unsigned int flags;
- unsigned seq, m_seq;
+ unsigned seq, m_seq, r_seq;
int last_type;
unsigned depth;
int total_link_count;
@@ -1375,6 +1375,27 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -EXDEV;
break;
}
+ if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) &&
+ (read_seqretry(&rename_lock, nd->r_seq) ||
+ read_seqretry(&mount_lock, nd->m_seq)))) {
+ char *pathbuf, *pathptr;
+
+ nd->r_seq = read_seqbegin(&rename_lock);
+ /* Cannot take m_seq here. */
+
+ pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
+ if (!pathbuf)
+ return -ECHILD;
+ pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
+ kfree(pathbuf);
+ if (IS_ERR_OR_NULL(pathptr)) {
+ int error = PTR_ERR_OR_ZERO(pathptr);
+
+ if (!error)
+ error = nd_jump_root(nd);
+ return error;
+ }
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
struct dentry *old = nd->path.dentry;
struct dentry *parent = old->d_parent;
@@ -1510,6 +1531,27 @@ static int follow_dotdot(struct nameidata *nd)
return -EXDEV;
break;
}
+ if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) &&
+ (read_seqretry(&rename_lock, nd->r_seq) ||
+ read_seqretry(&mount_lock, nd->m_seq)))) {
+ char *pathbuf, *pathptr;
+
+ nd->r_seq = read_seqbegin(&rename_lock);
+ /* Cannot take m_seq here. */
+
+ pathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
+ if (!pathbuf)
+ return -ENOMEM;
+ pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
+ kfree(pathbuf);
+ if (IS_ERR_OR_NULL(pathptr)) {
+ int error = PTR_ERR_OR_ZERO(pathptr);
+
+ if (!error)
+ error = nd_jump_root(nd);
+ return error;
+ }
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
int ret = path_parent_directory(&nd->path);
if (ret)
@@ -2269,6 +2311,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->last_type = LAST_ROOT; /* if there are only slashes... */
nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
nd->depth = 0;
+ nd->m_seq = read_seqbegin(&mount_lock);
+ nd->r_seq = read_seqbegin(&rename_lock);
+
if (flags & LOOKUP_ROOT) {
struct dentry *root = nd->root.dentry;
struct inode *inode = root->d_inode;
@@ -2279,7 +2324,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
if (flags & LOOKUP_RCU) {
nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
nd->root_seq = nd->seq;
- nd->m_seq = read_seqbegin(&mount_lock);
} else {
path_get(&nd->path);
}
@@ -2290,7 +2334,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->path.mnt = NULL;
nd->path.dentry = NULL;
- nd->m_seq = read_seqbegin(&mount_lock);
if (unlikely(flags & (LOOKUP_CHROOT | LOOKUP_XDEV))) {
error = dirfd_path_init(nd);
if (unlikely(error))
--
2.19.0
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Fri, Oct 5, 2018 at 5:07 PM Aleksa Sarai <[email protected]> wrote:
> On 2018-10-04, Jann Horn <[email protected]> wrote:
> > On Thu, Oct 4, 2018 at 6:26 PM Aleksa Sarai <[email protected]> wrote:
> > > On 2018-09-29, Jann Horn <[email protected]> wrote:
> > > > You attempt to open "C/../../etc/passwd" under the root "/A/B".
> > > > Something else concurrently moves /A/B/C to /A/C. This can result in
> > > > the following:
> > > >
> > > > 1. You start the path walk and reach /A/B/C.
> > > > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
> > > > 3. Your path walk follows the first ".." up into /A. This is outside
> > > > the process root, but you never actually encountered the process root,
> > > > so you don't notice.
> > > > 4. Your path walk follows the second ".." up to /. Again, this is
> > > > outside the process root, but you don't notice.
> > > > 5. Your path walk walks down to /etc/passwd, and the open completes
> > > > successfully. You now have an fd pointing outside your chroot.
> > >
> > > I've been playing with this and I have the following patch, which
> > > according to my testing protects against attacks where ".." skips over
> > > nd->root. It abuses __d_path to figure out if nd->path can be resolved
> > > from nd->root (obviously a proper version of this patch would refactor
> > > __d_path so it could be used like this -- and would not return
> > > -EMULTIHOP).
> > >
> > > I've also attached my reproducer. With it, I was seeing fairly constant
> > > breakouts before this patch and after it I didn't see a single breakout
> > > after running it overnight. Obviously this is not conclusive, but I'm
> > > hoping that it can show what my idea for protecting against ".." was.
> > >
> > > Does this patch make sense? Or is there something wrong with it that I'm
> > > not seeing?
> > >
> > > --8<-------------------------------------------------------------------
> > >
> > > There is a fairly easy-to-exploit race condition with chroot(2) (and
> > > thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a
> > > path can be used to "skip over" nd->root and thus escape to the
> > > filesystem above nd->root.
> > >
> > > thread1 [attacker]:
> > > for (;;)
> > > renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
> > > thread2 [victim]:
> > > for (;;)
> > > openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
> > >
> > > With fairly significant regularity, thread2 will resolve to
> > > "/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases
> > > will be detected during ".." resolution (which is the weak point of
> > > chroot(2) -- since walking *into* a subdirectory tautologically cannot
> > > result in you walking *outside* nd->root).
> > >
> > > The use of __d_path here might seem suspect, however we don't mind if a
> > > path is moved from within the chroot to outside the chroot and we
> > > incorrectly decide it is safe (because at that point we are still within
> > > the set of files which were accessible at the beginning of resolution).
> > > However, we can fail resolution on the next path component if it remains
> > > outside of the root. A path which has always been outside nd->root
> > > during resolution will never be resolveable from nd->root and thus will
> > > always be blocked.
> > >
> > > DO NOT MERGE: Currently this code returns -EMULTIHOP in this case,
> > > purely as a debugging measure (so that you can see that
> > > the protection actually does something). Obviously in the
> > > proper patch this will return -EXDEV.
> > >
> > > Signed-off-by: Aleksa Sarai <[email protected]>
> > > ---
> > > fs/namei.c | 32 ++++++++++++++++++++++++++++++--
> > > 1 file changed, 30 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 6f995e6de6b1..c8349693d47b 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -53,8 +53,8 @@
> > > * The new code replaces the old recursive symlink resolution with
> > > * an iterative one (in case of non-nested symlink chains). It does
> > > * this with calls to <fs>_follow_link().
> > > - * As a side effect, dir_namei(), _namei() and follow_link() are now
> > > - * replaced with a single function lookup_dentry() that can handle all
> > > + * As a side effect, dir_namei(), _namei() and follow_link() are now
> > > + * replaced with a single function lookup_dentry() that can handle all
> > > * the special cases of the former code.
> > > *
> > > * With the new dcache, the pathname is stored at each inode, at least as
> > > @@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> > > return -EXDEV;
> > > break;
> > > }
> > > + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
> > > + char *pathbuf, *pathptr;
> > > +
> > > + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> > > + if (!pathbuf)
> > > + return -ECHILD;
> > > + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
> > > + kfree(pathbuf);
> > > + if (IS_ERR_OR_NULL(pathptr)) {
> > > + if (!pathptr)
> > > + pathptr = ERR_PTR(-EMULTIHOP);
> > > + return PTR_ERR(pathptr);
> > > + }
> > > + }
> >
> > One somewhat problematic thing about this approach is that if someone
> > tries to lookup
> > "a/a/a/a/a/a/a/a/a/a/[...]/../../../../../../../../../.." for some
> > reason, you'll have quadratic runtime: For each "..", you'll have to
> > walk up to the root.
>
> What if we took rename_lock (call it nd->r_seq) at the start of the
> resolution, and then only tried the __d_path-style check
>
> if (read_seqretry(&rename_lock, nd->r_seq) ||
> read_seqretry(&mount_lock, nd->m_seq))
> /* do the __d_path lookup. */
>
> That way you would only hit the slow path if there were concurrent
> renames or mounts *and* you are doing a path resolution with
> AT_THIS_ROOT or AT_BENEATH. I've attached a modified patch that does
> this (and after some testing it also appears to work).
Yeah, I think that might do the job.
> I'm not sure if there's a way to always avoid the quadratic lookup
> without (significantly and probably unreasonably) changing how dcache
> invalidation works. And obviously using this slow path if there was
> _any_ rename on the _entire_ system is suboptimal, but I think it is a
> significant improvement.
Yeah, I think this is much better.
> Another possibility is to expand on Andy's suggestion to use
> /proc/$pid/root, and instead require AT_THIS_ROOT to use the root of a
> namespace as its dirfd (I'm not sure if there's a trivial way to detect
> this though). This wouldn't help with AT_BENEATH, but it should protect
> against ".." shenanigans without any ".." handling changes. (This is
> less ideal because it requires a container process, but it is another
> way of dealing with the issue.)
(For container usecases, but not for a web server that uses AT_BENEATH.)
> ---
> fs/namei.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 46 insertions(+), 3 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 6f995e6de6b1..12c9be175cb4 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -493,7 +493,7 @@ struct nameidata {
> struct path root;
> struct inode *inode; /* path.dentry.d_inode */
> unsigned int flags;
> - unsigned seq, m_seq;
> + unsigned seq, m_seq, r_seq;
> int last_type;
> unsigned depth;
> int total_link_count;
> @@ -1375,6 +1375,27 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> return -EXDEV;
> break;
> }
> + if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) &&
> + (read_seqretry(&rename_lock, nd->r_seq) ||
> + read_seqretry(&mount_lock, nd->m_seq)))) {
> + char *pathbuf, *pathptr;
> +
> + nd->r_seq = read_seqbegin(&rename_lock);
> + /* Cannot take m_seq here. */
> +
> + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> + if (!pathbuf)
> + return -ECHILD;
> + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
> + kfree(pathbuf);
You're doing this check before actually looking up the parent, right?
So as long as I don't trigger the "path_equal(&nd->path, &nd->root)"
check that you do for O_BENEATH, escaping up by one level is possible,
right? You should probably move this check so that it happens after
following "..".
(Also: I assume that you're going to get rid of that memory allocation
in a future version.)
> + if (IS_ERR_OR_NULL(pathptr)) {
> + int error = PTR_ERR_OR_ZERO(pathptr);
> +
> + if (!error)
> + error = nd_jump_root(nd);
> + return error;
> + }
> + }
> if (nd->path.dentry != nd->path.mnt->mnt_root) {
> struct dentry *old = nd->path.dentry;
> struct dentry *parent = old->d_parent;
> @@ -1510,6 +1531,27 @@ static int follow_dotdot(struct nameidata *nd)
> return -EXDEV;
> break;
> }
> + if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) &&
> + (read_seqretry(&rename_lock, nd->r_seq) ||
> + read_seqretry(&mount_lock, nd->m_seq)))) {
> + char *pathbuf, *pathptr;
> +
> + nd->r_seq = read_seqbegin(&rename_lock);
> + /* Cannot take m_seq here. */
> +
> + pathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
> + if (!pathbuf)
> + return -ENOMEM;
> + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
> + kfree(pathbuf);
> + if (IS_ERR_OR_NULL(pathptr)) {
> + int error = PTR_ERR_OR_ZERO(pathptr);
> +
> + if (!error)
> + error = nd_jump_root(nd);
> + return error;
> + }
> + }
Same problem as in the RCU case above.
> if (nd->path.dentry != nd->path.mnt->mnt_root) {
> int ret = path_parent_directory(&nd->path);
> if (ret)
> @@ -2269,6 +2311,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> nd->last_type = LAST_ROOT; /* if there are only slashes... */
> nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
> nd->depth = 0;
> + nd->m_seq = read_seqbegin(&mount_lock);
> + nd->r_seq = read_seqbegin(&rename_lock);
This means that now, attempting to perform a lookup while something is
holding the rename_lock will spin on the lock. I don't know whether
that's a problem in practice though. Does anyone on this thread know
whether this is problematic?
On 2018-10-05, Jann Horn <[email protected]> wrote:
> > What if we took rename_lock (call it nd->r_seq) at the start of the
> > resolution, and then only tried the __d_path-style check
> >
> > if (read_seqretry(&rename_lock, nd->r_seq) ||
> > read_seqretry(&mount_lock, nd->m_seq))
> > /* do the __d_path lookup. */
> >
> > That way you would only hit the slow path if there were concurrent
> > renames or mounts *and* you are doing a path resolution with
> > AT_THIS_ROOT or AT_BENEATH. I've attached a modified patch that does
> > this (and after some testing it also appears to work).
>
> Yeah, I think that might do the job.
*phew* I was all out of other ideas. :P
> > ---
> > fs/namei.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 46 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 6f995e6de6b1..12c9be175cb4 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -493,7 +493,7 @@ struct nameidata {
> > struct path root;
> > struct inode *inode; /* path.dentry.d_inode */
> > unsigned int flags;
> > - unsigned seq, m_seq;
> > + unsigned seq, m_seq, r_seq;
> > int last_type;
> > unsigned depth;
> > int total_link_count;
> > @@ -1375,6 +1375,27 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> > return -EXDEV;
> > break;
> > }
> > + if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) &&
> > + (read_seqretry(&rename_lock, nd->r_seq) ||
> > + read_seqretry(&mount_lock, nd->m_seq)))) {
> > + char *pathbuf, *pathptr;
> > +
> > + nd->r_seq = read_seqbegin(&rename_lock);
> > + /* Cannot take m_seq here. */
> > +
> > + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> > + if (!pathbuf)
> > + return -ECHILD;
> > + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
> > + kfree(pathbuf);
>
> You're doing this check before actually looking up the parent, right?
> So as long as I don't trigger the "path_equal(&nd->path, &nd->root)"
> check that you do for O_BENEATH, escaping up by one level is possible,
> right? You should probably move this check so that it happens after
> following "..".
Yup, you're right. I'll do that.
> (Also: I assume that you're going to get rid of that memory allocation
> in a future version.)
Sure. Would you prefer adding some scratch space in nameidata, or that I
change __d_path so it accepts NULL as the buffer (and thus it doesn't
actually do any string operations)?
> > if (nd->path.dentry != nd->path.mnt->mnt_root) {
> > int ret = path_parent_directory(&nd->path);
> > if (ret)
> > @@ -2269,6 +2311,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > nd->last_type = LAST_ROOT; /* if there are only slashes... */
> > nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
> > nd->depth = 0;
> > + nd->m_seq = read_seqbegin(&mount_lock);
> > + nd->r_seq = read_seqbegin(&rename_lock);
>
> This means that now, attempting to perform a lookup while something is
> holding the rename_lock will spin on the lock. I don't know whether
> that's a problem in practice though. Does anyone on this thread know
> whether this is problematic?
I could make it so that we only take &rename_lock
if (nd->flags & (FOLLOW_BENEATH | FOLLOW_CHROOT)),
since it's not used outside of that path.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
* Aleksa Sarai:
> On 2018-10-01, Andy Lutomirski <[email protected]> wrote:
>> >>> Currently most container runtimes try to do this resolution in
>> >>> userspace[1], causing many potential race conditions. In addition, the
>> >>> "obvious" alternative (actually performing a {ch,pivot_}root(2))
>> >>> requires a fork+exec which is *very* costly if necessary for every
>> >>> filesystem operation involving a container.
>> >>
>> >> Wait. fork() I understand, but why exec? And actually, you don't need
>> >> a full fork() either, clone() lets you do this with some process parts
>> >> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep
>> >> the file descriptor table shared. And why chroot()/pivot_root(),
>> >> wouldn't you want to use setns()?
>> >
>> > You're right about this -- for C runtimes. In Go we cannot do a raw
>> > clone() or fork() (if you do it manually with RawSyscall you'll end with
>> > broken runtime state). So you're forced to do fork+exec (which then
>> > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes
>> > for CLONE_VFORK.
>>
>> I must admit that I’m not very sympathetic to the argument that “Go’s
>> runtime model is incompatible with the simpler solution.”
>
> Multi-threaded programs have a similar issue (though with Go it's much
> worse). If you fork a multi-threaded C program then you can only safely
> use AS-Safe glibc functions (those that are safe within a signal
> handler). But if you're just doing three syscalls this shouldn't be as
> big of a problem as Go where you can't even do said syscalls.
The situation is a bit more complicated. There are many programs out
there which use malloc and free (at least indirectly) after a fork,
and we cannot break them. In glibc, we have a couple of subsystems
which are put into a known state before calling the fork/clone system
call if the application calls fork. The price we pay for that is a
fork which is not POSIX-compliant because it is not async-signal-safe.
Admittedly, other libcs chose different trade-offs.
However, what is the same across libcs is this: You cannot call the
clone system call directly and get a fully working new process. Some
things break. For example, for recursive mutexes, we need to know the
TID of the current thread, and we cannot perform a system call to get
it for performance reasons. So everyone has a TID cache for that.
But the TID cache does not get reset when you bypass the fork
implementation in libc, so you end up with subtle corruption bugs on
TID reuse.
So I'd say that in most cases, the C situation is pretty much the same
as the Go situation. If I recall correctly, the problem for Go is
that it cannot call setns from Go code because it fails in the kernel
for multi-threaded processes, and Go processes are already
multi-threaded when user Go code runs.
On Sat, Oct 6, 2018 at 10:56 PM Florian Weimer <[email protected]> wrote:
>
> * Aleksa Sarai:
>
> > On 2018-10-01, Andy Lutomirski <[email protected]> wrote:
> >> >>> Currently most container runtimes try to do this resolution in
> >> >>> userspace[1], causing many potential race conditions. In addition, the
> >> >>> "obvious" alternative (actually performing a {ch,pivot_}root(2))
> >> >>> requires a fork+exec which is *very* costly if necessary for every
> >> >>> filesystem operation involving a container.
> >> >>
> >> >> Wait. fork() I understand, but why exec? And actually, you don't need
> >> >> a full fork() either, clone() lets you do this with some process parts
> >> >> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep
> >> >> the file descriptor table shared. And why chroot()/pivot_root(),
> >> >> wouldn't you want to use setns()?
> >> >
> >> > You're right about this -- for C runtimes. In Go we cannot do a raw
> >> > clone() or fork() (if you do it manually with RawSyscall you'll end with
> >> > broken runtime state). So you're forced to do fork+exec (which then
> >> > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes
> >> > for CLONE_VFORK.
> >>
> >> I must admit that I’m not very sympathetic to the argument that “Go’s
> >> runtime model is incompatible with the simpler solution.”
> >
> > Multi-threaded programs have a similar issue (though with Go it's much
> > worse). If you fork a multi-threaded C program then you can only safely
> > use AS-Safe glibc functions (those that are safe within a signal
> > handler). But if you're just doing three syscalls this shouldn't be as
> > big of a problem as Go where you can't even do said syscalls.
>
> The situation is a bit more complicated. There are many programs out
> there which use malloc and free (at least indirectly) after a fork,
> and we cannot break them. In glibc, we have a couple of subsystems
> which are put into a known state before calling the fork/clone system
> call if the application calls fork. The price we pay for that is a
> fork which is not POSIX-compliant because it is not async-signal-safe.
> Admittedly, other libcs chose different trade-offs.
>
> However, what is the same across libcs is this: You cannot call the
> clone system call directly and get a fully working new process. Some
> things break. For example, for recursive mutexes, we need to know the
> TID of the current thread, and we cannot perform a system call to get
> it for performance reasons. So everyone has a TID cache for that.
> But the TID cache does not get reset when you bypass the fork
> implementation in libc, so you end up with subtle corruption bugs on
> TID reuse.
Sure, but recursive mutexes etc. are very specific use-case.
I'd even go so far to say that if you use mutexes + threads and then
also fork in those threads you're hosed anyway. If you don't things get a little
cleaner assuming you don't call library functions that use mutexes
internally.
Event then you might (sometimes at least) still get around most problems
with atfork handlers (thought I really don't like him). But you know more
about this then I do. :)
>
> So I'd say that in most cases, the C situation is pretty much the same
> as the Go situation. If I recall correctly, the problem for Go is
> that it cannot call setns from Go code because it fails in the kernel
> for multi-threaded processes, and Go processes are already
> multi-threaded when user Go code runs.
That is true for *some* namespaces (user, mount) but not for all.
For example, setns(CLONE_NEWNET) would be fine from go.
But the go runtime thinks it's clever to clone a new thread in between
entry and exit of a syscall. If you switch namespaces you might end
up with a new thread that belongs to the wrong namespace which is
very problematic.
So you can either rely on calling some go magic that locks
you to a specific os thread but that does only work in later go versions or
you go the constructor route, i.e. you e.g. implement a (dummy)
subcommand that you can call and that triggers the execution of a
C function that is marked with __attribute__((constructor)) that runs
before the go runtime and in which you can do setns(), fork() and
friends (somewhat) safely. This has very
bad performance and is a nasty hack but it's really unavoidable.
On Sat, Oct 6, 2018 at 4:10 AM Aleksa Sarai <[email protected]> wrote:
> On 2018-10-05, Jann Horn <[email protected]> wrote:
> > > What if we took rename_lock (call it nd->r_seq) at the start of the
> > > resolution, and then only tried the __d_path-style check
> > >
> > > if (read_seqretry(&rename_lock, nd->r_seq) ||
> > > read_seqretry(&mount_lock, nd->m_seq))
> > > /* do the __d_path lookup. */
> > >
> > > That way you would only hit the slow path if there were concurrent
> > > renames or mounts *and* you are doing a path resolution with
> > > AT_THIS_ROOT or AT_BENEATH. I've attached a modified patch that does
> > > this (and after some testing it also appears to work).
> >
> > Yeah, I think that might do the job.
>
> *phew* I was all out of other ideas. :P
>
> > > ---
> > > fs/namei.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
> > > 1 file changed, 46 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 6f995e6de6b1..12c9be175cb4 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -493,7 +493,7 @@ struct nameidata {
> > > struct path root;
> > > struct inode *inode; /* path.dentry.d_inode */
> > > unsigned int flags;
> > > - unsigned seq, m_seq;
> > > + unsigned seq, m_seq, r_seq;
> > > int last_type;
> > > unsigned depth;
> > > int total_link_count;
> > > @@ -1375,6 +1375,27 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> > > return -EXDEV;
> > > break;
> > > }
> > > + if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) &&
> > > + (read_seqretry(&rename_lock, nd->r_seq) ||
> > > + read_seqretry(&mount_lock, nd->m_seq)))) {
> > > + char *pathbuf, *pathptr;
> > > +
> > > + nd->r_seq = read_seqbegin(&rename_lock);
> > > + /* Cannot take m_seq here. */
> > > +
> > > + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
> > > + if (!pathbuf)
> > > + return -ECHILD;
> > > + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
> > > + kfree(pathbuf);
> >
> > You're doing this check before actually looking up the parent, right?
> > So as long as I don't trigger the "path_equal(&nd->path, &nd->root)"
> > check that you do for O_BENEATH, escaping up by one level is possible,
> > right? You should probably move this check so that it happens after
> > following "..".
>
> Yup, you're right. I'll do that.
>
> > (Also: I assume that you're going to get rid of that memory allocation
> > in a future version.)
>
> Sure. Would you prefer adding some scratch space in nameidata, or that I
> change __d_path so it accepts NULL as the buffer (and thus it doesn't
> actually do any string operations)?
Well, I think accepting a NULL buffer would be much cleaner; but keep
in mind that I'm just someone making suggestions, Al Viro is the one
who has to like your code. :P
> > > if (nd->path.dentry != nd->path.mnt->mnt_root) {
> > > int ret = path_parent_directory(&nd->path);
> > > if (ret)
> > > @@ -2269,6 +2311,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > > nd->last_type = LAST_ROOT; /* if there are only slashes... */
> > > nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
> > > nd->depth = 0;
> > > + nd->m_seq = read_seqbegin(&mount_lock);
> > > + nd->r_seq = read_seqbegin(&rename_lock);
> >
> > This means that now, attempting to perform a lookup while something is
> > holding the rename_lock will spin on the lock. I don't know whether
> > that's a problem in practice though. Does anyone on this thread know
> > whether this is problematic?
>
> I could make it so that we only take &rename_lock
> if (nd->flags & (FOLLOW_BENEATH | FOLLOW_CHROOT)),
> since it's not used outside of that path.
I think that might be a sensible change; but as I said, I don't
actually know whether it's necessary, and it would be very helpful if
someone who actually knows commented on this.