Patch changelog:
v6:
* Drop O_* flags API to the new LOOKUP_ path scoping bits and
instead introduce resolveat(2) as an alternative method of
obtaining an O_PATH. The justification for this is included in
patch 6 (though switching back to O_* flags is trivial).
v5:
* In response to CVE-2019-5736 (one of the vectors showed that
open(2)+fexec(3) cannot be used to scope binfmt_script's implicit
open_exec()), AT_* flags have been re-added and are now piped
through to binfmt_script (and other binfmt_* that use open_exec)
but are only supported for execveat(2) for now.
v4:
* Remove AT_* flag reservations, as they require more discussion.
* Switch to path_is_under() over __d_path() for breakout checking.
* Make O_XDEV no longer block openat("/tmp", "/", O_XDEV) -- dirfd
is now ignored for absolute paths to match other flags.
* Improve the dirfd_path_init() refactor and move it to a separate
commit.
* Remove reference to Linux-capsicum.
* Switch "proclink" name to "magic link".
v3: [resend]
v2:
* Made ".." resolution with AT_THIS_ROOT and AT_BENEATH safe(r) with
some semi-aggressive __d_path checking (see patch 3).
* Disallowed "proclinks" with AT_THIS_ROOT and AT_BENEATH, in the
hopes they can be re-enabled once safe.
* Removed the selftests as they will be reimplemented as xfstests.
* Removed stat(2) support, since you can already get it through
O_PATH and fstatat(2).
The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS,
the flag has been split up into separate flags. However, instead of
being an openat(2) flag it is provided through a new syscall
resolveat(2) which provides an alternative way to get an O_PATH file
descriptor (the reasoning for doing this is included in patch 6). The
following new LOOKUP_ (and corresponding uapi) flags are added:
* LOOKUP_XDEV blocks all mountpoint crossings (upwards, downwards, or
through absolute links). Absolute pathnames alone in openat(2) do
not trigger this.
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do resolveat(NOFOLLOW|NO_MAGICLINKS) on a "magic link" and
it will *not* fail (assuming that no parent component was a "magic
link"), and you will have an fd for the "magic link".
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using ".." (see patch 4 for more detail).
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks as in
patch 4, but that requires more discussion.
In addition, two new flags were added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including "magic links". Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
"magic links" with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[6] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having O_THISROOT (such as
CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
In addition, a mirror set of AT_* flags have been added (though
currently these are only supported for execveat(2) -- and not for any
other syscall). The need for these is explained in patch 5 (it's
motivated by CVE-2019-5736).
Cc: Al Viro <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: David Howells <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Drysdale <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin
Aleksa Sarai (6):
namei: split out nd->dfd handling to dirfd_path_init
namei: O_BENEATH-style path resolution flags
namei: LOOKUP_IN_ROOT: chroot-like path resolution
namei: aggressively check for nd->root escape on ".." resolution
binfmt_*: scope path resolution of interpreters
namei: resolveat(2) syscall
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/binfmt_elf.c | 2 +-
fs/binfmt_elf_fdpic.c | 2 +-
fs/binfmt_em86.c | 4 +-
fs/binfmt_misc.c | 2 +-
fs/binfmt_script.c | 2 +-
fs/exec.c | 26 +-
fs/namei.c | 251 +++++++++++++++-----
include/linux/binfmts.h | 1 +
include/linux/fs.h | 9 +-
include/linux/namei.h | 8 +
include/uapi/linux/fcntl.h | 18 ++
27 files changed, 270 insertions(+), 71 deletions(-)
--
2.21.0
The most obvious syscall to add support for the new LOOKUP_* scoping
flags would be openat(2) (along with the required execveat(2) change
included in this series). However, there are a few reasons to not do
this:
* The new LOOKUP_* flags are intended to be security features, and
openat(2) will silently ignore all unknown flags. This means that
users would need to avoid foot-gunning themselves constantly when
using this interface if it were part of openat(2).
* Resolution scoping feels like a different operation to the existing
O_* flags. And since openat(2) has limited flag space, it seems to be
quite wasteful to clutter it with 5 flags that are all
resolution-related. Arguably O_NOFOLLOw is also a resolution flag but
its entire purpose is to error out if you encounter a trailing
symlink not to scope resolution.
* Other systems would be able to reimplement this syscall allowing for
cross-OS standardisation rather than being hidden amongst O_* flags
which may result in it not being used by all the parties that might
want to use it (file servers, web servers, container runtimes, etc).
* It gives us the opportunity to iterate on the O_PATH interface in the
future. There are some potential security improvements that can be
made to O_PATH (handling /proc/self/fd re-opening of file descriptors
much more sanely) which could be made even better with some other
bits (such as ACC_MODE bits which work for O_PATH).
To this end, we introduce the resolveat(2) syscall. At the moment it's
effectively another way of getting a bog-standard O_PATH descriptor but
with the ability to use the new LOOKUP_* flags.
Because resolveat(2) only provides the ability to get O_PATH
descriptors, users will need to get creative with /proc/self/fd in order
to get a usable file descriptor for other uses. However, in future we
can add O_EMPTYPATH support to openat(2) which would allow for
re-opening without procfs (though as mentioned above there are some
security improvements that should be made to the interfaces).
NOTE: This patch adds the syscall to all architectures using the new
unified syscall numbering, but several architectures are missing
newer (nr > 423) syscalls -- hence the uneven gaps in the syscall
tables.
Cc: Christian Brauner <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/namei.c | 46 +++++++++++++++++++++
include/uapi/linux/fcntl.h | 12 ++++++
18 files changed, 74 insertions(+)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 63ed39cbd3bd..72f431b1dc9c 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -461,5 +461,6 @@
530 common getegid sys_getegid
531 common geteuid sys_geteuid
532 common getppid sys_getppid
+533 common resolveat sys_resolveat
# all other architectures have common numbers for new syscall, alpha
# is the exception.
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 9016f4081bb9..1bc0282a67f7 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -437,3 +437,4 @@
421 common rt_sigtimedwait_time64 sys_rt_sigtimedwait
422 common futex_time64 sys_futex
423 common sched_rr_get_interval_time64 sys_sched_rr_get_interval
+428 common resolveat sys_resolveat
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index ab9cda5f6136..d3ae73ffaf48 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -344,3 +344,4 @@
332 common pkey_free sys_pkey_free
333 common rseq sys_rseq
# 334 through 423 are reserved to sync up with other architectures
+428 common resolveat sys_resolveat
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 125c14178979..81b7389e9e58 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -423,3 +423,4 @@
421 common rt_sigtimedwait_time64 sys_rt_sigtimedwait
422 common futex_time64 sys_futex
423 common sched_rr_get_interval_time64 sys_sched_rr_get_interval
+428 common resolveat sys_resolveat
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 8ee3a8c18498..626aed10e2b5 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -429,3 +429,4 @@
421 common rt_sigtimedwait_time64 sys_rt_sigtimedwait
422 common futex_time64 sys_futex
423 common sched_rr_get_interval_time64 sys_sched_rr_get_interval
+428 common resolveat sys_resolveat
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 15f4117900ee..8cbd6032d6bf 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -362,3 +362,4 @@
421 n32 rt_sigtimedwait_time64 compat_sys_rt_sigtimedwait_time64
422 n32 futex_time64 sys_futex
423 n32 sched_rr_get_interval_time64 sys_sched_rr_get_interval
+428 n32 resolveat sys_resolveat
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index c85502e67b44..234923a1fc88 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -338,3 +338,4 @@
327 n64 rseq sys_rseq
328 n64 io_pgetevents sys_io_pgetevents
# 329 through 423 are reserved to sync up with other architectures
+428 n64 resolveat sys_resolveat
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 2e063d0f837e..7b4586acf35d 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -411,3 +411,4 @@
421 o32 rt_sigtimedwait_time64 sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time64
422 o32 futex_time64 sys_futex sys_futex
423 o32 sched_rr_get_interval_time64 sys_sched_rr_get_interval sys_sched_rr_get_interval
+428 o32 resolveat sys_resolveat sys_resolveat
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index b26766c6647d..19a9a92dc5f8 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -420,3 +420,4 @@
421 32 rt_sigtimedwait_time64 sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time64
422 32 futex_time64 sys_futex sys_futex
423 32 sched_rr_get_interval_time64 sys_sched_rr_get_interval sys_sched_rr_get_interval
+428 common resolveat sys_resolveat sys_resolveat
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index b18abb0c3dae..bfcd75b928de 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -505,3 +505,4 @@
421 32 rt_sigtimedwait_time64 sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time64
422 32 futex_time64 sys_futex sys_futex
423 32 sched_rr_get_interval_time64 sys_sched_rr_get_interval sys_sched_rr_get_interval
+428 common resolveat sys_resolveat sys_resolveat
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 02579f95f391..084e51f02e65 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -426,3 +426,4 @@
421 32 rt_sigtimedwait_time64 - compat_sys_rt_sigtimedwait_time64
422 32 futex_time64 - sys_futex
423 32 sched_rr_get_interval_time64 - sys_sched_rr_get_interval
+428 common resolveat sys_resolveat -
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index bfda678576e4..e9115c5cec72 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -426,3 +426,4 @@
421 common rt_sigtimedwait_time64 sys_rt_sigtimedwait
422 common futex_time64 sys_futex
423 common sched_rr_get_interval_time64 sys_sched_rr_get_interval
+428 common resolveat sys_resolveat
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index b9a5a04b2d2c..2d3fdd913d89 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -469,3 +469,4 @@
421 32 rt_sigtimedwait_time64 sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time64
422 32 futex_time64 sys_futex sys_futex
423 32 sched_rr_get_interval_time64 sys_sched_rr_get_interval sys_sched_rr_get_interval
+428 common resolveat sys_resolveat sys_resolveat
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cd5f982b1e5..101feef22473 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup
426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter
427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register
+428 i386 resolveat sys_resolveat __ia32_sys_resolveat
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 64ca0d06259a..c7c197bf07bc 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
425 common io_uring_setup __x64_sys_io_uring_setup
426 common io_uring_enter __x64_sys_io_uring_enter
427 common io_uring_register __x64_sys_io_uring_register
+428 common resolveat __x64_sys_resolveat
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 6af49929de85..86c44f15dfa4 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -394,3 +394,4 @@
421 common rt_sigtimedwait_time64 sys_rt_sigtimedwait
422 common futex_time64 sys_futex
423 common sched_rr_get_interval_time64 sys_sched_rr_get_interval
+428 common resolveat sys_resolveat
diff --git a/fs/namei.c b/fs/namei.c
index 2b6a1bf4e745..30db5833aac3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3656,6 +3656,52 @@ struct file *do_filp_open(int dfd, struct filename *pathname,
return filp;
}
+SYSCALL_DEFINE3(resolveat, int, dfd, const char __user *, path,
+ unsigned long, flags)
+{
+ int fd;
+ struct filename *tmp;
+ struct open_flags op = {
+ .open_flag = O_PATH,
+ };
+
+ if (flags & ~VALID_RESOLVE_FLAGS)
+ return -EINVAL;
+
+ if (flags & RESOLVE_CLOEXEC)
+ op.open_flag |= O_CLOEXEC;
+ if (!(flags & RESOLVE_NOFOLLOW))
+ op.lookup_flags |= LOOKUP_FOLLOW;
+ if (flags & RESOLVE_BENEATH)
+ op.lookup_flags |= LOOKUP_BENEATH;
+ if (flags & RESOLVE_XDEV)
+ op.lookup_flags |= LOOKUP_XDEV;
+ if (flags & RESOLVE_NO_MAGICLINKS)
+ op.lookup_flags |= LOOKUP_NO_MAGICLINKS;
+ if (flags & RESOLVE_NO_SYMLINKS)
+ op.lookup_flags |= LOOKUP_NO_SYMLINKS;
+ if (flags & RESOLVE_THIS_ROOT)
+ op.lookup_flags |= LOOKUP_IN_ROOT;
+
+ tmp = getname(path);
+ if (IS_ERR(tmp))
+ return PTR_ERR(tmp);
+
+ fd = get_unused_fd_flags(op.open_flag);
+ if (fd >= 0) {
+ struct file *f = do_filp_open(dfd, tmp, &op);
+ if (IS_ERR(f)) {
+ put_unused_fd(fd);
+ fd = PTR_ERR(f);
+ } else {
+ fsnotify_open(f);
+ fd_install(fd, f);
+ }
+ }
+ putname(tmp);
+ return fd;
+}
+
struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
const char *name, const struct open_flags *op)
{
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index bfca7f87cd2a..d5a2a97929c6 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -100,4 +100,16 @@
#define AT_NO_SYMLINKS 0x080000 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */
#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
+/* First two bits of RESOLVE_* are reserved for future ACC_MODE extensions. */
+#define RESOLVE_CLOEXEC 0x004 /* Set O_CLOEXEC on the returned fd. */
+#define RESOLVE_NOFOLLOW 0x008 /* Don't follow trailing symlinks. */
+#define RESOLVE_RESOLUTION_TYPE 0x1F0 /* Type of path-resolution scoping we are applying. */
+#define RESOLVE_BENEATH 0x010 /* - Block "lexical" trickery like "..", symlinks, absolute paths, etc. */
+#define RESOLVE_XDEV 0x020 /* - Block mount-point crossings (includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS 0x040 /* - Block procfs-style "magic" symlinks. */
+#define RESOLVE_NO_SYMLINKS 0x080 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */
+#define RESOLVE_THIS_ROOT 0x100 /* - Scope ".." resolution to dirfd (like chroot(2)). */
+
+#define VALID_RESOLVE_FLAGS (RESOLVE_CLOEXEC | RESOLVE_NOFOLLOW | RESOLVE_RESOLUTION_TYPE)
+
#endif /* _UAPI_LINUX_FCNTL_H */
--
2.21.0
Add the following flags to allow various restrictions on path
resolution (these affect the *entire* resolution, rather than just the
final path component -- as is the case with most other AT_* flags).
The primary justification for these flags is to allow for programs to be
far more strict about how they want path resolution to handle symlinks,
mountpoint crossings, and paths that escape the dirfd (through an
absolute path or ".." shenanigans).
This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init). More classical applications (which have their own
potentially buggy userspace path sanitisation code) include web
servers, archive extraction tools, network file servers, and so on.
These flags are exposed to userspace in a later patchset.
* LOOKUP_XDEV: Disallow mount-point crossing (both *down* into one, or
*up* from one). The primary "scoping" use is to blocking resolution
that crosses a bind-mount, which has a similar property to a symlink
(in the way that it allows for escape from the starting-point). Since
it is not possible to differentiate bind-mounts However since
bind-mounting requires privileges (in ways symlinks don't) this has
been split from LOOKUP_BENEATH. The naming is based on "find -xdev" as
well as -EXDEV (though find(1) doesn't walk upwards, the semantics
seem obvious).
* LOOKUP_NO_MAGICLINKS: Disallows ->get_link "symlink" jumping. This is
a very specific restriction, and it exists because /proc/$pid/fd/...
"symlinks" allow for access outside nd->root and pose risk to
container runtimes that don't want to be tricked into accessing a host
path (but do want to allow no-funny-business symlink resolution).
* LOOKUP_NO_SYMLINKS: Disallows symlink jumping *of any kind*. Implies
LOOKUP_NO_MAGICLINKS (obviously).
* LOOKUP_BENEATH: Disallow "escapes" from the starting point of the
filesystem tree during resolution (you must stay "beneath" the
starting point at all times). Currently this is done by disallowing
".." and absolute paths (either in the given path or found during
symlink resolution) entirely, as well as all "magic link" jumping.
The wholesale banning of ".." is because it is currently not safe to
allow ".." resolution (races can cause the path to be moved outside of
the root -- this is conceptually similar to historical chroot(2)
escape attacks). Future patches in this series will address this, and
will re-enable ".." resolution once it is safe. With those patches,
".." resolution will only be allowed if it remains in the root
throughout resolution (such as "a/../b" not "a/../../outside/b").
The banning of "magic link" jumping is done because it is not clear
whether semantically they should be allowed -- while some "magic
links" are safe there are many that can cause escapes (and once a
resolution is outside of the root, O_BENEATH will no longer detect
it). Future patches may re-enable "magic link" jumping when such jumps
would remain inside the root.
The LOOKUP_NO_*LINK flags return -ELOOP if path resolution would
violates their requirement, while the others all return -EXDEV.
This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.
[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/619151/
[3]: https://lwn.net/Articles/603929/
[4]: https://lwn.net/Articles/723057/
Cc: Eric Biederman <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Kees Cook <[email protected]>
Suggested-by: David Drysdale <[email protected]>
Suggested-by: Al Viro <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 76 ++++++++++++++++++++++++++++++++++++-------
include/linux/namei.h | 7 ++++
2 files changed, 72 insertions(+), 11 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 2a91b72aa5e9..e13a02720a9d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -843,6 +843,12 @@ static inline void path_to_nameidata(const struct path *path,
static int nd_jump_root(struct nameidata *nd)
{
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
+ if (unlikely(nd->flags & LOOKUP_XDEV)) {
+ if (nd->path.mnt != nd->root.mnt)
+ return -EXDEV;
+ }
if (nd->flags & LOOKUP_RCU) {
struct dentry *d;
nd->path = nd->root;
@@ -1051,6 +1057,9 @@ const char *get_link(struct nameidata *nd)
int error;
const char *res;
+ if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+ return ERR_PTR(-ELOOP);
+
if (!(nd->flags & LOOKUP_RCU)) {
touch_atime(&last->link);
cond_resched();
@@ -1081,14 +1090,23 @@ const char *get_link(struct nameidata *nd)
} else {
res = get(dentry, inode, &last->done);
}
+ /* If we just jumped it was because of a procfs-style link. */
+ if (unlikely(nd->flags & LOOKUP_JUMPED)) {
+ if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
+ return ERR_PTR(-ELOOP);
+ /* Not currently safe. */
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return ERR_PTR(-EXDEV);
+ }
if (IS_ERR_OR_NULL(res))
return res;
}
if (*res == '/') {
if (!nd->root.mnt)
set_root(nd);
- if (unlikely(nd_jump_root(nd)))
- return ERR_PTR(-ECHILD);
+ error = nd_jump_root(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
while (unlikely(*++res == '/'))
;
}
@@ -1269,12 +1287,16 @@ static int follow_managed(struct path *path, struct nameidata *nd)
break;
}
- if (need_mntput && path->mnt == mnt)
- mntput(path->mnt);
+ if (need_mntput) {
+ if (path->mnt == mnt)
+ mntput(path->mnt);
+ if (unlikely(nd->flags & LOOKUP_XDEV))
+ ret = -EXDEV;
+ else
+ nd->flags |= LOOKUP_JUMPED;
+ }
if (ret == -EISDIR || !ret)
ret = 1;
- if (need_mntput)
- nd->flags |= LOOKUP_JUMPED;
if (unlikely(ret < 0))
path_put_conditional(path, nd);
return ret;
@@ -1331,6 +1353,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
mounted = __lookup_mnt(path->mnt, path->dentry);
if (!mounted)
break;
+ if (unlikely(nd->flags & LOOKUP_XDEV))
+ return false;
path->mnt = &mounted->mnt;
path->dentry = mounted->mnt.mnt_root;
nd->flags |= LOOKUP_JUMPED;
@@ -1351,8 +1375,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
struct inode *inode = nd->inode;
while (1) {
- if (path_equal(&nd->path, &nd->root))
+ if (path_equal(&nd->path, &nd->root)) {
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
break;
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
struct dentry *old = nd->path.dentry;
struct dentry *parent = old->d_parent;
@@ -1377,6 +1404,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -ECHILD;
if (&mparent->mnt == nd->path.mnt)
break;
+ if (unlikely(nd->flags & LOOKUP_XDEV))
+ return -EXDEV;
/* we know that mountpoint was pinned */
nd->path.dentry = mountpoint;
nd->path.mnt = &mparent->mnt;
@@ -1391,6 +1420,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -ECHILD;
if (!mounted)
break;
+ if (unlikely(nd->flags & LOOKUP_XDEV))
+ return -EXDEV;
nd->path.mnt = &mounted->mnt;
nd->path.dentry = mounted->mnt.mnt_root;
inode = nd->path.dentry->d_inode;
@@ -1479,8 +1510,11 @@ static int path_parent_directory(struct path *path)
static int follow_dotdot(struct nameidata *nd)
{
while(1) {
- if (path_equal(&nd->path, &nd->root))
+ if (path_equal(&nd->path, &nd->root)) {
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
break;
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
int ret = path_parent_directory(&nd->path);
if (ret)
@@ -1489,6 +1523,8 @@ static int follow_dotdot(struct nameidata *nd)
}
if (!follow_up(&nd->path))
break;
+ if (unlikely(nd->flags & LOOKUP_XDEV))
+ return -EXDEV;
}
follow_mount(&nd->path);
nd->inode = nd->path.dentry->d_inode;
@@ -1703,6 +1739,13 @@ static inline int may_lookup(struct nameidata *nd)
static inline int handle_dots(struct nameidata *nd, int type)
{
if (type == LAST_DOTDOT) {
+ /*
+ * LOOKUP_BENEATH resolving ".." is not currently safe -- races can
+ * cause our parent to have moved outside of the root and us to skip
+ * over it.
+ */
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
if (!nd->root.mnt)
set_root(nd);
if (nd->flags & LOOKUP_RCU) {
@@ -2251,6 +2294,15 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->path.dentry = NULL;
nd->m_seq = read_seqbegin(&mount_lock);
+
+ if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+ error = dirfd_path_init(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ nd->root = nd->path;
+ if (!(nd->flags & LOOKUP_RCU))
+ path_get(&nd->root);
+ }
if (*s == '/') {
if (likely(!nd->root.mnt))
set_root(nd);
@@ -2259,9 +2311,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
s = ERR_PTR(error);
return s;
}
- error = dirfd_path_init(nd);
- if (unlikely(error))
- return ERR_PTR(error);
+ if (likely(!nd->path.mnt)) {
+ error = dirfd_path_init(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ }
return s;
}
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 9138b4471dbf..7bc819ad0cd3 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -50,6 +50,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_EMPTY 0x4000
#define LOOKUP_DOWN 0x8000
+/* Scoping flags for lookup. */
+#define LOOKUP_BENEATH 0x010000 /* No escaping from starting point. */
+#define LOOKUP_XDEV 0x020000 /* No mountpoint crossing. */
+#define LOOKUP_NO_MAGICLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
+#define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*.
+ Implies LOOKUP_NO_MAGICLINKS. */
+
extern int path_pts(struct path *path);
extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
--
2.21.0
The primary motivation for the need for this flag is container runtimes
which have to interact with malicious root filesystems in the host
namespaces. One of the first requirements for a container runtime to be
secure against a malicious rootfs is that they correctly scope symlinks
(that is, they should be scoped as though they are chroot(2)ed into the
container's rootfs) and ".."-style paths[*]. The already-existing O_XDEV
and O_NOMAGICLINKS[**] help defend against other potential attacks in a
malicious rootfs scenario.
Currently most container runtimes try to do this resolution in
userspace[1], causing many potential race conditions. In addition, the
"obvious" alternative (actually performing a {ch,pivot_}root(2))
requires a fork+exec (for some runtimes) which is *very* costly if
necessary for every filesystem operation involving a container.
[*] At the moment, ".." and "magic link" jumping are disallowed for the
same reason it is disabled for LOOKUP_BENEATH -- currently it is not
safe to allow it. Future patches may enable it unconditionally once
we have resolved the possible races (for "..") and semantics (for
"magic link" jumping).
The most significant openat(2) semantic change with LOOKUP_THISROOT is
that absolute pathnames no longer cause dirfd to be ignored completely.
The rationale is that LOOKUP_THISROOT must necessarily chroot-scope
symlinks with absolute paths to dirfd, and so doing it for the base path
seems to be the most consistent behaviour (and also avoids foot-gunning
users who want to scope paths that are absolute).
[1]: https://github.com/cyphar/filepath-securejoin
Cc: Eric Biederman <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 6 +++---
include/linux/namei.h | 1 +
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index e13a02720a9d..3a3cba593b85 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1095,7 +1095,7 @@ const char *get_link(struct nameidata *nd)
if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
return ERR_PTR(-ELOOP);
/* Not currently safe. */
- if (unlikely(nd->flags & LOOKUP_BENEATH))
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
return ERR_PTR(-EXDEV);
}
if (IS_ERR_OR_NULL(res))
@@ -1744,7 +1744,7 @@ static inline int handle_dots(struct nameidata *nd, int type)
* cause our parent to have moved outside of the root and us to skip
* over it.
*/
- if (unlikely(nd->flags & LOOKUP_BENEATH))
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
return -EXDEV;
if (!nd->root.mnt)
set_root(nd);
@@ -2295,7 +2295,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->m_seq = read_seqbegin(&mount_lock);
- if (unlikely(nd->flags & LOOKUP_BENEATH)) {
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
error = dirfd_path_init(nd);
if (unlikely(error))
return ERR_PTR(error);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 7bc819ad0cd3..4b1ee717cb14 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -56,6 +56,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_NO_MAGICLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
#define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*.
Implies LOOKUP_NO_MAGICLINKS. */
+#define LOOKUP_IN_ROOT 0x100000 /* Treat dirfd as %current->fs->root. */
extern int path_pts(struct path *path);
--
2.21.0
This patch allows for O_BENEATH and O_THISROOT to safely permit ".."
resolution (in the case of O_BENEATH the resolution will still fail if
".." resolution would resolve a path outside of the root -- while
O_THISROOT will chroot(2)-style scope it). "magic link" jumps are still
disallowed entirely because now they could result in inconsistent
behaviour if resolution encounters a subsequent "..".
The need for this patch is explained by observing there is a fairly
easy-to-exploit race condition with chroot(2) (and thus by extension
O_THISROOT and O_BENEATH) where a rename(2) of a path can be used to
"skip over" nd->root and thus escape to the filesystem above nd->root.
thread1 [attacker]:
for (;;)
renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
thread2 [victim]:
for (;;)
openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.
With this patch, such cases will be detected *during* ".." resolution
(which is the weak point of chroot(2) -- since walking *into* a
subdirectory tautologically cannot result in you walking *outside*
nd->root -- except through a bind-mount or "magic link"). By detecting
this at ".." resolution (rather than checking only at the end of the
entire resolution) we can both correct escapes by jumping back to the
root (in the case of O_THISROOT), as well as avoid revealing to
attackers the structure of the filesystem outside of the root (through
timing attacks for instance).
In order to avoid a quadratic lookup with each ".." entry, we only
activate the slow path if a write through &rename_lock or &mount_lock
have occurred during path resolution (&rename_lock and &mount_lock are
re-taken to further optimise the lookup). Since the primary attack being
protected against is MS_MOVE or rename(2), not doing additional checks
unless a mount or rename have occurred avoids making the common case
slow.
The use of path_is_under() here might seem suspect, but on further
inspection of the most important race (a path was *inside* the root but
is now *outside*), there appears to be no attack potential. If
path_is_under() occurs before the rename, then the path will be resolved
but since the path was originally inside the root there is no escape.
Subsequent ".." jumps are guaranteed to check path_is_under() (by
construction, &rename_lock or &mount_lock must have been taken by the
attacker after path_is_under() returned in the victim), and thus will
not be able to escape from the previously-inside-root path. Walking down
is still safe since the entire subtree was moved (either by rename(2) or
MS_MOVE) and because (as discussed above) walking down is safe.
Cc: Al Viro <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 48 +++++++++++++++++++++++++++++++++---------------
1 file changed, 33 insertions(+), 15 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 3a3cba593b85..2b6a1bf4e745 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@ struct nameidata {
struct path root;
struct inode *inode; /* path.dentry.d_inode */
unsigned int flags;
- unsigned seq, m_seq;
+ unsigned seq, m_seq, r_seq;
int last_type;
unsigned depth;
int total_link_count;
@@ -1739,19 +1739,35 @@ static inline int may_lookup(struct nameidata *nd)
static inline int handle_dots(struct nameidata *nd, int type)
{
if (type == LAST_DOTDOT) {
- /*
- * LOOKUP_BENEATH resolving ".." is not currently safe -- races can
- * cause our parent to have moved outside of the root and us to skip
- * over it.
- */
- if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
- return -EXDEV;
+ int error = 0;
+
if (!nd->root.mnt)
set_root(nd);
- if (nd->flags & LOOKUP_RCU) {
- return follow_dotdot_rcu(nd);
- } else
- return follow_dotdot(nd);
+ if (nd->flags & LOOKUP_RCU)
+ error = follow_dotdot_rcu(nd);
+ else
+ error = follow_dotdot(nd);
+ if (error)
+ return error;
+
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
+ bool m_retry = read_seqretry(&mount_lock, nd->m_seq);
+ bool r_retry = read_seqretry(&rename_lock, nd->r_seq);
+
+ /*
+ * Don't bother checking unless there's a racing
+ * rename(2) or MS_MOVE.
+ */
+ if (likely(!m_retry && !r_retry))
+ return 0;
+
+ if (m_retry && !(nd->flags & LOOKUP_RCU))
+ nd->m_seq = read_seqbegin(&mount_lock);
+ if (r_retry)
+ nd->r_seq = read_seqbegin(&rename_lock);
+ if (!path_is_under(&nd->path, &nd->root))
+ return -EXDEV;
+ }
}
return 0;
}
@@ -2272,6 +2288,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->last_type = LAST_ROOT; /* if there are only slashes... */
nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
nd->depth = 0;
+
+ nd->m_seq = read_seqbegin(&mount_lock);
+ if (unlikely(flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
+ nd->r_seq = read_seqbegin(&rename_lock);
+
if (flags & LOOKUP_ROOT) {
struct dentry *root = nd->root.dentry;
struct inode *inode = root->d_inode;
@@ -2282,7 +2303,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
if (flags & LOOKUP_RCU) {
nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
nd->root_seq = nd->seq;
- nd->m_seq = read_seqbegin(&mount_lock);
} else {
path_get(&nd->path);
}
@@ -2293,8 +2313,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->path.mnt = NULL;
nd->path.dentry = NULL;
- nd->m_seq = read_seqbegin(&mount_lock);
-
if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
error = dirfd_path_init(nd);
if (unlikely(error))
--
2.21.0
The need to be able to scope path resolution of interpreters became
clear with one of the possible vectors used in CVE-2019-5736 (which
most major container runtimes were vulnerable to).
Naively, it might seem that openat(2) -- which supports path scoping --
can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
binary being executed. Unfortunately, a "bad binary" (usually a symlink)
could be written as a #!-style script with the symlink target as the
interpreter -- which would be completely missed by just scoping the
openat(2). An example of this being exploitable is CVE-2019-5736.
In order to get around this, we need to pass down to each binfmt_*
implementation the scoping flags requested in execveat(2). In order to
maintain backwards-compatibility we only pass the scoping AT_* flags.
To avoid breaking userspace (in the exceptionally rare cases where you
have #!-scripts with a relative path being execveat(2)-ed with dfd !=
AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
set in execveat(2).
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/binfmt_elf.c | 2 +-
fs/binfmt_elf_fdpic.c | 2 +-
fs/binfmt_em86.c | 4 ++--
fs/binfmt_misc.c | 2 +-
fs/binfmt_script.c | 2 +-
fs/exec.c | 26 ++++++++++++++++++++++----
include/linux/binfmts.h | 1 +
include/linux/fs.h | 9 +++++++--
include/uapi/linux/fcntl.h | 6 ++++++
9 files changed, 42 insertions(+), 12 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 7d09d125f148..473420eed477 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -772,7 +772,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (elf_interpreter[elf_ppnt->p_filesz - 1] != '\0')
goto out_free_interp;
- interpreter = open_exec(elf_interpreter);
+ interpreter = openat_exec(bprm->dfd, elf_interpreter, bprm->flags);
retval = PTR_ERR(interpreter);
if (IS_ERR(interpreter))
goto out_free_interp;
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index b53bb3729ac1..c463c6428f77 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -263,7 +263,7 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
kdebug("Using ELF interpreter %s", interpreter_name);
/* replace the program with the interpreter */
- interpreter = open_exec(interpreter_name);
+ interpreter = openat_exec(bprm->dfd, interpreter_name, bprm->flags);
retval = PTR_ERR(interpreter);
if (IS_ERR(interpreter)) {
interpreter = NULL;
diff --git a/fs/binfmt_em86.c b/fs/binfmt_em86.c
index dd2d3f0cd55d..3ee46b0dc0d4 100644
--- a/fs/binfmt_em86.c
+++ b/fs/binfmt_em86.c
@@ -81,10 +81,10 @@ static int load_em86(struct linux_binprm *bprm)
/*
* OK, now restart the process with the interpreter's inode.
- * Note that we use open_exec() as the name is now in kernel
+ * Note that we use openat_exec() as the name is now in kernel
* space, and we don't need to copy it.
*/
- file = open_exec(interp);
+ file = openat_exec(binprm->dfd, interp, binprm->flags);
if (IS_ERR(file))
return PTR_ERR(file);
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index aa4a7a23ff99..573ef06ff5a1 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -209,7 +209,7 @@ static int load_misc_binary(struct linux_binprm *bprm)
if (!IS_ERR(interp_file))
deny_write_access(interp_file);
} else {
- interp_file = open_exec(fmt->interpreter);
+ interp_file = openat_exec(bprm->dfd, fmt->interpreter, bprm->flags);
}
retval = PTR_ERR(interp_file);
if (IS_ERR(interp_file))
diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c
index e996174cbfc0..aaff12d12bfd 100644
--- a/fs/binfmt_script.c
+++ b/fs/binfmt_script.c
@@ -137,7 +137,7 @@ static int load_script(struct linux_binprm *bprm)
/*
* OK, now restart the process with the interpreter's dentry.
*/
- file = open_exec(i_name);
+ file = openat_exec(bprm->dfd, i_name, bprm->flags);
if (IS_ERR(file))
return PTR_ERR(file);
diff --git a/fs/exec.c b/fs/exec.c
index 2e0033348d8e..d0f244d9a541 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -846,12 +846,24 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
.lookup_flags = LOOKUP_FOLLOW,
};
- if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
+ if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH | AT_BENEATH |
+ AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
+ AT_THIS_ROOT)) != 0)
return ERR_PTR(-EINVAL);
if (flags & AT_SYMLINK_NOFOLLOW)
open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW;
if (flags & AT_EMPTY_PATH)
open_exec_flags.lookup_flags |= LOOKUP_EMPTY;
+ if (flags & AT_BENEATH)
+ open_exec_flags.lookup_flags |= LOOKUP_BENEATH;
+ if (flags & AT_XDEV)
+ open_exec_flags.lookup_flags |= LOOKUP_XDEV;
+ if (flags & AT_NO_MAGICLINKS)
+ open_exec_flags.lookup_flags |= LOOKUP_NO_MAGICLINKS;
+ if (flags & AT_NO_SYMLINKS)
+ open_exec_flags.lookup_flags |= LOOKUP_NO_SYMLINKS;
+ if (flags & AT_THIS_ROOT)
+ open_exec_flags.lookup_flags |= LOOKUP_IN_ROOT;
file = do_filp_open(fd, name, &open_exec_flags);
if (IS_ERR(file))
@@ -879,18 +891,18 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
return ERR_PTR(err);
}
-struct file *open_exec(const char *name)
+struct file *openat_exec(int dfd, const char *name, int flags)
{
struct filename *filename = getname_kernel(name);
struct file *f = ERR_CAST(filename);
if (!IS_ERR(filename)) {
- f = do_open_execat(AT_FDCWD, filename, 0);
+ f = do_open_execat(dfd, filename, flags);
putname(filename);
}
return f;
}
-EXPORT_SYMBOL(open_exec);
+EXPORT_SYMBOL(openat_exec);
int kernel_read_file(struct file *file, void **buf, loff_t *size,
loff_t max_size, enum kernel_read_file_id id)
@@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename,
sched_exec();
+ bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
+ AT_THIS_ROOT);
+ bprm->dfd = AT_FDCWD;
+ if (bprm->flags)
+ bprm->dfd = fd;
+
bprm->file = file;
if (!filename) {
bprm->filename = "none";
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 688ab0de7810..e4da2d36e97f 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -50,6 +50,7 @@ struct linux_binprm {
unsigned int taso:1;
#endif
unsigned int recursion_depth; /* only for search_binary_handler() */
+ int dfd, flags; /* passed down to execat_open() */
struct file * file;
struct cred *cred; /* new credentials */
int unsafe; /* how unsafe this exec is (mask of LSM_UNSAFE_*) */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fe94c48481a6..cdcd2e021101 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2945,8 +2945,13 @@ extern int kernel_read_file_from_fd(int, void **, loff_t *, loff_t,
extern ssize_t kernel_read(struct file *, void *, size_t, loff_t *);
extern ssize_t kernel_write(struct file *, const void *, size_t, loff_t *);
extern ssize_t __kernel_write(struct file *, const void *, size_t, loff_t *);
-extern struct file * open_exec(const char *);
-
+
+extern struct file *openat_exec(int, const char *, int);
+static inline struct file *open_exec(const char *name)
+{
+ return openat_exec(AT_FDCWD, name, 0);
+}
+
/* fs/dcache.c -- generic fs support functions */
extern bool is_subdir(struct dentry *, struct dentry *);
extern bool path_is_under(const struct path *, const struct path *);
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 1d338357df8a..bfca7f87cd2a 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -93,5 +93,11 @@
#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
+#define AT_RESOLUTION_TYPE 0x1F0000 /* Type of path-resolution scoping we are applying. */
+#define AT_BENEATH 0x010000 /* - Block "lexical" trickery like "..", symlinks, absolute paths, etc. */
+#define AT_XDEV 0x020000 /* - Block mount-point crossings (includes bind-mounts). */
+#define AT_NO_MAGICLINKS 0x040000 /* - Block procfs-style "magic" symlinks. */
+#define AT_NO_SYMLINKS 0x080000 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */
+#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
#endif /* _UAPI_LINUX_FCNTL_H */
--
2.21.0
On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <[email protected]> wrote:
> The need to be able to scope path resolution of interpreters became
> clear with one of the possible vectors used in CVE-2019-5736 (which
> most major container runtimes were vulnerable to).
>
> Naively, it might seem that openat(2) -- which supports path scoping --
> can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
> binary being executed. Unfortunately, a "bad binary" (usually a symlink)
> could be written as a #!-style script with the symlink target as the
> interpreter -- which would be completely missed by just scoping the
> openat(2). An example of this being exploitable is CVE-2019-5736.
>
> In order to get around this, we need to pass down to each binfmt_*
> implementation the scoping flags requested in execveat(2). In order to
> maintain backwards-compatibility we only pass the scoping AT_* flags.
>
> To avoid breaking userspace (in the exceptionally rare cases where you
> have #!-scripts with a relative path being execveat(2)-ed with dfd !=
> AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
> set in execveat(2).
This seems extremely dangerous. I like the overall series, but not this patch.
> @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename,
>
> sched_exec();
>
> + bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
> + AT_THIS_ROOT);
[...]
> +#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
So now what happens if there is a setuid root ELF binary with program
interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an
unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that
going to let the unprivileged user decide which interpreter the
setuid-root process should use? From a high-level perspective, opening
the interpreter should be controlled by the program that is being
loaded, not by the program that invoked it.
In my opinion, CVE-2019-5736 points out two different problems:
The big problem: The __ptrace_may_access() logic has a special-case
short-circuit for "introspection" that you can't opt out of; this
makes it possible to open things in procfs that are related to the
current process even if the credentials of the process wouldn't permit
accessing another process like it. I think the proper fix to deal with
this would be to add a prctl() flag for "set whether introspection is
allowed for this process", and if userspace has manually un-set that
flag, any introspection special-case logic would be skipped.
An additional problem: /proc/*/exe can be used to open a file for
writing; I think it may have been Andy Lutomirski who pointed out some
time ago that it would be nice if you couldn't use /proc/*/fd/* to
re-open files with more privileges, which is sort of the same thing.
On 2019-05-06, Jann Horn <[email protected]> wrote:
> On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <[email protected]> wrote:
> > The need to be able to scope path resolution of interpreters became
> > clear with one of the possible vectors used in CVE-2019-5736 (which
> > most major container runtimes were vulnerable to).
> >
> > Naively, it might seem that openat(2) -- which supports path scoping --
> > can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
> > binary being executed. Unfortunately, a "bad binary" (usually a symlink)
> > could be written as a #!-style script with the symlink target as the
> > interpreter -- which would be completely missed by just scoping the
> > openat(2). An example of this being exploitable is CVE-2019-5736.
> >
> > In order to get around this, we need to pass down to each binfmt_*
> > implementation the scoping flags requested in execveat(2). In order to
> > maintain backwards-compatibility we only pass the scoping AT_* flags.
> >
> > To avoid breaking userspace (in the exceptionally rare cases where you
> > have #!-scripts with a relative path being execveat(2)-ed with dfd !=
> > AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
> > set in execveat(2).
>
> This seems extremely dangerous. I like the overall series, but not this patch.
>
> > @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename,
> >
> > sched_exec();
> >
> > + bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
> > + AT_THIS_ROOT);
> [...]
> > +#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
>
> So now what happens if there is a setuid root ELF binary with program
> interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an
> unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that
> going to let the unprivileged user decide which interpreter the
> setuid-root process should use? From a high-level perspective, opening
> the interpreter should be controlled by the program that is being
> loaded, not by the program that invoked it.
I went a bit nuts with openat_exec(), and I did end up adding it to the
ELF interpreter lookup (and you're completely right that this is a bad
idea -- I will drop it from this patch if it's included in the next
series).
The proposed solutions you give below are much nicer than this patch so
I can drop it and work on fixing those issues separately.
> In my opinion, CVE-2019-5736 points out two different problems:
>
> The big problem: The __ptrace_may_access() logic has a special-case
> short-circuit for "introspection" that you can't opt out of; this
> makes it possible to open things in procfs that are related to the
> current process even if the credentials of the process wouldn't permit
> accessing another process like it. I think the proper fix to deal with
> this would be to add a prctl() flag for "set whether introspection is
> allowed for this process", and if userspace has manually un-set that
> flag, any introspection special-case logic would be skipped.
We could do PR_SET_DUMPABLE=3 for this, I guess?
> An additional problem: /proc/*/exe can be used to open a file for
> writing; I think it may have been Andy Lutomirski who pointed out some
> time ago that it would be nice if you couldn't use /proc/*/fd/* to
> re-open files with more privileges, which is sort of the same thing.
This is something I'm currently working on a series for, which would
boil down to some restrictions on how re-opening of file descriptors
works through procfs.
However, execveat() of a procfs magiclink is a bit hard to block --
there is no way for userspace to to represent a file being "open for
execute" so they are all "open for execute" by default and blocking it
outright seems a bit extreme (though I actually hope to eventually add
the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence
why I've reserved some bits).
(Thinking more about it, there is an argument that I should include the
above patch into this series so that we can block re-opening of fds
opened through resolveat(2) without explicit flags from the outset.)
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
> On May 6, 2019, at 12:17 PM, Aleksa Sarai <[email protected]> wrote:
>
>> On 2019-05-06, Jann Horn <[email protected]> wrote:
>>> On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <[email protected]> wrote:
>>> The need to be able to scope path resolution of interpreters became
>>> clear with one of the possible vectors used in CVE-2019-5736 (which
>>> most major container runtimes were vulnerable to).
>>>
>>> Naively, it might seem that openat(2) -- which supports path scoping --
>>> can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
>>> binary being executed. Unfortunately, a "bad binary" (usually a symlink)
>>> could be written as a #!-style script with the symlink target as the
>>> interpreter -- which would be completely missed by just scoping the
>>> openat(2). An example of this being exploitable is CVE-2019-5736.
>>>
>>> In order to get around this, we need to pass down to each binfmt_*
>>> implementation the scoping flags requested in execveat(2). In order to
>>> maintain backwards-compatibility we only pass the scoping AT_* flags.
>>>
>>> To avoid breaking userspace (in the exceptionally rare cases where you
>>> have #!-scripts with a relative path being execveat(2)-ed with dfd !=
>>> AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
>>> set in execveat(2).
>>
>> This seems extremely dangerous. I like the overall series, but not this patch.
>>
>>> @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename,
>>>
>>> sched_exec();
>>>
>>> + bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
>>> + AT_THIS_ROOT);
>> [...]
>>> +#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
>>
>> So now what happens if there is a setuid root ELF binary with program
>> interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an
>> unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that
>> going to let the unprivileged user decide which interpreter the
>> setuid-root process should use? From a high-level perspective, opening
>> the interpreter should be controlled by the program that is being
>> loaded, not by the program that invoked it.
>
> I went a bit nuts with openat_exec(), and I did end up adding it to the
> ELF interpreter lookup (and you're completely right that this is a bad
> idea -- I will drop it from this patch if it's included in the next
> series).
>
> The proposed solutions you give below are much nicer than this patch so
> I can drop it and work on fixing those issues separately.
>
>> In my opinion, CVE-2019-5736 points out two different problems:
>>
>> The big problem: The __ptrace_may_access() logic has a special-case
>> short-circuit for "introspection" that you can't opt out of; this
>> makes it possible to open things in procfs that are related to the
>> current process even if the credentials of the process wouldn't permit
>> accessing another process like it. I think the proper fix to deal with
>> this would be to add a prctl() flag for "set whether introspection is
>> allowed for this process", and if userspace has manually un-set that
>> flag, any introspection special-case logic would be skipped.
>
> We could do PR_SET_DUMPABLE=3 for this, I guess?
>
>> An additional problem: /proc/*/exe can be used to open a file for
>> writing; I think it may have been Andy Lutomirski who pointed out some
>> time ago that it would be nice if you couldn't use /proc/*/fd/* to
>> re-open files with more privileges, which is sort of the same thing.
>
> This is something I'm currently working on a series for, which would
> boil down to some restrictions on how re-opening of file descriptors
> works through procfs.
>
> However, execveat() of a procfs magiclink is a bit hard to block --
> there is no way for userspace to to represent a file being "open for
> execute" so they are all "open for execute" by default and blocking it
> outright seems a bit extreme (though I actually hope to eventually add
> the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence
> why I've reserved some bits).
There’s an O_MAYEXEC series floating around.
>
> (Thinking more about it, there is an argument that I should include the
> above patch into this series so that we can block re-opening of fds
> opened through resolveat(2) without explicit flags from the outset.)
>
> --
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>
Jann Horn <[email protected]> writes:
>
> In my opinion, CVE-2019-5736 points out two different problems:
>
> The big problem: The __ptrace_may_access() logic has a special-case
> short-circuit for "introspection" that you can't opt out of;
Once upon a time in a galaxy far far away I fixed a bug where we missing
ptrace_may_access checks on various proc files and systems using selinux
stopped working. At the time selinux did not allow ptrace like access
to yourself. The "introspection" special case was the quick and simple
work-around.
There is nothing fundamental in having the "introspection" special case
except that various lsms have probably grown to depend upon it being
there. I expect without difficulty we could move the check down
into the various lsms. Which would get that check out of the core
kernel code.
Then the special case would the lsms challenge to keep or remove.
Eric
On 2019-05-07, Aleksa Sarai <[email protected]> wrote:
> On 2019-05-06, Jann Horn <[email protected]> wrote:
> > On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <[email protected]> wrote:
> > > The need to be able to scope path resolution of interpreters became
> > > clear with one of the possible vectors used in CVE-2019-5736 (which
> > > most major container runtimes were vulnerable to).
> > >
> > > Naively, it might seem that openat(2) -- which supports path scoping --
> > > can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
> > > binary being executed. Unfortunately, a "bad binary" (usually a symlink)
> > > could be written as a #!-style script with the symlink target as the
> > > interpreter -- which would be completely missed by just scoping the
> > > openat(2). An example of this being exploitable is CVE-2019-5736.
> > >
> > > In order to get around this, we need to pass down to each binfmt_*
> > > implementation the scoping flags requested in execveat(2). In order to
> > > maintain backwards-compatibility we only pass the scoping AT_* flags.
> > >
> > > To avoid breaking userspace (in the exceptionally rare cases where you
> > > have #!-scripts with a relative path being execveat(2)-ed with dfd !=
> > > AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
> > > set in execveat(2).
> >
> > This seems extremely dangerous. I like the overall series, but not this patch.
> >
> > > @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename,
> > >
> > > sched_exec();
> > >
> > > + bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
> > > + AT_THIS_ROOT);
> > [...]
> > > +#define AT_THIS_ROOT 0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
> >
> > So now what happens if there is a setuid root ELF binary with program
> > interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an
> > unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that
> > going to let the unprivileged user decide which interpreter the
> > setuid-root process should use? From a high-level perspective, opening
> > the interpreter should be controlled by the program that is being
> > loaded, not by the program that invoked it.
>
> I went a bit nuts with openat_exec(), and I did end up adding it to the
> ELF interpreter lookup (and you're completely right that this is a bad
> idea -- I will drop it from this patch if it's included in the next
> series).
>
> The proposed solutions you give below are much nicer than this patch so
> I can drop it and work on fixing those issues separately.
Another possible solution would be to only allow (for instance)
AT_NO_MAGICLINKS for execveat(2). That way you cannot scope the
resolution but you can block the most concerning cases -- those
involving /proc-related access.
I've posted a v7 with this patch dropped (because we can always add AT_*
flags later in time), but I think having at least NO_MAGICLINKS would be
useful.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Tue, May 07, 2019 at 07:38:58PM -0500, Eric W. Biederman wrote:
> Jann Horn <[email protected]> writes:
> > In my opinion, CVE-2019-5736 points out two different problems:
> >
> > The big problem: The __ptrace_may_access() logic has a special-case
> > short-circuit for "introspection" that you can't opt out of;
>
> Once upon a time in a galaxy far far away I fixed a bug where we missing
> ptrace_may_access checks on various proc files and systems using selinux
> stopped working. At the time selinux did not allow ptrace like access
> to yourself. The "introspection" special case was the quick and simple
> work-around.
>
> There is nothing fundamental in having the "introspection" special case
> except that various lsms have probably grown to depend upon it being
> there. I expect without difficulty we could move the check down
> into the various lsms. Which would get that check out of the core
> kernel code.
Oh, if that's an option, that would be great, I think.
But this means, for example, that a non-root, non-dumpable process can't
open /proc/self/maps anymore, or open /proc/self/fd/*, and things like
that, without making itself dumpable. I would be surprised if there is
no code out there that relies on that.
From what I can tell, without the introspection special case,
introspection would fail in the following cases (assuming that the
process is not capable and isn't using sys_setfs[ug]id()):
- ruid/euid/suid are not all the same
- rgid/egid/sgid are not all the same
- process is not dumpable
I think that there probably should be some way for a non-dumpable
process to look at its own procfs entries? If we could start from a
clean slate, I'd propose an opt-in flag to openat() for that, but
since we don't have a clean slate, I'd be afraid of breaking things
with that. But maybe I'm just being overly careful here?
On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
> On 2019-05-06, Jann Horn <[email protected]> wrote:
> > In my opinion, CVE-2019-5736 points out two different problems:
> >
> > The big problem: The __ptrace_may_access() logic has a special-case
> > short-circuit for "introspection" that you can't opt out of; this
> > makes it possible to open things in procfs that are related to the
> > current process even if the credentials of the process wouldn't permit
> > accessing another process like it. I think the proper fix to deal with
> > this would be to add a prctl() flag for "set whether introspection is
> > allowed for this process", and if userspace has manually un-set that
> > flag, any introspection special-case logic would be skipped.
>
> We could do PR_SET_DUMPABLE=3 for this, I guess?
Hmm... I'd make it a new prctl() command, since introspection is
somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
think the introspection flag should be per-thread.
> > An additional problem: /proc/*/exe can be used to open a file for
> > writing; I think it may have been Andy Lutomirski who pointed out some
> > time ago that it would be nice if you couldn't use /proc/*/fd/* to
> > re-open files with more privileges, which is sort of the same thing.
>
> This is something I'm currently working on a series for, which would
> boil down to some restrictions on how re-opening of file descriptors
> works through procfs.
Ah, nice!
> However, execveat() of a procfs magiclink is a bit hard to block --
> there is no way for userspace to to represent a file being "open for
> execute" so they are all "open for execute" by default and blocking it
> outright seems a bit extreme (though I actually hope to eventually add
> the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence
> why I've reserved some bits).
(For what it's worth, I'm mostly concerned about read vs write, not
really about execute, since execute really is just another form of
reading in my opinion.)
> (Thinking more about it, there is an argument that I should include the
> above patch into this series so that we can block re-opening of fds
> opened through resolveat(2) without explicit flags from the outset.)
On Fri, May 10, 2019 at 1:41 PM Jann Horn <[email protected]> wrote:
>
> On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
> > On 2019-05-06, Jann Horn <[email protected]> wrote:
> > > In my opinion, CVE-2019-5736 points out two different problems:
> > >
> > > The big problem: The __ptrace_may_access() logic has a special-case
> > > short-circuit for "introspection" that you can't opt out of; this
> > > makes it possible to open things in procfs that are related to the
> > > current process even if the credentials of the process wouldn't permit
> > > accessing another process like it. I think the proper fix to deal with
> > > this would be to add a prctl() flag for "set whether introspection is
> > > allowed for this process", and if userspace has manually un-set that
> > > flag, any introspection special-case logic would be skipped.
> >
> > We could do PR_SET_DUMPABLE=3 for this, I guess?
>
> Hmm... I'd make it a new prctl() command, since introspection is
> somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
> think the introspection flag should be per-thread.
I've lost track of the context here, but it seems to me that
mitigating attacks involving accidental following of /proc links
shouldn't depend on dumpability. What's the actual problem this is
trying to solve again?
On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote:
> On Fri, May 10, 2019 at 1:41 PM Jann Horn <[email protected]> wrote:
> >
> > On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
> > > On 2019-05-06, Jann Horn <[email protected]> wrote:
> > > > In my opinion, CVE-2019-5736 points out two different problems:
> > > >
> > > > The big problem: The __ptrace_may_access() logic has a special-case
> > > > short-circuit for "introspection" that you can't opt out of; this
> > > > makes it possible to open things in procfs that are related to the
> > > > current process even if the credentials of the process wouldn't permit
> > > > accessing another process like it. I think the proper fix to deal with
> > > > this would be to add a prctl() flag for "set whether introspection is
> > > > allowed for this process", and if userspace has manually un-set that
> > > > flag, any introspection special-case logic would be skipped.
> > >
> > > We could do PR_SET_DUMPABLE=3 for this, I guess?
> >
> > Hmm... I'd make it a new prctl() command, since introspection is
> > somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
> > think the introspection flag should be per-thread.
>
> I've lost track of the context here, but it seems to me that
> mitigating attacks involving accidental following of /proc links
> shouldn't depend on dumpability. What's the actual problem this is
> trying to solve again?
The one actual security problem that I've seen related to this is
CVE-2019-5736. There is a write-up of it at
<https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html>
under "Successful approach", but it goes more or less as follows:
A container is running that doesn't use user namespaces (because for
some reason I don't understand, apparently some people still do that).
An evil process is running inside the container with UID 0 (as in,
GLOBAL_ROOT_UID); so if the evil process inside the container was able
to reach root-owned files on the host filesystem, it could write into
them.
The container engine wants to spawn a new process inside the container.
It forks off a child that joins the container's namespaces (including
PID and mount namespaces), and then the child calls execve() on some
path in the container.
The attacker replaces the executable in the container with a symlink
to /proc/self/exe and replaces a library inside the container with a
malicious one.
When the container engine calls execve(), intending to run an executable
inside the container, it instead goes through ptrace_may_access() using
the introspection short-circuit and re-executes its own executable
through the jumped symlink /proc/self/exe (which is normally unreachable
for the container). After the execve(), the process loads an evil
library from inside the container and is under the control of the
container.
Now the container controls a process whose /proc/self/exe is a jumped
symlink to a host executable, and the container can write into it.
Some container engines are now using an extremely ugly hack to work
around this - whenever they want to enter a container, they copy the
host binary into a new memfd and execute that to avoid exposing the
original host binary to containers:
<https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b>
In my opinion, the problems here are:
- Apparently some people run untrusted containers without user
namespaces. It would be really nice if people could not do that.
(Probably the biggest problem here.)
- ptrace_may_access() has a short-circuit that permits a process to
unintentionally look at itself even if it has dropped privileges -
here, it permits the execve("/proc/self/exe", ...) that would
normally be blocked by the check for CAP_SYS_PTRACE if the process
is nondumpable.
- You can use /proc/*/exe to get a writable fd.
On Sat, May 11, 2019 at 12:55 AM Jann Horn <[email protected]> wrote:
>
> On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote:
> > On Fri, May 10, 2019 at 1:41 PM Jann Horn <[email protected]> wrote:
> > >
> > > On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
> > > > On 2019-05-06, Jann Horn <[email protected]> wrote:
> > > > > In my opinion, CVE-2019-5736 points out two different problems:
> > > > >
> > > > > The big problem: The __ptrace_may_access() logic has a special-case
> > > > > short-circuit for "introspection" that you can't opt out of; this
> > > > > makes it possible to open things in procfs that are related to the
> > > > > current process even if the credentials of the process wouldn't permit
> > > > > accessing another process like it. I think the proper fix to deal with
> > > > > this would be to add a prctl() flag for "set whether introspection is
> > > > > allowed for this process", and if userspace has manually un-set that
> > > > > flag, any introspection special-case logic would be skipped.
> > > >
> > > > We could do PR_SET_DUMPABLE=3 for this, I guess?
> > >
> > > Hmm... I'd make it a new prctl() command, since introspection is
> > > somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
> > > think the introspection flag should be per-thread.
> >
> > I've lost track of the context here, but it seems to me that
> > mitigating attacks involving accidental following of /proc links
> > shouldn't depend on dumpability. What's the actual problem this is
> > trying to solve again?
>
> The one actual security problem that I've seen related to this is
> CVE-2019-5736. There is a write-up of it at
> <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html>
> under "Successful approach", but it goes more or less as follows:
>
> A container is running that doesn't use user namespaces (because for
> some reason I don't understand, apparently some people still do that).
> An evil process is running inside the container with UID 0 (as in,
> GLOBAL_ROOT_UID); so if the evil process inside the container was able
> to reach root-owned files on the host filesystem, it could write into
> them.
>
> The container engine wants to spawn a new process inside the container.
> It forks off a child that joins the container's namespaces (including
> PID and mount namespaces), and then the child calls execve() on some
> path in the container.
> The attacker replaces the executable in the container with a symlink
> to /proc/self/exe and replaces a library inside the container with a
> malicious one.
> When the container engine calls execve(), intending to run an executable
> inside the container, it instead goes through ptrace_may_access() using
> the introspection short-circuit and re-executes its own executable
> through the jumped symlink /proc/self/exe (which is normally unreachable
> for the container). After the execve(), the process loads an evil
> library from inside the container and is under the control of the
> container.
> Now the container controls a process whose /proc/self/exe is a jumped
> symlink to a host executable, and the container can write into it.
>
> Some container engines are now using an extremely ugly hack to work
> around this - whenever they want to enter a container, they copy the
> host binary into a new memfd and execute that to avoid exposing the
> original host binary to containers:
> <https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b>
>
>
> In my opinion, the problems here are:
>
> - Apparently some people run untrusted containers without user
> namespaces. It would be really nice if people could not do that.
> (Probably the biggest problem here.)
I know I sound like a broken record since I've been going on about this
forever together with a lot of other people but honestly,
the fact that people are running untrusted workloads in privileged containers
is the real issue here.
Aleksa is a good friend of mine and we have discussed this a lot so I hope
he doesn't hate me for saying this again: it is crazy that there are container
runtimes out there that promise (or at least do not state the opposite)
containers without user namespaces or containers with user namespaces
that allow to map the host root id to anything can be safe. They cannot.
Even if this /proc/*/exe thing is somehow blocked there
are other ways of escaping from a privileged container.
We (i.e. LXC) literally do not accept CVEs for privileged containers
because we do not consider them safe by design.
It seems to me to be heading in the wrong direction to keep up the
illusion that with enough effort we can make this all nice and safe.
Yes, the userspace memfd hack we came up with is as ugly as a security
patch can be but if you make promises you can't keep you better be
prepared to pay the price when things start to fall apart.
So if this part of the patch is just needed to handle this do we really
want to do all that tricky work or is there more to gain from this that
makes it worth it.
Christian
On 2019-05-11, Christian Brauner <[email protected]> wrote:
> > In my opinion, the problems here are:
> >
> > - Apparently some people run untrusted containers without user
> > namespaces. It would be really nice if people could not do that.
> > (Probably the biggest problem here.)
>
> I know I sound like a broken record since I've been going on about this
> forever together with a lot of other people but honestly,
> the fact that people are running untrusted workloads in privileged containers
> is the real issue here.
I completely agree. It's a shit-show, and it's caused by bad defaults in
Docker and (now) podman. To be fair, they both now support rootless
containers but the default is still privileged containers.
They do support user namespaces (though it should be noted that LXD's
support is much nicer from a security standpoint) but unless it's the
default the support is almost pointless. In the case of Docker it can
lead to some usability issues when you enable it (which I believe is the
main justification for it not being the default).
> Aleksa is a good friend of mine and we have discussed this a lot so I hope
> he doesn't hate me for saying this again: it is crazy that there are container
> runtimes out there that promise (or at least do not state the opposite)
> containers without user namespaces or containers with user namespaces
> that allow to map the host root id to anything can be safe. They cannot.
Yeah, the fact that we (runc) don't scream from the rooftops that this
setup is insecure is definitely a problem. I have mentioned this
whenever I've had a chance, but the fact that the most popular runtimes
(which use runc) don't use user namespaces compounds the issue. I'm
willing to bet that >90% of users of runc-based runtimes don't use user
namespaces at all, and this is all down to bad defaults.
There are also some other misfeatures we have in runc that we're
basically forced to support because some users use them, and we can't
really break entire projects (even though it's the projects' fault they
have an insecure setup).
> It seems to me to be heading in the wrong direction to keep up the
> illusion that with enough effort we can make this all nice and safe.
> Yes, the userspace memfd hack we came up with is as ugly as a security
> patch can be but if you make promises you can't keep you better be
> prepared to pay the price when things start to fall apart.
> So if this part of the patch is just needed to handle this do we really
> want to do all that tricky work or is there more to gain from this that
> makes it worth it.
I dropped this patch in v7, I don't think it's required for the
overarching feature. Looking back on it, it doesn't make much sense
given the context that privileged containers are unsafe in the first
place.
I do think that being able to block introspection might be a useful
hardening feature though. During attachment it would be nice to be sure
that nothing will be able to touch the attaching process's /proc/$pid --
even itself.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
> On May 10, 2019, at 3:55 PM, Jann Horn <[email protected]> wrote:
>
>> On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote:
>>> On Fri, May 10, 2019 at 1:41 PM Jann Horn <[email protected]> wrote:
>>>
>>>> On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
>>>>> On 2019-05-06, Jann Horn <[email protected]> wrote:
>>>>> In my opinion, CVE-2019-5736 points out two different problems:
>>>>>
>>>>> The big problem: The __ptrace_may_access() logic has a special-case
>>>>> short-circuit for "introspection" that you can't opt out of; this
>>>>> makes it possible to open things in procfs that are related to the
>>>>> current process even if the credentials of the process wouldn't permit
>>>>> accessing another process like it. I think the proper fix to deal with
>>>>> this would be to add a prctl() flag for "set whether introspection is
>>>>> allowed for this process", and if userspace has manually un-set that
>>>>> flag, any introspection special-case logic would be skipped.
>>>>
>>>> We could do PR_SET_DUMPABLE=3 for this, I guess?
>>>
>>> Hmm... I'd make it a new prctl() command, since introspection is
>>> somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
>>> think the introspection flag should be per-thread.
>>
>> I've lost track of the context here, but it seems to me that
>> mitigating attacks involving accidental following of /proc links
>> shouldn't depend on dumpability. What's the actual problem this is
>> trying to solve again?
>
> The one actual security problem that I've seen related to this is
> CVE-2019-5736. There is a write-up of it at
> <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html>
> under "Successful approach", but it goes more or less as follows:
>
> A container is running that doesn't use user namespaces (because for
> some reason I don't understand, apparently some people still do that).
> An evil process is running inside the container with UID 0 (as in,
> GLOBAL_ROOT_UID); so if the evil process inside the container was able
> to reach root-owned files on the host filesystem, it could write into
> them.
>
> The container engine wants to spawn a new process inside the container.
> It forks off a child that joins the container's namespaces (including
> PID and mount namespaces), and then the child calls execve() on some
> path in the container.
I think that, at this point, the task should be considered owned by the container. Maybe we should have a better API than execve() to execute a program in a safer way, but fiddling with dumpability seems like a band-aid. In fact, the process is arguably pwned even *before* execve.
A better “spawn” API should fix this. In the mean time, I think it should be assumed that, if you join a container’s namespaces, you are at its mercy.
> The attacker replaces the executable in the container with a symlink
> to /proc/self/exe and replaces a library inside the container with a
> malicious one.
Cute.
> When the container engine calls execve(), intending to run an executable
> inside the container, it instead goes through ptrace_may_access() using
> the introspection short-circuit and re-executes its own executable
> through the jumped symlink /proc/self/exe (which is normally unreachable
> for the container). After the execve(), the process loads an evil
> library from inside the container and is under the control of the
> container.
> Now the container controls a process whose /proc/self/exe is a jumped
> symlink to a host executable, and the container can write into it.
>
> Some container engines are now using an extremely ugly hack to work
> around this - whenever they want to enter a container, they copy the
> host binary into a new memfd and execute that to avoid exposing the
> original host binary to containers:
> <https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b>
>
>
> In my opinion, the problems here are:
>
> - Apparently some people run untrusted containers without user
> namespaces. It would be really nice if people could not do that.
> (Probably the biggest problem here.)
> - ptrace_may_access() has a short-circuit that permits a process to
> unintentionally look at itself even if it has dropped privileges -
> here, it permits the execve("/proc/self/exe", ...) that would
> normally be blocked by the check for CAP_SYS_PTRACE if the process
> is nondumpable.
I don’t see this as a problem. Dumpable is about protecting a task from others, not about protecting a task against itself.
> - You can use /proc/*/exe to get a writable fd.
This is IMO the real bug.
On 2019-05-11, Andy Lutomirski <[email protected]> wrote:
> >> I've lost track of the context here, but it seems to me that
> >> mitigating attacks involving accidental following of /proc links
> >> shouldn't depend on dumpability. What's the actual problem this is
> >> trying to solve again?
> >
> > The one actual security problem that I've seen related to this is
> > CVE-2019-5736. There is a write-up of it at
> > <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html>
> > under "Successful approach", but it goes more or less as follows:
> >
> > A container is running that doesn't use user namespaces (because for
> > some reason I don't understand, apparently some people still do that).
> > An evil process is running inside the container with UID 0 (as in,
> > GLOBAL_ROOT_UID); so if the evil process inside the container was able
> > to reach root-owned files on the host filesystem, it could write into
> > them.
> >
> > The container engine wants to spawn a new process inside the container.
> > It forks off a child that joins the container's namespaces (including
> > PID and mount namespaces), and then the child calls execve() on some
> > path in the container.
>
> I think that, at this point, the task should be considered owned by
> the container. Maybe we should have a better API than execve() to
> execute a program in a safer way, but fiddling with dumpability seems
> like a band-aid. In fact, the process is arguably pwned even *before*
> execve.
Yeah, execve is just the vector (though in this case it's done in order
to clear mm->dumpable). An earlier CVE (CVE-2016-9962) was very similar
but was attacking a dirfd that runc had open into the container (LXC had
a very similar bug too) -- setting !mm->dumpable was one of the
workarounds we had for this.
> A better “spawn” API should fix this. In the mean time, I think it
> should be assumed that, if you join a container’s namespaces, you are
> at its mercy.
This is generally how we treat containers as runtime authors, but it's
not a trivial thing to get right. In many cases the kernel APIs are
working against you -- Christian and myself have written a fair few
patches to fix holes in the kernel APIs so we can avoid these kinds of
assumptions.
But yes, one of the most risky parts of a container runtime is when
you're attaching to a running container because all of the helpful
introspection APIs in /proc/ suddenly become a security nightmare. A
better "spawn a process in these namespaces" API might help improve the
situation (or at least, I hope it would).
> > - You can use /proc/*/exe to get a writable fd.
>
> This is IMO the real bug.
I will try to send an RFC of the patchset I have for this next week or
so. Funnily enough, currently /proc/*/exe has the write bit set in its
"mode" (my series fixes this).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Sat, May 11, 2019 at 1:00 PM Andy Lutomirski <[email protected]> wrote:
>
> A better “spawn” API should fix this.
Andy, stop with the "spawn would be better".
Spawn is garbage. It's garbage because it's fundamentally too
inflexible, and it's garbage because it is quite complex to try to
work around the inflexibility by having those complex "action pointer
arrays" to make up for its failings.
And spawn() would fundamentally have all the same permission issues
that you now point to execve() as having, so it doesn't even *solve*
anything.
You've said this whole "spawn would fix things" thing before, and it's
wrong. Spawn isn't better. Really. If fixes absolutely zero things,
and the only reason for spawn existing is because VMS and NT had that
broken and inflexible model.
There's at least one paper from some MS people about how "spawn()" is
wonderful, and maybe you bought into the garbage from that. But that
paper is about how they hate fork(), not because of execve(). And if
you hate fork, use "vfork()" instead (preferably with an immediate
call to a non-returning function in the child to avoid the stack
re-use issue that makes it so simple to screw up vfork() in hard to
debug ways).
execve() is a _fine_ model. That's not the problem in this whole issue
at all - never was, and never will be.
The problem in this discussion is (a) having privileges you shouldn't
have and (b) having other interfaces that make it easyish to change
the filesystem layout to confuse those entities with privileges.
So the reason the open flags can be problematic is exactly because
they effectively change filesystem layout. And no, it's not just
AT_THIS_ROOT, although that's the obvious one. Things like "you can't
follow symlinks" can also effectively change the layout: imagine if
you have a PATH-like lookup model, and you end up having symlinks as
part of the standard filesystem layout.. Now a "don't follow symlinks"
can turn the *standard* executable into something that isn't found,
and then you might end up executing something else instead (think root
having '.' as the last entry in path, which some people used to
suggest as the fix for the completely bad "first entry" case)..
Notice? None of the real problems are about execve or would be solved
by any spawn API. You just think that because you've apparently been
talking to too many MS people that think fork (and thus indirectly
execve()) is bad process management.
Linus
On 2019-05-11, Linus Torvalds <[email protected]> wrote:
> On Sat, May 11, 2019 at 1:21 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > Notice? None of the real problems are about execve or would be solved
> > by any spawn API. You just think that because you've apparently been
> > talking to too many MS people that think fork (and thus indirectly
> > execve()) is bad process management.
>
> Side note: a good policy has been (and remains) to make suid binaries
> not be dynamically linked. And in the absence of that, the dynamic
> linker at least resets the library path when it notices itself being
> dynamic, and it certainly doesn't inherit any open flags from the
> non-trusted environment.
>
> And by the same logic, a suid interpreter must *definitely* should not
> inherit any execve() flags from the non-trusted environment. So I
> think Aleksa's patch to use the passed-in open flags is *exactly* the
> wrong thing to do for security reasons. It doesn't close holes, it
> opens them.
Yup, I've dropped the patch for the next version. (To be honest, I'm not
sure why I included any of the other flags -- the only one that would've
been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Sat, May 11, 2019 at 1:21 PM Linus Torvalds
<[email protected]> wrote:
>
> Notice? None of the real problems are about execve or would be solved
> by any spawn API. You just think that because you've apparently been
> talking to too many MS people that think fork (and thus indirectly
> execve()) is bad process management.
Side note: a good policy has been (and remains) to make suid binaries
not be dynamically linked. And in the absence of that, the dynamic
linker at least resets the library path when it notices itself being
dynamic, and it certainly doesn't inherit any open flags from the
non-trusted environment.
And by the same logic, a suid interpreter must *definitely* should not
inherit any execve() flags from the non-trusted environment. So I
think Aleksa's patch to use the passed-in open flags is *exactly* the
wrong thing to do for security reasons. It doesn't close holes, it
opens them.
Linus
On May 11, 2019 7:43:44 PM GMT+02:00, Linus Torvalds <[email protected]> wrote:
>On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <[email protected]> wrote:
>>
>> Yup, I've dropped the patch for the next version. (To be honest, I'm
>not
>> sure why I included any of the other flags -- the only one that
>would've
>> been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)
>
>I do wonder if we could try to just set AT_NO_MAGICLINKS
>unconditionally for execve() (and certainly for the suid case).
>
>I'd rather try to do these things across the board, than have "suid
>binaries are treated specially" if at all possible.
>
>The main use case for having /proc/<pid>/exe thing is for finding open
>file descriptors, and for 'ps' kind of use, or to find the startup
>directory when people don't populate the execve() environment fully
>(ie "readlink(/proc/self/exe)" is afaik pretty common.
>
>Sadly, googling for
>
> execve /proc/self/exe
>
>does actually find hits, including one that implies that chrome does
>exactly that. So it might not be possible.
>
>Somewhat odd, but it does just confirm the whole "users will at some
>point do everything in their power to use every odd special case,
>intended or not".
>
> Linus
Sadly I have to admit that we are using this.
Also, execveat on glibc is implemented via
/proc/self/fd/<nr> on kernels that do not
have a proper execveat.
See fexecve...
Christian
On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <[email protected]> wrote:
>
> Yup, I've dropped the patch for the next version. (To be honest, I'm not
> sure why I included any of the other flags -- the only one that would've
> been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)
I do wonder if we could try to just set AT_NO_MAGICLINKS
unconditionally for execve() (and certainly for the suid case).
I'd rather try to do these things across the board, than have "suid
binaries are treated specially" if at all possible.
The main use case for having /proc/<pid>/exe thing is for finding open
file descriptors, and for 'ps' kind of use, or to find the startup
directory when people don't populate the execve() environment fully
(ie "readlink(/proc/self/exe)" is afaik pretty common.
Sadly, googling for
execve /proc/self/exe
does actually find hits, including one that implies that chrome does
exactly that. So it might not be possible.
Somewhat odd, but it does just confirm the whole "users will at some
point do everything in their power to use every odd special case,
intended or not".
Linus
On 2019-05-11, Linus Torvalds <[email protected]> wrote:
> On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <[email protected]> wrote:
> > Yup, I've dropped the patch for the next version. (To be honest, I'm not
> > sure why I included any of the other flags -- the only one that would've
> > been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)
>
> I do wonder if we could try to just set AT_NO_MAGICLINKS
> unconditionally for execve() (and certainly for the suid case).
>
> I'd rather try to do these things across the board, than have "suid
> binaries are treated specially" if at all possible.
>
> The main use case for having /proc/<pid>/exe thing is for finding open
> file descriptors, and for 'ps' kind of use, or to find the startup
> directory when people don't populate the execve() environment fully
> (ie "readlink(/proc/self/exe)" is afaik pretty common.
>
> Sadly, googling for
>
> execve /proc/self/exe
>
> does actually find hits, including one that implies that chrome does
> exactly that. So it might not be possible.
>
> Somewhat odd, but it does just confirm the whole "users will at some
> point do everything in their power to use every odd special case,
> intended or not".
*sheepishly* Actually we use this in runc very liberally.
It's done because we need to run namespace-related code but runc is
written in Go so (long story short) we re-exec ourselves in order to
run some __attribute__((constructor)) code which sets up the namespaces
and then lets the Go runtime boot.
I suspect just writing everything in C would've been orders of magnitude
simpler, but I wasn't around when that decision was made. :P
Also as Christian mentioned, fexecve(3) in glibc is implemented using
/proc/self/fd on old kernels (then again, if we change the behaviour on
new kernels it won't matter because glibc uses execveat(AT_EMPTY_PATH)
if it's available).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
> On May 11, 2019, at 10:21 AM, Linus Torvalds <[email protected]> wrote:
>
>> On Sat, May 11, 2019 at 1:00 PM Andy Lutomirski <[email protected]> wrote:
>>
>> A better “spawn” API should fix this.
>
> Andy, stop with the "spawn would be better".
It doesn’t have to be spawn per se. But the current situation sucks.
>
> Notice? None of the real problems are about execve or would be solved
> by any spawn API. You just think that because you've apparently been
> talking to too many MS people that think fork (and thus indirectly
> execve()) is bad process management.
>
>
I’ve literally never spoken to an MS person about it.
What container managers and init systems *want* is a way to drop
privileges, change namespaces, etc and then run something in a
controlled way so that the intermediate states aren’t dangerous. An
API for this could be spawn-like or exec-like — that particular
distinction is beside the point. Having personally written code that
mucks with namepsaces, I've wanted two particular abilities that are
both quite awkward:
a) Change all my UIDs and GIDs to match a container, enter that
container's namespaces, and run some binary in the container's
filesystem, all atomically enough that I don't need to worry about
accidentally leaking privileges into the container. A
super-duper-non-dumpable mode would kind of allow this, but I'd worry
that there's some other hole besides ptrace() and /proc/self.
b) Change all my UIDs and GIDs to match a container, enter that
container's namespaces, and run some binary that is *not* in the
container's filesystem. This happens, for example, if the container's
mount namespace has no exec mounts at all. We don't have a fantastic
way to do this at all right now due to /proc/self/exe.
Regardless, the actual CVE at hand would have been nicely avoided if
writing to /proc/self/exe didn’t work, and I see no reason we can’t
make that happen.
I suppose we could also consider a change to disable /proc/self/exe if
it's not reachable from /proc/self/root. By "disable", I mean that
readlink() should maybe still work, but actually trying to open it
could probably fail safely.
On Sat, May 11, 2019 at 07:08:49PM -0400, Linus Torvalds wrote:
> [ on mobile again, power is off and the wifi doesn't work, so I'm reading
> email on my phone and apologizing once more for horrible html email.. ]
>
> On Sat, May 11, 2019, 18:40 Andy Lutomirski <[email protected]> wrote:
>
> >
> > a) Change all my UIDs and GIDs to match a container, enter that
> > container's namespaces, and run some binary in the container's
For the namespace part, an idea I had and presented at LPC a while ago
was to make setns() interpret the nstype argument as a flag argument and
to introduce an fd that can refer to multiple namespaces at the same
time. This way you could do:
setns(namespaces_fd, CLONE_NEWNS | CLONE_NEWUSER | CLONE_NEWPID)
that could still be done. But I since have come to think that there's a
better way now that we have CLONE_PIDFD. We should just make setns()
take a pidfd as argument and then be able to call:
setns(pidfd, 0);
which would cause the calling thread to join all of the namespaces of
the process referred to by pidfd at the same time. That really shouldn't
be hard to do. I could probably get this going rather quickly and it
would really help out container managers a lot.
> > filesystem, all atomically enough that I don't need to worry about
> > accidentally leaking privileges into the container. A
> > super-duper-non-dumpable mode would kind of allow this, but I'd worry
> > that there's some other hole besides ptrace() and /proc/self.
> >
>
> So I would expect that you'd want to do this even *without* doing an execve
> at all, which is why I still don't think this is actually about
> spawn/execve at all.
>
> But I agree that what you that what you want sounds reasonable. But I think
I have pitched an api like that a while ago (see [1]) - when I first
started this fd for processes thing - that would allow you to do things
like that. The gist was:
1. int pidfd_create aka CLONE_PIDFD now
will return an fd that creates a process context. The fd returned by
does not refer to any existing process and has not actually been
started yet. So non-configuration operations on it or trying to
interact with it would fail with e.g. ESRCH/EINVAL.
We essentially have this now with CLONE_PIDFD. The bit that is missing
is an "unscheduled" process.
2. int pidfd_config
takes a CLONE_PIDFD and can be used to configure a process context
before it is alive.
Some things that I would like to be able to do with this syscall are:
- configure signals
- set clone flags
- write idmappings if the process runs in a new user namespace
- configure what happens when all procfds referring to the process are gone
- ...
3. int pidfd_info
4. int pidfd_manage
Parts of that are captured in pidfd_send_signal().
Just to get a very rough feel for this without detailing parameters right now:
/* process should have own mountns */
pidfd_config(fd, PROC_SET_NAMESPACE, CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWUSR, <potentially additional arguments>)
/* process should get SIGKILL when all procfds are closed */
pidfd_config(fd, PROC_SET_CLOSE, SIGKILL, <potentially additional arguments>)
After the caller is done configuring the process there would be a final step:
pidfd_config(fd, PROC_CREATE, 0, <potentially additional arguments>)
which would create the process and (either as return value or through a
parameter) return the pid of the newly created process.
I had more thoughts on this and had started prototyping some of it but
then there wasn't much interest it seemed. Maybe because it's crazy.
[1]: https://lore.kernel.org/lkml/[email protected]/
> the "dumpable" flag has always been a complete hack, and very unreliable
> with random meanings (and random ways to then get around it).
>
> We have actually had lots of people wanting to run "lists of system calls"
> kinds of things. Sometimes for performance reasons, sometimes for other
> random reasons Maybe that "atomicity" angle would be another one, although
> we would have to make the setuid etc things be something that happens at
> the end.
>
> So rather than "spawn", is much rather see people think about ways to just
> batch operations in general, rather than think it is about batching things
> just before a process change.
>
> b) Change all my UIDs and GIDs to match a container, enter that
> > container's namespaces, and run some binary that is *not* in the
> > container's filesystem.
> >
>
> Hey, you could probably do something very close to that by opening the
> executable you want to run first, making it O_CLOEXEC, then doing all the
> other things (which now makes the executable inaccessible), and finally
> doing execveat() on the file descriptor..
I think that's somewhat similar to what I've suggested above.
>
> I say "something very close" because even though it's O_CLOEXEC, only the
> file would be closed, and /proc/self/exe would still exist.
>
> But we really could make that execveat of a O_CLOEXEC file thing also
> disable access to /proc/*/exe, and it sounds like a sane and simple
> extension in general to do..
>
> Linus
>
> >
On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <[email protected]> wrote:
>
> I bet this will break something that already exists. An execveat() flag to turn off /proc/self/exe would do the trick, though.
Thinking more about it, I suspect it is (once again) wrong to let the
thing that does the execve() control that bit.
Generally, the less we allow people to affect the lifetime and
environment of a suid executable, the better off we are.
But maybe we could limit /proc/*/exe to at least not honor suid'ness
of the target? Or does chrome/runc depend on that too?
Linus
On 2019-05-12, Linus Torvalds <[email protected]> wrote:
> On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <[email protected]> wrote:
> > I bet this will break something that already exists. An execveat()
> > flag to turn off /proc/self/exe would do the trick, though.
>
> Thinking more about it, I suspect it is (once again) wrong to let the
> thing that does the execve() control that bit.
>
> Generally, the less we allow people to affect the lifetime and
> environment of a suid executable, the better off we are.
>
> But maybe we could limit /proc/*/exe to at least not honor suid'ness
> of the target? Or does chrome/runc depend on that too?
Speaking on the runc side, we don't depend on this. It's possible
someone depends on this for fexecve(3) -- but as mentioned before in
newer kernels glibc uses execve(AT_EMPTY_PATH).
I would like to point out though that I'm a little bit cautious about
/proc/self/exe-specific restrictions -- because a trivial way to get
around them would be to just open it with O_PATH (and you end up with a
/proc/self/fd/ which is equivalent). Unfortunately blocking setuid exec
on all O_PATH descriptors would break even execve(AT_EMPTY_PATH) of
setuid descriptors.
The patches I mentioned (which Andy and I discussed off-list) would
effectively make the magiclink modes in /proc/ affect how you can
operate on the path (no write bit in the mode, cannot re-open it write).
One aspect of this is how to handle O_PATH and in particular how do we
handle an O_PATH re-open of an already-restricted magiclink.
Maybe we could make it so that setuid is disallowed if you are dealing
with an O_PATH fd which was a magiclink. Effectively, on O_PATH open you
get an fmode_t saying FMODE_SETUID_EXEC_ALLOWED *but* if the path is a
magiclink this fmode gets dropped and when the fd is given to
execveat(AT_EMPTY_PATH) the fmode is checked and setuid-exec is not
allowed.
[I assume in this discussion "setuid" means "setuid + setcap", right?]
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On 2019-05-12, Aleksa Sarai <[email protected]> wrote:
> On 2019-05-12, Linus Torvalds <[email protected]> wrote:
> > On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <[email protected]> wrote:
> > > I bet this will break something that already exists. An execveat()
> > > flag to turn off /proc/self/exe would do the trick, though.
> >
> > Thinking more about it, I suspect it is (once again) wrong to let the
> > thing that does the execve() control that bit.
> >
> > Generally, the less we allow people to affect the lifetime and
> > environment of a suid executable, the better off we are.
> >
> > But maybe we could limit /proc/*/exe to at least not honor suid'ness
> > of the target? Or does chrome/runc depend on that too?
>
> Speaking on the runc side, we don't depend on this. It's possible
> someone depends on this for fexecve(3) -- but as mentioned before in
> newer kernels glibc uses execve(AT_EMPTY_PATH).
>
> I would like to point out though that I'm a little bit cautious about
> /proc/self/exe-specific restrictions -- because a trivial way to get
> around them would be to just open it with O_PATH (and you end up with a
> /proc/self/fd/ which is equivalent). Unfortunately blocking setuid exec
> on all O_PATH descriptors would break even execve(AT_EMPTY_PATH) of
> setuid descriptors.
>
> The patches I mentioned (which Andy and I discussed off-list) would
> effectively make the magiclink modes in /proc/ affect how you can
> operate on the path (no write bit in the mode, cannot re-open it write).
> One aspect of this is how to handle O_PATH and in particular how do we
> handle an O_PATH re-open of an already-restricted magiclink.
>
> Maybe we could make it so that setuid is disallowed if you are dealing
> with an O_PATH fd which was a magiclink. Effectively, on O_PATH open you
> get an fmode_t saying FMODE_SETUID_EXEC_ALLOWED *but* if the path is a
> magiclink this fmode gets dropped and when the fd is given to
> execveat(AT_EMPTY_PATH) the fmode is checked and setuid-exec is not
> allowed.
... and obviously /proc/self/exe would have an fmode
~FMODE_SETUID_EXEC_ALLOWED from the outset. The reason for this slightly
odd semantic would be to continue to allow O_PATH setuid-exec as long as
the O_PATH was opened from an actual path rather than a magiclink.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
> On May 12, 2019, at 6:35 AM, Aleksa Sarai <[email protected]> wrote:
>
>> On 2019-05-12, Linus Torvalds <[email protected]> wrote:
>>> On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <[email protected]> wrote:
>>> I bet this will break something that already exists. An execveat()
>>> flag to turn off /proc/self/exe would do the trick, though.
>>
>> Thinking more about it, I suspect it is (once again) wrong to let the
>> thing that does the execve() control that bit.
>>
>> Generally, the less we allow people to affect the lifetime and
>> environment of a suid executable, the better off we are.
>>
>> But maybe we could limit /proc/*/exe to at least not honor suid'ness
>> of the target? Or does chrome/runc depend on that too?
>
> Speaking on the runc side, we don't depend on this. It's possible
> someone depends on this for fexecve(3) -- but as mentioned before in
> newer kernels glibc uses execve(AT_EMPTY_PATH).
Why are we concerned about suid? Don’t we already block suid if the path being execed doesn’t come from the current mountns? That should mostly cover the things we care about, no?
I suppose we could also block suid for deleted files, so that deleting a known-buggy suid binary becomes more reliable. But every sensible package tool should already be chmoding the suid away before unlinking.