This patchset is being developed here:
<https://github.com/cyphar/linux/tree/resolveat/master>
Patch changelog:
v12:
* Remove @how->reserved field from openat2(2), and instead use the
(struct, size) design for syscall extensions.
* Implement copy_struct_{to,from}_user() to unify (struct, size)
syscall extension designs (as well as make them slightly more
efficient by using memchr_inv() as well as using buffers and
avoiding repeated access_ok() checks for trailing byte operations).
* Port sched_setattr(), perf_event_attr(), and clone3() to use the
new helpers.
v11: <https://lore.kernel.org/lkml/[email protected]/>
<https://lore.kernel.org/lkml/[email protected]/>
v10: <https://lore.kernel.org/lkml/[email protected]/>
v09: <https://lore.kernel.org/lkml/[email protected]/>
v08: <https://lore.kernel.org/lkml/[email protected]/>
v07: <https://lore.kernel.org/lkml/[email protected]/>
v06: <https://lore.kernel.org/lkml/[email protected]/>
v05: <https://lore.kernel.org/lkml/[email protected]/>
v04: <https://lore.kernel.org/lkml/[email protected]/>
v03: <https://lore.kernel.org/lkml/[email protected]/>
v02: <https://lore.kernel.org/lkml/[email protected]/>
v01: <https://lore.kernel.org/lkml/[email protected]/>
The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS,
the flag has been split up into separate flags. However, instead of
being an openat(2) flag it is provided through a new syscall openat2(2)
which provides several other improvements to the openat(2) interface (see the
patch description for more details). The following new LOOKUP_* flags are
added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do
not trigger this.
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[6] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
And further, several semantics of file descriptor "re-opening" are now
changed to prevent attacks like CVE-2019-5736 by restricting how
magic-links can be resolved (based on their mode). This required some
other changes to the semantics of the modes of O_PATH file descriptor's
associated /proc/self/fd magic-links. openat2(2) has the ability to
further restrict re-opening of its own O_PATH fds, so that users can
make even better use of this feature.
Finally, O_EMPTYPATH was added so that users can do /proc/self/fd-style
re-opening without depending on procfs. The new restricted semantics for
magic-links are applied here too.
In order to make all of the above more usable, I'm working on
libpathrs[7] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Cc: Al Viro <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: David Howells <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Drysdale <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin
[7]: https://github.com/openSUSE/libpathrs
Aleksa Sarai (12):
lib: introduce copy_struct_{to,from}_user helpers
clone3: switch to copy_struct_from_user()
sched_setattr: switch to copy_struct_{to,from}_user()
perf_event_open: switch to copy_struct_from_user()
namei: obey trailing magic-link DAC permissions
procfs: switch magic-link modes to be more sane
open: O_EMPTYPATH: procfs-less file descriptor re-opening
namei: O_BENEATH-style path resolution flags
namei: LOOKUP_IN_ROOT: chroot-like path resolution
namei: aggressively check for nd->root escape on ".." resolution
open: openat2(2) syscall
selftests: add openat2(2) selftests
Documentation/filesystems/path-lookup.rst | 12 +-
arch/alpha/include/uapi/asm/fcntl.h | 1 +
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/include/uapi/asm/fcntl.h | 39 +-
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/include/uapi/asm/fcntl.h | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/fcntl.c | 2 +-
fs/internal.h | 1 +
fs/namei.c | 270 ++++++++++--
fs/open.c | 100 ++++-
fs/proc/base.c | 20 +-
fs/proc/fd.c | 23 +-
fs/proc/namespaces.c | 2 +-
include/linux/fcntl.h | 21 +-
include/linux/fs.h | 8 +-
include/linux/namei.h | 9 +
include/linux/syscalls.h | 14 +-
include/linux/uaccess.h | 5 +
include/uapi/asm-generic/fcntl.h | 4 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 42 ++
include/uapi/linux/sched.h | 2 +
kernel/events/core.c | 45 +-
kernel/fork.c | 34 +-
kernel/sched/core.c | 85 +---
lib/Makefile | 2 +-
lib/struct_user.c | 182 ++++++++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/memfd/memfd_test.c | 7 +-
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 167 ++++++++
tools/testing/selftests/openat2/helpers.h | 118 +++++
.../testing/selftests/openat2/linkmode_test.c | 333 +++++++++++++++
.../testing/selftests/openat2/openat2_test.c | 106 +++++
.../selftests/openat2/rename_attack_test.c | 127 ++++++
.../testing/selftests/openat2/resolve_test.c | 402 ++++++++++++++++++
53 files changed, 1971 insertions(+), 248 deletions(-)
create mode 100644 lib/struct_user.c
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/linkmode_test.c
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
--
2.23.0
A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases). This is done
in both directions -- hence two helpers -- though it's more common to
have to copy user space structs into kernel space.
Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[1]). A future
patch replaces all of the common uses of this pattern to use the new
copy_struct_{to,from}_user() helpers.
[1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
similar checks to copy_struct_from_user() while rt_sigprocmask(2)
always rejects differently-sized struct arguments.
Suggested-by: Rasmus Villemoes <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
include/linux/uaccess.h | 5 ++
lib/Makefile | 2 +-
lib/struct_user.c | 182 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 188 insertions(+), 1 deletion(-)
create mode 100644 lib/struct_user.c
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 34a038563d97..0ad9544a1aee 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -230,6 +230,11 @@ static inline unsigned long __copy_from_user_inatomic_nocache(void *to,
#endif /* ARCH_HAS_NOCACHE_UACCESS */
+extern int copy_struct_to_user(void __user *dst, size_t usize,
+ const void *src, size_t ksize);
+extern int copy_struct_from_user(void *dst, size_t ksize,
+ const void __user *src, size_t usize);
+
/*
* probe_kernel_read(): safely attempt to read from a location
* @dst: pointer to the buffer that shall take the data
diff --git a/lib/Makefile b/lib/Makefile
index 29c02a924973..d86c71feaf0a 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -28,7 +28,7 @@ endif
CFLAGS_string.o := $(call cc-option, -fno-stack-protector)
endif
-lib-y := ctype.o string.o vsprintf.o cmdline.o \
+lib-y := ctype.o string.o struct_user.o vsprintf.o cmdline.o \
rbtree.o radix-tree.o timerqueue.o xarray.o \
idr.o extable.o \
sha1.o chacha.o irq_regs.o argv_split.o \
diff --git a/lib/struct_user.c b/lib/struct_user.c
new file mode 100644
index 000000000000..7301ab1bbe98
--- /dev/null
+++ b/lib/struct_user.c
@@ -0,0 +1,182 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2019 SUSE LLC
+ * Copyright (C) 2019 Aleksa Sarai <[email protected]>
+ */
+
+#include <linux/types.h>
+#include <linux/export.h>
+#include <linux/uaccess.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+
+#define BUFFER_SIZE 64
+
+/*
+ * "memset(p, 0, size)" but for user space buffers. Caller must have already
+ * checked access_ok(p, size).
+ */
+static int __memzero_user(void __user *p, size_t s)
+{
+ const char zeros[BUFFER_SIZE] = {};
+ while (s > 0) {
+ size_t n = min(s, sizeof(zeros));
+
+ if (__copy_to_user(p, zeros, n))
+ return -EFAULT;
+
+ p += n;
+ s -= n;
+ }
+ return 0;
+}
+
+/**
+ * copy_struct_to_user: copy a struct to user space
+ * @dst: Destination address, in user space.
+ * @usize: Size of @dst struct.
+ * @src: Source address, in kernel space.
+ * @ksize: Size of @src struct.
+ *
+ * Copies a struct from kernel space to user space, in a way that guarantees
+ * backwards-compatibility for struct syscall arguments (as long as future
+ * struct extensions are made such that all new fields are *appended* to the
+ * old struct, and zeroed-out new fields have the same meaning as the old
+ * struct).
+ *
+ * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
+ * The recommended usage is something like the following:
+ *
+ * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
+ * {
+ * int err;
+ * struct foo karg = {};
+ *
+ * // do something with karg
+ *
+ * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg));
+ * if (err)
+ * return err;
+ *
+ * // ...
+ * }
+ *
+ * There are three cases to consider:
+ * * If @usize == @ksize, then it's copied verbatim.
+ * * If @usize < @ksize, then kernel space is "returning" a newer struct to an
+ * older user space. In order to avoid user space getting incomplete
+ * information (new fields might be important), all trailing bytes in @src
+ * (@ksize - @usize) must be zerored, otherwise -EFBIG is returned.
+ * * If @usize > @ksize, then the kernel is "returning" an older struct to a
+ * newer user space. The trailing bytes in @dst (@usize - @ksize) will be
+ * zero-filled.
+ *
+ * Returns (in all cases, some data may have been copied):
+ * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src.
+ * * -EFAULT: access to user space failed.
+ */
+int copy_struct_to_user(void __user *dst, size_t usize,
+ const void *src, size_t ksize)
+{
+ size_t size = min(ksize, usize);
+ size_t rest = abs(ksize - usize);
+
+ if (unlikely(usize > PAGE_SIZE))
+ return -EFAULT;
+ if (unlikely(!access_ok(dst, usize)))
+ return -EFAULT;
+
+ /* Deal with trailing bytes. */
+ if (usize < ksize) {
+ if (memchr_inv(src + size, 0, rest))
+ return -EFBIG;
+ } else if (usize > ksize) {
+ if (__memzero_user(dst + size, rest))
+ return -EFAULT;
+ }
+ /* Copy the interoperable parts of the struct. */
+ if (__copy_to_user(dst, src, size))
+ return -EFAULT;
+ return 0;
+}
+EXPORT_SYMBOL(copy_struct_to_user);
+
+/**
+ * copy_struct_from_user: copy a struct from user space
+ * @dst: Destination address, in kernel space. This buffer must be @ksize
+ * bytes long.
+ * @ksize: Size of @dst struct.
+ * @src: Source address, in user space.
+ * @usize: (Alleged) size of @src struct.
+ *
+ * Copies a struct from user space to kernel space, in a way that guarantees
+ * backwards-compatibility for struct syscall arguments (as long as future
+ * struct extensions are made such that all new fields are *appended* to the
+ * old struct, and zeroed-out new fields have the same meaning as the old
+ * struct).
+ *
+ * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
+ * The recommended usage is something like the following:
+ *
+ * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize)
+ * {
+ * int err;
+ * struct foo karg = {};
+ *
+ * err = copy_struct_from_user(&karg, sizeof(karg), uarg, size);
+ * if (err)
+ * return err;
+ *
+ * // ...
+ * }
+ *
+ * There are three cases to consider:
+ * * If @usize == @ksize, then it's copied verbatim.
+ * * If @usize < @ksize, then the user space has passed an old struct to a
+ * newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize)
+ * are to be zero-filled.
+ * * If @usize > @ksize, then the user space has passed a new struct to an
+ * older kernel. The trailing bytes unknown to the kernel (@usize - @ksize)
+ * are checked to ensure they are zeroed, otherwise -E2BIG is returned.
+ *
+ * Returns (in all cases, some data may have been copied):
+ * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src.
+ * * -E2BIG: @usize is "too big" (at time of writing, >PAGE_SIZE).
+ * * -EFAULT: access to user space failed.
+ */
+int copy_struct_from_user(void *dst, size_t ksize,
+ const void __user *src, size_t usize)
+{
+ size_t size = min(ksize, usize);
+ size_t rest = abs(ksize - usize);
+
+ if (unlikely(usize > PAGE_SIZE))
+ return -EFAULT;
+ if (unlikely(!access_ok(src, usize)))
+ return -EFAULT;
+
+ /* Deal with trailing bytes. */
+ if (usize < ksize)
+ memset(dst + size, 0, rest);
+ else if (usize > ksize) {
+ const void __user *addr = src + size;
+ char buffer[BUFFER_SIZE] = {};
+
+ while (rest > 0) {
+ size_t bufsize = min(rest, sizeof(buffer));
+
+ if (__copy_from_user(buffer, addr, bufsize))
+ return -EFAULT;
+ if (memchr_inv(buffer, 0, bufsize))
+ return -E2BIG;
+
+ addr += bufsize;
+ rest -= bufsize;
+ }
+ }
+ /* Copy the interoperable parts of the struct. */
+ if (__copy_from_user(dst, src, size))
+ return -EFAULT;
+ return 0;
+}
+EXPORT_SYMBOL(copy_struct_from_user);
--
2.23.0
The change is very straightforward, and takes advantage of the (very
minor) efficiency improvements in copy_struct_from_user() -- that the
memchr_inv() check is done on a buffer instead of one-at-at-time with
get_user().
Additionally, explicitly define CLONE_ARGS_SIZE_VER0 to match the other
users of the struct-extension pattern.
Cc: Christian Brauner <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
include/uapi/linux/sched.h | 2 ++
kernel/fork.c | 34 ++++++----------------------------
2 files changed, 8 insertions(+), 28 deletions(-)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..0945805982b4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -47,6 +47,8 @@ struct clone_args {
__aligned_u64 tls;
};
+#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
+
/*
* Scheduling policies
*/
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..70c10d9b429a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2528,39 +2528,17 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
#ifdef __ARCH_WANT_SYS_CLONE3
noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
struct clone_args __user *uargs,
- size_t size)
+ size_t usize)
{
+ int err;
struct clone_args args;
- if (unlikely(size > PAGE_SIZE))
- return -E2BIG;
-
- if (unlikely(size < sizeof(struct clone_args)))
+ if (unlikely(usize < CLONE_ARGS_SIZE_VER0))
return -EINVAL;
- if (unlikely(!access_ok(uargs, size)))
- return -EFAULT;
-
- if (size > sizeof(struct clone_args)) {
- unsigned char __user *addr;
- unsigned char __user *end;
- unsigned char val;
-
- addr = (void __user *)uargs + sizeof(struct clone_args);
- end = (void __user *)uargs + size;
-
- for (; addr < end; addr++) {
- if (get_user(val, addr))
- return -EFAULT;
- if (val)
- return -E2BIG;
- }
-
- size = sizeof(struct clone_args);
- }
-
- if (copy_from_user(&args, uargs, size))
- return -EFAULT;
+ err = copy_struct_from_user(&args, sizeof(args), uargs, usize);
+ if (err)
+ return err;
*kargs = (struct kernel_clone_args){
.flags = args.flags,
--
2.23.0
The change is very straightforward, and takes advantage of the (very
minor) efficiency improvements in copy_struct_{to,from}_user() -- that
the memchr_inv() check is done on a buffer instead of one-at-at-time
with get_user() or put_user().
Signed-off-by: Aleksa Sarai <[email protected]>
---
kernel/sched/core.c | 85 ++++++---------------------------------------
1 file changed, 10 insertions(+), 75 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 010d578118d6..2f58b07d3468 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4900,9 +4900,6 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a
u32 size;
int ret;
- if (!access_ok(uattr, SCHED_ATTR_SIZE_VER0))
- return -EFAULT;
-
/* Zero the full structure, so that a short copy will be nice: */
memset(attr, 0, sizeof(*attr));
@@ -4910,45 +4907,19 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a
if (ret)
return ret;
- /* Bail out on silly large: */
- if (size > PAGE_SIZE)
- goto err_size;
-
/* ABI compatibility quirk: */
if (!size)
size = SCHED_ATTR_SIZE_VER0;
-
if (size < SCHED_ATTR_SIZE_VER0)
goto err_size;
- /*
- * If we're handed a bigger struct than we know of,
- * ensure all the unknown bits are 0 - i.e. new
- * user-space does not rely on any kernel feature
- * extensions we dont know about yet.
- */
- if (size > sizeof(*attr)) {
- unsigned char __user *addr;
- unsigned char __user *end;
- unsigned char val;
-
- addr = (void __user *)uattr + sizeof(*attr);
- end = (void __user *)uattr + size;
-
- for (; addr < end; addr++) {
- ret = get_user(val, addr);
- if (ret)
- return ret;
- if (val)
- goto err_size;
- }
- size = sizeof(*attr);
+ ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
+ if (ret) {
+ if (ret == -E2BIG)
+ goto err_size;
+ return ret;
}
- ret = copy_from_user(attr, uattr, size);
- if (ret)
- return -EFAULT;
-
if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) &&
size < SCHED_ATTR_SIZE_VER1)
return -EINVAL;
@@ -5105,51 +5076,15 @@ SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param)
return retval;
}
-static int sched_read_attr(struct sched_attr __user *uattr,
- struct sched_attr *attr,
- unsigned int usize)
-{
- int ret;
-
- if (!access_ok(uattr, usize))
- return -EFAULT;
-
- /*
- * If we're handed a smaller struct than we know of,
- * ensure all the unknown bits are 0 - i.e. old
- * user-space does not get uncomplete information.
- */
- if (usize < sizeof(*attr)) {
- unsigned char *addr;
- unsigned char *end;
-
- addr = (void *)attr + usize;
- end = (void *)attr + sizeof(*attr);
-
- for (; addr < end; addr++) {
- if (*addr)
- return -EFBIG;
- }
-
- attr->size = usize;
- }
-
- ret = copy_to_user(uattr, attr, attr->size);
- if (ret)
- return -EFAULT;
-
- return 0;
-}
-
/**
* sys_sched_getattr - similar to sched_getparam, but with sched_attr
* @pid: the pid in question.
* @uattr: structure containing the extended parameters.
- * @size: sizeof(attr) for fwd/bwd comp.
+ * @usize: sizeof(attr) for fwd/bwd comp.
* @flags: for future extension.
*/
SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
- unsigned int, size, unsigned int, flags)
+ unsigned int, usize, unsigned int, flags)
{
struct sched_attr attr = {
.size = sizeof(struct sched_attr),
@@ -5157,8 +5092,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
struct task_struct *p;
int retval;
- if (!uattr || pid < 0 || size > PAGE_SIZE ||
- size < SCHED_ATTR_SIZE_VER0 || flags)
+ if (!uattr || pid < 0 || usize > PAGE_SIZE ||
+ usize < SCHED_ATTR_SIZE_VER0 || flags)
return -EINVAL;
rcu_read_lock();
@@ -5188,7 +5123,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
rcu_read_unlock();
- retval = sched_read_attr(uattr, &attr, size);
+ retval = copy_struct_to_user(uattr, usize, &attr, sizeof(attr));
return retval;
out_unlock:
--
2.23.0
The change is very straightforward, and takes advantage of the (very
minor) efficiency improvements in copy_struct_from_user() -- that the
memchr_inv() check is done on a buffer instead of one-at-at-time with
get_user().
Signed-off-by: Aleksa Sarai <[email protected]>
---
kernel/events/core.c | 45 ++++++++------------------------------------
1 file changed, 8 insertions(+), 37 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0463c1151bae..fe5f58443ba6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10498,55 +10498,26 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
u32 size;
int ret;
- if (!access_ok(uattr, PERF_ATTR_SIZE_VER0))
- return -EFAULT;
-
- /*
- * zero the full structure, so that a short copy will be nice.
- */
+ /* Zero the full structure, so that a short copy will be nice. */
memset(attr, 0, sizeof(*attr));
ret = get_user(size, &uattr->size);
if (ret)
return ret;
- if (size > PAGE_SIZE) /* silly large */
- goto err_size;
-
- if (!size) /* abi compat */
+ /* ABI compatibility quirk: */
+ if (!size)
size = PERF_ATTR_SIZE_VER0;
-
if (size < PERF_ATTR_SIZE_VER0)
goto err_size;
- /*
- * If we're handed a bigger struct than we know of,
- * ensure all the unknown bits are 0 - i.e. new
- * user-space does not rely on any kernel feature
- * extensions we dont know about yet.
- */
- if (size > sizeof(*attr)) {
- unsigned char __user *addr;
- unsigned char __user *end;
- unsigned char val;
-
- addr = (void __user *)uattr + sizeof(*attr);
- end = (void __user *)uattr + size;
-
- for (; addr < end; addr++) {
- ret = get_user(val, addr);
- if (ret)
- return ret;
- if (val)
- goto err_size;
- }
- size = sizeof(*attr);
+ ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
+ if (ret) {
+ if (ret == -E2BIG)
+ goto err_size;
+ return ret;
}
- ret = copy_from_user(attr, uattr, size);
- if (ret)
- return -EFAULT;
-
attr->size = size;
if (attr->__reserved_1)
--
2.23.0
Add the following flags to allow various restrictions on path resolution
(these affect the *entire* resolution, rather than just the final path
component -- as is the case with LOOKUP_FOLLOW).
The primary justification for these flags is to allow for programs to be
far more strict about how they want path resolution to handle symlinks,
mountpoint crossings, and paths that escape the dirfd (through an
absolute path or ".." shenanigans).
This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init). More classical applications (which have their own
potentially buggy userspace path sanitisation code) include web servers,
archive extraction tools, network file servers, and so on.
These flags are exposed to userspace through openat2(2) in a later
patchset.
* LOOKUP_NO_XDEV: Disallow mount-point crossing (both *down* into one,
or *up* from one). Both bind-mounts and cross-filesystem mounts are
blocked by this flag. The naming is based on "find -xdev" as well as
-EXDEV (though find(1) doesn't walk upwards, the semantics seem
obvious).
* LOOKUP_NO_MAGICLINKS: Disallows ->get_link "symlink" (or rather,
magic-link) jumping. This is a very specific restriction, and it
exists because /proc/$pid/fd/... "symlinks" allow for access outside
nd->root and pose risk to container runtimes that don't want to be
tricked into accessing a host path (but do want to allow
no-funny-business symlink resolution).
* LOOKUP_NO_SYMLINKS: Disallows resolution through symlinks of any kind
(including magic-links).
* LOOKUP_BENEATH: Disallow "escapes" from the starting point of the
filesystem tree during resolution (you must stay "beneath" the
starting point at all times). Currently this is done by disallowing
".." and absolute paths (either in the given path or found during
symlink resolution) entirely, as well as all magic-link jumping.
The wholesale banning of ".." is because it is currently not safe to
allow ".." resolution (races can cause the path to be moved outside of
the root -- this is conceptually similar to historical chroot(2)
escape attacks). Future patches in this series will address this, and
will re-enable ".." resolution once it is safe. With those patches,
".." resolution will only be allowed if it remains in the root
throughout resolution (such as "a/../b" not "a/../../outside/b").
The banning of magic-link jumping is done because it is not clear
whether semantically they should be allowed -- while some magic-links
are safe there are many that can cause escapes (and once a
resolution is outside of the root, O_BENEATH will no longer detect
it). Future patches may re-enable magic-link jumping when such jumps
would remain inside the root.
The LOOKUP_NO_*LINK flags return -ELOOP if path resolution would
violates their requirement, while the others all return -EXDEV.
This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.
[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/619151/
[3]: https://lwn.net/Articles/603929/
[4]: https://lwn.net/Articles/723057/
Cc: Christian Brauner <[email protected]>
Suggested-by: David Drysdale <[email protected]>
Suggested-by: Al Viro <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 85 ++++++++++++++++++++++++++++++++++++-------
include/linux/namei.h | 7 ++++
2 files changed, 78 insertions(+), 14 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index e39b573fcc4d..2e18ce5a313e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -674,7 +674,11 @@ static int unlazy_walk(struct nameidata *nd)
goto out2;
if (unlikely(!legitimize_path(nd, &nd->path, nd->seq)))
goto out1;
- if (nd->root.mnt && !(nd->flags & LOOKUP_ROOT)) {
+ if (!nd->root.mnt) {
+ /* Restart from path_init() if nd->root was cleared. */
+ if (nd->flags & LOOKUP_BENEATH)
+ goto out;
+ } else if (!(nd->flags & LOOKUP_ROOT)) {
if (unlikely(!legitimize_path(nd, &nd->root, nd->root_seq)))
goto out;
}
@@ -843,6 +847,13 @@ static inline void path_to_nameidata(const struct path *path,
static int nd_jump_root(struct nameidata *nd)
{
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
+ /* Absolute path arguments to path_init() are allowed. */
+ if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
+ return -EXDEV;
+ }
if (nd->flags & LOOKUP_RCU) {
struct dentry *d;
nd->path = nd->root;
@@ -1051,6 +1062,9 @@ const char *get_link(struct nameidata *nd)
int error;
const char *res;
+ if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+ return ERR_PTR(-ELOOP);
+
if (!(nd->flags & LOOKUP_RCU)) {
touch_atime(&last->link);
cond_resched();
@@ -1082,14 +1096,22 @@ const char *get_link(struct nameidata *nd)
} else {
res = get(dentry, inode, &last->done);
}
+ if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
+ if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
+ return ERR_PTR(-ELOOP);
+ /* Not currently safe. */
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return ERR_PTR(-EXDEV);
+ }
if (IS_ERR_OR_NULL(res))
return res;
}
if (*res == '/') {
if (!nd->root.mnt)
set_root(nd);
- if (unlikely(nd_jump_root(nd)))
- return ERR_PTR(-ECHILD);
+ error = nd_jump_root(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
while (unlikely(*++res == '/'))
;
}
@@ -1270,12 +1292,16 @@ static int follow_managed(struct path *path, struct nameidata *nd)
break;
}
- if (need_mntput && path->mnt == mnt)
- mntput(path->mnt);
+ if (need_mntput) {
+ if (path->mnt == mnt)
+ mntput(path->mnt);
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ ret = -EXDEV;
+ else
+ nd->flags |= LOOKUP_JUMPED;
+ }
if (ret == -EISDIR || !ret)
ret = 1;
- if (need_mntput)
- nd->flags |= LOOKUP_JUMPED;
if (unlikely(ret < 0))
path_put_conditional(path, nd);
return ret;
@@ -1332,6 +1358,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
mounted = __lookup_mnt(path->mnt, path->dentry);
if (!mounted)
break;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ return false;
path->mnt = &mounted->mnt;
path->dentry = mounted->mnt.mnt_root;
nd->flags |= LOOKUP_JUMPED;
@@ -1352,8 +1380,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
struct inode *inode = nd->inode;
while (1) {
- if (path_equal(&nd->path, &nd->root))
+ if (path_equal(&nd->path, &nd->root)) {
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
break;
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
struct dentry *old = nd->path.dentry;
struct dentry *parent = old->d_parent;
@@ -1378,6 +1409,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -ECHILD;
if (&mparent->mnt == nd->path.mnt)
break;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ return -EXDEV;
/* we know that mountpoint was pinned */
nd->path.dentry = mountpoint;
nd->path.mnt = &mparent->mnt;
@@ -1392,6 +1425,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -ECHILD;
if (!mounted)
break;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ return -EXDEV;
nd->path.mnt = &mounted->mnt;
nd->path.dentry = mounted->mnt.mnt_root;
inode = nd->path.dentry->d_inode;
@@ -1480,8 +1515,11 @@ static int path_parent_directory(struct path *path)
static int follow_dotdot(struct nameidata *nd)
{
while(1) {
- if (path_equal(&nd->path, &nd->root))
+ if (path_equal(&nd->path, &nd->root)) {
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
break;
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
int ret = path_parent_directory(&nd->path);
if (ret)
@@ -1490,6 +1528,8 @@ static int follow_dotdot(struct nameidata *nd)
}
if (!follow_up(&nd->path))
break;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ return -EXDEV;
}
follow_mount(&nd->path);
nd->inode = nd->path.dentry->d_inode;
@@ -1704,6 +1744,13 @@ static inline int may_lookup(struct nameidata *nd)
static inline int handle_dots(struct nameidata *nd, int type)
{
if (type == LAST_DOTDOT) {
+ /*
+ * LOOKUP_BENEATH resolving ".." is not currently safe -- races
+ * can cause our parent to have moved outside of the root and
+ * us to skip over it.
+ */
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
if (!nd->root.mnt)
set_root(nd);
if (nd->flags & LOOKUP_RCU) {
@@ -2170,6 +2217,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
/* must be paired with terminate_walk() */
static const char *path_init(struct nameidata *nd, unsigned flags)
{
+ int error;
const char *s = nd->name->name;
if (!*s)
@@ -2202,11 +2250,13 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->path.dentry = NULL;
nd->m_seq = read_seqbegin(&mount_lock);
+
+ /* Figure out the starting path and root (if needed). */
if (*s == '/') {
set_root(nd);
- if (likely(!nd_jump_root(nd)))
- return s;
- return ERR_PTR(-ECHILD);
+ error = nd_jump_root(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
} else if (nd->dfd == AT_FDCWD) {
if (flags & LOOKUP_RCU) {
struct fs_struct *fs = current->fs;
@@ -2222,7 +2272,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
get_fs_pwd(current->fs, &nd->path);
nd->inode = nd->path.dentry->d_inode;
}
- return s;
} else {
/* Caller must check execute permissions on the starting path component */
struct fd f = fdget_raw(nd->dfd);
@@ -2247,8 +2296,16 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->inode = nd->path.dentry->d_inode;
}
fdput(f);
- return s;
}
+ /* For scoped-lookups we need to set the root to the dirfd as well. */
+ if (flags & LOOKUP_BENEATH) {
+ nd->root = nd->path;
+ if (flags & LOOKUP_RCU)
+ nd->root_seq = nd->seq;
+ else
+ path_get(&nd->root);
+ }
+ return s;
}
static const char *trailing_symlink(struct nameidata *nd)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index bd6d3eb7764d..be407415c28a 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -51,6 +51,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_DOWN 0x8000
#define LOOKUP_MAGICLINK_JUMPED 0x10000
+/* Scoping flags for lookup. */
+#define LOOKUP_BENEATH 0x020000 /* No escaping from starting point. */
+#define LOOKUP_NO_XDEV 0x040000 /* No mountpoint crossing. */
+#define LOOKUP_NO_MAGICLINKS 0x080000 /* No /proc/$pid/fd/ "symlink" crossing. */
+#define LOOKUP_NO_SYMLINKS 0x100000 /* No symlink crossing *at all*.
+ Implies LOOKUP_NO_MAGICLINKS. */
+
extern int path_pts(struct path *path);
extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
--
2.23.0
The primary motivation for the need for this flag is container runtimes
which have to interact with malicious root filesystems in the host
namespaces. One of the first requirements for a container runtime to be
secure against a malicious rootfs is that they correctly scope symlinks
(that is, they should be scoped as though they are chroot(2)ed into the
container's rootfs) and ".."-style paths[*]. The already-existing
LOOKUP_NO_XDEV and LOOKUP_NO_MAGICLINKS help defend against other
potential attacks in a malicious rootfs scenario.
Currently most container runtimes try to do this resolution in
userspace[1], causing many potential race conditions. In addition, the
"obvious" alternative (actually performing a {ch,pivot_}root(2))
requires a fork+exec (for some runtimes) which is *very* costly if
necessary for every filesystem operation involving a container.
[*] At the moment, ".." and magic-link jumping are disallowed for the
same reason it is disabled for LOOKUP_BENEATH -- currently it is not
safe to allow it. Future patches may enable it unconditionally once
we have resolved the possible races (for "..") and semantics (for
magic-link jumping).
The most significant *at(2) semantic change with LOOKUP_IN_ROOT is that
absolute pathnames no longer cause the dirfd to be ignored completely.
The rationale is that LOOKUP_IN_ROOT must necessarily chroot-scope
symlinks with absolute paths to dirfd, and so doing it for the base path
seems to be the most consistent behaviour (and also avoids foot-gunning
users who want to scope paths that are absolute).
[1]: https://github.com/cyphar/filepath-securejoin
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 41 +++++++++++++++++++++++++++++++----------
include/linux/namei.h | 1 +
2 files changed, 32 insertions(+), 10 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 2e18ce5a313e..0352d275bd13 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -676,7 +676,7 @@ static int unlazy_walk(struct nameidata *nd)
goto out1;
if (!nd->root.mnt) {
/* Restart from path_init() if nd->root was cleared. */
- if (nd->flags & LOOKUP_BENEATH)
+ if (nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))
goto out;
} else if (!(nd->flags & LOOKUP_ROOT)) {
if (unlikely(!legitimize_path(nd, &nd->root, nd->root_seq)))
@@ -809,10 +809,18 @@ static int complete_walk(struct nameidata *nd)
return status;
}
-static void set_root(struct nameidata *nd)
+static int set_root(struct nameidata *nd)
{
struct fs_struct *fs = current->fs;
+ /*
+ * Jumping to the real root as part of LOOKUP_IN_ROOT is a BUG in namei,
+ * but we still have to ensure it doesn't happen because it will cause a
+ * breakout from the dirfd.
+ */
+ if (WARN_ON(nd->flags & LOOKUP_IN_ROOT))
+ return -ENOTRECOVERABLE;
+
if (nd->flags & LOOKUP_RCU) {
unsigned seq;
@@ -824,6 +832,7 @@ static void set_root(struct nameidata *nd)
} else {
get_fs_root(fs, &nd->root);
}
+ return 0;
}
static void path_put_conditional(struct path *path, struct nameidata *nd)
@@ -854,6 +863,11 @@ static int nd_jump_root(struct nameidata *nd)
if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
return -EXDEV;
}
+ if (!nd->root.mnt) {
+ int error = set_root(nd);
+ if (error)
+ return error;
+ }
if (nd->flags & LOOKUP_RCU) {
struct dentry *d;
nd->path = nd->root;
@@ -1100,15 +1114,13 @@ const char *get_link(struct nameidata *nd)
if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
return ERR_PTR(-ELOOP);
/* Not currently safe. */
- if (unlikely(nd->flags & LOOKUP_BENEATH))
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
return ERR_PTR(-EXDEV);
}
if (IS_ERR_OR_NULL(res))
return res;
}
if (*res == '/') {
- if (!nd->root.mnt)
- set_root(nd);
error = nd_jump_root(nd);
if (unlikely(error))
return ERR_PTR(error);
@@ -1744,15 +1756,20 @@ static inline int may_lookup(struct nameidata *nd)
static inline int handle_dots(struct nameidata *nd, int type)
{
if (type == LAST_DOTDOT) {
+ int error = 0;
+
/*
* LOOKUP_BENEATH resolving ".." is not currently safe -- races
* can cause our parent to have moved outside of the root and
* us to skip over it.
*/
- if (unlikely(nd->flags & LOOKUP_BENEATH))
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
return -EXDEV;
- if (!nd->root.mnt)
- set_root(nd);
+ if (!nd->root.mnt) {
+ error = set_root(nd);
+ if (error)
+ return error;
+ }
if (nd->flags & LOOKUP_RCU) {
return follow_dotdot_rcu(nd);
} else
@@ -2251,9 +2268,13 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->m_seq = read_seqbegin(&mount_lock);
+ /* LOOKUP_IN_ROOT treats absolute paths as being relative-to-dirfd. */
+ if (flags & LOOKUP_IN_ROOT)
+ while (*s == '/')
+ s++;
+
/* Figure out the starting path and root (if needed). */
if (*s == '/') {
- set_root(nd);
error = nd_jump_root(nd);
if (unlikely(error))
return ERR_PTR(error);
@@ -2298,7 +2319,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
fdput(f);
}
/* For scoped-lookups we need to set the root to the dirfd as well. */
- if (flags & LOOKUP_BENEATH) {
+ if (flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)) {
nd->root = nd->path;
if (flags & LOOKUP_RCU)
nd->root_seq = nd->seq;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index be407415c28a..ec2c6c588ea7 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -57,6 +57,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_NO_MAGICLINKS 0x080000 /* No /proc/$pid/fd/ "symlink" crossing. */
#define LOOKUP_NO_SYMLINKS 0x100000 /* No symlink crossing *at all*.
Implies LOOKUP_NO_MAGICLINKS. */
+#define LOOKUP_IN_ROOT 0x200000 /* Treat dirfd as %current->fs->root. */
extern int path_pts(struct path *path);
--
2.23.0
The ability for userspace to "re-open" file descriptors through
/proc/self/fd has been a very useful tool for all sorts of usecases
(container runtimes are one common example). However, the current
interface for doing this has resulted in some pretty subtle security
holes. Userspace can re-open a file descriptor with more permissions
than the original, which can result in cases such as /proc/$pid/exe
being re-opened O_RDWR at a later date even though (by definition)
/proc/$pid/exe cannot be opened for writing. When combined with O_PATH
the results can get even more confusing.
We cannot block this outright. Aside from userspace already depending on
it, it's a useful feature which can actually increase the security of
userspace. For instance, LXC keeps an O_PATH of the container's
/dev/pts/ptmx that gets re-opened to create new ptys and then uses
TIOCGPTPEER to get the slave end. This allows for pty allocation without
resolving paths inside an (untrusted) container's rootfs. There isn't a
trivial way of doing this that is as straight-forward and safe as O_PATH
re-opening.
Instead we have to restrict it in such a way that it doesn't break
(good) users but does block potential attackers. The solution applied in
this patch is to restrict *re-opening* (not resolution through)
magic-links by requiring that mode of the link be obeyed. Normal
symlinks have modes of a+rwx but magic-links have other modes. These
magic-link modes were historically ignored during path resolution, but
they've now been re-purposed for more useful ends.
It is also necessary to define semantics for the mode of an O_PATH
descriptor, since re-opening a magic-link through an O_PATH needs to be
just as restricted as the corresponding magic-link -- otherwise the
above protection can be bypassed. There are two distinct cases:
1. The target is a regular file (not a magic-link). Userspace depends
on being able to re-open the O_PATH of a regular file, so we must
define the mode to be a+rwx.
2. The target is a magic-link. In this case, we simply copy the mode of
the magic-link. This results in an O_PATH of a magic-link
effectively acting as a no-op in terms of how much re-opening
privileges a process has.
CAP_DAC_OVERRIDE can be used to override all of these restrictions, but
we only permit &init_userns's capabilities to affect these semantics.
The reason for this is that there isn't a clear way to track what
user_ns is the original owner of a given O_PATH chain -- thus an
unprivileged user could create a new userns and O_PATH the file
descriptor, owning it. All signs would indicate that the user really
does have CAP_DAC_OVERRIDE over the new descriptor and the protection
would be bypassed. We thus opt for the more conservative approach.
I have run this patch on several machines for several days. So far, the
only processes which have hit this case ("loadkeys" and "kbd_mode" from
the kbd package[1]) gracefully handle the permission error and do not
cause any user-visible problems. In order to give users a heads-up, a
warning is output to dmesg whenever may_open_magiclink() refuses access.
[1]: http://git.altlinux.org/people/legion/packages/kbd.git
Suggested-by: Andy Lutomirski <[email protected]>
Suggested-by: Christian Brauner <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
Documentation/filesystems/path-lookup.rst | 12 +--
fs/internal.h | 1 +
fs/namei.c | 105 +++++++++++++++++++---
fs/open.c | 3 +-
fs/proc/fd.c | 23 ++++-
include/linux/fs.h | 4 +
include/linux/namei.h | 1 +
7 files changed, 130 insertions(+), 19 deletions(-)
diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst
index 434a07b0002b..a57d78ec8bee 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -1310,12 +1310,14 @@ longer needed.
``LOOKUP_JUMPED`` means that the current dentry was chosen not because
it had the right name but for some other reason. This happens when
following "``..``", following a symlink to ``/``, crossing a mount point
-or accessing a "``/proc/$PID/fd/$FD``" symlink. In this case the
-filesystem has not been asked to revalidate the name (with
-``d_revalidate()``). In such cases the inode may still need to be
-revalidated, so ``d_op->d_weak_revalidate()`` is called if
+or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
+link"). In this case the filesystem has not been asked to revalidate the
+name (with ``d_revalidate()``). In such cases the inode may still need
+to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
``LOOKUP_JUMPED`` is set when the look completes - which may be at the
-final component or, when creating, unlinking, or renaming, at the penultimate component.
+final component or, when creating, unlinking, or renaming, at the
+penultimate component. ``LOOKUP_MAGICLINK_JUMPED`` is set alongside
+``LOOKUP_JUMPED`` if a magic-link was traversed.
Final-component flags
~~~~~~~~~~~~~~~~~~~~~
diff --git a/fs/internal.h b/fs/internal.h
index 315fcd8d237c..f48449a43626 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -119,6 +119,7 @@ struct open_flags {
int acc_mode;
int intent;
int lookup_flags;
+ fmode_t opath_mask;
};
extern struct file *do_filp_open(int dfd, struct filename *pathname,
const struct open_flags *op);
diff --git a/fs/namei.c b/fs/namei.c
index 209c51a5226c..54d57dad0f91 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -872,7 +872,7 @@ void nd_jump_link(struct path *path)
nd->path = *path;
nd->inode = nd->path.dentry->d_inode;
- nd->flags |= LOOKUP_JUMPED;
+ nd->flags |= LOOKUP_JUMPED | LOOKUP_MAGICLINK_JUMPED;
}
static inline void put_link(struct nameidata *nd)
@@ -1066,6 +1066,7 @@ const char *get_link(struct nameidata *nd)
return ERR_PTR(error);
nd->last_type = LAST_BIND;
+ nd->flags &= ~LOOKUP_MAGICLINK_JUMPED;
res = READ_ONCE(inode->i_link);
if (!res) {
const char * (*get)(struct dentry *, struct inode *,
@@ -3501,16 +3502,73 @@ static int do_tmpfile(struct nameidata *nd, unsigned flags,
return error;
}
-static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file)
+/**
+ * may_reopen_magiclink - Check permissions for opening a trailing magic-link
+ * @upgrade_mask: the upgrade-mask of the magic-link
+ * @acc_mode: ACC_MODE which the user is attempting
+ *
+ * We block magic-link re-opening if the @upgrade_mask is more strict than the
+ * @acc_mode being requested, unless the user is capable(CAP_DAC_OVERRIDE).
+ *
+ * Returns 0 if successful, -EACCES on error.
+ */
+static int may_open_magiclink(fmode_t upgrade_mask, int acc_mode)
{
- struct path path;
- int error = path_lookupat(nd, flags, &path);
- if (!error) {
- audit_inode(nd->name, path.dentry, 0);
- error = vfs_open(&path, file);
- path_put(&path);
- }
- return error;
+ /*
+ * We only allow for init_userns to be able to override magic-links.
+ * This is done to avoid cases where an unprivileged userns could take
+ * an O_PATH of the fd, resulting in it being very unclear whether
+ * CAP_DAC_OVERRIDE should work on the new O_PATH fd (given that it
+ * pipes through to the underlying file).
+ */
+ if (capable(CAP_DAC_OVERRIDE))
+ return 0;
+
+ if ((acc_mode & MAY_READ) &&
+ !(upgrade_mask & (FMODE_READ | FMODE_PATH_READ)))
+ goto err;
+ if ((acc_mode & MAY_WRITE) &&
+ !(upgrade_mask & (FMODE_WRITE | FMODE_PATH_WRITE)))
+ goto err;
+
+ return 0;
+
+err:
+ pr_warn_ratelimited("%s[%d]: magic-link re-open blocked ('%s%s%s' requested with an upgrade-mask of '%s%s%s%s')",
+ current->comm, task_pid_nr(current),
+ (acc_mode & MAY_READ) ? "r" : "",
+ (acc_mode & MAY_WRITE) ? "w" : "",
+ (acc_mode & MAY_EXEC) ? "x" : "",
+ (upgrade_mask & FMODE_READ) ? "r" : "",
+ (upgrade_mask & FMODE_PATH_READ) ? "R" : "",
+ (upgrade_mask & FMODE_WRITE) ? "w" : "",
+ (upgrade_mask & FMODE_PATH_WRITE) ? "W" : "");
+ return -EACCES;
+}
+
+static int trailing_magiclink(struct nameidata *nd, int acc_mode,
+ fmode_t *opath_mask)
+{
+ struct inode *inode = nd->link_inode;
+ fmode_t upgrade_mask = 0;
+
+ /* Was the trailing_symlink() a magic-link? */
+ if (!(nd->flags & LOOKUP_MAGICLINK_JUMPED))
+ return 0;
+
+ /*
+ * Figure out the upgrade-mask of the link_inode. Since these aren't
+ * strictly POSIX semantics we don't do an acl_permission_check() here,
+ * so we only care that at least one bit is set for each upgrade-mode.
+ */
+ if (inode->i_mode & S_IRUGO)
+ upgrade_mask |= FMODE_PATH_READ;
+ if (inode->i_mode & S_IWUGO)
+ upgrade_mask |= FMODE_PATH_WRITE;
+ /* Restrict the O_PATH upgrade-mask of the caller. */
+ if (opath_mask)
+ *opath_mask &= upgrade_mask;
+ return may_open_magiclink(upgrade_mask, acc_mode);
}
static struct file *path_openat(struct nameidata *nd,
@@ -3526,13 +3584,38 @@ static struct file *path_openat(struct nameidata *nd,
if (unlikely(file->f_flags & __O_TMPFILE)) {
error = do_tmpfile(nd, flags, op, file);
} else if (unlikely(file->f_flags & O_PATH)) {
- error = do_o_path(nd, flags, file);
+ /* Inlined path_lookupat() with a trailing_magiclink() check. */
+ fmode_t opath_mask = op->opath_mask;
+ const char *s = path_init(nd, flags);
+
+ while (!(error = link_path_walk(s, nd))
+ && ((error = lookup_last(nd)) > 0)) {
+ s = trailing_symlink(nd);
+ error = trailing_magiclink(nd, op->acc_mode, &opath_mask);
+ if (error)
+ s = ERR_PTR(error);
+ }
+ if (!error)
+ error = complete_walk(nd);
+
+ if (!error && nd->flags & LOOKUP_DIRECTORY)
+ if (!d_can_lookup(nd->path.dentry))
+ error = -ENOTDIR;
+ if (!error) {
+ audit_inode(nd->name, nd->path.dentry, 0);
+ error = vfs_open(&nd->path, file);
+ file->f_mode |= opath_mask;
+ }
+ terminate_walk(nd);
} else {
const char *s = path_init(nd, flags);
while (!(error = link_path_walk(s, nd)) &&
(error = do_last(nd, file, op)) > 0) {
nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
s = trailing_symlink(nd);
+ error = trailing_magiclink(nd, op->acc_mode, NULL);
+ if (error)
+ s = ERR_PTR(error);
}
terminate_walk(nd);
}
diff --git a/fs/open.c b/fs/open.c
index a59abe3c669a..806a75d685e1 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1001,8 +1001,9 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
acc_mode |= MAY_APPEND;
op->acc_mode = acc_mode;
-
op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN;
+ /* For O_PATH backwards-compatibility we default to an all-set mask. */
+ op->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE;
if (flags & O_CREAT) {
op->intent |= LOOKUP_CREATE;
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 81882a13212d..9b7d8becb002 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -104,11 +104,30 @@ static void tid_fd_update_inode(struct task_struct *task, struct inode *inode,
task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid);
if (S_ISLNK(inode->i_mode)) {
+ /*
+ * Always set +x (depending on the fmode type), since there
+ * currently aren't FMODE_PATH_EXEC restrictions and there is
+ * no O_MAYEXEC yet. This might change in the future, in which
+ * case we will restrict +x.
+ */
unsigned i_mode = S_IFLNK;
+ if (f_mode & FMODE_PATH)
+ i_mode |= S_IXGRP;
+ else
+ i_mode |= S_IXUSR;
+ /*
+ * Construct the mode bits based on the open-mode. The u+rwx
+ * bits are for "ordinary" open modes while g+rwx are for
+ * O_PATH modes.
+ */
if (f_mode & FMODE_READ)
- i_mode |= S_IRUSR | S_IXUSR;
+ i_mode |= S_IRUSR;
if (f_mode & FMODE_WRITE)
- i_mode |= S_IWUSR | S_IXUSR;
+ i_mode |= S_IWUSR;
+ if (f_mode & FMODE_PATH_READ)
+ i_mode |= S_IRGRP;
+ if (f_mode & FMODE_PATH_WRITE)
+ i_mode |= S_IWGRP;
inode->i_mode = i_mode;
}
security_task_to_inode(task, inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 997a530ff4e9..a9ad596b28e2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -173,6 +173,10 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File does not contribute to nr_files count */
#define FMODE_NOACCOUNT ((__force fmode_t)0x20000000)
+/* File is an O_PATH descriptor which can be upgraded to (read, write). */
+#define FMODE_PATH_READ ((__force fmode_t)0x40000000)
+#define FMODE_PATH_WRITE ((__force fmode_t)0x80000000)
+
/*
* Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
* that indicates that they should check the contents of the iovec are
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 9138b4471dbf..bd6d3eb7764d 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -49,6 +49,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_ROOT 0x2000
#define LOOKUP_EMPTY 0x4000
#define LOOKUP_DOWN 0x8000
+#define LOOKUP_MAGICLINK_JUMPED 0x10000
extern int path_pts(struct path *path);
--
2.23.0
Now that magic-link modes are obeyed for file re-opening purposes, some
of the pre-existing magic-link modes need to be adjusted to be more
semantically correct.
The most blatant example of this is /proc/self/exe, which had a mode of
a+rwx even though tautologically the file could never be opened for
writing (because it is the current->mm of a live process).
With the new O_PATH restrictions, changing the default mode of these
magic-links allows us to avoid delayed-access attacks such as we saw in
CVE-2019-5736.
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/proc/base.c | 20 ++++++++++----------
fs/proc/namespaces.c | 2 +-
2 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..297242174402 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -133,9 +133,9 @@ struct pid_entry {
#define DIR(NAME, MODE, iops, fops) \
NOD(NAME, (S_IFDIR|(MODE)), &iops, &fops, {} )
-#define LNK(NAME, get_link) \
- NOD(NAME, (S_IFLNK|S_IRWXUGO), \
- &proc_pid_link_inode_operations, NULL, \
+#define LNK(NAME, MODE, get_link) \
+ NOD(NAME, (S_IFLNK|(MODE)), \
+ &proc_pid_link_inode_operations, NULL, \
{ .proc_get_link = get_link } )
#define REG(NAME, MODE, fops) \
NOD(NAME, (S_IFREG|(MODE)), NULL, &fops, {})
@@ -3028,9 +3028,9 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
#endif
REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations),
- LNK("cwd", proc_cwd_link),
- LNK("root", proc_root_link),
- LNK("exe", proc_exe_link),
+ LNK("cwd", S_IRWXUGO, proc_cwd_link),
+ LNK("root", S_IRWXUGO, proc_root_link),
+ LNK("exe", S_IRUGO|S_IXUGO, proc_exe_link),
REG("mounts", S_IRUGO, proc_mounts_operations),
REG("mountinfo", S_IRUGO, proc_mountinfo_operations),
REG("mountstats", S_IRUSR, proc_mountstats_operations),
@@ -3429,11 +3429,11 @@ static const struct pid_entry tid_base_stuff[] = {
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
#endif
REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations),
- LNK("cwd", proc_cwd_link),
- LNK("root", proc_root_link),
- LNK("exe", proc_exe_link),
+ LNK("cwd", S_IRWXUGO, proc_cwd_link),
+ LNK("root", S_IRWXUGO, proc_root_link),
+ LNK("exe", S_IRUGO|S_IXUGO, proc_exe_link),
REG("mounts", S_IRUGO, proc_mounts_operations),
- REG("mountinfo", S_IRUGO, proc_mountinfo_operations),
+ REG("mountinfo", S_IRUGO, proc_mountinfo_operations),
#ifdef CONFIG_PROC_PAGE_MONITOR
REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
REG("smaps", S_IRUGO, proc_pid_smaps_operations),
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index dd2b35f78b09..cd1e130913f7 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -94,7 +94,7 @@ static struct dentry *proc_ns_instantiate(struct dentry *dentry,
struct inode *inode;
struct proc_inode *ei;
- inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRWXUGO);
+ inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRUGO);
if (!inode)
return ERR_PTR(-ENOENT);
--
2.23.0
This patch allows for LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit
".." resolution (in the case of LOOKUP_BENEATH the resolution will still
fail if ".." resolution would resolve a path outside of the root --
while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps
are still disallowed entirely because now they could result in
inconsistent behaviour if resolution encounters a subsequent ".."[*].
The need for this patch is explained by observing there is a fairly
easy-to-exploit race condition with chroot(2) (and thus by extension
LOOKUP_IN_ROOT and LOOKUP_BENEATH if ".." is allowed) where a rename(2)
of a path can be used to "skip over" nd->root and thus escape to the
filesystem above nd->root.
thread1 [attacker]:
for (;;)
renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
thread2 [victim]:
for (;;)
openat2(dirb, "b/c/../../etc/shadow",
{ .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );
With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.
With this patch, such cases will be detected *during* ".." resolution
(which is the weak point of chroot(2) -- since walking *into* a
subdirectory tautologically cannot result in you walking *outside*
nd->root -- except through a bind-mount or magic-link). By detecting
this at ".." resolution (rather than checking only at the end of the
entire resolution) we can both correct escapes by jumping back to the
root (in the case of LOOKUP_IN_ROOT), as well as avoid revealing to
attackers the structure of the filesystem outside of the root (through
timing attacks for instance).
In order to avoid a quadratic lookup with each ".." entry, we only
activate the slow path if a write through &rename_lock or &mount_lock
has occurred during path resolution (&rename_lock and &mount_lock are
re-taken to further optimise the lookup). Since the primary attack being
protected against is MS_MOVE or rename(2), not doing additional checks
unless a mount or rename have occurred avoids making the common case
slow.
The use of path_is_under() here might seem suspect, but on further
inspection of the most important race (a path was *inside* the root but
is now *outside*), there appears to be no attack potential:
* If path_is_under() occurs before the rename, then the path will be
resolved -- however the path was originally inside the root and thus
there is no escape (and to userspace it'd look like the rename
occurred after the path was resolved). If path_is_under() occurs
afterwards, the resolution is blocked.
* Subsequent ".." jumps are guaranteed to check path_is_under() -- by
construction, &rename_lock or &mount_lock must have been taken by
the attacker after path_is_under() returned in the victim. Thus ".."
will not be able to escape from the previously-inside-root path.
* Walking down in the moved path is still safe since the entire
subtree was moved (either by rename(2) or MS_MOVE) and because (as
discussed above) walking down is safe.
A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.
[*] It may be acceptable in the future to do a path_is_under() check
after resolving a magic-link and permit resolution if the
nd_jump_link() result is still within the dirfd. However this seems
unlikely to be a feature that people *really* need* -- it can be
added later if it turns out a lot of people want it.
Cc: Al Viro <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namei.c | 45 +++++++++++++++++++++++++++++++--------------
1 file changed, 31 insertions(+), 14 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 0352d275bd13..fd1eb5ce8baa 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@ struct nameidata {
struct path root;
struct inode *inode; /* path.dentry.d_inode */
unsigned int flags;
- unsigned seq, m_seq;
+ unsigned seq, m_seq, r_seq;
int last_type;
unsigned depth;
int total_link_count;
@@ -1758,22 +1758,36 @@ static inline int handle_dots(struct nameidata *nd, int type)
if (type == LAST_DOTDOT) {
int error = 0;
- /*
- * LOOKUP_BENEATH resolving ".." is not currently safe -- races
- * can cause our parent to have moved outside of the root and
- * us to skip over it.
- */
- if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
- return -EXDEV;
if (!nd->root.mnt) {
error = set_root(nd);
if (error)
return error;
}
- if (nd->flags & LOOKUP_RCU) {
- return follow_dotdot_rcu(nd);
- } else
- return follow_dotdot(nd);
+ if (nd->flags & LOOKUP_RCU)
+ error = follow_dotdot_rcu(nd);
+ else
+ error = follow_dotdot(nd);
+ if (error)
+ return error;
+
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
+ bool m_retry = read_seqretry(&mount_lock, nd->m_seq);
+ bool r_retry = read_seqretry(&rename_lock, nd->r_seq);
+
+ /*
+ * Don't bother checking unless there's a racing
+ * rename(2) or MS_MOVE.
+ */
+ if (likely(!m_retry && !r_retry))
+ return 0;
+
+ if (m_retry && !(nd->flags & LOOKUP_RCU))
+ nd->m_seq = read_seqbegin(&mount_lock);
+ if (r_retry)
+ nd->r_seq = read_seqbegin(&rename_lock);
+ if (!path_is_under(&nd->path, &nd->root))
+ return -EXDEV;
+ }
}
return 0;
}
@@ -2245,6 +2259,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->last_type = LAST_ROOT; /* if there are only slashes... */
nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
nd->depth = 0;
+
+ nd->m_seq = read_seqbegin(&mount_lock);
+ if (flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))
+ nd->r_seq = read_seqbegin(&rename_lock);
+
if (flags & LOOKUP_ROOT) {
struct dentry *root = nd->root.dentry;
struct inode *inode = root->d_inode;
@@ -2266,8 +2285,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->path.mnt = NULL;
nd->path.dentry = NULL;
- nd->m_seq = read_seqbegin(&mount_lock);
-
/* LOOKUP_IN_ROOT treats absolute paths as being relative-to-dirfd. */
if (flags & LOOKUP_IN_ROOT)
while (*s == '/')
--
2.23.0
The most obvious syscall to add support for the new LOOKUP_* scoping
flags would be openat(2). However, there are a few reasons why this is
not the best course of action:
* The new LOOKUP_* flags are intended to be security features, and
openat(2) will silently ignore all unknown flags. This means that
users would need to avoid foot-gunning themselves constantly when
using this interface if it were part of openat(2). This can be fixed
by having userspace libraries handle this for users[1], but should be
avoided if possible.
* Resolution scoping feels like a different operation to the existing
O_* flags. And since openat(2) has limited flag space, it seems to be
quite wasteful to clutter it with 5 flags that are all
resolution-related. Arguably O_NOFOLLOW is also a resolution flag but
its entire purpose is to error out if you encounter a trailing
symlink -- not to scope resolution.
* Other systems would be able to reimplement this syscall allowing for
cross-OS standardisation rather than being hidden amongst O_* flags
which may result in it not being used by all the parties that might
want to use it (file servers, web servers, container runtimes, etc).
* It gives us the opportunity to iterate on the O_PATH interface. In
particular, the new @how->upgrade_mask field for fd re-opening is
only possible because we have a clean slate without needing to re-use
the ACC_MODE flag design nor the existing openat(2) @mode semantics.
To this end, we introduce the openat2(2) syscall. It provides all of the
features of openat(2) through the @how->flags argument, but also
also provides a new @how->resolve argument which exposes RESOLVE_* flags
that map to our new LOOKUP_* flags. It also eliminates the long-standing
ugliness of variadic-open(2) by embedding it in a struct.
In order to allow for userspace to lock down their usage of file
descriptor re-opening, openat2(2) has the ability for users to disallow
certain re-opening modes through @how->upgrade_mask. At the moment,
there is no UPGRADE_NOEXEC.
[1]: https://github.com/openSUSE/libpathrs
Suggested-by: Christian Brauner <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/open.c | 94 ++++++++++++++++-----
include/linux/fcntl.h | 19 ++++-
include/linux/fs.h | 4 +-
include/linux/syscalls.h | 14 ++-
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 42 +++++++++
24 files changed, 168 insertions(+), 30 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 728fe028c02c..9f374f7d9514 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
543 common fspick sys_fspick
544 common pidfd_open sys_pidfd_open
# 545 reserved for clone3
+547 common openat2 sys_openat2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 6da7dc4d79cc..4ba54bc7e19a 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,3 +449,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+437 common openat2 sys_openat2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 2629a68b8724..8aa00ccb0b96 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 436
+#define __NR_compat_syscalls 438
#endif
#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 94ab29cf4f00..57f6f592d460 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -879,6 +879,8 @@ __SYSCALL(__NR_fspick, sys_fspick)
__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#define __NR_clone3 435
__SYSCALL(__NR_clone3, sys_clone3)
+#define __NR_openat2 437
+__SYSCALL(__NR_openat2, sys_openat2)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 36d5faf4c86c..8d36f2e2dc89 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -356,3 +356,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 common openat2 sys_openat2
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index a88a285a0e5f..2559925f1924 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 common openat2 sys_openat2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 09b0cd7dab0a..c04385e60833 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+437 common openat2 sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index c9c879ec9b6d..ba06cae655c6 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -374,3 +374,4 @@
433 n32 fspick sys_fspick
434 n32 pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 n32 openat2 sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index bbce9159caa1..0f3de320ae51 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -350,3 +350,4 @@
433 n64 fspick sys_fspick
434 n64 pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 n64 openat2 sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 9653591428ec..f108464d09a3 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -423,3 +423,4 @@
433 o32 fspick sys_fspick
434 o32 pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 o32 openat2 sys_openat2 sys_openat2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 670d1371aca1..45ddc4485844 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -432,3 +432,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3_wrapper
+437 common openat2 sys_openat2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 43f736ed47f2..a8b5ecb5b602 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 nospu clone3 ppc_clone3
+437 common openat2 sys_openat2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 3054e9c035a3..16b571c06161 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
433 common fspick sys_fspick sys_fspick
434 common pidfd_open sys_pidfd_open sys_pidfd_open
435 common clone3 sys_clone3 sys_clone3
+437 common openat2 sys_openat2 sys_openat2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index b5ed26c4c005..a7185cc18626 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 common openat2 sys_openat2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 8c8cc7537fb2..b11c19552022 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -481,3 +481,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 common openat2 sys_openat2
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c00019abd076..dfa1dc5c8587 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -440,3 +440,4 @@
433 i386 fspick sys_fspick __ia32_sys_fspick
434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
435 i386 clone3 sys_clone3 __ia32_sys_clone3
+437 i386 openat2 sys_openat2 __ia32_sys_openat2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c29976eca4a8..9035647ef236 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,7 @@
433 common fspick __x64_sys_fspick
434 common pidfd_open __x64_sys_pidfd_open
435 common clone3 __x64_sys_clone3/ptregs
+437 common openat2 __x64_sys_openat2
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 25f4de729a6d..f0a68013c038 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -406,3 +406,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+437 common openat2 sys_openat2
diff --git a/fs/open.c b/fs/open.c
index 310b896eecf0..c33a927c9218 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -947,19 +947,27 @@ struct file *open_with_fake_path(const struct path *path, int flags,
}
EXPORT_SYMBOL(open_with_fake_path);
-static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op)
+static inline int build_open_flags(const struct open_how *how,
+ struct open_flags *op)
{
+ int flags = how->flags;
int lookup_flags = 0;
+ int opath_mask = 0;
int acc_mode = ACC_MODE(flags);
/*
- * Clear out all open flags we don't know about so that we don't report
- * them in fcntl(F_GETFD) or similar interfaces.
+ * Older syscalls still clear these bits before calling
+ * build_open_flags(), but openat2(2) checks all its arguments.
*/
- flags &= VALID_OPEN_FLAGS;
+ if (flags & ~VALID_OPEN_FLAGS)
+ return -EINVAL;
+ if (how->resolve & ~VALID_RESOLVE_FLAGS)
+ return -EINVAL;
+ if (!(how->flags & (O_PATH | O_CREAT | __O_TMPFILE)) && how->mode != 0)
+ return -EINVAL;
if (flags & (O_CREAT | __O_TMPFILE))
- op->mode = (mode & S_IALLUGO) | S_IFREG;
+ op->mode = (how->mode & S_IALLUGO) | S_IFREG;
else
op->mode = 0;
@@ -987,6 +995,14 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
*/
flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
acc_mode = 0;
+
+ /* Allow userspace to restrict the re-opening of O_PATH fds. */
+ if (how->upgrade_mask & ~VALID_UPGRADE_FLAGS)
+ return -EINVAL;
+ if (!(how->upgrade_mask & UPGRADE_NOREAD))
+ opath_mask |= FMODE_PATH_READ;
+ if (!(how->upgrade_mask & UPGRADE_NOWRITE))
+ opath_mask |= FMODE_PATH_WRITE;
}
op->open_flag = flags;
@@ -1002,8 +1018,7 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
op->acc_mode = acc_mode;
op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN;
- /* For O_PATH backwards-compatibility we default to an all-set mask. */
- op->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE;
+ op->opath_mask = opath_mask;
if (flags & O_CREAT) {
op->intent |= LOOKUP_CREATE;
@@ -1017,6 +1032,18 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
lookup_flags |= LOOKUP_FOLLOW;
if (flags & O_EMPTYPATH)
lookup_flags |= LOOKUP_EMPTY;
+
+ if (how->resolve & RESOLVE_NO_XDEV)
+ lookup_flags |= LOOKUP_NO_XDEV;
+ if (how->resolve & RESOLVE_NO_MAGICLINKS)
+ lookup_flags |= LOOKUP_NO_MAGICLINKS;
+ if (how->resolve & RESOLVE_NO_SYMLINKS)
+ lookup_flags |= LOOKUP_NO_SYMLINKS;
+ if (how->resolve & RESOLVE_BENEATH)
+ lookup_flags |= LOOKUP_BENEATH;
+ if (how->resolve & RESOLVE_IN_ROOT)
+ lookup_flags |= LOOKUP_IN_ROOT;
+
op->lookup_flags = lookup_flags;
return 0;
}
@@ -1035,8 +1062,11 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
struct file *file_open_name(struct filename *name, int flags, umode_t mode)
{
struct open_flags op;
- int err = build_open_flags(flags, mode, &op);
- return err ? ERR_PTR(err) : do_filp_open(AT_FDCWD, name, &op);
+ struct open_how how = OPEN_HOW_FROM(flags, mode);
+ int err = build_open_flags(&how, &op);
+ if (err)
+ return ERR_PTR(err);
+ return do_filp_open(AT_FDCWD, name, &op);
}
/**
@@ -1067,17 +1097,19 @@ struct file *file_open_root(struct dentry *dentry, struct vfsmount *mnt,
const char *filename, int flags, umode_t mode)
{
struct open_flags op;
- int err = build_open_flags(flags, mode, &op);
+ struct open_how how = OPEN_HOW_FROM(flags, mode);
+ int err = build_open_flags(&how, &op);
if (err)
return ERR_PTR(err);
return do_file_open_root(dentry, mnt, filename, &op);
}
EXPORT_SYMBOL(file_open_root);
-long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
+long do_sys_open(int dfd, const char __user *filename,
+ struct open_how *how)
{
struct open_flags op;
- int fd = build_open_flags(flags, mode, &op);
+ int fd = build_open_flags(how, &op);
int empty = 0;
struct filename *tmp;
@@ -1090,7 +1122,7 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
if (!empty)
op.open_flag &= ~O_EMPTYPATH;
- fd = get_unused_fd_flags(flags);
+ fd = get_unused_fd_flags(how->flags);
if (fd >= 0) {
struct file *f = do_filp_open(dfd, tmp, &op);
if (IS_ERR(f)) {
@@ -1107,19 +1139,37 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
- if (force_o_largefile())
- flags |= O_LARGEFILE;
-
- return do_sys_open(AT_FDCWD, filename, flags, mode);
+ return ksys_open(filename, flags, mode);
}
SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
umode_t, mode)
{
+ struct open_how how = OPEN_HOW_FROM(flags, mode);
+
+ if (force_o_largefile())
+ how.flags |= O_LARGEFILE;
+
+ return do_sys_open(dfd, filename, &how);
+}
+
+SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename,
+ const struct open_how __user *, how, size_t, usize)
+{
+ int err;
+ struct open_how tmp;
+
+ if (unlikely(usize < OPEN_HOW_SIZE_VER0))
+ return -EINVAL;
+
+ err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize);
+ if (err)
+ return err;
+
if (force_o_largefile())
- flags |= O_LARGEFILE;
+ tmp.flags |= O_LARGEFILE;
- return do_sys_open(dfd, filename, flags, mode);
+ return do_sys_open(dfd, filename, &tmp);
}
#ifdef CONFIG_COMPAT
@@ -1129,7 +1179,8 @@ SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
*/
COMPAT_SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
- return do_sys_open(AT_FDCWD, filename, flags, mode);
+ struct open_how how = OPEN_HOW_FROM(flags, mode);
+ return do_sys_open(AT_FDCWD, filename, &how);
}
/*
@@ -1138,7 +1189,8 @@ COMPAT_SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t,
*/
COMPAT_SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags, umode_t, mode)
{
- return do_sys_open(dfd, filename, flags, mode);
+ struct open_how how = OPEN_HOW_FROM(flags, mode);
+ return do_sys_open(dfd, filename, &how);
}
#endif
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 2868ae6c8fc1..66125211caba 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -4,13 +4,30 @@
#include <uapi/linux/fcntl.h>
-/* list of all valid flags for the open/openat flags argument: */
+/* Should open_how.mode be set for older syscalls wrappers? */
+#define OPEN_HOW_MODE(flags, mode) \
+ (((flags) & (O_CREAT | __O_TMPFILE)) ? (mode) : 0)
+/* Convert older syscall (flags, mode) arguments to open_how. */
+#define OPEN_HOW_FROM(flags, mode) \
+ { .flags = (flags) & VALID_OPEN_FLAGS, \
+ .mode = OPEN_HOW_MODE((flags), (mode)) }
+
+/* List of all valid flags for the open/openat flags argument: */
#define VALID_OPEN_FLAGS \
(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_EMPTYPATH)
+/* List of all valid flags for the how->upgrade_mask argument: */
+#define VALID_UPGRADE_FLAGS \
+ (UPGRADE_NOWRITE | UPGRADE_NOREAD)
+
+/* List of all valid flags for the how->resolve argument: */
+#define VALID_RESOLVE_FLAGS \
+ (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
+ RESOLVE_BENEATH | RESOLVE_IN_ROOT)
+
#ifndef force_o_largefile
#define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a9ad596b28e2..135e4fa773fc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2498,8 +2498,8 @@ extern int do_truncate(struct dentry *, loff_t start, unsigned int time_attrs,
struct file *filp);
extern int vfs_fallocate(struct file *file, int mode, loff_t offset,
loff_t len);
-extern long do_sys_open(int dfd, const char __user *filename, int flags,
- umode_t mode);
+extern long do_sys_open(int dfd, const char __user *filename,
+ struct open_how *how);
extern struct file *file_open_name(struct filename *, int, umode_t);
extern struct file *filp_open(const char *, int, umode_t);
extern struct file *file_open_root(struct dentry *, struct vfsmount *,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 88145da7d140..a249bcb686bb 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -69,6 +69,7 @@ struct rseq;
union bpf_attr;
struct io_uring_params;
struct clone_args;
+struct open_how;
#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -439,6 +440,8 @@ asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
umode_t mode);
+asmlinkage long sys_openat2(int dfd, const char __user *filename,
+ const struct open_how *how, size_t size);
asmlinkage long sys_close(unsigned int fd);
asmlinkage long sys_vhangup(void);
@@ -1374,15 +1377,18 @@ static inline int ksys_close(unsigned int fd)
return __close_fd(current->files, fd);
}
-extern long do_sys_open(int dfd, const char __user *filename, int flags,
- umode_t mode);
+extern long do_sys_open(int dfd, const char __user *filename,
+ struct open_how *how);
static inline long ksys_open(const char __user *filename, int flags,
umode_t mode)
{
+ struct open_how how = OPEN_HOW_FROM(flags, mode);
+
if (force_o_largefile())
- flags |= O_LARGEFILE;
- return do_sys_open(AT_FDCWD, filename, flags, mode);
+ how.flags |= O_LARGEFILE;
+
+ return do_sys_open(AT_FDCWD, filename, &how);
}
extern long do_sys_truncate(const char __user *pathname, loff_t length);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1be0e798e362..b28c11b338ee 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -851,8 +851,11 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
__SYSCALL(__NR_clone3, sys_clone3)
#endif
+#define __NR_openat2 437
+__SYSCALL(__NR_openat2, sys_openat2)
+
#undef __NR_syscalls
-#define __NR_syscalls 436
+#define __NR_syscalls 438
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 1d338357df8a..479baf2da10e 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -93,5 +93,47 @@
#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
+/**
+ * Arguments for how openat2(2) should open the target path. If @resolve is
+ * zero, then openat2(2) operates identically to openat(2).
+ *
+ * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
+ * than being silently ignored. In addition, @mode (or @upgrade_mask) must be
+ * zero unless one of {O_CREAT, O_TMPFILE, O_PATH} are set.
+ *
+ * @flags: O_* flags.
+ * @mode: O_CREAT/O_TMPFILE file mode.
+ * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
+ * @resolve: RESOLVE_* flags.
+ */
+struct open_how {
+ __u32 flags;
+ union {
+ __u16 mode;
+ __u16 upgrade_mask;
+ };
+ __u16 resolve;
+};
+
+#define OPEN_HOW_SIZE_VER0 8 /* sizeof first published struct */
+
+/* how->resolve flags for openat2(2). */
+#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings
+ (includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style
+ "magic-links". */
+#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks
+ (implies OEXT_NO_MAGICLINKS) */
+#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like
+ "..", symlinks, and absolute
+ paths which escape the dirfd. */
+#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".."
+ be scoped inside the dirfd
+ (similar to chroot(2)). */
+
+/* how->upgrade flags for openat2(2). */
+/* First bit is reserved for a future UPGRADE_NOEXEC flag. */
+#define UPGRADE_NOREAD 0x02 /* Block re-opening with MAY_READ. */
+#define UPGRADE_NOWRITE 0x04 /* Block re-opening with MAY_WRITE. */
#endif /* _UAPI_LINUX_FCNTL_H */
--
2.23.0
Userspace has made use of /proc/self/fd very liberally to allow for
descriptors to be re-opened. There are a wide variety of uses for this
feature, but it has always required constructing a pathname and could
not be done without procfs mounted. The obvious solution for this is to
extend openat(2) to have an AT_EMPTY_PATH-equivalent -- O_EMPTYPATH.
Now that descriptor re-opening has been made safe through the new
magic-link resolution restrictions, we can replicate these restrictions
for O_EMPTYPATH. In particular, we only allow "upgrading" the file
descriptor if the corresponding FMODE_PATH_* bit is set (or the
FMODE_{READ,WRITE} cases for non-O_PATH file descriptors).
When doing openat(O_EMPTYPATH|O_PATH), O_PATH takes precedence and
O_EMPTYPATH is ignored. Very few users ever have a need to O_PATH
re-open an existing file descriptor, and so accommodating them at the
expense of further complicating O_PATH makes little sense. Ultimately,
if users ask for this we can always add RESOLVE_EMPTY_PATH to
resolveat(2) in the future.
Signed-off-by: Aleksa Sarai <[email protected]>
---
arch/alpha/include/uapi/asm/fcntl.h | 1 +
arch/parisc/include/uapi/asm/fcntl.h | 39 ++++++++++++++--------------
arch/sparc/include/uapi/asm/fcntl.h | 1 +
fs/fcntl.c | 2 +-
fs/namei.c | 20 ++++++++++++++
fs/open.c | 7 ++++-
include/linux/fcntl.h | 2 +-
include/uapi/asm-generic/fcntl.h | 4 +++
8 files changed, 54 insertions(+), 22 deletions(-)
diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..1f879bade68b 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -34,6 +34,7 @@
#define O_PATH 040000000
#define __O_TMPFILE 0100000000
+#define O_EMPTYPATH 0200000000
#define F_GETLK 7
#define F_SETLK 8
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 03ce20e5ad7d..5d709058a76f 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -2,26 +2,27 @@
#ifndef _PARISC_FCNTL_H
#define _PARISC_FCNTL_H
-#define O_APPEND 000000010
-#define O_BLKSEEK 000000100 /* HPUX only */
-#define O_CREAT 000000400 /* not fcntl */
-#define O_EXCL 000002000 /* not fcntl */
-#define O_LARGEFILE 000004000
-#define __O_SYNC 000100000
+#define O_APPEND 0000000010
+#define O_BLKSEEK 0000000100 /* HPUX only */
+#define O_CREAT 0000000400 /* not fcntl */
+#define O_EXCL 0000002000 /* not fcntl */
+#define O_LARGEFILE 0000004000
+#define __O_SYNC 0000100000
#define O_SYNC (__O_SYNC|O_DSYNC)
-#define O_NONBLOCK 000200004 /* HPUX has separate NDELAY & NONBLOCK */
-#define O_NOCTTY 000400000 /* not fcntl */
-#define O_DSYNC 001000000 /* HPUX only */
-#define O_RSYNC 002000000 /* HPUX only */
-#define O_NOATIME 004000000
-#define O_CLOEXEC 010000000 /* set close_on_exec */
-
-#define O_DIRECTORY 000010000 /* must be a directory */
-#define O_NOFOLLOW 000000200 /* don't follow links */
-#define O_INVISIBLE 004000000 /* invisible I/O, for DMAPI/XDSM */
-
-#define O_PATH 020000000
-#define __O_TMPFILE 040000000
+#define O_NONBLOCK 0000200004 /* HPUX has separate NDELAY & NONBLOCK */
+#define O_NOCTTY 0000400000 /* not fcntl */
+#define O_DSYNC 0001000000 /* HPUX only */
+#define O_RSYNC 0002000000 /* HPUX only */
+#define O_NOATIME 0004000000
+#define O_CLOEXEC 0010000000 /* set close_on_exec */
+
+#define O_DIRECTORY 0000010000 /* must be a directory */
+#define O_NOFOLLOW 0000000200 /* don't follow links */
+#define O_INVISIBLE 0004000000 /* invisible I/O, for DMAPI/XDSM */
+
+#define O_PATH 0020000000
+#define __O_TMPFILE 0040000000
+#define O_EMPTYPATH 0100000000
#define F_GETLK64 8
#define F_SETLK64 9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..dc86c9eaf950 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
#define O_PATH 0x1000000
#define __O_TMPFILE 0x2000000
+#define O_EMPTYPATH 0x4000000
#define F_GETOWN 5 /* for sockets. */
#define F_SETOWN 6 /* for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 3d40771e8e7c..4cf05a2fd162 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(
(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index 54d57dad0f91..e39b573fcc4d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3571,6 +3571,24 @@ static int trailing_magiclink(struct nameidata *nd, int acc_mode,
return may_open_magiclink(upgrade_mask, acc_mode);
}
+static int do_emptypath(struct nameidata *nd, const struct open_flags *op,
+ struct file *file)
+{
+ int error;
+ /* We don't support AT_FDCWD (since O_PATH is disallowed here). */
+ struct fd f = fdget_raw(nd->dfd);
+
+ if (!f.file)
+ return -EBADF;
+
+ /* Apply trailing_magiclink()-like restrictions. */
+ error = may_open_magiclink(f.file->f_mode, op->acc_mode);
+ if (!error)
+ error = vfs_open(&f.file->f_path, file);
+ fdput(f);
+ return error;
+}
+
static struct file *path_openat(struct nameidata *nd,
const struct open_flags *op, unsigned flags)
{
@@ -3583,6 +3601,8 @@ static struct file *path_openat(struct nameidata *nd,
if (unlikely(file->f_flags & __O_TMPFILE)) {
error = do_tmpfile(nd, flags, op, file);
+ } else if (unlikely(file->f_flags & O_EMPTYPATH)) {
+ error = do_emptypath(nd, op, file);
} else if (unlikely(file->f_flags & O_PATH)) {
/* Inlined path_lookupat() with a trailing_magiclink() check. */
fmode_t opath_mask = op->opath_mask;
diff --git a/fs/open.c b/fs/open.c
index 806a75d685e1..310b896eecf0 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1015,6 +1015,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
lookup_flags |= LOOKUP_DIRECTORY;
if (!(flags & O_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
+ if (flags & O_EMPTYPATH)
+ lookup_flags |= LOOKUP_EMPTY;
op->lookup_flags = lookup_flags;
return 0;
}
@@ -1076,14 +1078,17 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
struct open_flags op;
int fd = build_open_flags(flags, mode, &op);
+ int empty = 0;
struct filename *tmp;
if (fd)
return fd;
- tmp = getname(filename);
+ tmp = getname_flags(filename, op.lookup_flags, &empty);
if (IS_ERR(tmp))
return PTR_ERR(tmp);
+ if (!empty)
+ op.open_flag &= ~O_EMPTYPATH;
fd = get_unused_fd_flags(flags);
if (fd >= 0) {
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index d019df946cb2..2868ae6c8fc1 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -9,7 +9,7 @@
(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
- O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_EMPTYPATH)
#ifndef force_o_largefile
#define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..ae6862f69cc2 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
#define __O_TMPFILE 020000000
#endif
+#ifndef O_EMPTYPATH
+#define O_EMPTYPATH 040000000
+#endif
+
/* a horrid kludge trying to make sure that this will fail on old kernels */
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
--
2.23.0
Test all of the various openat2(2) flags, as well as how file
descriptor re-opening works. A small stress-test of a symlink-rename
attack is included to show that the protections against ".."-based
attacks are sufficient.
In addition, the memfd selftest is fixed to no longer depend on the
now-disallowed functionality of upgrading an O_RDONLY descriptor to
O_RDWR.
Signed-off-by: Aleksa Sarai <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/memfd/memfd_test.c | 7 +-
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 167 ++++++++
tools/testing/selftests/openat2/helpers.h | 118 +++++
.../testing/selftests/openat2/linkmode_test.c | 333 +++++++++++++++
.../testing/selftests/openat2/openat2_test.c | 106 +++++
.../selftests/openat2/rename_attack_test.c | 127 ++++++
.../testing/selftests/openat2/resolve_test.c | 402 ++++++++++++++++++
10 files changed, 1268 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/linkmode_test.c
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 25b43a8c2b15..13c02e0d0efc 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -37,6 +37,7 @@ TARGETS += powerpc
TARGETS += proc
TARGETS += pstore
TARGETS += ptrace
+TARGETS += openat2
TARGETS += rseq
TARGETS += rtc
TARGETS += seccomp
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index c67d32eeb668..e71df3d3e55d 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -925,7 +925,7 @@ static void test_share_mmap(char *banner, char *b_suffix)
*/
static void test_share_open(char *banner, char *b_suffix)
{
- int fd, fd2;
+ int procfd, fd, fd2;
printf("%s %s %s\n", memfd_str, banner, b_suffix);
@@ -950,13 +950,16 @@ static void test_share_open(char *banner, char *b_suffix)
mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
+ /* We cannot do a MAY_WRITE re-open of an O_RDONLY fd. */
+ procfd = mfd_assert_open(fd2, O_PATH, 0);
close(fd2);
- fd2 = mfd_assert_open(fd, O_RDWR, 0);
+ fd2 = mfd_assert_open(procfd, O_WRONLY, 0);
mfd_assert_add_seals(fd2, F_SEAL_SEAL);
mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+ close(procfd);
close(fd2);
close(fd);
}
diff --git a/tools/testing/selftests/openat2/.gitignore b/tools/testing/selftests/openat2/.gitignore
new file mode 100644
index 000000000000..bd68f6c3fd07
--- /dev/null
+++ b/tools/testing/selftests/openat2/.gitignore
@@ -0,0 +1 @@
+/*_test
diff --git a/tools/testing/selftests/openat2/Makefile b/tools/testing/selftests/openat2/Makefile
new file mode 100644
index 000000000000..0b8d42ec4052
--- /dev/null
+++ b/tools/testing/selftests/openat2/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := linkmode_test openat2_test resolve_test rename_attack_test
+
+include ../lib.mk
+
+$(TEST_GEN_PROGS): helpers.c
diff --git a/tools/testing/selftests/openat2/helpers.c b/tools/testing/selftests/openat2/helpers.c
new file mode 100644
index 000000000000..def6f7720086
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.c
@@ -0,0 +1,167 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <string.h>
+#include <syscall.h>
+#include <limits.h>
+
+#include "helpers.h"
+
+int raw_openat2(int dfd, const char *path, const void *how, size_t size)
+{
+ int ret = syscall(__NR_openat2, dfd, path, how, size);
+ return ret >= 0 ? ret : -errno;
+}
+
+int sys_openat2(int dfd, const char *path, const struct open_how *how)
+{
+ return raw_openat2(dfd, path, how, sizeof(*how));
+}
+
+int sys_openat(int dfd, const char *path, const struct open_how *how)
+{
+ int ret = openat(dfd, path, how->flags, how->mode);
+ return ret >= 0 ? ret : -errno;
+}
+
+int sys_renameat2(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath, unsigned int flags)
+{
+ int ret = syscall(__NR_renameat2, olddirfd, oldpath,
+ newdirfd, newpath, flags);
+ return ret >= 0 ? ret : -errno;
+}
+
+char *openat_flags(unsigned int flags)
+{
+ char *flagset, *accmode = "(none)";
+
+ switch (flags & 0x03) {
+ case O_RDWR:
+ accmode = "O_RDWR";
+ break;
+ case O_RDONLY:
+ accmode = "O_RDONLY";
+ break;
+ case O_WRONLY:
+ accmode = "O_WRONLY";
+ break;
+ }
+
+ E_asprintf(&flagset, "%s%s%s",
+ (flags & O_PATH) ? "O_PATH|" : "",
+ (flags & O_CREAT) ? "O_CREAT|" : "",
+ accmode);
+
+ return flagset;
+}
+
+char *openat2_flags(const struct open_how *how)
+{
+ char *p;
+ char *flags_set, *resolve_set, *acc_set, *set;
+
+ flags_set = openat_flags(how->flags);
+
+ E_asprintf(&resolve_set, "%s%s%s%s%s0",
+ (how->resolve & RESOLVE_NO_XDEV) ? "RESOLVE_NO_XDEV|" : "",
+ (how->resolve & RESOLVE_NO_MAGICLINKS) ? "RESOLVE_NO_MAGICLINKS|" : "",
+ (how->resolve & RESOLVE_NO_SYMLINKS) ? "RESOLVE_NO_SYMLINKS|" : "",
+ (how->resolve & RESOLVE_BENEATH) ? "RESOLVE_BENEATH|" : "",
+ (how->resolve & RESOLVE_IN_ROOT) ? "RESOLVE_IN_ROOT|" : "");
+
+ /* Remove trailing "|0". */
+ p = strstr(resolve_set, "|0");
+ if (p)
+ *p = '\0';
+
+ if (how->flags & O_PATH)
+ E_asprintf(&acc_set, ", upgrade_mask=%s%s0",
+ (how->upgrade_mask & UPGRADE_NOREAD) ? "UPGRADE_NOREAD|" : "",
+ (how->upgrade_mask & UPGRADE_NOWRITE) ? "UPGRADE_NOWRITE|" : "");
+ else if (how->flags & O_CREAT)
+ E_asprintf(&acc_set, ", mode=0%o", how->mode);
+ else
+ acc_set = strdup("");
+
+ /* Remove trailing "|0". */
+ p = strstr(acc_set, "|0");
+ if (p)
+ *p = '\0';
+
+ /* And now generate our flagset. */
+ E_asprintf(&set, "[flags=%s, resolve=%s%s]",
+ flags_set, resolve_set, acc_set);
+
+ free(flags_set);
+ free(resolve_set);
+ free(acc_set);
+ return set;
+}
+
+int touchat(int dfd, const char *path)
+{
+ int fd = openat(dfd, path, O_CREAT);
+ if (fd >= 0)
+ close(fd);
+ return fd;
+}
+
+char *fdreadlink(int fd)
+{
+ char *target, *tmp;
+
+ E_asprintf(&tmp, "/proc/self/fd/%d", fd);
+
+ target = malloc(PATH_MAX);
+ if (!target)
+ ksft_exit_fail_msg("fdreadlink: malloc failed\n");
+ memset(target, 0, PATH_MAX);
+
+ E_readlink(tmp, target, PATH_MAX);
+ free(tmp);
+ return target;
+}
+
+bool fdequal(int fd, int dfd, const char *path)
+{
+ char *fdpath, *dfdpath, *other;
+ bool cmp;
+
+ fdpath = fdreadlink(fd);
+ dfdpath = fdreadlink(dfd);
+
+ if (!path)
+ E_asprintf(&other, "%s", dfdpath);
+ else if (*path == '/')
+ E_asprintf(&other, "%s", path);
+ else
+ E_asprintf(&other, "%s/%s", dfdpath, path);
+
+ cmp = !strcmp(fdpath, other);
+ if (!cmp)
+ ksft_print_msg("fdequal: expected '%s' but got '%s'\n", other, fdpath);
+
+ free(fdpath);
+ free(dfdpath);
+ free(other);
+ return cmp;
+}
+
+void test_openat2_supported(void)
+{
+ struct open_how how = {};
+ int fd = sys_openat2(AT_FDCWD, ".", &how);
+ if (fd == -ENOSYS)
+ ksft_exit_skip("openat2(2) unsupported on this kernel\n");
+ if (fd < 0)
+ ksft_exit_fail_msg("openat2(2) supported check failed: %s\n", strerror(-fd));
+ close(fd);
+}
diff --git a/tools/testing/selftests/openat2/helpers.h b/tools/testing/selftests/openat2/helpers.h
new file mode 100644
index 000000000000..eb40030664f9
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.h
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#ifndef __RESOLVEAT_H__
+#define __RESOLVEAT_H__
+
+#define _GNU_SOURCE
+#include <stdint.h>
+#include <errno.h>
+#include "../kselftest.h"
+
+#define ARRAY_LEN(X) (sizeof (X) / sizeof (*(X)))
+#define BUILD_BUG_ON(e) ((void)(sizeof(struct { int:(-!!(e)); })))
+
+#ifndef SYS_openat2
+#ifndef __NR_openat2
+#define __NR_openat2 437
+#endif /* __NR_openat2 */
+#define SYS_openat2 __NR_openat2
+#endif /* SYS_openat2 */
+
+/**
+ * Arguments for how openat2(2) should open the target path. If @extra is zero,
+ * then openat2 is identical to openat(2). Only one of @mode or @upgrade_mask
+ * may be set at any given time.
+ *
+ * @flags: O_* flags (unknown flags ignored).
+ * @mode: O_CREAT file mode (ignored otherwise).
+ * @upgrade_mask: restrict how the O_PATH may be re-opened (ignored otherwise).
+ * @resolve: RESOLVE_* flags (-EINVAL on unknown flags).
+ * @reserved: reserved for future extensions, must be zeroed.
+ */
+struct open_how {
+ uint32_t flags;
+ union {
+ uint16_t mode;
+ uint16_t upgrade_mask;
+ };
+ uint16_t resolve;
+};
+
+#define OPEN_HOW_SIZE_VER0 8 /* sizeof first published struct */
+
+#ifndef RESOLVE_INROOT
+/* how->resolve flags for openat2(2). */
+#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings
+ (includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style
+ "magic-links". */
+#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks
+ (implies OEXT_NO_MAGICLINKS) */
+#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like
+ "..", symlinks, and absolute
+ paths which escape the dirfd. */
+#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".."
+ be scoped inside the dirfd
+ (similar to chroot(2)). */
+#endif /* RESOLVE_IN_ROOT */
+
+#ifndef UPGRADE_NOREAD
+/* how->upgrade flags for openat2(2). */
+/* First bit is reserved for a future UPGRADE_NOEXEC flag. */
+#define UPGRADE_NOREAD 0x02 /* Block re-opening with MAY_READ. */
+#define UPGRADE_NOWRITE 0x04 /* Block re-opening with MAY_WRITE. */
+#endif /* UPGRADE_NOREAD */
+
+#ifndef O_EMPTYPATH
+#define O_EMPTYPATH 040000000
+#endif /* O_EMPTYPATH */
+
+#define E_func(func, ...) \
+ do { \
+ if (func(__VA_ARGS__) < 0) \
+ ksft_exit_fail_msg("%s:%d %s failed\n", \
+ __FILE__, __LINE__, #func);\
+ } while (0)
+
+#define E_mkdirat(...) E_func(mkdirat, __VA_ARGS__)
+#define E_symlinkat(...) E_func(symlinkat, __VA_ARGS__)
+#define E_touchat(...) E_func(touchat, __VA_ARGS__)
+#define E_readlink(...) E_func(readlink, __VA_ARGS__)
+#define E_fstatat(...) E_func(fstatat, __VA_ARGS__)
+#define E_asprintf(...) E_func(asprintf, __VA_ARGS__)
+#define E_fchdir(...) E_func(fchdir, __VA_ARGS__)
+#define E_mount(...) E_func(mount, __VA_ARGS__)
+#define E_unshare(...) E_func(unshare, __VA_ARGS__)
+#define E_setresuid(...) E_func(setresuid, __VA_ARGS__)
+#define E_chmod(...) E_func(chmod, __VA_ARGS__)
+
+#define E_assert(expr, msg, ...) \
+ do { \
+ if (!(expr)) \
+ ksft_exit_fail_msg("ASSERT(%s:%d) failed (%s): " msg "\n", \
+ __FILE__, __LINE__, #expr, ##__VA_ARGS__); \
+ } while (0)
+
+typedef int (*openfunc_t)(int dfd, const char *path, const struct open_how *how);
+
+int raw_openat2(int dfd, const char *path, const void *how, size_t size);
+int sys_openat2(int dfd, const char *path, const struct open_how *how);
+char *openat2_flags(const struct open_how *how);
+
+int sys_openat(int dfd, const char *path, const struct open_how *how);
+char *openat_flags(unsigned int flags);
+
+int sys_renameat2(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath, unsigned int flags);
+
+int touchat(int dfd, const char *path);
+char *fdreadlink(int fd);
+bool fdequal(int fd, int dfd, const char *path);
+
+void test_openat2_supported(void);
+
+#endif /* __RESOLVEAT_H__ */
diff --git a/tools/testing/selftests/openat2/linkmode_test.c b/tools/testing/selftests/openat2/linkmode_test.c
new file mode 100644
index 000000000000..44fcba738686
--- /dev/null
+++ b/tools/testing/selftests/openat2/linkmode_test.c
@@ -0,0 +1,333 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <stdbool.h>
+#include <string.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+static mode_t fdmode(int fd)
+{
+ char *fdpath;
+ struct stat statbuf;
+ mode_t mode;
+
+ E_asprintf(&fdpath, "/proc/self/fd/%d", fd);
+ E_fstatat(AT_FDCWD, fdpath, &statbuf, AT_SYMLINK_NOFOLLOW);
+ mode = (statbuf.st_mode & ~S_IFMT);
+ free(fdpath);
+
+ return mode;
+}
+
+static int reopen_proc(int fd, unsigned int flags)
+{
+ int ret, saved_errno;
+ char *fdpath;
+
+ E_asprintf(&fdpath, "/proc/self/fd/%d", fd);
+ ret = open(fdpath, flags);
+ saved_errno = errno;
+ free(fdpath);
+
+ return ret >= 0 ? ret : -saved_errno;
+}
+
+static int reopen_oemptypath(int fd, unsigned int flags)
+{
+ int ret = openat(fd, "", O_EMPTYPATH | flags);
+ return ret >= 0 ? ret : -errno;
+}
+
+struct reopen_test {
+ openfunc_t open;
+ mode_t chmod_mode;
+ struct {
+ struct open_how how;
+ mode_t mode;
+ int err;
+ } orig, new;
+};
+
+static bool reopen(int fd, struct reopen_test *test)
+{
+ int newfd;
+ mode_t proc_mode;
+ bool failed = false;
+
+ /* Check that the proc mode is correct. */
+ proc_mode = fdmode(fd);
+ if (proc_mode != test->orig.mode) {
+ ksft_print_msg("incorrect fdmode (got[%o] != want[%o])\n",
+ proc_mode, test->orig.mode);
+ failed = true;
+ }
+
+ /* Re-open through /proc. */
+ newfd = reopen_proc(fd, test->new.how.flags);
+ if (newfd != test->new.err && (newfd < 0 || test->new.err < 0)) {
+ ksft_print_msg("/proc failure (%d != %d [%s])\n",
+ newfd, test->new.err, strerror(-test->new.err));
+ failed = true;
+ }
+ if (newfd >= 0) {
+ proc_mode = fdmode(newfd);
+ if (proc_mode != test->new.mode) {
+ ksft_print_msg("/proc wrong fdmode (got[%o] != want[%o])\n",
+ proc_mode, test->new.mode);
+ failed = true;
+ }
+ close(newfd);
+ }
+
+ /* Re-open with O_EMPTYPATH. */
+ newfd = reopen_oemptypath(fd, test->new.how.flags);
+ if (newfd != test->new.err && (newfd < 0 || test->new.err < 0)) {
+ ksft_print_msg("O_EMPTYPATH failure (%d != %d [%s])\n",
+ newfd, test->new.err, strerror(-test->new.err));
+ failed = true;
+ }
+ if (newfd >= 0) {
+ proc_mode = fdmode(newfd);
+ if (proc_mode != test->new.mode) {
+ ksft_print_msg("O_EMPTYPATH wrong fdmode (got[%o] != want[%o])\n",
+ proc_mode, test->new.mode);
+ failed = true;
+ }
+ close(newfd);
+ }
+
+ return failed;
+}
+
+#define NUM_REOPEN_TESTS 28
+
+void test_reopen_ordinary(bool privileged)
+{
+ int fd;
+ int err_access = privileged ? 0 : -EACCES;
+ char tmpfile[] = "/tmp/ksft-openat2-reopen-testfile.XXXXXX";
+
+ fd = mkstemp(tmpfile);
+ E_assert(fd >= 0, "mkstemp failed: %m\n");
+ close(fd);
+
+ struct reopen_test tests[] = {
+ /* Re-opening with the same mode should succeed. */
+ { .open = sys_openat, .chmod_mode = 0400,
+ .orig.how.flags = O_RDONLY, .orig.mode = 0500,
+ .new.how.flags = O_RDONLY, .new.mode = 0500 },
+ { .open = sys_openat, .chmod_mode = 0200,
+ .orig.how.flags = O_WRONLY, .orig.mode = 0300,
+ .new.how.flags = O_WRONLY, .new.mode = 0300 },
+ { .open = sys_openat, .chmod_mode = 0600,
+ .orig.how.flags = O_RDWR, .orig.mode = 0700,
+ .new.how.flags = O_RDWR, .new.mode = 0700 },
+ { .open = sys_openat, .chmod_mode = 0600,
+ .orig.how.flags = O_RDWR, .orig.mode = 0700,
+ .new.how.flags = O_RDONLY, .new.mode = 0500 },
+ { .open = sys_openat, .chmod_mode = 0600,
+ .orig.how.flags = O_RDWR, .orig.mode = 0700,
+ .new.how.flags = O_WRONLY, .new.mode = 0300 },
+
+ /*
+ * Re-opening with a different mode will always fail (with an obvious
+ * carve-out for privileged users).
+ */
+ { .open = sys_openat, .chmod_mode = 0600,
+ .orig.how.flags = O_RDONLY, .orig.mode = 0500,
+ .new.how.flags = O_WRONLY, .new.mode = 0300, .new.err = err_access },
+ { .open = sys_openat, .chmod_mode = 0600,
+ .orig.how.flags = O_WRONLY, .orig.mode = 0300,
+ .new.how.flags = O_RDONLY, .new.mode = 0500, .new.err = err_access },
+ { .open = sys_openat, .chmod_mode = 0600,
+ .orig.how.flags = O_RDONLY, .orig.mode = 0500,
+ .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access },
+ { .open = sys_openat, .chmod_mode = 0600,
+ .orig.how.flags = O_WRONLY, .orig.mode = 0300,
+ .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access },
+
+ /* Doubly so if they didn't even have permissions at open-time. */
+ { .open = sys_openat, .chmod_mode = 0400,
+ .orig.how.flags = O_RDONLY, .orig.mode = 0500,
+ .new.how.flags = O_WRONLY, .new.mode = 0300, .new.err = err_access },
+ { .open = sys_openat, .chmod_mode = 0200,
+ .orig.how.flags = O_WRONLY, .orig.mode = 0300,
+ .new.how.flags = O_RDONLY, .new.mode = 0500, .new.err = err_access },
+ { .open = sys_openat, .chmod_mode = 0400,
+ .orig.how.flags = O_RDONLY, .orig.mode = 0500,
+ .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access },
+ { .open = sys_openat, .chmod_mode = 0200,
+ .orig.how.flags = O_WRONLY, .orig.mode = 0300,
+ .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access },
+
+ /* O_PATH re-opens (of ordinary files) will always work. */
+ { .open = sys_openat, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0070,
+ .new.how.flags = O_WRONLY, .new.mode = 0300 },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0070,
+ .new.how.flags = O_WRONLY, .new.mode = 0300 },
+
+ { .open = sys_openat, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0070,
+ .new.how.flags = O_RDONLY, .new.mode = 0500 },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0070,
+ .new.how.flags = O_RDONLY, .new.mode = 0500 },
+
+ { .open = sys_openat, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0070,
+ .new.how.flags = O_RDWR, .new.mode = 0700 },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0070,
+ .new.how.flags = O_RDWR, .new.mode = 0700 },
+
+ /*
+ * openat2(2) UPGRADE_NO* flags. In the privileged case, the re-open
+ * will work but the mode will still be scoped to the mode (or'd with
+ * the open acc_mode).
+ */
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0010,
+ .orig.how.upgrade_mask = UPGRADE_NOREAD | UPGRADE_NOWRITE,
+ .new.how.flags = O_RDONLY, .new.mode = 0500, .new.err = err_access },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0010,
+ .orig.how.upgrade_mask = UPGRADE_NOREAD | UPGRADE_NOWRITE,
+ .new.how.flags = O_WRONLY, .new.mode = 0300, .new.err = err_access },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0010,
+ .orig.how.upgrade_mask = UPGRADE_NOREAD | UPGRADE_NOWRITE,
+ .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access },
+
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0050,
+ .orig.how.upgrade_mask = UPGRADE_NOWRITE,
+ .new.how.flags = O_RDONLY, .new.mode = 0500 },
+
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0030,
+ .orig.how.upgrade_mask = UPGRADE_NOREAD,
+ .new.how.flags = O_WRONLY, .new.mode = 0300 },
+
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0030,
+ .orig.how.upgrade_mask = UPGRADE_NOREAD,
+ .new.how.flags = O_RDONLY, .new.mode = 0500, .new.err = err_access },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0050,
+ .orig.how.upgrade_mask = UPGRADE_NOWRITE,
+ .new.how.flags = O_WRONLY, .new.mode = 0300, .new.err = err_access },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0030,
+ .orig.how.upgrade_mask = UPGRADE_NOREAD,
+ .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0050,
+ .orig.how.upgrade_mask = UPGRADE_NOWRITE,
+ .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access },
+ };
+
+ BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_REOPEN_TESTS);
+
+ for (int i = 0; i < ARRAY_LEN(tests); i++) {
+ int fd;
+ char *orig_flagset, *new_flagset;
+ struct reopen_test *test = &tests[i];
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+
+ E_chmod(tmpfile, test->chmod_mode);
+
+ fd = test->open(AT_FDCWD, tmpfile, &test->orig.how);
+ E_assert(fd >= 0, "open '%s' failed: %m\n", tmpfile);
+
+ /* Make sure that any EACCES we see is not from inode permissions. */
+ E_chmod(tmpfile, 0777);
+
+ if (reopen(fd, test))
+ resultfn = ksft_test_result_fail;
+
+ close(fd);
+
+ new_flagset = openat_flags(test->new.how.flags);
+ if (test->open == sys_openat)
+ orig_flagset = openat_flags(test->orig.how.flags);
+ else if (test->open == sys_openat2)
+ orig_flagset = openat2_flags(&test->orig.how);
+ else
+ ksft_exit_fail_msg("unknown test->open\n");
+
+ resultfn("%sordinary reopen of (orig[%s]=%s, new=%s) chmod=%.3o %s\n",
+ privileged ? "privileged " : "",
+ test->open == sys_openat ? "openat" : "openat2",
+ orig_flagset, new_flagset, test->chmod_mode,
+ test->new.err < 0 ? strerror(-test->new.err) : "works");
+ fflush(stdout);
+
+ free(new_flagset);
+ free(orig_flagset);
+ }
+
+ unlink(tmpfile);
+}
+
+#define NUM_CLOEXEC_TESTS 1
+
+void test_openat2_cloexec_test(void)
+{
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+ struct open_how how = {
+ .flags = O_CLOEXEC | O_PATH | O_DIRECTORY,
+ };
+
+ int fd = sys_openat2(AT_FDCWD, ".", &how);
+ E_assert(fd >= 0, "open '.' failed: %m\n");
+
+ int flags = fcntl(fd, F_GETFD);
+ E_assert(flags >= 0, "F_GETFD failed: %m\n");
+
+ if (!(flags & FD_CLOEXEC))
+ resultfn = ksft_test_result_fail;
+
+ resultfn("openat2(O_CLOEXEC) works as expected\n");
+}
+
+int main(int argc, char **argv)
+{
+ bool privileged;
+
+ ksft_print_header();
+ ksft_set_plan(2 * NUM_REOPEN_TESTS + NUM_CLOEXEC_TESTS);
+ test_openat2_supported();
+
+ /*
+ * Technically we should be checking CAP_DAC_OVERRIDE, but it's easier to
+ * just assume that euid=0 has the full capability set.
+ */
+ privileged = (geteuid() == 0);
+ if (!privileged)
+ ksft_test_result_skip("privileged tests require euid == 0\n");
+ else {
+ test_reopen_ordinary(privileged);
+
+ E_setresuid(65534, 65534, 65534);
+ privileged = (geteuid() == 0);
+ }
+
+ test_reopen_ordinary(privileged);
+ test_openat2_cloexec_test();
+
+ if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c
new file mode 100644
index 000000000000..a6950d91e014
--- /dev/null
+++ b/tools/testing/selftests/openat2/openat2_test.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+#define NUM_OPENAT2_TESTS 7
+
+struct open_how_ext {
+ struct open_how inner;
+ uint32_t extra1;
+ char pad1[128];
+ uint32_t extra2;
+ char pad2[128];
+ uint32_t extra3;
+};
+
+struct struct_test {
+ struct open_how_ext arg;
+ size_t size;
+ int err;
+};
+
+void test_openat2_struct(void)
+{
+ struct struct_test tests[] = {
+ /* Normal struct. */
+ { .arg.inner.flags = O_RDONLY,
+ .size = sizeof(struct open_how) },
+ /* Bigger struct, with zero padding. */
+ { .arg.inner.flags = O_RDONLY,
+ .size = sizeof(struct open_how_ext) },
+
+ /* TODO: Once expanded, check zero-padding. */
+
+ /* Smaller than version-0 struct. */
+ { .arg.inner.flags = O_RDONLY, .size = 0, .err = -EINVAL },
+ { .arg.inner.flags = O_RDONLY,
+ .size = OPEN_HOW_SIZE_VER0 - 1, .err = -EINVAL },
+ /* Bigger struct, with non-zero trailing bytes. */
+ { .arg.inner.flags = O_RDONLY, .arg.extra1 = 0xdeadbeef,
+ .size = sizeof(struct open_how_ext), .err = -E2BIG },
+ { .arg.inner.flags = O_RDONLY, .arg.extra2 = 0xfeedcafe,
+ .size = sizeof(struct open_how_ext), .err = -E2BIG },
+ { .arg.inner.flags = O_RDONLY, .arg.extra3 = 0xabad1dea,
+ .size = sizeof(struct open_how_ext), .err = -E2BIG },
+ };
+
+ BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_TESTS);
+
+ for (int i = 0; i < ARRAY_LEN(tests); i++) {
+ int fd;
+ bool failed;
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+ struct struct_test *test = &tests[i];
+
+ fd = raw_openat2(AT_FDCWD, ".", &test->arg, test->size);
+ if (test->err >= 0)
+ failed = (fd < 0);
+ else
+ failed = (fd != test->err);
+ if (fd >= 0)
+ close(fd);
+
+ if (failed)
+ resultfn = ksft_test_result_fail;
+
+ if (test->err >= 0)
+ resultfn("openat2([.], [struct], %ld [kernel:%ld]) ==> [.] [got:%s]\n",
+ test->size, sizeof(struct open_how),
+ (fd >= 0) ? "." : strerror(-fd));
+ else
+ resultfn("openat2([.], [struct], %ld [kernel:%ld]) ==> %s [got:%s]\n",
+ test->size, sizeof(struct open_how),
+ strerror(-test->err),
+ (fd >= 0) ? "." : strerror(-fd));
+ fflush(stdout);
+ }
+}
+
+int main(int argc, char **argv)
+{
+ ksft_print_header();
+ ksft_set_plan(NUM_OPENAT2_TESTS);
+
+ test_openat2_supported();
+ test_openat2_struct();
+
+ if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/openat2/rename_attack_test.c b/tools/testing/selftests/openat2/rename_attack_test.c
new file mode 100644
index 000000000000..39b20ea185d5
--- /dev/null
+++ b/tools/testing/selftests/openat2/rename_attack_test.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <syscall.h>
+#include <limits.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/* Construct a test directory with the following structure:
+ *
+ * root/
+ * |-- a/
+ * | `-- c/
+ * `-- b/
+ */
+int setup_testdir(void)
+{
+ int dfd;
+ char dirname[] = "/tmp/ksft-openat2-rename-attack.XXXXXX";
+
+ /* Make the top-level directory. */
+ if (!mkdtemp(dirname))
+ ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n");
+ dfd = open(dirname, O_PATH | O_DIRECTORY);
+ if (dfd < 0)
+ ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+
+ E_mkdirat(dfd, "a", 0755);
+ E_mkdirat(dfd, "b", 0755);
+ E_mkdirat(dfd, "a/c", 0755);
+
+ return dfd;
+}
+
+/* Swap @dirfd/@a and @dirfd/@b constantly. Parent must kill this process. */
+pid_t spawn_attack(int dirfd, char *a, char *b)
+{
+ pid_t child = fork();
+ if (child != 0)
+ return child;
+
+ /* If the parent (the test process) dies, kill ourselves too. */
+ prctl(PR_SET_PDEATHSIG, SIGKILL);
+
+ /* Swap @a and @b. */
+ for (;;)
+ renameat2(dirfd, a, dirfd, b, RENAME_EXCHANGE);
+ exit(1);
+}
+
+#define NUM_RENAME_TESTS 1
+#define ROUNDS 400000
+
+void test_rename_attack(void)
+{
+ int dfd, afd, escaped_count = 0;
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+ pid_t child;
+
+ dfd = setup_testdir();
+ afd = openat(dfd, "a", O_PATH);
+ if (afd < 0)
+ ksft_exit_fail_msg("test_rename_attack: failed to open 'a'\n");
+
+ child = spawn_attack(dfd, "a/c", "b");
+
+ for (int i = 0; i < ROUNDS; i++) {
+ int fd;
+ bool failed;
+ struct open_how how = {
+ .flags = O_PATH,
+ .resolve = RESOLVE_IN_ROOT,
+ };
+ char *victim_path = "c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../..";
+
+ fd = sys_openat2(afd, victim_path, &how);
+ if (fd < 0)
+ failed = (fd != -EXDEV);
+ else
+ failed = !fdequal(fd, afd, NULL);
+
+ escaped_count += failed;
+ close(fd);
+ }
+
+ if (escaped_count > 0)
+ resultfn = ksft_test_result_fail;
+
+ resultfn("rename attack fails (expected 0 breakouts in %d runs, got %d)\n",
+ ROUNDS, escaped_count);
+
+ /* Should be killed anyway, but might as well make sure. */
+ kill(child, SIGKILL);
+}
+
+int main(int argc, char **argv)
+{
+ ksft_print_header();
+ ksft_set_plan(NUM_RENAME_TESTS);
+ test_openat2_supported();
+
+ test_rename_attack();
+
+ if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/openat2/resolve_test.c b/tools/testing/selftests/openat2/resolve_test.c
new file mode 100644
index 000000000000..8ef3dbb7edbe
--- /dev/null
+++ b/tools/testing/selftests/openat2/resolve_test.c
@@ -0,0 +1,402 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/*
+ * Construct a test directory with the following structure:
+ *
+ * root/
+ * |-- procexe -> /proc/self/exe
+ * |-- procroot -> /proc/self/root
+ * |-- root/
+ * |-- mnt/ [mountpoint]
+ * | |-- self -> ../mnt/
+ * | `-- absself -> /mnt/
+ * |-- etc/
+ * | `-- passwd
+ * |-- creatlink -> /newfile3
+ * |-- relsym -> etc/passwd
+ * |-- abssym -> /etc/passwd
+ * |-- abscheeky -> /cheeky
+ * |-- abscheeky -> /cheeky
+ * `-- cheeky/
+ * |-- absself -> /
+ * |-- self -> ../../root/
+ * |-- garbageself -> /../../root/
+ * |-- passwd -> ../cheeky/../cheeky/../etc/../etc/passwd
+ * |-- abspasswd -> /../cheeky/../cheeky/../etc/../etc/passwd
+ * |-- dotdotlink -> ../../../../../../../../../../../../../../etc/passwd
+ * `-- garbagelink -> /../../../../../../../../../../../../../../etc/passwd
+ */
+int setup_testdir(void)
+{
+ int dfd, tmpfd;
+ char dirname[] = "/tmp/ksft-openat2-testdir.XXXXXX";
+
+ /* Unshare and make /tmp a new directory. */
+ E_unshare(CLONE_NEWNS);
+ E_mount("", "/tmp", "", MS_PRIVATE, "");
+
+ /* Make the top-level directory. */
+ if (!mkdtemp(dirname))
+ ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n");
+ dfd = open(dirname, O_PATH | O_DIRECTORY);
+ if (dfd < 0)
+ ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+
+ /* A sub-directory which is actually used for tests. */
+ E_mkdirat(dfd, "root", 0755);
+ tmpfd = openat(dfd, "root", O_PATH | O_DIRECTORY);
+ if (tmpfd < 0)
+ ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+ close(dfd);
+ dfd = tmpfd;
+
+ E_symlinkat("/proc/self/exe", dfd, "procexe");
+ E_symlinkat("/proc/self/root", dfd, "procroot");
+ E_mkdirat(dfd, "root", 0755);
+
+ /* There is no mountat(2), so use chdir. */
+ E_mkdirat(dfd, "mnt", 0755);
+ E_fchdir(dfd);
+ E_mount("tmpfs", "./mnt", "tmpfs", MS_NOSUID | MS_NODEV, "");
+ E_symlinkat("../mnt/", dfd, "mnt/self");
+ E_symlinkat("/mnt/", dfd, "mnt/absself");
+
+ E_mkdirat(dfd, "etc", 0755);
+ E_touchat(dfd, "etc/passwd");
+
+ E_symlinkat("/newfile3", dfd, "creatlink");
+ E_symlinkat("etc/passwd", dfd, "relsym");
+ E_symlinkat("/etc/passwd", dfd, "abssym");
+ E_symlinkat("/cheeky", dfd, "abscheeky");
+
+ E_mkdirat(dfd, "cheeky", 0755);
+
+ E_symlinkat("/", dfd, "cheeky/absself");
+ E_symlinkat("../../root/", dfd, "cheeky/self");
+ E_symlinkat("/../../root/", dfd, "cheeky/garbageself");
+
+ E_symlinkat("../cheeky/../etc/../etc/passwd", dfd, "cheeky/passwd");
+ E_symlinkat("/../cheeky/../etc/../etc/passwd", dfd, "cheeky/abspasswd");
+
+ E_symlinkat("../../../../../../../../../../../../../../etc/passwd",
+ dfd, "cheeky/dotdotlink");
+ E_symlinkat("/../../../../../../../../../../../../../../etc/passwd",
+ dfd, "cheeky/garbagelink");
+
+ return dfd;
+}
+
+struct basic_test {
+ const char *dir;
+ const char *path;
+ struct open_how how;
+ bool pass;
+ union {
+ int err;
+ const char *path;
+ } out;
+};
+
+#define NUM_OPENAT2_OPATH_TESTS 84
+
+void test_openat2_opath_tests(void)
+{
+ int rootfd;
+ char *procselfexe;
+
+ E_asprintf(&procselfexe, "/proc/%d/exe", getpid());
+ rootfd = setup_testdir();
+
+ struct basic_test tests[] = {
+ /** RESOLVE_BENEATH **/
+ /* Attempts to cross dirfd should be blocked. */
+ { .path = "/", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "cheeky/absself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "abscheeky/absself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "..", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "../root/", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "cheeky/self", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "abscheeky/self", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "cheeky/garbageself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "abscheeky/garbageself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ /* Only relative paths that stay inside dirfd should work. */
+ { .path = "root", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "root", .pass = true },
+ { .path = "etc", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc", .pass = true },
+ { .path = "etc/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "relsym", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "cheeky/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "abscheeky/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "abssym", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "/etc/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "cheeky/abspasswd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ /* Tricky paths should fail. */
+ { .path = "cheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "cheeky/garbagelink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+
+ /** RESOLVE_IN_ROOT **/
+ /* All attempts to cross the dirfd will be scoped-to-root. */
+ { .path = "/", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .path = "cheeky/absself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .path = "abscheeky/absself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .path = "..", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .path = "../root/", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .path = "../root/", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .path = "cheeky/self", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .path = "cheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .path = "abscheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .path = "root", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .path = "etc", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc", .pass = true },
+ { .path = "etc/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "relsym", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "cheeky/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "abscheeky/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "abssym", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "/etc/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "cheeky/abspasswd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "cheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "/../../../../abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "cheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "/../../../../abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ /* O_CREAT should handle trailing symlinks correctly. */
+ { .path = "newfile1", .how.flags = O_CREAT,
+ .how.mode = 0700,
+ .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "newfile1", .pass = true },
+ { .path = "/newfile2", .how.flags = O_CREAT,
+ .how.mode = 0700,
+ .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "newfile2", .pass = true },
+ { .path = "/creatlink", .how.flags = O_CREAT,
+ .how.mode = 0700,
+ .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "newfile3", .pass = true },
+
+ /** RESOLVE_NO_XDEV **/
+ /* Crossing *down* into a mountpoint is disallowed. */
+ { .path = "mnt", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "mnt/", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "mnt/.", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ /* Crossing *up* out of a mountpoint is disallowed. */
+ { .dir = "mnt", .path = ".", .how.resolve = RESOLVE_NO_XDEV,
+ .out.path = "mnt", .pass = true },
+ { .dir = "mnt", .path = "..", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .dir = "mnt", .path = "../mnt", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .dir = "mnt", .path = "self", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .dir = "mnt", .path = "absself", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ /* Jumping to "/" is ok, but later components cannot cross. */
+ { .dir = "mnt", .path = "/", .how.resolve = RESOLVE_NO_XDEV,
+ .out.path = "/", .pass = true },
+ { .dir = "/", .path = "/", .how.resolve = RESOLVE_NO_XDEV,
+ .out.path = "/", .pass = true },
+ { .path = "/proc/1", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "/tmp", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+
+ /** RESOLVE_NO_MAGICLINKS **/
+ /* Regular symlinks should work. */
+ { .path = "relsym", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.path = "etc/passwd", .pass = true },
+ /* Magic-links should not work. */
+ { .path = "procexe", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "/proc/self/exe", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "procroot/etc", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "/proc/self/root/etc", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "/proc/self/root/etc", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "/proc/self/exe", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.path = procselfexe, .pass = true },
+
+ /** RESOLVE_NO_SYMLINKS **/
+ /* Normal paths should work. */
+ { .path = ".", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = NULL, .pass = true },
+ { .path = "root", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "root", .pass = true },
+ { .path = "etc", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "etc", .pass = true },
+ { .path = "etc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "etc/passwd", .pass = true },
+ /* Regular symlinks are blocked. */
+ { .path = "relsym", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "abssym", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "cheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "abscheeky/absself", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ /* Trailing symlinks with NO_FOLLOW. */
+ { .path = "relsym", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "relsym", .pass = true },
+ { .path = "abssym", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "abssym", .pass = true },
+ { .path = "cheeky/garbagelink", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "cheeky/garbagelink", .pass = true },
+ { .path = "abscheeky/garbagelink", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "abscheeky/absself", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ };
+
+ BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_OPATH_TESTS);
+
+ for (int i = 0; i < ARRAY_LEN(tests); i++) {
+ int dfd, fd;
+ bool failed;
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+ struct basic_test *test = &tests[i];
+ char *flagstr;
+
+ /* Auto-set O_PATH. */
+ if (!(test->how.flags & O_CREAT))
+ test->how.flags |= O_PATH;
+ flagstr = openat2_flags(&test->how);
+
+ if (test->dir)
+ dfd = openat(rootfd, test->dir, O_PATH | O_DIRECTORY);
+ else
+ dfd = dup(rootfd);
+ if (dfd < 0) {
+ resultfn = ksft_test_result_error;
+ goto next;
+ }
+
+ fd = sys_openat2(dfd, test->path, &test->how);
+ if (test->pass)
+ failed = (fd < 0 || !fdequal(fd, rootfd, test->out.path));
+ else
+ failed = (fd != test->out.err);
+ if (fd >= 0)
+ close(fd);
+ close(dfd);
+
+ if (failed)
+ resultfn = ksft_test_result_fail;
+
+next:
+ if (test->pass)
+ resultfn("openat2(root[%s], %s, %s) ==> %s\n",
+ test->dir ?: ".", test->path, flagstr,
+ test->out.path ?: ".");
+ else
+ resultfn("openat2(root[%s], %s, %s) ==> %d (%s)\n",
+ test->dir ?: ".", test->path, flagstr,
+ test->out.err, strerror(-test->out.err));
+ fflush(stdout);
+
+ free(flagstr);
+ }
+
+ free(procselfexe);
+ close(rootfd);
+}
+
+int main(int argc, char **argv)
+{
+ ksft_print_header();
+ ksft_set_plan(NUM_OPENAT2_OPATH_TESTS);
+ test_openat2_supported();
+
+ /* NOTE: We should be checking for CAP_SYS_ADMIN here... */
+ if (geteuid() != 0)
+ ksft_exit_skip("openat2(2) tests require euid == 0\n");
+
+ test_openat2_opath_tests();
+
+ if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
--
2.23.0
On Wed, Sep 4, 2019 at 1:20 PM Aleksa Sarai <[email protected]> wrote:
>
> A common pattern for syscall extensions is increasing the size of a
> struct passed from userspace, such that the zero-value of the new fields
> result in the old kernel behaviour (allowing for a mix of userspace and
> kernel vintages to operate on one another in most cases).
Ack, this makes the whole series (and a few unrelated system calls) cleaner.
Linus
Hi,
just kernel-doc fixes:
On 9/4/19 1:19 PM, Aleksa Sarai wrote:
>
> diff --git a/lib/struct_user.c b/lib/struct_user.c
> new file mode 100644
> index 000000000000..7301ab1bbe98
> --- /dev/null
> +++ b/lib/struct_user.c
> @@ -0,0 +1,182 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2019 SUSE LLC
> + * Copyright (C) 2019 Aleksa Sarai <[email protected]>
> + */
> +
> +#include <linux/types.h>
> +#include <linux/export.h>
> +#include <linux/uaccess.h>
> +#include <linux/kernel.h>
> +#include <linux/string.h>
> +
> +#define BUFFER_SIZE 64
> +
> +
> +/**
> + * copy_struct_to_user: copy a struct to user space
use correct format:
* copy_struct_to_user - copy a struct to user space
> + * @dst: Destination address, in user space.
> + * @usize: Size of @dst struct.
> + * @src: Source address, in kernel space.
> + * @ksize: Size of @src struct.
> + *
> + * Copies a struct from kernel space to user space, in a way that guarantees
> + * backwards-compatibility for struct syscall arguments (as long as future
> + * struct extensions are made such that all new fields are *appended* to the
> + * old struct, and zeroed-out new fields have the same meaning as the old
> + * struct).
> + *
> + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> + * The recommended usage is something like the following:
> + *
> + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> + * {
> + * int err;
> + * struct foo karg = {};
> + *
> + * // do something with karg
> + *
> + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg));
> + * if (err)
> + * return err;
> + *
> + * // ...
> + * }
> + *
> + * There are three cases to consider:
> + * * If @usize == @ksize, then it's copied verbatim.
> + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an
> + * older user space. In order to avoid user space getting incomplete
> + * information (new fields might be important), all trailing bytes in @src
> + * (@ksize - @usize) must be zerored, otherwise -EFBIG is returned.
> + * * If @usize > @ksize, then the kernel is "returning" an older struct to a
> + * newer user space. The trailing bytes in @dst (@usize - @ksize) will be
> + * zero-filled.
> + *
> + * Returns (in all cases, some data may have been copied):
> + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src.
> + * * -EFAULT: access to user space failed.
> + */
> +int copy_struct_to_user(void __user *dst, size_t usize,
> + const void *src, size_t ksize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
> +
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;
> + if (unlikely(!access_ok(dst, usize)))
> + return -EFAULT;
> +
> + /* Deal with trailing bytes. */
> + if (usize < ksize) {
> + if (memchr_inv(src + size, 0, rest))
> + return -EFBIG;
> + } else if (usize > ksize) {
> + if (__memzero_user(dst + size, rest))
> + return -EFAULT;
> + }
> + /* Copy the interoperable parts of the struct. */
> + if (__copy_to_user(dst, src, size))
> + return -EFAULT;
> + return 0;
> +}
> +EXPORT_SYMBOL(copy_struct_to_user);
> +
> +/**
same here:
> + * copy_struct_from_user: copy a struct from user space
* copy_struct_from_user - copy a struct from user space
> + * @dst: Destination address, in kernel space. This buffer must be @ksize
> + * bytes long.
> + * @ksize: Size of @dst struct.
> + * @src: Source address, in user space.
> + * @usize: (Alleged) size of @src struct.
> + *
> + * Copies a struct from user space to kernel space, in a way that guarantees
> + * backwards-compatibility for struct syscall arguments (as long as future
> + * struct extensions are made such that all new fields are *appended* to the
> + * old struct, and zeroed-out new fields have the same meaning as the old
> + * struct).
> + *
> + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> + * The recommended usage is something like the following:
> + *
> + * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize)
> + * {
> + * int err;
> + * struct foo karg = {};
> + *
> + * err = copy_struct_from_user(&karg, sizeof(karg), uarg, size);
> + * if (err)
> + * return err;
> + *
> + * // ...
> + * }
> + *
> + * There are three cases to consider:
> + * * If @usize == @ksize, then it's copied verbatim.
> + * * If @usize < @ksize, then the user space has passed an old struct to a
> + * newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize)
> + * are to be zero-filled.
> + * * If @usize > @ksize, then the user space has passed a new struct to an
> + * older kernel. The trailing bytes unknown to the kernel (@usize - @ksize)
> + * are checked to ensure they are zeroed, otherwise -E2BIG is returned.
> + *
> + * Returns (in all cases, some data may have been copied):
> + * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src.
> + * * -E2BIG: @usize is "too big" (at time of writing, >PAGE_SIZE).
> + * * -EFAULT: access to user space failed.
> + */
> +int copy_struct_from_user(void *dst, size_t ksize,
> + const void __user *src, size_t usize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
> +
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;
> + if (unlikely(!access_ok(src, usize)))
> + return -EFAULT;
> +
> + /* Deal with trailing bytes. */
> + if (usize < ksize)
> + memset(dst + size, 0, rest);
> + else if (usize > ksize) {
> + const void __user *addr = src + size;
> + char buffer[BUFFER_SIZE] = {};
> +
> + while (rest > 0) {
> + size_t bufsize = min(rest, sizeof(buffer));
> +
> + if (__copy_from_user(buffer, addr, bufsize))
> + return -EFAULT;
> + if (memchr_inv(buffer, 0, bufsize))
> + return -E2BIG;
> +
> + addr += bufsize;
> + rest -= bufsize;
> + }
> + }
> + /* Copy the interoperable parts of the struct. */
> + if (__copy_from_user(dst, src, size))
> + return -EFAULT;
> + return 0;
> +}
> +EXPORT_SYMBOL(copy_struct_from_user);
>
thanks.
--
~Randy
Hi,
just noisy nits here:
On 9/4/19 1:19 PM, Aleksa Sarai wrote:
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 1d338357df8a..479baf2da10e 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -93,5 +93,47 @@
>
> #define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
>
> +/**
/** means "the following is kernel-doc", but it's not, so please either make
it kernel-doc format or just use /* to begin the comment.
> + * Arguments for how openat2(2) should open the target path. If @resolve is
> + * zero, then openat2(2) operates identically to openat(2).
> + *
> + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> + * than being silently ignored. In addition, @mode (or @upgrade_mask) must be
> + * zero unless one of {O_CREAT, O_TMPFILE, O_PATH} are set.
> + *
> + * @flags: O_* flags.
> + * @mode: O_CREAT/O_TMPFILE file mode.
> + * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
> + * @resolve: RESOLVE_* flags.
> + */
> +struct open_how {
> + __u32 flags;
> + union {
> + __u16 mode;
> + __u16 upgrade_mask;
> + };
> + __u16 resolve;
> +};
--
~Randy
On Wed, Sep 4, 2019 at 1:23 PM Aleksa Sarai <[email protected]> wrote:
>
> This patch allows for LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit
> ".." resolution (in the case of LOOKUP_BENEATH the resolution will still
> fail if ".." resolution would resolve a path outside of the root --
> while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps
> are still disallowed entirely because now they could result in
> inconsistent behaviour if resolution encounters a subsequent ".."[*].
This is the only patch in the series that makes me go "umm".
Why is it ok to re-initialize m_seq, which is used by other things
too? I think it's because we're out of RCU lookup, but there's no
comment about it, and it looks iffy to me. I'd rather have a separate
sequence count that doesn't have two users with different lifetime
rules.
But even apart from that, I think from a "patch continuity" standpoint
it would be better to introduce the sequence counts as just an error
condition first - iow, not have the "path_is_under()" check, but just
return -EXDEV if the sequence number doesn't match.
So you'd have three stages:
1) ".." always returns -EXDEV
2) ".." returns -EXDEV if there was a concurrent rename/mount
3) ".." returns -EXDEV if there was a concurrent rename/mount and we
reset the sequence numbers and check if you escaped.
becasue the sequence number reset really does make me go "hmm", plus I
get this nagging little feeling in the back of my head that you can
cause nasty O(n^2) lookup cost behavior with deep paths, lots of "..",
and repeated path_is_under() calls.
So (1) sounds safe. (2) sounds simple. And (3) is where I think subtle
things start happening.
Also, I'm not 100% convinced that (3) is needed at all. I think the
retry could be done in user space instead, which needs to have a
fallback anyway. Yes? No?
Linus
On Wed, Sep 4, 2019 at 2:09 PM Linus Torvalds
<[email protected]> wrote:
>
> So you'd have three stages:
>
> 1) ".." always returns -EXDEV
>
> 2) ".." returns -EXDEV if there was a concurrent rename/mount
>
> 3) ".." returns -EXDEV if there was a concurrent rename/mount and we
> reset the sequence numbers and check if you escaped.
In fact, I wonder if this should return -EAGAIN instead - to say that
"retrying may work".
Because then:
> Also, I'm not 100% convinced that (3) is needed at all. I think the
> retry could be done in user space instead, which needs to have a
> fallback anyway. Yes? No?
Any user mode fallback would want to know whether it's a final error
or whether simply re-trying might make it work again.
I think that re-try case is valid for any of the possible "races
happened, we can't guarantee that it's safe", and retrying inside the
kernel (or doing that re-validation) could have latency issues.
Maybe ".." is the only such case. I can't think of any other ones in
your series, but at least conceptually they could happen. For example,
we've had people who wanted pathname lookup without any IO happening,
because if you have to wait for IO you could want to use another
thread etc if you're doing some server in user space..
Linus
On Wed, Sep 4, 2019 at 2:35 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, Sep 4, 2019 at 2:09 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > So you'd have three stages:
> >
> > 1) ".." always returns -EXDEV
> >
> > 2) ".." returns -EXDEV if there was a concurrent rename/mount
> >
> > 3) ".." returns -EXDEV if there was a concurrent rename/mount and we
> > reset the sequence numbers and check if you escaped.
>
> In fact, I wonder if this should return -EAGAIN instead - to say that
> "retrying may work".
And here "this" was meant to be "case 2" - I was moving the quoted
text around and didn't fix my wording, so now it is ambiguous or
implies #3, which would be crazy.
Sorry for the confusion,
Linus
On 2019-09-04, Linus Torvalds <[email protected]> wrote:
> On Wed, Sep 4, 2019 at 1:23 PM Aleksa Sarai <[email protected]> wrote:
> > This patch allows for LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit
> > ".." resolution (in the case of LOOKUP_BENEATH the resolution will still
> > fail if ".." resolution would resolve a path outside of the root --
> > while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps
> > are still disallowed entirely because now they could result in
> > inconsistent behaviour if resolution encounters a subsequent ".."[*].
>
> This is the only patch in the series that makes me go "umm".
>
> Why is it ok to re-initialize m_seq, which is used by other things
> too? I think it's because we're out of RCU lookup, but there's no
> comment about it, and it looks iffy to me. I'd rather have a separate
> sequence count that doesn't have two users with different lifetime
> rules.
Yeah, the reasoning was that it's because we're out of RCU lookup and if
we didn't re-grab ->m_seq we'd hit path_is_under() on every subsequent
".." (even though we've checked that it's safe). But yes, I should've
used a different field to avoid confusion (and stop it looking
unnecessarily dodgy). I will fix that.
> But even apart from that, I think from a "patch continuity" standpoint
> it would be better to introduce the sequence counts as just an error
> condition first - iow, not have the "path_is_under()" check, but just
> return -EXDEV if the sequence number doesn't match.
Ack, will do.
> So you'd have three stages:
>
> 1) ".." always returns -EXDEV
>
> 2) ".." returns -EXDEV if there was a concurrent rename/mount
>
> 3) ".." returns -EXDEV if there was a concurrent rename/mount and we
> reset the sequence numbers and check if you escaped.
>
> becasue the sequence number reset really does make me go "hmm", plus I
> get this nagging little feeling in the back of my head that you can
> cause nasty O(n^2) lookup cost behavior with deep paths, lots of "..",
> and repeated path_is_under() calls.
The reason for doing the concurrent-{rename,mount} checks was to try to
avoid the O(n^2) in most cases, but you're right that if you have an
attacker that is spamming renames (or you're on a box with a lot of
renames and/or mounts going on *anywhere*) you will hit an O(n^2) here
(more pedantically, O(m*n) but who's counting?).
Unfortunately, I'm not sure what the best solution would be for this
one. If -EAGAIN retries are on the table, we could limit how many times
we're willing to do path_is_under() and then just return -EAGAIN.
> So (1) sounds safe. (2) sounds simple. And (3) is where I think subtle
> things start happening.
>
> Also, I'm not 100% convinced that (3) is needed at all. I think the
> retry could be done in user space instead, which needs to have a
> fallback anyway. Yes? No?
Hinting to userspace to do a retry (with -EAGAIN as you mention in your
other mail) wouldn't be a bad thing at all, though you'd almost
certainly get quite a few spurious -EAGAINs -- &{mount,rename}_lock are
global for the entire machine, after all.
But if the only significant roadblock is that (3) seems a bit too hairy,
I would be quite happy with landing (2) as a first step (with -EAGAIN).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Wed, Sep 4, 2019 at 2:49 PM Aleksa Sarai <[email protected]> wrote:
>
> Hinting to userspace to do a retry (with -EAGAIN as you mention in your
> other mail) wouldn't be a bad thing at all, though you'd almost
> certainly get quite a few spurious -EAGAINs -- &{mount,rename}_lock are
> global for the entire machine, after all.
I'd hope that we have some future (possibly very long-term)
alternative that is not quite system-global, but yes, right now they
are.
Which is one reason I'd rather see EAGAIN in user space - yes, it
probably makes it even easier to trigger, but it also means that user
space might be able to do something about it when it does trigger.
For example, maybe user space can first just use an untrusted path
as-is, and if it gets EAGAIN or EXDEV, it may be that user space can
simplify the path (ie turn "xyz/.../abc" into just "abc".
And even if user space doesn't do anything like that, I suspect a
performance problem is going to be a whole lot easier to debug and
report when somebody ends up seeing excessive retries happening. As a
developer you'll see it in profiles or in system call traces, rather
than it resulting in very odd possible slowdowns for the kernel.
And yeah, it would probably be best to then at least delay doing
option 3 indefinitely, just to make sure user space knows about and
actually has a test-case for that EAGAIN happening.
Linus
Linus Torvalds <[email protected]> wrote:
> > Hinting to userspace to do a retry (with -EAGAIN as you mention in your
> > other mail) wouldn't be a bad thing at all, though you'd almost
> > certainly get quite a few spurious -EAGAINs -- &{mount,rename}_lock are
> > global for the entire machine, after all.
>
> I'd hope that we have some future (possibly very long-term)
> alternative that is not quite system-global, but yes, right now they
> are.
It ought to be reasonably easy to make them per-sb at least, I think. We
don't allow cross-super rename, right?
David
On Wed, Sep 4, 2019 at 3:31 PM David Howells <[email protected]> wrote:
>
> It ought to be reasonably easy to make them per-sb at least, I think. We
> don't allow cross-super rename, right?
Right now the sequence count handling very much depends on it being a
global entity on the reader side, at least.
And while the rename sequence count could (and probably should) be
per-sb, the same is very much not true of the mount one.
So the rename seqcount is likely easier to fix than the mount one, but
neither of them are entirely trivial, afaik.
Linus
On Wed, Sep 04, 2019 at 03:38:20PM -0700, Linus Torvalds wrote:
> On Wed, Sep 4, 2019 at 3:31 PM David Howells <[email protected]> wrote:
> >
> > It ought to be reasonably easy to make them per-sb at least, I think. We
> > don't allow cross-super rename, right?
>
> Right now the sequence count handling very much depends on it being a
> global entity on the reader side, at least.
>
> And while the rename sequence count could (and probably should) be
> per-sb, the same is very much not true of the mount one.
Huh? That will cost us having to have a per-superblock dentry
hash table; recall that lockless lockup can give false negatives
if something gets moved from chain to chain, and rename_lock is
first and foremost used to catch those and retry. If we split
it on per-superblock basis, we can't have dentries from different
superblocks in the same chain anymore...
On Wed, Sep 4, 2019 at 4:29 PM Al Viro <[email protected]> wrote:
>
> On Wed, Sep 04, 2019 at 03:38:20PM -0700, Linus Torvalds wrote:
> > On Wed, Sep 4, 2019 at 3:31 PM David Howells <[email protected]> wrote:
> > >
> > > It ought to be reasonably easy to make them per-sb at least, I think. We
> > > don't allow cross-super rename, right?
> >
> > Right now the sequence count handling very much depends on it being a
> > global entity on the reader side, at least.
> >
> > And while the rename sequence count could (and probably should) be
> > per-sb, the same is very much not true of the mount one.
>
> Huh? That will cost us having to have a per-superblock dentry
> hash table; recall that lockless lockup can give false negatives
> if something gets moved from chain to chain, and rename_lock is
> first and foremost used to catch those and retry. If we split
> it on per-superblock basis, we can't have dentries from different
> superblocks in the same chain anymore...
That's exactly the "very much depends on it being a global entity on
the reader side" thing.
I'm not convinced that's the _only_ way to handle things. Maybe a
combination of (wild handwaving) per-hashqueue sequence count and some
clever scheme for pathname handling could work.
I've not personally seen a load where the global rename lock has been
a problem (very few things really do a lot of renames), but
system-wide locks do make me nervous.
We have other (and worse) ones. tasklist_lock comes to mind.
Linus
On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> +/**
> + * copy_struct_to_user: copy a struct to user space
> + * @dst: Destination address, in user space.
> + * @usize: Size of @dst struct.
> + * @src: Source address, in kernel space.
> + * @ksize: Size of @src struct.
> + *
> + * Copies a struct from kernel space to user space, in a way that guarantees
> + * backwards-compatibility for struct syscall arguments (as long as future
> + * struct extensions are made such that all new fields are *appended* to the
> + * old struct, and zeroed-out new fields have the same meaning as the old
> + * struct).
> + *
> + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> + * The recommended usage is something like the following:
> + *
> + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> + * {
> + * int err;
> + * struct foo karg = {};
> + *
> + * // do something with karg
> + *
> + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg));
> + * if (err)
> + * return err;
> + *
> + * // ...
> + * }
> + *
> + * There are three cases to consider:
> + * * If @usize == @ksize, then it's copied verbatim.
> + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an
> + * older user space. In order to avoid user space getting incomplete
> + * information (new fields might be important), all trailing bytes in @src
> + * (@ksize - @usize) must be zerored
s/zerored/zero/, right?
> , otherwise -EFBIG is returned.
'Funny' that, copy_struct_from_user() below seems to use E2BIG.
> + * * If @usize > @ksize, then the kernel is "returning" an older struct to a
> + * newer user space. The trailing bytes in @dst (@usize - @ksize) will be
> + * zero-filled.
> + *
> + * Returns (in all cases, some data may have been copied):
> + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src.
> + * * -EFAULT: access to user space failed.
> + */
> +int copy_struct_to_user(void __user *dst, size_t usize,
> + const void *src, size_t ksize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
> +
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;
Not documented above. Implementation consistent with *from*, but see
below.
> + if (unlikely(!access_ok(dst, usize)))
> + return -EFAULT;
> +
> + /* Deal with trailing bytes. */
> + if (usize < ksize) {
> + if (memchr_inv(src + size, 0, rest))
> + return -EFBIG;
> + } else if (usize > ksize) {
> + if (__memzero_user(dst + size, rest))
> + return -EFAULT;
> + }
> + /* Copy the interoperable parts of the struct. */
> + if (__copy_to_user(dst, src, size))
> + return -EFAULT;
> + return 0;
> +}
> +EXPORT_SYMBOL(copy_struct_to_user);
> +
> +/**
> + * copy_struct_from_user: copy a struct from user space
> + * @dst: Destination address, in kernel space. This buffer must be @ksize
> + * bytes long.
> + * @ksize: Size of @dst struct.
> + * @src: Source address, in user space.
> + * @usize: (Alleged) size of @src struct.
> + *
> + * Copies a struct from user space to kernel space, in a way that guarantees
> + * backwards-compatibility for struct syscall arguments (as long as future
> + * struct extensions are made such that all new fields are *appended* to the
> + * old struct, and zeroed-out new fields have the same meaning as the old
> + * struct).
> + *
> + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> + * The recommended usage is something like the following:
> + *
> + * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize)
> + * {
> + * int err;
> + * struct foo karg = {};
> + *
> + * err = copy_struct_from_user(&karg, sizeof(karg), uarg, size);
> + * if (err)
> + * return err;
> + *
> + * // ...
> + * }
> + *
> + * There are three cases to consider:
> + * * If @usize == @ksize, then it's copied verbatim.
> + * * If @usize < @ksize, then the user space has passed an old struct to a
> + * newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize)
> + * are to be zero-filled.
> + * * If @usize > @ksize, then the user space has passed a new struct to an
> + * older kernel. The trailing bytes unknown to the kernel (@usize - @ksize)
> + * are checked to ensure they are zeroed, otherwise -E2BIG is returned.
> + *
> + * Returns (in all cases, some data may have been copied):
> + * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src.
> + * * -E2BIG: @usize is "too big" (at time of writing, >PAGE_SIZE).
> + * * -EFAULT: access to user space failed.
> + */
> +int copy_struct_from_user(void *dst, size_t ksize,
> + const void __user *src, size_t usize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
> +
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;
Documented above as returning -E2BIG.
> + if (unlikely(!access_ok(src, usize)))
> + return -EFAULT;
> +
> + /* Deal with trailing bytes. */
> + if (usize < ksize)
> + memset(dst + size, 0, rest);
> + else if (usize > ksize) {
> + const void __user *addr = src + size;
> + char buffer[BUFFER_SIZE] = {};
Isn't that too big for on-stack?
> +
> + while (rest > 0) {
> + size_t bufsize = min(rest, sizeof(buffer));
> +
> + if (__copy_from_user(buffer, addr, bufsize))
> + return -EFAULT;
> + if (memchr_inv(buffer, 0, bufsize))
> + return -E2BIG;
> +
> + addr += bufsize;
> + rest -= bufsize;
> + }
The perf implementation uses get_user(); but if that is too slow, surely
we can do something with uaccess_try() here?
> + }
> + /* Copy the interoperable parts of the struct. */
> + if (__copy_from_user(dst, src, size))
> + return -EFAULT;
> + return 0;
> +}
> +EXPORT_SYMBOL(copy_struct_from_user);
And personally I'm not a big fan of EXPORT_SYMBOL().
On 2019-09-05, Peter Zijlstra <[email protected]> wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > +/**
> > + * copy_struct_to_user: copy a struct to user space
> > + * @dst: Destination address, in user space.
> > + * @usize: Size of @dst struct.
> > + * @src: Source address, in kernel space.
> > + * @ksize: Size of @src struct.
> > + *
> > + * Copies a struct from kernel space to user space, in a way that guarantees
> > + * backwards-compatibility for struct syscall arguments (as long as future
> > + * struct extensions are made such that all new fields are *appended* to the
> > + * old struct, and zeroed-out new fields have the same meaning as the old
> > + * struct).
> > + *
> > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> > + * The recommended usage is something like the following:
> > + *
> > + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> > + * {
> > + * int err;
> > + * struct foo karg = {};
> > + *
> > + * // do something with karg
> > + *
> > + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg));
> > + * if (err)
> > + * return err;
> > + *
> > + * // ...
> > + * }
> > + *
> > + * There are three cases to consider:
> > + * * If @usize == @ksize, then it's copied verbatim.
> > + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an
> > + * older user space. In order to avoid user space getting incomplete
> > + * information (new fields might be important), all trailing bytes in @src
> > + * (@ksize - @usize) must be zerored
>
> s/zerored/zero/, right?
It should've been "zeroed".
> > , otherwise -EFBIG is returned.
>
> 'Funny' that, copy_struct_from_user() below seems to use E2BIG.
This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for
a "too big" struct passed to the kernel, and EFBIG for a "too big"
struct passed to user-space. I would personally have preferred EMSGSIZE
instead of EFBIG, but felt using the existing error codes would be less
confusing.
>
> > + * * If @usize > @ksize, then the kernel is "returning" an older struct to a
> > + * newer user space. The trailing bytes in @dst (@usize - @ksize) will be
> > + * zero-filled.
> > + *
> > + * Returns (in all cases, some data may have been copied):
> > + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src.
> > + * * -EFAULT: access to user space failed.
> > + */
> > +int copy_struct_to_user(void __user *dst, size_t usize,
> > + const void *src, size_t ksize)
> > +{
> > + size_t size = min(ksize, usize);
> > + size_t rest = abs(ksize - usize);
> > +
> > + if (unlikely(usize > PAGE_SIZE))
> > + return -EFAULT;
>
> Not documented above. Implementation consistent with *from*, but see
> below.
Will update the kernel-doc.
> > + if (unlikely(!access_ok(dst, usize)))
> > + return -EFAULT;
> > +
> > + /* Deal with trailing bytes. */
> > + if (usize < ksize) {
> > + if (memchr_inv(src + size, 0, rest))
> > + return -EFBIG;
> > + } else if (usize > ksize) {
> > + if (__memzero_user(dst + size, rest))
> > + return -EFAULT;
> > + }
> > + /* Copy the interoperable parts of the struct. */
> > + if (__copy_to_user(dst, src, size))
> > + return -EFAULT;
> > + return 0;
> > +}
> > +EXPORT_SYMBOL(copy_struct_to_user);
> > +
> > +/**
> > + * copy_struct_from_user: copy a struct from user space
> > + * @dst: Destination address, in kernel space. This buffer must be @ksize
> > + * bytes long.
> > + * @ksize: Size of @dst struct.
> > + * @src: Source address, in user space.
> > + * @usize: (Alleged) size of @src struct.
> > + *
> > + * Copies a struct from user space to kernel space, in a way that guarantees
> > + * backwards-compatibility for struct syscall arguments (as long as future
> > + * struct extensions are made such that all new fields are *appended* to the
> > + * old struct, and zeroed-out new fields have the same meaning as the old
> > + * struct).
> > + *
> > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> > + * The recommended usage is something like the following:
> > + *
> > + * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize)
> > + * {
> > + * int err;
> > + * struct foo karg = {};
> > + *
> > + * err = copy_struct_from_user(&karg, sizeof(karg), uarg, size);
> > + * if (err)
> > + * return err;
> > + *
> > + * // ...
> > + * }
> > + *
> > + * There are three cases to consider:
> > + * * If @usize == @ksize, then it's copied verbatim.
> > + * * If @usize < @ksize, then the user space has passed an old struct to a
> > + * newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize)
> > + * are to be zero-filled.
> > + * * If @usize > @ksize, then the user space has passed a new struct to an
> > + * older kernel. The trailing bytes unknown to the kernel (@usize - @ksize)
> > + * are checked to ensure they are zeroed, otherwise -E2BIG is returned.
> > + *
> > + * Returns (in all cases, some data may have been copied):
> > + * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src.
> > + * * -E2BIG: @usize is "too big" (at time of writing, >PAGE_SIZE).
> > + * * -EFAULT: access to user space failed.
> > + */
> > +int copy_struct_from_user(void *dst, size_t ksize,
> > + const void __user *src, size_t usize)
> > +{
> > + size_t size = min(ksize, usize);
> > + size_t rest = abs(ksize - usize);
> > +
> > + if (unlikely(usize > PAGE_SIZE))
> > + return -EFAULT;
>
> Documented above as returning -E2BIG.
I will switch this (and to) back to -E2BIG -- I must've had a brain-fart
when doing some refactoring.
>
> > + if (unlikely(!access_ok(src, usize)))
> > + return -EFAULT;
> > +
> > + /* Deal with trailing bytes. */
> > + if (usize < ksize)
> > + memset(dst + size, 0, rest);
> > + else if (usize > ksize) {
> > + const void __user *addr = src + size;
> > + char buffer[BUFFER_SIZE] = {};
>
> Isn't that too big for on-stack?
Is a 64-byte buffer too big? I picked the number "at random" to be the
size of a cache line, but I could shrink it down to 32 bytes if the size
is an issue (I wanted to avoid needless allocations -- hence it being
on-stack).
> > +
> > + while (rest > 0) {
> > + size_t bufsize = min(rest, sizeof(buffer));
> > +
> > + if (__copy_from_user(buffer, addr, bufsize))
> > + return -EFAULT;
> > + if (memchr_inv(buffer, 0, bufsize))
> > + return -E2BIG;
> > +
> > + addr += bufsize;
> > + rest -= bufsize;
> > + }
>
> The perf implementation uses get_user(); but if that is too slow, surely
> we can do something with uaccess_try() here?
Is there a non-x86-specific way to do that (unless I'm mistaken only x86
has uaccess_try() or the other *_try() wrappers)? The main "performance
improvement" (if you can even call it that) is that we use memchr_inv()
which finds non-matching characters more efficiently than just doing a
loop.
> > + }
> > + /* Copy the interoperable parts of the struct. */
> > + if (__copy_from_user(dst, src, size))
> > + return -EFAULT;
> > + return 0;
> > +}
> > +EXPORT_SYMBOL(copy_struct_from_user);
>
> And personally I'm not a big fan of EXPORT_SYMBOL().
I don't have much of an opinion (after all, it only really makes sense a
lot of sense for syscalls) -- though out-of-tree modules that define
ioctl()s wouldn't be able to make use of them.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Peter Zijlstra <[email protected]> wrote:
> > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > +/**
> > > + * copy_struct_to_user: copy a struct to user space
> > > + * @dst: Destination address, in user space.
> > > + * @usize: Size of @dst struct.
> > > + * @src: Source address, in kernel space.
> > > + * @ksize: Size of @src struct.
> > > + *
> > > + * Copies a struct from kernel space to user space, in a way that guarantees
> > > + * backwards-compatibility for struct syscall arguments (as long as future
> > > + * struct extensions are made such that all new fields are *appended* to the
> > > + * old struct, and zeroed-out new fields have the same meaning as the old
> > > + * struct).
> > > + *
> > > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> > > + * The recommended usage is something like the following:
> > > + *
> > > + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> > > + * {
> > > + * int err;
> > > + * struct foo karg = {};
> > > + *
> > > + * // do something with karg
> > > + *
> > > + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg));
> > > + * if (err)
> > > + * return err;
> > > + *
> > > + * // ...
> > > + * }
> > > + *
> > > + * There are three cases to consider:
> > > + * * If @usize == @ksize, then it's copied verbatim.
> > > + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an
> > > + * older user space. In order to avoid user space getting incomplete
> > > + * information (new fields might be important), all trailing bytes in @src
> > > + * (@ksize - @usize) must be zerored
> >
> > s/zerored/zero/, right?
>
> It should've been "zeroed".
That reads wrong to me; that way it reads like this function must take
that action and zero out the 'rest'; which is just wrong.
This function must verify those bytes are zero, not make them zero.
> > > , otherwise -EFBIG is returned.
> >
> > 'Funny' that, copy_struct_from_user() below seems to use E2BIG.
>
> This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for
> a "too big" struct passed to the kernel, and EFBIG for a "too big"
> struct passed to user-space. I would personally have preferred EMSGSIZE
> instead of EFBIG, but felt using the existing error codes would be less
> confusing.
Sadly a recent commit:
1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code")
Made the situation even 'worse'.
> > > + if (unlikely(!access_ok(src, usize)))
> > > + return -EFAULT;
> > > +
> > > + /* Deal with trailing bytes. */
> > > + if (usize < ksize)
> > > + memset(dst + size, 0, rest);
> > > + else if (usize > ksize) {
> > > + const void __user *addr = src + size;
> > > + char buffer[BUFFER_SIZE] = {};
> >
> > Isn't that too big for on-stack?
>
> Is a 64-byte buffer too big? I picked the number "at random" to be the
> size of a cache line, but I could shrink it down to 32 bytes if the size
> is an issue (I wanted to avoid needless allocations -- hence it being
> on-stack).
Ah, my ctags gave me a definition of BUFFER_SIZE that was 512. I suppose
64 should be OK.
> > > +
> > > + while (rest > 0) {
> > > + size_t bufsize = min(rest, sizeof(buffer));
> > > +
> > > + if (__copy_from_user(buffer, addr, bufsize))
> > > + return -EFAULT;
> > > + if (memchr_inv(buffer, 0, bufsize))
> > > + return -E2BIG;
> > > +
> > > + addr += bufsize;
> > > + rest -= bufsize;
> > > + }
> >
> > The perf implementation uses get_user(); but if that is too slow, surely
> > we can do something with uaccess_try() here?
>
> Is there a non-x86-specific way to do that (unless I'm mistaken only x86
> has uaccess_try() or the other *_try() wrappers)? The main "performance
> improvement" (if you can even call it that) is that we use memchr_inv()
> which finds non-matching characters more efficiently than just doing a
> loop.
Oh, you're right, that's x86 only :/
On 2019-09-05, Rasmus Villemoes <[email protected]> wrote:
> On 04/09/2019 22.19, Aleksa Sarai wrote:
> > A common pattern for syscall extensions is increasing the size of a
> > struct passed from userspace, such that the zero-value of the new fields
> > result in the old kernel behaviour (allowing for a mix of userspace and
> > kernel vintages to operate on one another in most cases). This is done
> > in both directions -- hence two helpers -- though it's more common to
> > have to copy user space structs into kernel space.
> >
> > Previously there was no common lib/ function that implemented
> > the necessary extension-checking semantics (and different syscalls
> > implemented them slightly differently or incompletely[1]). A future
> > patch replaces all of the common uses of this pattern to use the new
> > copy_struct_{to,from}_user() helpers.
> >
> > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> > similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> > always rejects differently-sized struct arguments.
> >
> > Suggested-by: Rasmus Villemoes <[email protected]>
> > Signed-off-by: Aleksa Sarai <[email protected]>
> > ---
> > diff --git a/lib/struct_user.c b/lib/struct_user.c
> > new file mode 100644
> > index 000000000000..7301ab1bbe98
> > --- /dev/null
> > +++ b/lib/struct_user.c
> > @@ -0,0 +1,182 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * Copyright (C) 2019 SUSE LLC
> > + * Copyright (C) 2019 Aleksa Sarai <[email protected]>
> > + */
> > +
> > +#include <linux/types.h>
> > +#include <linux/export.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/kernel.h>
> > +#include <linux/string.h>
> > +
> > +#define BUFFER_SIZE 64
> > +
> > +/*
> > + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> > + * checked access_ok(p, size).
> > + */
>
> Isn't this __clear_user() exactly (perhaps except for the return value)?
> Perhaps not every arch has that?
I didn't know about clear_user() -- I will switch to it.
> > +static int __memzero_user(void __user *p, size_t s)
> > +{
> > + const char zeros[BUFFER_SIZE] = {};
> > + while (s > 0) {
> > + size_t n = min(s, sizeof(zeros));
> > +
> > + if (__copy_to_user(p, zeros, n))
> > + return -EFAULT;
> > +
> > + p += n;
> > + s -= n;
> > + }
> > + return 0;
> > +}
> > +
> > +/**
> > + * copy_struct_to_user: copy a struct to user space
> > + * @dst: Destination address, in user space.
> > + * @usize: Size of @dst struct.
> > + * @src: Source address, in kernel space.
> > + * @ksize: Size of @src struct.
> > + *
> > + * Returns (in all cases, some data may have been copied):
> > + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src.
> > + * * -EFAULT: access to user space failed.
> > + */
> > +int copy_struct_to_user(void __user *dst, size_t usize,
> > + const void *src, size_t ksize)
> > +{
> > + size_t size = min(ksize, usize);
> > + size_t rest = abs(ksize - usize);
>
> Eh, I'd avoid abs() here due to the funkiness of the implicit type
> conversions - ksize-usize has type size_t, then that's coerced to an int
> (or a long maybe?), the abs is applied which return an int/long (or
> unsigned versions?). Something like "rest = max(ksize, usize) - size;"
> is more obviously correct and doesn't fall into any
> narrowing/widening/sign extending traps.
Yeah, I originally used "max(ksize, usize) - size" for that reason but
was worried it looked too funky (and some quick tests showed that abs()
gives the right results in most cases -- though I just realised it would
probably not give the right results around SIZE_MAX). I'll switch back.
> > + if (unlikely(usize > PAGE_SIZE))
> > + return -EFAULT;
>
> Please don't. That is a restriction on all future extensions - once a
> kernel is shipped with a syscall using this helper with that arbitrary
> restriction in place, that syscall is forever prevented from extending
> its arg struct beyond PAGE_SIZE (which is arch-dependent anyway). Sure,
> it's hard to imagine, but who'd have thought 32 O_* or CLONE_* bits
> weren't enough for everybody?
>
> This is only for future compatibility, and if someone runs an app
> compiled against 7.3 headers on a 5.4 kernel, they probably don't care
> about performance, but they would like their app to run.
I'm not sure I agree that the limit is in place *forever* -- it's
generally not a break in compatibility to convert an error into a
success (though, there are counterexamples such as mknod(2) -- but that
was a very specific case).
You're right that it would mean that some very new code won't run on
very ancient kernels (assuming we ever pass around structs that
massive), but there should be a reasonable trade-off here IMHO.
If we allow very large sizes, a program could probably DoS the kernel by
allocating a moderately-large block of memory and then spawning a bunch
of threads that all cause the kernel to re-check that the same 1GB block
of memory is zeroed. I haven't tried, but it seems like it's best to
avoid the possibility altogether.
> > + }
> > + /* Copy the interoperable parts of the struct. */
> > + if (__copy_to_user(dst, src, size))
> > + return -EFAULT;
>
> I think I understand why you put this last instead of handling the
> buffer in the "natural" order. However,
> I'm wondering whether we should actually do this copy before checking
> that the extra kernel bytes are 0 - the user will still be told that
> there was some extra information via the -EFBIG/-E2BIG return, but maybe
> in some cases the part he understands is good enough. But I also guess
> we have to look to existing users to see whether that would prevent them
> from being converted to using this helper.
>
> linux-api folks, WDYT?
Regarding the order, I just copied what sched and perf already do. I
wouldn't mind doing it the other way around -- though I am a little
cautious about implicitly making guarantees like that. The syscall that
uses copy_struct_to_user() might not want to make that guarantee (it
might not make sense for them), and there are some -E2BIG returns that
won't result in data being copied (usize > PAGE_SIZE).
As for feedback, this is syscall-dependent at the moment. The sched and
perf users explicitly return the size of the kernel structure (by
overwriting uattr->size if -E2BIG is returned) for copies in either
direction. So users arguably already have some kind of feedback about
size issues. clone3() on the other hand doesn't do that (though it
doesn't copy anything to user-space so this isn't relevant to this
particular question).
Effectively, I'd like to see someone argue that this is something that
they would personally want (before we do it).
> > + return 0;
>
> Maybe more useful to "return size;", some users might want to know/pass
> on how much was actually copied.
Even though it is "just" min(ksize, usize), I don't see any harm in
returning it. Will do.
> > +}
> > +EXPORT_SYMBOL(copy_struct_to_user);
>
> Can't we wait with this until a modular user shows up? The primary users
> are syscalls, which can't be modular AFAIK.
Yeah, I'll drop it. You could use them for ioctl()s but we can always
add EXPORT_SYMBOL() later.
> > +/**
> > + * copy_struct_from_user: copy a struct from user space
> > + * @dst: Destination address, in kernel space. This buffer must be @ksize
> > + * bytes long.
> > + * @ksize: Size of @dst struct.
> > + * @src: Source address, in user space.
> > + * @usize: (Alleged) size of @src struct.
> > + *
> > + * Copies a struct from user space to kernel space, in a way that guarantees
> > + * backwards-compatibility for struct syscall arguments (as long as future
> > + * struct extensions are made such that all new fields are *appended* to the
> > + * old struct, and zeroed-out new fields have the same meaning as the old
> > + * struct).
> > + *
> > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> > + * The recommended usage is something like the following:
> > + *
> > + * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize)
> > + * {
> > + * int err;
> > + * struct foo karg = {};
> > + *
> > + * err = copy_struct_from_user(&karg, sizeof(karg), uarg, size);
> > + * if (err)
> > + * return err;
> > + *
> > + * // ...
> > + * }
> > + *
> > + * There are three cases to consider:
> > + * * If @usize == @ksize, then it's copied verbatim.
> > + * * If @usize < @ksize, then the user space has passed an old struct to a
> > + * newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize)
> > + * are to be zero-filled.
> > + * * If @usize > @ksize, then the user space has passed a new struct to an
> > + * older kernel. The trailing bytes unknown to the kernel (@usize - @ksize)
> > + * are checked to ensure they are zeroed, otherwise -E2BIG is returned.
> > + *
> > + * Returns (in all cases, some data may have been copied):
> > + * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src.
> > + * * -E2BIG: @usize is "too big" (at time of writing, >PAGE_SIZE).
> > + * * -EFAULT: access to user space failed.
> > + */
> > +int copy_struct_from_user(void *dst, size_t ksize,
> > + const void __user *src, size_t usize)
> > +{
> > + size_t size = min(ksize, usize);
> > + size_t rest = abs(ksize - usize);
>
> As above.
>
> > + if (unlikely(usize > PAGE_SIZE))
> > + return -EFAULT;
>
> As above.
>
> > + if (unlikely(!access_ok(src, usize)))
> > + return -EFAULT;
> > +
> > + /* Deal with trailing bytes. */
> > + if (usize < ksize)
> > + memset(dst + size, 0, rest);
> > + else if (usize > ksize) {
> > + const void __user *addr = src + size;
> > + char buffer[BUFFER_SIZE] = {};
> > +
> > + while (rest > 0) {
> > + size_t bufsize = min(rest, sizeof(buffer));
> > +
> > + if (__copy_from_user(buffer, addr, bufsize))
> > + return -EFAULT;
> > + if (memchr_inv(buffer, 0, bufsize))
> > + return -E2BIG;
> > +
> > + addr += bufsize;
> > + rest -= bufsize;
> > + }
>
> I'd create a __user_is_zero() helper for this - that way the two
> branches in the two helpers become nicely symmetric, each just calling a
> single helper that deals appropriately with the tail. And we can discuss
> how to implement __user_is_zero() in another bikeshed.
Will do.
>
> > + }
> > + /* Copy the interoperable parts of the struct. */
> > + if (__copy_from_user(dst, src, size))
> > + return -EFAULT;
>
> If you do move up the __copy_to_user(), please move this as well - on
> the kernel side, we certainly don't care that we copied some bytes to a
> local buffer which we then ignore because the user had a non-zero tail.
> But if __copy_to_user() is kept last in copy_struct_to_user(), this
> should stay for symmetry.
I will keep that in mind.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Thu, Sep 05, 2019 at 11:43:05AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote:
> > On 2019-09-05, Peter Zijlstra <[email protected]> wrote:
> > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > > +/**
> > > > + * copy_struct_to_user: copy a struct to user space
> > > > + * @dst: Destination address, in user space.
> > > > + * @usize: Size of @dst struct.
> > > > + * @src: Source address, in kernel space.
> > > > + * @ksize: Size of @src struct.
> > > > + *
> > > > + * Copies a struct from kernel space to user space, in a way that guarantees
> > > > + * backwards-compatibility for struct syscall arguments (as long as future
> > > > + * struct extensions are made such that all new fields are *appended* to the
> > > > + * old struct, and zeroed-out new fields have the same meaning as the old
> > > > + * struct).
> > > > + *
> > > > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> > > > + * The recommended usage is something like the following:
> > > > + *
> > > > + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> > > > + * {
> > > > + * int err;
> > > > + * struct foo karg = {};
> > > > + *
> > > > + * // do something with karg
> > > > + *
> > > > + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg));
> > > > + * if (err)
> > > > + * return err;
> > > > + *
> > > > + * // ...
> > > > + * }
> > > > + *
> > > > + * There are three cases to consider:
> > > > + * * If @usize == @ksize, then it's copied verbatim.
> > > > + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an
> > > > + * older user space. In order to avoid user space getting incomplete
> > > > + * information (new fields might be important), all trailing bytes in @src
> > > > + * (@ksize - @usize) must be zerored
> > >
> > > s/zerored/zero/, right?
> >
> > It should've been "zeroed".
>
> That reads wrong to me; that way it reads like this function must take
> that action and zero out the 'rest'; which is just wrong.
>
> This function must verify those bytes are zero, not make them zero.
>
> > > > , otherwise -EFBIG is returned.
> > >
> > > 'Funny' that, copy_struct_from_user() below seems to use E2BIG.
> >
> > This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for
> > a "too big" struct passed to the kernel, and EFBIG for a "too big"
> > struct passed to user-space. I would personally have preferred EMSGSIZE
> > instead of EFBIG, but felt using the existing error codes would be less
> > confusing.
>
> Sadly a recent commit:
>
> 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code")
>
> Made the situation even 'worse'.
And thinking more about things; I'm not convinced the above patch is
actually right.
Do we really want to simply truncate all the attributes of the task?
And should we not at least set sched_flags when there are non-default
clamp values applied?
See; that is I think the primary bug that had chrt failing; we tried to
publish the default clamp values as !0.
On 04/09/2019 22.19, Aleksa Sarai wrote:
> A common pattern for syscall extensions is increasing the size of a
> struct passed from userspace, such that the zero-value of the new fields
> result in the old kernel behaviour (allowing for a mix of userspace and
> kernel vintages to operate on one another in most cases). This is done
> in both directions -- hence two helpers -- though it's more common to
> have to copy user space structs into kernel space.
>
> Previously there was no common lib/ function that implemented
> the necessary extension-checking semantics (and different syscalls
> implemented them slightly differently or incompletely[1]). A future
> patch replaces all of the common uses of this pattern to use the new
> copy_struct_{to,from}_user() helpers.
>
> [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> always rejects differently-sized struct arguments.
>
> Suggested-by: Rasmus Villemoes <[email protected]>
> Signed-off-by: Aleksa Sarai <[email protected]>
> ---
> diff --git a/lib/struct_user.c b/lib/struct_user.c
> new file mode 100644
> index 000000000000..7301ab1bbe98
> --- /dev/null
> +++ b/lib/struct_user.c
> @@ -0,0 +1,182 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2019 SUSE LLC
> + * Copyright (C) 2019 Aleksa Sarai <[email protected]>
> + */
> +
> +#include <linux/types.h>
> +#include <linux/export.h>
> +#include <linux/uaccess.h>
> +#include <linux/kernel.h>
> +#include <linux/string.h>
> +
> +#define BUFFER_SIZE 64
> +
> +/*
> + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> + * checked access_ok(p, size).
> + */
Isn't this __clear_user() exactly (perhaps except for the return value)?
Perhaps not every arch has that?
> +static int __memzero_user(void __user *p, size_t s)
> +{
> + const char zeros[BUFFER_SIZE] = {};
> + while (s > 0) {
> + size_t n = min(s, sizeof(zeros));
> +
> + if (__copy_to_user(p, zeros, n))
> + return -EFAULT;
> +
> + p += n;
> + s -= n;
> + }
> + return 0;
> +}
> +
> +/**
> + * copy_struct_to_user: copy a struct to user space
> + * @dst: Destination address, in user space.
> + * @usize: Size of @dst struct.
> + * @src: Source address, in kernel space.
> + * @ksize: Size of @src struct.
> + *
> + * Returns (in all cases, some data may have been copied):
> + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src.
> + * * -EFAULT: access to user space failed.
> + */
> +int copy_struct_to_user(void __user *dst, size_t usize,
> + const void *src, size_t ksize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
Eh, I'd avoid abs() here due to the funkiness of the implicit type
conversions - ksize-usize has type size_t, then that's coerced to an int
(or a long maybe?), the abs is applied which return an int/long (or
unsigned versions?). Something like "rest = max(ksize, usize) - size;"
is more obviously correct and doesn't fall into any
narrowing/widening/sign extending traps.
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;
Please don't. That is a restriction on all future extensions - once a
kernel is shipped with a syscall using this helper with that arbitrary
restriction in place, that syscall is forever prevented from extending
its arg struct beyond PAGE_SIZE (which is arch-dependent anyway). Sure,
it's hard to imagine, but who'd have thought 32 O_* or CLONE_* bits
weren't enough for everybody?
This is only for future compatibility, and if someone runs an app
compiled against 7.3 headers on a 5.4 kernel, they probably don't care
about performance, but they would like their app to run.
[If we ever create such a large ABI struct that doesn't fit on stack,
we'd have to extend our API a little to create a dup_struct_from_user()
that does the kmalloc() for us and then calls copy_struct_from_user() -
but we might want that long before we hit PAGE_SIZE structs].
> + if (unlikely(!access_ok(dst, usize)))
> + return -EFAULT;
> +
> + /* Deal with trailing bytes. */
> + if (usize < ksize) {
> + if (memchr_inv(src + size, 0, rest))
> + return -EFBIG;
> + } else if (usize > ksize) {
> + if (__memzero_user(dst + size, rest))
> + return -EFAULT;
I think that could simply be __clear_user().
> + }
> + /* Copy the interoperable parts of the struct. */
> + if (__copy_to_user(dst, src, size))
> + return -EFAULT;
I think I understand why you put this last instead of handling the
buffer in the "natural" order. However,
I'm wondering whether we should actually do this copy before checking
that the extra kernel bytes are 0 - the user will still be told that
there was some extra information via the -EFBIG/-E2BIG return, but maybe
in some cases the part he understands is good enough. But I also guess
we have to look to existing users to see whether that would prevent them
from being converted to using this helper.
linux-api folks, WDYT?
> + return 0;
Maybe more useful to "return size;", some users might want to know/pass
on how much was actually copied.
> +}
> +EXPORT_SYMBOL(copy_struct_to_user);
Can't we wait with this until a modular user shows up? The primary users
are syscalls, which can't be modular AFAIK.
> +/**
> + * copy_struct_from_user: copy a struct from user space
> + * @dst: Destination address, in kernel space. This buffer must be @ksize
> + * bytes long.
> + * @ksize: Size of @dst struct.
> + * @src: Source address, in user space.
> + * @usize: (Alleged) size of @src struct.
> + *
> + * Copies a struct from user space to kernel space, in a way that guarantees
> + * backwards-compatibility for struct syscall arguments (as long as future
> + * struct extensions are made such that all new fields are *appended* to the
> + * old struct, and zeroed-out new fields have the same meaning as the old
> + * struct).
> + *
> + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> + * The recommended usage is something like the following:
> + *
> + * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize)
> + * {
> + * int err;
> + * struct foo karg = {};
> + *
> + * err = copy_struct_from_user(&karg, sizeof(karg), uarg, size);
> + * if (err)
> + * return err;
> + *
> + * // ...
> + * }
> + *
> + * There are three cases to consider:
> + * * If @usize == @ksize, then it's copied verbatim.
> + * * If @usize < @ksize, then the user space has passed an old struct to a
> + * newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize)
> + * are to be zero-filled.
> + * * If @usize > @ksize, then the user space has passed a new struct to an
> + * older kernel. The trailing bytes unknown to the kernel (@usize - @ksize)
> + * are checked to ensure they are zeroed, otherwise -E2BIG is returned.
> + *
> + * Returns (in all cases, some data may have been copied):
> + * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src.
> + * * -E2BIG: @usize is "too big" (at time of writing, >PAGE_SIZE).
> + * * -EFAULT: access to user space failed.
> + */
> +int copy_struct_from_user(void *dst, size_t ksize,
> + const void __user *src, size_t usize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
As above.
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;
As above.
> + if (unlikely(!access_ok(src, usize)))
> + return -EFAULT;
> +
> + /* Deal with trailing bytes. */
> + if (usize < ksize)
> + memset(dst + size, 0, rest);
> + else if (usize > ksize) {
> + const void __user *addr = src + size;
> + char buffer[BUFFER_SIZE] = {};
> +
> + while (rest > 0) {
> + size_t bufsize = min(rest, sizeof(buffer));
> +
> + if (__copy_from_user(buffer, addr, bufsize))
> + return -EFAULT;
> + if (memchr_inv(buffer, 0, bufsize))
> + return -E2BIG;
> +
> + addr += bufsize;
> + rest -= bufsize;
> + }
I'd create a __user_is_zero() helper for this - that way the two
branches in the two helpers become nicely symmetric, each just calling a
single helper that deals appropriately with the tail. And we can discuss
how to implement __user_is_zero() in another bikeshed.
> + }
> + /* Copy the interoperable parts of the struct. */
> + if (__copy_from_user(dst, src, size))
> + return -EFAULT;
If you do move up the __copy_to_user(), please move this as well - on
the kernel side, we certainly don't care that we copied some bytes to a
local buffer which we then ignore because the user had a non-zero tail.
But if __copy_to_user() is kept last in copy_struct_to_user(), this
should stay for symmetry.
> + return 0;
As above.
> +}
> +EXPORT_SYMBOL(copy_struct_from_user);
As above.
Rasmus
On Sep 05 2019, Aleksa Sarai <[email protected]> wrote:
> diff --git a/lib/struct_user.c b/lib/struct_user.c
> new file mode 100644
> index 000000000000..7301ab1bbe98
> --- /dev/null
> +++ b/lib/struct_user.c
> @@ -0,0 +1,182 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2019 SUSE LLC
> + * Copyright (C) 2019 Aleksa Sarai <[email protected]>
> + */
> +
> +#include <linux/types.h>
> +#include <linux/export.h>
> +#include <linux/uaccess.h>
> +#include <linux/kernel.h>
> +#include <linux/string.h>
> +
> +#define BUFFER_SIZE 64
> +
> +/*
> + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> + * checked access_ok(p, size).
> + */
> +static int __memzero_user(void __user *p, size_t s)
> +{
> + const char zeros[BUFFER_SIZE] = {};
Perhaps make that static?
Andreas.
--
Andreas Schwab, [email protected]
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
On Thu, Sep 05, 2019 at 11:09:35AM +0200, Andreas Schwab wrote:
> On Sep 05 2019, Aleksa Sarai <[email protected]> wrote:
>
> > diff --git a/lib/struct_user.c b/lib/struct_user.c
> > new file mode 100644
> > index 000000000000..7301ab1bbe98
> > --- /dev/null
> > +++ b/lib/struct_user.c
> > @@ -0,0 +1,182 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * Copyright (C) 2019 SUSE LLC
> > + * Copyright (C) 2019 Aleksa Sarai <[email protected]>
> > + */
> > +
> > +#include <linux/types.h>
> > +#include <linux/export.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/kernel.h>
> > +#include <linux/string.h>
> > +
> > +#define BUFFER_SIZE 64
> > +
> > +/*
> > + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> > + * checked access_ok(p, size).
> > + */
> > +static int __memzero_user(void __user *p, size_t s)
> > +{
> > + const char zeros[BUFFER_SIZE] = {};
>
> Perhaps make that static?
On SMP? It should at least be per cpu, and I'm not even sure
with preemption.
Gabriel
>
> Andreas.
>
> --
> Andreas Schwab, [email protected]
> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
> "And now for something completely different."
On Thu, Sep 05, 2019 at 07:50:26PM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Rasmus Villemoes <[email protected]> wrote:
> > On 04/09/2019 22.19, Aleksa Sarai wrote:
> > > A common pattern for syscall extensions is increasing the size of a
> > > struct passed from userspace, such that the zero-value of the new fields
> > > result in the old kernel behaviour (allowing for a mix of userspace and
> > > kernel vintages to operate on one another in most cases). This is done
> > > in both directions -- hence two helpers -- though it's more common to
> > > have to copy user space structs into kernel space.
> > >
> > > Previously there was no common lib/ function that implemented
> > > the necessary extension-checking semantics (and different syscalls
> > > implemented them slightly differently or incompletely[1]). A future
> > > patch replaces all of the common uses of this pattern to use the new
> > > copy_struct_{to,from}_user() helpers.
> > >
> > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> > > similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> > > always rejects differently-sized struct arguments.
> > >
> > > Suggested-by: Rasmus Villemoes <[email protected]>
> > > Signed-off-by: Aleksa Sarai <[email protected]>
> > > ---
> > > diff --git a/lib/struct_user.c b/lib/struct_user.c
> > > new file mode 100644
> > > index 000000000000..7301ab1bbe98
> > > --- /dev/null
> > > +++ b/lib/struct_user.c
> > > @@ -0,0 +1,182 @@
> > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > +/*
> > > + * Copyright (C) 2019 SUSE LLC
> > > + * Copyright (C) 2019 Aleksa Sarai <[email protected]>
> > > + */
> > > +
> > > +#include <linux/types.h>
> > > +#include <linux/export.h>
> > > +#include <linux/uaccess.h>
> > > +#include <linux/kernel.h>
> > > +#include <linux/string.h>
> > > +
> > > +#define BUFFER_SIZE 64
> > > +
> > > +/*
> > > + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> > > + * checked access_ok(p, size).
> > > + */
> >
> > Isn't this __clear_user() exactly (perhaps except for the return value)?
> > Perhaps not every arch has that?
>
> I didn't know about clear_user() -- I will switch to it.
>
> > > +static int __memzero_user(void __user *p, size_t s)
> > > +{
> > > + const char zeros[BUFFER_SIZE] = {};
> > > + while (s > 0) {
> > > + size_t n = min(s, sizeof(zeros));
> > > +
> > > + if (__copy_to_user(p, zeros, n))
> > > + return -EFAULT;
> > > +
> > > + p += n;
> > > + s -= n;
> > > + }
> > > + return 0;
> > > +}
> > > +
> > > +/**
> > > + * copy_struct_to_user: copy a struct to user space
> > > + * @dst: Destination address, in user space.
> > > + * @usize: Size of @dst struct.
> > > + * @src: Source address, in kernel space.
> > > + * @ksize: Size of @src struct.
> > > + *
> > > + * Returns (in all cases, some data may have been copied):
> > > + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src.
> > > + * * -EFAULT: access to user space failed.
> > > + */
> > > +int copy_struct_to_user(void __user *dst, size_t usize,
> > > + const void *src, size_t ksize)
> > > +{
> > > + size_t size = min(ksize, usize);
> > > + size_t rest = abs(ksize - usize);
> >
> > Eh, I'd avoid abs() here due to the funkiness of the implicit type
> > conversions - ksize-usize has type size_t, then that's coerced to an int
> > (or a long maybe?), the abs is applied which return an int/long (or
> > unsigned versions?). Something like "rest = max(ksize, usize) - size;"
> > is more obviously correct and doesn't fall into any
> > narrowing/widening/sign extending traps.
>
> Yeah, I originally used "max(ksize, usize) - size" for that reason but
> was worried it looked too funky (and some quick tests showed that abs()
> gives the right results in most cases -- though I just realised it would
> probably not give the right results around SIZE_MAX). I'll switch back.
>
> > > + if (unlikely(usize > PAGE_SIZE))
> > > + return -EFAULT;
> >
> > Please don't. That is a restriction on all future extensions - once a
> > kernel is shipped with a syscall using this helper with that arbitrary
> > restriction in place, that syscall is forever prevented from extending
> > its arg struct beyond PAGE_SIZE (which is arch-dependent anyway). Sure,
> > it's hard to imagine, but who'd have thought 32 O_* or CLONE_* bits
> > weren't enough for everybody?
> >
> > This is only for future compatibility, and if someone runs an app
> > compiled against 7.3 headers on a 5.4 kernel, they probably don't care
> > about performance, but they would like their app to run.
>
> I'm not sure I agree that the limit is in place *forever* -- it's
> generally not a break in compatibility to convert an error into a
> success (though, there are counterexamples such as mknod(2) -- but that
> was a very specific case).
>
> You're right that it would mean that some very new code won't run on
> very ancient kernels (assuming we ever pass around structs that
> massive), but there should be a reasonable trade-off here IMHO.
Passing a struct larger than a PAGE_SIZE right now (at least for all
those calls that would make use of this helper at the moment) is to be
considered a bug.
The PAGE_SIZE check is a reasonable heuristic. It's an assumption that
is pretty common in the kernel in other places as well. Plus the
possibility of DoS.
>
> If we allow very large sizes, a program could probably DoS the kernel by
> allocating a moderately-large block of memory and then spawning a bunch
> of threads that all cause the kernel to re-check that the same 1GB block
> of memory is zeroed. I haven't tried, but it seems like it's best to
> avoid the possibility altogether.
>
> > > + }
> > > + /* Copy the interoperable parts of the struct. */
> > > + if (__copy_to_user(dst, src, size))
> > > + return -EFAULT;
> >
> > I think I understand why you put this last instead of handling the
> > buffer in the "natural" order. However,
> > I'm wondering whether we should actually do this copy before checking
> > that the extra kernel bytes are 0 - the user will still be told that
> > there was some extra information via the -EFBIG/-E2BIG return, but maybe
> > in some cases the part he understands is good enough. But I also guess
> > we have to look to existing users to see whether that would prevent them
> > from being converted to using this helper.
> >
> > linux-api folks, WDYT?
>
> Regarding the order, I just copied what sched and perf already do. I
> wouldn't mind doing it the other way around -- though I am a little
> cautious about implicitly making guarantees like that. The syscall that
> uses copy_struct_to_user() might not want to make that guarantee (it
> might not make sense for them), and there are some -E2BIG returns that
> won't result in data being copied (usize > PAGE_SIZE).
>
> As for feedback, this is syscall-dependent at the moment. The sched and
> perf users explicitly return the size of the kernel structure (by
> overwriting uattr->size if -E2BIG is returned) for copies in either
> direction. So users arguably already have some kind of feedback about
> size issues. clone3() on the other hand doesn't do that (though it
> doesn't copy anything to user-space so this isn't relevant to this
> particular question).
>
> Effectively, I'd like to see someone argue that this is something that
> they would personally want (before we do it).
I think the order you have right now is fine. I don't see the point of
doing work first before we have verified that things are sane.
On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> A common pattern for syscall extensions is increasing the size of a
> struct passed from userspace, such that the zero-value of the new fields
> result in the old kernel behaviour (allowing for a mix of userspace and
> kernel vintages to operate on one another in most cases). This is done
> in both directions -- hence two helpers -- though it's more common to
> have to copy user space structs into kernel space.
>
> Previously there was no common lib/ function that implemented
> the necessary extension-checking semantics (and different syscalls
> implemented them slightly differently or incompletely[1]). A future
> patch replaces all of the common uses of this pattern to use the new
> copy_struct_{to,from}_user() helpers.
>
> [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> always rejects differently-sized struct arguments.
>
> Suggested-by: Rasmus Villemoes <[email protected]>
> Signed-off-by: Aleksa Sarai <[email protected]>
> ---
> include/linux/uaccess.h | 5 ++
> lib/Makefile | 2 +-
> lib/struct_user.c | 182 ++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 188 insertions(+), 1 deletion(-)
> create mode 100644 lib/struct_user.c
>
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 34a038563d97..0ad9544a1aee 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -230,6 +230,11 @@ static inline unsigned long __copy_from_user_inatomic_nocache(void *to,
>
> #endif /* ARCH_HAS_NOCACHE_UACCESS */
>
> +extern int copy_struct_to_user(void __user *dst, size_t usize,
> + const void *src, size_t ksize);
> +extern int copy_struct_from_user(void *dst, size_t ksize,
> + const void __user *src, size_t usize);
> +
> /*
> * probe_kernel_read(): safely attempt to read from a location
> * @dst: pointer to the buffer that shall take the data
> diff --git a/lib/Makefile b/lib/Makefile
> index 29c02a924973..d86c71feaf0a 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -28,7 +28,7 @@ endif
> CFLAGS_string.o := $(call cc-option, -fno-stack-protector)
> endif
>
> -lib-y := ctype.o string.o vsprintf.o cmdline.o \
> +lib-y := ctype.o string.o struct_user.o vsprintf.o cmdline.o \
> rbtree.o radix-tree.o timerqueue.o xarray.o \
> idr.o extable.o \
> sha1.o chacha.o irq_regs.o argv_split.o \
> diff --git a/lib/struct_user.c b/lib/struct_user.c
> new file mode 100644
> index 000000000000..7301ab1bbe98
> --- /dev/null
> +++ b/lib/struct_user.c
> @@ -0,0 +1,182 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2019 SUSE LLC
> + * Copyright (C) 2019 Aleksa Sarai <[email protected]>
> + */
> +
> +#include <linux/types.h>
> +#include <linux/export.h>
> +#include <linux/uaccess.h>
> +#include <linux/kernel.h>
> +#include <linux/string.h>
> +
> +#define BUFFER_SIZE 64
> +
> +/*
> + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> + * checked access_ok(p, size).
> + */
> +static int __memzero_user(void __user *p, size_t s)
> +{
> + const char zeros[BUFFER_SIZE] = {};
> + while (s > 0) {
> + size_t n = min(s, sizeof(zeros));
> +
> + if (__copy_to_user(p, zeros, n))
> + return -EFAULT;
> +
> + p += n;
> + s -= n;
> + }
> + return 0;
> +}
> +
> +/**
> + * copy_struct_to_user: copy a struct to user space
> + * @dst: Destination address, in user space.
> + * @usize: Size of @dst struct.
> + * @src: Source address, in kernel space.
> + * @ksize: Size of @src struct.
> + *
> + * Copies a struct from kernel space to user space, in a way that guarantees
> + * backwards-compatibility for struct syscall arguments (as long as future
> + * struct extensions are made such that all new fields are *appended* to the
> + * old struct, and zeroed-out new fields have the same meaning as the old
> + * struct).
> + *
> + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> + * The recommended usage is something like the following:
> + *
> + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> + * {
> + * int err;
> + * struct foo karg = {};
> + *
> + * // do something with karg
> + *
> + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg));
> + * if (err)
> + * return err;
> + *
> + * // ...
> + * }
> + *
> + * There are three cases to consider:
> + * * If @usize == @ksize, then it's copied verbatim.
> + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an
> + * older user space. In order to avoid user space getting incomplete
> + * information (new fields might be important), all trailing bytes in @src
> + * (@ksize - @usize) must be zerored, otherwise -EFBIG is returned.
> + * * If @usize > @ksize, then the kernel is "returning" an older struct to a
> + * newer user space. The trailing bytes in @dst (@usize - @ksize) will be
> + * zero-filled.
> + *
> + * Returns (in all cases, some data may have been copied):
> + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src.
> + * * -EFAULT: access to user space failed.
> + */
> +int copy_struct_to_user(void __user *dst, size_t usize,
> + const void *src, size_t ksize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
> +
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;
Looks like this should be -EFBIG.
> + if (unlikely(!access_ok(dst, usize)))
> + return -EFAULT;
> +
> + /* Deal with trailing bytes. */
> + if (usize < ksize) {
> + if (memchr_inv(src + size, 0, rest))
> + return -EFBIG;
> + } else if (usize > ksize) {
> + if (__memzero_user(dst + size, rest))
> + return -EFAULT;
Is zeroing that memory really our job? Seems to me we should just check
it is zeroed.
> + }
> + /* Copy the interoperable parts of the struct. */
> + if (__copy_to_user(dst, src, size))
> + return -EFAULT;
> + return 0;
> +}
> +EXPORT_SYMBOL(copy_struct_to_user);
> +
> +/**
> + * copy_struct_from_user: copy a struct from user space
> + * @dst: Destination address, in kernel space. This buffer must be @ksize
> + * bytes long.
> + * @ksize: Size of @dst struct.
> + * @src: Source address, in user space.
> + * @usize: (Alleged) size of @src struct.
> + *
> + * Copies a struct from user space to kernel space, in a way that guarantees
> + * backwards-compatibility for struct syscall arguments (as long as future
> + * struct extensions are made such that all new fields are *appended* to the
> + * old struct, and zeroed-out new fields have the same meaning as the old
> + * struct).
> + *
> + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> + * The recommended usage is something like the following:
> + *
> + * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize)
> + * {
> + * int err;
> + * struct foo karg = {};
> + *
> + * err = copy_struct_from_user(&karg, sizeof(karg), uarg, size);
> + * if (err)
> + * return err;
> + *
> + * // ...
> + * }
> + *
> + * There are three cases to consider:
> + * * If @usize == @ksize, then it's copied verbatim.
> + * * If @usize < @ksize, then the user space has passed an old struct to a
> + * newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize)
> + * are to be zero-filled.
> + * * If @usize > @ksize, then the user space has passed a new struct to an
> + * older kernel. The trailing bytes unknown to the kernel (@usize - @ksize)
> + * are checked to ensure they are zeroed, otherwise -E2BIG is returned.
> + *
> + * Returns (in all cases, some data may have been copied):
> + * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src.
> + * * -E2BIG: @usize is "too big" (at time of writing, >PAGE_SIZE).
> + * * -EFAULT: access to user space failed.
> + */
> +int copy_struct_from_user(void *dst, size_t ksize,
> + const void __user *src, size_t usize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
> +
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;
That should be -E2BIG.
> + if (unlikely(!access_ok(src, usize)))
> + return -EFAULT;
> +
> + /* Deal with trailing bytes. */
> + if (usize < ksize)
> + memset(dst + size, 0, rest);
I think kernel style mandates that if one branch in an if-else ladder
requires {} all other must use {} as well. So this should be:
if () {
// one line
} else {
// one line
// another line
}
That's a change in behavior for clone3() and sched at least, no? Unless
- which I guess you might have done - you have moved the "error out when
the struct is too small" part before the call to copy_struct_from_user()
for them.
> + else if (usize > ksize) {
> + const void __user *addr = src + size;
> + char buffer[BUFFER_SIZE] = {};
> +
> + while (rest > 0) {
> + size_t bufsize = min(rest, sizeof(buffer));
> +
> + if (__copy_from_user(buffer, addr, bufsize))
> + return -EFAULT;
> + if (memchr_inv(buffer, 0, bufsize))
> + return -E2BIG;
> +
> + addr += bufsize;
> + rest -= bufsize;
> + }
> + }
> + /* Copy the interoperable parts of the struct. */
> + if (__copy_from_user(dst, src, size))
> + return -EFAULT;
> + return 0;
> +}
> +EXPORT_SYMBOL(copy_struct_from_user);
> --
> 2.23.0
>
On 2019-09-05, Christian Brauner <[email protected]> wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > A common pattern for syscall extensions is increasing the size of a
> > struct passed from userspace, such that the zero-value of the new fields
> > result in the old kernel behaviour (allowing for a mix of userspace and
> > kernel vintages to operate on one another in most cases). This is done
> > in both directions -- hence two helpers -- though it's more common to
> > have to copy user space structs into kernel space.
> >
> > Previously there was no common lib/ function that implemented
> > the necessary extension-checking semantics (and different syscalls
> > implemented them slightly differently or incompletely[1]). A future
> > patch replaces all of the common uses of this pattern to use the new
> > copy_struct_{to,from}_user() helpers.
> >
> > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> > similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> > always rejects differently-sized struct arguments.
> >
> > Suggested-by: Rasmus Villemoes <[email protected]>
> > Signed-off-by: Aleksa Sarai <[email protected]>
>
> I would probably split this out into a separate patchset. It can very
> well go in before openat2(). Thoughts?
Yeah, I'll split this and the related patches out -- though I will admit
I'm not sure how you're supposed to deal with multiple independent
patchsets that depend on each other. How will folks reviewing openat2(2)
know to include the lib/struct_user.c changes?
Also, whose tree should it go through?
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Thu, Sep 05, 2019 at 01:17:38PM +0200, Rasmus Villemoes wrote:
> On 05/09/2019 13.05, Christian Brauner wrote:
> > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
>
> >> + if (unlikely(!access_ok(dst, usize)))
> >> + return -EFAULT;
> >> +
> >> + /* Deal with trailing bytes. */
> >> + if (usize < ksize) {
> >> + if (memchr_inv(src + size, 0, rest))
> >> + return -EFBIG;
> >> + } else if (usize > ksize) {
> >> + if (__memzero_user(dst + size, rest))
> >> + return -EFAULT;
> >
> > Is zeroing that memory really our job? Seems to me we should just check
> > it is zeroed.
>
> Of course it is, otherwise you'd require userspace to clear the output
> buffer it gives us, which in the majority of cases is wasted work. It's
> much easier to reason about if we just say "the kernel populates [uaddr,
> uaddr + usize)".
I don't really mind either way so sure. :)
On Thu, Sep 05, 2019 at 09:27:18PM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Christian Brauner <[email protected]> wrote:
> > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > A common pattern for syscall extensions is increasing the size of a
> > > struct passed from userspace, such that the zero-value of the new fields
> > > result in the old kernel behaviour (allowing for a mix of userspace and
> > > kernel vintages to operate on one another in most cases). This is done
> > > in both directions -- hence two helpers -- though it's more common to
> > > have to copy user space structs into kernel space.
> > >
> > > Previously there was no common lib/ function that implemented
> > > the necessary extension-checking semantics (and different syscalls
> > > implemented them slightly differently or incompletely[1]). A future
> > > patch replaces all of the common uses of this pattern to use the new
> > > copy_struct_{to,from}_user() helpers.
> > >
> > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> > > similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> > > always rejects differently-sized struct arguments.
> > >
> > > Suggested-by: Rasmus Villemoes <[email protected]>
> > > Signed-off-by: Aleksa Sarai <[email protected]>
> >
> > I would probably split this out into a separate patchset. It can very
> > well go in before openat2(). Thoughts?
>
> Yeah, I'll split this and the related patches out -- though I will admit
> I'm not sure how you're supposed to deal with multiple independent
> patchsets that depend on each other. How will folks reviewing openat2(2)
> know to include the lib/struct_user.c changes?
The way I usually deal with this is to make two branches. One with the
changes the other depends on and then merge this branch into the other
and put the changes on top. Then you can provide a complete branch that
people can test when you send the patchset out by just linking to it in
the cover letter.
(But if it's too much hazzle just leave it.)
>
> Also, whose tree should it go through?
If people think splitting it out makes sense and we can settle the
technical details I can take it and let it stew in linux-next at least
for a little while.
I have changes to clone3() in there that touch
copy_clone_args_from_user() anyway and there are tests for clone3()
struct copying so we'd catch regressions (for clone3() at least) pretty
quickly.
If we don't see any major issues in the next two weeks it might even be
ok to send for 5.4.
Christian
On 2019-09-05, Peter Zijlstra <[email protected]> wrote:
> On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote:
> > On 2019-09-05, Peter Zijlstra <[email protected]> wrote:
> > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > > +/**
> > > > + * copy_struct_to_user: copy a struct to user space
> > > > + * @dst: Destination address, in user space.
> > > > + * @usize: Size of @dst struct.
> > > > + * @src: Source address, in kernel space.
> > > > + * @ksize: Size of @src struct.
> > > > + *
> > > > + * Copies a struct from kernel space to user space, in a way that guarantees
> > > > + * backwards-compatibility for struct syscall arguments (as long as future
> > > > + * struct extensions are made such that all new fields are *appended* to the
> > > > + * old struct, and zeroed-out new fields have the same meaning as the old
> > > > + * struct).
> > > > + *
> > > > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> > > > + * The recommended usage is something like the following:
> > > > + *
> > > > + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> > > > + * {
> > > > + * int err;
> > > > + * struct foo karg = {};
> > > > + *
> > > > + * // do something with karg
> > > > + *
> > > > + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg));
> > > > + * if (err)
> > > > + * return err;
> > > > + *
> > > > + * // ...
> > > > + * }
> > > > + *
> > > > + * There are three cases to consider:
> > > > + * * If @usize == @ksize, then it's copied verbatim.
> > > > + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an
> > > > + * older user space. In order to avoid user space getting incomplete
> > > > + * information (new fields might be important), all trailing bytes in @src
> > > > + * (@ksize - @usize) must be zerored
> > >
> > > s/zerored/zero/, right?
> >
> > It should've been "zeroed".
>
> That reads wrong to me; that way it reads like this function must take
> that action and zero out the 'rest'; which is just wrong.
>
> This function must verify those bytes are zero, not make them zero.
Right, in my head I was thinking "must have been zeroed" which isn't
what it says. I'll switch to "zero".
> > > > , otherwise -EFBIG is returned.
> > >
> > > 'Funny' that, copy_struct_from_user() below seems to use E2BIG.
> >
> > This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for
> > a "too big" struct passed to the kernel, and EFBIG for a "too big"
> > struct passed to user-space. I would personally have preferred EMSGSIZE
> > instead of EFBIG, but felt using the existing error codes would be less
> > confusing.
>
> Sadly a recent commit:
>
> 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code")
>
> Made the situation even 'worse'.
I hadn't seen this patch before, and I have a few questions taking a
look at it:
* An error code for a particular behaviour was changed (EFBIG ->
E2BIG). Is this not a userspace breakage (I know Linus went ballistic
about something similar a while ago[1]), or did I misunderstand what
the issue was in [1]?
* At the risk of bike-shedding -- of we are changing it, wouldn't
-EMSGSIZE be more appropriate? To be fair, picking errno values has
always been more of an art than a science, but to my ears "Argument
list too long" doesn't make too much sense in the context of
"returning" a struct back to userspace (and the cause of the error
is that the argument passed by user space *isn't big enough*). If
there was an E2SMALL that would also work. ;)
* Do you want me to write a patch based on that, to switch it to
copy_struct_to_user()?
* That patch removes the "are there non-zero bytes in the tail that
userspace won't know about" check (which I have included in mine). I
understand that this caused issues specifically with sched_getattr(2)
due to the default value not being zero -- how should we rectify that
(given that we'd hopefully want to port everyone who uses that
interface to copy_struct_{to,from}_user())?
* Given that the [uk]attr->size construct is pretty important to the
usability of the sched and perf interfaces, should we require (or
encourage) it for all struct-extension syscall setups?
> > > > + if (unlikely(!access_ok(src, usize)))
> > > > + return -EFAULT;
> > > > +
> > > > + /* Deal with trailing bytes. */
> > > > + if (usize < ksize)
> > > > + memset(dst + size, 0, rest);
> > > > + else if (usize > ksize) {
> > > > + const void __user *addr = src + size;
> > > > + char buffer[BUFFER_SIZE] = {};
> > >
> > > Isn't that too big for on-stack?
> >
> > Is a 64-byte buffer too big? I picked the number "at random" to be the
> > size of a cache line, but I could shrink it down to 32 bytes if the size
> > is an issue (I wanted to avoid needless allocations -- hence it being
> > on-stack).
>
> Ah, my ctags gave me a definition of BUFFER_SIZE that was 512. I suppose
> 64 should be OK.
Good to know, though I'll rename it to avoid confusion.
[1]: https://lore.kernel.org/lkml/CA+55aFy98A+LJK4+GWMcbzaa1zsPBRo76q+ioEjbx-uaMKH6Uw@mail.gmail.com/
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> A common pattern for syscall extensions is increasing the size of a
> struct passed from userspace, such that the zero-value of the new fields
> result in the old kernel behaviour (allowing for a mix of userspace and
> kernel vintages to operate on one another in most cases). This is done
> in both directions -- hence two helpers -- though it's more common to
> have to copy user space structs into kernel space.
>
> Previously there was no common lib/ function that implemented
> the necessary extension-checking semantics (and different syscalls
> implemented them slightly differently or incompletely[1]). A future
> patch replaces all of the common uses of this pattern to use the new
> copy_struct_{to,from}_user() helpers.
>
> [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> always rejects differently-sized struct arguments.
>
> Suggested-by: Rasmus Villemoes <[email protected]>
> Signed-off-by: Aleksa Sarai <[email protected]>
I would probably split this out into a separate patchset. It can very
well go in before openat2(). Thoughts?
Christian
On 05/09/2019 13.05, Christian Brauner wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
>> + if (unlikely(!access_ok(dst, usize)))
>> + return -EFAULT;
>> +
>> + /* Deal with trailing bytes. */
>> + if (usize < ksize) {
>> + if (memchr_inv(src + size, 0, rest))
>> + return -EFBIG;
>> + } else if (usize > ksize) {
>> + if (__memzero_user(dst + size, rest))
>> + return -EFAULT;
>
> Is zeroing that memory really our job? Seems to me we should just check
> it is zeroed.
Of course it is, otherwise you'd require userspace to clear the output
buffer it gives us, which in the majority of cases is wasted work. It's
much easier to reason about if we just say "the kernel populates [uaddr,
uaddr + usize)".
It's completely symmetric to copy_struct_from_user doing a memset() of
the tail of the kernel buffer in case of ksize>usize - you wouldn't want
to require the kernel callers to pass a zeroed buffer to
copy_struct_from_user() - it's just that when we memset(__user*),
there's an error check to do.
Rasmus
On 2019-09-05, Christian Brauner <[email protected]> wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > A common pattern for syscall extensions is increasing the size of a
> > struct passed from userspace, such that the zero-value of the new fields
> > result in the old kernel behaviour (allowing for a mix of userspace and
> > kernel vintages to operate on one another in most cases). This is done
> > in both directions -- hence two helpers -- though it's more common to
> > have to copy user space structs into kernel space.
> >
> > Previously there was no common lib/ function that implemented
> > the necessary extension-checking semantics (and different syscalls
> > implemented them slightly differently or incompletely[1]). A future
> > patch replaces all of the common uses of this pattern to use the new
> > copy_struct_{to,from}_user() helpers.
> >
> > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> > similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> > always rejects differently-sized struct arguments.
> >
> > Suggested-by: Rasmus Villemoes <[email protected]>
> > Signed-off-by: Aleksa Sarai <[email protected]>
[...]
> > + if (unlikely(!access_ok(src, usize)))
> > + return -EFAULT;
> > +
> > + /* Deal with trailing bytes. */
> > + if (usize < ksize)
> > + memset(dst + size, 0, rest);
[...]
> That's a change in behavior for clone3() and sched at least, no? Unless
> - which I guess you might have done - you have moved the "error out when
> the struct is too small" part before the call to copy_struct_from_user()
> for them.
Yes, I've put the minimum size check to the callers in all of the
cases (in the case of clone3() I've #define'd a CLONE_ARGS_SIZE_VER0 to
match the others -- see patch 2 of the series).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On 2019-09-05, Peter Zijlstra <[email protected]> wrote:
> On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote:
> > On 2019-09-05, Peter Zijlstra <[email protected]> wrote:
> > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > > +
> > > > + while (rest > 0) {
> > > > + size_t bufsize = min(rest, sizeof(buffer));
> > > > +
> > > > + if (__copy_from_user(buffer, addr, bufsize))
> > > > + return -EFAULT;
> > > > + if (memchr_inv(buffer, 0, bufsize))
> > > > + return -E2BIG;
> > > > +
> > > > + addr += bufsize;
> > > > + rest -= bufsize;
> > > > + }
> > >
> > > The perf implementation uses get_user(); but if that is too slow, surely
> > > we can do something with uaccess_try() here?
> >
> > Is there a non-x86-specific way to do that (unless I'm mistaken only x86
> > has uaccess_try() or the other *_try() wrappers)? The main "performance
> > improvement" (if you can even call it that) is that we use memchr_inv()
> > which finds non-matching characters more efficiently than just doing a
> > loop.
>
> Oh, you're right, that's x86 only :/
Though, I just had an idea -- am I wrong to think that the following
would work just as well (without the need for an intermediate buffer)?
if (memchr_inv((const char __force *) src + size, 0, rest))
return -E2BIG;
Or is this type of thing very much frowned upon? What if it was a
separate memchr_inv_user() instead -- I feel as though there's not a
strong argument for needing to use a buffer when we're single-passing
the __user buffer and doing a basic boolean check.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> +/*
> + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> + * checked access_ok(p, size).
> + */
> +static int __memzero_user(void __user *p, size_t s)
> +{
> + const char zeros[BUFFER_SIZE] = {};
> + while (s > 0) {
> + size_t n = min(s, sizeof(zeros));
> +
> + if (__copy_to_user(p, zeros, n))
> + return -EFAULT;
> +
> + p += n;
> + s -= n;
> + }
> + return 0;
> +}
That's called clear_user().
> +int copy_struct_to_user(void __user *dst, size_t usize,
> + const void *src, size_t ksize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
> +
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;
Why?
> + } else if (usize > ksize) {
> + if (__memzero_user(dst + size, rest))
> + return -EFAULT;
> + }
> + /* Copy the interoperable parts of the struct. */
> + if (__copy_to_user(dst, src, size))
> + return -EFAULT;
Why not simply clear_user() and copy_to_user()?
> +int copy_struct_from_user(void *dst, size_t ksize,
> + const void __user *src, size_t usize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
Cute, but... you would be just as well without that 'rest' thing.
> +
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;
Again, why?
> + if (unlikely(!access_ok(src, usize)))
> + return -EFAULT;
Why not simply copy_from_user() here?
> + /* Deal with trailing bytes. */
> + if (usize < ksize)
> + memset(dst + size, 0, rest);
> + else if (usize > ksize) {
> + const void __user *addr = src + size;
> + char buffer[BUFFER_SIZE] = {};
> +
> + while (rest > 0) {
> + size_t bufsize = min(rest, sizeof(buffer));
> +
> + if (__copy_from_user(buffer, addr, bufsize))
> + return -EFAULT;
> + if (memchr_inv(buffer, 0, bufsize))
> + return -E2BIG;
Frankly, that looks like a candidate for is_all_zeroes_user().
With the loop like above serving as a dumb default. And on
badly alighed address it _will_ be dumb. Probably too much
so - something like
if ((unsigned long)addr & 1) {
u8 v;
if (get_user(v, (__u8 __user *)addr))
return -EFAULT;
if (v)
return -E2BIG;
addr++;
}
if ((unsigned long)addr & 2) {
u16 v;
if (get_user(v, (__u16 __user *)addr))
return -EFAULT;
if (v)
return -E2BIG;
addr +=2;
}
if ((unsigned long)addr & 4) {
u32 v;
if (get_user(v, (__u32 __user *)addr))
return -EFAULT;
if (v)
return -E2BIG;
}
<read the rest like you currently do>
would be saner, and things like x86 could trivially add an
asm variant - it's not hard. Incidentally, memchr_inv() is
an overkill in this case...
On Thu, Sep 05, 2019 at 07:07:50PM +0100, Al Viro wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > +/*
> > + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> > + * checked access_ok(p, size).
> > + */
> > +static int __memzero_user(void __user *p, size_t s)
> > +{
> > + const char zeros[BUFFER_SIZE] = {};
> > + while (s > 0) {
> > + size_t n = min(s, sizeof(zeros));
> > +
> > + if (__copy_to_user(p, zeros, n))
> > + return -EFAULT;
> > +
> > + p += n;
> > + s -= n;
> > + }
> > + return 0;
> > +}
>
> That's called clear_user().
>
> > +int copy_struct_to_user(void __user *dst, size_t usize,
> > + const void *src, size_t ksize)
> > +{
> > + size_t size = min(ksize, usize);
> > + size_t rest = abs(ksize - usize);
> > +
> > + if (unlikely(usize > PAGE_SIZE))
> > + return -EFAULT;
>
> Why?
Because every caller of that function right now has that limit set
anyway iirc. So we can either remove it from here and place it back for
the individual callers or leave it in the helper.
Also, I'm really asking, why not? Is it unreasonable to have an upper
bound on the size (for a long time probably) or are you disagreeing with
PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
bpf, and clone3 and in a few other places.
On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
> Because every caller of that function right now has that limit set
> anyway iirc. So we can either remove it from here and place it back for
> the individual callers or leave it in the helper.
> Also, I'm really asking, why not? Is it unreasonable to have an upper
> bound on the size (for a long time probably) or are you disagreeing with
> PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> bpf, and clone3 and in a few other places.
For a primitive that can be safely used with any size (OK, any within
the usual 2Gb limit)? Why push the random policy into the place where
it doesn't belong?
Seriously, what's the point? If they want to have a large chunk of
userland memory zeroed or checked for non-zeroes - why would that
be a problem?
On Thu, Sep 05, 2019 at 07:28:01PM +0100, Al Viro wrote:
> On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
>
> > Because every caller of that function right now has that limit set
> > anyway iirc. So we can either remove it from here and place it back for
> > the individual callers or leave it in the helper.
> > Also, I'm really asking, why not? Is it unreasonable to have an upper
> > bound on the size (for a long time probably) or are you disagreeing with
> > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> > bpf, and clone3 and in a few other places.
>
> For a primitive that can be safely used with any size (OK, any within
> the usual 2Gb limit)? Why push the random policy into the place where
> it doesn't belong?
Ah, the "not in the helper part" makes sense.
As long as leave the check for the callers themselves.
>
> Seriously, what's the point? If they want to have a large chunk of
> userland memory zeroed or checked for non-zeroes - why would that
> be a problem?
On 2019-09-05, Al Viro <[email protected]> wrote:
> On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
>
> > Because every caller of that function right now has that limit set
> > anyway iirc. So we can either remove it from here and place it back for
> > the individual callers or leave it in the helper.
> > Also, I'm really asking, why not? Is it unreasonable to have an upper
> > bound on the size (for a long time probably) or are you disagreeing with
> > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> > bpf, and clone3 and in a few other places.
>
> For a primitive that can be safely used with any size (OK, any within
> the usual 2Gb limit)? Why push the random policy into the place where
> it doesn't belong?
>
> Seriously, what's the point? If they want to have a large chunk of
> userland memory zeroed or checked for non-zeroes - why would that
> be a problem?
Thinking about it some more, there isn't really any r/w amplification --
so there isn't much to gain by passing giant structs. Though, if we are
going to permit 2GB buffers, isn't that also an argument to use
memchr_inv()? :P
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Fri, Sep 06, 2019 at 05:56:18AM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Al Viro <[email protected]> wrote:
> > On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
> >
> > > Because every caller of that function right now has that limit set
> > > anyway iirc. So we can either remove it from here and place it back for
> > > the individual callers or leave it in the helper.
> > > Also, I'm really asking, why not? Is it unreasonable to have an upper
> > > bound on the size (for a long time probably) or are you disagreeing with
> > > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> > > bpf, and clone3 and in a few other places.
> >
> > For a primitive that can be safely used with any size (OK, any within
> > the usual 2Gb limit)? Why push the random policy into the place where
> > it doesn't belong?
> >
> > Seriously, what's the point? If they want to have a large chunk of
> > userland memory zeroed or checked for non-zeroes - why would that
> > be a problem?
>
> Thinking about it some more, there isn't really any r/w amplification --
> so there isn't much to gain by passing giant structs. Though, if we are
> going to permit 2GB buffers, isn't that also an argument to use
> memchr_inv()? :P
I'm not sure I understand the last bit. If you look at what copy_from_user()
does on misaligned source/destination, especially on architectures that
really, really do not like unaligned access...
Case in point: alpha (and it's not unusual in that respect). What it boils
down to is
copy bytes until the destination is aligned
if source and destination are both aligned
copy word by word
else
read word by word, storing the mix of two adjacent words
copy the rest byte by byte
The unpleasant case (to and from having different remainders modulo 8) is
basically
if (count >= 8) {
u64 *aligned = (u64 *)(from & ~7);
u64 *dest = (u64 *)to;
int bitshift = (from & 7) * 8;
u64 prev, next;
prev = aligned[0];
do {
next = aligned[1];
prev <<= bitshift;
prev |= next >> (64 - bitshift);
*dest++ = prev;
aligned++;
prev = next;
from += 8;
to += 8;
count -= 8;
} while (count >= 8);
}
Now, mix that with "... and do memchr_inv() on the copy to find if we'd
copied any non-zeroes, nevermind where" and it starts looking really
ridiculous.
We should just read the fscking source, aligned down to word boundary
and check each word being read. The first and the last ones - masked.
All there is to it. On almost all architectures that'll work well
enough; s390 might want something more elaborate (there even word-by-word
copies are costly, but I'd suggest talking to them for details).
Something like bool all_zeroes_user(const void __user *p, size_t count)
would probably be a sane API...
On 2019-09-05, Al Viro <[email protected]> wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > +/*
> > + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> > + * checked access_ok(p, size).
> > + */
> > +static int __memzero_user(void __user *p, size_t s)
> > +{
> > + const char zeros[BUFFER_SIZE] = {};
> > + while (s > 0) {
> > + size_t n = min(s, sizeof(zeros));
> > +
> > + if (__copy_to_user(p, zeros, n))
> > + return -EFAULT;
> > +
> > + p += n;
> > + s -= n;
> > + }
> > + return 0;
> > +}
>
> That's called clear_user().
Already switched, I didn't know about clear_user() -- I assumed it
would've been called bzero_user() or memzero_user() and didn't find it
when looking.
> > +int copy_struct_to_user(void __user *dst, size_t usize,
> > + const void *src, size_t ksize)
> > +{
> > + size_t size = min(ksize, usize);
> > + size_t rest = abs(ksize - usize);
> > +
> > + if (unlikely(usize > PAGE_SIZE))
> > + return -EFAULT;
>
> Why?
>
> > + } else if (usize > ksize) {
> > + if (__memzero_user(dst + size, rest))
> > + return -EFAULT;
> > + }
> > + /* Copy the interoperable parts of the struct. */
> > + if (__copy_to_user(dst, src, size))
> > + return -EFAULT;
>
> Why not simply clear_user() and copy_to_user()?
I'm not sure I understand what you mean -- are you asking why we need to
do memchr_inv(src + size, 0, rest) earlier?
>
> > +int copy_struct_from_user(void *dst, size_t ksize,
> > + const void __user *src, size_t usize)
> > +{
> > + size_t size = min(ksize, usize);
> > + size_t rest = abs(ksize - usize);
>
> Cute, but... you would be just as well without that 'rest' thing.
I would argue it's harder to mess up using "rest" compared to getting
"ksize - usize" and "usize - ksize" mixed up (and it's a bit more
readable).
> > +
> > + if (unlikely(usize > PAGE_SIZE))
> > + return -EFAULT;
>
> Again, why?
As discussed in a sister thread, I will leave this in the callers
(though I would argue callers should always do some kind of sanity check
like this).
>
> > + if (unlikely(!access_ok(src, usize)))
> > + return -EFAULT;
>
> Why not simply copy_from_user() here?
>
> > + /* Deal with trailing bytes. */
> > + if (usize < ksize)
> > + memset(dst + size, 0, rest);
> > + else if (usize > ksize) {
> > + const void __user *addr = src + size;
> > + char buffer[BUFFER_SIZE] = {};
> > +
> > + while (rest > 0) {
> > + size_t bufsize = min(rest, sizeof(buffer));
> > +
> > + if (__copy_from_user(buffer, addr, bufsize))
> > + return -EFAULT;
> > + if (memchr_inv(buffer, 0, bufsize))
> > + return -E2BIG;
>
> Frankly, that looks like a candidate for is_all_zeroes_user().
> With the loop like above serving as a dumb default. And on
> badly alighed address it _will_ be dumb. Probably too much
> so - something like
> if ((unsigned long)addr & 1) {
> u8 v;
> if (get_user(v, (__u8 __user *)addr))
> return -EFAULT;
> if (v)
> return -E2BIG;
> addr++;
> }
> if ((unsigned long)addr & 2) {
> u16 v;
> if (get_user(v, (__u16 __user *)addr))
> return -EFAULT;
> if (v)
> return -E2BIG;
> addr +=2;
> }
> if ((unsigned long)addr & 4) {
> u32 v;
> if (get_user(v, (__u32 __user *)addr))
> return -EFAULT;
> if (v)
> return -E2BIG;
> }
> <read the rest like you currently do>
> would be saner, and things like x86 could trivially add an
> asm variant - it's not hard. Incidentally, memchr_inv() is
> an overkill in this case...
Why is memchr_inv() overkill?
But yes, breaking this out to an asm-generic is_all_zeroes_user()
wouldn't hurt -- and I'll put a cleaned-up version of the alignment
handling there too. Should I drop it in asm-generic/uaccess.h, or
somewhere else?
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On 2019-09-06, Al Viro <[email protected]> wrote:
> On Fri, Sep 06, 2019 at 09:00:03AM +1000, Aleksa Sarai wrote:
> > > > + return -EFAULT;
> > > > + }
> > > > + /* Copy the interoperable parts of the struct. */
> > > > + if (__copy_to_user(dst, src, size))
> > > > + return -EFAULT;
> > >
> > > Why not simply clear_user() and copy_to_user()?
> >
> > I'm not sure I understand what you mean -- are you asking why we need to
> > do memchr_inv(src + size, 0, rest) earlier?
>
> I'm asking why bother with __ and separate access_ok().
Ah right, it was a dumb "optimisation" (since we need to do access_ok()
anyway since we should early -EFAULT in that case). I've dropped the __
usages in my working copy.
> > > if ((unsigned long)addr & 1) {
> > > u8 v;
> > > if (get_user(v, (__u8 __user *)addr))
> > > return -EFAULT;
> > > if (v)
> > > return -E2BIG;
> > > addr++;
> > > }
> > > if ((unsigned long)addr & 2) {
> > > u16 v;
> > > if (get_user(v, (__u16 __user *)addr))
> > > return -EFAULT;
> > > if (v)
> > > return -E2BIG;
> > > addr +=2;
> > > }
> > > if ((unsigned long)addr & 4) {
> > > u32 v;
> > > if (get_user(v, (__u32 __user *)addr))
> > > return -EFAULT;
> > > if (v)
> > > return -E2BIG;
> > > }
> > > <read the rest like you currently do>
>
> Actually, this is a dumb way to do it - page size on anything
> is going to be a multiple of 8, so you could just as well
> read 8 bytes from an address aligned down. Then mask the
> bytes you don't want to check out and see if there's anything
> left.
>
> You can have readability boundaries inside a page - it's either
> the entire page (let alone a single word) being readable, or
> it's EFAULT for all parts.
>
> > > would be saner, and things like x86 could trivially add an
> > > asm variant - it's not hard. Incidentally, memchr_inv() is
> > > an overkill in this case...
> >
> > Why is memchr_inv() overkill?
>
> Look at its implementation; you only care if there are
> non-zeroes, you don't give a damn where in the buffer
> the first one would be. All you need is the same logics
> as in "from userland" case
> if (!count)
> return true;
> offset = (unsigned long)from & 7
> p = (u64 *)(from - offset);
> v = *p++;
> if (offset) { // unaligned
> count += offset;
> v &= ~aligned_byte_mask(offset); // see strnlen_user.c
> }
> while (count > 8) {
> if (v)
> return false;
> v = *p++;
> count -= 8;
> }
> if (count != 8)
> v &= aligned_byte_mask(count);
> return v == 0;
>
> All there is to it...
Alright, will do (for some reason I hadn't made the connection that
memchr_inv() is doing effectively the same word-by-word comparison but
also detecting where the first byte is).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Fri, Sep 06, 2019 at 09:00:03AM +1000, Aleksa Sarai wrote:
> > > + return -EFAULT;
> > > + }
> > > + /* Copy the interoperable parts of the struct. */
> > > + if (__copy_to_user(dst, src, size))
> > > + return -EFAULT;
> >
> > Why not simply clear_user() and copy_to_user()?
>
> I'm not sure I understand what you mean -- are you asking why we need to
> do memchr_inv(src + size, 0, rest) earlier?
I'm asking why bother with __ and separate access_ok().
> > if ((unsigned long)addr & 1) {
> > u8 v;
> > if (get_user(v, (__u8 __user *)addr))
> > return -EFAULT;
> > if (v)
> > return -E2BIG;
> > addr++;
> > }
> > if ((unsigned long)addr & 2) {
> > u16 v;
> > if (get_user(v, (__u16 __user *)addr))
> > return -EFAULT;
> > if (v)
> > return -E2BIG;
> > addr +=2;
> > }
> > if ((unsigned long)addr & 4) {
> > u32 v;
> > if (get_user(v, (__u32 __user *)addr))
> > return -EFAULT;
> > if (v)
> > return -E2BIG;
> > }
> > <read the rest like you currently do>
Actually, this is a dumb way to do it - page size on anything
is going to be a multiple of 8, so you could just as well
read 8 bytes from an address aligned down. Then mask the
bytes you don't want to check out and see if there's anything
left.
You can have readability boundaries inside a page - it's either
the entire page (let alone a single word) being readable, or
it's EFAULT for all parts.
> > would be saner, and things like x86 could trivially add an
> > asm variant - it's not hard. Incidentally, memchr_inv() is
> > an overkill in this case...
>
> Why is memchr_inv() overkill?
Look at its implementation; you only care if there are
non-zeroes, you don't give a damn where in the buffer
the first one would be. All you need is the same logics
as in "from userland" case
if (!count)
return true;
offset = (unsigned long)from & 7
p = (u64 *)(from - offset);
v = *p++;
if (offset) { // unaligned
count += offset;
v &= ~aligned_byte_mask(offset); // see strnlen_user.c
}
while (count > 8) {
if (v)
return false;
v = *p++;
count -= 8;
}
if (count != 8)
v &= aligned_byte_mask(count);
return v == 0;
All there is to it...
On Fri, Sep 06, 2019 at 12:49:44AM +0100, Al Viro wrote:
> On Fri, Sep 06, 2019 at 09:00:03AM +1000, Aleksa Sarai wrote:
> > > > + return -EFAULT;
> > > > + }
> > > > + /* Copy the interoperable parts of the struct. */
> > > > + if (__copy_to_user(dst, src, size))
> > > > + return -EFAULT;
> > >
> > > Why not simply clear_user() and copy_to_user()?
> >
> > I'm not sure I understand what you mean -- are you asking why we need to
> > do memchr_inv(src + size, 0, rest) earlier?
>
> I'm asking why bother with __ and separate access_ok().
>
> > > if ((unsigned long)addr & 1) {
> > > u8 v;
> > > if (get_user(v, (__u8 __user *)addr))
> > > return -EFAULT;
> > > if (v)
> > > return -E2BIG;
> > > addr++;
> > > }
> > > if ((unsigned long)addr & 2) {
> > > u16 v;
> > > if (get_user(v, (__u16 __user *)addr))
> > > return -EFAULT;
> > > if (v)
> > > return -E2BIG;
> > > addr +=2;
> > > }
> > > if ((unsigned long)addr & 4) {
> > > u32 v;
> > > if (get_user(v, (__u32 __user *)addr))
> > > return -EFAULT;
> > > if (v)
> > > return -E2BIG;
> > > }
> > > <read the rest like you currently do>
>
> Actually, this is a dumb way to do it - page size on anything
> is going to be a multiple of 8, so you could just as well
> read 8 bytes from an address aligned down. Then mask the
> bytes you don't want to check out and see if there's anything
> left.
>
> You can have readability boundaries inside a page - it's either
> the entire page (let alone a single word) being readable, or
> it's EFAULT for all parts.
>
> > > would be saner, and things like x86 could trivially add an
> > > asm variant - it's not hard. Incidentally, memchr_inv() is
> > > an overkill in this case...
> >
> > Why is memchr_inv() overkill?
>
> Look at its implementation; you only care if there are
> non-zeroes, you don't give a damn where in the buffer
> the first one would be. All you need is the same logics
> as in "from userland" case
> if (!count)
> return true;
> offset = (unsigned long)from & 7
> p = (u64 *)(from - offset);
> v = *p++;
> if (offset) { // unaligned
> count += offset;
> v &= ~aligned_byte_mask(offset); // see strnlen_user.c
> }
> while (count > 8) {
> if (v)
> return false;
> v = *p++;
> count -= 8;
> }
> if (count != 8)
> v &= aligned_byte_mask(count);
> return v == 0;
>
> All there is to it...
... and __user case would be pretty much this with
if (user_access_begin(from, count)) {
....
user_access_end();
}
wrapped around the damn thing - again, see strnlen_user.c, with
unsafe_get_user(v, p++, efault);
instead of those
v = *p++;
Calling conventions might need some thinking - it might be
* all read, all zeroes
* non-zero found
* read failed
so we probably want to map the "all zeroes" case to 0,
"read failed" to -EFAULT and "non-zero found" to something
else. Might be positive, might be some other -E.... - not
sure if E2BIG (or EFBIG) makes much sense here. Need to
look at the users...
On Fri, Sep 06, 2019 at 05:56:18AM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Al Viro <[email protected]> wrote:
> > On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
> >
> > > Because every caller of that function right now has that limit set
> > > anyway iirc. So we can either remove it from here and place it back for
> > > the individual callers or leave it in the helper.
> > > Also, I'm really asking, why not? Is it unreasonable to have an upper
> > > bound on the size (for a long time probably) or are you disagreeing with
> > > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> > > bpf, and clone3 and in a few other places.
> >
> > For a primitive that can be safely used with any size (OK, any within
> > the usual 2Gb limit)? Why push the random policy into the place where
> > it doesn't belong?
> >
> > Seriously, what's the point? If they want to have a large chunk of
> > userland memory zeroed or checked for non-zeroes - why would that
> > be a problem?
>
> Thinking about it some more, there isn't really any r/w amplification --
> so there isn't much to gain by passing giant structs. Though, if we are
> going to permit 2GB buffers, isn't that also an argument to use
> memchr_inv()? :P
I think we should just do a really dumb, easy to understand minimal
thing for now. It could even just be what every caller is doing right
now anyway with the get_user() loop.
The only way to settle memchr_inv() vs all the other clever ways
suggested here is to benchmark it and see if it matters *for the current
users* of this helper. If it does, great we can do it. If it doesn't why
bother having that argument right now?
Once we somehow end up in a possible world where we apparently have
decided it's a great idea to copy 2GB argument structures for a syscall
into or from the kernel we can start optimizing the hell out of this.
Before that and especially with current callers I honestly doubt it
matters whether we use memchr_inv() or while() {get_user()} loops.
Christian
On Thu, 2019-09-05 at 06:19 +1000, Aleksa Sarai wrote:
> The most obvious syscall to add support for the new LOOKUP_* scoping
> flags would be openat(2). However, there are a few reasons why this is
> not the best course of action:
>
> * The new LOOKUP_* flags are intended to be security features, and
> openat(2) will silently ignore all unknown flags. This means that
> users would need to avoid foot-gunning themselves constantly when
> using this interface if it were part of openat(2). This can be fixed
> by having userspace libraries handle this for users[1], but should be
> avoided if possible.
>
> * Resolution scoping feels like a different operation to the existing
> O_* flags. And since openat(2) has limited flag space, it seems to be
> quite wasteful to clutter it with 5 flags that are all
> resolution-related. Arguably O_NOFOLLOW is also a resolution flag but
> its entire purpose is to error out if you encounter a trailing
> symlink -- not to scope resolution.
>
> * Other systems would be able to reimplement this syscall allowing for
> cross-OS standardisation rather than being hidden amongst O_* flags
> which may result in it not being used by all the parties that might
> want to use it (file servers, web servers, container runtimes, etc).
>
> * It gives us the opportunity to iterate on the O_PATH interface. In
> particular, the new @how->upgrade_mask field for fd re-opening is
> only possible because we have a clean slate without needing to re-use
> the ACC_MODE flag design nor the existing openat(2) @mode semantics.
>
> To this end, we introduce the openat2(2) syscall. It provides all of the
> features of openat(2) through the @how->flags argument, but also
> also provides a new @how->resolve argument which exposes RESOLVE_* flags
> that map to our new LOOKUP_* flags. It also eliminates the long-standing
> ugliness of variadic-open(2) by embedding it in a struct.
>
> In order to allow for userspace to lock down their usage of file
> descriptor re-opening, openat2(2) has the ability for users to disallow
> certain re-opening modes through @how->upgrade_mask. At the moment,
> there is no UPGRADE_NOEXEC.
>
> [1]: https://github.com/openSUSE/libpathrs
>
> Suggested-by: Christian Brauner <[email protected]>
> Signed-off-by: Aleksa Sarai <[email protected]>
> ---
> arch/alpha/kernel/syscalls/syscall.tbl | 1 +
> arch/arm/tools/syscall.tbl | 1 +
> arch/arm64/include/asm/unistd.h | 2 +-
> arch/arm64/include/asm/unistd32.h | 2 +
> arch/ia64/kernel/syscalls/syscall.tbl | 1 +
> arch/m68k/kernel/syscalls/syscall.tbl | 1 +
> arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
> arch/parisc/kernel/syscalls/syscall.tbl | 1 +
> arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
> arch/s390/kernel/syscalls/syscall.tbl | 1 +
> arch/sh/kernel/syscalls/syscall.tbl | 1 +
> arch/sparc/kernel/syscalls/syscall.tbl | 1 +
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
> fs/open.c | 94 ++++++++++++++++-----
> include/linux/fcntl.h | 19 ++++-
> include/linux/fs.h | 4 +-
> include/linux/syscalls.h | 14 ++-
> include/uapi/asm-generic/unistd.h | 5 +-
> include/uapi/linux/fcntl.h | 42 +++++++++
> 24 files changed, 168 insertions(+), 30 deletions(-)
>
[...]
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 1d338357df8a..479baf2da10e 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -93,5 +93,47 @@
>
> #define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
>
> +/**
> + * Arguments for how openat2(2) should open the target path. If @resolve is
> + * zero, then openat2(2) operates identically to openat(2).
> + *
> + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> + * than being silently ignored. In addition, @mode (or @upgrade_mask) must be
> + * zero unless one of {O_CREAT, O_TMPFILE, O_PATH} are set.
> + *
After thinking about this a bit, I wonder if we might be better served
with a new set of OA2_* flags instead of repurposing the O_* flags?
Yes, those flags are familiar, but this is an entirely new syscall. We
have a chance to make a fresh start. Does something like O_LARGEFILE
have any real place in openat2? I'd argue no.
Also, once you want to add a new flag, then we get into the mess of how
to document whether open/openat also support it. It'd be good to freeze
changes on those syscalls and aim to only introduce new functionality in
openat2.
That would also allow us to drop some flags from openat2 that we really
don't need, and maybe expand the flag space to 64 bits initially, to
allow for expansion into the future.
Thoughts?
> + * @flags: O_* flags.
> + * @mode: O_CREAT/O_TMPFILE file mode.
> + * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
> + * @resolve: RESOLVE_* flags.
> + */
> +struct open_how {
> + __u32 flags;
> + union {
> + __u16 mode;
> + __u16 upgrade_mask;
> + };
> + __u16 resolve;
> +};
> +
> +#define OPEN_HOW_SIZE_VER0 8 /* sizeof first published struct */
> +
Hmm, there is no version field. When you want to expand this in the
future, what is the plan? Add a new flag to indicate that it's some
length?
> +/* how->resolve flags for openat2(2). */
> +#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings
> + (includes bind-mounts). */
> +#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style
> + "magic-links". */
> +#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks
> + (implies OEXT_NO_MAGICLINKS) */
> +#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like
> + "..", symlinks, and absolute
> + paths which escape the dirfd. */
> +#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".."
> + be scoped inside the dirfd
> + (similar to chroot(2)). */
> +
> +/* how->upgrade flags for openat2(2). */
> +/* First bit is reserved for a future UPGRADE_NOEXEC flag. */
> +#define UPGRADE_NOREAD 0x02 /* Block re-opening with MAY_READ. */
> +#define UPGRADE_NOWRITE 0x04 /* Block re-opening with MAY_WRITE. */
>
> #endif /* _UAPI_LINUX_FCNTL_H */
--
Jeff Layton <[email protected]>
On Sat, Sep 7, 2019 at 5:40 AM Jeff Layton <[email protected]> wrote:
>
> After thinking about this a bit, I wonder if we might be better served
> with a new set of OA2_* flags instead of repurposing the O_* flags?
I'd hate to have yet _another_ set of translation functions, and
another chance of people just getting it wrong either in user space or
the kernel.
So no. Let's not make another set of flags that has no sane way to
have type-safety to avoid more confusion.
The new flags that _only_ work with openat2() might be named with a
prefix/suffix to mark that, but I'm not sure it's a huge deal.
Linus
On Sat, Sep 7, 2019 at 10:42 AM Andy Lutomirski <[email protected]> wrote:
>
> Linus, you rejected resolveat() because you wanted a *nice* API
No. I rejected resoveat() because it was a completely broken garbage
API that couldn't do even basic stuff right (like O_CREAT).
We have a ton of flag space in the new openat2() model, we might as
well leave the old flags alone that people are (a) used to and (b) we
have code to support _anyway_.
Making up a new flag namespace is only going to cause us - and users -
more work, and more confusion. For no actual advantage. It's not going
to be "cleaner". It's just going to be worse.
Linus
> On Sep 7, 2019, at 9:58 AM, Linus Torvalds <[email protected]> wrote:
>
>> On Sat, Sep 7, 2019 at 5:40 AM Jeff Layton <[email protected]> wrote:
>>
>> After thinking about this a bit, I wonder if we might be better served
>> with a new set of OA2_* flags instead of repurposing the O_* flags?
>
> I'd hate to have yet _another_ set of translation functions, and
> another chance of people just getting it wrong either in user space or
> the kernel.
>
> So no. Let's not make another set of flags that has no sane way to
> have type-safety to avoid more confusion.
>
> The new flags that _only_ work with openat2() might be named with a
> prefix/suffix to mark that, but I'm not sure it's a huge deal.
>
>
I agree with the philosophy, but I think it doesn’t apply in this case. Here are the flags:
O_RDONLY, O_WRONLY, O_RDWR: not even a proper bitmask. The kernel already has the FMODE_ bits to make this make sense. How about we make the openat2 permission bits consistent with the internal representation and let the O_ permission bits remain as an awful translation. The kernel already translates like this, and it already sucks.
O_CREAT, O_TMPFILE, O_NOCTTY, O_TRUNC: not modes on the fd at all. These affect the meaning of open(). Heck, for openat2, NOCTTY should be this default.
O_EXCL: hopelessly overloaded.
O_APPEND, O_DIRECT, O_SYNC, O_DSYNC, O_LARGEFILE, O_NOATIME, O_PATH, O_NONBLOCK: genuine mode bits
O_CLOEXEC: special because it affects the fd, not the struct file.
Linus, you rejected resolveat() because you wanted a *nice* API that people would use and that might even be adopted by other OSes. Let’s please not make openat2() be a giant pile of crap in the name of consistency with open(). open(), frankly, is horrible.
> On Sep 7, 2019, at 10:45 AM, Linus Torvalds <[email protected]> wrote:
>
>> On Sat, Sep 7, 2019 at 10:42 AM Andy Lutomirski <[email protected]> wrote:
>>
>> Linus, you rejected resolveat() because you wanted a *nice* API
>
> No. I rejected resoveat() because it was a completely broken garbage
> API that couldn't do even basic stuff right (like O_CREAT).
>
> We have a ton of flag space in the new openat2() model, we might as
> well leave the old flags alone that people are (a) used to and (b) we
> have code to support _anyway_.
>
> Making up a new flag namespace is only going to cause us - and users -
> more work, and more confusion. For no actual advantage. It's not going
> to be "cleaner". It's just going to be worse.
>
>
If we keep all the flag bits in the same mask with the same values, then we’re stuck with O_RDONLY=0 and everything that implies. We’ll have UPGRADE_READ that works differently from the missing plain-old-READ bit, and we can’t express execute-only-no-read-or-write. This sucks.
Can we at least split the permission bits into their own mask and make bits 0 and 1 illegal in the main set of flags in openat2?
There’s another thread going on right now about adding a bit along the lines of “MAYEXEC”, and one of the conclusions was that it should wait for openat2 so that it can have same semantics. If we’re stuck with O_RDONLY and friends, then MAYEXEC is doomed to being at least a bit nonsensical.
As an analogy, AMD64 introduced bigger PTEs but kept the same nonsense encoding of read and write permission. And then we got NX, and now we’re getting little holes in the encoding stolen by CET to mean new silly things. I don’t know if you’ve been following the various rounds of patches, but it is truly horrible. The mapping from meaning to the actual bits is *shit*, and AMD64 should have made a clean break instead.
open()’s permission bits are basically the same situation. And the kernel *already* has a non-type-safe translation layer. Please, please let openat2() at least get rid of the turd in open()’s bits 0 and 1.
On 2019-09-07, Jeff Layton <[email protected]> wrote:
> On Thu, 2019-09-05 at 06:19 +1000, Aleksa Sarai wrote:
> > + * @flags: O_* flags.
> > + * @mode: O_CREAT/O_TMPFILE file mode.
> > + * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
> > + * @resolve: RESOLVE_* flags.
> > + */
> > +struct open_how {
> > + __u32 flags;
> > + union {
> > + __u16 mode;
> > + __u16 upgrade_mask;
> > + };
> > + __u16 resolve;
> > +};
> > +
> > +#define OPEN_HOW_SIZE_VER0 8 /* sizeof first published struct */
> > +
>
> Hmm, there is no version field. When you want to expand this in the
> future, what is the plan? Add a new flag to indicate that it's some
> length?
The "version number" is the size of the struct. Any extensions we make
are appended to the struct (openat2 now takes a size_t argument), and
the new copy_struct_{to,from}_user() helpers handle all of the
permutations of {old,new} kernel and {old,new} user space.
This is how clone3(), sched_[gs]etattr() and perf_event_open() all
operate (all of the sigset syscalls operate similarly but don't
gracefully handle different kernel vintages -- you just get -EINVAL).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
* Linus Torvalds <[email protected]> wrote:
> On Sat, Sep 7, 2019 at 10:42 AM Andy Lutomirski <[email protected]> wrote:
> >
> > Linus, you rejected resolveat() because you wanted a *nice* API
>
> No. I rejected resoveat() because it was a completely broken garbage
> API that couldn't do even basic stuff right (like O_CREAT).
>
> We have a ton of flag space in the new openat2() model, we might as
> well leave the old flags alone that people are (a) used to and (b) we
> have code to support _anyway_.
>
> Making up a new flag namespace is only going to cause us - and users -
> more work, and more confusion. For no actual advantage. It's not going
> to be "cleaner". It's just going to be worse.
I suspect there is a "add a clean new flags namespace" analogy to the
classic "add a clean new standard" XKCD:
https://xkcd.com/927/
Thanks,
Ingo
On 2019-09-05, Peter Zijlstra <[email protected]> wrote:
> On Thu, Sep 05, 2019 at 11:43:05AM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote:
> > > On 2019-09-05, Peter Zijlstra <[email protected]> wrote:
> > > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > > > +/**
> > > > > + * copy_struct_to_user: copy a struct to user space
> > > > > + * @dst: Destination address, in user space.
> > > > > + * @usize: Size of @dst struct.
> > > > > + * @src: Source address, in kernel space.
> > > > > + * @ksize: Size of @src struct.
> > > > > + *
> > > > > + * Copies a struct from kernel space to user space, in a way that guarantees
> > > > > + * backwards-compatibility for struct syscall arguments (as long as future
> > > > > + * struct extensions are made such that all new fields are *appended* to the
> > > > > + * old struct, and zeroed-out new fields have the same meaning as the old
> > > > > + * struct).
> > > > > + *
> > > > > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
> > > > > + * The recommended usage is something like the following:
> > > > > + *
> > > > > + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> > > > > + * {
> > > > > + * int err;
> > > > > + * struct foo karg = {};
> > > > > + *
> > > > > + * // do something with karg
> > > > > + *
> > > > > + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg));
> > > > > + * if (err)
> > > > > + * return err;
> > > > > + *
> > > > > + * // ...
> > > > > + * }
> > > > > + *
> > > > > + * There are three cases to consider:
> > > > > + * * If @usize == @ksize, then it's copied verbatim.
> > > > > + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an
> > > > > + * older user space. In order to avoid user space getting incomplete
> > > > > + * information (new fields might be important), all trailing bytes in @src
> > > > > + * (@ksize - @usize) must be zerored
> > > >
> > > > s/zerored/zero/, right?
> > >
> > > It should've been "zeroed".
> >
> > That reads wrong to me; that way it reads like this function must take
> > that action and zero out the 'rest'; which is just wrong.
> >
> > This function must verify those bytes are zero, not make them zero.
> >
> > > > > , otherwise -EFBIG is returned.
> > > >
> > > > 'Funny' that, copy_struct_from_user() below seems to use E2BIG.
> > >
> > > This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for
> > > a "too big" struct passed to the kernel, and EFBIG for a "too big"
> > > struct passed to user-space. I would personally have preferred EMSGSIZE
> > > instead of EFBIG, but felt using the existing error codes would be less
> > > confusing.
> >
> > Sadly a recent commit:
> >
> > 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code")
> >
> > Made the situation even 'worse'.
>
> And thinking more about things; I'm not convinced the above patch is
> actually right.
>
> Do we really want to simply truncate all the attributes of the task?
>
> And should we not at least set sched_flags when there are non-default
> clamp values applied?
>
> See; that is I think the primary bug that had chrt failing; we tried to
> publish the default clamp values as !0.
I just saw this patch in -rc8 -- should I even attempt to port
sched_getattr(2) to copy_struct_to_user()? I agree that publishing a
default non-zero value is a mistake -- once you do that, old user space
will either get confused or lose information.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Wed, Sep 4, 2019 at 10:21 PM Aleksa Sarai <[email protected]> wrote:
> The ability for userspace to "re-open" file descriptors through
> /proc/self/fd has been a very useful tool for all sorts of usecases
> (container runtimes are one common example). However, the current
> interface for doing this has resulted in some pretty subtle security
> holes. Userspace can re-open a file descriptor with more permissions
> than the original, which can result in cases such as /proc/$pid/exe
> being re-opened O_RDWR at a later date even though (by definition)
> /proc/$pid/exe cannot be opened for writing. When combined with O_PATH
> the results can get even more confusing.
[...]
> Instead we have to restrict it in such a way that it doesn't break
> (good) users but does block potential attackers. The solution applied in
> this patch is to restrict *re-opening* (not resolution through)
> magic-links by requiring that mode of the link be obeyed. Normal
> symlinks have modes of a+rwx but magic-links have other modes. These
> magic-link modes were historically ignored during path resolution, but
> they've now been re-purposed for more useful ends.
Thanks for dealing with this issue!
[...]
> diff --git a/fs/namei.c b/fs/namei.c
> index 209c51a5226c..54d57dad0f91 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -872,7 +872,7 @@ void nd_jump_link(struct path *path)
>
> nd->path = *path;
> nd->inode = nd->path.dentry->d_inode;
> - nd->flags |= LOOKUP_JUMPED;
> + nd->flags |= LOOKUP_JUMPED | LOOKUP_MAGICLINK_JUMPED;
> }
[...]
> +static int trailing_magiclink(struct nameidata *nd, int acc_mode,
> + fmode_t *opath_mask)
> +{
> + struct inode *inode = nd->link_inode;
> + fmode_t upgrade_mask = 0;
> +
> + /* Was the trailing_symlink() a magic-link? */
> + if (!(nd->flags & LOOKUP_MAGICLINK_JUMPED))
> + return 0;
> +
> + /*
> + * Figure out the upgrade-mask of the link_inode. Since these aren't
> + * strictly POSIX semantics we don't do an acl_permission_check() here,
> + * so we only care that at least one bit is set for each upgrade-mode.
> + */
> + if (inode->i_mode & S_IRUGO)
> + upgrade_mask |= FMODE_PATH_READ;
> + if (inode->i_mode & S_IWUGO)
> + upgrade_mask |= FMODE_PATH_WRITE;
> + /* Restrict the O_PATH upgrade-mask of the caller. */
> + if (opath_mask)
> + *opath_mask &= upgrade_mask;
> + return may_open_magiclink(upgrade_mask, acc_mode);
> }
This looks racy because entries in the file descriptor table can be
switched out as long as task->files->file_lock isn't held. Unless I'm
missing something, something like the following (untested) would
bypass this restriction:
int readonly_fd = ...; /* some read-only fd we want to reopen as writable */
int writable_fd = open("/dev/null", O_RDWR);
int flippy_fd = dup(writable_fd);
char flippy_fd_path[100];
sprintf(flippy_fd_path, "/proc/%d/fd/%d", getpid(), flippy_fd);
if (fork() == 0) {
while (1) {
int reopened_fd = open(flippy_fd_path, O_RDWR);
if (reopened_fd == -1) continue;
char reopened_fd_path[100];
sprintf(reopened_fd_path, "/proc/self/fd/%d", reopened_fd);
char reopened_fd_target[1000];
int target_len = readlink(reopened_fd_path, reopened_fd_target,
sizeof(reopened_fd_target)-1);
reopened_fd_target[target_len] = 0;
if (strcmp(reopened_fd_target, "/dev/null"))
printf("managed to reopen as writable\n");
close(reopened_fd);
}
} else {
while (1) {
dup2(readonly_fd, flippy_fd);
dup2(writable_fd, flippy_fd);
}
}
Perhaps you could change nd_jump_link() to "void nd_jump_link(struct
path *path, umode_t link_mode)", and let proc_pid_get_link() pass the
link_mode through from an out-argument of .proc_get_link()? Then
proc_fd_link() could grab the proper mode in a race-free manner. And
nd_jump_link() could stash the mode in the nameidata.
A sketch of how I imagine that:
===============================
diff --git a/fs/namei.c b/fs/namei.c
index 6b936038319b..14c6790203c7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -506,6 +506,7 @@ struct nameidata {
struct inode *link_inode;
unsigned root_seq;
int dfd;
+ umode_t last_link_mode;
} __randomize_layout;
static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
@@ -890,7 +891,7 @@ static int nd_jump_root(struct nameidata *nd)
* Helper to directly jump to a known parsed path from ->get_link,
* caller must have taken a reference to path beforehand.
*/
-void nd_jump_link(struct path *path)
+void nd_jump_link(struct path *path, umode_t link_mode)
{
struct nameidata *nd = current->nameidata;
path_put(&nd->path);
@@ -898,6 +899,7 @@ void nd_jump_link(struct path *path)
nd->path = *path;
nd->inode = nd->path.dentry->d_inode;
nd->flags |= LOOKUP_JUMPED | LOOKUP_MAGICLINK_JUMPED;
+ nd->last_link_mode = link_mode;
}
static inline void put_link(struct nameidata *nd)
@@ -3654,9 +3656,9 @@ static int trailing_magiclink(struct nameidata
*nd, int acc_mode,
* strictly POSIX semantics we don't do an acl_permission_check() here,
* so we only care that at least one bit is set for each upgrade-mode.
*/
- if (inode->i_mode & S_IRUGO)
+ if (nd->last_link_mode & S_IRUGO)
upgrade_mask |= FMODE_PATH_READ;
- if (inode->i_mode & S_IWUGO)
+ if (nd->last_link_mode & S_IWUGO)
upgrade_mask |= FMODE_PATH_WRITE;
/* Restrict the O_PATH upgrade-mask of the caller. */
if (opath_mask)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 297242174402..af0218447571 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1614,6 +1614,7 @@ static const char *proc_pid_get_link(struct
dentry *dentry,
{
struct path path;
int error = -EACCES;
+ umode_t link_mode;
if (!dentry)
return ERR_PTR(-ECHILD);
@@ -1622,11 +1623,11 @@ static const char *proc_pid_get_link(struct
dentry *dentry,
if (!proc_fd_access_allowed(inode))
goto out;
- error = PROC_I(inode)->op.proc_get_link(dentry, &path);
+ error = PROC_I(inode)->op.proc_get_link(dentry, &path, &link_mode);
if (error)
goto out;
- nd_jump_link(&path);
+ nd_jump_link(&path, link_mode);
return NULL;
out:
return ERR_PTR(error);
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 9b7d8becb002..9c1d247177b1 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -163,7 +163,8 @@ static const struct dentry_operations
tid_fd_dentry_operations = {
.d_delete = pid_delete_dentry,
};
-static int proc_fd_link(struct dentry *dentry, struct path *path)
+static int proc_fd_link(struct dentry *dentry, struct path *path,
+ umode_t *link_mode)
{
struct files_struct *files = NULL;
struct task_struct *task;
@@ -184,6 +185,7 @@ static int proc_fd_link(struct dentry *dentry,
struct path *path)
if (fd_file) {
*path = fd_file->f_path;
path_get(&fd_file->f_path);
+ *link_mode = /* something based on fd_file->f_mode */;
ret = 0;
}
spin_unlock(&files->file_lock);
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cd0c8d5ce9a1..a090fff984ed 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -74,7 +74,7 @@ extern struct kmem_cache *proc_dir_entry_cache;
void pde_free(struct proc_dir_entry *pde);
union proc_op {
- int (*proc_get_link)(struct dentry *, struct path *);
+ int (*proc_get_link)(struct dentry *, struct path *, umode_t *);
int (*proc_show)(struct seq_file *m,
struct pid_namespace *ns, struct pid *pid,
struct task_struct *task);
===============================
On 2019-09-17, Jann Horn <[email protected]> wrote:
> On Wed, Sep 4, 2019 at 10:21 PM Aleksa Sarai <[email protected]> wrote:
> > The ability for userspace to "re-open" file descriptors through
> > /proc/self/fd has been a very useful tool for all sorts of usecases
> > (container runtimes are one common example). However, the current
> > interface for doing this has resulted in some pretty subtle security
> > holes. Userspace can re-open a file descriptor with more permissions
> > than the original, which can result in cases such as /proc/$pid/exe
> > being re-opened O_RDWR at a later date even though (by definition)
> > /proc/$pid/exe cannot be opened for writing. When combined with O_PATH
> > the results can get even more confusing.
> [...]
> > Instead we have to restrict it in such a way that it doesn't break
> > (good) users but does block potential attackers. The solution applied in
> > this patch is to restrict *re-opening* (not resolution through)
> > magic-links by requiring that mode of the link be obeyed. Normal
> > symlinks have modes of a+rwx but magic-links have other modes. These
> > magic-link modes were historically ignored during path resolution, but
> > they've now been re-purposed for more useful ends.
>
> Thanks for dealing with this issue!
>
> [...]
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 209c51a5226c..54d57dad0f91 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -872,7 +872,7 @@ void nd_jump_link(struct path *path)
> >
> > nd->path = *path;
> > nd->inode = nd->path.dentry->d_inode;
> > - nd->flags |= LOOKUP_JUMPED;
> > + nd->flags |= LOOKUP_JUMPED | LOOKUP_MAGICLINK_JUMPED;
> > }
> [...]
> > +static int trailing_magiclink(struct nameidata *nd, int acc_mode,
> > + fmode_t *opath_mask)
> > +{
> > + struct inode *inode = nd->link_inode;
> > + fmode_t upgrade_mask = 0;
> > +
> > + /* Was the trailing_symlink() a magic-link? */
> > + if (!(nd->flags & LOOKUP_MAGICLINK_JUMPED))
> > + return 0;
> > +
> > + /*
> > + * Figure out the upgrade-mask of the link_inode. Since these aren't
> > + * strictly POSIX semantics we don't do an acl_permission_check() here,
> > + * so we only care that at least one bit is set for each upgrade-mode.
> > + */
> > + if (inode->i_mode & S_IRUGO)
> > + upgrade_mask |= FMODE_PATH_READ;
> > + if (inode->i_mode & S_IWUGO)
> > + upgrade_mask |= FMODE_PATH_WRITE;
> > + /* Restrict the O_PATH upgrade-mask of the caller. */
> > + if (opath_mask)
> > + *opath_mask &= upgrade_mask;
> > + return may_open_magiclink(upgrade_mask, acc_mode);
> > }
>
> This looks racy because entries in the file descriptor table can be
> switched out as long as task->files->file_lock isn't held. Unless I'm
> missing something, something like the following (untested) would
> bypass this restriction:
You're absolutely right -- good catch!
> Perhaps you could change nd_jump_link() to "void nd_jump_link(struct
> path *path, umode_t link_mode)", and let proc_pid_get_link() pass the
> link_mode through from an out-argument of .proc_get_link()? Then
> proc_fd_link() could grab the proper mode in a race-free manner. And
> nd_jump_link() could stash the mode in the nameidata.
This indeed does appear to be the simplest solution -- I'm currently
testing a variation of the patch you proposed (with a few extra bits to
deal with nd_jump_link and proc_get_link being used elsewhere).
I'll include this change (assuming it fixes the flaw you found) in the
v13 series I'll send around next week. Thanks, Jann!
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On 2019-09-18, Aleksa Sarai <[email protected]> wrote:
> On 2019-09-17, Jann Horn <[email protected]> wrote:
> > On Wed, Sep 4, 2019 at 10:21 PM Aleksa Sarai <[email protected]> wrote:
> > > The ability for userspace to "re-open" file descriptors through
> > > /proc/self/fd has been a very useful tool for all sorts of usecases
> > > (container runtimes are one common example). However, the current
> > > interface for doing this has resulted in some pretty subtle security
> > > holes. Userspace can re-open a file descriptor with more permissions
> > > than the original, which can result in cases such as /proc/$pid/exe
> > > being re-opened O_RDWR at a later date even though (by definition)
> > > /proc/$pid/exe cannot be opened for writing. When combined with O_PATH
> > > the results can get even more confusing.
> > [...]
> > > Instead we have to restrict it in such a way that it doesn't break
> > > (good) users but does block potential attackers. The solution applied in
> > > this patch is to restrict *re-opening* (not resolution through)
> > > magic-links by requiring that mode of the link be obeyed. Normal
> > > symlinks have modes of a+rwx but magic-links have other modes. These
> > > magic-link modes were historically ignored during path resolution, but
> > > they've now been re-purposed for more useful ends.
> >
> > Thanks for dealing with this issue!
> >
> > [...]
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 209c51a5226c..54d57dad0f91 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -872,7 +872,7 @@ void nd_jump_link(struct path *path)
> > >
> > > nd->path = *path;
> > > nd->inode = nd->path.dentry->d_inode;
> > > - nd->flags |= LOOKUP_JUMPED;
> > > + nd->flags |= LOOKUP_JUMPED | LOOKUP_MAGICLINK_JUMPED;
> > > }
> > [...]
> > > +static int trailing_magiclink(struct nameidata *nd, int acc_mode,
> > > + fmode_t *opath_mask)
> > > +{
> > > + struct inode *inode = nd->link_inode;
> > > + fmode_t upgrade_mask = 0;
> > > +
> > > + /* Was the trailing_symlink() a magic-link? */
> > > + if (!(nd->flags & LOOKUP_MAGICLINK_JUMPED))
> > > + return 0;
> > > +
> > > + /*
> > > + * Figure out the upgrade-mask of the link_inode. Since these aren't
> > > + * strictly POSIX semantics we don't do an acl_permission_check() here,
> > > + * so we only care that at least one bit is set for each upgrade-mode.
> > > + */
> > > + if (inode->i_mode & S_IRUGO)
> > > + upgrade_mask |= FMODE_PATH_READ;
> > > + if (inode->i_mode & S_IWUGO)
> > > + upgrade_mask |= FMODE_PATH_WRITE;
> > > + /* Restrict the O_PATH upgrade-mask of the caller. */
> > > + if (opath_mask)
> > > + *opath_mask &= upgrade_mask;
> > > + return may_open_magiclink(upgrade_mask, acc_mode);
> > > }
> >
> > This looks racy because entries in the file descriptor table can be
> > switched out as long as task->files->file_lock isn't held. Unless I'm
> > missing something, something like the following (untested) would
> > bypass this restriction:
>
> You're absolutely right -- good catch!
>
> > Perhaps you could change nd_jump_link() to "void nd_jump_link(struct
> > path *path, umode_t link_mode)", and let proc_pid_get_link() pass the
> > link_mode through from an out-argument of .proc_get_link()? Then
> > proc_fd_link() could grab the proper mode in a race-free manner. And
> > nd_jump_link() could stash the mode in the nameidata.
>
> This indeed does appear to be the simplest solution -- I'm currently
> testing a variation of the patch you proposed (with a few extra bits to
> deal with nd_jump_link and proc_get_link being used elsewhere).
>
> I'll include this change (assuming it fixes the flaw you found) in the
> v13 series I'll send around next week. Thanks, Jann!
In case you're interested -- I've also included a selftest based on this
attack in my series (though it uses CLONE_FILES so that we could also
test O_EMPTYPATH, which wasn't affected because it didn't go through
procfs and thus couldn't hit the "outdated inode->i_mode" problem).
The attack script succeeds around 20% of the time on the original
patchset, and with the updated patchset it doesn't succeed in several
hundred thousand attempts (which I've repeated a few times).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>