2021-12-08 04:48:19

by Peter Collingbourne

[permalink] [raw]
Subject: [PATCH v3 0/6] kernel: introduce uaccess logging

This patch series introduces a kernel feature known as uaccess
logging, which allows userspace programs to be made aware of the
address and size of uaccesses performed by the kernel during
the servicing of a syscall. More details on the motivation
for and interface to this feature are available in the file
Documentation/admin-guide/uaccess-logging.rst added by the final
patch in the series.

Because we don't have a common kernel entry/exit code path that is used
on all architectures, uaccess logging is only implemented for arm64
and architectures that use CONFIG_GENERIC_ENTRY, i.e. x86 and s390.

The proposed interface is the result of numerous iterations and
prototyping and is based on a proposal by Dmitry Vyukov. The interface
preserves the correspondence between uaccess log identity and syscall
identity while tolerating incoming asynchronous signals in the interval
between setting up the logging and the actual syscall. We considered
a number of alternative designs but rejected them for various reasons:

- The design from v1 of this patch [1] proposed notifying the kernel
of the address and size of the uaccess buffer via a prctl that
would also automatically mask and unmask asynchronous signals as
needed, but this would require multiple syscalls per "real" syscall,
harming performance.

- We considered extending the syscall calling convention to
designate currently-unused registers to be used to pass the
location of the uaccess buffer, but this was rejected for being
architecture-specific.

- One idea that we considered involved using the stack pointer address
as a unique identifier for the syscall, but this currently would
need to be arch-specific as we currently do not appear to have an
arch-generic way of retrieving the stack pointer; the userspace
side would also need some arch-specific code for this to work. It's
also possible that a longjmp() past the signal handler would make
the stack pointer address not unique enough for this purpose.

We also evaluated implementing this on top of the existing tracepoint
facility, but concluded that it is not suitable for this purpose:

- Tracepoints have a per-task granularity at best, whereas we really want
to trace per-syscall. This is so that we can exclude syscalls that
should not be traced, such as syscalls that make up part of the
sanitizer implementation (to avoid infinite recursion when e.g. printing
an error report).

- Tracing would need to be synchronous in order to produce useful
stack traces. For example this could be achieved using the new SIGTRAP
on perf events mechanism. However, this would require logging each
access to the stack (in the form of a sigcontext) and this is more
likely to overflow the stack due to being much larger than a uaccess
buffer entry as well as being unbounded, in contrast to the bounded
buffer size passed to prctl(). An approach based on signal handlers is
also likely to fall foul of the asynchronous signal issues mentioned
previously, together with needing sigreturn to be handled specially
(because it copies a sigcontext from userspace) otherwise we could
never return from the signal handler. Furthermore, arguments to the
trace events are not available to SIGTRAP. (This on its own wouldn't
be insurmountable though -- we could add the arguments as fields
to siginfo.)

- The API in https://www.kernel.org/doc/Documentation/trace/ftrace.txt
-- e.g. trace_pipe_raw gives access to the internal ring buffer, but
I don't think it's usable because it's per-CPU and not per-task.

- Tracepoints can be used by eBPF programs, but eBPF programs may
only be loaded as root, among other potential headaches.

[1] https://lore.kernel.org/all/[email protected]/

Peter Collingbourne (6):
include: split out uaccess instrumentation into a separate header
uaccess-buffer: add core code
fs: use copy_from_user_nolog() to copy mount() data
uaccess-buffer: add CONFIG_GENERIC_ENTRY support
arm64: add support for uaccess logging
Documentation: document uaccess logging

Documentation/admin-guide/index.rst | 1 +
Documentation/admin-guide/uaccess-logging.rst | 151 ++++++++++++++++++
arch/Kconfig | 6 +
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/thread_info.h | 7 +-
arch/arm64/kernel/ptrace.c | 7 +
arch/arm64/kernel/signal.c | 5 +
arch/arm64/kernel/syscall.c | 1 +
fs/exec.c | 3 +
fs/namespace.c | 8 +-
include/linux/entry-common.h | 2 +
include/linux/instrumented-uaccess.h | 53 ++++++
include/linux/instrumented.h | 34 ----
include/linux/sched.h | 5 +
include/linux/thread_info.h | 4 +
include/linux/uaccess-buffer-info.h | 46 ++++++
include/linux/uaccess-buffer.h | 112 +++++++++++++
include/linux/uaccess.h | 2 +-
include/uapi/linux/prctl.h | 3 +
include/uapi/linux/uaccess-buffer.h | 27 ++++
kernel/Makefile | 1 +
kernel/bpf/helpers.c | 7 +-
kernel/entry/common.c | 10 ++
kernel/fork.c | 4 +
kernel/signal.c | 4 +-
kernel/sys.c | 6 +
kernel/uaccess-buffer.c | 129 +++++++++++++++
lib/iov_iter.c | 2 +-
lib/usercopy.c | 2 +-
29 files changed, 602 insertions(+), 41 deletions(-)
create mode 100644 Documentation/admin-guide/uaccess-logging.rst
create mode 100644 include/linux/instrumented-uaccess.h
create mode 100644 include/linux/uaccess-buffer-info.h
create mode 100644 include/linux/uaccess-buffer.h
create mode 100644 include/uapi/linux/uaccess-buffer.h
create mode 100644 kernel/uaccess-buffer.c

--
2.34.1.173.g76aa8bc2d0-goog



2021-12-08 04:48:27

by Peter Collingbourne

[permalink] [raw]
Subject: [PATCH v3 1/6] include: split out uaccess instrumentation into a separate header

In an upcoming change we are going to add uaccess instrumentation
that uses inline access to struct task_struct from the
instrumentation routines. Because instrumentation.h is included
from many places including (recursively) from sched.h this would
otherwise lead to a circular dependency. Break the dependency by
moving uaccess instrumentation routines into a separate header,
instrumentation-uaccess.h.

Link: https://linux-review.googlesource.com/id/I625728db0c8db374e13e4ebc54985ac5c79ace7d
Signed-off-by: Peter Collingbourne <[email protected]>
---
include/linux/instrumented-uaccess.h | 49 ++++++++++++++++++++++++++++
include/linux/instrumented.h | 34 -------------------
include/linux/uaccess.h | 2 +-
lib/iov_iter.c | 2 +-
lib/usercopy.c | 2 +-
5 files changed, 52 insertions(+), 37 deletions(-)
create mode 100644 include/linux/instrumented-uaccess.h

diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
new file mode 100644
index 000000000000..ece549088e50
--- /dev/null
+++ b/include/linux/instrumented-uaccess.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * This header provides generic wrappers for memory access instrumentation for
+ * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
+ */
+#ifndef _LINUX_INSTRUMENTED_UACCESS_H
+#define _LINUX_INSTRUMENTED_UACCESS_H
+
+#include <linux/compiler.h>
+#include <linux/kasan-checks.h>
+#include <linux/kcsan-checks.h>
+#include <linux/types.h>
+
+/**
+ * instrument_copy_to_user - instrument reads of copy_to_user
+ *
+ * Instrument reads from kernel memory, that are due to copy_to_user (and
+ * variants). The instrumentation must be inserted before the accesses.
+ *
+ * @to destination address
+ * @from source address
+ * @n number of bytes to copy
+ */
+static __always_inline void
+instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
+{
+ kasan_check_read(from, n);
+ kcsan_check_read(from, n);
+}
+
+/**
+ * instrument_copy_from_user - instrument writes of copy_from_user
+ *
+ * Instrument writes to kernel memory, that are due to copy_from_user (and
+ * variants). The instrumentation should be inserted before the accesses.
+ *
+ * @to destination address
+ * @from source address
+ * @n number of bytes to copy
+ */
+static __always_inline void
+instrument_copy_from_user(const void *to, const void __user *from, unsigned long n)
+{
+ kasan_check_write(to, n);
+ kcsan_check_write(to, n);
+}
+
+#endif /* _LINUX_INSTRUMENTED_UACCESS_H */
diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
index 42faebbaa202..b68f415510c7 100644
--- a/include/linux/instrumented.h
+++ b/include/linux/instrumented.h
@@ -102,38 +102,4 @@ static __always_inline void instrument_atomic_read_write(const volatile void *v,
kcsan_check_atomic_read_write(v, size);
}

-/**
- * instrument_copy_to_user - instrument reads of copy_to_user
- *
- * Instrument reads from kernel memory, that are due to copy_to_user (and
- * variants). The instrumentation must be inserted before the accesses.
- *
- * @to destination address
- * @from source address
- * @n number of bytes to copy
- */
-static __always_inline void
-instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
-{
- kasan_check_read(from, n);
- kcsan_check_read(from, n);
-}
-
-/**
- * instrument_copy_from_user - instrument writes of copy_from_user
- *
- * Instrument writes to kernel memory, that are due to copy_from_user (and
- * variants). The instrumentation should be inserted before the accesses.
- *
- * @to destination address
- * @from source address
- * @n number of bytes to copy
- */
-static __always_inline void
-instrument_copy_from_user(const void *to, const void __user *from, unsigned long n)
-{
- kasan_check_write(to, n);
- kcsan_check_write(to, n);
-}
-
#endif /* _LINUX_INSTRUMENTED_H */
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index ac0394087f7d..c0c467e39657 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -3,7 +3,7 @@
#define __LINUX_UACCESS_H__

#include <linux/fault-inject-usercopy.h>
-#include <linux/instrumented.h>
+#include <linux/instrumented-uaccess.h>
#include <linux/minmax.h>
#include <linux/sched.h>
#include <linux/thread_info.h>
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 66a740e6e153..3f9dc6df7102 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -12,7 +12,7 @@
#include <linux/compat.h>
#include <net/checksum.h>
#include <linux/scatterlist.h>
-#include <linux/instrumented.h>
+#include <linux/instrumented-uaccess.h>

#define PIPE_PARANOIA /* for now */

diff --git a/lib/usercopy.c b/lib/usercopy.c
index 7413dd300516..1cd188e62d06 100644
--- a/lib/usercopy.c
+++ b/lib/usercopy.c
@@ -1,7 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/bitops.h>
#include <linux/fault-inject-usercopy.h>
-#include <linux/instrumented.h>
+#include <linux/instrumented-uaccess.h>
#include <linux/uaccess.h>

/* out-of-line parts */
--
2.34.1.173.g76aa8bc2d0-goog


2021-12-08 04:48:29

by Peter Collingbourne

[permalink] [raw]
Subject: [PATCH v3 2/6] uaccess-buffer: add core code

Add the core code to support uaccess logging. Subsequent patches will
hook this up to the arch-specific kernel entry and exit code for
certain architectures.

Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
Signed-off-by: Peter Collingbourne <[email protected]>
---
v3:
- performance optimizations for entry/exit code
- don't use kcur == NULL to mean overflow
- fix potential double free in clone()
- don't allocate a new kernel-side uaccess buffer for each syscall
- fix uaccess buffer leak on exit
- fix some sparse warnings

v2:
- New interface that avoids multiple syscalls per real syscall and
is arch-generic
- Avoid logging uaccesses done by BPF programs
- Add documentation
- Split up into multiple patches
- Various code moves, renames etc as requested by Marco

fs/exec.c | 3 +
include/linux/instrumented-uaccess.h | 6 +-
include/linux/sched.h | 5 ++
include/linux/uaccess-buffer-info.h | 46 ++++++++++
include/linux/uaccess-buffer.h | 112 +++++++++++++++++++++++
include/uapi/linux/prctl.h | 3 +
include/uapi/linux/uaccess-buffer.h | 27 ++++++
kernel/Makefile | 1 +
kernel/bpf/helpers.c | 7 +-
kernel/fork.c | 4 +
kernel/signal.c | 4 +-
kernel/sys.c | 6 ++
kernel/uaccess-buffer.c | 129 +++++++++++++++++++++++++++
13 files changed, 350 insertions(+), 3 deletions(-)
create mode 100644 include/linux/uaccess-buffer-info.h
create mode 100644 include/linux/uaccess-buffer.h
create mode 100644 include/uapi/linux/uaccess-buffer.h
create mode 100644 kernel/uaccess-buffer.c

diff --git a/fs/exec.c b/fs/exec.c
index 537d92c41105..c9975e790f30 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -65,6 +65,7 @@
#include <linux/vmalloc.h>
#include <linux/io_uring.h>
#include <linux/syscall_user_dispatch.h>
+#include <linux/uaccess-buffer.h>

#include <linux/uaccess.h>
#include <asm/mmu_context.h>
@@ -1313,6 +1314,8 @@ int begin_new_exec(struct linux_binprm * bprm)
me->personality &= ~bprm->per_clear;

clear_syscall_work_syscall_user_dispatch(me);
+ uaccess_buffer_set_descriptor_addr_addr(0);
+ uaccess_buffer_free(current);

/*
* We have to apply CLOEXEC before we change whether the process is
diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
index ece549088e50..b967f4436d15 100644
--- a/include/linux/instrumented-uaccess.h
+++ b/include/linux/instrumented-uaccess.h
@@ -2,7 +2,8 @@

/*
* This header provides generic wrappers for memory access instrumentation for
- * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
+ * uaccess routines that the compiler cannot emit for: KASAN, KCSAN,
+ * uaccess buffers.
*/
#ifndef _LINUX_INSTRUMENTED_UACCESS_H
#define _LINUX_INSTRUMENTED_UACCESS_H
@@ -11,6 +12,7 @@
#include <linux/kasan-checks.h>
#include <linux/kcsan-checks.h>
#include <linux/types.h>
+#include <linux/uaccess-buffer.h>

/**
* instrument_copy_to_user - instrument reads of copy_to_user
@@ -27,6 +29,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
{
kasan_check_read(from, n);
kcsan_check_read(from, n);
+ uaccess_buffer_log_write(to, n);
}

/**
@@ -44,6 +47,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
{
kasan_check_write(to, n);
kcsan_check_write(to, n);
+ uaccess_buffer_log_read(from, n);
}

#endif /* _LINUX_INSTRUMENTED_UACCESS_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78c351e35fec..7c5278d7b57d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
#include <linux/rseq.h>
#include <linux/seqlock.h>
#include <linux/kcsan.h>
+#include <linux/uaccess-buffer-info.h>
#include <asm/kmap_size.h>

/* task_struct member predeclarations (sorted alphabetically): */
@@ -1484,6 +1485,10 @@ struct task_struct {
struct callback_head l1d_flush_kill;
#endif

+#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
+ struct uaccess_buffer_info uaccess_buffer;
+#endif
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/include/linux/uaccess-buffer-info.h b/include/linux/uaccess-buffer-info.h
new file mode 100644
index 000000000000..15e2d8f7c074
--- /dev/null
+++ b/include/linux/uaccess-buffer-info.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UACCESS_BUFFER_INFO_H
+#define _LINUX_UACCESS_BUFFER_INFO_H
+
+#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
+
+struct uaccess_buffer_info {
+ /*
+ * The pointer to pointer to struct uaccess_descriptor. This is the
+ * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
+ */
+ struct uaccess_descriptor __user *__user *desc_ptr_ptr;
+
+ /*
+ * The pointer to struct uaccess_descriptor read at syscall entry time.
+ */
+ struct uaccess_descriptor __user *desc_ptr;
+
+ /*
+ * A pointer to the kernel's temporary copy of the uaccess log for the
+ * current syscall. We log to a kernel buffer in order to avoid leaking
+ * timing information to userspace.
+ */
+ struct uaccess_buffer_entry *kbegin;
+
+ /*
+ * The position of the next uaccess buffer entry for the current
+ * syscall, or NULL if we are not logging the current syscall.
+ */
+ struct uaccess_buffer_entry *kcur;
+
+ /*
+ * A pointer to the end of the kernel's uaccess log.
+ */
+ struct uaccess_buffer_entry *kend;
+
+ /*
+ * The pointer to the userspace uaccess log, as read from the
+ * struct uaccess_descriptor.
+ */
+ struct uaccess_buffer_entry __user *ubegin;
+};
+
+#endif
+
+#endif /* _LINUX_UACCESS_BUFFER_INFO_H */
diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
new file mode 100644
index 000000000000..f2f46db274f3
--- /dev/null
+++ b/include/linux/uaccess-buffer.h
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UACCESS_BUFFER_H
+#define _LINUX_UACCESS_BUFFER_H
+
+#include <linux/sched.h>
+#include <uapi/linux/uaccess-buffer.h>
+
+#include <asm-generic/errno-base.h>
+
+#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
+
+static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
+{
+ return test_task_syscall_work(tsk, UACCESS_BUFFER_ENTRY);
+}
+
+void __uaccess_buffer_syscall_entry(void);
+static inline void uaccess_buffer_syscall_entry(void)
+{
+ __uaccess_buffer_syscall_entry();
+}
+
+void __uaccess_buffer_syscall_exit(void);
+static inline void uaccess_buffer_syscall_exit(void)
+{
+ __uaccess_buffer_syscall_exit();
+}
+
+bool __uaccess_buffer_pre_exit_loop(void);
+static inline bool uaccess_buffer_pre_exit_loop(void)
+{
+ if (!test_syscall_work(UACCESS_BUFFER_ENTRY))
+ return false;
+ return __uaccess_buffer_pre_exit_loop();
+}
+
+void __uaccess_buffer_post_exit_loop(void);
+static inline void uaccess_buffer_post_exit_loop(bool pending)
+{
+ if (pending)
+ __uaccess_buffer_post_exit_loop();
+}
+
+static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
+{
+ current->uaccess_buffer.desc_ptr_ptr =
+ (struct uaccess_descriptor __user * __user *)addr;
+ if (addr)
+ set_syscall_work(UACCESS_BUFFER_ENTRY);
+ else
+ clear_syscall_work(UACCESS_BUFFER_ENTRY);
+ return 0;
+}
+
+size_t copy_from_user_nolog(void *to, const void __user *from, size_t len);
+
+void uaccess_buffer_free(struct task_struct *tsk);
+
+void __uaccess_buffer_log_read(const void __user *from, unsigned long n);
+static inline void uaccess_buffer_log_read(const void __user *from, unsigned long n)
+{
+ if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
+ __uaccess_buffer_log_read(from, n);
+}
+
+void __uaccess_buffer_log_write(void __user *to, unsigned long n);
+static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
+{
+ if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
+ __uaccess_buffer_log_write(to, n);
+}
+
+#else
+
+static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
+{
+ return false;
+}
+static inline void uaccess_buffer_syscall_entry(void)
+{
+}
+static inline void uaccess_buffer_syscall_exit(void)
+{
+}
+static inline bool uaccess_buffer_pre_exit_loop(void)
+{
+ return false;
+}
+static inline void uaccess_buffer_post_exit_loop(bool pending)
+{
+}
+static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
+{
+ return -EINVAL;
+}
+static inline void uaccess_buffer_free(struct task_struct *tsk)
+{
+}
+
+#define copy_from_user_nolog(to, from, len) copy_from_user(to, from, len)
+
+static inline void uaccess_buffer_log_read(const void __user *from,
+ unsigned long n)
+{
+}
+static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
+{
+}
+
+#endif
+
+#endif /* _LINUX_UACCESS_BUFFER_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index bb73e9a0b24f..74b37469c7b3 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -272,4 +272,7 @@ struct prctl_mm_map {
# define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1
# define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2

+/* Configure uaccess logging feature */
+#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR 63
+
#endif /* _LINUX_PRCTL_H */
diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
new file mode 100644
index 000000000000..bf10f7c78857
--- /dev/null
+++ b/include/uapi/linux/uaccess-buffer.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
+#define _UAPI_LINUX_UACCESS_BUFFER_H
+
+#include <linux/types.h>
+
+/* Location of the uaccess log. */
+struct uaccess_descriptor {
+ /* Address of the uaccess_buffer_entry array. */
+ __u64 addr;
+ /* Size of the uaccess_buffer_entry array in number of elements. */
+ __u64 size;
+};
+
+/* Format of the entries in the uaccess log. */
+struct uaccess_buffer_entry {
+ /* Address being accessed. */
+ __u64 addr;
+ /* Number of bytes that were accessed. */
+ __u64 size;
+ /* UACCESS_BUFFER_* flags. */
+ __u64 flags;
+};
+
+#define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */
+
+#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 186c49582f45..d4d9be5146c3 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
obj-$(CONFIG_CFI_CLANG) += cfi.o
+obj-$(CONFIG_HAVE_ARCH_UACCESS_BUFFER) += uaccess-buffer.o

obj-$(CONFIG_PERF_EVENTS) += events/

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 649f07623df6..ab6520a633ef 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -15,6 +15,7 @@
#include <linux/pid_namespace.h>
#include <linux/proc_ns.h>
#include <linux/security.h>
+#include <linux/uaccess-buffer.h>

#include "../../lib/kstrtox.h"

@@ -637,7 +638,11 @@ const struct bpf_func_proto bpf_event_output_data_proto = {
BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
const void __user *, user_ptr)
{
- int ret = copy_from_user(dst, user_ptr, size);
+ /*
+ * Avoid logging uaccesses here as the BPF program may not be following
+ * the uaccess log rules.
+ */
+ int ret = copy_from_user_nolog(dst, user_ptr, size);

if (unlikely(ret)) {
memset(dst, 0, size);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697..8be2ca528a65 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -96,6 +96,7 @@
#include <linux/scs.h>
#include <linux/io_uring.h>
#include <linux/bpf.h>
+#include <linux/uaccess-buffer.h>

#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -754,6 +755,7 @@ void __put_task_struct(struct task_struct *tsk)
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
sched_core_free(tsk);
+ uaccess_buffer_free(tsk);

if (!profile_handoff_task(tsk))
free_task(tsk);
@@ -890,6 +892,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
if (memcg_charge_kernel_stack(tsk))
goto free_stack;

+ uaccess_buffer_free(orig);
+
stack_vm_area = task_stack_vm_area(tsk);

err = arch_dup_task_struct(tsk, orig);
diff --git a/kernel/signal.c b/kernel/signal.c
index a629b11bf3e0..69bf21518bd0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -45,6 +45,7 @@
#include <linux/posix-timers.h>
#include <linux/cgroup.h>
#include <linux/audit.h>
+#include <linux/uaccess-buffer.h>

#define CREATE_TRACE_POINTS
#include <trace/events/signal.h>
@@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
if (sig_fatal(p, sig) &&
!(signal->flags & SIGNAL_GROUP_EXIT) &&
!sigismember(&t->real_blocked, sig) &&
- (sig == SIGKILL || !p->ptrace)) {
+ (sig == SIGKILL ||
+ !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {
/*
* This signal will be fatal to the whole group.
*/
diff --git a/kernel/sys.c b/kernel/sys.c
index 8fdac0d90504..c71a9a9c0f68 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
#include <linux/version.h>
#include <linux/ctype.h>
#include <linux/syscall_user_dispatch.h>
+#include <linux/uaccess-buffer.h>

#include <linux/compat.h>
#include <linux/syscalls.h>
@@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = sched_core_share_pid(arg2, arg3, arg4, arg5);
break;
#endif
+ case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = uaccess_buffer_set_descriptor_addr_addr(arg2);
+ break;
default:
error = -EINVAL;
break;
diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
new file mode 100644
index 000000000000..088e43f7611c
--- /dev/null
+++ b/kernel/uaccess-buffer.c
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Support for uaccess logging via uaccess buffers.
+ *
+ * Copyright (C) 2021, Google LLC.
+ */
+
+#include <linux/compat.h>
+#include <linux/mm.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/sched.h>
+#include <linux/signal.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/uaccess-buffer.h>
+
+static void uaccess_buffer_log(unsigned long addr, unsigned long size,
+ unsigned long flags)
+{
+ struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+ struct uaccess_buffer_entry *entry = buf->kcur;
+
+ if (entry == buf->kend || unlikely(uaccess_kernel()))
+ return;
+ entry->addr = addr;
+ entry->size = size;
+ entry->flags = flags;
+
+ ++buf->kcur;
+}
+
+void __uaccess_buffer_log_read(const void __user *from, unsigned long n)
+{
+ uaccess_buffer_log((unsigned long)from, n, 0);
+}
+EXPORT_SYMBOL(__uaccess_buffer_log_read);
+
+void __uaccess_buffer_log_write(void __user *to, unsigned long n)
+{
+ uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
+}
+EXPORT_SYMBOL(__uaccess_buffer_log_write);
+
+bool __uaccess_buffer_pre_exit_loop(void)
+{
+ struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+ struct uaccess_descriptor __user *desc_ptr;
+ sigset_t tmp_mask;
+
+ if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
+ return false;
+
+ current->real_blocked = current->blocked;
+ sigfillset(&tmp_mask);
+ set_current_blocked(&tmp_mask);
+ return true;
+}
+
+void __uaccess_buffer_post_exit_loop(void)
+{
+ spin_lock_irq(&current->sighand->siglock);
+ current->blocked = current->real_blocked;
+ recalc_sigpending();
+ spin_unlock_irq(&current->sighand->siglock);
+}
+
+void uaccess_buffer_free(struct task_struct *tsk)
+{
+ struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
+
+ kfree(buf->kbegin);
+ clear_syscall_work(UACCESS_BUFFER_EXIT);
+ buf->kbegin = buf->kcur = buf->kend = NULL;
+}
+
+void __uaccess_buffer_syscall_entry(void)
+{
+ struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+ struct uaccess_descriptor desc;
+
+ if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
+ put_user(0, buf->desc_ptr_ptr) ||
+ copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
+ return;
+
+ if (desc.size > 1024)
+ desc.size = 1024;
+
+ if (buf->kend - buf->kbegin != desc.size)
+ buf->kbegin =
+ krealloc_array(buf->kbegin, desc.size,
+ sizeof(struct uaccess_buffer_entry),
+ GFP_KERNEL);
+ if (!buf->kbegin)
+ return;
+
+ set_syscall_work(UACCESS_BUFFER_EXIT);
+ buf->kcur = buf->kbegin;
+ buf->kend = buf->kbegin + desc.size;
+ buf->ubegin =
+ (struct uaccess_buffer_entry __user *)(unsigned long)desc.addr;
+}
+
+void __uaccess_buffer_syscall_exit(void)
+{
+ struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+ u64 num_entries = buf->kcur - buf->kbegin;
+ struct uaccess_descriptor desc;
+
+ clear_syscall_work(UACCESS_BUFFER_EXIT);
+ desc.addr = (u64)(unsigned long)(buf->ubegin + num_entries);
+ desc.size = buf->kend - buf->kcur;
+ buf->kcur = NULL;
+ if (copy_to_user(buf->ubegin, buf->kbegin,
+ num_entries * sizeof(struct uaccess_buffer_entry)) == 0)
+ (void)copy_to_user(buf->desc_ptr, &desc, sizeof(desc));
+}
+
+size_t copy_from_user_nolog(void *to, const void __user *from, size_t len)
+{
+ size_t retval;
+
+ clear_syscall_work(UACCESS_BUFFER_EXIT);
+ retval = copy_from_user(to, from, len);
+ if (current->uaccess_buffer.kcur)
+ set_syscall_work(UACCESS_BUFFER_EXIT);
+ return retval;
+}
--
2.34.1.173.g76aa8bc2d0-goog


2021-12-08 04:48:36

by Peter Collingbourne

[permalink] [raw]
Subject: [PATCH v3 3/6] fs: use copy_from_user_nolog() to copy mount() data

With uaccess logging the contract is that the kernel must not report
accessing more data than necessary, as this can lead to false positive
reports in downstream consumers. This generally works out of the box
when instrumenting copy_{from,to}_user(), but with the data argument
to mount() we use copy_from_user() to copy PAGE_SIZE bytes (or as
much as we can, if the PAGE_SIZE sized access failed) and figure out
later how much we actually need.

To prevent this from leading to a false positive report, use
copy_from_user_nolog(), which will prevent the access from being logged.
Recall that it is valid for the kernel to report accessing less
data than it actually accessed, as uaccess logging is a best-effort
mechanism for reporting uaccesses.

Link: https://linux-review.googlesource.com/id/I5629b92a725c817acd9a861288338dd605cafee6
Signed-off-by: Peter Collingbourne <[email protected]>
---
fs/namespace.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 659a8f39c61a..8f5f2aaca64e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -31,6 +31,7 @@
#include <uapi/linux/mount.h>
#include <linux/fs_context.h>
#include <linux/shmem_fs.h>
+#include <linux/uaccess-buffer.h>

#include "pnode.h"
#include "internal.h"
@@ -3197,7 +3198,12 @@ static void *copy_mount_options(const void __user * data)
if (!copy)
return ERR_PTR(-ENOMEM);

- left = copy_from_user(copy, data, PAGE_SIZE);
+ /*
+ * Use copy_from_user_nolog to avoid reporting overly large accesses in
+ * the uaccess buffer, as this can lead to false positive reports in
+ * downstream consumers.
+ */
+ left = copy_from_user_nolog(copy, data, PAGE_SIZE);

/*
* Not all architectures have an exact copy_from_user(). Resort to
--
2.34.1.173.g76aa8bc2d0-goog


2021-12-08 04:48:39

by Peter Collingbourne

[permalink] [raw]
Subject: [PATCH v3 4/6] uaccess-buffer: add CONFIG_GENERIC_ENTRY support

Add uaccess logging support on architectures that use
CONFIG_GENERIC_ENTRY (currently only s390 and x86).

Link: https://linux-review.googlesource.com/id/I3c5eb19a7e4a1dbe6095f6971f7826c4b0663f7d
Signed-off-by: Peter Collingbourne <[email protected]>
---
arch/Kconfig | 6 ++++++
include/linux/entry-common.h | 2 ++
include/linux/thread_info.h | 4 ++++
kernel/entry/common.c | 10 ++++++++++
4 files changed, 22 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index d3c4ab249e9c..c4dcab5279ac 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -31,6 +31,7 @@ config HOTPLUG_SMT
bool

config GENERIC_ENTRY
+ select HAVE_ARCH_UACCESS_BUFFER
bool

config KPROBES
@@ -1312,6 +1313,11 @@ config ARCH_HAS_PARANOID_L1D_FLUSH
config DYNAMIC_SIGFRAME
bool

+config HAVE_ARCH_UACCESS_BUFFER
+ bool
+ help
+ Select if the architecture's syscall entry/exit code supports uaccess buffers.
+
source "kernel/gcov/Kconfig"

source "scripts/gcc-plugins/Kconfig"
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 2e2b8d6140ed..973fcd1d48a3 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -42,12 +42,14 @@
SYSCALL_WORK_SYSCALL_EMU | \
SYSCALL_WORK_SYSCALL_AUDIT | \
SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
+ SYSCALL_WORK_UACCESS_BUFFER_ENTRY | \
ARCH_SYSCALL_WORK_ENTER)
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_AUDIT | \
SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
SYSCALL_WORK_SYSCALL_EXIT_TRAP | \
+ SYSCALL_WORK_UACCESS_BUFFER_EXIT | \
ARCH_SYSCALL_WORK_EXIT)

/*
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index ad0c4e041030..b0f8ea86967f 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -46,6 +46,8 @@ enum syscall_work_bit {
SYSCALL_WORK_BIT_SYSCALL_AUDIT,
SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+ SYSCALL_WORK_BIT_UACCESS_BUFFER_ENTRY,
+ SYSCALL_WORK_BIT_UACCESS_BUFFER_EXIT,
};

#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP)
@@ -55,6 +57,8 @@ enum syscall_work_bit {
#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_UACCESS_BUFFER_ENTRY BIT(SYSCALL_WORK_BIT_UACCESS_BUFFER_ENTRY)
+#define SYSCALL_WORK_UACCESS_BUFFER_EXIT BIT(SYSCALL_WORK_BIT_UACCESS_BUFFER_EXIT)
#endif

#include <asm/thread_info.h>
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d5a61d565ad5..57c4bb01a554 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -6,6 +6,7 @@
#include <linux/livepatch.h>
#include <linux/audit.h>
#include <linux/tick.h>
+#include <linux/uaccess-buffer.h>

#include "common.h"

@@ -70,6 +71,9 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
return ret;
}

+ if (work & SYSCALL_WORK_UACCESS_BUFFER_ENTRY)
+ uaccess_buffer_syscall_entry();
+
/* Either of the above might have changed the syscall number */
syscall = syscall_get_nr(current, regs);

@@ -197,14 +201,17 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
static void exit_to_user_mode_prepare(struct pt_regs *regs)
{
unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+ bool uaccess_buffer_pending;

lockdep_assert_irqs_disabled();

/* Flush pending rcuog wakeup before the last need_resched() check */
tick_nohz_user_enter_prepare();

+ uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
ti_work = exit_to_user_mode_loop(regs, ti_work);
+ uaccess_buffer_post_exit_loop(uaccess_buffer_pending);

arch_exit_to_user_mode_prepare(regs, ti_work);

@@ -247,6 +254,9 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)

audit_syscall_exit(regs);

+ if (work & SYSCALL_WORK_UACCESS_BUFFER_EXIT)
+ uaccess_buffer_syscall_exit();
+
if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
trace_sys_exit(regs, syscall_get_return_value(current, regs));

--
2.34.1.173.g76aa8bc2d0-goog


2021-12-08 04:48:42

by Peter Collingbourne

[permalink] [raw]
Subject: [PATCH v3 5/6] arm64: add support for uaccess logging

arm64 does not use CONFIG_GENERIC_ENTRY, so add the support for
uaccess logging directly to the architecture.

Link: https://linux-review.googlesource.com/id/I88de539fb9c4a9d27fa8cccbe201a6e4382faf89
Signed-off-by: Peter Collingbourne <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/thread_info.h | 7 ++++++-
arch/arm64/kernel/ptrace.c | 7 +++++++
arch/arm64/kernel/signal.c | 5 +++++
arch/arm64/kernel/syscall.c | 1 +
5 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c4207cf9bb17..6023946abe4a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -161,6 +161,7 @@ config ARM64
select HAVE_ARCH_THREAD_STRUCT_WHITELIST
select HAVE_ARCH_TRACEHOOK
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
+ select HAVE_ARCH_UACCESS_BUFFER
select HAVE_ARCH_VMAP_STACK
select HAVE_ARM_SMCCC
select HAVE_ASM_MODVERSIONS
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index e1317b7c4525..0461b36251ea 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -82,6 +82,8 @@ int arch_dup_task_struct(struct task_struct *dst,
#define TIF_SVE_VL_INHERIT 24 /* Inherit SVE vl_onexec across exec */
#define TIF_SSBD 25 /* Wants SSB mitigation */
#define TIF_TAGGED_ADDR 26 /* Allow tagged user addresses */
+#define TIF_UACCESS_BUFFER_ENTRY 27 /* thread has non-zero uaccess_desc_addr_addr */
+#define TIF_UACCESS_BUFFER_EXIT 28 /* thread has non-zero kcur */

#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
@@ -98,6 +100,8 @@ int arch_dup_task_struct(struct task_struct *dst,
#define _TIF_SVE (1 << TIF_SVE)
#define _TIF_MTE_ASYNC_FAULT (1 << TIF_MTE_ASYNC_FAULT)
#define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL)
+#define _TIF_UACCESS_BUFFER_ENTRY (1 << TIF_UACCESS_BUFFER_ENTRY)
+#define _TIF_UACCESS_BUFFER_EXIT (1 << TIF_UACCESS_BUFFER_EXIT)

#define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
@@ -106,7 +110,8 @@ int arch_dup_task_struct(struct task_struct *dst,

#define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
_TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
- _TIF_SYSCALL_EMU)
+ _TIF_SYSCALL_EMU | _TIF_UACCESS_BUFFER_ENTRY | \
+ _TIF_UACCESS_BUFFER_EXIT)

#ifdef CONFIG_SHADOW_CALL_STACK
#define INIT_SCS \
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 88a9034fb9b5..283372eccaeb 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -29,6 +29,7 @@
#include <linux/regset.h>
#include <linux/tracehook.h>
#include <linux/elf.h>
+#include <linux/uaccess-buffer.h>

#include <asm/compat.h>
#include <asm/cpufeature.h>
@@ -1854,6 +1855,9 @@ int syscall_trace_enter(struct pt_regs *regs)
if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
trace_sys_enter(regs, regs->syscallno);

+ if (flags & _TIF_UACCESS_BUFFER_ENTRY)
+ uaccess_buffer_syscall_entry();
+
audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
regs->regs[2], regs->regs[3]);

@@ -1866,6 +1870,9 @@ void syscall_trace_exit(struct pt_regs *regs)

audit_syscall_exit(regs);

+ if (flags & _TIF_UACCESS_BUFFER_EXIT)
+ uaccess_buffer_syscall_exit();
+
if (flags & _TIF_SYSCALL_TRACEPOINT)
trace_sys_exit(regs, syscall_get_return_value(current, regs));

diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 8f6372b44b65..5bbd98e5c257 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -20,6 +20,7 @@
#include <linux/tracehook.h>
#include <linux/ratelimit.h>
#include <linux/syscalls.h>
+#include <linux/uaccess-buffer.h>

#include <asm/daifflags.h>
#include <asm/debug-monitors.h>
@@ -919,6 +920,8 @@ static void do_signal(struct pt_regs *regs)

void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)
{
+ bool uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
+
do {
if (thread_flags & _TIF_NEED_RESCHED) {
/* Unmask Debug and SError for the next task */
@@ -950,6 +953,8 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)
local_daif_mask();
thread_flags = READ_ONCE(current_thread_info()->flags);
} while (thread_flags & _TIF_WORK_MASK);
+
+ uaccess_buffer_post_exit_loop(uaccess_buffer_pending);
}

unsigned long __ro_after_init signal_minsigstksz;
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index 50a0f1a38e84..d59022b594f2 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -7,6 +7,7 @@
#include <linux/ptrace.h>
#include <linux/randomize_kstack.h>
#include <linux/syscalls.h>
+#include <linux/uaccess-buffer.h>

#include <asm/daifflags.h>
#include <asm/debug-monitors.h>
--
2.34.1.173.g76aa8bc2d0-goog


2021-12-08 04:48:45

by Peter Collingbourne

[permalink] [raw]
Subject: [PATCH v3 6/6] Documentation: document uaccess logging

Add documentation for the uaccess logging feature.

Link: https://linux-review.googlesource.com/id/Ia626c0ca91bc0a3d8067d7f28406aa40693b65a2
Signed-off-by: Peter Collingbourne <[email protected]>
---
v3:
- document what happens if passing NULL to prctl
- be explicit about meaning of addr and size

Documentation/admin-guide/index.rst | 1 +
Documentation/admin-guide/uaccess-logging.rst | 151 ++++++++++++++++++
2 files changed, 152 insertions(+)
create mode 100644 Documentation/admin-guide/uaccess-logging.rst

diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 1bedab498104..4f6ee447ab2f 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -54,6 +54,7 @@ ABI will be found here.
:maxdepth: 1

sysfs-rules
+ uaccess-logging

The rest of this manual consists of various unordered guides on how to
configure specific aspects of kernel behavior to your liking.
diff --git a/Documentation/admin-guide/uaccess-logging.rst b/Documentation/admin-guide/uaccess-logging.rst
new file mode 100644
index 000000000000..24def38bbdf8
--- /dev/null
+++ b/Documentation/admin-guide/uaccess-logging.rst
@@ -0,0 +1,151 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Uaccess Logging
+===============
+
+Background
+----------
+
+Userspace tools such as sanitizers (ASan, MSan, HWASan) and tools
+making use of the ARM Memory Tagging Extension (MTE) need to
+monitor all memory accesses in a program so that they can detect
+memory errors. Furthermore, fuzzing tools such as syzkaller need to
+monitor all memory accesses so that they know which parts of memory
+to fuzz. For accesses made purely in userspace, this is achieved
+via compiler instrumentation, or for MTE, via direct hardware
+support. However, accesses made by the kernel on behalf of the user
+program via syscalls (i.e. uaccesses) are normally invisible to
+these tools.
+
+Traditionally, the sanitizers have handled this by interposing the libc
+syscall stubs with a wrapper that checks the memory based on what we
+believe the uaccesses will be. However, this creates a maintenance
+burden: each syscall must be annotated with its uaccesses in order
+to be recognized by the sanitizer, and these annotations must be
+continuously updated as the kernel changes.
+
+The kernel's uaccess logging feature provides userspace tools with
+the address and size of each userspace access, thereby allowing these
+tools to report memory errors involving these accesses without needing
+annotations for every syscall.
+
+By relying on the kernel's actual uaccesses, rather than a
+reimplementation of them, the userspace memory safety tools may
+play a dual role of verifying the validity of kernel accesses. Even
+a sanitizer whose syscall wrappers have complete knowledge of the
+kernel's intended API may vary from the kernel's actual uaccesses due
+to kernel bugs. A sanitizer with knowledge of the kernel's actual
+uaccesses may produce more accurate error reports that reveal such
+bugs. For example, a kernel that accesses more memory than expected
+by the userspace program could indicate that either userspace or the
+kernel has the wrong idea about which kernel functionality is being
+requested -- either way, there is a bug.
+
+Interface
+---------
+
+The feature may be used via the following prctl:
+
+.. code-block:: c
+
+ uint64_t addr = 0; /* Generally will be a TLS slot or equivalent */
+ prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &addr, 0, 0, 0);
+
+Supplying a non-zero address as the second argument to ``prctl``
+will cause the kernel to read an address (referred to as the *uaccess
+descriptor address*) from that address on each kernel entry. Specifying
+an address of NULL as the second argument will restore the kernel's
+default behavior, i.e. no uaccess descriptor address is read.
+
+When entering the kernel with a non-zero uaccess descriptor address
+to handle a syscall, the kernel will read a data structure of type
+``struct uaccess_descriptor`` from the uaccess descriptor address,
+which is defined as follows:
+
+.. code-block:: c
+
+ struct uaccess_descriptor {
+ uint64_t addr, size;
+ };
+
+This data structure contains the address and size (in array elements)
+of a *uaccess buffer*, which is an array of data structures of type
+``struct uaccess_buffer_entry``. Before returning to userspace, the
+kernel will log information about uaccesses to sequential entries
+in the uaccess buffer. It will also store ``NULL`` to the uaccess
+descriptor address, and store the address and size of the unused
+portion of the uaccess buffer to the uaccess descriptor.
+
+The format of a uaccess buffer entry is defined as follows:
+
+.. code-block:: c
+
+ struct uaccess_buffer_entry {
+ uint64_t addr, size, flags;
+ };
+
+``addr`` and ``size`` contain the address and size of the user memory
+access. On arm64, tag bits are preserved in the ``addr`` field. There
+is currently one flag bit assignment for the ``flags`` field:
+
+.. code-block:: c
+
+ #define UACCESS_BUFFER_FLAG_WRITE 1
+
+This flag is set if the access was a write, or clear if it was a
+read. The meaning of all other flag bits is reserved.
+
+When entering the kernel with a non-zero uaccess descriptor
+address for a reason other than a syscall (for example, when
+IPI'd due to an incoming asynchronous signal), any signals other
+than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
+``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
+initialized with ``sigfillset(set)``. This is to prevent incoming
+signals from interfering with uaccess logging.
+
+Example
+-------
+
+Here is an example of a code snippet that will enumerate the accesses
+performed by a ``uname(2)`` syscall:
+
+.. code-block:: c
+
+ struct uaccess_buffer_entry entries[64];
+ struct uaccess_descriptor desc;
+ uint64_t desc_addr = 0;
+ prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &desc_addr, 0, 0, 0);
+
+ desc.addr = (uint64_t)&entries;
+ desc.size = 64;
+ desc_addr = (uint64_t)&desc;
+
+ struct utsname un;
+ uname(&un);
+
+ struct uaccess_buffer_entry* entries_end = (struct uaccess_buffer_entry*)desc.addr;
+ for (struct uaccess_buffer_entry* entry = entries; entry != entries_end; ++entry) {
+ printf("%s at 0x%lx size 0x%lx\n", entry->flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ",
+ (unsigned long)entry->addr, (unsigned long)entry->size);
+ }
+
+Limitations
+-----------
+
+This feature is currently only supported on the arm64, s390 and x86
+architectures.
+
+Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of
+course, not all of the accesses may fit in the buffer, but aside from
+that, not all internal kernel APIs that access userspace memory are
+covered. Therefore, userspace programs should tolerate unreported
+accesses.
+
+On the other hand, the kernel guarantees that it will not
+(intentionally) report accessing more data than it is specified
+to read. For example, if the kernel implements a syscall that is
+specified to read a data structure of size ``N`` bytes by first
+reading a page's worth of data and then only using the first ``N``
+bytes from it, the kernel will either report reading ``N`` bytes or
+not report the access at all.
--
2.34.1.173.g76aa8bc2d0-goog


2021-12-08 09:28:10

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] include: split out uaccess instrumentation into a separate header

On Wed, 8 Dec 2021 at 05:48, Peter Collingbourne <[email protected]> wrote:
>
> In an upcoming change we are going to add uaccess instrumentation
> that uses inline access to struct task_struct from the
> instrumentation routines. Because instrumentation.h is included
> from many places including (recursively) from sched.h this would
> otherwise lead to a circular dependency. Break the dependency by
> moving uaccess instrumentation routines into a separate header,
> instrumentation-uaccess.h.
>
> Link: https://linux-review.googlesource.com/id/I625728db0c8db374e13e4ebc54985ac5c79ace7d
> Signed-off-by: Peter Collingbourne <[email protected]>

Acked-by: Dmitry Vyukov <[email protected]>

> ---
> include/linux/instrumented-uaccess.h | 49 ++++++++++++++++++++++++++++
> include/linux/instrumented.h | 34 -------------------
> include/linux/uaccess.h | 2 +-
> lib/iov_iter.c | 2 +-
> lib/usercopy.c | 2 +-
> 5 files changed, 52 insertions(+), 37 deletions(-)
> create mode 100644 include/linux/instrumented-uaccess.h
>
> diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
> new file mode 100644
> index 000000000000..ece549088e50
> --- /dev/null
> +++ b/include/linux/instrumented-uaccess.h
> @@ -0,0 +1,49 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * This header provides generic wrappers for memory access instrumentation for
> + * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
> + */
> +#ifndef _LINUX_INSTRUMENTED_UACCESS_H
> +#define _LINUX_INSTRUMENTED_UACCESS_H
> +
> +#include <linux/compiler.h>
> +#include <linux/kasan-checks.h>
> +#include <linux/kcsan-checks.h>
> +#include <linux/types.h>
> +
> +/**
> + * instrument_copy_to_user - instrument reads of copy_to_user
> + *
> + * Instrument reads from kernel memory, that are due to copy_to_user (and
> + * variants). The instrumentation must be inserted before the accesses.
> + *
> + * @to destination address
> + * @from source address
> + * @n number of bytes to copy
> + */
> +static __always_inline void
> +instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> +{
> + kasan_check_read(from, n);
> + kcsan_check_read(from, n);
> +}
> +
> +/**
> + * instrument_copy_from_user - instrument writes of copy_from_user
> + *
> + * Instrument writes to kernel memory, that are due to copy_from_user (and
> + * variants). The instrumentation should be inserted before the accesses.
> + *
> + * @to destination address
> + * @from source address
> + * @n number of bytes to copy
> + */
> +static __always_inline void
> +instrument_copy_from_user(const void *to, const void __user *from, unsigned long n)
> +{
> + kasan_check_write(to, n);
> + kcsan_check_write(to, n);
> +}
> +
> +#endif /* _LINUX_INSTRUMENTED_UACCESS_H */
> diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
> index 42faebbaa202..b68f415510c7 100644
> --- a/include/linux/instrumented.h
> +++ b/include/linux/instrumented.h
> @@ -102,38 +102,4 @@ static __always_inline void instrument_atomic_read_write(const volatile void *v,
> kcsan_check_atomic_read_write(v, size);
> }
>
> -/**
> - * instrument_copy_to_user - instrument reads of copy_to_user
> - *
> - * Instrument reads from kernel memory, that are due to copy_to_user (and
> - * variants). The instrumentation must be inserted before the accesses.
> - *
> - * @to destination address
> - * @from source address
> - * @n number of bytes to copy
> - */
> -static __always_inline void
> -instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> -{
> - kasan_check_read(from, n);
> - kcsan_check_read(from, n);
> -}
> -
> -/**
> - * instrument_copy_from_user - instrument writes of copy_from_user
> - *
> - * Instrument writes to kernel memory, that are due to copy_from_user (and
> - * variants). The instrumentation should be inserted before the accesses.
> - *
> - * @to destination address
> - * @from source address
> - * @n number of bytes to copy
> - */
> -static __always_inline void
> -instrument_copy_from_user(const void *to, const void __user *from, unsigned long n)
> -{
> - kasan_check_write(to, n);
> - kcsan_check_write(to, n);
> -}
> -
> #endif /* _LINUX_INSTRUMENTED_H */
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index ac0394087f7d..c0c467e39657 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -3,7 +3,7 @@
> #define __LINUX_UACCESS_H__
>
> #include <linux/fault-inject-usercopy.h>
> -#include <linux/instrumented.h>
> +#include <linux/instrumented-uaccess.h>
> #include <linux/minmax.h>
> #include <linux/sched.h>
> #include <linux/thread_info.h>
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 66a740e6e153..3f9dc6df7102 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -12,7 +12,7 @@
> #include <linux/compat.h>
> #include <net/checksum.h>
> #include <linux/scatterlist.h>
> -#include <linux/instrumented.h>
> +#include <linux/instrumented-uaccess.h>
>
> #define PIPE_PARANOIA /* for now */
>
> diff --git a/lib/usercopy.c b/lib/usercopy.c
> index 7413dd300516..1cd188e62d06 100644
> --- a/lib/usercopy.c
> +++ b/lib/usercopy.c
> @@ -1,7 +1,7 @@
> // SPDX-License-Identifier: GPL-2.0
> #include <linux/bitops.h>
> #include <linux/fault-inject-usercopy.h>
> -#include <linux/instrumented.h>
> +#include <linux/instrumented-uaccess.h>
> #include <linux/uaccess.h>
>
> /* out-of-line parts */
> --
> 2.34.1.173.g76aa8bc2d0-goog
>

2021-12-08 09:35:07

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] fs: use copy_from_user_nolog() to copy mount() data

On Wed, 8 Dec 2021 at 05:48, Peter Collingbourne <[email protected]> wrote:
>
> With uaccess logging the contract is that the kernel must not report
> accessing more data than necessary, as this can lead to false positive
> reports in downstream consumers. This generally works out of the box
> when instrumenting copy_{from,to}_user(), but with the data argument
> to mount() we use copy_from_user() to copy PAGE_SIZE bytes (or as
> much as we can, if the PAGE_SIZE sized access failed) and figure out
> later how much we actually need.
>
> To prevent this from leading to a false positive report, use
> copy_from_user_nolog(), which will prevent the access from being logged.
> Recall that it is valid for the kernel to report accessing less
> data than it actually accessed, as uaccess logging is a best-effort
> mechanism for reporting uaccesses.
>
> Link: https://linux-review.googlesource.com/id/I5629b92a725c817acd9a861288338dd605cafee6
> Signed-off-by: Peter Collingbourne <[email protected]>
> ---
> fs/namespace.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 659a8f39c61a..8f5f2aaca64e 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -31,6 +31,7 @@
> #include <uapi/linux/mount.h>
> #include <linux/fs_context.h>
> #include <linux/shmem_fs.h>
> +#include <linux/uaccess-buffer.h>
>
> #include "pnode.h"
> #include "internal.h"
> @@ -3197,7 +3198,12 @@ static void *copy_mount_options(const void __user * data)
> if (!copy)
> return ERR_PTR(-ENOMEM);
>
> - left = copy_from_user(copy, data, PAGE_SIZE);
> + /*
> + * Use copy_from_user_nolog to avoid reporting overly large accesses in
> + * the uaccess buffer, as this can lead to false positive reports in
> + * downstream consumers.
> + */
> + left = copy_from_user_nolog(copy, data, PAGE_SIZE);

A late idea...
Maybe it's better to log them with a new UACCESS_BUFFER_FLAG_OVERREAD
flag. Better for user-space, at least can detect UAFs by checking the
first byte. And a more logical kernel annotation (maybe will be used
in some other tools? or if we ever check user tags in the kernel).

Probably not too important today since we use this only in 2 places,
but longer term may be better.

Btw, what's the story with BPF accesses? Can we log them theoretically?

Previously the comment said:

+ /*
+ * Avoid copy_from_user() here as it may leak information about the BPF
+ * program to userspace via the uaccess buffer.
+ */

but now it says something very generic:

/*
* Avoid logging uaccesses here as the BPF program may not be following
* the uaccess log rules.
*/






> /*
> * Not all architectures have an exact copy_from_user(). Resort to
> --
> 2.34.1.173.g76aa8bc2d0-goog

2021-12-08 09:44:09

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] uaccess-buffer: add CONFIG_GENERIC_ENTRY support

On Wed, 8 Dec 2021 at 05:48, Peter Collingbourne <[email protected]> wrote:
>
> Add uaccess logging support on architectures that use
> CONFIG_GENERIC_ENTRY (currently only s390 and x86).
>
> Link: https://linux-review.googlesource.com/id/I3c5eb19a7e4a1dbe6095f6971f7826c4b0663f7d
> Signed-off-by: Peter Collingbourne <[email protected]>

Acked-by: Dmitry Vyukov <[email protected]>

> ---
> arch/Kconfig | 6 ++++++
> include/linux/entry-common.h | 2 ++
> include/linux/thread_info.h | 4 ++++
> kernel/entry/common.c | 10 ++++++++++
> 4 files changed, 22 insertions(+)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index d3c4ab249e9c..c4dcab5279ac 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -31,6 +31,7 @@ config HOTPLUG_SMT
> bool
>
> config GENERIC_ENTRY
> + select HAVE_ARCH_UACCESS_BUFFER
> bool
>
> config KPROBES
> @@ -1312,6 +1313,11 @@ config ARCH_HAS_PARANOID_L1D_FLUSH
> config DYNAMIC_SIGFRAME
> bool
>
> +config HAVE_ARCH_UACCESS_BUFFER
> + bool
> + help
> + Select if the architecture's syscall entry/exit code supports uaccess buffers.
> +
> source "kernel/gcov/Kconfig"
>
> source "scripts/gcc-plugins/Kconfig"
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 2e2b8d6140ed..973fcd1d48a3 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -42,12 +42,14 @@
> SYSCALL_WORK_SYSCALL_EMU | \
> SYSCALL_WORK_SYSCALL_AUDIT | \
> SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
> + SYSCALL_WORK_UACCESS_BUFFER_ENTRY | \
> ARCH_SYSCALL_WORK_ENTER)
> #define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
> SYSCALL_WORK_SYSCALL_TRACE | \
> SYSCALL_WORK_SYSCALL_AUDIT | \
> SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
> SYSCALL_WORK_SYSCALL_EXIT_TRAP | \
> + SYSCALL_WORK_UACCESS_BUFFER_EXIT | \
> ARCH_SYSCALL_WORK_EXIT)
>
> /*
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index ad0c4e041030..b0f8ea86967f 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -46,6 +46,8 @@ enum syscall_work_bit {
> SYSCALL_WORK_BIT_SYSCALL_AUDIT,
> SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
> SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
> + SYSCALL_WORK_BIT_UACCESS_BUFFER_ENTRY,
> + SYSCALL_WORK_BIT_UACCESS_BUFFER_EXIT,
> };
>
> #define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP)
> @@ -55,6 +57,8 @@ enum syscall_work_bit {
> #define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
> #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
> #define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
> +#define SYSCALL_WORK_UACCESS_BUFFER_ENTRY BIT(SYSCALL_WORK_BIT_UACCESS_BUFFER_ENTRY)
> +#define SYSCALL_WORK_UACCESS_BUFFER_EXIT BIT(SYSCALL_WORK_BIT_UACCESS_BUFFER_EXIT)
> #endif
>
> #include <asm/thread_info.h>
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index d5a61d565ad5..57c4bb01a554 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -6,6 +6,7 @@
> #include <linux/livepatch.h>
> #include <linux/audit.h>
> #include <linux/tick.h>
> +#include <linux/uaccess-buffer.h>
>
> #include "common.h"
>
> @@ -70,6 +71,9 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
> return ret;
> }
>
> + if (work & SYSCALL_WORK_UACCESS_BUFFER_ENTRY)
> + uaccess_buffer_syscall_entry();
> +
> /* Either of the above might have changed the syscall number */
> syscall = syscall_get_nr(current, regs);
>
> @@ -197,14 +201,17 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> static void exit_to_user_mode_prepare(struct pt_regs *regs)
> {
> unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> + bool uaccess_buffer_pending;
>
> lockdep_assert_irqs_disabled();
>
> /* Flush pending rcuog wakeup before the last need_resched() check */
> tick_nohz_user_enter_prepare();
>
> + uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
> if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> ti_work = exit_to_user_mode_loop(regs, ti_work);
> + uaccess_buffer_post_exit_loop(uaccess_buffer_pending);
>
> arch_exit_to_user_mode_prepare(regs, ti_work);
>
> @@ -247,6 +254,9 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
>
> audit_syscall_exit(regs);
>
> + if (work & SYSCALL_WORK_UACCESS_BUFFER_EXIT)
> + uaccess_buffer_syscall_exit();
> +
> if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
> trace_sys_exit(regs, syscall_get_return_value(current, regs));
>
> --
> 2.34.1.173.g76aa8bc2d0-goog
>

2021-12-08 09:49:05

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] arm64: add support for uaccess logging

On Wed, 8 Dec 2021 at 05:48, Peter Collingbourne <[email protected]> wrote:
>
> arm64 does not use CONFIG_GENERIC_ENTRY, so add the support for
> uaccess logging directly to the architecture.
>
> Link: https://linux-review.googlesource.com/id/I88de539fb9c4a9d27fa8cccbe201a6e4382faf89
> Signed-off-by: Peter Collingbourne <[email protected]>
> ---
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/thread_info.h | 7 ++++++-
> arch/arm64/kernel/ptrace.c | 7 +++++++
> arch/arm64/kernel/signal.c | 5 +++++
> arch/arm64/kernel/syscall.c | 1 +
> 5 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index c4207cf9bb17..6023946abe4a 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -161,6 +161,7 @@ config ARM64
> select HAVE_ARCH_THREAD_STRUCT_WHITELIST
> select HAVE_ARCH_TRACEHOOK
> select HAVE_ARCH_TRANSPARENT_HUGEPAGE
> + select HAVE_ARCH_UACCESS_BUFFER
> select HAVE_ARCH_VMAP_STACK
> select HAVE_ARM_SMCCC
> select HAVE_ASM_MODVERSIONS
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index e1317b7c4525..0461b36251ea 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -82,6 +82,8 @@ int arch_dup_task_struct(struct task_struct *dst,
> #define TIF_SVE_VL_INHERIT 24 /* Inherit SVE vl_onexec across exec */
> #define TIF_SSBD 25 /* Wants SSB mitigation */
> #define TIF_TAGGED_ADDR 26 /* Allow tagged user addresses */
> +#define TIF_UACCESS_BUFFER_ENTRY 27 /* thread has non-zero uaccess_desc_addr_addr */
> +#define TIF_UACCESS_BUFFER_EXIT 28 /* thread has non-zero kcur */
>
> #define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
> #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
> @@ -98,6 +100,8 @@ int arch_dup_task_struct(struct task_struct *dst,
> #define _TIF_SVE (1 << TIF_SVE)
> #define _TIF_MTE_ASYNC_FAULT (1 << TIF_MTE_ASYNC_FAULT)
> #define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL)
> +#define _TIF_UACCESS_BUFFER_ENTRY (1 << TIF_UACCESS_BUFFER_ENTRY)
> +#define _TIF_UACCESS_BUFFER_EXIT (1 << TIF_UACCESS_BUFFER_EXIT)
>
> #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
> _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
> @@ -106,7 +110,8 @@ int arch_dup_task_struct(struct task_struct *dst,
>
> #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
> _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
> - _TIF_SYSCALL_EMU)
> + _TIF_SYSCALL_EMU | _TIF_UACCESS_BUFFER_ENTRY | \
> + _TIF_UACCESS_BUFFER_EXIT)
>
> #ifdef CONFIG_SHADOW_CALL_STACK
> #define INIT_SCS \
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index 88a9034fb9b5..283372eccaeb 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -29,6 +29,7 @@
> #include <linux/regset.h>
> #include <linux/tracehook.h>
> #include <linux/elf.h>
> +#include <linux/uaccess-buffer.h>
>
> #include <asm/compat.h>
> #include <asm/cpufeature.h>
> @@ -1854,6 +1855,9 @@ int syscall_trace_enter(struct pt_regs *regs)
> if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
> trace_sys_enter(regs, regs->syscallno);
>
> + if (flags & _TIF_UACCESS_BUFFER_ENTRY)
> + uaccess_buffer_syscall_entry();
> +
> audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
> regs->regs[2], regs->regs[3]);
>
> @@ -1866,6 +1870,9 @@ void syscall_trace_exit(struct pt_regs *regs)
>
> audit_syscall_exit(regs);
>
> + if (flags & _TIF_UACCESS_BUFFER_EXIT)
> + uaccess_buffer_syscall_exit();
> +
> if (flags & _TIF_SYSCALL_TRACEPOINT)
> trace_sys_exit(regs, syscall_get_return_value(current, regs));
>
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index 8f6372b44b65..5bbd98e5c257 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -20,6 +20,7 @@
> #include <linux/tracehook.h>
> #include <linux/ratelimit.h>
> #include <linux/syscalls.h>
> +#include <linux/uaccess-buffer.h>
>
> #include <asm/daifflags.h>
> #include <asm/debug-monitors.h>
> @@ -919,6 +920,8 @@ static void do_signal(struct pt_regs *regs)
>
> void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)
> {
> + bool uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
> +
> do {
> if (thread_flags & _TIF_NEED_RESCHED) {
> /* Unmask Debug and SError for the next task */
> @@ -950,6 +953,8 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)
> local_daif_mask();
> thread_flags = READ_ONCE(current_thread_info()->flags);
> } while (thread_flags & _TIF_WORK_MASK);
> +
> + uaccess_buffer_post_exit_loop(uaccess_buffer_pending);
> }
>
> unsigned long __ro_after_init signal_minsigstksz;
> diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
> index 50a0f1a38e84..d59022b594f2 100644
> --- a/arch/arm64/kernel/syscall.c
> +++ b/arch/arm64/kernel/syscall.c
> @@ -7,6 +7,7 @@
> #include <linux/ptrace.h>
> #include <linux/randomize_kstack.h>
> #include <linux/syscalls.h>
> +#include <linux/uaccess-buffer.h>

This looks strange... Does some other header miss this include?

>
> #include <asm/daifflags.h>
> #include <asm/debug-monitors.h>
> --
> 2.34.1.173.g76aa8bc2d0-goog
>

2021-12-08 10:21:22

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] uaccess-buffer: add core code

On Wed, 8 Dec 2021 at 05:48, Peter Collingbourne <[email protected]> wrote:
>
> Add the core code to support uaccess logging. Subsequent patches will
> hook this up to the arch-specific kernel entry and exit code for
> certain architectures.
>
> Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> Signed-off-by: Peter Collingbourne <[email protected]>
> ---
> v3:
> - performance optimizations for entry/exit code
> - don't use kcur == NULL to mean overflow
> - fix potential double free in clone()
> - don't allocate a new kernel-side uaccess buffer for each syscall
> - fix uaccess buffer leak on exit
> - fix some sparse warnings
>
> v2:
> - New interface that avoids multiple syscalls per real syscall and
> is arch-generic
> - Avoid logging uaccesses done by BPF programs
> - Add documentation
> - Split up into multiple patches
> - Various code moves, renames etc as requested by Marco
>
> fs/exec.c | 3 +
> include/linux/instrumented-uaccess.h | 6 +-
> include/linux/sched.h | 5 ++
> include/linux/uaccess-buffer-info.h | 46 ++++++++++
> include/linux/uaccess-buffer.h | 112 +++++++++++++++++++++++
> include/uapi/linux/prctl.h | 3 +
> include/uapi/linux/uaccess-buffer.h | 27 ++++++
> kernel/Makefile | 1 +
> kernel/bpf/helpers.c | 7 +-
> kernel/fork.c | 4 +
> kernel/signal.c | 4 +-
> kernel/sys.c | 6 ++
> kernel/uaccess-buffer.c | 129 +++++++++++++++++++++++++++
> 13 files changed, 350 insertions(+), 3 deletions(-)
> create mode 100644 include/linux/uaccess-buffer-info.h
> create mode 100644 include/linux/uaccess-buffer.h
> create mode 100644 include/uapi/linux/uaccess-buffer.h
> create mode 100644 kernel/uaccess-buffer.c
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 537d92c41105..c9975e790f30 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -65,6 +65,7 @@
> #include <linux/vmalloc.h>
> #include <linux/io_uring.h>
> #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess-buffer.h>
>
> #include <linux/uaccess.h>
> #include <asm/mmu_context.h>
> @@ -1313,6 +1314,8 @@ int begin_new_exec(struct linux_binprm * bprm)
> me->personality &= ~bprm->per_clear;
>
> clear_syscall_work_syscall_user_dispatch(me);
> + uaccess_buffer_set_descriptor_addr_addr(0);
> + uaccess_buffer_free(current);
>
> /*
> * We have to apply CLOEXEC before we change whether the process is
> diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
> index ece549088e50..b967f4436d15 100644
> --- a/include/linux/instrumented-uaccess.h
> +++ b/include/linux/instrumented-uaccess.h
> @@ -2,7 +2,8 @@
>
> /*
> * This header provides generic wrappers for memory access instrumentation for
> - * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
> + * uaccess routines that the compiler cannot emit for: KASAN, KCSAN,
> + * uaccess buffers.
> */
> #ifndef _LINUX_INSTRUMENTED_UACCESS_H
> #define _LINUX_INSTRUMENTED_UACCESS_H
> @@ -11,6 +12,7 @@
> #include <linux/kasan-checks.h>
> #include <linux/kcsan-checks.h>
> #include <linux/types.h>
> +#include <linux/uaccess-buffer.h>
>
> /**
> * instrument_copy_to_user - instrument reads of copy_to_user
> @@ -27,6 +29,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> {
> kasan_check_read(from, n);
> kcsan_check_read(from, n);
> + uaccess_buffer_log_write(to, n);
> }
>
> /**
> @@ -44,6 +47,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
> {
> kasan_check_write(to, n);
> kcsan_check_write(to, n);
> + uaccess_buffer_log_read(from, n);
> }
>
> #endif /* _LINUX_INSTRUMENTED_UACCESS_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 78c351e35fec..7c5278d7b57d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -34,6 +34,7 @@
> #include <linux/rseq.h>
> #include <linux/seqlock.h>
> #include <linux/kcsan.h>
> +#include <linux/uaccess-buffer-info.h>
> #include <asm/kmap_size.h>
>
> /* task_struct member predeclarations (sorted alphabetically): */
> @@ -1484,6 +1485,10 @@ struct task_struct {
> struct callback_head l1d_flush_kill;
> #endif
>
> +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> + struct uaccess_buffer_info uaccess_buffer;
> +#endif
> +
> /*
> * New fields for task_struct should be added above here, so that
> * they are included in the randomized portion of task_struct.
> diff --git a/include/linux/uaccess-buffer-info.h b/include/linux/uaccess-buffer-info.h
> new file mode 100644
> index 000000000000..15e2d8f7c074
> --- /dev/null
> +++ b/include/linux/uaccess-buffer-info.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UACCESS_BUFFER_INFO_H
> +#define _LINUX_UACCESS_BUFFER_INFO_H
> +
> +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> +
> +struct uaccess_buffer_info {
> + /*
> + * The pointer to pointer to struct uaccess_descriptor. This is the
> + * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> + */
> + struct uaccess_descriptor __user *__user *desc_ptr_ptr;
> +
> + /*
> + * The pointer to struct uaccess_descriptor read at syscall entry time.
> + */
> + struct uaccess_descriptor __user *desc_ptr;
> +
> + /*
> + * A pointer to the kernel's temporary copy of the uaccess log for the
> + * current syscall. We log to a kernel buffer in order to avoid leaking
> + * timing information to userspace.
> + */
> + struct uaccess_buffer_entry *kbegin;
> +
> + /*
> + * The position of the next uaccess buffer entry for the current
> + * syscall, or NULL if we are not logging the current syscall.
> + */
> + struct uaccess_buffer_entry *kcur;
> +
> + /*
> + * A pointer to the end of the kernel's uaccess log.
> + */
> + struct uaccess_buffer_entry *kend;
> +
> + /*
> + * The pointer to the userspace uaccess log, as read from the
> + * struct uaccess_descriptor.
> + */
> + struct uaccess_buffer_entry __user *ubegin;
> +};
> +
> +#endif
> +
> +#endif /* _LINUX_UACCESS_BUFFER_INFO_H */
> diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
> new file mode 100644
> index 000000000000..f2f46db274f3
> --- /dev/null
> +++ b/include/linux/uaccess-buffer.h
> @@ -0,0 +1,112 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UACCESS_BUFFER_H
> +#define _LINUX_UACCESS_BUFFER_H
> +
> +#include <linux/sched.h>
> +#include <uapi/linux/uaccess-buffer.h>
> +
> +#include <asm-generic/errno-base.h>
> +
> +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> +
> +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> +{
> + return test_task_syscall_work(tsk, UACCESS_BUFFER_ENTRY);
> +}
> +
> +void __uaccess_buffer_syscall_entry(void);
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> + __uaccess_buffer_syscall_entry();
> +}
> +
> +void __uaccess_buffer_syscall_exit(void);
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> + __uaccess_buffer_syscall_exit();
> +}
> +
> +bool __uaccess_buffer_pre_exit_loop(void);
> +static inline bool uaccess_buffer_pre_exit_loop(void)
> +{
> + if (!test_syscall_work(UACCESS_BUFFER_ENTRY))
> + return false;
> + return __uaccess_buffer_pre_exit_loop();
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void);
> +static inline void uaccess_buffer_post_exit_loop(bool pending)
> +{
> + if (pending)
> + __uaccess_buffer_post_exit_loop();
> +}
> +
> +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)

I would move the implementation to .c file. It's a rare path.

> +{
> + current->uaccess_buffer.desc_ptr_ptr =
> + (struct uaccess_descriptor __user * __user *)addr;
> + if (addr)
> + set_syscall_work(UACCESS_BUFFER_ENTRY);
> + else
> + clear_syscall_work(UACCESS_BUFFER_ENTRY);
> + return 0;
> +}
> +
> +size_t copy_from_user_nolog(void *to, const void __user *from, size_t len);
> +
> +void uaccess_buffer_free(struct task_struct *tsk);
> +
> +void __uaccess_buffer_log_read(const void __user *from, unsigned long n);
> +static inline void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> + if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))

UACCESS_BUFFER_EXIT is only defined in future patches, so this won't compile.

> + __uaccess_buffer_log_read(from, n);
> +}
> +
> +void __uaccess_buffer_log_write(void __user *to, unsigned long n);
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> + if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
> + __uaccess_buffer_log_write(to, n);
> +}
> +
> +#else
> +
> +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> +{
> + return false;
> +}
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +}
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +}
> +static inline bool uaccess_buffer_pre_exit_loop(void)
> +{
> + return false;
> +}
> +static inline void uaccess_buffer_post_exit_loop(bool pending)
> +{
> +}
> +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> +{
> + return -EINVAL;
> +}
> +static inline void uaccess_buffer_free(struct task_struct *tsk)
> +{
> +}
> +
> +#define copy_from_user_nolog(to, from, len) copy_from_user(to, from, len)
> +
> +static inline void uaccess_buffer_log_read(const void __user *from,
> + unsigned long n)
> +{
> +}
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +}
> +
> +#endif
> +
> +#endif /* _LINUX_UACCESS_BUFFER_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index bb73e9a0b24f..74b37469c7b3 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -272,4 +272,7 @@ struct prctl_mm_map {
> # define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1
> # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2
>
> +/* Configure uaccess logging feature */
> +#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR 63
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
> new file mode 100644
> index 000000000000..bf10f7c78857
> --- /dev/null
> +++ b/include/uapi/linux/uaccess-buffer.h
> @@ -0,0 +1,27 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
> +#define _UAPI_LINUX_UACCESS_BUFFER_H
> +
> +#include <linux/types.h>
> +
> +/* Location of the uaccess log. */
> +struct uaccess_descriptor {
> + /* Address of the uaccess_buffer_entry array. */
> + __u64 addr;
> + /* Size of the uaccess_buffer_entry array in number of elements. */
> + __u64 size;
> +};
> +
> +/* Format of the entries in the uaccess log. */
> +struct uaccess_buffer_entry {
> + /* Address being accessed. */
> + __u64 addr;
> + /* Number of bytes that were accessed. */
> + __u64 size;
> + /* UACCESS_BUFFER_* flags. */
> + __u64 flags;
> +};
> +
> +#define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */
> +
> +#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 186c49582f45..d4d9be5146c3 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
> obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
> obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> obj-$(CONFIG_CFI_CLANG) += cfi.o
> +obj-$(CONFIG_HAVE_ARCH_UACCESS_BUFFER) += uaccess-buffer.o
>
> obj-$(CONFIG_PERF_EVENTS) += events/
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 649f07623df6..ab6520a633ef 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -15,6 +15,7 @@
> #include <linux/pid_namespace.h>
> #include <linux/proc_ns.h>
> #include <linux/security.h>
> +#include <linux/uaccess-buffer.h>
>
> #include "../../lib/kstrtox.h"
>
> @@ -637,7 +638,11 @@ const struct bpf_func_proto bpf_event_output_data_proto = {
> BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
> const void __user *, user_ptr)
> {
> - int ret = copy_from_user(dst, user_ptr, size);
> + /*
> + * Avoid logging uaccesses here as the BPF program may not be following
> + * the uaccess log rules.
> + */
> + int ret = copy_from_user_nolog(dst, user_ptr, size);
>
> if (unlikely(ret)) {
> memset(dst, 0, size);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3244cc56b697..8be2ca528a65 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -96,6 +96,7 @@
> #include <linux/scs.h>
> #include <linux/io_uring.h>
> #include <linux/bpf.h>
> +#include <linux/uaccess-buffer.h>
>
> #include <asm/pgalloc.h>
> #include <linux/uaccess.h>
> @@ -754,6 +755,7 @@ void __put_task_struct(struct task_struct *tsk)
> delayacct_tsk_free(tsk);
> put_signal_struct(tsk->signal);
> sched_core_free(tsk);
> + uaccess_buffer_free(tsk);
>
> if (!profile_handoff_task(tsk))
> free_task(tsk);
> @@ -890,6 +892,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
> if (memcg_charge_kernel_stack(tsk))
> goto free_stack;
>
> + uaccess_buffer_free(orig);
> +
> stack_vm_area = task_stack_vm_area(tsk);
>
> err = arch_dup_task_struct(tsk, orig);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a629b11bf3e0..69bf21518bd0 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -45,6 +45,7 @@
> #include <linux/posix-timers.h>
> #include <linux/cgroup.h>
> #include <linux/audit.h>
> +#include <linux/uaccess-buffer.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/signal.h>
> @@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
> if (sig_fatal(p, sig) &&
> !(signal->flags & SIGNAL_GROUP_EXIT) &&
> !sigismember(&t->real_blocked, sig) &&
> - (sig == SIGKILL || !p->ptrace)) {
> + (sig == SIGKILL ||
> + !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {
> /*
> * This signal will be fatal to the whole group.
> */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8fdac0d90504..c71a9a9c0f68 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -42,6 +42,7 @@
> #include <linux/version.h>
> #include <linux/ctype.h>
> #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess-buffer.h>
>
> #include <linux/compat.h>
> #include <linux/syscalls.h>
> @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> error = sched_core_share_pid(arg2, arg3, arg4, arg5);
> break;
> #endif
> + case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + error = uaccess_buffer_set_descriptor_addr_addr(arg2);

Does this miss uaccess_buffer_free() for the 0 case?
It seems that when we set it to 0 we always want to free as well (e.g.
in exec). I wonder if freeing should be done by
uaccess_buffer_set_descriptor_addr_addr() itself.
Both uaccess_buffer_set_descriptor_addr_addr() and
uaccess_buffer_free() reset task work, which is fine but is somewhat
suboptimal logically.
Then task exit could do uaccess_buffer_set_descriptor_addr_addr(0).


> + break;
> default:
> error = -EINVAL;
> break;
> diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
> new file mode 100644
> index 000000000000..088e43f7611c
> --- /dev/null
> +++ b/kernel/uaccess-buffer.c
> @@ -0,0 +1,129 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Support for uaccess logging via uaccess buffers.
> + *
> + * Copyright (C) 2021, Google LLC.
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/mm.h>
> +#include <linux/prctl.h>
> +#include <linux/ptrace.h>
> +#include <linux/sched.h>
> +#include <linux/signal.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/uaccess-buffer.h>
> +
> +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> + unsigned long flags)
> +{
> + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> + struct uaccess_buffer_entry *entry = buf->kcur;
> +
> + if (entry == buf->kend || unlikely(uaccess_kernel()))
> + return;
> + entry->addr = addr;
> + entry->size = size;
> + entry->flags = flags;
> +
> + ++buf->kcur;
> +}
> +
> +void __uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> + uaccess_buffer_log((unsigned long)from, n, 0);
> +}
> +EXPORT_SYMBOL(__uaccess_buffer_log_read);
> +
> +void __uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> + uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> +}
> +EXPORT_SYMBOL(__uaccess_buffer_log_write);
> +
> +bool __uaccess_buffer_pre_exit_loop(void)
> +{
> + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> + struct uaccess_descriptor __user *desc_ptr;
> + sigset_t tmp_mask;
> +
> + if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> + return false;
> +
> + current->real_blocked = current->blocked;
> + sigfillset(&tmp_mask);

This and __uaccess_buffer_post_exit_loop() runs only when we have a
signal/timer interrupt between setting the descriptor address in
userspace and entering the next syscall, right?
Just want to make sure this code is not executed for normal uaccess
tracing for performance reasons.

> + set_current_blocked(&tmp_mask);
> + return true;
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void)
> +{
> + spin_lock_irq(&current->sighand->siglock);
> + current->blocked = current->real_blocked;
> + recalc_sigpending();
> + spin_unlock_irq(&current->sighand->siglock);
> +}
> +
> +void uaccess_buffer_free(struct task_struct *tsk)
> +{
> + struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
> +
> + kfree(buf->kbegin);
> + clear_syscall_work(UACCESS_BUFFER_EXIT);
> + buf->kbegin = buf->kcur = buf->kend = NULL;
> +}
> +
> +void __uaccess_buffer_syscall_entry(void)
> +{
> + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> + struct uaccess_descriptor desc;
> +
> + if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
> + put_user(0, buf->desc_ptr_ptr) ||
> + copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
> + return;
> +
> + if (desc.size > 1024)
> + desc.size = 1024;
> +
> + if (buf->kend - buf->kbegin != desc.size)
> + buf->kbegin =
> + krealloc_array(buf->kbegin, desc.size,
> + sizeof(struct uaccess_buffer_entry),
> + GFP_KERNEL);
> + if (!buf->kbegin)

I think we also need to set at least buf->kend to NULL here.
I am not sure what can go wrong now, but it's a strange state. On next
iteration we will do "buf->kend - buf->kbegin", where kend is a
dangling pointer and kbegin is NULL.

> + return;
> +
> + set_syscall_work(UACCESS_BUFFER_EXIT);
> + buf->kcur = buf->kbegin;
> + buf->kend = buf->kbegin + desc.size;
> + buf->ubegin =
> + (struct uaccess_buffer_entry __user *)(unsigned long)desc.addr;
> +}
> +
> +void __uaccess_buffer_syscall_exit(void)
> +{
> + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> + u64 num_entries = buf->kcur - buf->kbegin;
> + struct uaccess_descriptor desc;
> +
> + clear_syscall_work(UACCESS_BUFFER_EXIT);
> + desc.addr = (u64)(unsigned long)(buf->ubegin + num_entries);
> + desc.size = buf->kend - buf->kcur;
> + buf->kcur = NULL;
> + if (copy_to_user(buf->ubegin, buf->kbegin,
> + num_entries * sizeof(struct uaccess_buffer_entry)) == 0)
> + (void)copy_to_user(buf->desc_ptr, &desc, sizeof(desc));
> +}
> +
> +size_t copy_from_user_nolog(void *to, const void __user *from, size_t len)
> +{
> + size_t retval;
> +
> + clear_syscall_work(UACCESS_BUFFER_EXIT);
> + retval = copy_from_user(to, from, len);
> + if (current->uaccess_buffer.kcur)
> + set_syscall_work(UACCESS_BUFFER_EXIT);
> + return retval;
> +}
> --
> 2.34.1.173.g76aa8bc2d0-goog
>

2021-12-08 15:33:43

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] kernel: introduce uaccess logging

On Wed, 8 Dec 2021 at 05:48, Peter Collingbourne <[email protected]> wrote:
[...]
> Peter Collingbourne (6):
> include: split out uaccess instrumentation into a separate header
> uaccess-buffer: add core code
> fs: use copy_from_user_nolog() to copy mount() data
> uaccess-buffer: add CONFIG_GENERIC_ENTRY support
> arm64: add support for uaccess logging
> Documentation: document uaccess logging

I think it needs to be possible to disable the feature via a Kconfig
option. Not all systems want or could even tolerate the additional
overheads -- even though you say they are minimal elsewhere. For
example, some embedded systems most likely have absolutely no use for
this feature, and the increase in .text might be unacceptable. Certain
features that we usually take for granted are no different (see
init/Kconfig: FUTEX, EPOLL, .. etc). If you'd like it enabled by
default, given the overheads are small enough, it can do "default y"
and be configurable only "if EXPERT".

Is it possible to add a kselftest-style test to
tools/testing/selftests? In addition to the basic tests, can certain
non-trivial properties, like masking of signals, also be tested? I
think that'd be extremely valuable, because I'm sure we'd have to
backport this to several older kernels.

Thanks,
-- Marco

2021-12-08 16:46:22

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] uaccess-buffer: add core code

On Tue, Dec 07, 2021 at 08:48PM -0800, Peter Collingbourne wrote:
> Add the core code to support uaccess logging. Subsequent patches will
> hook this up to the arch-specific kernel entry and exit code for
> certain architectures.
>
> Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> Signed-off-by: Peter Collingbourne <[email protected]>
> ---
> v3:
> - performance optimizations for entry/exit code
> - don't use kcur == NULL to mean overflow
> - fix potential double free in clone()
> - don't allocate a new kernel-side uaccess buffer for each syscall
> - fix uaccess buffer leak on exit
> - fix some sparse warnings
>
> v2:
> - New interface that avoids multiple syscalls per real syscall and
> is arch-generic
> - Avoid logging uaccesses done by BPF programs
> - Add documentation
> - Split up into multiple patches
> - Various code moves, renames etc as requested by Marco
>
> fs/exec.c | 3 +
> include/linux/instrumented-uaccess.h | 6 +-
> include/linux/sched.h | 5 ++
> include/linux/uaccess-buffer-info.h | 46 ++++++++++
> include/linux/uaccess-buffer.h | 112 +++++++++++++++++++++++
> include/uapi/linux/prctl.h | 3 +
> include/uapi/linux/uaccess-buffer.h | 27 ++++++
> kernel/Makefile | 1 +
> kernel/bpf/helpers.c | 7 +-
> kernel/fork.c | 4 +
> kernel/signal.c | 4 +-
> kernel/sys.c | 6 ++
> kernel/uaccess-buffer.c | 129 +++++++++++++++++++++++++++
> 13 files changed, 350 insertions(+), 3 deletions(-)
> create mode 100644 include/linux/uaccess-buffer-info.h
> create mode 100644 include/linux/uaccess-buffer.h
> create mode 100644 include/uapi/linux/uaccess-buffer.h
> create mode 100644 kernel/uaccess-buffer.c
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 537d92c41105..c9975e790f30 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -65,6 +65,7 @@
> #include <linux/vmalloc.h>
> #include <linux/io_uring.h>
> #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess-buffer.h>
>
> #include <linux/uaccess.h>
> #include <asm/mmu_context.h>
> @@ -1313,6 +1314,8 @@ int begin_new_exec(struct linux_binprm * bprm)
> me->personality &= ~bprm->per_clear;
>
> clear_syscall_work_syscall_user_dispatch(me);
> + uaccess_buffer_set_descriptor_addr_addr(0);
> + uaccess_buffer_free(current);
>
> /*
> * We have to apply CLOEXEC before we change whether the process is
> diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
> index ece549088e50..b967f4436d15 100644
> --- a/include/linux/instrumented-uaccess.h
> +++ b/include/linux/instrumented-uaccess.h
> @@ -2,7 +2,8 @@
>
> /*
> * This header provides generic wrappers for memory access instrumentation for
> - * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
> + * uaccess routines that the compiler cannot emit for: KASAN, KCSAN,
> + * uaccess buffers.
> */
> #ifndef _LINUX_INSTRUMENTED_UACCESS_H
> #define _LINUX_INSTRUMENTED_UACCESS_H
> @@ -11,6 +12,7 @@
> #include <linux/kasan-checks.h>
> #include <linux/kcsan-checks.h>
> #include <linux/types.h>
> +#include <linux/uaccess-buffer.h>
>
> /**
> * instrument_copy_to_user - instrument reads of copy_to_user
> @@ -27,6 +29,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> {
> kasan_check_read(from, n);
> kcsan_check_read(from, n);
> + uaccess_buffer_log_write(to, n);
> }
>
> /**
> @@ -44,6 +47,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
> {
> kasan_check_write(to, n);
> kcsan_check_write(to, n);
> + uaccess_buffer_log_read(from, n);
> }
>
> #endif /* _LINUX_INSTRUMENTED_UACCESS_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 78c351e35fec..7c5278d7b57d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -34,6 +34,7 @@
> #include <linux/rseq.h>
> #include <linux/seqlock.h>
> #include <linux/kcsan.h>
> +#include <linux/uaccess-buffer-info.h>
> #include <asm/kmap_size.h>
>
> /* task_struct member predeclarations (sorted alphabetically): */
> @@ -1484,6 +1485,10 @@ struct task_struct {
> struct callback_head l1d_flush_kill;
> #endif
>
> +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> + struct uaccess_buffer_info uaccess_buffer;
> +#endif
> +
> /*
> * New fields for task_struct should be added above here, so that
> * they are included in the randomized portion of task_struct.
> diff --git a/include/linux/uaccess-buffer-info.h b/include/linux/uaccess-buffer-info.h
> new file mode 100644
> index 000000000000..15e2d8f7c074
> --- /dev/null
> +++ b/include/linux/uaccess-buffer-info.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UACCESS_BUFFER_INFO_H
> +#define _LINUX_UACCESS_BUFFER_INFO_H
> +
> +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> +
> +struct uaccess_buffer_info {
> + /*
> + * The pointer to pointer to struct uaccess_descriptor. This is the
> + * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> + */
> + struct uaccess_descriptor __user *__user *desc_ptr_ptr;
> +
> + /*
> + * The pointer to struct uaccess_descriptor read at syscall entry time.
> + */
> + struct uaccess_descriptor __user *desc_ptr;
> +
> + /*
> + * A pointer to the kernel's temporary copy of the uaccess log for the
> + * current syscall. We log to a kernel buffer in order to avoid leaking
> + * timing information to userspace.
> + */
> + struct uaccess_buffer_entry *kbegin;
> +
> + /*
> + * The position of the next uaccess buffer entry for the current
> + * syscall, or NULL if we are not logging the current syscall.
> + */
> + struct uaccess_buffer_entry *kcur;
> +
> + /*
> + * A pointer to the end of the kernel's uaccess log.
> + */
> + struct uaccess_buffer_entry *kend;
> +
> + /*
> + * The pointer to the userspace uaccess log, as read from the
> + * struct uaccess_descriptor.
> + */
> + struct uaccess_buffer_entry __user *ubegin;
> +};
> +
> +#endif
> +
> +#endif /* _LINUX_UACCESS_BUFFER_INFO_H */
> diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
> new file mode 100644
> index 000000000000..f2f46db274f3
> --- /dev/null
> +++ b/include/linux/uaccess-buffer.h
> @@ -0,0 +1,112 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UACCESS_BUFFER_H
> +#define _LINUX_UACCESS_BUFFER_H
> +
> +#include <linux/sched.h>
> +#include <uapi/linux/uaccess-buffer.h>
> +
> +#include <asm-generic/errno-base.h>
> +
> +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER

Kernel-doc comments for each of the below would be useful (if __ prefixed
functions are implementation details, they can be left as-is).

> +
> +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> +{
> + return test_task_syscall_work(tsk, UACCESS_BUFFER_ENTRY);
> +}
> +
> +void __uaccess_buffer_syscall_entry(void);
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> + __uaccess_buffer_syscall_entry();
> +}
> +
> +void __uaccess_buffer_syscall_exit(void);
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> + __uaccess_buffer_syscall_exit();
> +}
> +
> +bool __uaccess_buffer_pre_exit_loop(void);
> +static inline bool uaccess_buffer_pre_exit_loop(void)
> +{
> + if (!test_syscall_work(UACCESS_BUFFER_ENTRY))
> + return false;
> + return __uaccess_buffer_pre_exit_loop();
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void);
> +static inline void uaccess_buffer_post_exit_loop(bool pending)
> +{
> + if (pending)
> + __uaccess_buffer_post_exit_loop();
> +}
> +
> +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> +{
> + current->uaccess_buffer.desc_ptr_ptr =
> + (struct uaccess_descriptor __user * __user *)addr;
> + if (addr)
> + set_syscall_work(UACCESS_BUFFER_ENTRY);
> + else
> + clear_syscall_work(UACCESS_BUFFER_ENTRY);
> + return 0;
> +}
> +
> +size_t copy_from_user_nolog(void *to, const void __user *from, size_t len);

copy_from_user() has unsigned long for @len and also return type. I'd
make them match.

> +
> +void uaccess_buffer_free(struct task_struct *tsk);
> +
> +void __uaccess_buffer_log_read(const void __user *from, unsigned long n);
> +static inline void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> + if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
> + __uaccess_buffer_log_read(from, n);
> +}
> +
> +void __uaccess_buffer_log_write(void __user *to, unsigned long n);
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> + if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
> + __uaccess_buffer_log_write(to, n);
> +}
> +
> +#else
> +
> +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> +{
> + return false;
> +}
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +}
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +}
> +static inline bool uaccess_buffer_pre_exit_loop(void)
> +{
> + return false;
> +}
> +static inline void uaccess_buffer_post_exit_loop(bool pending)
> +{
> +}
> +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> +{
> + return -EINVAL;
> +}
> +static inline void uaccess_buffer_free(struct task_struct *tsk)
> +{
> +}
> +
> +#define copy_from_user_nolog(to, from, len) copy_from_user(to, from, len)

"#define copy_from_user_nolog copy_from_user" ?

> +
> +static inline void uaccess_buffer_log_read(const void __user *from,
> + unsigned long n)
> +{
> +}
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +}
> +
> +#endif
> +
> +#endif /* _LINUX_UACCESS_BUFFER_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index bb73e9a0b24f..74b37469c7b3 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -272,4 +272,7 @@ struct prctl_mm_map {
> # define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1
> # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2
>
> +/* Configure uaccess logging feature */
> +#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR 63
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
> new file mode 100644
> index 000000000000..bf10f7c78857
> --- /dev/null
> +++ b/include/uapi/linux/uaccess-buffer.h
> @@ -0,0 +1,27 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
> +#define _UAPI_LINUX_UACCESS_BUFFER_H
> +
> +#include <linux/types.h>
> +
> +/* Location of the uaccess log. */
> +struct uaccess_descriptor {
> + /* Address of the uaccess_buffer_entry array. */
> + __u64 addr;
> + /* Size of the uaccess_buffer_entry array in number of elements. */
> + __u64 size;
> +};
> +
> +/* Format of the entries in the uaccess log. */
> +struct uaccess_buffer_entry {
> + /* Address being accessed. */
> + __u64 addr;
> + /* Number of bytes that were accessed. */
> + __u64 size;
> + /* UACCESS_BUFFER_* flags. */
> + __u64 flags;
> +};
> +
> +#define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */
> +
> +#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 186c49582f45..d4d9be5146c3 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
> obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
> obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> obj-$(CONFIG_CFI_CLANG) += cfi.o
> +obj-$(CONFIG_HAVE_ARCH_UACCESS_BUFFER) += uaccess-buffer.o
>
> obj-$(CONFIG_PERF_EVENTS) += events/
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 649f07623df6..ab6520a633ef 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -15,6 +15,7 @@
> #include <linux/pid_namespace.h>
> #include <linux/proc_ns.h>
> #include <linux/security.h>
> +#include <linux/uaccess-buffer.h>
>
> #include "../../lib/kstrtox.h"
>
> @@ -637,7 +638,11 @@ const struct bpf_func_proto bpf_event_output_data_proto = {
> BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
> const void __user *, user_ptr)
> {
> - int ret = copy_from_user(dst, user_ptr, size);
> + /*
> + * Avoid logging uaccesses here as the BPF program may not be following
> + * the uaccess log rules.
> + */
> + int ret = copy_from_user_nolog(dst, user_ptr, size);
>
> if (unlikely(ret)) {
> memset(dst, 0, size);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3244cc56b697..8be2ca528a65 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -96,6 +96,7 @@
> #include <linux/scs.h>
> #include <linux/io_uring.h>
> #include <linux/bpf.h>
> +#include <linux/uaccess-buffer.h>
>
> #include <asm/pgalloc.h>
> #include <linux/uaccess.h>
> @@ -754,6 +755,7 @@ void __put_task_struct(struct task_struct *tsk)
> delayacct_tsk_free(tsk);
> put_signal_struct(tsk->signal);
> sched_core_free(tsk);
> + uaccess_buffer_free(tsk);
>
> if (!profile_handoff_task(tsk))
> free_task(tsk);
> @@ -890,6 +892,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
> if (memcg_charge_kernel_stack(tsk))
> goto free_stack;
>
> + uaccess_buffer_free(orig);
> +
> stack_vm_area = task_stack_vm_area(tsk);
>
> err = arch_dup_task_struct(tsk, orig);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a629b11bf3e0..69bf21518bd0 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -45,6 +45,7 @@
> #include <linux/posix-timers.h>
> #include <linux/cgroup.h>
> #include <linux/audit.h>
> +#include <linux/uaccess-buffer.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/signal.h>
> @@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
> if (sig_fatal(p, sig) &&
> !(signal->flags & SIGNAL_GROUP_EXIT) &&
> !sigismember(&t->real_blocked, sig) &&
> - (sig == SIGKILL || !p->ptrace)) {
> + (sig == SIGKILL ||
> + !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {
> /*
> * This signal will be fatal to the whole group.
> */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8fdac0d90504..c71a9a9c0f68 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -42,6 +42,7 @@
> #include <linux/version.h>
> #include <linux/ctype.h>
> #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess-buffer.h>
>
> #include <linux/compat.h>
> #include <linux/syscalls.h>
> @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> error = sched_core_share_pid(arg2, arg3, arg4, arg5);
> break;
> #endif
> + case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + error = uaccess_buffer_set_descriptor_addr_addr(arg2);
> + break;
> default:
> error = -EINVAL;
> break;
> diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
> new file mode 100644
> index 000000000000..088e43f7611c
> --- /dev/null
> +++ b/kernel/uaccess-buffer.c
> @@ -0,0 +1,129 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Support for uaccess logging via uaccess buffers.
> + *
> + * Copyright (C) 2021, Google LLC.
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/mm.h>
> +#include <linux/prctl.h>
> +#include <linux/ptrace.h>
> +#include <linux/sched.h>
> +#include <linux/signal.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/uaccess-buffer.h>
> +
> +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> + unsigned long flags)
> +{
> + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> + struct uaccess_buffer_entry *entry = buf->kcur;
> +
> + if (entry == buf->kend || unlikely(uaccess_kernel()))
> + return;
> + entry->addr = addr;
> + entry->size = size;
> + entry->flags = flags;
> +
> + ++buf->kcur;
> +}
> +
> +void __uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> + uaccess_buffer_log((unsigned long)from, n, 0);
> +}
> +EXPORT_SYMBOL(__uaccess_buffer_log_read);
> +
> +void __uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> + uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> +}
> +EXPORT_SYMBOL(__uaccess_buffer_log_write);
> +
> +bool __uaccess_buffer_pre_exit_loop(void)
> +{
> + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> + struct uaccess_descriptor __user *desc_ptr;
> + sigset_t tmp_mask;
> +
> + if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> + return false;
> +
> + current->real_blocked = current->blocked;
> + sigfillset(&tmp_mask);
> + set_current_blocked(&tmp_mask);
> + return true;
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void)
> +{
> + spin_lock_irq(&current->sighand->siglock);
> + current->blocked = current->real_blocked;
> + recalc_sigpending();
> + spin_unlock_irq(&current->sighand->siglock);
> +}
> +
> +void uaccess_buffer_free(struct task_struct *tsk)
> +{
> + struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
> +
> + kfree(buf->kbegin);
> + clear_syscall_work(UACCESS_BUFFER_EXIT);
> + buf->kbegin = buf->kcur = buf->kend = NULL;
> +}
> +
> +void __uaccess_buffer_syscall_entry(void)
> +{
> + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> + struct uaccess_descriptor desc;
> +
> + if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
> + put_user(0, buf->desc_ptr_ptr) ||
> + copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
> + return;
> +
> + if (desc.size > 1024)
> + desc.size = 1024;
> +
> + if (buf->kend - buf->kbegin != desc.size)
> + buf->kbegin =
> + krealloc_array(buf->kbegin, desc.size,
> + sizeof(struct uaccess_buffer_entry),
> + GFP_KERNEL);
> + if (!buf->kbegin)
> + return;
> +
> + set_syscall_work(UACCESS_BUFFER_EXIT);
> + buf->kcur = buf->kbegin;
> + buf->kend = buf->kbegin + desc.size;
> + buf->ubegin =
> + (struct uaccess_buffer_entry __user *)(unsigned long)desc.addr;
> +}
> +
> +void __uaccess_buffer_syscall_exit(void)
> +{
> + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> + u64 num_entries = buf->kcur - buf->kbegin;
> + struct uaccess_descriptor desc;
> +
> + clear_syscall_work(UACCESS_BUFFER_EXIT);
> + desc.addr = (u64)(unsigned long)(buf->ubegin + num_entries);
> + desc.size = buf->kend - buf->kcur;
> + buf->kcur = NULL;
> + if (copy_to_user(buf->ubegin, buf->kbegin,
> + num_entries * sizeof(struct uaccess_buffer_entry)) == 0)
> + (void)copy_to_user(buf->desc_ptr, &desc, sizeof(desc));
> +}
> +
> +size_t copy_from_user_nolog(void *to, const void __user *from, size_t len)
> +{
> + size_t retval;
> +
> + clear_syscall_work(UACCESS_BUFFER_EXIT);
> + retval = copy_from_user(to, from, len);
> + if (current->uaccess_buffer.kcur)
> + set_syscall_work(UACCESS_BUFFER_EXIT);
> + return retval;
> +}
> --
> 2.34.1.173.g76aa8bc2d0-goog
>

2021-12-09 21:42:46

by Peter Collingbourne

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] fs: use copy_from_user_nolog() to copy mount() data

On Wed, Dec 8, 2021 at 1:35 AM Dmitry Vyukov <[email protected]> wrote:
>
> On Wed, 8 Dec 2021 at 05:48, Peter Collingbourne <[email protected]> wrote:
> >
> > With uaccess logging the contract is that the kernel must not report
> > accessing more data than necessary, as this can lead to false positive
> > reports in downstream consumers. This generally works out of the box
> > when instrumenting copy_{from,to}_user(), but with the data argument
> > to mount() we use copy_from_user() to copy PAGE_SIZE bytes (or as
> > much as we can, if the PAGE_SIZE sized access failed) and figure out
> > later how much we actually need.
> >
> > To prevent this from leading to a false positive report, use
> > copy_from_user_nolog(), which will prevent the access from being logged.
> > Recall that it is valid for the kernel to report accessing less
> > data than it actually accessed, as uaccess logging is a best-effort
> > mechanism for reporting uaccesses.
> >
> > Link: https://linux-review.googlesource.com/id/I5629b92a725c817acd9a861288338dd605cafee6
> > Signed-off-by: Peter Collingbourne <[email protected]>
> > ---
> > fs/namespace.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index 659a8f39c61a..8f5f2aaca64e 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -31,6 +31,7 @@
> > #include <uapi/linux/mount.h>
> > #include <linux/fs_context.h>
> > #include <linux/shmem_fs.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #include "pnode.h"
> > #include "internal.h"
> > @@ -3197,7 +3198,12 @@ static void *copy_mount_options(const void __user * data)
> > if (!copy)
> > return ERR_PTR(-ENOMEM);
> >
> > - left = copy_from_user(copy, data, PAGE_SIZE);
> > + /*
> > + * Use copy_from_user_nolog to avoid reporting overly large accesses in
> > + * the uaccess buffer, as this can lead to false positive reports in
> > + * downstream consumers.
> > + */
> > + left = copy_from_user_nolog(copy, data, PAGE_SIZE);
>
> A late idea...
> Maybe it's better to log them with a new UACCESS_BUFFER_FLAG_OVERREAD
> flag. Better for user-space, at least can detect UAFs by checking the
> first byte. And a more logical kernel annotation (maybe will be used
> in some other tools? or if we ever check user tags in the kernel).
>
> Probably not too important today since we use this only in 2 places,
> but longer term may be better.

I'm not sure about this. The overreads are basically an implementation
detail of the kernel, so I'm not sure it makes sense to expose them. A
scheme where we expose all overreads wouldn't necessarily help with
UAF, because what if for example the kernel reads *behind* the
user-provided pointer? I guess it could lead to false positives.

> Btw, what's the story with BPF accesses? Can we log them theoretically?
>
> Previously the comment said:
>
> + /*
> + * Avoid copy_from_user() here as it may leak information about the BPF
> + * program to userspace via the uaccess buffer.
> + */
>
> but now it says something very generic:
>
> /*
> * Avoid logging uaccesses here as the BPF program may not be following
> * the uaccess log rules.
> */

Yes we should be able to log them theoretically, but we don't need to
do that now. See my reply here:

https://lore.kernel.org/all/CAMn1gO5B5Q3hfN6kugv2wmdFGNhJb75iRX1zmCkw3wnueN1dtg@mail.gmail.com/#:~:text=This%20comment%20was,the%20comment%20accordingly.

Peter

2021-12-09 21:44:31

by Peter Collingbourne

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] arm64: add support for uaccess logging

On Wed, Dec 8, 2021 at 1:49 AM Dmitry Vyukov <[email protected]> wrote:
>
> On Wed, 8 Dec 2021 at 05:48, Peter Collingbourne <[email protected]> wrote:
> >
> > arm64 does not use CONFIG_GENERIC_ENTRY, so add the support for
> > uaccess logging directly to the architecture.
> >
> > Link: https://linux-review.googlesource.com/id/I88de539fb9c4a9d27fa8cccbe201a6e4382faf89
> > Signed-off-by: Peter Collingbourne <[email protected]>
> > ---
> > arch/arm64/Kconfig | 1 +
> > arch/arm64/include/asm/thread_info.h | 7 ++++++-
> > arch/arm64/kernel/ptrace.c | 7 +++++++
> > arch/arm64/kernel/signal.c | 5 +++++
> > arch/arm64/kernel/syscall.c | 1 +
> > 5 files changed, 20 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index c4207cf9bb17..6023946abe4a 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -161,6 +161,7 @@ config ARM64
> > select HAVE_ARCH_THREAD_STRUCT_WHITELIST
> > select HAVE_ARCH_TRACEHOOK
> > select HAVE_ARCH_TRANSPARENT_HUGEPAGE
> > + select HAVE_ARCH_UACCESS_BUFFER
> > select HAVE_ARCH_VMAP_STACK
> > select HAVE_ARM_SMCCC
> > select HAVE_ASM_MODVERSIONS
> > diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> > index e1317b7c4525..0461b36251ea 100644
> > --- a/arch/arm64/include/asm/thread_info.h
> > +++ b/arch/arm64/include/asm/thread_info.h
> > @@ -82,6 +82,8 @@ int arch_dup_task_struct(struct task_struct *dst,
> > #define TIF_SVE_VL_INHERIT 24 /* Inherit SVE vl_onexec across exec */
> > #define TIF_SSBD 25 /* Wants SSB mitigation */
> > #define TIF_TAGGED_ADDR 26 /* Allow tagged user addresses */
> > +#define TIF_UACCESS_BUFFER_ENTRY 27 /* thread has non-zero uaccess_desc_addr_addr */
> > +#define TIF_UACCESS_BUFFER_EXIT 28 /* thread has non-zero kcur */
> >
> > #define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
> > #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
> > @@ -98,6 +100,8 @@ int arch_dup_task_struct(struct task_struct *dst,
> > #define _TIF_SVE (1 << TIF_SVE)
> > #define _TIF_MTE_ASYNC_FAULT (1 << TIF_MTE_ASYNC_FAULT)
> > #define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL)
> > +#define _TIF_UACCESS_BUFFER_ENTRY (1 << TIF_UACCESS_BUFFER_ENTRY)
> > +#define _TIF_UACCESS_BUFFER_EXIT (1 << TIF_UACCESS_BUFFER_EXIT)
> >
> > #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
> > _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
> > @@ -106,7 +110,8 @@ int arch_dup_task_struct(struct task_struct *dst,
> >
> > #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
> > _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
> > - _TIF_SYSCALL_EMU)
> > + _TIF_SYSCALL_EMU | _TIF_UACCESS_BUFFER_ENTRY | \
> > + _TIF_UACCESS_BUFFER_EXIT)
> >
> > #ifdef CONFIG_SHADOW_CALL_STACK
> > #define INIT_SCS \
> > diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> > index 88a9034fb9b5..283372eccaeb 100644
> > --- a/arch/arm64/kernel/ptrace.c
> > +++ b/arch/arm64/kernel/ptrace.c
> > @@ -29,6 +29,7 @@
> > #include <linux/regset.h>
> > #include <linux/tracehook.h>
> > #include <linux/elf.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #include <asm/compat.h>
> > #include <asm/cpufeature.h>
> > @@ -1854,6 +1855,9 @@ int syscall_trace_enter(struct pt_regs *regs)
> > if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
> > trace_sys_enter(regs, regs->syscallno);
> >
> > + if (flags & _TIF_UACCESS_BUFFER_ENTRY)
> > + uaccess_buffer_syscall_entry();
> > +
> > audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
> > regs->regs[2], regs->regs[3]);
> >
> > @@ -1866,6 +1870,9 @@ void syscall_trace_exit(struct pt_regs *regs)
> >
> > audit_syscall_exit(regs);
> >
> > + if (flags & _TIF_UACCESS_BUFFER_EXIT)
> > + uaccess_buffer_syscall_exit();
> > +
> > if (flags & _TIF_SYSCALL_TRACEPOINT)
> > trace_sys_exit(regs, syscall_get_return_value(current, regs));
> >
> > diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> > index 8f6372b44b65..5bbd98e5c257 100644
> > --- a/arch/arm64/kernel/signal.c
> > +++ b/arch/arm64/kernel/signal.c
> > @@ -20,6 +20,7 @@
> > #include <linux/tracehook.h>
> > #include <linux/ratelimit.h>
> > #include <linux/syscalls.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #include <asm/daifflags.h>
> > #include <asm/debug-monitors.h>
> > @@ -919,6 +920,8 @@ static void do_signal(struct pt_regs *regs)
> >
> > void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)
> > {
> > + bool uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
> > +
> > do {
> > if (thread_flags & _TIF_NEED_RESCHED) {
> > /* Unmask Debug and SError for the next task */
> > @@ -950,6 +953,8 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)
> > local_daif_mask();
> > thread_flags = READ_ONCE(current_thread_info()->flags);
> > } while (thread_flags & _TIF_WORK_MASK);
> > +
> > + uaccess_buffer_post_exit_loop(uaccess_buffer_pending);
> > }
> >
> > unsigned long __ro_after_init signal_minsigstksz;
> > diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
> > index 50a0f1a38e84..d59022b594f2 100644
> > --- a/arch/arm64/kernel/syscall.c
> > +++ b/arch/arm64/kernel/syscall.c
> > @@ -7,6 +7,7 @@
> > #include <linux/ptrace.h>
> > #include <linux/randomize_kstack.h>
> > #include <linux/syscalls.h>
> > +#include <linux/uaccess-buffer.h>
>
> This looks strange... Does some other header miss this include?

This was left in unintentionally. I'll remove it in v4.

Peter

2021-12-09 22:12:17

by Peter Collingbourne

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] kernel: introduce uaccess logging

On Wed, Dec 8, 2021 at 7:33 AM Marco Elver <[email protected]> wrote:
>
> On Wed, 8 Dec 2021 at 05:48, Peter Collingbourne <[email protected]> wrote:
> [...]
> > Peter Collingbourne (6):
> > include: split out uaccess instrumentation into a separate header
> > uaccess-buffer: add core code
> > fs: use copy_from_user_nolog() to copy mount() data
> > uaccess-buffer: add CONFIG_GENERIC_ENTRY support
> > arm64: add support for uaccess logging
> > Documentation: document uaccess logging
>
> I think it needs to be possible to disable the feature via a Kconfig
> option. Not all systems want or could even tolerate the additional
> overheads -- even though you say they are minimal elsewhere. For
> example, some embedded systems most likely have absolutely no use for
> this feature, and the increase in .text might be unacceptable. Certain
> features that we usually take for granted are no different (see
> init/Kconfig: FUTEX, EPOLL, .. etc). If you'd like it enabled by
> default, given the overheads are small enough, it can do "default y"
> and be configurable only "if EXPERT".

Okay, done.

> Is it possible to add a kselftest-style test to
> tools/testing/selftests? In addition to the basic tests, can certain
> non-trivial properties, like masking of signals, also be tested? I
> think that'd be extremely valuable, because I'm sure we'd have to
> backport this to several older kernels.

Yes, I've added a new patch with a kselftest. (Good thing I did,
because it (together with DEBUG_PREEMPT) uncovered a bug in the
pre/post-exit-loop code. Fixed in v4.)

Peter

2021-12-09 22:13:24

by Peter Collingbourne

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] uaccess-buffer: add core code

On Wed, Dec 8, 2021 at 2:21 AM Dmitry Vyukov <[email protected]> wrote:
>
> On Wed, 8 Dec 2021 at 05:48, Peter Collingbourne <[email protected]> wrote:
> >
> > Add the core code to support uaccess logging. Subsequent patches will
> > hook this up to the arch-specific kernel entry and exit code for
> > certain architectures.
> >
> > Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> > Signed-off-by: Peter Collingbourne <[email protected]>
> > ---
> > v3:
> > - performance optimizations for entry/exit code
> > - don't use kcur == NULL to mean overflow
> > - fix potential double free in clone()
> > - don't allocate a new kernel-side uaccess buffer for each syscall
> > - fix uaccess buffer leak on exit
> > - fix some sparse warnings
> >
> > v2:
> > - New interface that avoids multiple syscalls per real syscall and
> > is arch-generic
> > - Avoid logging uaccesses done by BPF programs
> > - Add documentation
> > - Split up into multiple patches
> > - Various code moves, renames etc as requested by Marco
> >
> > fs/exec.c | 3 +
> > include/linux/instrumented-uaccess.h | 6 +-
> > include/linux/sched.h | 5 ++
> > include/linux/uaccess-buffer-info.h | 46 ++++++++++
> > include/linux/uaccess-buffer.h | 112 +++++++++++++++++++++++
> > include/uapi/linux/prctl.h | 3 +
> > include/uapi/linux/uaccess-buffer.h | 27 ++++++
> > kernel/Makefile | 1 +
> > kernel/bpf/helpers.c | 7 +-
> > kernel/fork.c | 4 +
> > kernel/signal.c | 4 +-
> > kernel/sys.c | 6 ++
> > kernel/uaccess-buffer.c | 129 +++++++++++++++++++++++++++
> > 13 files changed, 350 insertions(+), 3 deletions(-)
> > create mode 100644 include/linux/uaccess-buffer-info.h
> > create mode 100644 include/linux/uaccess-buffer.h
> > create mode 100644 include/uapi/linux/uaccess-buffer.h
> > create mode 100644 kernel/uaccess-buffer.c
> >
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 537d92c41105..c9975e790f30 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -65,6 +65,7 @@
> > #include <linux/vmalloc.h>
> > #include <linux/io_uring.h>
> > #include <linux/syscall_user_dispatch.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #include <linux/uaccess.h>
> > #include <asm/mmu_context.h>
> > @@ -1313,6 +1314,8 @@ int begin_new_exec(struct linux_binprm * bprm)
> > me->personality &= ~bprm->per_clear;
> >
> > clear_syscall_work_syscall_user_dispatch(me);
> > + uaccess_buffer_set_descriptor_addr_addr(0);
> > + uaccess_buffer_free(current);
> >
> > /*
> > * We have to apply CLOEXEC before we change whether the process is
> > diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
> > index ece549088e50..b967f4436d15 100644
> > --- a/include/linux/instrumented-uaccess.h
> > +++ b/include/linux/instrumented-uaccess.h
> > @@ -2,7 +2,8 @@
> >
> > /*
> > * This header provides generic wrappers for memory access instrumentation for
> > - * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
> > + * uaccess routines that the compiler cannot emit for: KASAN, KCSAN,
> > + * uaccess buffers.
> > */
> > #ifndef _LINUX_INSTRUMENTED_UACCESS_H
> > #define _LINUX_INSTRUMENTED_UACCESS_H
> > @@ -11,6 +12,7 @@
> > #include <linux/kasan-checks.h>
> > #include <linux/kcsan-checks.h>
> > #include <linux/types.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > /**
> > * instrument_copy_to_user - instrument reads of copy_to_user
> > @@ -27,6 +29,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> > {
> > kasan_check_read(from, n);
> > kcsan_check_read(from, n);
> > + uaccess_buffer_log_write(to, n);
> > }
> >
> > /**
> > @@ -44,6 +47,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
> > {
> > kasan_check_write(to, n);
> > kcsan_check_write(to, n);
> > + uaccess_buffer_log_read(from, n);
> > }
> >
> > #endif /* _LINUX_INSTRUMENTED_UACCESS_H */
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 78c351e35fec..7c5278d7b57d 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -34,6 +34,7 @@
> > #include <linux/rseq.h>
> > #include <linux/seqlock.h>
> > #include <linux/kcsan.h>
> > +#include <linux/uaccess-buffer-info.h>
> > #include <asm/kmap_size.h>
> >
> > /* task_struct member predeclarations (sorted alphabetically): */
> > @@ -1484,6 +1485,10 @@ struct task_struct {
> > struct callback_head l1d_flush_kill;
> > #endif
> >
> > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> > + struct uaccess_buffer_info uaccess_buffer;
> > +#endif
> > +
> > /*
> > * New fields for task_struct should be added above here, so that
> > * they are included in the randomized portion of task_struct.
> > diff --git a/include/linux/uaccess-buffer-info.h b/include/linux/uaccess-buffer-info.h
> > new file mode 100644
> > index 000000000000..15e2d8f7c074
> > --- /dev/null
> > +++ b/include/linux/uaccess-buffer-info.h
> > @@ -0,0 +1,46 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_UACCESS_BUFFER_INFO_H
> > +#define _LINUX_UACCESS_BUFFER_INFO_H
> > +
> > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> > +
> > +struct uaccess_buffer_info {
> > + /*
> > + * The pointer to pointer to struct uaccess_descriptor. This is the
> > + * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> > + */
> > + struct uaccess_descriptor __user *__user *desc_ptr_ptr;
> > +
> > + /*
> > + * The pointer to struct uaccess_descriptor read at syscall entry time.
> > + */
> > + struct uaccess_descriptor __user *desc_ptr;
> > +
> > + /*
> > + * A pointer to the kernel's temporary copy of the uaccess log for the
> > + * current syscall. We log to a kernel buffer in order to avoid leaking
> > + * timing information to userspace.
> > + */
> > + struct uaccess_buffer_entry *kbegin;
> > +
> > + /*
> > + * The position of the next uaccess buffer entry for the current
> > + * syscall, or NULL if we are not logging the current syscall.
> > + */
> > + struct uaccess_buffer_entry *kcur;
> > +
> > + /*
> > + * A pointer to the end of the kernel's uaccess log.
> > + */
> > + struct uaccess_buffer_entry *kend;
> > +
> > + /*
> > + * The pointer to the userspace uaccess log, as read from the
> > + * struct uaccess_descriptor.
> > + */
> > + struct uaccess_buffer_entry __user *ubegin;
> > +};
> > +
> > +#endif
> > +
> > +#endif /* _LINUX_UACCESS_BUFFER_INFO_H */
> > diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
> > new file mode 100644
> > index 000000000000..f2f46db274f3
> > --- /dev/null
> > +++ b/include/linux/uaccess-buffer.h
> > @@ -0,0 +1,112 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_UACCESS_BUFFER_H
> > +#define _LINUX_UACCESS_BUFFER_H
> > +
> > +#include <linux/sched.h>
> > +#include <uapi/linux/uaccess-buffer.h>
> > +
> > +#include <asm-generic/errno-base.h>
> > +
> > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> > +
> > +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> > +{
> > + return test_task_syscall_work(tsk, UACCESS_BUFFER_ENTRY);
> > +}
> > +
> > +void __uaccess_buffer_syscall_entry(void);
> > +static inline void uaccess_buffer_syscall_entry(void)
> > +{
> > + __uaccess_buffer_syscall_entry();
> > +}
> > +
> > +void __uaccess_buffer_syscall_exit(void);
> > +static inline void uaccess_buffer_syscall_exit(void)
> > +{
> > + __uaccess_buffer_syscall_exit();
> > +}
> > +
> > +bool __uaccess_buffer_pre_exit_loop(void);
> > +static inline bool uaccess_buffer_pre_exit_loop(void)
> > +{
> > + if (!test_syscall_work(UACCESS_BUFFER_ENTRY))
> > + return false;
> > + return __uaccess_buffer_pre_exit_loop();
> > +}
> > +
> > +void __uaccess_buffer_post_exit_loop(void);
> > +static inline void uaccess_buffer_post_exit_loop(bool pending)
> > +{
> > + if (pending)
> > + __uaccess_buffer_post_exit_loop();
> > +}
> > +
> > +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
>
> I would move the implementation to .c file. It's a rare path.

Done.

> > +{
> > + current->uaccess_buffer.desc_ptr_ptr =
> > + (struct uaccess_descriptor __user * __user *)addr;
> > + if (addr)
> > + set_syscall_work(UACCESS_BUFFER_ENTRY);
> > + else
> > + clear_syscall_work(UACCESS_BUFFER_ENTRY);
> > + return 0;
> > +}
> > +
> > +size_t copy_from_user_nolog(void *to, const void __user *from, size_t len);
> > +
> > +void uaccess_buffer_free(struct task_struct *tsk);
> > +
> > +void __uaccess_buffer_log_read(const void __user *from, unsigned long n);
> > +static inline void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> > +{
> > + if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
>
> UACCESS_BUFFER_EXIT is only defined in future patches, so this won't compile.

Right, but there's no way for CONFIG_UACCESS_BUFFER to be defined at this point,
so this won't compile anyway. We define the constants for this
(TIF_UACCESS_BUFFER_EXIT for arm64, SYSCALL_WORK_UACCESS_BUFFER_EXIT for
GENERIC_ENTRY) at the same time as we enable the respective architecture
support.

>
> > + __uaccess_buffer_log_read(from, n);
> > +}
> > +
> > +void __uaccess_buffer_log_write(void __user *to, unsigned long n);
> > +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> > +{
> > + if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
> > + __uaccess_buffer_log_write(to, n);
> > +}
> > +
> > +#else
> > +
> > +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> > +{
> > + return false;
> > +}
> > +static inline void uaccess_buffer_syscall_entry(void)
> > +{
> > +}
> > +static inline void uaccess_buffer_syscall_exit(void)
> > +{
> > +}
> > +static inline bool uaccess_buffer_pre_exit_loop(void)
> > +{
> > + return false;
> > +}
> > +static inline void uaccess_buffer_post_exit_loop(bool pending)
> > +{
> > +}
> > +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> > +{
> > + return -EINVAL;
> > +}
> > +static inline void uaccess_buffer_free(struct task_struct *tsk)
> > +{
> > +}
> > +
> > +#define copy_from_user_nolog(to, from, len) copy_from_user(to, from, len)
> > +
> > +static inline void uaccess_buffer_log_read(const void __user *from,
> > + unsigned long n)
> > +{
> > +}
> > +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> > +{
> > +}
> > +
> > +#endif
> > +
> > +#endif /* _LINUX_UACCESS_BUFFER_H */
> > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > index bb73e9a0b24f..74b37469c7b3 100644
> > --- a/include/uapi/linux/prctl.h
> > +++ b/include/uapi/linux/prctl.h
> > @@ -272,4 +272,7 @@ struct prctl_mm_map {
> > # define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1
> > # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2
> >
> > +/* Configure uaccess logging feature */
> > +#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR 63
> > +
> > #endif /* _LINUX_PRCTL_H */
> > diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
> > new file mode 100644
> > index 000000000000..bf10f7c78857
> > --- /dev/null
> > +++ b/include/uapi/linux/uaccess-buffer.h
> > @@ -0,0 +1,27 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
> > +#define _UAPI_LINUX_UACCESS_BUFFER_H
> > +
> > +#include <linux/types.h>
> > +
> > +/* Location of the uaccess log. */
> > +struct uaccess_descriptor {
> > + /* Address of the uaccess_buffer_entry array. */
> > + __u64 addr;
> > + /* Size of the uaccess_buffer_entry array in number of elements. */
> > + __u64 size;
> > +};
> > +
> > +/* Format of the entries in the uaccess log. */
> > +struct uaccess_buffer_entry {
> > + /* Address being accessed. */
> > + __u64 addr;
> > + /* Number of bytes that were accessed. */
> > + __u64 size;
> > + /* UACCESS_BUFFER_* flags. */
> > + __u64 flags;
> > +};
> > +
> > +#define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */
> > +
> > +#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
> > diff --git a/kernel/Makefile b/kernel/Makefile
> > index 186c49582f45..d4d9be5146c3 100644
> > --- a/kernel/Makefile
> > +++ b/kernel/Makefile
> > @@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
> > obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
> > obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> > obj-$(CONFIG_CFI_CLANG) += cfi.o
> > +obj-$(CONFIG_HAVE_ARCH_UACCESS_BUFFER) += uaccess-buffer.o
> >
> > obj-$(CONFIG_PERF_EVENTS) += events/
> >
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 649f07623df6..ab6520a633ef 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -15,6 +15,7 @@
> > #include <linux/pid_namespace.h>
> > #include <linux/proc_ns.h>
> > #include <linux/security.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #include "../../lib/kstrtox.h"
> >
> > @@ -637,7 +638,11 @@ const struct bpf_func_proto bpf_event_output_data_proto = {
> > BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
> > const void __user *, user_ptr)
> > {
> > - int ret = copy_from_user(dst, user_ptr, size);
> > + /*
> > + * Avoid logging uaccesses here as the BPF program may not be following
> > + * the uaccess log rules.
> > + */
> > + int ret = copy_from_user_nolog(dst, user_ptr, size);
> >
> > if (unlikely(ret)) {
> > memset(dst, 0, size);
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 3244cc56b697..8be2ca528a65 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -96,6 +96,7 @@
> > #include <linux/scs.h>
> > #include <linux/io_uring.h>
> > #include <linux/bpf.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #include <asm/pgalloc.h>
> > #include <linux/uaccess.h>
> > @@ -754,6 +755,7 @@ void __put_task_struct(struct task_struct *tsk)
> > delayacct_tsk_free(tsk);
> > put_signal_struct(tsk->signal);
> > sched_core_free(tsk);
> > + uaccess_buffer_free(tsk);
> >
> > if (!profile_handoff_task(tsk))
> > free_task(tsk);
> > @@ -890,6 +892,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
> > if (memcg_charge_kernel_stack(tsk))
> > goto free_stack;
> >
> > + uaccess_buffer_free(orig);
> > +
> > stack_vm_area = task_stack_vm_area(tsk);
> >
> > err = arch_dup_task_struct(tsk, orig);
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index a629b11bf3e0..69bf21518bd0 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -45,6 +45,7 @@
> > #include <linux/posix-timers.h>
> > #include <linux/cgroup.h>
> > #include <linux/audit.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #define CREATE_TRACE_POINTS
> > #include <trace/events/signal.h>
> > @@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
> > if (sig_fatal(p, sig) &&
> > !(signal->flags & SIGNAL_GROUP_EXIT) &&
> > !sigismember(&t->real_blocked, sig) &&
> > - (sig == SIGKILL || !p->ptrace)) {
> > + (sig == SIGKILL ||
> > + !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {
> > /*
> > * This signal will be fatal to the whole group.
> > */
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 8fdac0d90504..c71a9a9c0f68 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -42,6 +42,7 @@
> > #include <linux/version.h>
> > #include <linux/ctype.h>
> > #include <linux/syscall_user_dispatch.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #include <linux/compat.h>
> > #include <linux/syscalls.h>
> > @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > error = sched_core_share_pid(arg2, arg3, arg4, arg5);
> > break;
> > #endif
> > + case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
> > + if (arg3 || arg4 || arg5)
> > + return -EINVAL;
> > + error = uaccess_buffer_set_descriptor_addr_addr(arg2);
>
> Does this miss uaccess_buffer_free() for the 0 case?
> It seems that when we set it to 0 we always want to free as well (e.g.
> in exec). I wonder if freeing should be done by
> uaccess_buffer_set_descriptor_addr_addr() itself.
> Both uaccess_buffer_set_descriptor_addr_addr() and
> uaccess_buffer_free() reset task work, which is fine but is somewhat
> suboptimal logically.
> Then task exit could do uaccess_buffer_set_descriptor_addr_addr(0).

We don't need to free in that case because we can just log to the original
location. I originally had uaccess_buffer_set_descriptor_addr_addr() do the
freeing, but you asked me to change it to avoid the free [1]. I guess I don't
have a strong opinion but slightly prefer having the two functions do orthogonal
things.

[1] https://lore.kernel.org/all/CACT4Y+aoiT+z+3CMBNmO0SwXBXpfDCsHY7pPLf54S8V=c-a8ag@mail.gmail.com/#:~:text=Is%20this%20necessary

>
> > + break;
> > default:
> > error = -EINVAL;
> > break;
> > diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
> > new file mode 100644
> > index 000000000000..088e43f7611c
> > --- /dev/null
> > +++ b/kernel/uaccess-buffer.c
> > @@ -0,0 +1,129 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Support for uaccess logging via uaccess buffers.
> > + *
> > + * Copyright (C) 2021, Google LLC.
> > + */
> > +
> > +#include <linux/compat.h>
> > +#include <linux/mm.h>
> > +#include <linux/prctl.h>
> > +#include <linux/ptrace.h>
> > +#include <linux/sched.h>
> > +#include <linux/signal.h>
> > +#include <linux/slab.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/uaccess-buffer.h>
> > +
> > +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> > + unsigned long flags)
> > +{
> > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > + struct uaccess_buffer_entry *entry = buf->kcur;
> > +
> > + if (entry == buf->kend || unlikely(uaccess_kernel()))
> > + return;
> > + entry->addr = addr;
> > + entry->size = size;
> > + entry->flags = flags;
> > +
> > + ++buf->kcur;
> > +}
> > +
> > +void __uaccess_buffer_log_read(const void __user *from, unsigned long n)
> > +{
> > + uaccess_buffer_log((unsigned long)from, n, 0);
> > +}
> > +EXPORT_SYMBOL(__uaccess_buffer_log_read);
> > +
> > +void __uaccess_buffer_log_write(void __user *to, unsigned long n)
> > +{
> > + uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> > +}
> > +EXPORT_SYMBOL(__uaccess_buffer_log_write);
> > +
> > +bool __uaccess_buffer_pre_exit_loop(void)
> > +{
> > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > + struct uaccess_descriptor __user *desc_ptr;
> > + sigset_t tmp_mask;
> > +
> > + if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> > + return false;
> > +
> > + current->real_blocked = current->blocked;
> > + sigfillset(&tmp_mask);
>
> This and __uaccess_buffer_post_exit_loop() runs only when we have a
> signal/timer interrupt between setting the descriptor address in
> userspace and entering the next syscall, right?
> Just want to make sure this code is not executed for normal uaccess
> tracing for performance reasons.

They only need to run if something in _TIF_WORK_MASK (arm64) or
EXIT_TO_USER_MODE_WORK (GENERIC_ENTRY) is set, i.e. signals pending
or some kind of tracing enabled. That's how it works on the arm64
side (check is in the caller of do_notify_resume) but I had neglected
to move the calls into the if statement on the GENERIC_ENTRY side;
done in v4.

> > + set_current_blocked(&tmp_mask);
> > + return true;
> > +}
> > +
> > +void __uaccess_buffer_post_exit_loop(void)
> > +{
> > + spin_lock_irq(&current->sighand->siglock);
> > + current->blocked = current->real_blocked;
> > + recalc_sigpending();
> > + spin_unlock_irq(&current->sighand->siglock);
> > +}
> > +
> > +void uaccess_buffer_free(struct task_struct *tsk)
> > +{
> > + struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
> > +
> > + kfree(buf->kbegin);
> > + clear_syscall_work(UACCESS_BUFFER_EXIT);
> > + buf->kbegin = buf->kcur = buf->kend = NULL;
> > +}
> > +
> > +void __uaccess_buffer_syscall_entry(void)
> > +{
> > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > + struct uaccess_descriptor desc;
> > +
> > + if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
> > + put_user(0, buf->desc_ptr_ptr) ||
> > + copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
> > + return;
> > +
> > + if (desc.size > 1024)
> > + desc.size = 1024;
> > +
> > + if (buf->kend - buf->kbegin != desc.size)
> > + buf->kbegin =
> > + krealloc_array(buf->kbegin, desc.size,
> > + sizeof(struct uaccess_buffer_entry),
> > + GFP_KERNEL);
> > + if (!buf->kbegin)
>
> I think we also need to set at least buf->kend to NULL here.
> I am not sure what can go wrong now, but it's a strange state. On next
> iteration we will do "buf->kend - buf->kbegin", where kend is a
> dangling pointer and kbegin is NULL.

Done.

Peter

2021-12-09 22:14:25

by Peter Collingbourne

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] uaccess-buffer: add core code

On Wed, Dec 8, 2021 at 8:46 AM Marco Elver <[email protected]> wrote:
>
> On Tue, Dec 07, 2021 at 08:48PM -0800, Peter Collingbourne wrote:
> > Add the core code to support uaccess logging. Subsequent patches will
> > hook this up to the arch-specific kernel entry and exit code for
> > certain architectures.
> >
> > Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> > Signed-off-by: Peter Collingbourne <[email protected]>
> > ---
> > v3:
> > - performance optimizations for entry/exit code
> > - don't use kcur == NULL to mean overflow
> > - fix potential double free in clone()
> > - don't allocate a new kernel-side uaccess buffer for each syscall
> > - fix uaccess buffer leak on exit
> > - fix some sparse warnings
> >
> > v2:
> > - New interface that avoids multiple syscalls per real syscall and
> > is arch-generic
> > - Avoid logging uaccesses done by BPF programs
> > - Add documentation
> > - Split up into multiple patches
> > - Various code moves, renames etc as requested by Marco
> >
> > fs/exec.c | 3 +
> > include/linux/instrumented-uaccess.h | 6 +-
> > include/linux/sched.h | 5 ++
> > include/linux/uaccess-buffer-info.h | 46 ++++++++++
> > include/linux/uaccess-buffer.h | 112 +++++++++++++++++++++++
> > include/uapi/linux/prctl.h | 3 +
> > include/uapi/linux/uaccess-buffer.h | 27 ++++++
> > kernel/Makefile | 1 +
> > kernel/bpf/helpers.c | 7 +-
> > kernel/fork.c | 4 +
> > kernel/signal.c | 4 +-
> > kernel/sys.c | 6 ++
> > kernel/uaccess-buffer.c | 129 +++++++++++++++++++++++++++
> > 13 files changed, 350 insertions(+), 3 deletions(-)
> > create mode 100644 include/linux/uaccess-buffer-info.h
> > create mode 100644 include/linux/uaccess-buffer.h
> > create mode 100644 include/uapi/linux/uaccess-buffer.h
> > create mode 100644 kernel/uaccess-buffer.c
> >
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 537d92c41105..c9975e790f30 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -65,6 +65,7 @@
> > #include <linux/vmalloc.h>
> > #include <linux/io_uring.h>
> > #include <linux/syscall_user_dispatch.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #include <linux/uaccess.h>
> > #include <asm/mmu_context.h>
> > @@ -1313,6 +1314,8 @@ int begin_new_exec(struct linux_binprm * bprm)
> > me->personality &= ~bprm->per_clear;
> >
> > clear_syscall_work_syscall_user_dispatch(me);
> > + uaccess_buffer_set_descriptor_addr_addr(0);
> > + uaccess_buffer_free(current);
> >
> > /*
> > * We have to apply CLOEXEC before we change whether the process is
> > diff --git a/include/linux/instrumented-uaccess.h b/include/linux/instrumented-uaccess.h
> > index ece549088e50..b967f4436d15 100644
> > --- a/include/linux/instrumented-uaccess.h
> > +++ b/include/linux/instrumented-uaccess.h
> > @@ -2,7 +2,8 @@
> >
> > /*
> > * This header provides generic wrappers for memory access instrumentation for
> > - * uaccess routines that the compiler cannot emit for: KASAN, KCSAN.
> > + * uaccess routines that the compiler cannot emit for: KASAN, KCSAN,
> > + * uaccess buffers.
> > */
> > #ifndef _LINUX_INSTRUMENTED_UACCESS_H
> > #define _LINUX_INSTRUMENTED_UACCESS_H
> > @@ -11,6 +12,7 @@
> > #include <linux/kasan-checks.h>
> > #include <linux/kcsan-checks.h>
> > #include <linux/types.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > /**
> > * instrument_copy_to_user - instrument reads of copy_to_user
> > @@ -27,6 +29,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> > {
> > kasan_check_read(from, n);
> > kcsan_check_read(from, n);
> > + uaccess_buffer_log_write(to, n);
> > }
> >
> > /**
> > @@ -44,6 +47,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
> > {
> > kasan_check_write(to, n);
> > kcsan_check_write(to, n);
> > + uaccess_buffer_log_read(from, n);
> > }
> >
> > #endif /* _LINUX_INSTRUMENTED_UACCESS_H */
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 78c351e35fec..7c5278d7b57d 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -34,6 +34,7 @@
> > #include <linux/rseq.h>
> > #include <linux/seqlock.h>
> > #include <linux/kcsan.h>
> > +#include <linux/uaccess-buffer-info.h>
> > #include <asm/kmap_size.h>
> >
> > /* task_struct member predeclarations (sorted alphabetically): */
> > @@ -1484,6 +1485,10 @@ struct task_struct {
> > struct callback_head l1d_flush_kill;
> > #endif
> >
> > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> > + struct uaccess_buffer_info uaccess_buffer;
> > +#endif
> > +
> > /*
> > * New fields for task_struct should be added above here, so that
> > * they are included in the randomized portion of task_struct.
> > diff --git a/include/linux/uaccess-buffer-info.h b/include/linux/uaccess-buffer-info.h
> > new file mode 100644
> > index 000000000000..15e2d8f7c074
> > --- /dev/null
> > +++ b/include/linux/uaccess-buffer-info.h
> > @@ -0,0 +1,46 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_UACCESS_BUFFER_INFO_H
> > +#define _LINUX_UACCESS_BUFFER_INFO_H
> > +
> > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> > +
> > +struct uaccess_buffer_info {
> > + /*
> > + * The pointer to pointer to struct uaccess_descriptor. This is the
> > + * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> > + */
> > + struct uaccess_descriptor __user *__user *desc_ptr_ptr;
> > +
> > + /*
> > + * The pointer to struct uaccess_descriptor read at syscall entry time.
> > + */
> > + struct uaccess_descriptor __user *desc_ptr;
> > +
> > + /*
> > + * A pointer to the kernel's temporary copy of the uaccess log for the
> > + * current syscall. We log to a kernel buffer in order to avoid leaking
> > + * timing information to userspace.
> > + */
> > + struct uaccess_buffer_entry *kbegin;
> > +
> > + /*
> > + * The position of the next uaccess buffer entry for the current
> > + * syscall, or NULL if we are not logging the current syscall.
> > + */
> > + struct uaccess_buffer_entry *kcur;
> > +
> > + /*
> > + * A pointer to the end of the kernel's uaccess log.
> > + */
> > + struct uaccess_buffer_entry *kend;
> > +
> > + /*
> > + * The pointer to the userspace uaccess log, as read from the
> > + * struct uaccess_descriptor.
> > + */
> > + struct uaccess_buffer_entry __user *ubegin;
> > +};
> > +
> > +#endif
> > +
> > +#endif /* _LINUX_UACCESS_BUFFER_INFO_H */
> > diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
> > new file mode 100644
> > index 000000000000..f2f46db274f3
> > --- /dev/null
> > +++ b/include/linux/uaccess-buffer.h
> > @@ -0,0 +1,112 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_UACCESS_BUFFER_H
> > +#define _LINUX_UACCESS_BUFFER_H
> > +
> > +#include <linux/sched.h>
> > +#include <uapi/linux/uaccess-buffer.h>
> > +
> > +#include <asm-generic/errno-base.h>
> > +
> > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
>
> Kernel-doc comments for each of the below would be useful (if __ prefixed
> functions are implementation details, they can be left as-is).

Done.

> > +
> > +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> > +{
> > + return test_task_syscall_work(tsk, UACCESS_BUFFER_ENTRY);
> > +}
> > +
> > +void __uaccess_buffer_syscall_entry(void);
> > +static inline void uaccess_buffer_syscall_entry(void)
> > +{
> > + __uaccess_buffer_syscall_entry();
> > +}
> > +
> > +void __uaccess_buffer_syscall_exit(void);
> > +static inline void uaccess_buffer_syscall_exit(void)
> > +{
> > + __uaccess_buffer_syscall_exit();
> > +}
> > +
> > +bool __uaccess_buffer_pre_exit_loop(void);
> > +static inline bool uaccess_buffer_pre_exit_loop(void)
> > +{
> > + if (!test_syscall_work(UACCESS_BUFFER_ENTRY))
> > + return false;
> > + return __uaccess_buffer_pre_exit_loop();
> > +}
> > +
> > +void __uaccess_buffer_post_exit_loop(void);
> > +static inline void uaccess_buffer_post_exit_loop(bool pending)
> > +{
> > + if (pending)
> > + __uaccess_buffer_post_exit_loop();
> > +}
> > +
> > +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> > +{
> > + current->uaccess_buffer.desc_ptr_ptr =
> > + (struct uaccess_descriptor __user * __user *)addr;
> > + if (addr)
> > + set_syscall_work(UACCESS_BUFFER_ENTRY);
> > + else
> > + clear_syscall_work(UACCESS_BUFFER_ENTRY);
> > + return 0;
> > +}
> > +
> > +size_t copy_from_user_nolog(void *to, const void __user *from, size_t len);
>
> copy_from_user() has unsigned long for @len and also return type. I'd
> make them match.

Done.

> > +
> > +void uaccess_buffer_free(struct task_struct *tsk);
> > +
> > +void __uaccess_buffer_log_read(const void __user *from, unsigned long n);
> > +static inline void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> > +{
> > + if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
> > + __uaccess_buffer_log_read(from, n);
> > +}
> > +
> > +void __uaccess_buffer_log_write(void __user *to, unsigned long n);
> > +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> > +{
> > + if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
> > + __uaccess_buffer_log_write(to, n);
> > +}
> > +
> > +#else
> > +
> > +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> > +{
> > + return false;
> > +}
> > +static inline void uaccess_buffer_syscall_entry(void)
> > +{
> > +}
> > +static inline void uaccess_buffer_syscall_exit(void)
> > +{
> > +}
> > +static inline bool uaccess_buffer_pre_exit_loop(void)
> > +{
> > + return false;
> > +}
> > +static inline void uaccess_buffer_post_exit_loop(bool pending)
> > +{
> > +}
> > +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> > +{
> > + return -EINVAL;
> > +}
> > +static inline void uaccess_buffer_free(struct task_struct *tsk)
> > +{
> > +}
> > +
> > +#define copy_from_user_nolog(to, from, len) copy_from_user(to, from, len)
>
> "#define copy_from_user_nolog copy_from_user" ?

Done.

Peter

2021-12-10 02:59:23

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] fs: use copy_from_user_nolog() to copy mount() data

On Thu, 9 Dec 2021 at 22:42, Peter Collingbourne <[email protected]> wrote:
> > > With uaccess logging the contract is that the kernel must not report
> > > accessing more data than necessary, as this can lead to false positive
> > > reports in downstream consumers. This generally works out of the box
> > > when instrumenting copy_{from,to}_user(), but with the data argument
> > > to mount() we use copy_from_user() to copy PAGE_SIZE bytes (or as
> > > much as we can, if the PAGE_SIZE sized access failed) and figure out
> > > later how much we actually need.
> > >
> > > To prevent this from leading to a false positive report, use
> > > copy_from_user_nolog(), which will prevent the access from being logged.
> > > Recall that it is valid for the kernel to report accessing less
> > > data than it actually accessed, as uaccess logging is a best-effort
> > > mechanism for reporting uaccesses.
> > >
> > > Link: https://linux-review.googlesource.com/id/I5629b92a725c817acd9a861288338dd605cafee6
> > > Signed-off-by: Peter Collingbourne <[email protected]>
> > > ---
> > > fs/namespace.c | 8 +++++++-
> > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/namespace.c b/fs/namespace.c
> > > index 659a8f39c61a..8f5f2aaca64e 100644
> > > --- a/fs/namespace.c
> > > +++ b/fs/namespace.c
> > > @@ -31,6 +31,7 @@
> > > #include <uapi/linux/mount.h>
> > > #include <linux/fs_context.h>
> > > #include <linux/shmem_fs.h>
> > > +#include <linux/uaccess-buffer.h>
> > >
> > > #include "pnode.h"
> > > #include "internal.h"
> > > @@ -3197,7 +3198,12 @@ static void *copy_mount_options(const void __user * data)
> > > if (!copy)
> > > return ERR_PTR(-ENOMEM);
> > >
> > > - left = copy_from_user(copy, data, PAGE_SIZE);
> > > + /*
> > > + * Use copy_from_user_nolog to avoid reporting overly large accesses in
> > > + * the uaccess buffer, as this can lead to false positive reports in
> > > + * downstream consumers.
> > > + */
> > > + left = copy_from_user_nolog(copy, data, PAGE_SIZE);
> >
> > A late idea...
> > Maybe it's better to log them with a new UACCESS_BUFFER_FLAG_OVERREAD
> > flag. Better for user-space, at least can detect UAFs by checking the
> > first byte. And a more logical kernel annotation (maybe will be used
> > in some other tools? or if we ever check user tags in the kernel).
> >
> > Probably not too important today since we use this only in 2 places,
> > but longer term may be better.
>
> I'm not sure about this. The overreads are basically an implementation
> detail of the kernel, so I'm not sure it makes sense to expose them. A
> scheme where we expose all overreads wouldn't necessarily help with
> UAF, because what if for example the kernel reads *behind* the
> user-provided pointer? I guess it could lead to false positives.

If user-space uses logging to check addressability, then it can safely
check only the first byte (right? there must be at least 1 byte passed
by user-space at that address). And that's enough to detect UAFs.

> > Btw, what's the story with BPF accesses? Can we log them theoretically?
> >
> > Previously the comment said:
> >
> > + /*
> > + * Avoid copy_from_user() here as it may leak information about the BPF
> > + * program to userspace via the uaccess buffer.
> > + */
> >
> > but now it says something very generic:
> >
> > /*
> > * Avoid logging uaccesses here as the BPF program may not be following
> > * the uaccess log rules.
> > */
>
> Yes we should be able to log them theoretically, but we don't need to
> do that now. See my reply here:
>
> https://lore.kernel.org/all/CAMn1gO5B5Q3hfN6kugv2wmdFGNhJb75iRX1zmCkw3wnueN1dtg@mail.gmail.com/#:~:text=This%20comment%20was,the%20comment%20accordingly.

I see. These could be marked with another flag.
I don't have a strong opinion about this. But I am mentioning this
because my experience is that it's better to expose more raw info from
kernel in these cases, rather than hardcoding policies into kernel
code (what's ignored/why/when) b/c a delay from another kernel change
to wide deployment is 5+ years and user-space code may need to detect
and deal with all various versions of the kernel logic.
Say, fuzzing may still want to know about the mount options (rather
than no signal that the kernel reads at least something at that
address). But adding them later with a flag is not really a backwards
compatible change b/c you now have addressability checking code that's
not checking the new flag and will produce false positives.

2021-12-10 04:02:33

by Peter Collingbourne

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] fs: use copy_from_user_nolog() to copy mount() data

On Thu, Dec 9, 2021 at 6:59 PM Dmitry Vyukov <[email protected]> wrote:
>
> On Thu, 9 Dec 2021 at 22:42, Peter Collingbourne <[email protected]> wrote:
> > > > With uaccess logging the contract is that the kernel must not report
> > > > accessing more data than necessary, as this can lead to false positive
> > > > reports in downstream consumers. This generally works out of the box
> > > > when instrumenting copy_{from,to}_user(), but with the data argument
> > > > to mount() we use copy_from_user() to copy PAGE_SIZE bytes (or as
> > > > much as we can, if the PAGE_SIZE sized access failed) and figure out
> > > > later how much we actually need.
> > > >
> > > > To prevent this from leading to a false positive report, use
> > > > copy_from_user_nolog(), which will prevent the access from being logged.
> > > > Recall that it is valid for the kernel to report accessing less
> > > > data than it actually accessed, as uaccess logging is a best-effort
> > > > mechanism for reporting uaccesses.
> > > >
> > > > Link: https://linux-review.googlesource.com/id/I5629b92a725c817acd9a861288338dd605cafee6
> > > > Signed-off-by: Peter Collingbourne <[email protected]>
> > > > ---
> > > > fs/namespace.c | 8 +++++++-
> > > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/namespace.c b/fs/namespace.c
> > > > index 659a8f39c61a..8f5f2aaca64e 100644
> > > > --- a/fs/namespace.c
> > > > +++ b/fs/namespace.c
> > > > @@ -31,6 +31,7 @@
> > > > #include <uapi/linux/mount.h>
> > > > #include <linux/fs_context.h>
> > > > #include <linux/shmem_fs.h>
> > > > +#include <linux/uaccess-buffer.h>
> > > >
> > > > #include "pnode.h"
> > > > #include "internal.h"
> > > > @@ -3197,7 +3198,12 @@ static void *copy_mount_options(const void __user * data)
> > > > if (!copy)
> > > > return ERR_PTR(-ENOMEM);
> > > >
> > > > - left = copy_from_user(copy, data, PAGE_SIZE);
> > > > + /*
> > > > + * Use copy_from_user_nolog to avoid reporting overly large accesses in
> > > > + * the uaccess buffer, as this can lead to false positive reports in
> > > > + * downstream consumers.
> > > > + */
> > > > + left = copy_from_user_nolog(copy, data, PAGE_SIZE);
> > >
> > > A late idea...
> > > Maybe it's better to log them with a new UACCESS_BUFFER_FLAG_OVERREAD
> > > flag. Better for user-space, at least can detect UAFs by checking the
> > > first byte. And a more logical kernel annotation (maybe will be used
> > > in some other tools? or if we ever check user tags in the kernel).
> > >
> > > Probably not too important today since we use this only in 2 places,
> > > but longer term may be better.
> >
> > I'm not sure about this. The overreads are basically an implementation
> > detail of the kernel, so I'm not sure it makes sense to expose them. A
> > scheme where we expose all overreads wouldn't necessarily help with
> > UAF, because what if for example the kernel reads *behind* the
> > user-provided pointer? I guess it could lead to false positives.
>
> If user-space uses logging to check addressability, then it can safely
> check only the first byte (right? there must be at least 1 byte passed
> by user-space at that address). And that's enough to detect UAFs.

I was thinking more e.g. what if the kernel reads an entire page with
copy_from_user() and takes a subset of it later. Then the first byte
could point to some other random allocation in the same page and lead
to a false UAF report if we just consider the first byte.

So I think the use cases for accesses with this flag set may be
limited to things like fuzzers.

> > > Btw, what's the story with BPF accesses? Can we log them theoretically?
> > >
> > > Previously the comment said:
> > >
> > > + /*
> > > + * Avoid copy_from_user() here as it may leak information about the BPF
> > > + * program to userspace via the uaccess buffer.
> > > + */
> > >
> > > but now it says something very generic:
> > >
> > > /*
> > > * Avoid logging uaccesses here as the BPF program may not be following
> > > * the uaccess log rules.
> > > */
> >
> > Yes we should be able to log them theoretically, but we don't need to
> > do that now. See my reply here:
> >
> > https://lore.kernel.org/all/CAMn1gO5B5Q3hfN6kugv2wmdFGNhJb75iRX1zmCkw3wnueN1dtg@mail.gmail.com/#:~:text=This%20comment%20was,the%20comment%20accordingly.
>
> I see. These could be marked with another flag.
> I don't have a strong opinion about this. But I am mentioning this
> because my experience is that it's better to expose more raw info from
> kernel in these cases, rather than hardcoding policies into kernel
> code (what's ignored/why/when) b/c a delay from another kernel change
> to wide deployment is 5+ years and user-space code may need to detect
> and deal with all various versions of the kernel logic.
> Say, fuzzing may still want to know about the mount options (rather
> than no signal that the kernel reads at least something at that
> address). But adding them later with a flag is not really a backwards
> compatible change b/c you now have addressability checking code that's
> not checking the new flag and will produce false positives.

I think this is a good point. I'll see about adding flags for the BPF
and overread cases.

Peter

2021-12-10 04:23:30

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] fs: use copy_from_user_nolog() to copy mount() data

On Fri, 10 Dec 2021 at 05:02, Peter Collingbourne <[email protected]> wrote:
> > On Thu, 9 Dec 2021 at 22:42, Peter Collingbourne <[email protected]> wrote:
> > > > > With uaccess logging the contract is that the kernel must not report
> > > > > accessing more data than necessary, as this can lead to false positive
> > > > > reports in downstream consumers. This generally works out of the box
> > > > > when instrumenting copy_{from,to}_user(), but with the data argument
> > > > > to mount() we use copy_from_user() to copy PAGE_SIZE bytes (or as
> > > > > much as we can, if the PAGE_SIZE sized access failed) and figure out
> > > > > later how much we actually need.
> > > > >
> > > > > To prevent this from leading to a false positive report, use
> > > > > copy_from_user_nolog(), which will prevent the access from being logged.
> > > > > Recall that it is valid for the kernel to report accessing less
> > > > > data than it actually accessed, as uaccess logging is a best-effort
> > > > > mechanism for reporting uaccesses.
> > > > >
> > > > > Link: https://linux-review.googlesource.com/id/I5629b92a725c817acd9a861288338dd605cafee6
> > > > > Signed-off-by: Peter Collingbourne <[email protected]>
> > > > > ---
> > > > > fs/namespace.c | 8 +++++++-
> > > > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/fs/namespace.c b/fs/namespace.c
> > > > > index 659a8f39c61a..8f5f2aaca64e 100644
> > > > > --- a/fs/namespace.c
> > > > > +++ b/fs/namespace.c
> > > > > @@ -31,6 +31,7 @@
> > > > > #include <uapi/linux/mount.h>
> > > > > #include <linux/fs_context.h>
> > > > > #include <linux/shmem_fs.h>
> > > > > +#include <linux/uaccess-buffer.h>
> > > > >
> > > > > #include "pnode.h"
> > > > > #include "internal.h"
> > > > > @@ -3197,7 +3198,12 @@ static void *copy_mount_options(const void __user * data)
> > > > > if (!copy)
> > > > > return ERR_PTR(-ENOMEM);
> > > > >
> > > > > - left = copy_from_user(copy, data, PAGE_SIZE);
> > > > > + /*
> > > > > + * Use copy_from_user_nolog to avoid reporting overly large accesses in
> > > > > + * the uaccess buffer, as this can lead to false positive reports in
> > > > > + * downstream consumers.
> > > > > + */
> > > > > + left = copy_from_user_nolog(copy, data, PAGE_SIZE);
> > > >
> > > > A late idea...
> > > > Maybe it's better to log them with a new UACCESS_BUFFER_FLAG_OVERREAD
> > > > flag. Better for user-space, at least can detect UAFs by checking the
> > > > first byte. And a more logical kernel annotation (maybe will be used
> > > > in some other tools? or if we ever check user tags in the kernel).
> > > >
> > > > Probably not too important today since we use this only in 2 places,
> > > > but longer term may be better.
> > >
> > > I'm not sure about this. The overreads are basically an implementation
> > > detail of the kernel, so I'm not sure it makes sense to expose them. A
> > > scheme where we expose all overreads wouldn't necessarily help with
> > > UAF, because what if for example the kernel reads *behind* the
> > > user-provided pointer? I guess it could lead to false positives.
> >
> > If user-space uses logging to check addressability, then it can safely
> > check only the first byte (right? there must be at least 1 byte passed
> > by user-space at that address). And that's enough to detect UAFs.
>
> I was thinking more e.g. what if the kernel reads an entire page with
> copy_from_user() and takes a subset of it later. Then the first byte
> could point to some other random allocation in the same page and lead
> to a false UAF report if we just consider the first byte.

Humm.. good point. As I said I am not strong about this. I don't know
what's the right answer.

> So I think the use cases for accesses with this flag set may be
> limited to things like fuzzers.
>
> > > > Btw, what's the story with BPF accesses? Can we log them theoretically?
> > > >
> > > > Previously the comment said:
> > > >
> > > > + /*
> > > > + * Avoid copy_from_user() here as it may leak information about the BPF
> > > > + * program to userspace via the uaccess buffer.
> > > > + */
> > > >
> > > > but now it says something very generic:
> > > >
> > > > /*
> > > > * Avoid logging uaccesses here as the BPF program may not be following
> > > > * the uaccess log rules.
> > > > */
> > >
> > > Yes we should be able to log them theoretically, but we don't need to
> > > do that now. See my reply here:
> > >
> > > https://lore.kernel.org/all/CAMn1gO5B5Q3hfN6kugv2wmdFGNhJb75iRX1zmCkw3wnueN1dtg@mail.gmail.com/#:~:text=This%20comment%20was,the%20comment%20accordingly.
> >
> > I see. These could be marked with another flag.
> > I don't have a strong opinion about this. But I am mentioning this
> > because my experience is that it's better to expose more raw info from
> > kernel in these cases, rather than hardcoding policies into kernel
> > code (what's ignored/why/when) b/c a delay from another kernel change
> > to wide deployment is 5+ years and user-space code may need to detect
> > and deal with all various versions of the kernel logic.
> > Say, fuzzing may still want to know about the mount options (rather
> > than no signal that the kernel reads at least something at that
> > address). But adding them later with a flag is not really a backwards
> > compatible change b/c you now have addressability checking code that's
> > not checking the new flag and will produce false positives.
>
> I think this is a good point. I'll see about adding flags for the BPF
> and overread cases.
>
> Peter