LinuxLists.cc - [PATCH] kernel: introduce prctl(PR_LOG

2021-09-22 06:19:30

Subject: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

This patch introduces a kernel feature known as uaccess logging.
With uaccess logging, the userspace program passes the address and size
of a so-called uaccess buffer to the kernel via a prctl(). The prctl()
is a request for the kernel to log any uaccesses made during the next
syscall to the uaccess buffer. When the next syscall returns, the address
one past the end of the logged uaccess buffer entries is written to the
location specified by the third argument to the prctl(). In this way,
the userspace program may enumerate the uaccesses logged to the access
buffer to determine which accesses occurred.

Uaccess logging has several use cases focused around bug detection
tools:

1) Userspace memory safety tools such as ASan, MSan, HWASan and tools
making use of the ARM Memory Tagging Extension (MTE) need to monitor
all memory accesses in a program so that they can detect memory
errors. For accesses made purely in userspace, this is achieved
via compiler instrumentation, or for MTE, via direct hardware
support. However, accesses made by the kernel on behalf of the
user program via syscalls (i.e. uaccesses) are invisible to these
tools. With MTE there is some level of error detection possible in
the kernel (in synchronous mode, bad accesses generally result in
returning -EFAULT from the syscall), but by the time we get back to
userspace we've lost the information about the address and size of the
failed access, which makes it harder to produce a useful error report.

With the current versions of the sanitizers, we address this by
interposing the libc syscall stubs with a wrapper that checks the
memory based on what we believe the uaccesses will be. However, this
creates a maintenance burden: each syscall must be annotated with
its uaccesses in order to be recognized by the sanitizer, and these
annotations must be continuously updated as the kernel changes. This
is especially burdensome for syscalls such as ioctl(2) which have a
large surface area of possible uaccesses.

2) Verifying the validity of kernel accesses. This can be achieved in
conjunction with the userspace memory safety tools mentioned in (1).
Even a sanitizer whose syscall wrappers have complete knowledge of
the kernel's intended API may vary from the kernel's actual uaccesses
due to kernel bugs. A sanitizer with knowledge of the kernel's actual
uaccesses may produce more accurate error reports that reveal such
bugs.

An example of such a bug, which was found by an earlier version of this
patch together with a prototype client of the API in HWASan, was fixed
by commit d0efb16294d1 ("net: don't unconditionally copy_from_user
a struct ifreq for socket ioctls"). Although this bug turned out to
relatively harmless, it was a bug nonetheless and it's always possible
that more serious bugs of this sort may be introduced in the future.

3) Kernel fuzzing. We may use the list of reported kernel accesses to
guide a kernel fuzzing tool such as syzkaller (so that it knows which
parts of user memory to fuzz), as an alternative to providing the tool
with a list of syscalls and their uaccesses (which again thanks to
(2) may not be accurate).

All signals except SIGKILL and SIGSTOP are masked for the interval
between the prctl() and the next syscall in order to prevent handlers
for intervening asynchronous signals from issuing syscalls that may
cause uaccesses from the wrong syscall to be logged.

The format of a uaccess buffer entry is defined as follows:

struct access_buffer_entry {
u64 addr, size, flags;
};

The meaning of addr and size should be obvious. On arm64, tag bits
are preserved in the addr field. The current meaning of the flags
field is that bit 0 indicates whether the access was a read (clear)
or a write (set). The meaning of all other flag bits is reserved.
All fields are of type u64 in order to avoid compat concerns.

Here is an example of a code snippet that will enumerate the accesses
performed by a uname(2) syscall:

struct access_buffer_entry entries[64];
uint64_t entries_end64 = (uint64_t)&entries;
struct utsname un;
prctl(PR_LOG_UACCESS, entries, sizeof(entries), &entries_end64, 0);
uname(&un);
struct access_buffer_entry *entries_end = (struct uaccess_buffer_entry *)entries_end64;
for (struct acccess_buffer_entry *i = entries; i != entries_end; ++i) {
printf("%s at 0x%lu size 0x%lx\n",
entries[i].flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ",
(unsigned long)entries[i].addr, (unsigned long)entries[i].size);
}

Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of
course, not all of the accesses may fit in the buffer, but aside from
that, there are syscalls such as async I/O that are currently missed due
to the uaccesses occurring on a different kernel task (this is analogous
to how async I/O accesses are exempt from userspace MTE checks). We
view this as acceptable, as the access buffer can be sized sufficiently
large to handle syscalls that make a reasonable number of uaccesses,
and syscalls that use a different task for uaccesses are rare. In
many cases, the sanitizer does not need to see every memory access,
so it's fine if we miss the odd uaccess here and there. Even for those
sanitizers that do need to see every memory access it still represents
a much lower maintenance burden if we just have to handle the unusual
syscalls specially.

Because we don't have a common kernel entry/exit code path that is used
on all architectures, uaccess logging is only implemented for arm64 and
architectures that use CONFIG_GENERIC_ENTRY, i.e. x86 and s390.

One downside of this ABI is that it involves making two syscalls per
"real" syscall, which can harm performance. One possible way to avoid
this may be to have the prctl() register the uaccess buffer location
once at thread startup and use the same location for all syscalls in
the thread. However, because the program may be making syscalls very
early, before TLS is available, this may not always work. Furthermore,
because of the same asynchronous signal concerns that prompted temporarily
masking signals after the prctl(), the syscall stub would need to be made
reentrant, and it is unclear whether this is feasible without manually
masking asynchronous signals using rt_sigprocmask(2) while reading the
uaccess buffer, defeating the purpose of avoiding the extra syscall.

One idea that we considered involved using the stack pointer address as
a unique identifier for the syscall, but this currently would need to be
arch-specific as we currently do not appear to have an arch-generic way
of retrieving the stack pointer; the userspace side would also need some
arch-specific code for this to work. It's also possible that a longjmp()
past the signal handler would make the stack pointer address not unique
enough for this purpose.

On the other hand, by allocating the uaccess log on the stack and blocking
asynchronous signals for the interval between the prctl() and the "real"
syscall, we can avoid any reentrancy and TLS concerns.

Another way to avoid the overhead may be to use an architecture-specific
calling convention to pass the address of the uaccess buffer to the kernel
at syscall time in registers currently unused for syscall arguments. For
example, one arm64-specific scheme that was used in a previous iteration
of the patch was:

- Bit 0 of the immediate argument to the SVC instruction must be set.
- Register X6 contains the address of the access buffer.
- Register X7 contains the size of the access buffer in bytes.
- On return, X6 will contain the address of the memory location following
any access buffer entries written by the kernel.

However, this would need to be implemented separately for each
architecture (and some of them don't have enough registers anyway),
whereas the prctl() is (at least in theory) architecture-generic.

We also evaluated implementing this on top of the existing tracepoint
facility, but concluded that it is not suitable for this purpose:

- Tracepoints have a per-task granularity at best, whereas we really want
to trace per-syscall. This is so that we can exclude syscalls that
should not be traced, such as syscalls that make up part of the
sanitizer implementation (to avoid infinite recursion when e.g. printing
an error report).

- Tracing would need to be synchronous in order to produce useful
stack traces. For example this could be achieved using the new SIGTRAP
on perf events mechanism. However, this would require logging each
access to the stack (in the form of a sigcontext) and this is more
likely to overflow the stack due to being much larger than a uaccess
buffer entry as well as being unbounded, in contrast to the bounded
buffer size passed to prctl(). An approach based on signal handlers is
also likely to fall foul of the asynchronous signal issues mentioned
previously, together with needing sigreturn to be handled specially
(because it copies a sigcontext from userspace) otherwise we could
never return from the signal handler. Furthermore, arguments to the
trace events are not available to SIGTRAP. (This on its own wouldn't
be insurmountable though -- we could add the arguments as fields
to siginfo.)

- The API in https://www.kernel.org/doc/Documentation/trace/ftrace.txt
-- e.g. trace_pipe_raw gives access to the internal ring buffer, but
I don't think it's useable because it's per-CPU and not per-task.

- Tracepoints can be used by eBPF programs, but eBPF programs may
only be loaded as root, among other potential headaches.

Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
Signed-off-by: Peter Collingbourne <[email protected]>
---
arch/Kconfig | 6 ++
arch/arm64/Kconfig | 1 +
arch/arm64/kernel/syscall.c | 2 +
include/linux/instrumented.h | 5 +-
include/linux/sched.h | 3 +
include/linux/uaccess_buffer.h | 43 ++++++++++
include/linux/uaccess_buffer_info.h | 23 ++++++
include/uapi/linux/prctl.h | 9 +++
kernel/Makefile | 1 +
kernel/entry/common.c | 3 +
kernel/sys.c | 6 ++
kernel/uaccess_buffer.c | 118 ++++++++++++++++++++++++++++
12 files changed, 219 insertions(+), 1 deletion(-)
create mode 100644 include/linux/uaccess_buffer.h
create mode 100644 include/linux/uaccess_buffer_info.h
create mode 100644 kernel/uaccess_buffer.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 8df1c7102643..a427f6440cc9 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -31,6 +31,7 @@ config HOTPLUG_SMT
bool

config GENERIC_ENTRY
+ select UACCESS_BUFFER
bool

config KPROBES
@@ -1288,6 +1289,11 @@ config ARCH_HAS_ELFCORE_COMPAT
config ARCH_HAS_PARANOID_L1D_FLUSH
bool

+config UACCESS_BUFFER
+ bool
+ help
+ Select if the architecture's syscall entry/exit code supports uaccess buffers.
+
source "kernel/gcov/Kconfig"

source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5c7ae4c3954b..4764e5fd7ba9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -221,6 +221,7 @@ config ARM64
select THREAD_INFO_IN_TASK
select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD
select TRACE_IRQFLAGS_SUPPORT
+ select UACCESS_BUFFER
help
ARM 64-bit (AArch64) Linux support.

diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index 50a0f1a38e84..c3f8652d84a5 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -139,7 +139,9 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
goto trace_exit;
}

+ uaccess_buffer_syscall_entry();
invoke_syscall(regs, scno, sc_nr, syscall_table);
+ uaccess_buffer_syscall_exit();

/*
* The tracing status may have changed under our feet, so we have to
diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
index 42faebbaa202..9144936edcb1 100644
--- a/include/linux/instrumented.h
+++ b/include/linux/instrumented.h
@@ -2,7 +2,7 @@

/*
* This header provides generic wrappers for memory access instrumentation that
- * the compiler cannot emit for: KASAN, KCSAN.
+ * the compiler cannot emit for: KASAN, KCSAN, access buffers.
*/
#ifndef _LINUX_INSTRUMENTED_H
#define _LINUX_INSTRUMENTED_H
@@ -11,6 +11,7 @@
#include <linux/kasan-checks.h>
#include <linux/kcsan-checks.h>
#include <linux/types.h>
+#include <linux/uaccess_buffer.h>

/**
* instrument_read - instrument regular read access
@@ -117,6 +118,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
{
kasan_check_read(from, n);
kcsan_check_read(from, n);
+ uaccess_buffer_log_write(to, n);
}

/**
@@ -134,6 +136,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
{
kasan_check_write(to, n);
kcsan_check_write(to, n);
+ uaccess_buffer_log_read(from, n);
}

#endif /* _LINUX_INSTRUMENTED_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e12b524426b0..3fecb0487b97 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
#include <linux/rseq.h>
#include <linux/seqlock.h>
#include <linux/kcsan.h>
+#include <linux/uaccess_buffer_info.h>
#include <asm/kmap_size.h>

/* task_struct member predeclarations (sorted alphabetically): */
@@ -1487,6 +1488,8 @@ struct task_struct {
struct callback_head l1d_flush_kill;
#endif

+ struct uaccess_buffer_info uaccess_buffer;
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/include/linux/uaccess_buffer.h b/include/linux/uaccess_buffer.h
new file mode 100644
index 000000000000..3b81f2a192a4
--- /dev/null
+++ b/include/linux/uaccess_buffer.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_ACCESS_BUFFER_H
+#define _LINUX_ACCESS_BUFFER_H
+
+#include <asm-generic/errno-base.h>
+
+#ifdef CONFIG_UACCESS_BUFFER
+
+void uaccess_buffer_log_read(const void __user *from, unsigned long n);
+void uaccess_buffer_log_write(void __user *to, unsigned long n);
+
+void uaccess_buffer_syscall_entry(void);
+void uaccess_buffer_syscall_exit(void);
+
+int uaccess_buffer_set_logging(unsigned long addr, unsigned long size,
+ unsigned long store_end_addr);
+
+#else
+
+static inline void uaccess_buffer_log_read(const void __user *from,
+ unsigned long n)
+{
+}
+static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
+{
+}
+
+static inline void uaccess_buffer_syscall_entry(void)
+{
+}
+static inline void uaccess_buffer_syscall_exit(void)
+{
+}
+
+static inline int uaccess_buffer_set_logging(unsigned long addr,
+ unsigned long size,
+ unsigned long store_end_addr)
+{
+ return -EINVAL;
+}
+#endif
+
+#endif /* _LINUX_ACCESS_BUFFER_H */
diff --git a/include/linux/uaccess_buffer_info.h b/include/linux/uaccess_buffer_info.h
new file mode 100644
index 000000000000..a6cefe6e73b5
--- /dev/null
+++ b/include/linux/uaccess_buffer_info.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_ACCESS_BUFFER_INFO_H
+#define _LINUX_ACCESS_BUFFER_INFO_H
+
+#include <uapi/asm/signal.h>
+
+#ifdef CONFIG_UACCESS_BUFFER
+
+struct uaccess_buffer_info {
+ unsigned long addr, size;
+ unsigned long store_end_addr;
+ sigset_t saved_sigmask;
+ u8 state;
+};
+
+#else
+
+struct uaccess_buffer_info {
+};
+
+#endif
+
+#endif /* _LINUX_ACCESS_BUFFER_INFO_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 43bd7f713c39..d8baacaef800 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -269,4 +269,13 @@ struct prctl_mm_map {
# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */
# define PR_SCHED_CORE_MAX 4

+/* Log uaccesses to a user-provided buffer */
+#define PR_LOG_UACCESS 63
+
+/* Format of the entries in the uaccess log. */
+struct uaccess_buffer_entry {
+ __u64 addr, size, flags;
+};
+# define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 4df609be42d0..75a5d95ce9c3 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -115,6 +115,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
obj-$(CONFIG_CFI_CLANG) += cfi.o
+obj-$(CONFIG_UACCESS_BUFFER) += uaccess_buffer.o

obj-$(CONFIG_PERF_EVENTS) += events/

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bf16395b9e13..c7e7ff8cbab3 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -89,6 +89,8 @@ __syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
if (work & SYSCALL_WORK_ENTER)
syscall = syscall_trace_enter(regs, syscall, work);

+ uaccess_buffer_syscall_entry();
+
return syscall;
}

@@ -273,6 +275,7 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
local_irq_enable();
}

+ uaccess_buffer_syscall_exit();
rseq_syscall(regs);

/*
diff --git a/kernel/sys.c b/kernel/sys.c
index 8fdac0d90504..df487600773c 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
#include <linux/version.h>
#include <linux/ctype.h>
#include <linux/syscall_user_dispatch.h>
+#include <linux/uaccess_buffer.h>

#include <linux/compat.h>
#include <linux/syscalls.h>
@@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = sched_core_share_pid(arg2, arg3, arg4, arg5);
break;
#endif
+ case PR_LOG_UACCESS:
+ if (arg5)
+ return -EINVAL;
+ error = uaccess_buffer_set_logging(arg2, arg3, arg4);
+ break;
default:
error = -EINVAL;
break;
diff --git a/kernel/uaccess_buffer.c b/kernel/uaccess_buffer.c
new file mode 100644
index 000000000000..b9da89887c4b
--- /dev/null
+++ b/kernel/uaccess_buffer.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/compat.h>
+#include <linux/prctl.h>
+#include <linux/sched.h>
+#include <linux/signal.h>
+#include <linux/uaccess.h>
+#include <linux/uaccess_buffer.h>
+#include <linux/uaccess_buffer_info.h>
+
+#ifdef CONFIG_UACCESS_BUFFER
+
+/*
+ * We use a separate implementation of copy_to_user() that avoids the call
+ * to instrument_copy_to_user() as this would otherwise lead to infinite
+ * recursion.
+ */
+static unsigned long
+uaccess_buffer_copy_to_user(void __user *to, const void *from, unsigned long n)
+{
+ if (!access_ok(to, n))
+ return n;
+ return raw_copy_to_user(to, from, n);
+}
+
+static void uaccess_buffer_log(unsigned long addr, unsigned long size,
+ unsigned long flags)
+{
+ struct uaccess_buffer_entry entry;
+
+ if (current->uaccess_buffer.size < sizeof(entry) ||
+ unlikely(uaccess_kernel()))
+ return;
+ entry.addr = addr;
+ entry.size = size;
+ entry.flags = flags;
+
+ /*
+ * If our uaccess fails, abort the log so that the end address writeback
+ * does not occur and userspace sees zero accesses.
+ */
+ if (uaccess_buffer_copy_to_user(
+ (void __user *)current->uaccess_buffer.addr, &entry,
+ sizeof(entry))) {
+ current->uaccess_buffer.state = 0;
+ current->uaccess_buffer.addr = current->uaccess_buffer.size = 0;
+ }
+
+ current->uaccess_buffer.addr += sizeof(entry);
+ current->uaccess_buffer.size -= sizeof(entry);
+}
+
+void uaccess_buffer_log_read(const void __user *from, unsigned long n)
+{
+ uaccess_buffer_log((unsigned long)from, n, 0);
+}
+EXPORT_SYMBOL(uaccess_buffer_log_read);
+
+void uaccess_buffer_log_write(void __user *to, unsigned long n)
+{
+ uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
+}
+EXPORT_SYMBOL(uaccess_buffer_log_write);
+
+int uaccess_buffer_set_logging(unsigned long addr, unsigned long size,
+ unsigned long store_end_addr)
+{
+ sigset_t temp_sigmask;
+
+ current->uaccess_buffer.addr = addr;
+ current->uaccess_buffer.size = size;
+ current->uaccess_buffer.store_end_addr = store_end_addr;
+
+ /*
+ * Allow 2 syscalls before resetting the state: the current one (i.e.
+ * prctl) and the next one, whose accesses we want to log.
+ */
+ current->uaccess_buffer.state = 2;
+
+ /*
+ * Temporarily mask signals so that an intervening asynchronous signal
+ * will not interfere with the logging.
+ */
+ current->uaccess_buffer.saved_sigmask = current->blocked;
+ sigfillset(&temp_sigmask);
+ sigdelsetmask(&temp_sigmask, sigmask(SIGKILL) | sigmask(SIGSTOP));
+ __set_current_blocked(&temp_sigmask);
+
+ return 0;
+}
+
+void uaccess_buffer_syscall_entry(void)
+{
+ /*
+ * The current syscall may be e.g. rt_sigprocmask, and therefore we want
+ * to reset the mask before the syscall and not after, so that our
+ * temporary mask is unobservable.
+ */
+ if (current->uaccess_buffer.state == 1)
+ __set_current_blocked(&current->uaccess_buffer.saved_sigmask);
+}
+
+void uaccess_buffer_syscall_exit(void)
+{
+ if (current->uaccess_buffer.state > 0) {
+ --current->uaccess_buffer.state;
+ if (current->uaccess_buffer.state == 0) {
+ u64 addr64 = current->uaccess_buffer.addr;
+
+ uaccess_buffer_copy_to_user(
+ (void __user *)
+ current->uaccess_buffer.store_end_addr,
+ &addr64, sizeof(addr64));
+ current->uaccess_buffer.addr = current->uaccess_buffer.size = 0;
+ }
+ }
+}
+
+#endif
--
2.33.0.464.g1972c5931b-goog

2021-09-22 06:31:28

by Cyrill Gorcunov

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On Tue, Sep 21, 2021 at 11:18:09PM -0700, Peter Collingbourne wrote:
> This patch introduces a kernel feature known as uaccess logging.
> With uaccess logging, the userspace program passes the address and size
> of a so-called uaccess buffer to the kernel via a prctl(). The prctl()
> is a request for the kernel to log any uaccesses made during the next
> syscall to the uaccess buffer. When the next syscall returns, the address
> one past the end of the logged uaccess buffer entries is written to the
> location specified by the third argument to the prctl(). In this way,
> the userspace program may enumerate the uaccesses logged to the access
> buffer to determine which accesses occurred.
...
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e12b524426b0..3fecb0487b97 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -34,6 +34,7 @@
> #include <linux/rseq.h>
> #include <linux/seqlock.h>
> #include <linux/kcsan.h>
> +#include <linux/uaccess_buffer_info.h>
> #include <asm/kmap_size.h>
>
> /* task_struct member predeclarations (sorted alphabetically): */
> @@ -1487,6 +1488,8 @@ struct task_struct {
> struct callback_head l1d_flush_kill;
> #endif
>
> + struct uaccess_buffer_info uaccess_buffer;
> +

Hi, Peter! I didn't read the patch carefully yet (will do once time permit)
but from a glance should not this member be under #ifdef CONFIG_UACCESS_BUFFER
or something? task_struct is already bloated too much :(

> + case PR_LOG_UACCESS:
> + if (arg5)
> + return -EINVAL;
> + error = uaccess_buffer_set_logging(arg2, arg3, arg4);
> + break;

Same here (if only I didn't miss something obvious). If there is no support
for CONFIG_UACCESS_BUFFER we should return an error I guess.

2021-09-22 10:48:35

by Marco Elver

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On Tue, Sep 21, 2021 at 11:18PM -0700, Peter Collingbourne wrote:
> This patch introduces a kernel feature known as uaccess logging.
> With uaccess logging, the userspace program passes the address and size
> of a so-called uaccess buffer to the kernel via a prctl(). The prctl()
> is a request for the kernel to log any uaccesses made during the next
> syscall to the uaccess buffer. When the next syscall returns, the address
> one past the end of the logged uaccess buffer entries is written to the
> location specified by the third argument to the prctl(). In this way,
> the userspace program may enumerate the uaccesses logged to the access
> buffer to determine which accesses occurred.
>
> Uaccess logging has several use cases focused around bug detection
> tools:
>
> 1) Userspace memory safety tools such as ASan, MSan, HWASan and tools
> making use of the ARM Memory Tagging Extension (MTE) need to monitor
> all memory accesses in a program so that they can detect memory
> errors. For accesses made purely in userspace, this is achieved
> via compiler instrumentation, or for MTE, via direct hardware
> support. However, accesses made by the kernel on behalf of the
> user program via syscalls (i.e. uaccesses) are invisible to these
> tools. With MTE there is some level of error detection possible in
> the kernel (in synchronous mode, bad accesses generally result in
> returning -EFAULT from the syscall), but by the time we get back to
> userspace we've lost the information about the address and size of the
> failed access, which makes it harder to produce a useful error report.
>
> With the current versions of the sanitizers, we address this by
> interposing the libc syscall stubs with a wrapper that checks the
> memory based on what we believe the uaccesses will be. However, this
> creates a maintenance burden: each syscall must be annotated with
> its uaccesses in order to be recognized by the sanitizer, and these
> annotations must be continuously updated as the kernel changes. This
> is especially burdensome for syscalls such as ioctl(2) which have a
> large surface area of possible uaccesses.
>
> 2) Verifying the validity of kernel accesses. This can be achieved in
> conjunction with the userspace memory safety tools mentioned in (1).
> Even a sanitizer whose syscall wrappers have complete knowledge of
> the kernel's intended API may vary from the kernel's actual uaccesses
> due to kernel bugs. A sanitizer with knowledge of the kernel's actual
> uaccesses may produce more accurate error reports that reveal such
> bugs.
>
> An example of such a bug, which was found by an earlier version of this
> patch together with a prototype client of the API in HWASan, was fixed
> by commit d0efb16294d1 ("net: don't unconditionally copy_from_user
> a struct ifreq for socket ioctls"). Although this bug turned out to
> relatively harmless, it was a bug nonetheless and it's always possible
> that more serious bugs of this sort may be introduced in the future.
>
> 3) Kernel fuzzing. We may use the list of reported kernel accesses to
> guide a kernel fuzzing tool such as syzkaller (so that it knows which
> parts of user memory to fuzz), as an alternative to providing the tool
> with a list of syscalls and their uaccesses (which again thanks to
> (2) may not be accurate).
>
> All signals except SIGKILL and SIGSTOP are masked for the interval
> between the prctl() and the next syscall in order to prevent handlers
> for intervening asynchronous signals from issuing syscalls that may
> cause uaccesses from the wrong syscall to be logged.
>
> The format of a uaccess buffer entry is defined as follows:
>
> struct access_buffer_entry {

uaccess_buffer_entry (missing 'u')?

Here and below.

> u64 addr, size, flags;
> };
>
> The meaning of addr and size should be obvious. On arm64, tag bits
> are preserved in the addr field. The current meaning of the flags
> field is that bit 0 indicates whether the access was a read (clear)
> or a write (set). The meaning of all other flag bits is reserved.
> All fields are of type u64 in order to avoid compat concerns.
>
> Here is an example of a code snippet that will enumerate the accesses
> performed by a uname(2) syscall:
>
> struct access_buffer_entry entries[64];
> uint64_t entries_end64 = (uint64_t)&entries;
> struct utsname un;
> prctl(PR_LOG_UACCESS, entries, sizeof(entries), &entries_end64, 0);
> uname(&un);
> struct access_buffer_entry *entries_end = (struct uaccess_buffer_entry *)entries_end64;
> for (struct acccess_buffer_entry *i = entries; i != entries_end; ++i) {
> printf("%s at 0x%lu size 0x%lx\n",
> entries[i].flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ",
> (unsigned long)entries[i].addr, (unsigned long)entries[i].size);
> }

I think it would be good to persist information like this in dedicated
Documentation. If I had to guess, Documentation/admin-guide/uaccess-buffer.rst
could be the appropriate location.

It also allows to write some background, and then shorten the commit
message / cover letter (details can be placed in Documentation,
including the example).

> Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of
> course, not all of the accesses may fit in the buffer, but aside from
> that, there are syscalls such as async I/O that are currently missed due
> to the uaccesses occurring on a different kernel task (this is analogous
> to how async I/O accesses are exempt from userspace MTE checks). We
> view this as acceptable, as the access buffer can be sized sufficiently
> large to handle syscalls that make a reasonable number of uaccesses,
> and syscalls that use a different task for uaccesses are rare. In
> many cases, the sanitizer does not need to see every memory access,
> so it's fine if we miss the odd uaccess here and there. Even for those
> sanitizers that do need to see every memory access it still represents
> a much lower maintenance burden if we just have to handle the unusual
> syscalls specially.
>
> Because we don't have a common kernel entry/exit code path that is used
> on all architectures, uaccess logging is only implemented for arm64 and
> architectures that use CONFIG_GENERIC_ENTRY, i.e. x86 and s390.
>
> One downside of this ABI is that it involves making two syscalls per
> "real" syscall, which can harm performance. One possible way to avoid
> this may be to have the prctl() register the uaccess buffer location
> once at thread startup and use the same location for all syscalls in
> the thread. However, because the program may be making syscalls very
> early, before TLS is available, this may not always work. Furthermore,
> because of the same asynchronous signal concerns that prompted temporarily
> masking signals after the prctl(), the syscall stub would need to be made
> reentrant, and it is unclear whether this is feasible without manually
> masking asynchronous signals using rt_sigprocmask(2) while reading the
> uaccess buffer, defeating the purpose of avoiding the extra syscall.
>
> One idea that we considered involved using the stack pointer address as
> a unique identifier for the syscall, but this currently would need to be
> arch-specific as we currently do not appear to have an arch-generic way
> of retrieving the stack pointer; the userspace side would also need some
> arch-specific code for this to work. It's also possible that a longjmp()
> past the signal handler would make the stack pointer address not unique
> enough for this purpose.
>
> On the other hand, by allocating the uaccess log on the stack and blocking
> asynchronous signals for the interval between the prctl() and the "real"
> syscall, we can avoid any reentrancy and TLS concerns.
>
> Another way to avoid the overhead may be to use an architecture-specific
> calling convention to pass the address of the uaccess buffer to the kernel
> at syscall time in registers currently unused for syscall arguments. For
> example, one arm64-specific scheme that was used in a previous iteration
> of the patch was:
>
> - Bit 0 of the immediate argument to the SVC instruction must be set.
> - Register X6 contains the address of the access buffer.
> - Register X7 contains the size of the access buffer in bytes.
> - On return, X6 will contain the address of the memory location following
> any access buffer entries written by the kernel.
>
> However, this would need to be implemented separately for each
> architecture (and some of them don't have enough registers anyway),
> whereas the prctl() is (at least in theory) architecture-generic.
>
> We also evaluated implementing this on top of the existing tracepoint
> facility, but concluded that it is not suitable for this purpose:
>
> - Tracepoints have a per-task granularity at best, whereas we really want
> to trace per-syscall. This is so that we can exclude syscalls that
> should not be traced, such as syscalls that make up part of the
> sanitizer implementation (to avoid infinite recursion when e.g. printing
> an error report).
>
> - Tracing would need to be synchronous in order to produce useful
> stack traces. For example this could be achieved using the new SIGTRAP
> on perf events mechanism. However, this would require logging each
> access to the stack (in the form of a sigcontext) and this is more
> likely to overflow the stack due to being much larger than a uaccess
> buffer entry as well as being unbounded, in contrast to the bounded
> buffer size passed to prctl(). An approach based on signal handlers is
> also likely to fall foul of the asynchronous signal issues mentioned
> previously, together with needing sigreturn to be handled specially
> (because it copies a sigcontext from userspace) otherwise we could
> never return from the signal handler. Furthermore, arguments to the
> trace events are not available to SIGTRAP. (This on its own wouldn't
> be insurmountable though -- we could add the arguments as fields
> to siginfo.)
>
> - The API in https://www.kernel.org/doc/Documentation/trace/ftrace.txt
> -- e.g. trace_pipe_raw gives access to the internal ring buffer, but
> I don't think it's useable because it's per-CPU and not per-task.
>
> - Tracepoints can be used by eBPF programs, but eBPF programs may
> only be loaded as root, among other potential headaches.
>
> Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> Signed-off-by: Peter Collingbourne <[email protected]>
> ---
> arch/Kconfig | 6 ++
> arch/arm64/Kconfig | 1 +
> arch/arm64/kernel/syscall.c | 2 +

The arch-enablement changes should be their own patches (one for arm64,
and another for the generic entry code). And it probably further helps
reviewability by moving bits of the commit message into a cover letter
and Documentation.

So I think there should be at least 4 patches (core code, arm64, generic
entry, Documentation).

> include/linux/instrumented.h | 5 +-
> include/linux/sched.h | 3 +
> include/linux/uaccess_buffer.h | 43 ++++++++++
> include/linux/uaccess_buffer_info.h | 23 ++++++
> include/uapi/linux/prctl.h | 9 +++
> kernel/Makefile | 1 +
> kernel/entry/common.c | 3 +
> kernel/sys.c | 6 ++
> kernel/uaccess_buffer.c | 118 ++++++++++++++++++++++++++++
> 12 files changed, 219 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/uaccess_buffer.h
> create mode 100644 include/linux/uaccess_buffer_info.h
> create mode 100644 kernel/uaccess_buffer.c
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 8df1c7102643..a427f6440cc9 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -31,6 +31,7 @@ config HOTPLUG_SMT
> bool
>
> config GENERIC_ENTRY
> + select UACCESS_BUFFER
> bool
>
> config KPROBES
> @@ -1288,6 +1289,11 @@ config ARCH_HAS_ELFCORE_COMPAT
> config ARCH_HAS_PARANOID_L1D_FLUSH
> bool
>
> +config UACCESS_BUFFER

This is not a user-selectable config option, so I think the name should
be HAVE_ARCH_UACCESS_BUFFER.

> + bool
> + help
> + Select if the architecture's syscall entry/exit code supports uaccess buffers.
> +
> source "kernel/gcov/Kconfig"
>
> source "scripts/gcc-plugins/Kconfig"
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 5c7ae4c3954b..4764e5fd7ba9 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -221,6 +221,7 @@ config ARM64
> select THREAD_INFO_IN_TASK
> select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD
> select TRACE_IRQFLAGS_SUPPORT
> + select UACCESS_BUFFER

Once it's called HAVE_ARCH_UACCESS_BUFFER, it should also be reordered
to be in alphabetical order with other selects here to reduce risk of
merge conflicts (when things are added at the end).

> help
> ARM 64-bit (AArch64) Linux support.
>
> diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
> index 50a0f1a38e84..c3f8652d84a5 100644
> --- a/arch/arm64/kernel/syscall.c
> +++ b/arch/arm64/kernel/syscall.c
> @@ -139,7 +139,9 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
> goto trace_exit;
> }
>
> + uaccess_buffer_syscall_entry();
> invoke_syscall(regs, scno, sc_nr, syscall_table);
> + uaccess_buffer_syscall_exit();

I think this is missing an #include <linux/access_buffer.h>, although
it's currently implicit via instrumented.h.

>
> /*
> * The tracing status may have changed under our feet, so we have to
> diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
> index 42faebbaa202..9144936edcb1 100644
> --- a/include/linux/instrumented.h
> +++ b/include/linux/instrumented.h
> @@ -2,7 +2,7 @@
>
> /*
> * This header provides generic wrappers for memory access instrumentation that
> - * the compiler cannot emit for: KASAN, KCSAN.
> + * the compiler cannot emit for: KASAN, KCSAN, access buffers.
> */
> #ifndef _LINUX_INSTRUMENTED_H
> #define _LINUX_INSTRUMENTED_H
> @@ -11,6 +11,7 @@
> #include <linux/kasan-checks.h>
> #include <linux/kcsan-checks.h>
> #include <linux/types.h>
> +#include <linux/uaccess_buffer.h>

instrumented.h is included from almost everywhere. Like is done for
KASAN and KCSAN, it helps to minimize the header required for the
checks.

In this case, for consistency with the others, I recommend having a
header <linux/uaccess-buffer-checks.h> with just the 2 functions
required here. And then put the rest (including the struct) into
<linux/uaccess-buffer.h>.

> /**
> * instrument_read - instrument regular read access
> @@ -117,6 +118,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> {
> kasan_check_read(from, n);
> kcsan_check_read(from, n);
> + uaccess_buffer_log_write(to, n);
> }
>
> /**
> @@ -134,6 +136,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
> {
> kasan_check_write(to, n);
> kcsan_check_write(to, n);
> + uaccess_buffer_log_read(from, n);
> }
>
> #endif /* _LINUX_INSTRUMENTED_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e12b524426b0..3fecb0487b97 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -34,6 +34,7 @@
> #include <linux/rseq.h>
> #include <linux/seqlock.h>
> #include <linux/kcsan.h>
> +#include <linux/uaccess_buffer_info.h>
> #include <asm/kmap_size.h>
>
> /* task_struct member predeclarations (sorted alphabetically): */
> @@ -1487,6 +1488,8 @@ struct task_struct {
> struct callback_head l1d_flush_kill;
> #endif
>
> + struct uaccess_buffer_info uaccess_buffer;
> +
> /*
> * New fields for task_struct should be added above here, so that
> * they are included in the randomized portion of task_struct.
> diff --git a/include/linux/uaccess_buffer.h b/include/linux/uaccess_buffer.h
> new file mode 100644
> index 000000000000..3b81f2a192a4
> --- /dev/null
> +++ b/include/linux/uaccess_buffer.h
> @@ -0,0 +1,43 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_ACCESS_BUFFER_H
> +#define _LINUX_ACCESS_BUFFER_H
> +
> +#include <asm-generic/errno-base.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +void uaccess_buffer_log_read(const void __user *from, unsigned long n);
> +void uaccess_buffer_log_write(void __user *to, unsigned long n);
> +

As suggested above, the above 2 functions above can go into a single
header uaccess-buffer-checks.h (which also doesn't need to include
errno-base.h).

The rest below (merged with the struct definition) can then go into the
second header.

> +void uaccess_buffer_syscall_entry(void);
> +void uaccess_buffer_syscall_exit(void);
> +
> +int uaccess_buffer_set_logging(unsigned long addr, unsigned long size,
> + unsigned long store_end_addr);
> +
> +#else
> +
> +static inline void uaccess_buffer_log_read(const void __user *from,
> + unsigned long n)
> +{
> +}
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +}
> +
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +}
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +}
> +
> +static inline int uaccess_buffer_set_logging(unsigned long addr,
> + unsigned long size,
> + unsigned long store_end_addr)
> +{
> + return -EINVAL;
> +}
> +#endif
> +
> +#endif /* _LINUX_ACCESS_BUFFER_H */
> diff --git a/include/linux/uaccess_buffer_info.h b/include/linux/uaccess_buffer_info.h
> new file mode 100644
> index 000000000000..a6cefe6e73b5
> --- /dev/null
> +++ b/include/linux/uaccess_buffer_info.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_ACCESS_BUFFER_INFO_H
> +#define _LINUX_ACCESS_BUFFER_INFO_H
> +
> +#include <uapi/asm/signal.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +struct uaccess_buffer_info {

Please add comments for each variable if possible.

> + unsigned long addr, size;
> + unsigned long store_end_addr;
> + sigset_t saved_sigmask;
> + u8 state;

I think 'state' is less clear than it could be. Looking at the rest of
the code, 'state' could just be 'enabled' (or 'enable_count')?

> +};
> +
> +#else
> +
> +struct uaccess_buffer_info {
> +};

I'm not sure what's cleaner: a) in the #else case defining an empty
struct, or b) in the #else case not defining the struct and guarding the
use in sched.h with an #ifdef.

(b) is currently the standard way of doing it, and unless it improves
readability elsewhere (e.g. need to pass current->uaccess_buffer to a
function outside kernel/uaccess_buffer.c), I'd probably just do (b).

> +#endif
> +
> +#endif /* _LINUX_ACCESS_BUFFER_INFO_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 43bd7f713c39..d8baacaef800 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -269,4 +269,13 @@ struct prctl_mm_map {
> # define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */
> # define PR_SCHED_CORE_MAX 4
>
> +/* Log uaccesses to a user-provided buffer */
> +#define PR_LOG_UACCESS 63
> +
> +/* Format of the entries in the uaccess log. */
> +struct uaccess_buffer_entry {
> + __u64 addr, size, flags;

Comments for the above variables (in which case putting them on their
own line would be preferred).

> +};
> +# define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */

The struct and UACCESS_BUFFER_FLAG_WRITE does not belong in prctl.h, and
should probably go in another appropriate header in uapi/linux.

> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 4df609be42d0..75a5d95ce9c3 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -115,6 +115,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
> obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
> obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> obj-$(CONFIG_CFI_CLANG) += cfi.o
> +obj-$(CONFIG_UACCESS_BUFFER) += uaccess_buffer.o
>
> obj-$(CONFIG_PERF_EVENTS) += events/
>
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bf16395b9e13..c7e7ff8cbab3 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -89,6 +89,8 @@ __syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
> if (work & SYSCALL_WORK_ENTER)
> syscall = syscall_trace_enter(regs, syscall, work);
>
> + uaccess_buffer_syscall_entry();

This is also missing an #include.

> +
> return syscall;
> }
>
> @@ -273,6 +275,7 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
> local_irq_enable();
> }
>
> + uaccess_buffer_syscall_exit();
> rseq_syscall(regs);
>
> /*
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8fdac0d90504..df487600773c 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -42,6 +42,7 @@
> #include <linux/version.h>
> #include <linux/ctype.h>
> #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess_buffer.h>
>
> #include <linux/compat.h>
> #include <linux/syscalls.h>
> @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> error = sched_core_share_pid(arg2, arg3, arg4, arg5);
> break;
> #endif
> + case PR_LOG_UACCESS:
> + if (arg5)
> + return -EINVAL;
> + error = uaccess_buffer_set_logging(arg2, arg3, arg4);
> + break;
> default:
> error = -EINVAL;
> break;
> diff --git a/kernel/uaccess_buffer.c b/kernel/uaccess_buffer.c
> new file mode 100644
> index 000000000000..b9da89887c4b
> --- /dev/null
> +++ b/kernel/uaccess_buffer.c
> @@ -0,0 +1,118 @@
> +// SPDX-License-Identifier: GPL-2.0

This is a new file, and probably needs a Copyright header. See
e.g. mm/kfence/core.c for template.

> +#include <linux/compat.h>
> +#include <linux/prctl.h>
> +#include <linux/sched.h>
> +#include <linux/signal.h>
> +#include <linux/uaccess.h>
> +#include <linux/uaccess_buffer.h>
> +#include <linux/uaccess_buffer_info.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +

The #ifdef is redundant, given the Makefile already guards compilation
of this file.

> +/*
> + * We use a separate implementation of copy_to_user() that avoids the call
> + * to instrument_copy_to_user() as this would otherwise lead to infinite
> + * recursion.
> + */
> +static unsigned long
> +uaccess_buffer_copy_to_user(void __user *to, const void *from, unsigned long n)
> +{
> + if (!access_ok(to, n))
> + return n;
> + return raw_copy_to_user(to, from, n);
> +}
> +
> +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> + unsigned long flags)
> +{
> + struct uaccess_buffer_entry entry;
> +
> + if (current->uaccess_buffer.size < sizeof(entry) ||
> + unlikely(uaccess_kernel()))
> + return;
> + entry.addr = addr;
> + entry.size = size;
> + entry.flags = flags;
> +
> + /*
> + * If our uaccess fails, abort the log so that the end address writeback
> + * does not occur and userspace sees zero accesses.
> + */
> + if (uaccess_buffer_copy_to_user(
> + (void __user *)current->uaccess_buffer.addr, &entry,
> + sizeof(entry))) {
> + current->uaccess_buffer.state = 0;
> + current->uaccess_buffer.addr = current->uaccess_buffer.size = 0;
> + }
> +
> + current->uaccess_buffer.addr += sizeof(entry);
> + current->uaccess_buffer.size -= sizeof(entry);
> +}
> +
> +void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> + uaccess_buffer_log((unsigned long)from, n, 0);
> +}
> +EXPORT_SYMBOL(uaccess_buffer_log_read);
> +
> +void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> + uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> +}
> +EXPORT_SYMBOL(uaccess_buffer_log_write);
> +
> +int uaccess_buffer_set_logging(unsigned long addr, unsigned long size,
> + unsigned long store_end_addr)
> +{
> + sigset_t temp_sigmask;
> +
> + current->uaccess_buffer.addr = addr;
> + current->uaccess_buffer.size = size;
> + current->uaccess_buffer.store_end_addr = store_end_addr;
> +
> + /*
> + * Allow 2 syscalls before resetting the state: the current one (i.e.
> + * prctl) and the next one, whose accesses we want to log.
> + */
> + current->uaccess_buffer.state = 2;
> +
> + /*
> + * Temporarily mask signals so that an intervening asynchronous signal
> + * will not interfere with the logging.
> + */
> + current->uaccess_buffer.saved_sigmask = current->blocked;
> + sigfillset(&temp_sigmask);
> + sigdelsetmask(&temp_sigmask, sigmask(SIGKILL) | sigmask(SIGSTOP));
> + __set_current_blocked(&temp_sigmask);
> +
> + return 0;
> +}
> +
> +void uaccess_buffer_syscall_entry(void)
> +{
> + /*
> + * The current syscall may be e.g. rt_sigprocmask, and therefore we want
> + * to reset the mask before the syscall and not after, so that our
> + * temporary mask is unobservable.
> + */
> + if (current->uaccess_buffer.state == 1)
> + __set_current_blocked(&current->uaccess_buffer.saved_sigmask);
> +}
> +
> +void uaccess_buffer_syscall_exit(void)
> +{
> + if (current->uaccess_buffer.state > 0) {

This could just be

+ if (!current->uaccess_buffer.state)
+ return;

.. which avoids the deep nesting below and helps with formatting.

This is also the point where the name 'state' just seems too generic, and
'enabled' (or 'enable_count') would make it clearer what is happening.

> + --current->uaccess_buffer.state;
> + if (current->uaccess_buffer.state == 0) {
> + u64 addr64 = current->uaccess_buffer.addr;
> +
> + uaccess_buffer_copy_to_user(
> + (void __user *)
> + current->uaccess_buffer.store_end_addr,
> + &addr64, sizeof(addr64));
> + current->uaccess_buffer.addr = current->uaccess_buffer.size = 0;
> + }
> + }
> +}
> +
> +#endif

Thanks,
-- Marco

2021-09-22 13:49:51

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On Wed, 22 Sept 2021 at 08:18, Peter Collingbourne <[email protected]> wrote:
>
> This patch introduces a kernel feature known as uaccess logging.
> With uaccess logging, the userspace program passes the address and size
> of a so-called uaccess buffer to the kernel via a prctl(). The prctl()
> is a request for the kernel to log any uaccesses made during the next
> syscall to the uaccess buffer. When the next syscall returns, the address
> one past the end of the logged uaccess buffer entries is written to the
> location specified by the third argument to the prctl(). In this way,
> the userspace program may enumerate the uaccesses logged to the access
> buffer to determine which accesses occurred.
>
> Uaccess logging has several use cases focused around bug detection
> tools:
>
> 1) Userspace memory safety tools such as ASan, MSan, HWASan and tools
> making use of the ARM Memory Tagging Extension (MTE) need to monitor
> all memory accesses in a program so that they can detect memory
> errors. For accesses made purely in userspace, this is achieved
> via compiler instrumentation, or for MTE, via direct hardware
> support. However, accesses made by the kernel on behalf of the
> user program via syscalls (i.e. uaccesses) are invisible to these
> tools. With MTE there is some level of error detection possible in
> the kernel (in synchronous mode, bad accesses generally result in
> returning -EFAULT from the syscall), but by the time we get back to
> userspace we've lost the information about the address and size of the
> failed access, which makes it harder to produce a useful error report.
>
> With the current versions of the sanitizers, we address this by
> interposing the libc syscall stubs with a wrapper that checks the
> memory based on what we believe the uaccesses will be. However, this
> creates a maintenance burden: each syscall must be annotated with
> its uaccesses in order to be recognized by the sanitizer, and these
> annotations must be continuously updated as the kernel changes. This
> is especially burdensome for syscalls such as ioctl(2) which have a
> large surface area of possible uaccesses.
>
> 2) Verifying the validity of kernel accesses. This can be achieved in
> conjunction with the userspace memory safety tools mentioned in (1).
> Even a sanitizer whose syscall wrappers have complete knowledge of
> the kernel's intended API may vary from the kernel's actual uaccesses
> due to kernel bugs. A sanitizer with knowledge of the kernel's actual
> uaccesses may produce more accurate error reports that reveal such
> bugs.
>
> An example of such a bug, which was found by an earlier version of this
> patch together with a prototype client of the API in HWASan, was fixed
> by commit d0efb16294d1 ("net: don't unconditionally copy_from_user
> a struct ifreq for socket ioctls"). Although this bug turned out to
> relatively harmless, it was a bug nonetheless and it's always possible
> that more serious bugs of this sort may be introduced in the future.
>
> 3) Kernel fuzzing. We may use the list of reported kernel accesses to
> guide a kernel fuzzing tool such as syzkaller (so that it knows which
> parts of user memory to fuzz), as an alternative to providing the tool
> with a list of syscalls and their uaccesses (which again thanks to
> (2) may not be accurate).
>
> All signals except SIGKILL and SIGSTOP are masked for the interval
> between the prctl() and the next syscall in order to prevent handlers
> for intervening asynchronous signals from issuing syscalls that may
> cause uaccesses from the wrong syscall to be logged.
>
> The format of a uaccess buffer entry is defined as follows:
>
> struct access_buffer_entry {
> u64 addr, size, flags;
> };
>
> The meaning of addr and size should be obvious. On arm64, tag bits
> are preserved in the addr field. The current meaning of the flags
> field is that bit 0 indicates whether the access was a read (clear)
> or a write (set). The meaning of all other flag bits is reserved.
> All fields are of type u64 in order to avoid compat concerns.
>
> Here is an example of a code snippet that will enumerate the accesses
> performed by a uname(2) syscall:
>
> struct access_buffer_entry entries[64];
> uint64_t entries_end64 = (uint64_t)&entries;
> struct utsname un;
> prctl(PR_LOG_UACCESS, entries, sizeof(entries), &entries_end64, 0);
> uname(&un);
> struct access_buffer_entry *entries_end = (struct uaccess_buffer_entry *)entries_end64;
> for (struct acccess_buffer_entry *i = entries; i != entries_end; ++i) {
> printf("%s at 0x%lu size 0x%lx\n",
> entries[i].flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ",
> (unsigned long)entries[i].addr, (unsigned long)entries[i].size);
> }
>
> Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of
> course, not all of the accesses may fit in the buffer, but aside from
> that, there are syscalls such as async I/O that are currently missed due
> to the uaccesses occurring on a different kernel task (this is analogous
> to how async I/O accesses are exempt from userspace MTE checks). We
> view this as acceptable, as the access buffer can be sized sufficiently
> large to handle syscalls that make a reasonable number of uaccesses,
> and syscalls that use a different task for uaccesses are rare. In
> many cases, the sanitizer does not need to see every memory access,
> so it's fine if we miss the odd uaccess here and there. Even for those
> sanitizers that do need to see every memory access it still represents
> a much lower maintenance burden if we just have to handle the unusual
> syscalls specially.
>
> Because we don't have a common kernel entry/exit code path that is used
> on all architectures, uaccess logging is only implemented for arm64 and
> architectures that use CONFIG_GENERIC_ENTRY, i.e. x86 and s390.
>
> One downside of this ABI is that it involves making two syscalls per
> "real" syscall, which can harm performance. One possible way to avoid
> this may be to have the prctl() register the uaccess buffer location
> once at thread startup and use the same location for all syscalls in
> the thread. However, because the program may be making syscalls very
> early, before TLS is available, this may not always work. Furthermore,
> because of the same asynchronous signal concerns that prompted temporarily
> masking signals after the prctl(), the syscall stub would need to be made
> reentrant, and it is unclear whether this is feasible without manually
> masking asynchronous signals using rt_sigprocmask(2) while reading the
> uaccess buffer, defeating the purpose of avoiding the extra syscall.
>
> One idea that we considered involved using the stack pointer address as
> a unique identifier for the syscall, but this currently would need to be
> arch-specific as we currently do not appear to have an arch-generic way
> of retrieving the stack pointer; the userspace side would also need some
> arch-specific code for this to work. It's also possible that a longjmp()
> past the signal handler would make the stack pointer address not unique
> enough for this purpose.
>
> On the other hand, by allocating the uaccess log on the stack and blocking
> asynchronous signals for the interval between the prctl() and the "real"
> syscall, we can avoid any reentrancy and TLS concerns.
>
> Another way to avoid the overhead may be to use an architecture-specific
> calling convention to pass the address of the uaccess buffer to the kernel
> at syscall time in registers currently unused for syscall arguments. For
> example, one arm64-specific scheme that was used in a previous iteration
> of the patch was:
>
> - Bit 0 of the immediate argument to the SVC instruction must be set.
> - Register X6 contains the address of the access buffer.
> - Register X7 contains the size of the access buffer in bytes.
> - On return, X6 will contain the address of the memory location following
> any access buffer entries written by the kernel.
>
> However, this would need to be implemented separately for each
> architecture (and some of them don't have enough registers anyway),
> whereas the prctl() is (at least in theory) architecture-generic.
>
> We also evaluated implementing this on top of the existing tracepoint
> facility, but concluded that it is not suitable for this purpose:
>
> - Tracepoints have a per-task granularity at best, whereas we really want
> to trace per-syscall. This is so that we can exclude syscalls that
> should not be traced, such as syscalls that make up part of the
> sanitizer implementation (to avoid infinite recursion when e.g. printing
> an error report).
>
> - Tracing would need to be synchronous in order to produce useful
> stack traces. For example this could be achieved using the new SIGTRAP
> on perf events mechanism. However, this would require logging each
> access to the stack (in the form of a sigcontext) and this is more
> likely to overflow the stack due to being much larger than a uaccess
> buffer entry as well as being unbounded, in contrast to the bounded
> buffer size passed to prctl(). An approach based on signal handlers is
> also likely to fall foul of the asynchronous signal issues mentioned
> previously, together with needing sigreturn to be handled specially
> (because it copies a sigcontext from userspace) otherwise we could
> never return from the signal handler. Furthermore, arguments to the
> trace events are not available to SIGTRAP. (This on its own wouldn't
> be insurmountable though -- we could add the arguments as fields
> to siginfo.)
>
> - The API in https://www.kernel.org/doc/Documentation/trace/ftrace.txt
> -- e.g. trace_pipe_raw gives access to the internal ring buffer, but
> I don't think it's useable because it's per-CPU and not per-task.
>
> - Tracepoints can be used by eBPF programs, but eBPF programs may
> only be loaded as root, among other potential headaches.

Hi Peter,

Is this intended to be used with real syscall only? I think for
sanitizers we want to use this with libc syscall wrappers and more
complex libc functions as well. Signal blocking assumes that there
will be 1 and only 1 real syscall. I wonder if this can create
problems with libc functions, in particular, if a libc function does
>1 syscall, no syscalls, variable number of syscalls.

> Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> Signed-off-by: Peter Collingbourne <[email protected]>
> ---
> arch/Kconfig | 6 ++
> arch/arm64/Kconfig | 1 +
> arch/arm64/kernel/syscall.c | 2 +
> include/linux/instrumented.h | 5 +-
> include/linux/sched.h | 3 +
> include/linux/uaccess_buffer.h | 43 ++++++++++
> include/linux/uaccess_buffer_info.h | 23 ++++++
> include/uapi/linux/prctl.h | 9 +++
> kernel/Makefile | 1 +
> kernel/entry/common.c | 3 +
> kernel/sys.c | 6 ++
> kernel/uaccess_buffer.c | 118 ++++++++++++++++++++++++++++
> 12 files changed, 219 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/uaccess_buffer.h
> create mode 100644 include/linux/uaccess_buffer_info.h
> create mode 100644 kernel/uaccess_buffer.c
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 8df1c7102643..a427f6440cc9 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -31,6 +31,7 @@ config HOTPLUG_SMT
> bool
>
> config GENERIC_ENTRY
> + select UACCESS_BUFFER
> bool
>
> config KPROBES
> @@ -1288,6 +1289,11 @@ config ARCH_HAS_ELFCORE_COMPAT
> config ARCH_HAS_PARANOID_L1D_FLUSH
> bool
>
> +config UACCESS_BUFFER
> + bool
> + help
> + Select if the architecture's syscall entry/exit code supports uaccess buffers.
> +
> source "kernel/gcov/Kconfig"
>
> source "scripts/gcc-plugins/Kconfig"
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 5c7ae4c3954b..4764e5fd7ba9 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -221,6 +221,7 @@ config ARM64
> select THREAD_INFO_IN_TASK
> select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD
> select TRACE_IRQFLAGS_SUPPORT
> + select UACCESS_BUFFER
> help
> ARM 64-bit (AArch64) Linux support.
>
> diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
> index 50a0f1a38e84..c3f8652d84a5 100644
> --- a/arch/arm64/kernel/syscall.c
> +++ b/arch/arm64/kernel/syscall.c
> @@ -139,7 +139,9 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
> goto trace_exit;
> }
>
> + uaccess_buffer_syscall_entry();
> invoke_syscall(regs, scno, sc_nr, syscall_table);
> + uaccess_buffer_syscall_exit();
>
> /*
> * The tracing status may have changed under our feet, so we have to
> diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
> index 42faebbaa202..9144936edcb1 100644
> --- a/include/linux/instrumented.h
> +++ b/include/linux/instrumented.h
> @@ -2,7 +2,7 @@
>
> /*
> * This header provides generic wrappers for memory access instrumentation that
> - * the compiler cannot emit for: KASAN, KCSAN.
> + * the compiler cannot emit for: KASAN, KCSAN, access buffers.
> */
> #ifndef _LINUX_INSTRUMENTED_H
> #define _LINUX_INSTRUMENTED_H
> @@ -11,6 +11,7 @@
> #include <linux/kasan-checks.h>
> #include <linux/kcsan-checks.h>
> #include <linux/types.h>
> +#include <linux/uaccess_buffer.h>
>
> /**
> * instrument_read - instrument regular read access
> @@ -117,6 +118,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> {
> kasan_check_read(from, n);
> kcsan_check_read(from, n);
> + uaccess_buffer_log_write(to, n);
> }
>
> /**
> @@ -134,6 +136,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
> {
> kasan_check_write(to, n);
> kcsan_check_write(to, n);
> + uaccess_buffer_log_read(from, n);
> }
>
> #endif /* _LINUX_INSTRUMENTED_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e12b524426b0..3fecb0487b97 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -34,6 +34,7 @@
> #include <linux/rseq.h>
> #include <linux/seqlock.h>
> #include <linux/kcsan.h>
> +#include <linux/uaccess_buffer_info.h>
> #include <asm/kmap_size.h>
>
> /* task_struct member predeclarations (sorted alphabetically): */
> @@ -1487,6 +1488,8 @@ struct task_struct {
> struct callback_head l1d_flush_kill;
> #endif
>
> + struct uaccess_buffer_info uaccess_buffer;
> +
> /*
> * New fields for task_struct should be added above here, so that
> * they are included in the randomized portion of task_struct.
> diff --git a/include/linux/uaccess_buffer.h b/include/linux/uaccess_buffer.h
> new file mode 100644
> index 000000000000..3b81f2a192a4
> --- /dev/null
> +++ b/include/linux/uaccess_buffer.h
> @@ -0,0 +1,43 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_ACCESS_BUFFER_H
> +#define _LINUX_ACCESS_BUFFER_H
> +
> +#include <asm-generic/errno-base.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +void uaccess_buffer_log_read(const void __user *from, unsigned long n);
> +void uaccess_buffer_log_write(void __user *to, unsigned long n);
> +
> +void uaccess_buffer_syscall_entry(void);
> +void uaccess_buffer_syscall_exit(void);
> +
> +int uaccess_buffer_set_logging(unsigned long addr, unsigned long size,
> + unsigned long store_end_addr);
> +
> +#else
> +
> +static inline void uaccess_buffer_log_read(const void __user *from,
> + unsigned long n)
> +{
> +}
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +}
> +
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +}
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +}
> +
> +static inline int uaccess_buffer_set_logging(unsigned long addr,
> + unsigned long size,
> + unsigned long store_end_addr)
> +{
> + return -EINVAL;
> +}
> +#endif
> +
> +#endif /* _LINUX_ACCESS_BUFFER_H */
> diff --git a/include/linux/uaccess_buffer_info.h b/include/linux/uaccess_buffer_info.h
> new file mode 100644
> index 000000000000..a6cefe6e73b5
> --- /dev/null
> +++ b/include/linux/uaccess_buffer_info.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_ACCESS_BUFFER_INFO_H
> +#define _LINUX_ACCESS_BUFFER_INFO_H
> +
> +#include <uapi/asm/signal.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +struct uaccess_buffer_info {
> + unsigned long addr, size;
> + unsigned long store_end_addr;
> + sigset_t saved_sigmask;
> + u8 state;
> +};
> +
> +#else
> +
> +struct uaccess_buffer_info {
> +};
> +
> +#endif
> +
> +#endif /* _LINUX_ACCESS_BUFFER_INFO_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 43bd7f713c39..d8baacaef800 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -269,4 +269,13 @@ struct prctl_mm_map {
> # define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */
> # define PR_SCHED_CORE_MAX 4
>
> +/* Log uaccesses to a user-provided buffer */
> +#define PR_LOG_UACCESS 63
> +
> +/* Format of the entries in the uaccess log. */
> +struct uaccess_buffer_entry {
> + __u64 addr, size, flags;
> +};
> +# define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 4df609be42d0..75a5d95ce9c3 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -115,6 +115,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
> obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
> obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> obj-$(CONFIG_CFI_CLANG) += cfi.o
> +obj-$(CONFIG_UACCESS_BUFFER) += uaccess_buffer.o
>
> obj-$(CONFIG_PERF_EVENTS) += events/
>
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bf16395b9e13..c7e7ff8cbab3 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -89,6 +89,8 @@ __syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
> if (work & SYSCALL_WORK_ENTER)
> syscall = syscall_trace_enter(regs, syscall, work);
>
> + uaccess_buffer_syscall_entry();
> +
> return syscall;
> }
>
> @@ -273,6 +275,7 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
> local_irq_enable();
> }
>
> + uaccess_buffer_syscall_exit();
> rseq_syscall(regs);
>
> /*
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8fdac0d90504..df487600773c 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -42,6 +42,7 @@
> #include <linux/version.h>
> #include <linux/ctype.h>
> #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess_buffer.h>
>
> #include <linux/compat.h>
> #include <linux/syscalls.h>
> @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> error = sched_core_share_pid(arg2, arg3, arg4, arg5);
> break;
> #endif
> + case PR_LOG_UACCESS:
> + if (arg5)
> + return -EINVAL;
> + error = uaccess_buffer_set_logging(arg2, arg3, arg4);
> + break;
> default:
> error = -EINVAL;
> break;
> diff --git a/kernel/uaccess_buffer.c b/kernel/uaccess_buffer.c
> new file mode 100644
> index 000000000000..b9da89887c4b
> --- /dev/null
> +++ b/kernel/uaccess_buffer.c
> @@ -0,0 +1,118 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/compat.h>
> +#include <linux/prctl.h>
> +#include <linux/sched.h>
> +#include <linux/signal.h>
> +#include <linux/uaccess.h>
> +#include <linux/uaccess_buffer.h>
> +#include <linux/uaccess_buffer_info.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +/*
> + * We use a separate implementation of copy_to_user() that avoids the call
> + * to instrument_copy_to_user() as this would otherwise lead to infinite
> + * recursion.
> + */
> +static unsigned long
> +uaccess_buffer_copy_to_user(void __user *to, const void *from, unsigned long n)
> +{
> + if (!access_ok(to, n))
> + return n;
> + return raw_copy_to_user(to, from, n);
> +}
> +
> +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> + unsigned long flags)
> +{
> + struct uaccess_buffer_entry entry;
> +
> + if (current->uaccess_buffer.size < sizeof(entry) ||
> + unlikely(uaccess_kernel()))
> + return;
> + entry.addr = addr;
> + entry.size = size;
> + entry.flags = flags;
> +
> + /*
> + * If our uaccess fails, abort the log so that the end address writeback
> + * does not occur and userspace sees zero accesses.
> + */
> + if (uaccess_buffer_copy_to_user(
> + (void __user *)current->uaccess_buffer.addr, &entry,
> + sizeof(entry))) {
> + current->uaccess_buffer.state = 0;
> + current->uaccess_buffer.addr = current->uaccess_buffer.size = 0;
> + }
> +
> + current->uaccess_buffer.addr += sizeof(entry);
> + current->uaccess_buffer.size -= sizeof(entry);
> +}
> +
> +void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> + uaccess_buffer_log((unsigned long)from, n, 0);
> +}
> +EXPORT_SYMBOL(uaccess_buffer_log_read);
> +
> +void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> + uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> +}
> +EXPORT_SYMBOL(uaccess_buffer_log_write);
> +
> +int uaccess_buffer_set_logging(unsigned long addr, unsigned long size,
> + unsigned long store_end_addr)
> +{
> + sigset_t temp_sigmask;
> +
> + current->uaccess_buffer.addr = addr;
> + current->uaccess_buffer.size = size;
> + current->uaccess_buffer.store_end_addr = store_end_addr;
> +
> + /*
> + * Allow 2 syscalls before resetting the state: the current one (i.e.
> + * prctl) and the next one, whose accesses we want to log.
> + */
> + current->uaccess_buffer.state = 2;
> +
> + /*
> + * Temporarily mask signals so that an intervening asynchronous signal
> + * will not interfere with the logging.
> + */
> + current->uaccess_buffer.saved_sigmask = current->blocked;
> + sigfillset(&temp_sigmask);
> + sigdelsetmask(&temp_sigmask, sigmask(SIGKILL) | sigmask(SIGSTOP));
> + __set_current_blocked(&temp_sigmask);
> +
> + return 0;
> +}
> +
> +void uaccess_buffer_syscall_entry(void)
> +{
> + /*
> + * The current syscall may be e.g. rt_sigprocmask, and therefore we want
> + * to reset the mask before the syscall and not after, so that our
> + * temporary mask is unobservable.
> + */
> + if (current->uaccess_buffer.state == 1)
> + __set_current_blocked(&current->uaccess_buffer.saved_sigmask);
> +}
> +
> +void uaccess_buffer_syscall_exit(void)
> +{
> + if (current->uaccess_buffer.state > 0) {
> + --current->uaccess_buffer.state;
> + if (current->uaccess_buffer.state == 0) {
> + u64 addr64 = current->uaccess_buffer.addr;
> +
> + uaccess_buffer_copy_to_user(
> + (void __user *)
> + current->uaccess_buffer.store_end_addr,
> + &addr64, sizeof(addr64));
> + current->uaccess_buffer.addr = current->uaccess_buffer.size = 0;
> + }
> + }
> +}
> +
> +#endif
> --
> 2.33.0.464.g1972c5931b-goog
>

2021-09-22 14:25:31

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

Peter Collingbourne <[email protected]> writes:

> This patch introduces a kernel feature known as uaccess logging.
> With uaccess logging, the userspace program passes the address and size
> of a so-called uaccess buffer to the kernel via a prctl(). The prctl()
> is a request for the kernel to log any uaccesses made during the next
> syscall to the uaccess buffer. When the next syscall returns, the address
> one past the end of the logged uaccess buffer entries is written to the
> location specified by the third argument to the prctl(). In this way,
> the userspace program may enumerate the uaccesses logged to the access
> buffer to determine which accesses occurred.
>
> Uaccess logging has several use cases focused around bug detection
> tools:
>
> 1) Userspace memory safety tools such as ASan, MSan, HWASan and tools
> making use of the ARM Memory Tagging Extension (MTE) need to monitor
> all memory accesses in a program so that they can detect memory
> errors. For accesses made purely in userspace, this is achieved
> via compiler instrumentation, or for MTE, via direct hardware
> support. However, accesses made by the kernel on behalf of the
> user program via syscalls (i.e. uaccesses) are invisible to these
> tools. With MTE there is some level of error detection possible in
> the kernel (in synchronous mode, bad accesses generally result in
> returning -EFAULT from the syscall), but by the time we get back to
> userspace we've lost the information about the address and size of the
> failed access, which makes it harder to produce a useful error report.
>
> With the current versions of the sanitizers, we address this by
> interposing the libc syscall stubs with a wrapper that checks the
> memory based on what we believe the uaccesses will be. However, this
> creates a maintenance burden: each syscall must be annotated with
> its uaccesses in order to be recognized by the sanitizer, and these
> annotations must be continuously updated as the kernel changes. This
> is especially burdensome for syscalls such as ioctl(2) which have a
> large surface area of possible uaccesses.
>
> 2) Verifying the validity of kernel accesses. This can be achieved in
> conjunction with the userspace memory safety tools mentioned in (1).
> Even a sanitizer whose syscall wrappers have complete knowledge of
> the kernel's intended API may vary from the kernel's actual uaccesses
> due to kernel bugs. A sanitizer with knowledge of the kernel's actual
> uaccesses may produce more accurate error reports that reveal such
> bugs.
>
> An example of such a bug, which was found by an earlier version of this
> patch together with a prototype client of the API in HWASan, was fixed
> by commit d0efb16294d1 ("net: don't unconditionally copy_from_user
> a struct ifreq for socket ioctls"). Although this bug turned out to
> relatively harmless, it was a bug nonetheless and it's always possible
> that more serious bugs of this sort may be introduced in the future.
>
> 3) Kernel fuzzing. We may use the list of reported kernel accesses to
> guide a kernel fuzzing tool such as syzkaller (so that it knows which
> parts of user memory to fuzz), as an alternative to providing the tool
> with a list of syscalls and their uaccesses (which again thanks to
> (2) may not be accurate).

How is logging the kernel's activity like this not a significant
information leak? How is this safe for unprivileged users?

Eric

> All signals except SIGKILL and SIGSTOP are masked for the interval
> between the prctl() and the next syscall in order to prevent handlers
> for intervening asynchronous signals from issuing syscalls that may
> cause uaccesses from the wrong syscall to be logged.
>
> The format of a uaccess buffer entry is defined as follows:
>
> struct access_buffer_entry {
> u64 addr, size, flags;
> };
>
> The meaning of addr and size should be obvious. On arm64, tag bits
> are preserved in the addr field. The current meaning of the flags
> field is that bit 0 indicates whether the access was a read (clear)
> or a write (set). The meaning of all other flag bits is reserved.
> All fields are of type u64 in order to avoid compat concerns.
>
> Here is an example of a code snippet that will enumerate the accesses
> performed by a uname(2) syscall:
>
> struct access_buffer_entry entries[64];
> uint64_t entries_end64 = (uint64_t)&entries;
> struct utsname un;
> prctl(PR_LOG_UACCESS, entries, sizeof(entries), &entries_end64, 0);
> uname(&un);
> struct access_buffer_entry *entries_end = (struct uaccess_buffer_entry *)entries_end64;
> for (struct acccess_buffer_entry *i = entries; i != entries_end; ++i) {
> printf("%s at 0x%lu size 0x%lx\n",
> entries[i].flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ",
> (unsigned long)entries[i].addr, (unsigned long)entries[i].size);
> }
>
> Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of
> course, not all of the accesses may fit in the buffer, but aside from
> that, there are syscalls such as async I/O that are currently missed due
> to the uaccesses occurring on a different kernel task (this is analogous
> to how async I/O accesses are exempt from userspace MTE checks). We
> view this as acceptable, as the access buffer can be sized sufficiently
> large to handle syscalls that make a reasonable number of uaccesses,
> and syscalls that use a different task for uaccesses are rare. In
> many cases, the sanitizer does not need to see every memory access,
> so it's fine if we miss the odd uaccess here and there. Even for those
> sanitizers that do need to see every memory access it still represents
> a much lower maintenance burden if we just have to handle the unusual
> syscalls specially.
>
> Because we don't have a common kernel entry/exit code path that is used
> on all architectures, uaccess logging is only implemented for arm64 and
> architectures that use CONFIG_GENERIC_ENTRY, i.e. x86 and s390.
>
> One downside of this ABI is that it involves making two syscalls per
> "real" syscall, which can harm performance. One possible way to avoid
> this may be to have the prctl() register the uaccess buffer location
> once at thread startup and use the same location for all syscalls in
> the thread. However, because the program may be making syscalls very
> early, before TLS is available, this may not always work. Furthermore,
> because of the same asynchronous signal concerns that prompted temporarily
> masking signals after the prctl(), the syscall stub would need to be made
> reentrant, and it is unclear whether this is feasible without manually
> masking asynchronous signals using rt_sigprocmask(2) while reading the
> uaccess buffer, defeating the purpose of avoiding the extra syscall.
>
> One idea that we considered involved using the stack pointer address as
> a unique identifier for the syscall, but this currently would need to be
> arch-specific as we currently do not appear to have an arch-generic way
> of retrieving the stack pointer; the userspace side would also need some
> arch-specific code for this to work. It's also possible that a longjmp()
> past the signal handler would make the stack pointer address not unique
> enough for this purpose.
>
> On the other hand, by allocating the uaccess log on the stack and blocking
> asynchronous signals for the interval between the prctl() and the "real"
> syscall, we can avoid any reentrancy and TLS concerns.
>
> Another way to avoid the overhead may be to use an architecture-specific
> calling convention to pass the address of the uaccess buffer to the kernel
> at syscall time in registers currently unused for syscall arguments. For
> example, one arm64-specific scheme that was used in a previous iteration
> of the patch was:
>
> - Bit 0 of the immediate argument to the SVC instruction must be set.
> - Register X6 contains the address of the access buffer.
> - Register X7 contains the size of the access buffer in bytes.
> - On return, X6 will contain the address of the memory location following
> any access buffer entries written by the kernel.
>
> However, this would need to be implemented separately for each
> architecture (and some of them don't have enough registers anyway),
> whereas the prctl() is (at least in theory) architecture-generic.
>
> We also evaluated implementing this on top of the existing tracepoint
> facility, but concluded that it is not suitable for this purpose:
>
> - Tracepoints have a per-task granularity at best, whereas we really want
> to trace per-syscall. This is so that we can exclude syscalls that
> should not be traced, such as syscalls that make up part of the
> sanitizer implementation (to avoid infinite recursion when e.g. printing
> an error report).
>
> - Tracing would need to be synchronous in order to produce useful
> stack traces. For example this could be achieved using the new SIGTRAP
> on perf events mechanism. However, this would require logging each
> access to the stack (in the form of a sigcontext) and this is more
> likely to overflow the stack due to being much larger than a uaccess
> buffer entry as well as being unbounded, in contrast to the bounded
> buffer size passed to prctl(). An approach based on signal handlers is
> also likely to fall foul of the asynchronous signal issues mentioned
> previously, together with needing sigreturn to be handled specially
> (because it copies a sigcontext from userspace) otherwise we could
> never return from the signal handler. Furthermore, arguments to the
> trace events are not available to SIGTRAP. (This on its own wouldn't
> be insurmountable though -- we could add the arguments as fields
> to siginfo.)
>
> - The API in https://www.kernel.org/doc/Documentation/trace/ftrace.txt
> -- e.g. trace_pipe_raw gives access to the internal ring buffer, but
> I don't think it's useable because it's per-CPU and not per-task.
>
> - Tracepoints can be used by eBPF programs, but eBPF programs may
> only be loaded as root, among other potential headaches.
>
> Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> Signed-off-by: Peter Collingbourne <[email protected]>
> ---
> arch/Kconfig | 6 ++
> arch/arm64/Kconfig | 1 +
> arch/arm64/kernel/syscall.c | 2 +
> include/linux/instrumented.h | 5 +-
> include/linux/sched.h | 3 +
> include/linux/uaccess_buffer.h | 43 ++++++++++
> include/linux/uaccess_buffer_info.h | 23 ++++++
> include/uapi/linux/prctl.h | 9 +++
> kernel/Makefile | 1 +
> kernel/entry/common.c | 3 +
> kernel/sys.c | 6 ++
> kernel/uaccess_buffer.c | 118 ++++++++++++++++++++++++++++
> 12 files changed, 219 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/uaccess_buffer.h
> create mode 100644 include/linux/uaccess_buffer_info.h
> create mode 100644 kernel/uaccess_buffer.c
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 8df1c7102643..a427f6440cc9 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -31,6 +31,7 @@ config HOTPLUG_SMT
> bool
>
> config GENERIC_ENTRY
> + select UACCESS_BUFFER
> bool
>
> config KPROBES
> @@ -1288,6 +1289,11 @@ config ARCH_HAS_ELFCORE_COMPAT
> config ARCH_HAS_PARANOID_L1D_FLUSH
> bool
>
> +config UACCESS_BUFFER
> + bool
> + help
> + Select if the architecture's syscall entry/exit code supports uaccess buffers.
> +
> source "kernel/gcov/Kconfig"
>
> source "scripts/gcc-plugins/Kconfig"
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 5c7ae4c3954b..4764e5fd7ba9 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -221,6 +221,7 @@ config ARM64
> select THREAD_INFO_IN_TASK
> select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD
> select TRACE_IRQFLAGS_SUPPORT
> + select UACCESS_BUFFER
> help
> ARM 64-bit (AArch64) Linux support.
>
> diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
> index 50a0f1a38e84..c3f8652d84a5 100644
> --- a/arch/arm64/kernel/syscall.c
> +++ b/arch/arm64/kernel/syscall.c
> @@ -139,7 +139,9 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
> goto trace_exit;
> }
>
> + uaccess_buffer_syscall_entry();
> invoke_syscall(regs, scno, sc_nr, syscall_table);
> + uaccess_buffer_syscall_exit();
>
> /*
> * The tracing status may have changed under our feet, so we have to
> diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
> index 42faebbaa202..9144936edcb1 100644
> --- a/include/linux/instrumented.h
> +++ b/include/linux/instrumented.h
> @@ -2,7 +2,7 @@
>
> /*
> * This header provides generic wrappers for memory access instrumentation that
> - * the compiler cannot emit for: KASAN, KCSAN.
> + * the compiler cannot emit for: KASAN, KCSAN, access buffers.
> */
> #ifndef _LINUX_INSTRUMENTED_H
> #define _LINUX_INSTRUMENTED_H
> @@ -11,6 +11,7 @@
> #include <linux/kasan-checks.h>
> #include <linux/kcsan-checks.h>
> #include <linux/types.h>
> +#include <linux/uaccess_buffer.h>
>
> /**
> * instrument_read - instrument regular read access
> @@ -117,6 +118,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> {
> kasan_check_read(from, n);
> kcsan_check_read(from, n);
> + uaccess_buffer_log_write(to, n);
> }
>
> /**
> @@ -134,6 +136,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
> {
> kasan_check_write(to, n);
> kcsan_check_write(to, n);
> + uaccess_buffer_log_read(from, n);
> }
>
> #endif /* _LINUX_INSTRUMENTED_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e12b524426b0..3fecb0487b97 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -34,6 +34,7 @@
> #include <linux/rseq.h>
> #include <linux/seqlock.h>
> #include <linux/kcsan.h>
> +#include <linux/uaccess_buffer_info.h>
> #include <asm/kmap_size.h>
>
> /* task_struct member predeclarations (sorted alphabetically): */
> @@ -1487,6 +1488,8 @@ struct task_struct {
> struct callback_head l1d_flush_kill;
> #endif
>
> + struct uaccess_buffer_info uaccess_buffer;
> +
> /*
> * New fields for task_struct should be added above here, so that
> * they are included in the randomized portion of task_struct.
> diff --git a/include/linux/uaccess_buffer.h b/include/linux/uaccess_buffer.h
> new file mode 100644
> index 000000000000..3b81f2a192a4
> --- /dev/null
> +++ b/include/linux/uaccess_buffer.h
> @@ -0,0 +1,43 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_ACCESS_BUFFER_H
> +#define _LINUX_ACCESS_BUFFER_H
> +
> +#include <asm-generic/errno-base.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +void uaccess_buffer_log_read(const void __user *from, unsigned long n);
> +void uaccess_buffer_log_write(void __user *to, unsigned long n);
> +
> +void uaccess_buffer_syscall_entry(void);
> +void uaccess_buffer_syscall_exit(void);
> +
> +int uaccess_buffer_set_logging(unsigned long addr, unsigned long size,
> + unsigned long store_end_addr);
> +
> +#else
> +
> +static inline void uaccess_buffer_log_read(const void __user *from,
> + unsigned long n)
> +{
> +}
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +}
> +
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +}
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +}
> +
> +static inline int uaccess_buffer_set_logging(unsigned long addr,
> + unsigned long size,
> + unsigned long store_end_addr)
> +{
> + return -EINVAL;
> +}
> +#endif
> +
> +#endif /* _LINUX_ACCESS_BUFFER_H */
> diff --git a/include/linux/uaccess_buffer_info.h b/include/linux/uaccess_buffer_info.h
> new file mode 100644
> index 000000000000..a6cefe6e73b5
> --- /dev/null
> +++ b/include/linux/uaccess_buffer_info.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_ACCESS_BUFFER_INFO_H
> +#define _LINUX_ACCESS_BUFFER_INFO_H
> +
> +#include <uapi/asm/signal.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +struct uaccess_buffer_info {
> + unsigned long addr, size;
> + unsigned long store_end_addr;
> + sigset_t saved_sigmask;
> + u8 state;
> +};
> +
> +#else
> +
> +struct uaccess_buffer_info {
> +};
> +
> +#endif
> +
> +#endif /* _LINUX_ACCESS_BUFFER_INFO_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 43bd7f713c39..d8baacaef800 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -269,4 +269,13 @@ struct prctl_mm_map {
> # define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */
> # define PR_SCHED_CORE_MAX 4
>
> +/* Log uaccesses to a user-provided buffer */
> +#define PR_LOG_UACCESS 63
> +
> +/* Format of the entries in the uaccess log. */
> +struct uaccess_buffer_entry {
> + __u64 addr, size, flags;
> +};
> +# define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 4df609be42d0..75a5d95ce9c3 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -115,6 +115,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
> obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
> obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> obj-$(CONFIG_CFI_CLANG) += cfi.o
> +obj-$(CONFIG_UACCESS_BUFFER) += uaccess_buffer.o
>
> obj-$(CONFIG_PERF_EVENTS) += events/
>
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bf16395b9e13..c7e7ff8cbab3 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -89,6 +89,8 @@ __syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
> if (work & SYSCALL_WORK_ENTER)
> syscall = syscall_trace_enter(regs, syscall, work);
>
> + uaccess_buffer_syscall_entry();
> +
> return syscall;
> }
>
> @@ -273,6 +275,7 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
> local_irq_enable();
> }
>
> + uaccess_buffer_syscall_exit();
> rseq_syscall(regs);
>
> /*
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8fdac0d90504..df487600773c 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -42,6 +42,7 @@
> #include <linux/version.h>
> #include <linux/ctype.h>
> #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess_buffer.h>
>
> #include <linux/compat.h>
> #include <linux/syscalls.h>
> @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> error = sched_core_share_pid(arg2, arg3, arg4, arg5);
> break;
> #endif
> + case PR_LOG_UACCESS:
> + if (arg5)
> + return -EINVAL;
> + error = uaccess_buffer_set_logging(arg2, arg3, arg4);
> + break;
> default:
> error = -EINVAL;
> break;
> diff --git a/kernel/uaccess_buffer.c b/kernel/uaccess_buffer.c
> new file mode 100644
> index 000000000000..b9da89887c4b
> --- /dev/null
> +++ b/kernel/uaccess_buffer.c
> @@ -0,0 +1,118 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/compat.h>
> +#include <linux/prctl.h>
> +#include <linux/sched.h>
> +#include <linux/signal.h>
> +#include <linux/uaccess.h>
> +#include <linux/uaccess_buffer.h>
> +#include <linux/uaccess_buffer_info.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +/*
> + * We use a separate implementation of copy_to_user() that avoids the call
> + * to instrument_copy_to_user() as this would otherwise lead to infinite
> + * recursion.
> + */
> +static unsigned long
> +uaccess_buffer_copy_to_user(void __user *to, const void *from, unsigned long n)
> +{
> + if (!access_ok(to, n))
> + return n;
> + return raw_copy_to_user(to, from, n);
> +}
> +
> +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> + unsigned long flags)
> +{
> + struct uaccess_buffer_entry entry;
> +
> + if (current->uaccess_buffer.size < sizeof(entry) ||
> + unlikely(uaccess_kernel()))
> + return;
> + entry.addr = addr;
> + entry.size = size;
> + entry.flags = flags;
> +
> + /*
> + * If our uaccess fails, abort the log so that the end address writeback
> + * does not occur and userspace sees zero accesses.
> + */
> + if (uaccess_buffer_copy_to_user(
> + (void __user *)current->uaccess_buffer.addr, &entry,
> + sizeof(entry))) {
> + current->uaccess_buffer.state = 0;
> + current->uaccess_buffer.addr = current->uaccess_buffer.size = 0;
> + }
> +
> + current->uaccess_buffer.addr += sizeof(entry);
> + current->uaccess_buffer.size -= sizeof(entry);
> +}
> +
> +void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> + uaccess_buffer_log((unsigned long)from, n, 0);
> +}
> +EXPORT_SYMBOL(uaccess_buffer_log_read);
> +
> +void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> + uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> +}
> +EXPORT_SYMBOL(uaccess_buffer_log_write);
> +
> +int uaccess_buffer_set_logging(unsigned long addr, unsigned long size,
> + unsigned long store_end_addr)
> +{
> + sigset_t temp_sigmask;
> +
> + current->uaccess_buffer.addr = addr;
> + current->uaccess_buffer.size = size;
> + current->uaccess_buffer.store_end_addr = store_end_addr;
> +
> + /*
> + * Allow 2 syscalls before resetting the state: the current one (i.e.
> + * prctl) and the next one, whose accesses we want to log.
> + */
> + current->uaccess_buffer.state = 2;
> +
> + /*
> + * Temporarily mask signals so that an intervening asynchronous signal
> + * will not interfere with the logging.
> + */
> + current->uaccess_buffer.saved_sigmask = current->blocked;
> + sigfillset(&temp_sigmask);
> + sigdelsetmask(&temp_sigmask, sigmask(SIGKILL) | sigmask(SIGSTOP));
> + __set_current_blocked(&temp_sigmask);
> +
> + return 0;
> +}
> +
> +void uaccess_buffer_syscall_entry(void)
> +{
> + /*
> + * The current syscall may be e.g. rt_sigprocmask, and therefore we want
> + * to reset the mask before the syscall and not after, so that our
> + * temporary mask is unobservable.
> + */
> + if (current->uaccess_buffer.state == 1)
> + __set_current_blocked(&current->uaccess_buffer.saved_sigmask);
> +}
> +
> +void uaccess_buffer_syscall_exit(void)
> +{
> + if (current->uaccess_buffer.state > 0) {
> + --current->uaccess_buffer.state;
> + if (current->uaccess_buffer.state == 0) {
> + u64 addr64 = current->uaccess_buffer.addr;
> +
> + uaccess_buffer_copy_to_user(
> + (void __user *)
> + current->uaccess_buffer.store_end_addr,
> + &addr64, sizeof(addr64));
> + current->uaccess_buffer.addr = current->uaccess_buffer.size = 0;
> + }
> + }
> +}
> +
> +#endif

2021-09-22 15:33:13

by Kees Cook

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On Wed, Sep 22, 2021 at 09:23:10AM -0500, Eric W. Biederman wrote:
> Peter Collingbourne <[email protected]> writes:
>
> > This patch introduces a kernel feature known as uaccess logging.
> > With uaccess logging, the userspace program passes the address and size
> > of a so-called uaccess buffer to the kernel via a prctl(). The prctl()
> > is a request for the kernel to log any uaccesses made during the next
> > syscall to the uaccess buffer. When the next syscall returns, the address
> > one past the end of the logged uaccess buffer entries is written to the
> > location specified by the third argument to the prctl(). In this way,
> > the userspace program may enumerate the uaccesses logged to the access
> > buffer to determine which accesses occurred.
> > [...]
> > 3) Kernel fuzzing. We may use the list of reported kernel accesses to
> > guide a kernel fuzzing tool such as syzkaller (so that it knows which
> > parts of user memory to fuzz), as an alternative to providing the tool
> > with a list of syscalls and their uaccesses (which again thanks to
> > (2) may not be accurate).
>
> How is logging the kernel's activity like this not a significant
> information leak? How is this safe for unprivileged users?

This does result in userspace being able to "watch" the kernel progress
through a syscall. I'd say it's less dangerous than userfaultfd, but
still worrisome. (And userfaultfd is normally disabled[1] for unprivileged
users trying to interpose the kernel accessing user memory.)

Regardless, this is a pretty useful tool for this kind of fuzzing.
Perhaps the timing exposure could be mitigated by having the kernel
collect the record in a separate kernel-allocated buffer and flush the
results to userspace at syscall exit? (This would solve the
copy_to_user() recursion issue too.)

I'm pondering what else might be getting exposed by creating this level
of probing... kernel addresses would already be getting rejected, so
they wouldn't show up in the buffer. Hmm. Jann, any thoughts here?

Some other thoughts:

Instead of reimplementing copy_*_user() with a new wrapper that
bypasses some checks and adds others and has to stay in sync, etc,
how about just adding a "recursion" flag? Something like:

copy_from_user(...)
instrument_copy_from_user(...)
uaccess_buffer_log_read(...)
if (current->uaccess_buffer.writing)
return;
uaccess_buffer_log(...)
current->uaccess_buffer.writing = true;
copy_to_user(...)
current->uaccess_buffer.writing = false;

How about using this via seccomp instead of a per-syscall prctl? This
would mean you would have very specific control over which syscalls
should get the uaccess tracing, and wouldn't need to deal with
the signal mask (I think). I would imagine something similar to
SECCOMP_FILTER_FLAG_LOG, maybe SECCOMP_FILTER_FLAG_UACCESS_TRACE, and
add a new top-level seccomp command, (like SECCOMP_GET_NOTIF_SIZES)
maybe named SECCOMP_SET_UACCESS_TRACE_BUFFER.

This would likely only make sense for SECCOMP_RET_TRACE or _TRAP if the
program wants to collect the results after every syscall. And maybe this
won't make any sense across exec (losing the mm that was used during
SECCOMP_SET_UACCESS_TRACE_BUFFER). Hmmm.

-Kees

[1] https://git.kernel.org/linus/d0d4730ac2e404a5b0da9a87ef38c73e51cb1664

--
Kees Cook

2021-09-22 16:06:11

by Jann Horn

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On Wed, Sep 22, 2021 at 5:30 PM Kees Cook <[email protected]> wrote:
> On Wed, Sep 22, 2021 at 09:23:10AM -0500, Eric W. Biederman wrote:
> > Peter Collingbourne <[email protected]> writes:
> >
> > > This patch introduces a kernel feature known as uaccess logging.
> > > With uaccess logging, the userspace program passes the address and size
> > > of a so-called uaccess buffer to the kernel via a prctl(). The prctl()
> > > is a request for the kernel to log any uaccesses made during the next
> > > syscall to the uaccess buffer. When the next syscall returns, the address
> > > one past the end of the logged uaccess buffer entries is written to the
> > > location specified by the third argument to the prctl(). In this way,
> > > the userspace program may enumerate the uaccesses logged to the access
> > > buffer to determine which accesses occurred.
> > > [...]
> > > 3) Kernel fuzzing. We may use the list of reported kernel accesses to
> > > guide a kernel fuzzing tool such as syzkaller (so that it knows which
> > > parts of user memory to fuzz), as an alternative to providing the tool
> > > with a list of syscalls and their uaccesses (which again thanks to
> > > (2) may not be accurate).
> >
> > How is logging the kernel's activity like this not a significant
> > information leak? How is this safe for unprivileged users?
>
> This does result in userspace being able to "watch" the kernel progress
> through a syscall. I'd say it's less dangerous than userfaultfd, but
> still worrisome. (And userfaultfd is normally disabled[1] for unprivileged
> users trying to interpose the kernel accessing user memory.)
>
> Regardless, this is a pretty useful tool for this kind of fuzzing.
> Perhaps the timing exposure could be mitigated by having the kernel
> collect the record in a separate kernel-allocated buffer and flush the
> results to userspace at syscall exit? (This would solve the
> copy_to_user() recursion issue too.)

Other than what Kees has already said, the only security concern I
have with that patch should be trivial to fix: If the ->uaccess_buffer
machinery writes to current's memory, it must be reset during
execve(), before switching to the new mm, to prevent the old task from
causing the kernel to scribble into the new mm.

One aspect that might benefit from some clarification on intended
behavior is: what should happen if there are BPF tracing programs
running (possibly as part of some kind of system-wide profiling or
such) that poke around in userspace memory with BPF's uaccess helpers
(especially "bpf_copy_from_user")?

> I'm pondering what else might be getting exposed by creating this level
> of probing... kernel addresses would already be getting rejected, so
> they wouldn't show up in the buffer. Hmm. Jann, any thoughts here?
>
>
> Some other thoughts:
>
>
> Instead of reimplementing copy_*_user() with a new wrapper that
> bypasses some checks and adds others and has to stay in sync, etc,
> how about just adding a "recursion" flag? Something like:
>
> copy_from_user(...)
> instrument_copy_from_user(...)
> uaccess_buffer_log_read(...)
> if (current->uaccess_buffer.writing)
> return;
> uaccess_buffer_log(...)
> current->uaccess_buffer.writing = true;
> copy_to_user(...)
> current->uaccess_buffer.writing = false;
>
>
> How about using this via seccomp instead of a per-syscall prctl? This
> would mean you would have very specific control over which syscalls
> should get the uaccess tracing, and wouldn't need to deal with
> the signal mask (I think). I would imagine something similar to
> SECCOMP_FILTER_FLAG_LOG, maybe SECCOMP_FILTER_FLAG_UACCESS_TRACE, and
> add a new top-level seccomp command, (like SECCOMP_GET_NOTIF_SIZES)
> maybe named SECCOMP_SET_UACCESS_TRACE_BUFFER.
>
> This would likely only make sense for SECCOMP_RET_TRACE or _TRAP if the
> program wants to collect the results after every syscall. And maybe this
> won't make any sense across exec (losing the mm that was used during
> SECCOMP_SET_UACCESS_TRACE_BUFFER). Hmmm.

And then I guess your plan would be that userspace would be expected
to use the userspace instruction pointer
(seccomp_data::instruction_pointer) to indicate instructions that
should be traced?

Or instead of seccomp, you could do it kinda like
https://www.kernel.org/doc/html/latest/admin-guide/syscall-user-dispatch.html
, with a prctl that specifies a specific instruction pointer?

2021-09-22 17:48:50

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On 22.09.21 08:18, Peter Collingbourne wrote:
> This patch introduces a kernel feature known as uaccess logging.
> With uaccess logging, the userspace program passes the address and size
> of a so-called uaccess buffer to the kernel via a prctl(). The prctl()
> is a request for the kernel to log any uaccesses made during the next
> syscall to the uaccess buffer. When the next syscall returns, the address
> one past the end of the logged uaccess buffer entries is written to the
> location specified by the third argument to the prctl(). In this way,
> the userspace program may enumerate the uaccesses logged to the access
> buffer to determine which accesses occurred.
>
> Uaccess logging has several use cases focused around bug detection
> tools:
>
> 1) Userspace memory safety tools such as ASan, MSan, HWASan and tools
> making use of the ARM Memory Tagging Extension (MTE) need to monitor
> all memory accesses in a program so that they can detect memory
> errors. For accesses made purely in userspace, this is achieved
> via compiler instrumentation, or for MTE, via direct hardware
> support. However, accesses made by the kernel on behalf of the
> user program via syscalls (i.e. uaccesses) are invisible to these
> tools. With MTE there is some level of error detection possible in
> the kernel (in synchronous mode, bad accesses generally result in
> returning -EFAULT from the syscall), but by the time we get back to
> userspace we've lost the information about the address and size of the
> failed access, which makes it harder to produce a useful error report.
>
> With the current versions of the sanitizers, we address this by
> interposing the libc syscall stubs with a wrapper that checks the
> memory based on what we believe the uaccesses will be. However, this
> creates a maintenance burden: each syscall must be annotated with
> its uaccesses in order to be recognized by the sanitizer, and these
> annotations must be continuously updated as the kernel changes. This
> is especially burdensome for syscalls such as ioctl(2) which have a
> large surface area of possible uaccesses.
>
> 2) Verifying the validity of kernel accesses. This can be achieved in
> conjunction with the userspace memory safety tools mentioned in (1).
> Even a sanitizer whose syscall wrappers have complete knowledge of
> the kernel's intended API may vary from the kernel's actual uaccesses
> due to kernel bugs. A sanitizer with knowledge of the kernel's actual
> uaccesses may produce more accurate error reports that reveal such
> bugs.
>
> An example of such a bug, which was found by an earlier version of this
> patch together with a prototype client of the API in HWASan, was fixed
> by commit d0efb16294d1 ("net: don't unconditionally copy_from_user
> a struct ifreq for socket ioctls"). Although this bug turned out to
> relatively harmless, it was a bug nonetheless and it's always possible
> that more serious bugs of this sort may be introduced in the future.
>
> 3) Kernel fuzzing. We may use the list of reported kernel accesses to
> guide a kernel fuzzing tool such as syzkaller (so that it knows which
> parts of user memory to fuzz), as an alternative to providing the tool
> with a list of syscalls and their uaccesses (which again thanks to
> (2) may not be accurate).
>
> All signals except SIGKILL and SIGSTOP are masked for the interval
> between the prctl() and the next syscall in order to prevent handlers
> for intervening asynchronous signals from issuing syscalls that may
> cause uaccesses from the wrong syscall to be logged.

Stupid question: can this be exploited from user space to effectively
disable SIGKILL for a long time ... and do we care?

Like, the application allocates a bunch of memory, issues the prctl()
and spins in user space. What would happen if the OOM killer selects
this task as a target and does a do_send_sig_info(SIGKILL,
SEND_SIG_PRIV, ...) ?

--
Thanks,

David / dhildenb

2021-09-22 19:24:06

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On Wed, 22 Sep 2021 19:46:47 +0200
David Hildenbrand <[email protected]> wrote:

> > All signals except SIGKILL and SIGSTOP are masked for the interval
> > between the prctl() and the next syscall in order to prevent handlers
> > for intervening asynchronous signals from issuing syscalls that may
> > cause uaccesses from the wrong syscall to be logged.
>
> Stupid question: can this be exploited from user space to effectively
> disable SIGKILL for a long time ... and do we care?

I first misread it too, but then caught my mistake reading it a second
time. It says "except SIGKILL". So no, it does not disable SIGKILL.

-- Steve

2021-09-22 20:00:26

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On Wed, Sep 22, 2021 at 03:22:50PM -0400, Steven Rostedt wrote:
> On Wed, 22 Sep 2021 19:46:47 +0200
> David Hildenbrand <[email protected]> wrote:
>
> > > All signals except SIGKILL and SIGSTOP are masked for the interval
> > > between the prctl() and the next syscall in order to prevent handlers
> > > for intervening asynchronous signals from issuing syscalls that may
> > > cause uaccesses from the wrong syscall to be logged.
> >
> > Stupid question: can this be exploited from user space to effectively
> > disable SIGKILL for a long time ... and do we care?
>
> I first misread it too, but then caught my mistake reading it a second
> time. It says "except SIGKILL". So no, it does not disable SIGKILL.

Disabling SIGINT might already be a giant nuisance. Letting through
SIGSTOP but not SIGCONT seems awkward. Blocking SIGTRAP seems like a bad
idea too. Blocking SIGBUS as delivered by #MC will be hillarious.

2021-09-22 22:10:47

by Peter Collingbourne

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On Wed, Sep 22, 2021 at 12:56 PM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Sep 22, 2021 at 03:22:50PM -0400, Steven Rostedt wrote:
> > On Wed, 22 Sep 2021 19:46:47 +0200
> > David Hildenbrand <[email protected]> wrote:
> >
> > > > All signals except SIGKILL and SIGSTOP are masked for the interval
> > > > between the prctl() and the next syscall in order to prevent handlers
> > > > for intervening asynchronous signals from issuing syscalls that may
> > > > cause uaccesses from the wrong syscall to be logged.
> > >
> > > Stupid question: can this be exploited from user space to effectively
> > > disable SIGKILL for a long time ... and do we care?
> >
> > I first misread it too, but then caught my mistake reading it a second
> > time. It says "except SIGKILL". So no, it does not disable SIGKILL.
>
> Disabling SIGINT might already be a giant nuisance. Letting through
> SIGSTOP but not SIGCONT seems awkward. Blocking SIGTRAP seems like a bad
> idea too. Blocking SIGBUS as delivered by #MC will be hillarious.

I'm only blocking the signals that are already blockable from
userspace via rt_sigprocmask (which prevents blocking SIGKILL and
SIGSTOP, but allows blocking the others including SIGBUS, for which
the man page states that the result is undefined if synchronously
generated while blocked). So in terms of blocking signals I don't
think this is exposing any new capabilities.

Per the sigaction man page, SIGKILL and SIGSTOP can't have userspace
signal handlers, so we don't need to block them in order to prevent
intervening asynchronous signal handlers (nor do we want to due to the
DoS potential). I would need to double check the behavior but I
believe that for SIGCONT continuing the process is separate from
signal delivery and unaffected by blocking (see prepare_signal in
kernel/signal.c) -- so the SIGCONT will make it continue and the
handler if any would be called once the syscall returns after the
automatic restoration of the signal mask.

Peter

2021-09-22 22:35:18

by Peter Collingbourne

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On Wed, Sep 22, 2021 at 6:45 AM Dmitry Vyukov <[email protected]> wrote:
>
> On Wed, 22 Sept 2021 at 08:18, Peter Collingbourne <[email protected]> wrote:
> >
> > This patch introduces a kernel feature known as uaccess logging.
> > With uaccess logging, the userspace program passes the address and size
> > of a so-called uaccess buffer to the kernel via a prctl(). The prctl()
> > is a request for the kernel to log any uaccesses made during the next
> > syscall to the uaccess buffer. When the next syscall returns, the address
> > one past the end of the logged uaccess buffer entries is written to the
> > location specified by the third argument to the prctl(). In this way,
> > the userspace program may enumerate the uaccesses logged to the access
> > buffer to determine which accesses occurred.
> >
> > Uaccess logging has several use cases focused around bug detection
> > tools:
> >
> > 1) Userspace memory safety tools such as ASan, MSan, HWASan and tools
> > making use of the ARM Memory Tagging Extension (MTE) need to monitor
> > all memory accesses in a program so that they can detect memory
> > errors. For accesses made purely in userspace, this is achieved
> > via compiler instrumentation, or for MTE, via direct hardware
> > support. However, accesses made by the kernel on behalf of the
> > user program via syscalls (i.e. uaccesses) are invisible to these
> > tools. With MTE there is some level of error detection possible in
> > the kernel (in synchronous mode, bad accesses generally result in
> > returning -EFAULT from the syscall), but by the time we get back to
> > userspace we've lost the information about the address and size of the
> > failed access, which makes it harder to produce a useful error report.
> >
> > With the current versions of the sanitizers, we address this by
> > interposing the libc syscall stubs with a wrapper that checks the
> > memory based on what we believe the uaccesses will be. However, this
> > creates a maintenance burden: each syscall must be annotated with
> > its uaccesses in order to be recognized by the sanitizer, and these
> > annotations must be continuously updated as the kernel changes. This
> > is especially burdensome for syscalls such as ioctl(2) which have a
> > large surface area of possible uaccesses.
> >
> > 2) Verifying the validity of kernel accesses. This can be achieved in
> > conjunction with the userspace memory safety tools mentioned in (1).
> > Even a sanitizer whose syscall wrappers have complete knowledge of
> > the kernel's intended API may vary from the kernel's actual uaccesses
> > due to kernel bugs. A sanitizer with knowledge of the kernel's actual
> > uaccesses may produce more accurate error reports that reveal such
> > bugs.
> >
> > An example of such a bug, which was found by an earlier version of this
> > patch together with a prototype client of the API in HWASan, was fixed
> > by commit d0efb16294d1 ("net: don't unconditionally copy_from_user
> > a struct ifreq for socket ioctls"). Although this bug turned out to
> > relatively harmless, it was a bug nonetheless and it's always possible
> > that more serious bugs of this sort may be introduced in the future.
> >
> > 3) Kernel fuzzing. We may use the list of reported kernel accesses to
> > guide a kernel fuzzing tool such as syzkaller (so that it knows which
> > parts of user memory to fuzz), as an alternative to providing the tool
> > with a list of syscalls and their uaccesses (which again thanks to
> > (2) may not be accurate).
> >
> > All signals except SIGKILL and SIGSTOP are masked for the interval
> > between the prctl() and the next syscall in order to prevent handlers
> > for intervening asynchronous signals from issuing syscalls that may
> > cause uaccesses from the wrong syscall to be logged.
> >
> > The format of a uaccess buffer entry is defined as follows:
> >
> > struct access_buffer_entry {
> > u64 addr, size, flags;
> > };
> >
> > The meaning of addr and size should be obvious. On arm64, tag bits
> > are preserved in the addr field. The current meaning of the flags
> > field is that bit 0 indicates whether the access was a read (clear)
> > or a write (set). The meaning of all other flag bits is reserved.
> > All fields are of type u64 in order to avoid compat concerns.
> >
> > Here is an example of a code snippet that will enumerate the accesses
> > performed by a uname(2) syscall:
> >
> > struct access_buffer_entry entries[64];
> > uint64_t entries_end64 = (uint64_t)&entries;
> > struct utsname un;
> > prctl(PR_LOG_UACCESS, entries, sizeof(entries), &entries_end64, 0);
> > uname(&un);
> > struct access_buffer_entry *entries_end = (struct uaccess_buffer_entry *)entries_end64;
> > for (struct acccess_buffer_entry *i = entries; i != entries_end; ++i) {
> > printf("%s at 0x%lu size 0x%lx\n",
> > entries[i].flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ",
> > (unsigned long)entries[i].addr, (unsigned long)entries[i].size);
> > }
> >
> > Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of
> > course, not all of the accesses may fit in the buffer, but aside from
> > that, there are syscalls such as async I/O that are currently missed due
> > to the uaccesses occurring on a different kernel task (this is analogous
> > to how async I/O accesses are exempt from userspace MTE checks). We
> > view this as acceptable, as the access buffer can be sized sufficiently
> > large to handle syscalls that make a reasonable number of uaccesses,
> > and syscalls that use a different task for uaccesses are rare. In
> > many cases, the sanitizer does not need to see every memory access,
> > so it's fine if we miss the odd uaccess here and there. Even for those
> > sanitizers that do need to see every memory access it still represents
> > a much lower maintenance burden if we just have to handle the unusual
> > syscalls specially.
> >
> > Because we don't have a common kernel entry/exit code path that is used
> > on all architectures, uaccess logging is only implemented for arm64 and
> > architectures that use CONFIG_GENERIC_ENTRY, i.e. x86 and s390.
> >
> > One downside of this ABI is that it involves making two syscalls per
> > "real" syscall, which can harm performance. One possible way to avoid
> > this may be to have the prctl() register the uaccess buffer location
> > once at thread startup and use the same location for all syscalls in
> > the thread. However, because the program may be making syscalls very
> > early, before TLS is available, this may not always work. Furthermore,
> > because of the same asynchronous signal concerns that prompted temporarily
> > masking signals after the prctl(), the syscall stub would need to be made
> > reentrant, and it is unclear whether this is feasible without manually
> > masking asynchronous signals using rt_sigprocmask(2) while reading the
> > uaccess buffer, defeating the purpose of avoiding the extra syscall.
> >
> > One idea that we considered involved using the stack pointer address as
> > a unique identifier for the syscall, but this currently would need to be
> > arch-specific as we currently do not appear to have an arch-generic way
> > of retrieving the stack pointer; the userspace side would also need some
> > arch-specific code for this to work. It's also possible that a longjmp()
> > past the signal handler would make the stack pointer address not unique
> > enough for this purpose.
> >
> > On the other hand, by allocating the uaccess log on the stack and blocking
> > asynchronous signals for the interval between the prctl() and the "real"
> > syscall, we can avoid any reentrancy and TLS concerns.
> >
> > Another way to avoid the overhead may be to use an architecture-specific
> > calling convention to pass the address of the uaccess buffer to the kernel
> > at syscall time in registers currently unused for syscall arguments. For
> > example, one arm64-specific scheme that was used in a previous iteration
> > of the patch was:
> >
> > - Bit 0 of the immediate argument to the SVC instruction must be set.
> > - Register X6 contains the address of the access buffer.
> > - Register X7 contains the size of the access buffer in bytes.
> > - On return, X6 will contain the address of the memory location following
> > any access buffer entries written by the kernel.
> >
> > However, this would need to be implemented separately for each
> > architecture (and some of them don't have enough registers anyway),
> > whereas the prctl() is (at least in theory) architecture-generic.
> >
> > We also evaluated implementing this on top of the existing tracepoint
> > facility, but concluded that it is not suitable for this purpose:
> >
> > - Tracepoints have a per-task granularity at best, whereas we really want
> > to trace per-syscall. This is so that we can exclude syscalls that
> > should not be traced, such as syscalls that make up part of the
> > sanitizer implementation (to avoid infinite recursion when e.g. printing
> > an error report).
> >
> > - Tracing would need to be synchronous in order to produce useful
> > stack traces. For example this could be achieved using the new SIGTRAP
> > on perf events mechanism. However, this would require logging each
> > access to the stack (in the form of a sigcontext) and this is more
> > likely to overflow the stack due to being much larger than a uaccess
> > buffer entry as well as being unbounded, in contrast to the bounded
> > buffer size passed to prctl(). An approach based on signal handlers is
> > also likely to fall foul of the asynchronous signal issues mentioned
> > previously, together with needing sigreturn to be handled specially
> > (because it copies a sigcontext from userspace) otherwise we could
> > never return from the signal handler. Furthermore, arguments to the
> > trace events are not available to SIGTRAP. (This on its own wouldn't
> > be insurmountable though -- we could add the arguments as fields
> > to siginfo.)
> >
> > - The API in https://www.kernel.org/doc/Documentation/trace/ftrace.txt
> > -- e.g. trace_pipe_raw gives access to the internal ring buffer, but
> > I don't think it's useable because it's per-CPU and not per-task.
> >
> > - Tracepoints can be used by eBPF programs, but eBPF programs may
> > only be loaded as root, among other potential headaches.
>
> Hi Peter,
>
> Is this intended to be used with real syscall only? I think for
> sanitizers we want to use this with libc syscall wrappers and more
> complex libc functions as well. Signal blocking assumes that there
> will be 1 and only 1 real syscall. I wonder if this can create
> problems with libc functions, in particular, if a libc function does
> >1 syscall, no syscalls, variable number of syscalls.

Yes, the intent is that this will only be used from the code that
executes the "real" syscall. Otherwise, as you point out, there's the
possibility of the libc function executing a different number of
syscalls, especially if symbol interposition comes into the picture.

Peter

2021-09-23 08:11:35

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On 22.09.21 21:22, Steven Rostedt wrote:
> On Wed, 22 Sep 2021 19:46:47 +0200
> David Hildenbrand <[email protected]> wrote:
>
>>> All signals except SIGKILL and SIGSTOP are masked for the interval
>>> between the prctl() and the next syscall in order to prevent handlers
>>> for intervening asynchronous signals from issuing syscalls that may
>>> cause uaccesses from the wrong syscall to be logged.
>>
>> Stupid question: can this be exploited from user space to effectively
>> disable SIGKILL for a long time ... and do we care?
>
> I first misread it too, but then caught my mistake reading it a second
> time. It says "except SIGKILL". So no, it does not disable SIGKILL.

Thanks for pointing out the obvious Steve :)

--
Thanks,

David / dhildenb

2021-09-24 23:08:22

by Peter Collingbourne

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On Wed, Sep 22, 2021 at 8:59 AM Jann Horn <[email protected]> wrote:
>
> On Wed, Sep 22, 2021 at 5:30 PM Kees Cook <[email protected]> wrote:
> > On Wed, Sep 22, 2021 at 09:23:10AM -0500, Eric W. Biederman wrote:
> > > Peter Collingbourne <[email protected]> writes:
> > >
> > > > This patch introduces a kernel feature known as uaccess logging.
> > > > With uaccess logging, the userspace program passes the address and size
> > > > of a so-called uaccess buffer to the kernel via a prctl(). The prctl()
> > > > is a request for the kernel to log any uaccesses made during the next
> > > > syscall to the uaccess buffer. When the next syscall returns, the address
> > > > one past the end of the logged uaccess buffer entries is written to the
> > > > location specified by the third argument to the prctl(). In this way,
> > > > the userspace program may enumerate the uaccesses logged to the access
> > > > buffer to determine which accesses occurred.
> > > > [...]
> > > > 3) Kernel fuzzing. We may use the list of reported kernel accesses to
> > > > guide a kernel fuzzing tool such as syzkaller (so that it knows which
> > > > parts of user memory to fuzz), as an alternative to providing the tool
> > > > with a list of syscalls and their uaccesses (which again thanks to
> > > > (2) may not be accurate).
> > >
> > > How is logging the kernel's activity like this not a significant
> > > information leak? How is this safe for unprivileged users?
> >
> > This does result in userspace being able to "watch" the kernel progress
> > through a syscall. I'd say it's less dangerous than userfaultfd, but
> > still worrisome. (And userfaultfd is normally disabled[1] for unprivileged
> > users trying to interpose the kernel accessing user memory.)
> >
> > Regardless, this is a pretty useful tool for this kind of fuzzing.
> > Perhaps the timing exposure could be mitigated by having the kernel
> > collect the record in a separate kernel-allocated buffer and flush the
> > results to userspace at syscall exit? (This would solve the
> > copy_to_user() recursion issue too.)

Seems reasonable. I suppose that in terms of timing information we're
already (unavoidably) exposing how long the syscall took overall, and
we probably shouldn't deliberately expose more than that.

That being said, I'm wondering if that has security implications on
its own if it's then possible for userspace to manipulate the kernel
into allocating a large buffer (either at prctl() time or as a result
of getting the kernel to do a large number of uaccesses). Perhaps it
can be mitigated by limiting the size of the uaccess buffer provided
at prctl() time.

> Other than what Kees has already said, the only security concern I
> have with that patch should be trivial to fix: If the ->uaccess_buffer
> machinery writes to current's memory, it must be reset during
> execve(), before switching to the new mm, to prevent the old task from
> causing the kernel to scribble into the new mm.

Yes, that's a good point. I'll fix that in the next version.

> One aspect that might benefit from some clarification on intended
> behavior is: what should happen if there are BPF tracing programs
> running (possibly as part of some kind of system-wide profiling or
> such) that poke around in userspace memory with BPF's uaccess helpers
> (especially "bpf_copy_from_user")?

I think we should probably be ignoring those accesses, since we cannot
know a priori whether the accesses are directly associated with the
syscall or not, and this is after all a best-effort mechanism.

> > I'm pondering what else might be getting exposed by creating this level
> > of probing... kernel addresses would already be getting rejected, so
> > they wouldn't show up in the buffer. Hmm. Jann, any thoughts here?
> >
> >
> > Some other thoughts:
> >
> >
> > Instead of reimplementing copy_*_user() with a new wrapper that
> > bypasses some checks and adds others and has to stay in sync, etc,
> > how about just adding a "recursion" flag? Something like:
> >
> > copy_from_user(...)
> > instrument_copy_from_user(...)
> > uaccess_buffer_log_read(...)
> > if (current->uaccess_buffer.writing)
> > return;
> > uaccess_buffer_log(...)
> > current->uaccess_buffer.writing = true;
> > copy_to_user(...)
> > current->uaccess_buffer.writing = false;
> >
> >
> > How about using this via seccomp instead of a per-syscall prctl? This
> > would mean you would have very specific control over which syscalls
> > should get the uaccess tracing, and wouldn't need to deal with
> > the signal mask (I think). I would imagine something similar to
> > SECCOMP_FILTER_FLAG_LOG, maybe SECCOMP_FILTER_FLAG_UACCESS_TRACE, and
> > add a new top-level seccomp command, (like SECCOMP_GET_NOTIF_SIZES)
> > maybe named SECCOMP_SET_UACCESS_TRACE_BUFFER.
> >
> > This would likely only make sense for SECCOMP_RET_TRACE or _TRAP if the
> > program wants to collect the results after every syscall. And maybe this
> > won't make any sense across exec (losing the mm that was used during
> > SECCOMP_SET_UACCESS_TRACE_BUFFER). Hmmm.
>
> And then I guess your plan would be that userspace would be expected
> to use the userspace instruction pointer
> (seccomp_data::instruction_pointer) to indicate instructions that
> should be traced?
>
> Or instead of seccomp, you could do it kinda like
> https://www.kernel.org/doc/html/latest/admin-guide/syscall-user-dispatch.html
> , with a prctl that specifies a specific instruction pointer?

Given a choice between these two options, I would prefer the prctl()
because userspace programs may already be using seccomp filters and
sanitizers shouldn't interfere with it.

However, in either the seccomp filter or prctl() case, you still have
the problem of deciding where to log to. Keep in mind that you would
need to prevent intervening async signals (that occur between when the
syscall happens and when we read the log) from triggering additional
syscalls that may overwrite the log (as a result of using the same
syscall wrapper). This implies that the log location would need to be
per-syscall, so the location would need to be on the stack (or
equivalent, like a heap-allocated buffer with lifetime tied to that of
the syscall wrapper function). So then, how do you notify the kernel
of that location? One possibility is arch-specific augmentation of the
syscall calling convention, as I mentioned in the initial message. But
then if you're doing that, you might as well dispense with the seccomp
filter or prctl() entirely and have the request for a log be
communicated purely via the calling convention.

It seems like any approach which avoids the multiple syscalls is going
to be arch-specific to some degree. So I think I would prefer to start
with a "slow but simple" API, and let architectures provide their own
arch-specific mechanisms if they wish to optimize it.

Peter

2021-09-26 02:21:53

by Kees Cook

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On Fri, Sep 24, 2021 at 02:50:04PM -0700, Peter Collingbourne wrote:
> On Wed, Sep 22, 2021 at 8:59 AM Jann Horn <[email protected]> wrote:
> >
> > On Wed, Sep 22, 2021 at 5:30 PM Kees Cook <[email protected]> wrote:
> > > On Wed, Sep 22, 2021 at 09:23:10AM -0500, Eric W. Biederman wrote:
> > > > Peter Collingbourne <[email protected]> writes:
> > > > > This patch introduces a kernel feature known as uaccess logging.
> > > > > [...]
> > > > [...]
> > > > How is logging the kernel's activity like this not a significant
> > > > information leak? How is this safe for unprivileged users?
> > > [...]
> > > Regardless, this is a pretty useful tool for this kind of fuzzing.
> > > Perhaps the timing exposure could be mitigated by having the kernel
> > > collect the record in a separate kernel-allocated buffer and flush the
> > > results to userspace at syscall exit? (This would solve the
> > > copy_to_user() recursion issue too.)
>
> Seems reasonable. I suppose that in terms of timing information we're
> already (unavoidably) exposing how long the syscall took overall, and
> we probably shouldn't deliberately expose more than that.

Right -- I can't think of anything that can really use this today,
but it very much feels like the kind of information that could aid in
a timing race.

> That being said, I'm wondering if that has security implications on
> its own if it's then possible for userspace to manipulate the kernel
> into allocating a large buffer (either at prctl() time or as a result
> of getting the kernel to do a large number of uaccesses). Perhaps it
> can be mitigated by limiting the size of the uaccess buffer provided
> at prctl() time.

There are a lot of exact-size allocation controls already (which I think
is an unavoidable but separate issue[1]), but perhaps this could be
mitigated by making the reserved buffer be PAGE_SIZE granular?

> > One aspect that might benefit from some clarification on intended
> > behavior is: what should happen if there are BPF tracing programs
> > running (possibly as part of some kind of system-wide profiling or
> > such) that poke around in userspace memory with BPF's uaccess helpers
> > (especially "bpf_copy_from_user")?
>
> I think we should probably be ignoring those accesses, since we cannot
> know a priori whether the accesses are directly associated with the
> syscall or not, and this is after all a best-effort mechanism.

Perhaps the "don't log this uaccess" flag I suggested could be
repurposed by BPF too, as a general "make this access invisible to
PR_LOG_UACCESS" flag? i.e. this bit:

> > > Instead of reimplementing copy_*_user() with a new wrapper that
> > > bypasses some checks and adds others and has to stay in sync, etc,
> > > how about just adding a "recursion" flag? Something like:
> > >
> > > copy_from_user(...)
> > > instrument_copy_from_user(...)
> > > uaccess_buffer_log_read(...)
> > > if (current->uaccess_buffer.writing)
> > > return;
> > > uaccess_buffer_log(...)
> > > current->uaccess_buffer.writing = true;
> > > copy_to_user(...)
> > > current->uaccess_buffer.writing = false;

> > > This would likely only make sense for SECCOMP_RET_TRACE or _TRAP if the
> > > program wants to collect the results after every syscall. And maybe this
> > > won't make any sense across exec (losing the mm that was used during
> > > SECCOMP_SET_UACCESS_TRACE_BUFFER). Hmmm.
> >
> > And then I guess your plan would be that userspace would be expected
> > to use the userspace instruction pointer
> > (seccomp_data::instruction_pointer) to indicate instructions that
> > should be traced?

That could be one way -- but seccomp filters would allow a bunch of
ways.

> >
> > Or instead of seccomp, you could do it kinda like
> > https://www.kernel.org/doc/html/latest/admin-guide/syscall-user-dispatch.html
> > , with a prctl that specifies a specific instruction pointer?
>
> Given a choice between these two options, I would prefer the prctl()
> because userspace programs may already be using seccomp filters and
> sanitizers shouldn't interfere with it.

That's fair -- the "I wish we could make complex decisions about which
syscalls to act on" sounds like seccomp.

> However, in either the seccomp filter or prctl() case, you still have
> the problem of deciding where to log to. Keep in mind that you would
> need to prevent intervening async signals (that occur between when the
> syscall happens and when we read the log) from triggering additional

Could the sig handler also set the "make the uaccess invisible" flag?
(It would need to be a "depth" flag, most likely.)

-Kees

[1] https://github.com/KSPP/linux/issues/9

--
Kees Cook

2021-11-23 05:17:44

by Peter Collingbourne

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On Tue, Sep 21, 2021 at 11:30 PM Cyrill Gorcunov <[email protected]> wrote:
>
> On Tue, Sep 21, 2021 at 11:18:09PM -0700, Peter Collingbourne wrote:
> > This patch introduces a kernel feature known as uaccess logging.
> > With uaccess logging, the userspace program passes the address and size
> > of a so-called uaccess buffer to the kernel via a prctl(). The prctl()
> > is a request for the kernel to log any uaccesses made during the next
> > syscall to the uaccess buffer. When the next syscall returns, the address
> > one past the end of the logged uaccess buffer entries is written to the
> > location specified by the third argument to the prctl(). In this way,
> > the userspace program may enumerate the uaccesses logged to the access
> > buffer to determine which accesses occurred.
> ...
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index e12b524426b0..3fecb0487b97 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -34,6 +34,7 @@
> > #include <linux/rseq.h>
> > #include <linux/seqlock.h>
> > #include <linux/kcsan.h>
> > +#include <linux/uaccess_buffer_info.h>
> > #include <asm/kmap_size.h>
> >
> > /* task_struct member predeclarations (sorted alphabetically): */
> > @@ -1487,6 +1488,8 @@ struct task_struct {
> > struct callback_head l1d_flush_kill;
> > #endif
> >
> > + struct uaccess_buffer_info uaccess_buffer;
> > +
>
> Hi, Peter! I didn't read the patch carefully yet (will do once time permit)
> but from a glance should not this member be under #ifdef CONFIG_UACCESS_BUFFER
> or something? task_struct is already bloated too much :(

Yes, I've now added an ifdef here (previously I had the ifdef inside
the struct uaccess_buffer_info, but I think this would still leave
some space due to C struct layout rules).

>
> > + case PR_LOG_UACCESS:
> > + if (arg5)
> > + return -EINVAL;
> > + error = uaccess_buffer_set_logging(arg2, arg3, arg4);
> > + break;
>
> Same here (if only I didn't miss something obvious). If there is no support
> for CONFIG_UACCESS_BUFFER we should return an error I guess.

The uaccess_buffer_set_logging (now
uaccess_buffer_set_descriptor_addr_addr) function is defined to return
-EINVAL if CONFIG_UACCESS_BUFFER is not defined.

Peter

2021-11-23 05:18:17

by Peter Collingbourne

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On Wed, Sep 22, 2021 at 3:44 AM Marco Elver <[email protected]> wrote:
>
> On Tue, Sep 21, 2021 at 11:18PM -0700, Peter Collingbourne wrote:
> > This patch introduces a kernel feature known as uaccess logging.
> > With uaccess logging, the userspace program passes the address and size
> > of a so-called uaccess buffer to the kernel via a prctl(). The prctl()
> > is a request for the kernel to log any uaccesses made during the next
> > syscall to the uaccess buffer. When the next syscall returns, the address
> > one past the end of the logged uaccess buffer entries is written to the
> > location specified by the third argument to the prctl(). In this way,
> > the userspace program may enumerate the uaccesses logged to the access
> > buffer to determine which accesses occurred.
> >
> > Uaccess logging has several use cases focused around bug detection
> > tools:
> >
> > 1) Userspace memory safety tools such as ASan, MSan, HWASan and tools
> > making use of the ARM Memory Tagging Extension (MTE) need to monitor
> > all memory accesses in a program so that they can detect memory
> > errors. For accesses made purely in userspace, this is achieved
> > via compiler instrumentation, or for MTE, via direct hardware
> > support. However, accesses made by the kernel on behalf of the
> > user program via syscalls (i.e. uaccesses) are invisible to these
> > tools. With MTE there is some level of error detection possible in
> > the kernel (in synchronous mode, bad accesses generally result in
> > returning -EFAULT from the syscall), but by the time we get back to
> > userspace we've lost the information about the address and size of the
> > failed access, which makes it harder to produce a useful error report.
> >
> > With the current versions of the sanitizers, we address this by
> > interposing the libc syscall stubs with a wrapper that checks the
> > memory based on what we believe the uaccesses will be. However, this
> > creates a maintenance burden: each syscall must be annotated with
> > its uaccesses in order to be recognized by the sanitizer, and these
> > annotations must be continuously updated as the kernel changes. This
> > is especially burdensome for syscalls such as ioctl(2) which have a
> > large surface area of possible uaccesses.
> >
> > 2) Verifying the validity of kernel accesses. This can be achieved in
> > conjunction with the userspace memory safety tools mentioned in (1).
> > Even a sanitizer whose syscall wrappers have complete knowledge of
> > the kernel's intended API may vary from the kernel's actual uaccesses
> > due to kernel bugs. A sanitizer with knowledge of the kernel's actual
> > uaccesses may produce more accurate error reports that reveal such
> > bugs.
> >
> > An example of such a bug, which was found by an earlier version of this
> > patch together with a prototype client of the API in HWASan, was fixed
> > by commit d0efb16294d1 ("net: don't unconditionally copy_from_user
> > a struct ifreq for socket ioctls"). Although this bug turned out to
> > relatively harmless, it was a bug nonetheless and it's always possible
> > that more serious bugs of this sort may be introduced in the future.
> >
> > 3) Kernel fuzzing. We may use the list of reported kernel accesses to
> > guide a kernel fuzzing tool such as syzkaller (so that it knows which
> > parts of user memory to fuzz), as an alternative to providing the tool
> > with a list of syscalls and their uaccesses (which again thanks to
> > (2) may not be accurate).
> >
> > All signals except SIGKILL and SIGSTOP are masked for the interval
> > between the prctl() and the next syscall in order to prevent handlers
> > for intervening asynchronous signals from issuing syscalls that may
> > cause uaccesses from the wrong syscall to be logged.
> >
> > The format of a uaccess buffer entry is defined as follows:
> >
> > struct access_buffer_entry {
>
> uaccess_buffer_entry (missing 'u')?
>
> Here and below.

Done.

>
> > u64 addr, size, flags;
> > };
> >
> > The meaning of addr and size should be obvious. On arm64, tag bits
> > are preserved in the addr field. The current meaning of the flags
> > field is that bit 0 indicates whether the access was a read (clear)
> > or a write (set). The meaning of all other flag bits is reserved.
> > All fields are of type u64 in order to avoid compat concerns.
> >
> > Here is an example of a code snippet that will enumerate the accesses
> > performed by a uname(2) syscall:
> >
> > struct access_buffer_entry entries[64];
> > uint64_t entries_end64 = (uint64_t)&entries;
> > struct utsname un;
> > prctl(PR_LOG_UACCESS, entries, sizeof(entries), &entries_end64, 0);
> > uname(&un);
> > struct access_buffer_entry *entries_end = (struct uaccess_buffer_entry *)entries_end64;
> > for (struct acccess_buffer_entry *i = entries; i != entries_end; ++i) {
> > printf("%s at 0x%lu size 0x%lx\n",
> > entries[i].flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ",
> > (unsigned long)entries[i].addr, (unsigned long)entries[i].size);
> > }
>
> I think it would be good to persist information like this in dedicated
> Documentation. If I had to guess, Documentation/admin-guide/uaccess-buffer.rst
> could be the appropriate location.
>
> It also allows to write some background, and then shorten the commit
> message / cover letter (details can be placed in Documentation,
> including the example).

Done.

> > Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of
> > course, not all of the accesses may fit in the buffer, but aside from
> > that, there are syscalls such as async I/O that are currently missed due
> > to the uaccesses occurring on a different kernel task (this is analogous
> > to how async I/O accesses are exempt from userspace MTE checks). We
> > view this as acceptable, as the access buffer can be sized sufficiently
> > large to handle syscalls that make a reasonable number of uaccesses,
> > and syscalls that use a different task for uaccesses are rare. In
> > many cases, the sanitizer does not need to see every memory access,
> > so it's fine if we miss the odd uaccess here and there. Even for those
> > sanitizers that do need to see every memory access it still represents
> > a much lower maintenance burden if we just have to handle the unusual
> > syscalls specially.
> >
> > Because we don't have a common kernel entry/exit code path that is used
> > on all architectures, uaccess logging is only implemented for arm64 and
> > architectures that use CONFIG_GENERIC_ENTRY, i.e. x86 and s390.
> >
> > One downside of this ABI is that it involves making two syscalls per
> > "real" syscall, which can harm performance. One possible way to avoid
> > this may be to have the prctl() register the uaccess buffer location
> > once at thread startup and use the same location for all syscalls in
> > the thread. However, because the program may be making syscalls very
> > early, before TLS is available, this may not always work. Furthermore,
> > because of the same asynchronous signal concerns that prompted temporarily
> > masking signals after the prctl(), the syscall stub would need to be made
> > reentrant, and it is unclear whether this is feasible without manually
> > masking asynchronous signals using rt_sigprocmask(2) while reading the
> > uaccess buffer, defeating the purpose of avoiding the extra syscall.
> >
> > One idea that we considered involved using the stack pointer address as
> > a unique identifier for the syscall, but this currently would need to be
> > arch-specific as we currently do not appear to have an arch-generic way
> > of retrieving the stack pointer; the userspace side would also need some
> > arch-specific code for this to work. It's also possible that a longjmp()
> > past the signal handler would make the stack pointer address not unique
> > enough for this purpose.
> >
> > On the other hand, by allocating the uaccess log on the stack and blocking
> > asynchronous signals for the interval between the prctl() and the "real"
> > syscall, we can avoid any reentrancy and TLS concerns.
> >
> > Another way to avoid the overhead may be to use an architecture-specific
> > calling convention to pass the address of the uaccess buffer to the kernel
> > at syscall time in registers currently unused for syscall arguments. For
> > example, one arm64-specific scheme that was used in a previous iteration
> > of the patch was:
> >
> > - Bit 0 of the immediate argument to the SVC instruction must be set.
> > - Register X6 contains the address of the access buffer.
> > - Register X7 contains the size of the access buffer in bytes.
> > - On return, X6 will contain the address of the memory location following
> > any access buffer entries written by the kernel.
> >
> > However, this would need to be implemented separately for each
> > architecture (and some of them don't have enough registers anyway),
> > whereas the prctl() is (at least in theory) architecture-generic.
> >
> > We also evaluated implementing this on top of the existing tracepoint
> > facility, but concluded that it is not suitable for this purpose:
> >
> > - Tracepoints have a per-task granularity at best, whereas we really want
> > to trace per-syscall. This is so that we can exclude syscalls that
> > should not be traced, such as syscalls that make up part of the
> > sanitizer implementation (to avoid infinite recursion when e.g. printing
> > an error report).
> >
> > - Tracing would need to be synchronous in order to produce useful
> > stack traces. For example this could be achieved using the new SIGTRAP
> > on perf events mechanism. However, this would require logging each
> > access to the stack (in the form of a sigcontext) and this is more
> > likely to overflow the stack due to being much larger than a uaccess
> > buffer entry as well as being unbounded, in contrast to the bounded
> > buffer size passed to prctl(). An approach based on signal handlers is
> > also likely to fall foul of the asynchronous signal issues mentioned
> > previously, together with needing sigreturn to be handled specially
> > (because it copies a sigcontext from userspace) otherwise we could
> > never return from the signal handler. Furthermore, arguments to the
> > trace events are not available to SIGTRAP. (This on its own wouldn't
> > be insurmountable though -- we could add the arguments as fields
> > to siginfo.)
> >
> > - The API in https://www.kernel.org/doc/Documentation/trace/ftrace.txt
> > -- e.g. trace_pipe_raw gives access to the internal ring buffer, but
> > I don't think it's useable because it's per-CPU and not per-task.
> >
> > - Tracepoints can be used by eBPF programs, but eBPF programs may
> > only be loaded as root, among other potential headaches.
> >
> > Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> > Signed-off-by: Peter Collingbourne <[email protected]>
> > ---
> > arch/Kconfig | 6 ++
> > arch/arm64/Kconfig | 1 +
> > arch/arm64/kernel/syscall.c | 2 +
>
> The arch-enablement changes should be their own patches (one for arm64,
> and another for the generic entry code). And it probably further helps
> reviewability by moving bits of the commit message into a cover letter
> and Documentation.
>
> So I think there should be at least 4 patches (core code, arm64, generic
> entry, Documentation).

Done.

> > include/linux/instrumented.h | 5 +-
> > include/linux/sched.h | 3 +
> > include/linux/uaccess_buffer.h | 43 ++++++++++
> > include/linux/uaccess_buffer_info.h | 23 ++++++
> > include/uapi/linux/prctl.h | 9 +++
> > kernel/Makefile | 1 +
> > kernel/entry/common.c | 3 +
> > kernel/sys.c | 6 ++
> > kernel/uaccess_buffer.c | 118 ++++++++++++++++++++++++++++
> > 12 files changed, 219 insertions(+), 1 deletion(-)
> > create mode 100644 include/linux/uaccess_buffer.h
> > create mode 100644 include/linux/uaccess_buffer_info.h
> > create mode 100644 kernel/uaccess_buffer.c
> >
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index 8df1c7102643..a427f6440cc9 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -31,6 +31,7 @@ config HOTPLUG_SMT
> > bool
> >
> > config GENERIC_ENTRY
> > + select UACCESS_BUFFER
> > bool
> >
> > config KPROBES
> > @@ -1288,6 +1289,11 @@ config ARCH_HAS_ELFCORE_COMPAT
> > config ARCH_HAS_PARANOID_L1D_FLUSH
> > bool
> >
> > +config UACCESS_BUFFER
>
> This is not a user-selectable config option, so I think the name should
> be HAVE_ARCH_UACCESS_BUFFER.

Done. (There doesn't seem to be a naming convention for these -- even
for the non-configurable ones I sometimes see unadorned names,
ARCH_HAS_, etc. -- but something with ARCH is probably clearer.)

> > + bool
> > + help
> > + Select if the architecture's syscall entry/exit code supports uaccess buffers.
> > +
> > source "kernel/gcov/Kconfig"
> >
> > source "scripts/gcc-plugins/Kconfig"
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 5c7ae4c3954b..4764e5fd7ba9 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -221,6 +221,7 @@ config ARM64
> > select THREAD_INFO_IN_TASK
> > select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD
> > select TRACE_IRQFLAGS_SUPPORT
> > + select UACCESS_BUFFER
>
> Once it's called HAVE_ARCH_UACCESS_BUFFER, it should also be reordered
> to be in alphabetical order with other selects here to reduce risk of
> merge conflicts (when things are added at the end).

Done.

> > help
> > ARM 64-bit (AArch64) Linux support.
> >
> > diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
> > index 50a0f1a38e84..c3f8652d84a5 100644
> > --- a/arch/arm64/kernel/syscall.c
> > +++ b/arch/arm64/kernel/syscall.c
> > @@ -139,7 +139,9 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
> > goto trace_exit;
> > }
> >
> > + uaccess_buffer_syscall_entry();
> > invoke_syscall(regs, scno, sc_nr, syscall_table);
> > + uaccess_buffer_syscall_exit();
>
> I think this is missing an #include <linux/access_buffer.h>, although
> it's currently implicit via instrumented.h.

Yes, this became clear once I split the header in two.

> > /*
> > * The tracing status may have changed under our feet, so we have to
> > diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
> > index 42faebbaa202..9144936edcb1 100644
> > --- a/include/linux/instrumented.h
> > +++ b/include/linux/instrumented.h
> > @@ -2,7 +2,7 @@
> >
> > /*
> > * This header provides generic wrappers for memory access instrumentation that
> > - * the compiler cannot emit for: KASAN, KCSAN.
> > + * the compiler cannot emit for: KASAN, KCSAN, access buffers.
> > */
> > #ifndef _LINUX_INSTRUMENTED_H
> > #define _LINUX_INSTRUMENTED_H
> > @@ -11,6 +11,7 @@
> > #include <linux/kasan-checks.h>
> > #include <linux/kcsan-checks.h>
> > #include <linux/types.h>
> > +#include <linux/uaccess_buffer.h>
>
> instrumented.h is included from almost everywhere. Like is done for
> KASAN and KCSAN, it helps to minimize the header required for the
> checks.
>
> In this case, for consistency with the others, I recommend having a
> header <linux/uaccess-buffer-checks.h> with just the 2 functions
> required here. And then put the rest (including the struct) into
> <linux/uaccess-buffer.h>.

Hmm, are these really "checks" though? (Well, most likely userspace
will end up using these data structures for checks, but that's
irrelevant to the kernel.) I think that "uaccess-buffer-log-hooks.h"
would be a better name.

I also ended up moving the struct to the log hooks header so that I
would be able to access the task_struct in the inlined helper
functions in uaccess-buiffer.h. I guess the alternative is that we add
yet another header just for the struct.

> > /**
> > * instrument_read - instrument regular read access
> > @@ -117,6 +118,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> > {
> > kasan_check_read(from, n);
> > kcsan_check_read(from, n);
> > + uaccess_buffer_log_write(to, n);
> > }
> >
> > /**
> > @@ -134,6 +136,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
> > {
> > kasan_check_write(to, n);
> > kcsan_check_write(to, n);
> > + uaccess_buffer_log_read(from, n);
> > }
> >
> > #endif /* _LINUX_INSTRUMENTED_H */
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index e12b524426b0..3fecb0487b97 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -34,6 +34,7 @@
> > #include <linux/rseq.h>
> > #include <linux/seqlock.h>
> > #include <linux/kcsan.h>
> > +#include <linux/uaccess_buffer_info.h>
> > #include <asm/kmap_size.h>
> >
> > /* task_struct member predeclarations (sorted alphabetically): */
> > @@ -1487,6 +1488,8 @@ struct task_struct {
> > struct callback_head l1d_flush_kill;
> > #endif
> >
> > + struct uaccess_buffer_info uaccess_buffer;
> > +
> > /*
> > * New fields for task_struct should be added above here, so that
> > * they are included in the randomized portion of task_struct.
> > diff --git a/include/linux/uaccess_buffer.h b/include/linux/uaccess_buffer.h
> > new file mode 100644
> > index 000000000000..3b81f2a192a4
> > --- /dev/null
> > +++ b/include/linux/uaccess_buffer.h
> > @@ -0,0 +1,43 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_ACCESS_BUFFER_H
> > +#define _LINUX_ACCESS_BUFFER_H
> > +
> > +#include <asm-generic/errno-base.h>
> > +
> > +#ifdef CONFIG_UACCESS_BUFFER
> > +
> > +void uaccess_buffer_log_read(const void __user *from, unsigned long n);
> > +void uaccess_buffer_log_write(void __user *to, unsigned long n);
> > +
>
> As suggested above, the above 2 functions above can go into a single
> header uaccess-buffer-checks.h (which also doesn't need to include
> errno-base.h).
>
> The rest below (merged with the struct definition) can then go into the
> second header.

Done (modulo above).

> > +void uaccess_buffer_syscall_entry(void);
> > +void uaccess_buffer_syscall_exit(void);
> > +
> > +int uaccess_buffer_set_logging(unsigned long addr, unsigned long size,
> > + unsigned long store_end_addr);
> > +
> > +#else
> > +
> > +static inline void uaccess_buffer_log_read(const void __user *from,
> > + unsigned long n)
> > +{
> > +}
> > +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> > +{
> > +}
> > +
> > +static inline void uaccess_buffer_syscall_entry(void)
> > +{
> > +}
> > +static inline void uaccess_buffer_syscall_exit(void)
> > +{
> > +}
> > +
> > +static inline int uaccess_buffer_set_logging(unsigned long addr,
> > + unsigned long size,
> > + unsigned long store_end_addr)
> > +{
> > + return -EINVAL;
> > +}
> > +#endif
> > +
> > +#endif /* _LINUX_ACCESS_BUFFER_H */
> > diff --git a/include/linux/uaccess_buffer_info.h b/include/linux/uaccess_buffer_info.h
> > new file mode 100644
> > index 000000000000..a6cefe6e73b5
> > --- /dev/null
> > +++ b/include/linux/uaccess_buffer_info.h
> > @@ -0,0 +1,23 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_ACCESS_BUFFER_INFO_H
> > +#define _LINUX_ACCESS_BUFFER_INFO_H
> > +
> > +#include <uapi/asm/signal.h>
> > +
> > +#ifdef CONFIG_UACCESS_BUFFER
> > +
> > +struct uaccess_buffer_info {
>
> Please add comments for each variable if possible.

Done.

> > + unsigned long addr, size;
> > + unsigned long store_end_addr;
> > + sigset_t saved_sigmask;
> > + u8 state;
>
> I think 'state' is less clear than it could be. Looking at the rest of
> the code, 'state' could just be 'enabled' (or 'enable_count')?

I ended up removing this field as a result of the new userspace API.

> > +};
> > +
> > +#else
> > +
> > +struct uaccess_buffer_info {
> > +};
>
> I'm not sure what's cleaner: a) in the #else case defining an empty
> struct, or b) in the #else case not defining the struct and guarding the
> use in sched.h with an #ifdef.
>
> (b) is currently the standard way of doing it, and unless it improves
> readability elsewhere (e.g. need to pass current->uaccess_buffer to a
> function outside kernel/uaccess_buffer.c), I'd probably just do (b).

Yeah, and as I mentioned in my other reply I think it also avoids a
bit of struct padding in the case where the feature is disabled.

> > +#endif
> > +
> > +#endif /* _LINUX_ACCESS_BUFFER_INFO_H */
> > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > index 43bd7f713c39..d8baacaef800 100644
> > --- a/include/uapi/linux/prctl.h
> > +++ b/include/uapi/linux/prctl.h
> > @@ -269,4 +269,13 @@ struct prctl_mm_map {
> > # define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */
> > # define PR_SCHED_CORE_MAX 4
> >
> > +/* Log uaccesses to a user-provided buffer */
> > +#define PR_LOG_UACCESS 63
> > +
> > +/* Format of the entries in the uaccess log. */
> > +struct uaccess_buffer_entry {
> > + __u64 addr, size, flags;
>
> Comments for the above variables (in which case putting them on their
> own line would be preferred).

Done.

> > +};
> > +# define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */
>
> The struct and UACCESS_BUFFER_FLAG_WRITE does not belong in prctl.h, and
> should probably go in another appropriate header in uapi/linux.

Done.

> > +
> > #endif /* _LINUX_PRCTL_H */
> > diff --git a/kernel/Makefile b/kernel/Makefile
> > index 4df609be42d0..75a5d95ce9c3 100644
> > --- a/kernel/Makefile
> > +++ b/kernel/Makefile
> > @@ -115,6 +115,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
> > obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
> > obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> > obj-$(CONFIG_CFI_CLANG) += cfi.o
> > +obj-$(CONFIG_UACCESS_BUFFER) += uaccess_buffer.o
> >
> > obj-$(CONFIG_PERF_EVENTS) += events/
> >
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index bf16395b9e13..c7e7ff8cbab3 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -89,6 +89,8 @@ __syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
> > if (work & SYSCALL_WORK_ENTER)
> > syscall = syscall_trace_enter(regs, syscall, work);
> >
> > + uaccess_buffer_syscall_entry();
>
> This is also missing an #include.

Done.

>
> > +
> > return syscall;
> > }
> >
> > @@ -273,6 +275,7 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
> > local_irq_enable();
> > }
> >
> > + uaccess_buffer_syscall_exit();
> > rseq_syscall(regs);
> >
> > /*
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 8fdac0d90504..df487600773c 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -42,6 +42,7 @@
> > #include <linux/version.h>
> > #include <linux/ctype.h>
> > #include <linux/syscall_user_dispatch.h>
> > +#include <linux/uaccess_buffer.h>
> >
> > #include <linux/compat.h>
> > #include <linux/syscalls.h>
> > @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > error = sched_core_share_pid(arg2, arg3, arg4, arg5);
> > break;
> > #endif
> > + case PR_LOG_UACCESS:
> > + if (arg5)
> > + return -EINVAL;
> > + error = uaccess_buffer_set_logging(arg2, arg3, arg4);
> > + break;
> > default:
> > error = -EINVAL;
> > break;
> > diff --git a/kernel/uaccess_buffer.c b/kernel/uaccess_buffer.c
> > new file mode 100644
> > index 000000000000..b9da89887c4b
> > --- /dev/null
> > +++ b/kernel/uaccess_buffer.c
> > @@ -0,0 +1,118 @@
> > +// SPDX-License-Identifier: GPL-2.0
>
> This is a new file, and probably needs a Copyright header. See
> e.g. mm/kfence/core.c for template.

Done.

> > +#include <linux/compat.h>
> > +#include <linux/prctl.h>
> > +#include <linux/sched.h>
> > +#include <linux/signal.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/uaccess_buffer.h>
> > +#include <linux/uaccess_buffer_info.h>
> > +
> > +#ifdef CONFIG_UACCESS_BUFFER
> > +
>
> The #ifdef is redundant, given the Makefile already guards compilation
> of this file.

Removed.

> > +/*
> > + * We use a separate implementation of copy_to_user() that avoids the call
> > + * to instrument_copy_to_user() as this would otherwise lead to infinite
> > + * recursion.
> > + */
> > +static unsigned long
> > +uaccess_buffer_copy_to_user(void __user *to, const void *from, unsigned long n)
> > +{
> > + if (!access_ok(to, n))
> > + return n;
> > + return raw_copy_to_user(to, from, n);
> > +}
> > +
> > +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> > + unsigned long flags)
> > +{
> > + struct uaccess_buffer_entry entry;
> > +
> > + if (current->uaccess_buffer.size < sizeof(entry) ||
> > + unlikely(uaccess_kernel()))
> > + return;
> > + entry.addr = addr;
> > + entry.size = size;
> > + entry.flags = flags;
> > +
> > + /*
> > + * If our uaccess fails, abort the log so that the end address writeback
> > + * does not occur and userspace sees zero accesses.
> > + */
> > + if (uaccess_buffer_copy_to_user(
> > + (void __user *)current->uaccess_buffer.addr, &entry,
> > + sizeof(entry))) {
> > + current->uaccess_buffer.state = 0;
> > + current->uaccess_buffer.addr = current->uaccess_buffer.size = 0;
> > + }
> > +
> > + current->uaccess_buffer.addr += sizeof(entry);
> > + current->uaccess_buffer.size -= sizeof(entry);
> > +}
> > +
> > +void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> > +{
> > + uaccess_buffer_log((unsigned long)from, n, 0);
> > +}
> > +EXPORT_SYMBOL(uaccess_buffer_log_read);
> > +
> > +void uaccess_buffer_log_write(void __user *to, unsigned long n)
> > +{
> > + uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> > +}
> > +EXPORT_SYMBOL(uaccess_buffer_log_write);
> > +
> > +int uaccess_buffer_set_logging(unsigned long addr, unsigned long size,
> > + unsigned long store_end_addr)
> > +{
> > + sigset_t temp_sigmask;
> > +
> > + current->uaccess_buffer.addr = addr;
> > + current->uaccess_buffer.size = size;
> > + current->uaccess_buffer.store_end_addr = store_end_addr;
> > +
> > + /*
> > + * Allow 2 syscalls before resetting the state: the current one (i.e.
> > + * prctl) and the next one, whose accesses we want to log.
> > + */
> > + current->uaccess_buffer.state = 2;
> > +
> > + /*
> > + * Temporarily mask signals so that an intervening asynchronous signal
> > + * will not interfere with the logging.
> > + */
> > + current->uaccess_buffer.saved_sigmask = current->blocked;
> > + sigfillset(&temp_sigmask);
> > + sigdelsetmask(&temp_sigmask, sigmask(SIGKILL) | sigmask(SIGSTOP));
> > + __set_current_blocked(&temp_sigmask);
> > +
> > + return 0;
> > +}
> > +
> > +void uaccess_buffer_syscall_entry(void)
> > +{
> > + /*
> > + * The current syscall may be e.g. rt_sigprocmask, and therefore we want
> > + * to reset the mask before the syscall and not after, so that our
> > + * temporary mask is unobservable.
> > + */
> > + if (current->uaccess_buffer.state == 1)
> > + __set_current_blocked(&current->uaccess_buffer.saved_sigmask);
> > +}
> > +
> > +void uaccess_buffer_syscall_exit(void)
> > +{
> > + if (current->uaccess_buffer.state > 0) {
>
> This could just be
>
> + if (!current->uaccess_buffer.state)
> + return;
>
> .. which avoids the deep nesting below and helps with formatting.
>
> This is also the point where the name 'state' just seems too generic, and
> 'enabled' (or 'enable_count') would make it clearer what is happening.

This code is removed as a result of the new interface.

Peter

2021-11-23 05:18:27

by Peter Collingbourne

[permalink] [raw]

Subject: Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

On Sat, Sep 25, 2021 at 7:20 PM Kees Cook <[email protected]> wrote:
>
> On Fri, Sep 24, 2021 at 02:50:04PM -0700, Peter Collingbourne wrote:
> > On Wed, Sep 22, 2021 at 8:59 AM Jann Horn <[email protected]> wrote:
> > >
> > > On Wed, Sep 22, 2021 at 5:30 PM Kees Cook <[email protected]> wrote:
> > > > On Wed, Sep 22, 2021 at 09:23:10AM -0500, Eric W. Biederman wrote:
> > > > > Peter Collingbourne <[email protected]> writes:
> > > > > > This patch introduces a kernel feature known as uaccess logging.
> > > > > > [...]
> > > > > [...]
> > > > > How is logging the kernel's activity like this not a significant
> > > > > information leak? How is this safe for unprivileged users?
> > > > [...]
> > > > Regardless, this is a pretty useful tool for this kind of fuzzing.
> > > > Perhaps the timing exposure could be mitigated by having the kernel
> > > > collect the record in a separate kernel-allocated buffer and flush the
> > > > results to userspace at syscall exit? (This would solve the
> > > > copy_to_user() recursion issue too.)
> >
> > Seems reasonable. I suppose that in terms of timing information we're
> > already (unavoidably) exposing how long the syscall took overall, and
> > we probably shouldn't deliberately expose more than that.
>
> Right -- I can't think of anything that can really use this today,
> but it very much feels like the kind of information that could aid in
> a timing race.

Okay, this now goes via a kernel-allocated buffer.

> > That being said, I'm wondering if that has security implications on
> > its own if it's then possible for userspace to manipulate the kernel
> > into allocating a large buffer (either at prctl() time or as a result
> > of getting the kernel to do a large number of uaccesses). Perhaps it
> > can be mitigated by limiting the size of the uaccess buffer provided
> > at prctl() time.
>
> There are a lot of exact-size allocation controls already (which I think
> is an unavoidable but separate issue[1]), but perhaps this could be
> mitigated by making the reserved buffer be PAGE_SIZE granular?

I was more thinking about userspace causing a kernel OOM or something
by making the kernel allocate large buffers. I decided to mitigate it
by putting an upper limit on the size of the kernel-side buffer.

Since it sounds like exact-size allocations are a pre-existing issue
we probably don't need to do anything about them at this time.

> > > One aspect that might benefit from some clarification on intended
> > > behavior is: what should happen if there are BPF tracing programs
> > > running (possibly as part of some kind of system-wide profiling or
> > > such) that poke around in userspace memory with BPF's uaccess helpers
> > > (especially "bpf_copy_from_user")?
> >
> > I think we should probably be ignoring those accesses, since we cannot
> > know a priori whether the accesses are directly associated with the
> > syscall or not, and this is after all a best-effort mechanism.
>
> Perhaps the "don't log this uaccess" flag I suggested could be
> repurposed by BPF too, as a general "make this access invisible to
> PR_LOG_UACCESS" flag? i.e. this bit:

Since we ended up not needing this flag (because of the kernel-side
buffer) I ended up just making BPF use raw_copy_from_user().

> > > > Instead of reimplementing copy_*_user() with a new wrapper that
> > > > bypasses some checks and adds others and has to stay in sync, etc,
> > > > how about just adding a "recursion" flag? Something like:
> > > >
> > > > copy_from_user(...)
> > > > instrument_copy_from_user(...)
> > > > uaccess_buffer_log_read(...)
> > > > if (current->uaccess_buffer.writing)
> > > > return;
> > > > uaccess_buffer_log(...)
> > > > current->uaccess_buffer.writing = true;
> > > > copy_to_user(...)
> > > > current->uaccess_buffer.writing = false;
>
>
>
> > > > This would likely only make sense for SECCOMP_RET_TRACE or _TRAP if the
> > > > program wants to collect the results after every syscall. And maybe this
> > > > won't make any sense across exec (losing the mm that was used during
> > > > SECCOMP_SET_UACCESS_TRACE_BUFFER). Hmmm.
> > >
> > > And then I guess your plan would be that userspace would be expected
> > > to use the userspace instruction pointer
> > > (seccomp_data::instruction_pointer) to indicate instructions that
> > > should be traced?
>
> That could be one way -- but seccomp filters would allow a bunch of
> ways.
>
> > >
> > > Or instead of seccomp, you could do it kinda like
> > > https://www.kernel.org/doc/html/latest/admin-guide/syscall-user-dispatch.html
> > > , with a prctl that specifies a specific instruction pointer?
> >
> > Given a choice between these two options, I would prefer the prctl()
> > because userspace programs may already be using seccomp filters and
> > sanitizers shouldn't interfere with it.
>
> That's fair -- the "I wish we could make complex decisions about which
> syscalls to act on" sounds like seccomp.
>
> > However, in either the seccomp filter or prctl() case, you still have
> > the problem of deciding where to log to. Keep in mind that you would
> > need to prevent intervening async signals (that occur between when the
> > syscall happens and when we read the log) from triggering additional
>
> Could the sig handler also set the "make the uaccess invisible" flag?
> (It would need to be a "depth" flag, most likely.)

It's more complicated than that because you can longjmp() out of a
signal handler and that won't necessarily call sigreturn(). The kernel
doesn't really have a concept of "depth" as applied to signal
handlers, it's all managed on the userspace stack.

I brainstormed this with Dmitry a bit out of band and we came up with
a nice solution that avoids the two syscalls, is arch-generic and
avoids the problem with asynchronous signal handlers. I'll paste a bit
from the documentation that I wrote, but please see the full
documentation in v2 patch 5/5 for more details.

The feature may be used via the following prctl:

.. code-block:: c

uint64_t addr = 0; /* Generally will be a TLS slot or equivalent */
prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &addr, 0, 0, 0);

Supplying a non-zero address as the second argument to ``prctl``
will cause the kernel to read an address from that address on each
kernel entry (referred to as the *uaccess descriptor address*).

When entering the kernel to handle a syscall with a non-zero uaccess
descriptor address, the kernel will read a data structure of type
``struct uaccess_descriptor`` from the uaccess descriptor address,
which is defined as follows:

.. code-block:: c

struct uaccess_descriptor {
uint64_t addr, size;
};

This data structure contains the address and size (in array elements)
of a *uaccess buffer*, which is an array of data structures of type
``struct uaccess_buffer_entry``. Before returning to userspace, the
kernel will log information about uaccesses to sequential entries
in the uaccess buffer. It will also store ``NULL`` to the uaccess
descriptor address, and store the address and size of the unused
portion of the uaccess buffer to the uaccess descriptor.

[...]

When entering the kernel for a reason other than a syscall (for
example, when IPI'd due to an incoming asynchronous signal) with
a non-zero uaccess descriptor address, any signals other
than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
initialized with ``sigfillset(set)``. This is to prevent incoming
signals from interfering with uaccess logging.

Peter