2021-11-23 05:17:15

by Peter Collingbourne

[permalink] [raw]
Subject: [PATCH v2 0/5] kernel: introduce uaccess logging

This patch series introduces a kernel feature known as uaccess
logging, which allows userspace programs to be made aware of the
address and size of uaccesses performed by the kernel during
the servicing of a syscall. More details on the motivation
for and interface to this feature are available in the file
Documentation/admin-guide/uaccess-logging.rst added by the final
patch in the series.

Because we don't have a common kernel entry/exit code path that is used
on all architectures, uaccess logging is only implemented for arm64
and architectures that use CONFIG_GENERIC_ENTRY, i.e. x86 and s390.

The proposed interface is the result of numerous iterations and
prototyping and is based on a proposal by Dmitry Vyukov. The interface
preserves the correspondence between uaccess log identity and syscall
identity while tolerating incoming asynchronous signals in the interval
between setting up the logging and the actual syscall. We considered
a number of alternative designs but rejected them for various reasons:

- The design from v1 of this patch [1] proposed notifying the kernel
of the address and size of the uaccess buffer via a prctl that
would also automatically mask and unmask asynchronous signals as
needed, but this would require multiple syscalls per "real" syscall,
harming performance.

- We considered extending the syscall calling convention to
designate currently-unused registers to be used to pass the
location of the uaccess buffer, but this was rejected for being
architecture-specific.

- One idea that we considered involved using the stack pointer address
as a unique identifier for the syscall, but this currently would
need to be arch-specific as we currently do not appear to have an
arch-generic way of retrieving the stack pointer; the userspace
side would also need some arch-specific code for this to work. It's
also possible that a longjmp() past the signal handler would make
the stack pointer address not unique enough for this purpose.

We also evaluated implementing this on top of the existing tracepoint
facility, but concluded that it is not suitable for this purpose:

- Tracepoints have a per-task granularity at best, whereas we really want
to trace per-syscall. This is so that we can exclude syscalls that
should not be traced, such as syscalls that make up part of the
sanitizer implementation (to avoid infinite recursion when e.g. printing
an error report).

- Tracing would need to be synchronous in order to produce useful
stack traces. For example this could be achieved using the new SIGTRAP
on perf events mechanism. However, this would require logging each
access to the stack (in the form of a sigcontext) and this is more
likely to overflow the stack due to being much larger than a uaccess
buffer entry as well as being unbounded, in contrast to the bounded
buffer size passed to prctl(). An approach based on signal handlers is
also likely to fall foul of the asynchronous signal issues mentioned
previously, together with needing sigreturn to be handled specially
(because it copies a sigcontext from userspace) otherwise we could
never return from the signal handler. Furthermore, arguments to the
trace events are not available to SIGTRAP. (This on its own wouldn't
be insurmountable though -- we could add the arguments as fields
to siginfo.)

- The API in https://www.kernel.org/doc/Documentation/trace/ftrace.txt
-- e.g. trace_pipe_raw gives access to the internal ring buffer, but
I don't think it's usable because it's per-CPU and not per-task.

- Tracepoints can be used by eBPF programs, but eBPF programs may
only be loaded as root, among other potential headaches.

[1] https://lore.kernel.org/all/[email protected]/

Peter Collingbourne (5):
fs: use raw_copy_from_user() to copy mount() data
uaccess-buffer: add core code
uaccess-buffer: add CONFIG_GENERIC_ENTRY support
arm64: add support for uaccess logging
Documentation: document uaccess logging

Documentation/admin-guide/index.rst | 1 +
Documentation/admin-guide/uaccess-logging.rst | 149 ++++++++++++++++++
arch/Kconfig | 6 +
arch/arm64/Kconfig | 1 +
arch/arm64/kernel/signal.c | 5 +
arch/arm64/kernel/syscall.c | 3 +
fs/exec.c | 2 +
fs/namespace.c | 7 +-
include/linux/instrumented.h | 5 +-
include/linux/sched.h | 4 +
include/linux/uaccess-buffer-log-hooks.h | 59 +++++++
include/linux/uaccess-buffer.h | 79 ++++++++++
include/uapi/linux/prctl.h | 3 +
include/uapi/linux/uaccess-buffer.h | 25 +++
kernel/Makefile | 1 +
kernel/bpf/helpers.c | 6 +-
kernel/entry/common.c | 7 +
kernel/fork.c | 3 +
kernel/signal.c | 4 +-
kernel/sys.c | 6 +
kernel/uaccess-buffer.c | 125 +++++++++++++++
21 files changed, 497 insertions(+), 4 deletions(-)
create mode 100644 Documentation/admin-guide/uaccess-logging.rst
create mode 100644 include/linux/uaccess-buffer-log-hooks.h
create mode 100644 include/linux/uaccess-buffer.h
create mode 100644 include/uapi/linux/uaccess-buffer.h
create mode 100644 kernel/uaccess-buffer.c

--
2.34.0.rc2.393.gf8c9666880-goog



2021-11-23 05:17:17

by Peter Collingbourne

[permalink] [raw]
Subject: [PATCH v2 1/5] fs: use raw_copy_from_user() to copy mount() data

With uaccess logging the contract is that the kernel must not report
accessing more data than necessary, as this can lead to false positive
reports in downstream consumers. This generally works out of the box
when instrumenting copy_{from,to}_user(), but with the data argument
to mount() we use copy_from_user() to copy PAGE_SIZE bytes (or as
much as we can, if the PAGE_SIZE sized access failed) and figure out
later how much we actually need.

To prevent this from leading to a false positive report, use
raw_copy_from_user(), which will prevent the access from being logged.
Recall that it is valid for the kernel to report accessing less
data than it actually accessed, as uaccess logging is a best-effort
mechanism for reporting uaccesses.

Link: https://linux-review.googlesource.com/id/I5629b92a725c817acd9a861288338dd605cafee6
Signed-off-by: Peter Collingbourne <[email protected]>
---
fs/namespace.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 659a8f39c61a..695b30e391f0 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3197,7 +3197,12 @@ static void *copy_mount_options(const void __user * data)
if (!copy)
return ERR_PTR(-ENOMEM);

- left = copy_from_user(copy, data, PAGE_SIZE);
+ /*
+ * Use raw_copy_from_user to avoid reporting overly large accesses in
+ * the uaccess buffer, as this can lead to false positive reports in
+ * downstream consumers.
+ */
+ left = raw_copy_from_user(copy, data, PAGE_SIZE);

/*
* Not all architectures have an exact copy_from_user(). Resort to
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-23 05:17:21

by Peter Collingbourne

[permalink] [raw]
Subject: [PATCH v2 2/5] uaccess-buffer: add core code

Add the core code to support uaccess logging. Subsequent patches will
hook this up to the arch-specific kernel entry and exit code for
certain architectures.

Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
Signed-off-by: Peter Collingbourne <[email protected]>
---
v2:
- New interface that avoids multiple syscalls per real syscall and
is arch-generic
- Avoid logging uaccesses done by BPF programs
- Add documentation
- Split up into multiple patches
- Various code moves, renames etc as requested by Marco

arch/Kconfig | 5 +
fs/exec.c | 2 +
include/linux/instrumented.h | 5 +-
include/linux/sched.h | 4 +
include/linux/uaccess-buffer-log-hooks.h | 59 +++++++++++
include/linux/uaccess-buffer.h | 79 ++++++++++++++
include/uapi/linux/prctl.h | 3 +
include/uapi/linux/uaccess-buffer.h | 25 +++++
kernel/Makefile | 1 +
kernel/bpf/helpers.c | 6 +-
kernel/fork.c | 3 +
kernel/signal.c | 4 +-
kernel/sys.c | 6 ++
kernel/uaccess-buffer.c | 125 +++++++++++++++++++++++
14 files changed, 324 insertions(+), 3 deletions(-)
create mode 100644 include/linux/uaccess-buffer-log-hooks.h
create mode 100644 include/linux/uaccess-buffer.h
create mode 100644 include/uapi/linux/uaccess-buffer.h
create mode 100644 kernel/uaccess-buffer.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 26b8ed11639d..6030298a7e9a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1302,6 +1302,11 @@ config ARCH_HAS_PARANOID_L1D_FLUSH
config DYNAMIC_SIGFRAME
bool

+config HAVE_ARCH_UACCESS_BUFFER
+ bool
+ help
+ Select if the architecture's syscall entry/exit code supports uaccess buffers.
+
source "kernel/gcov/Kconfig"

source "scripts/gcc-plugins/Kconfig"
diff --git a/fs/exec.c b/fs/exec.c
index 537d92c41105..5f30314f3ec6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -65,6 +65,7 @@
#include <linux/vmalloc.h>
#include <linux/io_uring.h>
#include <linux/syscall_user_dispatch.h>
+#include <linux/uaccess-buffer.h>

#include <linux/uaccess.h>
#include <asm/mmu_context.h>
@@ -1313,6 +1314,7 @@ int begin_new_exec(struct linux_binprm * bprm)
me->personality &= ~bprm->per_clear;

clear_syscall_work_syscall_user_dispatch(me);
+ uaccess_buffer_set_descriptor_addr_addr(0);

/*
* We have to apply CLOEXEC before we change whether the process is
diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
index 42faebbaa202..c96be1695614 100644
--- a/include/linux/instrumented.h
+++ b/include/linux/instrumented.h
@@ -2,7 +2,7 @@

/*
* This header provides generic wrappers for memory access instrumentation that
- * the compiler cannot emit for: KASAN, KCSAN.
+ * the compiler cannot emit for: KASAN, KCSAN, uaccess buffers.
*/
#ifndef _LINUX_INSTRUMENTED_H
#define _LINUX_INSTRUMENTED_H
@@ -11,6 +11,7 @@
#include <linux/kasan-checks.h>
#include <linux/kcsan-checks.h>
#include <linux/types.h>
+#include <linux/uaccess-buffer-log-hooks.h>

/**
* instrument_read - instrument regular read access
@@ -117,6 +118,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
{
kasan_check_read(from, n);
kcsan_check_read(from, n);
+ uaccess_buffer_log_write(to, n);
}

/**
@@ -134,6 +136,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
{
kasan_check_write(to, n);
kcsan_check_write(to, n);
+ uaccess_buffer_log_read(from, n);
}

#endif /* _LINUX_INSTRUMENTED_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78c351e35fec..1f978deaa3f8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1484,6 +1484,10 @@ struct task_struct {
struct callback_head l1d_flush_kill;
#endif

+#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
+ struct uaccess_buffer_info uaccess_buffer;
+#endif
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/include/linux/uaccess-buffer-log-hooks.h b/include/linux/uaccess-buffer-log-hooks.h
new file mode 100644
index 000000000000..bddc84ddce32
--- /dev/null
+++ b/include/linux/uaccess-buffer-log-hooks.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UACCESS_BUFFER_LOG_HOOKS_H
+#define _LINUX_UACCESS_BUFFER_LOG_HOOKS_H
+
+#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
+
+struct uaccess_buffer_info {
+ /*
+ * The pointer to pointer to struct uaccess_descriptor. This is the
+ * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
+ */
+ struct uaccess_descriptor __user *__user *desc_ptr_ptr;
+
+ /*
+ * The pointer to struct uaccess_descriptor read at syscall entry time.
+ */
+ struct uaccess_descriptor __user *desc_ptr;
+
+ /*
+ * A pointer to the kernel's temporary copy of the uaccess log for the
+ * current syscall. We log to a kernel buffer in order to avoid leaking
+ * timing information to userspace.
+ */
+ struct uaccess_buffer_entry *kbegin;
+
+ /*
+ * The position of the next uaccess buffer entry for the current
+ * syscall.
+ */
+ struct uaccess_buffer_entry *kcur;
+
+ /*
+ * A pointer to the end of the kernel's uaccess log.
+ */
+ struct uaccess_buffer_entry *kend;
+
+ /*
+ * The pointer to the userspace uaccess log, as read from the
+ * struct uaccess_descriptor.
+ */
+ struct uaccess_buffer_entry __user *ubegin;
+};
+
+void uaccess_buffer_log_read(const void __user *from, unsigned long n);
+void uaccess_buffer_log_write(void __user *to, unsigned long n);
+
+#else
+
+static inline void uaccess_buffer_log_read(const void __user *from,
+ unsigned long n)
+{
+}
+static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
+{
+}
+
+#endif
+
+#endif /* _LINUX_UACCESS_BUFFER_LOG_HOOKS_H */
diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
new file mode 100644
index 000000000000..14261368d3a9
--- /dev/null
+++ b/include/linux/uaccess-buffer.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UACCESS_BUFFER_H
+#define _LINUX_UACCESS_BUFFER_H
+
+#include <linux/sched.h>
+#include <uapi/linux/uaccess-buffer.h>
+
+#include <asm-generic/errno-base.h>
+
+#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
+
+static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
+{
+ return tsk->uaccess_buffer.desc_ptr_ptr;
+}
+
+void __uaccess_buffer_syscall_entry(void);
+static inline void uaccess_buffer_syscall_entry(void)
+{
+ if (uaccess_buffer_maybe_blocked(current))
+ __uaccess_buffer_syscall_entry();
+}
+
+void __uaccess_buffer_syscall_exit(void);
+static inline void uaccess_buffer_syscall_exit(void)
+{
+ if (uaccess_buffer_maybe_blocked(current))
+ __uaccess_buffer_syscall_exit();
+}
+
+bool __uaccess_buffer_pre_exit_loop(void);
+static inline bool uaccess_buffer_pre_exit_loop(void)
+{
+ if (!uaccess_buffer_maybe_blocked(current))
+ return false;
+ return __uaccess_buffer_pre_exit_loop();
+}
+
+void __uaccess_buffer_post_exit_loop(void);
+static inline void uaccess_buffer_post_exit_loop(bool pending)
+{
+ if (pending)
+ __uaccess_buffer_post_exit_loop();
+}
+
+void uaccess_buffer_cancel_log(struct task_struct *tsk);
+int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr);
+
+#else
+
+static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
+{
+ return false;
+}
+
+static inline void uaccess_buffer_syscall_entry(void)
+{
+}
+static inline void uaccess_buffer_syscall_exit(void)
+{
+}
+static inline bool uaccess_buffer_pre_exit_loop(void)
+{
+ return false;
+}
+static inline void uaccess_buffer_post_exit_loop(bool pending)
+{
+}
+static inline void uaccess_buffer_cancel_log(struct task_struct *tsk)
+{
+}
+
+static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
+{
+ return -EINVAL;
+}
+#endif
+
+#endif /* _LINUX_UACCESS_BUFFER_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index bb73e9a0b24f..74b37469c7b3 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -272,4 +272,7 @@ struct prctl_mm_map {
# define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1
# define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2

+/* Configure uaccess logging feature */
+#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR 63
+
#endif /* _LINUX_PRCTL_H */
diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
new file mode 100644
index 000000000000..619b17dc25c4
--- /dev/null
+++ b/include/uapi/linux/uaccess-buffer.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
+#define _UAPI_LINUX_UACCESS_BUFFER_H
+
+/* Location of the uaccess log. */
+struct uaccess_descriptor {
+ /* Address of the uaccess_buffer_entry array. */
+ __u64 addr;
+ /* Size of the uaccess_buffer_entry array in number of elements. */
+ __u64 size;
+};
+
+/* Format of the entries in the uaccess log. */
+struct uaccess_buffer_entry {
+ /* Address being accessed. */
+ __u64 addr;
+ /* Number of bytes that were accessed. */
+ __u64 size;
+ /* UACCESS_BUFFER_* flags. */
+ __u64 flags;
+};
+
+#define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */
+
+#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 186c49582f45..d4d9be5146c3 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
obj-$(CONFIG_CFI_CLANG) += cfi.o
+obj-$(CONFIG_HAVE_ARCH_UACCESS_BUFFER) += uaccess-buffer.o

obj-$(CONFIG_PERF_EVENTS) += events/

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 649f07623df6..167b50177066 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -637,7 +637,11 @@ const struct bpf_func_proto bpf_event_output_data_proto = {
BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
const void __user *, user_ptr)
{
- int ret = copy_from_user(dst, user_ptr, size);
+ /*
+ * Avoid copy_from_user() here as it may leak information about the BPF
+ * program to userspace via the uaccess buffer.
+ */
+ int ret = raw_copy_from_user(dst, user_ptr, size);

if (unlikely(ret)) {
memset(dst, 0, size);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697..c7abe7e7c7cd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -96,6 +96,7 @@
#include <linux/scs.h>
#include <linux/io_uring.h>
#include <linux/bpf.h>
+#include <linux/uaccess-buffer.h>

#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -890,6 +891,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
if (memcg_charge_kernel_stack(tsk))
goto free_stack;

+ uaccess_buffer_cancel_log(tsk);
+
stack_vm_area = task_stack_vm_area(tsk);

err = arch_dup_task_struct(tsk, orig);
diff --git a/kernel/signal.c b/kernel/signal.c
index a629b11bf3e0..69bf21518bd0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -45,6 +45,7 @@
#include <linux/posix-timers.h>
#include <linux/cgroup.h>
#include <linux/audit.h>
+#include <linux/uaccess-buffer.h>

#define CREATE_TRACE_POINTS
#include <trace/events/signal.h>
@@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
if (sig_fatal(p, sig) &&
!(signal->flags & SIGNAL_GROUP_EXIT) &&
!sigismember(&t->real_blocked, sig) &&
- (sig == SIGKILL || !p->ptrace)) {
+ (sig == SIGKILL ||
+ !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {
/*
* This signal will be fatal to the whole group.
*/
diff --git a/kernel/sys.c b/kernel/sys.c
index 8fdac0d90504..c71a9a9c0f68 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
#include <linux/version.h>
#include <linux/ctype.h>
#include <linux/syscall_user_dispatch.h>
+#include <linux/uaccess-buffer.h>

#include <linux/compat.h>
#include <linux/syscalls.h>
@@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = sched_core_share_pid(arg2, arg3, arg4, arg5);
break;
#endif
+ case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = uaccess_buffer_set_descriptor_addr_addr(arg2);
+ break;
default:
error = -EINVAL;
break;
diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
new file mode 100644
index 000000000000..e1c6d6ab9af8
--- /dev/null
+++ b/kernel/uaccess-buffer.c
@@ -0,0 +1,125 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Support for uaccess logging via uaccess buffers.
+ *
+ * Copyright (C) 2021, Google LLC.
+ */
+
+#include <linux/compat.h>
+#include <linux/mm.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/sched.h>
+#include <linux/signal.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/uaccess-buffer.h>
+
+static void uaccess_buffer_log(unsigned long addr, unsigned long size,
+ unsigned long flags)
+{
+ struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+ struct uaccess_buffer_entry *entry = buf->kcur;
+
+ if (!entry || unlikely(uaccess_kernel()))
+ return;
+ entry->addr = addr;
+ entry->size = size;
+ entry->flags = flags;
+
+ ++buf->kcur;
+ if (buf->kcur == buf->kend)
+ buf->kcur = 0;
+}
+
+void uaccess_buffer_log_read(const void __user *from, unsigned long n)
+{
+ uaccess_buffer_log((unsigned long)from, n, 0);
+}
+EXPORT_SYMBOL(uaccess_buffer_log_read);
+
+void uaccess_buffer_log_write(void __user *to, unsigned long n)
+{
+ uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
+}
+EXPORT_SYMBOL(uaccess_buffer_log_write);
+
+int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
+{
+ current->uaccess_buffer.desc_ptr_ptr =
+ (struct uaccess_descriptor __user * __user *)addr;
+ uaccess_buffer_cancel_log(current);
+ return 0;
+}
+
+bool __uaccess_buffer_pre_exit_loop(void)
+{
+ struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+ struct uaccess_descriptor __user *desc_ptr;
+ sigset_t tmp_mask;
+
+ if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
+ return false;
+
+ current->real_blocked = current->blocked;
+ sigfillset(&tmp_mask);
+ set_current_blocked(&tmp_mask);
+ return true;
+}
+
+void __uaccess_buffer_post_exit_loop(void)
+{
+ spin_lock_irq(&current->sighand->siglock);
+ current->blocked = current->real_blocked;
+ recalc_sigpending();
+ spin_unlock_irq(&current->sighand->siglock);
+}
+
+void uaccess_buffer_cancel_log(struct task_struct *tsk)
+{
+ struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
+
+ if (buf->kcur) {
+ buf->kcur = 0;
+ kfree(buf->kbegin);
+ }
+}
+
+void __uaccess_buffer_syscall_entry(void)
+{
+ struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+ struct uaccess_descriptor desc;
+
+ if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
+ put_user(0, buf->desc_ptr_ptr) ||
+ copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
+ return;
+
+ if (desc.size > 1024)
+ desc.size = 1024;
+
+ buf->kbegin = kmalloc_array(
+ desc.size, sizeof(struct uaccess_buffer_entry), GFP_KERNEL);
+ buf->kcur = buf->kbegin;
+ buf->kend = buf->kbegin + desc.size;
+ buf->ubegin = (struct uaccess_buffer_entry __user *)desc.addr;
+}
+
+void __uaccess_buffer_syscall_exit(void)
+{
+ struct uaccess_buffer_info *buf = &current->uaccess_buffer;
+ u64 num_entries = buf->kcur - buf->kbegin;
+ struct uaccess_descriptor desc;
+
+ if (!buf->kcur)
+ return;
+
+ desc.addr = (u64)(buf->ubegin + num_entries);
+ desc.size = buf->kend - buf->kcur;
+ buf->kcur = 0;
+ if (copy_to_user(buf->ubegin, buf->kbegin,
+ num_entries * sizeof(struct uaccess_buffer_entry)) == 0)
+ (void)copy_to_user(buf->desc_ptr, &desc, sizeof(desc));
+
+ kfree(buf->kbegin);
+}
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-23 05:17:30

by Peter Collingbourne

[permalink] [raw]
Subject: [PATCH v2 3/5] uaccess-buffer: add CONFIG_GENERIC_ENTRY support

Add uaccess logging support on architectures that use
CONFIG_GENERIC_ENTRY (currently only s390 and x86).

Link: https://linux-review.googlesource.com/id/I3c5eb19a7e4a1dbe6095f6971f7826c4b0663f7d
Signed-off-by: Peter Collingbourne <[email protected]>
---
arch/Kconfig | 1 +
kernel/entry/common.c | 7 +++++++
2 files changed, 8 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 6030298a7e9a..bc9cdbfdb57e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -31,6 +31,7 @@ config HOTPLUG_SMT
bool

config GENERIC_ENTRY
+ select HAVE_ARCH_UACCESS_BUFFER
bool

config KPROBES
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d5a61d565ad5..e06f88f77a70 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -6,6 +6,7 @@
#include <linux/livepatch.h>
#include <linux/audit.h>
#include <linux/tick.h>
+#include <linux/uaccess-buffer.h>

#include "common.h"

@@ -89,6 +90,8 @@ __syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
if (work & SYSCALL_WORK_ENTER)
syscall = syscall_trace_enter(regs, syscall, work);

+ uaccess_buffer_syscall_entry();
+
return syscall;
}

@@ -197,14 +200,17 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
static void exit_to_user_mode_prepare(struct pt_regs *regs)
{
unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+ bool uaccess_buffer_pending;

lockdep_assert_irqs_disabled();

/* Flush pending rcuog wakeup before the last need_resched() check */
tick_nohz_user_enter_prepare();

+ uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
ti_work = exit_to_user_mode_loop(regs, ti_work);
+ uaccess_buffer_post_exit_loop(uaccess_buffer_pending);

arch_exit_to_user_mode_prepare(regs, ti_work);

@@ -271,6 +277,7 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
local_irq_enable();
}

+ uaccess_buffer_syscall_exit();
rseq_syscall(regs);

/*
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-23 05:17:34

by Peter Collingbourne

[permalink] [raw]
Subject: [PATCH v2 4/5] arm64: add support for uaccess logging

arm64 does not use CONFIG_GENERIC_ENTRY, so add the support for
uaccess logging directly to the architecture.

Link: https://linux-review.googlesource.com/id/I88de539fb9c4a9d27fa8cccbe201a6e4382faf89
Signed-off-by: Peter Collingbourne <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/kernel/signal.c | 5 +++++
arch/arm64/kernel/syscall.c | 3 +++
3 files changed, 9 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c4207cf9bb17..6023946abe4a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -161,6 +161,7 @@ config ARM64
select HAVE_ARCH_THREAD_STRUCT_WHITELIST
select HAVE_ARCH_TRACEHOOK
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
+ select HAVE_ARCH_UACCESS_BUFFER
select HAVE_ARCH_VMAP_STACK
select HAVE_ARM_SMCCC
select HAVE_ASM_MODVERSIONS
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 8f6372b44b65..5bbd98e5c257 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -20,6 +20,7 @@
#include <linux/tracehook.h>
#include <linux/ratelimit.h>
#include <linux/syscalls.h>
+#include <linux/uaccess-buffer.h>

#include <asm/daifflags.h>
#include <asm/debug-monitors.h>
@@ -919,6 +920,8 @@ static void do_signal(struct pt_regs *regs)

void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)
{
+ bool uaccess_buffer_pending = uaccess_buffer_pre_exit_loop();
+
do {
if (thread_flags & _TIF_NEED_RESCHED) {
/* Unmask Debug and SError for the next task */
@@ -950,6 +953,8 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)
local_daif_mask();
thread_flags = READ_ONCE(current_thread_info()->flags);
} while (thread_flags & _TIF_WORK_MASK);
+
+ uaccess_buffer_post_exit_loop(uaccess_buffer_pending);
}

unsigned long __ro_after_init signal_minsigstksz;
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index 50a0f1a38e84..6de801c1af05 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -7,6 +7,7 @@
#include <linux/ptrace.h>
#include <linux/randomize_kstack.h>
#include <linux/syscalls.h>
+#include <linux/uaccess-buffer.h>

#include <asm/daifflags.h>
#include <asm/debug-monitors.h>
@@ -139,7 +140,9 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
goto trace_exit;
}

+ uaccess_buffer_syscall_entry();
invoke_syscall(regs, scno, sc_nr, syscall_table);
+ uaccess_buffer_syscall_exit();

/*
* The tracing status may have changed under our feet, so we have to
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-23 05:17:36

by Peter Collingbourne

[permalink] [raw]
Subject: [PATCH v2 5/5] Documentation: document uaccess logging

Add documentation for the uaccess logging feature.

Link: https://linux-review.googlesource.com/id/Ia626c0ca91bc0a3d8067d7f28406aa40693b65a2
Signed-off-by: Peter Collingbourne <[email protected]>
---
Documentation/admin-guide/index.rst | 1 +
Documentation/admin-guide/uaccess-logging.rst | 149 ++++++++++++++++++
2 files changed, 150 insertions(+)
create mode 100644 Documentation/admin-guide/uaccess-logging.rst

diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 1bedab498104..4f6ee447ab2f 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -54,6 +54,7 @@ ABI will be found here.
:maxdepth: 1

sysfs-rules
+ uaccess-logging

The rest of this manual consists of various unordered guides on how to
configure specific aspects of kernel behavior to your liking.
diff --git a/Documentation/admin-guide/uaccess-logging.rst b/Documentation/admin-guide/uaccess-logging.rst
new file mode 100644
index 000000000000..4b2b297afc00
--- /dev/null
+++ b/Documentation/admin-guide/uaccess-logging.rst
@@ -0,0 +1,149 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Uaccess Logging
+===============
+
+Background
+----------
+
+Userspace tools such as sanitizers (ASan, MSan, HWASan) and tools
+making use of the ARM Memory Tagging Extension (MTE) need to
+monitor all memory accesses in a program so that they can detect
+memory errors. Furthermore, fuzzing tools such as syzkaller need to
+monitor all memory accesses so that they know which parts of memory
+to fuzz. For accesses made purely in userspace, this is achieved
+via compiler instrumentation, or for MTE, via direct hardware
+support. However, accesses made by the kernel on behalf of the user
+program via syscalls (i.e. uaccesses) are normally invisible to
+these tools.
+
+Traditionally, the sanitizers have handled this by interposing the libc
+syscall stubs with a wrapper that checks the memory based on what we
+believe the uaccesses will be. However, this creates a maintenance
+burden: each syscall must be annotated with its uaccesses in order
+to be recognized by the sanitizer, and these annotations must be
+continuously updated as the kernel changes.
+
+The kernel's uaccess logging feature provides userspace tools with
+the address and size of each userspace access, thereby allowing these
+tools to report memory errors involving these accesses without needing
+annotations for every syscall.
+
+By relying on the kernel's actual uaccesses, rather than a
+reimplementation of them, the userspace memory safety tools may
+play a dual role of verifying the validity of kernel accesses. Even
+a sanitizer whose syscall wrappers have complete knowledge of the
+kernel's intended API may vary from the kernel's actual uaccesses due
+to kernel bugs. A sanitizer with knowledge of the kernel's actual
+uaccesses may produce more accurate error reports that reveal such
+bugs. For example, a kernel that accesses more memory than expected
+by the userspace program could indicate that either userspace or the
+kernel has the wrong idea about which kernel functionality is being
+requested -- either way, there is a bug.
+
+Interface
+---------
+
+The feature may be used via the following prctl:
+
+.. code-block:: c
+
+ uint64_t addr = 0; /* Generally will be a TLS slot or equivalent */
+ prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &addr, 0, 0, 0);
+
+Supplying a non-zero address as the second argument to ``prctl``
+will cause the kernel to read an address (referred to as the *uaccess
+descriptor address*) from that address on each kernel entry.
+
+When entering the kernel with a non-zero uaccess descriptor address
+to handle a syscall, the kernel will read a data structure of type
+``struct uaccess_descriptor`` from the uaccess descriptor address,
+which is defined as follows:
+
+.. code-block:: c
+
+ struct uaccess_descriptor {
+ uint64_t addr, size;
+ };
+
+This data structure contains the address and size (in array elements)
+of a *uaccess buffer*, which is an array of data structures of type
+``struct uaccess_buffer_entry``. Before returning to userspace, the
+kernel will log information about uaccesses to sequential entries
+in the uaccess buffer. It will also store ``NULL`` to the uaccess
+descriptor address, and store the address and size of the unused
+portion of the uaccess buffer to the uaccess descriptor.
+
+The format of a uaccess buffer entry is defined as follows:
+
+.. code-block:: c
+
+ struct uaccess_buffer_entry {
+ uint64_t addr, size, flags;
+ };
+
+The meaning of ``addr`` and ``size`` should be obvious. On arm64,
+tag bits are preserved in the ``addr`` field. There is currently
+one flag bit assignment for the ``flags`` field:
+
+.. code-block:: c
+
+ #define UACCESS_BUFFER_FLAG_WRITE 1
+
+This flag is set if the access was a write, or clear if it was a
+read. The meaning of all other flag bits is reserved.
+
+When entering the kernel with a non-zero uaccess descriptor
+address for a reason other than a syscall (for example, when
+IPI'd due to an incoming asynchronous signal), any signals other
+than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
+``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
+initialized with ``sigfillset(set)``. This is to prevent incoming
+signals from interfering with uaccess logging.
+
+Example
+-------
+
+Here is an example of a code snippet that will enumerate the accesses
+performed by a ``uname(2)`` syscall:
+
+.. code-block:: c
+
+ struct uaccess_buffer_entry entries[64];
+ struct uaccess_descriptor desc;
+ uint64_t desc_addr = 0;
+ prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &desc_addr, 0, 0, 0);
+
+ desc.addr = (uint64_t)&entries;
+ desc.size = 64;
+ desc_addr = (uint64_t)&desc;
+
+ struct utsname un;
+ uname(&un);
+
+ struct uaccess_buffer_entry* entries_end = (struct uaccess_buffer_entry*)desc.addr;
+ for (struct uaccess_buffer_entry* entry = entries; entry != entries_end; ++entry) {
+ printf("%s at 0x%lx size 0x%lx\n", entry->flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ",
+ (unsigned long)entry->addr, (unsigned long)entry->size);
+ }
+
+Limitations
+-----------
+
+This feature is currently only supported on the arm64, s390 and x86
+architectures.
+
+Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of
+course, not all of the accesses may fit in the buffer, but aside from
+that, not all internal kernel APIs that access userspace memory are
+covered. Therefore, userspace programs should tolerate unreported
+accesses.
+
+On the other hand, the kernel guarantees that it will not
+(intentionally) report accessing more data than it is specified
+to read. For example, if the kernel implements a syscall that is
+specified to read a data structure of size ``N`` bytes by first
+reading a page's worth of data and then only using the first ``N``
+bytes from it, the kernel will either report reading ``N`` bytes or
+not report the access at all.
--
2.34.0.rc2.393.gf8c9666880-goog


2021-11-23 07:46:37

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] Documentation: document uaccess logging

On Tue, 23 Nov 2021 at 06:17, Peter Collingbourne <[email protected]> wrote:
>
> Add documentation for the uaccess logging feature.
>
> Link: https://linux-review.googlesource.com/id/Ia626c0ca91bc0a3d8067d7f28406aa40693b65a2
> Signed-off-by: Peter Collingbourne <[email protected]>
> ---
> Documentation/admin-guide/index.rst | 1 +
> Documentation/admin-guide/uaccess-logging.rst | 149 ++++++++++++++++++
> 2 files changed, 150 insertions(+)
> create mode 100644 Documentation/admin-guide/uaccess-logging.rst
>
> diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
> index 1bedab498104..4f6ee447ab2f 100644
> --- a/Documentation/admin-guide/index.rst
> +++ b/Documentation/admin-guide/index.rst
> @@ -54,6 +54,7 @@ ABI will be found here.
> :maxdepth: 1
>
> sysfs-rules
> + uaccess-logging
>
> The rest of this manual consists of various unordered guides on how to
> configure specific aspects of kernel behavior to your liking.
> diff --git a/Documentation/admin-guide/uaccess-logging.rst b/Documentation/admin-guide/uaccess-logging.rst
> new file mode 100644
> index 000000000000..4b2b297afc00
> --- /dev/null
> +++ b/Documentation/admin-guide/uaccess-logging.rst
> @@ -0,0 +1,149 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============
> +Uaccess Logging
> +===============
> +
> +Background
> +----------
> +
> +Userspace tools such as sanitizers (ASan, MSan, HWASan) and tools
> +making use of the ARM Memory Tagging Extension (MTE) need to
> +monitor all memory accesses in a program so that they can detect
> +memory errors. Furthermore, fuzzing tools such as syzkaller need to
> +monitor all memory accesses so that they know which parts of memory
> +to fuzz. For accesses made purely in userspace, this is achieved
> +via compiler instrumentation, or for MTE, via direct hardware
> +support. However, accesses made by the kernel on behalf of the user
> +program via syscalls (i.e. uaccesses) are normally invisible to
> +these tools.
> +
> +Traditionally, the sanitizers have handled this by interposing the libc
> +syscall stubs with a wrapper that checks the memory based on what we
> +believe the uaccesses will be. However, this creates a maintenance
> +burden: each syscall must be annotated with its uaccesses in order
> +to be recognized by the sanitizer, and these annotations must be
> +continuously updated as the kernel changes.
> +
> +The kernel's uaccess logging feature provides userspace tools with
> +the address and size of each userspace access, thereby allowing these
> +tools to report memory errors involving these accesses without needing
> +annotations for every syscall.
> +
> +By relying on the kernel's actual uaccesses, rather than a
> +reimplementation of them, the userspace memory safety tools may
> +play a dual role of verifying the validity of kernel accesses. Even
> +a sanitizer whose syscall wrappers have complete knowledge of the
> +kernel's intended API may vary from the kernel's actual uaccesses due
> +to kernel bugs. A sanitizer with knowledge of the kernel's actual
> +uaccesses may produce more accurate error reports that reveal such
> +bugs. For example, a kernel that accesses more memory than expected
> +by the userspace program could indicate that either userspace or the
> +kernel has the wrong idea about which kernel functionality is being
> +requested -- either way, there is a bug.
> +
> +Interface
> +---------
> +
> +The feature may be used via the following prctl:
> +
> +.. code-block:: c
> +
> + uint64_t addr = 0; /* Generally will be a TLS slot or equivalent */
> + prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &addr, 0, 0, 0);
> +
> +Supplying a non-zero address as the second argument to ``prctl``

Is it possible to unregister it? Is it what happens when 0 is passed
as addr? If so, please describe.
It may be handy to do one-off tracing with the address on stack.

> +will cause the kernel to read an address (referred to as the *uaccess
> +descriptor address*) from that address on each kernel entry.
> +
> +When entering the kernel with a non-zero uaccess descriptor address
> +to handle a syscall, the kernel will read a data structure of type
> +``struct uaccess_descriptor`` from the uaccess descriptor address,
> +which is defined as follows:
> +
> +.. code-block:: c
> +
> + struct uaccess_descriptor {
> + uint64_t addr, size;
> + };

Want to double check the extension story. If we ever want flags in
uaccess_descriptor, we can add a flag to prctl that would say that
address must point to uaccess_descriptor_v2 that contains flags,
right?
And similarly we can extend uaccess_buffer_entry, right?

> +This data structure contains the address and size (in array elements)
> +of a *uaccess buffer*, which is an array of data structures of type
> +``struct uaccess_buffer_entry``. Before returning to userspace, the
> +kernel will log information about uaccesses to sequential entries
> +in the uaccess buffer. It will also store ``NULL`` to the uaccess
> +descriptor address, and store the address and size of the unused
> +portion of the uaccess buffer to the uaccess descriptor.
> +
> +The format of a uaccess buffer entry is defined as follows:
> +
> +.. code-block:: c
> +
> + struct uaccess_buffer_entry {
> + uint64_t addr, size, flags;
> + };
> +
> +The meaning of ``addr`` and ``size`` should be obvious. On arm64,

I would say explicitly "addr and size contain address and size of the
user memory access".

> +tag bits are preserved in the ``addr`` field. There is currently
> +one flag bit assignment for the ``flags`` field:
> +
> +.. code-block:: c
> +
> + #define UACCESS_BUFFER_FLAG_WRITE 1
> +
> +This flag is set if the access was a write, or clear if it was a
> +read. The meaning of all other flag bits is reserved.
> +
> +When entering the kernel with a non-zero uaccess descriptor
> +address for a reason other than a syscall (for example, when
> +IPI'd due to an incoming asynchronous signal), any signals other
> +than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
> +``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
> +initialized with ``sigfillset(set)``. This is to prevent incoming
> +signals from interfering with uaccess logging.
> +
> +Example
> +-------
> +
> +Here is an example of a code snippet that will enumerate the accesses
> +performed by a ``uname(2)`` syscall:
> +
> +.. code-block:: c
> +
> + struct uaccess_buffer_entry entries[64];
> + struct uaccess_descriptor desc;
> + uint64_t desc_addr = 0;
> + prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &desc_addr, 0, 0, 0);
> +
> + desc.addr = (uint64_t)&entries;
> + desc.size = 64;
> + desc_addr = (uint64_t)&desc;

We don't need any additional compiler barriers here, right?
It seems that we only need to prevent re-ordering of these writes with
the next and previous syscalls, which the compiler should do already.

> + struct utsname un;
> + uname(&un);
> +
> + struct uaccess_buffer_entry* entries_end = (struct uaccess_buffer_entry*)desc.addr;
> + for (struct uaccess_buffer_entry* entry = entries; entry != entries_end; ++entry) {
> + printf("%s at 0x%lx size 0x%lx\n", entry->flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ",
> + (unsigned long)entry->addr, (unsigned long)entry->size);
> + }
> +
> +Limitations
> +-----------
> +
> +This feature is currently only supported on the arm64, s390 and x86
> +architectures.
> +
> +Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of
> +course, not all of the accesses may fit in the buffer, but aside from
> +that, not all internal kernel APIs that access userspace memory are
> +covered. Therefore, userspace programs should tolerate unreported
> +accesses.
> +
> +On the other hand, the kernel guarantees that it will not
> +(intentionally) report accessing more data than it is specified
> +to read. For example, if the kernel implements a syscall that is
> +specified to read a data structure of size ``N`` bytes by first
> +reading a page's worth of data and then only using the first ``N``
> +bytes from it, the kernel will either report reading ``N`` bytes or
> +not report the access at all.
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-23 07:51:11

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] fs: use raw_copy_from_user() to copy mount() data

On Tue, 23 Nov 2021 at 06:17, Peter Collingbourne <[email protected]> wrote:
>
> With uaccess logging the contract is that the kernel must not report
> accessing more data than necessary, as this can lead to false positive
> reports in downstream consumers. This generally works out of the box
> when instrumenting copy_{from,to}_user(), but with the data argument
> to mount() we use copy_from_user() to copy PAGE_SIZE bytes (or as
> much as we can, if the PAGE_SIZE sized access failed) and figure out
> later how much we actually need.
>
> To prevent this from leading to a false positive report, use
> raw_copy_from_user(), which will prevent the access from being logged.
> Recall that it is valid for the kernel to report accessing less
> data than it actually accessed, as uaccess logging is a best-effort
> mechanism for reporting uaccesses.
>
> Link: https://linux-review.googlesource.com/id/I5629b92a725c817acd9a861288338dd605cafee6
> Signed-off-by: Peter Collingbourne <[email protected]>
> ---
> fs/namespace.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 659a8f39c61a..695b30e391f0 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3197,7 +3197,12 @@ static void *copy_mount_options(const void __user * data)
> if (!copy)
> return ERR_PTR(-ENOMEM);
>
> - left = copy_from_user(copy, data, PAGE_SIZE);
> + /*
> + * Use raw_copy_from_user to avoid reporting overly large accesses in
> + * the uaccess buffer, as this can lead to false positive reports in
> + * downstream consumers.
> + */
> + left = raw_copy_from_user(copy, data, PAGE_SIZE);

This will skip KASAN/etc checks as well, right? I guess it is fine b/c
this affects just this place and the code looks safe (famous last
words :)) and we can refine it in future.
But I wonder about false positives under KMSAN. However, we probably
can add an explicit KMSAN annotation to mark it as initialised.
Alex?

> /*
> * Not all architectures have an exact copy_from_user(). Resort to
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-23 09:56:34

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] uaccess-buffer: add core code

.On Tue, 23 Nov 2021 at 06:17, Peter Collingbourne <[email protected]> wrote:
>
> Add the core code to support uaccess logging. Subsequent patches will
> hook this up to the arch-specific kernel entry and exit code for
> certain architectures.

I don't see where we block signals when a user writes to the addr. I
expected to see some get_user from the addr somewhere in the signal
handling code. What am I missing?

> Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> Signed-off-by: Peter Collingbourne <[email protected]>
> ---
> v2:
> - New interface that avoids multiple syscalls per real syscall and
> is arch-generic
> - Avoid logging uaccesses done by BPF programs
> - Add documentation
> - Split up into multiple patches
> - Various code moves, renames etc as requested by Marco
>
> arch/Kconfig | 5 +
> fs/exec.c | 2 +
> include/linux/instrumented.h | 5 +-
> include/linux/sched.h | 4 +
> include/linux/uaccess-buffer-log-hooks.h | 59 +++++++++++
> include/linux/uaccess-buffer.h | 79 ++++++++++++++
> include/uapi/linux/prctl.h | 3 +
> include/uapi/linux/uaccess-buffer.h | 25 +++++
> kernel/Makefile | 1 +
> kernel/bpf/helpers.c | 6 +-
> kernel/fork.c | 3 +
> kernel/signal.c | 4 +-
> kernel/sys.c | 6 ++
> kernel/uaccess-buffer.c | 125 +++++++++++++++++++++++
> 14 files changed, 324 insertions(+), 3 deletions(-)
> create mode 100644 include/linux/uaccess-buffer-log-hooks.h
> create mode 100644 include/linux/uaccess-buffer.h
> create mode 100644 include/uapi/linux/uaccess-buffer.h
> create mode 100644 kernel/uaccess-buffer.c
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 26b8ed11639d..6030298a7e9a 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1302,6 +1302,11 @@ config ARCH_HAS_PARANOID_L1D_FLUSH
> config DYNAMIC_SIGFRAME
> bool
>
> +config HAVE_ARCH_UACCESS_BUFFER
> + bool
> + help
> + Select if the architecture's syscall entry/exit code supports uaccess buffers.
> +
> source "kernel/gcov/Kconfig"
>
> source "scripts/gcc-plugins/Kconfig"
> diff --git a/fs/exec.c b/fs/exec.c
> index 537d92c41105..5f30314f3ec6 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -65,6 +65,7 @@
> #include <linux/vmalloc.h>
> #include <linux/io_uring.h>
> #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess-buffer.h>
>
> #include <linux/uaccess.h>
> #include <asm/mmu_context.h>
> @@ -1313,6 +1314,7 @@ int begin_new_exec(struct linux_binprm * bprm)
> me->personality &= ~bprm->per_clear;
>
> clear_syscall_work_syscall_user_dispatch(me);
> + uaccess_buffer_set_descriptor_addr_addr(0);
>
> /*
> * We have to apply CLOEXEC before we change whether the process is
> diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
> index 42faebbaa202..c96be1695614 100644
> --- a/include/linux/instrumented.h
> +++ b/include/linux/instrumented.h
> @@ -2,7 +2,7 @@
>
> /*
> * This header provides generic wrappers for memory access instrumentation that
> - * the compiler cannot emit for: KASAN, KCSAN.
> + * the compiler cannot emit for: KASAN, KCSAN, uaccess buffers.
> */
> #ifndef _LINUX_INSTRUMENTED_H
> #define _LINUX_INSTRUMENTED_H
> @@ -11,6 +11,7 @@
> #include <linux/kasan-checks.h>
> #include <linux/kcsan-checks.h>
> #include <linux/types.h>
> +#include <linux/uaccess-buffer-log-hooks.h>
>
> /**
> * instrument_read - instrument regular read access
> @@ -117,6 +118,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> {
> kasan_check_read(from, n);
> kcsan_check_read(from, n);
> + uaccess_buffer_log_write(to, n);
> }
>
> /**
> @@ -134,6 +136,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
> {
> kasan_check_write(to, n);
> kcsan_check_write(to, n);
> + uaccess_buffer_log_read(from, n);
> }
>
> #endif /* _LINUX_INSTRUMENTED_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 78c351e35fec..1f978deaa3f8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1484,6 +1484,10 @@ struct task_struct {
> struct callback_head l1d_flush_kill;
> #endif
>
> +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> + struct uaccess_buffer_info uaccess_buffer;
> +#endif

Shouldn't this be controlled by an additional config that a user can
enable/disable?
If I am reading this correctly, the current implementation forces
uaccess logging for all arches that support it. Some embed kernels may
not want this.


> /*
> * New fields for task_struct should be added above here, so that
> * they are included in the randomized portion of task_struct.
> diff --git a/include/linux/uaccess-buffer-log-hooks.h b/include/linux/uaccess-buffer-log-hooks.h
> new file mode 100644
> index 000000000000..bddc84ddce32
> --- /dev/null
> +++ b/include/linux/uaccess-buffer-log-hooks.h
> @@ -0,0 +1,59 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UACCESS_BUFFER_LOG_HOOKS_H
> +#define _LINUX_UACCESS_BUFFER_LOG_HOOKS_H
> +
> +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> +
> +struct uaccess_buffer_info {
> + /*
> + * The pointer to pointer to struct uaccess_descriptor. This is the
> + * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> + */
> + struct uaccess_descriptor __user *__user *desc_ptr_ptr;
> +
> + /*
> + * The pointer to struct uaccess_descriptor read at syscall entry time.
> + */
> + struct uaccess_descriptor __user *desc_ptr;
> +
> + /*
> + * A pointer to the kernel's temporary copy of the uaccess log for the
> + * current syscall. We log to a kernel buffer in order to avoid leaking
> + * timing information to userspace.
> + */
> + struct uaccess_buffer_entry *kbegin;
> +
> + /*
> + * The position of the next uaccess buffer entry for the current
> + * syscall.
> + */
> + struct uaccess_buffer_entry *kcur;
> +
> + /*
> + * A pointer to the end of the kernel's uaccess log.
> + */
> + struct uaccess_buffer_entry *kend;
> +
> + /*
> + * The pointer to the userspace uaccess log, as read from the
> + * struct uaccess_descriptor.
> + */
> + struct uaccess_buffer_entry __user *ubegin;
> +};
> +
> +void uaccess_buffer_log_read(const void __user *from, unsigned long n);
> +void uaccess_buffer_log_write(void __user *to, unsigned long n);
> +
> +#else
> +
> +static inline void uaccess_buffer_log_read(const void __user *from,
> + unsigned long n)
> +{
> +}
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +}
> +
> +#endif
> +
> +#endif /* _LINUX_UACCESS_BUFFER_LOG_HOOKS_H */
> diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
> new file mode 100644
> index 000000000000..14261368d3a9
> --- /dev/null
> +++ b/include/linux/uaccess-buffer.h
> @@ -0,0 +1,79 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UACCESS_BUFFER_H
> +#define _LINUX_UACCESS_BUFFER_H
> +
> +#include <linux/sched.h>
> +#include <uapi/linux/uaccess-buffer.h>
> +
> +#include <asm-generic/errno-base.h>
> +
> +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> +
> +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> +{
> + return tsk->uaccess_buffer.desc_ptr_ptr;
> +}
> +
> +void __uaccess_buffer_syscall_entry(void);
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> + if (uaccess_buffer_maybe_blocked(current))
> + __uaccess_buffer_syscall_entry();
> +}
> +
> +void __uaccess_buffer_syscall_exit(void);
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> + if (uaccess_buffer_maybe_blocked(current))
> + __uaccess_buffer_syscall_exit();
> +}
> +
> +bool __uaccess_buffer_pre_exit_loop(void);
> +static inline bool uaccess_buffer_pre_exit_loop(void)
> +{
> + if (!uaccess_buffer_maybe_blocked(current))
> + return false;
> + return __uaccess_buffer_pre_exit_loop();
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void);
> +static inline void uaccess_buffer_post_exit_loop(bool pending)
> +{
> + if (pending)
> + __uaccess_buffer_post_exit_loop();
> +}
> +
> +void uaccess_buffer_cancel_log(struct task_struct *tsk);
> +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr);
> +
> +#else
> +
> +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> +{
> + return false;
> +}
> +
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +}
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +}
> +static inline bool uaccess_buffer_pre_exit_loop(void)
> +{
> + return false;
> +}
> +static inline void uaccess_buffer_post_exit_loop(bool pending)
> +{
> +}
> +static inline void uaccess_buffer_cancel_log(struct task_struct *tsk)
> +{
> +}
> +
> +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> +{
> + return -EINVAL;
> +}
> +#endif
> +
> +#endif /* _LINUX_UACCESS_BUFFER_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index bb73e9a0b24f..74b37469c7b3 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -272,4 +272,7 @@ struct prctl_mm_map {
> # define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1
> # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2
>
> +/* Configure uaccess logging feature */
> +#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR 63
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
> new file mode 100644
> index 000000000000..619b17dc25c4
> --- /dev/null
> +++ b/include/uapi/linux/uaccess-buffer.h
> @@ -0,0 +1,25 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
> +#define _UAPI_LINUX_UACCESS_BUFFER_H
> +
> +/* Location of the uaccess log. */
> +struct uaccess_descriptor {
> + /* Address of the uaccess_buffer_entry array. */
> + __u64 addr;
> + /* Size of the uaccess_buffer_entry array in number of elements. */
> + __u64 size;
> +};
> +
> +/* Format of the entries in the uaccess log. */
> +struct uaccess_buffer_entry {
> + /* Address being accessed. */
> + __u64 addr;
> + /* Number of bytes that were accessed. */
> + __u64 size;
> + /* UACCESS_BUFFER_* flags. */
> + __u64 flags;
> +};
> +
> +#define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */
> +
> +#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 186c49582f45..d4d9be5146c3 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
> obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
> obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> obj-$(CONFIG_CFI_CLANG) += cfi.o
> +obj-$(CONFIG_HAVE_ARCH_UACCESS_BUFFER) += uaccess-buffer.o
>
> obj-$(CONFIG_PERF_EVENTS) += events/
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 649f07623df6..167b50177066 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -637,7 +637,11 @@ const struct bpf_func_proto bpf_event_output_data_proto = {
> BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
> const void __user *, user_ptr)
> {
> - int ret = copy_from_user(dst, user_ptr, size);
> + /*
> + * Avoid copy_from_user() here as it may leak information about the BPF
> + * program to userspace via the uaccess buffer.
> + */
> + int ret = raw_copy_from_user(dst, user_ptr, size);

Here I am more concerned about KASAN/KMSAN checks.
What exactly is the attack vector here? Are these accesses secret?
Can't the same info be obtained using userfaultfd/unmapping memory?

raw_copy_from_user also skips access_ok, is it ok?


> if (unlikely(ret)) {
> memset(dst, 0, size);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3244cc56b697..c7abe7e7c7cd 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -96,6 +96,7 @@
> #include <linux/scs.h>
> #include <linux/io_uring.h>
> #include <linux/bpf.h>
> +#include <linux/uaccess-buffer.h>
>
> #include <asm/pgalloc.h>
> #include <linux/uaccess.h>
> @@ -890,6 +891,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
> if (memcg_charge_kernel_stack(tsk))
> goto free_stack;
>
> + uaccess_buffer_cancel_log(tsk);

Why do we need this?
tsk is a newly allocated task_struct. If I am not mistaken, it's not
zero initialized, so are we kfree'ing garbage?
But we may need to do something with tasks after arch_dup_task_struct.

> stack_vm_area = task_stack_vm_area(tsk);
>
> err = arch_dup_task_struct(tsk, orig);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a629b11bf3e0..69bf21518bd0 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -45,6 +45,7 @@
> #include <linux/posix-timers.h>
> #include <linux/cgroup.h>
> #include <linux/audit.h>
> +#include <linux/uaccess-buffer.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/signal.h>
> @@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
> if (sig_fatal(p, sig) &&
> !(signal->flags & SIGNAL_GROUP_EXIT) &&
> !sigismember(&t->real_blocked, sig) &&
> - (sig == SIGKILL || !p->ptrace)) {
> + (sig == SIGKILL ||
> + !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {

Why do we need this change?

> /*
> * This signal will be fatal to the whole group.
> */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8fdac0d90504..c71a9a9c0f68 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -42,6 +42,7 @@
> #include <linux/version.h>
> #include <linux/ctype.h>
> #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess-buffer.h>
>
> #include <linux/compat.h>
> #include <linux/syscalls.h>
> @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> error = sched_core_share_pid(arg2, arg3, arg4, arg5);
> break;
> #endif
> + case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + error = uaccess_buffer_set_descriptor_addr_addr(arg2);
> + break;
> default:
> error = -EINVAL;
> break;
> diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
> new file mode 100644
> index 000000000000..e1c6d6ab9af8
> --- /dev/null
> +++ b/kernel/uaccess-buffer.c
> @@ -0,0 +1,125 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Support for uaccess logging via uaccess buffers.
> + *
> + * Copyright (C) 2021, Google LLC.
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/mm.h>
> +#include <linux/prctl.h>
> +#include <linux/ptrace.h>
> +#include <linux/sched.h>
> +#include <linux/signal.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/uaccess-buffer.h>
> +
> +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> + unsigned long flags)
> +{
> + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> + struct uaccess_buffer_entry *entry = buf->kcur;
> +
> + if (!entry || unlikely(uaccess_kernel()))
> + return;
> + entry->addr = addr;
> + entry->size = size;
> + entry->flags = flags;
> +
> + ++buf->kcur;
> + if (buf->kcur == buf->kend)
> + buf->kcur = 0;

= NULL;

> +}
> +
> +void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> + uaccess_buffer_log((unsigned long)from, n, 0);
> +}
> +EXPORT_SYMBOL(uaccess_buffer_log_read);
> +
> +void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> + uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> +}
> +EXPORT_SYMBOL(uaccess_buffer_log_write);
> +
> +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> +{
> + current->uaccess_buffer.desc_ptr_ptr =
> + (struct uaccess_descriptor __user * __user *)addr;
> + uaccess_buffer_cancel_log(current);

Is this necessary? It looks more reasonable and useful to not call
cancel. In most cases the user won't setup it twice/change. But if the
user anyhow asked to trace the prctl, why not trace it?

> + return 0;
> +}
> +
> +bool __uaccess_buffer_pre_exit_loop(void)
> +{
> + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> + struct uaccess_descriptor __user *desc_ptr;
> + sigset_t tmp_mask;
> +
> + if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> + return false;
> +
> + current->real_blocked = current->blocked;
> + sigfillset(&tmp_mask);
> + set_current_blocked(&tmp_mask);
> + return true;
> +}
> +
> +void __uaccess_buffer_post_exit_loop(void)
> +{
> + spin_lock_irq(&current->sighand->siglock);
> + current->blocked = current->real_blocked;
> + recalc_sigpending();
> + spin_unlock_irq(&current->sighand->siglock);
> +}
> +
;> +void uaccess_buffer_cancel_log(struct task_struct *tsk)
> +{
> + struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
> +
> + if (buf->kcur) {

uaccess_buffer_log sets kcur to NULL on overflow and we call this
function from a middle of fork, it looks strange that kfree'ing
something depends on the previous buffer overflow. Should we check
kbegin here?

> + buf->kcur = 0;

= NULL
I would also set kend to NULL to not leave a dangling pointer.

> + kfree(buf->kbegin);
> + }
> +}
> +
> +void __uaccess_buffer_syscall_entry(void)
> +{
> + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> + struct uaccess_descriptor desc;
> +
> + if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
> + put_user(0, buf->desc_ptr_ptr) ||
> + copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
> + return;
> +
> + if (desc.size > 1024)
> + desc.size = 1024;
> +
> + buf->kbegin = kmalloc_array(
> + desc.size, sizeof(struct uaccess_buffer_entry), GFP_KERNEL);

Is not handling error intentional here?
Maybe it's better to check the error just to make the code more
explicit (and maybe prevent some future bugs).

Actually an interesting attack vector. If you can make this kmalloc
fail, you can prevent sanitizers from detecting any of the bad
accesses :)

Does it make sense to flag the error somewhere in the desc?... or I am
thinking if we should pre-allocate the buffer, if we start tracing a
task, we will trace lots of syscalls, so avoiding kmalloc/kfree per
syscall can make sense. What do you think?

> + buf->kcur = buf->kbegin;
> + buf->kend = buf->kbegin + desc.size;
> + buf->ubegin = (struct uaccess_buffer_entry __user *)desc.addr;
> +}
> +
> +void __uaccess_buffer_syscall_exit(void)
> +{
> + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> + u64 num_entries = buf->kcur - buf->kbegin;
> + struct uaccess_descriptor desc;
> +
> + if (!buf->kcur)

uaccess_buffer_log sets kcur to NULL on overflow. I think we need to
check kbegin here.


> + return;
> +
> + desc.addr = (u64)(buf->ubegin + num_entries);
> + desc.size = buf->kend - buf->kcur;

Is the user expected to use size in any of reasonable scenarios?
We cap size at 1024 at entry, so the size can be truncated here.

> + buf->kcur = 0;

= NULL

> + if (copy_to_user(buf->ubegin, buf->kbegin,
> + num_entries * sizeof(struct uaccess_buffer_entry)) == 0)
> + (void)copy_to_user(buf->desc_ptr, &desc, sizeof(desc));
> +
> + kfree(buf->kbegin);

What if we enter exit/exit_group with logging enabled, won't we leak the buffer?

> +}
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-23 10:08:21

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] uaccess-buffer: add core code

On Tue, 23 Nov 2021 at 10:56, Dmitry Vyukov <[email protected]> wrote:
>
> .On Tue, 23 Nov 2021 at 06:17, Peter Collingbourne <[email protected]> wrote:
> >
> > Add the core code to support uaccess logging. Subsequent patches will
> > hook this up to the arch-specific kernel entry and exit code for
> > certain architectures.
>
> I don't see where we block signals when a user writes to the addr. I
> expected to see some get_user from the addr somewhere in the signal
> handling code. What am I missing?
>
> > Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> > Signed-off-by: Peter Collingbourne <[email protected]>
> > ---
> > v2:
> > - New interface that avoids multiple syscalls per real syscall and
> > is arch-generic
> > - Avoid logging uaccesses done by BPF programs
> > - Add documentation
> > - Split up into multiple patches
> > - Various code moves, renames etc as requested by Marco
> >
> > arch/Kconfig | 5 +
> > fs/exec.c | 2 +
> > include/linux/instrumented.h | 5 +-
> > include/linux/sched.h | 4 +
> > include/linux/uaccess-buffer-log-hooks.h | 59 +++++++++++
> > include/linux/uaccess-buffer.h | 79 ++++++++++++++
> > include/uapi/linux/prctl.h | 3 +
> > include/uapi/linux/uaccess-buffer.h | 25 +++++
> > kernel/Makefile | 1 +
> > kernel/bpf/helpers.c | 6 +-
> > kernel/fork.c | 3 +
> > kernel/signal.c | 4 +-
> > kernel/sys.c | 6 ++
> > kernel/uaccess-buffer.c | 125 +++++++++++++++++++++++
> > 14 files changed, 324 insertions(+), 3 deletions(-)
> > create mode 100644 include/linux/uaccess-buffer-log-hooks.h
> > create mode 100644 include/linux/uaccess-buffer.h
> > create mode 100644 include/uapi/linux/uaccess-buffer.h
> > create mode 100644 kernel/uaccess-buffer.c
> >
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index 26b8ed11639d..6030298a7e9a 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -1302,6 +1302,11 @@ config ARCH_HAS_PARANOID_L1D_FLUSH
> > config DYNAMIC_SIGFRAME
> > bool
> >
> > +config HAVE_ARCH_UACCESS_BUFFER
> > + bool
> > + help
> > + Select if the architecture's syscall entry/exit code supports uaccess buffers.
> > +
> > source "kernel/gcov/Kconfig"
> >
> > source "scripts/gcc-plugins/Kconfig"
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 537d92c41105..5f30314f3ec6 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -65,6 +65,7 @@
> > #include <linux/vmalloc.h>
> > #include <linux/io_uring.h>
> > #include <linux/syscall_user_dispatch.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #include <linux/uaccess.h>
> > #include <asm/mmu_context.h>
> > @@ -1313,6 +1314,7 @@ int begin_new_exec(struct linux_binprm * bprm)
> > me->personality &= ~bprm->per_clear;
> >
> > clear_syscall_work_syscall_user_dispatch(me);
> > + uaccess_buffer_set_descriptor_addr_addr(0);
> >
> > /*
> > * We have to apply CLOEXEC before we change whether the process is
> > diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
> > index 42faebbaa202..c96be1695614 100644
> > --- a/include/linux/instrumented.h
> > +++ b/include/linux/instrumented.h
> > @@ -2,7 +2,7 @@
> >
> > /*
> > * This header provides generic wrappers for memory access instrumentation that
> > - * the compiler cannot emit for: KASAN, KCSAN.
> > + * the compiler cannot emit for: KASAN, KCSAN, uaccess buffers.
> > */
> > #ifndef _LINUX_INSTRUMENTED_H
> > #define _LINUX_INSTRUMENTED_H
> > @@ -11,6 +11,7 @@
> > #include <linux/kasan-checks.h>
> > #include <linux/kcsan-checks.h>
> > #include <linux/types.h>
> > +#include <linux/uaccess-buffer-log-hooks.h>
> >
> > /**
> > * instrument_read - instrument regular read access
> > @@ -117,6 +118,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> > {
> > kasan_check_read(from, n);
> > kcsan_check_read(from, n);
> > + uaccess_buffer_log_write(to, n);
> > }
> >
> > /**
> > @@ -134,6 +136,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
> > {
> > kasan_check_write(to, n);
> > kcsan_check_write(to, n);
> > + uaccess_buffer_log_read(from, n);
> > }
> >
> > #endif /* _LINUX_INSTRUMENTED_H */
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 78c351e35fec..1f978deaa3f8 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1484,6 +1484,10 @@ struct task_struct {
> > struct callback_head l1d_flush_kill;
> > #endif
> >
> > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> > + struct uaccess_buffer_info uaccess_buffer;
> > +#endif
>
> Shouldn't this be controlled by an additional config that a user can
> enable/disable?
> If I am reading this correctly, the current implementation forces
> uaccess logging for all arches that support it. Some embed kernels may
> not want this.
>
>
> > /*
> > * New fields for task_struct should be added above here, so that
> > * they are included in the randomized portion of task_struct.
> > diff --git a/include/linux/uaccess-buffer-log-hooks.h b/include/linux/uaccess-buffer-log-hooks.h
> > new file mode 100644
> > index 000000000000..bddc84ddce32
> > --- /dev/null
> > +++ b/include/linux/uaccess-buffer-log-hooks.h
> > @@ -0,0 +1,59 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_UACCESS_BUFFER_LOG_HOOKS_H
> > +#define _LINUX_UACCESS_BUFFER_LOG_HOOKS_H
> > +
> > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> > +
> > +struct uaccess_buffer_info {
> > + /*
> > + * The pointer to pointer to struct uaccess_descriptor. This is the
> > + * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> > + */
> > + struct uaccess_descriptor __user *__user *desc_ptr_ptr;
> > +
> > + /*
> > + * The pointer to struct uaccess_descriptor read at syscall entry time.
> > + */
> > + struct uaccess_descriptor __user *desc_ptr;
> > +
> > + /*
> > + * A pointer to the kernel's temporary copy of the uaccess log for the
> > + * current syscall. We log to a kernel buffer in order to avoid leaking
> > + * timing information to userspace.
> > + */
> > + struct uaccess_buffer_entry *kbegin;
> > +
> > + /*
> > + * The position of the next uaccess buffer entry for the current
> > + * syscall.
> > + */
> > + struct uaccess_buffer_entry *kcur;
> > +
> > + /*
> > + * A pointer to the end of the kernel's uaccess log.
> > + */
> > + struct uaccess_buffer_entry *kend;
> > +
> > + /*
> > + * The pointer to the userspace uaccess log, as read from the
> > + * struct uaccess_descriptor.
> > + */
> > + struct uaccess_buffer_entry __user *ubegin;
> > +};
> > +
> > +void uaccess_buffer_log_read(const void __user *from, unsigned long n);
> > +void uaccess_buffer_log_write(void __user *to, unsigned long n);
> > +
> > +#else
> > +
> > +static inline void uaccess_buffer_log_read(const void __user *from,
> > + unsigned long n)
> > +{
> > +}
> > +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> > +{
> > +}
> > +
> > +#endif
> > +
> > +#endif /* _LINUX_UACCESS_BUFFER_LOG_HOOKS_H */
> > diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
> > new file mode 100644
> > index 000000000000..14261368d3a9
> > --- /dev/null
> > +++ b/include/linux/uaccess-buffer.h
> > @@ -0,0 +1,79 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_UACCESS_BUFFER_H
> > +#define _LINUX_UACCESS_BUFFER_H
> > +
> > +#include <linux/sched.h>
> > +#include <uapi/linux/uaccess-buffer.h>
> > +
> > +#include <asm-generic/errno-base.h>
> > +
> > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> > +
> > +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> > +{
> > + return tsk->uaccess_buffer.desc_ptr_ptr;
> > +}
> > +
> > +void __uaccess_buffer_syscall_entry(void);
> > +static inline void uaccess_buffer_syscall_entry(void)
> > +{
> > + if (uaccess_buffer_maybe_blocked(current))
> > + __uaccess_buffer_syscall_entry();
> > +}
> > +
> > +void __uaccess_buffer_syscall_exit(void);
> > +static inline void uaccess_buffer_syscall_exit(void)
> > +{
> > + if (uaccess_buffer_maybe_blocked(current))
> > + __uaccess_buffer_syscall_exit();
> > +}
> > +
> > +bool __uaccess_buffer_pre_exit_loop(void);
> > +static inline bool uaccess_buffer_pre_exit_loop(void)
> > +{
> > + if (!uaccess_buffer_maybe_blocked(current))
> > + return false;
> > + return __uaccess_buffer_pre_exit_loop();
> > +}
> > +
> > +void __uaccess_buffer_post_exit_loop(void);
> > +static inline void uaccess_buffer_post_exit_loop(bool pending)
> > +{
> > + if (pending)
> > + __uaccess_buffer_post_exit_loop();
> > +}
> > +
> > +void uaccess_buffer_cancel_log(struct task_struct *tsk);
> > +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr);
> > +
> > +#else
> > +
> > +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> > +{
> > + return false;
> > +}
> > +
> > +static inline void uaccess_buffer_syscall_entry(void)
> > +{
> > +}
> > +static inline void uaccess_buffer_syscall_exit(void)
> > +{
> > +}
> > +static inline bool uaccess_buffer_pre_exit_loop(void)
> > +{
> > + return false;
> > +}
> > +static inline void uaccess_buffer_post_exit_loop(bool pending)
> > +{
> > +}
> > +static inline void uaccess_buffer_cancel_log(struct task_struct *tsk)
> > +{
> > +}
> > +
> > +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> > +{
> > + return -EINVAL;
> > +}
> > +#endif
> > +
> > +#endif /* _LINUX_UACCESS_BUFFER_H */
> > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > index bb73e9a0b24f..74b37469c7b3 100644
> > --- a/include/uapi/linux/prctl.h
> > +++ b/include/uapi/linux/prctl.h
> > @@ -272,4 +272,7 @@ struct prctl_mm_map {
> > # define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1
> > # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2
> >
> > +/* Configure uaccess logging feature */
> > +#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR 63
> > +
> > #endif /* _LINUX_PRCTL_H */
> > diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
> > new file mode 100644
> > index 000000000000..619b17dc25c4
> > --- /dev/null
> > +++ b/include/uapi/linux/uaccess-buffer.h
> > @@ -0,0 +1,25 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
> > +#define _UAPI_LINUX_UACCESS_BUFFER_H
> > +
> > +/* Location of the uaccess log. */
> > +struct uaccess_descriptor {
> > + /* Address of the uaccess_buffer_entry array. */
> > + __u64 addr;
> > + /* Size of the uaccess_buffer_entry array in number of elements. */
> > + __u64 size;
> > +};
> > +
> > +/* Format of the entries in the uaccess log. */
> > +struct uaccess_buffer_entry {
> > + /* Address being accessed. */
> > + __u64 addr;
> > + /* Number of bytes that were accessed. */
> > + __u64 size;
> > + /* UACCESS_BUFFER_* flags. */
> > + __u64 flags;
> > +};
> > +
> > +#define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */
> > +
> > +#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
> > diff --git a/kernel/Makefile b/kernel/Makefile
> > index 186c49582f45..d4d9be5146c3 100644
> > --- a/kernel/Makefile
> > +++ b/kernel/Makefile
> > @@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
> > obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
> > obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> > obj-$(CONFIG_CFI_CLANG) += cfi.o
> > +obj-$(CONFIG_HAVE_ARCH_UACCESS_BUFFER) += uaccess-buffer.o
> >
> > obj-$(CONFIG_PERF_EVENTS) += events/
> >
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 649f07623df6..167b50177066 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -637,7 +637,11 @@ const struct bpf_func_proto bpf_event_output_data_proto = {
> > BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
> > const void __user *, user_ptr)
> > {
> > - int ret = copy_from_user(dst, user_ptr, size);
> > + /*
> > + * Avoid copy_from_user() here as it may leak information about the BPF
> > + * program to userspace via the uaccess buffer.
> > + */
> > + int ret = raw_copy_from_user(dst, user_ptr, size);
>
> Here I am more concerned about KASAN/KMSAN checks.
> What exactly is the attack vector here? Are these accesses secret?
> Can't the same info be obtained using userfaultfd/unmapping memory?
>
> raw_copy_from_user also skips access_ok, is it ok?

One way to do this may be:

uaccess_buffer_log_pause();
copy_from_user(...);
uaccess_buffer_log_resume();


> > if (unlikely(ret)) {
> > memset(dst, 0, size);
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 3244cc56b697..c7abe7e7c7cd 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -96,6 +96,7 @@
> > #include <linux/scs.h>
> > #include <linux/io_uring.h>
> > #include <linux/bpf.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #include <asm/pgalloc.h>
> > #include <linux/uaccess.h>
> > @@ -890,6 +891,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
> > if (memcg_charge_kernel_stack(tsk))
> > goto free_stack;
> >
> > + uaccess_buffer_cancel_log(tsk);
>
> Why do we need this?
> tsk is a newly allocated task_struct. If I am not mistaken, it's not
> zero initialized, so are we kfree'ing garbage?
> But we may need to do something with tasks after arch_dup_task_struct.
>
> > stack_vm_area = task_stack_vm_area(tsk);
> >
> > err = arch_dup_task_struct(tsk, orig);
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index a629b11bf3e0..69bf21518bd0 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -45,6 +45,7 @@
> > #include <linux/posix-timers.h>
> > #include <linux/cgroup.h>
> > #include <linux/audit.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #define CREATE_TRACE_POINTS
> > #include <trace/events/signal.h>
> > @@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
> > if (sig_fatal(p, sig) &&
> > !(signal->flags & SIGNAL_GROUP_EXIT) &&
> > !sigismember(&t->real_blocked, sig) &&
> > - (sig == SIGKILL || !p->ptrace)) {
> > + (sig == SIGKILL ||
> > + !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {
>
> Why do we need this change?
>
> > /*
> > * This signal will be fatal to the whole group.
> > */
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 8fdac0d90504..c71a9a9c0f68 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -42,6 +42,7 @@
> > #include <linux/version.h>
> > #include <linux/ctype.h>
> > #include <linux/syscall_user_dispatch.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #include <linux/compat.h>
> > #include <linux/syscalls.h>
> > @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > error = sched_core_share_pid(arg2, arg3, arg4, arg5);
> > break;
> > #endif
> > + case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
> > + if (arg3 || arg4 || arg5)
> > + return -EINVAL;
> > + error = uaccess_buffer_set_descriptor_addr_addr(arg2);
> > + break;
> > default:
> > error = -EINVAL;
> > break;
> > diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
> > new file mode 100644
> > index 000000000000..e1c6d6ab9af8
> > --- /dev/null
> > +++ b/kernel/uaccess-buffer.c
> > @@ -0,0 +1,125 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Support for uaccess logging via uaccess buffers.
> > + *
> > + * Copyright (C) 2021, Google LLC.
> > + */
> > +
> > +#include <linux/compat.h>
> > +#include <linux/mm.h>
> > +#include <linux/prctl.h>
> > +#include <linux/ptrace.h>
> > +#include <linux/sched.h>
> > +#include <linux/signal.h>
> > +#include <linux/slab.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/uaccess-buffer.h>
> > +
> > +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> > + unsigned long flags)
> > +{
> > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > + struct uaccess_buffer_entry *entry = buf->kcur;
> > +
> > + if (!entry || unlikely(uaccess_kernel()))
> > + return;
> > + entry->addr = addr;
> > + entry->size = size;
> > + entry->flags = flags;
> > +
> > + ++buf->kcur;
> > + if (buf->kcur == buf->kend)
> > + buf->kcur = 0;
>
> = NULL;
>
> > +}
> > +
> > +void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> > +{
> > + uaccess_buffer_log((unsigned long)from, n, 0);
> > +}
> > +EXPORT_SYMBOL(uaccess_buffer_log_read);
> > +
> > +void uaccess_buffer_log_write(void __user *to, unsigned long n)
> > +{
> > + uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> > +}
> > +EXPORT_SYMBOL(uaccess_buffer_log_write);
> > +
> > +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> > +{
> > + current->uaccess_buffer.desc_ptr_ptr =
> > + (struct uaccess_descriptor __user * __user *)addr;
> > + uaccess_buffer_cancel_log(current);
>
> Is this necessary? It looks more reasonable and useful to not call
> cancel. In most cases the user won't setup it twice/change. But if the
> user anyhow asked to trace the prctl, why not trace it?
>
> > + return 0;
> > +}
> > +
> > +bool __uaccess_buffer_pre_exit_loop(void)
> > +{
> > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > + struct uaccess_descriptor __user *desc_ptr;
> > + sigset_t tmp_mask;
> > +
> > + if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> > + return false;
> > +
> > + current->real_blocked = current->blocked;
> > + sigfillset(&tmp_mask);
> > + set_current_blocked(&tmp_mask);
> > + return true;
> > +}
> > +
> > +void __uaccess_buffer_post_exit_loop(void)
> > +{
> > + spin_lock_irq(&current->sighand->siglock);
> > + current->blocked = current->real_blocked;
> > + recalc_sigpending();
> > + spin_unlock_irq(&current->sighand->siglock);
> > +}
> > +
> ;> +void uaccess_buffer_cancel_log(struct task_struct *tsk)
> > +{
> > + struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
> > +
> > + if (buf->kcur) {
>
> uaccess_buffer_log sets kcur to NULL on overflow and we call this
> function from a middle of fork, it looks strange that kfree'ing
> something depends on the previous buffer overflow. Should we check
> kbegin here?
>
> > + buf->kcur = 0;
>
> = NULL
> I would also set kend to NULL to not leave a dangling pointer.
>
> > + kfree(buf->kbegin);
> > + }
> > +}
> > +
> > +void __uaccess_buffer_syscall_entry(void)
> > +{
> > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > + struct uaccess_descriptor desc;
> > +
> > + if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
> > + put_user(0, buf->desc_ptr_ptr) ||
> > + copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
> > + return;
> > +
> > + if (desc.size > 1024)
> > + desc.size = 1024;
> > +
> > + buf->kbegin = kmalloc_array(
> > + desc.size, sizeof(struct uaccess_buffer_entry), GFP_KERNEL);
>
> Is not handling error intentional here?
> Maybe it's better to check the error just to make the code more
> explicit (and maybe prevent some future bugs).
>
> Actually an interesting attack vector. If you can make this kmalloc
> fail, you can prevent sanitizers from detecting any of the bad
> accesses :)
>
> Does it make sense to flag the error somewhere in the desc?... or I am
> thinking if we should pre-allocate the buffer, if we start tracing a
> task, we will trace lots of syscalls, so avoiding kmalloc/kfree per
> syscall can make sense. What do you think?
>
> > + buf->kcur = buf->kbegin;
> > + buf->kend = buf->kbegin + desc.size;
> > + buf->ubegin = (struct uaccess_buffer_entry __user *)desc.addr;
> > +}
> > +
> > +void __uaccess_buffer_syscall_exit(void)
> > +{
> > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > + u64 num_entries = buf->kcur - buf->kbegin;
> > + struct uaccess_descriptor desc;
> > +
> > + if (!buf->kcur)
>
> uaccess_buffer_log sets kcur to NULL on overflow. I think we need to
> check kbegin here.
>
>
> > + return;
> > +
> > + desc.addr = (u64)(buf->ubegin + num_entries);
> > + desc.size = buf->kend - buf->kcur;
>
> Is the user expected to use size in any of reasonable scenarios?
> We cap size at 1024 at entry, so the size can be truncated here.
>
> > + buf->kcur = 0;
>
> = NULL
>
> > + if (copy_to_user(buf->ubegin, buf->kbegin,
> > + num_entries * sizeof(struct uaccess_buffer_entry)) == 0)
> > + (void)copy_to_user(buf->desc_ptr, &desc, sizeof(desc));
> > +
> > + kfree(buf->kbegin);
>
> What if we enter exit/exit_group with logging enabled, won't we leak the buffer?
>
> > +}
> > --
> > 2.34.0.rc2.393.gf8c9666880-goog
> >

2021-11-23 10:09:45

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] fs: use raw_copy_from_user() to copy mount() data

On Tue, Nov 23, 2021 at 8:51 AM Dmitry Vyukov <[email protected]> wrote:
>
> On Tue, 23 Nov 2021 at 06:17, Peter Collingbourne <[email protected]> wrote:
> >
> > With uaccess logging the contract is that the kernel must not report
> > accessing more data than necessary, as this can lead to false positive
> > reports in downstream consumers. This generally works out of the box
> > when instrumenting copy_{from,to}_user(), but with the data argument
> > to mount() we use copy_from_user() to copy PAGE_SIZE bytes (or as
> > much as we can, if the PAGE_SIZE sized access failed) and figure out
> > later how much we actually need.
> >
> > To prevent this from leading to a false positive report, use
> > raw_copy_from_user(), which will prevent the access from being logged.
> > Recall that it is valid for the kernel to report accessing less
> > data than it actually accessed, as uaccess logging is a best-effort
> > mechanism for reporting uaccesses.
> >
> > Link: https://linux-review.googlesource.com/id/I5629b92a725c817acd9a861288338dd605cafee6
> > Signed-off-by: Peter Collingbourne <[email protected]>
> > ---
> > fs/namespace.c | 7 ++++++-
> > 1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index 659a8f39c61a..695b30e391f0 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -3197,7 +3197,12 @@ static void *copy_mount_options(const void __user * data)
> > if (!copy)
> > return ERR_PTR(-ENOMEM);
> >
> > - left = copy_from_user(copy, data, PAGE_SIZE);
> > + /*
> > + * Use raw_copy_from_user to avoid reporting overly large accesses in
> > + * the uaccess buffer, as this can lead to false positive reports in
> > + * downstream consumers.
> > + */
> > + left = raw_copy_from_user(copy, data, PAGE_SIZE);

I don't really like the idea of using raw_copy_from_user() anywhere.
Right now users of instrumented.h can more or less assume they see all
usercopy events, and removing the copy_from_user() call from here
looks like a regression.

Cannot the usercopy logger decide whether it wants to log the access
based on the size (e.g. skip accesses >= PAGE_SIZE)?
Will it help if we can instrument both sides of copy_from_user() (see
the code here: https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/14103/4)?

If not, maybe we can disable/enable uaccess logging for the current
task around these accesses?

> This will skip KASAN/etc checks as well, right? I guess it is fine b/c
> this affects just this place and the code looks safe (famous last
> words :)) and we can refine it in future.
> But I wonder about false positives under KMSAN. However, we probably
> can add an explicit KMSAN annotation to mark it as initialised.
> Alex?
>
> > /*
> > * Not all architectures have an exact copy_from_user(). Resort to
> > --
> > 2.34.0.rc2.393.gf8c9666880-goog
> >



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2021-11-23 10:20:24

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] uaccess-buffer: add core code

On Tue, Nov 23, 2021 at 10:56 AM Dmitry Vyukov <[email protected]> wrote:
>
> .On Tue, 23 Nov 2021 at 06:17, Peter Collingbourne <[email protected]> wrote:
> >
> > Add the core code to support uaccess logging. Subsequent patches will
> > hook this up to the arch-specific kernel entry and exit code for
> > certain architectures.
>
> I don't see where we block signals when a user writes to the addr. I
> expected to see some get_user from the addr somewhere in the signal
> handling code. What am I missing?
>
> > Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> > Signed-off-by: Peter Collingbourne <[email protected]>
> > ---
> > v2:
> > - New interface that avoids multiple syscalls per real syscall and
> > is arch-generic
> > - Avoid logging uaccesses done by BPF programs
> > - Add documentation
> > - Split up into multiple patches
> > - Various code moves, renames etc as requested by Marco
> >
> > arch/Kconfig | 5 +
> > fs/exec.c | 2 +
> > include/linux/instrumented.h | 5 +-
> > include/linux/sched.h | 4 +
> > include/linux/uaccess-buffer-log-hooks.h | 59 +++++++++++
> > include/linux/uaccess-buffer.h | 79 ++++++++++++++
> > include/uapi/linux/prctl.h | 3 +
> > include/uapi/linux/uaccess-buffer.h | 25 +++++
> > kernel/Makefile | 1 +
> > kernel/bpf/helpers.c | 6 +-
> > kernel/fork.c | 3 +
> > kernel/signal.c | 4 +-
> > kernel/sys.c | 6 ++
> > kernel/uaccess-buffer.c | 125 +++++++++++++++++++++++
> > 14 files changed, 324 insertions(+), 3 deletions(-)
> > create mode 100644 include/linux/uaccess-buffer-log-hooks.h
> > create mode 100644 include/linux/uaccess-buffer.h
> > create mode 100644 include/uapi/linux/uaccess-buffer.h
> > create mode 100644 kernel/uaccess-buffer.c
> >
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index 26b8ed11639d..6030298a7e9a 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -1302,6 +1302,11 @@ config ARCH_HAS_PARANOID_L1D_FLUSH
> > config DYNAMIC_SIGFRAME
> > bool
> >
> > +config HAVE_ARCH_UACCESS_BUFFER
> > + bool
> > + help
> > + Select if the architecture's syscall entry/exit code supports uaccess buffers.
> > +
> > source "kernel/gcov/Kconfig"
> >
> > source "scripts/gcc-plugins/Kconfig"
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 537d92c41105..5f30314f3ec6 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -65,6 +65,7 @@
> > #include <linux/vmalloc.h>
> > #include <linux/io_uring.h>
> > #include <linux/syscall_user_dispatch.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #include <linux/uaccess.h>
> > #include <asm/mmu_context.h>
> > @@ -1313,6 +1314,7 @@ int begin_new_exec(struct linux_binprm * bprm)
> > me->personality &= ~bprm->per_clear;
> >
> > clear_syscall_work_syscall_user_dispatch(me);
> > + uaccess_buffer_set_descriptor_addr_addr(0);
> >
> > /*
> > * We have to apply CLOEXEC before we change whether the process is
> > diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
> > index 42faebbaa202..c96be1695614 100644
> > --- a/include/linux/instrumented.h
> > +++ b/include/linux/instrumented.h
> > @@ -2,7 +2,7 @@
> >
> > /*
> > * This header provides generic wrappers for memory access instrumentation that
> > - * the compiler cannot emit for: KASAN, KCSAN.
> > + * the compiler cannot emit for: KASAN, KCSAN, uaccess buffers.
> > */
> > #ifndef _LINUX_INSTRUMENTED_H
> > #define _LINUX_INSTRUMENTED_H
> > @@ -11,6 +11,7 @@
> > #include <linux/kasan-checks.h>
> > #include <linux/kcsan-checks.h>
> > #include <linux/types.h>
> > +#include <linux/uaccess-buffer-log-hooks.h>
> >
> > /**
> > * instrument_read - instrument regular read access
> > @@ -117,6 +118,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> > {
> > kasan_check_read(from, n);
> > kcsan_check_read(from, n);
> > + uaccess_buffer_log_write(to, n);
> > }
> >
> > /**
> > @@ -134,6 +136,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
> > {
> > kasan_check_write(to, n);
> > kcsan_check_write(to, n);
> > + uaccess_buffer_log_read(from, n);
> > }
> >
> > #endif /* _LINUX_INSTRUMENTED_H */
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 78c351e35fec..1f978deaa3f8 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1484,6 +1484,10 @@ struct task_struct {
> > struct callback_head l1d_flush_kill;
> > #endif
> >
> > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> > + struct uaccess_buffer_info uaccess_buffer;
> > +#endif
>
> Shouldn't this be controlled by an additional config that a user can
> enable/disable?
> If I am reading this correctly, the current implementation forces
> uaccess logging for all arches that support it. Some embed kernels may
> not want this.
>
>
> > /*
> > * New fields for task_struct should be added above here, so that
> > * they are included in the randomized portion of task_struct.
> > diff --git a/include/linux/uaccess-buffer-log-hooks.h b/include/linux/uaccess-buffer-log-hooks.h
> > new file mode 100644
> > index 000000000000..bddc84ddce32
> > --- /dev/null
> > +++ b/include/linux/uaccess-buffer-log-hooks.h
> > @@ -0,0 +1,59 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_UACCESS_BUFFER_LOG_HOOKS_H
> > +#define _LINUX_UACCESS_BUFFER_LOG_HOOKS_H
> > +
> > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> > +
> > +struct uaccess_buffer_info {
> > + /*
> > + * The pointer to pointer to struct uaccess_descriptor. This is the
> > + * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> > + */
> > + struct uaccess_descriptor __user *__user *desc_ptr_ptr;
> > +
> > + /*
> > + * The pointer to struct uaccess_descriptor read at syscall entry time.
> > + */
> > + struct uaccess_descriptor __user *desc_ptr;
> > +
> > + /*
> > + * A pointer to the kernel's temporary copy of the uaccess log for the
> > + * current syscall. We log to a kernel buffer in order to avoid leaking
> > + * timing information to userspace.
> > + */
> > + struct uaccess_buffer_entry *kbegin;
> > +
> > + /*
> > + * The position of the next uaccess buffer entry for the current
> > + * syscall.
> > + */
> > + struct uaccess_buffer_entry *kcur;
> > +
> > + /*
> > + * A pointer to the end of the kernel's uaccess log.
> > + */
> > + struct uaccess_buffer_entry *kend;
> > +
> > + /*
> > + * The pointer to the userspace uaccess log, as read from the
> > + * struct uaccess_descriptor.
> > + */
> > + struct uaccess_buffer_entry __user *ubegin;
> > +};
> > +
> > +void uaccess_buffer_log_read(const void __user *from, unsigned long n);
> > +void uaccess_buffer_log_write(void __user *to, unsigned long n);
> > +
> > +#else
> > +
> > +static inline void uaccess_buffer_log_read(const void __user *from,
> > + unsigned long n)
> > +{
> > +}
> > +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> > +{
> > +}
> > +
> > +#endif
> > +
> > +#endif /* _LINUX_UACCESS_BUFFER_LOG_HOOKS_H */
> > diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
> > new file mode 100644
> > index 000000000000..14261368d3a9
> > --- /dev/null
> > +++ b/include/linux/uaccess-buffer.h
> > @@ -0,0 +1,79 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_UACCESS_BUFFER_H
> > +#define _LINUX_UACCESS_BUFFER_H
> > +
> > +#include <linux/sched.h>
> > +#include <uapi/linux/uaccess-buffer.h>
> > +
> > +#include <asm-generic/errno-base.h>
> > +
> > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> > +
> > +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> > +{
> > + return tsk->uaccess_buffer.desc_ptr_ptr;
> > +}
> > +
> > +void __uaccess_buffer_syscall_entry(void);
> > +static inline void uaccess_buffer_syscall_entry(void)
> > +{
> > + if (uaccess_buffer_maybe_blocked(current))
> > + __uaccess_buffer_syscall_entry();
> > +}
> > +
> > +void __uaccess_buffer_syscall_exit(void);
> > +static inline void uaccess_buffer_syscall_exit(void)
> > +{
> > + if (uaccess_buffer_maybe_blocked(current))
> > + __uaccess_buffer_syscall_exit();
> > +}
> > +
> > +bool __uaccess_buffer_pre_exit_loop(void);
> > +static inline bool uaccess_buffer_pre_exit_loop(void)
> > +{
> > + if (!uaccess_buffer_maybe_blocked(current))
> > + return false;
> > + return __uaccess_buffer_pre_exit_loop();
> > +}
> > +
> > +void __uaccess_buffer_post_exit_loop(void);
> > +static inline void uaccess_buffer_post_exit_loop(bool pending)
> > +{
> > + if (pending)
> > + __uaccess_buffer_post_exit_loop();
> > +}
> > +
> > +void uaccess_buffer_cancel_log(struct task_struct *tsk);
> > +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr);
> > +
> > +#else
> > +
> > +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> > +{
> > + return false;
> > +}
> > +
> > +static inline void uaccess_buffer_syscall_entry(void)
> > +{
> > +}
> > +static inline void uaccess_buffer_syscall_exit(void)
> > +{
> > +}
> > +static inline bool uaccess_buffer_pre_exit_loop(void)
> > +{
> > + return false;
> > +}
> > +static inline void uaccess_buffer_post_exit_loop(bool pending)
> > +{
> > +}
> > +static inline void uaccess_buffer_cancel_log(struct task_struct *tsk)
> > +{
> > +}
> > +
> > +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> > +{
> > + return -EINVAL;
> > +}
> > +#endif
> > +
> > +#endif /* _LINUX_UACCESS_BUFFER_H */
> > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > index bb73e9a0b24f..74b37469c7b3 100644
> > --- a/include/uapi/linux/prctl.h
> > +++ b/include/uapi/linux/prctl.h
> > @@ -272,4 +272,7 @@ struct prctl_mm_map {
> > # define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1
> > # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2
> >
> > +/* Configure uaccess logging feature */
> > +#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR 63
> > +
> > #endif /* _LINUX_PRCTL_H */
> > diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
> > new file mode 100644
> > index 000000000000..619b17dc25c4
> > --- /dev/null
> > +++ b/include/uapi/linux/uaccess-buffer.h
> > @@ -0,0 +1,25 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
> > +#define _UAPI_LINUX_UACCESS_BUFFER_H
> > +
> > +/* Location of the uaccess log. */
> > +struct uaccess_descriptor {
> > + /* Address of the uaccess_buffer_entry array. */
> > + __u64 addr;
> > + /* Size of the uaccess_buffer_entry array in number of elements. */
> > + __u64 size;
> > +};
> > +
> > +/* Format of the entries in the uaccess log. */
> > +struct uaccess_buffer_entry {
> > + /* Address being accessed. */
> > + __u64 addr;
> > + /* Number of bytes that were accessed. */
> > + __u64 size;
> > + /* UACCESS_BUFFER_* flags. */
> > + __u64 flags;
> > +};
> > +
> > +#define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */
> > +
> > +#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
> > diff --git a/kernel/Makefile b/kernel/Makefile
> > index 186c49582f45..d4d9be5146c3 100644
> > --- a/kernel/Makefile
> > +++ b/kernel/Makefile
> > @@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
> > obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
> > obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> > obj-$(CONFIG_CFI_CLANG) += cfi.o
> > +obj-$(CONFIG_HAVE_ARCH_UACCESS_BUFFER) += uaccess-buffer.o
> >
> > obj-$(CONFIG_PERF_EVENTS) += events/
> >
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 649f07623df6..167b50177066 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -637,7 +637,11 @@ const struct bpf_func_proto bpf_event_output_data_proto = {
> > BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
> > const void __user *, user_ptr)
> > {
> > - int ret = copy_from_user(dst, user_ptr, size);
> > + /*
> > + * Avoid copy_from_user() here as it may leak information about the BPF
> > + * program to userspace via the uaccess buffer.
> > + */
> > + int ret = raw_copy_from_user(dst, user_ptr, size);
>
> Here I am more concerned about KASAN/KMSAN checks.
> What exactly is the attack vector here? Are these accesses secret?

If there are concerns about leaking information in this particular
case, any other call to copy_from_user() added in the future will be
prone to the same issues.
So if uaccess logging is meant for production use, it should be
possible to lock the feature down from unwanted users.

> Can't the same info be obtained using userfaultfd/unmapping memory?
>
> raw_copy_from_user also skips access_ok, is it ok?
>
>
> > if (unlikely(ret)) {
> > memset(dst, 0, size);
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 3244cc56b697..c7abe7e7c7cd 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -96,6 +96,7 @@
> > #include <linux/scs.h>
> > #include <linux/io_uring.h>
> > #include <linux/bpf.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #include <asm/pgalloc.h>
> > #include <linux/uaccess.h>
> > @@ -890,6 +891,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
> > if (memcg_charge_kernel_stack(tsk))
> > goto free_stack;
> >
> > + uaccess_buffer_cancel_log(tsk);
>
> Why do we need this?
> tsk is a newly allocated task_struct. If I am not mistaken, it's not
> zero initialized, so are we kfree'ing garbage?
> But we may need to do something with tasks after arch_dup_task_struct.
>
> > stack_vm_area = task_stack_vm_area(tsk);
> >
> > err = arch_dup_task_struct(tsk, orig);
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index a629b11bf3e0..69bf21518bd0 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -45,6 +45,7 @@
> > #include <linux/posix-timers.h>
> > #include <linux/cgroup.h>
> > #include <linux/audit.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #define CREATE_TRACE_POINTS
> > #include <trace/events/signal.h>
> > @@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
> > if (sig_fatal(p, sig) &&
> > !(signal->flags & SIGNAL_GROUP_EXIT) &&
> > !sigismember(&t->real_blocked, sig) &&
> > - (sig == SIGKILL || !p->ptrace)) {
> > + (sig == SIGKILL ||
> > + !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {
>
> Why do we need this change?
>
> > /*
> > * This signal will be fatal to the whole group.
> > */
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 8fdac0d90504..c71a9a9c0f68 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -42,6 +42,7 @@
> > #include <linux/version.h>
> > #include <linux/ctype.h>
> > #include <linux/syscall_user_dispatch.h>
> > +#include <linux/uaccess-buffer.h>
> >
> > #include <linux/compat.h>
> > #include <linux/syscalls.h>
> > @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > error = sched_core_share_pid(arg2, arg3, arg4, arg5);
> > break;
> > #endif
> > + case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
> > + if (arg3 || arg4 || arg5)
> > + return -EINVAL;
> > + error = uaccess_buffer_set_descriptor_addr_addr(arg2);
> > + break;
> > default:
> > error = -EINVAL;
> > break;
> > diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
> > new file mode 100644
> > index 000000000000..e1c6d6ab9af8
> > --- /dev/null
> > +++ b/kernel/uaccess-buffer.c
> > @@ -0,0 +1,125 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Support for uaccess logging via uaccess buffers.
> > + *
> > + * Copyright (C) 2021, Google LLC.
> > + */
> > +
> > +#include <linux/compat.h>
> > +#include <linux/mm.h>
> > +#include <linux/prctl.h>
> > +#include <linux/ptrace.h>
> > +#include <linux/sched.h>
> > +#include <linux/signal.h>
> > +#include <linux/slab.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/uaccess-buffer.h>
> > +
> > +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> > + unsigned long flags)
> > +{
> > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > + struct uaccess_buffer_entry *entry = buf->kcur;
> > +
> > + if (!entry || unlikely(uaccess_kernel()))
> > + return;
> > + entry->addr = addr;
> > + entry->size = size;
> > + entry->flags = flags;
> > +
> > + ++buf->kcur;
> > + if (buf->kcur == buf->kend)
> > + buf->kcur = 0;
>
> = NULL;
>
> > +}
> > +
> > +void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> > +{
> > + uaccess_buffer_log((unsigned long)from, n, 0);
> > +}
> > +EXPORT_SYMBOL(uaccess_buffer_log_read);
> > +
> > +void uaccess_buffer_log_write(void __user *to, unsigned long n)
> > +{
> > + uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> > +}
> > +EXPORT_SYMBOL(uaccess_buffer_log_write);
> > +
> > +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> > +{
> > + current->uaccess_buffer.desc_ptr_ptr =
> > + (struct uaccess_descriptor __user * __user *)addr;
> > + uaccess_buffer_cancel_log(current);
>
> Is this necessary? It looks more reasonable and useful to not call
> cancel. In most cases the user won't setup it twice/change. But if the
> user anyhow asked to trace the prctl, why not trace it?
>
> > + return 0;
> > +}
> > +
> > +bool __uaccess_buffer_pre_exit_loop(void)
> > +{
> > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > + struct uaccess_descriptor __user *desc_ptr;
> > + sigset_t tmp_mask;
> > +
> > + if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> > + return false;
> > +
> > + current->real_blocked = current->blocked;
> > + sigfillset(&tmp_mask);
> > + set_current_blocked(&tmp_mask);
> > + return true;
> > +}
> > +
> > +void __uaccess_buffer_post_exit_loop(void)
> > +{
> > + spin_lock_irq(&current->sighand->siglock);
> > + current->blocked = current->real_blocked;
> > + recalc_sigpending();
> > + spin_unlock_irq(&current->sighand->siglock);
> > +}
> > +
> ;> +void uaccess_buffer_cancel_log(struct task_struct *tsk)
> > +{
> > + struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
> > +
> > + if (buf->kcur) {
>
> uaccess_buffer_log sets kcur to NULL on overflow and we call this
> function from a middle of fork, it looks strange that kfree'ing
> something depends on the previous buffer overflow. Should we check
> kbegin here?
>
> > + buf->kcur = 0;
>
> = NULL
> I would also set kend to NULL to not leave a dangling pointer.
>
> > + kfree(buf->kbegin);
> > + }
> > +}
> > +
> > +void __uaccess_buffer_syscall_entry(void)
> > +{
> > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > + struct uaccess_descriptor desc;
> > +
> > + if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
> > + put_user(0, buf->desc_ptr_ptr) ||
> > + copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
> > + return;
> > +
> > + if (desc.size > 1024)
> > + desc.size = 1024;
> > +
> > + buf->kbegin = kmalloc_array(
> > + desc.size, sizeof(struct uaccess_buffer_entry), GFP_KERNEL);
>
> Is not handling error intentional here?
> Maybe it's better to check the error just to make the code more
> explicit (and maybe prevent some future bugs).
>
> Actually an interesting attack vector. If you can make this kmalloc
> fail, you can prevent sanitizers from detecting any of the bad
> accesses :)
>
> Does it make sense to flag the error somewhere in the desc?... or I am
> thinking if we should pre-allocate the buffer, if we start tracing a
> task, we will trace lots of syscalls, so avoiding kmalloc/kfree per
> syscall can make sense. What do you think?
>
> > + buf->kcur = buf->kbegin;
> > + buf->kend = buf->kbegin + desc.size;
> > + buf->ubegin = (struct uaccess_buffer_entry __user *)desc.addr;
> > +}
> > +
> > +void __uaccess_buffer_syscall_exit(void)
> > +{
> > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > + u64 num_entries = buf->kcur - buf->kbegin;
> > + struct uaccess_descriptor desc;
> > +
> > + if (!buf->kcur)
>
> uaccess_buffer_log sets kcur to NULL on overflow. I think we need to
> check kbegin here.
>
>
> > + return;
> > +
> > + desc.addr = (u64)(buf->ubegin + num_entries);
> > + desc.size = buf->kend - buf->kcur;
>
> Is the user expected to use size in any of reasonable scenarios?
> We cap size at 1024 at entry, so the size can be truncated here.
>
> > + buf->kcur = 0;
>
> = NULL
>
> > + if (copy_to_user(buf->ubegin, buf->kbegin,
> > + num_entries * sizeof(struct uaccess_buffer_entry)) == 0)
> > + (void)copy_to_user(buf->desc_ptr, &desc, sizeof(desc));
> > +
> > + kfree(buf->kbegin);
>
> What if we enter exit/exit_group with logging enabled, won't we leak the buffer?
>
> > +}
> > --
> > 2.34.0.rc2.393.gf8c9666880-goog
> >



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2021-11-23 14:53:59

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2 2/5] uaccess-buffer: add core code

From: Alexander Potapenko
> Sent: 23 November 2021 10:20
...
> So if uaccess logging is meant for production use, it should be
> possible to lock the feature down from unwanted users.

Isn't the conditional for logging every get_user() and put_user()
going to completely kill performance?

I'm not sure I'd want this code compiled into any kernel I use.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-11-25 13:42:25

by kernel test robot

[permalink] [raw]
Subject: [uaccess] 7cd6f10220: BUG:unable_to_handle_page_fault_for_address



Greeting,

FYI, we noticed the following commit (built with clang-14):

commit: 7cd6f102201f3ea35eea1b990f7543e890b7fdbb ("[PATCH v2 3/5] uaccess-buffer: add CONFIG_GENERIC_ENTRY support")
url: https://github.com/0day-ci/linux/commits/Peter-Collingbourne/kernel-introduce-uaccess-logging/20211123-131922
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git cb0e52b7748737b2cf6481fdd9b920ce7e1ebbdf
patch link: https://lore.kernel.org/lkml/[email protected]

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+----------------------------------------------------------+------------+------------+
| | e050ed271b | 7cd6f10220 |
+----------------------------------------------------------+------------+------------+
| boot_successes | 16 | 0 |
| boot_failures | 0 | 16 |
| BUG:unable_to_handle_page_fault_for_address | 0 | 10 |
| Oops:#[##] | 0 | 10 |
| RIP:kfree | 0 | 10 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 16 |
| WARNING:at_mm/slub.c:#free_nonslab_page | 0 | 6 |
| RIP:free_nonslab_page | 0 | 6 |
| BUG:KASAN:double-free_or_invalid-free_in_dup_task_struct | 0 | 6 |
| maybe_for_address#:#[##] | 0 | 6 |
| RIP:__memcpy | 0 | 6 |
+----------------------------------------------------------+------------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 29.153667][ T2] BUG: unable to handle page fault for address: ffffebf7d0000008
[ 29.154602][ T2] #PF: supervisor read access in kernel mode
[ 29.155284][ T2] #PF: error_code(0x0000) - not-present page
[ 29.155975][ T2] PGD 0 P4D 0
[ 29.156359][ T2] Oops: 0000 [#1] PREEMPT SMP KASAN PTI
[ 29.156771][ T2] CPU: 0 PID: 2 Comm: kthreadd Not tainted 5.16.0-rc1-00007-g7cd6f102201f #1 aaaec4470dd30d48a14d7cba8ba3e2c3760eb3bd
[ 29.156771][ T2] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 29.156771][ T2] RIP: 0010:kfree (include/linux/page-flags.h:198 include/linux/mm.h:863 mm/slub.c:4556)
[ 29.156771][ T2] Code: 00 00 80 72 09 48 8b 0d 8e 1a 69 03 eb 0a 48 b9 00 00 00 80 7f 77 00 00 48 01 d9 48 81 e9 00 00 00 80 48 c1 e9 0c 48 c1 e1 06 <4c> 8b 7c 01 08 41 f6 c7 01 0f 85 d3 00 00 00 48 01 c8 49 89 c7 49
All code
========
0: 00 00 add %al,(%rax)
2: 80 72 09 48 xorb $0x48,0x9(%rdx)
6: 8b 0d 8e 1a 69 03 mov 0x3691a8e(%rip),%ecx # 0x3691a9a
c: eb 0a jmp 0x18
e: 48 b9 00 00 00 80 7f movabs $0x777f80000000,%rcx
15: 77 00 00
18: 48 01 d9 add %rbx,%rcx
1b: 48 81 e9 00 00 00 80 sub $0xffffffff80000000,%rcx
22: 48 c1 e9 0c shr $0xc,%rcx
26: 48 c1 e1 06 shl $0x6,%rcx
2a:* 4c 8b 7c 01 08 mov 0x8(%rcx,%rax,1),%r15 <-- trapping instruction
2f: 41 f6 c7 01 test $0x1,%r15b
33: 0f 85 d3 00 00 00 jne 0x10c
39: 48 01 c8 add %rcx,%rax
3c: 49 89 c7 mov %rax,%r15
3f: 49 rex.WB

Code starting with the faulting instruction
===========================================
0: 4c 8b 7c 01 08 mov 0x8(%rcx,%rax,1),%r15
5: 41 f6 c7 01 test $0x1,%r15b
9: 0f 85 d3 00 00 00 jne 0xe2
f: 48 01 c8 add %rcx,%rax
12: 49 89 c7 mov %rax,%r15
15: 49 rex.WB
[ 29.156771][ T2] RSP: 0000:ffffc9000002fc08 EFLAGS: 00010206
[ 29.156771][ T2] RAX: ffffea0000000000 RBX: 0000067400000161 RCX: 000001f7d0000000
[ 29.156771][ T2] RDX: dffffc0000000000 RSI: ffffffff83c824e0 RDI: ffffffff841d22a0
[ 29.156771][ T2] RBP: ffff888131593628 R08: dffffc0000000000 R09: fffffbfff0a326f9
[ 29.156771][ T2] R10: dffff7fff0a326fa R11: 1ffffffff0a326f8 R12: ffff8881315a0000
[ 29.156771][ T2] R13: dffffc0000000000 R14: ffffffff81190728 R15: ffff8881315a26c0
[ 29.156771][ T2] FS: 0000000000000000(0000) GS:ffff8883ae800000(0000) knlGS:0000000000000000
[ 29.156771][ T2] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 29.156771][ T2] CR2: ffffebf7d0000008 CR3: 0000000004c16000 CR4: 00000000000406f0
[ 29.156771][ T2] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 29.156771][ T2] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 29.156771][ T2] Call Trace:
[ 29.156771][ T2] <TASK>
[ 29.156771][ T2] ? find_vm_area (mm/vmalloc.c:2497)
[ 29.156771][ T2] dup_task_struct (include/linux/sched/task.h:148 kernel/fork.c:896)
[ 29.156771][ T2] copy_process (kernel/fork.c:?)
[ 29.156771][ T2] ? __lock_acquire (kernel/locking/lockdep.c:?)
[ 29.156771][ T2] kernel_clone (kernel/fork.c:2585)
[ 29.156771][ T2] ? sched_clock_cpu (kernel/sched/clock.c:292 kernel/sched/clock.c:382)
[ 29.156771][ T2] ? kthread_unuse_mm (kernel/kthread.c:272)
[ 29.156771][ T2] kernel_thread (kernel/fork.c:2637)
[ 29.156771][ T2] ? kthread_unuse_mm (kernel/kthread.c:272)
[ 29.156771][ T2] kthreadd (kernel/kthread.c:351 kernel/kthread.c:685)
[ 29.156771][ T2] ? trace_sched_kthread_stop_ret (kernel/kthread.c:658)
[ 29.156771][ T2] ret_from_fork (??:?)
[ 29.156771][ T2] </TASK>
[ 29.156771][ T2] Modules linked in:
[ 29.156771][ T2] CR2: ffffebf7d0000008
[ 29.156771][ T2] ---[ end trace a8dc7679c1d35edd ]---
[ 29.156771][ T2] RIP: 0010:kfree (include/linux/page-flags.h:198 include/linux/mm.h:863 mm/slub.c:4556)
[ 29.156771][ T2] Code: 00 00 80 72 09 48 8b 0d 8e 1a 69 03 eb 0a 48 b9 00 00 00 80 7f 77 00 00 48 01 d9 48 81 e9 00 00 00 80 48 c1 e9 0c 48 c1 e1 06 <4c> 8b 7c 01 08 41 f6 c7 01 0f 85 d3 00 00 00 48 01 c8 49 89 c7 49
All code
========
0: 00 00 add %al,(%rax)
2: 80 72 09 48 xorb $0x48,0x9(%rdx)
6: 8b 0d 8e 1a 69 03 mov 0x3691a8e(%rip),%ecx # 0x3691a9a
c: eb 0a jmp 0x18
e: 48 b9 00 00 00 80 7f movabs $0x777f80000000,%rcx
15: 77 00 00
18: 48 01 d9 add %rbx,%rcx
1b: 48 81 e9 00 00 00 80 sub $0xffffffff80000000,%rcx
22: 48 c1 e9 0c shr $0xc,%rcx
26: 48 c1 e1 06 shl $0x6,%rcx
2a:* 4c 8b 7c 01 08 mov 0x8(%rcx,%rax,1),%r15 <-- trapping instruction
2f: 41 f6 c7 01 test $0x1,%r15b
33: 0f 85 d3 00 00 00 jne 0x10c
39: 48 01 c8 add %rcx,%rax
3c: 49 89 c7 mov %rax,%r15
3f: 49 rex.WB

Code starting with the faulting instruction
===========================================
0: 4c 8b 7c 01 08 mov 0x8(%rcx,%rax,1),%r15
5: 41 f6 c7 01 test $0x1,%r15b
9: 0f 85 d3 00 00 00 jne 0xe2
f: 48 01 c8 add %rcx,%rax
12: 49 89 c7 mov %rax,%r15
15: 49 rex.WB


To reproduce:

# build kernel
cd linux
cp config-5.16.0-rc1-00007-g7cd6f102201f .config
make HOSTCC=clang-14 CC=clang-14 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=clang-14 CC=clang-14 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.



---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (8.22 kB)
config-5.16.0-rc1-00007-g7cd6f102201f (120.92 kB)
job-script (4.99 kB)
dmesg.xz (8.80 kB)
Download all attachments

2021-12-08 03:52:36

by Peter Collingbourne

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] uaccess-buffer: add core code

On Tue, Nov 23, 2021 at 2:20 AM Alexander Potapenko <[email protected]> wrote:
>
> On Tue, Nov 23, 2021 at 10:56 AM Dmitry Vyukov <[email protected]> wrote:
> >
> > .On Tue, 23 Nov 2021 at 06:17, Peter Collingbourne <[email protected]> wrote:
> > >
> > > Add the core code to support uaccess logging. Subsequent patches will
> > > hook this up to the arch-specific kernel entry and exit code for
> > > certain architectures.
> >
> > I don't see where we block signals when a user writes to the addr. I
> > expected to see some get_user from the addr somewhere in the signal
> > handling code. What am I missing?

This happens in the uaccess_buffer_{pre,post}_exit_loop() hooks. These
hooks go around the part of the kernel exit code that deals with
(among other things) signal handling.

> >
> > > Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> > > Signed-off-by: Peter Collingbourne <[email protected]>
> > > ---
> > > v2:
> > > - New interface that avoids multiple syscalls per real syscall and
> > > is arch-generic
> > > - Avoid logging uaccesses done by BPF programs
> > > - Add documentation
> > > - Split up into multiple patches
> > > - Various code moves, renames etc as requested by Marco
> > >
> > > arch/Kconfig | 5 +
> > > fs/exec.c | 2 +
> > > include/linux/instrumented.h | 5 +-
> > > include/linux/sched.h | 4 +
> > > include/linux/uaccess-buffer-log-hooks.h | 59 +++++++++++
> > > include/linux/uaccess-buffer.h | 79 ++++++++++++++
> > > include/uapi/linux/prctl.h | 3 +
> > > include/uapi/linux/uaccess-buffer.h | 25 +++++
> > > kernel/Makefile | 1 +
> > > kernel/bpf/helpers.c | 6 +-
> > > kernel/fork.c | 3 +
> > > kernel/signal.c | 4 +-
> > > kernel/sys.c | 6 ++
> > > kernel/uaccess-buffer.c | 125 +++++++++++++++++++++++
> > > 14 files changed, 324 insertions(+), 3 deletions(-)
> > > create mode 100644 include/linux/uaccess-buffer-log-hooks.h
> > > create mode 100644 include/linux/uaccess-buffer.h
> > > create mode 100644 include/uapi/linux/uaccess-buffer.h
> > > create mode 100644 kernel/uaccess-buffer.c
> > >
> > > diff --git a/arch/Kconfig b/arch/Kconfig
> > > index 26b8ed11639d..6030298a7e9a 100644
> > > --- a/arch/Kconfig
> > > +++ b/arch/Kconfig
> > > @@ -1302,6 +1302,11 @@ config ARCH_HAS_PARANOID_L1D_FLUSH
> > > config DYNAMIC_SIGFRAME
> > > bool
> > >
> > > +config HAVE_ARCH_UACCESS_BUFFER
> > > + bool
> > > + help
> > > + Select if the architecture's syscall entry/exit code supports uaccess buffers.
> > > +
> > > source "kernel/gcov/Kconfig"
> > >
> > > source "scripts/gcc-plugins/Kconfig"
> > > diff --git a/fs/exec.c b/fs/exec.c
> > > index 537d92c41105..5f30314f3ec6 100644
> > > --- a/fs/exec.c
> > > +++ b/fs/exec.c
> > > @@ -65,6 +65,7 @@
> > > #include <linux/vmalloc.h>
> > > #include <linux/io_uring.h>
> > > #include <linux/syscall_user_dispatch.h>
> > > +#include <linux/uaccess-buffer.h>
> > >
> > > #include <linux/uaccess.h>
> > > #include <asm/mmu_context.h>
> > > @@ -1313,6 +1314,7 @@ int begin_new_exec(struct linux_binprm * bprm)
> > > me->personality &= ~bprm->per_clear;
> > >
> > > clear_syscall_work_syscall_user_dispatch(me);
> > > + uaccess_buffer_set_descriptor_addr_addr(0);
> > >
> > > /*
> > > * We have to apply CLOEXEC before we change whether the process is
> > > diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
> > > index 42faebbaa202..c96be1695614 100644
> > > --- a/include/linux/instrumented.h
> > > +++ b/include/linux/instrumented.h
> > > @@ -2,7 +2,7 @@
> > >
> > > /*
> > > * This header provides generic wrappers for memory access instrumentation that
> > > - * the compiler cannot emit for: KASAN, KCSAN.
> > > + * the compiler cannot emit for: KASAN, KCSAN, uaccess buffers.
> > > */
> > > #ifndef _LINUX_INSTRUMENTED_H
> > > #define _LINUX_INSTRUMENTED_H
> > > @@ -11,6 +11,7 @@
> > > #include <linux/kasan-checks.h>
> > > #include <linux/kcsan-checks.h>
> > > #include <linux/types.h>
> > > +#include <linux/uaccess-buffer-log-hooks.h>
> > >
> > > /**
> > > * instrument_read - instrument regular read access
> > > @@ -117,6 +118,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> > > {
> > > kasan_check_read(from, n);
> > > kcsan_check_read(from, n);
> > > + uaccess_buffer_log_write(to, n);
> > > }
> > >
> > > /**
> > > @@ -134,6 +136,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
> > > {
> > > kasan_check_write(to, n);
> > > kcsan_check_write(to, n);
> > > + uaccess_buffer_log_read(from, n);
> > > }
> > >
> > > #endif /* _LINUX_INSTRUMENTED_H */
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index 78c351e35fec..1f978deaa3f8 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -1484,6 +1484,10 @@ struct task_struct {
> > > struct callback_head l1d_flush_kill;
> > > #endif
> > >
> > > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> > > + struct uaccess_buffer_info uaccess_buffer;
> > > +#endif
> >
> > Shouldn't this be controlled by an additional config that a user can
> > enable/disable?
> > If I am reading this correctly, the current implementation forces
> > uaccess logging for all arches that support it. Some embed kernels may
> > not want this.

My intent is to get the overhead (for programs that don't use this
feature) low enough that we don't need to make it configurable.

My initial performance measurements showed a large overhead on a
DragonBoard 845c so I spent some time optimizing the entry/exit code
paths. The new code is in the v3 patch series and with that I see
the following, measured with [1] for invalid syscall and a similar
program for uname, which is probably the simplest syscall that does
a copy_*_user() (all measurements in ns/syscall):

small core big core
(Cortex-A55) (Cortex-A75)
before after before after
invalid 242.9 +/- 0.06 244.6 +/- 0.02 147.4 +/- 0.06 148.1 +/- 0.04
uname 601.7 +/- 0.1 607.0 +/- 0.7 283.0 +/- 0.4 285.7 +/-
0.6

So it looks like even in the worst case where the program is doing
nothing but issuing syscalls the overhead is < 1%. I would be surprised
if it was measurable in the real world.

I also measured the overhead on x86 using similar programs (under
virtualization, since I don't currently have a setup to test the
kernel on bare-metal on x86) but there was too much noise for me to
observe a difference.

[1] https://lore.kernel.org/all/CAMn1gO4MwRV8bmFJ_SeY5tsYNPn2ZP56LjAhafygjFaKuu5ouw@mail.gmail.com/

> >
> >
> > > /*
> > > * New fields for task_struct should be added above here, so that
> > > * they are included in the randomized portion of task_struct.
> > > diff --git a/include/linux/uaccess-buffer-log-hooks.h b/include/linux/uaccess-buffer-log-hooks.h
> > > new file mode 100644
> > > index 000000000000..bddc84ddce32
> > > --- /dev/null
> > > +++ b/include/linux/uaccess-buffer-log-hooks.h
> > > @@ -0,0 +1,59 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _LINUX_UACCESS_BUFFER_LOG_HOOKS_H
> > > +#define _LINUX_UACCESS_BUFFER_LOG_HOOKS_H
> > > +
> > > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> > > +
> > > +struct uaccess_buffer_info {
> > > + /*
> > > + * The pointer to pointer to struct uaccess_descriptor. This is the
> > > + * value controlled by prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR).
> > > + */
> > > + struct uaccess_descriptor __user *__user *desc_ptr_ptr;
> > > +
> > > + /*
> > > + * The pointer to struct uaccess_descriptor read at syscall entry time.
> > > + */
> > > + struct uaccess_descriptor __user *desc_ptr;
> > > +
> > > + /*
> > > + * A pointer to the kernel's temporary copy of the uaccess log for the
> > > + * current syscall. We log to a kernel buffer in order to avoid leaking
> > > + * timing information to userspace.
> > > + */
> > > + struct uaccess_buffer_entry *kbegin;
> > > +
> > > + /*
> > > + * The position of the next uaccess buffer entry for the current
> > > + * syscall.
> > > + */
> > > + struct uaccess_buffer_entry *kcur;
> > > +
> > > + /*
> > > + * A pointer to the end of the kernel's uaccess log.
> > > + */
> > > + struct uaccess_buffer_entry *kend;
> > > +
> > > + /*
> > > + * The pointer to the userspace uaccess log, as read from the
> > > + * struct uaccess_descriptor.
> > > + */
> > > + struct uaccess_buffer_entry __user *ubegin;
> > > +};
> > > +
> > > +void uaccess_buffer_log_read(const void __user *from, unsigned long n);
> > > +void uaccess_buffer_log_write(void __user *to, unsigned long n);
> > > +
> > > +#else
> > > +
> > > +static inline void uaccess_buffer_log_read(const void __user *from,
> > > + unsigned long n)
> > > +{
> > > +}
> > > +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> > > +{
> > > +}
> > > +
> > > +#endif
> > > +
> > > +#endif /* _LINUX_UACCESS_BUFFER_LOG_HOOKS_H */
> > > diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
> > > new file mode 100644
> > > index 000000000000..14261368d3a9
> > > --- /dev/null
> > > +++ b/include/linux/uaccess-buffer.h
> > > @@ -0,0 +1,79 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _LINUX_UACCESS_BUFFER_H
> > > +#define _LINUX_UACCESS_BUFFER_H
> > > +
> > > +#include <linux/sched.h>
> > > +#include <uapi/linux/uaccess-buffer.h>
> > > +
> > > +#include <asm-generic/errno-base.h>
> > > +
> > > +#ifdef CONFIG_HAVE_ARCH_UACCESS_BUFFER
> > > +
> > > +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> > > +{
> > > + return tsk->uaccess_buffer.desc_ptr_ptr;
> > > +}
> > > +
> > > +void __uaccess_buffer_syscall_entry(void);
> > > +static inline void uaccess_buffer_syscall_entry(void)
> > > +{
> > > + if (uaccess_buffer_maybe_blocked(current))
> > > + __uaccess_buffer_syscall_entry();
> > > +}
> > > +
> > > +void __uaccess_buffer_syscall_exit(void);
> > > +static inline void uaccess_buffer_syscall_exit(void)
> > > +{
> > > + if (uaccess_buffer_maybe_blocked(current))
> > > + __uaccess_buffer_syscall_exit();
> > > +}
> > > +
> > > +bool __uaccess_buffer_pre_exit_loop(void);
> > > +static inline bool uaccess_buffer_pre_exit_loop(void)
> > > +{
> > > + if (!uaccess_buffer_maybe_blocked(current))
> > > + return false;
> > > + return __uaccess_buffer_pre_exit_loop();
> > > +}
> > > +
> > > +void __uaccess_buffer_post_exit_loop(void);
> > > +static inline void uaccess_buffer_post_exit_loop(bool pending)
> > > +{
> > > + if (pending)
> > > + __uaccess_buffer_post_exit_loop();
> > > +}
> > > +
> > > +void uaccess_buffer_cancel_log(struct task_struct *tsk);
> > > +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr);
> > > +
> > > +#else
> > > +
> > > +static inline bool uaccess_buffer_maybe_blocked(struct task_struct *tsk)
> > > +{
> > > + return false;
> > > +}
> > > +
> > > +static inline void uaccess_buffer_syscall_entry(void)
> > > +{
> > > +}
> > > +static inline void uaccess_buffer_syscall_exit(void)
> > > +{
> > > +}
> > > +static inline bool uaccess_buffer_pre_exit_loop(void)
> > > +{
> > > + return false;
> > > +}
> > > +static inline void uaccess_buffer_post_exit_loop(bool pending)
> > > +{
> > > +}
> > > +static inline void uaccess_buffer_cancel_log(struct task_struct *tsk)
> > > +{
> > > +}
> > > +
> > > +static inline int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> > > +{
> > > + return -EINVAL;
> > > +}
> > > +#endif
> > > +
> > > +#endif /* _LINUX_UACCESS_BUFFER_H */
> > > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > > index bb73e9a0b24f..74b37469c7b3 100644
> > > --- a/include/uapi/linux/prctl.h
> > > +++ b/include/uapi/linux/prctl.h
> > > @@ -272,4 +272,7 @@ struct prctl_mm_map {
> > > # define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1
> > > # define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2
> > >
> > > +/* Configure uaccess logging feature */
> > > +#define PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR 63
> > > +
> > > #endif /* _LINUX_PRCTL_H */
> > > diff --git a/include/uapi/linux/uaccess-buffer.h b/include/uapi/linux/uaccess-buffer.h
> > > new file mode 100644
> > > index 000000000000..619b17dc25c4
> > > --- /dev/null
> > > +++ b/include/uapi/linux/uaccess-buffer.h
> > > @@ -0,0 +1,25 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > > +#ifndef _UAPI_LINUX_UACCESS_BUFFER_H
> > > +#define _UAPI_LINUX_UACCESS_BUFFER_H
> > > +
> > > +/* Location of the uaccess log. */
> > > +struct uaccess_descriptor {
> > > + /* Address of the uaccess_buffer_entry array. */
> > > + __u64 addr;
> > > + /* Size of the uaccess_buffer_entry array in number of elements. */
> > > + __u64 size;
> > > +};
> > > +
> > > +/* Format of the entries in the uaccess log. */
> > > +struct uaccess_buffer_entry {
> > > + /* Address being accessed. */
> > > + __u64 addr;
> > > + /* Number of bytes that were accessed. */
> > > + __u64 size;
> > > + /* UACCESS_BUFFER_* flags. */
> > > + __u64 flags;
> > > +};
> > > +
> > > +#define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */
> > > +
> > > +#endif /* _UAPI_LINUX_UACCESS_BUFFER_H */
> > > diff --git a/kernel/Makefile b/kernel/Makefile
> > > index 186c49582f45..d4d9be5146c3 100644
> > > --- a/kernel/Makefile
> > > +++ b/kernel/Makefile
> > > @@ -114,6 +114,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
> > > obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
> > > obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> > > obj-$(CONFIG_CFI_CLANG) += cfi.o
> > > +obj-$(CONFIG_HAVE_ARCH_UACCESS_BUFFER) += uaccess-buffer.o
> > >
> > > obj-$(CONFIG_PERF_EVENTS) += events/
> > >
> > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > > index 649f07623df6..167b50177066 100644
> > > --- a/kernel/bpf/helpers.c
> > > +++ b/kernel/bpf/helpers.c
> > > @@ -637,7 +637,11 @@ const struct bpf_func_proto bpf_event_output_data_proto = {
> > > BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
> > > const void __user *, user_ptr)
> > > {
> > > - int ret = copy_from_user(dst, user_ptr, size);
> > > + /*
> > > + * Avoid copy_from_user() here as it may leak information about the BPF
> > > + * program to userspace via the uaccess buffer.
> > > + */
> > > + int ret = raw_copy_from_user(dst, user_ptr, size);
> >
> > Here I am more concerned about KASAN/KMSAN checks.
> > What exactly is the attack vector here? Are these accesses secret?
>
> If there are concerns about leaking information in this particular
> case, any other call to copy_from_user() added in the future will be
> prone to the same issues.
> So if uaccess logging is meant for production use, it should be
> possible to lock the feature down from unwanted users.

This comment was probably poorly worded. This isn't really about
exposing "secret" accesses. It's more that the accesses carried out
by the BPF program are not necessarily in response to the syscall,
nor may they be compliant with the rules that we've set for uaccess
logging. For example they could be using the same trick of reading
PAGE_SIZE bytes that mount() is doing.

If we hit one of these issues in the kernel itself, we may consider
that a bug that we can fix. But we don't necessarily have control
over what BPF programs are doing. We may want to allow BPF programs to
opt into uaccess logging, but I reckon that should be done separately.

I've changed the wording of the comment accordingly.

> > Can't the same info be obtained using userfaultfd/unmapping memory?
> >
> > raw_copy_from_user also skips access_ok, is it ok?

Okay, I've introduced a copy_from_user_nolog() function that disables
uaccess logging and used it here as well as in mount().

> >
> >
> > > if (unlikely(ret)) {
> > > memset(dst, 0, size);
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 3244cc56b697..c7abe7e7c7cd 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -96,6 +96,7 @@
> > > #include <linux/scs.h>
> > > #include <linux/io_uring.h>
> > > #include <linux/bpf.h>
> > > +#include <linux/uaccess-buffer.h>
> > >
> > > #include <asm/pgalloc.h>
> > > #include <linux/uaccess.h>
> > > @@ -890,6 +891,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
> > > if (memcg_charge_kernel_stack(tsk))
> > > goto free_stack;
> > >
> > > + uaccess_buffer_cancel_log(tsk);
> >
> > Why do we need this?
> > tsk is a newly allocated task_struct. If I am not mistaken, it's not
> > zero initialized, so are we kfree'ing garbage?
> > But we may need to do something with tasks after arch_dup_task_struct.

Right, the intent was to call this on orig.

Without something like this I think we'll call the syscall exit handler
twice, which would lead to a double free if we're not careful. That
aside, in order to avoid the complexities resulting from e.g. clone3()
with CLONE_VM set, it seems best not to report anything from this
syscall.

I guess I never hit a problem with this in practice because
1) Android uses CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y
2) my prototype patch on the userspace side never sets up a uaccess
descriptor for clone() (because it doesn't go via the usual syscall
wrappers that I instrumented).

> >
> > > stack_vm_area = task_stack_vm_area(tsk);
> > >
> > > err = arch_dup_task_struct(tsk, orig);
> > > diff --git a/kernel/signal.c b/kernel/signal.c
> > > index a629b11bf3e0..69bf21518bd0 100644
> > > --- a/kernel/signal.c
> > > +++ b/kernel/signal.c
> > > @@ -45,6 +45,7 @@
> > > #include <linux/posix-timers.h>
> > > #include <linux/cgroup.h>
> > > #include <linux/audit.h>
> > > +#include <linux/uaccess-buffer.h>
> > >
> > > #define CREATE_TRACE_POINTS
> > > #include <trace/events/signal.h>
> > > @@ -1031,7 +1032,8 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
> > > if (sig_fatal(p, sig) &&
> > > !(signal->flags & SIGNAL_GROUP_EXIT) &&
> > > !sigismember(&t->real_blocked, sig) &&
> > > - (sig == SIGKILL || !p->ptrace)) {
> > > + (sig == SIGKILL ||
> > > + !(p->ptrace || uaccess_buffer_maybe_blocked(p)))) {
> >
> > Why do we need this change?

This code is, as best I can tell, an optimization to handle the
case where a process is sent an asynchronous signal without a signal
handler. In this case, the process's tasks are killed immediately by
starting a group exit and waking them up. This avoids a round trip
through one of the process's tasks, but it doesn't match the documented
behavior in the case where the uaccess descriptor address is non-zero
-- in this case, we need to behave as if most signals are blocked. I
guess it doesn't really matter in the end because no signal handler
would be invoked anyway, but it doesn't seem like we should change
the blocking behavior just because of this optimization. So if a task
has a uaccess descriptor address address, we just send the signal to
a specific thread and let it handle blocking if necessary.

I guess an alternative solution would be to have the sending thread
do a cross-process read of the uaccess descriptor address, but that
seems like more trouble than it's worth.

> >
> > > /*
> > > * This signal will be fatal to the whole group.
> > > */
> > > diff --git a/kernel/sys.c b/kernel/sys.c
> > > index 8fdac0d90504..c71a9a9c0f68 100644
> > > --- a/kernel/sys.c
> > > +++ b/kernel/sys.c
> > > @@ -42,6 +42,7 @@
> > > #include <linux/version.h>
> > > #include <linux/ctype.h>
> > > #include <linux/syscall_user_dispatch.h>
> > > +#include <linux/uaccess-buffer.h>
> > >
> > > #include <linux/compat.h>
> > > #include <linux/syscalls.h>
> > > @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > > error = sched_core_share_pid(arg2, arg3, arg4, arg5);
> > > break;
> > > #endif
> > > + case PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR:
> > > + if (arg3 || arg4 || arg5)
> > > + return -EINVAL;
> > > + error = uaccess_buffer_set_descriptor_addr_addr(arg2);
> > > + break;
> > > default:
> > > error = -EINVAL;
> > > break;
> > > diff --git a/kernel/uaccess-buffer.c b/kernel/uaccess-buffer.c
> > > new file mode 100644
> > > index 000000000000..e1c6d6ab9af8
> > > --- /dev/null
> > > +++ b/kernel/uaccess-buffer.c
> > > @@ -0,0 +1,125 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * Support for uaccess logging via uaccess buffers.
> > > + *
> > > + * Copyright (C) 2021, Google LLC.
> > > + */
> > > +
> > > +#include <linux/compat.h>
> > > +#include <linux/mm.h>
> > > +#include <linux/prctl.h>
> > > +#include <linux/ptrace.h>
> > > +#include <linux/sched.h>
> > > +#include <linux/signal.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/uaccess.h>
> > > +#include <linux/uaccess-buffer.h>
> > > +
> > > +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> > > + unsigned long flags)
> > > +{
> > > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > > + struct uaccess_buffer_entry *entry = buf->kcur;
> > > +
> > > + if (!entry || unlikely(uaccess_kernel()))
> > > + return;
> > > + entry->addr = addr;
> > > + entry->size = size;
> > > + entry->flags = flags;
> > > +
> > > + ++buf->kcur;
> > > + if (buf->kcur == buf->kend)
> > > + buf->kcur = 0;
> >
> > = NULL;

Done.

> > > +}
> > > +
> > > +void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> > > +{
> > > + uaccess_buffer_log((unsigned long)from, n, 0);
> > > +}
> > > +EXPORT_SYMBOL(uaccess_buffer_log_read);
> > > +
> > > +void uaccess_buffer_log_write(void __user *to, unsigned long n)
> > > +{
> > > + uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> > > +}
> > > +EXPORT_SYMBOL(uaccess_buffer_log_write);
> > > +
> > > +int uaccess_buffer_set_descriptor_addr_addr(unsigned long addr)
> > > +{
> > > + current->uaccess_buffer.desc_ptr_ptr =
> > > + (struct uaccess_descriptor __user * __user *)addr;
> > > + uaccess_buffer_cancel_log(current);
> >
> > Is this necessary? It looks more reasonable and useful to not call
> > cancel. In most cases the user won't setup it twice/change. But if the
> > user anyhow asked to trace the prctl, why not trace it?

No particularly good reason. I previously guarded all the uaccess
buffer hooks using the desc_ptr_ptr field, so we could end up not
following the syscall exit code path if someone set this to 0. This
prctl doesn't do any uaccesses, so I guess it doesn't really matter
in the end. But we can relatively easily arrange for the syscall exit
code to check kcur instead, and since we need to check that anyway,
I guess it makes the code slightly more efficient.

> >
> > > + return 0;
> > > +}
> > > +
> > > +bool __uaccess_buffer_pre_exit_loop(void)
> > > +{
> > > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > > + struct uaccess_descriptor __user *desc_ptr;
> > > + sigset_t tmp_mask;
> > > +
> > > + if (get_user(desc_ptr, buf->desc_ptr_ptr) || !desc_ptr)
> > > + return false;
> > > +
> > > + current->real_blocked = current->blocked;
> > > + sigfillset(&tmp_mask);
> > > + set_current_blocked(&tmp_mask);
> > > + return true;
> > > +}
> > > +
> > > +void __uaccess_buffer_post_exit_loop(void)
> > > +{
> > > + spin_lock_irq(&current->sighand->siglock);
> > > + current->blocked = current->real_blocked;
> > > + recalc_sigpending();
> > > + spin_unlock_irq(&current->sighand->siglock);
> > > +}
> > > +
> > ;> +void uaccess_buffer_cancel_log(struct task_struct *tsk)
> > > +{
> > > + struct uaccess_buffer_info *buf = &tsk->uaccess_buffer;
> > > +
> > > + if (buf->kcur) {
> >
> > uaccess_buffer_log sets kcur to NULL on overflow and we call this
> > function from a middle of fork, it looks strange that kfree'ing
> > something depends on the previous buffer overflow. Should we check
> > kbegin here?

It's worse than this because we lose our place in the kernel-allocated
buffer when we set kcur to NULL (well, I guess we can recover by
noticing that kbegin != NULL and kcur == NULL, but that adds a bit more
complexity). I'm not sure why I wrote it this way, I think it was just
a remnant of when I had the buffers copied to userspace directly. Now
I leave kcur pointing at the end of the buffer if we overflow.

> > > + buf->kcur = 0;
> >
> > = NULL
> > I would also set kend to NULL to not leave a dangling pointer.

Yes, and kbegin when we get around to destroying this (now in
uaccess_buffer_free).

> >
> > > + kfree(buf->kbegin);
> > > + }
> > > +}
> > > +
> > > +void __uaccess_buffer_syscall_entry(void)
> > > +{
> > > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > > + struct uaccess_descriptor desc;
> > > +
> > > + if (get_user(buf->desc_ptr, buf->desc_ptr_ptr) || !buf->desc_ptr ||
> > > + put_user(0, buf->desc_ptr_ptr) ||
> > > + copy_from_user(&desc, buf->desc_ptr, sizeof(desc)))
> > > + return;
> > > +
> > > + if (desc.size > 1024)
> > > + desc.size = 1024;
> > > +
> > > + buf->kbegin = kmalloc_array(
> > > + desc.size, sizeof(struct uaccess_buffer_entry), GFP_KERNEL);
> >
> > Is not handling error intentional here?
> > Maybe it's better to check the error just to make the code more
> > explicit (and maybe prevent some future bugs).

Done.

> > Actually an interesting attack vector. If you can make this kmalloc
> > fail, you can prevent sanitizers from detecting any of the bad
> > accesses :)
> >
> > Does it make sense to flag the error somewhere in the desc?... or I am
> > thinking if we should pre-allocate the buffer, if we start tracing a
> > task, we will trace lots of syscalls, so avoiding kmalloc/kfree per
> > syscall can make sense. What do you think?

Makes sense. Now I preserve kbegin and kend between syscalls
(reallocing as necessary) and only free it together with the
task_struct.

> > > + buf->kcur = buf->kbegin;
> > > + buf->kend = buf->kbegin + desc.size;
> > > + buf->ubegin = (struct uaccess_buffer_entry __user *)desc.addr;
> > > +}
> > > +
> > > +void __uaccess_buffer_syscall_exit(void)
> > > +{
> > > + struct uaccess_buffer_info *buf = &current->uaccess_buffer;
> > > + u64 num_entries = buf->kcur - buf->kbegin;
> > > + struct uaccess_descriptor desc;
> > > +
> > > + if (!buf->kcur)
> >
> > uaccess_buffer_log sets kcur to NULL on overflow. I think we need to
> > check kbegin here.

Fixed by changing the overflow behavior per above.

> >
> >
> > > + return;
> > > +
> > > + desc.addr = (u64)(buf->ubegin + num_entries);
> > > + desc.size = buf->kend - buf->kcur;
> >
> > Is the user expected to use size in any of reasonable scenarios?
> > We cap size at 1024 at entry, so the size can be truncated here.
> >
> > > + buf->kcur = 0;
> >
> > = NULL

Done.

> >
> > > + if (copy_to_user(buf->ubegin, buf->kbegin,
> > > + num_entries * sizeof(struct uaccess_buffer_entry)) == 0)
> > > + (void)copy_to_user(buf->desc_ptr, &desc, sizeof(desc));
> > > +
> > > + kfree(buf->kbegin);
> >
> > What if we enter exit/exit_group with logging enabled, won't we leak the buffer?

Right. Fixed by adding a hook to task_struct destruction.

Peter

2021-12-08 03:53:44

by Peter Collingbourne

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] fs: use raw_copy_from_user() to copy mount() data

On Tue, Nov 23, 2021 at 2:09 AM Alexander Potapenko <[email protected]> wrote:
>
> On Tue, Nov 23, 2021 at 8:51 AM Dmitry Vyukov <[email protected]> wrote:
> >
> > On Tue, 23 Nov 2021 at 06:17, Peter Collingbourne <[email protected]> wrote:
> > >
> > > With uaccess logging the contract is that the kernel must not report
> > > accessing more data than necessary, as this can lead to false positive
> > > reports in downstream consumers. This generally works out of the box
> > > when instrumenting copy_{from,to}_user(), but with the data argument
> > > to mount() we use copy_from_user() to copy PAGE_SIZE bytes (or as
> > > much as we can, if the PAGE_SIZE sized access failed) and figure out
> > > later how much we actually need.
> > >
> > > To prevent this from leading to a false positive report, use
> > > raw_copy_from_user(), which will prevent the access from being logged.
> > > Recall that it is valid for the kernel to report accessing less
> > > data than it actually accessed, as uaccess logging is a best-effort
> > > mechanism for reporting uaccesses.
> > >
> > > Link: https://linux-review.googlesource.com/id/I5629b92a725c817acd9a861288338dd605cafee6
> > > Signed-off-by: Peter Collingbourne <[email protected]>
> > > ---
> > > fs/namespace.c | 7 ++++++-
> > > 1 file changed, 6 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/namespace.c b/fs/namespace.c
> > > index 659a8f39c61a..695b30e391f0 100644
> > > --- a/fs/namespace.c
> > > +++ b/fs/namespace.c
> > > @@ -3197,7 +3197,12 @@ static void *copy_mount_options(const void __user * data)
> > > if (!copy)
> > > return ERR_PTR(-ENOMEM);
> > >
> > > - left = copy_from_user(copy, data, PAGE_SIZE);
> > > + /*
> > > + * Use raw_copy_from_user to avoid reporting overly large accesses in
> > > + * the uaccess buffer, as this can lead to false positive reports in
> > > + * downstream consumers.
> > > + */
> > > + left = raw_copy_from_user(copy, data, PAGE_SIZE);
>
> I don't really like the idea of using raw_copy_from_user() anywhere.
> Right now users of instrumented.h can more or less assume they see all
> usercopy events, and removing the copy_from_user() call from here
> looks like a regression.
>
> Cannot the usercopy logger decide whether it wants to log the access
> based on the size (e.g. skip accesses >= PAGE_SIZE)?
> Will it help if we can instrument both sides of copy_from_user() (see
> the code here: https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/14103/4)?
>
> If not, maybe we can disable/enable uaccess logging for the current
> task around these accesses?

This seems reasonable, done in v3.

Peter

2021-12-10 21:30:12

by Peter Collingbourne

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] Documentation: document uaccess logging

On Mon, Nov 22, 2021 at 11:46 PM Dmitry Vyukov <[email protected]> wrote:
>
> On Tue, 23 Nov 2021 at 06:17, Peter Collingbourne <[email protected]> wrote:
> >
> > Add documentation for the uaccess logging feature.
> >
> > Link: https://linux-review.googlesource.com/id/Ia626c0ca91bc0a3d8067d7f28406aa40693b65a2
> > Signed-off-by: Peter Collingbourne <[email protected]>
> > ---
> > Documentation/admin-guide/index.rst | 1 +
> > Documentation/admin-guide/uaccess-logging.rst | 149 ++++++++++++++++++
> > 2 files changed, 150 insertions(+)
> > create mode 100644 Documentation/admin-guide/uaccess-logging.rst
> >
> > diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
> > index 1bedab498104..4f6ee447ab2f 100644
> > --- a/Documentation/admin-guide/index.rst
> > +++ b/Documentation/admin-guide/index.rst
> > @@ -54,6 +54,7 @@ ABI will be found here.
> > :maxdepth: 1
> >
> > sysfs-rules
> > + uaccess-logging
> >
> > The rest of this manual consists of various unordered guides on how to
> > configure specific aspects of kernel behavior to your liking.
> > diff --git a/Documentation/admin-guide/uaccess-logging.rst b/Documentation/admin-guide/uaccess-logging.rst
> > new file mode 100644
> > index 000000000000..4b2b297afc00
> > --- /dev/null
> > +++ b/Documentation/admin-guide/uaccess-logging.rst
> > @@ -0,0 +1,149 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +===============
> > +Uaccess Logging
> > +===============
> > +
> > +Background
> > +----------
> > +
> > +Userspace tools such as sanitizers (ASan, MSan, HWASan) and tools
> > +making use of the ARM Memory Tagging Extension (MTE) need to
> > +monitor all memory accesses in a program so that they can detect
> > +memory errors. Furthermore, fuzzing tools such as syzkaller need to
> > +monitor all memory accesses so that they know which parts of memory
> > +to fuzz. For accesses made purely in userspace, this is achieved
> > +via compiler instrumentation, or for MTE, via direct hardware
> > +support. However, accesses made by the kernel on behalf of the user
> > +program via syscalls (i.e. uaccesses) are normally invisible to
> > +these tools.
> > +
> > +Traditionally, the sanitizers have handled this by interposing the libc
> > +syscall stubs with a wrapper that checks the memory based on what we
> > +believe the uaccesses will be. However, this creates a maintenance
> > +burden: each syscall must be annotated with its uaccesses in order
> > +to be recognized by the sanitizer, and these annotations must be
> > +continuously updated as the kernel changes.
> > +
> > +The kernel's uaccess logging feature provides userspace tools with
> > +the address and size of each userspace access, thereby allowing these
> > +tools to report memory errors involving these accesses without needing
> > +annotations for every syscall.
> > +
> > +By relying on the kernel's actual uaccesses, rather than a
> > +reimplementation of them, the userspace memory safety tools may
> > +play a dual role of verifying the validity of kernel accesses. Even
> > +a sanitizer whose syscall wrappers have complete knowledge of the
> > +kernel's intended API may vary from the kernel's actual uaccesses due
> > +to kernel bugs. A sanitizer with knowledge of the kernel's actual
> > +uaccesses may produce more accurate error reports that reveal such
> > +bugs. For example, a kernel that accesses more memory than expected
> > +by the userspace program could indicate that either userspace or the
> > +kernel has the wrong idea about which kernel functionality is being
> > +requested -- either way, there is a bug.
> > +
> > +Interface
> > +---------
> > +
> > +The feature may be used via the following prctl:
> > +
> > +.. code-block:: c
> > +
> > + uint64_t addr = 0; /* Generally will be a TLS slot or equivalent */
> > + prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &addr, 0, 0, 0);
> > +
> > +Supplying a non-zero address as the second argument to ``prctl``
>
> Is it possible to unregister it? Is it what happens when 0 is passed
> as addr? If so, please describe.
> It may be handy to do one-off tracing with the address on stack.

Yes, done in v3.

> > +will cause the kernel to read an address (referred to as the *uaccess
> > +descriptor address*) from that address on each kernel entry.
> > +
> > +When entering the kernel with a non-zero uaccess descriptor address
> > +to handle a syscall, the kernel will read a data structure of type
> > +``struct uaccess_descriptor`` from the uaccess descriptor address,
> > +which is defined as follows:
> > +
> > +.. code-block:: c
> > +
> > + struct uaccess_descriptor {
> > + uint64_t addr, size;
> > + };
>
> Want to double check the extension story. If we ever want flags in
> uaccess_descriptor, we can add a flag to prctl that would say that
> address must point to uaccess_descriptor_v2 that contains flags,
> right?
> And similarly we can extend uaccess_buffer_entry, right?

Yes, we can specify a flag bit in e.g. the third argument to prctl,
which could switch us to using new struct definitions for the uaccess
descriptor and uaccess buffer entries.

> > +This data structure contains the address and size (in array elements)
> > +of a *uaccess buffer*, which is an array of data structures of type
> > +``struct uaccess_buffer_entry``. Before returning to userspace, the
> > +kernel will log information about uaccesses to sequential entries
> > +in the uaccess buffer. It will also store ``NULL`` to the uaccess
> > +descriptor address, and store the address and size of the unused
> > +portion of the uaccess buffer to the uaccess descriptor.
> > +
> > +The format of a uaccess buffer entry is defined as follows:
> > +
> > +.. code-block:: c
> > +
> > + struct uaccess_buffer_entry {
> > + uint64_t addr, size, flags;
> > + };
> > +
> > +The meaning of ``addr`` and ``size`` should be obvious. On arm64,
>
> I would say explicitly "addr and size contain address and size of the
> user memory access".

Done in v3.

> > +tag bits are preserved in the ``addr`` field. There is currently
> > +one flag bit assignment for the ``flags`` field:
> > +
> > +.. code-block:: c
> > +
> > + #define UACCESS_BUFFER_FLAG_WRITE 1
> > +
> > +This flag is set if the access was a write, or clear if it was a
> > +read. The meaning of all other flag bits is reserved.
> > +
> > +When entering the kernel with a non-zero uaccess descriptor
> > +address for a reason other than a syscall (for example, when
> > +IPI'd due to an incoming asynchronous signal), any signals other
> > +than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling
> > +``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been
> > +initialized with ``sigfillset(set)``. This is to prevent incoming
> > +signals from interfering with uaccess logging.
> > +
> > +Example
> > +-------
> > +
> > +Here is an example of a code snippet that will enumerate the accesses
> > +performed by a ``uname(2)`` syscall:
> > +
> > +.. code-block:: c
> > +
> > + struct uaccess_buffer_entry entries[64];
> > + struct uaccess_descriptor desc;
> > + uint64_t desc_addr = 0;
> > + prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &desc_addr, 0, 0, 0);
> > +
> > + desc.addr = (uint64_t)&entries;
> > + desc.size = 64;
> > + desc_addr = (uint64_t)&desc;
>
> We don't need any additional compiler barriers here, right?
> It seems that we only need to prevent re-ordering of these writes with
> the next and previous syscalls, which the compiler should do already.

Right. From the compiler's perspective the address of desc_addr is
leaked at the prctl call site, so any external function call
(including syscalls) could read or write to it.

Peter