LinuxLists.cc - [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user

2024-04-04 19:03:17

Subject: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()

With all the known caveats, tracing BPF programs may directly write to
user-space memory with the bpf_probe_write_user() helper. Memory safety
is an obvious problem when using this helper, since it is too easy to
overwrite memory across all running processes that user space did not
expect to be touched (neither the verifier nor the kernel knows what may
be touched). While it is possible to come up with mechanisms to safely
communicate to the BPF program which memory region may be written to,
there are no built-in guarantees of safety. For this reason, the helper
produces a warning in the kernel log, and in newer kernels it is
possible to disallow use of the helper since 51e1bb9eeaf7 ("bpf: Add
lockdown check for probe_write_user helper").

Nevertheless, direct user-space memory writes from BPF programs can be
useful to efficiently manipulate and communicate with cooperating user
space processes.

For example, one of our use cases are for events that happen relatively
frequently in the kernel (e.g. specific scheduler events), but a set of
user space threads want to check for such events in very hot code paths
to make more optimal decisions (the cost of such a check can be no more
than a load and compare). The types of events and heuristics used may
change based on system environment and application, and a BPF program
provides the best trade-offs in terms of performance and deployment.

To achieve better safety, introduce tagged user writable regions, that
must explicitly be registered before tracing BPF programs may use them:

1. The prctl() option PR_BPF_REGISTER_WRITABLE allows any user space
process (that is allowed to use prctl()) to register tagged writable
memory regions for the current thread.

2. Conversely, the prctl() option PR_BPF_UNREGISTER_WRITABLE allows a
user space process to unregister a writable memory region that was
previously registered from the current thread. This must be done
before freeing memory if the thread that registered a region is
still running.

3. Tracing BPF programs may write to any registered region in the
current thread with bpf_probe_write_user_registered(). If the memory
region has been tagged with a non-zero value, the BPF program must
provide a matching tag.

Admin capabilities are still required to attach BPF programs that use
the new bpf_probe_write_user_registered() helper.

With this interface, user space threads are guaranteed that no writes
happen to regions that they did not explicitly register. Tagging can be
used to associate additional semantics with the memory region.

A note on tag allocation: Since such programs can only be installed by
the local system administrator, tag allocation may be done by the system
administrator. For example, by providing headers with tag definitions,
or a central service to distribute tags to the BPF program loader and to
user applications.

Signed-off-by: Marco Elver <[email protected]>
---
Documentation/bpf/bpf_design_QA.rst | 3 +-
fs/exec.c | 2 +
include/linux/sched.h | 5 +
include/linux/trace_events.h | 19 +++
include/uapi/linux/bpf.h | 16 ++
include/uapi/linux/capability.h | 3 +-
include/uapi/linux/prctl.h | 5 +
kernel/fork.c | 6 +
kernel/sys.c | 7 +
kernel/trace/bpf_trace.c | 220 +++++++++++++++++++++++++++-
tools/include/uapi/linux/bpf.h | 16 ++
tools/include/uapi/linux/prctl.h | 5 +
12 files changed, 303 insertions(+), 4 deletions(-)

diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst
index eb19c945f4d5..5f081d254a44 100644
--- a/Documentation/bpf/bpf_design_QA.rst
+++ b/Documentation/bpf/bpf_design_QA.rst
@@ -254,7 +254,8 @@ A: Sort-of.
Tracing BPF programs can overwrite the user memory
of the current task with bpf_probe_write_user(). Every time such
program is loaded the kernel will print warning message, so
-this helper is only useful for experiments and prototypes.
+this helper is only useful for experiments and prototypes. A safer,
+but more limited, alternative is bpf_probe_write_user_registered().
Tracing BPF programs are root only.

Q: New functionality via kernel modules?
diff --git a/fs/exec.c b/fs/exec.c
index cf1df7f16e55..38bf71cbdf5e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -65,6 +65,7 @@
#include <linux/syscall_user_dispatch.h>
#include <linux/coredump.h>
#include <linux/time_namespace.h>
+#include <linux/trace_events.h>
#include <linux/user_events.h>
#include <linux/rseq.h>

@@ -1881,6 +1882,7 @@ static int bprm_execve(struct linux_binprm *bprm)
user_events_execve(current);
acct_update_integrals(current);
task_numa_free(current, false);
+ bpf_user_writable_free(current);
return retval;

out:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3c2abbc587b4..81ab3b5df430 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -53,6 +53,7 @@ struct bio_list;
struct blk_plug;
struct bpf_local_storage;
struct bpf_run_ctx;
+struct bpf_user_writable;
struct capture_control;
struct cfs_rq;
struct fs_struct;
@@ -1501,6 +1502,10 @@ struct task_struct {
struct bpf_local_storage __rcu *bpf_storage;
/* Used for BPF run context */
struct bpf_run_ctx *bpf_ctx;
+#ifdef CONFIG_BPF_EVENTS
+ /* Used for bpf_probe_write_user_registered() */
+ struct bpf_user_writable *bpf_user_writable;
+#endif
#endif

#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 6f9bdfb09d1d..658c1bfc9048 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -775,6 +775,10 @@ int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
unsigned long *missed);
int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int bpf_user_writable_register(void __user *start, size_t size, u32 tag);
+int bpf_user_writable_unregister(void __user *start, size_t size);
+int bpf_user_writable_copy(struct task_struct *t, u64 clone_flags);
+void bpf_user_writable_free(struct task_struct *t);
#else
static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
{
@@ -826,6 +830,21 @@ bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
{
return -EOPNOTSUPP;
}
+static inline int bpf_user_writable_register(void __user *start, size_t size, u32 tag)
+{
+ return -EOPNOTSUPP;
+}
+static inline int bpf_user_writable_unregister(void __user *start, size_t size)
+{
+ return -EOPNOTSUPP;
+}
+static inline int bpf_user_writable_copy(struct task_struct *t, u64 clone_flags)
+{
+ return 0;
+}
+static inline void bpf_user_writable_free(struct task_struct *t)
+{
+}
#endif

enum {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3c42b9f1bada..f5b86792b99d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5782,6 +5782,21 @@ union bpf_attr {
* 0 on success.
*
* **-ENOENT** if the bpf_local_storage cannot be found.
+ *
+ * long bpf_probe_write_user_registered(void *dst, const void *src, u32 len, u32 tag)
+ * Description
+ * Attempt in a safe way to write *len* bytes from the buffer *src*
+ * to *dst* in memory. Writes are permitted for threads that have
+ * registered the target memory region as writable via the prctl()
+ * PR_BPF_REGISTER_WRITABLE. If the region was registered with a
+ * non-zero tag, a matching *tag* must be provided.
+ *
+ * This helper should not be used to implement any kind of
+ * security mechanism because of TOC-TOU attacks, but rather to
+ * debug, divert, and manipulate execution of cooperative
+ * processes.
+ * Return
+ * 0 on success, or a negative error in case of failure.
*/
#define ___BPF_FUNC_MAPPER(FN, ctx...) \
FN(unspec, 0, ##ctx) \
@@ -5996,6 +6011,7 @@ union bpf_attr {
FN(user_ringbuf_drain, 209, ##ctx) \
FN(cgrp_storage_get, 210, ##ctx) \
FN(cgrp_storage_delete, 211, ##ctx) \
+ FN(probe_write_user_registered, 212, ##ctx) \
/* */

/* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 5bb906098697..144db9d2bea9 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -401,7 +401,8 @@ struct vfs_ns_cap_data {
* - bpf_probe_read to read arbitrary kernel memory is allowed
* - bpf_trace_printk to print kernel memory is allowed
*
- * CAP_SYS_ADMIN is required to use bpf_probe_write_user.
+ * CAP_SYS_ADMIN is required to use bpf_probe_write_user and
+ * bpf_probe_write_user_registered.
*
* CAP_SYS_ADMIN is required to iterate system wide loaded
* programs, maps, links, BTFs and convert their IDs to file descriptors.
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 370ed14b1ae0..4a9372e675ae 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -306,4 +306,9 @@ struct prctl_mm_map {
# define PR_RISCV_V_VSTATE_CTRL_NEXT_MASK 0xc
# define PR_RISCV_V_VSTATE_CTRL_MASK 0x1f

+/* Register tagged writable memory region for the current task. */
+#define PR_BPF_REGISTER_WRITABLE 71
+/* Unregister tagged writable memory region for the current task. */
+#define PR_BPF_UNREGISTER_WRITABLE 72
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 39a5046c2f0b..3ea87b1e5806 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -98,6 +98,7 @@
#include <linux/io_uring.h>
#include <linux/bpf.h>
#include <linux/stackprotector.h>
+#include <linux/trace_events.h>
#include <linux/user_events.h>
#include <linux/iommu.h>
#include <linux/rseq.h>
@@ -606,6 +607,7 @@ void free_task(struct task_struct *tsk)
if (tsk->flags & PF_KTHREAD)
free_kthread_struct(tsk);
bpf_task_storage_free(tsk);
+ bpf_user_writable_free(tsk);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -2353,6 +2355,9 @@ __latent_entropy struct task_struct *copy_process(
#ifdef CONFIG_BPF_SYSCALL
RCU_INIT_POINTER(p->bpf_storage, NULL);
p->bpf_ctx = NULL;
+ retval = bpf_user_writable_copy(p, clone_flags);
+ if (retval)
+ goto bad_fork_cleanup_policy;
#endif

/* Perform scheduler related setup. Assign this task to a CPU. */
@@ -2664,6 +2669,7 @@ __latent_entropy struct task_struct *copy_process(
bad_fork_cleanup_perf:
perf_event_free_task(p);
bad_fork_cleanup_policy:
+ bpf_user_writable_free(p);
lockdep_free_task(p);
#ifdef CONFIG_NUMA
mpol_put(p->mempolicy);
diff --git a/kernel/sys.c b/kernel/sys.c
index 8bb106a56b3a..c317e462600b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
#include <linux/mount.h>
#include <linux/gfp.h>
#include <linux/syscore_ops.h>
+#include <linux/trace_events.h>
#include <linux/version.h>
#include <linux/ctype.h>
#include <linux/syscall_user_dispatch.h>
@@ -2760,6 +2761,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_RISCV_V_GET_CONTROL:
error = RISCV_V_GET_CONTROL();
break;
+ case PR_BPF_REGISTER_WRITABLE:
+ error = bpf_user_writable_register((void __user *)arg2, arg3, arg4);
+ break;
+ case PR_BPF_UNREGISTER_WRITABLE:
+ error = bpf_user_writable_unregister((void __user *)arg2, arg3);
+ break;
default:
error = -EINVAL;
break;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0a5c4efc73c3..643454c398a2 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -82,6 +82,181 @@ static struct bpf_raw_event_map *bpf_get_raw_tracepoint_module(const char *name)
}
#endif /* CONFIG_MODULES */

+/* Memory region writable with bpf_probe_write_user_registered(). */
+struct bpf_user_writable_entry {
+ void __user *start;
+ size_t size;
+ u32 tag;
+};
+
+struct bpf_user_writable {
+ u32 capacity;
+ u32 size;
+ struct bpf_user_writable_entry entries[];
+};
+
+static inline int bpf_user_writable_entry_cmp(const void *_key, const void *_ent)
+{
+ const struct bpf_user_writable_entry *key = (const struct bpf_user_writable_entry *)_key;
+ const struct bpf_user_writable_entry *ent = (const struct bpf_user_writable_entry *)_ent;
+ void __user *key_end = key->start + key->size - 1;
+ void __user *ent_end = ent->start + ent->size - 1;
+
+ if (ent->start <= key->start && key_end <= ent_end)
+ return 0; /* key memory range contained in entry */
+
+ return key->start < ent->start ? -1 : 1;
+}
+
+/*
+ * Returns true if the memory region from @ptr with size @size is a subset of
+ * any registered region of the current task.
+ */
+static const struct bpf_user_writable_entry *bpf_user_writable_find(void __user *ptr, size_t size)
+{
+ const struct bpf_user_writable_entry key = {.start = ptr, .size = size};
+ const struct bpf_user_writable *uw = current->bpf_user_writable;
+
+ if (!uw)
+ return NULL;
+
+ /*
+ * We want bpf_probe_write_user_registered() to be as fast as the
+ * non-registered version (for small uw->size): regular bsearch() does
+ * too many calls, esp. because bpf_user_writable_entry_cmp() needs to
+ * be outlined. Use the inline version instead, which also gives the
+ * compiler a chance to inline bpf_user_writable_entry_cmp().
+ */
+ return __inline_bsearch(&key, uw->entries, uw->size, sizeof(key), bpf_user_writable_entry_cmp);
+}
+
+int bpf_user_writable_register(void __user *start, size_t size, u32 tag)
+{
+ const struct bpf_user_writable_entry *exist;
+ struct bpf_user_writable_entry *ent;
+ struct bpf_user_writable *uw;
+
+ /*
+ * Sanity check the user didn't try to register inaccessible memory.
+ */
+ if (!start || !size || !access_ok(start, size))
+ return -EFAULT;
+
+ /*
+ * Trying to add a range with the same address as already added is most
+ * likely a bug - ensure the range was unregistered before re-adding.
+ *
+ * This also ensures that we always retain the property that no two
+ * entries have the same start and we can sort them based on the start
+ * address alone.
+ */
+ exist = bpf_user_writable_find(start, 1);
+ if (exist && exist->start == start)
+ return -EEXIST;
+
+ if (current->bpf_user_writable) {
+ uw = current->bpf_user_writable;
+ } else { /* initial alloc */
+ uw = kmalloc(struct_size(uw, entries, 1), GFP_KERNEL);
+ if (!uw)
+ return -ENOMEM;
+ uw->capacity = 1;
+ uw->size = 0;
+ current->bpf_user_writable = uw;
+ }
+
+ if (uw->size == uw->capacity) { /* grow exponentially */
+ const u32 ncap = uw->capacity * 2;
+ struct bpf_user_writable *new_uw;
+
+ /* Upper bound of 2^31 entries - should be more than enough. */
+ if (uw->capacity > S32_MAX)
+ return -ENOMEM;
+
+ new_uw = krealloc(uw, struct_size(uw, entries, ncap), GFP_KERNEL);
+ if (!new_uw)
+ return -ENOMEM;
+
+ current->bpf_user_writable = uw = new_uw;
+ uw->capacity = ncap;
+ }
+
+ /* Insert new entry and sort. */
+ ent = &uw->entries[uw->size++];
+ ent->start = start;
+ ent->size = size;
+ ent->tag = tag;
+
+ sort(uw->entries, uw->size, sizeof(*ent), bpf_user_writable_entry_cmp, NULL);
+
+ return 0;
+}
+
+int bpf_user_writable_unregister(void __user *start, size_t size)
+{
+ const struct bpf_user_writable_entry *ent = bpf_user_writable_find(start, size);
+ struct bpf_user_writable *uw = current->bpf_user_writable;
+ size_t del_idx;
+
+ if (!ent)
+ return -ENOENT;
+
+ /* To unregister, require the precise range as used on registration. */
+ if (ent->start != start || ent->size != size)
+ return -ENOENT;
+
+ /*
+ * Shift all entries after the to-be-deleted one left by one. The array
+ * remains sorted.
+ */
+ del_idx = ent - uw->entries;
+ for (size_t i = del_idx + 1; i < uw->size; ++i)
+ uw->entries[i - 1] = uw->entries[i];
+ uw->size--;
+
+ return 0;
+}
+
+int bpf_user_writable_copy(struct task_struct *t, u64 clone_flags)
+{
+ const struct bpf_user_writable *src;
+ struct bpf_user_writable *dst;
+
+ if (WARN_ON_ONCE(t == current))
+ return 0;
+
+ t->bpf_user_writable = NULL;
+
+ src = current->bpf_user_writable;
+ if (!src || (clone_flags & CLONE_VM)) {
+ /*
+ * Shared VM: this thread gets its own user writable regions.
+ */
+ return 0;
+ }
+
+ /*
+ * A fork()'ed task (with separate VM) should behave as closely to the
+ * original task as possible, and we will copy the writable regions.
+ */
+ dst = kmalloc(struct_size(dst, entries, src->capacity), GFP_KERNEL);
+ if (!dst)
+ return -ENOMEM;
+ memcpy(dst, src, struct_size(dst, entries, src->size));
+ t->bpf_user_writable = dst;
+
+ return 0;
+}
+
+void bpf_user_writable_free(struct task_struct *t)
+{
+ if (!t->bpf_user_writable)
+ return;
+
+ kfree(t->bpf_user_writable);
+ t->bpf_user_writable = NULL;
+}
+
u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
u64 bpf_get_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);

@@ -324,8 +499,8 @@ static const struct bpf_func_proto bpf_probe_read_compat_str_proto = {
};
#endif /* CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE */

-BPF_CALL_3(bpf_probe_write_user, void __user *, unsafe_ptr, const void *, src,
- u32, size)
+static __always_inline int
+bpf_probe_write_user_common(void __user *unsafe_ptr, const void *src, u32 size)
{
/*
* Ensure we're in user context which is safe for the helper to
@@ -349,6 +524,12 @@ BPF_CALL_3(bpf_probe_write_user, void __user *, unsafe_ptr, const void *, src,
return copy_to_user_nofault(unsafe_ptr, src, size);
}

+BPF_CALL_3(bpf_probe_write_user, void __user *, unsafe_ptr, const void *, src,
+ u32, size)
+{
+ return bpf_probe_write_user_common(unsafe_ptr, src, size);
+}
+
static const struct bpf_func_proto bpf_probe_write_user_proto = {
.func = bpf_probe_write_user,
.gpl_only = true,
@@ -369,6 +550,39 @@ static const struct bpf_func_proto *bpf_get_probe_write_proto(void)
return &bpf_probe_write_user_proto;
}

+BPF_CALL_4(bpf_probe_write_user_registered, void __user *, unsafe_ptr, const void *, src,
+ u32, size, u32, tag)
+{
+ const struct bpf_user_writable_entry *ent = bpf_user_writable_find(unsafe_ptr, size);
+
+ if (!ent)
+ return -EPERM;
+
+ /* A region with tag 0 matches any tag. */
+ if (ent->tag && ent->tag != tag)
+ return -EPERM;
+
+ return bpf_probe_write_user_common(unsafe_ptr, src, size);
+}
+
+static const struct bpf_func_proto bpf_probe_write_user_registered_proto = {
+ .func = bpf_probe_write_user_registered,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_ANYTHING,
+ .arg2_type = ARG_PTR_TO_MEM | MEM_RDONLY,
+ .arg3_type = ARG_CONST_SIZE,
+ .arg4_type = ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto *bpf_get_probe_write_registered_proto(void)
+{
+ if (!capable(CAP_SYS_ADMIN))
+ return NULL;
+
+ return &bpf_probe_write_user_registered_proto;
+}
+
#define MAX_TRACE_PRINTK_VARARGS 3
#define BPF_TRACE_PRINTK_SIZE 1024

@@ -1552,6 +1766,8 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
case BPF_FUNC_probe_write_user:
return security_locked_down(LOCKDOWN_BPF_WRITE_USER) < 0 ?
NULL : bpf_get_probe_write_proto();
+ case BPF_FUNC_probe_write_user_registered:
+ return bpf_get_probe_write_registered_proto();
case BPF_FUNC_probe_read_user:
return &bpf_probe_read_user_proto;
case BPF_FUNC_probe_read_kernel:
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3c42b9f1bada..f5b86792b99d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5782,6 +5782,21 @@ union bpf_attr {
* 0 on success.
*
* **-ENOENT** if the bpf_local_storage cannot be found.
+ *
+ * long bpf_probe_write_user_registered(void *dst, const void *src, u32 len, u32 tag)
+ * Description
+ * Attempt in a safe way to write *len* bytes from the buffer *src*
+ * to *dst* in memory. Writes are permitted for threads that have
+ * registered the target memory region as writable via the prctl()
+ * PR_BPF_REGISTER_WRITABLE. If the region was registered with a
+ * non-zero tag, a matching *tag* must be provided.
+ *
+ * This helper should not be used to implement any kind of
+ * security mechanism because of TOC-TOU attacks, but rather to
+ * debug, divert, and manipulate execution of cooperative
+ * processes.
+ * Return
+ * 0 on success, or a negative error in case of failure.
*/
#define ___BPF_FUNC_MAPPER(FN, ctx...) \
FN(unspec, 0, ##ctx) \
@@ -5996,6 +6011,7 @@ union bpf_attr {
FN(user_ringbuf_drain, 209, ##ctx) \
FN(cgrp_storage_get, 210, ##ctx) \
FN(cgrp_storage_delete, 211, ##ctx) \
+ FN(probe_write_user_registered, 212, ##ctx) \
/* */

/* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 370ed14b1ae0..4a9372e675ae 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -306,4 +306,9 @@ struct prctl_mm_map {
# define PR_RISCV_V_VSTATE_CTRL_NEXT_MASK 0xc
# define PR_RISCV_V_VSTATE_CTRL_MASK 0x1f

+/* Register tagged writable memory region for the current task. */
+#define PR_BPF_REGISTER_WRITABLE 71
+/* Unregister tagged writable memory region for the current task. */
+#define PR_BPF_UNREGISTER_WRITABLE 72
+
#endif /* _LINUX_PRCTL_H */
--
2.44.0.478.gd926399ef9-goog

2024-04-04 19:29:04

by Marco Elver

[permalink] [raw]

Subject: [PATCH bpf-next 2/2] selftests/bpf: Add test for bpf_probe_write_user_registered()

Introduce a BPF test program and user space code to test
bpf_probe_write_user_registered().

The test program also demonstrates 2 ways a BPF program may obtain the
addresses it can write to: either by tracing prctl() or simply accessing
current->bpf_user_writable directly.

Signed-off-by: Marco Elver <[email protected]>
---
.../prog_tests/probe_write_user_registered.c | 325 ++++++++++++++++++
.../progs/test_probe_write_user_registered.c | 219 ++++++++++++
2 files changed, 544 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/probe_write_user_registered.c
create mode 100644 tools/testing/selftests/bpf/progs/test_probe_write_user_registered.c

diff --git a/tools/testing/selftests/bpf/prog_tests/probe_write_user_registered.c b/tools/testing/selftests/bpf/prog_tests/probe_write_user_registered.c
new file mode 100644
index 000000000000..78ac0756d365
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/probe_write_user_registered.c
@@ -0,0 +1,325 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023, Google LLC. */
+
+#include <malloc.h>
+#include <pthread.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <sys/prctl.h>
+#include <time.h>
+
+#include <test_progs.h>
+#include "test_probe_write_user_registered.skel.h"
+
+#define TEST_TAG 0xf23c39ab
+
+/* Encoding of the test access-type in the tv_nsec parameter. */
+enum test_access {
+ TEST_SUB_REGION,
+ TEST_EQ_REGION,
+ TEST_ONE_BY_ONE,
+ TEST_ANY_TAG,
+};
+
+/* This will be written to by the BPF program. */
+struct test_data {
+ volatile uint64_t padding_start;
+ volatile uint64_t nanosleep_arg;
+ volatile uint64_t padding_end;
+};
+
+static struct test_data test_data;
+
+static void prctl_register_writable(const volatile void *start, size_t size, uint32_t tag)
+{
+ ASSERT_OK(prctl(PR_BPF_REGISTER_WRITABLE, start, size, tag, 0), __func__);
+}
+
+static void prctl_unregister_writable(const volatile void *start, size_t size)
+{
+ ASSERT_OK(prctl(PR_BPF_UNREGISTER_WRITABLE, start, size, 0, 0), __func__);
+}
+
+/* Returns the actual tv_nsec value derived from base and test_access. */
+static uint64_t do_nanosleep(uint64_t base, enum test_access test_access)
+{
+ const uint64_t tv_nsec = base << 8 | test_access;
+ struct timespec ts = {};
+
+ ts.tv_sec = 0;
+ ts.tv_nsec = tv_nsec;
+ syscall(__NR_nanosleep, &ts, NULL);
+
+ return tv_nsec;
+}
+
+/*
+ * Test that the basic usage works: register, write from BPF program,
+ * unregister, after which no more writes can happen.
+ */
+static void test_register_and_unregister(struct test_probe_write_user_registered *skel)
+{
+ uint64_t nsec = 1234;
+ uint64_t expect;
+
+ prctl_register_writable(&test_data, sizeof(test_data), TEST_TAG);
+
+ /* Check that we see the writes. */
+ for (int i = 0; i < 3; ++i) {
+ test_data.nanosleep_arg = 0;
+ expect = do_nanosleep(++nsec, TEST_SUB_REGION);
+ ASSERT_EQ(test_data.nanosleep_arg, expect, __func__);
+ ASSERT_EQ(skel->data->found_user_registered, 1, __func__);
+ }
+
+ /* Registered the whole region, so this should also work... */
+ for (int i = 0; i < 3; ++i) {
+ test_data.nanosleep_arg = 0;
+ expect = do_nanosleep(++nsec, TEST_EQ_REGION);
+ ASSERT_EQ(test_data.nanosleep_arg, expect, __func__);
+ ASSERT_EQ(skel->data->found_user_registered, 1, __func__);
+ }
+
+ prctl_unregister_writable(&test_data, sizeof(test_data));
+
+ /* No more writes after unregistration. */
+ test_data.nanosleep_arg = 0;
+ do_nanosleep(++nsec, TEST_SUB_REGION);
+ ASSERT_EQ(test_data.nanosleep_arg, 0, __func__);
+ ASSERT_EQ(skel->data->found_user_registered, 0, __func__);
+}
+
+/*
+ * Test that accesses with mismatching tags fail.
+ */
+static void test_bad_tag(struct test_probe_write_user_registered *skel)
+{
+ uint64_t expect;
+
+ prctl_register_writable(&test_data, sizeof(test_data), TEST_TAG);
+ test_data.nanosleep_arg = 0;
+ expect = do_nanosleep(1234, TEST_SUB_REGION);
+ ASSERT_EQ(test_data.nanosleep_arg, expect, __func__);
+ ASSERT_EQ(skel->data->found_user_registered, 1, __func__);
+ do_nanosleep(9999, TEST_ANY_TAG); /* fails */
+ ASSERT_EQ(test_data.nanosleep_arg, expect, __func__);
+ ASSERT_EQ(skel->data->found_user_registered, 1, __func__);
+ prctl_unregister_writable(&test_data, sizeof(test_data));
+}
+
+/*
+ * Test that the "any" (zero) tag works.
+ */
+static void test_any_tag(struct test_probe_write_user_registered *skel)
+{
+ uint64_t nsec = 1234;
+ uint64_t expect;
+
+ prctl_register_writable(&test_data, sizeof(test_data), 0);
+
+ for (int i = 0; i < 3; ++i) {
+ test_data.nanosleep_arg = 0;
+ expect = do_nanosleep(++nsec, TEST_ANY_TAG);
+ ASSERT_EQ(test_data.nanosleep_arg, expect, __func__);
+ ASSERT_EQ(skel->data->found_user_registered, 0, __func__);
+ }
+
+ prctl_unregister_writable(&test_data, sizeof(test_data));
+
+ test_data.nanosleep_arg = 0;
+ do_nanosleep(++nsec, TEST_ANY_TAG);
+ ASSERT_EQ(test_data.nanosleep_arg, 0, __func__);
+ ASSERT_EQ(skel->data->found_user_registered, 0, __func__);
+}
+
+/*
+ * Test that invalid prctl() fail.
+ */
+static void test_invalid_prctl(struct test_probe_write_user_registered *skel)
+{
+ ASSERT_ERR(prctl(PR_BPF_REGISTER_WRITABLE, NULL, 1, 0, 0), __func__);
+ ASSERT_ERR(prctl(PR_BPF_REGISTER_WRITABLE, &test_data, 0, 0, 0), __func__);
+ prctl_register_writable(&test_data, sizeof(test_data), TEST_TAG);
+ ASSERT_ERR(prctl(PR_BPF_REGISTER_WRITABLE, &test_data, sizeof(test_data), 0, 0), __func__);
+ ASSERT_ERR(prctl(PR_BPF_REGISTER_WRITABLE, &test_data, 2, 0, 0), __func__);
+ prctl_register_writable((void *)&test_data + 1, 1, TEST_TAG);
+ prctl_register_writable((void *)&test_data - 1, 1, TEST_TAG);
+
+ ASSERT_ERR(prctl(PR_BPF_UNREGISTER_WRITABLE, &test_data, 1, 0, 0), __func__);
+ prctl_unregister_writable((void *)&test_data - 1, 1);
+ prctl_unregister_writable(&test_data, sizeof(test_data));
+ prctl_unregister_writable((void *)&test_data + 1, 1);
+ ASSERT_ERR(prctl(PR_BPF_UNREGISTER_WRITABLE, 0x123456, 1, 0, 0), __func__);
+ ASSERT_ERR(prctl(PR_BPF_UNREGISTER_WRITABLE, &test_data, sizeof(test_data), 0, 0), __func__);
+}
+
+/*
+ * Test that we can register multiple regions and they all work.
+ */
+static void test_multiple_region(struct test_probe_write_user_registered *skel)
+{
+ uint64_t expect;
+
+ prctl_register_writable(&test_data.nanosleep_arg, sizeof(uint64_t), TEST_TAG);
+ prctl_register_writable(&test_data.padding_end, sizeof(uint64_t), TEST_TAG);
+ /* First one last, so the test program knows where to start. */
+ prctl_register_writable(&test_data.padding_start, sizeof(uint64_t), TEST_TAG);
+
+ memset(&test_data, 0, sizeof(test_data));
+ do_nanosleep(0xf00d, TEST_EQ_REGION); /* fails */
+ ASSERT_EQ(test_data.nanosleep_arg, 0, __func__);
+ ASSERT_EQ(skel->data->found_user_registered, 1, __func__); /* found first */
+
+ expect = do_nanosleep(0xf33d, TEST_ONE_BY_ONE);
+ ASSERT_EQ(test_data.padding_start, expect, __func__);
+ ASSERT_EQ(test_data.nanosleep_arg, expect, __func__);
+ ASSERT_EQ(test_data.padding_end, expect, __func__);
+ ASSERT_EQ(skel->data->found_user_registered, 1, __func__);
+
+ prctl_unregister_writable(&test_data.padding_start, sizeof(uint64_t));
+ prctl_unregister_writable(&test_data.nanosleep_arg, sizeof(uint64_t));
+ prctl_unregister_writable(&test_data.padding_end, sizeof(uint64_t));
+}
+
+static void *test_thread_func(void *arg)
+{
+ struct test_probe_write_user_registered *skel = arg;
+
+ /* If this fails, the thread didn't inherit the region. */
+ ASSERT_ERR(prctl(PR_BPF_UNREGISTER_WRITABLE, &test_data, sizeof(test_data), 0, 0), __func__);
+ /* So that the BPF user_writable task storage is filled. */
+ prctl_register_writable(&test_data, 1, TEST_TAG);
+ prctl_unregister_writable(&test_data, 1);
+
+ /* Test that there really is no way it'll write. */
+ test_data.nanosleep_arg = 0;
+ do_nanosleep(9999, TEST_SUB_REGION); /* fails */
+ ASSERT_EQ(test_data.nanosleep_arg, 0, __func__);
+ ASSERT_EQ(skel->data->found_user_registered, 0, __func__);
+
+ return NULL;
+}
+
+/*
+ * Test that threads (CLONE_VM) do not inherit writable regions.
+ */
+static void test_thread(struct test_probe_write_user_registered *skel)
+{
+ uint64_t expect;
+ pthread_t tid;
+
+ prctl_register_writable(&test_data, sizeof(test_data), TEST_TAG);
+
+ test_data.nanosleep_arg = 0;
+ expect = do_nanosleep(1234, TEST_SUB_REGION);
+ ASSERT_EQ(test_data.nanosleep_arg, expect, __func__);
+ ASSERT_EQ(skel->data->found_user_registered, 1, __func__);
+
+ ASSERT_OK(pthread_create(&tid, NULL, test_thread_func, skel), "pthread_create");
+ ASSERT_OK(pthread_join(tid, NULL), "pthread_join");
+
+ ASSERT_EQ(test_data.nanosleep_arg, 0, __func__);
+ prctl_unregister_writable(&test_data, sizeof(test_data));
+}
+
+/*
+ * Test that fork() does inherit writable regions.
+ */
+static void test_fork(struct test_probe_write_user_registered *skel)
+{
+ uint64_t expect;
+ int pid, status;
+
+ prctl_register_writable(&test_data, sizeof(test_data), TEST_TAG);
+
+ test_data.nanosleep_arg = 0;
+ expect = do_nanosleep(1234, TEST_SUB_REGION);
+ ASSERT_EQ(test_data.nanosleep_arg, expect, __func__);
+ ASSERT_EQ(skel->data->found_user_registered, 1, __func__);
+
+ pid = fork();
+ if (!pid) {
+ test_data.nanosleep_arg = 0; /* write prefault */
+ expect = do_nanosleep(3333, TEST_SUB_REGION);
+ ASSERT_EQ(skel->data->found_user_registered, 1, __func__);
+ exit(!ASSERT_EQ(test_data.nanosleep_arg, expect, __func__));
+ }
+
+ status = -1;
+ waitpid(pid, &status, 0);
+ ASSERT_EQ(status, 0, __func__);
+
+ ASSERT_EQ(test_data.nanosleep_arg, expect, __func__);
+ prctl_unregister_writable(&test_data, sizeof(test_data));
+}
+
+/*
+ * Test that the kernel can allocate lots of regions and find them.
+ */
+static void test_stress_regions(struct test_probe_write_user_registered *skel)
+{
+ const int STRESS_SIZE = 200;
+ struct test_data *large = malloc(STRESS_SIZE * sizeof(*large));
+ uint64_t expect;
+
+ ASSERT_NEQ(large, NULL, __func__);
+
+ memset(large, 0, STRESS_SIZE * sizeof(*large));
+
+ for (int i = 0; i < STRESS_SIZE; ++i) {
+ prctl_register_writable(&large[i], sizeof(*large), TEST_TAG);
+ ASSERT_ERR(prctl(PR_BPF_REGISTER_WRITABLE, &large[i], sizeof(*large), 0, 0), __func__);
+ expect = do_nanosleep(777, TEST_SUB_REGION);
+ ASSERT_EQ(large[i].nanosleep_arg, expect, __func__);
+ ASSERT_EQ(skel->data->found_user_registered, 1, __func__);
+ }
+
+ for (int i = 0; i < STRESS_SIZE; ++i) {
+ prctl_unregister_writable(&large[i], sizeof(*large));
+ ASSERT_ERR(prctl(PR_BPF_UNREGISTER_WRITABLE, &large[i], sizeof(*large), 0, 0), __func__);
+ large[i].nanosleep_arg = 0;
+ do_nanosleep(1992, TEST_SUB_REGION); /* no more writes */
+ ASSERT_EQ(large[i].nanosleep_arg, 0, __func__);
+ ASSERT_EQ(skel->data->found_user_registered, i < STRESS_SIZE - 1 ? 1 : 0, __func__);
+ }
+
+ for (int i = 0; i < STRESS_SIZE; ++i)
+ ASSERT_ERR(prctl(PR_BPF_UNREGISTER_WRITABLE, &large[i], sizeof(*large), 0, 0), __func__);
+
+ free(large);
+}
+
+/*
+ * Test setup.
+ */
+void test_probe_write_user_registered(void)
+{
+ struct test_probe_write_user_registered *skel;
+
+ skel = test_probe_write_user_registered__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "open and load"))
+ return;
+
+ if (!ASSERT_OK(test_probe_write_user_registered__attach(skel), "attach"))
+ goto cleanup;
+
+ if (test__start_subtest("register_and_unregister"))
+ test_register_and_unregister(skel);
+ if (test__start_subtest("bad_tag"))
+ test_bad_tag(skel);
+ if (test__start_subtest("any_tag"))
+ test_any_tag(skel);
+ if (test__start_subtest("invalid_prctl"))
+ test_invalid_prctl(skel);
+ if (test__start_subtest("multiple_region"))
+ test_multiple_region(skel);
+ if (test__start_subtest("thread"))
+ test_thread(skel);
+ if (test__start_subtest("fork"))
+ test_fork(skel);
+ if (test__start_subtest("stress_regions"))
+ test_stress_regions(skel);
+
+cleanup:
+ test_probe_write_user_registered__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_probe_write_user_registered.c b/tools/testing/selftests/bpf/progs/test_probe_write_user_registered.c
new file mode 100644
index 000000000000..9174ff2e36f9
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_probe_write_user_registered.c
@@ -0,0 +1,219 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023, Google LLC. */
+#include "vmlinux.h"
+#include <asm/unistd.h>
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+/*
+ * We just need the CLONE_VM definition. Without __ASSEMBLY__ sched.h would
+ * redefine clone_args, which is already defined by vmlinux.h
+ */
+#define __ASSEMBLY__
+#include <linux/sched.h>
+#undef __ASSEMBLY__
+
+#define TEST_TAG 0xf23c39ab
+
+/* Encoding of the test access-type in the tv_nsec parameter. */
+enum test_access {
+ TEST_SUB_REGION,
+ TEST_EQ_REGION,
+ TEST_ONE_BY_ONE,
+ TEST_ANY_TAG,
+};
+#define TEST_ACCESS(nsec) ((enum test_access)((nsec) & 0xff))
+
+struct test_data {
+ __u64 padding_start;
+ __u64 nanosleep_arg;
+ __u64 padding_end;
+};
+
+struct user_writable {
+ void *start;
+ size_t size;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, struct user_writable);
+} user_writable SEC(".maps");
+
+int found_user_registered = -1;
+
+/*
+ * This is used to test that the contents of per-task bpf_user_writable is sane.
+ *
+ * It also demonstrates another way (vs. prctl()) how the BPF program can obtain
+ * addresses associated with a tag. Beware, however, that this is O(#registered)
+ * and a production BPF program should cache its result in task local storage.
+ */
+static int find_user_registered(__u32 tag, void *start)
+{
+ const struct bpf_user_writable *uw = bpf_get_current_task_btf()->bpf_user_writable;
+ int count = 0;
+
+ if (!uw)
+ return count;
+
+ /*
+ * Ensure termination of the loop to make the verifier happy. Use
+ * bpf_loop() if you expect a very large number of registered regions.
+ */
+ for (__u32 idx = 0; idx < uw->size && idx < 1024; ++idx) {
+ if (uw->entries[idx].tag == tag && uw->entries[idx].start == start)
+ count++;
+ }
+
+ return count;
+}
+
+static void sys_nanosleep(struct pt_regs *regs)
+{
+ struct __kernel_timespec *ts;
+ struct user_writable *w;
+ __u32 dummy = -99;
+ __u64 tv_nsec;
+ int err;
+
+ _Static_assert(sizeof(ts->tv_nsec) == sizeof(tv_nsec), "ABI");
+
+ found_user_registered = -1;
+
+ w = bpf_task_storage_get(&user_writable, bpf_get_current_task_btf(), 0, 0);
+ if (!w)
+ return;
+
+ ts = (void *)PT_REGS_PARM1_CORE_SYSCALL(regs);
+ if (bpf_probe_read_user(&tv_nsec, sizeof(ts->tv_nsec), &ts->tv_nsec))
+ return;
+
+ found_user_registered = find_user_registered(TEST_TAG, w->start);
+
+ bpf_printk("doing test accesses");
+
+ /*
+ * Test failing accesses before, so that if they actually succeed, we
+ * won't do the real write and the test will detect a missed write.
+ */
+ if (!bpf_probe_write_user_registered(w->start + w->size - 1, &dummy, sizeof(dummy), TEST_TAG))
+ return;
+ if (!bpf_probe_write_user_registered(w->start - 1, &dummy, sizeof(dummy), TEST_TAG))
+ return;
+ if (!bpf_probe_write_user_registered(w->start + 100, &dummy, sizeof(dummy), TEST_TAG))
+ return;
+ if (TEST_ACCESS(tv_nsec) != TEST_ANY_TAG) {
+ if (!bpf_probe_write_user_registered(w->start, &dummy, sizeof(dummy), 123))
+ return;
+ if (!bpf_probe_write_user_registered(w->start, &dummy, sizeof(dummy), 0))
+ return;
+ }
+
+ switch (TEST_ACCESS(tv_nsec)) {
+ case TEST_SUB_REGION:
+ bpf_printk("sub region write");
+ err = bpf_probe_write_user_registered(w->start + sizeof(__u64), &tv_nsec, sizeof(tv_nsec), TEST_TAG);
+ break;
+ case TEST_EQ_REGION: {
+ struct test_data out = {};
+
+ bpf_printk("whole region write");
+ out.nanosleep_arg = tv_nsec;
+ err = bpf_probe_write_user_registered(w->start, &out, sizeof(out), TEST_TAG);
+ break;
+ }
+ case TEST_ONE_BY_ONE:
+ bpf_printk("write one by one");
+ for (int i = 0; i < 3; ++i) {
+ err = bpf_probe_write_user_registered(w->start + i * sizeof(__u64), &tv_nsec,
+ sizeof(tv_nsec), TEST_TAG);
+ if (err)
+ break;
+ }
+ break;
+ case TEST_ANY_TAG:
+ bpf_printk("any tag write");
+ err = bpf_probe_write_user_registered(w->start + sizeof(__u64), &tv_nsec, sizeof(tv_nsec), 93845);
+ break;
+ default:
+ bpf_printk("unknown access method");
+ return;
+ }
+
+ if (err)
+ bpf_printk("write failed: %d", err);
+ else
+ bpf_printk("write success");
+}
+
+static void sys_prctl(struct pt_regs *regs)
+{
+ struct user_writable *w;
+ __u32 tag;
+
+ if (PT_REGS_PARM1_CORE_SYSCALL(regs) != /*PR_BPF_REGISTER_WRITABLE*/71)
+ return;
+
+ tag = (__u32)PT_REGS_PARM4_CORE_SYSCALL(regs);
+ if (tag && tag != TEST_TAG)
+ return;
+
+ w = bpf_task_storage_get(&user_writable, bpf_get_current_task_btf(), 0,
+ BPF_LOCAL_STORAGE_GET_F_CREATE);
+ if (!w)
+ return;
+
+ bpf_printk("registered user writable region with tag %x", tag);
+ w->start = (void *)PT_REGS_PARM2_CORE_SYSCALL(regs);
+ w->size = PT_REGS_PARM3_CORE_SYSCALL(regs);
+}
+
+SEC("tp_btf/sys_enter")
+int BPF_PROG(sys_enter, struct pt_regs *regs, long id)
+{
+ switch (id) {
+ case __NR_prctl:
+ sys_prctl(regs);
+ break;
+ case __NR_nanosleep:
+ sys_nanosleep(regs);
+ break;
+ default:
+ break;
+ }
+ return 0;
+}
+
+/*
+ * The user writable region is copied on fork(). Also copy the per-task map we
+ * use in this test.
+ */
+SEC("tp_btf/task_newtask")
+int BPF_PROG(task_newtask, struct task_struct *t, unsigned long clone_flags)
+{
+ const struct user_writable *src;
+ struct user_writable *dst;
+
+ if (clone_flags & CLONE_VM)
+ return 0;
+
+ src = bpf_task_storage_get(&user_writable, bpf_get_current_task_btf(), 0, 0);
+ if (!src)
+ return 0;
+
+ dst = bpf_task_storage_get(&user_writable, t, 0, BPF_LOCAL_STORAGE_GET_F_CREATE);
+ if (!dst) {
+ bpf_printk("failed to copy user_writable on fork()");
+ return 0;
+ }
+ *dst = *src;
+ bpf_printk("fork copied user writable region");
+
+ return 0;
+}
+
+char _license[] SEC("license") = "GPL";
--
2.44.0.478.gd926399ef9-goog

2024-04-04 23:31:27

by Alexei Starovoitov

[permalink] [raw]

Subject: Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()

On Thu, Apr 4, 2024 at 12:02 PM Marco Elver <[email protected]> wrote:
>
> With all the known caveats, tracing BPF programs may directly write to
> user-space memory with the bpf_probe_write_user() helper. Memory safety
> is an obvious problem when using this helper, since it is too easy to
> overwrite memory across all running processes that user space did not
> expect to be touched (neither the verifier nor the kernel knows what may
> be touched). While it is possible to come up with mechanisms to safely
> communicate to the BPF program which memory region may be written to,
> there are no built-in guarantees of safety. For this reason, the helper
> produces a warning in the kernel log, and in newer kernels it is
> possible to disallow use of the helper since 51e1bb9eeaf7 ("bpf: Add
> lockdown check for probe_write_user helper").

So is it a fix or a feature?

> Nevertheless, direct user-space memory writes from BPF programs can be
> useful to efficiently manipulate and communicate with cooperating user
> space processes.

But there are many different ways for bpf to communicate with user space:
perf ringbuf, bpf ringbug, various maps including mmap-ed array and arena.
The commit log doesn't explain why we need another one.

> For example, one of our use cases are for events that happen relatively
> frequently in the kernel (e.g. specific scheduler events), but a set of
> user space threads want to check for such events in very hot code paths
> to make more optimal decisions (the cost of such a check can be no more
> than a load and compare). The types of events and heuristics used may
> change based on system environment and application, and a BPF program
> provides the best trade-offs in terms of performance and deployment.

and the tasks can use mmaped array shared across all or unique to each
process.
And both bpf and user space can read/write them with a single instruction.

>
> To achieve better safety, introduce tagged user writable regions, that
> must explicitly be registered before tracing BPF programs may use them:
>
> 1. The prctl() option PR_BPF_REGISTER_WRITABLE allows any user space
> process (that is allowed to use prctl()) to register tagged writable
> memory regions for the current thread.
>
> 2. Conversely, the prctl() option PR_BPF_UNREGISTER_WRITABLE allows a
> user space process to unregister a writable memory region that was
> previously registered from the current thread. This must be done
> before freeing memory if the thread that registered a region is
> still running.
>
> 3. Tracing BPF programs may write to any registered region in the
> current thread with bpf_probe_write_user_registered(). If the memory
> region has been tagged with a non-zero value, the BPF program must
> provide a matching tag.
>
> Admin capabilities are still required to attach BPF programs that use
> the new bpf_probe_write_user_registered() helper.

We stopped adding new helpers ~2 years ago.
Only new kfuncs are allowed.

>
> With this interface, user space threads are guaranteed that no writes
> happen to regions that they did not explicitly register. Tagging can be
> used to associate additional semantics with the memory region.
>
> A note on tag allocation: Since such programs can only be installed by
> the local system administrator, tag allocation may be done by the system
> administrator. For example, by providing headers with tag definitions,
> or a central service to distribute tags to the BPF program loader and to
> user applications.

Not clear how that's achieved in practice.
To do prctl(REGSISTER, ... tag)
the process will just pass this u32 tag.
There is no way for the admin or other process to enforce certain
tag usage.
Since there is no way to enforce extra tag seems like a weak
protection against something? What in particular?

2024-04-05 08:29:03

by Marco Elver

[permalink] [raw]

Subject: Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()

On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov
<[email protected]> wrote:
>
> On Thu, Apr 4, 2024 at 12:02 PM Marco Elver <[email protected]> wrote:
> >
> > With all the known caveats, tracing BPF programs may directly write to
> > user-space memory with the bpf_probe_write_user() helper. Memory safety
> > is an obvious problem when using this helper, since it is too easy to
> > overwrite memory across all running processes that user space did not
> > expect to be touched (neither the verifier nor the kernel knows what may
> > be touched). While it is possible to come up with mechanisms to safely
> > communicate to the BPF program which memory region may be written to,
> > there are no built-in guarantees of safety. For this reason, the helper
> > produces a warning in the kernel log, and in newer kernels it is
> > possible to disallow use of the helper since 51e1bb9eeaf7 ("bpf: Add
> > lockdown check for probe_write_user helper").
>
> So is it a fix or a feature?

Feature. The above paragraph is just an intro. Remove it?

> > Nevertheless, direct user-space memory writes from BPF programs can be
> > useful to efficiently manipulate and communicate with cooperating user
> > space processes.
>
> But there are many different ways for bpf to communicate with user space:
> perf ringbuf, bpf ringbug, various maps including mmap-ed array and arena.
> The commit log doesn't explain why we need another one.
>
> > For example, one of our use cases are for events that happen relatively
> > frequently in the kernel (e.g. specific scheduler events), but a set of
> > user space threads want to check for such events in very hot code paths
> > to make more optimal decisions (the cost of such a check can be no more
> > than a load and compare). The types of events and heuristics used may
> > change based on system environment and application, and a BPF program
> > provides the best trade-offs in terms of performance and deployment.
>
> and the tasks can use mmaped array shared across all or unique to each
> process.
> And both bpf and user space can read/write them with a single instruction.

That's BPF_F_MMAPABLE, right?

That does not work because the mmapped region is global. Our requirements are:

1. Single tracing BPF program.

2. Per-process (per VM) memory region (here it's per-thread, but each
thread just registers the same process-wide region). No sharing
between processes.

3. From #2 it follows: exec unregisters the registered memory region;
fork gets a cloned region.

4. Unprivileged processes can do prctl(REGISTER). Some of them might
not be able to use the bpf syscall.

The reason for #2 is that each user space process also writes to the
memory region (read by the BPF program to make updates depending on
what state it finds), and having shared state between processes
doesn't work here.

Is there any reasonable BPF facility that can do this today? (If
BPF_F_MMAPABLE could do it while satisfying requirements 2-4, I'd be a
happy camper.)

bpf_probe_write_user() can, but safety is not built in, along with
getting fork + exec right is brittle.

> > To achieve better safety, introduce tagged user writable regions, that
> > must explicitly be registered before tracing BPF programs may use them:
> >
> > 1. The prctl() option PR_BPF_REGISTER_WRITABLE allows any user space
> > process (that is allowed to use prctl()) to register tagged writable
> > memory regions for the current thread.
> >
> > 2. Conversely, the prctl() option PR_BPF_UNREGISTER_WRITABLE allows a
> > user space process to unregister a writable memory region that was
> > previously registered from the current thread. This must be done
> > before freeing memory if the thread that registered a region is
> > still running.
> >
> > 3. Tracing BPF programs may write to any registered region in the
> > current thread with bpf_probe_write_user_registered(). If the memory
> > region has been tagged with a non-zero value, the BPF program must
> > provide a matching tag.
> >
> > Admin capabilities are still required to attach BPF programs that use
> > the new bpf_probe_write_user_registered() helper.
>
> We stopped adding new helpers ~2 years ago.
> Only new kfuncs are allowed.

Sure.

> >
> > With this interface, user space threads are guaranteed that no writes
> > happen to regions that they did not explicitly register. Tagging can be
> > used to associate additional semantics with the memory region.
> >
> > A note on tag allocation: Since such programs can only be installed by
> > the local system administrator, tag allocation may be done by the system
> > administrator. For example, by providing headers with tag definitions,
> > or a central service to distribute tags to the BPF program loader and to
> > user applications.
>
> Not clear how that's achieved in practice.
> To do prctl(REGSISTER, ... tag)
> the process will just pass this u32 tag.
> There is no way for the admin or other process to enforce certain
> tag usage.
> Since there is no way to enforce extra tag seems like a weak
> protection against something? What in particular?

The main goal is to a) avoid accidental writes into areas the user
space program doesn't want writing to, along with 2) weakly
associating "type" via a tag. As soon as the user space program does
prctl(REGISTER, tag) it wants writes to happen. It's the user space
program's fault if it uses random tags, because in practice, we know
exactly which BPF programs are running in production and which tags
they service. The programs we do _care_ about are reviewed and do use
the right tags. The mechanism is not for protecting BPF programs or
what data they communicate, but for the benefit of the user space
program itself to bound the memory that could be touched in its own
address space (use the wrong region/tag .. tough luck).

2024-04-05 20:28:56

by Andrii Nakryiko

[permalink] [raw]

Subject: Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()

On Fri, Apr 5, 2024 at 1:28 AM Marco Elver <[email protected]> wrote:
>
> On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Thu, Apr 4, 2024 at 12:02 PM Marco Elver <[email protected]> wrote:
> > >
> > > With all the known caveats, tracing BPF programs may directly write to
> > > user-space memory with the bpf_probe_write_user() helper. Memory safety
> > > is an obvious problem when using this helper, since it is too easy to
> > > overwrite memory across all running processes that user space did not
> > > expect to be touched (neither the verifier nor the kernel knows what may
> > > be touched). While it is possible to come up with mechanisms to safely
> > > communicate to the BPF program which memory region may be written to,
> > > there are no built-in guarantees of safety. For this reason, the helper
> > > produces a warning in the kernel log, and in newer kernels it is
> > > possible to disallow use of the helper since 51e1bb9eeaf7 ("bpf: Add
> > > lockdown check for probe_write_user helper").
> >
> > So is it a fix or a feature?
>
> Feature. The above paragraph is just an intro. Remove it?
>
> > > Nevertheless, direct user-space memory writes from BPF programs can be
> > > useful to efficiently manipulate and communicate with cooperating user
> > > space processes.
> >
> > But there are many different ways for bpf to communicate with user space:
> > perf ringbuf, bpf ringbug, various maps including mmap-ed array and arena.
> > The commit log doesn't explain why we need another one.
> >
> > > For example, one of our use cases are for events that happen relatively
> > > frequently in the kernel (e.g. specific scheduler events), but a set of
> > > user space threads want to check for such events in very hot code paths
> > > to make more optimal decisions (the cost of such a check can be no more
> > > than a load and compare). The types of events and heuristics used may
> > > change based on system environment and application, and a BPF program
> > > provides the best trade-offs in terms of performance and deployment.
> >
> > and the tasks can use mmaped array shared across all or unique to each
> > process.
> > And both bpf and user space can read/write them with a single instruction.
>
> That's BPF_F_MMAPABLE, right?
>
> That does not work because the mmapped region is global. Our requirements are:
>
> 1. Single tracing BPF program.
>
> 2. Per-process (per VM) memory region (here it's per-thread, but each
> thread just registers the same process-wide region). No sharing
> between processes.
>
> 3. From #2 it follows: exec unregisters the registered memory region;
> fork gets a cloned region.
>
> 4. Unprivileged processes can do prctl(REGISTER). Some of them might
> not be able to use the bpf syscall.
>
> The reason for #2 is that each user space process also writes to the
> memory region (read by the BPF program to make updates depending on
> what state it finds), and having shared state between processes
> doesn't work here.
>
> Is there any reasonable BPF facility that can do this today? (If
> BPF_F_MMAPABLE could do it while satisfying requirements 2-4, I'd be a
> happy camper.)

You could simulate something like this with multi-element ARRAY +
BPF_F_MMAPABLE, though you'd need to pre-allocate up to max number of
processes, so it's not an exact fit.

But what seems to be much closer is using BPF task-local storage, if
we support mmap()'ing its memory into user-space. We've had previous
discussions on how to achieve this (the simplest being that
mmap(task_local_map_fd, ...) maps current thread's part of BPF task
local storage). You won't get automatic cloning (you'd have to do that
from the BPF program on fork/exec tracepoint, for example), and within
the process you'd probably want to have just one thread (main?) to
mmap() initially and just share the pointer across all relevant
threads. But this is a more generic building block, IMO. This relying
on BPF map also means pinning is possible and all the other BPF map
abstraction benefits.

>
> bpf_probe_write_user() can, but safety is not built in, along with
> getting fork + exec right is brittle.
>
> > > To achieve better safety, introduce tagged user writable regions, that
> > > must explicitly be registered before tracing BPF programs may use them:
> > >
> > > 1. The prctl() option PR_BPF_REGISTER_WRITABLE allows any user space
> > > process (that is allowed to use prctl()) to register tagged writable
> > > memory regions for the current thread.
> > >
> > > 2. Conversely, the prctl() option PR_BPF_UNREGISTER_WRITABLE allows a
> > > user space process to unregister a writable memory region that was
> > > previously registered from the current thread. This must be done
> > > before freeing memory if the thread that registered a region is
> > > still running.
> > >
> > > 3. Tracing BPF programs may write to any registered region in the
> > > current thread with bpf_probe_write_user_registered(). If the memory
> > > region has been tagged with a non-zero value, the BPF program must
> > > provide a matching tag.
> > >
> > > Admin capabilities are still required to attach BPF programs that use
> > > the new bpf_probe_write_user_registered() helper.
> >
> > We stopped adding new helpers ~2 years ago.
> > Only new kfuncs are allowed.
>
> Sure.
>
> > >
> > > With this interface, user space threads are guaranteed that no writes
> > > happen to regions that they did not explicitly register. Tagging can be
> > > used to associate additional semantics with the memory region.
> > >
> > > A note on tag allocation: Since such programs can only be installed by
> > > the local system administrator, tag allocation may be done by the system
> > > administrator. For example, by providing headers with tag definitions,
> > > or a central service to distribute tags to the BPF program loader and to
> > > user applications.
> >
> > Not clear how that's achieved in practice.
> > To do prctl(REGSISTER, ... tag)
> > the process will just pass this u32 tag.
> > There is no way for the admin or other process to enforce certain
> > tag usage.
> > Since there is no way to enforce extra tag seems like a weak
> > protection against something? What in particular?
>
> The main goal is to a) avoid accidental writes into areas the user
> space program doesn't want writing to, along with 2) weakly
> associating "type" via a tag. As soon as the user space program does
> prctl(REGISTER, tag) it wants writes to happen. It's the user space
> program's fault if it uses random tags, because in practice, we know
> exactly which BPF programs are running in production and which tags
> they service. The programs we do _care_ about are reviewed and do use
> the right tags. The mechanism is not for protecting BPF programs or
> what data they communicate, but for the benefit of the user space
> program itself to bound the memory that could be touched in its own
> address space (use the wrong region/tag .. tough luck).

2024-04-08 09:32:26

by Marco Elver

[permalink] [raw]

Subject: Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()

On Fri, 5 Apr 2024 at 22:28, Andrii Nakryiko <[email protected]> wrote:
>
> On Fri, Apr 5, 2024 at 1:28 AM Marco Elver <[email protected]> wrote:
> >
> > On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov
> > <[email protected]> wrote:
[...]
> > > and the tasks can use mmaped array shared across all or unique to each
> > > process.
> > > And both bpf and user space can read/write them with a single instruction.
> >
> > That's BPF_F_MMAPABLE, right?
> >
> > That does not work because the mmapped region is global. Our requirements are:
> >
> > 1. Single tracing BPF program.
> >
> > 2. Per-process (per VM) memory region (here it's per-thread, but each
> > thread just registers the same process-wide region). No sharing
> > between processes.
> >
> > 3. From #2 it follows: exec unregisters the registered memory region;
> > fork gets a cloned region.
> >
> > 4. Unprivileged processes can do prctl(REGISTER). Some of them might
> > not be able to use the bpf syscall.
> >
> > The reason for #2 is that each user space process also writes to the
> > memory region (read by the BPF program to make updates depending on
> > what state it finds), and having shared state between processes
> > doesn't work here.
> >
> > Is there any reasonable BPF facility that can do this today? (If
> > BPF_F_MMAPABLE could do it while satisfying requirements 2-4, I'd be a
> > happy camper.)
>
> You could simulate something like this with multi-element ARRAY +
> BPF_F_MMAPABLE, though you'd need to pre-allocate up to max number of
> processes, so it's not an exact fit.

Right, for production use this is infeasible.

> But what seems to be much closer is using BPF task-local storage, if
> we support mmap()'ing its memory into user-space. We've had previous
> discussions on how to achieve this (the simplest being that
> mmap(task_local_map_fd, ...) maps current thread's part of BPF task
> local storage). You won't get automatic cloning (you'd have to do that
> from the BPF program on fork/exec tracepoint, for example), and within
> the process you'd probably want to have just one thread (main?) to
> mmap() initially and just share the pointer across all relevant
> threads.

In the way you imagine it, would that allow all threads sharing the
same memory, despite it being task-local? Presumably each task's local
storage would be mapped to just point to the same memory?

> But this is a more generic building block, IMO. This relying
> on BPF map also means pinning is possible and all the other BPF map
> abstraction benefits.

Deployment-wise it will make things harder because unprivileged
processes still have to somehow get the map's shared fd somehow to
mmap() it. Not unsolvable, and in general what you describe looks
interesting, but I currently can't see how it will be simpler.

In absence of all that, is a safer "bpf_probe_write_user()" like I
proposed in this patch ("bpf_probe_write_user_registered()") at all
appealing?

Thanks,
-- Marco

2024-04-08 18:24:44

by Alexei Starovoitov

[permalink] [raw]

Subject: Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()

On Mon, Apr 8, 2024 at 2:30 AM Marco Elver <[email protected]> wrote:
>
> On Fri, 5 Apr 2024 at 22:28, Andrii Nakryiko <[email protected]> wrote:
> >
> > On Fri, Apr 5, 2024 at 1:28 AM Marco Elver <[email protected]> wrote:
> > >
> > > On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov
> > > <[email protected]> wrote:
> [...]
> > > > and the tasks can use mmaped array shared across all or unique to each
> > > > process.
> > > > And both bpf and user space can read/write them with a single instruction.
> > >
> > > That's BPF_F_MMAPABLE, right?
> > >
> > > That does not work because the mmapped region is global. Our requirements are:

It sounds not like "requirements", but a description of the proposed
solution.
Pls share the actual use case.
This "tracing prog" sounds more like a ghost scheduler that
wants to interact with known user processes.

> > >
> > > 1. Single tracing BPF program.
> > >
> > > 2. Per-process (per VM) memory region (here it's per-thread, but each
> > > thread just registers the same process-wide region). No sharing
> > > between processes.
> > >
> > > 3. From #2 it follows: exec unregisters the registered memory region;
> > > fork gets a cloned region.
> > >
> > > 4. Unprivileged processes can do prctl(REGISTER). Some of them might
> > > not be able to use the bpf syscall.
> > >
> > > The reason for #2 is that each user space process also writes to the
> > > memory region (read by the BPF program to make updates depending on
> > > what state it finds), and having shared state between processes
> > > doesn't work here.
> > >
> > > Is there any reasonable BPF facility that can do this today? (If
> > > BPF_F_MMAPABLE could do it while satisfying requirements 2-4, I'd be a
> > > happy camper.)
> >
> > You could simulate something like this with multi-element ARRAY +
> > BPF_F_MMAPABLE, though you'd need to pre-allocate up to max number of
> > processes, so it's not an exact fit.
>
> Right, for production use this is infeasible.

Last I heard, ghost agent and a few important tasks can mmap bpf array
and share it with bpf prog.
So quite feasible.

>
> > But what seems to be much closer is using BPF task-local storage, if
> > we support mmap()'ing its memory into user-space. We've had previous
> > discussions on how to achieve this (the simplest being that
> > mmap(task_local_map_fd, ...) maps current thread's part of BPF task
> > local storage). You won't get automatic cloning (you'd have to do that
> > from the BPF program on fork/exec tracepoint, for example), and within
> > the process you'd probably want to have just one thread (main?) to
> > mmap() initially and just share the pointer across all relevant
> > threads.
>
> In the way you imagine it, would that allow all threads sharing the
> same memory, despite it being task-local? Presumably each task's local
> storage would be mapped to just point to the same memory?
>
> > But this is a more generic building block, IMO. This relying
> > on BPF map also means pinning is possible and all the other BPF map
> > abstraction benefits.
>
> Deployment-wise it will make things harder because unprivileged
> processes still have to somehow get the map's shared fd somehow to
> mmap() it. Not unsolvable, and in general what you describe looks
> interesting, but I currently can't see how it will be simpler.

bpf map can be pinned into bpffs for any unpriv process to access.
Then any task can bpf_obj_get it and mmap it.
If you have few such tasks than bpf array will do.
If you have millions of tasks then use bpf arena which is a sparse array.
Use pid as an index or some other per-task id.
Both bpf prog and all tasks can read/write such shared memory
with normal load/store instructions.

> In absence of all that, is a safer "bpf_probe_write_user()" like I
> proposed in this patch ("bpf_probe_write_user_registered()") at all
> appealing?

To be honest, another "probe" variant is not appealing.
It's pretty much bpf_probe_write_user without pr_warn_ratelimited.
The main issue with bpf_probe_read/write_user() is their non-determinism.
They will error when memory is swapped out.
These helpers are ok-ish for observability when consumers understand
that some events might be lost, but for 24/7 production use
losing reads becomes a problem that bpf prog cannot mitigate.
What do bpf prog suppose to do when this safer bpf_probe_write_user errors?
Use some other mechanism to communicate with user space?
A mechanism with such builtin randomness in behavior is a footgun for
bpf users.
We have bpf_copy_from_user*() that don't have this non-determinism.
We can introduce bpf_copy_to_user(), but it will be usable
from sleepable bpf prog.
While it sounds you need it somewhere where scheduler makes decisions,
so I suspect bpf array or arena is a better fit.

Or something that extends bpf local storage map.
See long discussion:
https://lore.kernel.org/bpf/[email protected]/

I still like the idea to let user tasks register memory in
bpf local storage map, the kernel will pin such pages,
and then bpf prog can read/write these regions directly.
In bpf prog it will be:
ptr = bpf_task_storage_get(&map, task, ...);
if (ptr) { *ptr = ... }
and direct read/write into the same memory from user space.

2024-04-09 08:11:57

by Marco Elver

[permalink] [raw]

Subject: Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()

On Mon, 8 Apr 2024 at 20:24, Alexei Starovoitov
<[email protected]> wrote:
>
> On Mon, Apr 8, 2024 at 2:30 AM Marco Elver <[email protected]> wrote:
> >
> > On Fri, 5 Apr 2024 at 22:28, Andrii Nakryiko <[email protected]> wrote:
> > >
> > > On Fri, Apr 5, 2024 at 1:28 AM Marco Elver <[email protected]> wrote:
> > > >
> > > > On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov
> > > > <[email protected]> wrote:
> > [...]
> > > > > and the tasks can use mmaped array shared across all or unique to each
> > > > > process.
> > > > > And both bpf and user space can read/write them with a single instruction.
> > > >
> > > > That's BPF_F_MMAPABLE, right?
> > > >
> > > > That does not work because the mmapped region is global. Our requirements are:
>
> It sounds not like "requirements", but a description of the proposed
> solution.

These requirements can also be implemented differently, e.g. with the
proposed task-local storage maps if done right (the devil appears to
be in the implementation-details - these details are currently beyond
my knowledge of the BPF subsystem internals). They really are the bare
minimum requirements for the use case. The proposed solution probably
happens to be the simplest one, and mapping requirements is relatively
straightforward for it.

> Pls share the actual use case.
> This "tracing prog" sounds more like a ghost scheduler that
> wants to interact with known user processes.

It's tcmalloc - we have a BPF program provide a simpler variant of the
"thread re-scheduling" notifications discussed here:
https://lore.kernel.org/all/CACT4Y+beLh1qnHF9bxhMUcva8KyuvZs7Mg_31SGK5xSoR=3m1A@mail.gmail.com/

Various synchronization algorithms can be optimized if they know about
scheduling events. This has been proposed in [1] to implement an
adaptive mutex. However, we think that there are many more
possibilities, but it really depends on the kind of scheduler state
being exposed ("thread on CPU" as in [1], or more details, like
details about which thread was switched in, which was switched out,
where was the thread migrated to, etc.). Committing to these narrow
use case ABIs and extending the main kernel with more and more such
information does not scale. Instead, BPF easily allows to expose this
information where it's required.

[1] https://lore.kernel.org/all/[email protected]/

> > > > 1. Single tracing BPF program.
> > > >
> > > > 2. Per-process (per VM) memory region (here it's per-thread, but each
> > > > thread just registers the same process-wide region). No sharing
> > > > between processes.
> > > >
> > > > 3. From #2 it follows: exec unregisters the registered memory region;
> > > > fork gets a cloned region.
> > > >
> > > > 4. Unprivileged processes can do prctl(REGISTER). Some of them might
> > > > not be able to use the bpf syscall.
> > > >
> > > > The reason for #2 is that each user space process also writes to the
> > > > memory region (read by the BPF program to make updates depending on
> > > > what state it finds), and having shared state between processes
> > > > doesn't work here.
> > > >
> > > > Is there any reasonable BPF facility that can do this today? (If
> > > > BPF_F_MMAPABLE could do it while satisfying requirements 2-4, I'd be a
> > > > happy camper.)
> > >
> > > You could simulate something like this with multi-element ARRAY +
> > > BPF_F_MMAPABLE, though you'd need to pre-allocate up to max number of
> > > processes, so it's not an exact fit.
> >
> > Right, for production use this is infeasible.
>
> Last I heard, ghost agent and a few important tasks can mmap bpf array
> and share it with bpf prog.
> So quite feasible.

Nothing related to ghost.

It's tcmalloc, i.e. every process running everywhere.

> > > But what seems to be much closer is using BPF task-local storage, if
> > > we support mmap()'ing its memory into user-space. We've had previous
> > > discussions on how to achieve this (the simplest being that
> > > mmap(task_local_map_fd, ...) maps current thread's part of BPF task
> > > local storage). You won't get automatic cloning (you'd have to do that
> > > from the BPF program on fork/exec tracepoint, for example), and within
> > > the process you'd probably want to have just one thread (main?) to
> > > mmap() initially and just share the pointer across all relevant
> > > threads.
> >
> > In the way you imagine it, would that allow all threads sharing the
> > same memory, despite it being task-local? Presumably each task's local
> > storage would be mapped to just point to the same memory?
> >
> > > But this is a more generic building block, IMO. This relying
> > > on BPF map also means pinning is possible and all the other BPF map
> > > abstraction benefits.
> >
> > Deployment-wise it will make things harder because unprivileged
> > processes still have to somehow get the map's shared fd somehow to
> > mmap() it. Not unsolvable, and in general what you describe looks
> > interesting, but I currently can't see how it will be simpler.
>
> bpf map can be pinned into bpffs for any unpriv process to access.
> Then any task can bpf_obj_get it and mmap it.
> If you have few such tasks than bpf array will do.
> If you have millions of tasks then use bpf arena which is a sparse array.
> Use pid as an index or some other per-task id.
> Both bpf prog and all tasks can read/write such shared memory
> with normal load/store instructions.
>
> > In absence of all that, is a safer "bpf_probe_write_user()" like I
> > proposed in this patch ("bpf_probe_write_user_registered()") at all
> > appealing?
>
> To be honest, another "probe" variant is not appealing.
> It's pretty much bpf_probe_write_user without pr_warn_ratelimited.

Fair enough.

> The main issue with bpf_probe_read/write_user() is their non-determinism.
> They will error when memory is swapped out.

Right. Although, user space mlock'ing and prefaulting the memory
solves it in the common case (some corner cases like after fork() are
still tricky).

> These helpers are ok-ish for observability when consumers understand
> that some events might be lost, but for 24/7 production use
> losing reads becomes a problem that bpf prog cannot mitigate.
> What do bpf prog suppose to do when this safer bpf_probe_write_user errors?
> Use some other mechanism to communicate with user space?

Right, for use cases where these errors aren't ok it does not work.
But if the algorithm is tolerant to lossy reads/writes from the BPF
program side, it's not an issue.

> A mechanism with such builtin randomness in behavior is a footgun for
> bpf users.
> We have bpf_copy_from_user*() that don't have this non-determinism.
> We can introduce bpf_copy_to_user(), but it will be usable
> from sleepable bpf prog.
> While it sounds you need it somewhere where scheduler makes decisions,
> so I suspect bpf array or arena is a better fit.

Correct, it can't sleep because it wants scheduler events.

> Or something that extends bpf local storage map.
> See long discussion:
> https://lore.kernel.org/bpf/[email protected]/
>
> I still like the idea to let user tasks register memory in
> bpf local storage map, the kernel will pin such pages,
> and then bpf prog can read/write these regions directly.
> In bpf prog it will be:
> ptr = bpf_task_storage_get(&map, task, ...);
> if (ptr) { *ptr = ... }
> and direct read/write into the same memory from user space.

I would certainly welcome something like this - I agree this looks
better than bpf_probe_read/write.

Thanks,
-- Marco

2024-04-09 21:42:27

by Alexei Starovoitov

[permalink] [raw]

Subject: Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()

On Tue, Apr 9, 2024 at 1:11 AM Marco Elver <[email protected]> wrote:
>
> On Mon, 8 Apr 2024 at 20:24, Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Mon, Apr 8, 2024 at 2:30 AM Marco Elver <[email protected]> wrote:
> > >
> > > On Fri, 5 Apr 2024 at 22:28, Andrii Nakryiko <[email protected]> wrote:
> > > >
> > > > On Fri, Apr 5, 2024 at 1:28 AM Marco Elver <[email protected]> wrote:
> > > > >
> > > > > On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov
> > > > > <[email protected]> wrote:
> > > [...]
> > > > > > and the tasks can use mmaped array shared across all or unique to each
> > > > > > process.
> > > > > > And both bpf and user space can read/write them with a single instruction.
> > > > >
> > > > > That's BPF_F_MMAPABLE, right?
> > > > >
> > > > > That does not work because the mmapped region is global. Our requirements are:
> >
> > It sounds not like "requirements", but a description of the proposed
> > solution.
>
> These requirements can also be implemented differently, e.g. with the
> proposed task-local storage maps if done right (the devil appears to
> be in the implementation-details - these details are currently beyond
> my knowledge of the BPF subsystem internals). They really are the bare
> minimum requirements for the use case. The proposed solution probably
> happens to be the simplest one, and mapping requirements is relatively
> straightforward for it.
>
> > Pls share the actual use case.
> > This "tracing prog" sounds more like a ghost scheduler that
> > wants to interact with known user processes.
>
> It's tcmalloc - we have a BPF program provide a simpler variant of the
> "thread re-scheduling" notifications discussed here:
> https://lore.kernel.org/all/CACT4Y+beLh1qnHF9bxhMUcva8KyuvZs7Mg_31SGK5xSoR=3m1A@mail.gmail.com/
>
> Various synchronization algorithms can be optimized if they know about
> scheduling events. This has been proposed in [1] to implement an
> adaptive mutex. However, we think that there are many more
> possibilities, but it really depends on the kind of scheduler state
> being exposed ("thread on CPU" as in [1], or more details, like
> details about which thread was switched in, which was switched out,
> where was the thread migrated to, etc.). Committing to these narrow
> use case ABIs and extending the main kernel with more and more such
> information does not scale. Instead, BPF easily allows to expose this
> information where it's required.
>
> [1] https://lore.kernel.org/all/[email protected]/

Thanks for the links. That's very helpful.
I was about to mention rseq.
I think it's a better fit, but Mathieu doesn't want non-uapi bits in it.
Ideally we could have added pointer + size to rseq as a scratch area
that bpf prog can read/write.
User space would register such area, kernel would pin pages,
and bpf would directly access that.
Pretty much the same idea of bpf local storage, but without bpf map.
rseq has its pros and cons.
Having a bpf local storage map to manage such pinned pages provides
separation of concerns. And it's up to tcmalloc whether to use
one such map for all tasks or map for every tcmalloc instance.
Such scratch areas are per-task per-map.
rseq dictates per-task only, but no concerns with setup thanks
to glibc integration.

> > The main issue with bpf_probe_read/write_user() is their non-determinism.
> > They will error when memory is swapped out.
>
> Right. Although, user space mlock'ing and prefaulting the memory
> solves it in the common case (some corner cases like after fork() are
> still tricky).

yeah. gets tricky indeed.
With bpf local storage map managing pinned pages the map_fd with
auto-close property addresses unpinning. No need to hack into
fork-ing exit-ing paths, but tcmalloc would need to re-register
the areas into bpf local storage after fork... guessing.

> > These helpers are ok-ish for observability when consumers understand
> > that some events might be lost, but for 24/7 production use
> > losing reads becomes a problem that bpf prog cannot mitigate.
> > What do bpf prog suppose to do when this safer bpf_probe_write_user errors?
> > Use some other mechanism to communicate with user space?
>
> Right, for use cases where these errors aren't ok it does not work.
> But if the algorithm is tolerant to lossy reads/writes from the BPF
> program side, it's not an issue.

Reading Dmitry Vyukov's comments in that thread... looks like
he's also concerned with the 'probe' part of rseq api.

> > A mechanism with such builtin randomness in behavior is a footgun for
> > bpf users.
> > We have bpf_copy_from_user*() that don't have this non-determinism.
> > We can introduce bpf_copy_to_user(), but it will be usable
> > from sleepable bpf prog.
> > While it sounds you need it somewhere where scheduler makes decisions,
> > so I suspect bpf array or arena is a better fit.
>
> Correct, it can't sleep because it wants scheduler events.
>
> > Or something that extends bpf local storage map.
> > See long discussion:
> > https://lore.kernel.org/bpf/[email protected]/
> >
> > I still like the idea to let user tasks register memory in
> > bpf local storage map, the kernel will pin such pages,
> > and then bpf prog can read/write these regions directly.
> > In bpf prog it will be:
> > ptr = bpf_task_storage_get(&map, task, ...);
> > if (ptr) { *ptr = ... }
> > and direct read/write into the same memory from user space.
>
> I would certainly welcome something like this - I agree this looks
> better than bpf_probe_read/write.

Great to hear. It's on the todo list, but no one started to work on it.
If you have cycles to prototype it would be great.