New version of the sleepable bpf_timer code, without the HID changes, as
they can now go through the HID tree indepandantly.
For reference, the use cases I have in mind:
---
Basically, I need to be able to defer a HID-BPF program for the
following reasons (from the aforementioned patch):
1. defer an event:
Sometimes we receive an out of proximity event, but the device can not
be trusted enough, and we need to ensure that we won't receive another
one in the following n milliseconds. So we need to wait those n
milliseconds, and eventually re-inject that event in the stack.
2. inject new events in reaction to one given event:
We might want to transform one given event into several. This is the
case for macro keys where a single key press is supposed to send
a sequence of key presses. But this could also be used to patch a
faulty behavior, if a device forgets to send a release event.
3. communicate with the device in reaction to one event:
We might want to communicate back to the device after a given event.
For example a device might send us an event saying that it came back
from sleeping state and needs to be re-initialized.
Currently we can achieve that by keeping a userspace program around,
raise a bpf event, and let that userspace program inject the events and
commands.
However, we are just keeping that program alive as a daemon for just
scheduling commands. There is no logic in it, so it doesn't really justify
an actual userspace wakeup. So a kernel workqueue seems simpler to handle.
bpf_timers are currently running in a soft IRQ context, this patch
series implements a sleppable context for them.
Cheers,
Benjamin
To: Alexei Starovoitov <[email protected]>
To: Daniel Borkmann <[email protected]>
To: Andrii Nakryiko <[email protected]>
To: Martin KaFai Lau <[email protected]>
To: Eduard Zingerman <[email protected]>
To: Song Liu <[email protected]>
To: Yonghong Song <[email protected]>
To: John Fastabend <[email protected]>
To: KP Singh <[email protected]>
To: Stanislav Fomichev <[email protected]>
To: Hao Luo <[email protected]>
To: Jiri Olsa <[email protected]>
To: Mykola Lysenko <[email protected]>
To: Shuah Khan <[email protected]>
Cc: Benjamin Tissoires <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
---
Changes in v5:
- took various reviews into account
- rewrote the tests to be separated to not have a uggly include
- Link to v4: https://lore.kernel.org/r/[email protected]
Changes in v4:
- dropped the HID changes, they can go independently from bpf-core
- addressed Alexei's and Eduard's remarks
- added selftests
- Link to v3: https://lore.kernel.org/r/[email protected]
Changes in v3:
- fixed the crash from v2
- changed the API to have only BPF_F_TIMER_SLEEPABLE for
bpf_timer_start()
- split the new kfuncs/verifier patch into several sub-patches, for
easier reviews
- Link to v2: https://lore.kernel.org/r/[email protected]
Changes in v2:
- make use of bpf_timer (and dropped the custom HID handling)
- implemented bpf_timer_set_sleepable_cb as a kfunc
- still not implemented global subprogs
- no sleepable bpf_timer selftests yet
- Link to v1: https://lore.kernel.org/r/[email protected]
---
Benjamin Tissoires (6):
bpf/helpers: introduce sleepable bpf_timers
bpf/verifier: add bpf_timer as a kfunc capable type
bpf/helpers: introduce bpf_timer_set_sleepable_cb() kfunc
bpf/helpers: mark the callback of bpf_timer_set_sleepable_cb() as sleepable
tools: sync include/uapi/linux/bpf.h
selftests/bpf: add sleepable timer tests
include/linux/bpf_verifier.h | 1 +
include/uapi/linux/bpf.h | 4 +
kernel/bpf/helpers.c | 132 ++++++++++++++-
kernel/bpf/verifier.c | 96 ++++++++++-
tools/include/uapi/linux/bpf.h | 4 +
tools/testing/selftests/bpf/bpf_experimental.h | 5 +
.../selftests/bpf/bpf_testmod/bpf_testmod.c | 5 +
.../selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h | 1 +
tools/testing/selftests/bpf/prog_tests/timer.c | 34 ++++
.../testing/selftests/bpf/progs/timer_sleepable.c | 185 +++++++++++++++++++++
10 files changed, 458 insertions(+), 9 deletions(-)
---
base-commit: 9187210eee7d87eea37b45ea93454a88681894a4
change-id: 20240205-hid-bpf-sleepable-c01260fd91c4
Best regards,
--
Benjamin Tissoires <[email protected]>
We need to extend the bpf_timer API, but the way forward relies on kfuncs.
So make bpf_timer known for kfuncs from the verifier PoV
Signed-off-by: Benjamin Tissoires <[email protected]>
---
changes in v5:
- also check for the reg offset
changes in v4:
- enforce KF_ARG_PTR_TO_TIMER to be of type PTR_TO_MAP_VALUE
new in v3 (split from v2 02/10)
---
kernel/bpf/verifier.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 63749ad5ac6b..24a604e26ec7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10826,6 +10826,7 @@ enum {
KF_ARG_LIST_NODE_ID,
KF_ARG_RB_ROOT_ID,
KF_ARG_RB_NODE_ID,
+ KF_ARG_TIMER_ID,
};
BTF_ID_LIST(kf_arg_btf_ids)
@@ -10834,6 +10835,7 @@ BTF_ID(struct, bpf_list_head)
BTF_ID(struct, bpf_list_node)
BTF_ID(struct, bpf_rb_root)
BTF_ID(struct, bpf_rb_node)
+BTF_ID(struct, bpf_timer_kern)
static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
const struct btf_param *arg, int type)
@@ -10877,6 +10879,12 @@ static bool is_kfunc_arg_rbtree_node(const struct btf *btf, const struct btf_par
return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_NODE_ID);
}
+static bool is_kfunc_arg_timer(const struct btf *btf, const struct btf_param *arg)
+{
+ bool ret = __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_TIMER_ID);
+ return ret;
+}
+
static bool is_kfunc_arg_callback(struct bpf_verifier_env *env, const struct btf *btf,
const struct btf_param *arg)
{
@@ -10946,6 +10954,7 @@ enum kfunc_ptr_arg_type {
KF_ARG_PTR_TO_NULL,
KF_ARG_PTR_TO_CONST_STR,
KF_ARG_PTR_TO_MAP,
+ KF_ARG_PTR_TO_TIMER,
};
enum special_kfunc_type {
@@ -11102,6 +11111,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
if (is_kfunc_arg_map(meta->btf, &args[argno]))
return KF_ARG_PTR_TO_MAP;
+ if (is_kfunc_arg_timer(meta->btf, &args[argno]))
+ return KF_ARG_PTR_TO_TIMER;
+
if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
if (!btf_type_is_struct(ref_t)) {
verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
@@ -11735,6 +11747,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
case KF_ARG_PTR_TO_CALLBACK:
case KF_ARG_PTR_TO_REFCOUNTED_KPTR:
case KF_ARG_PTR_TO_CONST_STR:
+ case KF_ARG_PTR_TO_TIMER:
/* Trusted by default */
break;
default:
@@ -12021,6 +12034,16 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
if (ret)
return ret;
break;
+ case KF_ARG_PTR_TO_TIMER:
+ if (reg->type != PTR_TO_MAP_VALUE) {
+ verbose(env, "arg#%d doesn't point to a map value\n", i);
+ return -EINVAL;
+ }
+ if (reg->off) {
+ verbose(env, "arg#%d offset can not be greater than 0\n", i);
+ return -EINVAL;
+ }
+ break;
}
}
--
2.44.0
They are implemented as a workqueue, which means that there are no
guarantees of timing nor ordering.
Signed-off-by: Benjamin Tissoires <[email protected]>
---
no changes in v5
changes in v4:
- dropped __bpf_timer_compute_key()
- use a spin_lock instead of a semaphore
- ensure bpf_timer_cancel_and_free is not complaining about
non sleepable context and use cancel_work() instead of
cancel_work_sync()
- return -EINVAL if a delay is given to bpf_timer_start() with
BPF_F_TIMER_SLEEPABLE
changes in v3:
- extracted the implementation in bpf_timer only, without
bpf_timer_set_sleepable_cb()
- rely on schedule_work() only, from bpf_timer_start()
- add semaphore to ensure bpf_timer_work_cb() is accessing
consistent data
changes in v2 (compared to the one attaches to v1 0/9):
- make use of a kfunc
- add a (non-used) BPF_F_TIMER_SLEEPABLE
- the callback is *not* called, it makes the kernel crashes
---
include/uapi/linux/bpf.h | 4 +++
kernel/bpf/helpers.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 88 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3c42b9f1bada..b90def29d796 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7461,10 +7461,14 @@ struct bpf_core_relo {
* - BPF_F_TIMER_ABS: Timeout passed is absolute time, by default it is
* relative to current time.
* - BPF_F_TIMER_CPU_PIN: Timer will be pinned to the CPU of the caller.
+ * - BPF_F_TIMER_SLEEPABLE: Timer will run in a sleepable context, with
+ * no guarantees of ordering nor timing (consider this as being just
+ * offloaded immediately).
*/
enum {
BPF_F_TIMER_ABS = (1ULL << 0),
BPF_F_TIMER_CPU_PIN = (1ULL << 1),
+ BPF_F_TIMER_SLEEPABLE = (1ULL << 2),
};
/* BPF numbers iterator state */
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index a89587859571..38de73a9df83 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1094,14 +1094,20 @@ const struct bpf_func_proto bpf_snprintf_proto = {
* bpf_timer_cancel() cancels the timer and decrements prog's refcnt.
* Inner maps can contain bpf timers as well. ops->map_release_uref is
* freeing the timers when inner map is replaced or deleted by user space.
+ *
+ * sleepable_lock protects only the setup of the workqueue, not the callback
+ * itself. This is done to ensure we don't run concurrently a free of the
+ * callback or the associated program.
*/
struct bpf_hrtimer {
struct hrtimer timer;
+ struct work_struct work;
struct bpf_map *map;
struct bpf_prog *prog;
void __rcu *callback_fn;
void *value;
struct rcu_head rcu;
+ spinlock_t sleepable_lock;
};
/* the actual struct hidden inside uapi struct bpf_timer */
@@ -1114,6 +1120,49 @@ struct bpf_timer_kern {
struct bpf_spin_lock lock;
} __attribute__((aligned(8)));
+static void bpf_timer_work_cb(struct work_struct *work)
+{
+ struct bpf_hrtimer *t = container_of(work, struct bpf_hrtimer, work);
+ struct bpf_map *map = t->map;
+ bpf_callback_t callback_fn;
+ void *value = t->value;
+ unsigned long flags;
+ void *key;
+ u32 idx;
+
+ BTF_TYPE_EMIT(struct bpf_timer);
+
+ spin_lock_irqsave(&t->sleepable_lock, flags);
+
+ callback_fn = READ_ONCE(t->callback_fn);
+ if (!callback_fn) {
+ spin_unlock_irqrestore(&t->sleepable_lock, flags);
+ return;
+ }
+
+ if (map->map_type == BPF_MAP_TYPE_ARRAY) {
+ struct bpf_array *array = container_of(map, struct bpf_array, map);
+
+ /* compute the key */
+ idx = ((char *)value - array->value) / array->elem_size;
+ key = &idx;
+ } else { /* hash or lru */
+ key = value - round_up(map->key_size, 8);
+ }
+
+ /* prevent the callback to be freed by bpf_timer_cancel() while running
+ * so we can release the sleepable lock
+ */
+ bpf_prog_inc(t->prog);
+
+ spin_unlock_irqrestore(&t->sleepable_lock, flags);
+
+ callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, 0, 0);
+ /* The verifier checked that return value is zero. */
+
+ bpf_prog_put(t->prog);
+}
+
static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
@@ -1192,6 +1241,8 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
t->prog = NULL;
rcu_assign_pointer(t->callback_fn, NULL);
hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
+ INIT_WORK(&t->work, bpf_timer_work_cb);
+ spin_lock_init(&t->sleepable_lock);
t->timer.function = bpf_timer_cb;
WRITE_ONCE(timer->timer, t);
/* Guarantee the order between timer->timer and map->usercnt. So
@@ -1237,6 +1288,7 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
ret = -EINVAL;
goto out;
}
+ spin_lock(&t->sleepable_lock);
if (!atomic64_read(&t->map->usercnt)) {
/* maps with timers must be either held by user space
* or pinned in bpffs. Otherwise timer might still be
@@ -1263,6 +1315,8 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
}
rcu_assign_pointer(t->callback_fn, callback_fn);
out:
+ if (t)
+ spin_unlock(&t->sleepable_lock);
__bpf_spin_unlock_irqrestore(&timer->lock);
return ret;
}
@@ -1283,8 +1337,12 @@ BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, fla
if (in_nmi())
return -EOPNOTSUPP;
- if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN))
+ if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN | BPF_F_TIMER_SLEEPABLE))
return -EINVAL;
+
+ if ((flags & BPF_F_TIMER_SLEEPABLE) && nsecs)
+ return -EINVAL;
+
__bpf_spin_lock_irqsave(&timer->lock);
t = timer->timer;
if (!t || !t->prog) {
@@ -1300,7 +1358,10 @@ BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, fla
if (flags & BPF_F_TIMER_CPU_PIN)
mode |= HRTIMER_MODE_PINNED;
- hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
+ if (flags & BPF_F_TIMER_SLEEPABLE)
+ schedule_work(&t->work);
+ else
+ hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
out:
__bpf_spin_unlock_irqrestore(&timer->lock);
return ret;
@@ -1348,13 +1409,22 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
ret = -EDEADLK;
goto out;
}
+ spin_lock(&t->sleepable_lock);
drop_prog_refcnt(t);
+ spin_unlock(&t->sleepable_lock);
out:
__bpf_spin_unlock_irqrestore(&timer->lock);
/* Cancel the timer and wait for associated callback to finish
* if it was running.
*/
ret = ret ?: hrtimer_cancel(&t->timer);
+
+ /* also cancel the sleepable work, but *do not* wait for
+ * it to finish if it was running as we might not be in a
+ * sleepable context
+ */
+ ret = ret ?: cancel_work(&t->work);
+
rcu_read_unlock();
return ret;
}
@@ -1383,11 +1453,13 @@ void bpf_timer_cancel_and_free(void *val)
t = timer->timer;
if (!t)
goto out;
+ spin_lock(&t->sleepable_lock);
drop_prog_refcnt(t);
/* The subsequent bpf_timer_start/cancel() helpers won't be able to use
* this timer, since it won't be initialized.
*/
WRITE_ONCE(timer->timer, NULL);
+ spin_unlock(&t->sleepable_lock);
out:
__bpf_spin_unlock_irqrestore(&timer->lock);
if (!t)
@@ -1410,6 +1482,16 @@ void bpf_timer_cancel_and_free(void *val)
*/
if (this_cpu_read(hrtimer_running) != t)
hrtimer_cancel(&t->timer);
+
+ /* also cancel the sleepable work, but *do not* wait for
+ * it to finish if it was running as we might not be in a
+ * sleepable context. Same reason as above, it's fine to
+ * free 't': the subprog callback will never access it anymore
+ * and can not reschedule itself since timer->timer = NULL was
+ * already done.
+ */
+ cancel_work(&t->work);
+
kfree_rcu(t, rcu);
}
--
2.44.0
In this patch, bpf_timer_set_sleepable_cb() is functionally equivalent
to bpf_timer_set_callback(), to the exception that it enforces
the timer to be started with BPF_F_TIMER_SLEEPABLE.
But given that bpf_timer_set_callback() is a helper when
bpf_timer_set_sleepable_cb() is a kfunc, we need to teach the verifier
about its attached callback.
Marking that callback as sleepable will be done in a separate patch
Acked-by: Eduard Zingerman <[email protected]>
Signed-off-by: Benjamin Tissoires <[email protected]>
---
changes in v5:
- enforced sleepable timers to only have BPF_F_TIMER_SLEEPABLE
- use is_bpf_timer_set_sleepable_cb_impl_kfunc() instead of generic
is_async_cb()
changes in v4:
- added a new (ignored) argument to the kfunc so that we do not
need to wlak the stack
new in v3 (split from v2 02/10)
---
kernel/bpf/helpers.c | 46 +++++++++++++++++++++++++++++++++++++--
kernel/bpf/verifier.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 102 insertions(+), 4 deletions(-)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 38de73a9df83..04f5a9b2c723 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1108,6 +1108,7 @@ struct bpf_hrtimer {
void *value;
struct rcu_head rcu;
spinlock_t sleepable_lock;
+ bool is_sleepable;
};
/* the actual struct hidden inside uapi struct bpf_timer */
@@ -1273,8 +1274,8 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
.arg3_type = ARG_ANYTHING,
};
-BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callback_fn,
- struct bpf_prog_aux *, aux)
+static int __bpf_timer_set_callback(struct bpf_timer_kern *timer, void *callback_fn,
+ struct bpf_prog_aux *aux, bool is_sleepable)
{
struct bpf_prog *prev, *prog = aux->prog;
struct bpf_hrtimer *t;
@@ -1314,6 +1315,7 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
t->prog = prog;
}
rcu_assign_pointer(t->callback_fn, callback_fn);
+ t->is_sleepable = is_sleepable;
out:
if (t)
spin_unlock(&t->sleepable_lock);
@@ -1321,6 +1323,12 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
return ret;
}
+BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callback_fn,
+ struct bpf_prog_aux *, aux)
+{
+ return __bpf_timer_set_callback(timer, callback_fn, aux, false);
+}
+
static const struct bpf_func_proto bpf_timer_set_callback_proto = {
.func = bpf_timer_set_callback,
.gpl_only = true,
@@ -1350,6 +1358,11 @@ BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, fla
goto out;
}
+ if (t->is_sleepable && flags != BPF_F_TIMER_SLEEPABLE) {
+ ret = -EINVAL;
+ goto out;
+ }
+
if (flags & BPF_F_TIMER_ABS)
mode = HRTIMER_MODE_ABS_SOFT;
else
@@ -2627,6 +2640,34 @@ __bpf_kfunc void bpf_throw(u64 cookie)
WARN(1, "A call to BPF exception callback should never return\n");
}
+/**
+ * bpf_timer_set_sleepable_cb_impl() - Configure the timer to call %callback_fn
+ * static function in a sleepable context.
+ * @timer: The bpf_timer that needs to be configured
+ * @callback_fn: a static bpf function
+ *
+ * @returns %0 on success. %-EINVAL if %timer was not initialized with
+ * bpf_timer_init() earlier. %-EPERM if %timer is in a map that doesn't
+ * have any user references.
+ * The user space should either hold a file descriptor to a map with timers
+ * or pin such map in bpffs. When map is unpinned or file descriptor is
+ * closed all timers in the map will be cancelled and freed.
+ *
+ * This kfunc is equivalent to %bpf_timer_set_callback except that it tells
+ * the verifier that the target callback is run in a sleepable context.
+ */
+__bpf_kfunc int bpf_timer_set_sleepable_cb_impl(struct bpf_timer_kern *timer,
+ int (callback_fn)(void *map, int *key, struct bpf_timer *timer),
+ void *aux__ign)
+{
+ struct bpf_prog_aux *aux = (struct bpf_prog_aux *)aux__ign;
+
+ if (!aux)
+ return -EINVAL;
+
+ return __bpf_timer_set_callback(timer, (void *)callback_fn, aux, true);
+}
+
__bpf_kfunc_end_defs();
BTF_KFUNCS_START(generic_btf_ids)
@@ -2703,6 +2744,7 @@ BTF_ID_FLAGS(func, bpf_dynptr_is_null)
BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
BTF_ID_FLAGS(func, bpf_dynptr_size)
BTF_ID_FLAGS(func, bpf_dynptr_clone)
+BTF_ID_FLAGS(func, bpf_timer_set_sleepable_cb_impl)
BTF_KFUNCS_END(common_btf_ids)
static const struct btf_kfunc_id_set common_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 24a604e26ec7..458765d881f3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -501,8 +501,12 @@ static bool is_dynptr_ref_function(enum bpf_func_id func_id)
}
static bool is_sync_callback_calling_kfunc(u32 btf_id);
+static bool is_async_callback_calling_kfunc(u32 btf_id);
+static bool is_callback_calling_kfunc(u32 btf_id);
static bool is_bpf_throw_kfunc(struct bpf_insn *insn);
+static bool is_bpf_timer_set_sleepable_cb_impl_kfunc(u32 btf_id);
+
static bool is_sync_callback_calling_function(enum bpf_func_id func_id)
{
return func_id == BPF_FUNC_for_each_map_elem ||
@@ -530,7 +534,8 @@ static bool is_sync_callback_calling_insn(struct bpf_insn *insn)
static bool is_async_callback_calling_insn(struct bpf_insn *insn)
{
- return bpf_helper_call(insn) && is_async_callback_calling_function(insn->imm);
+ return (bpf_helper_call(insn) && is_async_callback_calling_function(insn->imm)) ||
+ (bpf_pseudo_kfunc_call(insn) && is_async_callback_calling_kfunc(insn->imm));
}
static bool is_may_goto_insn(struct bpf_insn *insn)
@@ -9471,7 +9476,7 @@ static int push_callback_call(struct bpf_verifier_env *env, struct bpf_insn *ins
*/
env->subprog_info[subprog].is_cb = true;
if (bpf_pseudo_kfunc_call(insn) &&
- !is_sync_callback_calling_kfunc(insn->imm)) {
+ !is_callback_calling_kfunc(insn->imm)) {
verbose(env, "verifier bug: kfunc %s#%d not marked as callback-calling\n",
func_id_name(insn->imm), insn->imm);
return -EFAULT;
@@ -10981,6 +10986,7 @@ enum special_kfunc_type {
KF_bpf_percpu_obj_drop_impl,
KF_bpf_throw,
KF_bpf_iter_css_task_new,
+ KF_bpf_timer_set_sleepable_cb_impl,
};
BTF_SET_START(special_kfunc_set)
@@ -11007,6 +11013,7 @@ BTF_ID(func, bpf_throw)
#ifdef CONFIG_CGROUPS
BTF_ID(func, bpf_iter_css_task_new)
#endif
+BTF_ID(func, bpf_timer_set_sleepable_cb_impl)
BTF_SET_END(special_kfunc_set)
BTF_ID_LIST(special_kfunc_list)
@@ -11037,6 +11044,7 @@ BTF_ID(func, bpf_iter_css_task_new)
#else
BTF_ID_UNUSED
#endif
+BTF_ID(func, bpf_timer_set_sleepable_cb_impl)
static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
{
@@ -11365,12 +11373,28 @@ static bool is_sync_callback_calling_kfunc(u32 btf_id)
return btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl];
}
+static bool is_async_callback_calling_kfunc(u32 btf_id)
+{
+ return btf_id == special_kfunc_list[KF_bpf_timer_set_sleepable_cb_impl];
+}
+
static bool is_bpf_throw_kfunc(struct bpf_insn *insn)
{
return bpf_pseudo_kfunc_call(insn) && insn->off == 0 &&
insn->imm == special_kfunc_list[KF_bpf_throw];
}
+static bool is_bpf_timer_set_sleepable_cb_impl_kfunc(u32 btf_id)
+{
+ return btf_id == special_kfunc_list[KF_bpf_timer_set_sleepable_cb_impl];
+}
+
+static bool is_callback_calling_kfunc(u32 btf_id)
+{
+ return is_sync_callback_calling_kfunc(btf_id) ||
+ is_async_callback_calling_kfunc(btf_id);
+}
+
static bool is_rbtree_lock_required_kfunc(u32 btf_id)
{
return is_bpf_rbtree_api_kfunc(btf_id);
@@ -12155,6 +12179,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
}
}
+ if (is_bpf_timer_set_sleepable_cb_impl_kfunc(meta.func_id)) {
+ err = push_callback_call(env, insn, insn_idx, meta.subprogno,
+ set_timer_callback_state);
+ if (err) {
+ verbose(env, "kfunc %s#%d failed callback verification\n",
+ func_name, meta.func_id);
+ return err;
+ }
+ }
+
rcu_lock = is_kfunc_bpf_rcu_read_lock(&meta);
rcu_unlock = is_kfunc_bpf_rcu_read_unlock(&meta);
@@ -19548,6 +19582,28 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
*cnt = 1;
+ } else if (is_bpf_timer_set_sleepable_cb_impl_kfunc(desc->func_id)) {
+ /* The verifier will process callback_fn as many times as necessary
+ * with different maps and the register states prepared by
+ * set_timer_callback_state will be accurate.
+ *
+ * The following use case is valid:
+ * map1 is shared by prog1, prog2, prog3.
+ * prog1 calls bpf_timer_init for some map1 elements
+ * prog2 calls bpf_timer_set_callback for some map1 elements.
+ * Those that were not bpf_timer_init-ed will return -EINVAL.
+ * prog3 calls bpf_timer_start for some map1 elements.
+ * Those that were not both bpf_timer_init-ed and
+ * bpf_timer_set_callback-ed will return -EINVAL.
+ */
+ struct bpf_insn ld_addrs[2] = {
+ BPF_LD_IMM64(BPF_REG_3, (long)env->prog->aux),
+ };
+
+ insn_buf[0] = ld_addrs[0];
+ insn_buf[1] = ld_addrs[1];
+ insn_buf[2] = *insn;
+ *cnt = 3;
}
return 0;
}
--
2.44.0
Now that we have bpf_timer_set_sleepable_cb() available and working, we
can tag the attached callback as sleepable, and let the verifier check
in the correct context the calls and kfuncs.
Signed-off-by: Benjamin Tissoires <[email protected]>
---
no changes in v5
changes in v4:
- use a function parameter to forward the sleepable information
new in v3 (split from v2 02/10)
---
include/linux/bpf_verifier.h | 1 +
kernel/bpf/verifier.c | 13 ++++++++++---
2 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 7cb1b75eee38..14e4ee67b694 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -426,6 +426,7 @@ struct bpf_verifier_state {
* while they are still in use.
*/
bool used_as_loop_entry;
+ bool in_sleepable;
/* first and last insn idx of this verifier state */
u32 first_insn_idx;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 458765d881f3..39a16184c0d8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1434,6 +1434,7 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state,
}
dst_state->speculative = src->speculative;
dst_state->active_rcu_lock = src->active_rcu_lock;
+ dst_state->in_sleepable = src->in_sleepable;
dst_state->curframe = src->curframe;
dst_state->active_lock.ptr = src->active_lock.ptr;
dst_state->active_lock.id = src->active_lock.id;
@@ -2407,7 +2408,7 @@ static void init_func_state(struct bpf_verifier_env *env,
/* Similar to push_stack(), but for async callbacks */
static struct bpf_verifier_state *push_async_cb(struct bpf_verifier_env *env,
int insn_idx, int prev_insn_idx,
- int subprog)
+ int subprog, bool is_sleepable)
{
struct bpf_verifier_stack_elem *elem;
struct bpf_func_state *frame;
@@ -2434,6 +2435,7 @@ static struct bpf_verifier_state *push_async_cb(struct bpf_verifier_env *env,
* Initialize it similar to do_check_common().
*/
elem->st.branches = 1;
+ elem->st.in_sleepable = is_sleepable;
frame = kzalloc(sizeof(*frame), GFP_KERNEL);
if (!frame)
goto err;
@@ -5279,7 +5281,8 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
static bool in_sleepable(struct bpf_verifier_env *env)
{
- return env->prog->sleepable;
+ return env->prog->sleepable ||
+ (env->cur_state && env->cur_state->in_sleepable);
}
/* The non-sleepable programs and sleepable programs with explicit bpf_rcu_read_lock()
@@ -9493,7 +9496,8 @@ static int push_callback_call(struct bpf_verifier_env *env, struct bpf_insn *ins
/* there is no real recursion here. timer callbacks are async */
env->subprog_info[subprog].is_async_cb = true;
async_cb = push_async_cb(env, env->subprog_info[subprog].start,
- insn_idx, subprog);
+ insn_idx, subprog,
+ is_bpf_timer_set_sleepable_cb_impl_kfunc(insn->imm));
if (!async_cb)
return -EFAULT;
callee = async_cb->frame[0];
@@ -16941,6 +16945,9 @@ static bool states_equal(struct bpf_verifier_env *env,
if (old->active_rcu_lock != cur->active_rcu_lock)
return false;
+ if (old->in_sleepable != cur->in_sleepable)
+ return false;
+
/* for states to be equal callsites have to be the same
* and all frame states need to be equivalent
*/
--
2.44.0
bpf_experimental.h and ../bpf_testmod/bpf_testmod_kfunc.h are both
including vmlinux.h, which is not compatible with including time.h
or bpf_tcp_helpers.h.
So keep sleepable tests in a separate bpf source file.
The first correct test is run twice for convenience:
- first through RUN_TESTS
- then we ensure that the timer was actually executed, in a manual
load/attach/run
Signed-off-by: Benjamin Tissoires <[email protected]>
---
changes in v5:
- keep sleepable test separate
new in v4
---
tools/testing/selftests/bpf/bpf_experimental.h | 5 +
.../selftests/bpf/bpf_testmod/bpf_testmod.c | 5 +
.../selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h | 1 +
tools/testing/selftests/bpf/prog_tests/timer.c | 34 ++++
.../testing/selftests/bpf/progs/timer_sleepable.c | 185 +++++++++++++++++++++
5 files changed, 230 insertions(+)
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index a5b9df38c162..06ed98b3bc51 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -459,4 +459,9 @@ extern int bpf_iter_css_new(struct bpf_iter_css *it,
extern struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *it) __weak __ksym;
extern void bpf_iter_css_destroy(struct bpf_iter_css *it) __weak __ksym;
+extern int bpf_timer_set_sleepable_cb_impl(struct bpf_timer *timer,
+ int (callback_fn)(void *map, int *key, struct bpf_timer *timer),
+ void *aux__ign) __ksym;
+#define bpf_timer_set_sleepable_cb(timer, cb) \
+ bpf_timer_set_sleepable_cb_impl(timer, cb, NULL)
#endif
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 39ad96a18123..eb2b78552ca2 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -494,6 +494,10 @@ __bpf_kfunc static u32 bpf_kfunc_call_test_static_unused_arg(u32 arg, u32 unused
return arg;
}
+__bpf_kfunc void bpf_kfunc_call_test_sleepable(void)
+{
+}
+
BTF_KFUNCS_START(bpf_testmod_check_kfunc_ids)
BTF_ID_FLAGS(func, bpf_testmod_test_mod_kfunc)
BTF_ID_FLAGS(func, bpf_kfunc_call_test1)
@@ -520,6 +524,7 @@ BTF_ID_FLAGS(func, bpf_kfunc_call_test_ref, KF_TRUSTED_ARGS | KF_RCU)
BTF_ID_FLAGS(func, bpf_kfunc_call_test_destructive, KF_DESTRUCTIVE)
BTF_ID_FLAGS(func, bpf_kfunc_call_test_static_unused_arg)
BTF_ID_FLAGS(func, bpf_kfunc_call_test_offset)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_sleepable, KF_SLEEPABLE)
BTF_KFUNCS_END(bpf_testmod_check_kfunc_ids)
static int bpf_testmod_ops_init(struct btf *btf)
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
index 7c664dd61059..ce5cd763561c 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
@@ -96,6 +96,7 @@ void bpf_kfunc_call_test_pass2(struct prog_test_pass2 *p) __ksym;
void bpf_kfunc_call_test_mem_len_fail2(__u64 *mem, int len) __ksym;
void bpf_kfunc_call_test_destructive(void) __ksym;
+void bpf_kfunc_call_test_sleepable(void) __ksym;
void bpf_kfunc_call_test_offset(struct prog_test_ref_kfunc *p);
struct prog_test_member *bpf_kfunc_call_memb_acquire(void);
diff --git a/tools/testing/selftests/bpf/prog_tests/timer.c b/tools/testing/selftests/bpf/prog_tests/timer.c
index d66687f1ee6a..0c2c3ffe1887 100644
--- a/tools/testing/selftests/bpf/prog_tests/timer.c
+++ b/tools/testing/selftests/bpf/prog_tests/timer.c
@@ -3,6 +3,7 @@
#include <test_progs.h>
#include "timer.skel.h"
#include "timer_failure.skel.h"
+#include "timer_sleepable.skel.h"
#define NUM_THR 8
@@ -95,3 +96,36 @@ void serial_test_timer(void)
RUN_TESTS(timer_failure);
}
+
+/* isolate sleepable tests from main timer tests
+ * because if test_timer fails, it spews the console
+ * with 10000 * 5 "spin_lock_thread:PASS:test_run_opts retval 0 nsec"
+ */
+void serial_test_sleepable_timer(void)
+{
+ struct timer_sleepable *timer_sleepable_skel = NULL;
+ int err, prog_fd;
+
+ LIBBPF_OPTS(bpf_test_run_opts, topts);
+
+ RUN_TESTS(timer_sleepable);
+
+ /* re-run the success test to check if the timer was actually executed */
+
+ timer_sleepable_skel = timer_sleepable__open_and_load();
+ if (!ASSERT_OK_PTR(timer_sleepable_skel, "timer_sleepable_skel_load"))
+ return;
+
+ err = timer_sleepable__attach(timer_sleepable_skel);
+ if (!ASSERT_OK(err, "timer_sleepable_attach"))
+ return;
+
+ prog_fd = bpf_program__fd(timer_sleepable_skel->progs.test_call_sleepable);
+ err = bpf_prog_test_run_opts(prog_fd, &topts);
+ ASSERT_OK(err, "test_run");
+ ASSERT_EQ(topts.retval, 0, "test_run");
+
+ usleep(50); /* 10 usecs should be enough, but give it extra */
+
+ ASSERT_EQ(timer_sleepable_skel->bss->ok_sleepable, 1, "ok_sleepable");
+}
diff --git a/tools/testing/selftests/bpf/progs/timer_sleepable.c b/tools/testing/selftests/bpf/progs/timer_sleepable.c
new file mode 100644
index 000000000000..97612a020e8c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/timer_sleepable.c
@@ -0,0 +1,185 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook
+ * Copyright (c) 2024 Benjamin Tissoires
+ */
+
+#include "bpf_experimental.h"
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+#include "../bpf_testmod/bpf_testmod_kfunc.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define CLOCK_MONOTONIC 1
+
+struct elem {
+ struct bpf_timer t;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 3);
+ __type(key, int);
+ __type(value, struct elem);
+} timer_map SEC(".maps");
+
+__u32 ok_sleepable;
+
+/* callback for sleepable timer */
+static int timer_cb_sleepable(void *map, int *key, struct bpf_timer *timer)
+{
+ bpf_kfunc_call_test_sleepable();
+ ok_sleepable |= (1 << *key);
+ return 0;
+}
+
+SEC("syscall")
+/* check that calling bpf_timer_start() with BPF_F_TIMER_SLEEPABLE on a sleepable
+ * callback works
+ */
+__retval(0)
+long test_call_sleepable(void *ctx)
+{
+ int key = 0;
+ struct bpf_timer *timer;
+
+ if (ok_sleepable & 1)
+ return -1;
+
+ timer = bpf_map_lookup_elem(&timer_map, &key);
+ if (timer) {
+ if (bpf_timer_init(timer, &timer_map, CLOCK_MONOTONIC) != 0)
+ return -2;
+ bpf_timer_set_sleepable_cb(timer, timer_cb_sleepable);
+ if (bpf_timer_start(timer, 0, BPF_F_TIMER_SLEEPABLE))
+ return -3;
+ } else {
+ return -4;
+ }
+
+ return 0;
+}
+
+SEC("?syscall")
+__log_level(2)
+__failure
+/* check that bpf_timer_set_callback() can not be called with a
+ * sleepable callback
+ */
+__msg("mark_precise: frame0: regs=r1 stack= before")
+__msg(": (85) call bpf_kfunc_call_test_sleepable#") /* anchor message */
+__msg("program must be sleepable to call sleepable kfunc bpf_kfunc_call_test_sleepable")
+long test_non_sleepable_sleepable_callback(void *ctx)
+{
+ int key = 0;
+ struct bpf_timer *timer;
+
+ timer = bpf_map_lookup_elem(&timer_map, &key);
+ if (timer) {
+ bpf_timer_init(timer, &timer_map, CLOCK_MONOTONIC);
+ bpf_timer_set_callback(timer, timer_cb_sleepable);
+ bpf_timer_start(timer, 0, BPF_F_TIMER_SLEEPABLE);
+ }
+
+ return 0;
+}
+
+SEC("syscall")
+/* check that calling bpf_timer_start() without BPF_F_TIMER_SLEEPABLE on a sleepable
+ * callback is returning -EINVAL
+ */
+__retval(-22)
+long test_call_sleepable_missing_flag(void *ctx)
+{
+ int key = 1;
+ struct bpf_timer *timer;
+
+ timer = bpf_map_lookup_elem(&timer_map, &key);
+ if (!timer)
+ return 1;
+
+ if (bpf_timer_init(timer, &timer_map, CLOCK_MONOTONIC))
+ return 2;
+
+ if (bpf_timer_set_sleepable_cb(timer, timer_cb_sleepable))
+ return 3;
+
+ return bpf_timer_start(timer, 0, 0);
+}
+
+SEC("syscall")
+/* check that calling bpf_timer_start() without BPF_F_TIMER_SLEEPABLE on a sleepable
+ * callback is returning -EINVAL
+ */
+__retval(-22)
+long test_call_sleepable_delay(void *ctx)
+{
+ int key = 2;
+ struct bpf_timer *timer;
+
+ timer = bpf_map_lookup_elem(&timer_map, &key);
+ if (!timer)
+ return 1;
+
+ if (bpf_timer_init(timer, &timer_map, CLOCK_MONOTONIC))
+ return 2;
+
+ if (bpf_timer_set_sleepable_cb(timer, timer_cb_sleepable))
+ return 3;
+
+ return bpf_timer_start(timer, 1, BPF_F_TIMER_SLEEPABLE);
+}
+
+SEC("?syscall")
+__log_level(2)
+__failure
+/* check that the first argument of bpf_timer_set_callback()
+ * is a correct bpf_timer pointer.
+ */
+__msg("mark_precise: frame0: regs=r1 stack= before")
+__msg(": (85) call bpf_timer_set_sleepable_cb_impl#") /* anchor message */
+__msg("arg#0 doesn't point to a map value")
+long test_wrong_pointer(void *ctx)
+{
+ int key = 0;
+ struct bpf_timer *timer;
+
+ timer = bpf_map_lookup_elem(&timer_map, &key);
+ if (!timer)
+ return 1;
+
+ if (bpf_timer_init(timer, &timer_map, CLOCK_MONOTONIC))
+ return 2;
+
+ if (bpf_timer_set_sleepable_cb((void *)&timer, timer_cb_sleepable))
+ return 3;
+
+ return -22;
+}
+
+SEC("?syscall")
+__log_level(2)
+__failure
+/* check that the first argument of bpf_timer_set_callback()
+ * is a correct bpf_timer pointer.
+ */
+__msg("mark_precise: frame0: regs=r1 stack= before")
+__msg(": (85) call bpf_timer_set_sleepable_cb_impl#") /* anchor message */
+__msg("arg#0 offset can not be greater than 0")
+long test_wrong_pointer_offset(void *ctx)
+{
+ int key = 0;
+ struct bpf_timer *timer;
+
+ timer = bpf_map_lookup_elem(&timer_map, &key);
+ if (!timer)
+ return 1;
+
+ if (bpf_timer_init(timer, &timer_map, CLOCK_MONOTONIC))
+ return 2;
+
+ if (bpf_timer_set_sleepable_cb((void *)timer + 1, timer_cb_sleepable))
+ return 3;
+
+ return -22;
+}
--
2.44.0
cp include/uapi/linux/bpf.h tools/include/uapi/linux/bpf.h
Signed-off-by: Benjamin Tissoires <[email protected]>
---
no changes in v5
new in v4
---
tools/include/uapi/linux/bpf.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3c42b9f1bada..b90def29d796 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7461,10 +7461,14 @@ struct bpf_core_relo {
* - BPF_F_TIMER_ABS: Timeout passed is absolute time, by default it is
* relative to current time.
* - BPF_F_TIMER_CPU_PIN: Timer will be pinned to the CPU of the caller.
+ * - BPF_F_TIMER_SLEEPABLE: Timer will run in a sleepable context, with
+ * no guarantees of ordering nor timing (consider this as being just
+ * offloaded immediately).
*/
enum {
BPF_F_TIMER_ABS = (1ULL << 0),
BPF_F_TIMER_CPU_PIN = (1ULL << 1),
+ BPF_F_TIMER_SLEEPABLE = (1ULL << 2),
};
/* BPF numbers iterator state */
--
2.44.0
From: Benjamin Tissoires <[email protected]>
bpf_experimental.h and ../bpf_testmod/bpf_testmod_kfunc.h are both
including vmlinux.h, which is not compatible with including time.h
or bpf_tcp_helpers.h.
So keep sleepable tests in a separate bpf source file.
The first correct test is run twice for convenience:
- first through RUN_TESTS
- then we ensure that the timer was actually executed, in a manual
load/attach/run
Signed-off-by: Benjamin Tissoires <[email protected]>
---
Sorry, I realized the tests are failing :(
This is an updated version that replaces the one in v5
changes in v5b:
- do not use SEC(syscall) as they are run in a sleepable context
already
- duplicate test_call_sleepable() so we can test if the callback gets
called
changes in v5:
- keep sleepable test separate
new in v4
---
.../testing/selftests/bpf/bpf_experimental.h | 5 +
.../selftests/bpf/bpf_testmod/bpf_testmod.c | 5 +
.../bpf/bpf_testmod/bpf_testmod_kfunc.h | 1 +
.../testing/selftests/bpf/prog_tests/timer.c | 34 +++
.../selftests/bpf/progs/timer_sleepable.c | 212 ++++++++++++++++++
5 files changed, 257 insertions(+)
create mode 100644 tools/testing/selftests/bpf/progs/timer_sleepable.c
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index a5b9df38c162..06ed98b3bc51 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -459,4 +459,9 @@ extern int bpf_iter_css_new(struct bpf_iter_css *it,
extern struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *it) __weak __ksym;
extern void bpf_iter_css_destroy(struct bpf_iter_css *it) __weak __ksym;
+extern int bpf_timer_set_sleepable_cb_impl(struct bpf_timer *timer,
+ int (callback_fn)(void *map, int *key, struct bpf_timer *timer),
+ void *aux__ign) __ksym;
+#define bpf_timer_set_sleepable_cb(timer, cb) \
+ bpf_timer_set_sleepable_cb_impl(timer, cb, NULL)
#endif
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 39ad96a18123..eb2b78552ca2 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -494,6 +494,10 @@ __bpf_kfunc static u32 bpf_kfunc_call_test_static_unused_arg(u32 arg, u32 unused
return arg;
}
+__bpf_kfunc void bpf_kfunc_call_test_sleepable(void)
+{
+}
+
BTF_KFUNCS_START(bpf_testmod_check_kfunc_ids)
BTF_ID_FLAGS(func, bpf_testmod_test_mod_kfunc)
BTF_ID_FLAGS(func, bpf_kfunc_call_test1)
@@ -520,6 +524,7 @@ BTF_ID_FLAGS(func, bpf_kfunc_call_test_ref, KF_TRUSTED_ARGS | KF_RCU)
BTF_ID_FLAGS(func, bpf_kfunc_call_test_destructive, KF_DESTRUCTIVE)
BTF_ID_FLAGS(func, bpf_kfunc_call_test_static_unused_arg)
BTF_ID_FLAGS(func, bpf_kfunc_call_test_offset)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_sleepable, KF_SLEEPABLE)
BTF_KFUNCS_END(bpf_testmod_check_kfunc_ids)
static int bpf_testmod_ops_init(struct btf *btf)
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
index 7c664dd61059..ce5cd763561c 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
@@ -96,6 +96,7 @@ void bpf_kfunc_call_test_pass2(struct prog_test_pass2 *p) __ksym;
void bpf_kfunc_call_test_mem_len_fail2(__u64 *mem, int len) __ksym;
void bpf_kfunc_call_test_destructive(void) __ksym;
+void bpf_kfunc_call_test_sleepable(void) __ksym;
void bpf_kfunc_call_test_offset(struct prog_test_ref_kfunc *p);
struct prog_test_member *bpf_kfunc_call_memb_acquire(void);
diff --git a/tools/testing/selftests/bpf/prog_tests/timer.c b/tools/testing/selftests/bpf/prog_tests/timer.c
index d66687f1ee6a..c6c7c623b31c 100644
--- a/tools/testing/selftests/bpf/prog_tests/timer.c
+++ b/tools/testing/selftests/bpf/prog_tests/timer.c
@@ -3,6 +3,7 @@
#include <test_progs.h>
#include "timer.skel.h"
#include "timer_failure.skel.h"
+#include "timer_sleepable.skel.h"
#define NUM_THR 8
@@ -95,3 +96,36 @@ void serial_test_timer(void)
RUN_TESTS(timer_failure);
}
+
+/* isolate sleepable tests from main timer tests
+ * because if test_timer fails, it spews the console
+ * with 10000 * 5 "spin_lock_thread:PASS:test_run_opts retval 0 nsec"
+ */
+void serial_test_sleepable_timer(void)
+{
+ struct timer_sleepable *timer_sleepable_skel = NULL;
+ int err, prog_fd;
+
+ LIBBPF_OPTS(bpf_test_run_opts, topts);
+
+ RUN_TESTS(timer_sleepable);
+
+ /* re-run the success test to check if the timer was actually executed */
+
+ timer_sleepable_skel = timer_sleepable__open_and_load();
+ if (!ASSERT_OK_PTR(timer_sleepable_skel, "timer_sleepable_skel_load"))
+ return;
+
+ err = timer_sleepable__attach(timer_sleepable_skel);
+ if (!ASSERT_OK(err, "timer_sleepable_attach"))
+ return;
+
+ prog_fd = bpf_program__fd(timer_sleepable_skel->progs.test_syscall_sleepable);
+ err = bpf_prog_test_run_opts(prog_fd, &topts);
+ ASSERT_OK(err, "test_run");
+ ASSERT_EQ(topts.retval, 0, "test_run");
+
+ usleep(50); /* 10 usecs should be enough, but give it extra */
+
+ ASSERT_EQ(timer_sleepable_skel->bss->ok_sleepable, 1, "ok_sleepable");
+}
diff --git a/tools/testing/selftests/bpf/progs/timer_sleepable.c b/tools/testing/selftests/bpf/progs/timer_sleepable.c
new file mode 100644
index 000000000000..07df35b227bc
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/timer_sleepable.c
@@ -0,0 +1,212 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook
+ * Copyright (c) 2024 Benjamin Tissoires
+ */
+
+#include "bpf_experimental.h"
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+#include "../bpf_testmod/bpf_testmod_kfunc.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define CLOCK_MONOTONIC 1
+
+struct elem {
+ struct bpf_timer t;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 3);
+ __type(key, int);
+ __type(value, struct elem);
+} timer_map SEC(".maps");
+
+__u32 ok_sleepable;
+
+/* callback for sleepable timer */
+static int timer_cb_sleepable(void *map, int *key, struct bpf_timer *timer)
+{
+ bpf_kfunc_call_test_sleepable();
+ ok_sleepable |= (1 << *key);
+ return 0;
+}
+
+SEC("tc")
+/* check that calling bpf_timer_start() with BPF_F_TIMER_SLEEPABLE on a sleepable
+ * callback works
+ */
+__retval(0)
+long test_call_sleepable(void *ctx)
+{
+ int key = 0;
+ struct bpf_timer *timer;
+
+ if (ok_sleepable & 1)
+ return -1;
+
+ timer = bpf_map_lookup_elem(&timer_map, &key);
+ if (timer) {
+ if (bpf_timer_init(timer, &timer_map, CLOCK_MONOTONIC) != 0)
+ return -2;
+ bpf_timer_set_sleepable_cb(timer, timer_cb_sleepable);
+ if (bpf_timer_start(timer, 0, BPF_F_TIMER_SLEEPABLE))
+ return -3;
+ } else {
+ return -4;
+ }
+
+ return 0;
+}
+
+SEC("syscall")
+/* check that calling bpf_timer_start() with BPF_F_TIMER_SLEEPABLE on a sleepable
+ * callback works.
+ */
+__retval(0)
+long test_syscall_sleepable(void *ctx)
+{
+ int key = 0;
+ struct bpf_timer *timer;
+
+ if (ok_sleepable & 1)
+ return -1;
+
+ timer = bpf_map_lookup_elem(&timer_map, &key);
+ if (timer) {
+ if (bpf_timer_init(timer, &timer_map, CLOCK_MONOTONIC) != 0)
+ return -2;
+ bpf_timer_set_sleepable_cb(timer, timer_cb_sleepable);
+ if (bpf_timer_start(timer, 0, BPF_F_TIMER_SLEEPABLE))
+ return -3;
+ } else {
+ return -4;
+ }
+
+ return 0;
+}
+
+SEC("?tc")
+__log_level(2)
+__failure
+/* check that bpf_timer_set_callback() can not be called with a
+ * sleepable callback
+ */
+__msg("mark_precise: frame0: regs=r1 stack= before")
+__msg(": (85) call bpf_kfunc_call_test_sleepable#") /* anchor message */
+__msg("program must be sleepable to call sleepable kfunc bpf_kfunc_call_test_sleepable")
+long test_non_sleepable_sleepable_callback(void *ctx)
+{
+ int key = 0;
+ struct bpf_timer *timer;
+
+ timer = bpf_map_lookup_elem(&timer_map, &key);
+ if (timer) {
+ bpf_timer_init(timer, &timer_map, CLOCK_MONOTONIC);
+ bpf_timer_set_callback(timer, timer_cb_sleepable);
+ bpf_timer_start(timer, 0, BPF_F_TIMER_SLEEPABLE);
+ }
+
+ return 0;
+}
+
+SEC("tc")
+/* check that calling bpf_timer_start() without BPF_F_TIMER_SLEEPABLE on a sleepable
+ * callback is returning -EINVAL
+ */
+__retval(-22)
+long test_call_sleepable_missing_flag(void *ctx)
+{
+ int key = 1;
+ struct bpf_timer *timer;
+
+ timer = bpf_map_lookup_elem(&timer_map, &key);
+ if (!timer)
+ return 1;
+
+ if (bpf_timer_init(timer, &timer_map, CLOCK_MONOTONIC))
+ return 2;
+
+ if (bpf_timer_set_sleepable_cb(timer, timer_cb_sleepable))
+ return 3;
+
+ return bpf_timer_start(timer, 0, 0);
+}
+
+SEC("tc")
+/* check that calling bpf_timer_start() without BPF_F_TIMER_SLEEPABLE on a sleepable
+ * callback is returning -EINVAL
+ */
+__retval(-22)
+long test_call_sleepable_delay(void *ctx)
+{
+ int key = 2;
+ struct bpf_timer *timer;
+
+ timer = bpf_map_lookup_elem(&timer_map, &key);
+ if (!timer)
+ return 1;
+
+ if (bpf_timer_init(timer, &timer_map, CLOCK_MONOTONIC))
+ return 2;
+
+ if (bpf_timer_set_sleepable_cb(timer, timer_cb_sleepable))
+ return 3;
+
+ return bpf_timer_start(timer, 1, BPF_F_TIMER_SLEEPABLE);
+}
+
+SEC("?tc")
+__log_level(2)
+__failure
+/* check that the first argument of bpf_timer_set_callback()
+ * is a correct bpf_timer pointer.
+ */
+__msg("mark_precise: frame0: regs=r1 stack= before")
+__msg(": (85) call bpf_timer_set_sleepable_cb_impl#") /* anchor message */
+__msg("arg#0 doesn't point to a map value")
+long test_wrong_pointer(void *ctx)
+{
+ int key = 0;
+ struct bpf_timer *timer;
+
+ timer = bpf_map_lookup_elem(&timer_map, &key);
+ if (!timer)
+ return 1;
+
+ if (bpf_timer_init(timer, &timer_map, CLOCK_MONOTONIC))
+ return 2;
+
+ if (bpf_timer_set_sleepable_cb((void *)&timer, timer_cb_sleepable))
+ return 3;
+
+ return -22;
+}
+
+SEC("?tc")
+__log_level(2)
+__failure
+/* check that the first argument of bpf_timer_set_callback()
+ * is a correct bpf_timer pointer.
+ */
+__msg("mark_precise: frame0: regs=r1 stack= before")
+__msg(": (85) call bpf_timer_set_sleepable_cb_impl#") /* anchor message */
+__msg("arg#0 offset can not be greater than 0")
+long test_wrong_pointer_offset(void *ctx)
+{
+ int key = 0;
+ struct bpf_timer *timer;
+
+ timer = bpf_map_lookup_elem(&timer_map, &key);
+ if (!timer)
+ return 1;
+
+ if (bpf_timer_init(timer, &timer_map, CLOCK_MONOTONIC))
+ return 2;
+
+ if (bpf_timer_set_sleepable_cb((void *)timer + 1, timer_cb_sleepable))
+ return 3;
+
+ return -22;
+}
--
2.44.0
On Fri, 22 Mar 2024 at 15:57, Benjamin Tissoires <[email protected]> wrote:
>
> We need to extend the bpf_timer API, but the way forward relies on kfuncs.
> So make bpf_timer known for kfuncs from the verifier PoV
>
> Signed-off-by: Benjamin Tissoires <[email protected]>
>
> ---
>
> changes in v5:
> - also check for the reg offset
>
> changes in v4:
> - enforce KF_ARG_PTR_TO_TIMER to be of type PTR_TO_MAP_VALUE
>
> new in v3 (split from v2 02/10)
> ---
> kernel/bpf/verifier.c | 23 +++++++++++++++++++++++
> 1 file changed, 23 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 63749ad5ac6b..24a604e26ec7 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -10826,6 +10826,7 @@ enum {
> KF_ARG_LIST_NODE_ID,
> KF_ARG_RB_ROOT_ID,
> KF_ARG_RB_NODE_ID,
> + KF_ARG_TIMER_ID,
> };
>
> BTF_ID_LIST(kf_arg_btf_ids)
> @@ -10834,6 +10835,7 @@ BTF_ID(struct, bpf_list_head)
> BTF_ID(struct, bpf_list_node)
> BTF_ID(struct, bpf_rb_root)
> BTF_ID(struct, bpf_rb_node)
> +BTF_ID(struct, bpf_timer_kern)
>
> static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
> const struct btf_param *arg, int type)
> @@ -10877,6 +10879,12 @@ static bool is_kfunc_arg_rbtree_node(const struct btf *btf, const struct btf_par
> return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_NODE_ID);
> }
>
> +static bool is_kfunc_arg_timer(const struct btf *btf, const struct btf_param *arg)
> +{
> + bool ret = __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_TIMER_ID);
> + return ret;
> +}
> +
> static bool is_kfunc_arg_callback(struct bpf_verifier_env *env, const struct btf *btf,
> const struct btf_param *arg)
> {
> @@ -10946,6 +10954,7 @@ enum kfunc_ptr_arg_type {
> KF_ARG_PTR_TO_NULL,
> KF_ARG_PTR_TO_CONST_STR,
> KF_ARG_PTR_TO_MAP,
> + KF_ARG_PTR_TO_TIMER,
> };
>
> enum special_kfunc_type {
> @@ -11102,6 +11111,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
> if (is_kfunc_arg_map(meta->btf, &args[argno]))
> return KF_ARG_PTR_TO_MAP;
>
> + if (is_kfunc_arg_timer(meta->btf, &args[argno]))
> + return KF_ARG_PTR_TO_TIMER;
> +
> if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
> if (!btf_type_is_struct(ref_t)) {
> verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
> @@ -11735,6 +11747,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> case KF_ARG_PTR_TO_CALLBACK:
> case KF_ARG_PTR_TO_REFCOUNTED_KPTR:
> case KF_ARG_PTR_TO_CONST_STR:
> + case KF_ARG_PTR_TO_TIMER:
> /* Trusted by default */
> break;
> default:
> @@ -12021,6 +12034,16 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> if (ret)
> return ret;
> break;
> + case KF_ARG_PTR_TO_TIMER:
> + if (reg->type != PTR_TO_MAP_VALUE) {
> + verbose(env, "arg#%d doesn't point to a map value\n", i);
> + return -EINVAL;
> + }
> + if (reg->off) {
> + verbose(env, "arg#%d offset can not be greater than 0\n", i);
> + return -EINVAL;
> + }
This won't be correct. You don't really check whether the timer exists
at reg->off (and if you did, this would still restrict it to 0 offset,
and not check the variable offset which would be non-zero). What I
would suggest is calling process_timer_func (see how dynptr calls the
same underlying process_dynptr_func to enforce type invariants). This
would allow sharing the same checks and avoid bugs from creeping in.
It does all checks wrt constant/variable offset and looking up the
timer field offset and matching it against the one in the pointer.
> + break;
> }
> }
>
>
> --
> 2.44.0
>
>
On Fri, 22 Mar 2024 at 15:57, Benjamin Tissoires <[email protected]> wrote:
>
> In this patch, bpf_timer_set_sleepable_cb() is functionally equivalent
> to bpf_timer_set_callback(), to the exception that it enforces
> the timer to be started with BPF_F_TIMER_SLEEPABLE.
>
> But given that bpf_timer_set_callback() is a helper when
> bpf_timer_set_sleepable_cb() is a kfunc, we need to teach the verifier
> about its attached callback.
> Marking that callback as sleepable will be done in a separate patch
>
> Acked-by: Eduard Zingerman <[email protected]>
> Signed-off-by: Benjamin Tissoires <[email protected]>
>
> ---
> [...]
>
> @@ -19548,6 +19582,28 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> *cnt = 1;
> + } else if (is_bpf_timer_set_sleepable_cb_impl_kfunc(desc->func_id)) {
> + /* The verifier will process callback_fn as many times as necessary
> + * with different maps and the register states prepared by
> + * set_timer_callback_state will be accurate.
> + *
> + * The following use case is valid:
> + * map1 is shared by prog1, prog2, prog3.
> + * prog1 calls bpf_timer_init for some map1 elements
> + * prog2 calls bpf_timer_set_callback for some map1 elements.
> + * Those that were not bpf_timer_init-ed will return -EINVAL.
> + * prog3 calls bpf_timer_start for some map1 elements.
> + * Those that were not both bpf_timer_init-ed and
> + * bpf_timer_set_callback-ed will return -EINVAL.
> + */
> + struct bpf_insn ld_addrs[2] = {
> + BPF_LD_IMM64(BPF_REG_3, (long)env->prog->aux),
> + };
> +
> + insn_buf[0] = ld_addrs[0];
> + insn_buf[1] = ld_addrs[1];
> + insn_buf[2] = *insn;
> + *cnt = 3;
> }
Would be better to handle the fixup of all kfuncs in one place, i.e.
fixup_kfunc_call.
> return 0;
> }
>
> --
> 2.44.0
>
>
On Fri, Mar 22, 2024 at 9:31 AM Kumar Kartikeya Dwivedi
<[email protected]> wrote:
>
> On Fri, 22 Mar 2024 at 15:57, Benjamin Tissoires <[email protected]> wrote:
> >
> > We need to extend the bpf_timer API, but the way forward relies on kfuncs.
> > So make bpf_timer known for kfuncs from the verifier PoV
> >
> > Signed-off-by: Benjamin Tissoires <[email protected]>
> >
> > ---
> >
> > changes in v5:
> > - also check for the reg offset
> >
> > changes in v4:
> > - enforce KF_ARG_PTR_TO_TIMER to be of type PTR_TO_MAP_VALUE
> >
> > new in v3 (split from v2 02/10)
> > ---
> > kernel/bpf/verifier.c | 23 +++++++++++++++++++++++
> > 1 file changed, 23 insertions(+)
> >
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 63749ad5ac6b..24a604e26ec7 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -10826,6 +10826,7 @@ enum {
> > KF_ARG_LIST_NODE_ID,
> > KF_ARG_RB_ROOT_ID,
> > KF_ARG_RB_NODE_ID,
> > + KF_ARG_TIMER_ID,
> > };
> >
> > BTF_ID_LIST(kf_arg_btf_ids)
> > @@ -10834,6 +10835,7 @@ BTF_ID(struct, bpf_list_head)
> > BTF_ID(struct, bpf_list_node)
> > BTF_ID(struct, bpf_rb_root)
> > BTF_ID(struct, bpf_rb_node)
> > +BTF_ID(struct, bpf_timer_kern)
> >
> > static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
> > const struct btf_param *arg, int type)
> > @@ -10877,6 +10879,12 @@ static bool is_kfunc_arg_rbtree_node(const struct btf *btf, const struct btf_par
> > return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_NODE_ID);
> > }
> >
> > +static bool is_kfunc_arg_timer(const struct btf *btf, const struct btf_param *arg)
> > +{
> > + bool ret = __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_TIMER_ID);
> > + return ret;
> > +}
> > +
> > static bool is_kfunc_arg_callback(struct bpf_verifier_env *env, const struct btf *btf,
> > const struct btf_param *arg)
> > {
> > @@ -10946,6 +10954,7 @@ enum kfunc_ptr_arg_type {
> > KF_ARG_PTR_TO_NULL,
> > KF_ARG_PTR_TO_CONST_STR,
> > KF_ARG_PTR_TO_MAP,
> > + KF_ARG_PTR_TO_TIMER,
> > };
> >
> > enum special_kfunc_type {
> > @@ -11102,6 +11111,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
> > if (is_kfunc_arg_map(meta->btf, &args[argno]))
> > return KF_ARG_PTR_TO_MAP;
> >
> > + if (is_kfunc_arg_timer(meta->btf, &args[argno]))
> > + return KF_ARG_PTR_TO_TIMER;
> > +
> > if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
> > if (!btf_type_is_struct(ref_t)) {
> > verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
> > @@ -11735,6 +11747,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > case KF_ARG_PTR_TO_CALLBACK:
> > case KF_ARG_PTR_TO_REFCOUNTED_KPTR:
> > case KF_ARG_PTR_TO_CONST_STR:
> > + case KF_ARG_PTR_TO_TIMER:
> > /* Trusted by default */
> > break;
> > default:
> > @@ -12021,6 +12034,16 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > if (ret)
> > return ret;
> > break;
> > + case KF_ARG_PTR_TO_TIMER:
> > + if (reg->type != PTR_TO_MAP_VALUE) {
> > + verbose(env, "arg#%d doesn't point to a map value\n", i);
> > + return -EINVAL;
> > + }
> > + if (reg->off) {
> > + verbose(env, "arg#%d offset can not be greater than 0\n", i);
> > + return -EINVAL;
> > + }
>
> This won't be correct. You don't really check whether the timer exists
> at reg->off (and if you did, this would still restrict it to 0 offset,
> and not check the variable offset which would be non-zero). What I
> would suggest is calling process_timer_func (see how dynptr calls the
> same underlying process_dynptr_func to enforce type invariants). This
> would allow sharing the same checks and avoid bugs from creeping in.
> It does all checks wrt constant/variable offset and looking up the
> timer field offset and matching it against the one in the pointer.
Observation is correct. The patch is buggy,
but the suggestion to follow process_dynptr_func() will lead
to unnecessary complexity.
dynptr-s are on stack with plenty of extra checks.
In this case bpf_timer is in map_value.
Much simpler is to follow KF_ARG_PTR_TO_MAP approach.
Set ref_id, ref_t, ref_tname to bpf_timer and
process_kf_arg_ptr_to_btf_id() should work as-is.
You still need
+BTF_ID(struct, bpf_timer_kern)
to recognize that argument in a kfunc,
but the selftests will be using 'struct bpf_timer t',
so KF_ARG_PTR_TO_TIMER would need to match against 'struct bpf_timer'
and not against 'struct bpf_timer_kern'.
On Sun, 24 Mar 2024 at 04:53, Alexei Starovoitov
<[email protected]> wrote:
>
> On Fri, Mar 22, 2024 at 9:31 AM Kumar Kartikeya Dwivedi
> <[email protected]> wrote:
> >
> > On Fri, 22 Mar 2024 at 15:57, Benjamin Tissoires <[email protected]> wrote:
> > >
> > > We need to extend the bpf_timer API, but the way forward relies on kfuncs.
> > > So make bpf_timer known for kfuncs from the verifier PoV
> > >
> > > Signed-off-by: Benjamin Tissoires <[email protected]>
> > >
> > > ---
> > >
> > > changes in v5:
> > > - also check for the reg offset
> > >
> > > changes in v4:
> > > - enforce KF_ARG_PTR_TO_TIMER to be of type PTR_TO_MAP_VALUE
> > >
> > > new in v3 (split from v2 02/10)
> > > ---
> > > kernel/bpf/verifier.c | 23 +++++++++++++++++++++++
> > > 1 file changed, 23 insertions(+)
> > >
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 63749ad5ac6b..24a604e26ec7 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -10826,6 +10826,7 @@ enum {
> > > KF_ARG_LIST_NODE_ID,
> > > KF_ARG_RB_ROOT_ID,
> > > KF_ARG_RB_NODE_ID,
> > > + KF_ARG_TIMER_ID,
> > > };
> > >
> > > BTF_ID_LIST(kf_arg_btf_ids)
> > > @@ -10834,6 +10835,7 @@ BTF_ID(struct, bpf_list_head)
> > > BTF_ID(struct, bpf_list_node)
> > > BTF_ID(struct, bpf_rb_root)
> > > BTF_ID(struct, bpf_rb_node)
> > > +BTF_ID(struct, bpf_timer_kern)
> > >
> > > static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
> > > const struct btf_param *arg, int type)
> > > @@ -10877,6 +10879,12 @@ static bool is_kfunc_arg_rbtree_node(const struct btf *btf, const struct btf_par
> > > return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_NODE_ID);
> > > }
> > >
> > > +static bool is_kfunc_arg_timer(const struct btf *btf, const struct btf_param *arg)
> > > +{
> > > + bool ret = __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_TIMER_ID);
> > > + return ret;
> > > +}
> > > +
> > > static bool is_kfunc_arg_callback(struct bpf_verifier_env *env, const struct btf *btf,
> > > const struct btf_param *arg)
> > > {
> > > @@ -10946,6 +10954,7 @@ enum kfunc_ptr_arg_type {
> > > KF_ARG_PTR_TO_NULL,
> > > KF_ARG_PTR_TO_CONST_STR,
> > > KF_ARG_PTR_TO_MAP,
> > > + KF_ARG_PTR_TO_TIMER,
> > > };
> > >
> > > enum special_kfunc_type {
> > > @@ -11102,6 +11111,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
> > > if (is_kfunc_arg_map(meta->btf, &args[argno]))
> > > return KF_ARG_PTR_TO_MAP;
> > >
> > > + if (is_kfunc_arg_timer(meta->btf, &args[argno]))
> > > + return KF_ARG_PTR_TO_TIMER;
> > > +
> > > if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
> > > if (!btf_type_is_struct(ref_t)) {
> > > verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
> > > @@ -11735,6 +11747,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > > case KF_ARG_PTR_TO_CALLBACK:
> > > case KF_ARG_PTR_TO_REFCOUNTED_KPTR:
> > > case KF_ARG_PTR_TO_CONST_STR:
> > > + case KF_ARG_PTR_TO_TIMER:
> > > /* Trusted by default */
> > > break;
> > > default:
> > > @@ -12021,6 +12034,16 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > > if (ret)
> > > return ret;
> > > break;
> > > + case KF_ARG_PTR_TO_TIMER:
> > > + if (reg->type != PTR_TO_MAP_VALUE) {
> > > + verbose(env, "arg#%d doesn't point to a map value\n", i);
> > > + return -EINVAL;
> > > + }
> > > + if (reg->off) {
> > > + verbose(env, "arg#%d offset can not be greater than 0\n", i);
> > > + return -EINVAL;
> > > + }
> >
> > This won't be correct. You don't really check whether the timer exists
> > at reg->off (and if you did, this would still restrict it to 0 offset,
> > and not check the variable offset which would be non-zero). What I
> > would suggest is calling process_timer_func (see how dynptr calls the
> > same underlying process_dynptr_func to enforce type invariants). This
> > would allow sharing the same checks and avoid bugs from creeping in.
> > It does all checks wrt constant/variable offset and looking up the
> > timer field offset and matching it against the one in the pointer.
>
> Observation is correct. The patch is buggy,
> but the suggestion to follow process_dynptr_func() will lead
> to unnecessary complexity.
> dynptr-s are on stack with plenty of extra checks.
The suggestion was to call process_timer_func, not process_dynptr_func.
> In this case bpf_timer is in map_value.
> Much simpler is to follow KF_ARG_PTR_TO_MAP approach.
What I meant by the example was that dynptr handling does the same
thing for kfuncs and helpers (using the same function), so timer
arguments should do the same (i.e. use process_timer_func), which will
do all checks for constant offset (ensuring var_off is tnum_is_const)
and match it against btf_record->timer_off.
> [...]
On Sat, Mar 23, 2024 at 9:01 PM Kumar Kartikeya Dwivedi
<[email protected]> wrote:
>
> On Sun, 24 Mar 2024 at 04:53, Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Fri, Mar 22, 2024 at 9:31 AM Kumar Kartikeya Dwivedi
> > <[email protected]> wrote:
> > >
> > > On Fri, 22 Mar 2024 at 15:57, Benjamin Tissoires <[email protected]> wrote:
> > > >
> > > > We need to extend the bpf_timer API, but the way forward relies on kfuncs.
> > > > So make bpf_timer known for kfuncs from the verifier PoV
> > > >
> > > > Signed-off-by: Benjamin Tissoires <[email protected]>
> > > >
> > > > ---
> > > >
> > > > changes in v5:
> > > > - also check for the reg offset
> > > >
> > > > changes in v4:
> > > > - enforce KF_ARG_PTR_TO_TIMER to be of type PTR_TO_MAP_VALUE
> > > >
> > > > new in v3 (split from v2 02/10)
> > > > ---
> > > > kernel/bpf/verifier.c | 23 +++++++++++++++++++++++
> > > > 1 file changed, 23 insertions(+)
> > > >
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index 63749ad5ac6b..24a604e26ec7 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > > @@ -10826,6 +10826,7 @@ enum {
> > > > KF_ARG_LIST_NODE_ID,
> > > > KF_ARG_RB_ROOT_ID,
> > > > KF_ARG_RB_NODE_ID,
> > > > + KF_ARG_TIMER_ID,
> > > > };
> > > >
> > > > BTF_ID_LIST(kf_arg_btf_ids)
> > > > @@ -10834,6 +10835,7 @@ BTF_ID(struct, bpf_list_head)
> > > > BTF_ID(struct, bpf_list_node)
> > > > BTF_ID(struct, bpf_rb_root)
> > > > BTF_ID(struct, bpf_rb_node)
> > > > +BTF_ID(struct, bpf_timer_kern)
> > > >
> > > > static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
> > > > const struct btf_param *arg, int type)
> > > > @@ -10877,6 +10879,12 @@ static bool is_kfunc_arg_rbtree_node(const struct btf *btf, const struct btf_par
> > > > return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_NODE_ID);
> > > > }
> > > >
> > > > +static bool is_kfunc_arg_timer(const struct btf *btf, const struct btf_param *arg)
> > > > +{
> > > > + bool ret = __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_TIMER_ID);
> > > > + return ret;
> > > > +}
> > > > +
> > > > static bool is_kfunc_arg_callback(struct bpf_verifier_env *env, const struct btf *btf,
> > > > const struct btf_param *arg)
> > > > {
> > > > @@ -10946,6 +10954,7 @@ enum kfunc_ptr_arg_type {
> > > > KF_ARG_PTR_TO_NULL,
> > > > KF_ARG_PTR_TO_CONST_STR,
> > > > KF_ARG_PTR_TO_MAP,
> > > > + KF_ARG_PTR_TO_TIMER,
> > > > };
> > > >
> > > > enum special_kfunc_type {
> > > > @@ -11102,6 +11111,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
> > > > if (is_kfunc_arg_map(meta->btf, &args[argno]))
> > > > return KF_ARG_PTR_TO_MAP;
> > > >
> > > > + if (is_kfunc_arg_timer(meta->btf, &args[argno]))
> > > > + return KF_ARG_PTR_TO_TIMER;
> > > > +
> > > > if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
> > > > if (!btf_type_is_struct(ref_t)) {
> > > > verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
> > > > @@ -11735,6 +11747,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > > > case KF_ARG_PTR_TO_CALLBACK:
> > > > case KF_ARG_PTR_TO_REFCOUNTED_KPTR:
> > > > case KF_ARG_PTR_TO_CONST_STR:
> > > > + case KF_ARG_PTR_TO_TIMER:
> > > > /* Trusted by default */
> > > > break;
> > > > default:
> > > > @@ -12021,6 +12034,16 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > > > if (ret)
> > > > return ret;
> > > > break;
> > > > + case KF_ARG_PTR_TO_TIMER:
> > > > + if (reg->type != PTR_TO_MAP_VALUE) {
> > > > + verbose(env, "arg#%d doesn't point to a map value\n", i);
> > > > + return -EINVAL;
> > > > + }
> > > > + if (reg->off) {
> > > > + verbose(env, "arg#%d offset can not be greater than 0\n", i);
> > > > + return -EINVAL;
> > > > + }
> > >
> > > This won't be correct. You don't really check whether the timer exists
> > > at reg->off (and if you did, this would still restrict it to 0 offset,
> > > and not check the variable offset which would be non-zero). What I
> > > would suggest is calling process_timer_func (see how dynptr calls the
> > > same underlying process_dynptr_func to enforce type invariants). This
> > > would allow sharing the same checks and avoid bugs from creeping in.
> > > It does all checks wrt constant/variable offset and looking up the
> > > timer field offset and matching it against the one in the pointer.
> >
> > Observation is correct. The patch is buggy,
> > but the suggestion to follow process_dynptr_func() will lead
> > to unnecessary complexity.
> > dynptr-s are on stack with plenty of extra checks.
>
> The suggestion was to call process_timer_func, not process_dynptr_func.
>
> > In this case bpf_timer is in map_value.
> > Much simpler is to follow KF_ARG_PTR_TO_MAP approach.
>
> What I meant by the example was that dynptr handling does the same
> thing for kfuncs and helpers (using the same function), so timer
> arguments should do the same (i.e. use process_timer_func), which will
> do all checks for constant offset (ensuring var_off is tnum_is_const)
> and match it against btf_record->timer_off.
I don't follow. Please elaborate with a patch.
The var_off and off is a part of the bug, but it's not the biggest part of it.
On Sun, 24 Mar 2024 at 05:38, Alexei Starovoitov
<[email protected]> wrote:
>
> On Sat, Mar 23, 2024 at 9:01 PM Kumar Kartikeya Dwivedi
> <[email protected]> wrote:
> >
> > On Sun, 24 Mar 2024 at 04:53, Alexei Starovoitov
> > <[email protected]> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 9:31 AM Kumar Kartikeya Dwivedi
> > > <[email protected]> wrote:
> > > >
> > > > On Fri, 22 Mar 2024 at 15:57, Benjamin Tissoires <[email protected]> wrote:
> > > > >
> > > > > We need to extend the bpf_timer API, but the way forward relies on kfuncs.
> > > > > So make bpf_timer known for kfuncs from the verifier PoV
> > > > >
> > > > > Signed-off-by: Benjamin Tissoires <[email protected]>
> > > > >
> > > > > ---
> > > > >
> > > > > changes in v5:
> > > > > - also check for the reg offset
> > > > >
> > > > > changes in v4:
> > > > > - enforce KF_ARG_PTR_TO_TIMER to be of type PTR_TO_MAP_VALUE
> > > > >
> > > > > new in v3 (split from v2 02/10)
> > > > > ---
> > > > > kernel/bpf/verifier.c | 23 +++++++++++++++++++++++
> > > > > 1 file changed, 23 insertions(+)
> > > > >
> > > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > > index 63749ad5ac6b..24a604e26ec7 100644
> > > > > --- a/kernel/bpf/verifier.c
> > > > > +++ b/kernel/bpf/verifier.c
> > > > > @@ -10826,6 +10826,7 @@ enum {
> > > > > KF_ARG_LIST_NODE_ID,
> > > > > KF_ARG_RB_ROOT_ID,
> > > > > KF_ARG_RB_NODE_ID,
> > > > > + KF_ARG_TIMER_ID,
> > > > > };
> > > > >
> > > > > BTF_ID_LIST(kf_arg_btf_ids)
> > > > > @@ -10834,6 +10835,7 @@ BTF_ID(struct, bpf_list_head)
> > > > > BTF_ID(struct, bpf_list_node)
> > > > > BTF_ID(struct, bpf_rb_root)
> > > > > BTF_ID(struct, bpf_rb_node)
> > > > > +BTF_ID(struct, bpf_timer_kern)
> > > > >
> > > > > static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
> > > > > const struct btf_param *arg, int type)
> > > > > @@ -10877,6 +10879,12 @@ static bool is_kfunc_arg_rbtree_node(const struct btf *btf, const struct btf_par
> > > > > return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_NODE_ID);
> > > > > }
> > > > >
> > > > > +static bool is_kfunc_arg_timer(const struct btf *btf, const struct btf_param *arg)
> > > > > +{
> > > > > + bool ret = __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_TIMER_ID);
> > > > > + return ret;
> > > > > +}
> > > > > +
> > > > > static bool is_kfunc_arg_callback(struct bpf_verifier_env *env, const struct btf *btf,
> > > > > const struct btf_param *arg)
> > > > > {
> > > > > @@ -10946,6 +10954,7 @@ enum kfunc_ptr_arg_type {
> > > > > KF_ARG_PTR_TO_NULL,
> > > > > KF_ARG_PTR_TO_CONST_STR,
> > > > > KF_ARG_PTR_TO_MAP,
> > > > > + KF_ARG_PTR_TO_TIMER,
> > > > > };
> > > > >
> > > > > enum special_kfunc_type {
> > > > > @@ -11102,6 +11111,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
> > > > > if (is_kfunc_arg_map(meta->btf, &args[argno]))
> > > > > return KF_ARG_PTR_TO_MAP;
> > > > >
> > > > > + if (is_kfunc_arg_timer(meta->btf, &args[argno]))
> > > > > + return KF_ARG_PTR_TO_TIMER;
> > > > > +
> > > > > if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
> > > > > if (!btf_type_is_struct(ref_t)) {
> > > > > verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
> > > > > @@ -11735,6 +11747,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > > > > case KF_ARG_PTR_TO_CALLBACK:
> > > > > case KF_ARG_PTR_TO_REFCOUNTED_KPTR:
> > > > > case KF_ARG_PTR_TO_CONST_STR:
> > > > > + case KF_ARG_PTR_TO_TIMER:
> > > > > /* Trusted by default */
> > > > > break;
> > > > > default:
> > > > > @@ -12021,6 +12034,16 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > > > > if (ret)
> > > > > return ret;
> > > > > break;
> > > > > + case KF_ARG_PTR_TO_TIMER:
> > > > > + if (reg->type != PTR_TO_MAP_VALUE) {
> > > > > + verbose(env, "arg#%d doesn't point to a map value\n", i);
> > > > > + return -EINVAL;
> > > > > + }
> > > > > + if (reg->off) {
> > > > > + verbose(env, "arg#%d offset can not be greater than 0\n", i);
> > > > > + return -EINVAL;
> > > > > + }
> > > >
> > > > This won't be correct. You don't really check whether the timer exists
> > > > at reg->off (and if you did, this would still restrict it to 0 offset,
> > > > and not check the variable offset which would be non-zero). What I
> > > > would suggest is calling process_timer_func (see how dynptr calls the
> > > > same underlying process_dynptr_func to enforce type invariants). This
> > > > would allow sharing the same checks and avoid bugs from creeping in.
> > > > It does all checks wrt constant/variable offset and looking up the
> > > > timer field offset and matching it against the one in the pointer.
> > >
> > > Observation is correct. The patch is buggy,
> > > but the suggestion to follow process_dynptr_func() will lead
> > > to unnecessary complexity.
> > > dynptr-s are on stack with plenty of extra checks.
> >
> > The suggestion was to call process_timer_func, not process_dynptr_func.
> >
> > > In this case bpf_timer is in map_value.
> > > Much simpler is to follow KF_ARG_PTR_TO_MAP approach.
> >
> > What I meant by the example was that dynptr handling does the same
> > thing for kfuncs and helpers (using the same function), so timer
> > arguments should do the same (i.e. use process_timer_func), which will
> > do all checks for constant offset (ensuring var_off is tnum_is_const)
> > and match it against btf_record->timer_off.
>
> I don't follow. Please elaborate with a patch.
> The var_off and off is a part of the bug, but it's not the biggest part of it.
Not compile tested.
On Sat, Mar 23, 2024 at 9:57 PM Kumar Kartikeya Dwivedi
<[email protected]> wrote:
> > > >
> > > > Observation is correct. The patch is buggy,
> > > > but the suggestion to follow process_dynptr_func() will lead
> > > > to unnecessary complexity.
> > > > dynptr-s are on stack with plenty of extra checks.
> > >
> > > The suggestion was to call process_timer_func, not process_dynptr_func.
> > >
> > > > In this case bpf_timer is in map_value.
> > > > Much simpler is to follow KF_ARG_PTR_TO_MAP approach.
> > >
> > > What I meant by the example was that dynptr handling does the same
> > > thing for kfuncs and helpers (using the same function), so timer
> > > arguments should do the same (i.e. use process_timer_func), which will
> > > do all checks for constant offset (ensuring var_off is tnum_is_const)
> > > and match it against btf_record->timer_off.
> >
> > I don't follow. Please elaborate with a patch.
> > The var_off and off is a part of the bug, but it's not the biggest part of it.
>
> Not compile tested.
I see. All makes sense to me.
Benjamin,
pls incorporate it in your set.
On Fri, Mar 22, 2024 at 7:56 AM Benjamin Tissoires <[email protected]> wrote:
>
> They are implemented as a workqueue, which means that there are no
> guarantees of timing nor ordering.
>
> Signed-off-by: Benjamin Tissoires <[email protected]>
>
> ---
>
> no changes in v5
>
> changes in v4:
> - dropped __bpf_timer_compute_key()
> - use a spin_lock instead of a semaphore
> - ensure bpf_timer_cancel_and_free is not complaining about
> non sleepable context and use cancel_work() instead of
> cancel_work_sync()
> - return -EINVAL if a delay is given to bpf_timer_start() with
> BPF_F_TIMER_SLEEPABLE
>
> changes in v3:
> - extracted the implementation in bpf_timer only, without
> bpf_timer_set_sleepable_cb()
> - rely on schedule_work() only, from bpf_timer_start()
> - add semaphore to ensure bpf_timer_work_cb() is accessing
> consistent data
>
> changes in v2 (compared to the one attaches to v1 0/9):
> - make use of a kfunc
> - add a (non-used) BPF_F_TIMER_SLEEPABLE
> - the callback is *not* called, it makes the kernel crashes
> ---
> include/uapi/linux/bpf.h | 4 +++
> kernel/bpf/helpers.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++--
> 2 files changed, 88 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 3c42b9f1bada..b90def29d796 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7461,10 +7461,14 @@ struct bpf_core_relo {
> * - BPF_F_TIMER_ABS: Timeout passed is absolute time, by default it is
> * relative to current time.
> * - BPF_F_TIMER_CPU_PIN: Timer will be pinned to the CPU of the caller.
> + * - BPF_F_TIMER_SLEEPABLE: Timer will run in a sleepable context, with
> + * no guarantees of ordering nor timing (consider this as being just
> + * offloaded immediately).
> */
> enum {
> BPF_F_TIMER_ABS = (1ULL << 0),
> BPF_F_TIMER_CPU_PIN = (1ULL << 1),
> + BPF_F_TIMER_SLEEPABLE = (1ULL << 2),
> };
>
> /* BPF numbers iterator state */
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index a89587859571..38de73a9df83 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1094,14 +1094,20 @@ const struct bpf_func_proto bpf_snprintf_proto = {
> * bpf_timer_cancel() cancels the timer and decrements prog's refcnt.
> * Inner maps can contain bpf timers as well. ops->map_release_uref is
> * freeing the timers when inner map is replaced or deleted by user space.
> + *
> + * sleepable_lock protects only the setup of the workqueue, not the callback
> + * itself. This is done to ensure we don't run concurrently a free of the
> + * callback or the associated program.
I recall there was a discussion about this lock earlier,
but I don't remember what the conclusion was.
The above comment is not enough to understand what it protects.
In general how sleepable cb is fundamentally different
from non-sleepable one when it comes to races ?
bpf_timer_set_callback() is racy for both sleepable and non-sleepable
and the latter handles it fine.
Note that struct bpf_hrtimer is rcu protected.
See kfree_rcu(t, rcu); in bpf_timer_cancel_and_free().
> */
> struct bpf_hrtimer {
> struct hrtimer timer;
> + struct work_struct work;
> struct bpf_map *map;
> struct bpf_prog *prog;
> void __rcu *callback_fn;
> void *value;
> struct rcu_head rcu;
> + spinlock_t sleepable_lock;
> };
>
> /* the actual struct hidden inside uapi struct bpf_timer */
> @@ -1114,6 +1120,49 @@ struct bpf_timer_kern {
> struct bpf_spin_lock lock;
> } __attribute__((aligned(8)));
>
> +static void bpf_timer_work_cb(struct work_struct *work)
> +{
> + struct bpf_hrtimer *t = container_of(work, struct bpf_hrtimer, work);
> + struct bpf_map *map = t->map;
> + bpf_callback_t callback_fn;
> + void *value = t->value;
> + unsigned long flags;
> + void *key;
> + u32 idx;
> +
> + BTF_TYPE_EMIT(struct bpf_timer);
> +
> + spin_lock_irqsave(&t->sleepable_lock, flags);
> +
> + callback_fn = READ_ONCE(t->callback_fn);
> + if (!callback_fn) {
> + spin_unlock_irqrestore(&t->sleepable_lock, flags);
> + return;
> + }
> +
> + if (map->map_type == BPF_MAP_TYPE_ARRAY) {
> + struct bpf_array *array = container_of(map, struct bpf_array, map);
> +
> + /* compute the key */
> + idx = ((char *)value - array->value) / array->elem_size;
> + key = &idx;
> + } else { /* hash or lru */
> + key = value - round_up(map->key_size, 8);
> + }
> +
> + /* prevent the callback to be freed by bpf_timer_cancel() while running
> + * so we can release the sleepable lock
> + */
> + bpf_prog_inc(t->prog);
> +
> + spin_unlock_irqrestore(&t->sleepable_lock, flags);
why prog_inc ?
The sleepable progs need rcu_read_lock_trace() + migrate_disable()
anyway, which are missing here.
Probably best to call __bpf_prog_enter_sleepable_recur()
like kern_sys_bpf() does.
Now with that, the bpf_timer_cancel() can drop prog refcnt to zero
and it's ok, since rcu_read_lock_trace() will protect it.
> +
> + callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, 0, 0);
> + /* The verifier checked that return value is zero. */
the prog will finish and will be freed after rcu_read_unlock_trace().
Seems fine to me. No need for inc/dec refcnt.
> +
> + bpf_prog_put(t->prog);
> +}
> +
> static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
>
> static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
> @@ -1192,6 +1241,8 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
> t->prog = NULL;
> rcu_assign_pointer(t->callback_fn, NULL);
> hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
> + INIT_WORK(&t->work, bpf_timer_work_cb);
> + spin_lock_init(&t->sleepable_lock);
> t->timer.function = bpf_timer_cb;
> WRITE_ONCE(timer->timer, t);
> /* Guarantee the order between timer->timer and map->usercnt. So
> @@ -1237,6 +1288,7 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
> ret = -EINVAL;
> goto out;
> }
> + spin_lock(&t->sleepable_lock);
> if (!atomic64_read(&t->map->usercnt)) {
> /* maps with timers must be either held by user space
> * or pinned in bpffs. Otherwise timer might still be
> @@ -1263,6 +1315,8 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
> }
> rcu_assign_pointer(t->callback_fn, callback_fn);
> out:
> + if (t)
> + spin_unlock(&t->sleepable_lock);
> __bpf_spin_unlock_irqrestore(&timer->lock);
If lock is really needed why timer->lock cannot be reused?
The pattern of two locks in pretty much the same data structure
is begging for questions about what is going on here.
> return ret;
> }
> @@ -1283,8 +1337,12 @@ BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, fla
>
> if (in_nmi())
> return -EOPNOTSUPP;
> - if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN))
> + if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN | BPF_F_TIMER_SLEEPABLE))
> return -EINVAL;
> +
> + if ((flags & BPF_F_TIMER_SLEEPABLE) && nsecs)
> + return -EINVAL;
> +
> __bpf_spin_lock_irqsave(&timer->lock);
> t = timer->timer;
> if (!t || !t->prog) {
> @@ -1300,7 +1358,10 @@ BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, fla
> if (flags & BPF_F_TIMER_CPU_PIN)
> mode |= HRTIMER_MODE_PINNED;
>
> - hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
> + if (flags & BPF_F_TIMER_SLEEPABLE)
> + schedule_work(&t->work);
> + else
> + hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
> out:
> __bpf_spin_unlock_irqrestore(&timer->lock);
> return ret;
> @@ -1348,13 +1409,22 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> ret = -EDEADLK;
> goto out;
> }
> + spin_lock(&t->sleepable_lock);
> drop_prog_refcnt(t);
> + spin_unlock(&t->sleepable_lock);
this also looks odd.
> out:
> __bpf_spin_unlock_irqrestore(&timer->lock);
> /* Cancel the timer and wait for associated callback to finish
> * if it was running.
> */
> ret = ret ?: hrtimer_cancel(&t->timer);
> +
> + /* also cancel the sleepable work, but *do not* wait for
> + * it to finish if it was running as we might not be in a
> + * sleepable context
> + */
> + ret = ret ?: cancel_work(&t->work);
> +
> rcu_read_unlock();
> return ret;
> }
> @@ -1383,11 +1453,13 @@ void bpf_timer_cancel_and_free(void *val)
> t = timer->timer;
> if (!t)
> goto out;
> + spin_lock(&t->sleepable_lock);
> drop_prog_refcnt(t);
> /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
> * this timer, since it won't be initialized.
> */
> WRITE_ONCE(timer->timer, NULL);
> + spin_unlock(&t->sleepable_lock);
This one I don't understand either.
pw-bot: cr
On Mar 24 2024, Alexei Starovoitov wrote:
> On Sat, Mar 23, 2024 at 9:57 PM Kumar Kartikeya Dwivedi
> <[email protected]> wrote:
>
> > > > >
> > > > > Observation is correct. The patch is buggy,
> > > > > but the suggestion to follow process_dynptr_func() will lead
> > > > > to unnecessary complexity.
> > > > > dynptr-s are on stack with plenty of extra checks.
> > > >
> > > > The suggestion was to call process_timer_func, not process_dynptr_func.
> > > >
> > > > > In this case bpf_timer is in map_value.
> > > > > Much simpler is to follow KF_ARG_PTR_TO_MAP approach.
> > > >
> > > > What I meant by the example was that dynptr handling does the same
> > > > thing for kfuncs and helpers (using the same function), so timer
> > > > arguments should do the same (i.e. use process_timer_func), which will
> > > > do all checks for constant offset (ensuring var_off is tnum_is_const)
> > > > and match it against btf_record->timer_off.
> > >
> > > I don't follow. Please elaborate with a patch.
> > > The var_off and off is a part of the bug, but it's not the biggest part of it.
> >
> > Not compile tested.
Compiles just fine :)
>
> I see. All makes sense to me.
>
> Benjamin,
> pls incorporate it in your set.
>
OK, done!
I just had to revert to the following or KF_ARG_TIMER_ID was not
recognized by the verifier:
---
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7ee20e9d14bd..a5e147468ac8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10848,7 +10848,7 @@ BTF_ID(struct, bpf_list_head)
BTF_ID(struct, bpf_list_node)
BTF_ID(struct, bpf_rb_root)
BTF_ID(struct, bpf_rb_node)
-BTF_ID(struct, bpf_timer)
+BTF_ID(struct, bpf_timer_kern)
static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
const struct btf_param *arg, int type)
---
Cheers,
Benjamin
On Mon, Mar 25, 2024 at 1:50 AM Alexei Starovoitov
<[email protected]> wrote:
>
> On Fri, Mar 22, 2024 at 7:56 AM Benjamin Tissoires <[email protected]> wrote:
> >
> > They are implemented as a workqueue, which means that there are no
> > guarantees of timing nor ordering.
> >
> > Signed-off-by: Benjamin Tissoires <[email protected]>
> >
> > ---
> >
> > no changes in v5
> >
> > changes in v4:
> > - dropped __bpf_timer_compute_key()
> > - use a spin_lock instead of a semaphore
> > - ensure bpf_timer_cancel_and_free is not complaining about
> > non sleepable context and use cancel_work() instead of
> > cancel_work_sync()
> > - return -EINVAL if a delay is given to bpf_timer_start() with
> > BPF_F_TIMER_SLEEPABLE
> >
> > changes in v3:
> > - extracted the implementation in bpf_timer only, without
> > bpf_timer_set_sleepable_cb()
> > - rely on schedule_work() only, from bpf_timer_start()
> > - add semaphore to ensure bpf_timer_work_cb() is accessing
> > consistent data
> >
> > changes in v2 (compared to the one attaches to v1 0/9):
> > - make use of a kfunc
> > - add a (non-used) BPF_F_TIMER_SLEEPABLE
> > - the callback is *not* called, it makes the kernel crashes
> > ---
> > include/uapi/linux/bpf.h | 4 +++
> > kernel/bpf/helpers.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++--
> > 2 files changed, 88 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 3c42b9f1bada..b90def29d796 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -7461,10 +7461,14 @@ struct bpf_core_relo {
> > * - BPF_F_TIMER_ABS: Timeout passed is absolute time, by default it is
> > * relative to current time.
> > * - BPF_F_TIMER_CPU_PIN: Timer will be pinned to the CPU of the caller.
> > + * - BPF_F_TIMER_SLEEPABLE: Timer will run in a sleepable context, with
> > + * no guarantees of ordering nor timing (consider this as being just
> > + * offloaded immediately).
> > */
> > enum {
> > BPF_F_TIMER_ABS = (1ULL << 0),
> > BPF_F_TIMER_CPU_PIN = (1ULL << 1),
> > + BPF_F_TIMER_SLEEPABLE = (1ULL << 2),
> > };
> >
> > /* BPF numbers iterator state */
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index a89587859571..38de73a9df83 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1094,14 +1094,20 @@ const struct bpf_func_proto bpf_snprintf_proto = {
> > * bpf_timer_cancel() cancels the timer and decrements prog's refcnt.
> > * Inner maps can contain bpf timers as well. ops->map_release_uref is
> > * freeing the timers when inner map is replaced or deleted by user space.
> > + *
> > + * sleepable_lock protects only the setup of the workqueue, not the callback
> > + * itself. This is done to ensure we don't run concurrently a free of the
> > + * callback or the associated program.
>
> I recall there was a discussion about this lock earlier,
> but I don't remember what the conclusion was.
There wasn't much of a conclusion TBH.
> The above comment is not enough to understand what it protects.
>
> In general how sleepable cb is fundamentally different
> from non-sleepable one when it comes to races ?
I think there are 2 main differences:
- it's sleepable, so classic RCU locking doesn't work (I didn't know
about rcu_read_lock_trace() up to now)
- when we cancel(_and_free) the program, we can not afford to wait for
the program to finish because that program might take ages to do so.
While OTOH, hrtimer callbacks are in soft IRQ, so with IRQ disabled,
and nothing can interrupt it AFAICT. We can also wait for the timer cb
to finish in that case because it can't sleep.
>
> bpf_timer_set_callback() is racy for both sleepable and non-sleepable
> and the latter handles it fine.
I don't think bpf_timer_set_callback() is the problem: in both cases
(sleepable or not) we are under the spinlock from bpf_timer_kern so
the race is cut short there.
>
> Note that struct bpf_hrtimer is rcu protected.
> See kfree_rcu(t, rcu); in bpf_timer_cancel_and_free().
Sorry, RCU is still always hard to grasp for me, and even if I think I
get it, I still don't see how this would be sufficient in sleepable
bpf_timer_work_cb() without any lock protecting the struct bpf_hrtimer
very first access.
>
> > */
> > struct bpf_hrtimer {
> > struct hrtimer timer;
> > + struct work_struct work;
> > struct bpf_map *map;
> > struct bpf_prog *prog;
> > void __rcu *callback_fn;
> > void *value;
> > struct rcu_head rcu;
> > + spinlock_t sleepable_lock;
> > };
> >
> > /* the actual struct hidden inside uapi struct bpf_timer */
> > @@ -1114,6 +1120,49 @@ struct bpf_timer_kern {
> > struct bpf_spin_lock lock;
> > } __attribute__((aligned(8)));
> >
> > +static void bpf_timer_work_cb(struct work_struct *work)
> > +{
> > + struct bpf_hrtimer *t = container_of(work, struct bpf_hrtimer, work);
> > + struct bpf_map *map = t->map;
> > + bpf_callback_t callback_fn;
> > + void *value = t->value;
> > + unsigned long flags;
> > + void *key;
> > + u32 idx;
> > +
> > + BTF_TYPE_EMIT(struct bpf_timer);
> > +
> > + spin_lock_irqsave(&t->sleepable_lock, flags);
> > +
> > + callback_fn = READ_ONCE(t->callback_fn);
> > + if (!callback_fn) {
> > + spin_unlock_irqrestore(&t->sleepable_lock, flags);
> > + return;
> > + }
> > +
> > + if (map->map_type == BPF_MAP_TYPE_ARRAY) {
> > + struct bpf_array *array = container_of(map, struct bpf_array, map);
> > +
> > + /* compute the key */
> > + idx = ((char *)value - array->value) / array->elem_size;
> > + key = &idx;
> > + } else { /* hash or lru */
> > + key = value - round_up(map->key_size, 8);
> > + }
> > +
> > + /* prevent the callback to be freed by bpf_timer_cancel() while running
> > + * so we can release the sleepable lock
> > + */
> > + bpf_prog_inc(t->prog);
> > +
> > + spin_unlock_irqrestore(&t->sleepable_lock, flags);
>
> why prog_inc ?
> The sleepable progs need rcu_read_lock_trace() + migrate_disable()
> anyway, which are missing here.
> Probably best to call __bpf_prog_enter_sleepable_recur()
> like kern_sys_bpf() does.
Sounds like a good idea.
But as I was playing with it, I realized that t->prog is not RCU
protected, so I have no guarantees that the value is correct while
calling __bpf_prog_enter_sleepable_recur(t->prog, &run_ctx)...
Should I manually call first rcu_read_lock_trace() before
__bpf_prog_enter_sleepable_recur(t->prog, &run_ctx)?
>
> Now with that, the bpf_timer_cancel() can drop prog refcnt to zero
> and it's ok, since rcu_read_lock_trace() will protect it.
OK, this is a good step forward, thanks!
>
> > +
> > + callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, 0, 0);
> > + /* The verifier checked that return value is zero. */
>
> the prog will finish and will be freed after rcu_read_unlock_trace().
> Seems fine to me. No need for inc/dec refcnt.
Ack
>
> > +
> > + bpf_prog_put(t->prog);
> > +}
> > +
> > static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
> >
> > static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
> > @@ -1192,6 +1241,8 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
> > t->prog = NULL;
> > rcu_assign_pointer(t->callback_fn, NULL);
> > hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
> > + INIT_WORK(&t->work, bpf_timer_work_cb);
> > + spin_lock_init(&t->sleepable_lock);
> > t->timer.function = bpf_timer_cb;
> > WRITE_ONCE(timer->timer, t);
> > /* Guarantee the order between timer->timer and map->usercnt. So
> > @@ -1237,6 +1288,7 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
> > ret = -EINVAL;
> > goto out;
> > }
> > + spin_lock(&t->sleepable_lock);
> > if (!atomic64_read(&t->map->usercnt)) {
> > /* maps with timers must be either held by user space
> > * or pinned in bpffs. Otherwise timer might still be
> > @@ -1263,6 +1315,8 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
> > }
> > rcu_assign_pointer(t->callback_fn, callback_fn);
> > out:
> > + if (t)
> > + spin_unlock(&t->sleepable_lock);
> > __bpf_spin_unlock_irqrestore(&timer->lock);
>
> If lock is really needed why timer->lock cannot be reused?
> The pattern of two locks in pretty much the same data structure
> is begging for questions about what is going on here.
Agree, but I can't find a way to reuse timer->lock:
- ideally I should add struct work_struct into struct bpf_timer_kern
directly, but there is a warning about the size of bpf_timer_kern
which makes me feel like we can not extend it
- adding a pointer back from struct bpf_hrtimer to bpf_timer_kern is
also not a solution as we might be freed if we are outside of the lock
in bpf_timer_kern...
Though if I have reliable access from bpf_timer_work_cb() to the
matching bpf_timer_kern, I could spinlock ->lock while I need to
access ->timer, and then everything would be much easier.
>
> > return ret;
> > }
> > @@ -1283,8 +1337,12 @@ BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, fla
> >
> > if (in_nmi())
> > return -EOPNOTSUPP;
> > - if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN))
> > + if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN | BPF_F_TIMER_SLEEPABLE))
> > return -EINVAL;
> > +
> > + if ((flags & BPF_F_TIMER_SLEEPABLE) && nsecs)
> > + return -EINVAL;
> > +
> > __bpf_spin_lock_irqsave(&timer->lock);
> > t = timer->timer;
> > if (!t || !t->prog) {
> > @@ -1300,7 +1358,10 @@ BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, fla
> > if (flags & BPF_F_TIMER_CPU_PIN)
> > mode |= HRTIMER_MODE_PINNED;
> >
> > - hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
> > + if (flags & BPF_F_TIMER_SLEEPABLE)
> > + schedule_work(&t->work);
> > + else
> > + hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
> > out:
> > __bpf_spin_unlock_irqrestore(&timer->lock);
> > return ret;
> > @@ -1348,13 +1409,22 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> > ret = -EDEADLK;
> > goto out;
> > }
> > + spin_lock(&t->sleepable_lock);
> > drop_prog_refcnt(t);
> > + spin_unlock(&t->sleepable_lock);
>
> this also looks odd.
I basically need to protect "t->prog = NULL;" from happening while
bpf_timer_work_cb is setting up the bpf program to be run.
>
> > out:
> > __bpf_spin_unlock_irqrestore(&timer->lock);
> > /* Cancel the timer and wait for associated callback to finish
> > * if it was running.
> > */
> > ret = ret ?: hrtimer_cancel(&t->timer);
> > +
> > + /* also cancel the sleepable work, but *do not* wait for
> > + * it to finish if it was running as we might not be in a
> > + * sleepable context
> > + */
> > + ret = ret ?: cancel_work(&t->work);
> > +
> > rcu_read_unlock();
> > return ret;
> > }
> > @@ -1383,11 +1453,13 @@ void bpf_timer_cancel_and_free(void *val)
> > t = timer->timer;
> > if (!t)
> > goto out;
> > + spin_lock(&t->sleepable_lock);
> > drop_prog_refcnt(t);
> > /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
> > * this timer, since it won't be initialized.
> > */
> > WRITE_ONCE(timer->timer, NULL);
> > + spin_unlock(&t->sleepable_lock);
>
> This one I don't understand either.
Same as above, I do not want t->prog to be set to NULL.
Also, side note: if anyone feels like it would go faster to fix those
races by themself instead of teaching me how to properly do it, this
is definitely fine from me :)
Cheers,
Benjamin
On Wed, Mar 27, 2024 at 10:02 AM Benjamin Tissoires
<[email protected]> wrote:
> > > goto out;
> > > }
> > > + spin_lock(&t->sleepable_lock);
> > > drop_prog_refcnt(t);
> > > + spin_unlock(&t->sleepable_lock);
> >
> > this also looks odd.
>
> I basically need to protect "t->prog = NULL;" from happening while
> bpf_timer_work_cb is setting up the bpf program to be run.
Ok. I think I understand the race you're trying to fix.
The bpf_timer_cancel_and_free() is doing
cancel_work()
and proceeds with
kfree_rcu(t, rcu);
That's the only race and these extra locks don't help.
The t->prog = NULL is nothing to worry about.
The bpf_timer_work_cb() might still see callback_fn == NULL
"when it's being setup" and it's ok.
These locks don't help that.
I suggest to drop sleepable_lock everywhere.
READ_ONCE of callback_fn in bpf_timer_work_cb() is enough.
Add rcu_read_lock_trace() before calling bpf prog.
The race to fix is above 'cancel_work + kfree_rcu'
since kfree_rcu might free 'struct bpf_hrtimer *t'
while the work is pending and work_queue internal
logic might UAF struct work_struct work.
By the time it may luckily enter bpf_timer_work_cb() it's too late.
The argument 'struct work_struct *work' might already be freed.
To fix this problem, how about the following:
don't call kfree_rcu and instead queue the work to free it.
After cancel_work(&t->work); the work_struct can be reused.
So set it up to call "freeing callback" and do
schedule_work(&t->work);
There is a big assumption here that new work won't be
executed before cancelled work completes.
Need to check with wq experts.
Another approach is to do something smart with
cancel_work() return code.
If it returns true set a flag inside bpf_hrtimer and
make bpf_timer_work_cb() free(t) after bpf prog finishes.
> Also, side note: if anyone feels like it would go faster to fix those
> races by themself instead of teaching me how to properly do it, this
> is definitely fine from me :)
Most of the time goes into analyzing and thinking :)
Whoever codes it doesn't speed things much.
Pls do another respin if you still have cycles to work on it.
On Wed, Apr 3, 2024 at 11:50 AM Alexei Starovoitov
<[email protected]> wrote:
>
> On Wed, Mar 27, 2024 at 10:02 AM Benjamin Tissoires
> <[email protected]> wrote:
> > > > goto out;
> > > > }
> > > > + spin_lock(&t->sleepable_lock);
> > > > drop_prog_refcnt(t);
> > > > + spin_unlock(&t->sleepable_lock);
> > >
> > > this also looks odd.
> >
> > I basically need to protect "t->prog = NULL;" from happening while
> > bpf_timer_work_cb is setting up the bpf program to be run.
>
> Ok. I think I understand the race you're trying to fix.
> The bpf_timer_cancel_and_free() is doing
> cancel_work()
> and proceeds with
> kfree_rcu(t, rcu);
>
> That's the only race and these extra locks don't help.
>
> The t->prog = NULL is nothing to worry about.
> The bpf_timer_work_cb() might still see callback_fn == NULL
> "when it's being setup" and it's ok.
> These locks don't help that.
>
> I suggest to drop sleepable_lock everywhere.
> READ_ONCE of callback_fn in bpf_timer_work_cb() is enough.
> Add rcu_read_lock_trace() before calling bpf prog.
>
> The race to fix is above 'cancel_work + kfree_rcu'
> since kfree_rcu might free 'struct bpf_hrtimer *t'
> while the work is pending and work_queue internal
> logic might UAF struct work_struct work.
> By the time it may luckily enter bpf_timer_work_cb() it's too late.
> The argument 'struct work_struct *work' might already be freed.
>
> To fix this problem, how about the following:
> don't call kfree_rcu and instead queue the work to free it.
> After cancel_work(&t->work); the work_struct can be reused.
> So set it up to call "freeing callback" and do
> schedule_work(&t->work);
>
> There is a big assumption here that new work won't be
> executed before cancelled work completes.
> Need to check with wq experts.
>
> Another approach is to do something smart with
> cancel_work() return code.
> If it returns true set a flag inside bpf_hrtimer and
> make bpf_timer_work_cb() free(t) after bpf prog finishes.
Looking through wq code... I think I have to correct myself.
cancel_work and immediate free is probably fine from wq pov.
It has this comment:
worker->current_func(work);
/*
* While we must be careful to not use "work" after this, the trace
* point will only record its address.
*/
trace_workqueue_execute_end(work, worker->current_func);
the bpf_timer_work_cb() might still be running bpf prog.
So it shouldn't touch 'struct bpf_hrtimer *t' after bpf prog returns,
since kfree_rcu(t, rcu); could have freed it by then.
There is also this code in net/rxrpc/rxperf.c
cancel_work(&call->work);
kfree(call);
So it looks like it's fine to drop sleepable_lock,
add rcu_read_lock_trace() and things should be ok.
On Wed, Apr 3, 2024 at 6:01 PM Alexei Starovoitov
<[email protected]> wrote:
>
> On Wed, Apr 3, 2024 at 11:50 AM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Wed, Mar 27, 2024 at 10:02 AM Benjamin Tissoires
> > <[email protected]> wrote:
> > > > > goto out;
> > > > > }
> > > > > + spin_lock(&t->sleepable_lock);
> > > > > drop_prog_refcnt(t);
> > > > > + spin_unlock(&t->sleepable_lock);
> > > >
> > > > this also looks odd.
> > >
> > > I basically need to protect "t->prog = NULL;" from happening while
> > > bpf_timer_work_cb is setting up the bpf program to be run.
> >
> > Ok. I think I understand the race you're trying to fix.
> > The bpf_timer_cancel_and_free() is doing
> > cancel_work()
> > and proceeds with
> > kfree_rcu(t, rcu);
> >
> > That's the only race and these extra locks don't help.
> >
> > The t->prog = NULL is nothing to worry about.
> > The bpf_timer_work_cb() might still see callback_fn == NULL
> > "when it's being setup" and it's ok.
> > These locks don't help that.
> >
> > I suggest to drop sleepable_lock everywhere.
> > READ_ONCE of callback_fn in bpf_timer_work_cb() is enough.
> > Add rcu_read_lock_trace() before calling bpf prog.
> >
> > The race to fix is above 'cancel_work + kfree_rcu'
> > since kfree_rcu might free 'struct bpf_hrtimer *t'
> > while the work is pending and work_queue internal
> > logic might UAF struct work_struct work.
> > By the time it may luckily enter bpf_timer_work_cb() it's too late.
> > The argument 'struct work_struct *work' might already be freed.
> >
> > To fix this problem, how about the following:
> > don't call kfree_rcu and instead queue the work to free it.
> > After cancel_work(&t->work); the work_struct can be reused.
> > So set it up to call "freeing callback" and do
> > schedule_work(&t->work);
> >
> > There is a big assumption here that new work won't be
> > executed before cancelled work completes.
> > Need to check with wq experts.
> >
> > Another approach is to do something smart with
> > cancel_work() return code.
> > If it returns true set a flag inside bpf_hrtimer and
> > make bpf_timer_work_cb() free(t) after bpf prog finishes.
>
> Looking through wq code... I think I have to correct myself.
> cancel_work and immediate free is probably fine from wq pov.
> It has this comment:
> worker->current_func(work);
> /*
> * While we must be careful to not use "work" after this, the trace
> * point will only record its address.
> */
> trace_workqueue_execute_end(work, worker->current_func);
>
> the bpf_timer_work_cb() might still be running bpf prog.
> So it shouldn't touch 'struct bpf_hrtimer *t' after bpf prog returns,
> since kfree_rcu(t, rcu); could have freed it by then.
> There is also this code in net/rxrpc/rxperf.c
> cancel_work(&call->work);
> kfree(call);
Correction to correction.
Above piece in rxrpc is buggy.
The following race is possible:
cpu 0
process_one_work()
set_work_pool_and_clear_pending(work, pool->id, 0);
cpu 1
cancel_work()
kfree_rcu(work)
worker->current_func(work);
Here 'work' is a pointer to freed memory.
Though wq code will not be touching it, callback will UAF.
Also what I proposed earlier as:
INIT_WORK(A); schedule_work(); cancel_work(); INIT_WORK(B); schedule_work();
won't guarantee the ordering.
Since the callback function is different,
find_worker_executing_work() will consider it a separate work item.
Another option is to to keep bpf_timer_work_cb callback
and add a 'bool free_me;' to struct bpf_hrtimer
and let the callback free it.
But it's also racy.
cancel_work() may return false, though worker->current_func(work)
wasn't called yet.
So we cannot set 'free_me' in bpf_timer_cancel_and_free()
in race free maner.
After brainstorming with Tejun it seems the best is to use
another work_struct to call a different callback and do
cancel_work_sync() there.
So we need something like:
struct bpf_hrtimer {
union {
struct hrtimer timer;
+ struct work_struct work;
};
struct bpf_map *map;
struct bpf_prog *prog;
void __rcu *callback_fn;
void *value;
union {
struct rcu_head rcu;
+ struct work_struct sync_work;
};
+ u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
};
'work' will be used to call bpf_timer_work_cb.
'sync_work' will be used to call cancel_work_sync() + kfree_rcu().
And, of course,
schedule_work(&t->sync_work); from bpf_timer_cancel_and_free()
instead of kfree_rcu.
On Thu, Apr 4, 2024 at 4:44 AM Alexei Starovoitov
<[email protected]> wrote:
>
> On Wed, Apr 3, 2024 at 6:01 PM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Wed, Apr 3, 2024 at 11:50 AM Alexei Starovoitov
> > <[email protected]> wrote:
> > >
> > > On Wed, Mar 27, 2024 at 10:02 AM Benjamin Tissoires
> > > <[email protected]> wrote:
> > > > > > goto out;
> > > > > > }
> > > > > > + spin_lock(&t->sleepable_lock);
> > > > > > drop_prog_refcnt(t);
> > > > > > + spin_unlock(&t->sleepable_lock);
> > > > >
> > > > > this also looks odd.
> > > >
> > > > I basically need to protect "t->prog = NULL;" from happening while
> > > > bpf_timer_work_cb is setting up the bpf program to be run.
> > >
> > > Ok. I think I understand the race you're trying to fix.
> > > The bpf_timer_cancel_and_free() is doing
> > > cancel_work()
> > > and proceeds with
> > > kfree_rcu(t, rcu);
> > >
> > > That's the only race and these extra locks don't help.
Thanks a lot for pinpointing the location of the race. Indeed, when I
read your email this morning I said "of course, this was obvious" :(
>
> > >
> > > The t->prog = NULL is nothing to worry about.
> > > The bpf_timer_work_cb() might still see callback_fn == NULL
> > > "when it's being setup" and it's ok.
> > > These locks don't help that.
> > >
> > > I suggest to drop sleepable_lock everywhere.
> > > READ_ONCE of callback_fn in bpf_timer_work_cb() is enough.
> > > Add rcu_read_lock_trace() before calling bpf prog.
> > >
> > > The race to fix is above 'cancel_work + kfree_rcu'
> > > since kfree_rcu might free 'struct bpf_hrtimer *t'
> > > while the work is pending and work_queue internal
> > > logic might UAF struct work_struct work.
> > > By the time it may luckily enter bpf_timer_work_cb() it's too late.
> > > The argument 'struct work_struct *work' might already be freed.
> > >
> > > To fix this problem, how about the following:
> > > don't call kfree_rcu and instead queue the work to free it.
> > > After cancel_work(&t->work); the work_struct can be reused.
> > > So set it up to call "freeing callback" and do
> > > schedule_work(&t->work);
> > >
> > > There is a big assumption here that new work won't be
> > > executed before cancelled work completes.
> > > Need to check with wq experts.
> > >
> > > Another approach is to do something smart with
> > > cancel_work() return code.
> > > If it returns true set a flag inside bpf_hrtimer and
> > > make bpf_timer_work_cb() free(t) after bpf prog finishes.
> >
> > Looking through wq code... I think I have to correct myself.
> > cancel_work and immediate free is probably fine from wq pov.
> > It has this comment:
> > worker->current_func(work);
> > /*
> > * While we must be careful to not use "work" after this, the trace
> > * point will only record its address.
> > */
> > trace_workqueue_execute_end(work, worker->current_func);
> >
> > the bpf_timer_work_cb() might still be running bpf prog.
> > So it shouldn't touch 'struct bpf_hrtimer *t' after bpf prog returns,
> > since kfree_rcu(t, rcu); could have freed it by then.
> > There is also this code in net/rxrpc/rxperf.c
> > cancel_work(&call->work);
> > kfree(call);
>
> Correction to correction.
> Above piece in rxrpc is buggy.
> The following race is possible:
> cpu 0
> process_one_work()
> set_work_pool_and_clear_pending(work, pool->id, 0);
>
> cpu 1
> cancel_work()
> kfree_rcu(work)
>
> worker->current_func(work);
>
> Here 'work' is a pointer to freed memory.
> Though wq code will not be touching it, callback will UAF.
>
> Also what I proposed earlier as:
> INIT_WORK(A); schedule_work(); cancel_work(); INIT_WORK(B); schedule_work();
> won't guarantee the ordering.
> Since the callback function is different,
> find_worker_executing_work() will consider it a separate work item.
>
> Another option is to to keep bpf_timer_work_cb callback
> and add a 'bool free_me;' to struct bpf_hrtimer
> and let the callback free it.
> But it's also racy.
> cancel_work() may return false, though worker->current_func(work)
> wasn't called yet.
> So we cannot set 'free_me' in bpf_timer_cancel_and_free()
> in race free maner.
>
> After brainstorming with Tejun it seems the best is to use
> another work_struct to call a different callback and do
> cancel_work_sync() there.
Works for me. I should be able to spina v6 soon enough, but I have a
couple of remaining questions below:
>
> So we need something like:
>
> struct bpf_hrtimer {
> union {
> struct hrtimer timer;
> + struct work_struct work;
> };
> struct bpf_map *map;
> struct bpf_prog *prog;
> void __rcu *callback_fn;
> void *value;
> union {
Are you sure we need an union here? If we get to call kfree_rcu() we
need to have both struct rcu_head and sync_work, not one or the other.
> struct rcu_head rcu;
> + struct work_struct sync_work;
> };
> + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init()
(like in my v2 or v3 IIRC). But that means that once a timer is
initialized it needs to be of one or the other type (especially true
with the first union in this struct).
So should we reject during run time bpf_timer_set_callback() for
sleepable timers and only allow bpf_timer_set_sleepable_cb() for
those? (and the invert in the other case).
This version of the patch allows for one timer to be used as softIRQ
or WQ, depending on the timer_set_callback that is used. But it might
be simpler for the kfree_rcu race to define the bpf_timer to be of one
kind, so we are sure to call the correct kfree method.
> };
>
> 'work' will be used to call bpf_timer_work_cb.
> 'sync_work' will be used to call cancel_work_sync() + kfree_rcu().
>
> And, of course,
> schedule_work(&t->sync_work); from bpf_timer_cancel_and_free()
> instead of kfree_rcu.
>
Cheers,
Benjamin
On Thu, Apr 4, 2024 at 8:27 AM Benjamin Tissoires
<[email protected]> wrote:
>
>
> >
> > So we need something like:
> >
> > struct bpf_hrtimer {
> > union {
> > struct hrtimer timer;
> > + struct work_struct work;
> > };
> > struct bpf_map *map;
> > struct bpf_prog *prog;
> > void __rcu *callback_fn;
> > void *value;
> > union {
>
> Are you sure we need an union here? If we get to call kfree_rcu() we
> need to have both struct rcu_head and sync_work, not one or the other.
why? with an extra flag it's one or the other.
In bpf_timer_cancel_and_free()
if (flag & SLEEPABLE) {
schedule_work() to cancel_work_sync + kfree_rcu
} else {
hrtimer_cancel
kfree_rcu
}
> > struct rcu_head rcu;
> > + struct work_struct sync_work;
> > };
> > + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
>
> If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init()
> (like in my v2 or v3 IIRC). But that means that once a timer is
> initialized it needs to be of one or the other type (especially true
> with the first union in this struct).
yes. That's an idea.
The code to support wq vs timer seems to be diverging more
than what we expected initially.
It seems cleaner to set it as init time and enforce in
other helpers.
Also with two work_struct-s we're pushing the sizeof(bpf_hrtimer)
too far.
It's already at 112 bytes and some people use bpf_timer per flow.
So potentially millions of such timers.
Adding extra sizeof(struct work_struct)=32 * 2 that won't be
used is too much.
Note that sizeof(struct hrtimer)=64, so unions make everything
fit nicely.
> So should we reject during run time bpf_timer_set_callback() for
> sleepable timers and only allow bpf_timer_set_sleepable_cb() for
> those? (and the invert in the other case).
yes.
> This version of the patch allows for one timer to be used as softIRQ
> or WQ, depending on the timer_set_callback that is used. But it might
> be simpler for the kfree_rcu race to define the bpf_timer to be of one
> kind, so we are sure to call the correct kfree method.
I think one or another simplifies the code and makes it easier
to think through combinations.
I'm still contemplating adding new "struct bpf_wq" and new kfuncs
to completely separate wq vs timer.
The code reuse seems to be relatively small.
We can potentially factor out internals of bpf_timer_* into smaller
helpers and use them from bpf_timer_* and from new bpf_wq_* kfuncs.
One more thing.
bpf_timer_cancel() api turned out to be troublesome.
Not only it cancels the timer, but drops callback too.
It was a surprising behavior for people familiar with
kernel timer api-s.
We should not repeat this mistake with wq.
We can try to fix bpf_timer_cancel() too.
If we drop drop_prog_refcnt() from it it shouldn't affect
existing bpf_timer users who are forced to do:
bpf_timer_cancel()
bpf_timer_set_callback()
bpf_timer_start()
all the time.
If/when bpf_timer_cancel() stops dropping the callback
such bpf prog won't be affected. So low chance of breaking any prog.
wdyt?
On Thu, Apr 4, 2024 at 6:41 PM Alexei Starovoitov
<[email protected]> wrote:
>
> On Thu, Apr 4, 2024 at 8:27 AM Benjamin Tissoires
> <[email protected]> wrote:
> >
> >
> > >
> > > So we need something like:
> > >
> > > struct bpf_hrtimer {
> > > union {
> > > struct hrtimer timer;
> > > + struct work_struct work;
> > > };
> > > struct bpf_map *map;
> > > struct bpf_prog *prog;
> > > void __rcu *callback_fn;
> > > void *value;
> > > union {
> >
> > Are you sure we need an union here? If we get to call kfree_rcu() we
> > need to have both struct rcu_head and sync_work, not one or the other.
>
> why? with an extra flag it's one or the other.
> In bpf_timer_cancel_and_free()
> if (flag & SLEEPABLE) {
> schedule_work() to cancel_work_sync + kfree_rcu
> } else {
> hrtimer_cancel
> kfree_rcu
> }
I thought kfree_rcu required struct rcu_head, and given that we need
to initialize sync_work it will be poisoned...
>
> > > struct rcu_head rcu;
> > > + struct work_struct sync_work;
> > > };
> > > + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
> >
> > If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init()
> > (like in my v2 or v3 IIRC). But that means that once a timer is
> > initialized it needs to be of one or the other type (especially true
> > with the first union in this struct).
>
> yes. That's an idea.
> The code to support wq vs timer seems to be diverging more
> than what we expected initially.
> It seems cleaner to set it as init time and enforce in
> other helpers.
OK, works for me.
>
> Also with two work_struct-s we're pushing the sizeof(bpf_hrtimer)
> too far.
> It's already at 112 bytes and some people use bpf_timer per flow.
> So potentially millions of such timers.
> Adding extra sizeof(struct work_struct)=32 * 2 that won't be
> used is too much.
> Note that sizeof(struct hrtimer)=64, so unions make everything
> fit nicely.
Maybe we should do
union {
struct hrtimer timer;
struct {
struct work_struct work;
struct work_struct sync_work;
}
}
(not nice to read but at least we don't change the size at the beginning)
>
> > So should we reject during run time bpf_timer_set_callback() for
> > sleepable timers and only allow bpf_timer_set_sleepable_cb() for
> > those? (and the invert in the other case).
>
> yes.
>
> > This version of the patch allows for one timer to be used as softIRQ
> > or WQ, depending on the timer_set_callback that is used. But it might
> > be simpler for the kfree_rcu race to define the bpf_timer to be of one
> > kind, so we are sure to call the correct kfree method.
>
> I think one or another simplifies the code and makes it easier
> to think through combinations.
>
> I'm still contemplating adding new "struct bpf_wq" and new kfuncs
> to completely separate wq vs timer.
> The code reuse seems to be relatively small.
There is some code reuse in the verifier, but it can be factored out I think.
Though the biggest reuse might be in the map portion of bpf_timer,
which I haven't looked much TBH.
> We can potentially factor out internals of bpf_timer_* into smaller
> helpers and use them from bpf_timer_* and from new bpf_wq_* kfuncs.
Yeah, also, given that we are going to enforce delay == 0 for
sleepable timers (wq), the user api would be much cleaner if we can
have a dedicated bpf_wq (and it would make the flags of bpf_timer_init
easier to deal with).
>
> One more thing.
> bpf_timer_cancel() api turned out to be troublesome.
> Not only it cancels the timer, but drops callback too.
> It was a surprising behavior for people familiar with
> kernel timer api-s.
> We should not repeat this mistake with wq.
>
> We can try to fix bpf_timer_cancel() too.
> If we drop drop_prog_refcnt() from it it shouldn't affect
> existing bpf_timer users who are forced to do:
> bpf_timer_cancel()
> bpf_timer_set_callback()
> bpf_timer_start()
> all the time.
> If/when bpf_timer_cancel() stops dropping the callback
> such bpf prog won't be affected. So low chance of breaking any prog.
> wdyt?
>
How would a program know set_callback() is not required after a
cancel() because the kernel kept it around? It seems that it's going
to be hard for them to know that (unless by trying first a start()),
and it will add more code.
timer_cancel() would be hard to change but we can always do the change
and add a new kfunc timer_cancel_no_drop() which would clearly allow
for new programs to know that set_callback() is not required to be
called. In a few kernel releases we could remove it and say that
timer_cancel() is the same (and replaced by a #define)
Anyway, the more I think of it, the more I think the best API would be
a dedicated wq API. It's probably going to need a little bit more
work, but it'll be more or less this work plus the new bpf_wq type in
the map.
Cheers,
Benjamin
On Thu, Apr 4, 2024 at 10:56 AM Benjamin Tissoires
<[email protected]> wrote:
>
> On Thu, Apr 4, 2024 at 6:41 PM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Thu, Apr 4, 2024 at 8:27 AM Benjamin Tissoires
> > <[email protected]> wrote:
> > >
> > >
> > > >
> > > > So we need something like:
> > > >
> > > > struct bpf_hrtimer {
> > > > union {
> > > > struct hrtimer timer;
> > > > + struct work_struct work;
> > > > };
> > > > struct bpf_map *map;
> > > > struct bpf_prog *prog;
> > > > void __rcu *callback_fn;
> > > > void *value;
> > > > union {
> > >
> > > Are you sure we need an union here? If we get to call kfree_rcu() we
> > > need to have both struct rcu_head and sync_work, not one or the other.
> >
> > why? with an extra flag it's one or the other.
> > In bpf_timer_cancel_and_free()
> > if (flag & SLEEPABLE) {
> > schedule_work() to cancel_work_sync + kfree_rcu
> > } else {
> > hrtimer_cancel
> > kfree_rcu
> > }
>
> I thought kfree_rcu required struct rcu_head, and given that we need
> to initialize sync_work it will be poisoned...
yes. It needs rcu_head.
But where do you see a conflict?
INIT_WORK + schedule_work() will use that space,
then cancel_work_sync() will wait on a different work_struct,
then kfree_rcu() will reuse that space.
In case of hrtimers none of the work_structs will be used.
>
> >
> > > > struct rcu_head rcu;
> > > > + struct work_struct sync_work;
> > > > };
> > > > + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
> > >
> > > If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init()
> > > (like in my v2 or v3 IIRC). But that means that once a timer is
> > > initialized it needs to be of one or the other type (especially true
> > > with the first union in this struct).
> >
> > yes. That's an idea.
> > The code to support wq vs timer seems to be diverging more
> > than what we expected initially.
> > It seems cleaner to set it as init time and enforce in
> > other helpers.
>
> OK, works for me.
>
> >
> > Also with two work_struct-s we're pushing the sizeof(bpf_hrtimer)
> > too far.
> > It's already at 112 bytes and some people use bpf_timer per flow.
> > So potentially millions of such timers.
> > Adding extra sizeof(struct work_struct)=32 * 2 that won't be
> > used is too much.
> > Note that sizeof(struct hrtimer)=64, so unions make everything
> > fit nicely.
>
> Maybe we should do
> union {
> struct hrtimer timer;
> struct {
> struct work_struct work;
> struct work_struct sync_work;
> }
> }
It's also ok, but sharing rcu_head and work_struct seems
cleaner, since it highlights that they're exclusive.
> (not nice to read but at least we don't change the size at the beginning)
>
> >
> > > So should we reject during run time bpf_timer_set_callback() for
> > > sleepable timers and only allow bpf_timer_set_sleepable_cb() for
> > > those? (and the invert in the other case).
> >
> > yes.
> >
> > > This version of the patch allows for one timer to be used as softIRQ
> > > or WQ, depending on the timer_set_callback that is used. But it might
> > > be simpler for the kfree_rcu race to define the bpf_timer to be of one
> > > kind, so we are sure to call the correct kfree method.
> >
> > I think one or another simplifies the code and makes it easier
> > to think through combinations.
> >
> > I'm still contemplating adding new "struct bpf_wq" and new kfuncs
> > to completely separate wq vs timer.
> > The code reuse seems to be relatively small.
>
> There is some code reuse in the verifier, but it can be factored out I think.
>
> Though the biggest reuse might be in the map portion of bpf_timer,
> which I haven't looked much TBH.
Right. It's all the 'case BPF_TIMER:' in various places.
New 'struct bpf_wq' would need another entry in btf_field_type.
But that should be a straightforward addition.
>
> > We can potentially factor out internals of bpf_timer_* into smaller
> > helpers and use them from bpf_timer_* and from new bpf_wq_* kfuncs.
>
> Yeah, also, given that we are going to enforce delay == 0 for
> sleepable timers (wq), the user api would be much cleaner if we can
> have a dedicated bpf_wq (and it would make the flags of bpf_timer_init
> easier to deal with).
It seems so.
Kinda hard to judge one way or the other without looking at
the final code, but it seems separation is worth attempting, at least.
Also if we ever do hrtimer+wq we probably will be using
'struct delayed_work' instead of rolling our own
'struct hrtimer' + 'struct work_struct' combo.
It seems wq logic already made such a combination special enough
and thought through the races, so we better just follow that path.
In that case it might be yet another 'struct bpf_delayed_wq'
and another set of kfuncs.
Considering that cancel_work() and cancel_delayed_work()
are separate api in the kernel.
Multiplexing all of them under bpf_timer_cancel()
seems wrong.
In the past we were somewhat limited in terms of helpers.
We tried not to add them unless absolutely necessary because
of uapi considerations.
Now with kfuncs we can add/tweak/remove them at will.
>
> >
> > One more thing.
> > bpf_timer_cancel() api turned out to be troublesome.
> > Not only it cancels the timer, but drops callback too.
> > It was a surprising behavior for people familiar with
> > kernel timer api-s.
> > We should not repeat this mistake with wq.
> >
> > We can try to fix bpf_timer_cancel() too.
> > If we drop drop_prog_refcnt() from it it shouldn't affect
> > existing bpf_timer users who are forced to do:
> > bpf_timer_cancel()
> > bpf_timer_set_callback()
> > bpf_timer_start()
> > all the time.
> > If/when bpf_timer_cancel() stops dropping the callback
> > such bpf prog won't be affected. So low chance of breaking any prog.
> > wdyt?
> >
>
> How would a program know set_callback() is not required after a
> cancel() because the kernel kept it around? It seems that it's going
> to be hard for them to know that (unless by trying first a start()),
> and it will add more code.
>
> timer_cancel() would be hard to change but we can always do the change
> and add a new kfunc timer_cancel_no_drop() which would clearly allow
that works too.
> for new programs to know that set_callback() is not required to be
> called. In a few kernel releases we could remove it and say that
> timer_cancel() is the same (and replaced by a #define)
#define won't work, since mechanics of detecting and calling
helpers vs kfuncs is quite different.
> Anyway, the more I think of it, the more I think the best API would be
> a dedicated wq API. It's probably going to need a little bit more
> work, but it'll be more or less this work plus the new bpf_wq type in
> the map.
It seems to me as well.
Thanks for brainstorming.
On Thu, Apr 4, 2024 at 8:29 PM Alexei Starovoitov
<[email protected]> wrote:
>
> On Thu, Apr 4, 2024 at 10:56 AM Benjamin Tissoires
> <[email protected]> wrote:
> >
> > On Thu, Apr 4, 2024 at 6:41 PM Alexei Starovoitov
> > <[email protected]> wrote:
> > >
> > > On Thu, Apr 4, 2024 at 8:27 AM Benjamin Tissoires
> > > <[email protected]> wrote:
> > > >
> > > >
> > > > >
> > > > > So we need something like:
> > > > >
> > > > > struct bpf_hrtimer {
> > > > > union {
> > > > > struct hrtimer timer;
> > > > > + struct work_struct work;
> > > > > };
> > > > > struct bpf_map *map;
> > > > > struct bpf_prog *prog;
> > > > > void __rcu *callback_fn;
> > > > > void *value;
> > > > > union {
> > > >
> > > > Are you sure we need an union here? If we get to call kfree_rcu() we
> > > > need to have both struct rcu_head and sync_work, not one or the other.
> > >
> > > why? with an extra flag it's one or the other.
> > > In bpf_timer_cancel_and_free()
> > > if (flag & SLEEPABLE) {
> > > schedule_work() to cancel_work_sync + kfree_rcu
> > > } else {
> > > hrtimer_cancel
> > > kfree_rcu
> > > }
> >
> > I thought kfree_rcu required struct rcu_head, and given that we need
> > to initialize sync_work it will be poisoned...
>
> yes. It needs rcu_head.
> But where do you see a conflict?
> INIT_WORK + schedule_work() will use that space,
> then cancel_work_sync() will wait on a different work_struct,
> then kfree_rcu() will reuse that space.
Yeah, sorry, I haven't realized that the memory used by kfree_rcu
wasn't initialized.
>
> In case of hrtimers none of the work_structs will be used.
>
> >
> > >
> > > > > struct rcu_head rcu;
> > > > > + struct work_struct sync_work;
> > > > > };
> > > > > + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
> > > >
> > > > If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init()
> > > > (like in my v2 or v3 IIRC). But that means that once a timer is
> > > > initialized it needs to be of one or the other type (especially true
> > > > with the first union in this struct).
> > >
> > > yes. That's an idea.
> > > The code to support wq vs timer seems to be diverging more
> > > than what we expected initially.
> > > It seems cleaner to set it as init time and enforce in
> > > other helpers.
> >
> > OK, works for me.
> >
> > >
> > > Also with two work_struct-s we're pushing the sizeof(bpf_hrtimer)
> > > too far.
> > > It's already at 112 bytes and some people use bpf_timer per flow.
> > > So potentially millions of such timers.
> > > Adding extra sizeof(struct work_struct)=32 * 2 that won't be
> > > used is too much.
> > > Note that sizeof(struct hrtimer)=64, so unions make everything
> > > fit nicely.
> >
> > Maybe we should do
> > union {
> > struct hrtimer timer;
> > struct {
> > struct work_struct work;
> > struct work_struct sync_work;
> > }
> > }
>
> It's also ok, but sharing rcu_head and work_struct seems
> cleaner, since it highlights that they're exclusive.
>
> > (not nice to read but at least we don't change the size at the beginning)
> >
> > >
> > > > So should we reject during run time bpf_timer_set_callback() for
> > > > sleepable timers and only allow bpf_timer_set_sleepable_cb() for
> > > > those? (and the invert in the other case).
> > >
> > > yes.
> > >
> > > > This version of the patch allows for one timer to be used as softIRQ
> > > > or WQ, depending on the timer_set_callback that is used. But it might
> > > > be simpler for the kfree_rcu race to define the bpf_timer to be of one
> > > > kind, so we are sure to call the correct kfree method.
> > >
> > > I think one or another simplifies the code and makes it easier
> > > to think through combinations.
> > >
> > > I'm still contemplating adding new "struct bpf_wq" and new kfuncs
> > > to completely separate wq vs timer.
> > > The code reuse seems to be relatively small.
> >
> > There is some code reuse in the verifier, but it can be factored out I think.
> >
> > Though the biggest reuse might be in the map portion of bpf_timer,
> > which I haven't looked much TBH.
>
> Right. It's all the 'case BPF_TIMER:' in various places.
> New 'struct bpf_wq' would need another entry in btf_field_type.
> But that should be a straightforward addition.
>
> >
> > > We can potentially factor out internals of bpf_timer_* into smaller
> > > helpers and use them from bpf_timer_* and from new bpf_wq_* kfuncs.
> >
> > Yeah, also, given that we are going to enforce delay == 0 for
> > sleepable timers (wq), the user api would be much cleaner if we can
> > have a dedicated bpf_wq (and it would make the flags of bpf_timer_init
> > easier to deal with).
>
> It seems so.
> Kinda hard to judge one way or the other without looking at
> the final code, but it seems separation is worth attempting, at least.
>
> Also if we ever do hrtimer+wq we probably will be using
> 'struct delayed_work' instead of rolling our own
> 'struct hrtimer' + 'struct work_struct' combo.
>
> It seems wq logic already made such a combination special enough
> and thought through the races, so we better just follow that path.
> In that case it might be yet another 'struct bpf_delayed_wq'
> and another set of kfuncs.
> Considering that cancel_work() and cancel_delayed_work()
> are separate api in the kernel.
> Multiplexing all of them under bpf_timer_cancel()
> seems wrong.
> In the past we were somewhat limited in terms of helpers.
> We tried not to add them unless absolutely necessary because
> of uapi considerations.
> Now with kfuncs we can add/tweak/remove them at will.
>
> >
> > >
> > > One more thing.
> > > bpf_timer_cancel() api turned out to be troublesome.
> > > Not only it cancels the timer, but drops callback too.
> > > It was a surprising behavior for people familiar with
> > > kernel timer api-s.
> > > We should not repeat this mistake with wq.
> > >
> > > We can try to fix bpf_timer_cancel() too.
> > > If we drop drop_prog_refcnt() from it it shouldn't affect
> > > existing bpf_timer users who are forced to do:
> > > bpf_timer_cancel()
> > > bpf_timer_set_callback()
> > > bpf_timer_start()
> > > all the time.
> > > If/when bpf_timer_cancel() stops dropping the callback
> > > such bpf prog won't be affected. So low chance of breaking any prog.
> > > wdyt?
> > >
> >
> > How would a program know set_callback() is not required after a
> > cancel() because the kernel kept it around? It seems that it's going
> > to be hard for them to know that (unless by trying first a start()),
> > and it will add more code.
> >
> > timer_cancel() would be hard to change but we can always do the change
> > and add a new kfunc timer_cancel_no_drop() which would clearly allow
>
> that works too.
>
> > for new programs to know that set_callback() is not required to be
> > called. In a few kernel releases we could remove it and say that
> > timer_cancel() is the same (and replaced by a #define)
>
> #define won't work, since mechanics of detecting and calling
> helpers vs kfuncs is quite different.
>
> > Anyway, the more I think of it, the more I think the best API would be
> > a dedicated wq API. It's probably going to need a little bit more
> > work, but it'll be more or less this work plus the new bpf_wq type in
> > the map.
>
> It seems to me as well.
>
> Thanks for brainstorming.
>
Alright, as of today (and I'm about to be AFK for the weekend), I got
your changes in and working (I think). I'll review the series on
Monday and send it back so we have a baseline to compare it to with
bpf_wq.
Cheers,
Benjamin
On Fri, Apr 5, 2024 at 8:46 AM Benjamin Tissoires
<[email protected]> wrote:
>
> Alright, as of today (and I'm about to be AFK for the weekend), I got
> your changes in and working (I think). I'll review the series on
> Monday and send it back so we have a baseline to compare it to with
> bpf_wq.
Nice! Looking forward to it.