This is a followup of sleepable bpf_timer[0].
When discussing sleepable bpf_timer, it was thought that we should give
a try to bpf_wq, as the 2 APIs are similar but distinct enough to
justify a new one.
So here it is.
I tried to keep as much as possible common code in kernel/bpf/helpers.c
but I couldn't get away with code duplication in kernel/bpf/verifier.c.
This series introduces a basic bpf_wq support:
- creation is supported
- assignment is supported
- running a simple bpf_wq is also supported.
We will probably need to extend the API further with:
- a full delayed_work API (can be piggy backed on top with a correct
flag)
- bpf_wq_cancel() <- apparently not, this is shooting ourself in the
foot
- bpf_wq_cancel_sync() (for sleepable programs)
- documentation
---
For reference, the use cases I have in mind:
---
Basically, I need to be able to defer a HID-BPF program for the
following reasons (from the aforementioned patch):
1. defer an event:
Sometimes we receive an out of proximity event, but the device can not
be trusted enough, and we need to ensure that we won't receive another
one in the following n milliseconds. So we need to wait those n
milliseconds, and eventually re-inject that event in the stack.
2. inject new events in reaction to one given event:
We might want to transform one given event into several. This is the
case for macro keys where a single key press is supposed to send
a sequence of key presses. But this could also be used to patch a
faulty behavior, if a device forgets to send a release event.
3. communicate with the device in reaction to one event:
We might want to communicate back to the device after a given event.
For example a device might send us an event saying that it came back
from sleeping state and needs to be re-initialized.
Currently we can achieve that by keeping a userspace program around,
raise a bpf event, and let that userspace program inject the events and
commands.
However, we are just keeping that program alive as a daemon for just
scheduling commands. There is no logic in it, so it doesn't really justify
an actual userspace wakeup. So a kernel workqueue seems simpler to handle.
bpf_timers are currently running in a soft IRQ context, this patch
series implements a sleppable context for them.
Cheers,
Benjamin
To: Alexei Starovoitov <[email protected]>
To: Daniel Borkmann <[email protected]>
To: Andrii Nakryiko <[email protected]>
To: Martin KaFai Lau <[email protected]>
To: Eduard Zingerman <[email protected]>
To: Song Liu <[email protected]>
To: Yonghong Song <[email protected]>
To: John Fastabend <[email protected]>
To: KP Singh <[email protected]>
To: Stanislav Fomichev <[email protected]>
To: Hao Luo <[email protected]>
To: Jiri Olsa <[email protected]>
To: Mykola Lysenko <[email protected]>
To: Shuah Khan <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Signed-off-by: Benjamin Tissoires <[email protected]>
[0] https://lore.kernel.org/all/[email protected]/
---
Changes in v2:
- took previous review into account
- mainly dropped BPF_F_WQ_SLEEPABLE
- Link to v1: https://lore.kernel.org/r/[email protected]
---
Benjamin Tissoires (16):
bpf: make timer data struct more generic
bpf: replace bpf_timer_init with a generic helper
bpf: replace bpf_timer_set_callback with a generic helper
bpf: replace bpf_timer_cancel_and_free with a generic helper
bpf: add support for bpf_wq user type
tools: sync include/uapi/linux/bpf.h
bpf: verifier: bail out if the argument is not a map
bpf: add support for KF_ARG_PTR_TO_WORKQUEUE
bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps
selftests/bpf: add bpf_wq tests
bpf: wq: add bpf_wq_init
selftests/bpf: wq: add bpf_wq_init() checks
bpf: wq: add bpf_wq_set_callback_impl
selftests/bpf: add checks for bpf_wq_set_callback()
bpf: add bpf_wq_start
selftests/bpf: wq: add bpf_wq_start() checks
include/linux/bpf.h | 13 +-
include/linux/bpf_verifier.h | 1 +
include/uapi/linux/bpf.h | 4 +
kernel/bpf/arraymap.c | 18 +-
kernel/bpf/btf.c | 17 +
kernel/bpf/hashtab.c | 55 +++-
kernel/bpf/helpers.c | 348 ++++++++++++++++-----
kernel/bpf/syscall.c | 15 +-
kernel/bpf/verifier.c | 154 ++++++++-
tools/include/uapi/linux/bpf.h | 4 +
tools/testing/selftests/bpf/bpf_experimental.h | 7 +
.../selftests/bpf/bpf_testmod/bpf_testmod.c | 5 +
.../selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h | 1 +
tools/testing/selftests/bpf/prog_tests/wq.c | 41 +++
tools/testing/selftests/bpf/progs/wq.c | 186 +++++++++++
tools/testing/selftests/bpf/progs/wq_failures.c | 144 +++++++++
16 files changed, 910 insertions(+), 103 deletions(-)
---
base-commit: ffa6b26b4d8a0520b78636ca9373ab842cb3b1a8
change-id: 20240411-bpf_wq-fe24e8d24f5e
Best regards,
--
Benjamin Tissoires <[email protected]>
To be able to add workqueues and reuse most of the timer code, we need
to make bpf_hrtimer more generic.
There is no code change except that the new struct gets a new u64 flags
attribute. We are still below 2 cache lines, so this shouldn't impact
the current running codes.
The ordering is also changed. Everything related to async callback
is now on top of bpf_hrtimer.
Signed-off-by: Benjamin Tissoires <[email protected]>
---
no changes in v2
---
kernel/bpf/helpers.c | 71 ++++++++++++++++++++++++++++------------------------
1 file changed, 38 insertions(+), 33 deletions(-)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 8cde717137bd..5a069c70b5e6 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1079,11 +1079,20 @@ const struct bpf_func_proto bpf_snprintf_proto = {
.arg5_type = ARG_CONST_SIZE_OR_ZERO,
};
+struct bpf_async_cb {
+ struct bpf_map *map;
+ struct bpf_prog *prog;
+ void __rcu *callback_fn;
+ void *value;
+ struct rcu_head rcu;
+ u64 flags;
+};
+
/* BPF map elements can contain 'struct bpf_timer'.
* Such map owns all of its BPF timers.
* 'struct bpf_timer' is allocated as part of map element allocation
* and it's zero initialized.
- * That space is used to keep 'struct bpf_timer_kern'.
+ * That space is used to keep 'struct bpf_async_kern'.
* bpf_timer_init() allocates 'struct bpf_hrtimer', inits hrtimer, and
* remembers 'struct bpf_map *' pointer it's part of.
* bpf_timer_set_callback() increments prog refcnt and assign bpf callback_fn.
@@ -1096,16 +1105,12 @@ const struct bpf_func_proto bpf_snprintf_proto = {
* freeing the timers when inner map is replaced or deleted by user space.
*/
struct bpf_hrtimer {
+ struct bpf_async_cb cb;
struct hrtimer timer;
- struct bpf_map *map;
- struct bpf_prog *prog;
- void __rcu *callback_fn;
- void *value;
- struct rcu_head rcu;
};
/* the actual struct hidden inside uapi struct bpf_timer */
-struct bpf_timer_kern {
+struct bpf_async_kern {
struct bpf_hrtimer *timer;
/* bpf_spin_lock is used here instead of spinlock_t to make
* sure that it always fits into space reserved by struct bpf_timer
@@ -1119,14 +1124,14 @@ static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
{
struct bpf_hrtimer *t = container_of(hrtimer, struct bpf_hrtimer, timer);
- struct bpf_map *map = t->map;
- void *value = t->value;
+ struct bpf_map *map = t->cb.map;
+ void *value = t->cb.value;
bpf_callback_t callback_fn;
void *key;
u32 idx;
BTF_TYPE_EMIT(struct bpf_timer);
- callback_fn = rcu_dereference_check(t->callback_fn, rcu_read_lock_bh_held());
+ callback_fn = rcu_dereference_check(t->cb.callback_fn, rcu_read_lock_bh_held());
if (!callback_fn)
goto out;
@@ -1155,7 +1160,7 @@ static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
return HRTIMER_NORESTART;
}
-BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map,
+BPF_CALL_3(bpf_timer_init, struct bpf_async_kern *, timer, struct bpf_map *, map,
u64, flags)
{
clockid_t clockid = flags & (MAX_CLOCKS - 1);
@@ -1163,8 +1168,8 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
int ret = 0;
BUILD_BUG_ON(MAX_CLOCKS != 16);
- BUILD_BUG_ON(sizeof(struct bpf_timer_kern) > sizeof(struct bpf_timer));
- BUILD_BUG_ON(__alignof__(struct bpf_timer_kern) != __alignof__(struct bpf_timer));
+ BUILD_BUG_ON(sizeof(struct bpf_async_kern) > sizeof(struct bpf_timer));
+ BUILD_BUG_ON(__alignof__(struct bpf_async_kern) != __alignof__(struct bpf_timer));
if (in_nmi())
return -EOPNOTSUPP;
@@ -1187,10 +1192,10 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
ret = -ENOMEM;
goto out;
}
- t->value = (void *)timer - map->record->timer_off;
- t->map = map;
- t->prog = NULL;
- rcu_assign_pointer(t->callback_fn, NULL);
+ t->cb.value = (void *)timer - map->record->timer_off;
+ t->cb.map = map;
+ t->cb.prog = NULL;
+ rcu_assign_pointer(t->cb.callback_fn, NULL);
hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
t->timer.function = bpf_timer_cb;
WRITE_ONCE(timer->timer, t);
@@ -1222,7 +1227,7 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
.arg3_type = ARG_ANYTHING,
};
-BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callback_fn,
+BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callback_fn,
struct bpf_prog_aux *, aux)
{
struct bpf_prog *prev, *prog = aux->prog;
@@ -1237,7 +1242,7 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
ret = -EINVAL;
goto out;
}
- if (!atomic64_read(&t->map->usercnt)) {
+ if (!atomic64_read(&t->cb.map->usercnt)) {
/* maps with timers must be either held by user space
* or pinned in bpffs. Otherwise timer might still be
* running even when bpf prog is detached and user space
@@ -1246,7 +1251,7 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
ret = -EPERM;
goto out;
}
- prev = t->prog;
+ prev = t->cb.prog;
if (prev != prog) {
/* Bump prog refcnt once. Every bpf_timer_set_callback()
* can pick different callback_fn-s within the same prog.
@@ -1259,9 +1264,9 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
if (prev)
/* Drop prev prog refcnt when swapping with new prog */
bpf_prog_put(prev);
- t->prog = prog;
+ t->cb.prog = prog;
}
- rcu_assign_pointer(t->callback_fn, callback_fn);
+ rcu_assign_pointer(t->cb.callback_fn, callback_fn);
out:
__bpf_spin_unlock_irqrestore(&timer->lock);
return ret;
@@ -1275,7 +1280,7 @@ static const struct bpf_func_proto bpf_timer_set_callback_proto = {
.arg2_type = ARG_PTR_TO_FUNC,
};
-BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, flags)
+BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, flags)
{
struct bpf_hrtimer *t;
int ret = 0;
@@ -1287,7 +1292,7 @@ BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, fla
return -EINVAL;
__bpf_spin_lock_irqsave(&timer->lock);
t = timer->timer;
- if (!t || !t->prog) {
+ if (!t || !t->cb.prog) {
ret = -EINVAL;
goto out;
}
@@ -1315,18 +1320,18 @@ static const struct bpf_func_proto bpf_timer_start_proto = {
.arg3_type = ARG_ANYTHING,
};
-static void drop_prog_refcnt(struct bpf_hrtimer *t)
+static void drop_prog_refcnt(struct bpf_async_cb *async)
{
- struct bpf_prog *prog = t->prog;
+ struct bpf_prog *prog = async->prog;
if (prog) {
bpf_prog_put(prog);
- t->prog = NULL;
- rcu_assign_pointer(t->callback_fn, NULL);
+ async->prog = NULL;
+ rcu_assign_pointer(async->callback_fn, NULL);
}
}
-BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
+BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
{
struct bpf_hrtimer *t;
int ret = 0;
@@ -1348,7 +1353,7 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
ret = -EDEADLK;
goto out;
}
- drop_prog_refcnt(t);
+ drop_prog_refcnt(&t->cb);
out:
__bpf_spin_unlock_irqrestore(&timer->lock);
/* Cancel the timer and wait for associated callback to finish
@@ -1371,7 +1376,7 @@ static const struct bpf_func_proto bpf_timer_cancel_proto = {
*/
void bpf_timer_cancel_and_free(void *val)
{
- struct bpf_timer_kern *timer = val;
+ struct bpf_async_kern *timer = val;
struct bpf_hrtimer *t;
/* Performance optimization: read timer->timer without lock first. */
@@ -1383,7 +1388,7 @@ void bpf_timer_cancel_and_free(void *val)
t = timer->timer;
if (!t)
goto out;
- drop_prog_refcnt(t);
+ drop_prog_refcnt(&t->cb);
/* The subsequent bpf_timer_start/cancel() helpers won't be able to use
* this timer, since it won't be initialized.
*/
@@ -1410,7 +1415,7 @@ void bpf_timer_cancel_and_free(void *val)
*/
if (this_cpu_read(hrtimer_running) != t)
hrtimer_cancel(&t->timer);
- kfree_rcu(t, rcu);
+ kfree_rcu(t, cb.rcu);
}
BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr)
--
2.44.0
No code change except for the new flags argument being stored in the
local data struct.
Signed-off-by: Benjamin Tissoires <[email protected]>
---
no changes in v2
---
kernel/bpf/helpers.c | 91 ++++++++++++++++++++++++++++++++++++----------------
1 file changed, 63 insertions(+), 28 deletions(-)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 5a069c70b5e6..1e6d1011303b 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1111,7 +1111,10 @@ struct bpf_hrtimer {
/* the actual struct hidden inside uapi struct bpf_timer */
struct bpf_async_kern {
- struct bpf_hrtimer *timer;
+ union {
+ struct bpf_async_cb *cb;
+ struct bpf_hrtimer *timer;
+ };
/* bpf_spin_lock is used here instead of spinlock_t to make
* sure that it always fits into space reserved by struct bpf_timer
* regardless of LOCKDEP and spinlock debug flags.
@@ -1119,6 +1122,10 @@ struct bpf_async_kern {
struct bpf_spin_lock lock;
} __attribute__((aligned(8)));
+enum bpf_async_type {
+ BPF_ASYNC_TYPE_TIMER = 0,
+};
+
static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
@@ -1160,46 +1167,55 @@ static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
return HRTIMER_NORESTART;
}
-BPF_CALL_3(bpf_timer_init, struct bpf_async_kern *, timer, struct bpf_map *, map,
- u64, flags)
+static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u64 flags,
+ enum bpf_async_type type)
{
- clockid_t clockid = flags & (MAX_CLOCKS - 1);
+ struct bpf_async_cb *cb;
struct bpf_hrtimer *t;
+ clockid_t clockid;
+ size_t size;
int ret = 0;
- BUILD_BUG_ON(MAX_CLOCKS != 16);
- BUILD_BUG_ON(sizeof(struct bpf_async_kern) > sizeof(struct bpf_timer));
- BUILD_BUG_ON(__alignof__(struct bpf_async_kern) != __alignof__(struct bpf_timer));
-
if (in_nmi())
return -EOPNOTSUPP;
- if (flags >= MAX_CLOCKS ||
- /* similar to timerfd except _ALARM variants are not supported */
- (clockid != CLOCK_MONOTONIC &&
- clockid != CLOCK_REALTIME &&
- clockid != CLOCK_BOOTTIME))
+ switch (type) {
+ case BPF_ASYNC_TYPE_TIMER:
+ size = sizeof(struct bpf_hrtimer);
+ break;
+ default:
return -EINVAL;
- __bpf_spin_lock_irqsave(&timer->lock);
- t = timer->timer;
+ }
+
+ __bpf_spin_lock_irqsave(&async->lock);
+ t = async->timer;
if (t) {
ret = -EBUSY;
goto out;
}
+
/* allocate hrtimer via map_kmalloc to use memcg accounting */
- t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, map->numa_node);
- if (!t) {
+ cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC, map->numa_node);
+ if (!cb) {
ret = -ENOMEM;
goto out;
}
- t->cb.value = (void *)timer - map->record->timer_off;
- t->cb.map = map;
- t->cb.prog = NULL;
- rcu_assign_pointer(t->cb.callback_fn, NULL);
- hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
- t->timer.function = bpf_timer_cb;
- WRITE_ONCE(timer->timer, t);
- /* Guarantee the order between timer->timer and map->usercnt. So
+
+ if (type == BPF_ASYNC_TYPE_TIMER) {
+ clockid = flags & (MAX_CLOCKS - 1);
+ t = (struct bpf_hrtimer *)cb;
+
+ hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
+ t->timer.function = bpf_timer_cb;
+ cb->value = (void *)async - map->record->timer_off;
+ }
+ cb->map = map;
+ cb->prog = NULL;
+ cb->flags = flags;
+ rcu_assign_pointer(cb->callback_fn, NULL);
+
+ WRITE_ONCE(async->cb, cb);
+ /* Guarantee the order between async->cb and map->usercnt. So
* when there are concurrent uref release and bpf timer init, either
* bpf_timer_cancel_and_free() called by uref release reads a no-NULL
* timer or atomic64_read() below returns a zero usercnt.
@@ -1209,15 +1225,34 @@ BPF_CALL_3(bpf_timer_init, struct bpf_async_kern *, timer, struct bpf_map *, map
/* maps with timers must be either held by user space
* or pinned in bpffs.
*/
- WRITE_ONCE(timer->timer, NULL);
- kfree(t);
+ WRITE_ONCE(async->cb, NULL);
+ kfree(cb);
ret = -EPERM;
}
out:
- __bpf_spin_unlock_irqrestore(&timer->lock);
+ __bpf_spin_unlock_irqrestore(&async->lock);
return ret;
}
+BPF_CALL_3(bpf_timer_init, struct bpf_async_kern *, timer, struct bpf_map *, map,
+ u64, flags)
+{
+ clock_t clockid = flags & (MAX_CLOCKS - 1);
+
+ BUILD_BUG_ON(MAX_CLOCKS != 16);
+ BUILD_BUG_ON(sizeof(struct bpf_async_kern) > sizeof(struct bpf_timer));
+ BUILD_BUG_ON(__alignof__(struct bpf_async_kern) != __alignof__(struct bpf_timer));
+
+ if (flags >= MAX_CLOCKS ||
+ /* similar to timerfd except _ALARM variants are not supported */
+ (clockid != CLOCK_MONOTONIC &&
+ clockid != CLOCK_REALTIME &&
+ clockid != CLOCK_BOOTTIME))
+ return -EINVAL;
+
+ return __bpf_async_init(timer, map, flags, BPF_ASYNC_TYPE_TIMER);
+}
+
static const struct bpf_func_proto bpf_timer_init_proto = {
.func = bpf_timer_init,
.gpl_only = true,
--
2.44.0
In the same way we have a generic __bpf_async_init(), we also need
to share code between timer and workqueue for the set_callback call.
We just add an unused flags parameter, as it will be used for workqueues.
Signed-off-by: Benjamin Tissoires <[email protected]>
---
no changes in v2
---
kernel/bpf/helpers.c | 29 ++++++++++++++++++-----------
1 file changed, 18 insertions(+), 11 deletions(-)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 1e6d1011303b..d0a645b09d3d 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1262,22 +1262,23 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
.arg3_type = ARG_ANYTHING,
};
-BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callback_fn,
- struct bpf_prog_aux *, aux)
+static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn,
+ struct bpf_prog_aux *aux, unsigned int flags,
+ enum bpf_async_type type)
{
struct bpf_prog *prev, *prog = aux->prog;
- struct bpf_hrtimer *t;
+ struct bpf_async_cb *cb;
int ret = 0;
if (in_nmi())
return -EOPNOTSUPP;
- __bpf_spin_lock_irqsave(&timer->lock);
- t = timer->timer;
- if (!t) {
+ __bpf_spin_lock_irqsave(&async->lock);
+ cb = async->cb;
+ if (!cb) {
ret = -EINVAL;
goto out;
}
- if (!atomic64_read(&t->cb.map->usercnt)) {
+ if (!atomic64_read(&cb->map->usercnt)) {
/* maps with timers must be either held by user space
* or pinned in bpffs. Otherwise timer might still be
* running even when bpf prog is detached and user space
@@ -1286,7 +1287,7 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callb
ret = -EPERM;
goto out;
}
- prev = t->cb.prog;
+ prev = cb->prog;
if (prev != prog) {
/* Bump prog refcnt once. Every bpf_timer_set_callback()
* can pick different callback_fn-s within the same prog.
@@ -1299,14 +1300,20 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callb
if (prev)
/* Drop prev prog refcnt when swapping with new prog */
bpf_prog_put(prev);
- t->cb.prog = prog;
+ cb->prog = prog;
}
- rcu_assign_pointer(t->cb.callback_fn, callback_fn);
+ rcu_assign_pointer(cb->callback_fn, callback_fn);
out:
- __bpf_spin_unlock_irqrestore(&timer->lock);
+ __bpf_spin_unlock_irqrestore(&async->lock);
return ret;
}
+BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callback_fn,
+ struct bpf_prog_aux *, aux)
+{
+ return __bpf_async_set_callback(timer, callback_fn, aux, 0, BPF_ASYNC_TYPE_TIMER);
+}
+
static const struct bpf_func_proto bpf_timer_set_callback_proto = {
.func = bpf_timer_set_callback,
.gpl_only = true,
--
2.44.0
Same reason than most bpf_timer* functions, we need almost the same for
workqueues.
So extract the generic part out of it so bpf_wq_cancel_and_free can reuse
it.
Signed-off-by: Benjamin Tissoires <[email protected]>
---
no changes in v2
---
kernel/bpf/helpers.c | 42 +++++++++++++++++++++++++-----------------
1 file changed, 25 insertions(+), 17 deletions(-)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index d0a645b09d3d..78847f444f79 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1413,36 +1413,44 @@ static const struct bpf_func_proto bpf_timer_cancel_proto = {
.arg1_type = ARG_PTR_TO_TIMER,
};
-/* This function is called by map_delete/update_elem for individual element and
- * by ops->map_release_uref when the user space reference to a map reaches zero.
- */
-void bpf_timer_cancel_and_free(void *val)
+static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *async)
{
- struct bpf_async_kern *timer = val;
- struct bpf_hrtimer *t;
+ struct bpf_async_cb *cb;
- /* Performance optimization: read timer->timer without lock first. */
- if (!READ_ONCE(timer->timer))
- return;
+ /* Performance optimization: read async->cb without lock first. */
+ if (!READ_ONCE(async->cb))
+ return NULL;
- __bpf_spin_lock_irqsave(&timer->lock);
+ __bpf_spin_lock_irqsave(&async->lock);
/* re-read it under lock */
- t = timer->timer;
- if (!t)
+ cb = async->cb;
+ if (!cb)
goto out;
- drop_prog_refcnt(&t->cb);
+ drop_prog_refcnt(cb);
/* The subsequent bpf_timer_start/cancel() helpers won't be able to use
* this timer, since it won't be initialized.
*/
- WRITE_ONCE(timer->timer, NULL);
+ WRITE_ONCE(async->cb, NULL);
out:
- __bpf_spin_unlock_irqrestore(&timer->lock);
+ __bpf_spin_unlock_irqrestore(&async->lock);
+ return cb;
+}
+
+/* This function is called by map_delete/update_elem for individual element and
+ * by ops->map_release_uref when the user space reference to a map reaches zero.
+ */
+void bpf_timer_cancel_and_free(void *val)
+{
+ struct bpf_hrtimer *t;
+
+ t = (struct bpf_hrtimer *)__bpf_async_cancel_and_free(val);
+
if (!t)
return;
/* Cancel the timer and wait for callback to complete if it was running.
* If hrtimer_cancel() can be safely called it's safe to call kfree(t)
* right after for both preallocated and non-preallocated maps.
- * The timer->timer = NULL was already done and no code path can
+ * The async->cb = NULL was already done and no code path can
* see address 't' anymore.
*
* Check that bpf_map_delete/update_elem() wasn't called from timer
@@ -1451,7 +1459,7 @@ void bpf_timer_cancel_and_free(void *val)
* return -1). Though callback_fn is still running on this cpu it's
* safe to do kfree(t) because bpf_timer_cb() read everything it needed
* from 't'. The bpf subprog callback_fn won't be able to access 't',
- * since timer->timer = NULL was already done. The timer will be
+ * since async->cb = NULL was already done. The timer will be
* effectively cancelled because bpf_timer_cb() will return
* HRTIMER_NORESTART.
*/
--
2.44.0
Mostly a copy/paste from the bpf_timer API, without the initialization
and free, as they will be done in a separate patch.
Signed-off-by: Benjamin Tissoires <[email protected]>
---
no changes in v2
---
include/linux/bpf.h | 11 ++++++++++-
include/uapi/linux/bpf.h | 4 ++++
kernel/bpf/btf.c | 17 +++++++++++++++++
kernel/bpf/syscall.c | 6 +++++-
kernel/bpf/verifier.c | 9 +++++++++
5 files changed, 45 insertions(+), 2 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5034c1b4ded7..c7dcfd395555 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -185,7 +185,7 @@ struct bpf_map_ops {
enum {
/* Support at most 10 fields in a BTF type */
- BTF_FIELDS_MAX = 10,
+ BTF_FIELDS_MAX = 11,
};
enum btf_field_type {
@@ -202,6 +202,7 @@ enum btf_field_type {
BPF_GRAPH_NODE = BPF_RB_NODE | BPF_LIST_NODE,
BPF_GRAPH_ROOT = BPF_RB_ROOT | BPF_LIST_HEAD,
BPF_REFCOUNT = (1 << 9),
+ BPF_WORKQUEUE = (1 << 10),
};
typedef void (*btf_dtor_kfunc_t)(void *);
@@ -238,6 +239,7 @@ struct btf_record {
u32 field_mask;
int spin_lock_off;
int timer_off;
+ int wq_off;
int refcount_off;
struct btf_field fields[];
};
@@ -312,6 +314,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type)
return "bpf_spin_lock";
case BPF_TIMER:
return "bpf_timer";
+ case BPF_WORKQUEUE:
+ return "bpf_wq";
case BPF_KPTR_UNREF:
case BPF_KPTR_REF:
return "kptr";
@@ -340,6 +344,8 @@ static inline u32 btf_field_type_size(enum btf_field_type type)
return sizeof(struct bpf_spin_lock);
case BPF_TIMER:
return sizeof(struct bpf_timer);
+ case BPF_WORKQUEUE:
+ return sizeof(struct bpf_wq);
case BPF_KPTR_UNREF:
case BPF_KPTR_REF:
case BPF_KPTR_PERCPU:
@@ -367,6 +373,8 @@ static inline u32 btf_field_type_align(enum btf_field_type type)
return __alignof__(struct bpf_spin_lock);
case BPF_TIMER:
return __alignof__(struct bpf_timer);
+ case BPF_WORKQUEUE:
+ return __alignof__(struct bpf_wq);
case BPF_KPTR_UNREF:
case BPF_KPTR_REF:
case BPF_KPTR_PERCPU:
@@ -406,6 +414,7 @@ static inline void bpf_obj_init_field(const struct btf_field *field, void *addr)
/* RB_ROOT_CACHED 0-inits, no need to do anything after memset */
case BPF_SPIN_LOCK:
case BPF_TIMER:
+ case BPF_WORKQUEUE:
case BPF_KPTR_UNREF:
case BPF_KPTR_REF:
case BPF_KPTR_PERCPU:
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index cee0a7915c08..e4ae83550fb3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7306,6 +7306,10 @@ struct bpf_timer {
__u64 __opaque[2];
} __attribute__((aligned(8)));
+struct bpf_wq {
+ __u64 __opaque[2];
+} __attribute__((aligned(8)));
+
struct bpf_dynptr {
__u64 __opaque[2];
} __attribute__((aligned(8)));
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 90c4a32d89ff..a9b5b28a2630 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3464,6 +3464,15 @@ static int btf_get_field_type(const char *name, u32 field_mask, u32 *seen_mask,
goto end;
}
}
+ if (field_mask & BPF_WORKQUEUE) {
+ if (!strcmp(name, "bpf_wq")) {
+ if (*seen_mask & BPF_WORKQUEUE)
+ return -E2BIG;
+ *seen_mask |= BPF_WORKQUEUE;
+ type = BPF_WORKQUEUE;
+ goto end;
+ }
+ }
field_mask_test_name(BPF_LIST_HEAD, "bpf_list_head");
field_mask_test_name(BPF_LIST_NODE, "bpf_list_node");
field_mask_test_name(BPF_RB_ROOT, "bpf_rb_root");
@@ -3515,6 +3524,7 @@ static int btf_find_struct_field(const struct btf *btf,
switch (field_type) {
case BPF_SPIN_LOCK:
case BPF_TIMER:
+ case BPF_WORKQUEUE:
case BPF_LIST_NODE:
case BPF_RB_NODE:
case BPF_REFCOUNT:
@@ -3582,6 +3592,7 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
switch (field_type) {
case BPF_SPIN_LOCK:
case BPF_TIMER:
+ case BPF_WORKQUEUE:
case BPF_LIST_NODE:
case BPF_RB_NODE:
case BPF_REFCOUNT:
@@ -3816,6 +3827,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
rec->spin_lock_off = -EINVAL;
rec->timer_off = -EINVAL;
+ rec->wq_off = -EINVAL;
rec->refcount_off = -EINVAL;
for (i = 0; i < cnt; i++) {
field_type_size = btf_field_type_size(info_arr[i].type);
@@ -3846,6 +3858,11 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
/* Cache offset for faster lookup at runtime */
rec->timer_off = rec->fields[i].offset;
break;
+ case BPF_WORKQUEUE:
+ WARN_ON_ONCE(rec->wq_off >= 0);
+ /* Cache offset for faster lookup at runtime */
+ rec->wq_off = rec->fields[i].offset;
+ break;
case BPF_REFCOUNT:
WARN_ON_ONCE(rec->refcount_off >= 0);
/* Cache offset for faster lookup at runtime */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7d392ec83655..0848e4141b00 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -559,6 +559,7 @@ void btf_record_free(struct btf_record *rec)
case BPF_SPIN_LOCK:
case BPF_TIMER:
case BPF_REFCOUNT:
+ case BPF_WORKQUEUE:
/* Nothing to release */
break;
default:
@@ -608,6 +609,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
case BPF_SPIN_LOCK:
case BPF_TIMER:
case BPF_REFCOUNT:
+ case BPF_WORKQUEUE:
/* Nothing to acquire */
break;
default:
@@ -679,6 +681,8 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
case BPF_TIMER:
bpf_timer_cancel_and_free(field_ptr);
break;
+ case BPF_WORKQUEUE:
+ break;
case BPF_KPTR_UNREF:
WRITE_ONCE(*(u64 *)field_ptr, 0);
break;
@@ -1085,7 +1089,7 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
map->record = btf_parse_fields(btf, value_type,
BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
- BPF_RB_ROOT | BPF_REFCOUNT,
+ BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE,
map->value_size);
if (!IS_ERR_OR_NULL(map->record)) {
int i;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2aad6d90550f..deaf2e1ab690 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1838,6 +1838,8 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg)
*/
if (btf_record_has_field(map->inner_map_meta->record, BPF_TIMER))
reg->map_uid = reg->id;
+ if (btf_record_has_field(map->inner_map_meta->record, BPF_WORKQUEUE))
+ reg->map_uid = reg->id;
} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
reg->type = PTR_TO_XDP_SOCK;
} else if (map->map_type == BPF_MAP_TYPE_SOCKMAP ||
@@ -18155,6 +18157,13 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
}
}
+ if (btf_record_has_field(map->record, BPF_WORKQUEUE)) {
+ if (is_tracing_prog_type(prog_type)) {
+ verbose(env, "tracing progs cannot use bpf_wq yet\n");
+ return -EINVAL;
+ }
+ }
+
if ((bpf_prog_is_offloaded(prog->aux) || bpf_map_is_offloaded(map)) &&
!bpf_offload_prog_map_match(prog, map)) {
verbose(env, "offload device mismatch between prog and map\n");
--
2.44.0
cp include/uapi/linux/bpf.h tools/include/uapi/linux/bpf.h
Signed-off-by: Benjamin Tissoires <[email protected]>
---
no changes in v2
---
tools/include/uapi/linux/bpf.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index cee0a7915c08..e4ae83550fb3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7306,6 +7306,10 @@ struct bpf_timer {
__u64 __opaque[2];
} __attribute__((aligned(8)));
+struct bpf_wq {
+ __u64 __opaque[2];
+} __attribute__((aligned(8)));
+
struct bpf_dynptr {
__u64 __opaque[2];
} __attribute__((aligned(8)));
--
2.44.0
When a kfunc is declared with a KF_ARG_PTR_TO_MAP, we should have
reg->map_ptr set to a non NULL value, otherwise, that means that the
underlying type is not a map.
Signed-off-by: Benjamin Tissoires <[email protected]>
---
new in v2
---
kernel/bpf/verifier.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index deaf2e1ab690..b95fd63864f1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -11720,6 +11720,11 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
case KF_ARG_PTR_TO_NULL:
continue;
case KF_ARG_PTR_TO_MAP:
+ if (!reg->map_ptr) {
+ verbose(env, "pointer in R%d isn't map pointer\n", regno);
+ return -EINVAL;
+ }
+ fallthrough;
case KF_ARG_PTR_TO_ALLOC_BTF_ID:
case KF_ARG_PTR_TO_BTF_ID:
if (!is_kfunc_trusted_args(meta) && !is_kfunc_rcu(meta))
--
2.44.0
Introduce support for KF_ARG_PTR_TO_WORKQUEUE. The kfuncs will use bpf_wq
as argument and that will be recognized as workqueue argument by verifier.
bpf_wq_kern casting can happen inside kfunc, but using bpf_wq in
argument makes life easier for users who work with non-kern type in BPF
progs.
Duplicate process_timer_func into process_wq_func.
meta argument is only needed to ensure bpf_wq_init's workqueue and map
arguments are coming from the same map (map_uid logic is necessary for
correct inner-map handling), so also amend check_kfunc_args() to
match what helpers functions check is doing.
Signed-off-by: Benjamin Tissoires <[email protected]>
---
changes in v2:
- clean up already done checks in process_wq_func()
- adjust comment
- gate check_kfunc_args check with map->record->wq_off.
---
kernel/bpf/verifier.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 65 insertions(+)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b95fd63864f1..5ff7b52d9032 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -332,6 +332,10 @@ struct bpf_kfunc_call_arg_meta {
u8 spi;
u8 frameno;
} iter;
+ struct {
+ struct bpf_map *ptr;
+ int uid;
+ } map;
u64 mem_size;
};
@@ -7598,6 +7602,23 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
return 0;
}
+static int process_wq_func(struct bpf_verifier_env *env, int regno,
+ struct bpf_kfunc_call_arg_meta *meta)
+{
+ struct bpf_reg_state *regs = cur_regs(env), *reg = ®s[regno];
+ struct bpf_map *map = reg->map_ptr;
+ u64 val = reg->var_off.value;
+
+ if (map->record->wq_off != val + reg->off) {
+ verbose(env, "off %lld doesn't point to 'struct bpf_wq' that is at %d\n",
+ val + reg->off, map->record->wq_off);
+ return -EINVAL;
+ }
+ meta->map.uid = reg->map_uid;
+ meta->map.ptr = map;
+ return 0;
+}
+
static int process_kptr_func(struct bpf_verifier_env *env, int regno,
struct bpf_call_arg_meta *meta)
{
@@ -10843,6 +10864,7 @@ enum {
KF_ARG_LIST_NODE_ID,
KF_ARG_RB_ROOT_ID,
KF_ARG_RB_NODE_ID,
+ KF_ARG_WORKQUEUE_ID,
};
BTF_ID_LIST(kf_arg_btf_ids)
@@ -10851,6 +10873,7 @@ BTF_ID(struct, bpf_list_head)
BTF_ID(struct, bpf_list_node)
BTF_ID(struct, bpf_rb_root)
BTF_ID(struct, bpf_rb_node)
+BTF_ID(struct, bpf_wq)
static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
const struct btf_param *arg, int type)
@@ -10894,6 +10917,11 @@ static bool is_kfunc_arg_rbtree_node(const struct btf *btf, const struct btf_par
return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_NODE_ID);
}
+static bool is_kfunc_arg_wq(const struct btf *btf, const struct btf_param *arg)
+{
+ return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_WORKQUEUE_ID);
+}
+
static bool is_kfunc_arg_callback(struct bpf_verifier_env *env, const struct btf *btf,
const struct btf_param *arg)
{
@@ -10963,6 +10991,7 @@ enum kfunc_ptr_arg_type {
KF_ARG_PTR_TO_NULL,
KF_ARG_PTR_TO_CONST_STR,
KF_ARG_PTR_TO_MAP,
+ KF_ARG_PTR_TO_WORKQUEUE,
};
enum special_kfunc_type {
@@ -11119,6 +11148,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
if (is_kfunc_arg_map(meta->btf, &args[argno]))
return KF_ARG_PTR_TO_MAP;
+ if (is_kfunc_arg_wq(meta->btf, &args[argno]))
+ return KF_ARG_PTR_TO_WORKQUEUE;
+
if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
if (!btf_type_is_struct(ref_t)) {
verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
@@ -11724,6 +11756,29 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
verbose(env, "pointer in R%d isn't map pointer\n", regno);
return -EINVAL;
}
+ if (meta->map.ptr && reg->map_ptr->record->wq_off >= 0) {
+ /* Use map_uid (which is unique id of inner map) to reject:
+ * inner_map1 = bpf_map_lookup_elem(outer_map, key1)
+ * inner_map2 = bpf_map_lookup_elem(outer_map, key2)
+ * if (inner_map1 && inner_map2) {
+ * wq = bpf_map_lookup_elem(inner_map1);
+ * if (wq)
+ * // mismatch would have been allowed
+ * bpf_wq_init(wq, inner_map2);
+ * }
+ *
+ * Comparing map_ptr is enough to distinguish normal and outer maps.
+ */
+ if (meta->map.ptr != reg->map_ptr ||
+ meta->map.uid != reg->map_uid) {
+ verbose(env,
+ "workqueue pointer in R1 map_uid=%d doesn't match map pointer in R2 map_uid=%d\n",
+ meta->map.uid, reg->map_uid);
+ return -EINVAL;
+ }
+ }
+ meta->map.ptr = reg->map_ptr;
+ meta->map.uid = reg->map_uid;
fallthrough;
case KF_ARG_PTR_TO_ALLOC_BTF_ID:
case KF_ARG_PTR_TO_BTF_ID:
@@ -11757,6 +11812,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
case KF_ARG_PTR_TO_CALLBACK:
case KF_ARG_PTR_TO_REFCOUNTED_KPTR:
case KF_ARG_PTR_TO_CONST_STR:
+ case KF_ARG_PTR_TO_WORKQUEUE:
/* Trusted by default */
break;
default:
@@ -12043,6 +12099,15 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
if (ret)
return ret;
break;
+ case KF_ARG_PTR_TO_WORKQUEUE:
+ if (reg->type != PTR_TO_MAP_VALUE) {
+ verbose(env, "arg#%d doesn't point to a map value\n", i);
+ return -EINVAL;
+ }
+ ret = process_wq_func(env, regno, meta);
+ if (ret < 0)
+ return ret;
+ break;
}
}
--
2.44.0
Currently bpf_wq_cancel_and_free() is just a placeholder as there is
no memory allocation for bpf_wq just yet.
Again, duplication of the bpf_timer approach
Signed-off-by: Benjamin Tissoires <[email protected]>
---
changes in v2:
- removed extra fallthrough
---
include/linux/bpf.h | 2 ++
kernel/bpf/arraymap.c | 18 ++++++++++-------
kernel/bpf/hashtab.c | 55 ++++++++++++++++++++++++++++++++++++++++-----------
kernel/bpf/helpers.c | 8 ++++++++
kernel/bpf/syscall.c | 9 +++++++++
5 files changed, 73 insertions(+), 19 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c7dcfd395555..64bf0cf3ee95 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -534,6 +534,7 @@ static inline void zero_map_value(struct bpf_map *map, void *dst)
void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
bool lock_src);
void bpf_timer_cancel_and_free(void *timer);
+void bpf_wq_cancel_and_free(void *timer);
void bpf_list_head_free(const struct btf_field *field, void *list_head,
struct bpf_spin_lock *spin_lock);
void bpf_rb_root_free(const struct btf_field *field, void *rb_root,
@@ -2204,6 +2205,7 @@ void bpf_map_free_record(struct bpf_map *map);
struct btf_record *btf_record_dup(const struct btf_record *rec);
bool btf_record_equal(const struct btf_record *rec_a, const struct btf_record *rec_b);
void bpf_obj_free_timer(const struct btf_record *rec, void *obj);
+void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj);
void bpf_obj_free_fields(const struct btf_record *rec, void *obj);
void __bpf_obj_drop_impl(void *p, const struct btf_record *rec, bool percpu);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 8c1e6d7654bb..580d07b15471 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -428,17 +428,21 @@ static void *array_map_vmalloc_addr(struct bpf_array *array)
return (void *)round_down((unsigned long)array, PAGE_SIZE);
}
-static void array_map_free_timers(struct bpf_map *map)
+static void array_map_free_timers_wq(struct bpf_map *map)
{
struct bpf_array *array = container_of(map, struct bpf_array, map);
int i;
- /* We don't reset or free fields other than timer on uref dropping to zero. */
- if (!btf_record_has_field(map->record, BPF_TIMER))
- return;
+ /* We don't reset or free fields other than timer and workqueue
+ * on uref dropping to zero.
+ */
+ if (btf_record_has_field(map->record, BPF_TIMER))
+ for (i = 0; i < array->map.max_entries; i++)
+ bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i));
- for (i = 0; i < array->map.max_entries; i++)
- bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i));
+ if (btf_record_has_field(map->record, BPF_WORKQUEUE))
+ for (i = 0; i < array->map.max_entries; i++)
+ bpf_obj_free_workqueue(map->record, array_map_elem_ptr(array, i));
}
/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
@@ -782,7 +786,7 @@ const struct bpf_map_ops array_map_ops = {
.map_alloc = array_map_alloc,
.map_free = array_map_free,
.map_get_next_key = array_map_get_next_key,
- .map_release_uref = array_map_free_timers,
+ .map_release_uref = array_map_free_timers_wq,
.map_lookup_elem = array_map_lookup_elem,
.map_update_elem = array_map_update_elem,
.map_delete_elem = array_map_delete_elem,
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 83a9a74260e9..4f8590067c6a 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -240,6 +240,26 @@ static void htab_free_prealloced_timers(struct bpf_htab *htab)
}
}
+static void htab_free_prealloced_wq(struct bpf_htab *htab)
+{
+ u32 num_entries = htab->map.max_entries;
+ int i;
+
+ if (!btf_record_has_field(htab->map.record, BPF_WORKQUEUE))
+ return;
+ if (htab_has_extra_elems(htab))
+ num_entries += num_possible_cpus();
+
+ for (i = 0; i < num_entries; i++) {
+ struct htab_elem *elem;
+
+ elem = get_htab_elem(htab, i);
+ bpf_obj_free_workqueue(htab->map.record,
+ elem->key + round_up(htab->map.key_size, 8));
+ cond_resched();
+ }
+}
+
static void htab_free_prealloced_fields(struct bpf_htab *htab)
{
u32 num_entries = htab->map.max_entries;
@@ -1495,7 +1515,7 @@ static void delete_all_elements(struct bpf_htab *htab)
migrate_enable();
}
-static void htab_free_malloced_timers(struct bpf_htab *htab)
+static void htab_free_malloced_timers_or_wq(struct bpf_htab *htab, bool is_timer)
{
int i;
@@ -1507,24 +1527,35 @@ static void htab_free_malloced_timers(struct bpf_htab *htab)
hlist_nulls_for_each_entry(l, n, head, hash_node) {
/* We only free timer on uref dropping to zero */
- bpf_obj_free_timer(htab->map.record, l->key + round_up(htab->map.key_size, 8));
+ if (is_timer)
+ bpf_obj_free_timer(htab->map.record,
+ l->key + round_up(htab->map.key_size, 8));
+ else
+ bpf_obj_free_workqueue(htab->map.record,
+ l->key + round_up(htab->map.key_size, 8));
}
cond_resched_rcu();
}
rcu_read_unlock();
}
-static void htab_map_free_timers(struct bpf_map *map)
+static void htab_map_free_timers_and_wq(struct bpf_map *map)
{
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
- /* We only free timer on uref dropping to zero */
- if (!btf_record_has_field(htab->map.record, BPF_TIMER))
- return;
- if (!htab_is_prealloc(htab))
- htab_free_malloced_timers(htab);
- else
- htab_free_prealloced_timers(htab);
+ /* We only free timer and workqueue on uref dropping to zero */
+ if (btf_record_has_field(htab->map.record, BPF_TIMER)) {
+ if (!htab_is_prealloc(htab))
+ htab_free_malloced_timers_or_wq(htab, true);
+ else
+ htab_free_prealloced_timers(htab);
+ }
+ if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE)) {
+ if (!htab_is_prealloc(htab))
+ htab_free_malloced_timers_or_wq(htab, false);
+ else
+ htab_free_prealloced_wq(htab);
+ }
}
/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
@@ -2260,7 +2291,7 @@ const struct bpf_map_ops htab_map_ops = {
.map_alloc = htab_map_alloc,
.map_free = htab_map_free,
.map_get_next_key = htab_map_get_next_key,
- .map_release_uref = htab_map_free_timers,
+ .map_release_uref = htab_map_free_timers_and_wq,
.map_lookup_elem = htab_map_lookup_elem,
.map_lookup_and_delete_elem = htab_map_lookup_and_delete_elem,
.map_update_elem = htab_map_update_elem,
@@ -2281,7 +2312,7 @@ const struct bpf_map_ops htab_lru_map_ops = {
.map_alloc = htab_map_alloc,
.map_free = htab_map_free,
.map_get_next_key = htab_map_get_next_key,
- .map_release_uref = htab_map_free_timers,
+ .map_release_uref = htab_map_free_timers_and_wq,
.map_lookup_elem = htab_lru_map_lookup_elem,
.map_lookup_and_delete_elem = htab_lru_map_lookup_and_delete_elem,
.map_lookup_elem_sys_only = htab_lru_map_lookup_elem_sys,
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 78847f444f79..9fd12d480b8b 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1468,6 +1468,14 @@ void bpf_timer_cancel_and_free(void *val)
kfree_rcu(t, cb.rcu);
}
+/* This function is called by map_delete/update_elem for individual element and
+ * by ops->map_release_uref when the user space reference to a map reaches zero.
+ */
+void bpf_wq_cancel_and_free(void *val)
+{
+ BTF_TYPE_EMIT(struct bpf_wq);
+}
+
BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr)
{
unsigned long *kptr = map_value;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0848e4141b00..63e368337483 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -661,6 +661,13 @@ void bpf_obj_free_timer(const struct btf_record *rec, void *obj)
bpf_timer_cancel_and_free(obj + rec->timer_off);
}
+void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj)
+{
+ if (WARN_ON_ONCE(!btf_record_has_field(rec, BPF_WORKQUEUE)))
+ return;
+ bpf_wq_cancel_and_free(obj + rec->wq_off);
+}
+
void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
{
const struct btf_field *fields;
@@ -682,6 +689,7 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
bpf_timer_cancel_and_free(field_ptr);
break;
case BPF_WORKQUEUE:
+ bpf_wq_cancel_and_free(field_ptr);
break;
case BPF_KPTR_UNREF:
WRITE_ONCE(*(u64 *)field_ptr, 0);
@@ -1119,6 +1127,7 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
}
break;
case BPF_TIMER:
+ case BPF_WORKQUEUE:
if (map->map_type != BPF_MAP_TYPE_HASH &&
map->map_type != BPF_MAP_TYPE_LRU_HASH &&
map->map_type != BPF_MAP_TYPE_ARRAY) {
--
2.44.0
We simply try in all supported map types if we can store/load a bpf_wq.
Signed-off-by: Benjamin Tissoires <[email protected]>
---
no changes in v2
---
tools/testing/selftests/bpf/prog_tests/wq.c | 11 +++
tools/testing/selftests/bpf/progs/wq.c | 135 ++++++++++++++++++++++++++++
2 files changed, 146 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/wq.c b/tools/testing/selftests/bpf/prog_tests/wq.c
new file mode 100644
index 000000000000..9a07b8bc2c52
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/wq.c
@@ -0,0 +1,11 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Benjamin Tissoires */
+#include <test_progs.h>
+#include "wq.skel.h"
+
+void serial_test_wq(void)
+{
+ LIBBPF_OPTS(bpf_test_run_opts, topts);
+
+ RUN_TESTS(wq);
+}
diff --git a/tools/testing/selftests/bpf/progs/wq.c b/tools/testing/selftests/bpf/progs/wq.c
new file mode 100644
index 000000000000..8c668ad04059
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/wq.c
@@ -0,0 +1,135 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Benjamin Tissoires
+ */
+
+#include "bpf_experimental.h"
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+#include "../bpf_testmod/bpf_testmod_kfunc.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct hmap_elem {
+ int counter;
+ struct bpf_timer timer; /* unused */
+ struct bpf_spin_lock lock; /* unused */
+ struct bpf_wq work;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __uint(max_entries, 1000);
+ __type(key, int);
+ __type(value, struct hmap_elem);
+} hmap SEC(".maps");
+
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __uint(max_entries, 1000);
+ __type(key, int);
+ __type(value, struct hmap_elem);
+} hmap_malloc SEC(".maps");
+
+struct elem {
+ struct bpf_wq w;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 2);
+ __type(key, int);
+ __type(value, struct elem);
+} array SEC(".maps");
+
+struct {
+ __uint(type, BPF_MAP_TYPE_LRU_HASH);
+ __uint(max_entries, 4);
+ __type(key, int);
+ __type(value, struct elem);
+} lru SEC(".maps");
+
+static int test_elem_callback(void *map, int *key)
+{
+ struct elem init = {}, *val;
+
+ if (map == &lru &&
+ bpf_map_update_elem(map, key, &init, 0))
+ return -1;
+
+ val = bpf_map_lookup_elem(map, key);
+ if (!val)
+ return -2;
+
+ return 0;
+}
+
+static int test_hmap_elem_callback(void *map, int *key)
+{
+ struct hmap_elem init = {}, *val;
+
+ if (bpf_map_update_elem(map, key, &init, 0))
+ return -1;
+
+ val = bpf_map_lookup_elem(map, key);
+ if (!val)
+ return -2;
+
+ return 0;
+}
+
+SEC("tc")
+/* test that workqueues can be used from an array
+ */
+__retval(0)
+long test_call_array_sleepable(void *ctx)
+{
+ int key = 0;
+
+ return test_elem_callback(&array, &key);
+}
+
+SEC("syscall")
+/* Same test than above but from a sleepable context.
+ */
+__retval(0)
+long test_syscall_array_sleepable(void *ctx)
+{
+ int key = 1;
+
+ return test_elem_callback(&array, &key);
+}
+
+SEC("tc")
+/* test that workqueues can be used from a hashmap
+ */
+__retval(0)
+long test_call_hash_sleepable(void *ctx)
+{
+ int key = 2;
+
+ return test_hmap_elem_callback(&hmap, &key);
+}
+
+SEC("tc")
+/* test that workqueues can be used from a hashmap
+ * with NO_PREALLOC.
+ */
+__retval(0)
+long test_call_hash_malloc_sleepable(void *ctx)
+{
+ int key = 3;
+
+ return test_hmap_elem_callback(&hmap_malloc, &key);
+}
+
+SEC("tc")
+/* test that workqueues can be used from a LRU map
+ */
+__retval(0)
+long test_call_lru_sleepable(void *ctx)
+{
+ int key = 4;
+
+ return test_elem_callback(&lru, &key);
+}
--
2.44.0
We need to teach the verifier about the second argument which is declared
as void * but which is of type KF_ARG_PTR_TO_MAP. We could have dropped
this extra case if we declared the second argument as struct bpf_map *,
but that means users will have to do extra casting to have their program
compile.
We also need to duplicate the timer code for the checking if the map
argument is matching the provided workqueue.
Signed-off-by: Benjamin Tissoires <[email protected]>
---
FWIW, I still have one concern with this implementation:
- bpf_wq_work() access ->prog without protection, but I think this might
be racing with bpf_wq_set_callback(): if we have the following:
CPU 0 CPU 1
bpf_wq_set_callback()
bpf_start()
bpf_wq_work():
prog = cb->prog;
bpf_wq_set_callback()
cb->prog = prog;
bpf_prog_put(prev)
rcu_assign_ptr(cb->callback_fn,
callback_fn);
callback = READ_ONCE(w->cb.callback_fn);
As I understand callback_fn is fine, prog might be, but we clearly
have an inconstency between "prog" and "callback_fn" as they can come
from 2 different bpf_wq_set_callback() calls.
IMO we should protect this by the async->lock, but I'm not sure if
it's OK or not.
---
changes in v2:
- dropped BPF_F_WQ_SLEEPABLE
- use __map suffix to drop special case in verifier
- in bpf_wq_work: use new local variable cb
---
kernel/bpf/helpers.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 102 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 9fd12d480b8b..b12486adc47a 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1109,11 +1109,18 @@ struct bpf_hrtimer {
struct hrtimer timer;
};
-/* the actual struct hidden inside uapi struct bpf_timer */
+struct bpf_work {
+ struct bpf_async_cb cb;
+ struct work_struct work;
+ struct work_struct delete_work;
+};
+
+/* the actual struct hidden inside uapi struct bpf_timer and bpf_wq */
struct bpf_async_kern {
union {
struct bpf_async_cb *cb;
struct bpf_hrtimer *timer;
+ struct bpf_work *work;
};
/* bpf_spin_lock is used here instead of spinlock_t to make
* sure that it always fits into space reserved by struct bpf_timer
@@ -1124,6 +1131,7 @@ struct bpf_async_kern {
enum bpf_async_type {
BPF_ASYNC_TYPE_TIMER = 0,
+ BPF_ASYNC_TYPE_WQ,
};
static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
@@ -1167,11 +1175,64 @@ static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
return HRTIMER_NORESTART;
}
+static void bpf_wq_work(struct work_struct *work)
+{
+ struct bpf_work *w = container_of(work, struct bpf_work, work);
+ struct bpf_tramp_run_ctx __maybe_unused run_ctx;
+ struct bpf_async_cb *cb = &w->cb;
+ struct bpf_prog *prog = cb->prog;
+ struct bpf_map *map = cb->map;
+ bpf_callback_t callback_fn;
+ void *value = cb->value;
+ void *key;
+ u32 idx;
+
+ BTF_TYPE_EMIT(struct bpf_wq);
+
+ callback_fn = READ_ONCE(cb->callback_fn);
+ if (!callback_fn || !prog)
+ return;
+
+ if (map->map_type == BPF_MAP_TYPE_ARRAY) {
+ struct bpf_array *array = container_of(map, struct bpf_array, map);
+
+ /* compute the key */
+ idx = ((char *)value - array->value) / array->elem_size;
+ key = &idx;
+ } else { /* hash or lru */
+ key = value - round_up(map->key_size, 8);
+ }
+
+ run_ctx.bpf_cookie = 0;
+
+ if (!__bpf_prog_enter_sleepable_recur(prog, &run_ctx)) {
+ /* recursion detected */
+ __bpf_prog_exit_sleepable_recur(prog, 0, &run_ctx);
+ return;
+ }
+
+ callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, 0, 0);
+ /* The verifier checked that return value is zero. */
+
+ __bpf_prog_exit_sleepable_recur(prog, 0 /* bpf_prog_run does runtime stats */,
+ &run_ctx);
+}
+
+static void bpf_wq_delete_work(struct work_struct *work)
+{
+ struct bpf_work *w = container_of(work, struct bpf_work, delete_work);
+
+ cancel_work_sync(&w->work);
+
+ kfree_rcu(w, cb.rcu);
+}
+
static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u64 flags,
enum bpf_async_type type)
{
struct bpf_async_cb *cb;
struct bpf_hrtimer *t;
+ struct bpf_work *w;
clockid_t clockid;
size_t size;
int ret = 0;
@@ -1183,6 +1244,9 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
case BPF_ASYNC_TYPE_TIMER:
size = sizeof(struct bpf_hrtimer);
break;
+ case BPF_ASYNC_TYPE_WQ:
+ size = sizeof(struct bpf_work);
+ break;
default:
return -EINVAL;
}
@@ -1201,13 +1265,22 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
goto out;
}
- if (type == BPF_ASYNC_TYPE_TIMER) {
+ switch (type) {
+ case BPF_ASYNC_TYPE_TIMER:
clockid = flags & (MAX_CLOCKS - 1);
t = (struct bpf_hrtimer *)cb;
hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
t->timer.function = bpf_timer_cb;
cb->value = (void *)async - map->record->timer_off;
+ break;
+ case BPF_ASYNC_TYPE_WQ:
+ w = (struct bpf_work *)cb;
+
+ INIT_WORK(&w->work, bpf_wq_work);
+ INIT_WORK(&w->delete_work, bpf_wq_delete_work);
+ cb->value = (void *)async - map->record->wq_off;
+ break;
}
cb->map = map;
cb->prog = NULL;
@@ -1473,7 +1546,19 @@ void bpf_timer_cancel_and_free(void *val)
*/
void bpf_wq_cancel_and_free(void *val)
{
+ struct bpf_work *work;
+
BTF_TYPE_EMIT(struct bpf_wq);
+
+ work = (struct bpf_work *)__bpf_async_cancel_and_free(val);
+ if (!work)
+ return;
+ /* Trigger cancel of the sleepable work, but *do not* wait for
+ * it to finish if it was running as we might not be in a
+ * sleepable context.
+ * kfree will be called once the work has finished.
+ */
+ schedule_work(&work->delete_work);
}
BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr)
@@ -2612,6 +2697,20 @@ __bpf_kfunc void bpf_throw(u64 cookie)
WARN(1, "A call to BPF exception callback should never return\n");
}
+__bpf_kfunc int bpf_wq_init(struct bpf_wq *wq, void *p__map, unsigned int flags)
+{
+ struct bpf_async_kern *async = (struct bpf_async_kern *)wq;
+ struct bpf_map *map = p__map;
+
+ BUILD_BUG_ON(sizeof(struct bpf_async_kern) > sizeof(struct bpf_wq));
+ BUILD_BUG_ON(__alignof__(struct bpf_async_kern) != __alignof__(struct bpf_wq));
+
+ if (flags)
+ return -EINVAL;
+
+ return __bpf_async_init(async, map, flags, BPF_ASYNC_TYPE_WQ);
+}
+
__bpf_kfunc_end_defs();
BTF_KFUNCS_START(generic_btf_ids)
@@ -2689,6 +2788,7 @@ BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
BTF_ID_FLAGS(func, bpf_dynptr_size)
BTF_ID_FLAGS(func, bpf_dynptr_clone)
BTF_ID_FLAGS(func, bpf_modify_return_test_tp)
+BTF_ID_FLAGS(func, bpf_wq_init)
BTF_KFUNCS_END(common_btf_ids)
static const struct btf_kfunc_id_set common_kfunc_set = {
--
2.44.0
Allows to test if allocation/free works
Signed-off-by: Benjamin Tissoires <[email protected]>
---
changes in v2:
- dropped mark_precise checks
---
tools/testing/selftests/bpf/bpf_experimental.h | 1 +
tools/testing/selftests/bpf/prog_tests/wq.c | 8 +++
tools/testing/selftests/bpf/progs/wq.c | 10 ++++
tools/testing/selftests/bpf/progs/wq_failures.c | 78 +++++++++++++++++++++++++
4 files changed, 97 insertions(+)
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 3329ea080865..785b91b629be 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -470,4 +470,5 @@ extern int bpf_iter_css_new(struct bpf_iter_css *it,
extern struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *it) __weak __ksym;
extern void bpf_iter_css_destroy(struct bpf_iter_css *it) __weak __ksym;
+extern int bpf_wq_init(struct bpf_wq *wq, void *p__map, unsigned int flags) __weak __ksym;
#endif
diff --git a/tools/testing/selftests/bpf/prog_tests/wq.c b/tools/testing/selftests/bpf/prog_tests/wq.c
index 9a07b8bc2c52..26ab69796103 100644
--- a/tools/testing/selftests/bpf/prog_tests/wq.c
+++ b/tools/testing/selftests/bpf/prog_tests/wq.c
@@ -2,6 +2,7 @@
/* Copyright (c) 2024 Benjamin Tissoires */
#include <test_progs.h>
#include "wq.skel.h"
+#include "wq_failures.skel.h"
void serial_test_wq(void)
{
@@ -9,3 +10,10 @@ void serial_test_wq(void)
RUN_TESTS(wq);
}
+
+void serial_test_failures_wq(void)
+{
+ LIBBPF_OPTS(bpf_test_run_opts, topts);
+
+ RUN_TESTS(wq_failures);
+}
diff --git a/tools/testing/selftests/bpf/progs/wq.c b/tools/testing/selftests/bpf/progs/wq.c
index 8c668ad04059..f92466eb8fb1 100644
--- a/tools/testing/selftests/bpf/progs/wq.c
+++ b/tools/testing/selftests/bpf/progs/wq.c
@@ -52,6 +52,7 @@ struct {
static int test_elem_callback(void *map, int *key)
{
struct elem init = {}, *val;
+ struct bpf_wq *wq;
if (map == &lru &&
bpf_map_update_elem(map, key, &init, 0))
@@ -61,12 +62,17 @@ static int test_elem_callback(void *map, int *key)
if (!val)
return -2;
+ wq = &val->w;
+ if (bpf_wq_init(wq, map, 0) != 0)
+ return -3;
+
return 0;
}
static int test_hmap_elem_callback(void *map, int *key)
{
struct hmap_elem init = {}, *val;
+ struct bpf_wq *wq;
if (bpf_map_update_elem(map, key, &init, 0))
return -1;
@@ -75,6 +81,10 @@ static int test_hmap_elem_callback(void *map, int *key)
if (!val)
return -2;
+ wq = &val->work;
+ if (bpf_wq_init(wq, map, 0) != 0)
+ return -3;
+
return 0;
}
diff --git a/tools/testing/selftests/bpf/progs/wq_failures.c b/tools/testing/selftests/bpf/progs/wq_failures.c
new file mode 100644
index 000000000000..db7015c7d541
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/wq_failures.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Benjamin Tissoires
+ */
+
+#include "bpf_experimental.h"
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+#include "../bpf_testmod/bpf_testmod_kfunc.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct elem {
+ struct bpf_wq w;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 2);
+ __type(key, int);
+ __type(value, struct elem);
+} array SEC(".maps");
+
+struct {
+ __uint(type, BPF_MAP_TYPE_LRU_HASH);
+ __uint(max_entries, 4);
+ __type(key, int);
+ __type(value, struct elem);
+} lru SEC(".maps");
+
+SEC("tc")
+/* test that bpf_wq_init takes a map as a second argument
+ */
+__log_level(2)
+__flag(BPF_F_TEST_STATE_FREQ)
+__failure
+__msg(": (85) call bpf_wq_init#") /* anchor message */
+__msg("pointer in R2 isn't map pointer")
+long test_wq_init_nomap(void *ctx)
+{
+ struct bpf_wq *wq;
+ struct elem *val;
+ int key = 0;
+
+ val = bpf_map_lookup_elem(&array, &key);
+ if (!val)
+ return -1;
+
+ wq = &val->w;
+ if (bpf_wq_init(wq, &key, 0) != 0)
+ return -3;
+
+ return 0;
+}
+
+SEC("tc")
+/* test that the workqueue is part of the map in bpf_wq_init
+ */
+__log_level(2)
+__flag(BPF_F_TEST_STATE_FREQ)
+__failure
+__msg(": (85) call bpf_wq_init#") /* anchor message */
+__msg("workqueue pointer in R1 map_uid=0 doesn't match map pointer in R2 map_uid=0")
+long test_wq_init_wrong_map(void *ctx)
+{
+ struct bpf_wq *wq;
+ struct elem *val;
+ int key = 0;
+
+ val = bpf_map_lookup_elem(&array, &key);
+ if (!val)
+ return -1;
+
+ wq = &val->w;
+ if (bpf_wq_init(wq, &lru, 0) != 0)
+ return -3;
+
+ return 0;
+}
--
2.44.0
To support sleepable async callbacks, we need to tell push_async_cb()
whether the cb is sleepable or not.
The verifier now detects that we are in bpf_wq_set_callback_impl and
can allow a sleepable callback to happen.
Signed-off-by: Benjamin Tissoires <[email protected]>
---
changes in v2:
- dropped flags check for sleepable
- squashed with v1 14/18 as there is no point in having it separated now
---
include/linux/bpf_verifier.h | 1 +
kernel/bpf/helpers.c | 15 +++++++++
kernel/bpf/verifier.c | 75 ++++++++++++++++++++++++++++++++++++++++----
3 files changed, 85 insertions(+), 6 deletions(-)
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 36d19cd32eb5..9db35530c878 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -426,6 +426,7 @@ struct bpf_verifier_state {
* while they are still in use.
*/
bool used_as_loop_entry;
+ bool in_sleepable;
/* first and last insn idx of this verifier state */
u32 first_insn_idx;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index b12486adc47a..85d3f483f27f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2711,6 +2711,20 @@ __bpf_kfunc int bpf_wq_init(struct bpf_wq *wq, void *p__map, unsigned int flags)
return __bpf_async_init(async, map, flags, BPF_ASYNC_TYPE_WQ);
}
+__bpf_kfunc int bpf_wq_set_callback_impl(struct bpf_wq *wq,
+ int (callback_fn)(void *map, int *key, struct bpf_wq *wq),
+ unsigned int flags,
+ void *aux__ign)
+{
+ struct bpf_prog_aux *aux = (struct bpf_prog_aux *)aux__ign;
+ struct bpf_async_kern *async = (struct bpf_async_kern *)wq;
+
+ if (flags)
+ return -EINVAL;
+
+ return __bpf_async_set_callback(async, callback_fn, aux, flags, BPF_ASYNC_TYPE_WQ);
+}
+
__bpf_kfunc_end_defs();
BTF_KFUNCS_START(generic_btf_ids)
@@ -2789,6 +2803,7 @@ BTF_ID_FLAGS(func, bpf_dynptr_size)
BTF_ID_FLAGS(func, bpf_dynptr_clone)
BTF_ID_FLAGS(func, bpf_modify_return_test_tp)
BTF_ID_FLAGS(func, bpf_wq_init)
+BTF_ID_FLAGS(func, bpf_wq_set_callback_impl)
BTF_KFUNCS_END(common_btf_ids)
static const struct btf_kfunc_id_set common_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5ff7b52d9032..e8603cff395b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -501,8 +501,12 @@ static bool is_dynptr_ref_function(enum bpf_func_id func_id)
}
static bool is_sync_callback_calling_kfunc(u32 btf_id);
+static bool is_async_callback_calling_kfunc(u32 btf_id);
+static bool is_callback_calling_kfunc(u32 btf_id);
static bool is_bpf_throw_kfunc(struct bpf_insn *insn);
+static bool is_bpf_wq_set_callback_impl_kfunc(u32 btf_id);
+
static bool is_sync_callback_calling_function(enum bpf_func_id func_id)
{
return func_id == BPF_FUNC_for_each_map_elem ||
@@ -530,7 +534,8 @@ static bool is_sync_callback_calling_insn(struct bpf_insn *insn)
static bool is_async_callback_calling_insn(struct bpf_insn *insn)
{
- return bpf_helper_call(insn) && is_async_callback_calling_function(insn->imm);
+ return (bpf_helper_call(insn) && is_async_callback_calling_function(insn->imm)) ||
+ (bpf_pseudo_kfunc_call(insn) && is_async_callback_calling_kfunc(insn->imm));
}
static bool is_may_goto_insn(struct bpf_insn *insn)
@@ -1429,6 +1434,7 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state,
}
dst_state->speculative = src->speculative;
dst_state->active_rcu_lock = src->active_rcu_lock;
+ dst_state->in_sleepable = src->in_sleepable;
dst_state->curframe = src->curframe;
dst_state->active_lock.ptr = src->active_lock.ptr;
dst_state->active_lock.id = src->active_lock.id;
@@ -2404,7 +2410,7 @@ static void init_func_state(struct bpf_verifier_env *env,
/* Similar to push_stack(), but for async callbacks */
static struct bpf_verifier_state *push_async_cb(struct bpf_verifier_env *env,
int insn_idx, int prev_insn_idx,
- int subprog)
+ int subprog, bool is_sleepable)
{
struct bpf_verifier_stack_elem *elem;
struct bpf_func_state *frame;
@@ -2431,6 +2437,7 @@ static struct bpf_verifier_state *push_async_cb(struct bpf_verifier_env *env,
* Initialize it similar to do_check_common().
*/
elem->st.branches = 1;
+ elem->st.in_sleepable = is_sleepable;
frame = kzalloc(sizeof(*frame), GFP_KERNEL);
if (!frame)
goto err;
@@ -5278,7 +5285,8 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
static bool in_sleepable(struct bpf_verifier_env *env)
{
- return env->prog->sleepable;
+ return env->prog->sleepable ||
+ (env->cur_state && env->cur_state->in_sleepable);
}
/* The non-sleepable programs and sleepable programs with explicit bpf_rcu_read_lock()
@@ -9513,7 +9521,7 @@ static int push_callback_call(struct bpf_verifier_env *env, struct bpf_insn *ins
*/
env->subprog_info[subprog].is_cb = true;
if (bpf_pseudo_kfunc_call(insn) &&
- !is_sync_callback_calling_kfunc(insn->imm)) {
+ !is_callback_calling_kfunc(insn->imm)) {
verbose(env, "verifier bug: kfunc %s#%d not marked as callback-calling\n",
func_id_name(insn->imm), insn->imm);
return -EFAULT;
@@ -9527,10 +9535,11 @@ static int push_callback_call(struct bpf_verifier_env *env, struct bpf_insn *ins
if (is_async_callback_calling_insn(insn)) {
struct bpf_verifier_state *async_cb;
- /* there is no real recursion here. timer callbacks are async */
+ /* there is no real recursion here. timer and workqueue callbacks are async */
env->subprog_info[subprog].is_async_cb = true;
async_cb = push_async_cb(env, env->subprog_info[subprog].start,
- insn_idx, subprog);
+ insn_idx, subprog,
+ is_bpf_wq_set_callback_impl_kfunc(insn->imm));
if (!async_cb)
return -EFAULT;
callee = async_cb->frame[0];
@@ -11018,6 +11027,7 @@ enum special_kfunc_type {
KF_bpf_percpu_obj_drop_impl,
KF_bpf_throw,
KF_bpf_iter_css_task_new,
+ KF_bpf_wq_set_callback_impl,
};
BTF_SET_START(special_kfunc_set)
@@ -11044,6 +11054,7 @@ BTF_ID(func, bpf_throw)
#ifdef CONFIG_CGROUPS
BTF_ID(func, bpf_iter_css_task_new)
#endif
+BTF_ID(func, bpf_wq_set_callback_impl)
BTF_SET_END(special_kfunc_set)
BTF_ID_LIST(special_kfunc_list)
@@ -11074,6 +11085,7 @@ BTF_ID(func, bpf_iter_css_task_new)
#else
BTF_ID_UNUSED
#endif
+BTF_ID(func, bpf_wq_set_callback_impl)
static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
{
@@ -11402,12 +11414,28 @@ static bool is_sync_callback_calling_kfunc(u32 btf_id)
return btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl];
}
+static bool is_async_callback_calling_kfunc(u32 btf_id)
+{
+ return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl];
+}
+
static bool is_bpf_throw_kfunc(struct bpf_insn *insn)
{
return bpf_pseudo_kfunc_call(insn) && insn->off == 0 &&
insn->imm == special_kfunc_list[KF_bpf_throw];
}
+static bool is_bpf_wq_set_callback_impl_kfunc(u32 btf_id)
+{
+ return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl];
+}
+
+static bool is_callback_calling_kfunc(u32 btf_id)
+{
+ return is_sync_callback_calling_kfunc(btf_id) ||
+ is_async_callback_calling_kfunc(btf_id);
+}
+
static bool is_rbtree_lock_required_kfunc(u32 btf_id)
{
return is_bpf_rbtree_api_kfunc(btf_id);
@@ -12219,6 +12247,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
}
}
+ if (is_bpf_wq_set_callback_impl_kfunc(meta.func_id)) {
+ err = push_callback_call(env, insn, insn_idx, meta.subprogno,
+ set_timer_callback_state);
+ if (err) {
+ verbose(env, "kfunc %s#%d failed callback verification\n",
+ func_name, meta.func_id);
+ return err;
+ }
+ }
+
rcu_lock = is_kfunc_bpf_rcu_read_lock(&meta);
rcu_unlock = is_kfunc_bpf_rcu_read_unlock(&meta);
@@ -16982,6 +17020,9 @@ static bool states_equal(struct bpf_verifier_env *env,
if (old->active_rcu_lock != cur->active_rcu_lock)
return false;
+ if (old->in_sleepable != cur->in_sleepable)
+ return false;
+
/* for states to be equal callsites have to be the same
* and all frame states need to be equivalent
*/
@@ -19653,6 +19694,28 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
*cnt = 1;
+ } else if (is_bpf_wq_set_callback_impl_kfunc(desc->func_id)) {
+ /* The verifier will process callback_fn as many times as necessary
+ * with different maps and the register states prepared by
+ * set_timer_callback_state will be accurate.
+ *
+ * The following use case is valid:
+ * map1 is shared by prog1, prog2, prog3.
+ * prog1 calls bpf_timer_init for some map1 elements
+ * prog2 calls bpf_timer_set_callback for some map1 elements.
+ * Those that were not bpf_timer_init-ed will return -EINVAL.
+ * prog3 calls bpf_timer_start for some map1 elements.
+ * Those that were not both bpf_timer_init-ed and
+ * bpf_timer_set_callback-ed will return -EINVAL.
+ */
+ struct bpf_insn ld_addrs[2] = {
+ BPF_LD_IMM64(BPF_REG_4, (long)env->prog->aux),
+ };
+
+ insn_buf[0] = ld_addrs[0];
+ insn_buf[1] = ld_addrs[1];
+ insn_buf[2] = *insn;
+ *cnt = 3;
}
return 0;
}
--
2.44.0
We assign the callback and set everything up.
The actual tests of these callbacks will be done when bpf_wq_start() is
available.
Signed-off-by: Benjamin Tissoires <[email protected]>
---
changes in v2:
- removed mark_precise checks
- adapted without BPF_F_WQ_SLEEPABLE
---
tools/testing/selftests/bpf/bpf_experimental.h | 5 ++
.../selftests/bpf/bpf_testmod/bpf_testmod.c | 5 ++
.../selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h | 1 +
tools/testing/selftests/bpf/progs/wq.c | 41 +++++++++++---
tools/testing/selftests/bpf/progs/wq_failures.c | 66 ++++++++++++++++++++++
5 files changed, 111 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 785b91b629be..b80b39f76034 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -471,4 +471,9 @@ extern struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *it) __
extern void bpf_iter_css_destroy(struct bpf_iter_css *it) __weak __ksym;
extern int bpf_wq_init(struct bpf_wq *wq, void *p__map, unsigned int flags) __weak __ksym;
+extern int bpf_wq_set_callback_impl(struct bpf_wq *wq,
+ int (callback_fn)(void *map, int *key, struct bpf_wq *wq),
+ unsigned int flags__k, void *aux__ign) __ksym;
+#define bpf_wq_set_callback(timer, cb, flags) \
+ bpf_wq_set_callback_impl(timer, cb, flags, NULL)
#endif
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 39ad96a18123..eb2b78552ca2 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -494,6 +494,10 @@ __bpf_kfunc static u32 bpf_kfunc_call_test_static_unused_arg(u32 arg, u32 unused
return arg;
}
+__bpf_kfunc void bpf_kfunc_call_test_sleepable(void)
+{
+}
+
BTF_KFUNCS_START(bpf_testmod_check_kfunc_ids)
BTF_ID_FLAGS(func, bpf_testmod_test_mod_kfunc)
BTF_ID_FLAGS(func, bpf_kfunc_call_test1)
@@ -520,6 +524,7 @@ BTF_ID_FLAGS(func, bpf_kfunc_call_test_ref, KF_TRUSTED_ARGS | KF_RCU)
BTF_ID_FLAGS(func, bpf_kfunc_call_test_destructive, KF_DESTRUCTIVE)
BTF_ID_FLAGS(func, bpf_kfunc_call_test_static_unused_arg)
BTF_ID_FLAGS(func, bpf_kfunc_call_test_offset)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_sleepable, KF_SLEEPABLE)
BTF_KFUNCS_END(bpf_testmod_check_kfunc_ids)
static int bpf_testmod_ops_init(struct btf *btf)
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
index 7c664dd61059..ce5cd763561c 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
@@ -96,6 +96,7 @@ void bpf_kfunc_call_test_pass2(struct prog_test_pass2 *p) __ksym;
void bpf_kfunc_call_test_mem_len_fail2(__u64 *mem, int len) __ksym;
void bpf_kfunc_call_test_destructive(void) __ksym;
+void bpf_kfunc_call_test_sleepable(void) __ksym;
void bpf_kfunc_call_test_offset(struct prog_test_ref_kfunc *p);
struct prog_test_member *bpf_kfunc_call_memb_acquire(void);
diff --git a/tools/testing/selftests/bpf/progs/wq.c b/tools/testing/selftests/bpf/progs/wq.c
index f92466eb8fb1..f746296d32b1 100644
--- a/tools/testing/selftests/bpf/progs/wq.c
+++ b/tools/testing/selftests/bpf/progs/wq.c
@@ -49,7 +49,8 @@ struct {
__type(value, struct elem);
} lru SEC(".maps");
-static int test_elem_callback(void *map, int *key)
+static int test_elem_callback(void *map, int *key,
+ int (callback_fn)(void *map, int *key, struct bpf_wq *wq))
{
struct elem init = {}, *val;
struct bpf_wq *wq;
@@ -66,10 +67,14 @@ static int test_elem_callback(void *map, int *key)
if (bpf_wq_init(wq, map, 0) != 0)
return -3;
+ if (bpf_wq_set_callback(wq, callback_fn, 0))
+ return -4;
+
return 0;
}
-static int test_hmap_elem_callback(void *map, int *key)
+static int test_hmap_elem_callback(void *map, int *key,
+ int (callback_fn)(void *map, int *key, struct bpf_wq *wq))
{
struct hmap_elem init = {}, *val;
struct bpf_wq *wq;
@@ -85,6 +90,28 @@ static int test_hmap_elem_callback(void *map, int *key)
if (bpf_wq_init(wq, map, 0) != 0)
return -3;
+ if (bpf_wq_set_callback(wq, callback_fn, 0))
+ return -4;
+
+ return 0;
+}
+
+__u32 ok;
+__u32 ok_sleepable;
+
+/* callback for non sleepable workqueue */
+static int wq_callback(void *map, int *key, struct bpf_wq *work)
+{
+ bpf_kfunc_common_test();
+ ok |= (1 << *key);
+ return 0;
+}
+
+/* callback for sleepable workqueue */
+static int wq_cb_sleepable(void *map, int *key, struct bpf_wq *work)
+{
+ bpf_kfunc_call_test_sleepable();
+ ok_sleepable |= (1 << *key);
return 0;
}
@@ -96,7 +123,7 @@ long test_call_array_sleepable(void *ctx)
{
int key = 0;
- return test_elem_callback(&array, &key);
+ return test_elem_callback(&array, &key, wq_cb_sleepable);
}
SEC("syscall")
@@ -107,7 +134,7 @@ long test_syscall_array_sleepable(void *ctx)
{
int key = 1;
- return test_elem_callback(&array, &key);
+ return test_elem_callback(&array, &key, wq_cb_sleepable);
}
SEC("tc")
@@ -118,7 +145,7 @@ long test_call_hash_sleepable(void *ctx)
{
int key = 2;
- return test_hmap_elem_callback(&hmap, &key);
+ return test_hmap_elem_callback(&hmap, &key, wq_callback);
}
SEC("tc")
@@ -130,7 +157,7 @@ long test_call_hash_malloc_sleepable(void *ctx)
{
int key = 3;
- return test_hmap_elem_callback(&hmap_malloc, &key);
+ return test_hmap_elem_callback(&hmap_malloc, &key, wq_callback);
}
SEC("tc")
@@ -141,5 +168,5 @@ long test_call_lru_sleepable(void *ctx)
{
int key = 4;
- return test_elem_callback(&lru, &key);
+ return test_elem_callback(&lru, &key, wq_callback);
}
diff --git a/tools/testing/selftests/bpf/progs/wq_failures.c b/tools/testing/selftests/bpf/progs/wq_failures.c
index db7015c7d541..4cbdb425f223 100644
--- a/tools/testing/selftests/bpf/progs/wq_failures.c
+++ b/tools/testing/selftests/bpf/progs/wq_failures.c
@@ -27,6 +27,20 @@ struct {
__type(value, struct elem);
} lru SEC(".maps");
+/* callback for non sleepable workqueue */
+static int wq_callback(void *map, int *key, struct bpf_wq *work)
+{
+ bpf_kfunc_common_test();
+ return 0;
+}
+
+/* callback for sleepable workqueue */
+static int wq_cb_sleepable(void *map, int *key, struct bpf_wq *work)
+{
+ bpf_kfunc_call_test_sleepable();
+ return 0;
+}
+
SEC("tc")
/* test that bpf_wq_init takes a map as a second argument
*/
@@ -76,3 +90,55 @@ long test_wq_init_wrong_map(void *ctx)
return 0;
}
+
+SEC("?tc")
+__log_level(2)
+__failure
+/* check that the first argument of bpf_wq_set_callback()
+ * is a correct bpf_wq pointer.
+ */
+__msg(": (85) call bpf_wq_set_callback_impl#") /* anchor message */
+__msg("arg#0 doesn't point to a map value")
+long test_wrong_wq_pointer(void *ctx)
+{
+ int key = 0;
+ struct bpf_wq *wq;
+
+ wq = bpf_map_lookup_elem(&array, &key);
+ if (!wq)
+ return 1;
+
+ if (bpf_wq_init(wq, &array, 0))
+ return 2;
+
+ if (bpf_wq_set_callback((void *)&wq, wq_callback, 0))
+ return 3;
+
+ return -22;
+}
+
+SEC("?tc")
+__log_level(2)
+__failure
+/* check that the first argument of bpf_wq_set_callback()
+ * is a correct bpf_wq pointer.
+ */
+__msg(": (85) call bpf_wq_set_callback_impl#") /* anchor message */
+__msg("off 1 doesn't point to 'struct bpf_wq' that is at 0")
+long test_wrong_wq_pointer_offset(void *ctx)
+{
+ int key = 0;
+ struct bpf_wq *wq;
+
+ wq = bpf_map_lookup_elem(&array, &key);
+ if (!wq)
+ return 1;
+
+ if (bpf_wq_init(wq, &array, 0))
+ return 2;
+
+ if (bpf_wq_set_callback((void *)wq + 1, wq_cb_sleepable, 0))
+ return 3;
+
+ return -22;
+}
--
2.44.0
again, copy/paste from bpf_timer_start().
Signed-off-by: Benjamin Tissoires <[email protected]>
---
changes in v2:
- dropped spinlock and use READ_ONCE() instead
---
kernel/bpf/helpers.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 85d3f483f27f..07550a657d7f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2711,6 +2711,23 @@ __bpf_kfunc int bpf_wq_init(struct bpf_wq *wq, void *p__map, unsigned int flags)
return __bpf_async_init(async, map, flags, BPF_ASYNC_TYPE_WQ);
}
+__bpf_kfunc int bpf_wq_start(struct bpf_wq *wq, unsigned int flags)
+{
+ struct bpf_async_kern *async = (struct bpf_async_kern *)wq;
+ struct bpf_work *w;
+
+ if (in_nmi())
+ return -EOPNOTSUPP;
+ if (flags)
+ return -EINVAL;
+ w = READ_ONCE(async->work);
+ if (!w || !READ_ONCE(w->cb.prog))
+ return -EINVAL;
+
+ schedule_work(&w->work);
+ return 0;
+}
+
__bpf_kfunc int bpf_wq_set_callback_impl(struct bpf_wq *wq,
int (callback_fn)(void *map, int *key, struct bpf_wq *wq),
unsigned int flags,
@@ -2804,6 +2821,7 @@ BTF_ID_FLAGS(func, bpf_dynptr_clone)
BTF_ID_FLAGS(func, bpf_modify_return_test_tp)
BTF_ID_FLAGS(func, bpf_wq_init)
BTF_ID_FLAGS(func, bpf_wq_set_callback_impl)
+BTF_ID_FLAGS(func, bpf_wq_start)
BTF_KFUNCS_END(common_btf_ids)
static const struct btf_kfunc_id_set common_kfunc_set = {
--
2.44.0
Allows to test if allocation/free works
Signed-off-by: Benjamin Tissoires <[email protected]>
---
changes in v2:
- remove #define CLOCK_MONOTONIC 1 leftover
---
tools/testing/selftests/bpf/bpf_experimental.h | 1 +
tools/testing/selftests/bpf/prog_tests/wq.c | 22 ++++++++++++++++++++++
tools/testing/selftests/bpf/progs/wq.c | 20 +++++++++++++++++---
3 files changed, 40 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index b80b39f76034..93c5a6c446b3 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -471,6 +471,7 @@ extern struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *it) __
extern void bpf_iter_css_destroy(struct bpf_iter_css *it) __weak __ksym;
extern int bpf_wq_init(struct bpf_wq *wq, void *p__map, unsigned int flags) __weak __ksym;
+extern int bpf_wq_start(struct bpf_wq *wq, unsigned int flags) __weak __ksym;
extern int bpf_wq_set_callback_impl(struct bpf_wq *wq,
int (callback_fn)(void *map, int *key, struct bpf_wq *wq),
unsigned int flags__k, void *aux__ign) __ksym;
diff --git a/tools/testing/selftests/bpf/prog_tests/wq.c b/tools/testing/selftests/bpf/prog_tests/wq.c
index 26ab69796103..8a4a91d944cc 100644
--- a/tools/testing/selftests/bpf/prog_tests/wq.c
+++ b/tools/testing/selftests/bpf/prog_tests/wq.c
@@ -6,9 +6,31 @@
void serial_test_wq(void)
{
+ struct wq *wq_skel = NULL;
+ int err, prog_fd;
+
LIBBPF_OPTS(bpf_test_run_opts, topts);
RUN_TESTS(wq);
+
+ /* re-run the success test to check if the timer was actually executed */
+
+ wq_skel = wq__open_and_load();
+ if (!ASSERT_OK_PTR(wq_skel, "wq_skel_load"))
+ return;
+
+ err = wq__attach(wq_skel);
+ if (!ASSERT_OK(err, "wq_attach"))
+ return;
+
+ prog_fd = bpf_program__fd(wq_skel->progs.test_syscall_array_sleepable);
+ err = bpf_prog_test_run_opts(prog_fd, &topts);
+ ASSERT_OK(err, "test_run");
+ ASSERT_EQ(topts.retval, 0, "test_run");
+
+ usleep(50); /* 10 usecs should be enough, but give it extra */
+
+ ASSERT_EQ(wq_skel->bss->ok_sleepable, (1 << 1), "ok_sleepable");
}
void serial_test_failures_wq(void)
diff --git a/tools/testing/selftests/bpf/progs/wq.c b/tools/testing/selftests/bpf/progs/wq.c
index f746296d32b1..8011a27a2fbd 100644
--- a/tools/testing/selftests/bpf/progs/wq.c
+++ b/tools/testing/selftests/bpf/progs/wq.c
@@ -49,12 +49,19 @@ struct {
__type(value, struct elem);
} lru SEC(".maps");
+__u32 ok;
+__u32 ok_sleepable;
+
static int test_elem_callback(void *map, int *key,
int (callback_fn)(void *map, int *key, struct bpf_wq *wq))
{
struct elem init = {}, *val;
struct bpf_wq *wq;
+ if ((ok & (1 << *key) ||
+ (ok_sleepable & (1 << *key))))
+ return -22;
+
if (map == &lru &&
bpf_map_update_elem(map, key, &init, 0))
return -1;
@@ -70,6 +77,9 @@ static int test_elem_callback(void *map, int *key,
if (bpf_wq_set_callback(wq, callback_fn, 0))
return -4;
+ if (bpf_wq_start(wq, 0))
+ return -5;
+
return 0;
}
@@ -79,6 +89,10 @@ static int test_hmap_elem_callback(void *map, int *key,
struct hmap_elem init = {}, *val;
struct bpf_wq *wq;
+ if ((ok & (1 << *key) ||
+ (ok_sleepable & (1 << *key))))
+ return -22;
+
if (bpf_map_update_elem(map, key, &init, 0))
return -1;
@@ -93,12 +107,12 @@ static int test_hmap_elem_callback(void *map, int *key,
if (bpf_wq_set_callback(wq, callback_fn, 0))
return -4;
+ if (bpf_wq_start(wq, 0))
+ return -5;
+
return 0;
}
-__u32 ok;
-__u32 ok_sleepable;
-
/* callback for non sleepable workqueue */
static int wq_callback(void *map, int *key, struct bpf_wq *work)
{
--
2.44.0
On Sat, Apr 20, 2024 at 2:10 AM Benjamin Tissoires <[email protected]> wrote:
>
> We need to teach the verifier about the second argument which is declared
> as void * but which is of type KF_ARG_PTR_TO_MAP. We could have dropped
> this extra case if we declared the second argument as struct bpf_map *,
> but that means users will have to do extra casting to have their program
> compile.
>
> We also need to duplicate the timer code for the checking if the map
> argument is matching the provided workqueue.
>
> Signed-off-by: Benjamin Tissoires <[email protected]>
>
> ---
>
> FWIW, I still have one concern with this implementation:
> - bpf_wq_work() access ->prog without protection, but I think this might
> be racing with bpf_wq_set_callback(): if we have the following:
>
> CPU 0 CPU 1
> bpf_wq_set_callback()
> bpf_start()
> bpf_wq_work():
> prog = cb->prog;
>
> bpf_wq_set_callback()
> cb->prog = prog;
> bpf_prog_put(prev)
> rcu_assign_ptr(cb->callback_fn,
> callback_fn);
> callback = READ_ONCE(w->cb.callback_fn);
>
> As I understand callback_fn is fine, prog might be, but we clearly
> have an inconstency between "prog" and "callback_fn" as they can come
> from 2 different bpf_wq_set_callback() calls.
>
> IMO we should protect this by the async->lock, but I'm not sure if
> it's OK or not.
I see the concern, but I think it's overkill.
Here 'prog' is used to pass it into __bpf_prog_enter_sleepable_recur()
to keep the standard pattern of calling into sleepable prog.
But it won't recurse.
We can open code migrate_disable,etc from there except this_cpu_inc_return,
but it's an overkill.
The passed 'prog' is irrelevant.
If somebody tries really hard by having two progs sharing the same
map with bpf_wq and racing to set_callback... I can see how
prog won't match callback, but it won't make a difference.
prog is not going trigger recursion check (unless somebody
tries is obsessed) and not going to UAF.
I imagine it's possible to attach somewhere in core wq callback
invocation path with fentry, set_callback to the same prog,
and technically it's kinda sorta recursion, but different subprogs,
so not a safety issue.
The code as-is is fine. imo.
On Sat, Apr 20, 2024 at 2:10 AM Benjamin Tissoires <[email protected]> wrote:
> @@ -11018,6 +11027,7 @@ enum special_kfunc_type {
> KF_bpf_percpu_obj_drop_impl,
> KF_bpf_throw,
> KF_bpf_iter_css_task_new,
> + KF_bpf_wq_set_callback_impl,
> };
>
> BTF_SET_START(special_kfunc_set)
> @@ -11044,6 +11054,7 @@ BTF_ID(func, bpf_throw)
> #ifdef CONFIG_CGROUPS
> BTF_ID(func, bpf_iter_css_task_new)
> #endif
> +BTF_ID(func, bpf_wq_set_callback_impl)
> BTF_SET_END(special_kfunc_set)
>
> BTF_ID_LIST(special_kfunc_list)
> @@ -11074,6 +11085,7 @@ BTF_ID(func, bpf_iter_css_task_new)
> #else
> BTF_ID_UNUSED
> #endif
> +BTF_ID(func, bpf_wq_set_callback_impl)
This is broken on !CONFIG_CGROUPS.
KF_bpf_wq_set_callback_impl in enum special_kfunc_type won't
match special_kfunc_list.
I moved this line up while applying.
>
> static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
> {
> @@ -11402,12 +11414,28 @@ static bool is_sync_callback_calling_kfunc(u32 btf_id)
> return btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl];
> }
>
> +static bool is_async_callback_calling_kfunc(u32 btf_id)
> +{
> + return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl];
> +}
> +
> static bool is_bpf_throw_kfunc(struct bpf_insn *insn)
> {
> return bpf_pseudo_kfunc_call(insn) && insn->off == 0 &&
> insn->imm == special_kfunc_list[KF_bpf_throw];
> }
>
> +static bool is_bpf_wq_set_callback_impl_kfunc(u32 btf_id)
> +{
> + return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl];
> +}
> +
> +static bool is_callback_calling_kfunc(u32 btf_id)
> +{
> + return is_sync_callback_calling_kfunc(btf_id) ||
> + is_async_callback_calling_kfunc(btf_id);
> +}
> +
> static bool is_rbtree_lock_required_kfunc(u32 btf_id)
> {
> return is_bpf_rbtree_api_kfunc(btf_id);
> @@ -12219,6 +12247,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> }
> }
>
> + if (is_bpf_wq_set_callback_impl_kfunc(meta.func_id)) {
> + err = push_callback_call(env, insn, insn_idx, meta.subprogno,
> + set_timer_callback_state);
> + if (err) {
> + verbose(env, "kfunc %s#%d failed callback verification\n",
> + func_name, meta.func_id);
> + return err;
> + }
> + }
> +
> rcu_lock = is_kfunc_bpf_rcu_read_lock(&meta);
> rcu_unlock = is_kfunc_bpf_rcu_read_unlock(&meta);
>
> @@ -16982,6 +17020,9 @@ static bool states_equal(struct bpf_verifier_env *env,
> if (old->active_rcu_lock != cur->active_rcu_lock)
> return false;
>
> + if (old->in_sleepable != cur->in_sleepable)
> + return false;
> +
> /* for states to be equal callsites have to be the same
> * and all frame states need to be equivalent
> */
> @@ -19653,6 +19694,28 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> *cnt = 1;
> + } else if (is_bpf_wq_set_callback_impl_kfunc(desc->func_id)) {
> + /* The verifier will process callback_fn as many times as necessary
> + * with different maps and the register states prepared by
> + * set_timer_callback_state will be accurate.
> + *
> + * The following use case is valid:
> + * map1 is shared by prog1, prog2, prog3.
> + * prog1 calls bpf_timer_init for some map1 elements
> + * prog2 calls bpf_timer_set_callback for some map1 elements.
> + * Those that were not bpf_timer_init-ed will return -EINVAL.
> + * prog3 calls bpf_timer_start for some map1 elements.
> + * Those that were not both bpf_timer_init-ed and
> + * bpf_timer_set_callback-ed will return -EINVAL.
> + */
Also removed this comment. It's not correct here.
> + struct bpf_insn ld_addrs[2] = {
> + BPF_LD_IMM64(BPF_REG_4, (long)env->prog->aux),
> + };
> +
> + insn_buf[0] = ld_addrs[0];
> + insn_buf[1] = ld_addrs[1];
> + insn_buf[2] = *insn;
> + *cnt = 3;
> }
> return 0;
> }
>
> --
> 2.44.0
>
On Sat, Apr 20, 2024 at 2:09 AM Benjamin Tissoires <[email protected]> wrote:
>
>
> -static void htab_map_free_timers(struct bpf_map *map)
> +static void htab_map_free_timers_and_wq(struct bpf_map *map)
> {
> struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>
> - /* We only free timer on uref dropping to zero */
> - if (!btf_record_has_field(htab->map.record, BPF_TIMER))
> - return;
> - if (!htab_is_prealloc(htab))
> - htab_free_malloced_timers(htab);
> - else
> - htab_free_prealloced_timers(htab);
> + /* We only free timer and workqueue on uref dropping to zero */
> + if (btf_record_has_field(htab->map.record, BPF_TIMER)) {
> + if (!htab_is_prealloc(htab))
> + htab_free_malloced_timers_or_wq(htab, true);
> + else
> + htab_free_prealloced_timers(htab);
> + }
> + if (btf_record_has_field(htab->map.record, BPF_WORKQUEUE)) {
> + if (!htab_is_prealloc(htab))
> + htab_free_malloced_timers_or_wq(htab, false);
> + else
> + htab_free_prealloced_wq(htab);
> + }
I missed this earlier, but pls follow up to improve this part.
Double walk of all elements once for timer and 2nd for wq
is inefficient. Better to handle it in one go in prealloc and malloced cases.
Until we have delayed_wq it's quite possible that somebody will
have both timers and wq in map elements.
On Sat, Apr 20, 2024 at 2:10 AM Benjamin Tissoires <[email protected]> wrote:
>
> +
> +SEC("tc")
> +/* test that workqueues can be used from an array
> + */
> +__retval(0)
> +long test_call_array_sleepable(void *ctx)
> +{
> + int key = 0;
> +
> + return test_elem_callback(&array, &key);
> +}
> +
> +SEC("syscall")
> +/* Same test than above but from a sleepable context.
> + */
I adjusted all these comments to be normal single line comments.
The main thing to follow up is to figure out how to wait for wq
to finish.
Since we have this message:
/test_progs -t wq
..
#521 wq:OK
Summary: 2/9 PASSED, 0 SKIPPED, 0 FAILED
Failed to unload bpf_testmod.ko from kernel: -11
and doing manual "rmmod bpf_testmod"
after the test run is quite annoying.
Overall great stuff. Thank you for pushing it through.
Much appreciated!
Hello:
This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <[email protected]>:
On Sat, 20 Apr 2024 11:09:00 +0200 you wrote:
> This is a followup of sleepable bpf_timer[0].
>
> When discussing sleepable bpf_timer, it was thought that we should give
> a try to bpf_wq, as the 2 APIs are similar but distinct enough to
> justify a new one.
>
> So here it is.
>
> [...]
Here is the summary with links:
- [bpf-next,v2,01/16] bpf: make timer data struct more generic
https://git.kernel.org/bpf/bpf-next/c/be2749beff62
- [bpf-next,v2,02/16] bpf: replace bpf_timer_init with a generic helper
https://git.kernel.org/bpf/bpf-next/c/56b4a177ae63
- [bpf-next,v2,03/16] bpf: replace bpf_timer_set_callback with a generic helper
https://git.kernel.org/bpf/bpf-next/c/073f11b02643
- [bpf-next,v2,04/16] bpf: replace bpf_timer_cancel_and_free with a generic helper
https://git.kernel.org/bpf/bpf-next/c/fc22d9495f0b
- [bpf-next,v2,05/16] bpf: add support for bpf_wq user type
https://git.kernel.org/bpf/bpf-next/c/d56b63cf0c0f
- [bpf-next,v2,06/16] tools: sync include/uapi/linux/bpf.h
https://git.kernel.org/bpf/bpf-next/c/f1d0a2fbb008
- [bpf-next,v2,07/16] bpf: verifier: bail out if the argument is not a map
https://git.kernel.org/bpf/bpf-next/c/ad2c03e691be
- [bpf-next,v2,08/16] bpf: add support for KF_ARG_PTR_TO_WORKQUEUE
https://git.kernel.org/bpf/bpf-next/c/d940c9b94d7e
- [bpf-next,v2,09/16] bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps
https://git.kernel.org/bpf/bpf-next/c/246331e3f1ea
- [bpf-next,v2,10/16] selftests/bpf: add bpf_wq tests
https://git.kernel.org/bpf/bpf-next/c/b4abee7c1ae3
- [bpf-next,v2,11/16] bpf: wq: add bpf_wq_init
https://git.kernel.org/bpf/bpf-next/c/eb48f6cd41a0
- [bpf-next,v2,12/16] selftests/bpf: wq: add bpf_wq_init() checks
https://git.kernel.org/bpf/bpf-next/c/e3d9eac99afd
- [bpf-next,v2,13/16] bpf: wq: add bpf_wq_set_callback_impl
https://git.kernel.org/bpf/bpf-next/c/81f1d7a583fa
- [bpf-next,v2,14/16] selftests/bpf: add checks for bpf_wq_set_callback()
https://git.kernel.org/bpf/bpf-next/c/01b7b1c5f3cc
- [bpf-next,v2,15/16] bpf: add bpf_wq_start
https://git.kernel.org/bpf/bpf-next/c/8e83da9732d9
- [bpf-next,v2,16/16] selftests/bpf: wq: add bpf_wq_start() checks
https://git.kernel.org/bpf/bpf-next/c/8290dba51910
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
On Tue, Apr 23, 2024 at 7:55 PM Alexei Starovoitov
<[email protected]> wrote:
>
> On Sat, Apr 20, 2024 at 2:10 AM Benjamin Tissoires <[email protected]> wrote:
> >
> > We need to teach the verifier about the second argument which is declared
> > as void * but which is of type KF_ARG_PTR_TO_MAP. We could have dropped
> > this extra case if we declared the second argument as struct bpf_map *,
> > but that means users will have to do extra casting to have their program
> > compile.
> >
> > We also need to duplicate the timer code for the checking if the map
> > argument is matching the provided workqueue.
> >
> > Signed-off-by: Benjamin Tissoires <[email protected]>
> >
> > ---
> >
> > FWIW, I still have one concern with this implementation:
> > - bpf_wq_work() access ->prog without protection, but I think this might
> > be racing with bpf_wq_set_callback(): if we have the following:
> >
> > CPU 0 CPU 1
> > bpf_wq_set_callback()
> > bpf_start()
> > bpf_wq_work():
> > prog = cb->prog;
> >
> > bpf_wq_set_callback()
> > cb->prog = prog;
> > bpf_prog_put(prev)
> > rcu_assign_ptr(cb->callback_fn,
> > callback_fn);
> > callback = READ_ONCE(w->cbcallback_fn);
> >
> > As I understand callback_fn is fine, prog might be, but we clearly
> > have an inconstency between "prog" and "callback_fn" as they can come
> > from 2 different bpf_wq_set_callback() calls.
> >
> > IMO we should protect this by the async->lock, but I'm not sure if
> > it's OK or not.
>
> I see the concern, but I think it's overkill.
> Here 'prog' is used to pass it into __bpf_prog_enter_sleepable_recur()
> to keep the standard pattern of calling into sleepable prog.
> But it won't recurse.
> We can open code migrate_disable,etc from there except this_cpu_inc_return,
> but it's an overkill.
> The passed 'prog' is irrelevant.
> If somebody tries really hard by having two progs sharing the same
> map with bpf_wq and racing to set_callback... I can see how
> prog won't match callback, but it won't make a difference.
> prog is not going trigger recursion check (unless somebody
> tries is obsessed) and not going to UAF.
> I imagine it's possible to attach somewhere in core wq callback
> invocation path with fentry, set_callback to the same prog,
> and technically it's kinda sorta recursion, but different subprogs,
> so not a safety issue.
> The code as-is is fine. imo.
After sleeping on it, I realized that the use of
__bpf_prog_enter_sleepable_recur() here is very much incorrect :(
The tests are passing only because we don't inc prog->active
when we run the prog via prog_run cmd.
Adding the following:
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index f6aad4ed2ab2..0732dfe22204 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -1514,7 +1514,9 @@ int bpf_prog_test_run_syscall(struct bpf_prog *prog,
}
rcu_read_lock_trace();
+ this_cpu_inc_return(*(prog->active));
retval = bpf_prog_run_pin_on_cpu(prog, ctx);
+ this_cpu_dec(*(prog->active));
rcu_read_unlock_trace();
makes the test fail sporadically.
Or 100% fail when the kernel is booted with 1 cpu.
Could you send a quick follow up to
replace __bpf_prog_enter_sleepable_recur() with
rcu_read_lock_trace();
migrate_disable();
?
Or I'll do it in an hour or so.
On Wed, Apr 24, 2024 at 8:06 AM Alexei Starovoitov
<[email protected]> wrote:
>
> On Tue, Apr 23, 2024 at 7:55 PM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Sat, Apr 20, 2024 at 2:10 AM Benjamin Tissoires <[email protected]> wrote:
> > >
> > > We need to teach the verifier about the second argument which is declared
> > > as void * but which is of type KF_ARG_PTR_TO_MAP. We could have dropped
> > > this extra case if we declared the second argument as struct bpf_map *,
> > > but that means users will have to do extra casting to have their program
> > > compile.
> > >
> > > We also need to duplicate the timer code for the checking if the map
> > > argument is matching the provided workqueue.
> > >
> > > Signed-off-by: Benjamin Tissoires <[email protected]>
> > >
> > > ---
> > >
> > > FWIW, I still have one concern with this implementation:
> > > - bpf_wq_work() access ->prog without protection, but I think this might
> > > be racing with bpf_wq_set_callback(): if we have the following:
> > >
> > > CPU 0 CPU 1
> > > bpf_wq_set_callback()
> > > bpf_start()
> > > bpf_wq_work():
> > > prog = cb->prog;
> > >
> > > bpf_wq_set_callback()
> > > cb->prog = prog;
> > > bpf_prog_put(prev)
> > > rcu_assign_ptr(cb->callback_fn,
> > > callback_fn);
> > > callback = READ_ONCE(w->cb.callback_fn);
> > >
> > > As I understand callback_fn is fine, prog might be, but we clearly
> > > have an inconstency between "prog" and "callback_fn" as they can come
> > > from 2 different bpf_wq_set_callback() calls.
> > >
> > > IMO we should protect this by the async->lock, but I'm not sure if
> > > it's OK or not.
> >
> > I see the concern, but I think it's overkill.
> > Here 'prog' is used to pass it into __bpf_prog_enter_sleepable_recur()
> > to keep the standard pattern of calling into sleepable prog.
> > But it won't recurse.
> > We can open code migrate_disable,etc from there except this_cpu_inc_return,
> > but it's an overkill.
> > The passed 'prog' is irrelevant.
> > If somebody tries really hard by having two progs sharing the same
> > map with bpf_wq and racing to set_callback... I can see how
> > prog won't match callback, but it won't make a difference.
> > prog is not going trigger recursion check (unless somebody
> > tries is obsessed) and not going to UAF.
> > I imagine it's possible to attach somewhere in core wq callback
> > invocation path with fentry, set_callback to the same prog,
> > and technically it's kinda sorta recursion, but different subprogs,
> > so not a safety issue.
> > The code as-is is fine. imo.
>
> After sleeping on it, I realized that the use of
> __bpf_prog_enter_sleepable_recur() here is very much incorrect :(
> The tests are passing only because we don't inc prog->active
> when we run the prog via prog_run cmd.
> Adding the following:
> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> index f6aad4ed2ab2..0732dfe22204 100644
> --- a/net/bpf/test_run.c
> +++ b/net/bpf/test_run.c
> @@ -1514,7 +1514,9 @@ int bpf_prog_test_run_syscall(struct bpf_prog *prog,
> }
>
> rcu_read_lock_trace();
> + this_cpu_inc_return(*(prog->active));
> retval = bpf_prog_run_pin_on_cpu(prog, ctx);
> + this_cpu_dec(*(prog->active));
> rcu_read_unlock_trace();
>
> makes the test fail sporadically.
> Or 100% fail when the kernel is booted with 1 cpu.
>
> Could you send a quick follow up to
> replace __bpf_prog_enter_sleepable_recur() with
> rcu_read_lock_trace();
> migrate_disable();
> ?
>
> Or I'll do it in an hour or so.
Considering two broken-build reports already
I've applied the following fix:
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=dc92febf7b93da5049fe177804e6b1961fcc6bd7
that addresses the build issue on !JIT and fixes this recursion problem.
On Tue, Apr 23, 2024 at 7:56 PM Alexei Starovoitov
<[email protected]> wrote:
>
> On Sat, Apr 20, 2024 at 2:10 AM Benjamin Tissoires <[email protected]> wrote:
> >
> > +
> > +SEC("tc")
> > +/* test that workqueues can be used from an array
> > + */
> > +__retval(0)
> > +long test_call_array_sleepable(void *ctx)
> > +{
> > + int key = 0;
> > +
> > + return test_elem_callback(&array, &key);
> > +}
> > +
> > +SEC("syscall")
> > +/* Same test than above but from a sleepable context.
> > + */
>
> I adjusted all these comments to be normal single line comments.
>
> The main thing to follow up is to figure out how to wait for wq
> to finish.
> Since we have this message:
> ./test_progs -t wq
> ...
> #521 wq:OK
> Summary: 2/9 PASSED, 0 SKIPPED, 0 FAILED
> Failed to unload bpf_testmod.ko from kernel: -11
>
> and doing manual "rmmod bpf_testmod"
> after the test run is quite annoying.
>
> Overall great stuff. Thank you for pushing it through.
> Much appreciated!
Hi Benjamin,
I've ended up fixing it as well,
because it was causing CI issues.
See
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=82e38a505c9868e784ec31e743fd8a9fa5ca1084
On Sat, Apr 20, 2024 at 2:10 AM Benjamin Tissoires <[email protected]> wrote:
>
> Allows to test if allocation/free works
>
> Signed-off-by: Benjamin Tissoires <[email protected]>
>
> ---
>
> changes in v2:
> - dropped mark_precise checks
> ---
> tools/testing/selftests/bpf/bpf_experimental.h | 1 +
> tools/testing/selftests/bpf/prog_tests/wq.c | 8 +++
> tools/testing/selftests/bpf/progs/wq.c | 10 ++++
> tools/testing/selftests/bpf/progs/wq_failures.c | 78 +++++++++++++++++++++++++
> 4 files changed, 97 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
> index 3329ea080865..785b91b629be 100644
> --- a/tools/testing/selftests/bpf/bpf_experimental.h
> +++ b/tools/testing/selftests/bpf/bpf_experimental.h
> @@ -470,4 +470,5 @@ extern int bpf_iter_css_new(struct bpf_iter_css *it,
> extern struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *it) __weak __ksym;
> extern void bpf_iter_css_destroy(struct bpf_iter_css *it) __weak __ksym;
>
> +extern int bpf_wq_init(struct bpf_wq *wq, void *p__map, unsigned int flags) __weak __ksym;
> #endif
> diff --git a/tools/testing/selftests/bpf/prog_tests/wq.c b/tools/testing/selftests/bpf/prog_tests/wq.c
> index 9a07b8bc2c52..26ab69796103 100644
> --- a/tools/testing/selftests/bpf/prog_tests/wq.c
> +++ b/tools/testing/selftests/bpf/prog_tests/wq.c
> @@ -2,6 +2,7 @@
> /* Copyright (c) 2024 Benjamin Tissoires */
> #include <test_progs.h>
> #include "wq.skel.h"
> +#include "wq_failures.skel.h"
>
> void serial_test_wq(void)
> {
> @@ -9,3 +10,10 @@ void serial_test_wq(void)
>
> RUN_TESTS(wq);
> }
> +
> +void serial_test_failures_wq(void)
> +{
> + LIBBPF_OPTS(bpf_test_run_opts, topts);
> +
unused leftover?
> + RUN_TESTS(wq_failures);
> +}
[...]
On Apr 24 2024, Alexei Starovoitov wrote:
> On Wed, Apr 24, 2024 at 8:06 AM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Tue, Apr 23, 2024 at 7:55 PM Alexei Starovoitov
> > <[email protected]> wrote:
> > >
> > > On Sat, Apr 20, 2024 at 2:10 AM Benjamin Tissoires <[email protected]> wrote:
> > > >
> > > > We need to teach the verifier about the second argument which is declared
> > > > as void * but which is of type KF_ARG_PTR_TO_MAP. We could have dropped
> > > > this extra case if we declared the second argument as struct bpf_map *,
> > > > but that means users will have to do extra casting to have their program
> > > > compile.
> > > >
> > > > We also need to duplicate the timer code for the checking if the map
> > > > argument is matching the provided workqueue.
> > > >
> > > > Signed-off-by: Benjamin Tissoires <[email protected]>
> > > >
> > > > ---
> > > >
> > > > FWIW, I still have one concern with this implementation:
> > > > - bpf_wq_work() access ->prog without protection, but I think this might
> > > > be racing with bpf_wq_set_callback(): if we have the following:
> > > >
> > > > CPU 0 CPU 1
> > > > bpf_wq_set_callback()
> > > > bpf_start()
> > > > bpf_wq_work():
> > > > prog = cb->prog;
> > > >
> > > > bpf_wq_set_callback()
> > > > cb->prog = prog;
> > > > bpf_prog_put(prev)
> > > > rcu_assign_ptr(cb->callback_fn,
> > > > callback_fn);
> > > > callback = READ_ONCE(w->cb.callback_fn);
> > > >
> > > > As I understand callback_fn is fine, prog might be, but we clearly
> > > > have an inconstency between "prog" and "callback_fn" as they can come
> > > > from 2 different bpf_wq_set_callback() calls.
> > > >
> > > > IMO we should protect this by the async->lock, but I'm not sure if
> > > > it's OK or not.
> > >
> > > I see the concern, but I think it's overkill.
> > > Here 'prog' is used to pass it into __bpf_prog_enter_sleepable_recur()
> > > to keep the standard pattern of calling into sleepable prog.
> > > But it won't recurse.
> > > We can open code migrate_disable,etc from there except this_cpu_inc_return,
> > > but it's an overkill.
> > > The passed 'prog' is irrelevant.
> > > If somebody tries really hard by having two progs sharing the same
> > > map with bpf_wq and racing to set_callback... I can see how
> > > prog won't match callback, but it won't make a difference.
> > > prog is not going trigger recursion check (unless somebody
> > > tries is obsessed) and not going to UAF.
> > > I imagine it's possible to attach somewhere in core wq callback
> > > invocation path with fentry, set_callback to the same prog,
> > > and technically it's kinda sorta recursion, but different subprogs,
> > > so not a safety issue.
> > > The code as-is is fine. imo.
> >
> > After sleeping on it, I realized that the use of
> > __bpf_prog_enter_sleepable_recur() here is very much incorrect :(
> > The tests are passing only because we don't inc prog->active
> > when we run the prog via prog_run cmd.
> > Adding the following:
> > diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> > index f6aad4ed2ab2..0732dfe22204 100644
> > --- a/net/bpf/test_run.c
> > +++ b/net/bpf/test_run.c
> > @@ -1514,7 +1514,9 @@ int bpf_prog_test_run_syscall(struct bpf_prog *prog,
> > }
> >
> > rcu_read_lock_trace();
> > + this_cpu_inc_return(*(prog->active));
> > retval = bpf_prog_run_pin_on_cpu(prog, ctx);
> > + this_cpu_dec(*(prog->active));
> > rcu_read_unlock_trace();
> >
> > makes the test fail sporadically.
> > Or 100% fail when the kernel is booted with 1 cpu.
> >
> > Could you send a quick follow up to
> > replace __bpf_prog_enter_sleepable_recur() with
> > rcu_read_lock_trace();
> > migrate_disable();
> > ?
> >
> > Or I'll do it in an hour or so.
>
> Considering two broken-build reports already
> I've applied the following fix:
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=dc92febf7b93da5049fe177804e6b1961fcc6bd7
>
> that addresses the build issue on !JIT and fixes this recursion problem.
Thanks a lot for fixing this on my behalf. I was slightly puzzled by the
broken-build report but really I wasn't in position to think straight
yesterday (got a pretty big muscular pain in the back and had to go to
the doctor to get painkillers)
I haven't forgoten about the double list walk in 9/16, I'll hopefully
send the fix today.
Cheers,
Benjamin
On Apr 24 2024, Andrii Nakryiko wrote:
> On Sat, Apr 20, 2024 at 2:10 AM Benjamin Tissoires <[email protected]> wrote:
> >
> > Allows to test if allocation/free works
> >
> > Signed-off-by: Benjamin Tissoires <[email protected]>
> >
> > ---
> >
> > changes in v2:
> > - dropped mark_precise checks
> > ---
> > tools/testing/selftests/bpf/bpf_experimental.h | 1 +
> > tools/testing/selftests/bpf/prog_tests/wq.c | 8 +++
> > tools/testing/selftests/bpf/progs/wq.c | 10 ++++
> > tools/testing/selftests/bpf/progs/wq_failures.c | 78 +++++++++++++++++++++++++
> > 4 files changed, 97 insertions(+)
> >
> > diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
> > index 3329ea080865..785b91b629be 100644
> > --- a/tools/testing/selftests/bpf/bpf_experimental.h
> > +++ b/tools/testing/selftests/bpf/bpf_experimental.h
> > @@ -470,4 +470,5 @@ extern int bpf_iter_css_new(struct bpf_iter_css *it,
> > extern struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *it) __weak __ksym;
> > extern void bpf_iter_css_destroy(struct bpf_iter_css *it) __weak __ksym;
> >
> > +extern int bpf_wq_init(struct bpf_wq *wq, void *p__map, unsigned int flags) __weak __ksym;
> > #endif
> > diff --git a/tools/testing/selftests/bpf/prog_tests/wq.c b/tools/testing/selftests/bpf/prog_tests/wq.c
> > index 9a07b8bc2c52..26ab69796103 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/wq.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/wq.c
> > @@ -2,6 +2,7 @@
> > /* Copyright (c) 2024 Benjamin Tissoires */
> > #include <test_progs.h>
> > #include "wq.skel.h"
> > +#include "wq_failures.skel.h"
> >
> > void serial_test_wq(void)
> > {
> > @@ -9,3 +10,10 @@ void serial_test_wq(void)
> >
> > RUN_TESTS(wq);
> > }
> > +
> > +void serial_test_failures_wq(void)
> > +{
> > + LIBBPF_OPTS(bpf_test_run_opts, topts);
> > +
>
> unused leftover?
Oops, yeah. Looks like it.
I'll send a removal of it while fixing 9/16
Cheers,
Benjamin
>
> > + RUN_TESTS(wq_failures);
> > +}
>
> [...]