Changes since v2:
* Simplify the branch condition on [3/4] (peterz)
* Rebase [4/4] accordingly
Frederic Weisbecker (4):
task_work: s/task_work_cancel()/task_work_cancel_func()/
task_work: Introduce task_work_cancel() again
perf: Fix event leak upon exit
perf: Fix event leak upon exec and file release
include/linux/perf_event.h | 1 +
include/linux/task_work.h | 3 ++-
kernel/events/core.c | 49 +++++++++++++++++++++++++++++---------
kernel/irq/manage.c | 2 +-
kernel/task_work.c | 34 ++++++++++++++++++++++----
security/keys/keyctl.c | 2 +-
6 files changed, 72 insertions(+), 19 deletions(-)
--
2.34.1
A proper task_work_cancel() API that actually cancels a callback and not
*any* callback pointing to a given function is going to be needed for
perf events event freeing. Do the appropriate rename to prepare for
that.
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/task_work.h | 2 +-
kernel/irq/manage.c | 2 +-
kernel/task_work.c | 10 +++++-----
security/keys/keyctl.c | 2 +-
4 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 795ef5a68429..23ab01ae185e 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -30,7 +30,7 @@ int task_work_add(struct task_struct *task, struct callback_head *twork,
struct callback_head *task_work_cancel_match(struct task_struct *task,
bool (*match)(struct callback_head *, void *data), void *data);
-struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
+struct callback_head *task_work_cancel_func(struct task_struct *, task_work_func_t);
void task_work_run(void);
static inline void exit_task_work(struct task_struct *task)
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index bf9ae8a8686f..ab767e62b19a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1333,7 +1333,7 @@ static int irq_thread(void *data)
* synchronize_hardirq(). So neither IRQTF_RUNTHREAD nor the
* oneshot mask bit can be set.
*/
- task_work_cancel(current, irq_thread_dtor);
+ task_work_cancel_func(current, irq_thread_dtor);
return 0;
}
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 95a7e1b7f1da..54ac24059daa 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -120,9 +120,9 @@ static bool task_work_func_match(struct callback_head *cb, void *data)
}
/**
- * task_work_cancel - cancel a pending work added by task_work_add()
- * @task: the task which should execute the work
- * @func: identifies the work to remove
+ * task_work_cancel_func - cancel a pending work matching a function added by task_work_add()
+ * @task: the task which should execute the func's work
+ * @func: identifies the func to match with a work to remove
*
* Find the last queued pending work with ->func == @func and remove
* it from queue.
@@ -131,7 +131,7 @@ static bool task_work_func_match(struct callback_head *cb, void *data)
* The found work or NULL if not found.
*/
struct callback_head *
-task_work_cancel(struct task_struct *task, task_work_func_t func)
+task_work_cancel_func(struct task_struct *task, task_work_func_t func)
{
return task_work_cancel_match(task, task_work_func_match, func);
}
@@ -168,7 +168,7 @@ void task_work_run(void)
if (!work)
break;
/*
- * Synchronize with task_work_cancel(). It can not remove
+ * Synchronize with task_work_cancel_match(). It can not remove
* the first entry == work, cmpxchg(task_works) must fail.
* But it can remove another entry from the ->next list.
*/
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 10ba439968f7..3aff32a2bcf3 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -1693,7 +1693,7 @@ long keyctl_session_to_parent(void)
goto unlock;
/* cancel an already pending keyring replacement */
- oldwork = task_work_cancel(parent, key_change_session_keyring);
+ oldwork = task_work_cancel_func(parent, key_change_session_keyring);
/* the replacement session keyring is applied just prior to userspace
* restarting */
--
2.34.1
Re-introduce task_work_cancel(), this time to cancel an actual callback
and not *any* callback pointing to a given function. This is going to be
needed for perf events event freeing.
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/task_work.h | 1 +
kernel/task_work.c | 24 ++++++++++++++++++++++++
2 files changed, 25 insertions(+)
diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 23ab01ae185e..26b8a47f41fc 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -31,6 +31,7 @@ int task_work_add(struct task_struct *task, struct callback_head *twork,
struct callback_head *task_work_cancel_match(struct task_struct *task,
bool (*match)(struct callback_head *, void *data), void *data);
struct callback_head *task_work_cancel_func(struct task_struct *, task_work_func_t);
+bool task_work_cancel(struct task_struct *task, struct callback_head *cb);
void task_work_run(void);
static inline void exit_task_work(struct task_struct *task)
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 54ac24059daa..2134ac8057a9 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -136,6 +136,30 @@ task_work_cancel_func(struct task_struct *task, task_work_func_t func)
return task_work_cancel_match(task, task_work_func_match, func);
}
+static bool task_work_match(struct callback_head *cb, void *data)
+{
+ return cb == data;
+}
+
+/**
+ * task_work_cancel - cancel a pending work added by task_work_add()
+ * @task: the task which should execute the work
+ * @cb: the callback to remove if queued
+ *
+ * Remove a callback from a task's queue if queued.
+ *
+ * RETURNS:
+ * True if the callback was queued and got cancelled, false otherwise.
+ */
+bool task_work_cancel(struct task_struct *task, struct callback_head *cb)
+{
+ struct callback_head *ret;
+
+ ret = task_work_cancel_match(task, task_work_match, cb);
+
+ return ret == cb;
+}
+
/**
* task_work_run - execute the works added by task_work_add()
*
--
2.34.1
The perf pending task work is never waited upon the matching event
release. In the case of a child event, released via free_event()
directly, this can potentially result in a leaked event, such as in the
following scenario that doesn't even require a weak IRQ work
implementation to trigger:
schedule()
prepare_task_switch()
=======> <NMI>
perf_event_overflow()
event->pending_sigtrap = ...
irq_work_queue(&event->pending_irq)
<======= </NMI>
perf_event_task_sched_out()
event_sched_out()
event->pending_sigtrap = 0;
atomic_long_inc_not_zero(&event->refcount)
task_work_add(&event->pending_task)
finish_lock_switch()
=======> <IRQ>
perf_pending_irq()
//do nothing, rely on pending task work
<======= </IRQ>
begin_new_exec()
perf_event_exit_task()
perf_event_exit_event()
// If is child event
free_event()
WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1)
// event is leaked
Similar scenarios can also happen with perf_event_remove_on_exec() or
simply against concurrent perf_event_release().
Fix this with synchonizing against the possibly remaining pending task
work while freeing the event, just like is done with remaining pending
IRQ work. This means that the pending task callback neither need nor
should hold a reference to the event, preventing it from ever beeing
freed.
Fixes: 517e6a301f34 ("perf: Fix perf_pending_task() UaF")
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/perf_event.h | 1 +
kernel/events/core.c | 38 ++++++++++++++++++++++++++++++++++----
2 files changed, 35 insertions(+), 4 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d2a15c0c6f8a..89ae41bb5f70 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -786,6 +786,7 @@ struct perf_event {
struct irq_work pending_irq;
struct callback_head pending_task;
unsigned int pending_work;
+ struct rcuwait pending_work_wait;
atomic_t event_limit;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f2a366e736a4..3c6a8ad3c6f7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2288,7 +2288,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
if (state != PERF_EVENT_STATE_OFF &&
!event->pending_work &&
!task_work_add(current, &event->pending_task, TWA_RESUME)) {
- WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
event->pending_work = 1;
} else {
local_dec(&event->ctx->nr_pending);
@@ -5184,9 +5183,35 @@ static bool exclusive_event_installable(struct perf_event *event,
static void perf_addr_filters_splice(struct perf_event *event,
struct list_head *head);
+static void perf_pending_task_sync(struct perf_event *event)
+{
+ struct callback_head *head = &event->pending_task;
+
+ if (!event->pending_work)
+ return;
+ /*
+ * If the task is queued to the current task's queue, we
+ * obviously can't wait for it to complete. Simply cancel it.
+ */
+ if (task_work_cancel(current, head)) {
+ event->pending_work = 0;
+ local_dec(&event->ctx->nr_pending);
+ return;
+ }
+
+ /*
+ * All accesses related to the event are within the same
+ * non-preemptible section in perf_pending_task(). The RCU
+ * grace period before the event is freed will make sure all
+ * those accesses are complete by then.
+ */
+ rcuwait_wait_event(&event->pending_work_wait, !event->pending_work, TASK_UNINTERRUPTIBLE);
+}
+
static void _free_event(struct perf_event *event)
{
irq_work_sync(&event->pending_irq);
+ perf_pending_task_sync(event);
unaccount_event(event);
@@ -6804,24 +6829,28 @@ static void perf_pending_task(struct callback_head *head)
struct perf_event *event = container_of(head, struct perf_event, pending_task);
int rctx;
+ /*
+ * All accesses to the event must belong to the same implicit RCU read-side
+ * critical section as the ->pending_work reset. See comment in
+ * perf_pending_task_sync().
+ */
+ preempt_disable_notrace();
/*
* If we 'fail' here, that's OK, it means recursion is already disabled
* and we won't recurse 'further'.
*/
- preempt_disable_notrace();
rctx = perf_swevent_get_recursion_context();
if (event->pending_work) {
event->pending_work = 0;
perf_sigtrap(event);
local_dec(&event->ctx->nr_pending);
+ rcuwait_wake_up(&event->pending_work_wait);
}
if (rctx >= 0)
perf_swevent_put_recursion_context(rctx);
preempt_enable_notrace();
-
- put_event(event);
}
#ifdef CONFIG_GUEST_PERF_EVENTS
@@ -11929,6 +11958,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
init_waitqueue_head(&event->waitq);
init_irq_work(&event->pending_irq, perf_pending_irq);
init_task_work(&event->pending_task, perf_pending_task);
+ rcuwait_init(&event->pending_work_wait);
mutex_init(&event->mmap_mutex);
raw_spin_lock_init(&event->addr_filters.lock);
--
2.34.1
When a task is scheduled out, pending sigtrap deliveries are deferred
to the target task upon resume to userspace via task_work.
However failures while adding en event's callback to the task_work
engine are ignored. And since the last call for events exit happen
after task work is eventually closed, there is a small window during
which pending sigtrap can be queued though ignored, leaking the event
refcount addition such as in the following scenario:
TASK A
-----
do_exit()
exit_task_work(tsk);
<IRQ>
perf_event_overflow()
event->pending_sigtrap = pending_id;
irq_work_queue(&event->pending_irq);
</IRQ>
=========> PREEMPTION: TASK A -> TASK B
event_sched_out()
event->pending_sigtrap = 0;
atomic_long_inc_not_zero(&event->refcount)
// FAILS: task work has exited
task_work_add(&event->pending_task)
[...]
<IRQ WORK>
perf_pending_irq()
// early return: event->oncpu = -1
</IRQ WORK>
[...]
=========> TASK B -> TASK A
perf_event_exit_task(tsk)
perf_event_exit_event()
free_event()
WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1)
// leak event due to unexpected refcount == 2
As a result the event is never released while the task exits.
Fix this with appropriate task_work_add()'s error handling.
Fixes: 517e6a301f34 ("perf: Fix perf_pending_task() UaF")
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/events/core.c | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 724e6d7e128f..f2a366e736a4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2284,18 +2284,15 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
}
if (event->pending_sigtrap) {
- bool dec = true;
-
event->pending_sigtrap = 0;
if (state != PERF_EVENT_STATE_OFF &&
- !event->pending_work) {
- event->pending_work = 1;
- dec = false;
+ !event->pending_work &&
+ !task_work_add(current, &event->pending_task, TWA_RESUME)) {
WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
- task_work_add(current, &event->pending_task, TWA_RESUME);
- }
- if (dec)
+ event->pending_work = 1;
+ } else {
local_dec(&event->ctx->nr_pending);
+ }
}
perf_event_set_state(event, state);
--
2.34.1
On Thu, May 16, 2024 at 04:09:32PM +0200, Frederic Weisbecker wrote:
> Changes since v2:
>
> * Simplify the branch condition on [3/4] (peterz)
> * Rebase [4/4] accordingly
>
> Frederic Weisbecker (4):
> task_work: s/task_work_cancel()/task_work_cancel_func()/
> task_work: Introduce task_work_cancel() again
> perf: Fix event leak upon exit
> perf: Fix event leak upon exec and file release
Trying this on linux-rt-devel/6.10.y-rt, after reverting Sebastian's
series that has clashes with this patch series:
[acme@nine linux]$ git log --oneline -5
4de7b8e17201 (HEAD -> linux-rt-devel-6.10.y-rt-sigtrap-fix-frederic-v3) Revert "perf: Move irq_work_queue() where the event is prepared."
5efa195af234 Revert "perf: Enqueue SIGTRAP always via task_work."
26ac4dfa180a Revert "perf: Remove perf_swevent_get_recursion_context() from perf_pending_task()."
c2fb5208a68e Revert "perf: Split __perf_pending_irq() out of perf_pending_irq()"
6d20efa57a89 (tag: v6.10-rc1-rt1-rebase, tag: v6.10-rc1-rt1, linux-rt-devel/linux-6.10.y-rt-rebase, linux-rt-devel/linux-6.10.y-rt, linux-rt-devel/for-kbuild-bot/prepare-release, linux-rt-devel/for-kbuild-bot/current-stable) Add localversion for -RT release
[acme@nine linux]$
Testing with the above reverts we're back to:
[ 66.513763] BUG: scheduling while atomic: perf/7940/0x00000002
[ 66.513763] BUG: scheduling while atomic: perf/7938/0x00000002
[ 66.513872] Preemption disabled at:
[ 66.513872] Preemption disabled at:
[ 66.513872] [<0000000000000000>] 0x0
[ 66.513872] [<0000000000000000>] 0x0
[ 66.513878] CPU: 1 PID: 7940 Comm: perf Kdump: loaded Not tainted 6.10.0-rc1-rt1+ #1
[ 66.513881] Hardware name: LENOVO 427623U/427623U, BIOS 8BET45WW (1.25 ) 05/18/2011
[ 66.513882] Call Trace:
[ 66.513885] <TASK>
[ 66.513887] dump_stack_lvl+0x51/0x70
[ 66.513893] __schedule_bug+0x88/0xa0
[ 66.513898] schedule_debug.constprop.0+0xd1/0x120
[ 66.513901] __schedule+0x50/0x680
[ 66.513906] ? __pfx_perf_pmu_nop_int+0x10/0x10
[ 66.513910] ? merge_sched_in+0x202/0x350
[ 66.513913] ? _raw_spin_lock+0x13/0x40
[ 66.513917] schedule_rtlock+0x1d/0x40
[ 66.513920] rtlock_slowlock_locked+0xcd/0x260
[ 66.513924] ? __pfx_perf_pmu_nop_void+0x10/0x10
[ 66.513926] ? perf_ctx_enable+0x55/0x70
[ 66.513930] rt_spin_lock+0x40/0x60
[ 66.513933] do_send_sig_info+0x32/0xb0
[ 66.513936] send_sig_perf+0x6f/0x90
[ 66.513939] perf_pending_task+0x89/0xa0
[ 66.513942] task_work_run+0x58/0x90
[ 66.513946] irqentry_exit_to_user_mode+0x1ce/0x1d0
[ 66.513950] asm_sysvec_irq_work+0x16/0x20
[ 66.513955] RIP: 0033:0x4c05af
[ 66.513957] Code: 8b 04 25 28 00 00 00 48 89 45 e8 31 c0 e8 39 c0 f4 ff 4c 89 e7 48 89 c3 e8 be d3 f4 ff f0 01 1d 37 e6 85 00 8b 05 39 e6 85 00 <83> f8 01 7e 23 89 d9 31 d2 0f 1f 84 00 00 00 00 00 f0 01 0d 19 e6
[ 66.513959] RSP: 002b:00007fcd6cf9bd70 EFLAGS: 00000202
[ 66.513961] RAX: 0000000000000bb8 RBX: 0000000000001f04 RCX: 00007fcd6f41fb7a
[ 66.513963] RDX: 0000000000000004 RSI: 0000000000000080 RDI: 00007ffe83407af4
[ 66.513964] RBP: 00007fcd6cf9bd90 R08: 00007ffe83407af0 R09: 0000000000000004
[ 66.513966] R10: 0000000000000000 R11: 0000000000000282 R12: 00007ffe83407af0
[ 66.513967] R13: 000000000000000d R14: 00007fcd6f421530 R15: 0000000000000000
[ 66.513970] </TASK>
<SNIP more backtraces>
With Frederic's patchset:
[acme@nine linux]$ b4 am -ctsl --cc-trailers [email protected]
Grabbing thread from lore.kernel.org/all/[email protected]/t.mbox.gz
Checking for newer revisions
Grabbing search results from lore.kernel.org
Analyzing 5 messages in the thread
Looking for additional code-review trailers on lore.kernel.org
<SNIP>
Total patches: 4
---
Cover: ./v3_20240516_frederic_perf_fix_leaked_sigtrap_events.cover
Link: https://lore.kernel.org/r/[email protected]
Base: not specified
git am ./v3_20240516_frederic_perf_fix_leaked_sigtrap_events.mbx
[acme@nine linux]$ git am ./v3_20240516_frederic_perf_fix_leaked_sigtrap_events.mbx
Applying: task_work: s/task_work_cancel()/task_work_cancel_func()/
Applying: task_work: Introduce task_work_cancel() again
Applying: perf: Fix event leak upon exit
Applying: perf: Fix event leak upon exec and file release
[acme@nine linux]$
[acme@nine linux]$ git log --oneline -9
1f88fa6e3adb (HEAD -> linux-rt-devel-6.10.y-rt-sigtrap-fix-frederic-v3) perf: Fix event leak upon exec and file release
44cde14a096c perf: Fix event leak upon exit
512f8f5cbaed task_work: Introduce task_work_cancel() again
e7bee294ec69 task_work: s/task_work_cancel()/task_work_cancel_func()/
4de7b8e17201 Revert "perf: Move irq_work_queue() where the event is prepared."
5efa195af234 Revert "perf: Enqueue SIGTRAP always via task_work."
26ac4dfa180a Revert "perf: Remove perf_swevent_get_recursion_context() from perf_pending_task()."
c2fb5208a68e Revert "perf: Split __perf_pending_irq() out of perf_pending_irq()"
6d20efa57a89 (tag: v6.10-rc1-rt1-rebase, tag: v6.10-rc1-rt1, linux-rt-devel/linux-6.10.y-rt-rebase, linux-rt-devel/linux-6.10.y-rt, linux-rt-devel/for-kbuild-bot/prepare-release, linux-rt-devel/for-kbuild-bot/current-stable) Add localversion for -RT release
[acme@nine linux]$
The workload that is used to do that, as a reminder, is 'perf test sigtrap'.
[ 121.217475] BUG: scheduling while atomic: perf/7955/0x00000002
[ 121.217478] BUG: scheduling while atomic: perf/7956/0x00000002
<SNIP list of modules>
[ 121.217492] BUG: scheduling while atomic: perf/7954/0x00000002
<SNIP list of modules>
[ 121.217570] Preemption disabled at:
<SNIP>
[ 121.217571] [<0000000000000000>] 0x0
<SNIP>
[ 121.217609] Preemption disabled at:
<SNIP>
[ 121.217610] [<0000000000000000>] 0x0
Lots of reports mixed up
[ 121.217575] CPU: 5 PID: 7955 Comm: perf Kdump: loaded Not tainted 6.10.0-rc1.frederic-rt1+ #3
[ 121.217577] i2c_smbus
[ 121.217577] cec
[ 121.217577] Hardware name: LENOVO 427623U/427623U, BIOS 8BET45WW (1.25 ) 05/18/2011
[ 121.217578] firmware_attributes_class
[ 121.217577] i2c_algo_bit
[ 121.217579] snd
[ 121.217579] drm_buddy
[ 121.217579] Call Trace:
121.217580] wmi_bmof
[ 121.217580] ttm
[ 121.217580] mei_me lpc_ich
[ 121.217581] <TASK>
[ 121.217581] sr_mod
[ 121.217582] mei
[ 121.217582] intel_gtt
[ 121.217583] soundcore sparse_keymap
[ 121.217584] sd_mod cdrom
[ 121.217584] platform_profile rfkill
[ 121.217585] drm_display_helper t10_pi
[ 121.217583] dump_stack_lvl+0x51/0x70
[ 121.217586] joydev xfs
[ 121.217587] sg drm_kms_helper
[ 121.217588] libcrc32c
[ 121.217589] sdhci_pci
[ 121.217589] i915
[ 121.217590] crct10dif_pclmul
[ 121.217590] cec
[ 121.217590] ahci
[ 121.217588] __schedule_bug+0x88/0xa0
[ 121.217598] ghash_clmulni_intel
[ 121.217596] __schedule+0x50/0x680
[ 121.217598] drm_display_helper t10_pi
[ 121.217599] mmc_core e1000e
[ 121.217600] sg
[ 121.217601] video
[ 121.217601] drm_kms_helper sdhci_pci
[ 121.217602] wmi
[ 121.217601] ? _raw_spin_lock+0x13/0x40
[ 121.217603] crct10dif_pclmul
[ 121.217603] serio_raw
[ 121.217604] ahci
[ 121.217604] dm_mirror
[ 121.217605] drm
[ 121.217605] dm_region_hash
[ 121.217605] schedule_rtlock+0x1d/0x40
[ 121.217606] dm_log
[ 121.217606] crc32_pclmul cqhci
[ 121.217607] dm_mod fuse
[ 121.217608] libahci
[ 121.217607] rtlock_slowlock_locked+0xcd/0x260
[ 121.217609] Preemption disabled at:
[ 121.217609] crc32c_intel sdhci libata ghash_clmulni_intel
[ 121.217610] [<0000000000000000>] 0x0
[ 121.217612] mmc_core
[ 121.217612] rt_spin_lock+0x40/0x60
[ 121.217612] e1000e video wmi serio_raw dm_mirror
[ 121.217614] do_send_sig_info+0x32/0xb0
[ 121.217615] dm_region_hash dm_log dm_mod fuse
[ 121.217617] send_sig_perf+0x6f/0x90
[ 121.217619] Preemption disabled at:
[ 121.217619] [<0000000000000000>] 0x0
[ 121.217619] perf_pending_task+0x65/0x90
[ 121.217624] task_work_run+0x58/0x90
[ 121.217627] irqentry_exit_to_user_mode+0x1ce/0x1d0
[ 121.217629] asm_sysvec_irq_work+0x16/0x20
[ 121.217633] RIP: 0033:0x4c05af
[ 121.217634] BUG: scheduling while atomic: perf/7953/0x00000002
[ 121.217634] Code: 8b 04 25 28 00 00 00 48 89 45 e8 31 c0 e8 39 c0 f4 ff 4c 89 e7 48 89 c3 e8 be d3 f4 ff f0 01 1d 37 e6 85 00 8b 05 39 e6 85 00 <83> f8 01 7e 23 89 d9 31 d2 0f 1f 84 00 00 00 00 00 f0 01 0d 19 e6
[ 121.217636] RSP: 002b:00007fb693ac5d70 EFLAGS: 00000206
<SNIP>
[ 121.217653] caller is perf_pending_task+0x29/0x90
[ 121.217654] coretemp snd_hda_codec_conexant kvm_intel snd_hda_codec_generic
[ 121.217655] dump_stack_lvl+0x51/0x70
[ 121.217657] mac80211 kvm snd_hda_intel uvcvideo snd_intel_dspcfg libarc4 snd_intel_sdw_acpi
[ 121.217659] __schedule_bug+0x88/0xa0
[ 121.217661] snd_hda_codec uvc rapl videobuf2_vmalloc mei_wdt snd_hda_core
[ 121.217663] schedule_debug.constprop.0+0xd1/0x120
[ 121.217665] videobuf2_memops btusb videobuf2_v4l2 iwlwifi btrtl snd_hwdep btintel
[ 121.217666] __schedule+0x50/0x680
[ 121.217668] videodev snd_seq btbcm snd_ctl_led snd_seq_device iTCO_wdt btmtk
[ 121.217670] ? _raw_spin_lock+0x13/0x40
[ 121.217672] snd_pcm intel_cstate videobuf2_common iTCO_vendor_support bluetooth cfg80211 thinkpad_acpi snd_timer
[ 121.217675] schedule_rtlock+0x1d/0x40
[ 121.217677] i2c_i801 think_lmi mc intel_uncore pcspkr i2c_smbus
[ 121.217678] rtlock_slowlock_locked+0xcd/0x260
[ 121.217680] firmware_attributes_class snd wmi_bmof mei_me lpc_ich mei soundcore sparse_keymap platform_profile
[ 121.217683] rt_spin_lock+0x40/0x60
[ 121.217685] rfkill joydev xfs libcrc32c i915
[ 121.217687] do_send_sig_info+0x32/0xb0
[ 121.217688] cec i2c_algo_bit drm_buddy ttm sr_mod intel_gtt
[ 121.217690] send_sig_perf+0x6f/0x90
[ 121.217691] sd_mod cdrom drm_display_helper t10_pi sg drm_kms_helper sdhci_pci
[ 121.217693] perf_pending_task+0x65/0x90
[ 121.217695] crct10dif_pclmul ahci drm crc32_pclmul cqhci libahci
[ 121.217697] task_work_run+0x58/0x90
[ 121.217698] crc32c_intel sdhci libata ghash_clmulni_intel mmc_core e1000e
[ 121.217700] irqentry_exit_to_user_mode+0x1ce/0x1d0
[ 121.217702] video wmi serio_raw dm_mirror dm_region_hash dm_log
[ 121.217703] asm_sysvec_irq_work+0x16/0x20
[ 121.217705] dm_mod fuse
Le Fri, May 31, 2024 at 01:25:26PM -0300, Arnaldo Carvalho de Melo a ?crit :
> With Frederic's patchset:
>
> [acme@nine linux]$ b4 am -ctsl --cc-trailers [email protected]
> Grabbing thread from lore.kernel.org/all/[email protected]/t.mbox.gz
> Checking for newer revisions
> Grabbing search results from lore.kernel.org
> Analyzing 5 messages in the thread
> Looking for additional code-review trailers on lore.kernel.org
> <SNIP>
> Total patches: 4
> ---
> Cover: ./v3_20240516_frederic_perf_fix_leaked_sigtrap_events.cover
> Link: https://lore.kernel.org/r/[email protected]
> Base: not specified
> git am ./v3_20240516_frederic_perf_fix_leaked_sigtrap_events.mbx
> [acme@nine linux]$ git am ./v3_20240516_frederic_perf_fix_leaked_sigtrap_events.mbx
> Applying: task_work: s/task_work_cancel()/task_work_cancel_func()/
> Applying: task_work: Introduce task_work_cancel() again
> Applying: perf: Fix event leak upon exit
> Applying: perf: Fix event leak upon exec and file release
> [acme@nine linux]$
>
> [acme@nine linux]$ git log --oneline -9
> 1f88fa6e3adb (HEAD -> linux-rt-devel-6.10.y-rt-sigtrap-fix-frederic-v3) perf: Fix event leak upon exec and file release
> 44cde14a096c perf: Fix event leak upon exit
> 512f8f5cbaed task_work: Introduce task_work_cancel() again
> e7bee294ec69 task_work: s/task_work_cancel()/task_work_cancel_func()/
> 4de7b8e17201 Revert "perf: Move irq_work_queue() where the event is prepared."
> 5efa195af234 Revert "perf: Enqueue SIGTRAP always via task_work."
> 26ac4dfa180a Revert "perf: Remove perf_swevent_get_recursion_context() from perf_pending_task()."
> c2fb5208a68e Revert "perf: Split __perf_pending_irq() out of perf_pending_irq()"
> 6d20efa57a89 (tag: v6.10-rc1-rt1-rebase, tag: v6.10-rc1-rt1, linux-rt-devel/linux-6.10.y-rt-rebase, linux-rt-devel/linux-6.10.y-rt, linux-rt-devel/for-kbuild-bot/prepare-release, linux-rt-devel/for-kbuild-bot/current-stable) Add localversion for -RT release
> [acme@nine linux]$
>
> The workload that is used to do that, as a reminder, is 'perf test sigtrap'.
>
> [ 121.217475] BUG: scheduling while atomic: perf/7955/0x00000002
> [ 121.217478] BUG: scheduling while atomic: perf/7956/0x00000002
> <SNIP list of modules>
> [ 121.217492] BUG: scheduling while atomic: perf/7954/0x00000002
> <SNIP list of modules>
> [ 121.217570] Preemption disabled at:
> <SNIP>
> [ 121.217571] [<0000000000000000>] 0x0
> <SNIP>
> [ 121.217609] Preemption disabled at:
> <SNIP>
> [ 121.217610] [<0000000000000000>] 0x0
Right because my patchset doesn't fix the pre-existing RT issue where
perf_sigtrap takes a sleeping lock while preemption is disabled. Sebastian
will need to rebase on top of this patchset and then also convert the perf
recursion context to be per task on RT to avoid preemption disablement.
Thanks.
On 2024-05-16 16:09:35 [+0200], Frederic Weisbecker wrote:
> When a task is scheduled out, pending sigtrap deliveries are deferred
> to the target task upon resume to userspace via task_work.
>
> However failures while adding en event's callback to the task_work
an
> engine are ignored. And since the last call for events exit happen
> after task work is eventually closed, there is a small window during
> which pending sigtrap can be queued though ignored, leaking the event
> refcount addition such as in the following scenario:
Sebastian
On 2024-05-16 16:09:32 [+0200], Frederic Weisbecker wrote:
> Changes since v2:
>
> * Simplify the branch condition on [3/4] (peterz)
> * Rebase [4/4] accordingly
This wasn't applied in the meantime?
I will try to rebase my patches on top of this series.
Sebastian