Hi,
Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending
the signal gets delayed until event_sched_out() which then uses
task_work_add() for its delivery. This breaks on PREEMPT_RT because the
signal is delivered with disabled preemption.
While looking at this, I also stumbled upon __perf_pending_irq() which
requires disabled interrupts but this is not the case on PREEMPT_RT.
This series aim to address both issues while not introducing a new issue
at the same time ;)
Any testing is appreciated.
v2…v3: https://lore.kernel.org/all/[email protected]/
- Marco suggested to add a few comments
- Added a comment to __perf_event_overflow() to explain why irq_work
is raised in the in_nmi() case.
- Added a comment to perf_event_exit_event() to explain why the
pending event is deleted.
v1…v2: https://lore.kernel.org/all/[email protected]/
- Marco pointed me to the testsuite that showed two problems:
- Delayed task_work from NMI / missing events.
Fixed by triggering dummy irq_work to enforce an interrupt for
the exit-to-userland path which checks task_work
- Increased ref-count on clean up/ during exec.
Mostly addressed by the former change. There is still a window
if the NMI occurs during execve(). This is addressed by removing
the task_work before free_event().
The testsuite (remove_on_exec) fails sometimes if the event/
SIGTRAP is sent before the sighandler is installed.
Sebastian
A signal is delivered by raising irq_work() which works from any context
including NMI. irq_work() can be delayed if the architecture does not
provide an interrupt vector. In order not to lose a signal, the signal
is injected via task_work during event_sched_out().
Instead going via irq_work, the signal could be added directly via
task_work. The signal is sent to current and can be enqueued on its
return path to userland instead of triggering irq_work. A dummy IRQ is
required in the NMI case to ensure the task_work is handled before
returning to user land. For this irq_work is used. An alternative would
be just raising an interrupt like arch_send_call_function_single_ipi().
During testing with `remove_on_exec' it become visible that the event
can be enqueued via NMI during execve(). The task_work must not be kept
because free_event() will complain later. Also the new task will not
have a sighandler installed.
Queue signal via task_work. Remove perf_event::pending_sigtrap and
and use perf_event::pending_work instead. Raise irq_work in the NMI case
for a dummy interrupt. Remove the task_work if the event is freed.
Tested-by: Marco Elver <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/perf_event.h | 3 +-
kernel/events/core.c | 58 ++++++++++++++++++++++----------------
2 files changed, 34 insertions(+), 27 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d2a15c0c6f8a9..24ac6765146c7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -781,7 +781,6 @@ struct perf_event {
unsigned int pending_wakeup;
unsigned int pending_kill;
unsigned int pending_disable;
- unsigned int pending_sigtrap;
unsigned long pending_addr; /* SIGTRAP */
struct irq_work pending_irq;
struct callback_head pending_task;
@@ -959,7 +958,7 @@ struct perf_event_context {
struct rcu_head rcu_head;
/*
- * Sum (event->pending_sigtrap + event->pending_work)
+ * Sum (event->pending_work + event->pending_work)
*
* The SIGTRAP is targeted at ctx->task, as such it won't do changing
* that until the signal is delivered.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c7a0274c662c8..e0b2da8de485f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
state = PERF_EVENT_STATE_OFF;
}
- if (event->pending_sigtrap) {
- bool dec = true;
-
- event->pending_sigtrap = 0;
- if (state != PERF_EVENT_STATE_OFF &&
- !event->pending_work) {
- event->pending_work = 1;
- dec = false;
- WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
- task_work_add(current, &event->pending_task, TWA_RESUME);
- }
- if (dec)
- local_dec(&event->ctx->nr_pending);
- }
-
perf_event_set_state(event, state);
if (!is_software_event(event))
@@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event)
* Yay, we hit home and are in the context of the event.
*/
if (cpu == smp_processor_id()) {
- if (event->pending_sigtrap) {
- event->pending_sigtrap = 0;
- perf_sigtrap(event);
- local_dec(&event->ctx->nr_pending);
- }
if (event->pending_disable) {
event->pending_disable = 0;
perf_event_disable_local(event);
@@ -9592,14 +9572,23 @@ static int __perf_event_overflow(struct perf_event *event,
if (regs)
pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
- if (!event->pending_sigtrap) {
- event->pending_sigtrap = pending_id;
+ if (!event->pending_work) {
+ event->pending_work = pending_id;
local_inc(&event->ctx->nr_pending);
- irq_work_queue(&event->pending_irq);
+ WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
+ task_work_add(current, &event->pending_task, TWA_RESUME);
+ /*
+ * The NMI path returns directly to userland. The
+ * irq_work is raised as a dummy interrupt to ensure
+ * regular return path to user is taken and task_work
+ * is processed.
+ */
+ if (in_nmi())
+ irq_work_queue(&event->pending_irq);
} else if (event->attr.exclude_kernel && valid_sample) {
/*
* Should not be able to return to user space without
- * consuming pending_sigtrap; with exceptions:
+ * consuming pending_work; with exceptions:
*
* 1. Where !exclude_kernel, events can overflow again
* in the kernel without returning to user space.
@@ -9609,7 +9598,7 @@ static int __perf_event_overflow(struct perf_event *event,
* To approximate progress (with false negatives),
* check 32-bit hash of the current IP.
*/
- WARN_ON_ONCE(event->pending_sigtrap != pending_id);
+ WARN_ON_ONCE(event->pending_work != pending_id);
}
event->pending_addr = 0;
@@ -13049,6 +13038,13 @@ static void sync_child_event(struct perf_event *child_event)
&parent_event->child_total_time_running);
}
+static bool task_work_cb_match(struct callback_head *cb, void *data)
+{
+ struct perf_event *event = container_of(cb, struct perf_event, pending_task);
+
+ return event == data;
+}
+
static void
perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
{
@@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
* Kick perf_poll() for is_event_hup();
*/
perf_event_wakeup(parent_event);
+ /*
+ * Cancel pending task_work and update counters if it has not
+ * yet been delivered to userland. free_event() expects the
+ * reference counter at one and keeping the event around until
+ * the task returns to userland can be a unexpected if there is
+ * no signal handler registered.
+ */
+ if (event->pending_work &&
+ task_work_cancel_match(current, task_work_cb_match, event)) {
+ put_event(event);
+ local_dec(&event->ctx->nr_pending);
+ }
free_event(event);
put_event(parent_event);
return;
--
2.43.0
Only if perf_event::pending_sigtrap is zero, the irq_work accounted by
increminging perf_event::nr_pending. The member perf_event::pending_addr
might be overwritten by a subsequent event if the signal was not yet
delivered and is expected. The irq_work will not be enqeueued again
because it has a check to be only enqueued once.
Move irq_work_queue() to where the counter is incremented and
perf_event::pending_sigtrap is set to make it more obvious that the
irq_work is scheduled once.
Tested-by: Marco Elver <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
kernel/events/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f0f0f71213a1d..c7a0274c662c8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9595,6 +9595,7 @@ static int __perf_event_overflow(struct perf_event *event,
if (!event->pending_sigtrap) {
event->pending_sigtrap = pending_id;
local_inc(&event->ctx->nr_pending);
+ irq_work_queue(&event->pending_irq);
} else if (event->attr.exclude_kernel && valid_sample) {
/*
* Should not be able to return to user space without
@@ -9614,7 +9615,6 @@ static int __perf_event_overflow(struct perf_event *event,
event->pending_addr = 0;
if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
event->pending_addr = data->addr;
- irq_work_queue(&event->pending_irq);
}
READ_ONCE(event->overflow_handler)(event, data, regs);
--
2.43.0
perf_pending_irq() invokes perf_event_wakeup() and __perf_pending_irq().
The former is in charge of waking any tasks which wait to be woken up
while the latter disables perf-events.
The irq_work perf_pending_irq(), while this an irq_work, the callback
is invoked in thread context on PREEMPT_RT. This is needed because all
the waking functions (wake_up_all(), kill_fasync()) acquire sleep locks
which must not be used with disabled interrupts.
Disabling events, as done by __perf_pending_irq(), expects a hardirq
context and disabled interrupts. This requirement is not fulfilled on
PREEMPT_RT.
Split functionality based on perf_event::pending_disable into irq_work
named `pending_disable_irq' and invoke it in hardirq context on
PREEMPT_RT. Rename the split out callback to perf_pending_disable().
Tested-by: Marco Elver <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/perf_event.h | 1 +
kernel/events/core.c | 31 +++++++++++++++++++++++--------
2 files changed, 24 insertions(+), 8 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 24ac6765146c7..c1c6600541657 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -783,6 +783,7 @@ struct perf_event {
unsigned int pending_disable;
unsigned long pending_addr; /* SIGTRAP */
struct irq_work pending_irq;
+ struct irq_work pending_disable_irq;
struct callback_head pending_task;
unsigned int pending_work;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5400f7ed2f98b..7266265ed8cc3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2449,7 +2449,7 @@ static void __perf_event_disable(struct perf_event *event,
* hold the top-level event's child_mutex, so any descendant that
* goes to exit will block in perf_event_exit_event().
*
- * When called from perf_pending_irq it's OK because event->ctx
+ * When called from perf_pending_disable it's OK because event->ctx
* is the current context on this CPU and preemption is disabled,
* hence we can't get into perf_event_task_sched_out for this context.
*/
@@ -2489,7 +2489,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
void perf_event_disable_inatomic(struct perf_event *event)
{
event->pending_disable = 1;
- irq_work_queue(&event->pending_irq);
+ irq_work_queue(&event->pending_disable_irq);
}
#define MAX_INTERRUPTS (~0ULL)
@@ -5175,6 +5175,7 @@ static void perf_addr_filters_splice(struct perf_event *event,
static void _free_event(struct perf_event *event)
{
irq_work_sync(&event->pending_irq);
+ irq_work_sync(&event->pending_disable_irq);
unaccount_event(event);
@@ -6711,7 +6712,7 @@ static void perf_sigtrap(struct perf_event *event)
/*
* Deliver the pending work in-event-context or follow the context.
*/
-static void __perf_pending_irq(struct perf_event *event)
+static void __perf_pending_disable(struct perf_event *event)
{
int cpu = READ_ONCE(event->oncpu);
@@ -6749,11 +6750,26 @@ static void __perf_pending_irq(struct perf_event *event)
* irq_work_queue(); // FAILS
*
* irq_work_run()
- * perf_pending_irq()
+ * perf_pending_disable()
*
* But the event runs on CPU-B and wants disabling there.
*/
- irq_work_queue_on(&event->pending_irq, cpu);
+ irq_work_queue_on(&event->pending_disable_irq, cpu);
+}
+
+static void perf_pending_disable(struct irq_work *entry)
+{
+ struct perf_event *event = container_of(entry, struct perf_event, pending_disable_irq);
+ int rctx;
+
+ /*
+ * If we 'fail' here, that's OK, it means recursion is already disabled
+ * and we won't recurse 'further'.
+ */
+ rctx = perf_swevent_get_recursion_context();
+ __perf_pending_disable(event);
+ if (rctx >= 0)
+ perf_swevent_put_recursion_context(rctx);
}
static void perf_pending_irq(struct irq_work *entry)
@@ -6776,8 +6792,6 @@ static void perf_pending_irq(struct irq_work *entry)
perf_event_wakeup(event);
}
- __perf_pending_irq(event);
-
if (rctx >= 0)
perf_swevent_put_recursion_context(rctx);
}
@@ -9572,7 +9586,7 @@ static int __perf_event_overflow(struct perf_event *event,
* is processed.
*/
if (in_nmi())
- irq_work_queue(&event->pending_irq);
+ irq_work_queue(&event->pending_disable_irq);
} else if (event->attr.exclude_kernel && valid_sample) {
/*
* Should not be able to return to user space without
@@ -11912,6 +11926,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
init_waitqueue_head(&event->waitq);
init_irq_work(&event->pending_irq, perf_pending_irq);
+ event->pending_disable_irq = IRQ_WORK_INIT_HARD(perf_pending_disable);
init_task_work(&event->pending_task, perf_pending_task);
mutex_init(&event->mmap_mutex);
--
2.43.0
Le Fri, Mar 22, 2024 at 07:48:22AM +0100, Sebastian Andrzej Siewior a ?crit :
> A signal is delivered by raising irq_work() which works from any context
> including NMI. irq_work() can be delayed if the architecture does not
> provide an interrupt vector. In order not to lose a signal, the signal
> is injected via task_work during event_sched_out().
>
> Instead going via irq_work, the signal could be added directly via
> task_work. The signal is sent to current and can be enqueued on its
> return path to userland instead of triggering irq_work. A dummy IRQ is
> required in the NMI case to ensure the task_work is handled before
> returning to user land. For this irq_work is used. An alternative would
> be just raising an interrupt like arch_send_call_function_single_ipi().
>
> During testing with `remove_on_exec' it become visible that the event
> can be enqueued via NMI during execve(). The task_work must not be kept
> because free_event() will complain later. Also the new task will not
> have a sighandler installed.
>
> Queue signal via task_work. Remove perf_event::pending_sigtrap and
> and use perf_event::pending_work instead. Raise irq_work in the NMI case
> for a dummy interrupt. Remove the task_work if the event is freed.
>
> Tested-by: Marco Elver <[email protected]>
> Tested-by: Arnaldo Carvalho de Melo <[email protected]>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
It clashes a bit with a series I have posted. Let's see the details:
> ---
> include/linux/perf_event.h | 3 +-
> kernel/events/core.c | 58 ++++++++++++++++++++++----------------
> 2 files changed, 34 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d2a15c0c6f8a9..24ac6765146c7 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -781,7 +781,6 @@ struct perf_event {
> unsigned int pending_wakeup;
> unsigned int pending_kill;
> unsigned int pending_disable;
> - unsigned int pending_sigtrap;
> unsigned long pending_addr; /* SIGTRAP */
> struct irq_work pending_irq;
> struct callback_head pending_task;
> @@ -959,7 +958,7 @@ struct perf_event_context {
> struct rcu_head rcu_head;
>
> /*
> - * Sum (event->pending_sigtrap + event->pending_work)
> + * Sum (event->pending_work + event->pending_work)
> *
> * The SIGTRAP is targeted at ctx->task, as such it won't do changing
> * that until the signal is delivered.
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index c7a0274c662c8..e0b2da8de485f 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
> state = PERF_EVENT_STATE_OFF;
> }
>
> - if (event->pending_sigtrap) {
> - bool dec = true;
> -
> - event->pending_sigtrap = 0;
> - if (state != PERF_EVENT_STATE_OFF &&
> - !event->pending_work) {
> - event->pending_work = 1;
> - dec = false;
> - WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> - task_work_add(current, &event->pending_task, TWA_RESUME);
> - }
> - if (dec)
> - local_dec(&event->ctx->nr_pending);
> - }
> -
> perf_event_set_state(event, state);
>
> if (!is_software_event(event))
> @@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event)
> * Yay, we hit home and are in the context of the event.
> */
> if (cpu == smp_processor_id()) {
> - if (event->pending_sigtrap) {
> - event->pending_sigtrap = 0;
> - perf_sigtrap(event);
> - local_dec(&event->ctx->nr_pending);
> - }
> if (event->pending_disable) {
> event->pending_disable = 0;
> perf_event_disable_local(event);
> @@ -9592,14 +9572,23 @@ static int __perf_event_overflow(struct perf_event *event,
>
> if (regs)
> pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> - if (!event->pending_sigtrap) {
> - event->pending_sigtrap = pending_id;
> + if (!event->pending_work) {
> + event->pending_work = pending_id;
> local_inc(&event->ctx->nr_pending);
> - irq_work_queue(&event->pending_irq);
> + WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> + task_work_add(current, &event->pending_task, TWA_RESUME);
If the overflow happens between exit_task_work() and perf_event_exit_task(),
you're leaking the event. (This was there before this patch).
See:
https://lore.kernel.org/all/[email protected]/T/#m5e6c8ebbef04ab9a1d7f05340cd3e2716a9a8c39
> + /*
> + * The NMI path returns directly to userland. The
> + * irq_work is raised as a dummy interrupt to ensure
> + * regular return path to user is taken and task_work
> + * is processed.
> + */
> + if (in_nmi())
> + irq_work_queue(&event->pending_irq);
> } else if (event->attr.exclude_kernel && valid_sample) {
> /*
> * Should not be able to return to user space without
> - * consuming pending_sigtrap; with exceptions:
> + * consuming pending_work; with exceptions:
> *
> * 1. Where !exclude_kernel, events can overflow again
> * in the kernel without returning to user space.
> @@ -9609,7 +9598,7 @@ static int __perf_event_overflow(struct perf_event *event,
> * To approximate progress (with false negatives),
> * check 32-bit hash of the current IP.
> */
> - WARN_ON_ONCE(event->pending_sigtrap != pending_id);
> + WARN_ON_ONCE(event->pending_work != pending_id);
> }
>
> event->pending_addr = 0;
> @@ -13049,6 +13038,13 @@ static void sync_child_event(struct perf_event *child_event)
> &parent_event->child_total_time_running);
> }
>
> +static bool task_work_cb_match(struct callback_head *cb, void *data)
> +{
> + struct perf_event *event = container_of(cb, struct perf_event, pending_task);
> +
> + return event == data;
> +}
I suggest we introduce a proper API to cancel an actual callback head, see:
https://lore.kernel.org/all/[email protected]/T/#mbfac417463018394f9d80c68c7f2cafe9d066a4b
https://lore.kernel.org/all/[email protected]/T/#m0a347249a462523358724085f2489ce9ed91e640
> +
> static void
> perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> {
> @@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> * Kick perf_poll() for is_event_hup();
> */
> perf_event_wakeup(parent_event);
> + /*
> + * Cancel pending task_work and update counters if it has not
> + * yet been delivered to userland. free_event() expects the
> + * reference counter at one and keeping the event around until
> + * the task returns to userland can be a unexpected if there is
> + * no signal handler registered.
> + */
> + if (event->pending_work &&
> + task_work_cancel_match(current, task_work_cb_match, event)) {
> + put_event(event);
> + local_dec(&event->ctx->nr_pending);
> + }
So exiting task, privileged exec and also exit on exec call into this before
releasing the children.
And parents rely on put_event() from file close + the task work.
But what about remote release of children on file close?
See perf_event_release_kernel() directly calling free_event() on them.
One possible fix is to avoid the reference count game around task work
and flush them on free_event().
See here:
https://lore.kernel.org/all/[email protected]/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
Thanks.
> free_event(event);
> put_event(parent_event);
> return;
> --
> 2.43.0
>
>
On 2024-04-08 23:29:03 [+0200], Frederic Weisbecker wrote:
> > index c7a0274c662c8..e0b2da8de485f 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
> > state = PERF_EVENT_STATE_OFF;
> > }
> >
> > - if (event->pending_sigtrap) {
> > - bool dec = true;
> > -
> > - event->pending_sigtrap = 0;
> > - if (state != PERF_EVENT_STATE_OFF &&
> > - !event->pending_work) {
> > - event->pending_work = 1;
> > - dec = false;
> > - WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> > - task_work_add(current, &event->pending_task, TWA_RESUME);
> > - }
> > - if (dec)
> > - local_dec(&event->ctx->nr_pending);
> > - }
> > -
> > perf_event_set_state(event, state);
> >
> > if (!is_software_event(event))
> > @@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event)
> > * Yay, we hit home and are in the context of the event.
> > */
> > if (cpu == smp_processor_id()) {
> > - if (event->pending_sigtrap) {
> > - event->pending_sigtrap = 0;
> > - perf_sigtrap(event);
> > - local_dec(&event->ctx->nr_pending);
> > - }
> > if (event->pending_disable) {
> > event->pending_disable = 0;
> > perf_event_disable_local(event);
> > @@ -9592,14 +9572,23 @@ static int __perf_event_overflow(struct perf_event *event,
> >
> > if (regs)
> > pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> > - if (!event->pending_sigtrap) {
> > - event->pending_sigtrap = pending_id;
> > + if (!event->pending_work) {
> > + event->pending_work = pending_id;
> > local_inc(&event->ctx->nr_pending);
> > - irq_work_queue(&event->pending_irq);
> > + WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> > + task_work_add(current, &event->pending_task, TWA_RESUME);
>
> If the overflow happens between exit_task_work() and perf_event_exit_task(),
> you're leaking the event. (This was there before this patch).
> See:
> https://lore.kernel.org/all/[email protected]/T/#m5e6c8ebbef04ab9a1d7f05340cd3e2716a9a8c39
Okay.
> > + /*
> > + * The NMI path returns directly to userland. The
> > + * irq_work is raised as a dummy interrupt to ensure
> > + * regular return path to user is taken and task_work
> > + * is processed.
> > + */
> > + if (in_nmi())
> > + irq_work_queue(&event->pending_irq);
> > } else if (event->attr.exclude_kernel && valid_sample) {
> > /*
> > * Should not be able to return to user space without
> > - * consuming pending_sigtrap; with exceptions:
> > + * consuming pending_work; with exceptions:
> > *
> > * 1. Where !exclude_kernel, events can overflow again
> > * in the kernel without returning to user space.
> > @@ -9609,7 +9598,7 @@ static int __perf_event_overflow(struct perf_event *event,
> > * To approximate progress (with false negatives),
> > * check 32-bit hash of the current IP.
> > */
> > - WARN_ON_ONCE(event->pending_sigtrap != pending_id);
> > + WARN_ON_ONCE(event->pending_work != pending_id);
> > }
> >
> > event->pending_addr = 0;
> > @@ -13049,6 +13038,13 @@ static void sync_child_event(struct perf_event *child_event)
> > &parent_event->child_total_time_running);
> > }
> >
> > +static bool task_work_cb_match(struct callback_head *cb, void *data)
> > +{
> > + struct perf_event *event = container_of(cb, struct perf_event, pending_task);
> > +
> > + return event == data;
> > +}
>
> I suggest we introduce a proper API to cancel an actual callback head, see:
>
> https://lore.kernel.org/all/[email protected]/T/#mbfac417463018394f9d80c68c7f2cafe9d066a4b
> https://lore.kernel.org/all/[email protected]/T/#m0a347249a462523358724085f2489ce9ed91e640
This rework would work.
> > static void
> > perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> > {
> > @@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> > * Kick perf_poll() for is_event_hup();
> > */
> > perf_event_wakeup(parent_event);
> > + /*
> > + * Cancel pending task_work and update counters if it has not
> > + * yet been delivered to userland. free_event() expects the
> > + * reference counter at one and keeping the event around until
> > + * the task returns to userland can be a unexpected if there is
> > + * no signal handler registered.
> > + */
> > + if (event->pending_work &&
> > + task_work_cancel_match(current, task_work_cb_match, event)) {
> > + put_event(event);
> > + local_dec(&event->ctx->nr_pending);
> > + }
>
> So exiting task, privileged exec and also exit on exec call into this before
> releasing the children.
>
> And parents rely on put_event() from file close + the task work.
>
> But what about remote release of children on file close?
> See perf_event_release_kernel() directly calling free_event() on them.
Interesting things you are presenting. I had events popping up at random
even after the task decided that it won't go back to userland to handle
it so letting it free looked like the only option…
> One possible fix is to avoid the reference count game around task work
> and flush them on free_event().
>
> See here:
>
> https://lore.kernel.org/all/[email protected]/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
That wake_up() within preempt_disable() section breaks on RT.
How do we go on from here?
> Thanks.
Sebastian
Le Tue, Apr 09, 2024 at 10:57:32AM +0200, Sebastian Andrzej Siewior a écrit :
> > > static void
> > > perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> > > {
> > > @@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> > > * Kick perf_poll() for is_event_hup();
> > > */
> > > perf_event_wakeup(parent_event);
> > > + /*
> > > + * Cancel pending task_work and update counters if it has not
> > > + * yet been delivered to userland. free_event() expects the
> > > + * reference counter at one and keeping the event around until
> > > + * the task returns to userland can be a unexpected if there is
> > > + * no signal handler registered.
> > > + */
> > > + if (event->pending_work &&
> > > + task_work_cancel_match(current, task_work_cb_match, event)) {
> > > + put_event(event);
> > > + local_dec(&event->ctx->nr_pending);
> > > + }
> >
> > So exiting task, privileged exec and also exit on exec call into this before
> > releasing the children.
> >
> > And parents rely on put_event() from file close + the task work.
> >
> > But what about remote release of children on file close?
> > See perf_event_release_kernel() directly calling free_event() on them.
>
> Interesting things you are presenting. I had events popping up at random
> even after the task decided that it won't go back to userland to handle
> it so letting it free looked like the only option…
>
> > One possible fix is to avoid the reference count game around task work
> > and flush them on free_event().
> >
> > See here:
> >
> > https://lore.kernel.org/all/[email protected]/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
>
> That wake_up() within preempt_disable() section breaks on RT.
Ah, but the wake-up still wants to go inside recursion protection somehow or
it could generate task_work loop again due to tracepoint events...
Although... the wake up occurs only when the event is dead after all...
> How do we go on from here?
I'd tend to think you need my patchset first because the problems it
fixes were not easily visible as long as there was an irq work to take
care of things most of the time. But once you rely on task_work only then
these become a real problem. Especially the sync against perf_release().
Thanks.
>
> > Thanks.
>
> Sebastian
On 2024-04-09 14:36:51 [+0200], Frederic Weisbecker wrote:
> > That wake_up() within preempt_disable() section breaks on RT.
>
> Ah, but the wake-up still wants to go inside recursion protection somehow or
> it could generate task_work loop again due to tracepoint events...
okay.
> Although... the wake up occurs only when the event is dead after all...
corner case or not, it has to work, right?
> > How do we go on from here?
>
> I'd tend to think you need my patchset first because the problems it
> fixes were not easily visible as long as there was an irq work to take
> care of things most of the time. But once you rely on task_work only then
> these become a real problem. Especially the sync against perf_release().
I don't mind rebasing on top of your series. But defaulting to task_work
is not an option then?
RT wise the irq_work is not handled in hardirq because of locks it
acquires and is handled instead in a thread. Depending on the priority
the task (receiving the event) it may run before the irq_work-thread.
Therefore the task_work looked neat because the event would be handled
_before_ the task returned to userland.
Couldn't we either flush _or_ remove the task_work in perf_release()?
> Thanks.
Sebastian
Le Tue, Apr 09, 2024 at 03:47:29PM +0200, Sebastian Andrzej Siewior a ?crit :
> On 2024-04-09 14:36:51 [+0200], Frederic Weisbecker wrote:
> > > That wake_up() within preempt_disable() section breaks on RT.
> >
> > Ah, but the wake-up still wants to go inside recursion protection somehow or
> > it could generate task_work loop again due to tracepoint events...
>
> okay.
>
> > Although... the wake up occurs only when the event is dead after all...
>
> corner case or not, it has to work, right?
Yep.
>
> > > How do we go on from here?
> >
> > I'd tend to think you need my patchset first because the problems it
> > fixes were not easily visible as long as there was an irq work to take
> > care of things most of the time. But once you rely on task_work only then
> > these become a real problem. Especially the sync against perf_release().
>
> I don't mind rebasing on top of your series. But defaulting to task_work
> is not an option then?
>
> RT wise the irq_work is not handled in hardirq because of locks it
> acquires and is handled instead in a thread. Depending on the priority
> the task (receiving the event) it may run before the irq_work-thread.
> Therefore the task_work looked neat because the event would be handled
> _before_ the task returned to userland.
I see.
> Couldn't we either flush _or_ remove the task_work in perf_release()?
Right so the problem in perf_release() is that we may be dealing with task works
of other tasks than current. In that case, task_work_cancel() is fine if it
successes. But if it fails, you don't have the guarantee that the task work
isn't concurrently running or about to run. And you have no way to know about
that. So then you need some sort of flushing indeed.
Thanks.
> > Thanks.
> Sebastian
On 2024-04-10 13:37:05 [+0200], Frederic Weisbecker wrote:
> > Couldn't we either flush _or_ remove the task_work in perf_release()?
>
> Right so the problem in perf_release() is that we may be dealing with task works
> of other tasks than current. In that case, task_work_cancel() is fine if it
> successes. But if it fails, you don't have the guarantee that the task work
> isn't concurrently running or about to run. And you have no way to know about
> that. So then you need some sort of flushing indeed.
Since perf_release() preemptible, a wait/sleep for completion would be
best (instead of flushing).
> Thanks.
>
> > > Thanks.
Sebastian
Le Wed, Apr 10, 2024 at 03:47:02PM +0200, Sebastian Andrzej Siewior a ?crit :
> On 2024-04-10 13:37:05 [+0200], Frederic Weisbecker wrote:
> > > Couldn't we either flush _or_ remove the task_work in perf_release()?
> >
> > Right so the problem in perf_release() is that we may be dealing with task works
> > of other tasks than current. In that case, task_work_cancel() is fine if it
> > successes. But if it fails, you don't have the guarantee that the task work
> > isn't concurrently running or about to run. And you have no way to know about
> > that. So then you need some sort of flushing indeed.
>
> Since perf_release() preemptible, a wait/sleep for completion would be
> best (instead of flushing).
Like this then?
https://lore.kernel.org/all/[email protected]/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
> > Thanks.
> >
> > > > Thanks.
>
> Sebastian
On 2024-04-10 16:42:56 [+0200], Frederic Weisbecker wrote:
> > > Like this then?
> > >
> > > https://lore.kernel.org/all/[email protected]/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
> >
> > Kind of, yes. Do we have more than one waiter? If not, maybe that
> > rcuwait would work then.
>
> Indeed there is only one waiter so that should work. Would
> that be something you can call while preemption is disabled?
rcuwait_wake_up() does only wake_up_process() which is fine.
wake_up() does spin_lock_irqsave() which is a no.
On the other hand that preempt-disable needs to go anyway due to
perf_sigtrap(). But a slim wake is a slim wake ;)
> Thanks.
Sebastian
Le Wed, Apr 10, 2024 at 04:48:21PM +0200, Sebastian Andrzej Siewior a ?crit :
> On 2024-04-10 16:42:56 [+0200], Frederic Weisbecker wrote:
> > > > Like this then?
> > > >
> > > > https://lore.kernel.org/all/[email protected]/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
> > >
> > > Kind of, yes. Do we have more than one waiter? If not, maybe that
> > > rcuwait would work then.
> >
> > Indeed there is only one waiter so that should work. Would
> > that be something you can call while preemption is disabled?
>
> rcuwait_wake_up() does only wake_up_process() which is fine.
> wake_up() does spin_lock_irqsave() which is a no.
Duh!
> On the other hand that preempt-disable needs to go anyway due to
> perf_sigtrap(). But a slim wake is a slim wake ;)
Sure thing :)
> > Thanks.
>
> Sebastian
Le Wed, Apr 10, 2024 at 04:06:33PM +0200, Sebastian Andrzej Siewior a ?crit :
> On 2024-04-10 16:00:17 [+0200], Frederic Weisbecker wrote:
> > Le Wed, Apr 10, 2024 at 03:47:02PM +0200, Sebastian Andrzej Siewior a ?crit :
> > > On 2024-04-10 13:37:05 [+0200], Frederic Weisbecker wrote:
> > > > > Couldn't we either flush _or_ remove the task_work in perf_release()?
> > > >
> > > > Right so the problem in perf_release() is that we may be dealing with task works
> > > > of other tasks than current. In that case, task_work_cancel() is fine if it
> > > > successes. But if it fails, you don't have the guarantee that the task work
> > > > isn't concurrently running or about to run. And you have no way to know about
> > > > that. So then you need some sort of flushing indeed.
> > >
> > > Since perf_release() preemptible, a wait/sleep for completion would be
> > > best (instead of flushing).
> >
> > Like this then?
> >
> > https://lore.kernel.org/all/[email protected]/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
>
> Kind of, yes. Do we have more than one waiter? If not, maybe that
> rcuwait would work then.
Indeed there is only one waiter so that should work. Would
that be something you can call while preemption is disabled?
Thanks.
> Otherwise (>1 waiter) we did establish that we may need a per-task
> counter for recursion handling so preempt-disable shouldn't be a problem
> then. The pending_work_wq must not be used outside of task context (means
> no hardirq or something like that).
>
> Sebastian
On 2024-04-10 16:00:17 [+0200], Frederic Weisbecker wrote:
> Le Wed, Apr 10, 2024 at 03:47:02PM +0200, Sebastian Andrzej Siewior a écrit :
> > On 2024-04-10 13:37:05 [+0200], Frederic Weisbecker wrote:
> > > > Couldn't we either flush _or_ remove the task_work in perf_release()?
> > >
> > > Right so the problem in perf_release() is that we may be dealing with task works
> > > of other tasks than current. In that case, task_work_cancel() is fine if it
> > > successes. But if it fails, you don't have the guarantee that the task work
> > > isn't concurrently running or about to run. And you have no way to know about
> > > that. So then you need some sort of flushing indeed.
> >
> > Since perf_release() preemptible, a wait/sleep for completion would be
> > best (instead of flushing).
>
> Like this then?
>
> https://lore.kernel.org/all/[email protected]/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
Kind of, yes. Do we have more than one waiter? If not, maybe that
rcuwait would work then.
Otherwise (>1 waiter) we did establish that we may need a per-task
counter for recursion handling so preempt-disable shouldn't be a problem
then. The pending_work_wq must not be used outside of task context (means
no hardirq or something like that).
Sebastian