LinuxLists.cc - [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

2024-03-12 18:08:40

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

Hi,

Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending
the signal gets delayed until event_sched_out() which then uses
task_work_add() for its delivery. This breaks on PREEMPT_RT because the
signal is delivered with disabled preemption.

While looking at this, I also stumbled upon __perf_pending_irq() which
requires disabled interrupts but this is not the case on PREEMPT_RT.

This series aim to address both issues while not introducing a new issue
at the same time ;)
Any testing is appreciated.

v1…v2: https://lore.kernel.org/all/[email protected]/
- Marco pointed me to the testsuite that showed two problems:
- Delayed task_work from NMI / missing events.
Fixed by triggering dummy irq_work to enforce an interrupt for
the exit-to-userland path which checks task_work
- Increased ref-count on clean up/ during exec.
Mostly addressed by the former change. There is still a window
if the NMI occurs during execve(). This is addressed by removing
the task_work before free_event().
The testsuite (remove_on_exec) fails sometimes if the event/
SIGTRAP is sent before the sighandler is installed.

Sebastian

2024-03-12 18:08:41

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: [PATCH v2 1/4] perf: Move irq_work_queue() where the event is prepared.

Only if perf_event::pending_sigtrap is zero, the irq_work accounted by
increminging perf_event::nr_pending. The member perf_event::pending_addr
might be overwritten by a subsequent event if the signal was not yet
delivered and is expected. The irq_work will not be enqeueued again
because it has a check to be only enqueued once.

Move irq_work_queue() to where the counter is incremented and
perf_event::pending_sigtrap is set to make it more obvious that the
irq_work is scheduled once.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
kernel/events/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f0f0f71213a1d..c7a0274c662c8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9595,6 +9595,7 @@ static int __perf_event_overflow(struct perf_event *event,
if (!event->pending_sigtrap) {
event->pending_sigtrap = pending_id;
local_inc(&event->ctx->nr_pending);
+ irq_work_queue(&event->pending_irq);
} else if (event->attr.exclude_kernel && valid_sample) {
/*
* Should not be able to return to user space without
@@ -9614,7 +9615,6 @@ static int __perf_event_overflow(struct perf_event *event,
event->pending_addr = 0;
if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
event->pending_addr = data->addr;
- irq_work_queue(&event->pending_irq);
}

READ_ONCE(event->overflow_handler)(event, data, regs);
--
2.43.0

2024-03-12 18:08:51

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: [PATCH v2 2/4] perf: Enqueue SIGTRAP always via task_work.

A signal is delivered by raising irq_work() which works from any context
including NMI. irq_work() can be delayed if the architecture does not
provide an interrupt vector. In order not to lose a signal, the signal
is injected via task_work during event_sched_out().

Instead going via irq_work, the signal could be added directly via
task_work. The signal is sent to current and can be enqueued on its
return path to userland instead of triggering irq_work. A dummy IRQ is
required in the NMI case to ensure the task_work is handled before
returning to user land. For this irq_work is used. An alternative would
be just raising an interrupt like arch_send_call_function_single_ipi().

During testing with `remove_on_exec' it become visible that the event
can be enqueued via NMI during execve(). The task_work must not be kept
because free_event() will complain later. Also the new task will not
have a sighandler installed.

Queue signal via task_work. Remove perf_event::pending_sigtrap and
and use perf_event::pending_work instead. Raise irq_work in the NMI case
for a dummy interrupt. Remove the task_work if the event is freed.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/perf_event.h | 3 +--
kernel/events/core.c | 45 +++++++++++++++++---------------------
2 files changed, 21 insertions(+), 27 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d2a15c0c6f8a9..24ac6765146c7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -781,7 +781,6 @@ struct perf_event {
unsigned int pending_wakeup;
unsigned int pending_kill;
unsigned int pending_disable;
- unsigned int pending_sigtrap;
unsigned long pending_addr; /* SIGTRAP */
struct irq_work pending_irq;
struct callback_head pending_task;
@@ -959,7 +958,7 @@ struct perf_event_context {
struct rcu_head rcu_head;

/*
- * Sum (event->pending_sigtrap + event->pending_work)
+ * Sum (event->pending_work + event->pending_work)
*
* The SIGTRAP is targeted at ctx->task, as such it won't do changing
* that until the signal is delivered.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c7a0274c662c8..e9926baaa1587 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
state = PERF_EVENT_STATE_OFF;
}

- if (event->pending_sigtrap) {
- bool dec = true;
-
- event->pending_sigtrap = 0;
- if (state != PERF_EVENT_STATE_OFF &&
- !event->pending_work) {
- event->pending_work = 1;
- dec = false;
- WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
- task_work_add(current, &event->pending_task, TWA_RESUME);
- }
- if (dec)
- local_dec(&event->ctx->nr_pending);
- }
-
perf_event_set_state(event, state);

if (!is_software_event(event))
@@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event)
* Yay, we hit home and are in the context of the event.
*/
if (cpu == smp_processor_id()) {
- if (event->pending_sigtrap) {
- event->pending_sigtrap = 0;
- perf_sigtrap(event);
- local_dec(&event->ctx->nr_pending);
- }
if (event->pending_disable) {
event->pending_disable = 0;
perf_event_disable_local(event);
@@ -9592,14 +9572,17 @@ static int __perf_event_overflow(struct perf_event *event,

if (regs)
pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
- if (!event->pending_sigtrap) {
- event->pending_sigtrap = pending_id;
+ if (!event->pending_work) {
+ event->pending_work = pending_id;
local_inc(&event->ctx->nr_pending);
- irq_work_queue(&event->pending_irq);
+ WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
+ task_work_add(current, &event->pending_task, TWA_RESUME);
+ if (in_nmi())
+ irq_work_queue(&event->pending_irq);
} else if (event->attr.exclude_kernel && valid_sample) {
/*
* Should not be able to return to user space without
- * consuming pending_sigtrap; with exceptions:
+ * consuming pending_work; with exceptions:
*
* 1. Where !exclude_kernel, events can overflow again
* in the kernel without returning to user space.
@@ -9609,7 +9592,7 @@ static int __perf_event_overflow(struct perf_event *event,
* To approximate progress (with false negatives),
* check 32-bit hash of the current IP.
*/
- WARN_ON_ONCE(event->pending_sigtrap != pending_id);
+ WARN_ON_ONCE(event->pending_work != pending_id);
}

event->pending_addr = 0;
@@ -13049,6 +13032,13 @@ static void sync_child_event(struct perf_event *child_event)
&parent_event->child_total_time_running);
}

+static bool task_work_cb_match(struct callback_head *cb, void *data)
+{
+ struct perf_event *event = container_of(cb, struct perf_event, pending_task);
+
+ return event == data;
+}
+
static void
perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
{
@@ -13088,6 +13078,11 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
* Kick perf_poll() for is_event_hup();
*/
perf_event_wakeup(parent_event);
+ if (event->pending_work &&
+ task_work_cancel_match(current, task_work_cb_match, event)) {
+ put_event(event);
+ local_dec(&event->ctx->nr_pending);
+ }
free_event(event);
put_event(parent_event);
return;
--
2.43.0

2024-03-12 18:08:54

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: [PATCH v2 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task().

perf_swevent_get_recursion_context() is supposed to avoid recursion.
This requires to remain on the same CPU in order to decrement/ increment
the same counter. This is done by using preempt_disable(). Having
preemption disabled while sending a signal leads to locking problems on
PREEMPT_RT because sighand, a spinlock_t, becomes a sleeping lock.

This callback runs in task context and currently delivers only a signal
to "itself". Any kind of recusrion protection in this context is not
required.

Remove recursion protection in perf_pending_task().

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
kernel/events/core.c | 12 ------------
1 file changed, 12 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index e9926baaa1587..806514d76d8dc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6785,14 +6785,6 @@ static void perf_pending_irq(struct irq_work *entry)
static void perf_pending_task(struct callback_head *head)
{
struct perf_event *event = container_of(head, struct perf_event, pending_task);
- int rctx;
-
- /*
- * If we 'fail' here, that's OK, it means recursion is already disabled
- * and we won't recurse 'further'.
- */
- preempt_disable_notrace();
- rctx = perf_swevent_get_recursion_context();

if (event->pending_work) {
event->pending_work = 0;
@@ -6800,10 +6792,6 @@ static void perf_pending_task(struct callback_head *head)
local_dec(&event->ctx->nr_pending);
}

- if (rctx >= 0)
- perf_swevent_put_recursion_context(rctx);
- preempt_enable_notrace();
-
put_event(event);
}

--
2.43.0

2024-03-12 18:09:15

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: [PATCH v2 4/4] perf: Split __perf_pending_irq() out of perf_pending_irq()

perf_pending_irq() invokes perf_event_wakeup() and __perf_pending_irq().
The former is in charge of waking any tasks which wait to be woken up
while the latter disables perf-events.

The irq_work perf_pending_irq(), while this an irq_work, the callback
is invoked in thread context on PREEMPT_RT. This is needed because all
the waking functions (wake_up_all(), kill_fasync()) acquire sleep locks
which must not be used with disabled interrupts.
Disabling events, as done by __perf_pending_irq(), expects a hardirq
context and disabled interrupts. This requirement is not fulfilled on
PREEMPT_RT.

Split functionality based on perf_event::pending_disable into irq_work
named `pending_disable_irq' and invoke it in hardirq context on
PREEMPT_RT. Rename the split out callback to perf_pending_disable().

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/perf_event.h | 1 +
kernel/events/core.c | 31 +++++++++++++++++++++++--------
2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 24ac6765146c7..c1c6600541657 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -783,6 +783,7 @@ struct perf_event {
unsigned int pending_disable;
unsigned long pending_addr; /* SIGTRAP */
struct irq_work pending_irq;
+ struct irq_work pending_disable_irq;
struct callback_head pending_task;
unsigned int pending_work;

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 806514d76d8dc..9aafb949fa100 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2449,7 +2449,7 @@ static void __perf_event_disable(struct perf_event *event,
* hold the top-level event's child_mutex, so any descendant that
* goes to exit will block in perf_event_exit_event().
*
- * When called from perf_pending_irq it's OK because event->ctx
+ * When called from perf_pending_disable it's OK because event->ctx
* is the current context on this CPU and preemption is disabled,
* hence we can't get into perf_event_task_sched_out for this context.
*/
@@ -2489,7 +2489,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
void perf_event_disable_inatomic(struct perf_event *event)
{
event->pending_disable = 1;
- irq_work_queue(&event->pending_irq);
+ irq_work_queue(&event->pending_disable_irq);
}

#define MAX_INTERRUPTS (~0ULL)
@@ -5175,6 +5175,7 @@ static void perf_addr_filters_splice(struct perf_event *event,
static void _free_event(struct perf_event *event)
{
irq_work_sync(&event->pending_irq);
+ irq_work_sync(&event->pending_disable_irq);

unaccount_event(event);

@@ -6711,7 +6712,7 @@ static void perf_sigtrap(struct perf_event *event)
/*
* Deliver the pending work in-event-context or follow the context.
*/
-static void __perf_pending_irq(struct perf_event *event)
+static void __perf_pending_disable(struct perf_event *event)
{
int cpu = READ_ONCE(event->oncpu);

@@ -6749,11 +6750,26 @@ static void __perf_pending_irq(struct perf_event *event)
* irq_work_queue(); // FAILS
*
* irq_work_run()
- * perf_pending_irq()
+ * perf_pending_disable()
*
* But the event runs on CPU-B and wants disabling there.
*/
- irq_work_queue_on(&event->pending_irq, cpu);
+ irq_work_queue_on(&event->pending_disable_irq, cpu);
+}
+
+static void perf_pending_disable(struct irq_work *entry)
+{
+ struct perf_event *event = container_of(entry, struct perf_event, pending_disable_irq);
+ int rctx;
+
+ /*
+ * If we 'fail' here, that's OK, it means recursion is already disabled
+ * and we won't recurse 'further'.
+ */
+ rctx = perf_swevent_get_recursion_context();
+ __perf_pending_disable(event);
+ if (rctx >= 0)
+ perf_swevent_put_recursion_context(rctx);
}

static void perf_pending_irq(struct irq_work *entry)
@@ -6776,8 +6792,6 @@ static void perf_pending_irq(struct irq_work *entry)
perf_event_wakeup(event);
}

- __perf_pending_irq(event);
-
if (rctx >= 0)
perf_swevent_put_recursion_context(rctx);
}
@@ -9566,7 +9580,7 @@ static int __perf_event_overflow(struct perf_event *event,
WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
task_work_add(current, &event->pending_task, TWA_RESUME);
if (in_nmi())
- irq_work_queue(&event->pending_irq);
+ irq_work_queue(&event->pending_disable_irq);
} else if (event->attr.exclude_kernel && valid_sample) {
/*
* Should not be able to return to user space without
@@ -11906,6 +11920,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,

init_waitqueue_head(&event->waitq);
init_irq_work(&event->pending_irq, perf_pending_irq);
+ event->pending_disable_irq = IRQ_WORK_INIT_HARD(perf_pending_disable);
init_task_work(&event->pending_task, perf_pending_task);

mutex_init(&event->mmap_mutex);
--
2.43.0

2024-03-12 21:42:51

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On Tue, Mar 12, 2024 at 07:01:48PM +0100, Sebastian Andrzej Siewior wrote:
> Hi,
>
> Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending
> the signal gets delayed until event_sched_out() which then uses
> task_work_add() for its delivery. This breaks on PREEMPT_RT because the
> signal is delivered with disabled preemption.
>
> While looking at this, I also stumbled upon __perf_pending_irq() which
> requires disabled interrupts but this is not the case on PREEMPT_RT.
>
> This series aim to address both issues while not introducing a new issue
> at the same time ;)
> Any testing is appreciated.
>
> v1…v2: https://lore.kernel.org/all/[email protected]/
> - Marco pointed me to the testsuite that showed two problems:
> - Delayed task_work from NMI / missing events.
> Fixed by triggering dummy irq_work to enforce an interrupt for
> the exit-to-userland path which checks task_work
> - Increased ref-count on clean up/ during exec.
> Mostly addressed by the former change. There is still a window
> if the NMI occurs during execve(). This is addressed by removing
> the task_work before free_event().
> The testsuite (remove_on_exec) fails sometimes if the event/
> SIGTRAP is sent before the sighandler is installed.

Survives:

[acme@nine perf_events]$ for x in {0..1000}; do (perf test sigtrap &); done

And:

[acme@nine perf_events]$ pwd
/home/acme/git/linux/tools/testing/selftests/perf_events
[acme@nine perf_events]$ for x in {0..100}; do (./remove_on_exec &); done

Nothing on dmesg.

But:

[acme@nine ~]$ pidof exec_child
24273 24271 24270 24269 24268 24267 24266 24265 24264 24263 24262 24261 24260 24259 24258 24257 24256 24255 24254 24253 24252 24251 24250 24249 24248 24247 24246 24245 24244 24243 24242 24241 24240 24239 24238 24237 24236 24233 24232 24231 24230 24229 24228 24227 24226 24225 24224 24223 24220 24219 24218 24217 24216 24215 24214 24213 24212 24211 24210 24209 24208 24207 24206 24204 24203 24202 24201 24200 24199 24198 24197 24196 24195 24194 24192 24191 24190 24189 24188 24187 24186 24185 24184 24183 24182 24181 24180 24179 24178 24177 24176 24175 24174 24173 24172 24171 24170 24169 24168 24167 24166 24165 24164 24163 24162 24161 24160 24159 24158 24157 24154 24153 24152 24151 24150 24149 24148 24147 24146 24145 24144 24143 24142 24141 24140 24139 24138 24137 24136 24135 24134 24133 24132 24131 24130 24129 24127 24126 24125 24124 24123 24122 24121 24120 24119 24118 24117 24115 24114 24113 24112 24111 24110 24109 24108 24107 24106 24105 24104 24103 24102 24101 24100 24099 24098 24097 24096 24095 24094 24093 24092 24091 24090 24089 24088 24087 24086 24085 24084 24083 24082 24081 24080 24076 24075 24074 24073 24072 24071 24070 24069 24068 24067 24066 24065 24064 24063 24062 24061 24060 24059 24058 24057 24056 24055 24054 24053 24052 24051 24049 24048 24047 24046 24045 24044 24043 24042 24041 24040 24039 24037 24036 24035 24034 24033 24032 24031 24030 24029 24028 24027 24026 24025 24024 24023 24022 24021 24020 24019 24018 24017 24016 24015 24014 24013 24012 24011 24010 24009 24008 24007 24006 24005 24004 24003 24002 23998 23997 23996 23995 23994 23993 23992 23991 23990 23989 23988 23986 23985 23984 23983 23982 23981 23980 23979 23978 23977 23976 23975 23974 23973 23972 23970 23969 23968 23967 23966 23965 23964 23963 23962 23961 23960 23958 23957 23956 23955 23954 23953 23952 23951 23950 23949 23948 23947 23946 23945 23944 23943 23942 23941 23940 23939 23938 23937 23936 23935 23934 23933 23932 23931 23930 23929 23928 23927 23926 23925 23924 23923 23919 23918 23917 23916 23915 23914 23913 23912 23911 23910 23909 23907 23906 23905 23904 23903 23902 23901 23900 23899 23898 23897 23896 23895 23894 23893 23891 23890 23889 23888 23887 23886 23885 23884 23883 23882 23881 23879 23878 23877 23876 23875 23874 23873 23872 23871 23870 23869 23867 23866 23865 23864 23863 23862 23858 23857 23856 23855 23854 23853 23852 23851 23850 23849 23848 23847 23846 23845 23844 23843 23842 23841 23840 23833 23832 23831 23830 23829 23828 23827 23826 23825 23824 23823 23820 23819 23818 23817 23816 23815 23814 23813 23812 23811 23810 23809 23808 23807 23806 23804 23803 23802 23801 23800 23799 23798 23797 23793 23792 23791 23787 23786 23774 23773 23772 23771 23770 23769 23768 23767 23756 23754 23751 23750 23749 23748 23747 23746 23743 23734 23733 23732 23731 23730 23729 23728 23727 23726 23725 23724 23723 23722 23719 23716 23715 23684 23668 23667 23666 23589 23574 23573 23526 23461 23396 23395 23375 23271 23270 23266 23242 23241 23231 23214 23213 23210 23197 23196 23182 23181 23129 23128 23046 22964
[acme@nine ~]$

[root@nine ~]# cat /proc/24263/stack
[<0>] irqentry_exit_to_user_mode+0x1c9/0x1e0
[<0>] asm_sysvec_apic_timer_interrupt+0x16/0x20
[root@nine ~]#

[root@nine ~]# for a in `pidof exec_child` ; do cat /proc/$a/stack ; done | sort | uniq -c
1 [<0>] asm_common_interrupt+0x22/0x40
495 [<0>] asm_sysvec_apic_timer_interrupt+0x16/0x20
2 [<0>] asm_sysvec_reschedule_ipi+0x16/0x20
498 [<0>] irqentry_exit_to_user_mode+0x1c9/0x1e0
[root@nine ~]#

[acme@nine ~]$ ps ax|grep exec_child| wc -l
504
[acme@nine ~]$ ps ax|grep exec_child| tail
24264 pts/0 R 0:04 exec_child
24265 pts/0 R 0:04 exec_child
24266 pts/0 R 0:04 exec_child
24267 pts/0 R 0:04 exec_child
24268 pts/0 R 0:04 exec_child
24269 pts/0 R 0:04 exec_child
24270 pts/0 R 0:04 exec_child
24271 pts/0 R 0:04 exec_child
24273 pts/0 R 0:04 exec_child
26704 pts/1 S+ 0:00 grep --color=auto exec_child
[acme@nine ~]$

All in 'R' state.

[root@nine ~]# killall exec_child
exec_child: no process found
[root@nine ~]# ps ax | grep exec_child | head -5
22964 pts/0 R 0:06 exec_child
23046 pts/0 R 0:05 exec_child
23128 pts/0 R 0:05 exec_child
23129 pts/0 R 0:05 exec_child
23181 pts/0 R 0:05 exec_child
[root@nine ~]# kill 22964 23046 23128 23129 23181
[root@nine ~]# ps ax | grep exec_child | head -5
23182 pts/0 R 0:06 exec_child
23196 pts/0 R 0:06 exec_child
23197 pts/0 R 0:06 exec_child
23210 pts/0 R 0:06 exec_child
23213 pts/0 R 0:06 exec_child
[root@nine ~]#

[root@nine ~]# kill `pidof exec_child`
[root@nine ~]# ps ax | grep exec_child | head -5
26806 pts/0 S+ 0:00 grep --color=auto exec_child
[root@nine ~]#

[acme@nine perf_events]$ ps ax|grep exec_child
26816 pts/0 S+ 0:00 grep --color=auto exec_child
[acme@nine perf_events]$ ./remove_on_exec
TAP version 13
1..4
# Starting 4 tests from 1 test cases.
# RUN remove_on_exec.fork_only ...
# OK remove_on_exec.fork_only
ok 1 remove_on_exec.fork_only
# RUN remove_on_exec.fork_exec_then_enable ...
# OK remove_on_exec.fork_exec_then_enable
ok 2 remove_on_exec.fork_exec_then_enable
# RUN remove_on_exec.enable_then_fork_exec ...
*# OK remove_on_exec.enable_then_fork_exec
ok 3 remove_on_exec.enable_then_fork_exec
# RUN remove_on_exec.exec_stress ...
*************************# OK remove_on_exec.exec_stress
ok 4 remove_on_exec.exec_stress
# PASSED: 4 / 4 tests passed.
# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
[acme@nine perf_events]$ ps ax|grep exec_child
26858 pts/0 S+ 0:00 grep --color=auto exec_child
[acme@nine perf_events]$ ps ax|grep exec_child
26860 pts/0 S+ 0:00 grep --color=auto exec_child
[acme@nine perf_events]$

[acme@nine perf_events]$ for x in {0..100}; do (./remove_on_exec &); done
<SNIP>

While it runs I can see:

[acme@nine ~]$ ps ax|grep exec_child
28627 pts/0 R 0:00 exec_child
28668 pts/0 R 0:00 exec_child
28744 pts/0 R 0:00 exec_child
28813 pts/0 R 0:00 exec_child
28912 pts/0 R 0:00 exec_child
29666 pts/1 S+ 0:00 grep --color=auto exec_child
[acme@nine ~]$

at the end they disappeared, on this last run.

But if I do a 'killall remove_on_exec' and stop that loop (control+C/Z)
we get all those exec_child running a seemingly eternal loop:

[acme@nine ~]$ ps ax|grep exec_child|wc -l
479
[acme@nine ~]$ ps ax|grep exec_child|head
30143 pts/0 R 0:01 exec_child
30144 pts/0 R 0:01 exec_child
30145 pts/0 R 0:01 exec_child
30147 pts/0 R 0:01 exec_child
30150 pts/0 R 0:01 exec_child
30151 pts/0 R 0:01 exec_child
30153 pts/0 R 0:01 exec_child
30154 pts/0 R 0:01 exec_child
30155 pts/0 R 0:01 exec_child
30156 pts/0 R 0:01 exec_child
[acme@nine ~]$ sudo cat /proc/30143/stack
[<0>] irqentry_exit_to_user_mode+0x1c9/0x1e0
[<0>] asm_sysvec_apic_timer_interrupt+0x16/0x20
[acme@nine ~]$

[acme@nine perf_events]$ uptime
18:36:17 up 16 min, 2 users, load average: 460.01, 375.77, 240.59
[acme@nine perf_events]$ uptime
18:36:21 up 16 min, 2 users, load average: 461.53, 377.49, 241.87
[acme@nine perf_events]$ uptime
18:36:24 up 16 min, 2 users, load average: 462.85, 379.16, 243.14
[acme@nine perf_events]$ uptime
18:36:27 up 16 min, 2 users, load average: 462.85, 379.16, 243.14
[acme@nine perf_events]$

- Arnaldo

2024-03-13 08:13:23

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On 2024-03-12 18:42:38 [-0300], Arnaldo Carvalho de Melo wrote:
…
> But:
>
> [acme@nine ~]$ pidof exec_child
> 24273 24271 24270 24269 24268 24267 24266 24265 24264 24263 24262 24261 24260 24259
…

> [root@nine ~]# cat /proc/24263/stack
> [<0>] irqentry_exit_to_user_mode+0x1c9/0x1e0
> [<0>] asm_sysvec_apic_timer_interrupt+0x16/0x20
> [root@nine ~]#
…
> [acme@nine ~]$ ps ax|grep exec_child| wc -l
> 504
> [acme@nine ~]$ ps ax|grep exec_child| tail
> 24264 pts/0 R 0:04 exec_child
> 24265 pts/0 R 0:04 exec_child
> 24266 pts/0 R 0:04 exec_child
> 24267 pts/0 R 0:04 exec_child
> 24268 pts/0 R 0:04 exec_child
> 24269 pts/0 R 0:04 exec_child
> 24270 pts/0 R 0:04 exec_child
> 24271 pts/0 R 0:04 exec_child
> 24273 pts/0 R 0:04 exec_child
> 26704 pts/1 S+ 0:00 grep --color=auto exec_child
> [acme@nine ~]$
>
> All in 'R' state.
>
> [root@nine ~]# killall exec_child
> exec_child: no process found
> [root@nine ~]# ps ax | grep exec_child | head -5
> 22964 pts/0 R 0:06 exec_child
> 23046 pts/0 R 0:05 exec_child
> 23128 pts/0 R 0:05 exec_child
> 23129 pts/0 R 0:05 exec_child
> 23181 pts/0 R 0:05 exec_child
> [root@nine ~]# kill 22964 23046 23128 23129 23181
> [root@nine ~]# ps ax | grep exec_child | head -5
> 23182 pts/0 R 0:06 exec_child
> 23196 pts/0 R 0:06 exec_child
> 23197 pts/0 R 0:06 exec_child
> 23210 pts/0 R 0:06 exec_child
> 23213 pts/0 R 0:06 exec_child
> [root@nine ~]#

You can't kill them?

> at the end they disappeared, on this last run.
>
> But if I do a 'killall remove_on_exec' and stop that loop (control+C/Z)
> we get all those exec_child running a seemingly eternal loop:

Is this new or was it there? Is this VM or bare metal?

One part I don't get: did you let it run or did you kill it?
`exec_child' spins until a signal is received or the parent kills it. So
it shouldn't remain there for ever. And my guess, that it is in spinning
in userland and not in kernel.
I tried it on bare metal and VM and couldn't reproduce this.
…
>
> - Arnaldo

Sebastian

2024-03-13 13:31:30

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On Wed, Mar 13, 2024 at 09:13:03AM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-03-12 18:42:38 [-0300], Arnaldo Carvalho de Melo wrote:
> …
> > But:
> >
> > [acme@nine ~]$ pidof exec_child
> > 24273 24271 24270 24269 24268 24267 24266 24265 24264 24263 24262 24261 24260 24259
> …
>
> > [root@nine ~]# cat /proc/24263/stack
> > [<0>] irqentry_exit_to_user_mode+0x1c9/0x1e0
> > [<0>] asm_sysvec_apic_timer_interrupt+0x16/0x20
> > [root@nine ~]#
> …
> > [acme@nine ~]$ ps ax|grep exec_child| wc -l
> > 504
> > [acme@nine ~]$ ps ax|grep exec_child| tail
> > 24264 pts/0 R 0:04 exec_child
> > 24265 pts/0 R 0:04 exec_child
> > 24266 pts/0 R 0:04 exec_child
> > 24267 pts/0 R 0:04 exec_child
> > 24268 pts/0 R 0:04 exec_child
> > 24269 pts/0 R 0:04 exec_child
> > 24270 pts/0 R 0:04 exec_child
> > 24271 pts/0 R 0:04 exec_child
> > 24273 pts/0 R 0:04 exec_child
> > 26704 pts/1 S+ 0:00 grep --color=auto exec_child
> > [acme@nine ~]$
> >
> > All in 'R' state.
> >
> > [root@nine ~]# killall exec_child
> > exec_child: no process found
> > [root@nine ~]# ps ax | grep exec_child | head -5
> > 22964 pts/0 R 0:06 exec_child
> > 23046 pts/0 R 0:05 exec_child
> > 23128 pts/0 R 0:05 exec_child
> > 23129 pts/0 R 0:05 exec_child
> > 23181 pts/0 R 0:05 exec_child
> > [root@nine ~]# kill 22964 23046 23128 23129 23181
> > [root@nine ~]# ps ax | grep exec_child | head -5
> > 23182 pts/0 R 0:06 exec_child
> > 23196 pts/0 R 0:06 exec_child
> > 23197 pts/0 R 0:06 exec_child
> > 23210 pts/0 R 0:06 exec_child
> > 23213 pts/0 R 0:06 exec_child
> > [root@nine ~]#

> You can't kill them?

I can, they remain in R state and I can kill them with 'kill `pidof exec_child`'

> > at the end they disappeared, on this last run.

> > But if I do a 'killall remove_on_exec' and stop that loop (control+C/Z)
> > we get all those exec_child running a seemingly eternal loop:
>
> Is this new or was it there? Is this VM or bare metal?

bare metal.

> One part I don't get: did you let it run or did you kill it?

If I let them run they will finish and exit, no exec_child remains.

If I instead try to stop the loop that goes on forking the 100 of them,
then the exec_child remain spinning.

> `exec_child' spins until a signal is received or the parent kills it. So

> it shouldn't remain there for ever. And my guess, that it is in spinning
> in userland and not in kernel.

Checking that now, the stack is the one I posted:

> > [root@nine ~]# cat /proc/24263/stack
> > [<0>] irqentry_exit_to_user_mode+0x1c9/0x1e0
> > [<0>] asm_sysvec_apic_timer_interrupt+0x16/0x20
> > [root@nine ~]#

> I tried it on bare metal and VM and couldn't reproduce this.

All my tests are in bare metal.

- Arnaldo

2024-03-13 13:47:51

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On 2024-03-13 10:28:41 [-0300], Arnaldo Carvalho de Melo wrote:
> > One part I don't get: did you let it run or did you kill it?
>
> If I let them run they will finish and exit, no exec_child remains.
>
> If I instead try to stop the loop that goes on forking the 100 of them,
> then the exec_child remain spinning.

Okay. So that problem only exists if you intervene. And you can
reproduce this odd behaviour with my patches but not without them,
right?

> > it shouldn't remain there for ever. And my guess, that it is in spinning
> > in userland and not in kernel.
>
> Checking that now, the stack is the one I posted:
>
> > > [root@nine ~]# cat /proc/24263/stack
> > > [<0>] irqentry_exit_to_user_mode+0x1c9/0x1e0
> > > [<0>] asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > [root@nine ~]#

could you resolve irqentry_exit_to_user_mode+0x1c9/0x1e0, please?

> > I tried it on bare metal and VM and couldn't reproduce this.
>
> All my tests are in bare metal.

Would you mind sending me your .config? The shell is bash I guess. I
will try to reproduce your setup on another box…

> - Arnaldo

Sebastian

2024-03-13 13:47:52

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On Wed, Mar 13, 2024 at 10:28:44AM -0300, Arnaldo Carvalho de Melo wrote:
> On Wed, Mar 13, 2024 at 09:13:03AM +0100, Sebastian Andrzej Siewior wrote:
> > One part I don't get: did you let it run or did you kill it?

> If I let them run they will finish and exit, no exec_child remains.

> If I instead try to stop the loop that goes on forking the 100 of them,
> then the exec_child remain spinning.

> > `exec_child' spins until a signal is received or the parent kills it. So

> > it shouldn't remain there for ever. And my guess, that it is in spinning
> > in userland and not in kernel.

> Checking that now:

tldr; the tight loop, full details at the end.

100.00 b6: mov signal_count,%eax
test %eax,%eax
↑ je b6

remove_on_exec.c

/* For exec'd child. */
static void exec_child(void)
{
struct sigaction action = {};
const int val = 42;

/* Set up sigtrap handler in case we erroneously receive a trap. */
action.sa_flags = SA_SIGINFO | SA_NODEFER;
action.sa_sigaction = sigtrap_handler;
sigemptyset(&action.sa_mask);
if (sigaction(SIGTRAP, &action, NULL))
_exit((perror("sigaction failed"), 1));

/* Signal parent that we're starting to spin. */
if (write(STDOUT_FILENO, &val, sizeof(int)) == -1)
_exit((perror("write failed"), 1));

/* Should hang here until killed. */
while (!signal_count);
}

So probably just a test needing to be a bit more polished?

Seems like it, on a newer machine, faster, I managed to reproduce it on
a non-RT kernel, with one exec_child remaining:

1.44 b6: mov signal_count,%eax
test %eax,%eax
98.56 ↑ je b6

same tight loop:

acme@x1:~/git/perf-tools-next/tools/testing/selftests/perf_events$ pidof exec_child
722300
acme@x1:~/git/perf-tools-next/tools/testing/selftests/perf_events$ ps ax|grep exec_child
722300 pts/2 R 4:08 exec_child
722502 pts/2 S+ 0:00 grep --color=auto exec_child
acme@x1:~/git/perf-tools-next/tools/testing/selftests/perf_events$

- Arnaldo

[root@nine ~]# perf record --call-graph dwarf -p 35785
^C[ perf record: Woken up 48 times to write data ]
[ perf record: Captured and wrote 12.120 MB perf.data (1503 samples) ]

[root@nine ~]# ls -la perf.data
-rw-------. 1 root root 12720152 Mar 13 10:32 perf.data
[root@nine ~]#
[root@nine ~]# perf report --no-child --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1K of event 'cycles:P'
# Event count (approx.): 926018718
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ......................................
#
98.48% exe remove_on_exec [.] exec_child
|
---exec_child
main
__libc_start_call_main
__libc_start_main@@GLIBC_2.34
_start

0.33% exe [kernel.kallsyms] [k] arch_scale_freq_tick
0.13% exe [kernel.kallsyms] [k] debug_smp_processor_id
0.13% exe [kernel.kallsyms] [k] check_cpu_stall
0.13% exe [kernel.kallsyms] [k] acct_account_cputime
0.13% exe [kernel.kallsyms] [k] cpuacct_account_field
0.07% exe [kernel.kallsyms] [k] preempt_count_add
0.07% exe [kernel.kallsyms] [k] update_irq_load_avg
0.07% exe [kernel.kallsyms] [k] cgroup_rstat_updated
0.07% exe [kernel.kallsyms] [k] rcu_sched_clock_irq
0.07% exe [kernel.kallsyms] [k] account_user_time
0.07% exe [kernel.kallsyms] [k] __hrtimer_run_queues
0.07% exe [kernel.kallsyms] [k] tick_nohz_highres_handler
0.07% exe [kernel.kallsyms] [k] ktime_get_update_offsets_now
0.06% exe [kernel.kallsyms] [k] __enqueue_entity
0.06% exe [kernel.kallsyms] [k] tick_sched_handle
0.00% exe [kernel.kallsyms] [k] __intel_pmu_enable_all.constprop.0

#
# (Tip: To show assembler sample contexts use perf record -b / perf script -F +brstackinsn --xed)
#
[root@nine ~]#

[root@nine ~]# perf annotate --stdio2 exec_child
Samples: 1K of event 'cycles:P', 4000 Hz, Event count (approx.): 911943256, [percent: local period]
exec_child() /home/acme/git/linux/tools/testing/selftests/perf_events/remove_on_exec
Percent

Disassembly of section .text:

00000000004045cf <exec_child>:
push %rbp
mov %rsp,%rbp
sub $0xb0,%rsp
lea -0xa0(%rbp),%rdx
mov $0x0,%eax
mov $0x13,%ecx
mov %rdx,%rdi
rep stos %rax,%es:(%rdi)
movl $0x2a,-0xa4(%rbp)
movl $0x40000004,-0x18(%rbp)
movq $0x402a2e,-0xa0(%rbp)
lea -0xa0(%rbp),%rax
add $0x8,%rax
mov %rax,%rdi
→ callq sigemptyset@plt
lea -0xa0(%rbp),%rax
mov $0x0,%edx
mov %rax,%rsi
mov $0x5,%edi
→ callq sigaction@plt
test %eax,%eax
↓ je 82
mov $0x4058af,%edi
→ callq perror@plt
mov $0x1,%edi
→ callq _exit@plt
82: lea -0xa4(%rbp),%rax
mov $0x4,%edx
mov %rax,%rsi
mov $0x1,%edi
→ callq write@plt
cmp $0xffffffffffffffff,%rax
↓ jne b5
mov $0x4058c0,%edi
→ callq perror@plt
mov $0x1,%edi
→ callq _exit@plt
b5: nop
100.00 b6: mov signal_count,%eax
test %eax,%eax
↑ je b6
nop
nop
leaveq
← retq
[root@nine ~]#

2024-03-13 14:06:59

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On Wed, Mar 13, 2024 at 02:46:45PM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-03-13 10:28:41 [-0300], Arnaldo Carvalho de Melo wrote:
> > > One part I don't get: did you let it run or did you kill it?
> >
> > If I let them run they will finish and exit, no exec_child remains.
> >
> > If I instead try to stop the loop that goes on forking the 100 of them,
> > then the exec_child remain spinning.
>
> Okay. So that problem only exists if you intervene. And you can
> reproduce this odd behaviour with my patches but not without them,
> right?

See the next message, I managed to reproduce that behaviour in a non-RT
kernel as well.

- Arnaldo

2024-03-13 14:18:16

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On Wed, 13 Mar 2024 at 14:47, Arnaldo Carvalho de Melo <[email protected]> wrote:
>
> On Wed, Mar 13, 2024 at 10:28:44AM -0300, Arnaldo Carvalho de Melo wrote:
> > On Wed, Mar 13, 2024 at 09:13:03AM +0100, Sebastian Andrzej Siewior wrote:
> > > One part I don't get: did you let it run or did you kill it?
>
> > If I let them run they will finish and exit, no exec_child remains.
>
> > If I instead try to stop the loop that goes on forking the 100 of them,
> > then the exec_child remain spinning.
>
> > > `exec_child' spins until a signal is received or the parent kills it. So
>
> > > it shouldn't remain there for ever. And my guess, that it is in spinning
> > > in userland and not in kernel.
>
> > Checking that now:
>
> tldr; the tight loop, full details at the end.
>
> 100.00 b6: mov signal_count,%eax
> test %eax,%eax
> ↑ je b6
>
> remove_on_exec.c
>
> /* For exec'd child. */
> static void exec_child(void)
> {
> struct sigaction action = {};
> const int val = 42;
>
> /* Set up sigtrap handler in case we erroneously receive a trap. */
> action.sa_flags = SA_SIGINFO | SA_NODEFER;
> action.sa_sigaction = sigtrap_handler;
> sigemptyset(&action.sa_mask);
> if (sigaction(SIGTRAP, &action, NULL))
> _exit((perror("sigaction failed"), 1));
>
> /* Signal parent that we're starting to spin. */
> if (write(STDOUT_FILENO, &val, sizeof(int)) == -1)
> _exit((perror("write failed"), 1));
>
> /* Should hang here until killed. */
> while (!signal_count);
> }
>
> So probably just a test needing to be a bit more polished?

Yes, possible.

> Seems like it, on a newer machine, faster, I managed to reproduce it on
> a non-RT kernel, with one exec_child remaining:
>
> 1.44 b6: mov signal_count,%eax
> test %eax,%eax
> 98.56 ↑ je b6

It's unclear to me why that happens. But I do recall seeing it before,
and my explanation was that with too many concurrent copies of the
test the system either ran out of memory (maybe?) because the stress
test also spawns 30 parallel copies of the "exec_child" subprocess. So
with the 100 parallel copies we end up with 30 * 100 processes. Maybe
that's too much?

In any case, if the kernel didn't fall over during that kind of stress
testing, and the test itself passes when run as a single copy, then
I'd conclude all looks good.

This particular feature of perf along with testing it once before
melted Peter's and my brain [1]. I hope your experience didn't result
in complete brain-melt. ;-)

[1] https://lore.kernel.org/all/[email protected]/

Thanks,
-- Marco

2024-03-13 14:23:38

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On 2024-03-13 10:47:33 [-0300], Arnaldo Carvalho de Melo wrote:
> /* For exec'd child. */
> static void exec_child(void)
> {
> struct sigaction action = {};
> const int val = 42;
>
> /* Set up sigtrap handler in case we erroneously receive a trap. */
> action.sa_flags = SA_SIGINFO | SA_NODEFER;
> action.sa_sigaction = sigtrap_handler;
> sigemptyset(&action.sa_mask);
> if (sigaction(SIGTRAP, &action, NULL))
> _exit((perror("sigaction failed"), 1));
>
> /* Signal parent that we're starting to spin. */
> if (write(STDOUT_FILENO, &val, sizeof(int)) == -1)
> _exit((perror("write failed"), 1));
>
> /* Should hang here until killed. */
> while (!signal_count);
> }
>
> So probably just a test needing to be a bit more polished?

Maybe. I'm not sure where this is coming from. Either someone should
kill or the signal should be delivered but, hmm… If the signal isn't
coming then it might one of those without a perf counter.

> Seems like it, on a newer machine, faster, I managed to reproduce it on
> a non-RT kernel, with one exec_child remaining:

Okay, so no regression. That is something ;)

> - Arnaldo

Sebastian

2024-03-13 14:37:20

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On Tue, 12 Mar 2024 at 19:08, Sebastian Andrzej Siewior
<[email protected]> wrote:
>
> Hi,
>
> Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending
> the signal gets delayed until event_sched_out() which then uses
> task_work_add() for its delivery. This breaks on PREEMPT_RT because the
> signal is delivered with disabled preemption.
>
> While looking at this, I also stumbled upon __perf_pending_irq() which
> requires disabled interrupts but this is not the case on PREEMPT_RT.
>
> This series aim to address both issues while not introducing a new issue
> at the same time ;)
> Any testing is appreciated.
>
> v1…v2: https://lore.kernel.org/all/[email protected]/
> - Marco pointed me to the testsuite that showed two problems:
> - Delayed task_work from NMI / missing events.
> Fixed by triggering dummy irq_work to enforce an interrupt for
> the exit-to-userland path which checks task_work
> - Increased ref-count on clean up/ during exec.
> Mostly addressed by the former change. There is still a window
> if the NMI occurs during execve(). This is addressed by removing
> the task_work before free_event().
> The testsuite (remove_on_exec) fails sometimes if the event/
> SIGTRAP is sent before the sighandler is installed.

Tested-by: Marco Elver <[email protected]>

It does pass the tests in tools/testing/selftests/perf_events (non-RT
kernel, lockdep enabled). But I do recall this being a particularly
sharp corner of perf, so any additional testing and review here is
useful.

Thanks,
-- Marco

2024-03-13 14:42:08

[permalink] [raw]

Subject: Re: [PATCH v2 2/4] perf: Enqueue SIGTRAP always via task_work.

On Tue, 12 Mar 2024 at 19:08, Sebastian Andrzej Siewior
<[email protected]> wrote:
>
> A signal is delivered by raising irq_work() which works from any context
> including NMI. irq_work() can be delayed if the architecture does not
> provide an interrupt vector. In order not to lose a signal, the signal
> is injected via task_work during event_sched_out().
>
> Instead going via irq_work, the signal could be added directly via
> task_work. The signal is sent to current and can be enqueued on its
> return path to userland instead of triggering irq_work. A dummy IRQ is
> required in the NMI case to ensure the task_work is handled before
> returning to user land. For this irq_work is used. An alternative would
> be just raising an interrupt like arch_send_call_function_single_ipi().
>
> During testing with `remove_on_exec' it become visible that the event
> can be enqueued via NMI during execve(). The task_work must not be kept
> because free_event() will complain later. Also the new task will not
> have a sighandler installed.
>
> Queue signal via task_work. Remove perf_event::pending_sigtrap and
> and use perf_event::pending_work instead. Raise irq_work in the NMI case
> for a dummy interrupt. Remove the task_work if the event is freed.
>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> ---
> include/linux/perf_event.h | 3 +--
> kernel/events/core.c | 45 +++++++++++++++++---------------------
> 2 files changed, 21 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d2a15c0c6f8a9..24ac6765146c7 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -781,7 +781,6 @@ struct perf_event {
> unsigned int pending_wakeup;
> unsigned int pending_kill;
> unsigned int pending_disable;
> - unsigned int pending_sigtrap;
> unsigned long pending_addr; /* SIGTRAP */
> struct irq_work pending_irq;
> struct callback_head pending_task;
> @@ -959,7 +958,7 @@ struct perf_event_context {
> struct rcu_head rcu_head;
>
> /*
> - * Sum (event->pending_sigtrap + event->pending_work)
> + * Sum (event->pending_work + event->pending_work)
> *
> * The SIGTRAP is targeted at ctx->task, as such it won't do changing
> * that until the signal is delivered.
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index c7a0274c662c8..e9926baaa1587 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
> state = PERF_EVENT_STATE_OFF;
> }
>
> - if (event->pending_sigtrap) {
> - bool dec = true;
> -
> - event->pending_sigtrap = 0;
> - if (state != PERF_EVENT_STATE_OFF &&
> - !event->pending_work) {
> - event->pending_work = 1;
> - dec = false;
> - WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> - task_work_add(current, &event->pending_task, TWA_RESUME);
> - }
> - if (dec)
> - local_dec(&event->ctx->nr_pending);
> - }
> -
> perf_event_set_state(event, state);
>
> if (!is_software_event(event))
> @@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event)
> * Yay, we hit home and are in the context of the event.
> */
> if (cpu == smp_processor_id()) {
> - if (event->pending_sigtrap) {
> - event->pending_sigtrap = 0;
> - perf_sigtrap(event);
> - local_dec(&event->ctx->nr_pending);
> - }
> if (event->pending_disable) {
> event->pending_disable = 0;
> perf_event_disable_local(event);
> @@ -9592,14 +9572,17 @@ static int __perf_event_overflow(struct perf_event *event,
>
> if (regs)
> pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> - if (!event->pending_sigtrap) {
> - event->pending_sigtrap = pending_id;
> + if (!event->pending_work) {
> + event->pending_work = pending_id;
> local_inc(&event->ctx->nr_pending);
> - irq_work_queue(&event->pending_irq);
> + WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> + task_work_add(current, &event->pending_task, TWA_RESUME);
> + if (in_nmi())
> + irq_work_queue(&event->pending_irq);

Some brief code comments here would help having to dig through git
history to understand this.

> } else if (event->attr.exclude_kernel && valid_sample) {
> /*
> * Should not be able to return to user space without
> - * consuming pending_sigtrap; with exceptions:
> + * consuming pending_work; with exceptions:
> *
> * 1. Where !exclude_kernel, events can overflow again
> * in the kernel without returning to user space.
> @@ -9609,7 +9592,7 @@ static int __perf_event_overflow(struct perf_event *event,
> * To approximate progress (with false negatives),
> * check 32-bit hash of the current IP.
> */
> - WARN_ON_ONCE(event->pending_sigtrap != pending_id);
> + WARN_ON_ONCE(event->pending_work != pending_id);
> }
>
> event->pending_addr = 0;
> @@ -13049,6 +13032,13 @@ static void sync_child_event(struct perf_event *child_event)
> &parent_event->child_total_time_running);
> }
>
> +static bool task_work_cb_match(struct callback_head *cb, void *data)
> +{
> + struct perf_event *event = container_of(cb, struct perf_event, pending_task);
> +
> + return event == data;
> +}
> +
> static void
> perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> {
> @@ -13088,6 +13078,11 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> * Kick perf_poll() for is_event_hup();
> */
> perf_event_wakeup(parent_event);
> + if (event->pending_work &&
> + task_work_cancel_match(current, task_work_cb_match, event)) {

Brief comment which case this covers would be good.

> + put_event(event);
> + local_dec(&event->ctx->nr_pending);
> + }
> free_event(event);
> put_event(parent_event);
> return;
> --
> 2.43.0
>

2024-03-13 15:24:17

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On Wed, Mar 13, 2024 at 03:35:27PM +0100, Marco Elver wrote:
> On Tue, 12 Mar 2024 at 19:08, Sebastian Andrzej Siewior <[email protected]> wrote:
> > Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending
> > the signal gets delayed until event_sched_out() which then uses
> > task_work_add() for its delivery. This breaks on PREEMPT_RT because the
> > signal is delivered with disabled preemption.

> > While looking at this, I also stumbled upon __perf_pending_irq() which
> > requires disabled interrupts but this is not the case on PREEMPT_RT.

> > This series aim to address both issues while not introducing a new issue
> > at the same time ;)
> > Any testing is appreciated.

> > v1…v2: https://lore.kernel.org/all/[email protected]/
> > - Marco pointed me to the testsuite that showed two problems:
> > - Delayed task_work from NMI / missing events.
> > Fixed by triggering dummy irq_work to enforce an interrupt for
> > the exit-to-userland path which checks task_work
> > - Increased ref-count on clean up/ during exec.
> > Mostly addressed by the former change. There is still a window
> > if the NMI occurs during execve(). This is addressed by removing
> > the task_work before free_event().
> > The testsuite (remove_on_exec) fails sometimes if the event/
> > SIGTRAP is sent before the sighandler is installed.

> Tested-by: Marco Elver <[email protected]>

> It does pass the tests in tools/testing/selftests/perf_events (non-RT
> kernel, lockdep enabled). But I do recall this being a particularly
> sharp corner of perf, so any additional testing and review here is
> useful.

Right, I'm testing with the full 'perf test' suite now.

- Arnaldo

2024-03-13 19:13:59

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On Wed, Mar 13, 2024 at 03:14:28PM -0300, Arnaldo Carvalho de Melo wrote:
> On Wed, Mar 13, 2024 at 12:23:32PM -0300, Arnaldo Carvalho de Melo wrote:
> > On Wed, Mar 13, 2024 at 03:35:27PM +0100, Marco Elver wrote:
> > > On Tue, 12 Mar 2024 at 19:08, Sebastian Andrzej Siewior <[email protected]> wrote:
> > > > Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending
> > > > the signal gets delayed until event_sched_out() which then uses
> > > > task_work_add() for its delivery. This breaks on PREEMPT_RT because the
> > > > signal is delivered with disabled preemption.
>
> > > > While looking at this, I also stumbled upon __perf_pending_irq() which
> > > > requires disabled interrupts but this is not the case on PREEMPT_RT.
>
> > > > This series aim to address both issues while not introducing a new issue
> > > > at the same time ;)
> > > > Any testing is appreciated.
>
> > > > v1…v2: https://lore.kernel.org/all/[email protected]/
> > > > - Marco pointed me to the testsuite that showed two problems:
> > > > - Delayed task_work from NMI / missing events.
> > > > Fixed by triggering dummy irq_work to enforce an interrupt for
> > > > the exit-to-userland path which checks task_work
> > > > - Increased ref-count on clean up/ during exec.
> > > > Mostly addressed by the former change. There is still a window
> > > > if the NMI occurs during execve(). This is addressed by removing
> > > > the task_work before free_event().
> > > > The testsuite (remove_on_exec) fails sometimes if the event/
> > > > SIGTRAP is sent before the sighandler is installed.
> >
> > > Tested-by: Marco Elver <[email protected]>

> > > It does pass the tests in tools/testing/selftests/perf_events (non-RT
> > > kernel, lockdep enabled). But I do recall this being a particularly
> > > sharp corner of perf, so any additional testing and review here is
> > > useful.
>
> > Right, I'm testing with the full 'perf test' suite now.
>
> 'perf test' doesn't show any regression, now I'm running Vince Weaver's
> https://github.com/deater/perf_event_tests, storing the results with
> this patchset and then without, to do a diff, lets see...

[root@nine perf_event_tests]# diff -u results.6.8.0-rc7-rt6 results.6.8.0-rc7.sebastian-rt6+ | grep ^[+-]
--- results.6.8.0-rc7-rt6 2024-03-13 15:26:37.923323518 -0300
+++ results.6.8.0-rc7.sebastian-rt6+ 2024-03-13 15:14:11.505333801 -0300
-Linux nine 6.8.0-rc7-rt6 #1 SMP PREEMPT_RT Fri Mar 8 17:36:48 -03 2024 x86_64 x86_64 x86_64 GNU/Linux
+Linux nine 6.8.0-rc7.sebastian-rt6+ #2 SMP PREEMPT_RT Tue Mar 12 18:01:31 -03 2024 x86_64 x86_64 x86_64 GNU/Linux
- Testing "branch-misses" generalized event... FAILED
+ Testing "branch-misses" generalized event... PASSED
- Testing uncore events... SKIPPED
+ Testing uncore events... PASSED
- We are running release 6.8.0-rc7-rt6
+ We are running release 6.8.0-rc7.sebastian-rt6+
- Running on CPU 4
+ Running on CPU 2
- Running on CPU 6
+ Running on CPU 2
- Measuring on CPU 5
-Running on CPU 6
-Measuring on CPU 5
+ Measuring on CPU 6
+Running on CPU 2
+Measuring on CPU 6
[root@nine perf_event_tests]#

So basically:

- Testing "branch-misses" generalized event... FAILED
+ Testing "branch-misses" generalized event... PASSED
- Testing uncore events... SKIPPED
+ Testing uncore events... PASSED

So things improved! I'll re-run to see if these results are stable...

- Arnaldo

2024-03-13 19:17:52

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On Wed, Mar 13, 2024 at 12:23:32PM -0300, Arnaldo Carvalho de Melo wrote:
> On Wed, Mar 13, 2024 at 03:35:27PM +0100, Marco Elver wrote:
> > On Tue, 12 Mar 2024 at 19:08, Sebastian Andrzej Siewior <[email protected]> wrote:
> > > Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending
> > > the signal gets delayed until event_sched_out() which then uses
> > > task_work_add() for its delivery. This breaks on PREEMPT_RT because the
> > > signal is delivered with disabled preemption.

> > > While looking at this, I also stumbled upon __perf_pending_irq() which
> > > requires disabled interrupts but this is not the case on PREEMPT_RT.

> > > This series aim to address both issues while not introducing a new issue
> > > at the same time ;)
> > > Any testing is appreciated.

> > > v1…v2: https://lore.kernel.org/all/[email protected]/
> > > - Marco pointed me to the testsuite that showed two problems:
> > > - Delayed task_work from NMI / missing events.
> > > Fixed by triggering dummy irq_work to enforce an interrupt for
> > > the exit-to-userland path which checks task_work
> > > - Increased ref-count on clean up/ during exec.
> > > Mostly addressed by the former change. There is still a window
> > > if the NMI occurs during execve(). This is addressed by removing
> > > the task_work before free_event().
> > > The testsuite (remove_on_exec) fails sometimes if the event/
> > > SIGTRAP is sent before the sighandler is installed.
>
> > Tested-by: Marco Elver <[email protected]>
>
> > It does pass the tests in tools/testing/selftests/perf_events (non-RT
> > kernel, lockdep enabled). But I do recall this being a particularly
> > sharp corner of perf, so any additional testing and review here is
> > useful.

> Right, I'm testing with the full 'perf test' suite now.

'perf test' doesn't show any regression, now I'm running Vince Weaver's
https://github.com/deater/perf_event_tests, storing the results with
this patchset and then without, to do a diff, lets see...

- Arnaldo

2024-03-13 20:12:45

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On Wed, Mar 13, 2024 at 03:30:52PM -0300, Arnaldo Carvalho de Melo wrote:
> On Wed, Mar 13, 2024 at 03:14:28PM -0300, Arnaldo Carvalho de Melo wrote:
> > 'perf test' doesn't show any regression, now I'm running Vince Weaver's
> > https://github.com/deater/perf_event_tests, storing the results with
> > this patchset and then without, to do a diff, lets see...
>
> So things improved! I'll re-run to see if these results are stable...

tldr; No dmesg activity, no kernel splats, most tests passed, nothing
noticeable when running with/without the patch with Vince's regression
tests. So:

Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Reported-by: Arnaldo Carvalho de Melo <[email protected]>

- Arnaldo

Further details:

Without the patch:

[root@nine perf_event_tests]# ./run_tests.sh | tee results.$(uname -r).new ; diff -u results.$(uname -r) results.$(uname -r).new
--- results.6.8.0-rc7-rt6 2024-03-13 15:26:37.923323518 -0300
+++ results.6.8.0-rc7-rt6.new 2024-03-13 15:32:43.983245095 -0300
@@ -296,7 +296,7 @@
+ tests/rdpmc/rdpmc_validation
Testing if userspace rdpmc reads give expected results... PASSED
+ tests/rdpmc/rdpmc_multiplexing
- Testing if userspace rdpmc multiplexing works... PASSED
+ Testing if userspace rdpmc multiplexing works... FAILED
+ tests/rdpmc/rdpmc_reset
Testing if resetting while using rdpmc works... PASSED
+ tests/rdpmc/rdpmc_group
@@ -304,15 +304,15 @@
+ tests/rdpmc/rdpmc_attach
Testing if rdpmc attach works... PASSED
+ tests/rdpmc/rdpmc_attach_cpu
- Running on CPU 4
+ Running on CPU 0
Testing if rdpmc behavior on attach CPU... PASSED
+ tests/rdpmc/rdpmc_attach_global_cpu
- Running on CPU 6
+ Running on CPU 3
Testing if rdpmc behavior on attach all procs on other CPU... FAILED
+ tests/rdpmc/rdpmc_attach_other_cpu
- Measuring on CPU 5
-Running on CPU 6
-Measuring on CPU 5
+ Measuring on CPU 0
+Running on CPU 3
+Measuring on CPU 0
Testing if rdpmc behavior on attach other CPU... FAILED
+ tests/rdpmc/rdpmc_multiattach
Testing if rdpmc multi-attach works... PASSED

A test flipped results.

Trying again with a more compact output:

[root@nine perf_event_tests]# ./run_tests.sh | tee results.$(uname -r).new ; diff -u results.$(uname -r) results.$(uname -r).new | grep ^[+-]
--- results.6.8.0-rc7-rt6 2024-03-13 15:26:37.923323518 -0300
+++ results.6.8.0-rc7-rt6.new 2024-03-13 17:06:34.944149451 -0300
- Running on CPU 4
-Testing if rdpmc behavior on attach CPU... PASSED
- + tests/rdpmc/rdpmc_attach_global_cpu
+Testing if rdpmc behavior on attach CPU... FAILED
+ + tests/rdpmc/rdpmc_attach_global_cpu
+ Running on CPU 0
- Measuring on CPU 5
-Running on CPU 6
-Measuring on CPU 5
+ Measuring on CPU 7
+Running on CPU 1
+Measuring on CPU 7
[root@nine perf_event_tests]#

Since its that rdpmc that is now always failing without this patch
series, lets try using that .new as the new baseline:

[root@nine perf_event_tests]# ./run_tests.sh | tee results.$(uname -r).new2 ; diff -u results.$(uname -r).new results.$(uname -r).new2 | grep ^[+-]
--- results.6.8.0-rc7-rt6.new 2024-03-13 17:06:34.944149451 -0300
+++ results.6.8.0-rc7-rt6.new2 2024-03-13 17:08:41.438282558 -0300
- Testing "branch-misses" generalized event... FAILED
+ Testing "branch-misses" generalized event... PASSED
- Testing if userspace rdpmc multiplexing works... PASSED
+ Testing if userspace rdpmc multiplexing works... FAILED
- Running on CPU 6
-Testing if rdpmc behavior on attach CPU... FAILED
+ Running on CPU 2
+Testing if rdpmc behavior on attach CPU... PASSED
- Running on CPU 0
+ Running on CPU 2
- Measuring on CPU 7
-Running on CPU 1
-Measuring on CPU 7
+ Measuring on CPU 2
+Running on CPU 0
+Measuring on CPU 2
[root@nine perf_event_tests]#

2024-03-13 20:14:38

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On Wed, Mar 13, 2024 at 05:12:25PM -0300, Arnaldo Carvalho de Melo wrote:
> On Wed, Mar 13, 2024 at 03:30:52PM -0300, Arnaldo Carvalho de Melo wrote:
> > On Wed, Mar 13, 2024 at 03:14:28PM -0300, Arnaldo Carvalho de Melo wrote:
> > > 'perf test' doesn't show any regression, now I'm running Vince Weaver's
> > > https://github.com/deater/perf_event_tests, storing the results with
> > > this patchset and then without, to do a diff, lets see...

> > So things improved! I'll re-run to see if these results are stable...

> tldr; No dmesg activity, no kernel splats, most tests passed, nothing
> noticeable when running with/without the patch with Vince's regression
> tests. So:

> Tested-by: Arnaldo Carvalho de Melo <[email protected]>
> Reported-by: Arnaldo Carvalho de Melo <[email protected]>

Too quick, now I'm testing it on top of torvalds/master, no PREEMPT_RT.

- Arnaldo

> - Arnaldo
>
> Further details:
>
> Without the patch:
>
> [root@nine perf_event_tests]# ./run_tests.sh | tee results.$(uname -r).new ; diff -u results.$(uname -r) results.$(uname -r).new
> --- results.6.8.0-rc7-rt6 2024-03-13 15:26:37.923323518 -0300
> +++ results.6.8.0-rc7-rt6.new 2024-03-13 15:32:43.983245095 -0300
> @@ -296,7 +296,7 @@
> + tests/rdpmc/rdpmc_validation
> Testing if userspace rdpmc reads give expected results... PASSED
> + tests/rdpmc/rdpmc_multiplexing
> - Testing if userspace rdpmc multiplexing works... PASSED
> + Testing if userspace rdpmc multiplexing works... FAILED
> + tests/rdpmc/rdpmc_reset
> Testing if resetting while using rdpmc works... PASSED
> + tests/rdpmc/rdpmc_group
> @@ -304,15 +304,15 @@
> + tests/rdpmc/rdpmc_attach
> Testing if rdpmc attach works... PASSED
> + tests/rdpmc/rdpmc_attach_cpu
> - Running on CPU 4
> + Running on CPU 0
> Testing if rdpmc behavior on attach CPU... PASSED
> + tests/rdpmc/rdpmc_attach_global_cpu
> - Running on CPU 6
> + Running on CPU 3
> Testing if rdpmc behavior on attach all procs on other CPU... FAILED
> + tests/rdpmc/rdpmc_attach_other_cpu
> - Measuring on CPU 5
> -Running on CPU 6
> -Measuring on CPU 5
> + Measuring on CPU 0
> +Running on CPU 3
> +Measuring on CPU 0
> Testing if rdpmc behavior on attach other CPU... FAILED
> + tests/rdpmc/rdpmc_multiattach
> Testing if rdpmc multi-attach works... PASSED
>
> A test flipped results.
>
> Trying again with a more compact output:
>
> [root@nine perf_event_tests]# ./run_tests.sh | tee results.$(uname -r).new ; diff -u results.$(uname -r) results.$(uname -r).new | grep ^[+-]
> --- results.6.8.0-rc7-rt6 2024-03-13 15:26:37.923323518 -0300
> +++ results.6.8.0-rc7-rt6.new 2024-03-13 17:06:34.944149451 -0300
> - Running on CPU 4
> -Testing if rdpmc behavior on attach CPU... PASSED
> - + tests/rdpmc/rdpmc_attach_global_cpu
> +Testing if rdpmc behavior on attach CPU... FAILED
> + + tests/rdpmc/rdpmc_attach_global_cpu
> + Running on CPU 0
> - Measuring on CPU 5
> -Running on CPU 6
> -Measuring on CPU 5
> + Measuring on CPU 7
> +Running on CPU 1
> +Measuring on CPU 7
> [root@nine perf_event_tests]#
>
> Since its that rdpmc that is now always failing without this patch
> series, lets try using that .new as the new baseline:
>
> [root@nine perf_event_tests]# ./run_tests.sh | tee results.$(uname -r).new2 ; diff -u results.$(uname -r).new results.$(uname -r).new2 | grep ^[+-]
> --- results.6.8.0-rc7-rt6.new 2024-03-13 17:06:34.944149451 -0300
> +++ results.6.8.0-rc7-rt6.new2 2024-03-13 17:08:41.438282558 -0300
> - Testing "branch-misses" generalized event... FAILED
> + Testing "branch-misses" generalized event... PASSED
> - Testing if userspace rdpmc multiplexing works... PASSED
> + Testing if userspace rdpmc multiplexing works... FAILED
> - Running on CPU 6
> -Testing if rdpmc behavior on attach CPU... FAILED
> + Running on CPU 2
> +Testing if rdpmc behavior on attach CPU... PASSED
> - Running on CPU 0
> + Running on CPU 2
> - Measuring on CPU 7
> -Running on CPU 1
> -Measuring on CPU 7
> + Measuring on CPU 2
> +Running on CPU 0
> +Measuring on CPU 2
> [root@nine perf_event_tests]#

2024-03-14 09:10:46

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On 2024-03-13 17:14:25 [-0300], Arnaldo Carvalho de Melo wrote:
> > tldr; No dmesg activity, no kernel splats, most tests passed, nothing
> > noticeable when running with/without the patch with Vince's regression
> > tests. So:
>
> > Tested-by: Arnaldo Carvalho de Melo <[email protected]>
> > Reported-by: Arnaldo Carvalho de Melo <[email protected]>
>
> Too quick, now I'm testing it on top of torvalds/master, no PREEMPT_RT.

Just to be clear: You revert your Tested-by because now you test this on
torvalds/master but not because you reported a regression which I
missed.

> - Arnaldo

Sebastian

2024-03-14 09:36:03

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [PATCH v2 2/4] perf: Enqueue SIGTRAP always via task_work.

On 2024-03-13 15:41:18 [+0100], Marco Elver wrote:
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index c7a0274c662c8..e9926baaa1587 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -9592,14 +9572,17 @@ static int __perf_event_overflow(struct perf_event *event,
> >
> > if (regs)
> > pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> > - if (!event->pending_sigtrap) {
> > - event->pending_sigtrap = pending_id;
> > + if (!event->pending_work) {
> > + event->pending_work = pending_id;
> > local_inc(&event->ctx->nr_pending);
> > - irq_work_queue(&event->pending_irq);
> > + WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> > + task_work_add(current, &event->pending_task, TWA_RESUME);
> > + if (in_nmi())
> > + irq_work_queue(&event->pending_irq);
>
> Some brief code comments here would help having to dig through git
> history to understand this.

Sure.

> > } else if (event->attr.exclude_kernel && valid_sample) {
> > /*
> > * Should not be able to return to user space without
> > - * consuming pending_sigtrap; with exceptions:
> > + * consuming pending_work; with exceptions:
> > *
> > * 1. Where !exclude_kernel, events can overflow again
> > * in the kernel without returning to user space.
> > @@ -13049,6 +13032,13 @@ static void sync_child_event(struct perf_event *child_event)
> > &parent_event->child_total_time_running);
> > }
> >
> > +static bool task_work_cb_match(struct callback_head *cb, void *data)
> > +{
> > + struct perf_event *event = container_of(cb, struct perf_event, pending_task);
> > +
> > + return event == data;
> > +}
> > +
> > static void
> > perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> > {
> > @@ -13088,6 +13078,11 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> > * Kick perf_poll() for is_event_hup();
> > */
> > perf_event_wakeup(parent_event);
> > + if (event->pending_work &&
> > + task_work_cancel_match(current, task_work_cb_match, event)) {
>
> Brief comment which case this covers would be good.

Okay.

> > + put_event(event);
> > + local_dec(&event->ctx->nr_pending);
> > + }
> > free_event(event);
> > put_event(parent_event);
> > return;

Sebastian

2024-03-14 14:34:55

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On Thu, Mar 14, 2024 at 10:10:33AM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-03-13 17:14:25 [-0300], Arnaldo Carvalho de Melo wrote:
> > > tldr; No dmesg activity, no kernel splats, most tests passed, nothing
> > > noticeable when running with/without the patch with Vince's regression
> > > tests. So:
> >
> > > Tested-by: Arnaldo Carvalho de Melo <[email protected]>
> > > Reported-by: Arnaldo Carvalho de Melo <[email protected]>
> >
> > Too quick, now I'm testing it on top of torvalds/master, no PREEMPT_RT.
>
> Just to be clear: You revert your Tested-by because now you test this on
> torvalds/master but not because you reported a regression which I
> missed.

You got it right. No regressions, the code is good, I just need to test
it a bit further, with torvalds/master, without PREEMPT_RT.

- Arnaldo

2024-03-14 21:22:50

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On Thu, Mar 14, 2024 at 11:34:39AM -0300, Arnaldo Carvalho de Melo wrote:
> On Thu, Mar 14, 2024 at 10:10:33AM +0100, Sebastian Andrzej Siewior wrote:
> > On 2024-03-13 17:14:25 [-0300], Arnaldo Carvalho de Melo wrote:
> > > > tldr; No dmesg activity, no kernel splats, most tests passed, nothing
> > > > noticeable when running with/without the patch with Vince's regression
> > > > tests. So:

> > > > Tested-by: Arnaldo Carvalho de Melo <[email protected]>
> > > > Reported-by: Arnaldo Carvalho de Melo <[email protected]>

> > > Too quick, now I'm testing it on top of torvalds/master, no PREEMPT_RT.

> > Just to be clear: You revert your Tested-by because now you test this on
> > torvalds/master but not because you reported a regression which I
> > missed.

> You got it right. No regressions, the code is good, I just need to test
> it a bit further, with torvalds/master, without PREEMPT_RT.

Tests performed, no regressions detected, same behaviour when killing
the remove_on_exec selftests midway:

[acme@nine perf_events]$ perf annotate --stdio2 exec_child | tail
mov $0x1,%edi
→ callq _exit@plt
b5: nop
100.00 b6: mov signal_count,%eax
test %eax,%eax
↑ je b6
nop
nop
leaveq
← retq
[acme@nine perf_events]$ pidof exec_child
28256 28249 28241 28240 28239 28236 28228 28226 28224 28223 28219 28215 28208 28207 28206 28205 28200 28188 28187 28186 28185 28169 28168 28167 28166 28155 28154 28153 28152 28140 28139 28138 28137 28124 28123 28122 28121 28111 28110 28109 28108 28094 28093 28092 28091 28080 28079 28078 28077 28064 28062 28061 28060 28048 28047 28046 28045 28030 28029 28028 28027 28012 28011 28010 28009 27998 27996 27994 27993 27982 27981 27979 27978 27966 27965 27962 27961 27952 27951 27949 27948 27934 27933 27932 27931 27920 27919 27918 27917 27906 27905 27903 27902 27888 27885 27883 27882
[acme@nine perf_events]$

[acme@nine linux]$ uname -a
Linux nine 6.8.0-rc7.sebastian-rt6+ #2 SMP PREEMPT_RT Tue Mar 12 18:01:31 -03 2024 x86_64 x86_64 x86_64 GNU/Linux
[acme@nine linux]$ perf probe -L perf_pending_disable
<perf_pending_disable@/home/acme/git/linux/kernel/events/core.c:0>
0 static void perf_pending_disable(struct irq_work *entry)
{
2 struct perf_event *event = container_of(entry, struct perf_event, pending_disable_irq);
int rctx;

/*
* If we 'fail' here, that's OK, it means recursion is already disabled
* and we won't recurse 'further'.
*/
rctx = perf_swevent_get_recursion_context();
10 __perf_pending_disable(event);
11 if (rctx >= 0)
12 perf_swevent_put_recursion_context(rctx);
}

static void perf_pending_irq(struct irq_work *entry)

[acme@nine linux]$

So I keep my:

Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Reported-by: Arnaldo Carvalho de Melo <[email protected]>

- Arnaldo

2024-03-14 21:46:37

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On Thu, Mar 14, 2024 at 06:22:38PM -0300, Arnaldo Carvalho de Melo wrote:
> On Thu, Mar 14, 2024 at 11:34:39AM -0300, Arnaldo Carvalho de Melo wrote:
> > On Thu, Mar 14, 2024 at 10:10:33AM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2024-03-13 17:14:25 [-0300], Arnaldo Carvalho de Melo wrote:
> > > > > tldr; No dmesg activity, no kernel splats, most tests passed, nothing
> > > > > noticeable when running with/without the patch with Vince's regression
> > > > > tests. So:
>
> > > > > Tested-by: Arnaldo Carvalho de Melo <[email protected]>
> > > > > Reported-by: Arnaldo Carvalho de Melo <[email protected]>
>
> > > > Too quick, now I'm testing it on top of torvalds/master, no PREEMPT_RT.
>
> > > Just to be clear: You revert your Tested-by because now you test this on
> > > torvalds/master but not because you reported a regression which I
> > > missed.
>
> > You got it right. No regressions, the code is good, I just need to test
> > it a bit further, with torvalds/master, without PREEMPT_RT.
>
> Tests performed, no regressions detected, same behaviour when killing
> the remove_on_exec selftests midway:

> [acme@nine linux]$ uname -a
> Linux nine 6.8.0-rc7.sebastian-rt6+ #2 SMP PREEMPT_RT Tue Mar 12 18:01:31 -03 2024 x86_64 x86_64 x86_64 GNU/Linux

Re-reading this I noticed I really retested with the rt kernel, d0h, so
here it goes again:

[acme@nine ~]$ uname -r
6.8.0.sigtrapfix+
[acme@nine ~]$ set -o vi
[acme@nine ~]$ perf test sigtrap
68: Sigtrap : Ok
[acme@nine ~]$ cd ~acme/git/linux
[acme@nine linux]$ cd tools/testing/selftests/perf_events && make
make: Nothing to be done for 'all'.
[acme@nine perf_events]$ for x in {0..100}; do (./remove_on_exec &); done
<snip>
ok 4 remove_on_exec.exec_stress
# PASSED: 4 / 4 tests passed.
# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
# OK remove_on_exec.exec_stress
ok 4 remove_on_exec.exec_stress
# PASSED: 4 / 4 tests passed.
# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
# OK remove_on_exec.exec_stress
ok 4 remove_on_exec.exec_stress
# PASSED: 4 / 4 tests passed.
# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
[acme@nine perf_events]$

So despite this mistake all is well, torvalds/master + your patchkit
seems ok.

Sorry for the noise :-\

- Arnaldo

2024-03-15 07:41:27

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

On 2024-03-14 18:46:21 [-0300], Arnaldo Carvalho de Melo wrote:
> So despite this mistake all is well, torvalds/master + your patchkit
> seems ok.
>
> Sorry for the noise :-\

No worries, thank you.

> - Arnaldo

Sebastian