Hi Steven,
I have systems that are regularly hitting a WARN in trace_event_dyn_put_ref.
The exact message is:
WARNING: CPU: 1 PID: 30309 at kernel/trace/trace_dynevent.c:46 trace_event_dyn_put_ref+0x15/0x20
With the following stacktrace:
perf_trace_init+0x8f/0xd0
perf_tp_event_init+0x1f/0x40
perf_try_init_event+0x4a/0x130
perf_event_alloc+0x497/0xf40
__do_sys_perf_event_open+0x1d4/0xf70
__x64_sys_perf_event_open+0x20/0x30
do_syscall_64+0x5c/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
I've debugged this and think that it's caused by changes in 5.15 that
introduced a refcounter for dynamic events.
Specifically, perf_trace_init() manages the refcount for the events it
finds in the ftrace_events list. However, perf_trace_event_init() ->
perf_trace_event_unreg() can decrement this count unexpectedly, if
perf_trace_event_open() encounters an error.
I'm able to reproduce reliably with the following shell script. This is
from an Ubuntu 20.04 userland. If the uprobes in the script don't line
up with exactly what's on your system, any should do.
--- >8 cut here 8< ---
#!/usr/bin/bash
# A simple-ish reproducer for trace_event_dyn_put_ref() WARN from
# perf_trace_init(). Triggering an -EINTR during uprobe attach is enough to
# land in the error case that does a double-decrement.
cleanup() {
rm -f perf.data
jobs -p | xargs -r kill -9 2>/dev/null
perf probe -d '*'
exit
}
setup() {
perf probe -d '*' 2>/dev/null
perf probe -x /proc/1/root/lib/x86_64-linux-gnu/libc.so.6 getaddrinfo
perf probe -x /proc/1/root/lib/x86_64-linux-gnu/libc.so.6 getaddrinfo%return
perf probe -x /proc/1/root/lib/x86_64-linux-gnu/libc.so.6 backtrace
perf probe -x /proc/1/root/lib/x86_64-linux-gnu/libaudit.so.1 audit_log_acct_message
perf probe -x /proc/1/root/lib/x86_64-linux-gnu/libaudit.so.1 audit_log_acct_message%return
}
setup
trap cleanup INT TERM
while true; do
perf record -e 'task:task_newtask' \
-e 'sched:sched_process_exit' \
-e 'raw_syscalls:sys_enter' \
-e 'raw_syscalls:sys_exit' \
-e 'probe_libaudit:audit_log_acct_message' \
-e 'probe_libaudit:audit_log_acct_message__return' \
-e 'probe_libc:backtrace' \
-e 'probe_libc:getaddrinfo' \
-e 'probe_libc:getaddrinfo__return' \
-a &
sleep "0.$RANDOM"
jobs -p | xargs -r kill -9
rm -f perf.data
done
--- >8 cut here 8< ---
I'm sending along a patch that, when applied, does not warn when run
with the script above. I've tested kprobes and uprobes with it a bit,
and did not observe any obvious regressions. I don't have the full
battery of tests that you do, though. However, if you do have something
else I should test against prior to submission, I'd be happy to do so.
That patch is in the e-mail that follows, and its commit message should
explain the approach to the solution, hopefully to your satisfaction.
Thanks,
-K
Krister Johansen (1):
tracing: fix a WARN from trace_event_dyn_put_ref
kernel/trace/trace_event_perf.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
--
2.25.1
The code in perf_trace_init takes a reference on a trace_event_call that is
looked up as part of the function call. If perf_trace_event_int fails,
however, perf_trace_event_unreg can decrement that refcount from underneath
perf_trace_init. This means that in some failure cases, perf_trace_init
can trigger the WARN in trace_dynevent.c which attempts to guard against
zero reference counts going negative.
The author can reproduce this problem readily by running perf record in a
loop against a series of uprobes with no other users. Killing the record
process before it can finish its setup is enough to trigger this warn
within a few seconds.
This patch leaves the behavior in perf_trace_event_unreg unchanged, but
moves most of the code in that function to perf_trace_event_cleanup. The
unreg function retains the ability to drop the refcount on the tp_event,
but cleanup does not. This modification is based upon the observation that
all of the other callers of perf_trace_event_init don't bother with
manipulating a reference count on the tp_events that they create. For
those callers, the trace_event_put_ref was already a no-op.
Signed-off-by: Krister Johansen <[email protected]>
Reviewed-by: David Reaver <[email protected]>
Fixes: 1d18538e6a092 "tracing: Have dynamic events have a ref counter"
CC: [email protected] # 5.15, 5.18, 5.19
---
kernel/trace/trace_event_perf.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index a114549720d6..7762bfd268cd 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -151,13 +151,13 @@ static int perf_trace_event_reg(struct trace_event_call *tp_event,
return ret;
}
-static void perf_trace_event_unreg(struct perf_event *p_event)
+static void perf_trace_event_cleanup(struct perf_event *p_event)
{
struct trace_event_call *tp_event = p_event->tp_event;
int i;
if (--tp_event->perf_refcount > 0)
- goto out;
+ return;
tp_event->class->reg(tp_event, TRACE_REG_PERF_UNREGISTER, NULL);
@@ -176,7 +176,13 @@ static void perf_trace_event_unreg(struct perf_event *p_event)
perf_trace_buf[i] = NULL;
}
}
-out:
+}
+
+static void perf_trace_event_unreg(struct perf_event *p_event)
+{
+ struct trace_event_call *tp_event = p_event->tp_event;
+
+ perf_trace_event_cleanup(p_event);
trace_event_put_ref(tp_event);
}
@@ -207,7 +213,7 @@ static int perf_trace_event_init(struct trace_event_call *tp_event,
ret = perf_trace_event_open(p_event);
if (ret) {
- perf_trace_event_unreg(p_event);
+ perf_trace_event_cleanup(p_event);
return ret;
}
--
2.25.1
[ Added the authors of this code to the Cc ]
On Fri, 12 Aug 2022 17:02:20 -0700
Krister Johansen <[email protected]> wrote:
> The code in perf_trace_init takes a reference on a trace_event_call that is
> looked up as part of the function call. If perf_trace_event_int fails,
> however, perf_trace_event_unreg can decrement that refcount from underneath
> perf_trace_init. This means that in some failure cases, perf_trace_init
> can trigger the WARN in trace_dynevent.c which attempts to guard against
> zero reference counts going negative.
>
> The author can reproduce this problem readily by running perf record in a
> loop against a series of uprobes with no other users. Killing the record
> process before it can finish its setup is enough to trigger this warn
> within a few seconds.
>
> This patch leaves the behavior in perf_trace_event_unreg unchanged, but
> moves most of the code in that function to perf_trace_event_cleanup. The
> unreg function retains the ability to drop the refcount on the tp_event,
> but cleanup does not. This modification is based upon the observation that
> all of the other callers of perf_trace_event_init don't bother with
> manipulating a reference count on the tp_events that they create. For
> those callers, the trace_event_put_ref was already a no-op.
>
> Signed-off-by: Krister Johansen <[email protected]>
> Reviewed-by: David Reaver <[email protected]>
> Fixes: 1d18538e6a092 "tracing: Have dynamic events have a ref counter"
> CC: [email protected] # 5.15, 5.18, 5.19
> ---
> kernel/trace/trace_event_perf.c | 14 ++++++++++----
> 1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
> index a114549720d6..7762bfd268cd 100644
> --- a/kernel/trace/trace_event_perf.c
> +++ b/kernel/trace/trace_event_perf.c
> @@ -151,13 +151,13 @@ static int perf_trace_event_reg(struct trace_event_call *tp_event,
> return ret;
> }
>
> -static void perf_trace_event_unreg(struct perf_event *p_event)
> +static void perf_trace_event_cleanup(struct perf_event *p_event)
> {
> struct trace_event_call *tp_event = p_event->tp_event;
> int i;
>
> if (--tp_event->perf_refcount > 0)
> - goto out;
> + return;
>
> tp_event->class->reg(tp_event, TRACE_REG_PERF_UNREGISTER, NULL);
>
> @@ -176,7 +176,13 @@ static void perf_trace_event_unreg(struct perf_event *p_event)
> perf_trace_buf[i] = NULL;
> }
> }
> -out:
> +}
> +
> +static void perf_trace_event_unreg(struct perf_event *p_event)
> +{
> + struct trace_event_call *tp_event = p_event->tp_event;
> +
> + perf_trace_event_cleanup(p_event);
> trace_event_put_ref(tp_event);
> }
>
> @@ -207,7 +213,7 @@ static int perf_trace_event_init(struct trace_event_call *tp_event,
>
> ret = perf_trace_event_open(p_event);
> if (ret) {
> - perf_trace_event_unreg(p_event);
> + perf_trace_event_cleanup(p_event);
The only problem I have with this patch is the loss of symmetry. Where
perf_trace_event_reg() returns successful, so unreg() should be the
function you call on failure.
Since perf_trace_event_reg() is only called from perf_trace_event_init()
let's move the perf_trace_event_open() into the perf_trace_event_reg() and
have the unreg still do the clean up on failure.
This way we keep the symmetry between *_reg() and *_unreg().
And then the init function will not have to do any clean up, and can just
end with:
return perf_trace_event_reg(tp_event, p_event);
Thanks,
-- Steve
> return ret;
> }
>
Hi Steven,
Thanks for the feedback.
On Tue, Aug 16, 2022 at 12:25:59PM -0400, Steven Rostedt wrote:
>
> [ Added the authors of this code to the Cc ]
Oops, that was an unintentional omission.
> On Fri, 12 Aug 2022 17:02:20 -0700
> Krister Johansen <[email protected]> wrote:
>
> > The code in perf_trace_init takes a reference on a trace_event_call that is
> > looked up as part of the function call. If perf_trace_event_int fails,
> > however, perf_trace_event_unreg can decrement that refcount from underneath
> > perf_trace_init. This means that in some failure cases, perf_trace_init
> > can trigger the WARN in trace_dynevent.c which attempts to guard against
> > zero reference counts going negative.
> >
> > The author can reproduce this problem readily by running perf record in a
> > loop against a series of uprobes with no other users. Killing the record
> > process before it can finish its setup is enough to trigger this warn
> > within a few seconds.
> >
> > This patch leaves the behavior in perf_trace_event_unreg unchanged, but
> > moves most of the code in that function to perf_trace_event_cleanup. The
> > unreg function retains the ability to drop the refcount on the tp_event,
> > but cleanup does not. This modification is based upon the observation that
> > all of the other callers of perf_trace_event_init don't bother with
> > manipulating a reference count on the tp_events that they create. For
> > those callers, the trace_event_put_ref was already a no-op.
<...>
> > @@ -207,7 +213,7 @@ static int perf_trace_event_init(struct trace_event_call *tp_event,
> >
> > ret = perf_trace_event_open(p_event);
> > if (ret) {
> > - perf_trace_event_unreg(p_event);
> > + perf_trace_event_cleanup(p_event);
>
> The only problem I have with this patch is the loss of symmetry. Where
> perf_trace_event_reg() returns successful, so unreg() should be the
> function you call on failure.
>
> Since perf_trace_event_reg() is only called from perf_trace_event_init()
> let's move the perf_trace_event_open() into the perf_trace_event_reg() and
> have the unreg still do the clean up on failure.
>
> This way we keep the symmetry between *_reg() and *_unreg().
>
> And then the init function will not have to do any clean up, and can just
> end with:
>
> return perf_trace_event_reg(tp_event, p_event);
I went to implement this suggestion. In the process, I found that I was
potentially creating more asymmetry. There are a number of places in
the teardown path where the code does something like this:
perf_trace_event_close(p_event);
perf_trace_event_unreg(p_event);
In this situation, it seems like I should move the close into the unreg
as well. However, if I do that, then unreg can't be called from reg
until open has been successful.
The original problem that I set out to solve was to prevent
perf_trace_event_unreg() from decrementing a refcount in
perf_trace_event_init(). If I pull unreg() into the failure path in
reg then I think I still need to keep that refcount decrement present
for the users that are calling unreg from outside of reg().
After looking at the failure handling in perf_trace_event_reg(),
it actually shares a lot of code with perf_trace_event_unreg(). Unifying
the two would be nice. If unreg is called from reg, that also makes
it difficult for unreg to be used separately.
If the primary objection is around the naming, I have a couple of
possible ideas:
1. Rename perf_trace_event_cleanup to perf_trace_event_unreg_noref and
use it in perf_trace_event_init. This keeps some of the reg/unreg
symmetry, but makes it clear that its the unreg that won't touch the
tp_event refcount.
2. Introduce a function that's complementary to perf_trace_event_init
and name it perf_trace_event_destroy. Modify the logic to be something
like the following:
perf_trace_event_init() {
trace_event_try_get_ref(tp_event);
perf_trace_event_perm();
perf_trace_event_reg();
perf_trace_event_open();
}
perf_trace_event_destroy() {
perf_trace_event_close();
perf_trace_event_unreg();
trace_event_put_ref();
}
And then modify the current callers of the close/unreg pair to use
destroy.
What do you think, would either of these be a solution that you would
find acceptable?
-K
On Tue, 16 Aug 2022 11:54:18 -0700
Krister Johansen <[email protected]> wrote:
> What do you think, would either of these be a solution that you would
> find acceptable?
Actually, I think the issue is that the unreg() calls trace_event_put_ref()
but the reg() does not do the try_get_ref(), but the perf_trace_init()
does. Which is the broken symmetry.
I think if we pull out the trace_event_put_ref() from the unreg() function,
we fix the bug and also put back the symmetry.
Does this patch work?
[ Untested, and not even compiled ]
-- Steve
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index a114549720d6..61e3a2620fa3 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -157,7 +157,7 @@ static void perf_trace_event_unreg(struct perf_event *p_event)
int i;
if (--tp_event->perf_refcount > 0)
- goto out;
+ return;
tp_event->class->reg(tp_event, TRACE_REG_PERF_UNREGISTER, NULL);
@@ -176,8 +176,6 @@ static void perf_trace_event_unreg(struct perf_event *p_event)
perf_trace_buf[i] = NULL;
}
}
-out:
- trace_event_put_ref(tp_event);
}
static int perf_trace_event_open(struct perf_event *p_event)
@@ -241,6 +239,7 @@ void perf_trace_destroy(struct perf_event *p_event)
mutex_lock(&event_mutex);
perf_trace_event_close(p_event);
perf_trace_event_unreg(p_event);
+ trace_event_put_ref(p_event->tp_event);
mutex_unlock(&event_mutex);
}
@@ -292,6 +291,7 @@ void perf_kprobe_destroy(struct perf_event *p_event)
mutex_lock(&event_mutex);
perf_trace_event_close(p_event);
perf_trace_event_unreg(p_event);
+ trace_event_put_ref(p_event->tp_event);
mutex_unlock(&event_mutex);
destroy_local_trace_kprobe(p_event->tp_event);
@@ -347,6 +347,7 @@ void perf_uprobe_destroy(struct perf_event *p_event)
mutex_lock(&event_mutex);
perf_trace_event_close(p_event);
perf_trace_event_unreg(p_event);
+ trace_event_put_ref(p_event->tp_event);
mutex_unlock(&event_mutex);
destroy_local_trace_uprobe(p_event->tp_event);
}
On Tue, Aug 16, 2022 at 03:24:11PM -0400, Steven Rostedt wrote:
> Actually, I think the issue is that the unreg() calls trace_event_put_ref()
> but the reg() does not do the try_get_ref(), but the perf_trace_init()
> does. Which is the broken symmetry.
>
> I think if we pull out the trace_event_put_ref() from the unreg() function,
> we fix the bug and also put back the symmetry.
>
> Does this patch work?
Yes, thanks, this change looks correct. It also does build and pass the
tests that I included in the original e-mail.
> diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
> index a114549720d6..61e3a2620fa3 100644
> --- a/kernel/trace/trace_event_perf.c
> +++ b/kernel/trace/trace_event_perf.c
> @@ -157,7 +157,7 @@ static void perf_trace_event_unreg(struct perf_event *p_event)
> int i;
>
> if (--tp_event->perf_refcount > 0)
> - goto out;
> + return;
>
> tp_event->class->reg(tp_event, TRACE_REG_PERF_UNREGISTER, NULL);
>
> @@ -176,8 +176,6 @@ static void perf_trace_event_unreg(struct perf_event *p_event)
> perf_trace_buf[i] = NULL;
> }
> }
> -out:
> - trace_event_put_ref(tp_event);
> }
>
> static int perf_trace_event_open(struct perf_event *p_event)
> @@ -241,6 +239,7 @@ void perf_trace_destroy(struct perf_event *p_event)
> mutex_lock(&event_mutex);
> perf_trace_event_close(p_event);
> perf_trace_event_unreg(p_event);
> + trace_event_put_ref(p_event->tp_event);
> mutex_unlock(&event_mutex);
> }
>
> @@ -292,6 +291,7 @@ void perf_kprobe_destroy(struct perf_event *p_event)
> mutex_lock(&event_mutex);
> perf_trace_event_close(p_event);
> perf_trace_event_unreg(p_event);
> + trace_event_put_ref(p_event->tp_event);
> mutex_unlock(&event_mutex);
>
> destroy_local_trace_kprobe(p_event->tp_event);
> @@ -347,6 +347,7 @@ void perf_uprobe_destroy(struct perf_event *p_event)
> mutex_lock(&event_mutex);
> perf_trace_event_close(p_event);
> perf_trace_event_unreg(p_event);
> + trace_event_put_ref(p_event->tp_event);
> mutex_unlock(&event_mutex);
> destroy_local_trace_uprobe(p_event->tp_event);
> }
>
If it matters:
Reviewed-by: Krister Johansen <[email protected]>
Tested-by: Krister Johansen <[email protected]>
Thanks again,
-K