rr, a userspace record and replay debugger[0], replays asynchronous events
such as signals and context switches by essentially[1] setting a breakpoint
at the address where the asynchronous event was delivered during recording
with a condition that the program state matches the state when the event
was delivered.
Currently, rr uses software breakpoints that trap (via ptrace) to the
supervisor, and evaluates the condition from the supervisor. If the
asynchronous event is delivered in a tight loop (thus requiring the
breakpoint condition to be repeatedly evaluated) the overhead can be
immense. A patch to rr that uses hardware breakpoints via perf events with
an attached BPF program to reject breakpoint hits where the condition is
not satisfied reduces rr's replay overhead by 94% on a pathological (but a
real customer-provided, not contrived) rr trace.
The only obstacle to this approach is that while the kernel allows a BPF
program to suppress sample output when a perf event overflows it does not
suppress signalling the perf event fd. This appears to be a simple
oversight in the code. This patch set reorders the overflow handler
callback and the side effects of perf event overflow to allow an overflow
handler to suppress all side effects, changes bpf_overflow_handler() to
suppress those side effects if the BPF program returns zero, and adds a
selftest.
The previous version of this patchset can be found at
https://lore.kernel.org/linux-kernel/[email protected]/
Changes since v1:
Patch 1 was added so that a sample suppressed by this mechanism will also
not generate SIGTRAPs nor count against the event limit.
Patch 2 is v1's patch 1.
Patch 3 is v1's patch 2, and addresses a number of review comments about
the self test and adds testing for the behavior introduced by patch 1.
[0] https://rr-project.org/
[1] Various optimizations exist to skip as much as execution as possible
before setting a breakpoint, and to determine a set of program state that
is practical to check and verify.
The perf subsystem already allows an overflow handler to clear pending_kill
to suppress a fasync signal (although nobody currently does this). Allow an
overflow handler to suppress the other visible side effects of an overflow,
namely event_limit accounting and SIGTRAP generation.
Signed-off-by: Kyle Huey <[email protected]>
---
kernel/events/core.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b704d83a28b2..19fddfc27a4a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9541,6 +9541,12 @@ static int __perf_event_overflow(struct perf_event *event,
*/
event->pending_kill = POLL_IN;
+
+ READ_ONCE(event->overflow_handler)(event, data, regs);
+
+ if (!event->pending_kill)
+ return ret;
+
if (events && atomic_dec_and_test(&event->event_limit)) {
ret = 1;
event->pending_kill = POLL_HUP;
@@ -9584,9 +9590,7 @@ static int __perf_event_overflow(struct perf_event *event,
irq_work_queue(&event->pending_irq);
}
- READ_ONCE(event->overflow_handler)(event, data, regs);
-
- if (*perf_event_fasync(event) && event->pending_kill) {
+ if (*perf_event_fasync(event)) {
event->pending_wakeup = 1;
irq_work_queue(&event->pending_irq);
}
--
2.34.1
On Thu, 7 Dec 2023 at 17:35, Kyle Huey <[email protected]> wrote:
>
> The perf subsystem already allows an overflow handler to clear pending_kill
> to suppress a fasync signal (although nobody currently does this). Allow an
> overflow handler to suppress the other visible side effects of an overflow,
> namely event_limit accounting and SIGTRAP generation.
>
> Signed-off-by: Kyle Huey <[email protected]>
> ---
> kernel/events/core.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index b704d83a28b2..19fddfc27a4a 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -9541,6 +9541,12 @@ static int __perf_event_overflow(struct perf_event *event,
> */
>
> event->pending_kill = POLL_IN;
> +
> + READ_ONCE(event->overflow_handler)(event, data, regs);
> +
> + if (!event->pending_kill)
> + return ret;
It's not at all intuitive that resetting pending_kill to 0 will
suppress everything else, too. There is no relationship between the
fasync signals and SIGTRAP. pending_kill is for the former and
pending_sigtrap is for the latter. One should not affect the other.
A nicer solution would be to properly undo the various pending_*
states (in the case of pending_sigtrap being set it should be enough
to reset pending_sigtrap to 0, and also decrement
event->ctx->nr_pending).
Although I can see why this solution is simpler. Perhaps with enough
comments it might be clearer.
Preferences?
On Thu, 7 Dec 2023 at 18:47, Kyle Huey <[email protected]> wrote:
>
>
>
> On Thu, Dec 7, 2023, 9:05 AM Marco Elver <[email protected]> wrote:
>>
>> On Thu, 7 Dec 2023 at 17:35, Kyle Huey <[email protected]> wrote:
>> >
>> > The perf subsystem already allows an overflow handler to clear pending_kill
>> > to suppress a fasync signal (although nobody currently does this). Allow an
>> > overflow handler to suppress the other visible side effects of an overflow,
>> > namely event_limit accounting and SIGTRAP generation.
>> >
>> > Signed-off-by: Kyle Huey <[email protected]>
>> > ---
>> > kernel/events/core.c | 10 +++++++---
>> > 1 file changed, 7 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/kernel/events/core.c b/kernel/events/core.c
>> > index b704d83a28b2..19fddfc27a4a 100644
>> > --- a/kernel/events/core.c
>> > +++ b/kernel/events/core.c
>> > @@ -9541,6 +9541,12 @@ static int __perf_event_overflow(struct perf_event *event,
>> > */
>> >
>> > event->pending_kill = POLL_IN;
>> > +
>> > + READ_ONCE(event->overflow_handler)(event, data, regs);
>> > +
>> > + if (!event->pending_kill)
>> > + return ret;
>>
>> It's not at all intuitive that resetting pending_kill to 0 will
>> suppress everything else, too. There is no relationship between the
>> fasync signals and SIGTRAP. pending_kill is for the former and
>> pending_sigtrap is for the latter. One should not affect the other.
>>
>> A nicer solution would be to properly undo the various pending_*
>> states (in the case of pending_sigtrap being set it should be enough
>> to reset pending_sigtrap to 0, and also decrement
>> event->ctx->nr_pending).
>
>
> I don't believe it's possible to correctly undo the event_limit decrement after the fact (if it's e.g. racing with the ioctl that adds to the event limit).
>
>> Although I can see why this solution is simpler. Perhaps with enough
>> comments it might be clearer.
>>
>> Preferences?
>
>
> The cleanest way would probably be to add a return value to the overflow handler function that controls this. It requires changing a bunch of arch specific code on arches I don't have access to though.
Hmm.
Maybe wait for perf maintainers to say what is preferrable. (I could
live with just making sure this has no other weird side effects and
more comments.)
Hello,
On Thu, Dec 07, 2023 at 06:53:58PM +0100, Marco Elver wrote:
> On Thu, 7 Dec 2023 at 18:47, Kyle Huey <[email protected]> wrote:
> >
> >
> >
> > On Thu, Dec 7, 2023, 9:05 AM Marco Elver <[email protected]> wrote:
> >>
> >> On Thu, 7 Dec 2023 at 17:35, Kyle Huey <[email protected]> wrote:
> >> >
> >> > The perf subsystem already allows an overflow handler to clear pending_kill
> >> > to suppress a fasync signal (although nobody currently does this). Allow an
> >> > overflow handler to suppress the other visible side effects of an overflow,
> >> > namely event_limit accounting and SIGTRAP generation.
Well, I think it can still hit the throttling logic and generate
a PERF_RECORD_THROTTLE. But it should be rare..
> >> >
> >> > Signed-off-by: Kyle Huey <[email protected]>
> >> > ---
> >> > kernel/events/core.c | 10 +++++++---
> >> > 1 file changed, 7 insertions(+), 3 deletions(-)
> >> >
> >> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> >> > index b704d83a28b2..19fddfc27a4a 100644
> >> > --- a/kernel/events/core.c
> >> > +++ b/kernel/events/core.c
> >> > @@ -9541,6 +9541,12 @@ static int __perf_event_overflow(struct perf_event *event,
> >> > */
> >> >
> >> > event->pending_kill = POLL_IN;
> >> > +
> >> > + READ_ONCE(event->overflow_handler)(event, data, regs);
> >> > +
> >> > + if (!event->pending_kill)
> >> > + return ret;
> >>
> >> It's not at all intuitive that resetting pending_kill to 0 will
> >> suppress everything else, too. There is no relationship between the
> >> fasync signals and SIGTRAP. pending_kill is for the former and
> >> pending_sigtrap is for the latter. One should not affect the other.
> >>
> >> A nicer solution would be to properly undo the various pending_*
> >> states (in the case of pending_sigtrap being set it should be enough
> >> to reset pending_sigtrap to 0, and also decrement
> >> event->ctx->nr_pending).
> >
> >
> > I don't believe it's possible to correctly undo the event_limit decrement after the fact (if it's e.g. racing with the ioctl that adds to the event limit).
> >
> >> Although I can see why this solution is simpler. Perhaps with enough
> >> comments it might be clearer.
> >>
> >> Preferences?
> >
> >
> > The cleanest way would probably be to add a return value to the overflow handler function that controls this. It requires changing a bunch of arch specific code on arches I don't have access to though.
>
> Hmm.
>
> Maybe wait for perf maintainers to say what is preferrable. (I could
> live with just making sure this has no other weird side effects and
> more comments.)
What if we can call bpf handler directly and check the return value?
Then I think we can also get rid of the original overflow handler.
Something like this (untested..)
Thanks,
Namhyung
---8<---
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e85cd1c0eaf3..1eba6f5bb70b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -809,7 +809,6 @@ struct perf_event {
perf_overflow_handler_t overflow_handler;
void *overflow_handler_context;
#ifdef CONFIG_BPF_SYSCALL
- perf_overflow_handler_t orig_overflow_handler;
struct bpf_prog *prog;
u64 bpf_cookie;
#endif
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4c72a41f11af..e1a00646dbbe 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9471,6 +9471,12 @@ static inline bool sample_is_allowed(struct perf_event *event, struct pt_regs *r
* Generic event overflow handling, sampling.
*/
+#ifdef CONFIG_BPF_SYSCALL
+static int bpf_overflow_handler(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs);
+#endif
+
static int __perf_event_overflow(struct perf_event *event,
int throttle, struct perf_sample_data *data,
struct pt_regs *regs)
@@ -9487,6 +9493,11 @@ static int __perf_event_overflow(struct perf_event *event,
ret = __perf_event_account_interrupt(event, throttle);
+#ifdef CONFIG_BPF_SYSCALL
+ if (event->prog && bpf_overflow_handler(event, data, regs) == 0)
+ return ret;
+#endif
+
/*
* XXX event_limit might not quite work as expected on inherited
* events
@@ -10346,7 +10357,7 @@ static void perf_event_free_filter(struct perf_event *event)
}
#ifdef CONFIG_BPF_SYSCALL
-static void bpf_overflow_handler(struct perf_event *event,
+static int bpf_overflow_handler(struct perf_event *event,
struct perf_sample_data *data,
struct pt_regs *regs)
{
@@ -10355,7 +10366,7 @@ static void bpf_overflow_handler(struct perf_event *event,
.event = event,
};
struct bpf_prog *prog;
- int ret = 0;
+ int ret = 1;
ctx.regs = perf_arch_bpf_user_pt_regs(regs);
if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
@@ -10369,10 +10380,7 @@ static void bpf_overflow_handler(struct perf_event *event,
rcu_read_unlock();
out:
__this_cpu_dec(bpf_prog_active);
- if (!ret)
- return;
-
- event->orig_overflow_handler(event, data, regs);
+ return ret;
}
static int perf_event_set_bpf_handler(struct perf_event *event,
@@ -10408,8 +10416,6 @@ static int perf_event_set_bpf_handler(struct perf_event *event,
event->prog = prog;
event->bpf_cookie = bpf_cookie;
- event->orig_overflow_handler = READ_ONCE(event->overflow_handler);
- WRITE_ONCE(event->overflow_handler, bpf_overflow_handler);
return 0;
}
@@ -10420,7 +10426,6 @@ static void perf_event_free_bpf_handler(struct perf_event *event)
if (!prog)
return;
- WRITE_ONCE(event->overflow_handler, event->orig_overflow_handler);
event->prog = NULL;
bpf_prog_put(prog);
}
@@ -11880,13 +11885,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
overflow_handler = parent_event->overflow_handler;
context = parent_event->overflow_handler_context;
#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_EVENT_TRACING)
- if (overflow_handler == bpf_overflow_handler) {
+ if (parent_event->prog) {
struct bpf_prog *prog = parent_event->prog;
bpf_prog_inc(prog);
event->prog = prog;
- event->orig_overflow_handler =
- parent_event->orig_overflow_handler;
}
#endif
}
On Thu, Dec 7, 2023 at 2:38 PM Namhyung Kim <[email protected]> wrote:
>
> Hello,
>
> On Thu, Dec 07, 2023 at 06:53:58PM +0100, Marco Elver wrote:
> > On Thu, 7 Dec 2023 at 18:47, Kyle Huey <[email protected]> wrote:
> > >
> > >
> > >
> > > On Thu, Dec 7, 2023, 9:05 AM Marco Elver <[email protected]> wrote:
> > >>
> > >> On Thu, 7 Dec 2023 at 17:35, Kyle Huey <[email protected]> wrote:
> > >> >
> > >> > The perf subsystem already allows an overflow handler to clear pending_kill
> > >> > to suppress a fasync signal (although nobody currently does this). Allow an
> > >> > overflow handler to suppress the other visible side effects of an overflow,
> > >> > namely event_limit accounting and SIGTRAP generation.
>
> Well, I think it can still hit the throttling logic and generate
> a PERF_RECORD_THROTTLE. But it should be rare..
>
> > >> >
> > >> > Signed-off-by: Kyle Huey <[email protected]>
> > >> > ---
> > >> > kernel/events/core.c | 10 +++++++---
> > >> > 1 file changed, 7 insertions(+), 3 deletions(-)
> > >> >
> > >> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > >> > index b704d83a28b2..19fddfc27a4a 100644
> > >> > --- a/kernel/events/core.c
> > >> > +++ b/kernel/events/core.c
> > >> > @@ -9541,6 +9541,12 @@ static int __perf_event_overflow(struct perf_event *event,
> > >> > */
> > >> >
> > >> > event->pending_kill = POLL_IN;
> > >> > +
> > >> > + READ_ONCE(event->overflow_handler)(event, data, regs);
> > >> > +
> > >> > + if (!event->pending_kill)
> > >> > + return ret;
> > >>
> > >> It's not at all intuitive that resetting pending_kill to 0 will
> > >> suppress everything else, too. There is no relationship between the
> > >> fasync signals and SIGTRAP. pending_kill is for the former and
> > >> pending_sigtrap is for the latter. One should not affect the other.
> > >>
> > >> A nicer solution would be to properly undo the various pending_*
> > >> states (in the case of pending_sigtrap being set it should be enough
> > >> to reset pending_sigtrap to 0, and also decrement
> > >> event->ctx->nr_pending).
> > >
> > >
> > > I don't believe it's possible to correctly undo the event_limit decrement after the fact (if it's e.g. racing with the ioctl that adds to the event limit).
> > >
> > >> Although I can see why this solution is simpler. Perhaps with enough
> > >> comments it might be clearer.
> > >>
> > >> Preferences?
> > >
> > >
> > > The cleanest way would probably be to add a return value to the overflow handler function that controls this. It requires changing a bunch of arch specific code on arches I don't have access to though.
> >
> > Hmm.
> >
> > Maybe wait for perf maintainers to say what is preferrable. (I could
> > live with just making sure this has no other weird side effects and
> > more comments.)
>
> What if we can call bpf handler directly and check the return value?
> Then I think we can also get rid of the original overflow handler.
>
> Something like this (untested..)
mmm, that's an interesting idea. I'll experiment with it.
- Kyle
> Thanks,
> Namhyung
>
>
> ---8<---
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index e85cd1c0eaf3..1eba6f5bb70b 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -809,7 +809,6 @@ struct perf_event {
> perf_overflow_handler_t overflow_handler;
> void *overflow_handler_context;
> #ifdef CONFIG_BPF_SYSCALL
> - perf_overflow_handler_t orig_overflow_handler;
> struct bpf_prog *prog;
> u64 bpf_cookie;
> #endif
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 4c72a41f11af..e1a00646dbbe 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -9471,6 +9471,12 @@ static inline bool sample_is_allowed(struct perf_event *event, struct pt_regs *r
> * Generic event overflow handling, sampling.
> */
>
> +#ifdef CONFIG_BPF_SYSCALL
> +static int bpf_overflow_handler(struct perf_event *event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs);
> +#endif
> +
> static int __perf_event_overflow(struct perf_event *event,
> int throttle, struct perf_sample_data *data,
> struct pt_regs *regs)
> @@ -9487,6 +9493,11 @@ static int __perf_event_overflow(struct perf_event *event,
>
> ret = __perf_event_account_interrupt(event, throttle);
>
> +#ifdef CONFIG_BPF_SYSCALL
> + if (event->prog && bpf_overflow_handler(event, data, regs) == 0)
> + return ret;
> +#endif
> +
> /*
> * XXX event_limit might not quite work as expected on inherited
> * events
> @@ -10346,7 +10357,7 @@ static void perf_event_free_filter(struct perf_event *event)
> }
>
> #ifdef CONFIG_BPF_SYSCALL
> -static void bpf_overflow_handler(struct perf_event *event,
> +static int bpf_overflow_handler(struct perf_event *event,
> struct perf_sample_data *data,
> struct pt_regs *regs)
> {
> @@ -10355,7 +10366,7 @@ static void bpf_overflow_handler(struct perf_event *event,
> .event = event,
> };
> struct bpf_prog *prog;
> - int ret = 0;
> + int ret = 1;
>
> ctx.regs = perf_arch_bpf_user_pt_regs(regs);
> if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
> @@ -10369,10 +10380,7 @@ static void bpf_overflow_handler(struct perf_event *event,
> rcu_read_unlock();
> out:
> __this_cpu_dec(bpf_prog_active);
> - if (!ret)
> - return;
> -
> - event->orig_overflow_handler(event, data, regs);
> + return ret;
> }
>
> static int perf_event_set_bpf_handler(struct perf_event *event,
> @@ -10408,8 +10416,6 @@ static int perf_event_set_bpf_handler(struct perf_event *event,
>
> event->prog = prog;
> event->bpf_cookie = bpf_cookie;
> - event->orig_overflow_handler = READ_ONCE(event->overflow_handler);
> - WRITE_ONCE(event->overflow_handler, bpf_overflow_handler);
> return 0;
> }
>
> @@ -10420,7 +10426,6 @@ static void perf_event_free_bpf_handler(struct perf_event *event)
> if (!prog)
> return;
>
> - WRITE_ONCE(event->overflow_handler, event->orig_overflow_handler);
> event->prog = NULL;
> bpf_prog_put(prog);
> }
> @@ -11880,13 +11885,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> overflow_handler = parent_event->overflow_handler;
> context = parent_event->overflow_handler_context;
> #if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_EVENT_TRACING)
> - if (overflow_handler == bpf_overflow_handler) {
> + if (parent_event->prog) {
> struct bpf_prog *prog = parent_event->prog;
>
> bpf_prog_inc(prog);
> event->prog = prog;
> - event->orig_overflow_handler =
> - parent_event->orig_overflow_handler;
> }
> #endif
> }