2023-12-04 20:14:48

by Kyle Huey

[permalink] [raw]
Subject: [PATCH 0/2] Combine perf and bpf for fast eval of hw breakpoint conditions

rr, a userspace record and replay debugger[0], replays asynchronous events
such as signals and context switches by essentially[1] setting a breakpoint
at the address where the asynchronous event was delivered during recording
with a condition that the program state matches the state when the event
was delivered.

Currently, rr uses software breakpoints that trap (via ptrace) to the
supervisor, and evaluates the condition from the supervisor. If the
asynchronous event is delivered in a tight loop (thus requiring the
breakpoint condition to be repeatedly evaluated) the overhead can be
immense. A patch to rr that uses hardware breakpoints via perf events with
an attached bpf program to reject breakpoint hits where the condition is
not satisfied reduces rr's replay overhead by 94% on a pathological (but a
real customer-provided, not contrived) rr trace.

The only obstacle to this approach is that while the kernel allows a bpf
program to suppress sample output when a perf event overflows it does not
suppress signalling the perf event fd. This appears to be a simple
oversight in the code. This patch set fixes that oversight and adds a
selftest.

[0] https://rr-project.org/
[1] Various optimizations exist to skip as much as execution as possible
before setting a breakpoint, and to determine a set of program state that
is practical to check and verify.



2023-12-04 20:14:51

by Kyle Huey

[permalink] [raw]
Subject: [PATCH 1/2] perf/bpf: Allow a bpf program to suppress I/O signals.

Returning zero from a bpf program attached to a perf event already
suppresses any data output. This allows it to suppress I/O availability
signals too.

Signed-off-by: Kyle Huey <[email protected]>
---
kernel/events/core.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index b704d83a28b2..34d7b19d45eb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10417,8 +10417,10 @@ static void bpf_overflow_handler(struct perf_event *event,
rcu_read_unlock();
out:
__this_cpu_dec(bpf_prog_active);
- if (!ret)
+ if (!ret) {
+ event->pending_kill = 0;
return;
+ }

event->orig_overflow_handler(event, data, regs);
}
--
2.34.1

2023-12-04 22:19:16

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf/bpf: Allow a bpf program to suppress I/O signals.

On Mon, Dec 4, 2023 at 12:14 PM Kyle Huey <[email protected]> wrote:
>
> Returning zero from a bpf program attached to a perf event already
> suppresses any data output. This allows it to suppress I/O availability
> signals too.

make sense, just one question below

>
> Signed-off-by: Kyle Huey <[email protected]>
> ---
> kernel/events/core.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index b704d83a28b2..34d7b19d45eb 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -10417,8 +10417,10 @@ static void bpf_overflow_handler(struct perf_event *event,
> rcu_read_unlock();
> out:
> __this_cpu_dec(bpf_prog_active);
> - if (!ret)
> + if (!ret) {
> + event->pending_kill = 0;
> return;
> + }

What's the distinction between event->pending_kill and
event->pending_wakeup? Should we do something about pending_wakeup?
Asking out of complete ignorance of all these perf specifics.


>
> event->orig_overflow_handler(event, data, regs);
> }
> --
> 2.34.1
>
>

2023-12-05 11:16:55

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf/bpf: Allow a bpf program to suppress I/O signals.

On Mon, Dec 04, 2023 at 02:18:49PM -0800, Andrii Nakryiko wrote:
> On Mon, Dec 4, 2023 at 12:14 PM Kyle Huey <[email protected]> wrote:
> >
> > Returning zero from a bpf program attached to a perf event already
> > suppresses any data output. This allows it to suppress I/O availability
> > signals too.
>
> make sense, just one question below
>
> >
> > Signed-off-by: Kyle Huey <[email protected]>

Acked-by: Jiri Olsa <[email protected]>

> > ---
> > kernel/events/core.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index b704d83a28b2..34d7b19d45eb 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -10417,8 +10417,10 @@ static void bpf_overflow_handler(struct perf_event *event,
> > rcu_read_unlock();
> > out:
> > __this_cpu_dec(bpf_prog_active);
> > - if (!ret)
> > + if (!ret) {
> > + event->pending_kill = 0;
> > return;
> > + }
>
> What's the distinction between event->pending_kill and
> event->pending_wakeup? Should we do something about pending_wakeup?
> Asking out of complete ignorance of all these perf specifics.
>

I think zeroing pending_kill is enough.. when it's set the perf code
sets pending_wakeup to call perf_event_wakeup in irq code that wakes
up event's ring buffer readers and sends sigio if pending_kill is set

jirka

>
> >
> > event->orig_overflow_handler(event, data, regs);
> > }
> > --
> > 2.34.1
> >
> >

2023-12-05 18:08:10

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf/bpf: Allow a bpf program to suppress I/O signals.

Hello,

Add Marco Elver to CC.

On Tue, Dec 5, 2023 at 3:16 AM Jiri Olsa <[email protected]> wrote:
>
> On Mon, Dec 04, 2023 at 02:18:49PM -0800, Andrii Nakryiko wrote:
> > On Mon, Dec 4, 2023 at 12:14 PM Kyle Huey <[email protected]> wrote:
> > >
> > > Returning zero from a bpf program attached to a perf event already
> > > suppresses any data output. This allows it to suppress I/O availability
> > > signals too.
> >
> > make sense, just one question below
> >
> > >
> > > Signed-off-by: Kyle Huey <[email protected]>
>
> Acked-by: Jiri Olsa <[email protected]>
>
> > > ---
> > > kernel/events/core.c | 4 +++-
> > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index b704d83a28b2..34d7b19d45eb 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -10417,8 +10417,10 @@ static void bpf_overflow_handler(struct perf_event *event,
> > > rcu_read_unlock();
> > > out:
> > > __this_cpu_dec(bpf_prog_active);
> > > - if (!ret)
> > > + if (!ret) {
> > > + event->pending_kill = 0;
> > > return;
> > > + }
> >
> > What's the distinction between event->pending_kill and
> > event->pending_wakeup? Should we do something about pending_wakeup?
> > Asking out of complete ignorance of all these perf specifics.
> >
>
> I think zeroing pending_kill is enough.. when it's set the perf code
> sets pending_wakeup to call perf_event_wakeup in irq code that wakes
> up event's ring buffer readers and sends sigio if pending_kill is set

Right, IIUC pending_wakeup is set by the ring buffer code when
a task is waiting for events and it gets enough events (watermark).
So I think it's good for ring buffer to manage the pending_wakeup.

And pending_kill is set when a task wants a signal delivery even
without getting enough events. Clearing pending_kill looks ok
to suppress normal signals but I'm not sure if it's ok for SIGTRAP.

If we want to handle returning 0 from bpf as if the event didn't
happen, I think SIGTRAP and event_limit logic should be done
after the overflow handler depending on pending_kill or something.

Thanks,
Namhyung

2023-12-05 18:17:32

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf/bpf: Allow a bpf program to suppress I/O signals.

On Tue, 5 Dec 2023 at 19:07, Namhyung Kim <[email protected]> wrote:
>
> Hello,
>
> Add Marco Elver to CC.
>
> On Tue, Dec 5, 2023 at 3:16 AM Jiri Olsa <[email protected]> wrote:
> >
> > On Mon, Dec 04, 2023 at 02:18:49PM -0800, Andrii Nakryiko wrote:
> > > On Mon, Dec 4, 2023 at 12:14 PM Kyle Huey <[email protected]> wrote:
> > > >
> > > > Returning zero from a bpf program attached to a perf event already
> > > > suppresses any data output. This allows it to suppress I/O availability
> > > > signals too.
> > >
> > > make sense, just one question below
> > >
> > > >
> > > > Signed-off-by: Kyle Huey <[email protected]>
> >
> > Acked-by: Jiri Olsa <[email protected]>
> >
> > > > ---
> > > > kernel/events/core.c | 4 +++-
> > > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > > index b704d83a28b2..34d7b19d45eb 100644
> > > > --- a/kernel/events/core.c
> > > > +++ b/kernel/events/core.c
> > > > @@ -10417,8 +10417,10 @@ static void bpf_overflow_handler(struct perf_event *event,
> > > > rcu_read_unlock();
> > > > out:
> > > > __this_cpu_dec(bpf_prog_active);
> > > > - if (!ret)
> > > > + if (!ret) {
> > > > + event->pending_kill = 0;
> > > > return;
> > > > + }
> > >
> > > What's the distinction between event->pending_kill and
> > > event->pending_wakeup? Should we do something about pending_wakeup?
> > > Asking out of complete ignorance of all these perf specifics.
> > >
> >
> > I think zeroing pending_kill is enough.. when it's set the perf code
> > sets pending_wakeup to call perf_event_wakeup in irq code that wakes
> > up event's ring buffer readers and sends sigio if pending_kill is set
>
> Right, IIUC pending_wakeup is set by the ring buffer code when
> a task is waiting for events and it gets enough events (watermark).
> So I think it's good for ring buffer to manage the pending_wakeup.
>
> And pending_kill is set when a task wants a signal delivery even
> without getting enough events. Clearing pending_kill looks ok
> to suppress normal signals but I'm not sure if it's ok for SIGTRAP.
>
> If we want to handle returning 0 from bpf as if the event didn't
> happen, I think SIGTRAP and event_limit logic should be done
> after the overflow handler depending on pending_kill or something.

I'm not sure which kernel version this is for, but in recent kernels,
the SIGTRAP logic was changed to no longer "abuse" event_limit, and
uses its own "pending_sigtrap" + "pending_work" (on reschedule
transitions).

Thanks,
-- Marco

2023-12-05 18:23:57

by Kyle Huey

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf/bpf: Allow a bpf program to suppress I/O signals.

On Tue, Dec 5, 2023 at 10:17 AM Marco Elver <[email protected]> wrote:
>
> On Tue, 5 Dec 2023 at 19:07, Namhyung Kim <[email protected]> wrote:
> >
> > Hello,
> >
> > Add Marco Elver to CC.
> >
> > On Tue, Dec 5, 2023 at 3:16 AM Jiri Olsa <[email protected]> wrote:
> > >
> > > On Mon, Dec 04, 2023 at 02:18:49PM -0800, Andrii Nakryiko wrote:
> > > > On Mon, Dec 4, 2023 at 12:14 PM Kyle Huey <[email protected]> wrote:
> > > > >
> > > > > Returning zero from a bpf program attached to a perf event already
> > > > > suppresses any data output. This allows it to suppress I/O availability
> > > > > signals too.
> > > >
> > > > make sense, just one question below
> > > >
> > > > >
> > > > > Signed-off-by: Kyle Huey <[email protected]>
> > >
> > > Acked-by: Jiri Olsa <[email protected]>
> > >
> > > > > ---
> > > > > kernel/events/core.c | 4 +++-
> > > > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > > > index b704d83a28b2..34d7b19d45eb 100644
> > > > > --- a/kernel/events/core.c
> > > > > +++ b/kernel/events/core.c
> > > > > @@ -10417,8 +10417,10 @@ static void bpf_overflow_handler(struct perf_event *event,
> > > > > rcu_read_unlock();
> > > > > out:
> > > > > __this_cpu_dec(bpf_prog_active);
> > > > > - if (!ret)
> > > > > + if (!ret) {
> > > > > + event->pending_kill = 0;
> > > > > return;
> > > > > + }
> > > >
> > > > What's the distinction between event->pending_kill and
> > > > event->pending_wakeup? Should we do something about pending_wakeup?
> > > > Asking out of complete ignorance of all these perf specifics.
> > > >
> > >
> > > I think zeroing pending_kill is enough.. when it's set the perf code
> > > sets pending_wakeup to call perf_event_wakeup in irq code that wakes
> > > up event's ring buffer readers and sends sigio if pending_kill is set
> >
> > Right, IIUC pending_wakeup is set by the ring buffer code when
> > a task is waiting for events and it gets enough events (watermark).
> > So I think it's good for ring buffer to manage the pending_wakeup.
> >
> > And pending_kill is set when a task wants a signal delivery even
> > without getting enough events. Clearing pending_kill looks ok
> > to suppress normal signals but I'm not sure if it's ok for SIGTRAP.
> >
> > If we want to handle returning 0 from bpf as if the event didn't
> > happen, I think SIGTRAP and event_limit logic should be done
> > after the overflow handler depending on pending_kill or something.
>
> I'm not sure which kernel version this is for, but in recent kernels,
> the SIGTRAP logic was changed to no longer "abuse" event_limit, and
> uses its own "pending_sigtrap" + "pending_work" (on reschedule
> transitions).
>
> Thanks,
> -- Marco

The patch was prepared against a 6.7 release candidate.

- Kyle

2023-12-05 18:26:48

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf/bpf: Allow a bpf program to suppress I/O signals.

On Tue, Dec 5, 2023 at 10:17 AM Marco Elver <[email protected]> wrote:
>
> On Tue, 5 Dec 2023 at 19:07, Namhyung Kim <[email protected]> wrote:
> > If we want to handle returning 0 from bpf as if the event didn't
> > happen, I think SIGTRAP and event_limit logic should be done
> > after the overflow handler depending on pending_kill or something.
>
> I'm not sure which kernel version this is for, but in recent kernels,
> the SIGTRAP logic was changed to no longer "abuse" event_limit, and
> uses its own "pending_sigtrap" + "pending_work" (on reschedule
> transitions).

Oh, I didn't mean SIGTRAP and event_limit together.
Maybe they have an issue separately.

Thanks,
Namhyung

2023-12-05 19:19:55

by Kyle Huey

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf/bpf: Allow a bpf program to suppress I/O signals.

On Tue, Dec 5, 2023 at 10:07 AM Namhyung Kim <[email protected]> wrote:
>
> Hello,
>
> Add Marco Elver to CC.
>
> On Tue, Dec 5, 2023 at 3:16 AM Jiri Olsa <[email protected]> wrote:
> >
> > On Mon, Dec 04, 2023 at 02:18:49PM -0800, Andrii Nakryiko wrote:
> > > On Mon, Dec 4, 2023 at 12:14 PM Kyle Huey <[email protected]> wrote:
> > > >
> > > > Returning zero from a bpf program attached to a perf event already
> > > > suppresses any data output. This allows it to suppress I/O availability
> > > > signals too.
> > >
> > > make sense, just one question below
> > >
> > > >
> > > > Signed-off-by: Kyle Huey <[email protected]>
> >
> > Acked-by: Jiri Olsa <[email protected]>
> >
> > > > ---
> > > > kernel/events/core.c | 4 +++-
> > > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > > index b704d83a28b2..34d7b19d45eb 100644
> > > > --- a/kernel/events/core.c
> > > > +++ b/kernel/events/core.c
> > > > @@ -10417,8 +10417,10 @@ static void bpf_overflow_handler(struct perf_event *event,
> > > > rcu_read_unlock();
> > > > out:
> > > > __this_cpu_dec(bpf_prog_active);
> > > > - if (!ret)
> > > > + if (!ret) {
> > > > + event->pending_kill = 0;
> > > > return;
> > > > + }
> > >
> > > What's the distinction between event->pending_kill and
> > > event->pending_wakeup? Should we do something about pending_wakeup?
> > > Asking out of complete ignorance of all these perf specifics.
> > >
> >
> > I think zeroing pending_kill is enough.. when it's set the perf code
> > sets pending_wakeup to call perf_event_wakeup in irq code that wakes
> > up event's ring buffer readers and sends sigio if pending_kill is set
>
> Right, IIUC pending_wakeup is set by the ring buffer code when
> a task is waiting for events and it gets enough events (watermark).
> So I think it's good for ring buffer to manage the pending_wakeup.
>
> And pending_kill is set when a task wants a signal delivery even
> without getting enough events. Clearing pending_kill looks ok
> to suppress normal signals but I'm not sure if it's ok for SIGTRAP.
>
> If we want to handle returning 0 from bpf as if the event didn't
> happen, I think SIGTRAP and event_limit logic should be done
> after the overflow handler depending on pending_kill or something.

Hmm, yes, perhaps. The SIGTRAP thing (which I was previously unaware
of) would actually be more useful to us than an I/O signal.

I am slightly wary that event_limit appears to have no tests in the kernel tree.

- Kyle

> Thanks,
> Namhyung