2022-03-30 06:00:54

by Beau Belgrave

[permalink] [raw]
Subject: Re: [PATCH] tracing/user_events: Add eBPF interface for user_event created events

On Tue, Mar 29, 2022 at 12:50:40PM -0700, Alexei Starovoitov wrote:
> On Tue, Mar 29, 2022 at 11:19 AM Beau Belgrave
> <[email protected]> wrote:
> >
> > Send user_event data to attached eBPF programs for user_event based perf
> > events.
> >
> > Add BPF_ITER flag to allow user_event data to have a zero copy path into
> > eBPF programs if required.
> >
> > Update documentation to describe new flags and structures for eBPF
> > integration.
> >
> > Signed-off-by: Beau Belgrave <[email protected]>
>
> The commit describes _what_ it does, but says nothing about _why_.
> At present I see no use out of bpf and user_events connection.
> The whole user_events feature looks redundant to me.
> We have uprobes and usdt. It doesn't look to me that
> user_events provide anything new that wasn't available earlier.

A lot of the why, in general, for user_events is covered in the first
change in the series.
Link: https://lore.kernel.org/all/[email protected]/

The why was also covered in Linux Plumbers Conference 2021 within the
tracing microconference.

An example of why we want user_events:
Managed code running that emits data out via Open Telemetry.
Since it's managed there isn't a stub location to patch, it moves.
We watch the Open Telemetry spans in an eBPF program, when a span takes
too long we collect stack data and perform other actions.
With user_events and perf we can monitor the entire system from the root
container without having to have relay agents within each
cgroup/namespace taking up resources.
We do not need to enter each cgroup mnt space and determine the correct
patch location or the right version of each binary for processes that
use user_events.

An example of why we want eBPF integration:
We also have scenarios where we are live decoding the data quickly.
Having user_data fed directly to eBPF lets us cast the data coming in to
a struct and decode very very quickly to determine if something is
wrong.
We can take that data quickly and put it into maps to perform further
aggregation as required.
We have scenarios that have "skid" problems, where we need to grab
further data exactly when the process that had the problem was running.
eBPF lets us do all of this that we cannot easily do otherwise.

Another benefit from user_events is the tracing is much faster than
uprobes or others using int 3 traps. This is critical to us to enable on
production systems.

Thanks,
-Beau


2022-03-30 12:18:16

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH] tracing/user_events: Add eBPF interface for user_event created events

On Tue, Mar 29, 2022 at 1:11 PM Beau Belgrave <[email protected]> wrote:
>
> On Tue, Mar 29, 2022 at 12:50:40PM -0700, Alexei Starovoitov wrote:
> > On Tue, Mar 29, 2022 at 11:19 AM Beau Belgrave
> > <[email protected]> wrote:
> > >
> > > Send user_event data to attached eBPF programs for user_event based perf
> > > events.
> > >
> > > Add BPF_ITER flag to allow user_event data to have a zero copy path into
> > > eBPF programs if required.
> > >
> > > Update documentation to describe new flags and structures for eBPF
> > > integration.
> > >
> > > Signed-off-by: Beau Belgrave <[email protected]>
> >
> > The commit describes _what_ it does, but says nothing about _why_.
> > At present I see no use out of bpf and user_events connection.
> > The whole user_events feature looks redundant to me.
> > We have uprobes and usdt. It doesn't look to me that
> > user_events provide anything new that wasn't available earlier.
>
> A lot of the why, in general, for user_events is covered in the first
> change in the series.
> Link: https://lore.kernel.org/all/[email protected]/
>
> The why was also covered in Linux Plumbers Conference 2021 within the
> tracing microconference.
>
> An example of why we want user_events:
> Managed code running that emits data out via Open Telemetry.
> Since it's managed there isn't a stub location to patch, it moves.
> We watch the Open Telemetry spans in an eBPF program, when a span takes
> too long we collect stack data and perform other actions.
> With user_events and perf we can monitor the entire system from the root
> container without having to have relay agents within each
> cgroup/namespace taking up resources.
> We do not need to enter each cgroup mnt space and determine the correct
> patch location or the right version of each binary for processes that
> use user_events.
>
> An example of why we want eBPF integration:
> We also have scenarios where we are live decoding the data quickly.
> Having user_data fed directly to eBPF lets us cast the data coming in to
> a struct and decode very very quickly to determine if something is
> wrong.
> We can take that data quickly and put it into maps to perform further
> aggregation as required.
> We have scenarios that have "skid" problems, where we need to grab
> further data exactly when the process that had the problem was running.
> eBPF lets us do all of this that we cannot easily do otherwise.
>
> Another benefit from user_events is the tracing is much faster than
> uprobes or others using int 3 traps. This is critical to us to enable on
> production systems.

None of it makes sense to me.
To take advantage of user_events user space has to be modified
and writev syscalls inserted.
This is not cheap and I cannot see a production system using this interface.
All you did is a poor man version of lttng that doesn't rely
on such heavy instrumentation.

2022-03-30 12:19:42

by Beau Belgrave

[permalink] [raw]
Subject: Re: [PATCH] tracing/user_events: Add eBPF interface for user_event created events

On Tue, Mar 29, 2022 at 03:31:31PM -0700, Alexei Starovoitov wrote:
> On Tue, Mar 29, 2022 at 1:11 PM Beau Belgrave <[email protected]> wrote:
> >
> > On Tue, Mar 29, 2022 at 12:50:40PM -0700, Alexei Starovoitov wrote:
> > > On Tue, Mar 29, 2022 at 11:19 AM Beau Belgrave
> > > <[email protected]> wrote:
> > > >
> > > > Send user_event data to attached eBPF programs for user_event based perf
> > > > events.
> > > >
> > > > Add BPF_ITER flag to allow user_event data to have a zero copy path into
> > > > eBPF programs if required.
> > > >
> > > > Update documentation to describe new flags and structures for eBPF
> > > > integration.
> > > >
> > > > Signed-off-by: Beau Belgrave <[email protected]>
> > >
> > > The commit describes _what_ it does, but says nothing about _why_.
> > > At present I see no use out of bpf and user_events connection.
> > > The whole user_events feature looks redundant to me.
> > > We have uprobes and usdt. It doesn't look to me that
> > > user_events provide anything new that wasn't available earlier.
> >
> > A lot of the why, in general, for user_events is covered in the first
> > change in the series.
> > Link: https://lore.kernel.org/all/[email protected]/
> >
> > The why was also covered in Linux Plumbers Conference 2021 within the
> > tracing microconference.
> >
> > An example of why we want user_events:
> > Managed code running that emits data out via Open Telemetry.
> > Since it's managed there isn't a stub location to patch, it moves.
> > We watch the Open Telemetry spans in an eBPF program, when a span takes
> > too long we collect stack data and perform other actions.
> > With user_events and perf we can monitor the entire system from the root
> > container without having to have relay agents within each
> > cgroup/namespace taking up resources.
> > We do not need to enter each cgroup mnt space and determine the correct
> > patch location or the right version of each binary for processes that
> > use user_events.
> >
> > An example of why we want eBPF integration:
> > We also have scenarios where we are live decoding the data quickly.
> > Having user_data fed directly to eBPF lets us cast the data coming in to
> > a struct and decode very very quickly to determine if something is
> > wrong.
> > We can take that data quickly and put it into maps to perform further
> > aggregation as required.
> > We have scenarios that have "skid" problems, where we need to grab
> > further data exactly when the process that had the problem was running.
> > eBPF lets us do all of this that we cannot easily do otherwise.
> >
> > Another benefit from user_events is the tracing is much faster than
> > uprobes or others using int 3 traps. This is critical to us to enable on
> > production systems.
>
> None of it makes sense to me.

Sorry.

> To take advantage of user_events user space has to be modified
> and writev syscalls inserted.

Yes, both user_events and lttng require user space modifications to do
tracing correctly. The syscall overheads are real, and the cost depends
on the mitigations around spectre/meltdown.

> This is not cheap and I cannot see a production system using this interface.

But you are fine with uprobe costs? uprobes appear to be much more costly
than a syscall approach on the hardware I've run on.

> All you did is a poor man version of lttng that doesn't rely
> on such heavy instrumentation.

Well I am a frugal person. :)

This work has solved some critical issues we've been having, and I would
appreciate a review of the code if possible.

Thanks,
-Beau

2022-03-30 18:12:45

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH] tracing/user_events: Add eBPF interface for user_event created events

On Tue, 29 Mar 2022 15:31:31 -0700
Alexei Starovoitov <[email protected]> wrote:

> On Tue, Mar 29, 2022 at 1:11 PM Beau Belgrave <[email protected]> wrote:
> >
> > On Tue, Mar 29, 2022 at 12:50:40PM -0700, Alexei Starovoitov wrote:
> > > On Tue, Mar 29, 2022 at 11:19 AM Beau Belgrave
> > > <[email protected]> wrote:
> > > >
> > > > Send user_event data to attached eBPF programs for user_event based perf
> > > > events.
> > > >
> > > > Add BPF_ITER flag to allow user_event data to have a zero copy path into
> > > > eBPF programs if required.
> > > >
> > > > Update documentation to describe new flags and structures for eBPF
> > > > integration.
> > > >
> > > > Signed-off-by: Beau Belgrave <[email protected]>
> > >
> > > The commit describes _what_ it does, but says nothing about _why_.
> > > At present I see no use out of bpf and user_events connection.
> > > The whole user_events feature looks redundant to me.
> > > We have uprobes and usdt. It doesn't look to me that
> > > user_events provide anything new that wasn't available earlier.
> >
> > A lot of the why, in general, for user_events is covered in the first
> > change in the series.
> > Link: https://lore.kernel.org/all/[email protected]/
> >
> > The why was also covered in Linux Plumbers Conference 2021 within the
> > tracing microconference.
> >
> > An example of why we want user_events:
> > Managed code running that emits data out via Open Telemetry.
> > Since it's managed there isn't a stub location to patch, it moves.
> > We watch the Open Telemetry spans in an eBPF program, when a span takes
> > too long we collect stack data and perform other actions.
> > With user_events and perf we can monitor the entire system from the root
> > container without having to have relay agents within each
> > cgroup/namespace taking up resources.
> > We do not need to enter each cgroup mnt space and determine the correct
> > patch location or the right version of each binary for processes that
> > use user_events.
> >
> > An example of why we want eBPF integration:
> > We also have scenarios where we are live decoding the data quickly.
> > Having user_data fed directly to eBPF lets us cast the data coming in to
> > a struct and decode very very quickly to determine if something is
> > wrong.
> > We can take that data quickly and put it into maps to perform further
> > aggregation as required.
> > We have scenarios that have "skid" problems, where we need to grab
> > further data exactly when the process that had the problem was running.
> > eBPF lets us do all of this that we cannot easily do otherwise.
> >
> > Another benefit from user_events is the tracing is much faster than
> > uprobes or others using int 3 traps. This is critical to us to enable on
> > production systems.
>
> None of it makes sense to me.
> To take advantage of user_events user space has to be modified
> and writev syscalls inserted.

That can be done by introducing new user SDT macros, which currently
expected to use uprobes (thus it just introduces a list of probe
address and semaphore in a section). But we can provide another
implementation for lighter user-events.

> This is not cheap and I cannot see a production system using this interface.

I agree this point. At least this needs to be paired with user-space
library so that the applications can use it. But I also think that
new feature is not always requires an actual production system which
relays on that, since that means such production system must use
out-of-tree custom kernel. That should be avoided from the upstream-first
policy viewpoint. (However, I would like to know the actual use case.)

> All you did is a poor man version of lttng that doesn't rely
> on such heavy instrumentation.

Isn't it reasonable to avoid using heavy instrumentation? :-)

Thank you,

--
Masami Hiramatsu <[email protected]>

2022-03-31 04:50:13

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH] tracing/user_events: Add eBPF interface for user_event created events

On Tue, Mar 29, 2022 at 4:11 PM Beau Belgrave <[email protected]> wrote:
>
> On Tue, Mar 29, 2022 at 03:31:31PM -0700, Alexei Starovoitov wrote:
> > On Tue, Mar 29, 2022 at 1:11 PM Beau Belgrave <[email protected]> wrote:
> > >
> > > On Tue, Mar 29, 2022 at 12:50:40PM -0700, Alexei Starovoitov wrote:
> > > > On Tue, Mar 29, 2022 at 11:19 AM Beau Belgrave
> > > > <[email protected]> wrote:
> > > > >
> > > > > Send user_event data to attached eBPF programs for user_event based perf
> > > > > events.
> > > > >
> > > > > Add BPF_ITER flag to allow user_event data to have a zero copy path into
> > > > > eBPF programs if required.
> > > > >
> > > > > Update documentation to describe new flags and structures for eBPF
> > > > > integration.
> > > > >
> > > > > Signed-off-by: Beau Belgrave <[email protected]>
> > > >
> > > > The commit describes _what_ it does, but says nothing about _why_.
> > > > At present I see no use out of bpf and user_events connection.
> > > > The whole user_events feature looks redundant to me.
> > > > We have uprobes and usdt. It doesn't look to me that
> > > > user_events provide anything new that wasn't available earlier.
> > >
> > > A lot of the why, in general, for user_events is covered in the first
> > > change in the series.
> > > Link: https://lore.kernel.org/all/[email protected]/
> > >
> > > The why was also covered in Linux Plumbers Conference 2021 within the
> > > tracing microconference.
> > >
> > > An example of why we want user_events:
> > > Managed code running that emits data out via Open Telemetry.
> > > Since it's managed there isn't a stub location to patch, it moves.
> > > We watch the Open Telemetry spans in an eBPF program, when a span takes
> > > too long we collect stack data and perform other actions.
> > > With user_events and perf we can monitor the entire system from the root
> > > container without having to have relay agents within each
> > > cgroup/namespace taking up resources.
> > > We do not need to enter each cgroup mnt space and determine the correct
> > > patch location or the right version of each binary for processes that
> > > use user_events.
> > >
> > > An example of why we want eBPF integration:
> > > We also have scenarios where we are live decoding the data quickly.
> > > Having user_data fed directly to eBPF lets us cast the data coming in to
> > > a struct and decode very very quickly to determine if something is
> > > wrong.
> > > We can take that data quickly and put it into maps to perform further
> > > aggregation as required.
> > > We have scenarios that have "skid" problems, where we need to grab
> > > further data exactly when the process that had the problem was running.
> > > eBPF lets us do all of this that we cannot easily do otherwise.
> > >
> > > Another benefit from user_events is the tracing is much faster than
> > > uprobes or others using int 3 traps. This is critical to us to enable on
> > > production systems.
> >
> > None of it makes sense to me.
>
> Sorry.
>
> > To take advantage of user_events user space has to be modified
> > and writev syscalls inserted.
>
> Yes, both user_events and lttng require user space modifications to do
> tracing correctly. The syscall overheads are real, and the cost depends
> on the mitigations around spectre/meltdown.
>
> > This is not cheap and I cannot see a production system using this interface.
>
> But you are fine with uprobe costs? uprobes appear to be much more costly
> than a syscall approach on the hardware I've run on.

Can we achieve the same/similar performance with sys_bpf(BPF_PROG_RUN)?

Thanks,
Song

>
> > All you did is a poor man version of lttng that doesn't rely
> > on such heavy instrumentation.
>
> Well I am a frugal person. :)
>
> This work has solved some critical issues we've been having, and I would
> appreciate a review of the code if possible.
>
> Thanks,
> -Beau