2019-09-18 06:22:24

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH] staging: tracing/kprobe: filter kprobe based perf event


Adding cc to [email protected] mailing list since this is really
bpf related.

On 9/17/19 10:24 PM, [email protected] wrote:
> From: Jinshan Xiong <[email protected]>
>
> Invoking bpf program only if kprobe based perf_event has been added into
> the percpu list. This is essential to make event tracing for cgroup to work
> properly.

The issue actually exists for bpf programs with kprobe, uprobe,
tracepoint and trace_syscall_enter/exit.

In all these places, bpf program is called regardless of
whether there are perf events or not on this cpu.
This provides bpf programs more opportunities to see
the events. I guess this is by design.
Alexei/Daniel, could you clarify?

This, unfortunately, has a consequence on cgroup.
Currently, perf event cgroup based on filtering
(PERF_FLAG_PID_CGROUP) won't work for bpf programs
with kprobee/uprobe/tracepoint/trace_syscall.
The reason is the same, bpf programs see
more events than what perf has configured.

the overflow interrupt (nmi) based perf_event bpf programs
are not impacted.

Any suggestions on what is the proper way to move
forward?

One way to start to honor events only permitted by perf
like what this patch did. But I am not sure whether this
contradicts the original motivation for bpf programs
to see all events regardless.

Another way is to do filtering inside bpf program.
We already have bpf_get_cgroup_id() helper.
We may need another helper to check whether the current
is (a descendant of) another cgroup as it is often the cases
to trace all processes under one parent cgroup which may have many
child cgroups.

>
> Signed-off-by: Jinshan Xiong <[email protected]>
> ---
> kernel/trace/trace_kprobe.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 9d483ad9bb6c..40ef0f1945f7 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -1171,11 +1171,18 @@ static int
> kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
> {
> struct trace_event_call *call = trace_probe_event_call(&tk->tp);
> + struct hlist_head *head = this_cpu_ptr(call->perf_events);
> struct kprobe_trace_entry_head *entry;
> - struct hlist_head *head;
> int size, __size, dsize;
> int rctx;
>
> + /*
> + * If head is empty, the process currently running on this cpu is not
> + * interested by kprobe perf PMU.
> + */
> + if (hlist_empty(head))
> + return 0;
> +
> if (bpf_prog_array_valid(call)) {
> unsigned long orig_ip = instruction_pointer(regs);
> int ret;
> @@ -1193,10 +1200,6 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
> return 0;
> }
>
> - head = this_cpu_ptr(call->perf_events);
> - if (hlist_empty(head))
> - return 0;
> -
> dsize = __get_data_size(&tk->tp, regs);
> __size = sizeof(*entry) + tk->tp.size + dsize;
> size = ALIGN(__size + sizeof(u32), sizeof(u64));
>


2019-09-18 14:37:48

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH] staging: tracing/kprobe: filter kprobe based perf event

On Wed, Sep 18, 2019 at 05:51:10AM +0000, Yonghong Song wrote:
>
> Adding cc to [email protected] mailing list since this is really
> bpf related.
>
> On 9/17/19 10:24 PM, [email protected] wrote:
> > From: Jinshan Xiong <[email protected]>
> >
> > Invoking bpf program only if kprobe based perf_event has been added into
> > the percpu list. This is essential to make event tracing for cgroup to work
> > properly.
>
> The issue actually exists for bpf programs with kprobe, uprobe,
> tracepoint and trace_syscall_enter/exit.
>
> In all these places, bpf program is called regardless of
> whether there are perf events or not on this cpu.
> This provides bpf programs more opportunities to see
> the events. I guess this is by design.
> Alexei/Daniel, could you clarify?

Yes. It is by design.
When bpf is attached to kprobe it will fire on all cpus.
Per-cpu or per-task or per-cgroup filtering is already done
inside bpf programs.
We cannot make this change now it will break all users.