2017-12-14 20:21:24

by Teng Qin

[permalink] [raw]
Subject: [PATCH tip 0/3] Improvements of scheduler related Tracepoints

This set of commits attempts to improve three scheduler related
Tracepoints: sched_switch, sched_process_fork, sched_process_exit.

Firstly, these commit add additional flag values, namely preempt,
clone_flags and group_dead to these Tracepoints, to make information
exposed via the Tracepoints more useful and complete.

Secondly, these commits exposes task_struct pointers in these
Tracepoints. The task_struct pointers are arguments of the Tracepoints
and currently only used to compute struct field values. But for BPF
programs attached to these Tracepoints, we may want to read additional
task information via the task_struct pointers. This is currently either
impossible, or we have to make assumption of whether the Tracepoint is
running from previous / parent or next / child, and use current pointer
instead. Exposing the task_struct pointers explicitly makes such use
case easier and more reliable.

Teng Qin (3):
Improve sched_switch Tracepoint
Improve sched_process_fork Tracepoint
Improve sched_process_exit Tracepoint

include/trace/events/sched.h | 54 ++++++++++++++++++++++++++++++++++++--------
kernel/exit.c | 2 +-
kernel/fork.c | 2 +-
3 files changed, 46 insertions(+), 12 deletions(-)

--
2.9.5


2017-12-14 20:20:59

by Teng Qin

[permalink] [raw]
Subject: [PATCH tip 1/3] Improve sched_switch Tracepoint

This commit adds value of the preempt flag to sched_switch's Tracepoint
struct. The value of the flag is already passed into the Tracepoint as
argument but currently only used to compute previous task's state.
Exposing the value is useful during debugging of contention issues.

This commit also exposes pointers of the previous and next task_struct in
the Tracepoint's struct. BPF programs can read task information via task
struct pointer. Exposing these pointers explicitly gives BPF programs an
easy and reliable way of using the Tracepoint.

Signed-off-by: Teng Qin <[email protected]>
---
include/trace/events/sched.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index bc01e06..9297b33 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -141,6 +141,9 @@ TRACE_EVENT(sched_switch,
__array( char, next_comm, TASK_COMM_LEN )
__field( pid_t, next_pid )
__field( int, next_prio )
+ __field( bool, preempt )
+ __field( void *, prev_task )
+ __field( void *, next_task )
),

TP_fast_assign(
@@ -151,6 +154,9 @@ TRACE_EVENT(sched_switch,
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
__entry->next_pid = next->pid;
__entry->next_prio = next->prio;
+ __entry->preempt = preempt;
+ __entry->prev_task = (void *)prev;
+ __entry->next_task = (void *)next;
/* XXX SCHED_DEADLINE */
),

--
2.9.5

2017-12-14 20:21:01

by Teng Qin

[permalink] [raw]
Subject: [PATCH tip 3/3] Improve sched_process_exit Tracepoint

This commit adds group_dead value to sched_process_exit's Tracepoint
struct. The value is passed as an argument of the Tracepoint. It is
useful to differentiate exits of a regular process or the last process
of the process group.

This commit also exposes pointers of exited task. BPF program can read task
information via task struct pointer. Exposing the pointer explicitly gives
BPF programs an easy and reliable way of using the Tracepoint.

Signed-off-by: Teng Qin <[email protected]>
---
include/trace/events/sched.h | 37 +++++++++++++++++++++++++++++--------
kernel/exit.c | 2 +-
2 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index e8bb6b0..c741f4e 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -204,6 +204,35 @@ TRACE_EVENT(sched_migrate_task,
__entry->orig_cpu, __entry->dest_cpu)
);

+/*
+ * Tracepoint for a task exiting:
+ */
+TRACE_EVENT(sched_process_exit,
+
+ TP_PROTO(struct task_struct *p, bool group_dead),
+
+ TP_ARGS(p, group_dead),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( int, prio )
+ __field( bool, group_dead )
+ __field( void *, task_struct )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+ __entry->pid = p->pid;
+ __entry->prio = p->prio;
+ __entry->group_dead = group_dead;
+ __entry->task_struct = (void *)p;
+ ),
+
+ TP_printk("comm=%s pid=%d prio=%d",
+ __entry->comm, __entry->pid, __entry->prio)
+);
+
DECLARE_EVENT_CLASS(sched_process_template,

TP_PROTO(struct task_struct *p),
@@ -233,14 +262,6 @@ DEFINE_EVENT(sched_process_template, sched_process_free,
TP_PROTO(struct task_struct *p),
TP_ARGS(p));

-
-/*
- * Tracepoint for a task exiting:
- */
-DEFINE_EVENT(sched_process_template, sched_process_exit,
- TP_PROTO(struct task_struct *p),
- TP_ARGS(p));
-
/*
* Tracepoint for waiting on task to unschedule:
*/
diff --git a/kernel/exit.c b/kernel/exit.c
index 6b4298a..b613594 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -853,7 +853,7 @@ void __noreturn do_exit(long code)

if (group_dead)
acct_process();
- trace_sched_process_exit(tsk);
+ trace_sched_process_exit(tsk, group_dead);

exit_sem(tsk);
exit_shm(tsk);
--
2.9.5

2017-12-14 20:21:29

by Teng Qin

[permalink] [raw]
Subject: [PATCH tip 2/3] Improve sched_process_fork Tracepoint

This commit adds value of the clone_flag to sched_process_fork's
Tracepoint struct. The value is passed as an argument of the Tracepoint.
It is useful when debugging Processes created via different clone flags.

This commit also exposes pointers of the parent and child task_struct in
the Tracepoint's struct. BPF programs can read task information via task
struct pointer. Exposing these pointers explicitly gives BPF programs an
easy and reliable way of using the Tracepoint.

Signed-off-by: Teng Qin <[email protected]>
---
include/trace/events/sched.h | 11 +++++++++--
kernel/fork.c | 2 +-
2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9297b33..e8bb6b0 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -278,15 +278,19 @@ TRACE_EVENT(sched_process_wait,
*/
TRACE_EVENT(sched_process_fork,

- TP_PROTO(struct task_struct *parent, struct task_struct *child),
+ TP_PROTO(struct task_struct *parent, struct task_struct *child,
+ unsigned long clone_flags),

- TP_ARGS(parent, child),
+ TP_ARGS(parent, child, clone_flags),

TP_STRUCT__entry(
__array( char, parent_comm, TASK_COMM_LEN )
__field( pid_t, parent_pid )
__array( char, child_comm, TASK_COMM_LEN )
__field( pid_t, child_pid )
+ __field( unsigned long, clone_flags )
+ __field( void *, parent_task )
+ __field( void *, child_task )
),

TP_fast_assign(
@@ -294,6 +298,9 @@ TRACE_EVENT(sched_process_fork,
__entry->parent_pid = parent->pid;
memcpy(__entry->child_comm, child->comm, TASK_COMM_LEN);
__entry->child_pid = child->pid;
+ __entry->clone_flags = clone_flags;
+ __entry->parent_task = (void *)parent;
+ __entry->child_task = (void *)child;
),

TP_printk("comm=%s pid=%d child_comm=%s child_pid=%d",
diff --git a/kernel/fork.c b/kernel/fork.c
index 432eadf..777985e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2053,7 +2053,7 @@ long _do_fork(unsigned long clone_flags,
struct completion vfork;
struct pid *pid;

- trace_sched_process_fork(current, p);
+ trace_sched_process_fork(current, p, clone_flags);

pid = get_task_pid(p, PIDTYPE_PID);
nr = pid_vnr(pid);
--
2.9.5

2017-12-14 20:49:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints

On Thu, Dec 14, 2017 at 12:20:41PM -0800, Teng Qin wrote:
> This set of commits attempts to improve three scheduler related
> Tracepoints: sched_switch, sched_process_fork, sched_process_exit.
>
> Firstly, these commit add additional flag values, namely preempt,
> clone_flags and group_dead to these Tracepoints, to make information
> exposed via the Tracepoints more useful and complete.
>
> Secondly, these commits exposes task_struct pointers in these
> Tracepoints. The task_struct pointers are arguments of the Tracepoints
> and currently only used to compute struct field values. But for BPF
> programs attached to these Tracepoints, we may want to read additional
> task information via the task_struct pointers. This is currently either
> impossible, or we have to make assumption of whether the Tracepoint is
> running from previous / parent or next / child, and use current pointer
> instead. Exposing the task_struct pointers explicitly makes such use
> case easier and more reliable.
>

NAK

2017-12-15 03:16:51

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints

On 12/14/17 12:49 PM, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 12:20:41PM -0800, Teng Qin wrote:
>> This set of commits attempts to improve three scheduler related
>> Tracepoints: sched_switch, sched_process_fork, sched_process_exit.
>>
>> Firstly, these commit add additional flag values, namely preempt,
>> clone_flags and group_dead to these Tracepoints, to make information
>> exposed via the Tracepoints more useful and complete.
>>
>> Secondly, these commits exposes task_struct pointers in these
>> Tracepoints. The task_struct pointers are arguments of the Tracepoints
>> and currently only used to compute struct field values. But for BPF
>> programs attached to these Tracepoints, we may want to read additional
>> task information via the task_struct pointers. This is currently either
>> impossible, or we have to make assumption of whether the Tracepoint is
>> running from previous / parent or next / child, and use current pointer
>> instead. Exposing the task_struct pointers explicitly makes such use
>> case easier and more reliable.
>>
>
> NAK

not sure what is the concern here.
Is it first or second part of the above ?
preempt and group_dead are bool and clone_flags has uapi defined
flags, so no kernel internals being exposed.
Two task_struct pointers are unusable outside of bpf progs.
There are plenty of other tracepoints that store pointers to
kernel structs and bpf progs are looking into them.
So nothing new being exposed here as well.

Note that TP_printk() kept unchanged, so typical user tracing
that parses trace_pipe won't see any difference.
Apps that use binary apis are using libs like libtracecmd and also fine.

2017-12-15 07:39:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints

On Thu, Dec 14, 2017 at 07:16:00PM -0800, Alexei Starovoitov wrote:
> On 12/14/17 12:49 PM, Peter Zijlstra wrote:
> > On Thu, Dec 14, 2017 at 12:20:41PM -0800, Teng Qin wrote:
> > > This set of commits attempts to improve three scheduler related
> > > Tracepoints: sched_switch, sched_process_fork, sched_process_exit.
> > >
> > > Firstly, these commit add additional flag values, namely preempt,
> > > clone_flags and group_dead to these Tracepoints, to make information
> > > exposed via the Tracepoints more useful and complete.
> > >
> > > Secondly, these commits exposes task_struct pointers in these
> > > Tracepoints. The task_struct pointers are arguments of the Tracepoints
> > > and currently only used to compute struct field values. But for BPF
> > > programs attached to these Tracepoints, we may want to read additional
> > > task information via the task_struct pointers. This is currently either
> > > impossible, or we have to make assumption of whether the Tracepoint is
> > > running from previous / parent or next / child, and use current pointer
> > > instead. Exposing the task_struct pointers explicitly makes such use
> > > case easier and more reliable.
> > >
> >
> > NAK
>
> not sure what is the concern here.
> Is it first or second part of the above ?

Definitely the second, but also the first. You know I would have ripped
out all scheduler tracepoints if I could have. They're a pain in the
arse.

A lot of people want to add to the tracepoints, with the end result that
they'll end up a big bloated pile of useless crap. The first part is
just the pieces you want added.

As to the second, that's complete crap; that just makes everything
slower for bodies benefit. If you register a traceprobe you already get
access to these things.

I think your problem is that you use perf to get access to the
tracepoints, which them means you have to do disgusting things like
this.

2017-12-15 08:54:29

by Teng Qin

[permalink] [raw]
Subject: Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints



On 12/14/17, 23:40, "Peter Zijlstra" <[email protected]> wrote:
> On Thu, Dec 14, 2017 at 07:16:00PM -0800, Alexei Starovoitov wrote:
> > On 12/14/17 12:49 PM, Peter Zijlstra wrote:
> > > On Thu, Dec 14, 2017 at 12:20:41PM -0800, Teng Qin wrote:
> > > > This set of commits attempts to improve three scheduler related
> > > > Tracepoints: sched_switch, sched_process_fork, sched_process_exit.
> > > >
> > > > Firstly, these commit add additional flag values, namely preempt,
> > > > clone_flags and group_dead to these Tracepoints, to make information
> > > > exposed via the Tracepoints more useful and complete.
> > > >
> > > > Secondly, these commits exposes task_struct pointers in these
> > > > Tracepoints. The task_struct pointers are arguments of the Tracepoints
> > > > and currently only used to compute struct field values. But for BPF
> > > > programs attached to these Tracepoints, we may want to read additional
> > > > task information via the task_struct pointers. This is currently either
> > > > impossible, or we have to make assumption of whether the Tracepoint is
> > > > running from previous / parent or next / child, and use current pointer
> > > > instead. Exposing the task_struct pointers explicitly makes such use
> > > > case easier and more reliable.
> > > >
> > >
> > > NAK
> >
> > not sure what is the concern here.
> > Is it first or second part of the above ?
>
> Definitely the second, but also the first. You know I would have ripped
> out all scheduler tracepoints if I could have. They're a pain in the
> arse.
>
> A lot of people want to add to the tracepoints, with the end result that
> they'll end up a big bloated pile of useless crap. The first part is
> just the pieces you want added.
>
> As to the second, that's complete crap; that just makes everything
> slower for bodies benefit. If you register a traceprobe you already get
> access to these things.

To have access to related task_struct is one of the main purposes of these
patches. Take sched_switch as an example. We depend on the implementation
of the Tracepoint is called from prev or next (which could, although unlikedly,
change) and use current to get that task_struct, which feels, correct
me if I'm wrong, kind of defeating the purpose of Tracepoints being more
implementation-independent than kprobes. Then we have to figure out another
Tracepoint or most likely a kprobe function to get the other (prev or next)
task_struct.

> I think your problem is that you use perf to get access to the
> tracepoints, which them means you have to do disgusting things like
> this.


2017-12-15 09:53:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints

On Fri, Dec 15, 2017 at 08:53:30AM +0000, Teng Qin wrote:
> To have access to related task_struct is one of the main purposes of these
> patches. Take sched_switch as an example. We depend on the implementation
> of the Tracepoint is called from prev or next (which could, although unlikedly,
> change) and use current to get that task_struct, which feels, correct
> me if I'm wrong, kind of defeating the purpose of Tracepoints being more
> implementation-independent than kprobes. Then we have to figure out another
> Tracepoint or most likely a kprobe function to get the other (prev or next)
> task_struct.

Go read how tracepoints work. The tracepoint_probe things get you
exactly what you want.

2017-12-15 17:10:46

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints

On 12/14/17 11:39 PM, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 07:16:00PM -0800, Alexei Starovoitov wrote:
>> On 12/14/17 12:49 PM, Peter Zijlstra wrote:
>>> On Thu, Dec 14, 2017 at 12:20:41PM -0800, Teng Qin wrote:
>>>> This set of commits attempts to improve three scheduler related
>>>> Tracepoints: sched_switch, sched_process_fork, sched_process_exit.
>>>>
>>>> Firstly, these commit add additional flag values, namely preempt,
>>>> clone_flags and group_dead to these Tracepoints, to make information
>>>> exposed via the Tracepoints more useful and complete.
>>>>
>>>> Secondly, these commits exposes task_struct pointers in these
>>>> Tracepoints. The task_struct pointers are arguments of the Tracepoints
>>>> and currently only used to compute struct field values. But for BPF
>>>> programs attached to these Tracepoints, we may want to read additional
>>>> task information via the task_struct pointers. This is currently either
>>>> impossible, or we have to make assumption of whether the Tracepoint is
>>>> running from previous / parent or next / child, and use current pointer
>>>> instead. Exposing the task_struct pointers explicitly makes such use
>>>> case easier and more reliable.
>>>>
>>>
>>> NAK
>>
>> not sure what is the concern here.
>> Is it first or second part of the above ?
>
> Definitely the second, but also the first. You know I would have ripped
> out all scheduler tracepoints if I could have. They're a pain in the
> arse.
>
> A lot of people want to add to the tracepoints, with the end result that
> they'll end up a big bloated pile of useless crap. The first part is
> just the pieces you want added.
>
> As to the second, that's complete crap; that just makes everything
> slower for bodies benefit. If you register a traceprobe you already get
> access to these things.
>
> I think your problem is that you use perf to get access to the
> tracepoints, which them means you have to do disgusting things like
> this.

yeah. Currently bpf progs are called at the end of
perf_trace_##call()
{
.. regular tracepoint copy craft
perf_trace_run_bpf_submit( &copied args )
}

from bpf pov we'd rather get access to raw args passed into
perf_trace_##call.
Sounds like you're suggesting to let bpf side register its
progs directly via tracepoint_probe_register() ?
That would solve the whole thing really nicely indeed.

How such api would look like ?
Something like extending kprobe/uprobe fd-based perf_event_open?
https://www.spinics.net/lists/netdev/msg470567.html
btw could you please apply that set to tip tree
or you want us to route it via bpf-next -> net-next ?

Thanks

2017-12-18 09:12:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints

On Fri, Dec 15, 2017 at 09:09:51AM -0800, Alexei Starovoitov wrote:

> yeah. Currently bpf progs are called at the end of
> perf_trace_##call()
> {
> .. regular tracepoint copy craft
> perf_trace_run_bpf_submit( &copied args )
> }
>
> from bpf pov we'd rather get access to raw args passed into
> perf_trace_##call.
> Sounds like you're suggesting to let bpf side register its
> progs directly via tracepoint_probe_register() ?
> That would solve the whole thing really nicely indeed.

Not sure I thought that far; but if you want the probe arguments either
grab them from the perf probe you're running in or indeed register your
own, don't play silly buggers and copy them in the trace data.

> How such api would look like ?
> Something like extending kprobe/uprobe fd-based perf_event_open?
> https://www.spinics.net/lists/netdev/msg470567.html
> btw could you please apply that set to tip tree
> or you want us to route it via bpf-next -> net-next ?

Oh right, lemme prod Ingo on that.

2017-12-21 02:03:31

by Alexei Starovoitov

[permalink] [raw]
Subject: tracepoint_probe_register and bpf. Was: [PATCH tip 0/3] Improvements of scheduler related Tracepoints

On Mon, Dec 18, 2017 at 10:11:57AM +0100, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 09:09:51AM -0800, Alexei Starovoitov wrote:
>
> > yeah. Currently bpf progs are called at the end of
> > perf_trace_##call()
> > {
> > .. regular tracepoint copy craft
> > perf_trace_run_bpf_submit( &copied args )
> > }
> >
> > from bpf pov we'd rather get access to raw args passed into
> > perf_trace_##call.
> > Sounds like you're suggesting to let bpf side register its
> > progs directly via tracepoint_probe_register() ?
> > That would solve the whole thing really nicely indeed.
>
> Not sure I thought that far; but if you want the probe arguments either
> grab them from the perf probe you're running in or indeed register your
> own, don't play silly buggers and copy them in the trace data.

So I've tried both approaches:

1. grab arguments from:
perf_trace_##call(void *__data, proto)

it works ok, but it adds 50-70 bytes of .text to every perf_trace_*
functions which multiplied by the number of tracepoints (~500) which in
total adds ~30k of .text to vmlinux
Not too bad, but it also adds extra branch to critical
execution path, so not ideal

2. register your own
This approach I think is much better.
The trick I'm doing is the following:
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
/* no 'static' here. The bpf probe functions are global */ \
notrace void \
__bpf_trace_##call(void *__data, proto) \
{ \
struct trace_event_call *event_call = __data; \
\
__FN_COUNT(bpf_trace_run, args)(event_call, CAST_TO_U64(args)); \
}

Where CAST_TO_U64 is a macro to cast all args to u64 one by one.
bpf_trace_run*() functions are defined as:
void bpf_trace_run1(struct trace_event_call *call, u64 arg1);
void bpf_trace_run2(struct trace_event_call *call, u64 arg1, u64 arg2);
void bpf_trace_run3(struct trace_event_call *call, u64 arg1, u64 arg2, u64 arg3);

so in assembler it looks like:
(gdb) disassemble __bpf_trace_xdp_exception
Dump of assembler code for function __bpf_trace_xdp_exception:
0xffffffff81132080 <+0>: mov %ecx,%ecx
0xffffffff81132082 <+2>: jmpq 0xffffffff811231f0 <bpf_trace_run3>

where

TRACE_EVENT(xdp_exception,
TP_PROTO(const struct net_device *dev,
const struct bpf_prog *xdp, u32 act),

The above assembler snippet is casting 32-bit 'act' field into 'u64'
to pass into bpf_trace_run3() and all other registers are passed as-is.
So all of ~500 of __bpf_trace_*() functions are only 5-10 _byte_ long
and in total this approach adds 7k bytes to .text and 8k bytes
to .rodata since I want them to appear in kallsyms, so I don't
have to add another 8-byte fields to 'struct trace_event_class'
and 'struct trace_event_call'

Such approach, I think, is the lowest overhead we can possibly have
while calling trace_xdp_exception() from kernel C code and
transitioning into bpf land.
Since many folks are starting to use tracepoint+bpf at speeds of
1M+ events per second this would be very valuable optimization.

Diff stat so far:
drivers/gpu/drm/i915/i915_trace.h | 13 ++++++++++---
fs/btrfs/super.c | 4 ++++
include/linux/trace_events.h | 27 +++++++++++++++++++++++++++
include/trace/bpf_probe.h | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
include/trace/define_trace.h | 1 +
include/uapi/linux/bpf.h | 4 ++++
kernel/trace/bpf_trace.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 191 insertions(+), 3 deletions(-)

Since I'm not touching anything on ftrace or perf side,
I'm thinking to extend BPF_PROG_ATTACH command to specify
which tracepoint to attach to.
The user space will look like:

prog_fd = bpf_prog_load(...);
bpf_prog_attach(prog_fd, "xdp_exception");

On the kernel side I will walk kallsyms to
find __tracepoint_xdp_exception record, check that
tp->name == "xdp_exception",
then find __bpf_trace_xdp_exception() address (also in kallsyms),
and finally use tracepoint_probe_register() to connect the two.

Thoughts?