2019-09-03 15:45:10

by Radim Krčmář

[permalink] [raw]
Subject: [PATCH 0/2] sched/debug: add sched_update_nr_running tracepoint

Add a tracepoint for monitoring nr_running values, which is helpful in
discovering scheduling imbalances.

More information and most of the code is in [2/2], while [1/2] fixes a
build issue that popped up because CREATE_TRACE_POINTS is now defined
for several includes.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Jirka Hladký <[email protected]>
Cc: Jiří Vozár <[email protected]>
Cc: [email protected]


Radim Krčmář (2):
x86/mm/tlb: include tracepoints from tlb.c instead of mmu_context.h
sched/debug: add sched_update_nr_running tracepoint

arch/x86/include/asm/mmu_context.h | 2 --
arch/x86/mm/tlb.c | 2 ++
include/trace/events/sched.h | 22 ++++++++++++++++++++++
kernel/sched/core.c | 7 ++-----
kernel/sched/fair.c | 2 --
kernel/sched/sched.h | 7 +++++++
6 files changed, 33 insertions(+), 9 deletions(-)

--
2.23.0


2019-09-03 15:45:12

by Radim Krčmář

[permalink] [raw]
Subject: [PATCH 1/2] x86/mm/tlb: include tracepoints from tlb.c instead of mmu_context.h

asm/mmu_context.h is a tree-wide include that will unnecessarily break
the build if CREATE_TRACE_POINTS is defined when including it.

Signed-off-by: Radim Krčmář <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 2 --
arch/x86/mm/tlb.c | 2 ++
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 16ae821483c8..e59f230ff981 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -7,8 +7,6 @@
#include <linux/mm_types.h>
#include <linux/pkeys.h>

-#include <trace/events/tlb.h>
-
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include <asm/paravirt.h>
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e6a9edc5baaf..83cd66a0db99 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -18,6 +18,8 @@

#include "mm_internal.h"

+#include <trace/events/tlb.h>
+
/*
* TLB flushing, formerly SMP-only
* c/o Linus Torvalds.
--
2.23.0

2019-09-03 15:45:51

by Radim Krčmář

[permalink] [raw]
Subject: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
custom data gathering points to better understand what was going on in
the scheduler.
Red Hat adapted one of them for the tracepoint framework and created a
tool to plot a heatmap of nr_running, where the sched_update_nr_running
tracepoint is being used for fine grained monitoring of scheduling
imbalance.
The tool is available from https://github.com/jirvoz/plot-nr-running.

The best place for the tracepoints is inside the add/sub_nr_running,
which requires some shenanigans to make it work as they are defined
inside sched.h.
The tracepoints have to be included from sched.h, which means that
CREATE_TRACE_POINTS has to be defined for the whole header and this
might cause problems if tree-wide headers expose tracepoints in sched.h
dependencies, but I'd argue it's the other side's misuse of tracepoints.

Moving the import sched.h line lower would require fixes in s390 and ppc
headers, because they don't include dependecies properly and expect
sched.h to do it, so it is simpler to keep sched.h there and
preventively undefine CREATE_TRACE_POINTS right after.

Exports of the pelt tracepoints remain because they don't need to be
protected by CREATE_TRACE_POINTS and moving them closer would be
unsightly.

Tested-by: Jirka Hladký <[email protected]>
Tested-by: Jiří Vozár <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>
---
include/trace/events/sched.h | 22 ++++++++++++++++++++++
kernel/sched/core.c | 7 ++-----
kernel/sched/fair.c | 2 --
kernel/sched/sched.h | 7 +++++++
4 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 420e80e56e55..1527fc695609 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -625,6 +625,28 @@ DECLARE_TRACE(sched_overutilized_tp,
TP_PROTO(struct root_domain *rd, bool overutilized),
TP_ARGS(rd, overutilized));

+TRACE_EVENT(sched_update_nr_running,
+
+ TP_PROTO(int cpu, int change, unsigned int nr_running),
+
+ TP_ARGS(cpu, change, nr_running),
+
+ TP_STRUCT__entry(
+ __field(int, cpu)
+ __field(int, change)
+ __field(unsigned int, nr_running)
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = cpu;
+ __entry->change = change;
+ __entry->nr_running = nr_running;
+ ),
+
+ TP_printk("cpu=%u nr_running=%u (%d)",
+ __entry->cpu, __entry->nr_running, __entry->change)
+);
+
#endif /* _TRACE_SCHED_H */

/* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 71981ce84231..31ac37b9f6f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6,7 +6,9 @@
*
* Copyright (C) 1991-2002 Linus Torvalds
*/
+#define CREATE_TRACE_POINTS
#include "sched.h"
+#undef CREATE_TRACE_POINTS

#include <linux/nospec.h>

@@ -20,9 +22,6 @@

#include "pelt.h"

-#define CREATE_TRACE_POINTS
-#include <trace/events/sched.h>
-
/*
* Export tracepoints that act as a bare tracehook (ie: have no trace event
* associated with them) to allow external modules to probe them.
@@ -7618,5 +7617,3 @@ const u32 sched_prio_to_wmult[40] = {
/* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,
/* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};
-
-#undef CREATE_TRACE_POINTS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84959d3285d1..421236d39902 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -22,8 +22,6 @@
*/
#include "sched.h"

-#include <trace/events/sched.h>
-
/*
* Targeted preemption latency for CPU-bound tasks:
*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c4915f46035a..b89d7786109a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -75,6 +75,8 @@
#include "cpupri.h"
#include "cpudeadline.h"

+#include <trace/events/sched.h>
+
#ifdef CONFIG_SCHED_DEBUG
# define SCHED_WARN_ON(x) WARN_ONCE(x, #x)
#else
@@ -1887,6 +1889,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)

rq->nr_running = prev_nr + count;

+ trace_sched_update_nr_running(cpu_of(rq), count, rq->nr_running);
+
#ifdef CONFIG_SMP
if (prev_nr < 2 && rq->nr_running >= 2) {
if (!READ_ONCE(rq->rd->overload))
@@ -1900,6 +1904,9 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
static inline void sub_nr_running(struct rq *rq, unsigned count)
{
rq->nr_running -= count;
+
+ trace_sched_update_nr_running(cpu_of(rq), -count, rq->nr_running);
+
/* Check if we still need preemption */
sched_update_tick_dependency(rq);
}
--
2.23.0

2019-09-03 16:07:17

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On 03/09/2019 16:43, Radim Krčmář wrote:
> The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> custom data gathering points to better understand what was going on in
> the scheduler.
> Red Hat adapted one of them for the tracepoint framework and created a
> tool to plot a heatmap of nr_running, where the sched_update_nr_running
> tracepoint is being used for fine grained monitoring of scheduling
> imbalance.
> The tool is available from https://github.com/jirvoz/plot-nr-running.
>
> The best place for the tracepoints is inside the add/sub_nr_running,
> which requires some shenanigans to make it work as they are defined
> inside sched.h.
> The tracepoints have to be included from sched.h, which means that
> CREATE_TRACE_POINTS has to be defined for the whole header and this
> might cause problems if tree-wide headers expose tracepoints in sched.h
> dependencies, but I'd argue it's the other side's misuse of tracepoints.
>
> Moving the import sched.h line lower would require fixes in s390 and ppc
> headers, because they don't include dependecies properly and expect
> sched.h to do it, so it is simpler to keep sched.h there and
> preventively undefine CREATE_TRACE_POINTS right after.
>
> Exports of the pelt tracepoints remain because they don't need to be
> protected by CREATE_TRACE_POINTS and moving them closer would be
> unsightly.
>

Pure trace events are frowned upon in scheduler world, try going with
trace points. Qais did something very similar recently:

https://lore.kernel.org/lkml/[email protected]/

You'll have to implement the associated trace events in a module, which
lets you define your own event format and doesn't form an ABI :).

2019-09-04 04:24:33

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote:
> On 03/09/2019 16:43, Radim Krčmář wrote:
> > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> > custom data gathering points to better understand what was going on in
> > the scheduler.
> > Red Hat adapted one of them for the tracepoint framework and created a
> > tool to plot a heatmap of nr_running, where the sched_update_nr_running
> > tracepoint is being used for fine grained monitoring of scheduling
> > imbalance.
> > The tool is available from https://github.com/jirvoz/plot-nr-running.
> >
> > The best place for the tracepoints is inside the add/sub_nr_running,
> > which requires some shenanigans to make it work as they are defined
> > inside sched.h.
> > The tracepoints have to be included from sched.h, which means that
> > CREATE_TRACE_POINTS has to be defined for the whole header and this
> > might cause problems if tree-wide headers expose tracepoints in sched.h
> > dependencies, but I'd argue it's the other side's misuse of tracepoints.
> >
> > Moving the import sched.h line lower would require fixes in s390 and ppc
> > headers, because they don't include dependecies properly and expect
> > sched.h to do it, so it is simpler to keep sched.h there and
> > preventively undefine CREATE_TRACE_POINTS right after.
> >
> > Exports of the pelt tracepoints remain because they don't need to be
> > protected by CREATE_TRACE_POINTS and moving them closer would be
> > unsightly.
> >
>
> Pure trace events are frowned upon in scheduler world, try going with
> trace points. Qais did something very similar recently:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> You'll have to implement the associated trace events in a module, which
> lets you define your own event format and doesn't form an ABI :).

Is that really true? eBPF programs loaded from userspace can access
tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI:
https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103

I don't have a strong opinion about considering tracepoints as ABI / API or
not, but just want to get the facts straight :)

thanks,

- Joel

2019-09-04 06:56:56

by Cheng Jian

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint


On 2019/9/4 0:05, Valentin Schneider wrote:
> On 03/09/2019 16:43, Radim Krčmář wrote:
>> The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
>> custom data gathering points to better understand what was going on in
>> the scheduler.
>> Red Hat adapted one of them for the tracepoint framework and created a
>> tool to plot a heatmap of nr_running, where the sched_update_nr_running
>> tracepoint is being used for fine grained monitoring of scheduling
>> imbalance.
>> The tool is available from https://github.com/jirvoz/plot-nr-running.
>>
>> The best place for the tracepoints is inside the add/sub_nr_running,
>> which requires some shenanigans to make it work as they are defined
>> inside sched.h.
>> The tracepoints have to be included from sched.h, which means that
>> CREATE_TRACE_POINTS has to be defined for the whole header and this
>> might cause problems if tree-wide headers expose tracepoints in sched.h
>> dependencies, but I'd argue it's the other side's misuse of tracepoints.
>>
>> Moving the import sched.h line lower would require fixes in s390 and ppc
>> headers, because they don't include dependecies properly and expect
>> sched.h to do it, so it is simpler to keep sched.h there and
>> preventively undefine CREATE_TRACE_POINTS right after.
>>
>> Exports of the pelt tracepoints remain because they don't need to be
>> protected by CREATE_TRACE_POINTS and moving them closer would be
>> unsightly.
>>
> Pure trace events are frowned upon in scheduler world, try going with
> trace points. Qais did something very similar recently:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> You'll have to implement the associated trace events in a module, which
> lets you define your own event format and doesn't form an ABI :).
>
> .


Hi Radim,


Why not try Chrome Tracing

1--Record trace data, scheduling, irq, etc.
    You can Produce a JSON file with the format expected by Chrome
    OR
    just save the captured data directly as "xxx.html", it can still work.
    cat /sys/kernel/debug/tracing/trace > trace.html
2--Go to chrome://tracing in Chrome,
3--Click "Load" and open your file, or alternatively drag the file into
Chrome,
4--Profit!


Systrace (Android System Trace) captures and displays execution times of
your app's
processes and other Android system processes, fortunately, there are
some ported
versions for Linux Desktop/Server.

https://github.com/gatieme/systrace



references--
https://www.chromium.org/developers/how-tos/trace-event-profiling-tool
https://www.chromium.org/developers/how-tos/trace-event-profiling-tool/trace-event-reading


Thanks,
    - Cheng Jian.


2019-09-04 08:16:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 04, 2019 at 12:23:10AM -0400, Joel Fernandes wrote:
> On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote:
> > On 03/09/2019 16:43, Radim Krčmář wrote:
> > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> > > custom data gathering points to better understand what was going on in
> > > the scheduler.
> > > Red Hat adapted one of them for the tracepoint framework and created a
> > > tool to plot a heatmap of nr_running, where the sched_update_nr_running
> > > tracepoint is being used for fine grained monitoring of scheduling
> > > imbalance.
> > > The tool is available from https://github.com/jirvoz/plot-nr-running.
> > >
> > > The best place for the tracepoints is inside the add/sub_nr_running,
> > > which requires some shenanigans to make it work as they are defined
> > > inside sched.h.
> > > The tracepoints have to be included from sched.h, which means that
> > > CREATE_TRACE_POINTS has to be defined for the whole header and this
> > > might cause problems if tree-wide headers expose tracepoints in sched.h
> > > dependencies, but I'd argue it's the other side's misuse of tracepoints.
> > >
> > > Moving the import sched.h line lower would require fixes in s390 and ppc
> > > headers, because they don't include dependecies properly and expect
> > > sched.h to do it, so it is simpler to keep sched.h there and
> > > preventively undefine CREATE_TRACE_POINTS right after.
> > >
> > > Exports of the pelt tracepoints remain because they don't need to be
> > > protected by CREATE_TRACE_POINTS and moving them closer would be
> > > unsightly.
> > >
> >
> > Pure trace events are frowned upon in scheduler world, try going with
> > trace points.

Quite; I hate tracepoints for the API constraints they impose. Been
bitten by that, not want to ever have to deal with that again.

> > Qais did something very similar recently:
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > You'll have to implement the associated trace events in a module, which
> > lets you define your own event format and doesn't form an ABI :).
>
> Is that really true? eBPF programs loaded from userspace can access
> tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI:
> https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103
>
> I don't have a strong opinion about considering tracepoints as ABI / API or
> not, but just want to get the facts straight :)

eBPF can access all sorts of kernel internals; if we were to deem eBPF
and API we'd be fscked.

2019-09-04 10:46:07

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On 09/04/19 00:23, Joel Fernandes wrote:
> On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote:
> > On 03/09/2019 16:43, Radim Krčmář wrote:
> > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> > > custom data gathering points to better understand what was going on in
> > > the scheduler.
> > > Red Hat adapted one of them for the tracepoint framework and created a
> > > tool to plot a heatmap of nr_running, where the sched_update_nr_running
> > > tracepoint is being used for fine grained monitoring of scheduling
> > > imbalance.
> > > The tool is available from https://github.com/jirvoz/plot-nr-running.
> > >
> > > The best place for the tracepoints is inside the add/sub_nr_running,
> > > which requires some shenanigans to make it work as they are defined
> > > inside sched.h.
> > > The tracepoints have to be included from sched.h, which means that
> > > CREATE_TRACE_POINTS has to be defined for the whole header and this
> > > might cause problems if tree-wide headers expose tracepoints in sched.h
> > > dependencies, but I'd argue it's the other side's misuse of tracepoints.
> > >
> > > Moving the import sched.h line lower would require fixes in s390 and ppc
> > > headers, because they don't include dependecies properly and expect
> > > sched.h to do it, so it is simpler to keep sched.h there and
> > > preventively undefine CREATE_TRACE_POINTS right after.
> > >
> > > Exports of the pelt tracepoints remain because they don't need to be
> > > protected by CREATE_TRACE_POINTS and moving them closer would be
> > > unsightly.
> > >
> >
> > Pure trace events are frowned upon in scheduler world, try going with
> > trace points. Qais did something very similar recently:
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > You'll have to implement the associated trace events in a module, which
> > lets you define your own event format and doesn't form an ABI :).
>
> Is that really true? eBPF programs loaded from userspace can access
> tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI:
> https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103
>
> I don't have a strong opinion about considering tracepoints as ABI / API or
> not, but just want to get the facts straight :)

It is actually true. But you need to make the distinction between a tracepoint
and a trace event first. What Valentin is talking about here is the *bare*
tracepoint without any event associated with them like the one I added to the
scheduler recently. These ones are not accessible via eBPF, unless something
has changed since I last tried.

The current infrastructure needs to be expanded to allow eBPF to attach these
bare tracepoints. Something similar to what I have in [1] is needed - but
instead of creating a new macro it needs to expand the current macro. [2] might
give full context of when I was trying to come up with alternatives to using
trace events.

[1] https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
[2] https://lore.kernel.org/lkml/[email protected]/

HTH

--
Qais Yousef

2019-09-04 10:55:17

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On 09/04/19 10:14, Peter Zijlstra wrote:
> On Wed, Sep 04, 2019 at 12:23:10AM -0400, Joel Fernandes wrote:
> > On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote:
> > > On 03/09/2019 16:43, Radim Krčmář wrote:
> > > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> > > > custom data gathering points to better understand what was going on in
> > > > the scheduler.
> > > > Red Hat adapted one of them for the tracepoint framework and created a
> > > > tool to plot a heatmap of nr_running, where the sched_update_nr_running
> > > > tracepoint is being used for fine grained monitoring of scheduling
> > > > imbalance.
> > > > The tool is available from https://github.com/jirvoz/plot-nr-running.
> > > >
> > > > The best place for the tracepoints is inside the add/sub_nr_running,
> > > > which requires some shenanigans to make it work as they are defined
> > > > inside sched.h.
> > > > The tracepoints have to be included from sched.h, which means that
> > > > CREATE_TRACE_POINTS has to be defined for the whole header and this
> > > > might cause problems if tree-wide headers expose tracepoints in sched.h
> > > > dependencies, but I'd argue it's the other side's misuse of tracepoints.
> > > >
> > > > Moving the import sched.h line lower would require fixes in s390 and ppc
> > > > headers, because they don't include dependecies properly and expect
> > > > sched.h to do it, so it is simpler to keep sched.h there and
> > > > preventively undefine CREATE_TRACE_POINTS right after.
> > > >
> > > > Exports of the pelt tracepoints remain because they don't need to be
> > > > protected by CREATE_TRACE_POINTS and moving them closer would be
> > > > unsightly.
> > > >
> > >
> > > Pure trace events are frowned upon in scheduler world, try going with
> > > trace points.
>
> Quite; I hate tracepoints for the API constraints they impose. Been
> bitten by that, not want to ever have to deal with that again.

s/tracepoints/trace events/ ?

They used to be one and the same but I think using them interchangeably might
cause some confusion now since we have tracepoints without trace events
associated with them.

Not trying to be picky, but the missing distinction confused the hell out of
me when I first started looking at this :-)

--
Qais Yousef

2019-09-04 13:09:43

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 04, 2019 at 11:43:33AM +0100, Qais Yousef wrote:
> On 09/04/19 00:23, Joel Fernandes wrote:
> > On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote:
> > > On 03/09/2019 16:43, Radim Krčmář wrote:
> > > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> > > > custom data gathering points to better understand what was going on in
> > > > the scheduler.
> > > > Red Hat adapted one of them for the tracepoint framework and created a
> > > > tool to plot a heatmap of nr_running, where the sched_update_nr_running
> > > > tracepoint is being used for fine grained monitoring of scheduling
> > > > imbalance.
> > > > The tool is available from https://github.com/jirvoz/plot-nr-running.
> > > >
> > > > The best place for the tracepoints is inside the add/sub_nr_running,
> > > > which requires some shenanigans to make it work as they are defined
> > > > inside sched.h.
> > > > The tracepoints have to be included from sched.h, which means that
> > > > CREATE_TRACE_POINTS has to be defined for the whole header and this
> > > > might cause problems if tree-wide headers expose tracepoints in sched.h
> > > > dependencies, but I'd argue it's the other side's misuse of tracepoints.
> > > >
> > > > Moving the import sched.h line lower would require fixes in s390 and ppc
> > > > headers, because they don't include dependecies properly and expect
> > > > sched.h to do it, so it is simpler to keep sched.h there and
> > > > preventively undefine CREATE_TRACE_POINTS right after.
> > > >
> > > > Exports of the pelt tracepoints remain because they don't need to be
> > > > protected by CREATE_TRACE_POINTS and moving them closer would be
> > > > unsightly.
> > > >
> > >
> > > Pure trace events are frowned upon in scheduler world, try going with
> > > trace points. Qais did something very similar recently:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > You'll have to implement the associated trace events in a module, which
> > > lets you define your own event format and doesn't form an ABI :).
> >
> > Is that really true? eBPF programs loaded from userspace can access
> > tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI:
> > https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103
> >
> > I don't have a strong opinion about considering tracepoints as ABI / API or
> > not, but just want to get the facts straight :)
>
> It is actually true.
>
> But you need to make the distinction between a tracepoint
> and a trace event first.

I know this distinction well.

> What Valentin is talking about here is the *bare*
> tracepoint without any event associated with them like the one I added to the
> scheduler recently. These ones are not accessible via eBPF, unless something
> has changed since I last tried.

Can this tracepoint be registered on with tracepoint_probe_register()?
Quickly looking at these new tracepoint, they can be otherwise how would they
even work right? If so, then eBPF can very well access it. Look at
__bpf_probe_register() and bpf_raw_tracepoint_open() which implement the
BPF_RAW_TRACEPOINT_OPEN.

> The current infrastructure needs to be expanded to allow eBPF to attach these
> bare tracepoints. Something similar to what I have in [1] is needed - but
> instead of creating a new macro it needs to expand the current macro. [2] might
> give full context of when I was trying to come up with alternatives to using
> trace events.
>
> [1] https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
> [2] https://lore.kernel.org/lkml/[email protected]/


As I was mentioning, tracepoints, not "trace events" can already be opened
directly with BPF. I don't see how these new tracepoints are different.

I wonder if this distinction of "tracepoint" being non-ABI can be documented
somewhere. I would be happy to do that if there is a place for the same. I
really want some general "policy" in the kernel on where we draw a line in
the sand with respect to tracepoints and ABI :).

For instance, perhaps VFS can also start having non-ABI tracepoints for the
benefit of people tracing the VFS.

thanks,

- Joel

2019-09-04 13:13:09

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 04, 2019 at 10:14:48AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 04, 2019 at 12:23:10AM -0400, Joel Fernandes wrote:
> > On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote:
> > > On 03/09/2019 16:43, Radim Krčmář wrote:
> > > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> > > > custom data gathering points to better understand what was going on in
> > > > the scheduler.
> > > > Red Hat adapted one of them for the tracepoint framework and created a
> > > > tool to plot a heatmap of nr_running, where the sched_update_nr_running
> > > > tracepoint is being used for fine grained monitoring of scheduling
> > > > imbalance.
> > > > The tool is available from https://github.com/jirvoz/plot-nr-running.
> > > >
> > > > The best place for the tracepoints is inside the add/sub_nr_running,
> > > > which requires some shenanigans to make it work as they are defined
> > > > inside sched.h.
> > > > The tracepoints have to be included from sched.h, which means that
> > > > CREATE_TRACE_POINTS has to be defined for the whole header and this
> > > > might cause problems if tree-wide headers expose tracepoints in sched.h
> > > > dependencies, but I'd argue it's the other side's misuse of tracepoints.
> > > >
> > > > Moving the import sched.h line lower would require fixes in s390 and ppc
> > > > headers, because they don't include dependecies properly and expect
> > > > sched.h to do it, so it is simpler to keep sched.h there and
> > > > preventively undefine CREATE_TRACE_POINTS right after.
> > > >
> > > > Exports of the pelt tracepoints remain because they don't need to be
> > > > protected by CREATE_TRACE_POINTS and moving them closer would be
> > > > unsightly.
> > > >
> > >
> > > Pure trace events are frowned upon in scheduler world, try going with
> > > trace points.
>
> Quite; I hate tracepoints for the API constraints they impose. Been
> bitten by that, not want to ever have to deal with that again.

Your NACKs on trace patches over the years have spoken out loud about this
point ;-)

> > > Qais did something very similar recently:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > You'll have to implement the associated trace events in a module, which
> > > lets you define your own event format and doesn't form an ABI :).
> >
> > Is that really true? eBPF programs loaded from userspace can access
> > tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI:
> > https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103
> >
> > I don't have a strong opinion about considering tracepoints as ABI / API or
> > not, but just want to get the facts straight :)
>
> eBPF can access all sorts of kernel internals; if we were to deem eBPF
> and API we'd be fscked.

True. However, for kprobes-based BPF program - it does check for kernel
version to ensure that the BPF program is built against the right kernel
version (in order to ensure the program is built against the right set of
kernel headers). If it is not, then BPF refuses to load the program.

But, to your point bpf_probe_read() can away access kernel memory and
assume structure layouts; so I guess a badly written bpf program can still do
all sorts of ABI-unstable things.

Good to know that the raw tracepoint being Ok since it is non-ABI; is where
we draw the line.

thanks,

- Joel

2019-09-04 13:15:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Tue, Sep 03, 2019 at 05:43:40PM +0200, Radim Krčmář wrote:

> Red Hat adapted one of them for the tracepoint framework and created a
> tool to plot a heatmap of nr_running, where the sched_update_nr_running
> tracepoint is being used for fine grained monitoring of scheduling
> imbalance.

You should be able to reconstruct this from wakeup and switch
tracepoints.


2019-09-04 13:23:03

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On 04/09/2019 14:13, Peter Zijlstra wrote:
> On Tue, Sep 03, 2019 at 05:43:40PM +0200, Radim Krčmář wrote:
>
>> Red Hat adapted one of them for the tracepoint framework and created a
>> tool to plot a heatmap of nr_running, where the sched_update_nr_running
>> tracepoint is being used for fine grained monitoring of scheduling
>> imbalance.
>
> You should be able to reconstruct this from wakeup and switch
> tracepoints.
>
>

I never tried to do it myself but know some folks around me gave it a go
(Morten, Qais and maybe even Chris). I can't remember the exact details but
there was something about not getting the right input - duplicated wakeups,
or no migration information (I think sched_migrate doesn't cover all of the
cases) so the information you're building diverges as time progresses.

2019-09-04 14:22:15

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On 09/04/19 09:06, Joel Fernandes wrote:
> >
> > It is actually true.
> >
> > But you need to make the distinction between a tracepoint
> > and a trace event first.
>
> I know this distinction well.
>
> > What Valentin is talking about here is the *bare*
> > tracepoint without any event associated with them like the one I added to the
> > scheduler recently. These ones are not accessible via eBPF, unless something
> > has changed since I last tried.
>
> Can this tracepoint be registered on with tracepoint_probe_register()?
> Quickly looking at these new tracepoint, they can be otherwise how would they
> even work right? If so, then eBPF can very well access it. Look at
> __bpf_probe_register() and bpf_raw_tracepoint_open() which implement the
> BPF_RAW_TRACEPOINT_OPEN.

Humm okay. I tried to use raw tracepoint with bcc but failed to attach. But
maybe I missed something on the way it should be used. AFAICT it was missing
the bits that I implemented in [1]. Maybe the method you mention is lower level
than bcc.

>
> > The current infrastructure needs to be expanded to allow eBPF to attach these
> > bare tracepoints. Something similar to what I have in [1] is needed - but
> > instead of creating a new macro it needs to expand the current macro. [2] might
> > give full context of when I was trying to come up with alternatives to using
> > trace events.
> >
> > [1] https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
> > [2] https://lore.kernel.org/lkml/[email protected]/
>
>
> As I was mentioning, tracepoints, not "trace events" can already be opened
> directly with BPF. I don't see how these new tracepoints are different.
>
> I wonder if this distinction of "tracepoint" being non-ABI can be documented
> somewhere. I would be happy to do that if there is a place for the same. I
> really want some general "policy" in the kernel on where we draw a line in
> the sand with respect to tracepoints and ABI :).
>
> For instance, perhaps VFS can also start having non-ABI tracepoints for the
> benefit of people tracing the VFS.

Good question. I did consider that but failed to come up with a place. AFAIU
the history moved from tracepoints to trace events and now moving back to
tracepoints. Something Steve is not very enthusiastic about.

LPC is coming, sounds like a good venue to discuss this :-)

Cheers

--
Qais Yousef

2019-09-04 14:38:55

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On 09/03/19 17:43, Radim Krčmář wrote:
> The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> custom data gathering points to better understand what was going on in
> the scheduler.
> Red Hat adapted one of them for the tracepoint framework and created a
> tool to plot a heatmap of nr_running, where the sched_update_nr_running
> tracepoint is being used for fine grained monitoring of scheduling
> imbalance.
> The tool is available from https://github.com/jirvoz/plot-nr-running.
>
> The best place for the tracepoints is inside the add/sub_nr_running,
> which requires some shenanigans to make it work as they are defined
> inside sched.h.

I managed to hook into sched_switch to get the nr_running of cfs tasks via
eBPF.

```
int on_switch(struct sched_switch_args *args) {
struct task_struct *prev = (struct task_struct *)bpf_get_current_task();
struct cgroup *prev_cgroup = prev->cgroups->subsys[cpuset_cgrp_id]->cgroup;
const char *prev_cgroup_name = prev_cgroup->kn->name;

if (prev_cgroup->kn->parent) {
bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=%s\\n",
prev->se.cfs_rq->nr_running,
prev_cgroup_name);
} else {
bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=/\\n",
prev->se.cfs_rq->nr_running);
}
return 0;
};
```

You can do something similar by attaching to the sched_switch tracepoint from
a module and a create a new event to get the nr_running.

Now this is not as accurate as your proposed new tracepoint in terms where you
sample nr_running, but should be good enough?

Cheers

--
Qais Yousef

> The tracepoints have to be included from sched.h, which means that
> CREATE_TRACE_POINTS has to be defined for the whole header and this
> might cause problems if tree-wide headers expose tracepoints in sched.h
> dependencies, but I'd argue it's the other side's misuse of tracepoints.
>
> Moving the import sched.h line lower would require fixes in s390 and ppc
> headers, because they don't include dependecies properly and expect
> sched.h to do it, so it is simpler to keep sched.h there and
> preventively undefine CREATE_TRACE_POINTS right after.
>
> Exports of the pelt tracepoints remain because they don't need to be
> protected by CREATE_TRACE_POINTS and moving them closer would be
> unsightly.
>
> Tested-by: Jirka Hladký <[email protected]>
> Tested-by: Jiří Vozár <[email protected]>
> Signed-off-by: Radim Krčmář <[email protected]>
> ---
> include/trace/events/sched.h | 22 ++++++++++++++++++++++
> kernel/sched/core.c | 7 ++-----
> kernel/sched/fair.c | 2 --
> kernel/sched/sched.h | 7 +++++++
> 4 files changed, 31 insertions(+), 7 deletions(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 420e80e56e55..1527fc695609 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -625,6 +625,28 @@ DECLARE_TRACE(sched_overutilized_tp,
> TP_PROTO(struct root_domain *rd, bool overutilized),
> TP_ARGS(rd, overutilized));
>
> +TRACE_EVENT(sched_update_nr_running,
> +
> + TP_PROTO(int cpu, int change, unsigned int nr_running),
> +
> + TP_ARGS(cpu, change, nr_running),
> +
> + TP_STRUCT__entry(
> + __field(int, cpu)
> + __field(int, change)
> + __field(unsigned int, nr_running)
> + ),
> +
> + TP_fast_assign(
> + __entry->cpu = cpu;
> + __entry->change = change;
> + __entry->nr_running = nr_running;
> + ),
> +
> + TP_printk("cpu=%u nr_running=%u (%d)",
> + __entry->cpu, __entry->nr_running, __entry->change)
> +);
> +
> #endif /* _TRACE_SCHED_H */
>
> /* This part must be outside protection */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 71981ce84231..31ac37b9f6f7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6,7 +6,9 @@
> *
> * Copyright (C) 1991-2002 Linus Torvalds
> */
> +#define CREATE_TRACE_POINTS
> #include "sched.h"
> +#undef CREATE_TRACE_POINTS
>
> #include <linux/nospec.h>
>
> @@ -20,9 +22,6 @@
>
> #include "pelt.h"
>
> -#define CREATE_TRACE_POINTS
> -#include <trace/events/sched.h>
> -
> /*
> * Export tracepoints that act as a bare tracehook (ie: have no trace event
> * associated with them) to allow external modules to probe them.
> @@ -7618,5 +7617,3 @@ const u32 sched_prio_to_wmult[40] = {
> /* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,
> /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
> };
> -
> -#undef CREATE_TRACE_POINTS
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 84959d3285d1..421236d39902 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -22,8 +22,6 @@
> */
> #include "sched.h"
>
> -#include <trace/events/sched.h>
> -
> /*
> * Targeted preemption latency for CPU-bound tasks:
> *
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c4915f46035a..b89d7786109a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -75,6 +75,8 @@
> #include "cpupri.h"
> #include "cpudeadline.h"
>
> +#include <trace/events/sched.h>
> +
> #ifdef CONFIG_SCHED_DEBUG
> # define SCHED_WARN_ON(x) WARN_ONCE(x, #x)
> #else
> @@ -1887,6 +1889,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
>
> rq->nr_running = prev_nr + count;
>
> + trace_sched_update_nr_running(cpu_of(rq), count, rq->nr_running);
> +
> #ifdef CONFIG_SMP
> if (prev_nr < 2 && rq->nr_running >= 2) {
> if (!READ_ONCE(rq->rd->overload))
> @@ -1900,6 +1904,9 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
> static inline void sub_nr_running(struct rq *rq, unsigned count)
> {
> rq->nr_running -= count;
> +
> + trace_sched_update_nr_running(cpu_of(rq), -count, rq->nr_running);
> +
> /* Check if we still need preemption */
> sched_update_tick_dependency(rq);
> }
> --
> 2.23.0
>

2019-09-04 14:44:37

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 04, 2019 at 03:20:17PM +0100, Qais Yousef wrote:
> On 09/04/19 09:06, Joel Fernandes wrote:
> > >
> > > It is actually true.
> > >
> > > But you need to make the distinction between a tracepoint
> > > and a trace event first.
> >
> > I know this distinction well.
> >
> > > What Valentin is talking about here is the *bare*
> > > tracepoint without any event associated with them like the one I added to the
> > > scheduler recently. These ones are not accessible via eBPF, unless something
> > > has changed since I last tried.
> >
> > Can this tracepoint be registered on with tracepoint_probe_register()?
> > Quickly looking at these new tracepoint, they can be otherwise how would they
> > even work right? If so, then eBPF can very well access it. Look at
> > __bpf_probe_register() and bpf_raw_tracepoint_open() which implement the
> > BPF_RAW_TRACEPOINT_OPEN.
>
> Humm okay. I tried to use raw tracepoint with bcc but failed to attach. But
> maybe I missed something on the way it should be used. AFAICT it was missing
> the bits that I implemented in [1]. Maybe the method you mention is lower level
> than bcc.

Oh, Ok. Not sure about BCC. I know that facebook folks are using *existing*
tracepoints (not trace events) to probe context switches and such (probably
not through BCC but some other BPF tracing code). Peter had rejected trace
events they were trying to add IIRC, so they added BPF_RAW_TRACEPOINT_OPEN
then IIRC.

> > > The current infrastructure needs to be expanded to allow eBPF to attach these
> > > bare tracepoints. Something similar to what I have in [1] is needed - but
> > > instead of creating a new macro it needs to expand the current macro. [2] might
> > > give full context of when I was trying to come up with alternatives to using
> > > trace events.
> > >
> > > [1] https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
> > > [2] https://lore.kernel.org/lkml/[email protected]/
> >
> >
> > As I was mentioning, tracepoints, not "trace events" can already be opened
> > directly with BPF. I don't see how these new tracepoints are different.
> >
> > I wonder if this distinction of "tracepoint" being non-ABI can be documented
> > somewhere. I would be happy to do that if there is a place for the same. I
> > really want some general "policy" in the kernel on where we draw a line in
> > the sand with respect to tracepoints and ABI :).
> >
> > For instance, perhaps VFS can also start having non-ABI tracepoints for the
> > benefit of people tracing the VFS.
>
> Good question. I did consider that but failed to come up with a place. AFAIU
> the history moved from tracepoints to trace events and now moving back to
> tracepoints. Something Steve is not very enthusiastic about.

Yeah this is a bit of a mess. I think for every recent LPC this has come up.
But the DECLARE_TRACE approach you did is interesting in that it
reduces/removes the API surface for trace-events at least.

thanks,

- Joel

2019-09-04 14:59:24

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On 09/04/19 10:41, Joel Fernandes wrote:
> On Wed, Sep 04, 2019 at 03:20:17PM +0100, Qais Yousef wrote:
> > On 09/04/19 09:06, Joel Fernandes wrote:
> > > >
> > > > It is actually true.
> > > >
> > > > But you need to make the distinction between a tracepoint
> > > > and a trace event first.
> > >
> > > I know this distinction well.
> > >
> > > > What Valentin is talking about here is the *bare*
> > > > tracepoint without any event associated with them like the one I added to the
> > > > scheduler recently. These ones are not accessible via eBPF, unless something
> > > > has changed since I last tried.
> > >
> > > Can this tracepoint be registered on with tracepoint_probe_register()?
> > > Quickly looking at these new tracepoint, they can be otherwise how would they
> > > even work right? If so, then eBPF can very well access it. Look at
> > > __bpf_probe_register() and bpf_raw_tracepoint_open() which implement the
> > > BPF_RAW_TRACEPOINT_OPEN.
> >
> > Humm okay. I tried to use raw tracepoint with bcc but failed to attach. But
> > maybe I missed something on the way it should be used. AFAICT it was missing
> > the bits that I implemented in [1]. Maybe the method you mention is lower level
> > than bcc.
>
> Oh, Ok. Not sure about BCC. I know that facebook folks are using *existing*
> tracepoints (not trace events) to probe context switches and such (probably
> not through BCC but some other BPF tracing code). Peter had rejected trace
> events they were trying to add IIRC, so they added BPF_RAW_TRACEPOINT_OPEN
> then IIRC.

Looking at the history BPF_RAW_TRACEPOINT_OPEN was added with the support for
RAW_TRACEPOINT c4f6699dfcb8 (bpf: introduce BPF_RAW_TRACEPOINT).

Anyway, if you ever get a chance please try it and let me know. I might have
done something wrong and you're more of a eBPF guru than I am :-)

>
> > > > The current infrastructure needs to be expanded to allow eBPF to attach these
> > > > bare tracepoints. Something similar to what I have in [1] is needed - but
> > > > instead of creating a new macro it needs to expand the current macro. [2] might
> > > > give full context of when I was trying to come up with alternatives to using
> > > > trace events.
> > > >
> > > > [1] https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
> > > > [2] https://lore.kernel.org/lkml/[email protected]/
> > >
> > >
> > > As I was mentioning, tracepoints, not "trace events" can already be opened
> > > directly with BPF. I don't see how these new tracepoints are different.
> > >
> > > I wonder if this distinction of "tracepoint" being non-ABI can be documented
> > > somewhere. I would be happy to do that if there is a place for the same. I
> > > really want some general "policy" in the kernel on where we draw a line in
> > > the sand with respect to tracepoints and ABI :).
> > >
> > > For instance, perhaps VFS can also start having non-ABI tracepoints for the
> > > benefit of people tracing the VFS.
> >
> > Good question. I did consider that but failed to come up with a place. AFAIU
> > the history moved from tracepoints to trace events and now moving back to
> > tracepoints. Something Steve is not very enthusiastic about.
>
> Yeah this is a bit of a mess. I think for every recent LPC this has come up.
> But the DECLARE_TRACE approach you did is interesting in that it
> reduces/removes the API surface for trace-events at least.

Yes. And you have the flexibility to add more info to the tracepoint without
worrying about breaking current users.

Another nice feat we discovered is that you can create several trace evenets
from the same tracepoint each exposing different set of info. You can achieve
the same with the current trace events if you use tracepoint_probe_register()
of course, but not if you use the macros. The tendency for in-kernel trace
events is to have 1:1 mapping even if the events can be extracted from
a single tracepoint.

--
Qais Yousef

2019-09-04 15:27:01

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 4, 2019 at 6:10 AM Joel Fernandes <[email protected]> wrote:
>
> I wonder if this distinction of "tracepoint" being non-ABI can be documented
> somewhere. I would be happy to do that if there is a place for the same. I
> really want some general "policy" in the kernel on where we draw a line in
> the sand with respect to tracepoints and ABI :).

It's been discussed millions times. tracepoints are not abi.
Example: android folks started abusing tracepoints inside bpf core
and we _deleted_ them.
Same thing can be done with _any_ tracepoint.
Do not abuse them and stop the fud about abi.

2019-09-04 15:28:27

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 4, 2019 at 6:14 AM Joel Fernandes <[email protected]> wrote:
>
> True. However, for kprobes-based BPF program - it does check for kernel
> version to ensure that the BPF program is built against the right kernel
> version (in order to ensure the program is built against the right set of
> kernel headers). If it is not, then BPF refuses to load the program.

This is not true anymore. Users found few ways to workaround that check
in practice. It became useless and it was deleted some time ago.

2019-09-04 15:35:24

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 04, 2019 at 08:26:52AM -0700, Alexei Starovoitov wrote:
> On Wed, Sep 4, 2019 at 6:14 AM Joel Fernandes <[email protected]> wrote:
> >
> > True. However, for kprobes-based BPF program - it does check for kernel
> > version to ensure that the BPF program is built against the right kernel
> > version (in order to ensure the program is built against the right set of
> > kernel headers). If it is not, then BPF refuses to load the program.
>
> This is not true anymore. Users found few ways to workaround that check
> in practice. It became useless and it was deleted some time ago.

Wow, Ok! Interesting!

thanks,

- Joel

2019-09-04 15:39:13

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 4, 2019 at 8:33 AM Joel Fernandes <[email protected]> wrote:
>
> On Wed, Sep 04, 2019 at 08:26:52AM -0700, Alexei Starovoitov wrote:
> > On Wed, Sep 4, 2019 at 6:14 AM Joel Fernandes <[email protected]> wrote:
> > >
> > > True. However, for kprobes-based BPF program - it does check for kernel
> > > version to ensure that the BPF program is built against the right kernel
> > > version (in order to ensure the program is built against the right set of
> > > kernel headers). If it is not, then BPF refuses to load the program.
> >
> > This is not true anymore. Users found few ways to workaround that check
> > in practice. It became useless and it was deleted some time ago.
>
> Wow, Ok! Interesting!

the other part of your email says about kernel header requirement.
This is not true any more as well :)
BTF relocations are already supported by the kernel, llvm, libbpf,
bpftool, pahole.
We'll be posting sample code soon.

2019-09-04 15:41:11

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 04, 2019 at 08:25:27AM -0700, Alexei Starovoitov wrote:
> On Wed, Sep 4, 2019 at 6:10 AM Joel Fernandes <[email protected]> wrote:
> >
> > I wonder if this distinction of "tracepoint" being non-ABI can be documented
> > somewhere. I would be happy to do that if there is a place for the same. I
> > really want some general "policy" in the kernel on where we draw a line in
> > the sand with respect to tracepoints and ABI :).
>
> It's been discussed millions times. tracepoints are not abi.
> Example: android folks started abusing tracepoints inside bpf core
> and we _deleted_ them.

This is news to me, which ones?

> Same thing can be done with _any_ tracepoint.
> Do not abuse them and stop the fud about abi.

I don't know what FUD you are referring to. At least it is not coming from
me. This thread is dealing with the issue about ABI specifically, I jumped in
just now. As I was saying earlier, I don't have a strong opinion about this.
I just want to know what is the agreed upon approach so that we can stick to
it.

It sounds like the agreement here is tracepoints can be added and used
without ABI guarantees, however the same is not true with trace events.
Where's the FUD in that?

thanks,

- Joel

2019-09-04 15:43:03

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 04, 2019 at 08:37:22AM -0700, Alexei Starovoitov wrote:
> On Wed, Sep 4, 2019 at 8:33 AM Joel Fernandes <[email protected]> wrote:
> >
> > On Wed, Sep 04, 2019 at 08:26:52AM -0700, Alexei Starovoitov wrote:
> > > On Wed, Sep 4, 2019 at 6:14 AM Joel Fernandes <[email protected]> wrote:
> > > >
> > > > True. However, for kprobes-based BPF program - it does check for kernel
> > > > version to ensure that the BPF program is built against the right kernel
> > > > version (in order to ensure the program is built against the right set of
> > > > kernel headers). If it is not, then BPF refuses to load the program.
> > >
> > > This is not true anymore. Users found few ways to workaround that check
> > > in practice. It became useless and it was deleted some time ago.
> >
> > Wow, Ok! Interesting!
>
> the other part of your email says about kernel header requirement.
> This is not true any more as well :)
> BTF relocations are already supported by the kernel, llvm, libbpf,
> bpftool, pahole.
> We'll be posting sample code soon.

Ok, this landscape seems to be changing quite a bit. I was going by what I
already know... Looking forward to catching up with the latest. Sorry,

thanks,

- Joel

2019-09-04 15:47:48

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 04, 2019 at 03:57:59PM +0100, Qais Yousef wrote:
> On 09/04/19 10:41, Joel Fernandes wrote:
> > On Wed, Sep 04, 2019 at 03:20:17PM +0100, Qais Yousef wrote:
> > > On 09/04/19 09:06, Joel Fernandes wrote:
> > > > >
> > > > > It is actually true.
> > > > >
> > > > > But you need to make the distinction between a tracepoint
> > > > > and a trace event first.
> > > >
> > > > I know this distinction well.
> > > >
> > > > > What Valentin is talking about here is the *bare*
> > > > > tracepoint without any event associated with them like the one I added to the
> > > > > scheduler recently. These ones are not accessible via eBPF, unless something
> > > > > has changed since I last tried.
> > > >
> > > > Can this tracepoint be registered on with tracepoint_probe_register()?
> > > > Quickly looking at these new tracepoint, they can be otherwise how would they
> > > > even work right? If so, then eBPF can very well access it. Look at
> > > > __bpf_probe_register() and bpf_raw_tracepoint_open() which implement the
> > > > BPF_RAW_TRACEPOINT_OPEN.
> > >
> > > Humm okay. I tried to use raw tracepoint with bcc but failed to attach. But
> > > maybe I missed something on the way it should be used. AFAICT it was missing
> > > the bits that I implemented in [1]. Maybe the method you mention is lower level
> > > than bcc.
> >
> > Oh, Ok. Not sure about BCC. I know that facebook folks are using *existing*
> > tracepoints (not trace events) to probe context switches and such (probably
> > not through BCC but some other BPF tracing code). Peter had rejected trace
> > events they were trying to add IIRC, so they added BPF_RAW_TRACEPOINT_OPEN
> > then IIRC.
>
> Looking at the history BPF_RAW_TRACEPOINT_OPEN was added with the support for
> RAW_TRACEPOINT c4f6699dfcb8 (bpf: introduce BPF_RAW_TRACEPOINT).
>
> Anyway, if you ever get a chance please try it and let me know. I might have
> done something wrong and you're more of a eBPF guru than I am :-)

eBPF guru and me? no way ;-) I have tried out BPF_RAW_TRACEPOINT_OPEN before
and it works as expected. Are there not any in-kernel samples? Perhaps Alexei
can post some if there are not.

thanks,

- Joel

2019-09-04 15:53:22

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 4, 2019 at 8:40 AM Joel Fernandes <[email protected]> wrote:
>
> On Wed, Sep 04, 2019 at 08:25:27AM -0700, Alexei Starovoitov wrote:
> > On Wed, Sep 4, 2019 at 6:10 AM Joel Fernandes <[email protected]> wrote:
> > >
> > > I wonder if this distinction of "tracepoint" being non-ABI can be documented
> > > somewhere. I would be happy to do that if there is a place for the same. I
> > > really want some general "policy" in the kernel on where we draw a line in
> > > the sand with respect to tracepoints and ABI :).
> >
> > It's been discussed millions times. tracepoints are not abi.
> > Example: android folks started abusing tracepoints inside bpf core
> > and we _deleted_ them.
>
> This is news to me, which ones?

those that your android teammates abused!

> > Same thing can be done with _any_ tracepoint.
> > Do not abuse them and stop the fud about abi.
>
> I don't know what FUD you are referring to. At least it is not coming from
> me. This thread is dealing with the issue about ABI specifically, I jumped in
> just now. As I was saying earlier, I don't have a strong opinion about this.
> I just want to know what is the agreed upon approach so that we can stick to
> it.
>
> It sounds like the agreement here is tracepoints can be added and used
> without ABI guarantees, however the same is not true with trace events.
> Where's the FUD in that?

Anything in tracing can be deleted.
Tracing is about debugging and introspection.
When underlying kernel code changes the introspection points change as well.

2019-09-04 17:48:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 04, 2019 at 08:51:21AM -0700, Alexei Starovoitov wrote:
> Anything in tracing can be deleted.
> Tracing is about debugging and introspection.
> When underlying kernel code changes the introspection points change as well.

Right; except when it breaks widely used tools; like say powertop. Been
there, done that.

2019-09-04 17:51:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 04, 2019 at 03:37:11PM +0100, Qais Yousef wrote:

> I managed to hook into sched_switch to get the nr_running of cfs tasks via
> eBPF.
>
> ```
> int on_switch(struct sched_switch_args *args) {
> struct task_struct *prev = (struct task_struct *)bpf_get_current_task();
> struct cgroup *prev_cgroup = prev->cgroups->subsys[cpuset_cgrp_id]->cgroup;
> const char *prev_cgroup_name = prev_cgroup->kn->name;
>
> if (prev_cgroup->kn->parent) {
> bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=%s\\n",
> prev->se.cfs_rq->nr_running,
> prev_cgroup_name);
> } else {
> bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=/\\n",
> prev->se.cfs_rq->nr_running);
> }
> return 0;
> };
> ```
>
> You can do something similar by attaching to the sched_switch tracepoint from
> a module and a create a new event to get the nr_running.
>
> Now this is not as accurate as your proposed new tracepoint in terms where you
> sample nr_running, but should be good enough?

The above is after deactivate() and gives an up-to-date count for
decrements. Attach something to trace_sched_wakeup() to get the
increment update.

2019-09-04 17:55:21

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Wed, Sep 4, 2019 at 10:47 AM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Sep 04, 2019 at 08:51:21AM -0700, Alexei Starovoitov wrote:
> > Anything in tracing can be deleted.
> > Tracing is about debugging and introspection.
> > When underlying kernel code changes the introspection points change as well.
>
> Right; except when it breaks widely used tools; like say powertop. Been
> there, done that.

powertop was a lesson learned, but it's not a relevant example anymore.
There are more widely used tools today. Like bcc tools.
And bpftrace is quickly gaining momentum and large user base.
bcc tools did break already several times and people fixed them.

2019-09-05 09:35:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint


* Alexei Starovoitov <[email protected]> wrote:

> On Wed, Sep 4, 2019 at 10:47 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Wed, Sep 04, 2019 at 08:51:21AM -0700, Alexei Starovoitov wrote:
> > > Anything in tracing can be deleted.
> > > Tracing is about debugging and introspection.
> > > When underlying kernel code changes the introspection points change as well.
> >
> > Right; except when it breaks widely used tools; like say powertop. Been
> > there, done that.
>
> powertop was a lesson learned, but it's not a relevant example anymore.
> There are more widely used tools today. Like bcc tools.
> And bpftrace is quickly gaining momentum and large user base.
> bcc tools did break already several times and people fixed them.

Are these tools using libtraceevents?

Thanks,

Ingo

2019-09-06 00:15:33

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On Thu, Sep 05, 2019 at 10:13:10AM +0200, Ingo Molnar wrote:
>
> * Alexei Starovoitov <[email protected]> wrote:
>
> > On Wed, Sep 4, 2019 at 10:47 AM Peter Zijlstra <[email protected]> wrote:
> > >
> > > On Wed, Sep 04, 2019 at 08:51:21AM -0700, Alexei Starovoitov wrote:
> > > > Anything in tracing can be deleted.
> > > > Tracing is about debugging and introspection.
> > > > When underlying kernel code changes the introspection points change as well.
> > >
> > > Right; except when it breaks widely used tools; like say powertop. Been
> > > there, done that.
> >
> > powertop was a lesson learned, but it's not a relevant example anymore.
> > There are more widely used tools today. Like bcc tools.
> > And bpftrace is quickly gaining momentum and large user base.
> > bcc tools did break already several times and people fixed them.
>
> Are these tools using libtraceevents?

bcc tools and bpftrace are using libbcc.
Which in turn is using libbpf.
libtraceevents is not used.

Interesting example is https://github.com/iovisor/bcc/blob/master/tools/tcplife.py
It's using "inet_sock_set_state" tracepoint when available on newer kernels
and kprobe in tcp_set_state() function on older kernels.
That tracepoint changed significantly over time.
It had different name 'tcp_set_state' and slightly different semantics.
Hence the tool was fixed when that change in tracepoint happened:
https://github.com/iovisor/bcc/commit/fd93dc0409b626b749b90f115d3d550a870ed125

Note that tcp:tcp_set_state tracepoint existed for full kernel release.
Yet people didn't make fuzz about the fact it disappeared in 4.16.
Though tcplife.py tool is simple there are more complex tools based
on this idea that are deployed in netflix and fb that went through the same
tcp_set_state->inet_sock_set_state fixes.

2019-09-09 22:59:26

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

On 09/04/19 19:48, Peter Zijlstra wrote:
> On Wed, Sep 04, 2019 at 03:37:11PM +0100, Qais Yousef wrote:
>
> > I managed to hook into sched_switch to get the nr_running of cfs tasks via
> > eBPF.
> >
> > ```
> > int on_switch(struct sched_switch_args *args) {
> > struct task_struct *prev = (struct task_struct *)bpf_get_current_task();
> > struct cgroup *prev_cgroup = prev->cgroups->subsys[cpuset_cgrp_id]->cgroup;
> > const char *prev_cgroup_name = prev_cgroup->kn->name;
> >
> > if (prev_cgroup->kn->parent) {
> > bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=%s\\n",
> > prev->se.cfs_rq->nr_running,
> > prev_cgroup_name);
> > } else {
> > bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=/\\n",
> > prev->se.cfs_rq->nr_running);
> > }
> > return 0;
> > };
> > ```
> >
> > You can do something similar by attaching to the sched_switch tracepoint from
> > a module and a create a new event to get the nr_running.
> >
> > Now this is not as accurate as your proposed new tracepoint in terms where you
> > sample nr_running, but should be good enough?
>
> The above is after deactivate() and gives an up-to-date count for
> decrements. Attach something to trace_sched_wakeup() to get the
> increment update.

I just remembered that sched_switch and sched_wakeup aren't
EXPORT_TRACEPOINT*() so can't be attached to via out of tree module. But still
accessible via eBPF.

There has been several attempts to export these tracepoints but they were
NACKed because there was no in-kernel module that needed them.

https://lore.kernel.org/lkml/[email protected]/

--
Qais Yousef