2017-03-28 06:35:57

by Dietmar Eggemann

[permalink] [raw]
Subject: [RFC PATCH 0/5] CFS load tracking trace events

This patch-set introduces trace events for load (and utilization)
tracking for the following three cfs scheduler bricks: cfs_rq,
sched_entity and task_group.

I've decided to sent it out because people are discussing problems with
PELT related functionality and as a base for the related discussion next
week at the OSPM-summit in Pisa.

The requirements for the design are:

(1) Support for all combinations related to the following kernel
configuration options: CONFIG_SMP, CONFIG_FAIR_GROUP_SCHED,
CONFIG_SCHED_AUTOGROUP, CONFIG_DEBUG_KERNEL.

(2) All key=value pairs have to appear in all combinations of the
kernel configuration options mentioned under (1).

(3) Stable interface, i.e. only use keys for a key=value pairs which
are independent from the underlying (load tracking) implementation.

(4) Minimal invasive to the actual cfs scheduler code, i.e. the trace
event should not have to be guarded by an if condition (e.g.
entity_is_task(se) to distinguish between task and task_group)
or kernel configuration switch.

The trace events only expose one load (and one utilization) key=value
pair besides those used to identify the cfs scheduler brick. They do
not provide any internal implementation details of the actual (PELT)
load-tracking mechanism.
This limitation might be too much since for a cfs_rq we can only trace
runnable load (cfs_rq->runnable_load_avg) or runnable/blocked load
(cfs_rq->avg.load_avg).
Other load metrics like instantaneous load ({cfs_rq|se}->load.weight)
are not traced but could be easily added.

The following keys are used to identify the cfs scheduler brick:

(1) Cpu number the cfs scheduler brick is attached to.

(2) Task_group path and (css) id.

(3) Task name and pid.

In case a key does not apply due to an unset Kernel configuration option
or the fact that a sched_entity can represent either a task or a
task_group its value is set to an invalid default:

(1) Task_group path: "(null)"

(2) Task group id: -1

(3) Task name: "(null)"

(4) Task pid: -1

Load tracking trace events are placed into kernel/sched/fair.c:

(1) for cfs_rq:

- In PELT core function __update_load_avg().

- During sched_entity attach/detach.

- During sched_entity load/util propagation down the task_group
hierarchy.

(2) for sched_entity:

- In PELT core function __update_load_avg().

- During sched_entity load/util propagation down the task_group
hierarchy.

(3) for task_group:

- In its PELT update function.

An alternative for using __update_load_avg() would be to put trace
events into update_cfs_rq_load_avg() for cfs_rq's and into
set_task_rq_fair(), update_load_avg(), sync_entity_load_avg() for
sched_entities. This would also get rid of the if(cfs_rq)/else condition
in __update_load_avg().

These trace events still look a bit fragile.
First of all, this patch-set has to use cfs specific data structures
in the global task scheduler trace file include/trace/events/sched.h.
And second, the accessor-functions (rq_of(), task_of(), etc.) are
private to the cfs scheduler class. In case they would be public these
trace events would be easier to code. That said, group_cfs_rq() is
already made public by this patch-stack.

This version bases on tip/sched/core as of yesterday (bc4278987e38). It
has been compile tested on ~160 configurations via 0day's kbuild test
robot.

Dietmar Eggemann (5):
sched/autogroup: Define autogroup_path() for !CONFIG_SCHED_DEBUG
sched/events: Introduce cfs_rq load tracking trace event
sched/fair: Export group_cfs_rq()
sched/events: Introduce sched_entity load tracking trace event
sched/events: Introduce task_group load tracking trace event

include/linux/sched.h | 10 +++
include/trace/events/sched.h | 143 +++++++++++++++++++++++++++++++++++++++++++
kernel/sched/autogroup.c | 2 -
kernel/sched/autogroup.h | 2 -
kernel/sched/fair.c | 26 ++++----
5 files changed, 167 insertions(+), 16 deletions(-)

--
2.11.0


2017-03-28 06:36:09

by Dietmar Eggemann

[permalink] [raw]
Subject: [RFC PATCH 4/5] sched/events: Introduce sched_entity load tracking trace event

The trace event keys load and util (utilization) are mapped to:

(1) load : se->avg.load_avg

(2) util : se->avg.util_avg

To let this trace event work for configurations w/ and w/o group
scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following
special handling is necessary for non-existent key=value pairs:

path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED or the
sched_entity represents a task.

id = -1 : In case of !CONFIG_FAIR_GROUP_SCHED or the
sched_entity represents a task.

comm = "(null)" : In case sched_entity represents a task_group.

pid = -1 : In case sched_entity represents a task_group.

The following list shows examples of the key=value pairs in different
configurations for:

(1) a task:

cpu=0 path=(null) id=-1 comm=sshd pid=2206 load=102 util=102

(2) a taskgroup:

cpu=1 path=/tg1/tg11/tg111 id=4 comm=(null) pid=-1 load=882 util=510

(3) an autogroup:

cpu=0 path=/autogroup-13 id=0 comm=(null) pid=-1 load=49 util=48

(4) w/o CONFIG_FAIR_GROUP_SCHED:

cpu=0 path=(null) id=-1 comm=sshd pid=2211 load=301 util=265

The trace event is only defined for CONFIG_SMP.

The helper functions __trace_sched_cpu(), __trace_sched_path() and
__trace_sched_id() are extended to deal with sched_entities as well.

Signed-off-by: Dietmar Eggemann <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
---
include/trace/events/sched.h | 63 +++++++++++++++++++++++++++++++++++---------
kernel/sched/fair.c | 3 +++
2 files changed, 54 insertions(+), 12 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 51db8a90e45f..647cfaf528fd 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -566,14 +566,15 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
#ifdef CONFIG_SMP
#ifdef CREATE_TRACE_POINTS
static inline
-int __trace_sched_cpu(struct cfs_rq *cfs_rq)
+int __trace_sched_cpu(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
#ifdef CONFIG_FAIR_GROUP_SCHED
- struct rq *rq = cfs_rq->rq;
+ struct rq *rq = cfs_rq ? cfs_rq->rq : NULL;
#else
- struct rq *rq = container_of(cfs_rq, struct rq, cfs);
+ struct rq *rq = cfs_rq ? container_of(cfs_rq, struct rq, cfs) : NULL;
#endif
- return cpu_of(rq);
+ return rq ? cpu_of(rq)
+ : task_cpu((container_of(se, struct task_struct, se)));
}

static inline
@@ -582,25 +583,24 @@ int __trace_sched_path(struct cfs_rq *cfs_rq, char *path, int len)
#ifdef CONFIG_FAIR_GROUP_SCHED
int l = path ? len : 0;

- if (task_group_is_autogroup(cfs_rq->tg))
+ if (cfs_rq && task_group_is_autogroup(cfs_rq->tg))
return autogroup_path(cfs_rq->tg, path, l) + 1;
- else
+ else if (cfs_rq && cfs_rq->tg->css.cgroup)
return cgroup_path(cfs_rq->tg->css.cgroup, path, l) + 1;
-#else
+#endif
if (path)
strcpy(path, "(null)");

return strlen("(null)");
-#endif
}

static inline int __trace_sched_id(struct cfs_rq *cfs_rq)
{
#ifdef CONFIG_FAIR_GROUP_SCHED
- return cfs_rq->tg->css.id;
-#else
- return -1;
+ if (cfs_rq)
+ return cfs_rq->tg->css.id;
#endif
+ return -1;
}
#endif /* CREATE_TRACE_POINTS */

@@ -623,7 +623,7 @@ TRACE_EVENT(sched_load_cfs_rq,
),

TP_fast_assign(
- __entry->cpu = __trace_sched_cpu(cfs_rq);
+ __entry->cpu = __trace_sched_cpu(cfs_rq, NULL);
__trace_sched_path(cfs_rq, __get_dynamic_array(path),
__get_dynamic_array_len(path));
__entry->id = __trace_sched_id(cfs_rq);
@@ -634,6 +634,45 @@ TRACE_EVENT(sched_load_cfs_rq,
TP_printk("cpu=%d path=%s id=%d load=%lu util=%lu", __entry->cpu,
__get_str(path), __entry->id, __entry->load, __entry->util)
);
+
+/*
+ * Tracepoint for sched_entity load tracking:
+ */
+TRACE_EVENT(sched_load_se,
+
+ TP_PROTO(struct sched_entity *se),
+
+ TP_ARGS(se),
+
+ TP_STRUCT__entry(
+ __field( int, cpu )
+ __dynamic_array(char, path,
+ __trace_sched_path(group_cfs_rq(se), NULL, 0) )
+ __field( int, id )
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( unsigned long, load )
+ __field( unsigned long, util )
+ ),
+
+ TP_fast_assign(
+ struct task_struct *p = group_cfs_rq(se) ? NULL
+ : container_of(se, struct task_struct, se);
+
+ __entry->cpu = __trace_sched_cpu(group_cfs_rq(se), se);
+ __trace_sched_path(group_cfs_rq(se), __get_dynamic_array(path),
+ __get_dynamic_array_len(path));
+ __entry->id = __trace_sched_id(group_cfs_rq(se));
+ memcpy(__entry->comm, p ? p->comm : "(null)", TASK_COMM_LEN);
+ __entry->pid = p ? p->pid : -1;
+ __entry->load = se->avg.load_avg;
+ __entry->util = se->avg.util_avg;
+ ),
+
+ TP_printk("cpu=%d path=%s id=%d comm=%s pid=%d load=%lu util=%lu",
+ __entry->cpu, __get_str(path), __entry->id, __entry->comm,
+ __entry->pid, __entry->load, __entry->util)
+);
#endif /* CONFIG_SMP */
#endif /* _TRACE_SCHED_H */

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04d4f81b96ae..d1dcb19f5b55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2940,6 +2940,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

if (cfs_rq)
trace_sched_load_cfs_rq(cfs_rq);
+ else
+ trace_sched_load_se(container_of(sa, struct sched_entity, avg));

return decayed;
}
@@ -3162,6 +3164,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
update_tg_cfs_load(cfs_rq, se);

trace_sched_load_cfs_rq(cfs_rq);
+ trace_sched_load_se(se);

return 1;
}
--
2.11.0

2017-03-28 06:35:58

by Dietmar Eggemann

[permalink] [raw]
Subject: [RFC PATCH 1/5] sched/autogroup: Define autogroup_path() for !CONFIG_SCHED_DEBUG

Define autogroup_path() even in the !CONFIG_SCHED_DEBUG case. If
CONFIG_SCHED_AUTOGROUP is enabled the path of an autogroup has to be
available to be printed in the load tracking trace events provided by
this patch-stack regardless whether CONFIG_SCHED_DEBUG is set or not.

Signed-off-by: Dietmar Eggemann <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
kernel/sched/autogroup.c | 2 --
kernel/sched/autogroup.h | 2 --
2 files changed, 4 deletions(-)

diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index da39489d2d80..22be4eaf6ae2 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -260,7 +260,6 @@ void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m)
}
#endif /* CONFIG_PROC_FS */

-#ifdef CONFIG_SCHED_DEBUG
int autogroup_path(struct task_group *tg, char *buf, int buflen)
{
if (!task_group_is_autogroup(tg))
@@ -268,4 +267,3 @@ int autogroup_path(struct task_group *tg, char *buf, int buflen)

return snprintf(buf, buflen, "%s-%ld", "/autogroup", tg->autogroup->id);
}
-#endif /* CONFIG_SCHED_DEBUG */
diff --git a/kernel/sched/autogroup.h b/kernel/sched/autogroup.h
index ce40c810cd5c..6a661cfa9584 100644
--- a/kernel/sched/autogroup.h
+++ b/kernel/sched/autogroup.h
@@ -55,11 +55,9 @@ autogroup_task_group(struct task_struct *p, struct task_group *tg)
return tg;
}

-#ifdef CONFIG_SCHED_DEBUG
static inline int autogroup_path(struct task_group *tg, char *buf, int buflen)
{
return 0;
}
-#endif

#endif /* CONFIG_SCHED_AUTOGROUP */
--
2.11.0

2017-03-28 06:36:40

by Dietmar Eggemann

[permalink] [raw]
Subject: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

The trace event keys load and util (utilization) are mapped to:

(1) load : cfs_rq->runnable_load_avg

(2) util : cfs_rq->avg.util_avg

To let this trace event work for configurations w/ and w/o group
scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following
special handling is necessary for non-existent key=value pairs:

path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED.

id = -1 : In case of !CONFIG_FAIR_GROUP_SCHED.

The following list shows examples of the key=value pairs in different
configurations for:

(1) a root task_group:

cpu=4 path=/ id=1 load=6 util=331

(2) a task_group:

cpu=1 path=/tg1/tg11/tg111 id=4 load=538 util=522

(3) an autogroup:

cpu=3 path=/autogroup-18 id=0 load=997 util=517

(4) w/o CONFIG_FAIR_GROUP_SCHED:

cpu=0 path=(null) id=-1 load=314 util=289

The trace event is only defined for CONFIG_SMP.

The helper function __trace_sched_path() can be used to get the length
parameter of the dynamic array (path == NULL) and to copy the path into
it (path != NULL).

Signed-off-by: Dietmar Eggemann <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
---
include/trace/events/sched.h | 73 ++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 9 ++++++
2 files changed, 82 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9e3ef6c99e4b..51db8a90e45f 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -562,6 +562,79 @@ TRACE_EVENT(sched_wake_idle_without_ipi,

TP_printk("cpu=%d", __entry->cpu)
);
+
+#ifdef CONFIG_SMP
+#ifdef CREATE_TRACE_POINTS
+static inline
+int __trace_sched_cpu(struct cfs_rq *cfs_rq)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ struct rq *rq = cfs_rq->rq;
+#else
+ struct rq *rq = container_of(cfs_rq, struct rq, cfs);
+#endif
+ return cpu_of(rq);
+}
+
+static inline
+int __trace_sched_path(struct cfs_rq *cfs_rq, char *path, int len)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ int l = path ? len : 0;
+
+ if (task_group_is_autogroup(cfs_rq->tg))
+ return autogroup_path(cfs_rq->tg, path, l) + 1;
+ else
+ return cgroup_path(cfs_rq->tg->css.cgroup, path, l) + 1;
+#else
+ if (path)
+ strcpy(path, "(null)");
+
+ return strlen("(null)");
+#endif
+}
+
+static inline int __trace_sched_id(struct cfs_rq *cfs_rq)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ return cfs_rq->tg->css.id;
+#else
+ return -1;
+#endif
+}
+#endif /* CREATE_TRACE_POINTS */
+
+/*
+ * Tracepoint for cfs_rq load tracking:
+ */
+TRACE_EVENT(sched_load_cfs_rq,
+
+ TP_PROTO(struct cfs_rq *cfs_rq),
+
+ TP_ARGS(cfs_rq),
+
+ TP_STRUCT__entry(
+ __field( int, cpu )
+ __dynamic_array(char, path,
+ __trace_sched_path(cfs_rq, NULL, 0) )
+ __field( int, id )
+ __field( unsigned long, load )
+ __field( unsigned long, util )
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = __trace_sched_cpu(cfs_rq);
+ __trace_sched_path(cfs_rq, __get_dynamic_array(path),
+ __get_dynamic_array_len(path));
+ __entry->id = __trace_sched_id(cfs_rq);
+ __entry->load = cfs_rq->runnable_load_avg;
+ __entry->util = cfs_rq->avg.util_avg;
+ ),
+
+ TP_printk("cpu=%d path=%s id=%d load=%lu util=%lu", __entry->cpu,
+ __get_str(path), __entry->id, __entry->load, __entry->util)
+);
+#endif /* CONFIG_SMP */
#endif /* _TRACE_SCHED_H */

/* This part must be outside protection */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03adf9fb48b1..ac19ab6ced8f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2950,6 +2950,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
}

+ if (cfs_rq)
+ trace_sched_load_cfs_rq(cfs_rq);
+
return decayed;
}

@@ -3170,6 +3173,8 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
update_tg_cfs_util(cfs_rq, se);
update_tg_cfs_load(cfs_rq, se);

+ trace_sched_load_cfs_rq(cfs_rq);
+
return 1;
}

@@ -3359,6 +3364,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
set_tg_cfs_propagate(cfs_rq);

cfs_rq_util_change(cfs_rq);
+
+ trace_sched_load_cfs_rq(cfs_rq);
}

/**
@@ -3379,6 +3386,8 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
set_tg_cfs_propagate(cfs_rq);

cfs_rq_util_change(cfs_rq);
+
+ trace_sched_load_cfs_rq(cfs_rq);
}

/* Add the load generated by se into cfs_rq's load average */
--
2.11.0

2017-03-28 06:36:07

by Dietmar Eggemann

[permalink] [raw]
Subject: [RFC PATCH 3/5] sched/fair: Export group_cfs_rq()

Export struct cfs_rq *group_cfs_rq(struct sched_entity *se) to be able
to distinguish sched_entities representing either tasks or task_groups
in the sched_entity related load tracking trace event provided by the
next patch.

Signed-off-by: Dietmar Eggemann <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 10 ++++++++++
kernel/sched/fair.c | 12 ------------
2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d67eee84fd43..8a35ff99140b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -392,6 +392,16 @@ struct sched_entity {
#endif
};

+/* cfs_rq "owned" by this sched_entity */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *se)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ return se->my_q;
+#else
+ return NULL;
+#endif
+}
+
struct sched_rt_entity {
struct list_head run_list;
unsigned long timeout;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac19ab6ced8f..04d4f81b96ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -292,12 +292,6 @@ static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
return se->cfs_rq;
}

-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
- return grp->my_q;
-}
-
static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
{
if (!cfs_rq->on_list) {
@@ -449,12 +443,6 @@ static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
return &rq->cfs;
}

-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
- return NULL;
-}
-
static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
{
}
--
2.11.0

2017-03-28 06:36:38

by Dietmar Eggemann

[permalink] [raw]
Subject: [RFC PATCH 5/5] sched/events: Introduce task_group load tracking trace event

The trace event key load is mapped to:

(1) load : cfs_rq->tg->load_avg

The cfs_rq owned by the task_group is used as the only parameter for the
trace event because it has a reference to the taskgroup and the cpu.
Using the taskgroup as a parameter instead would require the cpu as a
second parameter. A task_group is global and not per-cpu data. The cpu
key only tells on which cpu the value was gathered.

The following list shows examples of the key=value pairs for:

(1) a task group:

cpu=1 path=/tg1/tg11/tg111 id=4 load=517

(2) an autogroup:

cpu=1 path=/autogroup-10 id=0 load=1050

We don't maintain a load signal for a root task group.

The trace event is only defined if cfs group scheduling support
(CONFIG_FAIR_GROUP_SCHED) is enabled.

Signed-off-by: Dietmar Eggemann <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Steven Rostedt <[email protected]>
---
include/trace/events/sched.h | 31 +++++++++++++++++++++++++++++++
kernel/sched/fair.c | 2 ++
2 files changed, 33 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 647cfaf528fd..3fe0092176f8 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -673,6 +673,37 @@ TRACE_EVENT(sched_load_se,
__entry->cpu, __get_str(path), __entry->id, __entry->comm,
__entry->pid, __entry->load, __entry->util)
);
+
+/*
+ * Tracepoint for task_group load tracking:
+ */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+TRACE_EVENT(sched_load_tg,
+
+ TP_PROTO(struct cfs_rq *cfs_rq),
+
+ TP_ARGS(cfs_rq),
+
+ TP_STRUCT__entry(
+ __field( int, cpu )
+ __dynamic_array(char, path,
+ __trace_sched_path(cfs_rq, NULL, 0) )
+ __field( int, id )
+ __field( long, load )
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = cfs_rq->rq->cpu;
+ __trace_sched_path(cfs_rq, __get_dynamic_array(path),
+ __get_dynamic_array_len(path));
+ __entry->id = cfs_rq->tg->css.id;
+ __entry->load = atomic_long_read(&cfs_rq->tg->load_avg);
+ ),
+
+ TP_printk("cpu=%d path=%s id=%d load=%ld", __entry->cpu,
+ __get_str(path), __entry->id, __entry->load)
+);
+#endif /* CONFIG_FAIR_GROUP_SCHED */
#endif /* CONFIG_SMP */
#endif /* _TRACE_SCHED_H */

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d1dcb19f5b55..dbe2d5ef8b9e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2997,6 +2997,8 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
atomic_long_add(delta, &cfs_rq->tg->load_avg);
cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
}
+
+ trace_sched_load_tg(cfs_rq);
}

/*
--
2.11.0

2017-03-28 07:56:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On Tue, Mar 28, 2017 at 07:35:38AM +0100, Dietmar Eggemann wrote:
> The trace event keys load and util (utilization) are mapped to:
>
> (1) load : cfs_rq->runnable_load_avg
>
> (2) util : cfs_rq->avg.util_avg
>
> To let this trace event work for configurations w/ and w/o group
> scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following
> special handling is necessary for non-existent key=value pairs:
>
> path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED.
>
> id = -1 : In case of !CONFIG_FAIR_GROUP_SCHED.
>
> The following list shows examples of the key=value pairs in different
> configurations for:
>
> (1) a root task_group:
>
> cpu=4 path=/ id=1 load=6 util=331

What's @id and why do we care?

2017-03-28 08:01:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On Tue, Mar 28, 2017 at 07:35:38AM +0100, Dietmar Eggemann wrote:
>
> + if (cfs_rq)
> + trace_sched_load_cfs_rq(cfs_rq);

You can do that with DEFINE_EVENT_CONDITION and TP_CONDITION.

2017-03-28 08:05:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] sched/events: Introduce sched_entity load tracking trace event

On Tue, Mar 28, 2017 at 07:35:40AM +0100, Dietmar Eggemann wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04d4f81b96ae..d1dcb19f5b55 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2940,6 +2940,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>
> if (cfs_rq)
> trace_sched_load_cfs_rq(cfs_rq);
> + else
> + trace_sched_load_se(container_of(sa, struct sched_entity, avg));
>
> return decayed;
> }
> @@ -3162,6 +3164,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> update_tg_cfs_load(cfs_rq, se);
>
> trace_sched_load_cfs_rq(cfs_rq);
> + trace_sched_load_se(se);
>
> return 1;
> }

Having back-to-back tracepoints is disgusting.

2017-03-28 08:09:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] sched/events: Introduce sched_entity load tracking trace event

On Tue, Mar 28, 2017 at 07:35:40AM +0100, Dietmar Eggemann wrote:
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 51db8a90e45f..647cfaf528fd 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -566,14 +566,15 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
> #ifdef CONFIG_SMP
> #ifdef CREATE_TRACE_POINTS
> static inline
> -int __trace_sched_cpu(struct cfs_rq *cfs_rq)
> +int __trace_sched_cpu(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> #ifdef CONFIG_FAIR_GROUP_SCHED
> - struct rq *rq = cfs_rq->rq;
> + struct rq *rq = cfs_rq ? cfs_rq->rq : NULL;
> #else
> - struct rq *rq = container_of(cfs_rq, struct rq, cfs);
> + struct rq *rq = cfs_rq ? container_of(cfs_rq, struct rq, cfs) : NULL;
> #endif
> - return cpu_of(rq);
> + return rq ? cpu_of(rq)
> + : task_cpu((container_of(se, struct task_struct, se)));
> }

So here you duplicate lots of FAIR_GROUP internals. So why do you then
have to expose group_cfs_rq() in the previous patch?

2017-03-28 10:06:36

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] CFS load tracking trace events

On 28 March 2017 at 08:35, Dietmar Eggemann <[email protected]> wrote:
> This patch-set introduces trace events for load (and utilization)
> tracking for the following three cfs scheduler bricks: cfs_rq,
> sched_entity and task_group.
>
> I've decided to sent it out because people are discussing problems with
> PELT related functionality and as a base for the related discussion next
> week at the OSPM-summit in Pisa.
>
> The requirements for the design are:
>
> (1) Support for all combinations related to the following kernel
> configuration options: CONFIG_SMP, CONFIG_FAIR_GROUP_SCHED,
> CONFIG_SCHED_AUTOGROUP, CONFIG_DEBUG_KERNEL.
>
> (2) All key=value pairs have to appear in all combinations of the
> kernel configuration options mentioned under (1).
>
> (3) Stable interface, i.e. only use keys for a key=value pairs which
> are independent from the underlying (load tracking) implementation.
>
> (4) Minimal invasive to the actual cfs scheduler code, i.e. the trace
> event should not have to be guarded by an if condition (e.g.
> entity_is_task(se) to distinguish between task and task_group)
> or kernel configuration switch.
>
> The trace events only expose one load (and one utilization) key=value
> pair besides those used to identify the cfs scheduler brick. They do
> not provide any internal implementation details of the actual (PELT)
> load-tracking mechanism.
> This limitation might be too much since for a cfs_rq we can only trace
> runnable load (cfs_rq->runnable_load_avg) or runnable/blocked load
> (cfs_rq->avg.load_avg).
> Other load metrics like instantaneous load ({cfs_rq|se}->load.weight)
> are not traced but could be easily added.
>
> The following keys are used to identify the cfs scheduler brick:
>
> (1) Cpu number the cfs scheduler brick is attached to.
>
> (2) Task_group path and (css) id.
>
> (3) Task name and pid.

Do you really need both path/name and id/pid ?

The path/name looks quite intrusive so can't we just use id/pid ?

>
> In case a key does not apply due to an unset Kernel configuration option
> or the fact that a sched_entity can represent either a task or a
> task_group its value is set to an invalid default:
>
> (1) Task_group path: "(null)"
>
> (2) Task group id: -1
>
> (3) Task name: "(null)"
>
> (4) Task pid: -1
>
> Load tracking trace events are placed into kernel/sched/fair.c:
>
> (1) for cfs_rq:
>
> - In PELT core function __update_load_avg().
>
> - During sched_entity attach/detach.
>
> - During sched_entity load/util propagation down the task_group
> hierarchy.
>
> (2) for sched_entity:
>
> - In PELT core function __update_load_avg().
>
> - During sched_entity load/util propagation down the task_group
> hierarchy.
>
> (3) for task_group:
>
> - In its PELT update function.
>
> An alternative for using __update_load_avg() would be to put trace
> events into update_cfs_rq_load_avg() for cfs_rq's and into
> set_task_rq_fair(), update_load_avg(), sync_entity_load_avg() for
> sched_entities. This would also get rid of the if(cfs_rq)/else condition
> in __update_load_avg().
>
> These trace events still look a bit fragile.
> First of all, this patch-set has to use cfs specific data structures
> in the global task scheduler trace file include/trace/events/sched.h.
> And second, the accessor-functions (rq_of(), task_of(), etc.) are
> private to the cfs scheduler class. In case they would be public these
> trace events would be easier to code. That said, group_cfs_rq() is
> already made public by this patch-stack.
>
> This version bases on tip/sched/core as of yesterday (bc4278987e38). It
> has been compile tested on ~160 configurations via 0day's kbuild test
> robot.
>
> Dietmar Eggemann (5):
> sched/autogroup: Define autogroup_path() for !CONFIG_SCHED_DEBUG
> sched/events: Introduce cfs_rq load tracking trace event
> sched/fair: Export group_cfs_rq()
> sched/events: Introduce sched_entity load tracking trace event
> sched/events: Introduce task_group load tracking trace event
>
> include/linux/sched.h | 10 +++
> include/trace/events/sched.h | 143 +++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/autogroup.c | 2 -
> kernel/sched/autogroup.h | 2 -
> kernel/sched/fair.c | 26 ++++----
> 5 files changed, 167 insertions(+), 16 deletions(-)
>
> --
> 2.11.0
>

2017-03-28 13:31:41

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On 03/28/2017 09:56 AM, Peter Zijlstra wrote:
> On Tue, Mar 28, 2017 at 07:35:38AM +0100, Dietmar Eggemann wrote:

[...]

>> (1) a root task_group:
>>
>> cpu=4 path=/ id=1 load=6 util=331
>
> What's @id and why do we care?

It's a per cgroup/subsystem unique id for every task_group (cpu controller):

struct task_group {
struct cgroup_subsys_state css {
...
int id;
...
}
...
}

The root task group path=/ has id=1 and all autogroups have id=0.

I agree, this id is redundant in case we have the task_group path.

2017-03-28 13:47:43

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] CFS load tracking trace events

On 03/28/2017 12:05 PM, Vincent Guittot wrote:
> On 28 March 2017 at 08:35, Dietmar Eggemann <[email protected]> wrote:

[...]

>> The following keys are used to identify the cfs scheduler brick:
>>
>> (1) Cpu number the cfs scheduler brick is attached to.
>>
>> (2) Task_group path and (css) id.
>>
>> (3) Task name and pid.
>
> Do you really need both path/name and id/pid ?
>
> The path/name looks quite intrusive so can't we just use id/pid ?

One problem is that all autogroups use id=0.

Another thing with task_groups is that dealing with path="/tg1/tg11" is
so much more intuitive than id="7".

IMHO, we do need task name and pid to be able to clearly identify a task
(same name/different pid or fork phase (forkee still has name of forker)).

You're right, the implementation with path is more complicated but I
guess that's worth it. We could get rid of 'id' though.

[...]

2017-03-28 14:04:00

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] sched/events: Introduce sched_entity load tracking trace event

On 03/28/2017 10:05 AM, Peter Zijlstra wrote:
> On Tue, Mar 28, 2017 at 07:35:40AM +0100, Dietmar Eggemann wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 04d4f81b96ae..d1dcb19f5b55 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2940,6 +2940,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>
>> if (cfs_rq)
>> trace_sched_load_cfs_rq(cfs_rq);
>> + else
>> + trace_sched_load_se(container_of(sa, struct sched_entity, avg));
>>
>> return decayed;
>> }
>> @@ -3162,6 +3164,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
>> update_tg_cfs_load(cfs_rq, se);
>>
>> trace_sched_load_cfs_rq(cfs_rq);
>> + trace_sched_load_se(se);
>>
>> return 1;
>> }
>
> Having back-to-back tracepoints is disgusting.

Yeah, but avoiding putting them like this is hard since the calls to
update_tg_cfs_util() and update_tg_cfs_load() inside
propagate_entity_load_avg() refresh util for the cfs_rq and the se
respectively load (and runnable_load).

2017-03-28 14:11:59

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] sched/events: Introduce sched_entity load tracking trace event

On 03/28/2017 10:05 AM, Peter Zijlstra wrote:
> On Tue, Mar 28, 2017 at 07:35:40AM +0100, Dietmar Eggemann wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 04d4f81b96ae..d1dcb19f5b55 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2940,6 +2940,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>
>> if (cfs_rq)
>> trace_sched_load_cfs_rq(cfs_rq);
>> + else
>> + trace_sched_load_se(container_of(sa, struct sched_entity, avg));
>>
>> return decayed;
>> }
>> @@ -3162,6 +3164,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
>> update_tg_cfs_load(cfs_rq, se);
>>
>> trace_sched_load_cfs_rq(cfs_rq);
>> + trace_sched_load_se(se);
>>
>> return 1;
>> }
>
> Having back-to-back tracepoints is disgusting.
>

Yeah, avoiding putting them like this is hard since
update_tg_cfs_util()/update_tg_cfs_load() refresh util for the cfs_rq
and the se respectively load/runnable_load.

2017-03-28 14:14:42

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] sched/events: Introduce sched_entity load tracking trace event

On 03/28/2017 10:08 AM, Peter Zijlstra wrote:
> On Tue, Mar 28, 2017 at 07:35:40AM +0100, Dietmar Eggemann wrote:
>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>> index 51db8a90e45f..647cfaf528fd 100644
>> --- a/include/trace/events/sched.h
>> +++ b/include/trace/events/sched.h
>> @@ -566,14 +566,15 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
>> #ifdef CONFIG_SMP
>> #ifdef CREATE_TRACE_POINTS
>> static inline
>> -int __trace_sched_cpu(struct cfs_rq *cfs_rq)
>> +int __trace_sched_cpu(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> {
>> #ifdef CONFIG_FAIR_GROUP_SCHED
>> - struct rq *rq = cfs_rq->rq;
>> + struct rq *rq = cfs_rq ? cfs_rq->rq : NULL;
>> #else
>> - struct rq *rq = container_of(cfs_rq, struct rq, cfs);
>> + struct rq *rq = cfs_rq ? container_of(cfs_rq, struct rq, cfs) : NULL;
>> #endif
>> - return cpu_of(rq);
>> + return rq ? cpu_of(rq)
>> + : task_cpu((container_of(se, struct task_struct, se)));
>> }
>
> So here you duplicate lots of FAIR_GROUP internals. So why do you then
> have to expose group_cfs_rq() in the previous patch?
>

Not having group_cfs_rq() available made the trace event code too
confusing.

But like I mentioned in the second to last paragraph in the cover
letter, having all necessary cfs accessor-functions (rq_of(), task_of(),
etc.) available would definitely streamline the coding effort of these
trace events.

Do you think that making them public in include/linux/sched.h is the way
to go? What about the namespace issue with other sched classes? Should
they be exported with the name they have right now (since cfs was there
first) or should they be renamed to cfs_task_of() and rq_of_cfs_rq() etc. ?

RT and Deadline class already have the own (private) accessor-functions
(e.g. dl_task_of() or rq_of_dl_rq()).

2017-03-28 14:46:08

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On Tue, 28 Mar 2017 07:35:38 +0100
Dietmar Eggemann <[email protected]> wrote:

> /* This part must be outside protection */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 03adf9fb48b1..ac19ab6ced8f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2950,6 +2950,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
> }
>
> + if (cfs_rq)
> + trace_sched_load_cfs_rq(cfs_rq);
> +

Please use TRACE_EVENT_CONDITION(), and test for cfs_rq not NULL.

That way it moves the if (cfs_rq) out of the scheduler code and into
the jump label protected location. That is, the if is only tested when
tracing is enabled.

-- Steve


> return decayed;
> }
>
> @@ -3170,6 +3173,8 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> update_tg_cfs_util(cfs_rq, se);
> update_tg_cfs_load(cfs_rq, se);
>
> + trace_sched_load_cfs_rq(cfs_rq);
> +
> return 1;
> }
>
> @@ -3359,6 +3364,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> set_tg_cfs_propagate(cfs_rq);
>
> cfs_rq_util_change(cfs_rq);
> +
> + trace_sched_load_cfs_rq(cfs_rq);
> }
>
> /**
> @@ -3379,6 +3386,8 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> set_tg_cfs_propagate(cfs_rq);
>
> cfs_rq_util_change(cfs_rq);
> +
> + trace_sched_load_cfs_rq(cfs_rq);
> }
>
> /* Add the load generated by se into cfs_rq's load average */

2017-03-28 14:48:43

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On Tue, 28 Mar 2017 10:00:50 +0200
Peter Zijlstra <[email protected]> wrote:

> On Tue, Mar 28, 2017 at 07:35:38AM +0100, Dietmar Eggemann wrote:
> >
> > + if (cfs_rq)
> > + trace_sched_load_cfs_rq(cfs_rq);
>
> You can do that with DEFINE_EVENT_CONDITION and TP_CONDITION.

I just read this after I replied about using TRACE_EVENT_CONDITION. As
there is not a DEFINE_EVENT(), the TRACE_EVENT_CONDITION() should be
used. All the locations expect cfs_rq to not be NULL I assume.

-- Steve

2017-03-28 16:41:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] sched/events: Introduce sched_entity load tracking trace event

On Tue, Mar 28, 2017 at 04:13:45PM +0200, Dietmar Eggemann wrote:

> Do you think that making them public in include/linux/sched.h is the way to
> go?

No; all that stuff should really stay private. tracepoints are a very
bad reason to leak this stuff.

2017-03-28 16:45:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On Tue, Mar 28, 2017 at 10:46:00AM -0400, Steven Rostedt wrote:
> On Tue, 28 Mar 2017 07:35:38 +0100
> Dietmar Eggemann <[email protected]> wrote:
>
> > /* This part must be outside protection */
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 03adf9fb48b1..ac19ab6ced8f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2950,6 +2950,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
> > }
> >
> > + if (cfs_rq)
> > + trace_sched_load_cfs_rq(cfs_rq);
> > +
>
> Please use TRACE_EVENT_CONDITION(), and test for cfs_rq not NULL.

I too suggested that; but then I looked again at that code and we can
actually do this. cfs_rq can be constant propagated and the if
determined at build time.

Its not immediately obvious from the current code; but if we do
something like the below, it should be clearer.

---
Subject: sched/fair: Explicitly generate __update_load_avg() instances
From: Peter Zijlstra <[email protected]>
Date: Tue Mar 28 11:08:20 CEST 2017

The __update_load_avg() function is an __always_inline because its
used with constant propagation to generate different variants of the
code without having to duplicate it (which would be prone to bugs).

Explicitly instantiate the 3 variants.

Note that most of this is called from rather hot paths, so reducing
branches is good.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2849,7 +2849,7 @@ static u32 __compute_runnable_contrib(u6
* = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
*/
static __always_inline int
-__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
+___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
unsigned long weight, int running, struct cfs_rq *cfs_rq)
{
u64 delta, scaled_delta, periods;
@@ -2953,6 +2953,26 @@ __update_load_avg(u64 now, int cpu, stru
return decayed;
}

+static int
+__update_load_avg_blocked_se(u64 now, int cpu, struct sched_avg *sa)
+{
+ return ___update_load_avg(now, cpu, sa, 0, 0, NULL);
+}
+
+static int
+__update_load_avg_se(u64 now, int cpu, struct sched_avg *sa,
+ unsigned long weight, int running)
+{
+ return ___update_load_avg(now, cpu, sa, weight, running, NULL);
+}
+
+static int
+__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
+ unsigned long weight, int running, struct cfs_rq *cfs_rq)
+{
+ return ___update_load_avg(now, cpu, sa, weight, running, cfs_rq);
+}
+
/*
* Signed add and clamp on underflow.
*
@@ -3014,6 +3034,9 @@ static inline void update_tg_load_avg(st
void set_task_rq_fair(struct sched_entity *se,
struct cfs_rq *prev, struct cfs_rq *next)
{
+ u64 p_last_update_time;
+ u64 n_last_update_time;
+
if (!sched_feat(ATTACH_AGE_LOAD))
return;

@@ -3024,11 +3047,11 @@ void set_task_rq_fair(struct sched_entit
* time. This will result in the wakee task is less decayed, but giving
* the wakee more load sounds not bad.
*/
- if (se->avg.last_update_time && prev) {
- u64 p_last_update_time;
- u64 n_last_update_time;
+ if (!(se->avg.last_update_time && prev))
+ return;

#ifndef CONFIG_64BIT
+ {
u64 p_last_update_time_copy;
u64 n_last_update_time_copy;

@@ -3043,14 +3066,15 @@ void set_task_rq_fair(struct sched_entit

} while (p_last_update_time != p_last_update_time_copy ||
n_last_update_time != n_last_update_time_copy);
+ }
#else
- p_last_update_time = prev->avg.last_update_time;
- n_last_update_time = next->avg.last_update_time;
+ p_last_update_time = prev->avg.last_update_time;
+ n_last_update_time = next->avg.last_update_time;
#endif
- __update_load_avg(p_last_update_time, cpu_of(rq_of(prev)),
- &se->avg, 0, 0, NULL);
- se->avg.last_update_time = n_last_update_time;
- }
+ __update_load_avg_blocked_se(p_last_update_time,
+ cpu_of(rq_of(prev)),
+ &se->avg);
+ se->avg.last_update_time = n_last_update_time;
}

/* Take into account change of utilization of a child task group */
@@ -3329,9 +3353,9 @@ static inline void update_load_avg(struc
* track group sched_entity load average for task_h_load calc in migration
*/
if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
- __update_load_avg(now, cpu, &se->avg,
+ __update_load_avg_se(now, cpu, &se->avg,
se->on_rq * scale_load_down(se->load.weight),
- cfs_rq->curr == se, NULL);
+ cfs_rq->curr == se);
}

decayed = update_cfs_rq_load_avg(now, cfs_rq, true);
@@ -3437,7 +3461,7 @@ void sync_entity_load_avg(struct sched_e
u64 last_update_time;

last_update_time = cfs_rq_last_update_time(cfs_rq);
- __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+ __update_load_avg_blocked_se(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg);
}

/*

2017-03-28 16:58:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On Tue, Mar 28, 2017 at 06:44:59PM +0200, Peter Zijlstra wrote:
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2849,7 +2849,7 @@ static u32 __compute_runnable_contrib(u6
> * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
> */
> static __always_inline int
> -__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> +___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> unsigned long weight, int running, struct cfs_rq *cfs_rq)
> {
> u64 delta, scaled_delta, periods;
> @@ -2953,6 +2953,26 @@ __update_load_avg(u64 now, int cpu, stru
> return decayed;
> }
>
> +static int
> +__update_load_avg_blocked_se(u64 now, int cpu, struct sched_avg *sa)
> +{
> + return ___update_load_avg(now, cpu, sa, 0, 0, NULL);
> +}
> +
> +static int
> +__update_load_avg_se(u64 now, int cpu, struct sched_avg *sa,
> + unsigned long weight, int running)
> +{
> + return ___update_load_avg(now, cpu, sa, weight, running, NULL);
> +}
> +
> +static int
> +__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> + unsigned long weight, int running, struct cfs_rq *cfs_rq)
> +{
> + return ___update_load_avg(now, cpu, sa, weight, running, cfs_rq);

Although ideally we'd be able to tell the compiler that cfs_rq will not
be NULL here. Hurmph.. no __builtin for that I think :/

> +}

2017-03-28 17:20:41

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On 28-Mar 18:57, Peter Zijlstra wrote:
> On Tue, Mar 28, 2017 at 06:44:59PM +0200, Peter Zijlstra wrote:
> > ---
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2849,7 +2849,7 @@ static u32 __compute_runnable_contrib(u6
> > * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
> > */
> > static __always_inline int
> > -__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > +___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > {
> > u64 delta, scaled_delta, periods;
> > @@ -2953,6 +2953,26 @@ __update_load_avg(u64 now, int cpu, stru
> > return decayed;
> > }
> >
> > +static int
> > +__update_load_avg_blocked_se(u64 now, int cpu, struct sched_avg *sa)
> > +{
> > + return ___update_load_avg(now, cpu, sa, 0, 0, NULL);
> > +}
> > +
> > +static int
> > +__update_load_avg_se(u64 now, int cpu, struct sched_avg *sa,
> > + unsigned long weight, int running)
> > +{
> > + return ___update_load_avg(now, cpu, sa, weight, running, NULL);
> > +}
> > +
> > +static int
> > +__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > + unsigned long weight, int running, struct cfs_rq *cfs_rq)

__attribute__((nonnull (6)));

> > +{
> > + return ___update_load_avg(now, cpu, sa, weight, running, cfs_rq);
>
> Although ideally we'd be able to tell the compiler that cfs_rq will not
> be NULL here. Hurmph.. no __builtin for that I think :/

What about the above attribute?

>
> > +}

--
#include <best/regards.h>

Patrick Bellasi

2017-03-28 17:38:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On Tue, 28 Mar 2017 13:36:26 -0400
Steven Rostedt <[email protected]> wrote:

> But why play games, and rely on the design of the code? A
> TRACE_EVENT_CONDTION() is more robust and documents that this
> tracepoint should not be called when cfs_rq is NULL.

In other words, what are you trying to save for not using the
TRACE_EVENT_CONDITION()?

-- Steve

2017-03-28 17:36:34

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On Tue, 28 Mar 2017 18:44:59 +0200
Peter Zijlstra <[email protected]> wrote:

> On Tue, Mar 28, 2017 at 10:46:00AM -0400, Steven Rostedt wrote:
> > On Tue, 28 Mar 2017 07:35:38 +0100
> > Dietmar Eggemann <[email protected]> wrote:
> >
> > > /* This part must be outside protection */
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 03adf9fb48b1..ac19ab6ced8f 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -2950,6 +2950,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > > sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
> > > }
> > >
> > > + if (cfs_rq)
> > > + trace_sched_load_cfs_rq(cfs_rq);
> > > +
> >
> > Please use TRACE_EVENT_CONDITION(), and test for cfs_rq not NULL.
>
> I too suggested that; but then I looked again at that code and we can
> actually do this. cfs_rq can be constant propagated and the if
> determined at build time.
>
> Its not immediately obvious from the current code; but if we do
> something like the below, it should be clearer.
>

But why play games, and rely on the design of the code? A
TRACE_EVENT_CONDTION() is more robust and documents that this
tracepoint should not be called when cfs_rq is NULL.

-- Steve

2017-03-28 18:18:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On Tue, Mar 28, 2017 at 06:20:05PM +0100, Patrick Bellasi wrote:
> On 28-Mar 18:57, Peter Zijlstra wrote:
> > On Tue, Mar 28, 2017 at 06:44:59PM +0200, Peter Zijlstra wrote:

> > > +static int
> > > +__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > > + unsigned long weight, int running, struct cfs_rq *cfs_rq)
>
> __attribute__((nonnull (6)));
>
> > > +{
> > > + return ___update_load_avg(now, cpu, sa, weight, running, cfs_rq);
> >
> > Although ideally we'd be able to tell the compiler that cfs_rq will not
> > be NULL here. Hurmph.. no __builtin for that I think :/
>
> What about the above attribute?

Ooh, shiny, thanks! My bad for failing to check the function attributes.

2017-03-29 20:19:53

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] sched/events: Introduce sched_entity load tracking trace event

On 03/28/2017 06:41 PM, Peter Zijlstra wrote:
> On Tue, Mar 28, 2017 at 04:13:45PM +0200, Dietmar Eggemann wrote:
>
>> Do you think that making them public in include/linux/sched.h is the way to
>> go?
>
> No; all that stuff should really stay private. tracepoints are a very
> bad reason to leak this stuff.

Understood & makes sense to me. In hindsight, it's not too complicated
to code group_cfs_rq in include/trace/events/sched.h.

2017-03-29 20:38:40

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On 03/28/2017 07:37 PM, Steven Rostedt wrote:
> On Tue, 28 Mar 2017 13:36:26 -0400
> Steven Rostedt <[email protected]> wrote:
>
>> But why play games, and rely on the design of the code? A
>> TRACE_EVENT_CONDTION() is more robust and documents that this
>> tracepoint should not be called when cfs_rq is NULL.
>
> In other words, what are you trying to save for not using the
> TRACE_EVENT_CONDITION()?

IMHO, if we could avoid this

if(cfs_rq)
trace_sched_load_cfs_rq(cfs_rq);
else
trace_sched_load_se(container_of(sa, struct sched_entity, avg));

in __update_load_avg(), then we can use 'unconditional' TRACE_EVENTs in
all call-sites:

__update_load_avg{_cfs_rq}(), propagate_entity_load_avg(),
attach_entity_load_avg(), detach_entity_load_avg() for cfs_rq and

__update_load_avg_blocked_se(), __update_load_avg_se(),
propagate_entity_load_avg() for se.

[...]

2017-03-29 21:04:45

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On 03/28/2017 06:44 PM, Peter Zijlstra wrote:
> On Tue, Mar 28, 2017 at 10:46:00AM -0400, Steven Rostedt wrote:
>> On Tue, 28 Mar 2017 07:35:38 +0100
>> Dietmar Eggemann <[email protected]> wrote:

[...]

> I too suggested that; but then I looked again at that code and we can
> actually do this. cfs_rq can be constant propagated and the if
> determined at build time.
>
> Its not immediately obvious from the current code; but if we do
> something like the below, it should be clearer.
>
> ---
> Subject: sched/fair: Explicitly generate __update_load_avg() instances
> From: Peter Zijlstra <[email protected]>
> Date: Tue Mar 28 11:08:20 CEST 2017
>
> The __update_load_avg() function is an __always_inline because its
> used with constant propagation to generate different variants of the
> code without having to duplicate it (which would be prone to bugs).

Ah, so the if(cfs_rq)/else condition should stay in ___update_load_avg()
and I shouldn't move the trace events into the 3 variants?

I tried to verify that the if is determined at build time but it's kind
of hard with trace_events.

> Explicitly instantiate the 3 variants.
>
> Note that most of this is called from rather hot paths, so reducing
> branches is good.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2849,7 +2849,7 @@ static u32 __compute_runnable_contrib(u6
> * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
> */
> static __always_inline int
> -__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> +___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> unsigned long weight, int running, struct cfs_rq *cfs_rq)
> {
> u64 delta, scaled_delta, periods;
> @@ -2953,6 +2953,26 @@ __update_load_avg(u64 now, int cpu, stru
> return decayed;
> }
>
> +static int
> +__update_load_avg_blocked_se(u64 now, int cpu, struct sched_avg *sa)
> +{
> + return ___update_load_avg(now, cpu, sa, 0, 0, NULL);
> +}
> +
> +static int
> +__update_load_avg_se(u64 now, int cpu, struct sched_avg *sa,
> + unsigned long weight, int running)
> +{
> + return ___update_load_avg(now, cpu, sa, weight, running, NULL);
> +}
> +
> +static int
> +__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> + unsigned long weight, int running, struct cfs_rq *cfs_rq)
> +{
> + return ___update_load_avg(now, cpu, sa, weight, running, cfs_rq);
> +}

Why not reduce the parameter list of these 3 incarnations to 'now, cpu,
object'?

static int
__update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)

static int
__update_load_avg_se(u64 now, int cpu, struct sched_entity *se)

static int
__update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)

[...]

2017-03-30 07:04:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On Wed, Mar 29, 2017 at 11:03:45PM +0200, Dietmar Eggemann wrote:

> > +static int
> > +__update_load_avg_blocked_se(u64 now, int cpu, struct sched_avg *sa)
> > +{
> > + return ___update_load_avg(now, cpu, sa, 0, 0, NULL);
> > +}
> > +
> > +static int
> > +__update_load_avg_se(u64 now, int cpu, struct sched_avg *sa,
> > + unsigned long weight, int running)
> > +{
> > + return ___update_load_avg(now, cpu, sa, weight, running, NULL);
> > +}
> > +
> > +static int
> > +__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > + unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > +{
> > + return ___update_load_avg(now, cpu, sa, weight, running, cfs_rq);
> > +}
>
> Why not reduce the parameter list of these 3 incarnations to 'now, cpu,
> object'?
>
> static int
> __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
>
> static int
> __update_load_avg_se(u64 now, int cpu, struct sched_entity *se)
>
> static int
> __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
>
> [...]

doesn't quite work with se, but yes good idea.

And this way we don't need the nonnull attribute either, because it
should be clear from having dereferenced it that it cannot be null.

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2849,7 +2849,7 @@ static u32 __compute_runnable_contrib(u6
* = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
*/
static __always_inline int
-__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
+___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
unsigned long weight, int running, struct cfs_rq *cfs_rq)
{
u64 delta, scaled_delta, periods;
@@ -2953,6 +2953,28 @@ __update_load_avg(u64 now, int cpu, stru
return decayed;
}

+static int
+__update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
+{
+ return ___update_load_avg(now, cpu, &se->avg, 0, 0, NULL);
+}
+
+static int
+__update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ return ___update_load_avg(now, cpu, &se->avg,
+ se->on_rq * scale_load_down(se->load.weight),
+ cfs_rq->curr == se, NULL);
+}
+
+static int
+__update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
+{
+ return ___update_load_avg(now, cpu, &cfs_rq->avg,
+ scale_down_load(cfs_rq->load.weight),
+ cfs_rq->curr != NULL, cfs_rq);
+}
+
/*
* Signed add and clamp on underflow.
*
@@ -3014,6 +3036,9 @@ static inline void update_tg_load_avg(st
void set_task_rq_fair(struct sched_entity *se,
struct cfs_rq *prev, struct cfs_rq *next)
{
+ u64 p_last_update_time;
+ u64 n_last_update_time;
+
if (!sched_feat(ATTACH_AGE_LOAD))
return;

@@ -3024,11 +3049,11 @@ void set_task_rq_fair(struct sched_entit
* time. This will result in the wakee task is less decayed, but giving
* the wakee more load sounds not bad.
*/
- if (se->avg.last_update_time && prev) {
- u64 p_last_update_time;
- u64 n_last_update_time;
+ if (!(se->avg.last_update_time && prev))
+ return;

#ifndef CONFIG_64BIT
+ {
u64 p_last_update_time_copy;
u64 n_last_update_time_copy;

@@ -3043,14 +3068,13 @@ void set_task_rq_fair(struct sched_entit

} while (p_last_update_time != p_last_update_time_copy ||
n_last_update_time != n_last_update_time_copy);
+ }
#else
- p_last_update_time = prev->avg.last_update_time;
- n_last_update_time = next->avg.last_update_time;
+ p_last_update_time = prev->avg.last_update_time;
+ n_last_update_time = next->avg.last_update_time;
#endif
- __update_load_avg(p_last_update_time, cpu_of(rq_of(prev)),
- &se->avg, 0, 0, NULL);
- se->avg.last_update_time = n_last_update_time;
- }
+ __update_load_avg_blocked_se(p_last_update_time, cpu_of(rq_of(prev)), se);
+ se->avg.last_update_time = n_last_update_time;
}

/* Take into account change of utilization of a child task group */
@@ -3295,8 +3319,7 @@ update_cfs_rq_load_avg(u64 now, struct c
set_tg_cfs_propagate(cfs_rq);
}

- decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
- scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
+ decayed = __update_load_avg_cfs_rq(now, cpu_of(rq_of(cfs_rq)), cfs_rq);

#ifndef CONFIG_64BIT
smp_wmb();
@@ -3328,11 +3351,8 @@ static inline void update_load_avg(struc
* Track task load average for carrying it to new CPU after migrated, and
* track group sched_entity load average for task_h_load calc in migration
*/
- if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
- __update_load_avg(now, cpu, &se->avg,
- se->on_rq * scale_load_down(se->load.weight),
- cfs_rq->curr == se, NULL);
- }
+ if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
+ __update_load_avg_se(now, cpu, cfs_rq, se);

decayed = update_cfs_rq_load_avg(now, cfs_rq, true);
decayed |= propagate_entity_load_avg(se);
@@ -3437,7 +3457,7 @@ void sync_entity_load_avg(struct sched_e
u64 last_update_time;

last_update_time = cfs_rq_last_update_time(cfs_rq);
- __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+ __update_load_avg_blocked_se(last_update_time, cpu_of(rq_of(cfs_rq)), se);
}

/*

2017-03-30 07:46:28

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] sched/events: Introduce cfs_rq load tracking trace event

On 03/30/2017 09:04 AM, Peter Zijlstra wrote:
> On Wed, Mar 29, 2017 at 11:03:45PM +0200, Dietmar Eggemann wrote:

[...]

>> Why not reduce the parameter list of these 3 incarnations to 'now, cpu,
>> object'?
>>
>> static int
>> __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
>>
>> static int
>> __update_load_avg_se(u64 now, int cpu, struct sched_entity *se)
>>
>> static int
>> __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
>>
>> [...]
>
> doesn't quite work with se, but yes good idea.

Ah, OK, you don't like to use 'cfs_rq_of(se)->curr == se' in
__update_load_avg_se(). The reason is that it's already fetched in
update_load_avg()?

> And this way we don't need the nonnull attribute either, because it
> should be clear from having dereferenced it that it cannot be null.

Yes, this would be clearer now.

[...]