We want to measure the latency of RT tasks in our production
environment with schedstats facility, but currently schedstats is only
supported for fair sched class. This patch enable it for RT sched class
as well.
- Structure
Before we make schedstats available for RT sched class, we should make
struct sched_statistics independent of fair sched class first.
The struct sched_statistics is the schedular statistics of a task_struct
or a task_group. So we can move it into struct task_struct and
struct task_group to achieve the goal.
Below is the detailed explaination of the change in the structs.
The struct before this patch:
struct task_struct { |-> struct sched_entity {
... | ...
struct sched_entity *se; ---| struct sched_statistics statistics;
struct sched_rt_entity *rt; ...
... ...
}; };
struct task_group { |--> se[0]->statistics : schedstats of CPU0
... |
#ifdef CONFIG_FAIR_GROUP_SCHED |
struct sched_entity **se; --|--> se[1]->statistics : schedstats of CPU1
|
#endif |
|--> se[N]->statistics : schedstats of CPUn
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_rt_entity **rt_se; (N/A)
#endif
...
};
The '**se' in task_group is allocated in the fair sched class, which is
hard to be reused by other sched class.
The struct after this patch:
struct task_struct {
...
struct sched_statistics statistics;
...
struct sched_entity *se;
struct sched_rt_entity *rt;
...
};
struct task_group { |---> stats[0] : of CPU0
... |
struct sched_statistics **stats; --|---> stats[1] : of CPU1
... |
|---> stats[n] : of CPUn
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity **se;
#endif
#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity **rt_se;
#endif
...
};
After the patch it is clearly that both of se or rt_se can easily get the
sched_statistics by a task_struct or a task_group.
- Function Interface
The original prototype of the schedstats helpers are
update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se)
The cfs_rq in these helpers is used to get the rq_clock, and the se is
used to get the struct sched_statistics and the struct task_struct. In
order to make these helpers available by all sched classes, we can pass
the rq, sched_statistics and task_struct directly.
Then the new helpers are
update_stats_wait_*(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats)
which are independent of fair sched class.
To avoid vmlinux growing too large or introducing ovehead when
!schedstat_enabled(), some new helpers after schedstat_enabled() are also
introduced, Suggested by Mel. These helpers are in sched/stats.c,
__update_stats_wait_*(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats)
- Implementation
After we make the struct sched_statistics and the helpers of it
independent of fair sched class, we can easily use the schedstats
facility for RT sched class.
The schedstat usage in RT sched class is similar with fair sched class,
for example,
fair RT
enqueue update_stats_enqueue_fair update_stats_enqueue_rt
dequeue update_stats_dequeue_fair update_stats_dequeue_rt
put_prev_task update_stats_wait_start update_stats_wait_start
set_next_task update_stats_wait_end update_stats_wait_end
- Usage
The user can get the schedstats information in the same way in fair sched
class. For example,
fair RT
task show /proc/[pid]/sched /proc/[pid]/sched
group show cpu.stat in cgroup cpu.stat in cgroup
The sched:sched_stat_{wait, sleep, iowait, blocked, runtime} tracepoints
can be used to trace RT tasks as well.
Detailed examples of the output of schedstats for RT tasks are in patch #6.
Changes since RFC:
- improvement of schedstats helpers, per Mel.
- make struct schedstats independent of fair sched class
Yafang Shao (6):
sched: don't include stats.h in sched.h
sched, fair: use __schedstat_set() in set_next_entity()
sched: make struct sched_statistics independent of fair sched class
sched: make schedstats helpers independent of fair sched class
sched, rt: support sched_stat_runtime tracepoint for RT sched class
sched, rt: support schedstats for RT sched class
include/linux/sched.h | 3 +-
kernel/sched/core.c | 25 +++--
kernel/sched/deadline.c | 5 +-
kernel/sched/debug.c | 82 +++++++--------
kernel/sched/fair.c | 209 +++++++++++++++------------------------
kernel/sched/idle.c | 1 +
kernel/sched/rt.c | 142 +++++++++++++++++++++++++-
kernel/sched/sched.h | 9 +-
kernel/sched/stats.c | 105 ++++++++++++++++++++
kernel/sched/stats.h | 80 +++++++++++++++
kernel/sched/stop_task.c | 5 +-
11 files changed, 476 insertions(+), 190 deletions(-)
--
2.18.4
The runtime of a RT task has already been there, so we only need to
place the sched_stat_runtime tracepoint there.
One difference between fair task and RT task is that there is no vruntime
in RT task. To reuse the sched_stat_runtime tracepoint, '0' is passed as
vruntime for RT task.
The output of this tracepoint for RT task as follows,
stress-9748 [039] d.h. 113.519352: sched_stat_runtime: comm=stress pid=9748 runtime=997573 [ns] vruntime=0 [ns]
stress-9748 [039] d.h. 113.520352: sched_stat_runtime: comm=stress pid=9748 runtime=997627 [ns] vruntime=0 [ns]
stress-9748 [039] d.h. 113.521352: sched_stat_runtime: comm=stress pid=9748 runtime=998203 [ns] vruntime=0 [ns]
Signed-off-by: Yafang Shao <[email protected]>
---
kernel/sched/rt.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2d543a270dfe..da989653b0a2 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1011,6 +1011,8 @@ static void update_curr_rt(struct rq *rq)
schedstat_set(curr->stats.exec_max,
max(curr->stats.exec_max, delta_exec));
+ trace_sched_stat_runtime(curr, delta_exec, 0);
+
curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
--
2.18.4
We want to measure the latency of RT tasks in our production
environment with schedstats facility, but currently schedstats is only
supported for fair sched class. This patch enable it for RT sched class
as well.
After we make the struct sched_statistics and the helpers of it
independent of fair sched class, we can easily use the schedstats
facility for RT sched class.
The schedstat usage in RT sched class is similar with fair sched class,
for example,
fair RT
enqueue update_stats_enqueue_fair update_stats_enqueue_rt
dequeue update_stats_dequeue_fair update_stats_dequeue_rt
put_prev_task update_stats_wait_start update_stats_wait_start
set_next_task update_stats_wait_end update_stats_wait_end
The user can get the schedstats information in the same way in fair sched
class. For example,
fair RT
task show /proc/[pid]/sched /proc/[pid]/sched
group show cpu.stat in cgroup cpu.stat in cgroup
The output of a RT task's schedstats as follows,
$ cat /proc/10461/sched
...
stats.sum_sleep_runtime : 37966.502936
stats.wait_start : 0.000000
stats.sleep_start : 0.000000
stats.block_start : 279182.986012
stats.sleep_max : 9.001121
stats.block_max : 9.292692
stats.exec_max : 0.090009
stats.slice_max : 0.000000
stats.wait_max : 0.005305
stats.wait_sum : 0.352352
stats.wait_count : 236173
stats.iowait_sum : 37875.625128
stats.iowait_count : 235933
stats.nr_migrations_cold : 0
stats.nr_failed_migrations_affine : 0
stats.nr_failed_migrations_running : 0
stats.nr_failed_migrations_hot : 0
stats.nr_forced_migrations : 0
stats.nr_wakeups : 236172
stats.nr_wakeups_sync : 0
stats.nr_wakeups_migrate : 2
stats.nr_wakeups_local : 235865
stats.nr_wakeups_remote : 307
stats.nr_wakeups_affine : 0
stats.nr_wakeups_affine_attempts : 0
stats.nr_wakeups_passive : 0
stats.nr_wakeups_idle : 0
...
The sched:sched_stat_{wait, sleep, iowait, blocked} tracepoints can
be used to trace RT tasks as well.
The output of these tracepoints for a RT tasks as follows,
- blocked & iowait
kworker/48:1-442 [048] d... 539.830872: sched_stat_iowait: comm=stress pid=10461 delay=158242 [ns]
kworker/48:1-442 [048] d... 539.830872: sched_stat_blocked: comm=stress pid=10461 delay=158242 [ns]
- wait
stress-10460 [001] dN.. 813.965304: sched_stat_wait: comm=stress pid=10462 delay=99997536 [ns]
stress-10462 [001] d.h. 813.966300: sched_stat_runtime: comm=stress pid=10462 runtime=993812 [ns] vruntime=0 [ns]
[...]
stress-10462 [001] d.h. 814.065300: sched_stat_runtime: comm=stress pid=10462 runtime=994484 [ns] vruntime=0 [ns]
[ totally 100 times of sched_stat_runtime for pid 10462]
[ The delay of pid 10462 is the sum of above runtime ]
stress-10462 [001] dN.. 814.065307: sched_stat_wait: comm=stress pid=10460 delay=100001089 [ns]
stress-10460 [001] d.h. 814.066299: sched_stat_runtime: comm=stress pid=10460 runtime=991934 [ns] vruntime=0 [ns]
- sleep
sleep-15582 [041] dN.. 1732.814348: sched_stat_sleep: comm=sleep.sh pid=15474 delay=1001223130 [ns]
sleep-15584 [041] dN.. 1733.815908: sched_stat_sleep: comm=sleep.sh pid=15474 delay=1001238954 [ns]
[ In sleep.sh, it sleeps 1 sec each time. ]
Signed-off-by: Yafang Shao <[email protected]>
---
kernel/sched/rt.c | 134 +++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 133 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index da989653b0a2..f764c2b9070d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1271,6 +1271,121 @@ static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct rt_prio_arr
rt_se->on_list = 0;
}
+static inline void
+__schedstat_from_sched_rt_entity(struct sched_rt_entity *rt_se,
+ struct sched_statistics **stats)
+{
+ struct task_struct *p;
+ struct task_group *tg;
+ struct rt_rq *rt_rq;
+ int cpu;
+
+ if (rt_entity_is_task(rt_se)) {
+ p = rt_task_of(rt_se);
+ *stats = &p->stats;
+ } else {
+ rt_rq = group_rt_rq(rt_se);
+ tg = rt_rq->tg;
+ cpu = cpu_of(rq_of_rt_rq(rt_rq));
+ *stats = tg->stats[cpu];
+ }
+}
+
+static inline void
+schedstat_from_sched_rt_entity(struct sched_rt_entity *rt_se,
+ struct sched_statistics **stats)
+{
+ if (!schedstat_enabled())
+ return;
+
+ __schedstat_from_sched_rt_entity(rt_se, stats);
+}
+
+static inline void
+update_stats_wait_start_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
+{
+ struct sched_statistics *stats = NULL;
+ struct task_struct *p = NULL;
+
+ if (!schedstat_enabled())
+ return;
+
+ if (rt_entity_is_task(rt_se))
+ p = rt_task_of(rt_se);
+
+ __schedstat_from_sched_rt_entity(rt_se, &stats);
+
+ __update_stats_wait_start(rq_of_rt_rq(rt_rq), p, stats);
+}
+
+static inline void
+update_stats_enqueue_sleeper_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
+{
+ struct sched_statistics *stats = NULL;
+ struct task_struct *p = NULL;
+
+ if (!schedstat_enabled())
+ return;
+
+ if (rt_entity_is_task(rt_se))
+ p = rt_task_of(rt_se);
+
+ __schedstat_from_sched_rt_entity(rt_se, &stats);
+
+ __update_stats_enqueue_sleeper(rq_of_rt_rq(rt_rq), p, stats);
+}
+
+static inline void
+update_stats_enqueue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
+ int flags)
+{
+ if (!schedstat_enabled())
+ return;
+
+ if (flags & ENQUEUE_WAKEUP)
+ update_stats_enqueue_sleeper_rt(rt_rq, rt_se);
+}
+
+static inline void
+update_stats_wait_end_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
+{
+ struct sched_statistics *stats = NULL;
+ struct task_struct *p = NULL;
+
+ if (!schedstat_enabled())
+ return;
+
+ if (rt_entity_is_task(rt_se))
+ p = rt_task_of(rt_se);
+
+ __schedstat_from_sched_rt_entity(rt_se, &stats);
+
+ __update_stats_wait_end(rq_of_rt_rq(rt_rq), p, stats);
+}
+
+static inline void
+update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
+ int flags)
+{
+ struct task_struct *p = NULL;
+
+ if (!schedstat_enabled())
+ return;
+
+ if (rt_entity_is_task(rt_se))
+ p = rt_task_of(rt_se);
+
+ if ((flags & DEQUEUE_SLEEP) && p) {
+ if (p->state & TASK_INTERRUPTIBLE)
+ __schedstat_set(p->stats.sleep_start,
+ rq_clock(rq_of_rt_rq(rt_rq)));
+
+ if (p->state & TASK_UNINTERRUPTIBLE)
+ __schedstat_set(p->stats.block_start,
+ rq_clock(rq_of_rt_rq(rt_rq)));
+ }
+}
+
static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
@@ -1344,6 +1459,8 @@ static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rq *rq = rq_of_rt_se(rt_se);
+ update_stats_enqueue_rt(rt_rq_of_se(rt_se), rt_se, flags);
+
dequeue_rt_stack(rt_se, flags);
for_each_sched_rt_entity(rt_se)
__enqueue_rt_entity(rt_se, flags);
@@ -1354,6 +1471,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rq *rq = rq_of_rt_se(rt_se);
+ update_stats_dequeue_rt(rt_rq_of_se(rt_se), rt_se, flags);
dequeue_rt_stack(rt_se, flags);
for_each_sched_rt_entity(rt_se) {
@@ -1376,6 +1494,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
if (flags & ENQUEUE_WAKEUP)
rt_se->timeout = 0;
+ check_schedstat_required();
+ update_stats_wait_start_rt(rt_rq_of_se(rt_se), rt_se);
+
enqueue_rt_entity(rt_se, flags);
if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
@@ -1574,9 +1695,14 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
#endif
}
-static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first)
+void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first)
{
+ struct sched_rt_entity *rt_se = &p->rt;
+ struct rt_rq *rt_rq = &rq->rt;
+
p->se.exec_start = rq_clock_task(rq);
+ if (on_rt_rq(&p->rt))
+ update_stats_wait_end_rt(rt_rq, rt_se);
/* The running task is never eligible for pushing */
dequeue_pushable_task(rq, p);
@@ -1640,6 +1766,12 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
{
+ struct sched_rt_entity *rt_se = &p->rt;
+ struct rt_rq *rt_rq = &rq->rt;
+
+ if (on_rt_rq(&p->rt))
+ update_stats_wait_start_rt(rt_rq, rt_se);
+
update_curr_rt(rq);
update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
--
2.18.4
schedstat_enabled() has been already checked, so we can use
__schedstat_set() directly.
Signed-off-by: Yafang Shao <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8ff1daa3d9bb..7e7c03cede94 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4414,7 +4414,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
if (schedstat_enabled() &&
rq_of(cfs_rq)->cfs.load.weight >= 2*se->load.weight) {
- schedstat_set(se->statistics.slice_max,
+ __schedstat_set(se->statistics.slice_max,
max((u64)schedstat_val(se->statistics.slice_max),
se->sum_exec_runtime - se->prev_sum_exec_runtime));
}
--
2.18.4
The original prototype of the schedstats helpers are
update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se)
The cfs_rq in these helpers is used to get the rq_clock, and the se is
used to get the struct sched_statistics and the struct task_struct. In
order to make these helpers available by all sched classes, we can pass
the rq, sched_statistics and task_struct directly.
Then the new helpers are
update_stats_wait_*(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats)
which are independent of fair sched class.
To avoid vmlinux growing too large or introducing ovehead when
!schedstat_enabled(), some new helpers after schedstat_enabled() are also
introduced, Suggested by Mel. These helpers are in sched/stats.c,
__update_stats_wait_*(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats)
Cc: Mel Gorman <[email protected]>
Signed-off-by: Yafang Shao <[email protected]>
---
kernel/sched/fair.c | 140 +++++++------------------------------------
kernel/sched/stats.c | 104 ++++++++++++++++++++++++++++++++
kernel/sched/stats.h | 32 ++++++++++
3 files changed, 157 insertions(+), 119 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 14d8df308d44..b869a83fac29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -917,69 +917,44 @@ static void update_curr_fair(struct rq *rq)
}
static inline void
-update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_wait_start_fair(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct sched_statistics *stats = NULL;
- u64 wait_start, prev_wait_start;
+ struct task_struct *p = NULL;
if (!schedstat_enabled())
return;
- __schedstat_from_sched_entity(se, &stats);
-
- wait_start = rq_clock(rq_of(cfs_rq));
- prev_wait_start = schedstat_val(stats->wait_start);
+ if (entity_is_task(se))
+ p = task_of(se);
- if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
- likely(wait_start > prev_wait_start))
- wait_start -= prev_wait_start;
+ __schedstat_from_sched_entity(se, &stats);
- __schedstat_set(stats->wait_start, wait_start);
+ __update_stats_wait_start(rq_of(cfs_rq), p, stats);
}
static inline void
-update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_wait_end_fair(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct sched_statistics *stats = NULL;
struct task_struct *p = NULL;
- u64 delta;
if (!schedstat_enabled())
return;
- __schedstat_from_sched_entity(se, &stats);
-
- delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(stats->wait_start);
- if (entity_is_task(se)) {
+ if (entity_is_task(se))
p = task_of(se);
- if (task_on_rq_migrating(p)) {
- /*
- * Preserve migrating task's wait time so wait_start
- * time stamp can be adjusted to accumulate wait time
- * prior to migration.
- */
- __schedstat_set(stats->wait_start, delta);
-
- return;
- }
-
- trace_sched_stat_wait(p, delta);
- }
+ __schedstat_from_sched_entity(se, &stats);
- __schedstat_set(stats->wait_max,
- max(schedstat_val(stats->wait_max), delta));
- __schedstat_inc(stats->wait_count);
- __schedstat_add(stats->wait_sum, delta);
- __schedstat_set(stats->wait_start, 0);
+ __update_stats_wait_end(rq_of(cfs_rq), p, stats);
}
static inline void
-update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_enqueue_sleeper_fair(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct sched_statistics *stats = NULL;
struct task_struct *p = NULL;
- u64 sleep_start, block_start;
if (!schedstat_enabled())
return;
@@ -989,67 +964,14 @@ update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
__schedstat_from_sched_entity(se, &stats);
- sleep_start = schedstat_val(stats->sleep_start);
- block_start = schedstat_val(stats->block_start);
-
- if (sleep_start) {
- u64 delta = rq_clock(rq_of(cfs_rq)) - sleep_start;
-
- if ((s64)delta < 0)
- delta = 0;
-
- if (unlikely(delta > schedstat_val(stats->sleep_max)))
- __schedstat_set(stats->sleep_max, delta);
-
- __schedstat_set(stats->sleep_start, 0);
- __schedstat_add(stats->sum_sleep_runtime, delta);
-
- if (p) {
- account_scheduler_latency(p, delta >> 10, 1);
- trace_sched_stat_sleep(p, delta);
- }
- }
- if (block_start) {
- u64 delta = rq_clock(rq_of(cfs_rq)) - block_start;
-
- if ((s64)delta < 0)
- delta = 0;
-
- if (unlikely(delta > schedstat_val(stats->block_max)))
- __schedstat_set(stats->block_max, delta);
-
- __schedstat_set(stats->block_start, 0);
- __schedstat_add(stats->sum_sleep_runtime, delta);
-
- if (p) {
- if (p->in_iowait) {
- __schedstat_add(stats->iowait_sum, delta);
- __schedstat_inc(stats->iowait_count);
- trace_sched_stat_iowait(p, delta);
- }
-
- trace_sched_stat_blocked(p, delta);
-
- /*
- * Blocking time is in units of nanosecs, so shift by
- * 20 to get a milliseconds-range estimation of the
- * amount of time that the task spent sleeping:
- */
- if (unlikely(prof_on == SLEEP_PROFILING)) {
- profile_hits(SLEEP_PROFILING,
- (void *)get_wchan(p),
- delta >> 20);
- }
- account_scheduler_latency(p, delta >> 10, 0);
- }
- }
+ __update_stats_enqueue_sleeper(rq_of(cfs_rq), p, stats);
}
/*
* Task is being enqueued - update stats:
*/
static inline void
-update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+update_stats_enqueue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
if (!schedstat_enabled())
return;
@@ -1059,14 +981,14 @@ update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* a dequeue/enqueue event is a NOP)
*/
if (se != cfs_rq->curr)
- update_stats_wait_start(cfs_rq, se);
+ update_stats_wait_start_fair(cfs_rq, se);
if (flags & ENQUEUE_WAKEUP)
- update_stats_enqueue_sleeper(cfs_rq, se);
+ update_stats_enqueue_sleeper_fair(cfs_rq, se);
}
static inline void
-update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
if (!schedstat_enabled())
@@ -1077,7 +999,7 @@ update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* waiting task:
*/
if (se != cfs_rq->curr)
- update_stats_wait_end(cfs_rq, se);
+ update_stats_wait_end_fair(cfs_rq, se);
if ((flags & DEQUEUE_SLEEP) && entity_is_task(se)) {
struct task_struct *tsk = task_of(se);
@@ -4186,26 +4108,6 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
-static inline void check_schedstat_required(void)
-{
-#ifdef CONFIG_SCHEDSTATS
- if (schedstat_enabled())
- return;
-
- /* Force schedstat enabled if a dependent tracepoint is active */
- if (trace_sched_stat_wait_enabled() ||
- trace_sched_stat_sleep_enabled() ||
- trace_sched_stat_iowait_enabled() ||
- trace_sched_stat_blocked_enabled() ||
- trace_sched_stat_runtime_enabled()) {
- printk_deferred_once("Scheduler tracepoints stat_sleep, stat_iowait, "
- "stat_blocked and stat_runtime require the "
- "kernel parameter schedstats=enable or "
- "kernel.sched_schedstats=1\n");
- }
-#endif
-}
-
static inline bool cfs_bandwidth_used(void);
/*
@@ -4279,7 +4181,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
place_entity(cfs_rq, se, 0);
check_schedstat_required();
- update_stats_enqueue(cfs_rq, se, flags);
+ update_stats_enqueue_fair(cfs_rq, se, flags);
check_spread(cfs_rq, se);
if (!curr)
__enqueue_entity(cfs_rq, se);
@@ -4363,7 +4265,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_load_avg(cfs_rq, se, UPDATE_TG);
se_update_runnable(se);
- update_stats_dequeue(cfs_rq, se, flags);
+ update_stats_dequeue_fair(cfs_rq, se, flags);
clear_buddies(cfs_rq, se);
@@ -4448,7 +4350,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
* a CPU. So account for the time it spent waiting on the
* runqueue.
*/
- update_stats_wait_end(cfs_rq, se);
+ update_stats_wait_end_fair(cfs_rq, se);
__dequeue_entity(cfs_rq, se);
update_load_avg(cfs_rq, se, UPDATE_TG);
}
@@ -4550,7 +4452,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
check_spread(cfs_rq, prev);
if (prev->on_rq) {
- update_stats_wait_start(cfs_rq, prev);
+ update_stats_wait_start_fair(cfs_rq, prev);
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
/* in !on_rq case, update occurred at dequeue */
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 844bd9dbfbf0..1a9614c69669 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -5,6 +5,110 @@
#include "sched.h"
#include "stats.h"
+void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
+ struct sched_statistics *stats)
+{
+ u64 wait_start, prev_wait_start;
+
+ wait_start = rq_clock(rq);
+ prev_wait_start = schedstat_val(stats->wait_start);
+
+ if (p && likely(wait_start > prev_wait_start))
+ wait_start -= prev_wait_start;
+
+ __schedstat_set(stats->wait_start, wait_start);
+}
+
+void __update_stats_wait_end(struct rq *rq, struct task_struct *p,
+ struct sched_statistics *stats)
+{
+ u64 delta;
+
+ delta = rq_clock(rq) - schedstat_val(stats->wait_start);
+
+ if (p) {
+ if (task_on_rq_migrating(p)) {
+ /*
+ * Preserve migrating task's wait time so wait_start
+ * time stamp can be adjusted to accumulate wait time
+ * prior to migration.
+ */
+ __schedstat_set(stats->wait_start, delta);
+
+ return;
+ }
+
+ trace_sched_stat_wait(p, delta);
+ }
+
+ __schedstat_set(stats->wait_max,
+ max(schedstat_val(stats->wait_max), delta));
+ __schedstat_inc(stats->wait_count);
+ __schedstat_add(stats->wait_sum, delta);
+ __schedstat_set(stats->wait_start, 0);
+}
+
+void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
+ struct sched_statistics *stats)
+{
+ u64 sleep_start, block_start;
+
+ sleep_start = schedstat_val(stats->sleep_start);
+ block_start = schedstat_val(stats->block_start);
+
+ if (sleep_start) {
+ u64 delta = rq_clock(rq) - sleep_start;
+
+ if ((s64)delta < 0)
+ delta = 0;
+
+ if (unlikely(delta > schedstat_val(stats->sleep_max)))
+ __schedstat_set(stats->sleep_max, delta);
+
+ __schedstat_set(stats->sleep_start, 0);
+ __schedstat_add(stats->sum_sleep_runtime, delta);
+
+ if (p) {
+ account_scheduler_latency(p, delta >> 10, 1);
+ trace_sched_stat_sleep(p, delta);
+ }
+ }
+ if (block_start) {
+ u64 delta = rq_clock(rq) - block_start;
+
+ if ((s64)delta < 0)
+ delta = 0;
+
+ if (unlikely(delta > schedstat_val(stats->block_max)))
+ __schedstat_set(stats->block_max, delta);
+
+ __schedstat_set(stats->block_start, 0);
+ __schedstat_add(stats->sum_sleep_runtime, delta);
+
+ if (p) {
+ if (p->in_iowait) {
+ __schedstat_add(stats->iowait_sum, delta);
+ __schedstat_inc(stats->iowait_count);
+ trace_sched_stat_iowait(p, delta);
+ }
+
+ trace_sched_stat_blocked(p, delta);
+
+ /*
+ * Blocking time is in units of nanosecs, so shift by
+ * 20 to get a milliseconds-range estimation of the
+ * amount of time that the task spent sleeping:
+ */
+ if (unlikely(prof_on == SLEEP_PROFILING)) {
+ profile_hits(SLEEP_PROFILING,
+ (void *)get_wchan(p),
+ delta >> 20);
+ }
+ account_scheduler_latency(p, delta >> 10, 0);
+ }
+ }
+}
+
/*
* Current schedstat API version.
*
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 87242968712e..b8e3d4ee21e1 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -78,6 +78,33 @@ static inline int alloc_tg_schedstats(struct task_group *tg)
return 1;
}
+void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
+ struct sched_statistics *stats);
+
+void __update_stats_wait_end(struct rq *rq, struct task_struct *p,
+ struct sched_statistics *stats);
+void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
+ struct sched_statistics *stats);
+
+static inline void
+check_schedstat_required(void)
+{
+ if (schedstat_enabled())
+ return;
+
+ /* Force schedstat enabled if a dependent tracepoint is active */
+ if (trace_sched_stat_wait_enabled() ||
+ trace_sched_stat_sleep_enabled() ||
+ trace_sched_stat_iowait_enabled() ||
+ trace_sched_stat_blocked_enabled() ||
+ trace_sched_stat_runtime_enabled()) {
+ printk_deferred_once("Scheduler tracepoints stat_sleep, stat_iowait, "
+ "stat_blocked and stat_runtime require the "
+ "kernel parameter schedstats=enable or "
+ "kernel.sched_schedstats=1\n");
+ }
+}
+
#else /* !CONFIG_SCHEDSTATS: */
static inline void rq_sched_info_arrive (struct rq *rq, unsigned long long delta) { }
static inline void rq_sched_info_dequeued(struct rq *rq, unsigned long long delta) { }
@@ -101,6 +128,11 @@ static inline int alloc_tg_schedstats(struct task_group *tg)
return 1;
}
+# define __update_stats_wait_start(rq, p, stats) do { } while (0)
+# define __update_stats_wait_end(rq, p, stats) do { } while (0)
+# define __update_stats_enqueue_sleeper(rq, p, stats) do { } while (0)
+# define check_schedstat_required() do { } while (0)
+
#endif /* CONFIG_SCHEDSTATS */
#ifdef CONFIG_PSI
--
2.18.4
On Tue, Dec 01, 2020 at 07:54:12PM +0800, Yafang Shao wrote:
> schedstat_enabled() has been already checked, so we can use
> __schedstat_set() directly.
>
> Signed-off-by: Yafang Shao <[email protected]>
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On Tue, Dec 01, 2020 at 07:54:14PM +0800, Yafang Shao wrote:
> The original prototype of the schedstats helpers are
>
> update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se)
>
> The cfs_rq in these helpers is used to get the rq_clock, and the se is
> used to get the struct sched_statistics and the struct task_struct. In
> order to make these helpers available by all sched classes, we can pass
> the rq, sched_statistics and task_struct directly.
>
> Then the new helpers are
>
> update_stats_wait_*(struct rq *rq, struct task_struct *p,
> struct sched_statistics *stats)
>
> which are independent of fair sched class.
>
> To avoid vmlinux growing too large or introducing ovehead when
> !schedstat_enabled(), some new helpers after schedstat_enabled() are also
> introduced, Suggested by Mel. These helpers are in sched/stats.c,
>
> __update_stats_wait_*(struct rq *rq, struct task_struct *p,
> struct sched_statistics *stats)
>
> Cc: Mel Gorman <[email protected]>
> Signed-off-by: Yafang Shao <[email protected]>
Think it's ok, it's mostly code shuffling. I'd have been happier if
there was evidence showing a before/after comparison of the sched stats
for something simple like "perf bench sched pipe" and a clear statement
of no functional change as well as a comparison of the vmlinux files but
I think it's ok so;
Acked-by: Mel Gorman <[email protected]>
I didn't look at the rt.c parts
--
Mel Gorman
SUSE Labs
On Tue, Dec 1, 2020 at 8:46 PM Mel Gorman <[email protected]> wrote:
>
> On Tue, Dec 01, 2020 at 07:54:14PM +0800, Yafang Shao wrote:
> > The original prototype of the schedstats helpers are
> >
> > update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >
> > The cfs_rq in these helpers is used to get the rq_clock, and the se is
> > used to get the struct sched_statistics and the struct task_struct. In
> > order to make these helpers available by all sched classes, we can pass
> > the rq, sched_statistics and task_struct directly.
> >
> > Then the new helpers are
> >
> > update_stats_wait_*(struct rq *rq, struct task_struct *p,
> > struct sched_statistics *stats)
> >
> > which are independent of fair sched class.
> >
> > To avoid vmlinux growing too large or introducing ovehead when
> > !schedstat_enabled(), some new helpers after schedstat_enabled() are also
> > introduced, Suggested by Mel. These helpers are in sched/stats.c,
> >
> > __update_stats_wait_*(struct rq *rq, struct task_struct *p,
> > struct sched_statistics *stats)
> >
> > Cc: Mel Gorman <[email protected]>
> > Signed-off-by: Yafang Shao <[email protected]>
>
> Think it's ok, it's mostly code shuffling. I'd have been happier if
> there was evidence showing a before/after comparison of the sched stats
> for something simple like "perf bench sched pipe" and a clear statement
> of no functional change as well as a comparison of the vmlinux files but
> I think it's ok so;
>
Thanks for the review, I will do the comparison as you suggested.
> Acked-by: Mel Gorman <[email protected]>
>
Thanks!
> I didn't look at the rt.c parts
>
> --
> Mel Gorman
> SUSE Labs
--
Thanks
Yafang