2021-03-27 10:14:43

by Yafang Shao

[permalink] [raw]
Subject: [PATCH v2 0/6] sched: support schedstats for RT sched class

We want to measure the latency of RT tasks in our production
environment with schedstats facility, but currently schedstats is only
supported for fair sched class. In order to support if for other sched
classes, we should make it independent of fair sched class. The struct
sched_statistics is the schedular statistics of a task_struct or a
task_group, both of which are independent of sched class. So we can move
struct sched_statistics into struct task_struct and struct task_group to
achieve the goal.

After the patchset, schestats are orgnized as follows,
struct task_struct {
...
struct sched_statistics statistics;
...
struct sched_entity *se;
struct sched_rt_entity *rt;
...
};

struct task_group { |---> stats[0] : of CPU0
... |
struct sched_statistics **stats; --|---> stats[1] : of CPU1
... |
|---> stats[n] : of CPUn
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity **se;
#endif
#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity **rt_se;
#endif
...
};

The sched_statistics members may be frequently modified when schedstats is
enabled, in order to avoid impacting on random data which may in the same
cacheline with them, the struct sched_statistics is defined as cacheline
aligned.

Then we can use schedstats to trace RT tasks as well, for example,
Interface File
task schedstats : /proc/[pid]/sched
group schedstats: /proc/sched_debug
tracepoints : sched:sched_stat_{runtime, wait, sleep, iowait, blocked}

As PATCH #2 and #3 changes the core struct in the scheduler, so I did
'perf bench sched pipe' to measure the sched performance before and after
the change, suggested by Mel. Below is the data, which are all in
usecs/op.
Before After
kernel.sched_schedstats=0 6.0~6.1 6.0~6.1
kernel.sched_schedstats=1 6.2~6.4 6.2~6.4
No obvious difference after the change.

Changes since v1:
- Fix the build failure reported by kernel test robot.
- Add the performance data with 'perf bench sched pipe', suggested by
Mel.
- Make the struct sched_statistics cacheline aligned.
- Introduce task block time in schedstats

Changes since RFC:
- improvement of schedstats helpers, per Mel.
- make struct schedstats independent of fair sched class

Yafang Shao (6):
sched, fair: use __schedstat_set() in set_next_entity()
sched: make struct sched_statistics independent of fair sched class
sched: make schedstats helpers independent of fair sched class
sched: introduce task block time in schedstats
sched, rt: support sched_stat_runtime tracepoint for RT sched class
sched, rt: support schedstats for RT sched class

include/linux/sched.h | 7 +-
kernel/sched/core.c | 24 +++--
kernel/sched/deadline.c | 4 +-
kernel/sched/debug.c | 90 +++++++++--------
kernel/sched/fair.c | 210 ++++++++++++++++-----------------------
kernel/sched/rt.c | 143 +++++++++++++++++++++++++-
kernel/sched/sched.h | 3 +
kernel/sched/stats.c | 104 +++++++++++++++++++
kernel/sched/stats.h | 89 +++++++++++++++++
kernel/sched/stop_task.c | 4 +-
10 files changed, 489 insertions(+), 189 deletions(-)

--
2.18.2


2021-03-27 10:14:50

by Yafang Shao

[permalink] [raw]
Subject: [PATCH v2 1/6] sched, fair: use __schedstat_set() in set_next_entity()

schedstat_enabled() has been already checked, so we can use
__schedstat_set() directly.

Signed-off-by: Yafang Shao <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aaa0dfa29d53..114eec730698 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4429,7 +4429,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
if (schedstat_enabled() &&
rq_of(cfs_rq)->cfs.load.weight >= 2*se->load.weight) {
- schedstat_set(se->statistics.slice_max,
+ __schedstat_set(se->statistics.slice_max,
max((u64)schedstat_val(se->statistics.slice_max),
se->sum_exec_runtime - se->prev_sum_exec_runtime));
}
--
2.18.2

2021-03-27 10:15:12

by Yafang Shao

[permalink] [raw]
Subject: [PATCH v2 3/6] sched: make schedstats helpers independent of fair sched class

The original prototype of the schedstats helpers are

update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se)

The cfs_rq in these helpers is used to get the rq_clock, and the se is
used to get the struct sched_statistics and the struct task_struct. In
order to make these helpers available by all sched classes, we can pass
the rq, sched_statistics and task_struct directly.

Then the new helpers are

update_stats_wait_*(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats)

which are independent of fair sched class.

To avoid vmlinux growing too large or introducing ovehead when
!schedstat_enabled(), some new helpers after schedstat_enabled() are also
introduced, Suggested by Mel. These helpers are in sched/stats.c,

__update_stats_wait_*(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats)

The size of vmlinux as follows,
Before After
Size of vmlinux 826308552 826304640
The size is a litte smaller as some functions are not inlined again after
the change.

I also compared the sched performance with 'perf bench sched pipe',
suggested by Mel. The result as follows,
Before After
kernel.sched_schedstats=0 6.0~6.1 6.0~6.1
kernel.sched_schedstats=1 6.2~6.4 6.2~6.4
No obvious difference after the change.

No functional change.

[[email protected]: reported build failure in prev version]

Signed-off-by: Yafang Shao <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: kernel test robot <[email protected]>
---
kernel/sched/fair.c | 133 +++++++------------------------------------
kernel/sched/stats.c | 103 +++++++++++++++++++++++++++++++++
kernel/sched/stats.h | 34 +++++++++++
3 files changed, 156 insertions(+), 114 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5f72fef1cc0a..77a615b6310d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -904,32 +904,28 @@ static void update_curr_fair(struct rq *rq)
}

static inline void
-update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_wait_start_fair(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct sched_statistics *stats = NULL;
- u64 wait_start, prev_wait_start;
+ struct task_struct *p = NULL;

if (!schedstat_enabled())
return;

__schedstats_from_sched_entity(se, &stats);

- wait_start = rq_clock(rq_of(cfs_rq));
- prev_wait_start = schedstat_val(stats->wait_start);
+ if (entity_is_task(se))
+ p = task_of(se);

- if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
- likely(wait_start > prev_wait_start))
- wait_start -= prev_wait_start;
+ __update_stats_wait_start(rq_of(cfs_rq), p, stats);

- __schedstat_set(stats->wait_start, wait_start);
}

static inline void
-update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_wait_end_fair(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct sched_statistics *stats = NULL;
struct task_struct *p = NULL;
- u64 delta;

if (!schedstat_enabled())
return;
@@ -945,105 +941,34 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (unlikely(!schedstat_val(stats->wait_start)))
return;

- delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(stats->wait_start);
-
- if (entity_is_task(se)) {
+ if (entity_is_task(se))
p = task_of(se);
- if (task_on_rq_migrating(p)) {
- /*
- * Preserve migrating task's wait time so wait_start
- * time stamp can be adjusted to accumulate wait time
- * prior to migration.
- */
- __schedstat_set(stats->wait_start, delta);
- return;
- }
- trace_sched_stat_wait(p, delta);
- }

- __schedstat_set(stats->wait_max,
- max(schedstat_val(stats->wait_max), delta));
- __schedstat_inc(stats->wait_count);
- __schedstat_add(stats->wait_sum, delta);
- __schedstat_set(stats->wait_start, 0);
+ __update_stats_wait_end(rq_of(cfs_rq), p, stats);
}

static inline void
-update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_enqueue_sleeper_fair(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct sched_statistics *stats = NULL;
struct task_struct *tsk = NULL;
- u64 sleep_start, block_start;

if (!schedstat_enabled())
return;

__schedstats_from_sched_entity(se, &stats);

- sleep_start = schedstat_val(stats->sleep_start);
- block_start = schedstat_val(stats->block_start);
-
if (entity_is_task(se))
tsk = task_of(se);

- if (sleep_start) {
- u64 delta = rq_clock(rq_of(cfs_rq)) - sleep_start;
-
- if ((s64)delta < 0)
- delta = 0;
-
- if (unlikely(delta > schedstat_val(stats->sleep_max)))
- __schedstat_set(stats->sleep_max, delta);
-
- __schedstat_set(stats->sleep_start, 0);
- __schedstat_add(stats->sum_sleep_runtime, delta);
-
- if (tsk) {
- account_scheduler_latency(tsk, delta >> 10, 1);
- trace_sched_stat_sleep(tsk, delta);
- }
- }
- if (block_start) {
- u64 delta = rq_clock(rq_of(cfs_rq)) - block_start;
-
- if ((s64)delta < 0)
- delta = 0;
-
- if (unlikely(delta > schedstat_val(stats->block_max)))
- __schedstat_set(stats->block_max, delta);
-
- __schedstat_set(stats->block_start, 0);
- __schedstat_add(stats->sum_sleep_runtime, delta);
-
- if (tsk) {
- if (tsk->in_iowait) {
- __schedstat_add(stats->iowait_sum, delta);
- __schedstat_inc(stats->iowait_count);
- trace_sched_stat_iowait(tsk, delta);
- }
-
- trace_sched_stat_blocked(tsk, delta);
-
- /*
- * Blocking time is in units of nanosecs, so shift by
- * 20 to get a milliseconds-range estimation of the
- * amount of time that the task spent sleeping:
- */
- if (unlikely(prof_on == SLEEP_PROFILING)) {
- profile_hits(SLEEP_PROFILING,
- (void *)get_wchan(tsk),
- delta >> 20);
- }
- account_scheduler_latency(tsk, delta >> 10, 0);
- }
- }
+ __update_stats_enqueue_sleeper(rq_of(cfs_rq), tsk, stats);
}

/*
* Task is being enqueued - update stats:
*/
static inline void
-update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+update_stats_enqueue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
if (!schedstat_enabled())
return;
@@ -1053,14 +978,14 @@ update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* a dequeue/enqueue event is a NOP)
*/
if (se != cfs_rq->curr)
- update_stats_wait_start(cfs_rq, se);
+ update_stats_wait_start_fair(cfs_rq, se);

if (flags & ENQUEUE_WAKEUP)
- update_stats_enqueue_sleeper(cfs_rq, se);
+ update_stats_enqueue_sleeper_fair(cfs_rq, se);
}

static inline void
-update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{

if (!schedstat_enabled())
@@ -1071,7 +996,7 @@ update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* waiting task:
*/
if (se != cfs_rq->curr)
- update_stats_wait_end(cfs_rq, se);
+ update_stats_wait_end_fair(cfs_rq, se);

if ((flags & DEQUEUE_SLEEP) && entity_is_task(se)) {
struct task_struct *tsk = task_of(se);
@@ -4202,26 +4127,6 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)

static void check_enqueue_throttle(struct cfs_rq *cfs_rq);

-static inline void check_schedstat_required(void)
-{
-#ifdef CONFIG_SCHEDSTATS
- if (schedstat_enabled())
- return;
-
- /* Force schedstat enabled if a dependent tracepoint is active */
- if (trace_sched_stat_wait_enabled() ||
- trace_sched_stat_sleep_enabled() ||
- trace_sched_stat_iowait_enabled() ||
- trace_sched_stat_blocked_enabled() ||
- trace_sched_stat_runtime_enabled()) {
- printk_deferred_once("Scheduler tracepoints stat_sleep, stat_iowait, "
- "stat_blocked and stat_runtime require the "
- "kernel parameter schedstats=enable or "
- "kernel.sched_schedstats=1\n");
- }
-#endif
-}
-
static inline bool cfs_bandwidth_used(void);

/*
@@ -4295,7 +4200,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
place_entity(cfs_rq, se, 0);

check_schedstat_required();
- update_stats_enqueue(cfs_rq, se, flags);
+ update_stats_enqueue_fair(cfs_rq, se, flags);
check_spread(cfs_rq, se);
if (!curr)
__enqueue_entity(cfs_rq, se);
@@ -4379,7 +4284,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_load_avg(cfs_rq, se, UPDATE_TG);
se_update_runnable(se);

- update_stats_dequeue(cfs_rq, se, flags);
+ update_stats_dequeue_fair(cfs_rq, se, flags);

clear_buddies(cfs_rq, se);

@@ -4464,7 +4369,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
* a CPU. So account for the time it spent waiting on the
* runqueue.
*/
- update_stats_wait_end(cfs_rq, se);
+ update_stats_wait_end_fair(cfs_rq, se);
__dequeue_entity(cfs_rq, se);
update_load_avg(cfs_rq, se, UPDATE_TG);
}
@@ -4566,7 +4471,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
check_spread(cfs_rq, prev);

if (prev->on_rq) {
- update_stats_wait_start(cfs_rq, prev);
+ update_stats_wait_start_fair(cfs_rq, prev);
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
/* in !on_rq case, update occurred at dequeue */
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 3f93fc3b5648..b2542f4d3192 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -4,6 +4,109 @@
*/
#include "sched.h"

+void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
+ struct sched_statistics *stats)
+{
+u64 wait_start, prev_wait_start;
+
+ wait_start = rq_clock(rq);
+ prev_wait_start = schedstat_val(stats->wait_start);
+
+ if (p && likely(wait_start > prev_wait_start))
+ wait_start -= prev_wait_start;
+
+ __schedstat_set(stats->wait_start, wait_start);
+}
+
+void __update_stats_wait_end(struct rq *rq, struct task_struct *p,
+ struct sched_statistics *stats)
+{
+ u64 delta = rq_clock(rq) - schedstat_val(stats->wait_start);
+
+ if (p) {
+ if (task_on_rq_migrating(p)) {
+ /*
+ * Preserve migrating task's wait time so wait_start
+ * time stamp can be adjusted to accumulate wait time
+ * prior to migration.
+ */
+ __schedstat_set(stats->wait_start, delta);
+
+ return;
+ }
+
+ trace_sched_stat_wait(p, delta);
+ }
+
+ __schedstat_set(stats->wait_max,
+ max(schedstat_val(stats->wait_max), delta));
+ __schedstat_inc(stats->wait_count);
+ __schedstat_add(stats->wait_sum, delta);
+ __schedstat_set(stats->wait_start, 0);
+}
+
+void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
+ struct sched_statistics *stats)
+{
+ u64 sleep_start, block_start;
+
+ sleep_start = schedstat_val(stats->sleep_start);
+ block_start = schedstat_val(stats->block_start);
+
+ if (sleep_start) {
+ u64 delta = rq_clock(rq) - sleep_start;
+
+ if ((s64)delta < 0)
+ delta = 0;
+
+ if (unlikely(delta > schedstat_val(stats->sleep_max)))
+ __schedstat_set(stats->sleep_max, delta);
+
+ __schedstat_set(stats->sleep_start, 0);
+ __schedstat_add(stats->sum_sleep_runtime, delta);
+
+ if (p) {
+ account_scheduler_latency(p, delta >> 10, 1);
+ trace_sched_stat_sleep(p, delta);
+ }
+ }
+
+ if (block_start) {
+ u64 delta = rq_clock(rq) - block_start;
+
+ if ((s64)delta < 0)
+ delta = 0;
+
+ if (unlikely(delta > schedstat_val(stats->block_max)))
+ __schedstat_set(stats->block_max, delta);
+
+ __schedstat_set(stats->block_start, 0);
+ __schedstat_add(stats->sum_sleep_runtime, delta);
+
+ if (p) {
+ if (p->in_iowait) {
+ __schedstat_add(stats->iowait_sum, delta);
+ __schedstat_inc(stats->iowait_count);
+ trace_sched_stat_iowait(p, delta);
+ }
+
+ trace_sched_stat_blocked(p, delta);
+
+ /*
+ * Blocking time is in units of nanosecs, so shift by
+ * 20 to get a milliseconds-range estimation of the
+ * amount of time that the task spent sleeping:
+ */
+ if (unlikely(prof_on == SLEEP_PROFILING)) {
+ profile_hits(SLEEP_PROFILING,
+ (void *)get_wchan(p),
+ delta >> 20);
+ }
+ account_scheduler_latency(p, delta >> 10, 0);
+ }
+ }
+}
+
/*
* Current schedstat API version.
*
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 6c810388a897..3671857c70c6 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -2,6 +2,8 @@

#ifdef CONFIG_SCHEDSTATS

+extern struct static_key_false sched_schedstats;
+
/*
* Expects runqueue lock to be held for atomicity of update
*/
@@ -40,6 +42,33 @@ rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
#define schedstat_val(var) (var)
#define schedstat_val_or_zero(var) ((schedstat_enabled()) ? (var) : 0)

+void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
+ struct sched_statistics *stats);
+
+void __update_stats_wait_end(struct rq *rq, struct task_struct *p,
+ struct sched_statistics *stats);
+void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
+ struct sched_statistics *stats);
+
+static inline void
+check_schedstat_required(void)
+{
+ if (schedstat_enabled())
+ return;
+
+ /* Force schedstat enabled if a dependent tracepoint is active */
+ if (trace_sched_stat_wait_enabled() ||
+ trace_sched_stat_sleep_enabled() ||
+ trace_sched_stat_iowait_enabled() ||
+ trace_sched_stat_blocked_enabled() ||
+ trace_sched_stat_runtime_enabled()) {
+ printk_deferred_once("Scheduler tracepoints stat_sleep, stat_iowait, "
+ "stat_blocked and stat_runtime require the "
+ "kernel parameter schedstats=enable or "
+ "kernel.sched_schedstats=1\n");
+ }
+}
+
#else /* !CONFIG_SCHEDSTATS: */

static inline void rq_sched_info_arrive (struct rq *rq, unsigned long long delta) { }
@@ -55,6 +84,11 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt
# define schedstat_val(var) 0
# define schedstat_val_or_zero(var) 0

+# define __update_stats_wait_start(rq, p, stats) do { } while (0)
+# define __update_stats_wait_end(rq, p, stats) do { } while (0)
+# define __update_stats_enqueue_sleeper(rq, p, stats) do { } while (0)
+# define check_schedstat_required() do { } while (0)
+
#endif /* CONFIG_SCHEDSTATS */

#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SCHEDSTATS)
--
2.18.2

2021-03-27 10:15:12

by Yafang Shao

[permalink] [raw]
Subject: [PATCH v2 2/6] sched: make struct sched_statistics independent of fair sched class

If we want to use the schedstats facility to trace other sched classes, we
should make it independent of fair sched class. The struct sched_statistics
is the schedular statistics of a task_struct or a task_group. So we can
move it into struct task_struct and struct task_group to achieve the goal.

After the patch, schestats are orgnized as follows,

struct task_struct {
...
struct sched_statistics statistics;
...
struct sched_entity *se;
struct sched_rt_entity *rt;
...
};

struct task_group { |---> stats[0] : of CPU0
... |
struct sched_statistics **stats; --|---> stats[1] : of CPU1
... |
|---> stats[n] : of CPUn
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity **se;
#endif
#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity **rt_se;
#endif
...
};

The sched_statistics members may be frequently modified when schedstats is
enabled, in order to avoid impacting on random data which may in the same
cacheline with them, the struct sched_statistics is defined as cacheline
aligned.

As this patch changes the core struct of scheduler, so I verified the
performance it may impact on the scheduler with 'perf bench sched
pipe', suggested by Mel. Below is the result, in which all the values
are in usecs/op.

Before After
kernel.sched_schedstats=0 6.0~6.1 6.0~6.1
kernel.sched_schedstats=1 6.2~6.4 6.2~6.4

No obvious impact on the sched performance.

No functional change.

[[email protected]: reported build failure in prev version]

Signed-off-by: Yafang Shao <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: kernel test robot <[email protected]>
---
include/linux/sched.h | 5 +-
kernel/sched/core.c | 24 ++++----
kernel/sched/deadline.c | 4 +-
kernel/sched/debug.c | 86 ++++++++++++++--------------
kernel/sched/fair.c | 121 ++++++++++++++++++++++++++++-----------
kernel/sched/rt.c | 4 +-
kernel/sched/sched.h | 3 +
kernel/sched/stats.h | 55 ++++++++++++++++++
kernel/sched/stop_task.c | 4 +-
9 files changed, 210 insertions(+), 96 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 05572e2140ad..b687bb38897b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -447,7 +447,7 @@ struct sched_statistics {
u64 nr_wakeups_passive;
u64 nr_wakeups_idle;
#endif
-};
+} ____cacheline_aligned;

struct sched_entity {
/* For load-balancing: */
@@ -463,8 +463,6 @@ struct sched_entity {

u64 nr_migrations;

- struct sched_statistics statistics;
-
#ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
struct sched_entity *parent;
@@ -697,6 +695,7 @@ struct task_struct {
unsigned int rt_priority;

const struct sched_class *sched_class;
+ struct sched_statistics stats;
struct sched_entity se;
struct sched_rt_entity rt;
#ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3384ea74cad4..d55681b4f9a4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2913,11 +2913,11 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
#ifdef CONFIG_SMP
if (cpu == rq->cpu) {
__schedstat_inc(rq->ttwu_local);
- __schedstat_inc(p->se.statistics.nr_wakeups_local);
+ __schedstat_inc(p->stats.nr_wakeups_local);
} else {
struct sched_domain *sd;

- __schedstat_inc(p->se.statistics.nr_wakeups_remote);
+ __schedstat_inc(p->stats.nr_wakeups_remote);
rcu_read_lock();
for_each_domain(rq->cpu, sd) {
if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
@@ -2929,14 +2929,14 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
}

if (wake_flags & WF_MIGRATED)
- __schedstat_inc(p->se.statistics.nr_wakeups_migrate);
+ __schedstat_inc(p->stats.nr_wakeups_migrate);
#endif /* CONFIG_SMP */

__schedstat_inc(rq->ttwu_count);
- __schedstat_inc(p->se.statistics.nr_wakeups);
+ __schedstat_inc(p->stats.nr_wakeups);

if (wake_flags & WF_SYNC)
- __schedstat_inc(p->se.statistics.nr_wakeups_sync);
+ __schedstat_inc(p->stats.nr_wakeups_sync);
}

/*
@@ -3572,7 +3572,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)

#ifdef CONFIG_SCHEDSTATS
/* Even if schedstat is disabled, there should not be garbage */
- memset(&p->se.statistics, 0, sizeof(p->se.statistics));
+ memset(&p->stats, 0, sizeof(p->stats));
#endif

RB_CLEAR_NODE(&p->dl.rb_node);
@@ -8415,9 +8415,9 @@ void normalize_rt_tasks(void)
continue;

p->se.exec_start = 0;
- schedstat_set(p->se.statistics.wait_start, 0);
- schedstat_set(p->se.statistics.sleep_start, 0);
- schedstat_set(p->se.statistics.block_start, 0);
+ schedstat_set(p->stats.wait_start, 0);
+ schedstat_set(p->stats.sleep_start, 0);
+ schedstat_set(p->stats.block_start, 0);

if (!dl_task(p) && !rt_task(p)) {
/*
@@ -8507,6 +8507,7 @@ static void sched_free_group(struct task_group *tg)
{
free_fair_sched_group(tg);
free_rt_sched_group(tg);
+ free_tg_schedstats(tg);
autogroup_free(tg);
kmem_cache_free(task_group_cache, tg);
}
@@ -8526,6 +8527,9 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;

+ if (!alloc_tg_schedstats(tg))
+ goto err;
+
alloc_uclamp_sched_group(tg, parent);

return tg;
@@ -9212,7 +9216,7 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
int i;

for_each_possible_cpu(i)
- ws += schedstat_val(tg->se[i]->statistics.wait_sum);
+ ws += schedstat_val(tg->stats[i]->wait_sum);

seq_printf(sf, "wait_sum %llu\n", ws);
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9a2989749b8d..2d67b3ec880f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1265,8 +1265,8 @@ static void update_curr_dl(struct rq *rq)
return;
}

- schedstat_set(curr->se.statistics.exec_max,
- max(curr->se.statistics.exec_max, delta_exec));
+ schedstat_set(curr->stats.exec_max,
+ max(curr->stats.exec_max, delta_exec));

curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4b49cc2af5c4..d1bc616936d9 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -428,12 +428,13 @@ void unregister_sched_domain_sysctl(void)
#ifdef CONFIG_FAIR_GROUP_SCHED
static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group *tg)
{
+ struct sched_statistics statistics = *(tg->stats[cpu]);
struct sched_entity *se = tg->se[cpu];

#define P(F) SEQ_printf(m, " .%-30s: %lld\n", #F, (long long)F)
-#define P_SCHEDSTAT(F) SEQ_printf(m, " .%-30s: %lld\n", #F, (long long)schedstat_val(F))
+#define P_SCHEDSTAT(F) SEQ_printf(m, " .se->%-30s: %lld\n", #F, (long long)schedstat_val(F))
#define PN(F) SEQ_printf(m, " .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))
-#define PN_SCHEDSTAT(F) SEQ_printf(m, " .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)schedstat_val(F)))
+#define PN_SCHEDSTAT(F) SEQ_printf(m, " .se->%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)schedstat_val(F)))

if (!se)
return;
@@ -443,16 +444,17 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
PN(se->sum_exec_runtime);

if (schedstat_enabled()) {
- PN_SCHEDSTAT(se->statistics.wait_start);
- PN_SCHEDSTAT(se->statistics.sleep_start);
- PN_SCHEDSTAT(se->statistics.block_start);
- PN_SCHEDSTAT(se->statistics.sleep_max);
- PN_SCHEDSTAT(se->statistics.block_max);
- PN_SCHEDSTAT(se->statistics.exec_max);
- PN_SCHEDSTAT(se->statistics.slice_max);
- PN_SCHEDSTAT(se->statistics.wait_max);
- PN_SCHEDSTAT(se->statistics.wait_sum);
- P_SCHEDSTAT(se->statistics.wait_count);
+ /* Make the output backward compatible */
+ PN_SCHEDSTAT(statistics.wait_start);
+ PN_SCHEDSTAT(statistics.sleep_start);
+ PN_SCHEDSTAT(statistics.block_start);
+ PN_SCHEDSTAT(statistics.sleep_max);
+ PN_SCHEDSTAT(statistics.block_max);
+ PN_SCHEDSTAT(statistics.exec_max);
+ PN_SCHEDSTAT(statistics.slice_max);
+ PN_SCHEDSTAT(statistics.wait_max);
+ PN_SCHEDSTAT(statistics.wait_sum);
+ P_SCHEDSTAT(statistics.wait_count);
}

P(se->load.weight);
@@ -498,9 +500,9 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
p->prio);

SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
- SPLIT_NS(schedstat_val_or_zero(p->se.statistics.wait_sum)),
+ SPLIT_NS(schedstat_val_or_zero(p->stats.wait_sum)),
SPLIT_NS(p->se.sum_exec_runtime),
- SPLIT_NS(schedstat_val_or_zero(p->se.statistics.sum_sleep_runtime)));
+ SPLIT_NS(schedstat_val_or_zero(p->stats.sum_sleep_runtime)));

#ifdef CONFIG_NUMA_BALANCING
SEQ_printf(m, " %d %d", task_node(p), task_numa_group_id(p));
@@ -938,33 +940,33 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
if (schedstat_enabled()) {
u64 avg_atom, avg_per_cpu;

- PN_SCHEDSTAT(se.statistics.sum_sleep_runtime);
- PN_SCHEDSTAT(se.statistics.wait_start);
- PN_SCHEDSTAT(se.statistics.sleep_start);
- PN_SCHEDSTAT(se.statistics.block_start);
- PN_SCHEDSTAT(se.statistics.sleep_max);
- PN_SCHEDSTAT(se.statistics.block_max);
- PN_SCHEDSTAT(se.statistics.exec_max);
- PN_SCHEDSTAT(se.statistics.slice_max);
- PN_SCHEDSTAT(se.statistics.wait_max);
- PN_SCHEDSTAT(se.statistics.wait_sum);
- P_SCHEDSTAT(se.statistics.wait_count);
- PN_SCHEDSTAT(se.statistics.iowait_sum);
- P_SCHEDSTAT(se.statistics.iowait_count);
- P_SCHEDSTAT(se.statistics.nr_migrations_cold);
- P_SCHEDSTAT(se.statistics.nr_failed_migrations_affine);
- P_SCHEDSTAT(se.statistics.nr_failed_migrations_running);
- P_SCHEDSTAT(se.statistics.nr_failed_migrations_hot);
- P_SCHEDSTAT(se.statistics.nr_forced_migrations);
- P_SCHEDSTAT(se.statistics.nr_wakeups);
- P_SCHEDSTAT(se.statistics.nr_wakeups_sync);
- P_SCHEDSTAT(se.statistics.nr_wakeups_migrate);
- P_SCHEDSTAT(se.statistics.nr_wakeups_local);
- P_SCHEDSTAT(se.statistics.nr_wakeups_remote);
- P_SCHEDSTAT(se.statistics.nr_wakeups_affine);
- P_SCHEDSTAT(se.statistics.nr_wakeups_affine_attempts);
- P_SCHEDSTAT(se.statistics.nr_wakeups_passive);
- P_SCHEDSTAT(se.statistics.nr_wakeups_idle);
+ PN_SCHEDSTAT(stats.sum_sleep_runtime);
+ PN_SCHEDSTAT(stats.wait_start);
+ PN_SCHEDSTAT(stats.sleep_start);
+ PN_SCHEDSTAT(stats.block_start);
+ PN_SCHEDSTAT(stats.sleep_max);
+ PN_SCHEDSTAT(stats.block_max);
+ PN_SCHEDSTAT(stats.exec_max);
+ PN_SCHEDSTAT(stats.slice_max);
+ PN_SCHEDSTAT(stats.wait_max);
+ PN_SCHEDSTAT(stats.wait_sum);
+ P_SCHEDSTAT(stats.wait_count);
+ PN_SCHEDSTAT(stats.iowait_sum);
+ P_SCHEDSTAT(stats.iowait_count);
+ P_SCHEDSTAT(stats.nr_migrations_cold);
+ P_SCHEDSTAT(stats.nr_failed_migrations_affine);
+ P_SCHEDSTAT(stats.nr_failed_migrations_running);
+ P_SCHEDSTAT(stats.nr_failed_migrations_hot);
+ P_SCHEDSTAT(stats.nr_forced_migrations);
+ P_SCHEDSTAT(stats.nr_wakeups);
+ P_SCHEDSTAT(stats.nr_wakeups_sync);
+ P_SCHEDSTAT(stats.nr_wakeups_migrate);
+ P_SCHEDSTAT(stats.nr_wakeups_local);
+ P_SCHEDSTAT(stats.nr_wakeups_remote);
+ P_SCHEDSTAT(stats.nr_wakeups_affine);
+ P_SCHEDSTAT(stats.nr_wakeups_affine_attempts);
+ P_SCHEDSTAT(stats.nr_wakeups_passive);
+ P_SCHEDSTAT(stats.nr_wakeups_idle);

avg_atom = p->se.sum_exec_runtime;
if (nr_switches)
@@ -1030,6 +1032,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
void proc_sched_set_task(struct task_struct *p)
{
#ifdef CONFIG_SCHEDSTATS
- memset(&p->se.statistics, 0, sizeof(p->se.statistics));
+ memset(&p->stats, 0, sizeof(p->stats));
#endif
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 114eec730698..5f72fef1cc0a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -821,6 +821,41 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_SMP */

+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline void
+__schedstats_from_sched_entity(struct sched_entity *se,
+ struct sched_statistics **stats)
+{
+ struct task_group *tg;
+ struct task_struct *p;
+ struct cfs_rq *cfs;
+ int cpu;
+
+ if (entity_is_task(se)) {
+ p = task_of(se);
+ *stats = &p->stats;
+ } else {
+ cfs = group_cfs_rq(se);
+ tg = cfs->tg;
+ cpu = cpu_of(rq_of(cfs));
+ *stats = tg->stats[cpu];
+ }
+}
+
+#else
+
+static inline void
+__schedstats_from_sched_entity(struct sched_entity *se,
+ struct sched_statistics **stats)
+{
+ struct task_struct *p;
+
+ p = task_of(se);
+ *stats = &p->stats;
+}
+
+#endif
+
/*
* Update the current task's runtime statistics.
*/
@@ -828,6 +863,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
u64 now = rq_clock_task(rq_of(cfs_rq));
+ struct sched_statistics *stats = NULL;
u64 delta_exec;

if (unlikely(!curr))
@@ -839,8 +875,11 @@ static void update_curr(struct cfs_rq *cfs_rq)

curr->exec_start = now;

- schedstat_set(curr->statistics.exec_max,
- max(delta_exec, curr->statistics.exec_max));
+ if (schedstat_enabled()) {
+ __schedstats_from_sched_entity(curr, &stats);
+ __schedstat_set(stats->exec_max,
+ max(delta_exec, stats->exec_max));
+ }

curr->sum_exec_runtime += delta_exec;
schedstat_add(cfs_rq->exec_clock, delta_exec);
@@ -867,40 +906,46 @@ static void update_curr_fair(struct rq *rq)
static inline void
update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ struct sched_statistics *stats = NULL;
u64 wait_start, prev_wait_start;

if (!schedstat_enabled())
return;

+ __schedstats_from_sched_entity(se, &stats);
+
wait_start = rq_clock(rq_of(cfs_rq));
- prev_wait_start = schedstat_val(se->statistics.wait_start);
+ prev_wait_start = schedstat_val(stats->wait_start);

if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
likely(wait_start > prev_wait_start))
wait_start -= prev_wait_start;

- __schedstat_set(se->statistics.wait_start, wait_start);
+ __schedstat_set(stats->wait_start, wait_start);
}

static inline void
update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- struct task_struct *p;
+ struct sched_statistics *stats = NULL;
+ struct task_struct *p = NULL;
u64 delta;

if (!schedstat_enabled())
return;

+ __schedstats_from_sched_entity(se, &stats);
+
/*
* When the sched_schedstat changes from 0 to 1, some sched se
* maybe already in the runqueue, the se->statistics.wait_start
* will be 0.So it will let the delta wrong. We need to avoid this
* scenario.
*/
- if (unlikely(!schedstat_val(se->statistics.wait_start)))
+ if (unlikely(!schedstat_val(stats->wait_start)))
return;

- delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(se->statistics.wait_start);
+ delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(stats->wait_start);

if (entity_is_task(se)) {
p = task_of(se);
@@ -910,30 +955,33 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
* time stamp can be adjusted to accumulate wait time
* prior to migration.
*/
- __schedstat_set(se->statistics.wait_start, delta);
+ __schedstat_set(stats->wait_start, delta);
return;
}
trace_sched_stat_wait(p, delta);
}

- __schedstat_set(se->statistics.wait_max,
- max(schedstat_val(se->statistics.wait_max), delta));
- __schedstat_inc(se->statistics.wait_count);
- __schedstat_add(se->statistics.wait_sum, delta);
- __schedstat_set(se->statistics.wait_start, 0);
+ __schedstat_set(stats->wait_max,
+ max(schedstat_val(stats->wait_max), delta));
+ __schedstat_inc(stats->wait_count);
+ __schedstat_add(stats->wait_sum, delta);
+ __schedstat_set(stats->wait_start, 0);
}

static inline void
update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ struct sched_statistics *stats = NULL;
struct task_struct *tsk = NULL;
u64 sleep_start, block_start;

if (!schedstat_enabled())
return;

- sleep_start = schedstat_val(se->statistics.sleep_start);
- block_start = schedstat_val(se->statistics.block_start);
+ __schedstats_from_sched_entity(se, &stats);
+
+ sleep_start = schedstat_val(stats->sleep_start);
+ block_start = schedstat_val(stats->block_start);

if (entity_is_task(se))
tsk = task_of(se);
@@ -944,11 +992,11 @@ update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
if ((s64)delta < 0)
delta = 0;

- if (unlikely(delta > schedstat_val(se->statistics.sleep_max)))
- __schedstat_set(se->statistics.sleep_max, delta);
+ if (unlikely(delta > schedstat_val(stats->sleep_max)))
+ __schedstat_set(stats->sleep_max, delta);

- __schedstat_set(se->statistics.sleep_start, 0);
- __schedstat_add(se->statistics.sum_sleep_runtime, delta);
+ __schedstat_set(stats->sleep_start, 0);
+ __schedstat_add(stats->sum_sleep_runtime, delta);

if (tsk) {
account_scheduler_latency(tsk, delta >> 10, 1);
@@ -961,16 +1009,16 @@ update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
if ((s64)delta < 0)
delta = 0;

- if (unlikely(delta > schedstat_val(se->statistics.block_max)))
- __schedstat_set(se->statistics.block_max, delta);
+ if (unlikely(delta > schedstat_val(stats->block_max)))
+ __schedstat_set(stats->block_max, delta);

- __schedstat_set(se->statistics.block_start, 0);
- __schedstat_add(se->statistics.sum_sleep_runtime, delta);
+ __schedstat_set(stats->block_start, 0);
+ __schedstat_add(stats->sum_sleep_runtime, delta);

if (tsk) {
if (tsk->in_iowait) {
- __schedstat_add(se->statistics.iowait_sum, delta);
- __schedstat_inc(se->statistics.iowait_count);
+ __schedstat_add(stats->iowait_sum, delta);
+ __schedstat_inc(stats->iowait_count);
trace_sched_stat_iowait(tsk, delta);
}

@@ -1029,10 +1077,10 @@ update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
struct task_struct *tsk = task_of(se);

if (tsk->state & TASK_INTERRUPTIBLE)
- __schedstat_set(se->statistics.sleep_start,
+ __schedstat_set(tsk->stats.sleep_start,
rq_clock(rq_of(cfs_rq)));
if (tsk->state & TASK_UNINTERRUPTIBLE)
- __schedstat_set(se->statistics.block_start,
+ __schedstat_set(tsk->stats.block_start,
rq_clock(rq_of(cfs_rq)));
}
}
@@ -4407,6 +4455,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ struct sched_statistics *stats = NULL;
+
/* 'current' is not kept within the tree. */
if (se->on_rq) {
/*
@@ -4429,8 +4479,9 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
if (schedstat_enabled() &&
rq_of(cfs_rq)->cfs.load.weight >= 2*se->load.weight) {
- __schedstat_set(se->statistics.slice_max,
- max((u64)schedstat_val(se->statistics.slice_max),
+ __schedstats_from_sched_entity(se, &stats);
+ __schedstat_set(stats->slice_max,
+ max((u64)schedstat_val(stats->slice_max),
se->sum_exec_runtime - se->prev_sum_exec_runtime));
}

@@ -5892,12 +5943,12 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

- schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
+ schedstat_inc(p->stats.nr_wakeups_affine_attempts);
if (target == nr_cpumask_bits)
return prev_cpu;

schedstat_inc(sd->ttwu_move_affine);
- schedstat_inc(p->se.statistics.nr_wakeups_affine);
+ schedstat_inc(p->stats.nr_wakeups_affine);
return target;
}

@@ -7570,7 +7621,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;

- schedstat_inc(p->se.statistics.nr_failed_migrations_affine);
+ schedstat_inc(p->stats.nr_failed_migrations_affine);

env->flags |= LBF_SOME_PINNED;

@@ -7601,7 +7652,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
env->flags &= ~LBF_ALL_PINNED;

if (task_running(env->src_rq, p)) {
- schedstat_inc(p->se.statistics.nr_failed_migrations_running);
+ schedstat_inc(p->stats.nr_failed_migrations_running);
return 0;
}

@@ -7619,12 +7670,12 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
if (tsk_cache_hot == 1) {
schedstat_inc(env->sd->lb_hot_gained[env->idle]);
- schedstat_inc(p->se.statistics.nr_forced_migrations);
+ schedstat_inc(p->stats.nr_forced_migrations);
}
return 1;
}

- schedstat_inc(p->se.statistics.nr_failed_migrations_hot);
+ schedstat_inc(p->stats.nr_failed_migrations_hot);
return 0;
}

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index c286e5ba3c94..34ad07fb924e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1009,8 +1009,8 @@ static void update_curr_rt(struct rq *rq)
if (unlikely((s64)delta_exec <= 0))
return;

- schedstat_set(curr->se.statistics.exec_max,
- max(curr->se.statistics.exec_max, delta_exec));
+ schedstat_set(curr->stats.exec_max,
+ max(curr->stats.exec_max, delta_exec));

curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cbb0b011e9e0..a0ca95d64b4c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -385,6 +385,9 @@ struct cfs_bandwidth {
struct task_group {
struct cgroup_subsys_state css;

+ /* schedstats of this group on each CPU */
+ struct sched_statistics **stats;
+
#ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each CPU */
struct sched_entity **se;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index dc218e9f4558..6c810388a897 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -41,6 +41,7 @@ rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
#define schedstat_val_or_zero(var) ((schedstat_enabled()) ? (var) : 0)

#else /* !CONFIG_SCHEDSTATS: */
+
static inline void rq_sched_info_arrive (struct rq *rq, unsigned long long delta) { }
static inline void rq_sched_info_dequeued(struct rq *rq, unsigned long long delta) { }
static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delta) { }
@@ -53,8 +54,62 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt
# define schedstat_set(var, val) do { } while (0)
# define schedstat_val(var) 0
# define schedstat_val_or_zero(var) 0
+
#endif /* CONFIG_SCHEDSTATS */

+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SCHEDSTATS)
+static inline void free_tg_schedstats(struct task_group *tg)
+{
+ int i;
+
+ for_each_possible_cpu(i) {
+ if (tg->stats)
+ kfree(tg->stats[i]);
+ }
+
+ kfree(tg->stats);
+}
+
+static inline int alloc_tg_schedstats(struct task_group *tg)
+{
+ struct sched_statistics *stats;
+ int i;
+
+ /*
+ * This memory should be allocated whatever schedstat_enabled() or
+ * not.
+ */
+ tg->stats = kcalloc(nr_cpu_ids, sizeof(stats), GFP_KERNEL);
+ if (!tg->stats)
+ return 0;
+
+ for_each_possible_cpu(i) {
+ stats = kzalloc_node(sizeof(struct sched_statistics),
+ GFP_KERNEL, cpu_to_node(i));
+ if (!stats)
+ return 0;
+
+ tg->stats[i] = stats;
+ }
+
+ return 1;
+}
+
+#else
+
+static inline void free_tg_schedstats(struct task_group *tg)
+{
+
+}
+
+static inline int alloc_tg_schedstats(struct task_group *tg)
+{
+ return 1;
+}
+
+#endif
+
+
#ifdef CONFIG_PSI
/*
* PSI tracks state that persists across sleeps, such as iowaits and
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 55f39125c0e1..970266205b76 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -69,8 +69,8 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
if (unlikely((s64)delta_exec < 0))
delta_exec = 0;

- schedstat_set(curr->se.statistics.exec_max,
- max(curr->se.statistics.exec_max, delta_exec));
+ schedstat_set(curr->stats.exec_max,
+ max(curr->stats.exec_max, delta_exec));

curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
--
2.18.2

2021-03-27 10:15:15

by Yafang Shao

[permalink] [raw]
Subject: [PATCH v2 4/6] sched: introduce task block time in schedstats

Currently in schedstats we have sum_sleep_runtime and iowait_sum, but
there's no metric to show how long the task is in D state. Once a task in
D state, it means the task is blocked in the kernel, for example the
task may be waiting for a mutex. The D state is more frequent than
iowait, and it is more critital than S state. So it is worth to add a
metric to measure it.

Signed-off-by: Yafang Shao <[email protected]>
---
include/linux/sched.h | 2 ++
kernel/sched/debug.c | 6 ++++--
kernel/sched/stats.c | 1 +
3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b687bb38897b..2b885481b8bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -428,6 +428,8 @@ struct sched_statistics {

u64 block_start;
u64 block_max;
+ s64 sum_block_runtime;
+
u64 exec_max;
u64 slice_max;

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index d1bc616936d9..0995412dd3c0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -499,10 +499,11 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
(long long)(p->nvcsw + p->nivcsw),
p->prio);

- SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
+ SEQ_printf(m, "%9lld.%06ld %9lld.%06ld %9lld.%06ld %9lld.%06ld",
SPLIT_NS(schedstat_val_or_zero(p->stats.wait_sum)),
SPLIT_NS(p->se.sum_exec_runtime),
- SPLIT_NS(schedstat_val_or_zero(p->stats.sum_sleep_runtime)));
+ SPLIT_NS(schedstat_val_or_zero(p->stats.sum_sleep_runtime)),
+ SPLIT_NS(schedstat_val_or_zero(p->stats.sum_block_runtime)));

#ifdef CONFIG_NUMA_BALANCING
SEQ_printf(m, " %d %d", task_node(p), task_numa_group_id(p));
@@ -941,6 +942,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
u64 avg_atom, avg_per_cpu;

PN_SCHEDSTAT(stats.sum_sleep_runtime);
+ PN_SCHEDSTAT(stats.sum_block_runtime);
PN_SCHEDSTAT(stats.wait_start);
PN_SCHEDSTAT(stats.sleep_start);
PN_SCHEDSTAT(stats.block_start);
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index b2542f4d3192..21fae41c06f5 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -82,6 +82,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,

__schedstat_set(stats->block_start, 0);
__schedstat_add(stats->sum_sleep_runtime, delta);
+ __schedstat_add(stats->sum_block_runtime, delta);

if (p) {
if (p->in_iowait) {
--
2.18.2

2021-03-27 10:15:28

by Yafang Shao

[permalink] [raw]
Subject: [PATCH v2 5/6] sched, rt: support sched_stat_runtime tracepoint for RT sched class

The runtime of a RT task has already been there, so we only need to
add a tracepoint.

One difference between fair task and RT task is that there is no vruntime
in RT task. To reuse the sched_stat_runtime tracepoint, '0' is passed as
vruntime for RT task.

The output of this tracepoint for RT task as follows,
stress-9748 [039] d.h. 113.519352: sched_stat_runtime: comm=stress pid=9748 runtime=997573 [ns] vruntime=0 [ns]
stress-9748 [039] d.h. 113.520352: sched_stat_runtime: comm=stress pid=9748 runtime=997627 [ns] vruntime=0 [ns]
stress-9748 [039] d.h. 113.521352: sched_stat_runtime: comm=stress pid=9748 runtime=998203 [ns] vruntime=0 [ns]

Signed-off-by: Yafang Shao <[email protected]>
---
kernel/sched/rt.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 34ad07fb924e..ae5282484710 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1012,6 +1012,8 @@ static void update_curr_rt(struct rq *rq)
schedstat_set(curr->stats.exec_max,
max(curr->stats.exec_max, delta_exec));

+ trace_sched_stat_runtime(curr, delta_exec, 0);
+
curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);

--
2.18.2

2021-03-27 10:16:40

by Yafang Shao

[permalink] [raw]
Subject: [PATCH v2 6/6] sched, rt: support schedstats for RT sched class

We want to measure the latency of RT tasks in our production
environment with schedstats facility, but currently schedstats is only
supported for fair sched class. This patch enable it for RT sched class
as well.

After we make the struct sched_statistics and the helpers of it
independent of fair sched class, we can easily use the schedstats
facility for RT sched class.

The schedstat usage in RT sched class is similar with fair sched class,
for example,
fair RT
enqueue update_stats_enqueue_fair update_stats_enqueue_rt
dequeue update_stats_dequeue_fair update_stats_dequeue_rt
put_prev_task update_stats_wait_start update_stats_wait_start
set_next_task update_stats_wait_end update_stats_wait_end

The user can get the schedstats information in the same way in fair sched
class. For example,
Interface File
task schedstats : /proc/[pid]/sched
group schedstats: /proc/sched_debug

The sched:sched_stat_{wait, sleep, iowait, blocked} tracepoints can
be used to trace RT tasks as well. The output of these tracepoints for a
RT tasks as follows,

- blocked & iowait
kworker/48:1-442 [048] d... 539.830872: sched_stat_iowait: comm=stress pid=10461 delay=158242 [ns]
kworker/48:1-442 [048] d... 539.830872: sched_stat_blocked: comm=stress pid=10461 delay=158242 [ns]

- wait
stress-10460 [001] dN.. 813.965304: sched_stat_wait: comm=stress pid=10462 delay=99997536 [ns]
stress-10462 [001] d.h. 813.966300: sched_stat_runtime: comm=stress pid=10462 runtime=993812 [ns] vruntime=0 [ns]
[...]
stress-10462 [001] d.h. 814.065300: sched_stat_runtime: comm=stress pid=10462 runtime=994484 [ns] vruntime=0 [ns]
[ totally 100 times of sched_stat_runtime for pid 10462]
[ The delay of pid 10462 is the sum of above runtime ]
stress-10462 [001] dN.. 814.065307: sched_stat_wait: comm=stress pid=10460 delay=100001089 [ns]
stress-10460 [001] d.h. 814.066299: sched_stat_runtime: comm=stress pid=10460 runtime=991934 [ns] vruntime=0 [ns]

- sleep
sleep-15582 [041] dN.. 1732.814348: sched_stat_sleep: comm=sleep.sh pid=15474 delay=1001223130 [ns]
sleep-15584 [041] dN.. 1733.815908: sched_stat_sleep: comm=sleep.sh pid=15474 delay=1001238954 [ns]
[ In sleep.sh, it sleeps 1 sec each time. ]

[[email protected]: reported build failure in prev version]

Signed-off-by: Yafang Shao <[email protected]>
Cc: kernel test robot <[email protected]>
---
kernel/sched/rt.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 137 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ae5282484710..e5f706ffcdbc 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1273,6 +1273,125 @@ static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct rt_prio_arr
rt_se->on_list = 0;
}

+#ifdef CONFIG_RT_GROUP_SCHED
+static inline void
+__schedstats_from_sched_rt_entity(struct sched_rt_entity *rt_se,
+ struct sched_statistics **stats)
+{
+ struct task_struct *p;
+ struct task_group *tg;
+ struct rt_rq *rt_rq;
+ int cpu;
+
+ if (rt_entity_is_task(rt_se)) {
+ p = rt_task_of(rt_se);
+ *stats = &p->stats;
+ } else {
+ rt_rq = group_rt_rq(rt_se);
+ tg = rt_rq->tg;
+ cpu = cpu_of(rq_of_rt_rq(rt_rq));
+ *stats = tg->stats[cpu];
+ }
+}
+
+#else
+
+static inline void
+__schedstats_from_sched_rt_entity(struct sched_rt_entity *rt_se,
+ struct sched_statistics **stats)
+{
+ struct task_struct *p;
+
+ p = rt_task_of(rt_se);
+ *stats = &p->stats;
+}
+#endif
+
+static inline void
+update_stats_wait_start_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
+{
+ struct sched_statistics *stats = NULL;
+ struct task_struct *p = NULL;
+
+ if (!schedstat_enabled())
+ return;
+
+ if (rt_entity_is_task(rt_se))
+ p = rt_task_of(rt_se);
+
+ __schedstats_from_sched_rt_entity(rt_se, &stats);
+
+ __update_stats_wait_start(rq_of_rt_rq(rt_rq), p, stats);
+}
+
+static inline void
+update_stats_enqueue_sleeper_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
+{
+ struct sched_statistics *stats = NULL;
+ struct task_struct *p = NULL;
+
+ if (!schedstat_enabled())
+ return;
+
+ if (rt_entity_is_task(rt_se))
+ p = rt_task_of(rt_se);
+
+ __schedstats_from_sched_rt_entity(rt_se, &stats);
+
+ __update_stats_enqueue_sleeper(rq_of_rt_rq(rt_rq), p, stats);
+}
+
+static inline void
+update_stats_enqueue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
+ int flags)
+{
+ if (!schedstat_enabled())
+ return;
+
+ if (flags & ENQUEUE_WAKEUP)
+ update_stats_enqueue_sleeper_rt(rt_rq, rt_se);
+}
+
+static inline void
+update_stats_wait_end_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
+{
+ struct sched_statistics *stats = NULL;
+ struct task_struct *p = NULL;
+
+ if (!schedstat_enabled())
+ return;
+
+ if (rt_entity_is_task(rt_se))
+ p = rt_task_of(rt_se);
+
+ __schedstats_from_sched_rt_entity(rt_se, &stats);
+
+ __update_stats_wait_end(rq_of_rt_rq(rt_rq), p, stats);
+}
+
+static inline void
+update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
+ int flags)
+{
+ struct task_struct *p = NULL;
+
+ if (!schedstat_enabled())
+ return;
+
+ if (rt_entity_is_task(rt_se))
+ p = rt_task_of(rt_se);
+
+ if ((flags & DEQUEUE_SLEEP) && p) {
+ if (p->state & TASK_INTERRUPTIBLE)
+ __schedstat_set(p->stats.sleep_start,
+ rq_clock(rq_of_rt_rq(rt_rq)));
+
+ if (p->state & TASK_UNINTERRUPTIBLE)
+ __schedstat_set(p->stats.block_start,
+ rq_clock(rq_of_rt_rq(rt_rq)));
+ }
+}
+
static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
@@ -1346,6 +1465,8 @@ static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rq *rq = rq_of_rt_se(rt_se);

+ update_stats_enqueue_rt(rt_rq_of_se(rt_se), rt_se, flags);
+
dequeue_rt_stack(rt_se, flags);
for_each_sched_rt_entity(rt_se)
__enqueue_rt_entity(rt_se, flags);
@@ -1356,6 +1477,8 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rq *rq = rq_of_rt_se(rt_se);

+ update_stats_dequeue_rt(rt_rq_of_se(rt_se), rt_se, flags);
+
dequeue_rt_stack(rt_se, flags);

for_each_sched_rt_entity(rt_se) {
@@ -1378,6 +1501,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
if (flags & ENQUEUE_WAKEUP)
rt_se->timeout = 0;

+ check_schedstat_required();
+ update_stats_wait_start_rt(rt_rq_of_se(rt_se), rt_se);
+
enqueue_rt_entity(rt_se, flags);

if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
@@ -1578,7 +1704,12 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag

static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first)
{
+ struct sched_rt_entity *rt_se = &p->rt;
+ struct rt_rq *rt_rq = &rq->rt;
+
p->se.exec_start = rq_clock_task(rq);
+ if (on_rt_rq(&p->rt))
+ update_stats_wait_end_rt(rt_rq, rt_se);

/* The running task is never eligible for pushing */
dequeue_pushable_task(rq, p);
@@ -1642,6 +1773,12 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)

static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
{
+ struct sched_rt_entity *rt_se = &p->rt;
+ struct rt_rq *rt_rq = &rq->rt;
+
+ if (on_rt_rq(&p->rt))
+ update_stats_wait_start_rt(rt_rq, rt_se);
+
update_curr_rt(rq);

update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
--
2.18.2

2021-04-06 23:32:35

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] sched: support schedstats for RT sched class

On Sat, Mar 27, 2021 at 6:13 PM Yafang Shao <[email protected]> wrote:
>
> We want to measure the latency of RT tasks in our production
> environment with schedstats facility, but currently schedstats is only
> supported for fair sched class. In order to support if for other sched
> classes, we should make it independent of fair sched class. The struct
> sched_statistics is the schedular statistics of a task_struct or a
> task_group, both of which are independent of sched class. So we can move
> struct sched_statistics into struct task_struct and struct task_group to
> achieve the goal.
>
> After the patchset, schestats are orgnized as follows,
> struct task_struct {
> ...
> struct sched_statistics statistics;
> ...
> struct sched_entity *se;
> struct sched_rt_entity *rt;
> ...
> };
>
> struct task_group { |---> stats[0] : of CPU0
> ... |
> struct sched_statistics **stats; --|---> stats[1] : of CPU1
> ... |
> |---> stats[n] : of CPUn
> #ifdef CONFIG_FAIR_GROUP_SCHED
> struct sched_entity **se;
> #endif
> #ifdef CONFIG_RT_GROUP_SCHED
> struct sched_rt_entity **rt_se;
> #endif
> ...
> };
>
> The sched_statistics members may be frequently modified when schedstats is
> enabled, in order to avoid impacting on random data which may in the same
> cacheline with them, the struct sched_statistics is defined as cacheline
> aligned.
>
> Then we can use schedstats to trace RT tasks as well, for example,
> Interface File
> task schedstats : /proc/[pid]/sched
> group schedstats: /proc/sched_debug
> tracepoints : sched:sched_stat_{runtime, wait, sleep, iowait, blocked}
>
> As PATCH #2 and #3 changes the core struct in the scheduler, so I did
> 'perf bench sched pipe' to measure the sched performance before and after
> the change, suggested by Mel. Below is the data, which are all in
> usecs/op.
> Before After
> kernel.sched_schedstats=0 6.0~6.1 6.0~6.1
> kernel.sched_schedstats=1 6.2~6.4 6.2~6.4
> No obvious difference after the change.
>
> Changes since v1:
> - Fix the build failure reported by kernel test robot.
> - Add the performance data with 'perf bench sched pipe', suggested by
> Mel.
> - Make the struct sched_statistics cacheline aligned.
> - Introduce task block time in schedstats
>
> Changes since RFC:
> - improvement of schedstats helpers, per Mel.
> - make struct schedstats independent of fair sched class
>
> Yafang Shao (6):
> sched, fair: use __schedstat_set() in set_next_entity()
> sched: make struct sched_statistics independent of fair sched class
> sched: make schedstats helpers independent of fair sched class
> sched: introduce task block time in schedstats
> sched, rt: support sched_stat_runtime tracepoint for RT sched class
> sched, rt: support schedstats for RT sched class
>
> include/linux/sched.h | 7 +-
> kernel/sched/core.c | 24 +++--
> kernel/sched/deadline.c | 4 +-
> kernel/sched/debug.c | 90 +++++++++--------
> kernel/sched/fair.c | 210 ++++++++++++++++-----------------------
> kernel/sched/rt.c | 143 +++++++++++++++++++++++++-
> kernel/sched/sched.h | 3 +
> kernel/sched/stats.c | 104 +++++++++++++++++++
> kernel/sched/stats.h | 89 +++++++++++++++++
> kernel/sched/stop_task.c | 4 +-
> 10 files changed, 489 insertions(+), 189 deletions(-)
>
> --
> 2.18.2
>

Peter, Ingo, Mel,

Any comments on this version ?

--
Thanks
Yafang