2020-11-27 16:16:09

by Yafang Shao

[permalink] [raw]
Subject: [RFC PATCH v3 0/5] sched: support schedstats for RT sched class

We want to measure the latency of RT tasks in our production
environment with schedstats facility, but currently schedstats is only
supported for fair sched class. This patch enable it for RT sched class
as well.

- Structure

Before we make schedstats available for RT sched class, we should make
struct sched_statistics independent of fair sched class first.


The struct sched_statistics is the schedular statistics of a task_struct
or a task_group. So we can move it into struct task_struct and
struct task_group to achieve the goal.

Below is the detailed explaination of the change in the structs.

The struct before this patch:

struct task_struct { |-> struct sched_entity {
... | ...
struct sched_entity *se; ---| struct sched_statistics statistics;
struct sched_rt_entity *rt; ...
... ...
}; };

struct task_group { |--> se[0]->statistics : schedstats of CPU0
... |
#ifdef CONFIG_FAIR_GROUP_SCHED |
struct sched_entity **se; --|--> se[1]->statistics : schedstats of CPU1
|
#endif |
|--> se[N]->statistics : schedstats of CPUn

#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_rt_entity **rt_se; (N/A)
#endif
...
};

The '**se' in task_group is allocated in the fair sched class, which is
hard to be reused by other sched class.

The struct after this patch:
struct task_struct {
...
struct sched_statistics statistics;
...
struct sched_entity *se;
struct sched_rt_entity *rt;
...
};

struct task_group { |---> stats[0] : of CPU0
... |
struct sched_statistics **stats; --|---> stats[1] : of CPU1
... |
|---> stats[n] : of CPUn
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity **se;
#endif
#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity **rt_se;
#endif
...
};

After the patch it is clearly that both of se or rt_se can easily get the
sched_statistics by a task_struct or a task_group.

- Function Interface

The original prototype of the schedstats helpers are

update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se)

The cfs_rq in these helpers is used to get the rq_clock, and the se is
used to get the struct sched_statistics and the struct task_struct. In
order to make these helpers available by all sched classes, we can pass
the rq, sched_statistics and task_struct directly.

Then the new helpers are

update_stats_wait_*(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats)

which are independent of fair sched class.

To avoid vmlinux growing too large or introducing ovehead when
!schedstat_enabled(), some new helpers after schedstat_enabled() are also
introduced, Suggested by Mel. These helpers are in sched/stats.c,

__update_stats_wait_*(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats)

- Implementation

After we make the struct sched_statistics and the helpers of it
independent of fair sched class, we can easily use the schedstats
facility for RT sched class.

The schedstat usage in RT sched class is similar with fair sched class,
for example,
fair RT
enqueue update_stats_enqueue_fair update_stats_enqueue_rt
dequeue update_stats_dequeue_fair update_stats_dequeue_rt
put_prev_task update_stats_wait_start update_stats_wait_start
set_next_task update_stats_wait_end update_stats_wait_end

- Usage

The user can get the schedstats information in the same way in fair sched
class. For example,
fair RT
task show /proc/[pid]/sched /proc/[pid]/sched
group show cpu.stat in cgroup cpu.stat in cgroup

The sched:sched_stat_{wait, sleep, iowait, blocked} tracepoints can
be used to trace RT tasks as well. sched_stat_runtime can also be
supported in the future if it is helpful.

Yafang Shao (5):
sched: don't include stats.h in sched.h
sched, fair: use __schedstat_set() in set_next_entity()
sched: make struct sched_statistics independent of fair sched class
sched: make schedstats helpers independent of fair sched class
sched, rt: support schedstats for RT sched class

include/linux/sched.h | 3 +-
kernel/sched/core.c | 25 +++--
kernel/sched/deadline.c | 5 +-
kernel/sched/debug.c | 82 +++++++--------
kernel/sched/fair.c | 209 +++++++++++++++------------------------
kernel/sched/idle.c | 1 +
kernel/sched/rt.c | 178 ++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 13 ++-
kernel/sched/stats.c | 105 ++++++++++++++++++++
kernel/sched/stats.h | 80 +++++++++++++++
kernel/sched/stop_task.c | 5 +-
11 files changed, 517 insertions(+), 189 deletions(-)

--
2.18.4


2020-11-27 16:17:00

by Yafang Shao

[permalink] [raw]
Subject: [RFC PATCH v3 2/5] sched, fair: use __schedstat_set() in set_next_entity()

schedstat_enabled() has been already checked, so we can use
__schedstat_set() directly.

Signed-off-by: Yafang Shao <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8ff1daa3d9bb..7e7c03cede94 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4414,7 +4414,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
if (schedstat_enabled() &&
rq_of(cfs_rq)->cfs.load.weight >= 2*se->load.weight) {
- schedstat_set(se->statistics.slice_max,
+ __schedstat_set(se->statistics.slice_max,
max((u64)schedstat_val(se->statistics.slice_max),
se->sum_exec_runtime - se->prev_sum_exec_runtime));
}
--
2.18.4

2020-11-27 16:17:10

by Yafang Shao

[permalink] [raw]
Subject: [RFC PATCH v3 3/5] sched: make struct sched_statistics independent of fair sched class

If we want to use schedstats facility, we should move out of
struct sched_statistics from the struct sched_entity or add it into other
sctructs of sched entity as well. Obviously the latter one is bad because
that requires more spaces. So we should move it into a common struct which
can be used by all sched classes.

The struct sched_statistics is the schedular statistics of a task_struct
or a task_group. So we can move it into struct task_struct and
struct task_group to achieve the goal.

Below is the detailed explaination of the change in the structs.

- Before this patch

struct task_struct { |-> struct sched_entity {
... | ...
struct sched_entity *se; ---| struct sched_statistics statistics;
struct sched_rt_entity *rt; ...
... ...
}; };

struct task_group { |--> se[0]->statistics : schedstats of CPU0
... |
#ifdef CONFIG_FAIR_GROUP_SCHED |
struct sched_entity **se; --|--> se[1]->statistics : schedstats of CPU1
|
#endif |
|--> se[N]->statistics : schedstats of CPUn

#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_rt_entity **rt_se; (N/A)
#endif
...
};

The '**se' in task_group is allocated in the fair sched class, which is
hard to be reused by other sched class.

- After this patch

struct task_struct {
...
struct sched_statistics statistics;
...
struct sched_entity *se;
struct sched_rt_entity *rt;
...
};

struct task_group { |---> stats[0] : of CPU0
... |
struct sched_statistics **stats; --|---> stats[1] : of CPU1
... |
|---> stats[n] : of CPUn
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity **se;
#endif
#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity **rt_se;
#endif
...
};

After the patch it is clearly that both of se or rt_se can easily get the
sched_statistics by a task_struct or a task_group.

Signed-off-by: Yafang Shao <[email protected]>
---
include/linux/sched.h | 3 +-
kernel/sched/core.c | 24 ++++---
kernel/sched/deadline.c | 4 +-
kernel/sched/debug.c | 81 +++++++++++-----------
kernel/sched/fair.c | 142 ++++++++++++++++++++++++++-------------
kernel/sched/rt.c | 4 +-
kernel/sched/sched.h | 3 +
kernel/sched/stats.h | 46 +++++++++++++
kernel/sched/stop_task.c | 4 +-
9 files changed, 206 insertions(+), 105 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 063cd120b459..f8e969be4bee 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -462,8 +462,6 @@ struct sched_entity {

u64 nr_migrations;

- struct sched_statistics statistics;
-
#ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
struct sched_entity *parent;
@@ -681,6 +679,7 @@ struct task_struct {
unsigned int rt_priority;

const struct sched_class *sched_class;
+ struct sched_statistics stats;
struct sched_entity se;
struct sched_rt_entity rt;
#ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fd76628778f7..081b4f1f2cb4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2427,11 +2427,11 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
#ifdef CONFIG_SMP
if (cpu == rq->cpu) {
__schedstat_inc(rq->ttwu_local);
- __schedstat_inc(p->se.statistics.nr_wakeups_local);
+ __schedstat_inc(p->stats.nr_wakeups_local);
} else {
struct sched_domain *sd;

- __schedstat_inc(p->se.statistics.nr_wakeups_remote);
+ __schedstat_inc(p->stats.nr_wakeups_remote);
rcu_read_lock();
for_each_domain(rq->cpu, sd) {
if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
@@ -2443,14 +2443,14 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
}

if (wake_flags & WF_MIGRATED)
- __schedstat_inc(p->se.statistics.nr_wakeups_migrate);
+ __schedstat_inc(p->stats.nr_wakeups_migrate);
#endif /* CONFIG_SMP */

__schedstat_inc(rq->ttwu_count);
- __schedstat_inc(p->se.statistics.nr_wakeups);
+ __schedstat_inc(p->stats.nr_wakeups);

if (wake_flags & WF_SYNC)
- __schedstat_inc(p->se.statistics.nr_wakeups_sync);
+ __schedstat_inc(p->stats.nr_wakeups_sync);
}

/*
@@ -3075,7 +3075,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)

#ifdef CONFIG_SCHEDSTATS
/* Even if schedstat is disabled, there should not be garbage */
- memset(&p->se.statistics, 0, sizeof(p->se.statistics));
+ memset(&p->stats, 0, sizeof(p->stats));
#endif

RB_CLEAR_NODE(&p->dl.rb_node);
@@ -7347,9 +7347,9 @@ void normalize_rt_tasks(void)
continue;

p->se.exec_start = 0;
- schedstat_set(p->se.statistics.wait_start, 0);
- schedstat_set(p->se.statistics.sleep_start, 0);
- schedstat_set(p->se.statistics.block_start, 0);
+ schedstat_set(p->stats.wait_start, 0);
+ schedstat_set(p->stats.sleep_start, 0);
+ schedstat_set(p->stats.block_start, 0);

if (!dl_task(p) && !rt_task(p)) {
/*
@@ -7439,6 +7439,7 @@ static void sched_free_group(struct task_group *tg)
{
free_fair_sched_group(tg);
free_rt_sched_group(tg);
+ free_tg_schedstats(tg);
autogroup_free(tg);
kmem_cache_free(task_group_cache, tg);
}
@@ -7458,6 +7459,9 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;

+ if (!alloc_tg_schedstats(tg))
+ goto err;
+
alloc_uclamp_sched_group(tg, parent);

return tg;
@@ -8144,7 +8148,7 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
int i;

for_each_possible_cpu(i)
- ws += schedstat_val(tg->se[i]->statistics.wait_sum);
+ ws += schedstat_val(tg->stats[i]->wait_sum);

seq_printf(sf, "wait_sum %llu\n", ws);
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7a0124f81a4f..315d140b8f0e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1245,8 +1245,8 @@ static void update_curr_dl(struct rq *rq)
return;
}

- schedstat_set(curr->se.statistics.exec_max,
- max(curr->se.statistics.exec_max, delta_exec));
+ schedstat_set(curr->stats.exec_max,
+ max(curr->stats.exec_max, delta_exec));

curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 9758aa1bba1e..8c646fcb89de 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -444,16 +444,16 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
PN(se->sum_exec_runtime);

if (schedstat_enabled()) {
- PN_SCHEDSTAT(se->statistics.wait_start);
- PN_SCHEDSTAT(se->statistics.sleep_start);
- PN_SCHEDSTAT(se->statistics.block_start);
- PN_SCHEDSTAT(se->statistics.sleep_max);
- PN_SCHEDSTAT(se->statistics.block_max);
- PN_SCHEDSTAT(se->statistics.exec_max);
- PN_SCHEDSTAT(se->statistics.slice_max);
- PN_SCHEDSTAT(se->statistics.wait_max);
- PN_SCHEDSTAT(se->statistics.wait_sum);
- P_SCHEDSTAT(se->statistics.wait_count);
+ PN_SCHEDSTAT(tg->stats[cpu]->wait_start);
+ PN_SCHEDSTAT(tg->stats[cpu]->sleep_start);
+ PN_SCHEDSTAT(tg->stats[cpu]->block_start);
+ PN_SCHEDSTAT(tg->stats[cpu]->sleep_max);
+ PN_SCHEDSTAT(tg->stats[cpu]->block_max);
+ PN_SCHEDSTAT(tg->stats[cpu]->exec_max);
+ PN_SCHEDSTAT(tg->stats[cpu]->slice_max);
+ PN_SCHEDSTAT(tg->stats[cpu]->wait_max);
+ PN_SCHEDSTAT(tg->stats[cpu]->wait_sum);
+ P_SCHEDSTAT(tg->stats[cpu]->wait_count);
}

P(se->load.weight);
@@ -499,9 +499,9 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
p->prio);

SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
- SPLIT_NS(schedstat_val_or_zero(p->se.statistics.wait_sum)),
+ SPLIT_NS(schedstat_val_or_zero(p->stats.wait_sum)),
SPLIT_NS(p->se.sum_exec_runtime),
- SPLIT_NS(schedstat_val_or_zero(p->se.statistics.sum_sleep_runtime)));
+ SPLIT_NS(schedstat_val_or_zero(p->stats.sum_sleep_runtime)));

#ifdef CONFIG_NUMA_BALANCING
SEQ_printf(m, " %d %d", task_node(p), task_numa_group_id(p));
@@ -938,34 +938,33 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,

if (schedstat_enabled()) {
u64 avg_atom, avg_per_cpu;
-
- PN_SCHEDSTAT(se.statistics.sum_sleep_runtime);
- PN_SCHEDSTAT(se.statistics.wait_start);
- PN_SCHEDSTAT(se.statistics.sleep_start);
- PN_SCHEDSTAT(se.statistics.block_start);
- PN_SCHEDSTAT(se.statistics.sleep_max);
- PN_SCHEDSTAT(se.statistics.block_max);
- PN_SCHEDSTAT(se.statistics.exec_max);
- PN_SCHEDSTAT(se.statistics.slice_max);
- PN_SCHEDSTAT(se.statistics.wait_max);
- PN_SCHEDSTAT(se.statistics.wait_sum);
- P_SCHEDSTAT(se.statistics.wait_count);
- PN_SCHEDSTAT(se.statistics.iowait_sum);
- P_SCHEDSTAT(se.statistics.iowait_count);
- P_SCHEDSTAT(se.statistics.nr_migrations_cold);
- P_SCHEDSTAT(se.statistics.nr_failed_migrations_affine);
- P_SCHEDSTAT(se.statistics.nr_failed_migrations_running);
- P_SCHEDSTAT(se.statistics.nr_failed_migrations_hot);
- P_SCHEDSTAT(se.statistics.nr_forced_migrations);
- P_SCHEDSTAT(se.statistics.nr_wakeups);
- P_SCHEDSTAT(se.statistics.nr_wakeups_sync);
- P_SCHEDSTAT(se.statistics.nr_wakeups_migrate);
- P_SCHEDSTAT(se.statistics.nr_wakeups_local);
- P_SCHEDSTAT(se.statistics.nr_wakeups_remote);
- P_SCHEDSTAT(se.statistics.nr_wakeups_affine);
- P_SCHEDSTAT(se.statistics.nr_wakeups_affine_attempts);
- P_SCHEDSTAT(se.statistics.nr_wakeups_passive);
- P_SCHEDSTAT(se.statistics.nr_wakeups_idle);
+ PN_SCHEDSTAT(stats.sum_sleep_runtime);
+ PN_SCHEDSTAT(stats.wait_start);
+ PN_SCHEDSTAT(stats.sleep_start);
+ PN_SCHEDSTAT(stats.block_start);
+ PN_SCHEDSTAT(stats.sleep_max);
+ PN_SCHEDSTAT(stats.block_max);
+ PN_SCHEDSTAT(stats.exec_max);
+ PN_SCHEDSTAT(stats.slice_max);
+ PN_SCHEDSTAT(stats.wait_max);
+ PN_SCHEDSTAT(stats.wait_sum);
+ P_SCHEDSTAT(stats.wait_count);
+ PN_SCHEDSTAT(stats.iowait_sum);
+ P_SCHEDSTAT(stats.iowait_count);
+ P_SCHEDSTAT(stats.nr_migrations_cold);
+ P_SCHEDSTAT(stats.nr_failed_migrations_affine);
+ P_SCHEDSTAT(stats.nr_failed_migrations_running);
+ P_SCHEDSTAT(stats.nr_failed_migrations_hot);
+ P_SCHEDSTAT(stats.nr_forced_migrations);
+ P_SCHEDSTAT(stats.nr_wakeups);
+ P_SCHEDSTAT(stats.nr_wakeups_sync);
+ P_SCHEDSTAT(stats.nr_wakeups_migrate);
+ P_SCHEDSTAT(stats.nr_wakeups_local);
+ P_SCHEDSTAT(stats.nr_wakeups_remote);
+ P_SCHEDSTAT(stats.nr_wakeups_affine);
+ P_SCHEDSTAT(stats.nr_wakeups_affine_attempts);
+ P_SCHEDSTAT(stats.nr_wakeups_passive);
+ P_SCHEDSTAT(stats.nr_wakeups_idle);

avg_atom = p->se.sum_exec_runtime;
if (nr_switches)
@@ -1031,6 +1030,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
void proc_sched_set_task(struct task_struct *p)
{
#ifdef CONFIG_SCHEDSTATS
- memset(&p->se.statistics, 0, sizeof(p->se.statistics));
+ memset(&p->stats, 0, sizeof(p->stats));
#endif
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e7c03cede94..14d8df308d44 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -837,12 +837,44 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_SMP */

+static inline void
+__schedstat_from_sched_entity(struct sched_entity *se,
+ struct sched_statistics **stats)
+{
+ struct task_struct *p;
+ struct task_group *tg;
+ struct cfs_rq *cfs_rq;
+ int cpu;
+
+ if (entity_is_task(se)) {
+ p = task_of(se);
+ *stats = &p->stats;
+ } else {
+ cfs_rq = group_cfs_rq(se);
+ tg = cfs_rq->tg;
+ cpu = cpu_of(rq_of(cfs_rq));
+ *stats = tg->stats[cpu];
+ }
+}
+
+static inline void
+schedstat_from_sched_entity(struct sched_entity *se,
+ struct sched_statistics **stats)
+{
+ if (!schedstat_enabled())
+ return;
+
+ __schedstat_from_sched_entity(se, stats);
+}
+
/*
* Update the current task's runtime statistics.
*/
static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
+ struct sched_statistics *stats = NULL;
+
u64 now = rq_clock_task(rq_of(cfs_rq));
u64 delta_exec;

@@ -855,8 +887,12 @@ static void update_curr(struct cfs_rq *cfs_rq)

curr->exec_start = now;

- schedstat_set(curr->statistics.exec_max,
- max(delta_exec, curr->statistics.exec_max));
+ if (schedstat_enabled()) {
+ __schedstat_from_sched_entity(curr, &stats);
+
+ __schedstat_set(stats->exec_max,
+ max(delta_exec, stats->exec_max));
+ }

curr->sum_exec_runtime += delta_exec;
schedstat_add(cfs_rq->exec_clock, delta_exec);
@@ -883,67 +919,78 @@ static void update_curr_fair(struct rq *rq)
static inline void
update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ struct sched_statistics *stats = NULL;
u64 wait_start, prev_wait_start;

if (!schedstat_enabled())
return;

+ __schedstat_from_sched_entity(se, &stats);
+
wait_start = rq_clock(rq_of(cfs_rq));
- prev_wait_start = schedstat_val(se->statistics.wait_start);
+ prev_wait_start = schedstat_val(stats->wait_start);

if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
likely(wait_start > prev_wait_start))
wait_start -= prev_wait_start;

- __schedstat_set(se->statistics.wait_start, wait_start);
+ __schedstat_set(stats->wait_start, wait_start);
}

static inline void
update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- struct task_struct *p;
+ struct sched_statistics *stats = NULL;
+ struct task_struct *p = NULL;
u64 delta;

if (!schedstat_enabled())
return;

- delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(se->statistics.wait_start);
+ __schedstat_from_sched_entity(se, &stats);

+ delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(stats->wait_start);
if (entity_is_task(se)) {
p = task_of(se);
+
if (task_on_rq_migrating(p)) {
/*
* Preserve migrating task's wait time so wait_start
* time stamp can be adjusted to accumulate wait time
* prior to migration.
*/
- __schedstat_set(se->statistics.wait_start, delta);
+ __schedstat_set(stats->wait_start, delta);
+
return;
}
+
trace_sched_stat_wait(p, delta);
}

- __schedstat_set(se->statistics.wait_max,
- max(schedstat_val(se->statistics.wait_max), delta));
- __schedstat_inc(se->statistics.wait_count);
- __schedstat_add(se->statistics.wait_sum, delta);
- __schedstat_set(se->statistics.wait_start, 0);
+ __schedstat_set(stats->wait_max,
+ max(schedstat_val(stats->wait_max), delta));
+ __schedstat_inc(stats->wait_count);
+ __schedstat_add(stats->wait_sum, delta);
+ __schedstat_set(stats->wait_start, 0);
}

static inline void
update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- struct task_struct *tsk = NULL;
+ struct sched_statistics *stats = NULL;
+ struct task_struct *p = NULL;
u64 sleep_start, block_start;

if (!schedstat_enabled())
return;

- sleep_start = schedstat_val(se->statistics.sleep_start);
- block_start = schedstat_val(se->statistics.block_start);
-
if (entity_is_task(se))
- tsk = task_of(se);
+ p = task_of(se);
+
+ __schedstat_from_sched_entity(se, &stats);
+
+ sleep_start = schedstat_val(stats->sleep_start);
+ block_start = schedstat_val(stats->block_start);

if (sleep_start) {
u64 delta = rq_clock(rq_of(cfs_rq)) - sleep_start;
@@ -951,15 +998,15 @@ update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
if ((s64)delta < 0)
delta = 0;

- if (unlikely(delta > schedstat_val(se->statistics.sleep_max)))
- __schedstat_set(se->statistics.sleep_max, delta);
+ if (unlikely(delta > schedstat_val(stats->sleep_max)))
+ __schedstat_set(stats->sleep_max, delta);

- __schedstat_set(se->statistics.sleep_start, 0);
- __schedstat_add(se->statistics.sum_sleep_runtime, delta);
+ __schedstat_set(stats->sleep_start, 0);
+ __schedstat_add(stats->sum_sleep_runtime, delta);

- if (tsk) {
- account_scheduler_latency(tsk, delta >> 10, 1);
- trace_sched_stat_sleep(tsk, delta);
+ if (p) {
+ account_scheduler_latency(p, delta >> 10, 1);
+ trace_sched_stat_sleep(p, delta);
}
}
if (block_start) {
@@ -968,20 +1015,20 @@ update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
if ((s64)delta < 0)
delta = 0;

- if (unlikely(delta > schedstat_val(se->statistics.block_max)))
- __schedstat_set(se->statistics.block_max, delta);
+ if (unlikely(delta > schedstat_val(stats->block_max)))
+ __schedstat_set(stats->block_max, delta);

- __schedstat_set(se->statistics.block_start, 0);
- __schedstat_add(se->statistics.sum_sleep_runtime, delta);
+ __schedstat_set(stats->block_start, 0);
+ __schedstat_add(stats->sum_sleep_runtime, delta);

- if (tsk) {
- if (tsk->in_iowait) {
- __schedstat_add(se->statistics.iowait_sum, delta);
- __schedstat_inc(se->statistics.iowait_count);
- trace_sched_stat_iowait(tsk, delta);
+ if (p) {
+ if (p->in_iowait) {
+ __schedstat_add(stats->iowait_sum, delta);
+ __schedstat_inc(stats->iowait_count);
+ trace_sched_stat_iowait(p, delta);
}

- trace_sched_stat_blocked(tsk, delta);
+ trace_sched_stat_blocked(p, delta);

/*
* Blocking time is in units of nanosecs, so shift by
@@ -990,10 +1037,10 @@ update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
if (unlikely(prof_on == SLEEP_PROFILING)) {
profile_hits(SLEEP_PROFILING,
- (void *)get_wchan(tsk),
+ (void *)get_wchan(p),
delta >> 20);
}
- account_scheduler_latency(tsk, delta >> 10, 0);
+ account_scheduler_latency(p, delta >> 10, 0);
}
}
}
@@ -1036,10 +1083,10 @@ update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
struct task_struct *tsk = task_of(se);

if (tsk->state & TASK_INTERRUPTIBLE)
- __schedstat_set(se->statistics.sleep_start,
+ __schedstat_set(tsk->stats.sleep_start,
rq_clock(rq_of(cfs_rq)));
if (tsk->state & TASK_UNINTERRUPTIBLE)
- __schedstat_set(se->statistics.block_start,
+ __schedstat_set(tsk->stats.block_start,
rq_clock(rq_of(cfs_rq)));
}
}
@@ -4392,6 +4439,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ struct sched_statistics *stats = NULL;
+
/* 'current' is not kept within the tree. */
if (se->on_rq) {
/*
@@ -4414,8 +4463,9 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
if (schedstat_enabled() &&
rq_of(cfs_rq)->cfs.load.weight >= 2*se->load.weight) {
- __schedstat_set(se->statistics.slice_max,
- max((u64)schedstat_val(se->statistics.slice_max),
+ __schedstat_from_sched_entity(se, &stats);
+ __schedstat_set(stats->slice_max,
+ max((u64)schedstat_val(stats->slice_max),
se->sum_exec_runtime - se->prev_sum_exec_runtime));
}

@@ -5862,12 +5912,12 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

- schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
+ schedstat_inc(p->stats.nr_wakeups_affine_attempts);
if (target == nr_cpumask_bits)
return prev_cpu;

schedstat_inc(sd->ttwu_move_affine);
- schedstat_inc(p->se.statistics.nr_wakeups_affine);
+ schedstat_inc(p->stats.nr_wakeups_affine);
return target;
}

@@ -7533,7 +7583,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;

- schedstat_inc(p->se.statistics.nr_failed_migrations_affine);
+ schedstat_inc(p->stats.nr_failed_migrations_affine);

env->flags |= LBF_SOME_PINNED;

@@ -7564,7 +7614,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
env->flags &= ~LBF_ALL_PINNED;

if (task_running(env->src_rq, p)) {
- schedstat_inc(p->se.statistics.nr_failed_migrations_running);
+ schedstat_inc(p->stats.nr_failed_migrations_running);
return 0;
}

@@ -7582,12 +7632,12 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
if (tsk_cache_hot == 1) {
schedstat_inc(env->sd->lb_hot_gained[env->idle]);
- schedstat_inc(p->se.statistics.nr_forced_migrations);
+ schedstat_inc(p->stats.nr_forced_migrations);
}
return 1;
}

- schedstat_inc(p->se.statistics.nr_failed_migrations_hot);
+ schedstat_inc(p->stats.nr_failed_migrations_hot);
return 0;
}

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index af772ac0f32d..2d543a270dfe 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1008,8 +1008,8 @@ static void update_curr_rt(struct rq *rq)
if (unlikely((s64)delta_exec <= 0))
return;

- schedstat_set(curr->se.statistics.exec_max,
- max(curr->se.statistics.exec_max, delta_exec));
+ schedstat_set(curr->stats.exec_max,
+ max(curr->stats.exec_max, delta_exec));

curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 871544bb9a38..b1cdb942c67d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -387,6 +387,9 @@ struct cfs_bandwidth {
struct task_group {
struct cgroup_subsys_state css;

+ /* schedstats of this group on each CPU */
+ struct sched_statistics **stats;
+
#ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each CPU */
struct sched_entity **se;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index c23b653ffc53..87242968712e 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -42,6 +42,42 @@ rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
#define schedstat_val(var) (var)
#define schedstat_val_or_zero(var) ((schedstat_enabled()) ? (var) : 0)

+static inline void free_tg_schedstats(struct task_group *tg)
+{
+ int i;
+
+ for_each_possible_cpu(i) {
+ if (tg->stats)
+ kfree(tg->stats[i]);
+ }
+
+ kfree(tg->stats);
+}
+
+static inline int alloc_tg_schedstats(struct task_group *tg)
+{
+ struct sched_statistics *stats;
+ int i;
+
+ /*
+ * This memory should be allocated whatever schedstat_enabled() or
+ * not.
+ */
+ tg->stats = kcalloc(nr_cpu_ids, sizeof(stats), GFP_KERNEL);
+ if (!tg->stats)
+ return 0;
+
+ for_each_possible_cpu(i) {
+ stats = kzalloc_node(sizeof(struct sched_statistics),
+ GFP_KERNEL, cpu_to_node(i));
+ if (!stats)
+ return 0;
+ tg->stats[i] = stats;
+ }
+
+ return 1;
+}
+
#else /* !CONFIG_SCHEDSTATS: */
static inline void rq_sched_info_arrive (struct rq *rq, unsigned long long delta) { }
static inline void rq_sched_info_dequeued(struct rq *rq, unsigned long long delta) { }
@@ -55,6 +91,16 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt
# define schedstat_set(var, val) do { } while (0)
# define schedstat_val(var) 0
# define schedstat_val_or_zero(var) 0
+
+static inline void free_tg_schedstats(struct task_group *tg)
+{
+}
+
+static inline int alloc_tg_schedstats(struct task_group *tg)
+{
+ return 1;
+}
+
#endif /* CONFIG_SCHEDSTATS */

#ifdef CONFIG_PSI
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index a5d289049388..f35cd67a0881 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -70,8 +70,8 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
if (unlikely((s64)delta_exec < 0))
delta_exec = 0;

- schedstat_set(curr->se.statistics.exec_max,
- max(curr->se.statistics.exec_max, delta_exec));
+ schedstat_set(curr->stats.exec_max,
+ max(curr->stats.exec_max, delta_exec));

curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
--
2.18.4

2020-11-28 03:18:31

by Yafang Shao

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/5] sched: support schedstats for RT sched class

On Sat, Nov 28, 2020 at 12:12 AM Yafang Shao <[email protected]> wrote:
>
> We want to measure the latency of RT tasks in our production
> environment with schedstats facility, but currently schedstats is only
> supported for fair sched class. This patch enable it for RT sched class
> as well.
>
> - Structure
>
> Before we make schedstats available for RT sched class, we should make
> struct sched_statistics independent of fair sched class first.
>
>
> The struct sched_statistics is the schedular statistics of a task_struct
> or a task_group. So we can move it into struct task_struct and
> struct task_group to achieve the goal.
>
> Below is the detailed explaination of the change in the structs.
>
> The struct before this patch:
>
> struct task_struct { |-> struct sched_entity {
> ... | ...
> struct sched_entity *se; ---| struct sched_statistics statistics;
> struct sched_rt_entity *rt; ...
> ... ...
> }; };
>
> struct task_group { |--> se[0]->statistics : schedstats of CPU0
> ... |
> #ifdef CONFIG_FAIR_GROUP_SCHED |
> struct sched_entity **se; --|--> se[1]->statistics : schedstats of CPU1
> |
> #endif |
> |--> se[N]->statistics : schedstats of CPUn
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> struct sched_rt_entity **rt_se; (N/A)
> #endif
> ...
> };
>
> The '**se' in task_group is allocated in the fair sched class, which is
> hard to be reused by other sched class.
>
> The struct after this patch:
> struct task_struct {
> ...
> struct sched_statistics statistics;
> ...
> struct sched_entity *se;
> struct sched_rt_entity *rt;
> ...
> };
>
> struct task_group { |---> stats[0] : of CPU0
> ... |
> struct sched_statistics **stats; --|---> stats[1] : of CPU1
> ... |
> |---> stats[n] : of CPUn
> #ifdef CONFIG_FAIR_GROUP_SCHED
> struct sched_entity **se;
> #endif
> #ifdef CONFIG_RT_GROUP_SCHED
> struct sched_rt_entity **rt_se;
> #endif
> ...
> };
>
> After the patch it is clearly that both of se or rt_se can easily get the
> sched_statistics by a task_struct or a task_group.
>
> - Function Interface
>
> The original prototype of the schedstats helpers are
>
> update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se)
>
> The cfs_rq in these helpers is used to get the rq_clock, and the se is
> used to get the struct sched_statistics and the struct task_struct. In
> order to make these helpers available by all sched classes, we can pass
> the rq, sched_statistics and task_struct directly.
>
> Then the new helpers are
>
> update_stats_wait_*(struct rq *rq, struct task_struct *p,
> struct sched_statistics *stats)
>
> which are independent of fair sched class.
>
> To avoid vmlinux growing too large or introducing ovehead when
> !schedstat_enabled(), some new helpers after schedstat_enabled() are also
> introduced, Suggested by Mel. These helpers are in sched/stats.c,
>
> __update_stats_wait_*(struct rq *rq, struct task_struct *p,
> struct sched_statistics *stats)
>
> - Implementation
>
> After we make the struct sched_statistics and the helpers of it
> independent of fair sched class, we can easily use the schedstats
> facility for RT sched class.
>
> The schedstat usage in RT sched class is similar with fair sched class,
> for example,
> fair RT
> enqueue update_stats_enqueue_fair update_stats_enqueue_rt
> dequeue update_stats_dequeue_fair update_stats_dequeue_rt
> put_prev_task update_stats_wait_start update_stats_wait_start
> set_next_task update_stats_wait_end update_stats_wait_end
>
> - Usage
>
> The user can get the schedstats information in the same way in fair sched
> class. For example,
> fair RT
> task show /proc/[pid]/sched /proc/[pid]/sched
> group show cpu.stat in cgroup cpu.stat in cgroup
>
> The sched:sched_stat_{wait, sleep, iowait, blocked} tracepoints can
> be used to trace RT tasks as well. sched_stat_runtime can also be
> supported in the future if it is helpful.
>
> Yafang Shao (5):
> sched: don't include stats.h in sched.h
> sched, fair: use __schedstat_set() in set_next_entity()
> sched: make struct sched_statistics independent of fair sched class
> sched: make schedstats helpers independent of fair sched class
> sched, rt: support schedstats for RT sched class
>
> include/linux/sched.h | 3 +-
> kernel/sched/core.c | 25 +++--
> kernel/sched/deadline.c | 5 +-
> kernel/sched/debug.c | 82 +++++++--------
> kernel/sched/fair.c | 209 +++++++++++++++------------------------
> kernel/sched/idle.c | 1 +
> kernel/sched/rt.c | 178 ++++++++++++++++++++++++++++++++-
> kernel/sched/sched.h | 13 ++-
> kernel/sched/stats.c | 105 ++++++++++++++++++++
> kernel/sched/stats.h | 80 +++++++++++++++
> kernel/sched/stop_task.c | 5 +-
> 11 files changed, 517 insertions(+), 189 deletions(-)
>
> --
> 2.18.4
>

Ah, the patch #5 is missed,pls ignore this patchset. I will resend them.

--
Thanks
Yafang