Although we can rely on cpuacct to present the cpu usage of task
group, it is hard to tell how intense the competition is between
these groups on cpu resources.
Monitoring the wait time of each process could cost too much, and
there is no good way to accurately represent the conflict with
these info, we need the wait time on group dimension.
Thus we introduced group's wait_sum provided by kernel to represent
the conflict between task groups, whenever a group's cfs_rq ends
waiting, it's wait time accounted to the sum.
The cpu.stat is modified to show the new statistic, like:
nr_periods 0
nr_throttled 0
throttled_time 0
wait_sum 2035098795584
Now we can monitor the changing on wait_sum to tell how suffering
a task group is in the fight of cpu resources.
Signed-off-by: Michael Wang <[email protected]>
---
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 4 ++++
kernel/sched/sched.h | 1 +
3 files changed, 7 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78d8fac..ac27b8d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6787,6 +6787,8 @@ static int cpu_cfs_stat_show(struct seq_file *sf,
void *v)
seq_printf(sf, "nr_periods %d\n", cfs_b->nr_periods);
seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled);
seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
+ if (schedstat_enabled())
+ seq_printf(sf, "wait_sum %llu\n", tg->wait_sum);
return 0;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1866e64..ef82ceb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -862,6 +862,7 @@ static void update_curr_fair(struct rq *rq)
static inline void
update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ struct task_group *tg;
struct task_struct *p;
u64 delta;
@@ -882,6 +883,9 @@ static void update_curr_fair(struct rq *rq)
return;
}
trace_sched_stat_wait(p, delta);
+ } else {
+ tg = group_cfs_rq(se)->tg;
+ __schedstat_add(tg->wait_sum, delta);
}
__schedstat_set(se->statistics.wait_max,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6601baf..bb9b4fb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -358,6 +358,7 @@ struct task_group {
/* runqueue "owned" by this group on each CPU */
struct cfs_rq **cfs_rq;
unsigned long shares;
+ u64 wait_sum;
#ifdef CONFIG_SMP
/*
--
1.8.3.1
On Mon, Jul 02, 2018 at 03:29:39PM +0800, 王贇 wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1866e64..ef82ceb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -862,6 +862,7 @@ static void update_curr_fair(struct rq *rq)
> static inline void
> update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> + struct task_group *tg;
> struct task_struct *p;
> u64 delta;
>
> @@ -882,6 +883,9 @@ static void update_curr_fair(struct rq *rq)
> return;
> }
> trace_sched_stat_wait(p, delta);
> + } else {
> + tg = group_cfs_rq(se)->tg;
> + __schedstat_add(tg->wait_sum, delta);
> }
You're joking right? This patch is both broken and utterly insane.
You're wanting to update an effectively global variable for every
schedule action (and its broken because it is without any serialization
or atomics).
NAK
Hi, Peter
On 2018/7/2 下午8:03, Peter Zijlstra wrote:
> On Mon, Jul 02, 2018 at 03:29:39PM +0800, 王贇 wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1866e64..ef82ceb 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -862,6 +862,7 @@ static void update_curr_fair(struct rq *rq)
>> static inline void
>> update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> {
>> + struct task_group *tg;
>> struct task_struct *p;
>> u64 delta;
>>
>> @@ -882,6 +883,9 @@ static void update_curr_fair(struct rq *rq)
>> return;
>> }
>> trace_sched_stat_wait(p, delta);
>> + } else {
>> + tg = group_cfs_rq(se)->tg;
>> + __schedstat_add(tg->wait_sum, delta);
>> }
>
> You're joking right? This patch is both broken and utterly insane.
>
> You're wanting to update an effectively global variable for every
> schedule action (and its broken because it is without any serialization
> or atomics).
Thanks for the reply and sorry for the thoughtless, I'll rewrite
the code to make it per-cpu variable, then assemble the results
on show.
Regards,
Michael Wang
>
> NAK
>
Although we can rely on cpuacct to present the cpu usage of task
group, it is hard to tell how intense the competition is between
these groups on cpu resources.
Monitoring the wait time of each process or sched_debug could cost
too much, and there is no good way to accurately represent the
conflict with these info, we need the wait time on group dimension.
Thus we introduced group's wait_sum represent the conflict between
task groups, which is simply sum the wait time of group's cfs_rq.
The 'cpu.stat' is modified to show the statistic, like:
nr_periods 0
nr_throttled 0
throttled_time 0
wait_sum 2035098795584
Now we can monitor the changing on wait_sum to tell how suffering
a task group is in the fight of cpu resources.
For example:
(wait_sum - last_wait_sum) * 100 / (nr_cpu * period_ns) == X%
means the task group paid X percentage of period on waiting
for the cpu.
Signed-off-by: Michael Wang <[email protected]>
---
Since RFC:
redesigned the way to acquire wait_sum
kernel/sched/core.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78d8fac..cbff06b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6781,6 +6781,8 @@ static int __cfs_schedulable(struct task_group
*tg, u64 period, u64 quota)
static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
{
+ int i;
+ u64 wait_sum = 0;
struct task_group *tg = css_tg(seq_css(sf));
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -6788,6 +6790,12 @@ static int cpu_cfs_stat_show(struct seq_file *sf,
void *v)
seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled);
seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
+ if (schedstat_enabled()) {
+ for_each_possible_cpu(i)
+ wait_sum += tg->se[i]->statistics.wait_sum;
+ seq_printf(sf, "wait_sum %llu\n", wait_sum);
+ }
+
return 0;
}
#endif /* CONFIG_CFS_BANDWIDTH */
--
1.8.3.1
Although we can rely on cpuacct to present the cpu usage of task
group, it is hard to tell how intense the competition is between
these groups on cpu resources.
Monitoring the wait time of each process or sched_debug could cost
too much, and there is no good way to accurately represent the
conflict with these info, we need the wait time on group dimension.
Thus we introduced group's wait_sum represent the conflict between
task groups, which is simply sum the wait time of group's cfs_rq.
The 'cpu.stat' is modified to show the statistic, like:
nr_periods 0
nr_throttled 0
throttled_time 0
wait_sum 2035098795584
Now we can monitor the changing on wait_sum to tell how suffering
a task group is in the fight of cpu resources.
For example:
(wait_sum - last_wait_sum) * 100 / (nr_cpu * period_ns) == X%
means the task group paid X percentage of period on waiting
for the cpu.
Signed-off-by: Michael Wang <[email protected]>
---
Since v1:
Use schedstat_val to avoid compile error
Check and skip root_task_group
kernel/sched/core.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78d8fac..80ab995 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6781,6 +6781,8 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
{
+ int i;
+ u64 ws = 0;
struct task_group *tg = css_tg(seq_css(sf));
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -6788,6 +6790,12 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled);
seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
+ if (schedstat_enabled() && tg != &root_task_group) {
+ for_each_possible_cpu(i)
+ ws += schedstat_val(tg->se[i]->statistics.wait_sum);
+ seq_printf(sf, "wait_sum %llu\n", ws);
+ }
+
return 0;
}
#endif /* CONFIG_CFS_BANDWIDTH */
--
1.8.3.1
On 2018/7/4 上午11:27, 王贇 wrote:
> Although we can rely on cpuacct to present the cpu usage of task
> group, it is hard to tell how intense the competition is between
> these groups on cpu resources.
>
> Monitoring the wait time of each process or sched_debug could cost
> too much, and there is no good way to accurately represent the
> conflict with these info, we need the wait time on group dimension.
>
> Thus we introduced group's wait_sum represent the conflict between
> task groups, which is simply sum the wait time of group's cfs_rq.
>
> The 'cpu.stat' is modified to show the statistic, like:
>
> nr_periods 0
> nr_throttled 0
> throttled_time 0
> wait_sum 2035098795584
>
> Now we can monitor the changing on wait_sum to tell how suffering
> a task group is in the fight of cpu resources.
>
> For example:
> (wait_sum - last_wait_sum) * 100 / (nr_cpu * period_ns) == X%
>
> means the task group paid X percentage of period on waiting
> for the cpu.
Hi, Peter
How do you think about this proposal?
There are situation that tasks in some group suffered much more
than others, will be good to have some way to easily locate them.
Regards,
Michael Wang
>
> Signed-off-by: Michael Wang <[email protected]>
> ---
>
> Since v1:
> Use schedstat_val to avoid compile error
> Check and skip root_task_group
>
> kernel/sched/core.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 78d8fac..80ab995 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6781,6 +6781,8 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
>
> static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
> {
> + int i;
> + u64 ws = 0;
> struct task_group *tg = css_tg(seq_css(sf));
> struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
>
> @@ -6788,6 +6790,12 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
> seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled);
> seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
>
> + if (schedstat_enabled() && tg != &root_task_group) {
> + for_each_possible_cpu(i)
> + ws += schedstat_val(tg->se[i]->statistics.wait_sum);
> + seq_printf(sf, "wait_sum %llu\n", ws);
> + }
> +
> return 0;
> }
> #endif /* CONFIG_CFS_BANDWIDTH */
Hi, folks
On 2018/7/4 上午11:27, 王贇 wrote:
> Although we can rely on cpuacct to present the cpu usage of task
> group, it is hard to tell how intense the competition is between
> these groups on cpu resources.
>
> Monitoring the wait time of each process or sched_debug could cost
> too much, and there is no good way to accurately represent the
> conflict with these info, we need the wait time on group dimension.
>
> Thus we introduced group's wait_sum represent the conflict between
> task groups, which is simply sum the wait time of group's cfs_rq.
>
> The 'cpu.stat' is modified to show the statistic, like:
>
> nr_periods 0
> nr_throttled 0
> throttled_time 0
> wait_sum 2035098795584
>
> Now we can monitor the changing on wait_sum to tell how suffering
> a task group is in the fight of cpu resources.
>
> For example:
> (wait_sum - last_wait_sum) * 100 / (nr_cpu * period_ns) == X%
>
> means the task group paid X percentage of period on waiting
> for the cpu.
Any comments please?
Regards,
Michael Wang
>
> Signed-off-by: Michael Wang <[email protected]>
> ---
>
> Since v1:
> Use schedstat_val to avoid compile error
> Check and skip root_task_group
>
> kernel/sched/core.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 78d8fac..80ab995 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6781,6 +6781,8 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
>
> static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
> {
> + int i;
> + u64 ws = 0;
> struct task_group *tg = css_tg(seq_css(sf));
> struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
>
> @@ -6788,6 +6790,12 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
> seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled);
> seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
>
> + if (schedstat_enabled() && tg != &root_task_group) {
> + for_each_possible_cpu(i)
> + ws += schedstat_val(tg->se[i]->statistics.wait_sum);
> + seq_printf(sf, "wait_sum %llu\n", ws);
> + }
> +
> return 0;
> }
> #endif /* CONFIG_CFS_BANDWIDTH */
On Wed, Jul 04, 2018 at 11:27:27AM +0800, 王贇 wrote:
> @@ -6788,6 +6790,12 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
> seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled);
> seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
>
> + if (schedstat_enabled() && tg != &root_task_group) {
I put the variables here.
> + for_each_possible_cpu(i)
> + ws += schedstat_val(tg->se[i]->statistics.wait_sum);
This doesn't quite work on 32bit archs, but I'm not sure I care enough
to be bothered about that.
> + seq_printf(sf, "wait_sum %llu\n", ws);
> + }
On 2018/7/23 下午5:31, Peter Zijlstra wrote:
> On Wed, Jul 04, 2018 at 11:27:27AM +0800, 王贇 wrote:
>
>> @@ -6788,6 +6790,12 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
>> seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled);
>> seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
>>
>> + if (schedstat_enabled() && tg != &root_task_group) {
>
> I put the variables here.
Will do that in next version :-)
>
>> + for_each_possible_cpu(i)
>> + ws += schedstat_val(tg->se[i]->statistics.wait_sum);
>
> This doesn't quite work on 32bit archs, but I'm not sure I care enough
> to be bothered about that.
Could easily overflow then... hope they won't really care
about the group conflicts.
Regards,
Michael Wang
>
>> + seq_printf(sf, "wait_sum %llu\n", ws);
>> + }
Although we can rely on cpuacct to present the cpu usage of task
group, it is hard to tell how intense the competition is between
these groups on cpu resources.
Monitoring the wait time of each process or sched_debug could cost
too much, and there is no good way to accurately represent the
conflict with these info, we need the wait time on group dimension.
Thus we introduced group's wait_sum represent the conflict between
task groups, which is simply sum the wait time of group's cfs_rq.
The 'cpu.stat' is modified to show the statistic, like:
nr_periods 0
nr_throttled 0
throttled_time 0
wait_sum 2035098795584
Now we can monitor the changing on wait_sum to tell how suffering
a task group is in the fight of cpu resources.
For example:
(wait_sum - last_wait_sum) * 100 / (nr_cpu * period_ns) == X%
means the task group paid X percentage of period on waiting
for the cpu.
Signed-off-by: Michael Wang <[email protected]>
---
Since v2:
Declare variables inside branch (From Peter).
kernel/sched/core.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78d8fac..2a7bb7c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6788,6 +6788,15 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled);
seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
+ if (schedstat_enabled() && tg != &root_task_group) {
+ int i;
+ u64 ws = 0;
+
+ for_each_possible_cpu(i)
+ ws += schedstat_val(tg->se[i]->statistics.wait_sum);
+ seq_printf(sf, "wait_sum %llu\n", ws);
+ }
+
return 0;
}
#endif /* CONFIG_CFS_BANDWIDTH */
--
1.8.3.1
Commit-ID: 3d6c50c27bd6418dceb51642540ecfcb8ca708c2
Gitweb: https://git.kernel.org/tip/3d6c50c27bd6418dceb51642540ecfcb8ca708c2
Author: Yun Wang <[email protected]>
AuthorDate: Wed, 4 Jul 2018 11:27:27 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 25 Jul 2018 11:41:05 +0200
sched/debug: Show the sum wait time of a task group
Although we can rely on cpuacct to present the CPU usage of task
groups, it is hard to tell how intense the competition is between
these groups on CPU resources.
Monitoring the wait time or sched_debug of each process could be
very expensive, and there is no good way to accurately represent the
conflict with these info, we need the wait time on group dimension.
Thus we introduce group's wait_sum to represent the resource conflict
between task groups, which is simply the sum of the wait time of
the group's cfs_rq.
The 'cpu.stat' is modified to show the statistic, like:
nr_periods 0
nr_throttled 0
throttled_time 0
wait_sum 2035098795584
Now we can monitor the changes of wait_sum to tell how much a
a task group is suffering in the fight of CPU resources.
For example:
(wait_sum - last_wait_sum) * 100 / (nr_cpu * period_ns) == X%
means the task group paid X percentage of period on waiting
for the CPU.
Signed-off-by: Michael Wang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc177c06e490..2bc391a574e6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6748,6 +6748,16 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled);
seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
+ if (schedstat_enabled() && tg != &root_task_group) {
+ u64 ws = 0;
+ int i;
+
+ for_each_possible_cpu(i)
+ ws += schedstat_val(tg->se[i]->statistics.wait_sum);
+
+ seq_printf(sf, "wait_sum %llu\n", ws);
+ }
+
return 0;
}
#endif /* CONFIG_CFS_BANDWIDTH */