2022-01-11 09:56:14

by Cruz Zhao

[permalink] [raw]
Subject: [PATCH v2 0/3] Accounting forced idle time per cpu and per cgroup

There are two types of forced idle time: forced idle time from cookie'd
task and forced idle time form uncookie'd task. The forced idle time from
uncookie'd task is actually caused by the cookie'd task in runqueue
indirectly, and it's more accurate to measure the capacity loss with the
sum of both.

This patch set accounts forced idle time for each cpu to measure how long
the cpu is forced idle, which is displayed via via /proc/schedstat, and
also accounts for each cgroup to measure how long it forced its SMT siblings
into idle, which is displayed via /sys/fs/cgroup/cpuacct/cpuacct.forceidle
and /sys/fs/cgroup/cpuacct/cpuacct.forceidle_percpu. It is worth noting that
the forced idle time and the force idle time have different meanings.

We can get the total system forced idle time by looking at the root cgroup,
and we can get how long the cgroup forced it SMT siblings into idle. If the
force idle time of a cgroup is high, that can be rectified by making some
changes(ie. affinity, cpu budget, etc.) to the cgroup.

Cruz Zhao (3):
sched/core: Accounting forceidle time for all tasks except idle task
sched/core: Forced idle accounting per-cpu
sched/core: Force idle accounting per cgroup

include/linux/cgroup.h | 7 +++++
kernel/sched/core.c | 10 ++++--
kernel/sched/core_sched.c | 10 ++++--
kernel/sched/cpuacct.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 4 +++
kernel/sched/stats.c | 17 ++++++++--
6 files changed, 119 insertions(+), 8 deletions(-)

--
1.8.3.1



2022-01-11 09:56:16

by Cruz Zhao

[permalink] [raw]
Subject: [PATCH v2 1/3] sched/core: Accounting forceidle time for all tasks except idle task

There are two types of forced idle time: forced idle time from cookie'd
task and forced idle time form uncookie'd task. The forced idle time from
uncookie'd task is actually caused by the cookie'd task in runqueue
indirectly, and it's more accurate to measure the capacity loss with the
sum of both.

Assuming cpu x and cpu y are a pair of SMT siblings, consider the
following scenarios:
1.There's a cookie'd task running on cpu x, and there're 4 uncookie'd
tasks running on cpu y. For cpu x, there will be 80% forced idle time
(from uncookie'd task); for cpu y, there will be 20% forced idle time
(from cookie'd task).
2.There's a uncookie'd task running on cpu x, and there're 4 cookie'd
tasks running on cpu y. For cpu x, there will be 80% forced idle time
(from cookie'd task); for cpu y, there will be 20% forced idle time
(from uncookie'd task).

The scenario1 can recurrent by stress-ng(scenario2 can recurrent similary):
(cookie'd)taskset -c x stress-ng -c 1 -l 100
(uncookie'd)taskset -c y stress-ng -c 4 -l 100

In the above two scenarios, the total capacity loss is 1 cpu, but in
scenario1, the cookie'd forced idle time tells us 20% cpu capacity loss, in
scenario2, the cookie'd forced idle time tells us 80% cpu capacity loss,
which are not accurate. It'll be more accurate to measure with cookie'd
forced idle time and uncookie'd forced idle time.

Signed-off-by: Cruz Zhao <[email protected]>
---
kernel/sched/core.c | 3 +--
kernel/sched/core_sched.c | 2 +-
2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2e4ae00..e8187e7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5822,8 +5822,7 @@ static inline struct task_struct *pick_task(struct rq *rq)
}

if (schedstat_enabled() && rq->core->core_forceidle_count) {
- if (cookie)
- rq->core->core_forceidle_start = rq_clock(rq->core);
+ rq->core->core_forceidle_start = rq_clock(rq->core);
rq->core->core_forceidle_occupation = occ;
}

diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index 1fb4567..c8746a9 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -277,7 +277,7 @@ void __sched_core_account_forceidle(struct rq *rq)
rq_i = cpu_rq(i);
p = rq_i->core_pick ?: rq_i->curr;

- if (!p->core_cookie)
+ if (p == rq_i->idle)
continue;

__schedstat_add(p->stats.core_forceidle_sum, delta);
--
1.8.3.1


2022-01-11 09:56:17

by Cruz Zhao

[permalink] [raw]
Subject: [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu

Accounting for "forced idle" time per cpu, which is the time a cpu is
forced to idle by its SMT sibling.

As it's not accurate to measure the capacity loss only by cookie'd forced
idle time, and it's hard to trace the forced idle time caused by all the
uncookie'd tasks, we account the forced idle time from the perspective of
the cpu.

Forced idle time per cpu is displayed via /proc/schedstat, we can get the
forced idle time of cpu x from the 10th column of the row for cpu x. The
unit is ns. It also requires that schedstats is enabled.

Signed-off-by: Cruz Zhao <[email protected]>
---
kernel/sched/core.c | 7 ++++++-
kernel/sched/core_sched.c | 7 +++++--
kernel/sched/sched.h | 4 ++++
kernel/sched/stats.c | 17 +++++++++++++++--
4 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8187e7..a224b916 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -285,8 +285,10 @@ static void __sched_core_flip(bool enabled)

sched_core_lock(cpu, &flags);

- for_each_cpu(t, smt_mask)
+ for_each_cpu(t, smt_mask) {
cpu_rq(t)->core_enabled = enabled;
+ cpu_rq(t)->in_forcedidle = false;
+ }

cpu_rq(cpu)->core->core_forceidle_start = 0;

@@ -5690,6 +5692,7 @@ static inline struct task_struct *pick_task(struct rq *rq)
* another cpu during offline.
*/
rq->core_pick = NULL;
+ rq->in_forcedidle = false;
return __pick_next_task(rq, prev, rf);
}

@@ -5810,9 +5813,11 @@ static inline struct task_struct *pick_task(struct rq *rq)

rq_i->core_pick = p;

+ rq->in_forcedidle = false;
if (p == rq_i->idle) {
if (rq_i->nr_running) {
rq->core->core_forceidle_count++;
+ rq_i->in_forcedidle = true;
if (!fi_before)
rq->core->core_forceidle_seq++;
}
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index c8746a9..fe04805 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -242,7 +242,7 @@ int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
void __sched_core_account_forceidle(struct rq *rq)
{
const struct cpumask *smt_mask = cpu_smt_mask(cpu_of(rq));
- u64 delta, now = rq_clock(rq->core);
+ u64 delta_per_idlecpu, delta, now = rq_clock(rq->core);
struct rq *rq_i;
struct task_struct *p;
int i;
@@ -254,7 +254,7 @@ void __sched_core_account_forceidle(struct rq *rq)
if (rq->core->core_forceidle_start == 0)
return;

- delta = now - rq->core->core_forceidle_start;
+ delta_per_idlecpu = delta = now - rq->core->core_forceidle_start;
if (unlikely((s64)delta <= 0))
return;

@@ -277,6 +277,9 @@ void __sched_core_account_forceidle(struct rq *rq)
rq_i = cpu_rq(i);
p = rq_i->core_pick ?: rq_i->curr;

+ if (rq_i->in_forcedidle)
+ rq->rq_forceidle_time += delta_per_idlecpu;
+
if (p == rq_i->idle)
continue;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index de53be9..9377d91 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1086,6 +1086,9 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+#ifdef CONFIG_SCHED_CORE
+ u64 rq_forceidle_time;
+#endif
#endif

#ifdef CONFIG_CPU_IDLE
@@ -1115,6 +1118,7 @@ struct rq {
unsigned int core_forceidle_seq;
unsigned int core_forceidle_occupation;
u64 core_forceidle_start;
+ bool in_forcedidle;
#endif
};

diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 07dde29..ea22a8c 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -108,6 +108,16 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
}
}

+#ifdef CONFIG_SCHED_CORE
+static inline u64 get_rq_forceidle_time(struct rq *rq) {
+ return rq->rq_forceidle_time;
+}
+#else
+static inline u64 get_rq_forceidle_time(struct rq *rq) {
+ return 0;
+}
+#endif
+
/*
* Current schedstat API version.
*
@@ -125,21 +135,24 @@ static int show_schedstat(struct seq_file *seq, void *v)
seq_printf(seq, "timestamp %lu\n", jiffies);
} else {
struct rq *rq;
+ u64 rq_forceidle_time;
#ifdef CONFIG_SMP
struct sched_domain *sd;
int dcount = 0;
#endif
cpu = (unsigned long)(v - 2);
rq = cpu_rq(cpu);
+ rq_forceidle_time = get_rq_forceidle_time(rq);

/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %llu",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq_forceidle_time);

seq_printf(seq, "\n");

--
1.8.3.1


2022-01-11 09:56:21

by Cruz Zhao

[permalink] [raw]
Subject: [PATCH v2 3/3] sched/core: Force idle accounting per cgroup

Accounting for "force idle" time per cgroup, which is the time the tasks
of the cgroup forced its SMT siblings into idle.

Force idle time per cgroup is displayed via
/sys/fs/cgroup/cpuacct/$cg/cpuacct.forceidle.
Force idle time per cgroup per cpu is displayed via
/sys/fs/cgroup/cpuacct/$cg/cpuacct.forceidle_percpu.
The unit is ns.
It also requires that schedstats is enabled.

We can get the total system forced idle time by looking at the root cgroup,
and we can get how long the cgroup forced it SMT siblings into idle. If the
force idle time of a cgroup is high, that can be rectified by making some
changes(ie. affinity, cpu budget, etc.) to the cgroup.

Signed-off-by: Cruz Zhao <[email protected]>
---
include/linux/cgroup.h | 7 +++++
kernel/sched/core_sched.c | 1 +
kernel/sched/cpuacct.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 87 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 75c1514..0c1b616 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -774,10 +774,17 @@ static inline struct cgroup *cgroup_get_from_id(u64 id)
#ifdef CONFIG_CGROUP_CPUACCT
void cpuacct_charge(struct task_struct *tsk, u64 cputime);
void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
+#ifdef CONFIG_SCHED_CORE
+void cpuacct_account_forceidle(int cpu, struct task_struct *task, u64 cputime);
+#endif
#else
static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
static inline void cpuacct_account_field(struct task_struct *tsk, int index,
u64 val) {}
+#ifdef CONFIG_SCHED_CORE
+static inline void cpuacct_account_forceidle(int cpu, struct task_struct *task,
+ u64 cputime) {}
+#endif
#endif

void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec);
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index fe04805..add8672 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -284,6 +284,7 @@ void __sched_core_account_forceidle(struct rq *rq)
continue;

__schedstat_add(p->stats.core_forceidle_sum, delta);
+ cpuacct_account_forceidle(i, p, delta);
}
}

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 3d06c5e..b5c5d99 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -27,6 +27,9 @@ struct cpuacct {
/* cpuusage holds pointer to a u64-type object on every CPU */
u64 __percpu *cpuusage;
struct kernel_cpustat __percpu *cpustat;
+#ifdef CONFIG_SCHED_CORE
+ u64 __percpu *forceidle;
+#endif
};

static inline struct cpuacct *css_ca(struct cgroup_subsys_state *css)
@@ -46,9 +49,15 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
}

static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
+#ifdef CONFIG_SCHED_CORE
+static DEFINE_PER_CPU(u64, root_cpuacct_forceidle);
+#endif
static struct cpuacct root_cpuacct = {
.cpustat = &kernel_cpustat,
.cpuusage = &root_cpuacct_cpuusage,
+#ifdef CONFIG_SCHED_CORE
+ .forceidle = &root_cpuacct_forceidle,
+#endif
};

/* Create a new CPU accounting group */
@@ -72,8 +81,18 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
if (!ca->cpustat)
goto out_free_cpuusage;

+#ifdef CONFIG_SCHED_CORE
+ ca->forceidle = alloc_percpu(u64);
+ if (!ca->forceidle)
+ goto out_free_cpustat;
+#endif
+
return &ca->css;

+#ifdef CONFIG_SCHED_CORE
+out_free_cpustat:
+ free_percpu(ca->cpustat);
+#endif
out_free_cpuusage:
free_percpu(ca->cpuusage);
out_free_ca:
@@ -290,6 +309,37 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)
return 0;
}

+#ifdef CONFIG_SCHED_CORE
+static u64 __forceidle_read(struct cpuacct *ca, int cpu)
+{
+ return *per_cpu_ptr(ca->forceidle, cpu);
+}
+static int cpuacct_percpu_forceidle_seq_show(struct seq_file *m, void *V)
+{
+ struct cpuacct *ca = css_ca(seq_css(m));
+ u64 percpu;
+ int i;
+
+ for_each_possible_cpu(i) {
+ percpu = __forceidle_read(ca, i);
+ seq_printf(m, "%llu ", (unsigned long long) percpu);
+ }
+ seq_printf(m, "\n");
+ return 0;
+}
+static u64 cpuacct_forceidle_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ struct cpuacct *ca = css_ca(css);
+ u64 totalforceidle = 0;
+ int i;
+
+ for_each_possible_cpu(i)
+ totalforceidle += __forceidle_read(ca, i);
+ return totalforceidle;
+}
+#endif
+
static struct cftype files[] = {
{
.name = "usage",
@@ -324,6 +374,16 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)
.name = "stat",
.seq_show = cpuacct_stats_show,
},
+#ifdef CONFIG_SCHED_CORE
+ {
+ .name = "forceidle",
+ .read_u64 = cpuacct_forceidle_read,
+ },
+ {
+ .name = "forceidle_percpu",
+ .seq_show = cpuacct_percpu_forceidle_seq_show,
+ },
+#endif
{ } /* terminate */
};

@@ -359,6 +419,25 @@ void cpuacct_account_field(struct task_struct *tsk, int index, u64 val)
rcu_read_unlock();
}

+#ifdef CONFIG_SCHED_CORE
+void cpuacct_account_forceidle(int cpu, struct task_struct *tsk, u64 cputime)
+{
+ struct cpuacct *ca;
+ u64 *fi;
+
+ rcu_read_lock();
+ /*
+ * We have hold rq->core->__lock here, which protects ca->forceidle
+ * percpu.
+ */
+ for (ca = task_ca(tsk); ca; ca = parent_ca(ca)) {
+ fi = per_cpu_ptr(ca->forceidle, cpu);
+ *fi += cputime;
+ }
+ rcu_read_unlock();
+}
+#endif
+
struct cgroup_subsys cpuacct_cgrp_subsys = {
.css_alloc = cpuacct_css_alloc,
.css_free = cpuacct_css_free,
--
1.8.3.1


2022-01-11 23:52:18

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/core: Accounting forceidle time for all tasks except idle task

On Tue, Jan 11, 2022 at 1:56 AM Cruz Zhao <[email protected]> wrote:
>
> There are two types of forced idle time: forced idle time from cookie'd
> task and forced idle time form uncookie'd task. The forced idle time from
> uncookie'd task is actually caused by the cookie'd task in runqueue
> indirectly, and it's more accurate to measure the capacity loss with the
> sum of both.
>
> Assuming cpu x and cpu y are a pair of SMT siblings, consider the
> following scenarios:
> 1.There's a cookie'd task running on cpu x, and there're 4 uncookie'd
> tasks running on cpu y. For cpu x, there will be 80% forced idle time
> (from uncookie'd task); for cpu y, there will be 20% forced idle time
> (from cookie'd task).
> 2.There's a uncookie'd task running on cpu x, and there're 4 cookie'd
> tasks running on cpu y. For cpu x, there will be 80% forced idle time
> (from cookie'd task); for cpu y, there will be 20% forced idle time
> (from uncookie'd task).
>
> The scenario1 can recurrent by stress-ng(scenario2 can recurrent similary):
> (cookie'd)taskset -c x stress-ng -c 1 -l 100
> (uncookie'd)taskset -c y stress-ng -c 4 -l 100
>
> In the above two scenarios, the total capacity loss is 1 cpu, but in
> scenario1, the cookie'd forced idle time tells us 20% cpu capacity loss, in
> scenario2, the cookie'd forced idle time tells us 80% cpu capacity loss,
> which are not accurate. It'll be more accurate to measure with cookie'd
> forced idle time and uncookie'd forced idle time.
>
> Signed-off-by: Cruz Zhao <[email protected]>
> ---

Thanks,

Reviewed-by: Josh Don <[email protected]>

2022-01-12 02:00:35

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu

On Tue, Jan 11, 2022 at 1:56 AM Cruz Zhao <[email protected]> wrote:
>
> Accounting for "forced idle" time per cpu, which is the time a cpu is
> forced to idle by its SMT sibling.
>
> As it's not accurate to measure the capacity loss only by cookie'd forced
> idle time, and it's hard to trace the forced idle time caused by all the
> uncookie'd tasks, we account the forced idle time from the perspective of
> the cpu.
>
> Forced idle time per cpu is displayed via /proc/schedstat, we can get the
> forced idle time of cpu x from the 10th column of the row for cpu x. The
> unit is ns. It also requires that schedstats is enabled.
>
> Signed-off-by: Cruz Zhao <[email protected]>
> ---

Two quick followup questions:

1) From your v1, I still wasn't quite sure if the per-cpu time was
useful or not for you versus a single overall sum (ie. I think other
metrics could be more useful for analyzing steal_cookie if that's what
you're specifically interested in). Do you definitely want the per-cpu
totals?

2) If your cgroup accounting patch is merged, do you still want this
patch? You can grab the global values from the root cgroup (assuming
you have cgroups enabled). The only potential gap is if you need
per-cpu totals, though I'm working to add percpu stats to cgroup-v2:
https://lkml.kernel.org/r/%[email protected]%3E

2022-01-12 12:28:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu

On Tue, Jan 11, 2022 at 05:56:00PM +0800, Cruz Zhao wrote:

> @@ -1115,6 +1118,7 @@ struct rq {
> unsigned int core_forceidle_seq;
> unsigned int core_forceidle_occupation;
> u64 core_forceidle_start;
> + bool in_forcedidle;

naming is wrong

> #endif
> };
>
> diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
> index 07dde29..ea22a8c 100644
> --- a/kernel/sched/stats.c
> +++ b/kernel/sched/stats.c
> @@ -108,6 +108,16 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
> }
> }
>
> +#ifdef CONFIG_SCHED_CORE
> +static inline u64 get_rq_forceidle_time(struct rq *rq) {
> + return rq->rq_forceidle_time;
> +}
> +#else
> +static inline u64 get_rq_forceidle_time(struct rq *rq) {
> + return 0;
> +}
> +#endif

indent is wrong, and if you put the #ifdef inside the function it'll be
smaller.

2022-01-12 20:42:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] sched/core: Force idle accounting per cgroup

Hello,

On Tue, Jan 11, 2022 at 05:56:01PM +0800, Cruz Zhao wrote:
> +#ifdef CONFIG_SCHED_CORE
> +void cpuacct_account_forceidle(int cpu, struct task_struct *tsk, u64 cputime)
> +{
> + struct cpuacct *ca;
> + u64 *fi;
> +
> + rcu_read_lock();
> + /*
> + * We have hold rq->core->__lock here, which protects ca->forceidle
> + * percpu.
> + */
> + for (ca = task_ca(tsk); ca; ca = parent_ca(ca)) {
> + fi = per_cpu_ptr(ca->forceidle, cpu);
> + *fi += cputime;
> + }

Please don't do this. Use rstat and integrate it with other stats.

Thanks.

--
tejun

2022-01-14 11:06:37

by Cruz Zhao

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu



在 2022/1/12 下午8:27, Peter Zijlstra 写道:
> On Tue, Jan 11, 2022 at 05:56:00PM +0800, Cruz Zhao wrote:
>
>> @@ -1115,6 +1118,7 @@ struct rq {
>> unsigned int core_forceidle_seq;
>> unsigned int core_forceidle_occupation;
>> u64 core_forceidle_start;
>> + bool in_forcedidle;
>
> naming is wrong
>
>> #endif
>> };
>>
>> diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
>> index 07dde29..ea22a8c 100644
>> --- a/kernel/sched/stats.c
>> +++ b/kernel/sched/stats.c
>> @@ -108,6 +108,16 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
>> }
>> }
>>
>> +#ifdef CONFIG_SCHED_CORE
>> +static inline u64 get_rq_forceidle_time(struct rq *rq) {
>> + return rq->rq_forceidle_time;
>> +}
>> +#else
>> +static inline u64 get_rq_forceidle_time(struct rq *rq) {
>> + return 0;
>> +}
>> +#endif
>
> indent is wrong, and if you put the #ifdef inside the function it'll be
> smaller.

Thanks for reviewing and suggestions, I'll fix these problems in the
next version.

Best,
Cruz Zhao

2022-01-14 11:13:18

by Cruz Zhao

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] sched/core: Force idle accounting per cgroup



在 2022/1/13 上午4:42, Tejun Heo 写道:
> Hello,
>
> On Tue, Jan 11, 2022 at 05:56:01PM +0800, Cruz Zhao wrote:
>> +#ifdef CONFIG_SCHED_CORE
>> +void cpuacct_account_forceidle(int cpu, struct task_struct *tsk, u64 cputime)
>> +{
>> + struct cpuacct *ca;
>> + u64 *fi;
>> +
>> + rcu_read_lock();
>> + /*
>> + * We have hold rq->core->__lock here, which protects ca->forceidle
>> + * percpu.
>> + */
>> + for (ca = task_ca(tsk); ca; ca = parent_ca(ca)) {
>> + fi = per_cpu_ptr(ca->forceidle, cpu);
>> + *fi += cputime;
>> + }
>
> Please don't do this. Use rstat and integrate it with other stats.
>
> Thanks.
>

Thanks for suggestions, I'll try to do this using rstat. BTW, is it ok
to integrate it with cgroup_base_stat?

Best,
Cruz Zhao

2022-01-14 21:35:54

by Cruz Zhao

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu



在 2022/1/12 上午9:59, Josh Don 写道:
> On Tue, Jan 11, 2022 at 1:56 AM Cruz Zhao <[email protected]> wrote:
>>
>> Accounting for "forced idle" time per cpu, which is the time a cpu is
>> forced to idle by its SMT sibling.
>>
>> As it's not accurate to measure the capacity loss only by cookie'd forced
>> idle time, and it's hard to trace the forced idle time caused by all the
>> uncookie'd tasks, we account the forced idle time from the perspective of
>> the cpu.
>>
>> Forced idle time per cpu is displayed via /proc/schedstat, we can get the
>> forced idle time of cpu x from the 10th column of the row for cpu x. The
>> unit is ns. It also requires that schedstats is enabled.
>>
>> Signed-off-by: Cruz Zhao <[email protected]>
>> ---
>
> Two quick followup questions:
>
> 1) From your v1, I still wasn't quite sure if the per-cpu time was
> useful or not for you versus a single overall sum (ie. I think other
> metrics could be more useful for analyzing steal_cookie if that's what
> you're specifically interested in). Do you definitely want the per-cpu
> totals?
>
IMO, the per-cpu forced idle time can help us get to know whether the
forced idle time is uniform among the core, or among all the cpus. IMO,
it's a kind of balance.

> 2) If your cgroup accounting patch is merged, do you still want this
> patch? You can grab the global values from the root cgroup (assuming
> you have cgroups enabled). The only potential gap is if you need
> per-cpu totals, though I'm working to add percpu stats to cgroup-v2:
> https://lkml.kernel.org/r/%[email protected]%3E

If cgroup accounting patch is merged, this patch is still needed.

Consider the following scenario:
Task p of cgroup A is running on cpu x, and it forces cpu y into idle
for t ns. The forceidle time of cgroup A on cpu x increases t ns, and
the forcedidle time of cpu y increases t ns.

That is, the force idle time of cgroup is defined as the forced idle
time it caused, and the force idle time of cpu is defined as the time
the cpu is forced into idle, which have different meanings from each other.

And for SMT > 2, we cannot caculate the forced idle time of cpu x from
the cgroup interface.

Best,
Cruz Zhao

2022-01-14 22:51:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] sched/core: Force idle accounting per cgroup

On Fri, Jan 14, 2022 at 07:13:12PM +0800, cruzzhao wrote:
> Thanks for suggestions, I'll try to do this using rstat. BTW, is it ok
> to integrate it with cgroup_base_stat?

Yeah, I don't see a problem at least from cgroup / rstat POV.

Thanks.

--
tejun

2022-01-15 02:16:52

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu

> > 1) From your v1, I still wasn't quite sure if the per-cpu time was
> > useful or not for you versus a single overall sum (ie. I think other
> > metrics could be more useful for analyzing steal_cookie if that's what
> > you're specifically interested in). Do you definitely want the per-cpu
> > totals?
> >
> IMO, the per-cpu forced idle time can help us get to know whether the
> forced idle time is uniform among the core, or among all the cpus. IMO,
> it's a kind of balance.

Sure, I'm not opposed to it.

> > 2) If your cgroup accounting patch is merged, do you still want this
> > patch? You can grab the global values from the root cgroup (assuming
> > you have cgroups enabled). The only potential gap is if you need
> > per-cpu totals, though I'm working to add percpu stats to cgroup-v2:
> > https://lkml.kernel.org/r/%[email protected]%3E
>
> If cgroup accounting patch is merged, this patch is still needed.
>
> Consider the following scenario:
> Task p of cgroup A is running on cpu x, and it forces cpu y into idle
> for t ns. The forceidle time of cgroup A on cpu x increases t ns, and
> the forcedidle time of cpu y increases t ns.
>
> That is, the force idle time of cgroup is defined as the forced idle
> time it caused, and the force idle time of cpu is defined as the time
> the cpu is forced into idle, which have different meanings from each other.
>
> And for SMT > 2, we cannot caculate the forced idle time of cpu x from
> the cgroup interface.

Ack. Note that the patch I linked above for per-cpu stats for
cgroup-v2 has been nack'd, so it is unlikely we'll have kernel exports
for cgroup-v2 per-cpu stats (but perhaps some export via BPF). In v2,
we could at least export the cgroup sum force idle time in cpu.stat,
if you feel it would be useful and want to add the accounting to
rstat. If you really just want the per-cpu root stats then I'm fine
with skipping cgroup for now.

Subject: [tip: sched/urgent] sched/core: Accounting forceidle time for all tasks except idle task

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: b171501f258063f5c56dd2c5fdf310802d8d7dc1
Gitweb: https://git.kernel.org/tip/b171501f258063f5c56dd2c5fdf310802d8d7dc1
Author: Cruz Zhao <[email protected]>
AuthorDate: Tue, 11 Jan 2022 17:55:59 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Jan 2022 12:09:59 +01:00

sched/core: Accounting forceidle time for all tasks except idle task

There are two types of forced idle time: forced idle time from cookie'd
task and forced idle time form uncookie'd task. The forced idle time from
uncookie'd task is actually caused by the cookie'd task in runqueue
indirectly, and it's more accurate to measure the capacity loss with the
sum of both.

Assuming cpu x and cpu y are a pair of SMT siblings, consider the
following scenarios:
1.There's a cookie'd task running on cpu x, and there're 4 uncookie'd
tasks running on cpu y. For cpu x, there will be 80% forced idle time
(from uncookie'd task); for cpu y, there will be 20% forced idle time
(from cookie'd task).
2.There's a uncookie'd task running on cpu x, and there're 4 cookie'd
tasks running on cpu y. For cpu x, there will be 80% forced idle time
(from cookie'd task); for cpu y, there will be 20% forced idle time
(from uncookie'd task).

The scenario1 can recurrent by stress-ng(scenario2 can recurrent similary):
(cookie'd)taskset -c x stress-ng -c 1 -l 100
(uncookie'd)taskset -c y stress-ng -c 4 -l 100

In the above two scenarios, the total capacity loss is 1 cpu, but in
scenario1, the cookie'd forced idle time tells us 20% cpu capacity loss, in
scenario2, the cookie'd forced idle time tells us 80% cpu capacity loss,
which are not accurate. It'll be more accurate to measure with cookie'd
forced idle time and uncookie'd forced idle time.

Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Josh Don <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/core.c | 3 +--
kernel/sched/core_sched.c | 2 +-
2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 83872f9..0d2ab2a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5822,8 +5822,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
}

if (schedstat_enabled() && rq->core->core_forceidle_count) {
- if (cookie)
- rq->core->core_forceidle_start = rq_clock(rq->core);
+ rq->core->core_forceidle_start = rq_clock(rq->core);
rq->core->core_forceidle_occupation = occ;
}

diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index 1fb4567..c8746a9 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -277,7 +277,7 @@ void __sched_core_account_forceidle(struct rq *rq)
rq_i = cpu_rq(i);
p = rq_i->core_pick ?: rq_i->curr;

- if (!p->core_cookie)
+ if (p == rq_i->idle)
continue;

__schedstat_add(p->stats.core_forceidle_sum, delta);