2023-05-12 10:25:18

by Dietmar Eggemann

[permalink] [raw]
Subject: [PATCH v2 0/2] sched: Consider CPU contention in frequency, EAS max util & load-balance busiest CPU selection

This is the implementation of the idea to factor in CPU runnable_avg
into the CPU utilization getter functions (so called
'runnable boosting') as a way to consider CPU contention for:

(a) CPU frequency
(b) EAS' max util and
(c) 'migrate_util' type load-balance busiest CPU selection.

Tests:

for (a) and (b):

Testcase is Jankbench (all subtests, 10 iterations) on Pixel6 (Android
12) with mainline v5.18 kernel and forward ported task scheduler
patches.

Uclamp has been deactivated so that the Android Dynamic Performance
Framework (ADPF) 'CPU performance hints' feature (Userspace task
boosting via uclamp_min) does not interfere.

Max_frame_duration:
+-----------------+------------+
| kernel | value [ms] |
+-----------------+------------+
| base | 163.061513 |
| runnable | 161.991705 |
+-----------------+------------+

Mean_frame_duration:
+-----------------+------------+----------+
| kernel | value [ms] | diff [%] |
+-----------------+------------+----------+
| base | 18.0 | 0.0 |
| runnable | 12.7 | -29.43 |
+-----------------+------------+----------+

Jank percentage (Jank deadline 16ms):
+-----------------+------------+----------+
| kernel | value [%] | diff [%] |
+-----------------+------------+----------+
| base | 3.6 | 0.0 |
| runnable | 1.0 | -68.86 |
+-----------------+------------+----------+

Power usage [mW] (total - all CPUs):
+-----------------+------------+----------+
| kernel | value [mW] | diff [%] |
+-----------------+------------+----------+
| base | 129.5 | 0.0 |
| runnable | 134.3 | 3.71* |
+-----------------+------------+----------+

* Power usage went up from 129.3 (-0.15%) in v1 to 134.3 (3.71%) whereas
all the other benchmark numbers stayed roughly the same. This is
probably because of using 'runnable boosting' for EAS max util now as
well and tasks more often end up running on non-little CPUs because of
that.

for (c):

Testcase is 'perf bench sched messaging' on Arm64 Ampere Altra with 160
CPUs (sched domains = {MC, DIE, NUMA}) which shows some small
improvement:

perf stat --null --repeat 10 -- perf bench sched messaging -t -g 1 -l 2000

0.4869 +- 0.0173 seconds time elapsed (+- 3.55%) ->
0.4377 +- 0.0147 seconds time elapsed (+- 3.36%)

Chen Yu tested v1** with schbench, hackbench, netperf and tbench on an
Intel Sapphire Rapids with 2x56C/112T = 224 CPUs which showed no obvious
difference and some small improvements on tbench:

https://lkml.kernel.org/r/ZFSr4Adtx1ZI8hoc@chenyu5-mobl1

** The implementation for (c) hasn't changed in v2.

v1 -> v2:

(1) Refactor CPU utilization getter functions, let cpu_util_cfs() call
cpu_util_next() (now cpu_util()).

(2) Consider CPU contention in EAS (find_energy_efficient_cpu() ->
eenv_pd_max_util()) next to schedutil (sugov_get_util()) as well so
that EAS' and schedutil's views on CPU frequency selection are in
sync.

(3) Move 'util_avg = max(util_avg, runnable_avg)' from
cpu_boosted_util_cfs() to cpu_util_next() (now cpu_util()) so that
EAS can use it too.

(4) Rework patch header.

(5) Add test results (JankbenchX on Pixel6 to test changes in schedutil
and EAS) and 'perf bench sched messaging' on Arm64 Ampere Altra for
CFS load-balance (find_busiest_queue()).


Dietmar Eggemann (2):
sched/fair: Refactor CPU utilization functions
sched/fair, cpufreq: Introduce 'runnable boosting'

kernel/sched/core.c | 2 +-
kernel/sched/cpufreq_schedutil.c | 3 +-
kernel/sched/fair.c | 72 +++++++++++++++++++++++++-------
kernel/sched/sched.h | 49 +---------------------
4 files changed, 63 insertions(+), 63 deletions(-)

--
2.25.1



2023-05-12 10:26:25

by Dietmar Eggemann

[permalink] [raw]
Subject: [PATCH v2 2/2] sched/fair, cpufreq: Introduce 'runnable boosting'

The responsiveness of the Per Entity Load Tracking (PELT) util_avg in
mobile devices is still considered too low for utilization changes
during task ramp-up.

In Android this manifests in the fact that the first frames of a UI
activity are very prone to be jankframes (a frame which doesn't meet
the required frame rendering time, e.g. 16ms@60Hz) since the CPU
frequency is normally low at this point and has to ramp up quickly.

The beginning of an UI activity is also characterized by the occurrence
of CPU contention, especially on little CPUs. Current little CPUs can
have an original CPU capacity of only ~ 150 which means that the actual
CPU capacity at lower frequency can even be much smaller.

Schedutil maps CPU util_avg into CPU frequency request via:

util = effective_cpu_util(..., cpu_util_cfs(cpu), ...) ->
util = map_util_perf(util) -> freq = map_util_freq(util, ...)

CPU contention for CFS tasks can be detected by the 'CPU runnable_avg >
util_avg' condition in cpu_util_cfs(..., boost) -> cpu_util(..., boost).
Schedutil activates 'runnable boosting' by setting the new parameter
'boost = 1'.

To be in sync with schedutil's CPU frequency selection, Energy Aware
Scheduling (EAS) also calls cpu_util(..., boost = 1) during max util
detection.

Moreover, 'runnable boosting' is also used in load-balance for busiest
CPU selection when the migration type is 'migrate_util', i.e. only at
sched domains which don't have the SD_SHARE_PKG_RESOURCES flag set.

Suggested-by: Vincent Guittot <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/cpufreq_schedutil.c | 3 ++-
kernel/sched/fair.c | 29 ++++++++++++++++++-----------
kernel/sched/sched.h | 4 ++--
4 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 898fa3bc2765..8b776db1d24c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7441,7 +7441,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,

unsigned long sched_cpu_util(int cpu)
{
- return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
+ return effective_cpu_util(cpu, cpu_util_cfs(cpu, 0), ENERGY_UTIL, NULL);
}
#endif /* CONFIG_SMP */

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e3211455b203..3b902f533214 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -155,10 +155,11 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,

static void sugov_get_util(struct sugov_cpu *sg_cpu)
{
+ unsigned long util = cpu_util_cfs(sg_cpu->cpu, 1);
struct rq *rq = cpu_rq(sg_cpu->cpu);

sg_cpu->bw_dl = cpu_bw_dl(rq);
- sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu),
+ sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
FREQUENCY_UTIL, NULL);
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1220cfbee258..3a10fe5988d6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1800,7 +1800,7 @@ static void update_numa_stats(struct task_numa_env *env,

ns->load += cpu_load(rq);
ns->runnable += cpu_runnable(rq);
- ns->util += cpu_util_cfs(cpu);
+ ns->util += cpu_util_cfs(cpu, 0);
ns->nr_running += rq->cfs.h_nr_running;
ns->compute_capacity += capacity_of(cpu);

@@ -6184,9 +6184,10 @@ static inline bool cpu_overutilized(int cpu)
{
unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
+ unsigned long util = cpu_util_cfs(cpu, 0);

/* Return true only if the utilization doesn't fit CPU's capacity */
- return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
+ return !util_fits_cpu(util, rq_util_min, rq_util_max, cpu);
}

static inline void update_overutilized_status(struct rq *rq)
@@ -7149,10 +7150,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* Predicts what cpu_util(@cpu) would return if @p was removed from @cpu
* (@dst_cpu = -1) or migrated to @dst_cpu.
*/
-static unsigned long cpu_util(int cpu, struct task_struct *p, int dst_cpu)
+static unsigned long
+cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
{
struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
+ unsigned long runnable;
+
+ runnable = boost ? READ_ONCE(cfs_rq->avg.runnable_avg) : 0;
+ util = max(util, runnable);

/*
* If @dst_cpu is -1 or @p migrates from @cpu to @dst_cpu remove its
@@ -7169,6 +7175,7 @@ static unsigned long cpu_util(int cpu, struct task_struct *p, int dst_cpu)
unsigned long util_est;

util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+ util_est = max(util_est, runnable);

/*
* During wake-up @p isn't enqueued yet and doesn't contribute
@@ -7239,9 +7246,9 @@ static unsigned long cpu_util(int cpu, struct task_struct *p, int dst_cpu)
*
* Return: (Estimated) utilization for the specified CPU.
*/
-unsigned long cpu_util_cfs(int cpu)
+unsigned long cpu_util_cfs(int cpu, int boost)
{
- return cpu_util(cpu, NULL, -1);
+ return cpu_util(cpu, NULL, -1, boost);
}

/*
@@ -7263,7 +7270,7 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
p = NULL;

- return cpu_util(cpu, p, -1);
+ return cpu_util(cpu, p, -1, 0);
}

/*
@@ -7331,7 +7338,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
int cpu;

for_each_cpu(cpu, pd_cpus) {
- unsigned long util = cpu_util(cpu, p, -1);
+ unsigned long util = cpu_util(cpu, p, -1, 0);

busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
}
@@ -7355,7 +7362,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,

for_each_cpu(cpu, pd_cpus) {
struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
- unsigned long util = cpu_util(cpu, p, dst_cpu);
+ unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
unsigned long cpu_util;

/*
@@ -7501,7 +7508,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue;

- util = cpu_util(cpu, p, cpu);
+ util = cpu_util(cpu, p, cpu, 0);
cpu_cap = capacity_of(cpu);

/*
@@ -9443,7 +9450,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
unsigned long load = cpu_load(rq);

sgs->group_load += load;
- sgs->group_util += cpu_util_cfs(i);
+ sgs->group_util += cpu_util_cfs(i, 0);
sgs->group_runnable += cpu_runnable(rq);
sgs->sum_h_nr_running += rq->cfs.h_nr_running;

@@ -10561,7 +10568,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
break;

case migrate_util:
- util = cpu_util_cfs(i);
+ util = cpu_util_cfs(i, 1);

/*
* Don't try to pull utilization from a CPU with one
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f78c0f85cc76..b4706874eec1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2947,7 +2947,7 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
}


-extern unsigned long cpu_util_cfs(int cpu);
+extern unsigned long cpu_util_cfs(int cpu, int boost);

static inline unsigned long cpu_util_rt(struct rq *rq)
{
@@ -3037,7 +3037,7 @@ static inline bool uclamp_rq_is_capped(struct rq *rq)
if (!static_branch_likely(&sched_uclamp_used))
return false;

- rq_util = cpu_util_cfs(cpu_of(rq)) + cpu_util_rt(rq);
+ rq_util = cpu_util_cfs(cpu_of(rq), 0) + cpu_util_rt(rq);
max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);

return max_util != SCHED_CAPACITY_SCALE && rq_util >= max_util;
--
2.25.1


2023-05-12 11:31:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] sched/fair, cpufreq: Introduce 'runnable boosting'

On Fri, May 12, 2023 at 12:10:29PM +0200, Dietmar Eggemann wrote:

> -static unsigned long cpu_util(int cpu, struct task_struct *p, int dst_cpu)
> +static unsigned long
> +cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> {
> struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
> + unsigned long runnable;
> +
> + runnable = boost ? READ_ONCE(cfs_rq->avg.runnable_avg) : 0;
> + util = max(util, runnable);
>
if (boost)
util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg));


> @@ -7239,9 +7246,9 @@ static unsigned long cpu_util(int cpu, struct task_struct *p, int dst_cpu)
> *
> * Return: (Estimated) utilization for the specified CPU.
> */

Given that cpu_util() is the base function should this comment move
there?

> -unsigned long cpu_util_cfs(int cpu)
> +unsigned long cpu_util_cfs(int cpu, int boost)
> {
> - return cpu_util(cpu, NULL, -1);
> + return cpu_util(cpu, NULL, -1, boost);
> }

AFAICT the @boost argument is always a constant (0 or 1). Would it not
make more sense to simply add:

unsigned long cpu_util_cfs_boost(int cpu)
{
return cpu_util(cpu, NULL, -1, 1);
}

and use that in the few sites that actually need it?

2023-05-12 12:03:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] sched/fair, cpufreq: Introduce 'runnable boosting'

On Fri, May 12, 2023 at 12:10:29PM +0200, Dietmar Eggemann wrote:

> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index e3211455b203..3b902f533214 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -155,10 +155,11 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>
> static void sugov_get_util(struct sugov_cpu *sg_cpu)
> {
> + unsigned long util = cpu_util_cfs(sg_cpu->cpu, 1);
> struct rq *rq = cpu_rq(sg_cpu->cpu);
>
> sg_cpu->bw_dl = cpu_bw_dl(rq);
> - sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu),
> + sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
> FREQUENCY_UTIL, NULL);
> }
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1220cfbee258..3a10fe5988d6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

> @@ -7355,7 +7362,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
>
> for_each_cpu(cpu, pd_cpus) {
> struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
> - unsigned long util = cpu_util(cpu, p, dst_cpu);
> + unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
> unsigned long cpu_util;
>
> /*

> @@ -10561,7 +10568,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> break;
>
> case migrate_util:
> - util = cpu_util_cfs(i);
> + util = cpu_util_cfs(i, 1);
>
> /*
> * Don't try to pull utilization from a CPU with one

When you move that comment from cpu_util_cfs() to cpu_util() please also
add a paragraph about why boost=1 and why these locations, because I'm
sure we're going to be asking ouselves that at some point.



2023-05-12 14:55:09

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] sched/fair, cpufreq: Introduce 'runnable boosting'

On 12/05/2023 13:32, Peter Zijlstra wrote:
> On Fri, May 12, 2023 at 12:10:29PM +0200, Dietmar Eggemann wrote:

[...]

>> @@ -10561,7 +10568,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>> break;
>>
>> case migrate_util:
>> - util = cpu_util_cfs(i);
>> + util = cpu_util_cfs(i, 1);
>>
>> /*
>> * Don't try to pull utilization from a CPU with one
>
> When you move that comment from cpu_util_cfs() to cpu_util() please also
> add a paragraph about why boost=1 and why these locations, because I'm
> sure we're going to be asking ouselves that at some point.

Will do.


2023-05-12 14:55:29

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] sched/fair, cpufreq: Introduce 'runnable boosting'

On 12/05/2023 13:22, Peter Zijlstra wrote:
> On Fri, May 12, 2023 at 12:10:29PM +0200, Dietmar Eggemann wrote:
>
>> -static unsigned long cpu_util(int cpu, struct task_struct *p, int dst_cpu)
>> +static unsigned long
>> +cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
>> {
>> struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
>> unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
>> + unsigned long runnable;
>> +
>> + runnable = boost ? READ_ONCE(cfs_rq->avg.runnable_avg) : 0;
>> + util = max(util, runnable);
>>
> if (boost)
> util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg));

I need the util_est = max(util_est, runnable) further down as well. Just
want to fetch runnable only once.

util = 50, task_util = 5, util_est = 60, task_util_est = 10, runnable = 70

max(70 + 5, 60 + 10) != max (70 + 5, 70 + 10) when dst_cpu == cpu

>> @@ -7239,9 +7246,9 @@ static unsigned long cpu_util(int cpu, struct task_struct *p, int dst_cpu)
>> *
>> * Return: (Estimated) utilization for the specified CPU.
>> */
>
> Given that cpu_util() is the base function should this comment move
> there?

Yes, will move it.

>> -unsigned long cpu_util_cfs(int cpu)
>> +unsigned long cpu_util_cfs(int cpu, int boost)
>> {
>> - return cpu_util(cpu, NULL, -1);
>> + return cpu_util(cpu, NULL, -1, boost);
>> }
>
> AFAICT the @boost argument is always a constant (0 or 1). Would it not
> make more sense to simply add:
>
> unsigned long cpu_util_cfs_boost(int cpu)
> {
> return cpu_util(cpu, NULL, -1, 1);
> }
>
> and use that in the few sites that actually need it?

Yes, will change it.


2023-05-13 09:45:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] sched/fair, cpufreq: Introduce 'runnable boosting'

On Fri, May 12, 2023 at 04:39:51PM +0200, Dietmar Eggemann wrote:
> On 12/05/2023 13:22, Peter Zijlstra wrote:
> > On Fri, May 12, 2023 at 12:10:29PM +0200, Dietmar Eggemann wrote:
> >
> >> -static unsigned long cpu_util(int cpu, struct task_struct *p, int dst_cpu)
> >> +static unsigned long
> >> +cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> >> {
> >> struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> >> unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
> >> + unsigned long runnable;
> >> +
> >> + runnable = boost ? READ_ONCE(cfs_rq->avg.runnable_avg) : 0;
> >> + util = max(util, runnable);
> >>
> > if (boost)
> > util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg));
>
> I need the util_est = max(util_est, runnable) further down as well. Just
> want to fetch runnable only once.

Ah, fair enought; but still, please use a regular if().