LinuxLists.cc - [GIT PULL] Scheduler changes for v6.8

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Mon, 8 Jan 2024 at 06:07, Ingo Molnar <[email protected]> wrote:
>
> Please pull the latest sched/core git tree from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-core-2024-01-08

Just a note that I'm currently bisecting into this merge for a
horrendous performance regression.

It makes my empty kernel build go from 22 seconds to 44 seconds, and
makes a full kernel build enormously slower too.

I haven't finished the bisection, but it's now inside *just* this
pull, so I can already tell that I'm going to revert something in
here, because this has been making my merge window miserable.

You've been warned,

Linus

2024-01-10 22:41:56

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Wed, 10 Jan 2024 at 14:19, Linus Torvalds
<[email protected]> wrote:
>
> Just a note that I'm currently bisecting into this merge for a
> horrendous performance regression.
>
> It makes my empty kernel build go from 22 seconds to 44 seconds, and
> makes a full kernel build enormously slower too.
>
> I haven't finished the bisection, but it's now inside *just* this
> pull, so I can already tell that I'm going to revert something in
> here, because this has been making my merge window miserable.

It's one of these two:

f12560779f9d sched/cpufreq: Rework iowait boost
9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation

one more boot to go, then I'll try to revert whichever causes my
machine to perform horribly much worse.

Linus

2024-01-10 22:57:42

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
<[email protected]> wrote:
>
> It's one of these two:
>
> f12560779f9d sched/cpufreq: Rework iowait boost
> 9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
>
> one more boot to go, then I'll try to revert whichever causes my
> machine to perform horribly much worse.

I guess it should come as no surprise that the result is

9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit

but to revert cleanly I will have to revert all of

b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
f12560779f9d ("sched/cpufreq: Rework iowait boost")
9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
performance estimation")

This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.

I'll keep that revert in my private test-tree for now (so that I have
a working machine again), but I'll move it to my main branch soon
unless somebody has a quick fix for this problem.

Linus

2024-01-11 08:12:02

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

Le mercredi 10 janv. 2024 ? 14:57:14 (-0800), Linus Torvalds a ?crit :
> On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> <[email protected]> wrote:
> >
> > It's one of these two:
> >
> > f12560779f9d sched/cpufreq: Rework iowait boost
> > 9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> >
> > one more boot to go, then I'll try to revert whichever causes my
> > machine to perform horribly much worse.
>
> I guess it should come as no surprise that the result is
>
> 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
>
> but to revert cleanly I will have to revert all of
>
> b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> f12560779f9d ("sched/cpufreq: Rework iowait boost")
> 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> performance estimation")
>
> This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.

Could you confirm that cpufreq governor is schedutil and the driver is
amd-pstate on your system ?

Also I'm interested by the output of the amd_pstate to confirm that it uses the
adjust_perf callback

I suppose that you don't use uclamp feature and amd doesn't use EAS so that let
the change of the min parameter of adjust_perf which was probably always 0
unless you use deadline scheduler and which now takes into account irq pressure.

Could you try the patch below which restores the previous min value ?

---
kernel/sched/cpufreq_schedutil.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..3fe8ac6ce9cc 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -194,10 +194,11 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
{
unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
+ struct rq *rq = cpu_rq(sg_cpu->cpu);

util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
util = max(util, boost);
- sg_cpu->bw_min = min;
+ sg_cpu->bw_min = cpu_bw_dl(rq);
sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
}

@@ -442,7 +443,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
sg_cpu->util = prev_util;

- cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
+ cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_min),
sg_cpu->util, max_cap);

sg_cpu->sg_policy->last_freq_update_time = time;
--
2.34.1

>
> I'll keep that revert in my private test-tree for now (so that I have
> a working machine again), but I'll move it to my main branch soon
> unless somebody has a quick fix for this problem.
>
> Linus

2024-01-11 09:35:02

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

* Linus Torvalds <[email protected]> wrote:

> On Mon, 8 Jan 2024 at 06:07, Ingo Molnar <[email protected]> wrote:
> >
> > Please pull the latest sched/core git tree from:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-core-2024-01-08
>
> Just a note that I'm currently bisecting into this merge for a
> horrendous performance regression.
>
> It makes my empty kernel build go from 22 seconds to 44 seconds, and
> makes a full kernel build enormously slower too.

Ouch, that's horrible ...

> I haven't finished the bisection, but it's now inside *just* this
> pull, so I can already tell that I'm going to revert something in
> here, because this has been making my merge window miserable.

Just as a quick test, does switching to the 'performance' governor work
around the regression:

for ((cpu=0; cpu < $(nproc); cpu++)); do echo performance > /sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor; done

Thanks,

Ingo

2024-01-11 11:09:58

[permalink] [raw]

Subject: [GIT PULL] scheduler fixes

* Linus Torvalds <[email protected]> wrote:

> On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> <[email protected]> wrote:
> >
> > It's one of these two:
> >
> > f12560779f9d sched/cpufreq: Rework iowait boost
> > 9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> >
> > one more boot to go, then I'll try to revert whichever causes my
> > machine to perform horribly much worse.
>
> I guess it should come as no surprise that the result is
>
> 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
>
> but to revert cleanly I will have to revert all of
>
> b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> f12560779f9d ("sched/cpufreq: Rework iowait boost")
> 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> performance estimation")
>
> This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
>
> I'll keep that revert in my private test-tree for now (so that I have
> a working machine again), but I'll move it to my main branch soon
> unless somebody has a quick fix for this problem.

Thanks a lot for bisecting this, and ack on the revert in any case, these
are relatively fresh changes that clearly didn't get enough testing - sorry!

I also made the revert in sched/urgent & added a changelog, which you can
pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-urgent-2024-01-11

# HEAD: 250ce3c1169743f3575cc5937fccd72380052795 Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits

Revert recent changes to the sched_util logic, to address a bad
performance regression that increased kernel build time on Linus's
64-CPU desktop system substantially.

Lightly build and boot tested.

Thanks,

Ingo

------------------>
Ingo Molnar (1):
Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits

include/linux/energy_model.h | 1 +
kernel/sched/core.c | 90 +++++++++++++++++++++++-----------------
kernel/sched/cpufreq_schedutil.c | 90 ++++++++++++----------------------------
kernel/sched/fair.c | 22 ++--------
kernel/sched/sched.h | 84 +++++++++++++++++++++++++++++++++----
5 files changed, 160 insertions(+), 127 deletions(-)

2024-01-11 11:20:36

by tip-bot2 for Jacob Pan

[permalink] [raw]

Subject: [tip: sched/urgent] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 250ce3c1169743f3575cc5937fccd72380052795
Gitweb: https://git.kernel.org/tip/250ce3c1169743f3575cc5937fccd72380052795
Author: Ingo Molnar <[email protected]>
AuthorDate: Thu, 11 Jan 2024 11:45:17 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 11 Jan 2024 11:51:10 +01:00

Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits

This reverts the following commits:

b3edde44e5d450 ("cpufreq/schedutil: Use a fixed reference frequency")
f12560779f9d73 ("sched/cpufreq: Rework iowait boost")
9c0b4bb7f6303c ("sched/cpufreq: Rework schedutil governor performance estimation")

Because Linus reported a bad performance regression with the
sched_util governor, that increased the time his empty
kernel build took from 22 to 44 seconds (and can be similarly
measured in full builds as well) - and bisected it back to 9c0b4bb7f6303c.

Until we have a proper fix, revert the broken commit and its
dependent commits.

Reported-by: Linus Torvalds <[email protected]>
Bisected-by: Linus Torvalds <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com
---
include/linux/energy_model.h | 1 +-
kernel/sched/core.c | 90 +++++++++++++++++--------------
kernel/sched/cpufreq_schedutil.c | 90 +++++++++----------------------
kernel/sched/fair.c | 22 +-------
kernel/sched/sched.h | 84 ++++++++++++++++++++++++++---
5 files changed, 160 insertions(+), 127 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 88d91e0..c19e7ef 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -243,6 +243,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
scale_cpu = arch_scale_cpu_capacity(cpu);
ref_freq = arch_scale_freq_ref(cpu);

+ max_util = map_util_perf(max_util);
max_util = min(max_util, allowed_cpu_cap);
freq = map_util_freq(max_util, ref_freq, scale_cpu);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be49..2de77a6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7467,13 +7467,18 @@ int sched_core_idle_cpu(int cpu)
* required to meet deadlines.
*/
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
- unsigned long *min,
- unsigned long *max)
+ enum cpu_util_type type,
+ struct task_struct *p)
{
- unsigned long util, irq, scale;
+ unsigned long dl_util, util, irq, max;
struct rq *rq = cpu_rq(cpu);

- scale = arch_scale_cpu_capacity(cpu);
+ max = arch_scale_cpu_capacity(cpu);
+
+ if (!uclamp_is_used() &&
+ type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
+ return max;
+ }

/*
* Early check to see if IRQ/steal time saturates the CPU, can be
@@ -7481,49 +7486,45 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* update_irq_load_avg().
*/
irq = cpu_util_irq(rq);
- if (unlikely(irq >= scale)) {
- if (min)
- *min = scale;
- if (max)
- *max = scale;
- return scale;
- }
-
- if (min) {
- /*
- * The minimum utilization returns the highest level between:
- * - the computed DL bandwidth needed with the IRQ pressure which
- * steals time to the deadline task.
- * - The minimum performance requirement for CFS and/or RT.
- */
- *min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
-
- /*
- * When an RT task is runnable and uclamp is not used, we must
- * ensure that the task will run at maximum compute capacity.
- */
- if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt))
- *min = max(*min, scale);
- }
+ if (unlikely(irq >= max))
+ return max;

/*
* Because the time spend on RT/DL tasks is visible as 'lost' time to
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
+ *
+ * CFS and RT utilization can be boosted or capped, depending on
+ * utilization clamp constraints requested by currently RUNNABLE
+ * tasks.
+ * When there are no CFS RUNNABLE tasks, clamps are released and
+ * frequency will be gracefully reduced with the utilization decay.
*/
util = util_cfs + cpu_util_rt(rq);
- util += cpu_util_dl(rq);
+ if (type == FREQUENCY_UTIL)
+ util = uclamp_rq_util_with(rq, util, p);
+
+ dl_util = cpu_util_dl(rq);

/*
- * The maximum hint is a soft bandwidth requirement, which can be lower
- * than the actual utilization because of uclamp_max requirements.
+ * For frequency selection we do not make cpu_util_dl() a permanent part
+ * of this sum because we want to use cpu_bw_dl() later on, but we need
+ * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
+ * that we select f_max when there is no idle time.
+ *
+ * NOTE: numerical errors or stop class might cause us to not quite hit
+ * saturation when we should -- something for later.
*/
- if (max)
- *max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
+ if (util + dl_util >= max)
+ return max;

- if (util >= scale)
- return scale;
+ /*
+ * OTOH, for energy computation we need the estimated running time, so
+ * include util_dl and ignore dl_bw.
+ */
+ if (type == ENERGY_UTIL)
+ util += dl_util;

/*
* There is still idle time; further improve the number by using the
@@ -7534,15 +7535,28 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* U' = irq + --------- * U
* max
*/
- util = scale_irq_capacity(util, irq, scale);
+ util = scale_irq_capacity(util, irq, max);
util += irq;

- return min(scale, util);
+ /*
+ * Bandwidth required by DEADLINE must always be granted while, for
+ * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
+ * to gracefully reduce the frequency when no tasks show up for longer
+ * periods of time.
+ *
+ * Ideally we would like to set bw_dl as min/guaranteed freq and util +
+ * bw_dl as requested freq. However, cpufreq is not yet ready for such
+ * an interface. So, we only do the latter for now.
+ */
+ if (type == FREQUENCY_UTIL)
+ util += cpu_bw_dl(rq);
+
+ return min(max, util);
}

unsigned long sched_cpu_util(int cpu)
{
- return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
+ return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
}
#endif /* CONFIG_SMP */

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c09..5888176 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -47,7 +47,7 @@ struct sugov_cpu {
u64 last_update;

unsigned long util;
- unsigned long bw_min;
+ unsigned long bw_dl;

/* The field below is for single-CPU policies only: */
#ifdef CONFIG_NO_HZ_COMMON
@@ -115,28 +115,6 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy)
}

/**
- * get_capacity_ref_freq - get the reference frequency that has been used to
- * correlate frequency and compute capacity for a given cpufreq policy. We use
- * the CPU managing it for the arch_scale_freq_ref() call in the function.
- * @policy: the cpufreq policy of the CPU in question.
- *
- * Return: the reference CPU frequency to compute a capacity.
- */
-static __always_inline
-unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
-{
- unsigned int freq = arch_scale_freq_ref(policy->cpu);
-
- if (freq)
- return freq;
-
- if (arch_scale_freq_invariant())
- return policy->cpuinfo.max_freq;
-
- return policy->cur;
-}
-
-/**
* get_next_freq - Compute a new frequency for a given cpufreq policy.
* @sg_policy: schedutil policy object to compute the new frequency for.
* @util: Current CPU utilization.
@@ -162,9 +140,10 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
unsigned long util, unsigned long max)
{
struct cpufreq_policy *policy = sg_policy->policy;
- unsigned int freq;
+ unsigned int freq = arch_scale_freq_invariant() ?
+ policy->cpuinfo.max_freq : policy->cur;

- freq = get_capacity_ref_freq(policy);
+ util = map_util_perf(util);
freq = map_util_freq(util, freq, max);

if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
@@ -174,31 +153,14 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
return cpufreq_driver_resolve_freq(policy, freq);
}

-unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
- unsigned long min,
- unsigned long max)
-{
- /* Add dvfs headroom to actual utilization */
- actual = map_util_perf(actual);
- /* Actually we don't need to target the max performance */
- if (actual < max)
- max = actual;
-
- /*
- * Ensure at least minimum performance while providing more compute
- * capacity when possible.
- */
- return max(min, max);
-}
-
-static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
+static void sugov_get_util(struct sugov_cpu *sg_cpu)
{
- unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
+ unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu);
+ struct rq *rq = cpu_rq(sg_cpu->cpu);

- util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
- util = max(util, boost);
- sg_cpu->bw_min = min;
- sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
+ sg_cpu->bw_dl = cpu_bw_dl(rq);
+ sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
+ FREQUENCY_UTIL, NULL);
}

/**
@@ -289,16 +251,18 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
* This mechanism is designed to boost high frequently IO waiting tasks, while
* being more conservative on tasks which does sporadic IO operations.
*/
-static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
+static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
unsigned long max_cap)
{
+ unsigned long boost;
+
/* No boost currently required */
if (!sg_cpu->iowait_boost)
- return 0;
+ return;

/* Reset boost if the CPU appears to have been idle enough */
if (sugov_iowait_reset(sg_cpu, time, false))
- return 0;
+ return;

if (!sg_cpu->iowait_boost_pending) {
/*
@@ -307,7 +271,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
sg_cpu->iowait_boost >>= 1;
if (sg_cpu->iowait_boost < IOWAIT_BOOST_MIN) {
sg_cpu->iowait_boost = 0;
- return 0;
+ return;
}
}

@@ -317,7 +281,10 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
* sg_cpu->util is already in capacity scale; convert iowait_boost
* into the same scale so we can compare.
*/
- return (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
+ boost = (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
+ boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL);
+ if (sg_cpu->util < boost)
+ sg_cpu->util = boost;
}

#ifdef CONFIG_NO_HZ_COMMON
@@ -339,7 +306,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
*/
static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
{
- if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min)
+ if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
sg_cpu->sg_policy->limits_changed = true;
}

@@ -347,8 +314,6 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
u64 time, unsigned long max_cap,
unsigned int flags)
{
- unsigned long boost;
-
sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;

@@ -357,8 +322,8 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
return false;

- boost = sugov_iowait_apply(sg_cpu, time, max_cap);
- sugov_get_util(sg_cpu, boost);
+ sugov_get_util(sg_cpu);
+ sugov_iowait_apply(sg_cpu, time, max_cap);

return true;
}
@@ -442,8 +407,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
sg_cpu->util = prev_util;

- cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
- sg_cpu->util, max_cap);
+ cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
+ map_util_perf(sg_cpu->util), max_cap);

sg_cpu->sg_policy->last_freq_update_time = time;
}
@@ -459,10 +424,9 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)

for_each_cpu(j, policy->cpus) {
struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
- unsigned long boost;

- boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
- sugov_get_util(j_sg_cpu, boost);
+ sugov_get_util(j_sg_cpu);
+ sugov_iowait_apply(j_sg_cpu, time, max_cap);

util = max(j_sg_cpu->util, util);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b803030..f43021f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7820,7 +7820,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
for_each_cpu(cpu, pd_cpus) {
unsigned long util = cpu_util(cpu, p, -1, 0);

- busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+ busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
}

eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
@@ -7843,7 +7843,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
for_each_cpu(cpu, pd_cpus) {
struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
- unsigned long eff_util, min, max;
+ unsigned long eff_util;

/*
* Performance domain frequency: utilization clamping
@@ -7852,23 +7852,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
- eff_util = effective_cpu_util(cpu, util, &min, &max);
-
- /* Task's uclamp can modify min and max value */
- if (tsk && uclamp_is_used()) {
- min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
-
- /*
- * If there is no active max uclamp constraint,
- * directly use task's one, otherwise keep max.
- */
- if (uclamp_rq_is_idle(cpu_rq(cpu)))
- max = uclamp_eff_value(p, UCLAMP_MAX);
- else
- max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
- }
-
- eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
+ eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
max_util = max(max_util, eff_util);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54b..8a70d51 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2994,14 +2994,24 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif

#ifdef CONFIG_SMP
-unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
- unsigned long *min,
- unsigned long *max);
-
-unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
- unsigned long min,
- unsigned long max);
+/**
+ * enum cpu_util_type - CPU utilization type
+ * @FREQUENCY_UTIL: Utilization used to select frequency
+ * @ENERGY_UTIL: Utilization used during energy calculation
+ *
+ * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
+ * need to be aggregated differently depending on the usage made of them. This
+ * enum is used within effective_cpu_util() to differentiate the types of
+ * utilization expected by the callers, and adjust the aggregation accordingly.
+ */
+enum cpu_util_type {
+ FREQUENCY_UTIL,
+ ENERGY_UTIL,
+};

+unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
+ enum cpu_util_type type,
+ struct task_struct *p);

/*
* Verify the fitness of task @p to run on @cpu taking into account the
@@ -3058,6 +3068,59 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
}

+/**
+ * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
+ * @rq: The rq to clamp against. Must not be NULL.
+ * @util: The util value to clamp.
+ * @p: The task to clamp against. Can be NULL if you want to clamp
+ * against @rq only.
+ *
+ * Clamps the passed @util to the max(@rq, @p) effective uclamp values.
+ *
+ * If sched_uclamp_used static key is disabled, then just return the util
+ * without any clamping since uclamp aggregation at the rq level in the fast
+ * path is disabled, rendering this operation a NOP.
+ *
+ * Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
+ * will return the correct effective uclamp value of the task even if the
+ * static key is disabled.
+ */
+static __always_inline
+unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
+ struct task_struct *p)
+{
+ unsigned long min_util = 0;
+ unsigned long max_util = 0;
+
+ if (!static_branch_likely(&sched_uclamp_used))
+ return util;
+
+ if (p) {
+ min_util = uclamp_eff_value(p, UCLAMP_MIN);
+ max_util = uclamp_eff_value(p, UCLAMP_MAX);
+
+ /*
+ * Ignore last runnable task's max clamp, as this task will
+ * reset it. Similarly, no need to read the rq's min clamp.
+ */
+ if (uclamp_rq_is_idle(rq))
+ goto out;
+ }
+
+ min_util = max_t(unsigned long, min_util, uclamp_rq_get(rq, UCLAMP_MIN));
+ max_util = max_t(unsigned long, max_util, uclamp_rq_get(rq, UCLAMP_MAX));
+out:
+ /*
+ * Since CPU's {min,max}_util clamps are MAX aggregated considering
+ * RUNNABLE tasks with _different_ clamps, we can end up with an
+ * inversion. Fix it now when the clamps are applied.
+ */
+ if (unlikely(min_util >= max_util))
+ return min_util;
+
+ return clamp(util, min_util, max_util);
+}
+
/* Is the rq being capped/throttled by uclamp_max? */
static inline bool uclamp_rq_is_capped(struct rq *rq)
{
@@ -3095,6 +3158,13 @@ static inline unsigned long uclamp_eff_value(struct task_struct *p,
return SCHED_CAPACITY_SCALE;
}

+static inline
+unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
+ struct task_struct *p)
+{
+ return util;
+}
+
static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }

static inline bool uclamp_is_used(void)

2024-01-11 13:04:24

[permalink] [raw]

Subject: Re: [GIT PULL] scheduler fixes

On Thu, 11 Jan 2024 at 12:09, Ingo Molnar <[email protected]> wrote:
>
>
> * Linus Torvalds <[email protected]> wrote:
>
> > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > It's one of these two:
> > >
> > > f12560779f9d sched/cpufreq: Rework iowait boost
> > > 9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > >
> > > one more boot to go, then I'll try to revert whichever causes my
> > > machine to perform horribly much worse.
> >
> > I guess it should come as no surprise that the result is
> >
> > 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> >
> > but to revert cleanly I will have to revert all of
> >
> > b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > performance estimation")
> >
> > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> >
> > I'll keep that revert in my private test-tree for now (so that I have
> > a working machine again), but I'll move it to my main branch soon
> > unless somebody has a quick fix for this problem.
>
> Thanks a lot for bisecting this, and ack on the revert in any case, these
> are relatively fresh changes that clearly didn't get enough testing - sorry!

b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency") is
linked with other patches.
I can provide a clean revert of only :
f12560779f9d ("sched/cpufreq: Rework iowait boost")
9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")

if the fix that i proposed doesn't work:
https://lore.kernel.org/all/ZZ+ixagkxRPYyTCE@vingu-book/

>
> I also made the revert in sched/urgent & added a changelog, which you can
> pull from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-urgent-2024-01-11
>
> # HEAD: 250ce3c1169743f3575cc5937fccd72380052795 Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits
>
> Revert recent changes to the sched_util logic, to address a bad
> performance regression that increased kernel build time on Linus's
> 64-CPU desktop system substantially.
>
> Lightly build and boot tested.
>
> Thanks,
>
> Ingo
>
> ------------------>
> Ingo Molnar (1):
> Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commits
>
>
> include/linux/energy_model.h | 1 +
> kernel/sched/core.c | 90 +++++++++++++++++++++++-----------------
> kernel/sched/cpufreq_schedutil.c | 90 ++++++++++++----------------------------
> kernel/sched/fair.c | 22 ++--------
> kernel/sched/sched.h | 84 +++++++++++++++++++++++++++++++++----
> 5 files changed, 160 insertions(+), 127 deletions(-)

2024-01-11 17:45:45

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Thu, 11 Jan 2024 at 00:11, Vincent Guittot
<[email protected]> wrote:
>
> Could you confirm that cpufreq governor is schedutil and the driver is
> amd-pstate on your system ?

schedutil yes, amd-pstate no. I actually just use acpi_cpufreq

>
> Also I'm interested by the output of the amd_pstate to confirm that it uses the
> adjust_perf callback
>
> I suppose that you don't use uclamp feature and amd doesn't use EAS so that let
> the change of the min parameter of adjust_perf which was probably always 0
> unless you use deadline scheduler and which now takes into account irq pressure.
>
> Could you try the patch below which restores the previous min value ?
>
> ---
> kernel/sched/cpufreq_schedutil.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..3fe8ac6ce9cc 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -194,10 +194,11 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
> static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> {
> unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> + struct rq *rq = cpu_rq(sg_cpu->cpu);
>
> util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
> util = max(util, boost);
> - sg_cpu->bw_min = min;
> + sg_cpu->bw_min = cpu_bw_dl(rq);
> sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> }
>
> @@ -442,7 +443,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
> sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
> sg_cpu->util = prev_util;
>
> - cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
> + cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_min),
> sg_cpu->util, max_cap);
>
> sg_cpu->sg_policy->last_freq_update_time = time;
> --
> 2.34.1
>
>
> >
> > I'll keep that revert in my private test-tree for now (so that I have
> > a working machine again), but I'll move it to my main branch soon
> > unless somebody has a quick fix for this problem.
> >
> > Linus

2024-01-11 17:54:02

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Thu, 11 Jan 2024 at 09:45, Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 11 Jan 2024 at 00:11, Vincent Guittot
> <[email protected]> wrote:
> >
> > Could you confirm that cpufreq governor is schedutil and the driver is
> > amd-pstate on your system ?
>
> schedutil yes, amd-pstate no. I actually just use acpi_cpufreq

Bah. Hit 'send' mistakenly too soon, thus the abrupt end and
unfinished quoting removal.

And don't ask me why it's acpi_pstate-driven. I have X86_AMD_PSTATE=y, but

/sys/devices/system/cpu/cpufreq/policy0/scaling_driver

clearly says 'acpi-cpufreq'. Maybe I'm looking in the wrong place. My dmesg says

amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled

which is presumably the reason my machine uses acpi-pstate.

I will also test out your other questions, but I need to go back and
do more pull requests first.

Linus

2024-01-11 18:16:38

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Thu, 11 Jan 2024 at 18:53, Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 11 Jan 2024 at 09:45, Linus Torvalds
> <[email protected]> wrote:
> >
> > On Thu, 11 Jan 2024 at 00:11, Vincent Guittot
> > <[email protected]> wrote:
> > >
> > > Could you confirm that cpufreq governor is schedutil and the driver is
> > > amd-pstate on your system ?
> >
> > schedutil yes, amd-pstate no. I actually just use acpi_cpufreq
>
> Bah. Hit 'send' mistakenly too soon, thus the abrupt end and
> unfinished quoting removal.
>
> And don't ask me why it's acpi_pstate-driven. I have X86_AMD_PSTATE=y, but
>
> /sys/devices/system/cpu/cpufreq/policy0/scaling_driver
>
> clearly says 'acpi-cpufreq'. Maybe I'm looking in the wrong place. My dmesg says

That seems to be the right place to look

>
> amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled
>
> which is presumably the reason my machine uses acpi-pstate.
>
> I will also test out your other questions, but I need to go back and
> do more pull requests first.

ok, thanks

I'm going to continue checking what else could trigger such regression
having in mind that your system should not have beeb impacted by this
changes

>
> Linus

2024-01-11 20:48:35

[permalink] [raw]

Subject: [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit

* Vincent Guittot <[email protected]> wrote:

> On Thu, 11 Jan 2024 at 12:09, Ingo Molnar <[email protected]> wrote:
> >
> >
> > * Linus Torvalds <[email protected]> wrote:
> >
> > > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > > <[email protected]> wrote:
> > > >
> > > > It's one of these two:
> > > >
> > > > f12560779f9d sched/cpufreq: Rework iowait boost
> > > > 9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > > >
> > > > one more boot to go, then I'll try to revert whichever causes my
> > > > machine to perform horribly much worse.
> > >
> > > I guess it should come as no surprise that the result is
> > >
> > > 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> > >
> > > but to revert cleanly I will have to revert all of
> > >
> > > b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > > performance estimation")
> > >
> > > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> > >
> > > I'll keep that revert in my private test-tree for now (so that I have
> > > a working machine again), but I'll move it to my main branch soon
> > > unless somebody has a quick fix for this problem.
> >
> > Thanks a lot for bisecting this, and ack on the revert in any case, these
> > are relatively fresh changes that clearly didn't get enough testing - sorry!
>
> b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency") is
> linked with other patches.

Indeed.

> I can provide a clean revert of only :
> f12560779f9d ("sched/cpufreq: Rework iowait boost")
> 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")

I've done this too, see this new commit in sched/urgent:

60ee1706bd11 ("Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit")

Also attached below.

> if the fix that i proposed doesn't work:
> https://lore.kernel.org/all/ZZ+ixagkxRPYyTCE@vingu-book/

Yeah - although of course Linus is free to just pull the revert as well.
I'll try to reproduce the regression locally as well.

Thanks,

Ingo

===============================>
From: Ingo Molnar <[email protected]>
Date: Thu, 11 Jan 2024 11:45:17 +0100
Subject: [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit

This reverts the following commits:

f12560779f9d73 ("sched/cpufreq: Rework iowait boost")
9c0b4bb7f6303c ("sched/cpufreq: Rework schedutil governor performance estimation")

Because Linus reported a bad performance regression with the
sched_util governor, that increased the time his empty
kernel build took from 22 to 44 seconds (and can be similarly
measured in full builds as well) - and bisected it back to 9c0b4bb7f6303c.

Until we have a proper fix, revert the broken commit and its
dependent commit.

Reported-by: Linus Torvalds <[email protected]>
Bisected-by: Linus Torvalds <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com
---
include/linux/energy_model.h | 1 +
kernel/sched/core.c | 90 +++++++++++++++++++++++-----------------
kernel/sched/cpufreq_schedutil.c | 64 +++++++++++-----------------
kernel/sched/fair.c | 22 ++--------
kernel/sched/sched.h | 84 +++++++++++++++++++++++++++++++++----
5 files changed, 158 insertions(+), 103 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 88d91e087471..c19e7effe764 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -243,6 +243,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
scale_cpu = arch_scale_cpu_capacity(cpu);
ref_freq = arch_scale_freq_ref(cpu);

+ max_util = map_util_perf(max_util);
max_util = min(max_util, allowed_cpu_cap);
freq = map_util_freq(max_util, ref_freq, scale_cpu);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9116bcc90346..038eeaf76d2d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7468,13 +7468,18 @@ int sched_core_idle_cpu(int cpu)
* required to meet deadlines.
*/
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
- unsigned long *min,
- unsigned long *max)
+ enum cpu_util_type type,
+ struct task_struct *p)
{
- unsigned long util, irq, scale;
+ unsigned long dl_util, util, irq, max;
struct rq *rq = cpu_rq(cpu);

- scale = arch_scale_cpu_capacity(cpu);
+ max = arch_scale_cpu_capacity(cpu);
+
+ if (!uclamp_is_used() &&
+ type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
+ return max;
+ }

/*
* Early check to see if IRQ/steal time saturates the CPU, can be
@@ -7482,49 +7487,45 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* update_irq_load_avg().
*/
irq = cpu_util_irq(rq);
- if (unlikely(irq >= scale)) {
- if (min)
- *min = scale;
- if (max)
- *max = scale;
- return scale;
- }
-
- if (min) {
- /*
- * The minimum utilization returns the highest level between:
- * - the computed DL bandwidth needed with the IRQ pressure which
- * steals time to the deadline task.
- * - The minimum performance requirement for CFS and/or RT.
- */
- *min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
-
- /*
- * When an RT task is runnable and uclamp is not used, we must
- * ensure that the task will run at maximum compute capacity.
- */
- if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt))
- *min = max(*min, scale);
- }
+ if (unlikely(irq >= max))
+ return max;

/*
* Because the time spend on RT/DL tasks is visible as 'lost' time to
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
+ *
+ * CFS and RT utilization can be boosted or capped, depending on
+ * utilization clamp constraints requested by currently RUNNABLE
+ * tasks.
+ * When there are no CFS RUNNABLE tasks, clamps are released and
+ * frequency will be gracefully reduced with the utilization decay.
*/
util = util_cfs + cpu_util_rt(rq);
- util += cpu_util_dl(rq);
+ if (type == FREQUENCY_UTIL)
+ util = uclamp_rq_util_with(rq, util, p);
+
+ dl_util = cpu_util_dl(rq);

/*
- * The maximum hint is a soft bandwidth requirement, which can be lower
- * than the actual utilization because of uclamp_max requirements.
+ * For frequency selection we do not make cpu_util_dl() a permanent part
+ * of this sum because we want to use cpu_bw_dl() later on, but we need
+ * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
+ * that we select f_max when there is no idle time.
+ *
+ * NOTE: numerical errors or stop class might cause us to not quite hit
+ * saturation when we should -- something for later.
*/
- if (max)
- *max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
+ if (util + dl_util >= max)
+ return max;

- if (util >= scale)
- return scale;
+ /*
+ * OTOH, for energy computation we need the estimated running time, so
+ * include util_dl and ignore dl_bw.
+ */
+ if (type == ENERGY_UTIL)
+ util += dl_util;

/*
* There is still idle time; further improve the number by using the
@@ -7535,15 +7536,28 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* U' = irq + --------- * U
* max
*/
- util = scale_irq_capacity(util, irq, scale);
+ util = scale_irq_capacity(util, irq, max);
util += irq;

- return min(scale, util);
+ /*
+ * Bandwidth required by DEADLINE must always be granted while, for
+ * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
+ * to gracefully reduce the frequency when no tasks show up for longer
+ * periods of time.
+ *
+ * Ideally we would like to set bw_dl as min/guaranteed freq and util +
+ * bw_dl as requested freq. However, cpufreq is not yet ready for such
+ * an interface. So, we only do the latter for now.
+ */
+ if (type == FREQUENCY_UTIL)
+ util += cpu_bw_dl(rq);
+
+ return min(max, util);
}

unsigned long sched_cpu_util(int cpu)
{
- return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
+ return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
}
#endif /* CONFIG_SMP */

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..5f8729c41d0a 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -47,7 +47,7 @@ struct sugov_cpu {
u64 last_update;

unsigned long util;
- unsigned long bw_min;
+ unsigned long bw_dl;

/* The field below is for single-CPU policies only: */
#ifdef CONFIG_NO_HZ_COMMON
@@ -164,6 +164,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
struct cpufreq_policy *policy = sg_policy->policy;
unsigned int freq;

+ util = map_util_perf(util);
freq = get_capacity_ref_freq(policy);
freq = map_util_freq(util, freq, max);

@@ -174,31 +175,14 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
return cpufreq_driver_resolve_freq(policy, freq);
}

-unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
- unsigned long min,
- unsigned long max)
+static void sugov_get_util(struct sugov_cpu *sg_cpu)
{
- /* Add dvfs headroom to actual utilization */
- actual = map_util_perf(actual);
- /* Actually we don't need to target the max performance */
- if (actual < max)
- max = actual;
+ unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu);
+ struct rq *rq = cpu_rq(sg_cpu->cpu);

- /*
- * Ensure at least minimum performance while providing more compute
- * capacity when possible.
- */
- return max(min, max);
-}
-
-static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
-{
- unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
-
- util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
- util = max(util, boost);
- sg_cpu->bw_min = min;
- sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
+ sg_cpu->bw_dl = cpu_bw_dl(rq);
+ sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
+ FREQUENCY_UTIL, NULL);
}

/**
@@ -289,16 +273,18 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
* This mechanism is designed to boost high frequently IO waiting tasks, while
* being more conservative on tasks which does sporadic IO operations.
*/
-static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
+static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
unsigned long max_cap)
{
+ unsigned long boost;
+
/* No boost currently required */
if (!sg_cpu->iowait_boost)
- return 0;
+ return;

/* Reset boost if the CPU appears to have been idle enough */
if (sugov_iowait_reset(sg_cpu, time, false))
- return 0;
+ return;

if (!sg_cpu->iowait_boost_pending) {
/*
@@ -307,7 +293,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
sg_cpu->iowait_boost >>= 1;
if (sg_cpu->iowait_boost < IOWAIT_BOOST_MIN) {
sg_cpu->iowait_boost = 0;
- return 0;
+ return;
}
}

@@ -317,7 +303,10 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
* sg_cpu->util is already in capacity scale; convert iowait_boost
* into the same scale so we can compare.
*/
- return (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
+ boost = (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
+ boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL);
+ if (sg_cpu->util < boost)
+ sg_cpu->util = boost;
}

#ifdef CONFIG_NO_HZ_COMMON
@@ -339,7 +328,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
*/
static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
{
- if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min)
+ if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
sg_cpu->sg_policy->limits_changed = true;
}

@@ -347,8 +336,6 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
u64 time, unsigned long max_cap,
unsigned int flags)
{
- unsigned long boost;
-
sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;

@@ -357,8 +344,8 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
return false;

- boost = sugov_iowait_apply(sg_cpu, time, max_cap);
- sugov_get_util(sg_cpu, boost);
+ sugov_get_util(sg_cpu);
+ sugov_iowait_apply(sg_cpu, time, max_cap);

return true;
}
@@ -442,8 +429,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
sg_cpu->util = prev_util;

- cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
- sg_cpu->util, max_cap);
+ cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
+ map_util_perf(sg_cpu->util), max_cap);

sg_cpu->sg_policy->last_freq_update_time = time;
}
@@ -459,10 +446,9 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)

for_each_cpu(j, policy->cpus) {
struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
- unsigned long boost;

- boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
- sugov_get_util(j_sg_cpu, boost);
+ sugov_get_util(j_sg_cpu);
+ sugov_iowait_apply(j_sg_cpu, time, max_cap);

util = max(j_sg_cpu->util, util);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..f2bb83675e4a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7820,7 +7820,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
for_each_cpu(cpu, pd_cpus) {
unsigned long util = cpu_util(cpu, p, -1, 0);

- busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+ busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
}

eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
@@ -7843,7 +7843,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
for_each_cpu(cpu, pd_cpus) {
struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
- unsigned long eff_util, min, max;
+ unsigned long eff_util;

/*
* Performance domain frequency: utilization clamping
@@ -7852,23 +7852,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
- eff_util = effective_cpu_util(cpu, util, &min, &max);
-
- /* Task's uclamp can modify min and max value */
- if (tsk && uclamp_is_used()) {
- min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
-
- /*
- * If there is no active max uclamp constraint,
- * directly use task's one, otherwise keep max.
- */
- if (uclamp_rq_is_idle(cpu_rq(cpu)))
- max = uclamp_eff_value(p, UCLAMP_MAX);
- else
- max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
- }
-
- eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
+ eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
max_util = max(max_util, eff_util);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 001fe047bd5d..eb7e07a1abcc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3005,14 +3005,24 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif

#ifdef CONFIG_SMP
+/**
+ * enum cpu_util_type - CPU utilization type
+ * @FREQUENCY_UTIL: Utilization used to select frequency
+ * @ENERGY_UTIL: Utilization used during energy calculation
+ *
+ * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
+ * need to be aggregated differently depending on the usage made of them. This
+ * enum is used within effective_cpu_util() to differentiate the types of
+ * utilization expected by the callers, and adjust the aggregation accordingly.
+ */
+enum cpu_util_type {
+ FREQUENCY_UTIL,
+ ENERGY_UTIL,
+};
+
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
- unsigned long *min,
- unsigned long *max);
-
-unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
- unsigned long min,
- unsigned long max);
-
+ enum cpu_util_type type,
+ struct task_struct *p);

/*
* Verify the fitness of task @p to run on @cpu taking into account the
@@ -3069,6 +3079,59 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
}

+/**
+ * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
+ * @rq: The rq to clamp against. Must not be NULL.
+ * @util: The util value to clamp.
+ * @p: The task to clamp against. Can be NULL if you want to clamp
+ * against @rq only.
+ *
+ * Clamps the passed @util to the max(@rq, @p) effective uclamp values.
+ *
+ * If sched_uclamp_used static key is disabled, then just return the util
+ * without any clamping since uclamp aggregation at the rq level in the fast
+ * path is disabled, rendering this operation a NOP.
+ *
+ * Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
+ * will return the correct effective uclamp value of the task even if the
+ * static key is disabled.
+ */
+static __always_inline
+unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
+ struct task_struct *p)
+{
+ unsigned long min_util = 0;
+ unsigned long max_util = 0;
+
+ if (!static_branch_likely(&sched_uclamp_used))
+ return util;
+
+ if (p) {
+ min_util = uclamp_eff_value(p, UCLAMP_MIN);
+ max_util = uclamp_eff_value(p, UCLAMP_MAX);
+
+ /*
+ * Ignore last runnable task's max clamp, as this task will
+ * reset it. Similarly, no need to read the rq's min clamp.
+ */
+ if (uclamp_rq_is_idle(rq))
+ goto out;
+ }
+
+ min_util = max_t(unsigned long, min_util, uclamp_rq_get(rq, UCLAMP_MIN));
+ max_util = max_t(unsigned long, max_util, uclamp_rq_get(rq, UCLAMP_MAX));
+out:
+ /*
+ * Since CPU's {min,max}_util clamps are MAX aggregated considering
+ * RUNNABLE tasks with _different_ clamps, we can end up with an
+ * inversion. Fix it now when the clamps are applied.
+ */
+ if (unlikely(min_util >= max_util))
+ return min_util;
+
+ return clamp(util, min_util, max_util);
+}
+
/* Is the rq being capped/throttled by uclamp_max? */
static inline bool uclamp_rq_is_capped(struct rq *rq)
{
@@ -3106,6 +3169,13 @@ static inline unsigned long uclamp_eff_value(struct task_struct *p,
return SCHED_CAPACITY_SCALE;
}

+static inline
+unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
+ struct task_struct *p)
+{
+ return util;
+}
+
static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }

static inline bool uclamp_is_used(void)

2024-01-11 20:56:05

by tip-bot2 for Jacob Pan

[permalink] [raw]

Subject: [tip: sched/urgent] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 60ee1706bd11669b0530efb4e92526854b9a6364
Gitweb: https://git.kernel.org/tip/60ee1706bd11669b0530efb4e92526854b9a6364
Author: Ingo Molnar <[email protected]>
AuthorDate: Thu, 11 Jan 2024 11:45:17 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 11 Jan 2024 21:42:42 +01:00

Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit

This reverts the following commits:

f12560779f9d73 ("sched/cpufreq: Rework iowait boost")
9c0b4bb7f6303c ("sched/cpufreq: Rework schedutil governor performance estimation")

Because Linus reported a bad performance regression with the
sched_util governor, that increased the time his empty
kernel build took from 22 to 44 seconds (and can be similarly
measured in full builds as well) - and bisected it back to 9c0b4bb7f6303c.

Until we have a proper fix, revert the broken commit and its
dependent commit.

Reported-by: Linus Torvalds <[email protected]>
Bisected-by: Linus Torvalds <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com
---
include/linux/energy_model.h | 1 +-
kernel/sched/core.c | 90 +++++++++++++++++--------------
kernel/sched/cpufreq_schedutil.c | 64 ++++++++--------------
kernel/sched/fair.c | 22 +-------
kernel/sched/sched.h | 84 ++++++++++++++++++++++++++---
5 files changed, 158 insertions(+), 103 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 88d91e0..c19e7ef 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -243,6 +243,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
scale_cpu = arch_scale_cpu_capacity(cpu);
ref_freq = arch_scale_freq_ref(cpu);

+ max_util = map_util_perf(max_util);
max_util = min(max_util, allowed_cpu_cap);
freq = map_util_freq(max_util, ref_freq, scale_cpu);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9116bcc..038eeaf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7468,13 +7468,18 @@ int sched_core_idle_cpu(int cpu)
* required to meet deadlines.
*/
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
- unsigned long *min,
- unsigned long *max)
+ enum cpu_util_type type,
+ struct task_struct *p)
{
- unsigned long util, irq, scale;
+ unsigned long dl_util, util, irq, max;
struct rq *rq = cpu_rq(cpu);

- scale = arch_scale_cpu_capacity(cpu);
+ max = arch_scale_cpu_capacity(cpu);
+
+ if (!uclamp_is_used() &&
+ type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
+ return max;
+ }

/*
* Early check to see if IRQ/steal time saturates the CPU, can be
@@ -7482,49 +7487,45 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* update_irq_load_avg().
*/
irq = cpu_util_irq(rq);
- if (unlikely(irq >= scale)) {
- if (min)
- *min = scale;
- if (max)
- *max = scale;
- return scale;
- }
-
- if (min) {
- /*
- * The minimum utilization returns the highest level between:
- * - the computed DL bandwidth needed with the IRQ pressure which
- * steals time to the deadline task.
- * - The minimum performance requirement for CFS and/or RT.
- */
- *min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
-
- /*
- * When an RT task is runnable and uclamp is not used, we must
- * ensure that the task will run at maximum compute capacity.
- */
- if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt))
- *min = max(*min, scale);
- }
+ if (unlikely(irq >= max))
+ return max;

/*
* Because the time spend on RT/DL tasks is visible as 'lost' time to
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
+ *
+ * CFS and RT utilization can be boosted or capped, depending on
+ * utilization clamp constraints requested by currently RUNNABLE
+ * tasks.
+ * When there are no CFS RUNNABLE tasks, clamps are released and
+ * frequency will be gracefully reduced with the utilization decay.
*/
util = util_cfs + cpu_util_rt(rq);
- util += cpu_util_dl(rq);
+ if (type == FREQUENCY_UTIL)
+ util = uclamp_rq_util_with(rq, util, p);
+
+ dl_util = cpu_util_dl(rq);

/*
- * The maximum hint is a soft bandwidth requirement, which can be lower
- * than the actual utilization because of uclamp_max requirements.
+ * For frequency selection we do not make cpu_util_dl() a permanent part
+ * of this sum because we want to use cpu_bw_dl() later on, but we need
+ * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
+ * that we select f_max when there is no idle time.
+ *
+ * NOTE: numerical errors or stop class might cause us to not quite hit
+ * saturation when we should -- something for later.
*/
- if (max)
- *max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
+ if (util + dl_util >= max)
+ return max;

- if (util >= scale)
- return scale;
+ /*
+ * OTOH, for energy computation we need the estimated running time, so
+ * include util_dl and ignore dl_bw.
+ */
+ if (type == ENERGY_UTIL)
+ util += dl_util;

/*
* There is still idle time; further improve the number by using the
@@ -7535,15 +7536,28 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* U' = irq + --------- * U
* max
*/
- util = scale_irq_capacity(util, irq, scale);
+ util = scale_irq_capacity(util, irq, max);
util += irq;

- return min(scale, util);
+ /*
+ * Bandwidth required by DEADLINE must always be granted while, for
+ * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
+ * to gracefully reduce the frequency when no tasks show up for longer
+ * periods of time.
+ *
+ * Ideally we would like to set bw_dl as min/guaranteed freq and util +
+ * bw_dl as requested freq. However, cpufreq is not yet ready for such
+ * an interface. So, we only do the latter for now.
+ */
+ if (type == FREQUENCY_UTIL)
+ util += cpu_bw_dl(rq);
+
+ return min(max, util);
}

unsigned long sched_cpu_util(int cpu)
{
- return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
+ return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
}
#endif /* CONFIG_SMP */

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c09..5f8729c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -47,7 +47,7 @@ struct sugov_cpu {
u64 last_update;

unsigned long util;
- unsigned long bw_min;
+ unsigned long bw_dl;

/* The field below is for single-CPU policies only: */
#ifdef CONFIG_NO_HZ_COMMON
@@ -164,6 +164,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
struct cpufreq_policy *policy = sg_policy->policy;
unsigned int freq;

+ util = map_util_perf(util);
freq = get_capacity_ref_freq(policy);
freq = map_util_freq(util, freq, max);

@@ -174,31 +175,14 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
return cpufreq_driver_resolve_freq(policy, freq);
}

-unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
- unsigned long min,
- unsigned long max)
+static void sugov_get_util(struct sugov_cpu *sg_cpu)
{
- /* Add dvfs headroom to actual utilization */
- actual = map_util_perf(actual);
- /* Actually we don't need to target the max performance */
- if (actual < max)
- max = actual;
+ unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu);
+ struct rq *rq = cpu_rq(sg_cpu->cpu);

- /*
- * Ensure at least minimum performance while providing more compute
- * capacity when possible.
- */
- return max(min, max);
-}
-
-static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
-{
- unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
-
- util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
- util = max(util, boost);
- sg_cpu->bw_min = min;
- sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
+ sg_cpu->bw_dl = cpu_bw_dl(rq);
+ sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
+ FREQUENCY_UTIL, NULL);
}

/**
@@ -289,16 +273,18 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
* This mechanism is designed to boost high frequently IO waiting tasks, while
* being more conservative on tasks which does sporadic IO operations.
*/
-static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
+static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
unsigned long max_cap)
{
+ unsigned long boost;
+
/* No boost currently required */
if (!sg_cpu->iowait_boost)
- return 0;
+ return;

/* Reset boost if the CPU appears to have been idle enough */
if (sugov_iowait_reset(sg_cpu, time, false))
- return 0;
+ return;

if (!sg_cpu->iowait_boost_pending) {
/*
@@ -307,7 +293,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
sg_cpu->iowait_boost >>= 1;
if (sg_cpu->iowait_boost < IOWAIT_BOOST_MIN) {
sg_cpu->iowait_boost = 0;
- return 0;
+ return;
}
}

@@ -317,7 +303,10 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
* sg_cpu->util is already in capacity scale; convert iowait_boost
* into the same scale so we can compare.
*/
- return (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
+ boost = (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
+ boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL);
+ if (sg_cpu->util < boost)
+ sg_cpu->util = boost;
}

#ifdef CONFIG_NO_HZ_COMMON
@@ -339,7 +328,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
*/
static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
{
- if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min)
+ if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
sg_cpu->sg_policy->limits_changed = true;
}

@@ -347,8 +336,6 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
u64 time, unsigned long max_cap,
unsigned int flags)
{
- unsigned long boost;
-
sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;

@@ -357,8 +344,8 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
return false;

- boost = sugov_iowait_apply(sg_cpu, time, max_cap);
- sugov_get_util(sg_cpu, boost);
+ sugov_get_util(sg_cpu);
+ sugov_iowait_apply(sg_cpu, time, max_cap);

return true;
}
@@ -442,8 +429,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
sg_cpu->util = prev_util;

- cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
- sg_cpu->util, max_cap);
+ cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
+ map_util_perf(sg_cpu->util), max_cap);

sg_cpu->sg_policy->last_freq_update_time = time;
}
@@ -459,10 +446,9 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)

for_each_cpu(j, policy->cpus) {
struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
- unsigned long boost;

- boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
- sugov_get_util(j_sg_cpu, boost);
+ sugov_get_util(j_sg_cpu);
+ sugov_iowait_apply(j_sg_cpu, time, max_cap);

util = max(j_sg_cpu->util, util);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e..f2bb836 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7820,7 +7820,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
for_each_cpu(cpu, pd_cpus) {
unsigned long util = cpu_util(cpu, p, -1, 0);

- busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+ busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
}

eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
@@ -7843,7 +7843,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
for_each_cpu(cpu, pd_cpus) {
struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
- unsigned long eff_util, min, max;
+ unsigned long eff_util;

/*
* Performance domain frequency: utilization clamping
@@ -7852,23 +7852,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
- eff_util = effective_cpu_util(cpu, util, &min, &max);
-
- /* Task's uclamp can modify min and max value */
- if (tsk && uclamp_is_used()) {
- min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
-
- /*
- * If there is no active max uclamp constraint,
- * directly use task's one, otherwise keep max.
- */
- if (uclamp_rq_is_idle(cpu_rq(cpu)))
- max = uclamp_eff_value(p, UCLAMP_MAX);
- else
- max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
- }
-
- eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
+ eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
max_util = max(max_util, eff_util);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 001fe04..eb7e07a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3005,14 +3005,24 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif

#ifdef CONFIG_SMP
-unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
- unsigned long *min,
- unsigned long *max);
-
-unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
- unsigned long min,
- unsigned long max);
+/**
+ * enum cpu_util_type - CPU utilization type
+ * @FREQUENCY_UTIL: Utilization used to select frequency
+ * @ENERGY_UTIL: Utilization used during energy calculation
+ *
+ * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
+ * need to be aggregated differently depending on the usage made of them. This
+ * enum is used within effective_cpu_util() to differentiate the types of
+ * utilization expected by the callers, and adjust the aggregation accordingly.
+ */
+enum cpu_util_type {
+ FREQUENCY_UTIL,
+ ENERGY_UTIL,
+};

+unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
+ enum cpu_util_type type,
+ struct task_struct *p);

/*
* Verify the fitness of task @p to run on @cpu taking into account the
@@ -3069,6 +3079,59 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
}

+/**
+ * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
+ * @rq: The rq to clamp against. Must not be NULL.
+ * @util: The util value to clamp.
+ * @p: The task to clamp against. Can be NULL if you want to clamp
+ * against @rq only.
+ *
+ * Clamps the passed @util to the max(@rq, @p) effective uclamp values.
+ *
+ * If sched_uclamp_used static key is disabled, then just return the util
+ * without any clamping since uclamp aggregation at the rq level in the fast
+ * path is disabled, rendering this operation a NOP.
+ *
+ * Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
+ * will return the correct effective uclamp value of the task even if the
+ * static key is disabled.
+ */
+static __always_inline
+unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
+ struct task_struct *p)
+{
+ unsigned long min_util = 0;
+ unsigned long max_util = 0;
+
+ if (!static_branch_likely(&sched_uclamp_used))
+ return util;
+
+ if (p) {
+ min_util = uclamp_eff_value(p, UCLAMP_MIN);
+ max_util = uclamp_eff_value(p, UCLAMP_MAX);
+
+ /*
+ * Ignore last runnable task's max clamp, as this task will
+ * reset it. Similarly, no need to read the rq's min clamp.
+ */
+ if (uclamp_rq_is_idle(rq))
+ goto out;
+ }
+
+ min_util = max_t(unsigned long, min_util, uclamp_rq_get(rq, UCLAMP_MIN));
+ max_util = max_t(unsigned long, max_util, uclamp_rq_get(rq, UCLAMP_MAX));
+out:
+ /*
+ * Since CPU's {min,max}_util clamps are MAX aggregated considering
+ * RUNNABLE tasks with _different_ clamps, we can end up with an
+ * inversion. Fix it now when the clamps are applied.
+ */
+ if (unlikely(min_util >= max_util))
+ return min_util;
+
+ return clamp(util, min_util, max_util);
+}
+
/* Is the rq being capped/throttled by uclamp_max? */
static inline bool uclamp_rq_is_capped(struct rq *rq)
{
@@ -3106,6 +3169,13 @@ static inline unsigned long uclamp_eff_value(struct task_struct *p,
return SCHED_CAPACITY_SCALE;
}

+static inline
+unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
+ struct task_struct *p)
+{
+ return util;
+}
+
static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }

static inline bool uclamp_is_used(void)

2024-01-11 22:23:00

[permalink] [raw]

Subject: Re: [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit

On Thu, 11 Jan 2024 at 21:48, Ingo Molnar <[email protected]> wrote:
>
>
> * Vincent Guittot <[email protected]> wrote:
>
> > On Thu, 11 Jan 2024 at 12:09, Ingo Molnar <[email protected]> wrote:
> > >
> > >
> > > * Linus Torvalds <[email protected]> wrote:
> > >
> > > > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > > > <[email protected]> wrote:
> > > > >
> > > > > It's one of these two:
> > > > >
> > > > > f12560779f9d sched/cpufreq: Rework iowait boost
> > > > > 9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > > > >
> > > > > one more boot to go, then I'll try to revert whichever causes my
> > > > > machine to perform horribly much worse.
> > > >
> > > > I guess it should come as no surprise that the result is
> > > >
> > > > 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> > > >
> > > > but to revert cleanly I will have to revert all of
> > > >
> > > > b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > > > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > > > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > > > performance estimation")
> > > >
> > > > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> > > >
> > > > I'll keep that revert in my private test-tree for now (so that I have
> > > > a working machine again), but I'll move it to my main branch soon
> > > > unless somebody has a quick fix for this problem.
> > >
> > > Thanks a lot for bisecting this, and ack on the revert in any case, these
> > > are relatively fresh changes that clearly didn't get enough testing - sorry!
> >
> > b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency") is
> > linked with other patches.
>
> Indeed.
>
> > I can provide a clean revert of only :
> > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
>
> I've done this too, see this new commit in sched/urgent:
>
> 60ee1706bd11 ("Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit")

Thanks, your new commit looks good to me.

>
> Also attached below.
>
> > if the fix that i proposed doesn't work:
> > https://lore.kernel.org/all/ZZ+ixagkxRPYyTCE@vingu-book/
>
> Yeah - although of course Linus is free to just pull the revert as well.

Yes, I know

> I'll try to reproduce the regression locally as well.

Ok, Thanks

Vincent

>
> Thanks,
>
> Ingo
>
> ===============================>
> From: Ingo Molnar <[email protected]>
> Date: Thu, 11 Jan 2024 11:45:17 +0100
> Subject: [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit
>
> This reverts the following commits:
>
> f12560779f9d73 ("sched/cpufreq: Rework iowait boost")
> 9c0b4bb7f6303c ("sched/cpufreq: Rework schedutil governor performance estimation")
>
> Because Linus reported a bad performance regression with the
> sched_util governor, that increased the time his empty
> kernel build took from 22 to 44 seconds (and can be similarly
> measured in full builds as well) - and bisected it back to 9c0b4bb7f6303c.
>
> Until we have a proper fix, revert the broken commit and its
> dependent commit.
>
> Reported-by: Linus Torvalds <[email protected]>
> Bisected-by: Linus Torvalds <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Link: https://lore.kernel.org/r/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com
> ---
> include/linux/energy_model.h | 1 +
> kernel/sched/core.c | 90 +++++++++++++++++++++++-----------------
> kernel/sched/cpufreq_schedutil.c | 64 +++++++++++-----------------
> kernel/sched/fair.c | 22 ++--------
> kernel/sched/sched.h | 84 +++++++++++++++++++++++++++++++++----
> 5 files changed, 158 insertions(+), 103 deletions(-)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 88d91e087471..c19e7effe764 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -243,6 +243,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> scale_cpu = arch_scale_cpu_capacity(cpu);
> ref_freq = arch_scale_freq_ref(cpu);
>
> + max_util = map_util_perf(max_util);
> max_util = min(max_util, allowed_cpu_cap);
> freq = map_util_freq(max_util, ref_freq, scale_cpu);
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9116bcc90346..038eeaf76d2d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7468,13 +7468,18 @@ int sched_core_idle_cpu(int cpu)
> * required to meet deadlines.
> */
> unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> - unsigned long *min,
> - unsigned long *max)
> + enum cpu_util_type type,
> + struct task_struct *p)
> {
> - unsigned long util, irq, scale;
> + unsigned long dl_util, util, irq, max;
> struct rq *rq = cpu_rq(cpu);
>
> - scale = arch_scale_cpu_capacity(cpu);
> + max = arch_scale_cpu_capacity(cpu);
> +
> + if (!uclamp_is_used() &&
> + type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
> + return max;
> + }
>
> /*
> * Early check to see if IRQ/steal time saturates the CPU, can be
> @@ -7482,49 +7487,45 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> * update_irq_load_avg().
> */
> irq = cpu_util_irq(rq);
> - if (unlikely(irq >= scale)) {
> - if (min)
> - *min = scale;
> - if (max)
> - *max = scale;
> - return scale;
> - }
> -
> - if (min) {
> - /*
> - * The minimum utilization returns the highest level between:
> - * - the computed DL bandwidth needed with the IRQ pressure which
> - * steals time to the deadline task.
> - * - The minimum performance requirement for CFS and/or RT.
> - */
> - *min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
> -
> - /*
> - * When an RT task is runnable and uclamp is not used, we must
> - * ensure that the task will run at maximum compute capacity.
> - */
> - if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt))
> - *min = max(*min, scale);
> - }
> + if (unlikely(irq >= max))
> + return max;
>
> /*
> * Because the time spend on RT/DL tasks is visible as 'lost' time to
> * CFS tasks and we use the same metric to track the effective
> * utilization (PELT windows are synchronized) we can directly add them
> * to obtain the CPU's actual utilization.
> + *
> + * CFS and RT utilization can be boosted or capped, depending on
> + * utilization clamp constraints requested by currently RUNNABLE
> + * tasks.
> + * When there are no CFS RUNNABLE tasks, clamps are released and
> + * frequency will be gracefully reduced with the utilization decay.
> */
> util = util_cfs + cpu_util_rt(rq);
> - util += cpu_util_dl(rq);
> + if (type == FREQUENCY_UTIL)
> + util = uclamp_rq_util_with(rq, util, p);
> +
> + dl_util = cpu_util_dl(rq);
>
> /*
> - * The maximum hint is a soft bandwidth requirement, which can be lower
> - * than the actual utilization because of uclamp_max requirements.
> + * For frequency selection we do not make cpu_util_dl() a permanent part
> + * of this sum because we want to use cpu_bw_dl() later on, but we need
> + * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
> + * that we select f_max when there is no idle time.
> + *
> + * NOTE: numerical errors or stop class might cause us to not quite hit
> + * saturation when we should -- something for later.
> */
> - if (max)
> - *max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
> + if (util + dl_util >= max)
> + return max;
>
> - if (util >= scale)
> - return scale;
> + /*
> + * OTOH, for energy computation we need the estimated running time, so
> + * include util_dl and ignore dl_bw.
> + */
> + if (type == ENERGY_UTIL)
> + util += dl_util;
>
> /*
> * There is still idle time; further improve the number by using the
> @@ -7535,15 +7536,28 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> * U' = irq + --------- * U
> * max
> */
> - util = scale_irq_capacity(util, irq, scale);
> + util = scale_irq_capacity(util, irq, max);
> util += irq;
>
> - return min(scale, util);
> + /*
> + * Bandwidth required by DEADLINE must always be granted while, for
> + * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
> + * to gracefully reduce the frequency when no tasks show up for longer
> + * periods of time.
> + *
> + * Ideally we would like to set bw_dl as min/guaranteed freq and util +
> + * bw_dl as requested freq. However, cpufreq is not yet ready for such
> + * an interface. So, we only do the latter for now.
> + */
> + if (type == FREQUENCY_UTIL)
> + util += cpu_bw_dl(rq);
> +
> + return min(max, util);
> }
>
> unsigned long sched_cpu_util(int cpu)
> {
> - return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
> + return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
> }
> #endif /* CONFIG_SMP */
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..5f8729c41d0a 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -47,7 +47,7 @@ struct sugov_cpu {
> u64 last_update;
>
> unsigned long util;
> - unsigned long bw_min;
> + unsigned long bw_dl;
>
> /* The field below is for single-CPU policies only: */
> #ifdef CONFIG_NO_HZ_COMMON
> @@ -164,6 +164,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> struct cpufreq_policy *policy = sg_policy->policy;
> unsigned int freq;
>
> + util = map_util_perf(util);
> freq = get_capacity_ref_freq(policy);
> freq = map_util_freq(util, freq, max);
>
> @@ -174,31 +175,14 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> return cpufreq_driver_resolve_freq(policy, freq);
> }
>
> -unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
> - unsigned long min,
> - unsigned long max)
> +static void sugov_get_util(struct sugov_cpu *sg_cpu)
> {
> - /* Add dvfs headroom to actual utilization */
> - actual = map_util_perf(actual);
> - /* Actually we don't need to target the max performance */
> - if (actual < max)
> - max = actual;
> + unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu);
> + struct rq *rq = cpu_rq(sg_cpu->cpu);
>
> - /*
> - * Ensure at least minimum performance while providing more compute
> - * capacity when possible.
> - */
> - return max(min, max);
> -}
> -
> -static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> -{
> - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> -
> - util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
> - util = max(util, boost);
> - sg_cpu->bw_min = min;
> - sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> + sg_cpu->bw_dl = cpu_bw_dl(rq);
> + sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
> + FREQUENCY_UTIL, NULL);
> }
>
> /**
> @@ -289,16 +273,18 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> * This mechanism is designed to boost high frequently IO waiting tasks, while
> * being more conservative on tasks which does sporadic IO operations.
> */
> -static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> +static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> unsigned long max_cap)
> {
> + unsigned long boost;
> +
> /* No boost currently required */
> if (!sg_cpu->iowait_boost)
> - return 0;
> + return;
>
> /* Reset boost if the CPU appears to have been idle enough */
> if (sugov_iowait_reset(sg_cpu, time, false))
> - return 0;
> + return;
>
> if (!sg_cpu->iowait_boost_pending) {
> /*
> @@ -307,7 +293,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> sg_cpu->iowait_boost >>= 1;
> if (sg_cpu->iowait_boost < IOWAIT_BOOST_MIN) {
> sg_cpu->iowait_boost = 0;
> - return 0;
> + return;
> }
> }
>
> @@ -317,7 +303,10 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> * sg_cpu->util is already in capacity scale; convert iowait_boost
> * into the same scale so we can compare.
> */
> - return (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
> + boost = (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
> + boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL);
> + if (sg_cpu->util < boost)
> + sg_cpu->util = boost;
> }
>
> #ifdef CONFIG_NO_HZ_COMMON
> @@ -339,7 +328,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> */
> static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
> {
> - if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min)
> + if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
> sg_cpu->sg_policy->limits_changed = true;
> }
>
> @@ -347,8 +336,6 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
> u64 time, unsigned long max_cap,
> unsigned int flags)
> {
> - unsigned long boost;
> -
> sugov_iowait_boost(sg_cpu, time, flags);
> sg_cpu->last_update = time;
>
> @@ -357,8 +344,8 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
> if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
> return false;
>
> - boost = sugov_iowait_apply(sg_cpu, time, max_cap);
> - sugov_get_util(sg_cpu, boost);
> + sugov_get_util(sg_cpu);
> + sugov_iowait_apply(sg_cpu, time, max_cap);
>
> return true;
> }
> @@ -442,8 +429,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
> sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
> sg_cpu->util = prev_util;
>
> - cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
> - sg_cpu->util, max_cap);
> + cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
> + map_util_perf(sg_cpu->util), max_cap);
>
> sg_cpu->sg_policy->last_freq_update_time = time;
> }
> @@ -459,10 +446,9 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
>
> for_each_cpu(j, policy->cpus) {
> struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
> - unsigned long boost;
>
> - boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
> - sugov_get_util(j_sg_cpu, boost);
> + sugov_get_util(j_sg_cpu);
> + sugov_iowait_apply(j_sg_cpu, time, max_cap);
>
> util = max(j_sg_cpu->util, util);
> }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 533547e3c90a..f2bb83675e4a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7820,7 +7820,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
> for_each_cpu(cpu, pd_cpus) {
> unsigned long util = cpu_util(cpu, p, -1, 0);
>
> - busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> + busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
> }
>
> eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
> @@ -7843,7 +7843,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
> for_each_cpu(cpu, pd_cpus) {
> struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
> unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
> - unsigned long eff_util, min, max;
> + unsigned long eff_util;
>
> /*
> * Performance domain frequency: utilization clamping
> @@ -7852,23 +7852,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
> * NOTE: in case RT tasks are running, by default the
> * FREQUENCY_UTIL's utilization can be max OPP.
> */
> - eff_util = effective_cpu_util(cpu, util, &min, &max);
> -
> - /* Task's uclamp can modify min and max value */
> - if (tsk && uclamp_is_used()) {
> - min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
> -
> - /*
> - * If there is no active max uclamp constraint,
> - * directly use task's one, otherwise keep max.
> - */
> - if (uclamp_rq_is_idle(cpu_rq(cpu)))
> - max = uclamp_eff_value(p, UCLAMP_MAX);
> - else
> - max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
> - }
> -
> - eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
> + eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
> max_util = max(max_util, eff_util);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 001fe047bd5d..eb7e07a1abcc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3005,14 +3005,24 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> #endif
>
> #ifdef CONFIG_SMP
> +/**
> + * enum cpu_util_type - CPU utilization type
> + * @FREQUENCY_UTIL: Utilization used to select frequency
> + * @ENERGY_UTIL: Utilization used during energy calculation
> + *
> + * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
> + * need to be aggregated differently depending on the usage made of them. This
> + * enum is used within effective_cpu_util() to differentiate the types of
> + * utilization expected by the callers, and adjust the aggregation accordingly.
> + */
> +enum cpu_util_type {
> + FREQUENCY_UTIL,
> + ENERGY_UTIL,
> +};
> +
> unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> - unsigned long *min,
> - unsigned long *max);
> -
> -unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
> - unsigned long min,
> - unsigned long max);
> -
> + enum cpu_util_type type,
> + struct task_struct *p);
>
> /*
> * Verify the fitness of task @p to run on @cpu taking into account the
> @@ -3069,6 +3079,59 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
> return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
> }
>
> +/**
> + * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
> + * @rq: The rq to clamp against. Must not be NULL.
> + * @util: The util value to clamp.
> + * @p: The task to clamp against. Can be NULL if you want to clamp
> + * against @rq only.
> + *
> + * Clamps the passed @util to the max(@rq, @p) effective uclamp values.
> + *
> + * If sched_uclamp_used static key is disabled, then just return the util
> + * without any clamping since uclamp aggregation at the rq level in the fast
> + * path is disabled, rendering this operation a NOP.
> + *
> + * Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
> + * will return the correct effective uclamp value of the task even if the
> + * static key is disabled.
> + */
> +static __always_inline
> +unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
> + struct task_struct *p)
> +{
> + unsigned long min_util = 0;
> + unsigned long max_util = 0;
> +
> + if (!static_branch_likely(&sched_uclamp_used))
> + return util;
> +
> + if (p) {
> + min_util = uclamp_eff_value(p, UCLAMP_MIN);
> + max_util = uclamp_eff_value(p, UCLAMP_MAX);
> +
> + /*
> + * Ignore last runnable task's max clamp, as this task will
> + * reset it. Similarly, no need to read the rq's min clamp.
> + */
> + if (uclamp_rq_is_idle(rq))
> + goto out;
> + }
> +
> + min_util = max_t(unsigned long, min_util, uclamp_rq_get(rq, UCLAMP_MIN));
> + max_util = max_t(unsigned long, max_util, uclamp_rq_get(rq, UCLAMP_MAX));
> +out:
> + /*
> + * Since CPU's {min,max}_util clamps are MAX aggregated considering
> + * RUNNABLE tasks with _different_ clamps, we can end up with an
> + * inversion. Fix it now when the clamps are applied.
> + */
> + if (unlikely(min_util >= max_util))
> + return min_util;
> +
> + return clamp(util, min_util, max_util);
> +}
> +
> /* Is the rq being capped/throttled by uclamp_max? */
> static inline bool uclamp_rq_is_capped(struct rq *rq)
> {
> @@ -3106,6 +3169,13 @@ static inline unsigned long uclamp_eff_value(struct task_struct *p,
> return SCHED_CAPACITY_SCALE;
> }
>
> +static inline
> +unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
> + struct task_struct *p)
> +{
> + return util;
> +}
> +
> static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }
>
> static inline bool uclamp_is_used(void)

2024-01-12 14:25:40

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On 11/01/2024 19:16, Vincent Guittot wrote:
> On Thu, 11 Jan 2024 at 18:53, Linus Torvalds
> <[email protected]> wrote:
>>
>> On Thu, 11 Jan 2024 at 09:45, Linus Torvalds
>> <[email protected]> wrote:
>>>
>>> On Thu, 11 Jan 2024 at 00:11, Vincent Guittot
>>> <[email protected]> wrote:
>>>>
>>>> Could you confirm that cpufreq governor is schedutil and the driver is
>>>> amd-pstate on your system ?
>>>
>>> schedutil yes, amd-pstate no. I actually just use acpi_cpufreq
>>
>> Bah. Hit 'send' mistakenly too soon, thus the abrupt end and
>> unfinished quoting removal.
>>
>> And don't ask me why it's acpi_pstate-driven. I have X86_AMD_PSTATE=y, but
>>
>> /sys/devices/system/cpu/cpufreq/policy0/scaling_driver
>>
>> clearly says 'acpi-cpufreq'. Maybe I'm looking in the wrong place. My dmesg says
>
> That seems to be the right place to look
>
>>
>> amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled
>>
>> which is presumably the reason my machine uses acpi-pstate.
>>
>> I will also test out your other questions, but I need to go back and
>> do more pull requests first.
>
> ok, thanks
>
> I'm going to continue checking what else could trigger such regression
> having in mind that your system should not have beeb impacted by this
> changes

I can't see the regression on my

20-core (40-thread) Intel Xeon CPU E5-2690 v2

with 'schedutil' and 'acpi-cpufreq'.

f12560779f9d - sched/cpufreq: Rework iowait boost <- (w/ patches)
9c0b4bb7f630 - sched/cpufreq: Rework schedutil governor performance estimation
50181c0cff31 - sched/pelt: Avoid underestimation of task utilization <- (base)
..

# cpufreq-info -c 0 -e
..
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: 10.0 us.
hardware limits: 1.20 GHz - 3.00 GHz
available frequency steps: 3.00 GHz, 3.00 GHz, 2.90 GHz, 2.70 GHz, 2.60 GHz, 2.50 GHz, 2.40 GHz, 2.20 GHz,
2.10 GHz, 2.00 GHz, 1.80 GHz, 1.70 GHz, 1.60 GHz, 1.50 GHz, 1.30 GHz, 1.20 GHz
available cpufreq governors: conservative, ondemand, userspace, powersave, performance, schedutil
current policy: frequency should be within 1.20 GHz and 3.00 GHz.
The governor "schedutil" may decide which speed to use
within this range.
current CPU frequency is 1.20 GHz (asserted by call to hardware).

cpufreq is still fast-switching, so no schedutil 'sugov' DL threads.

2024-01-12 16:58:54

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Fri, 12 Jan 2024 at 15:23, Dietmar Eggemann <[email protected]> wrote:
>
> On 11/01/2024 19:16, Vincent Guittot wrote:
> > On Thu, 11 Jan 2024 at 18:53, Linus Torvalds
> > <[email protected]> wrote:
> >>
> >> On Thu, 11 Jan 2024 at 09:45, Linus Torvalds
> >> <[email protected]> wrote:
> >>>
> >>> On Thu, 11 Jan 2024 at 00:11, Vincent Guittot
> >>> <[email protected]> wrote:
> >>>>
> >>>> Could you confirm that cpufreq governor is schedutil and the driver is
> >>>> amd-pstate on your system ?
> >>>
> >>> schedutil yes, amd-pstate no. I actually just use acpi_cpufreq
> >>
> >> Bah. Hit 'send' mistakenly too soon, thus the abrupt end and
> >> unfinished quoting removal.
> >>
> >> And don't ask me why it's acpi_pstate-driven. I have X86_AMD_PSTATE=y, but
> >>
> >> /sys/devices/system/cpu/cpufreq/policy0/scaling_driver
> >>
> >> clearly says 'acpi-cpufreq'. Maybe I'm looking in the wrong place. My dmesg says
> >
> > That seems to be the right place to look
> >
> >>
> >> amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled
> >>
> >> which is presumably the reason my machine uses acpi-pstate.
> >>
> >> I will also test out your other questions, but I need to go back and
> >> do more pull requests first.
> >
> > ok, thanks
> >
> > I'm going to continue checking what else could trigger such regression
> > having in mind that your system should not have beeb impacted by this
> > changes
>
> I can't see the regression on my
>
> 20-core (40-thread) Intel Xeon CPU E5-2690 v2
>
> with 'schedutil' and 'acpi-cpufreq'.

Thanks for the tests

>
> f12560779f9d - sched/cpufreq: Rework iowait boost <- (w/ patches)
> 9c0b4bb7f630 - sched/cpufreq: Rework schedutil governor performance estimation
> 50181c0cff31 - sched/pelt: Avoid underestimation of task utilization <- (base)
> ...
>
> # cpufreq-info -c 0 -e
> ...
> analyzing CPU 0:
> driver: acpi-cpufreq
> CPUs which run at the same hardware frequency: 0
> CPUs which need to have their frequency coordinated by software: 0
> maximum transition latency: 10.0 us.
> hardware limits: 1.20 GHz - 3.00 GHz
> available frequency steps: 3.00 GHz, 3.00 GHz, 2.90 GHz, 2.70 GHz, 2.60 GHz, 2.50 GHz, 2.40 GHz, 2.20 GHz,
> 2.10 GHz, 2.00 GHz, 1.80 GHz, 1.70 GHz, 1.60 GHz, 1.50 GHz, 1.30 GHz, 1.20 GHz
> available cpufreq governors: conservative, ondemand, userspace, powersave, performance, schedutil
> current policy: frequency should be within 1.20 GHz and 3.00 GHz.
> The governor "schedutil" may decide which speed to use
> within this range.
> current CPU frequency is 1.20 GHz (asserted by call to hardware).
>
>
> cpufreq is still fast-switching, so no schedutil 'sugov' DL threads.

2024-01-12 18:18:22

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On 01/12/24 15:23, Dietmar Eggemann wrote:
> On 11/01/2024 19:16, Vincent Guittot wrote:
> > On Thu, 11 Jan 2024 at 18:53, Linus Torvalds
> > <[email protected]> wrote:
> >>
> >> On Thu, 11 Jan 2024 at 09:45, Linus Torvalds
> >> <[email protected]> wrote:
> >>>
> >>> On Thu, 11 Jan 2024 at 00:11, Vincent Guittot
> >>> <[email protected]> wrote:
> >>>>
> >>>> Could you confirm that cpufreq governor is schedutil and the driver is
> >>>> amd-pstate on your system ?
> >>>
> >>> schedutil yes, amd-pstate no. I actually just use acpi_cpufreq
> >>
> >> Bah. Hit 'send' mistakenly too soon, thus the abrupt end and
> >> unfinished quoting removal.
> >>
> >> And don't ask me why it's acpi_pstate-driven. I have X86_AMD_PSTATE=y, but
> >>
> >> /sys/devices/system/cpu/cpufreq/policy0/scaling_driver
> >>
> >> clearly says 'acpi-cpufreq'. Maybe I'm looking in the wrong place. My dmesg says
> >
> > That seems to be the right place to look
> >
> >>
> >> amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled
> >>
> >> which is presumably the reason my machine uses acpi-pstate.
> >>
> >> I will also test out your other questions, but I need to go back and
> >> do more pull requests first.
> >
> > ok, thanks
> >
> > I'm going to continue checking what else could trigger such regression
> > having in mind that your system should not have beeb impacted by this
> > changes
>
> I can't see the regression on my
>
> 20-core (40-thread) Intel Xeon CPU E5-2690 v2
>
> with 'schedutil' and 'acpi-cpufreq'.

I tried to reproduce on AMD 3900X 12-Core system. I don't see any difference
in compiling defconfig with and without the two patches reverted. ~1m26s.

using schedutil and acpi-cpufreq driver too.

I disabled CONFIG_UCLAMP_TASK and that didn't make a difference.

I would have expected the iowait boost to be the more troublesome being the
more subtle one tbh.

2024-01-12 18:24:44

[permalink] [raw]

Subject: Re: [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit

* Ingo Molnar <[email protected]> wrote:

> > I can provide a clean revert of only :
> > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
>
> I've done this too, see this new commit in sched/urgent:
>
> 60ee1706bd11 ("Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit")
>
> Also attached below.
>
> > if the fix that i proposed doesn't work:
> > https://lore.kernel.org/all/ZZ+ixagkxRPYyTCE@vingu-book/
>
> Yeah - although of course Linus is free to just pull the revert as well.
> I'll try to reproduce the regression locally as well.

Update & heads up: unfortunately I'm unable to reproduce the regression on
a fairly similar system with a Threadripper 3970X CPU.

Kernel build times are very close, with or without the revert, on vanilla
v6.7 or v6.7+sched/core.

Here's a few results where I tried to quantify kernel build times without
having to wait a long time.

Re-building the kernel/**.o object files in a loop:

$ perf stat --pre 'rm -f kernel/*.o kernel/*/*.o kernel/*/*/*.o' --null --sync --repeat 3 make -j64 kernel/ >/dev/null

# v6.7.0:
# bootup default schedutil governor:
24.521 +- 0.077 seconds time elapsed ( +- 0.31% )
24.644 +- 0.071 seconds time elapsed ( +- 0.29% )

# cpufreq-max:
24.452 +- 0.110 seconds time elapsed ( +- 0.45% )
24.482 +- 0.048 seconds time elapsed ( +- 0.20% )

# v6.7.0+sched/core:
# bootup default schedutil governor:
24.666 +- 0.063 seconds time elapsed ( +- 0.26% )
24.809 +- 0.118 seconds time elapsed ( +- 0.48% )

The fully-cached build numbers are very close to each other, and during the
hot phase of the kernel build all CPUs are saturated.

The 2x performance regression that Linus is seeing is either some
pathological wakeup behavior, or perhaps the cores don't transition
frequencies? The difference between the lowest and highest frequency is
pretty substantial (at least on my box):

cpu MHz : 2200.000
...
cpu MHz : 4000.000

There was *one* test when the tree was cache-cold, when I saw really bad
performance (which I didn't really expect with my nvram system), with -j32
builds:

Performance counter stats for 'make -j32 kernel/' (3 runs):

64.34 +- 39.22 seconds time elapsed ( +- 60.95% )
25.08 +- 0.142 seconds time elapsed ( +- 0.56% )
24.97 +- 0.072 seconds time elapsed ( +- 0.29% )

Unfortunately that outlier was on a vanilla v6.7 bootup.

As a next step I could try Linus's specific config, maybe there's some
detail in it that makes the difference.

The commit itself that Linus bisected to (9c0b4bb7f6303c) doesn't *seem*
wrong in itself, especially without uclamp [I presume Linus doesn't use
CONFIG_UCLAMP_TASK=y and the cpu.uclamp.min/uclamp.max cgroup interface
that goes with it?], but the commit changes how we use sched_util metrics,
which could change scheduling patterns - which is why I was spending many
hours yesterday and today trying to find a pathological workload to
reproduce this. No luck so far.

Linus: I can send a pull request for the 2-commit revert, or maybe you
could try Vincent's guess-patch that tries to restore to previous behavior
as closely as possible.

Thanks,

Ingo

2024-01-12 18:27:09

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

* Linus Torvalds <[email protected]> wrote:

> This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.

Same here:

model name : AMD Ryzen Threadripper 3970X 32-Core Processor

Could you please send me your desktop system's .config privately?

If it's triggered by something in the .config, I might be able to
reproduce.

Thanks,

Ingo

2024-01-12 19:03:56

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Fri, 12 Jan 2024 at 19:18, Qais Yousef <[email protected]> wrote:
>
> On 01/12/24 15:23, Dietmar Eggemann wrote:
> > On 11/01/2024 19:16, Vincent Guittot wrote:
> > > On Thu, 11 Jan 2024 at 18:53, Linus Torvalds
> > > <[email protected]> wrote:
> > >>
> > >> On Thu, 11 Jan 2024 at 09:45, Linus Torvalds
> > >> <[email protected]> wrote:
> > >>>
> > >>> On Thu, 11 Jan 2024 at 00:11, Vincent Guittot
> > >>> <[email protected]> wrote:
> > >>>>
> > >>>> Could you confirm that cpufreq governor is schedutil and the driver is
> > >>>> amd-pstate on your system ?
> > >>>
> > >>> schedutil yes, amd-pstate no. I actually just use acpi_cpufreq
> > >>
> > >> Bah. Hit 'send' mistakenly too soon, thus the abrupt end and
> > >> unfinished quoting removal.
> > >>
> > >> And don't ask me why it's acpi_pstate-driven. I have X86_AMD_PSTATE=y, but
> > >>
> > >> /sys/devices/system/cpu/cpufreq/policy0/scaling_driver
> > >>
> > >> clearly says 'acpi-cpufreq'. Maybe I'm looking in the wrong place. My dmesg says
> > >
> > > That seems to be the right place to look
> > >
> > >>
> > >> amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled
> > >>
> > >> which is presumably the reason my machine uses acpi-pstate.
> > >>
> > >> I will also test out your other questions, but I need to go back and
> > >> do more pull requests first.
> > >
> > > ok, thanks
> > >
> > > I'm going to continue checking what else could trigger such regression
> > > having in mind that your system should not have beeb impacted by this
> > > changes
> >
> > I can't see the regression on my
> >
> > 20-core (40-thread) Intel Xeon CPU E5-2690 v2
> >
> > with 'schedutil' and 'acpi-cpufreq'.
>
> I tried to reproduce on AMD 3900X 12-Core system. I don't see any difference
> in compiling defconfig with and without the two patches reverted. ~1m26s.

Thanks for testing.
I haven't been able to reproduce the problem too. But my system are
quite different

>
> using schedutil and acpi-cpufreq driver too.
>
> I disabled CONFIG_UCLAMP_TASK and that didn't make a difference.
>
> I would have expected the iowait boost to be the more troublesome being the
> more subtle one tbh.

2024-01-12 20:30:54

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

Ok, so testing a bit more. On a working kernel, when I do an empty
"make" (which is the fast test I've used), it's all single-threaded
because it's just 'make' doing tons of stat calls and string
operations.

And "cat /proc/cpuinfo | grep MHz" shows a nice clear signal:

...
cpu MHz : 2200.000
cpu MHz : 2200.000
cpu MHz : 4425.339
cpu MHz : 2200.000
...

so it boosts up to the top boost frequency.

Without the revert, doing the same thing, what I see is very
different. It's all just

...
cpu MHz : 2200.000
cpu MHz : 2200.000
cpu MHz : 2200.000
cpu MHz : 2200.000
...

which certainly explains why it takes 45s rather than 22s to do a full
empty build.

Btw, the "full empty build" I do is literally just

timestamp sh -c "make -j128 > ../makes"

where 'timestamp' is my stupid little wrapper program that just shows
elapsed time as the command is progressing (as opposed to "time",
which just shows it at the end).

Side note: that 4425.339 is very much the boost frequency, 'cpupower' reports

$ cpupower frequency-info
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: Cannot determine or is not supported.
hardware limits: 2.20 GHz - 3.70 GHz
available frequency steps: 3.70 GHz, 2.80 GHz, 2.20 GHz
available cpufreq governors: conservative ondemand userspace
powersave performance schedutil
current policy: frequency should be within 2.20 GHz and 3.70 GHz.
The governor "schedutil" may decide which speed to use
within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 2.20 GHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: no

and for all I know the scheduler got confused by the fact that it
thinks the hardware limits are 2.2-3.7 GHz. But the 3970X has a boost
frequency of 4.5GHz, and yes, I very much want it.

I will test Vincent's test-patch next.

Linus

2024-01-12 20:50:04

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Fri, 12 Jan 2024 at 12:30, Linus Torvalds
<[email protected]> wrote:
>
> I will test Vincent's test-patch next.

The patch at

https://lore.kernel.org/all/ZZ+ixagkxRPYyTCE@vingu-book/

makes absolutely no difference. All cores stay at 2.2GHz (ok, so
there's noise, but we're talking "within a couple of MHz of 2.2GHz").

Linus

2024-01-12 21:06:13

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Fri, 12 Jan 2024 at 12:49, Linus Torvalds
<[email protected]> wrote:
>
> All cores stay at 2.2GHz (ok, so there's noise, but we're
> talking "within a couple of MHz of 2.2GHz").

Note: that is true also when every CPU is fully loaded and I do a real
full build.

So the "empty make" is just my quick test that happens to be
single-threaded and should take just 20s. All my real builds slow down
too, because all CPUs stay at the minimum frequency.

And I just verified that Ingo's revert that only reverts two commits
(commit 60ee1706bd11 in the tip tree), makes things work correctly for
me.

Not surprising, since the bisection clearly pointed at just commit
9c0b4bb7f6303c being the one that caused the issue, but I decided to
just double-check anyway.

So with that revert, for the single-threaded case I see 4GHz+ numbers
(they spread from a single CPU to multiple CPUs once you run the
benchmark a few times).

And then when I run a full parallel build (rather than the
single-threaded empty one), the frequencies drop to ~3.85GHz for the
all-cpu case.

Linus

2024-01-13 01:04:52

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On 01/12/24 13:04, Linus Torvalds wrote:
> On Fri, 12 Jan 2024 at 12:49, Linus Torvalds
> <[email protected]> wrote:
> >
> > All cores stay at 2.2GHz (ok, so there's noise, but we're
> > talking "within a couple of MHz of 2.2GHz").
>
> Note: that is true also when every CPU is fully loaded and I do a real
> full build.

That is odd. I can't see how the patch can cause this yet, could you try with
a different compiler if possible? It seems like some operation goes wrong and
we end up with util or one of the transformations becoming 0 or nearby.

I tried your steps and I see frequencies changing. I have gcc
11.4.0 by default on my system.

I usually use perfetto but it should be easy to see frequency updates from
power/cpu_frequency trace event.

echo 1 | sudo tee /sys/kernel/tracing/tracing_on
echo 1 | sudo tee /sys/kernel/tracing/events/power/cpu_frequency/enable
sudo cat /sys/kernel/tracing/trace

or

sudo cat /sys/kernel/tracing/trace_pipe

to clear the buffer and block on read

Or easier with trace-cmd

sudo trace-cmd record -e power/cpu_frequency $COMPILE_CMD
sudo trace-cmd report

If there's no idle time the frequency updates will stop once we reach highest
frequency.

When I run it I see 2.2GHz and 3.8GHz values only, with and without the patches
reverted.

<idle>-0 [017]5089547182905: cpu_frequency: state=2200000 cpu_id=17
<...>-77147 [015]5089548006123: cpu_frequency: state=3800000 cpu_id=15
<idle>-0 [004]5089553342808: cpu_frequency: state=3800000 cpu_id=4
<idle>-0 [003]5089566989667: cpu_frequency: state=2200000 cpu_id=3
rm-77149 [005]5089569246195: cpu_frequency: state=3800000 cpu_id=5
<idle>-0 [004]5089570419322: cpu_frequency: state=2200000 cpu_id=4
<idle>-0 [017]5089570826267: cpu_frequency: state=3800000 cpu_id=17
<idle>-0 [003]5089575702589: cpu_frequency: state=3800000 cpu_id=3
<idle>-0 [017]5089576045916: cpu_frequency: state=2200000 cpu_id=17
<idle>-0 [003]5089584007141: cpu_frequency: state=2200000 cpu_id=3
<idle>-0 [015]5089593025639: cpu_frequency: state=2200000 cpu_id=15
<idle>-0 [010]5089593661028: cpu_frequency: state=2800000 cpu_id=10
<idle>-0 [023]5089595181029: cpu_frequency: state=2200000 cpu_id=23
<...>-77153 [015]5089595202328: cpu_frequency: state=3800000 cpu_id=15
<idle>-0 [017]5089596112508: cpu_frequency: state=3800000 cpu_id=17
<idle>-0 [003]5089601227012: cpu_frequency: state=3800000 cpu_id=3
<idle>-0 [017]5089601303574: cpu_frequency: state=2200000 cpu_id=17
<idle>-0 [005]5089611995487: cpu_frequency: state=2200000 cpu_id=5
<idle>-0 [017]5089612446143: cpu_frequency: state=3800000 cpu_id=17
<idle>-0 [003]5089612461191: cpu_frequency: state=2200000 cpu_id=3
cc1-77159 [009]5089616006631: cpu_frequency: state=3800000 cpu_id=9
<idle>-0 [003]5089618213587: cpu_frequency: state=3800000 cpu_id=3
<idle>-0 [017]5089618245105: cpu_frequency: state=2200000 cpu_id=17
<idle>-0 [003]5089624066151: cpu_frequency: state=2200000 cpu_id=3
<idle>-0 [017]5089627031955: cpu_frequency: state=3800000 cpu_id=17
<idle>-0 [003]5089632148220: cpu_frequency: state=3800000 cpu_id=3
<idle>-0 [017]5089633114584: cpu_frequency: state=2200000 cpu_id=17
objcopy-77166 [023]5089635324796: cpu_frequency: state=3800000 cpu_id=23
<idle>-0 [003]5089636043349: cpu_frequency: state=2200000 cpu_id=3
<idle>-0 [018]5089636071762: cpu_frequency: state=2200000 cpu_id=18
<idle>-0 [017]5089636511027: cpu_frequency: state=3800000 cpu_id=17
<idle>-0 [003]5089638171879: cpu_frequency: state=3800000 cpu_id=3
build-77168 [019]5089640011393: cpu_frequency: state=3800000 cpu_id=19
<idle>-0 [017]5089652020092: cpu_frequency: state=2200000 cpu_id=17
<idle>-0 [004]5089653340503: cpu_frequency: state=3800000 cpu_id=4
<idle>-0 [004]5089654532595: cpu_frequency: state=2200000 cpu_id=4
<idle>-0 [003]5089656013393: cpu_frequency: state=2200000 cpu_id=3
<idle>-0 [004]5089666815072: cpu_frequency: state=3800000 cpu_id=4
<idle>-0 [011]5089697117342: cpu_frequency: state=3800000 cpu_id=11
cat-77170 [010]5089697219972: cpu_frequency: state=3800000 cpu_id=10
<idle>-0 [004]5089697313957: cpu_frequency: state=2200000 cpu_id=4
<idle>-0 [017]5089699129526: cpu_frequency: state=3800000 cpu_id=17
ln-77172 [016]5089699710505: cpu_frequency: state=3800000 cpu_id=16
<idle>-0 [022]5089700249275: cpu_frequency: state=2200000 cpu_id=22
<idle>-0 [023]5089700316449: cpu_frequency: state=2200000 cpu_id=23
<idle>-0 [009]5089700372223: cpu_frequency: state=2200000 cpu_id=9

This is what cpupower frequency-info returns on my system

analyzing CPU 0:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: Cannot determine or is not supported.
hardware limits: 2.20 GHz - 4.67 GHz
available frequency steps: 3.80 GHz, 2.80 GHz, 2.20 GHz
available cpufreq governors: conservative ondemand userspace powersave performance schedutil
current policy: frequency should be within 2.20 GHz and 3.80 GHz.
The governor "schedutil" may decide which speed to use
within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 2.20 GHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: no

To my understanding schedutil doesn't know about turbo frequencies but we can
read current freq via counters which I think /proc/cpuinfo uses them. The trace
events will only show 3.8GHz as that's what we know about, but /proc/cpuinfo
shows above 4GHz if I run a single threaded workload, similar to what you've
seen. Again with and without the patches reverted.

I have to say, I am on tip/sched/core still. I probably should switch to your
tree to make sure there are no unexpected differences that can affect
reproducibility.

Cheers

--
Qais Yousef

>
> So the "empty make" is just my quick test that happens to be
> single-threaded and should take just 20s. All my real builds slow down
> too, because all CPUs stay at the minimum frequency.
>
> And I just verified that Ingo's revert that only reverts two commits
> (commit 60ee1706bd11 in the tip tree), makes things work correctly for
> me.
>
> Not surprising, since the bisection clearly pointed at just commit
> 9c0b4bb7f6303c being the one that caused the issue, but I decided to
> just double-check anyway.
>
> So with that revert, for the single-threaded case I see 4GHz+ numbers
> (they spread from a single CPU to multiple CPUs once you run the
> benchmark a few times).
>
> And then when I run a full parallel build (rather than the
> single-threaded empty one), the frequencies drop to ~3.85GHz for the
> all-cpu case.
>
> Linus

2024-01-13 01:25:23

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Fri, 12 Jan 2024 at 17:04, Qais Yousef <[email protected]> wrote:
>
> That is odd. I can't see how the patch can cause this yet, could you try with
> a different compiler if possible?

I use two different compilers - I do my allmodconfig builds with gcc,
and the kernels I boot with clang, so my problems have been with a
kernel built with

clang version 17.0.6

but to check that it's not a compiler issue I just did another try
with my current public tip of tree (ie *without* any reverts for this
issue) and gcc:

gcc version 13.2.1

and the behavior is exactly the same: all cores are stuck at 2.2GHz.

So no, it's not compiler-dependent.

> I usually use perfetto but it should be easy to see frequency updates from
> power/cpu_frequency trace event.
>
> echo 1 | sudo tee /sys/kernel/tracing/tracing_on
> echo 1 | sudo tee /sys/kernel/tracing/events/power/cpu_frequency/enable
> sudo cat /sys/kernel/tracing/trace

Shows absolutely nothing. Or rather, it shows the header with

# entries-in-buffer/entries-written: 0/0 #P:64

and that's it.

With a *working* kernel, I get events, setting the frequency to either
2.2GHz (idle) or 3.8GHz (work).

IOW, the tracing output is 100% consistent with "that commit breaks everything".

Linus

2024-01-13 01:31:58

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Fri, 12 Jan 2024 at 17:24, Linus Torvalds
<[email protected]> wrote:
>
> With a *working* kernel, I get events, setting the frequency to either
> 2.2GHz (idle) or 3.8GHz (work).

Just to fix that - not 3.8Ghz, but in addition to 2.2 I see 2.8 or 3.7:

...
<idle>-0 [034] d.s.. 208.340412: cpu_frequency:
state=2200000 cpu_id=34
cc1-101686 [034] d.h.. 208.342402: cpu_frequency:
state=2800000 cpu_id=34
cc1-101686 [034] d.h.. 208.343401: cpu_frequency:
state=3700000 cpu_id=34
sh-108794 [029] d.h.. 216.401014: cpu_frequency:
state=2200000 cpu_id=29
sh-108794 [029] d.... 216.402670: cpu_frequency:
state=2800000 cpu_id=29
genksyms-108565 [029] d.h.. 216.404005: cpu_frequency:
state=3700000 cpu_id=29
...

etc.

Linus

2024-01-13 10:47:50

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

Le vendredi 12 janv. 2024 ? 17:31:13 (-0800), Linus Torvalds a ?crit :
> On Fri, 12 Jan 2024 at 17:24, Linus Torvalds
> <[email protected]> wrote:
> >
> > With a *working* kernel, I get events, setting the frequency to either
> > 2.2GHz (idle) or 3.8GHz (work).
>
> Just to fix that - not 3.8Ghz, but in addition to 2.2 I see 2.8 or 3.7:

IIUC, with the commit you stay at lowest frequency all time which is clearly
weird. One change that could create such behavior is in
sugov_effective_cpu_perf() where we take the min between the actual
utilization and the max allowed. If max is set to min capacity for whatever
the reason, then you stay stuck to lowest capacity/frequency

What is the output of
/sys/devices/system/cpu/cpu0/cpu_capacity

It should be 1024 or close

Could you try the below ? it skips this step and ensures to use the actual
utilization to select the frequency

---
kernel/sched/cpufreq_schedutil.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..e420e2ee1a10 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -181,7 +181,6 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
/* Add dvfs headroom to actual utilization */
actual = map_util_perf(actual);
/* Actually we don't need to target the max performance */
- if (actual < max)
max = actual;

/*
--
2.34.1

>
> ...
> <idle>-0 [034] d.s.. 208.340412: cpu_frequency:
> state=2200000 cpu_id=34
> cc1-101686 [034] d.h.. 208.342402: cpu_frequency:
> state=2800000 cpu_id=34
> cc1-101686 [034] d.h.. 208.343401: cpu_frequency:
> state=3700000 cpu_id=34
> sh-108794 [029] d.h.. 216.401014: cpu_frequency:
> state=2200000 cpu_id=29
> sh-108794 [029] d.... 216.402670: cpu_frequency:
> state=2800000 cpu_id=29
> genksyms-108565 [029] d.h.. 216.404005: cpu_frequency:
> state=3700000 cpu_id=29
> ...
>
> etc.
>
> Linus

2024-01-13 18:34:09

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On 01/13/24 11:47, Vincent Guittot wrote:
> Le vendredi 12 janv. 2024 à 17:31:13 (-0800), Linus Torvalds a écrit :
> > On Fri, 12 Jan 2024 at 17:24, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > With a *working* kernel, I get events, setting the frequency to either
> > > 2.2GHz (idle) or 3.8GHz (work).
> >
> > Just to fix that - not 3.8Ghz, but in addition to 2.2 I see 2.8 or 3.7:
>
> IIUC, with the commit you stay at lowest frequency all time which is clearly
> weird. One change that could create such behavior is in
> sugov_effective_cpu_perf() where we take the min between the actual
> utilization and the max allowed. If max is set to min capacity for whatever
> the reason, then you stay stuck to lowest capacity/frequency

I tried on Linus' ToT (052d534373b7) with and without CONFIG_UCLAMP_TASK=y and
I still can't reproduce.

>
> What is the output of
> /sys/devices/system/cpu/cpu0/cpu_capacity

Since it's SMP, I don't see these generated.

There's a acpi_cppc node though. For CPU0

$ grep . /sys/devices/system/cpu/cpu0/acpi_cppc/*
/sys/devices/system/cpu/cpu0/acpi_cppc/feedback_ctrs:ref:500461208742 del:522909165272
/sys/devices/system/cpu/cpu0/acpi_cppc/highest_perf:216
/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_freq:550
/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_nonlinear_perf:62
/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_perf:20
/sys/devices/system/cpu/cpu0/acpi_cppc/nominal_freq:3801
/sys/devices/system/cpu/cpu0/acpi_cppc/nominal_perf:135
/sys/devices/system/cpu/cpu0/acpi_cppc/reference_perf:135
/sys/devices/system/cpu/cpu0/acpi_cppc/wraparound_time:18446744073709551615

highest_perf for all CPUs

$ grep . /sys/devices/system/cpu/cpu*/acpi_cppc/highest_perf
/sys/devices/system/cpu/cpu0/acpi_cppc/highest_perf:216
/sys/devices/system/cpu/cpu10/acpi_cppc/highest_perf:166
/sys/devices/system/cpu/cpu11/acpi_cppc/highest_perf:176
/sys/devices/system/cpu/cpu12/acpi_cppc/highest_perf:216
/sys/devices/system/cpu/cpu13/acpi_cppc/highest_perf:216
/sys/devices/system/cpu/cpu14/acpi_cppc/highest_perf:211
/sys/devices/system/cpu/cpu15/acpi_cppc/highest_perf:206
/sys/devices/system/cpu/cpu16/acpi_cppc/highest_perf:201
/sys/devices/system/cpu/cpu17/acpi_cppc/highest_perf:196
/sys/devices/system/cpu/cpu18/acpi_cppc/highest_perf:191
/sys/devices/system/cpu/cpu19/acpi_cppc/highest_perf:186
/sys/devices/system/cpu/cpu1/acpi_cppc/highest_perf:216
/sys/devices/system/cpu/cpu20/acpi_cppc/highest_perf:181
/sys/devices/system/cpu/cpu21/acpi_cppc/highest_perf:171
/sys/devices/system/cpu/cpu22/acpi_cppc/highest_perf:166
/sys/devices/system/cpu/cpu23/acpi_cppc/highest_perf:176
/sys/devices/system/cpu/cpu2/acpi_cppc/highest_perf:211
/sys/devices/system/cpu/cpu3/acpi_cppc/highest_perf:206
/sys/devices/system/cpu/cpu4/acpi_cppc/highest_perf:201
/sys/devices/system/cpu/cpu5/acpi_cppc/highest_perf:196
/sys/devices/system/cpu/cpu6/acpi_cppc/highest_perf:191
/sys/devices/system/cpu/cpu7/acpi_cppc/highest_perf:186
/sys/devices/system/cpu/cpu8/acpi_cppc/highest_perf:181
/sys/devices/system/cpu/cpu9/acpi_cppc/highest_perf:171

Can we overshoot somehow now and hitting a bug in frequency translation in
cpufreq by any chance?

>
> It should be 1024 or close
>
> Could you try the below ? it skips this step and ensures to use the actual
> utilization to select the frequency
>
> ---
> kernel/sched/cpufreq_schedutil.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..e420e2ee1a10 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -181,7 +181,6 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
> /* Add dvfs headroom to actual utilization */
> actual = map_util_perf(actual);
> /* Actually we don't need to target the max performance */
> - if (actual < max)
> max = actual;
>
> /*
> --
> 2.34.1

2024-01-13 18:38:09

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On 01/12/24 17:24, Linus Torvalds wrote:
> On Fri, 12 Jan 2024 at 17:04, Qais Yousef <[email protected]> wrote:
> >
> > That is odd. I can't see how the patch can cause this yet, could you try with
> > a different compiler if possible?
>
> I use two different compilers - I do my allmodconfig builds with gcc,
> and the kernels I boot with clang, so my problems have been with a
> kernel built with
>
> clang version 17.0.6
>
> but to check that it's not a compiler issue I just did another try
> with my current public tip of tree (ie *without* any reverts for this
> issue) and gcc:
>
> gcc version 13.2.1
>
> and the behavior is exactly the same: all cores are stuck at 2.2GHz.
>
> So no, it's not compiler-dependent.
>
> > I usually use perfetto but it should be easy to see frequency updates from
> > power/cpu_frequency trace event.
> >
> > echo 1 | sudo tee /sys/kernel/tracing/tracing_on
> > echo 1 | sudo tee /sys/kernel/tracing/events/power/cpu_frequency/enable
> > sudo cat /sys/kernel/tracing/trace
>
> Shows absolutely nothing. Or rather, it shows the header with
>
> # entries-in-buffer/entries-written: 0/0 #P:64
>
> and that's it.
>
> With a *working* kernel, I get events, setting the frequency to either
> 2.2GHz (idle) or 3.8GHz (work).
>
> IOW, the tracing output is 100% consistent with "that commit breaks everything".

Thanks for all the info. It seems for some reason we no longer think we need to
update the frequency. Not sure why yet. It's strange I can't reproduce too even
running from top of your tree. I ran with these patches on M1, Pixel, Intel and
now AMD and haven't noticed anything since they got merged.

2024-01-14 09:12:59

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Wed, Jan 10, 2024 > On Wed, 10 Jan > <[email protected]> wrote:
sched/cpufreq: Rework iowait boost
sched/cpufreq: Rework schedutil governor performance estimation
boot to go, then I'll try to revert whichever causes my
to perform horribly much worse.
come as no surprise that the result is
2e34984c46f5a86478f84d is the first bad commit
cleanly I will have to revert all of
("cpufreq/schedutil: Use a fixed reference frequency")
("sched/cpufreq: Rework iowait boost")
("sched/cpufreq: Rework schedutil governor
estimation")
32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
revert in my private test-tree for now (so that I have
again), but I'll move it to my main branch soon
has a quick fix for this problem.
this issue with my AMD Ryzen 5600G system. But
CPPC in BIOS and boot with acpi-cpufreq + schedutil.
case also CPPC is diabled as log "_CPC object is not
Enabling CPPC in BIOS issue not seen in my system. For
also uses _CPC object to determine the boost ratio.
in BIOS something is going wrong and max
some data with bpftracing:
'kretprobe:effective_cpu_util /cpu == 1/ { @eff_util = lhist(retval, 0, 1200, 50);} kprobe:get_next_freq /cpu == 1/ { @sugov_eff_util = lhist(arg1, 0, 1200, 50); @sugov_max_cap = lhist(arg2, 0, 1000, 2);} kretprobe:get_next_freq /cpu == 1/ { @sugov_freq = lhist(retval, 1000000, 5000000, 100000);}'
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
57433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
60679 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
69911 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
69926 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
|
|
|
|
|
2101 |@@@@@@@@@@@@@@@ |
2284 |@@@@@@@@@@@@@@@@ |
|
|
|
|
1901 |@@@@@@@@@@@@@ |
|
|
|
|
|
1416 |@@@@@@@@@@ |
|
|
7260 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13839 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23879 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
24555 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
is zero but selected freq is incorrect in bad case.

2024-01-14 11:18:30

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

Hi Wyes,

Le dimanche 14 janv. 2024 ? 14:42:40 (+0530), Wyes Karny a ?crit :
> On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
> > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > It's one of these two:
> > >
> > > f12560779f9d sched/cpufreq: Rework iowait boost
> > > 9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > >
> > > one more boot to go, then I'll try to revert whichever causes my
> > > machine to perform horribly much worse.
> >
> > I guess it should come as no surprise that the result is
> >
> > 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> >
> > but to revert cleanly I will have to revert all of
> >
> > b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > performance estimation")
> >
> > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> >
> > I'll keep that revert in my private test-tree for now (so that I have
> > a working machine again), but I'll move it to my main branch soon
> > unless somebody has a quick fix for this problem.
>
> Hi Linus,
>
> I'm able to reproduce this issue with my AMD Ryzen 5600G system. But
> only if I disable CPPC in BIOS and boot with acpi-cpufreq + schedutil.
> (I believe for your case also CPPC is diabled as log "_CPC object is not
> present" came). Enabling CPPC in BIOS issue not seen in my system. For
> AMD acpi-cpufreq also uses _CPC object to determine the boost ratio.
> When CPPC is disabled in BIOS something is going wrong and max
> capacity is becoming zero.
>
> Hi Vincent, Qais,
>
> I have collected some data with bpftracing:

Thanks for your tests results

>
> sudo bpftrace -e 'kretprobe:effective_cpu_util /cpu == 1/ { @eff_util = lhist(retval, 0, 1200, 50);} kprobe:get_next_freq /cpu == 1/ { @sugov_eff_util = lhist(arg1, 0, 1200, 50); @sugov_max_cap = lhist(arg2, 0, 1000, 2);} kretprobe:get_next_freq /cpu == 1/ { @sugov_freq = lhist(retval, 1000000, 5000000, 100000);}'
>
> with running: taskset -c 1 make
>
> issue case:
>
> Attaching 3 probes...
> @eff_util:
> [0, 50) 1263 |@ |
> [50, 100) 517 | |
> [100, 150) 233 | |
> [150, 200) 297 | |
> [200, 250) 162 | |
> [250, 300) 98 | |
> [300, 350) 75 | |
> [350, 400) 205 | |
> [400, 450) 210 | |
> [450, 500) 16 | |
> [500, 550) 1532 |@ |
> [550, 600) 1026 | |
> [600, 650) 761 | |
> [650, 700) 876 | |
> [700, 750) 1085 | |
> [750, 800) 891 | |
> [800, 850) 816 | |
> [850, 900) 983 | |
> [900, 950) 661 | |
> [950, 1000) 759 | |
> [1000, 1050) 57433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>

ok so the output of effective_cpu_util() seems correct or at least to maw utilization
value. In order to be correct, it means that arch_scale_cpu_capacity(cpu) is not zero
because of :

unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
unsigned long *min,
unsigned long *max)
{
unsigned long util, irq, scale;
struct rq *rq = cpu_rq(cpu);

scale = arch_scale_cpu_capacity(cpu);

/*
* Early check to see if IRQ/steal time saturates the CPU, can be
* because of inaccuracies in how we track these -- see
* update_irq_load_avg().
*/
irq = cpu_util_irq(rq);
if (unlikely(irq >= scale)) {
if (min)
*min = scale;
if (max)
*max = scale;
return scale;
}
..
}

If arch_scale_cpu_capacity(cpu) returns 0 then effective_cpu_util() should returns
0 too.

Now see below

> @sugov_eff_util:
> [0, 50) 1074 | |
> [50, 100) 571 | |
> [100, 150) 259 | |
> [150, 200) 169 | |
> [200, 250) 237 | |
> [250, 300) 156 | |
> [300, 350) 91 | |
> [350, 400) 46 | |
> [400, 450) 52 | |
> [450, 500) 195 | |
> [500, 550) 175 | |
> [550, 600) 46 | |
> [600, 650) 493 | |
> [650, 700) 1424 |@ |
> [700, 750) 646 | |
> [750, 800) 628 | |
> [800, 850) 612 | |
> [850, 900) 840 | |
> [900, 950) 893 | |
> [950, 1000) 640 | |
> [1000, 1050) 60679 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> @sugov_freq:
> [1400000, 1500000) 69911 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> @sugov_max_cap:
> [0, 2) 69926 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

In get_next_freq(struct sugov_policy *sg_policy, unsigned long util, unsigned long max)

max is 0 and we comes from this path:

static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
unsigned int flags)
{

..
max_cap = arch_scale_cpu_capacity(sg_cpu->cpu);

if (!sugov_update_single_common(sg_cpu, time, max_cap, flags))
return;

next_f = get_next_freq(sg_policy, sg_cpu->util, max_cap);
..

so here arch_scale_cpu_capacity(sg_cpu->cpu) returns 0 ...

AFAICT, AMD platform uses the default
static __always_inline
unsigned long arch_scale_cpu_capacity(int cpu)
{
return SCHED_CAPACITY_SCALE;
}

I'm missing something here

>
>
> good case:
>
> Attaching 3 probes...
> @eff_util:
> [0, 50) 246 |@ |
> [50, 100) 150 |@ |
> [100, 150) 191 |@ |
> [150, 200) 239 |@ |
> [200, 250) 117 | |
> [250, 300) 2101 |@@@@@@@@@@@@@@@ |
> [300, 350) 2284 |@@@@@@@@@@@@@@@@ |
> [350, 400) 713 |@@@@@ |
> [400, 450) 151 |@ |
> [450, 500) 154 |@ |
> [500, 550) 1121 |@@@@@@@@ |
> [550, 600) 1901 |@@@@@@@@@@@@@ |
> [600, 650) 1208 |@@@@@@@@ |
> [650, 700) 606 |@@@@ |
> [700, 750) 557 |@@@ |
> [750, 800) 872 |@@@@@@ |
> [800, 850) 1092 |@@@@@@@ |
> [850, 900) 1416 |@@@@@@@@@@ |
> [900, 950) 1107 |@@@@@@@ |
> [950, 1000) 1051 |@@@@@@@ |
> [1000, 1050) 7260 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> @sugov_eff_util:
> [0, 50) 241 | |
> [50, 100) 149 | |
> [100, 150) 72 | |
> [150, 200) 95 | |
> [200, 250) 43 | |
> [250, 300) 49 | |
> [300, 350) 19 | |
> [350, 400) 56 | |
> [400, 450) 22 | |
> [450, 500) 29 | |
> [500, 550) 1840 |@@@@@@ |
> [550, 600) 1476 |@@@@@ |
> [600, 650) 1027 |@@@ |
> [650, 700) 473 |@ |
> [700, 750) 366 |@ |
> [750, 800) 627 |@@ |
> [800, 850) 930 |@@@ |
> [850, 900) 1285 |@@@@ |
> [900, 950) 971 |@@@ |
> [950, 1000) 946 |@@@ |
> [1000, 1050) 13839 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> @sugov_freq:
> [1400000, 1500000) 648 |@ |
> [1500000, 1600000) 0 | |
> [1600000, 1700000) 0 | |
> [1700000, 1800000) 25 | |
> [1800000, 1900000) 0 | |
> [1900000, 2000000) 0 | |
> [2000000, 2100000) 0 | |
> [2100000, 2200000) 0 | |
> [2200000, 2300000) 0 | |
> [2300000, 2400000) 0 | |
> [2400000, 2500000) 0 | |
> [2500000, 2600000) 0 | |
> [2600000, 2700000) 0 | |
> [2700000, 2800000) 0 | |
> [2800000, 2900000) 0 | |
> [2900000, 3000000) 0 | |
> [3000000, 3100000) 0 | |
> [3100000, 3125K) 0 | |
> [3125K, 3300000) 0 | |
> [3300000, 3400000) 0 | |
> [3400000, 3500000) 0 | |
> [3500000, 3600000) 0 | |
> [3600000, 3700000) 0 | |
> [3700000, 3800000) 0 | |
> [3800000, 3900000) 0 | |
> [3900000, 4000000) 23879 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> @sugov_max_cap:
> [0, 2) 24555 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> In both case max_cap is zero but selected freq is incorrect in bad case.

Also we have in get_next_freq():
freq = map_util_freq(util, freq, max);
--> util * freq /max

If max was 0, we should have been an error ?

There is something strange that I don't understand

Could you trace on the return of sugov_get_util()
the value of sg_cpu->util ?

Thanks for you help
Vincent

>
> Thanks,
> Wyes
>

2024-01-14 12:38:16

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Sun, Jan 14, 2024 at 12:18:06PM +0100, Vincent Guittot wrote:
> Hi Wyes,
>
> Le dimanche 14 janv. 2024 ? 14:42:40 (+0530), Wyes Karny a ?crit :
> > On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
> > > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > > <[email protected]> wrote:
> > > >
> > > > It's one of these two:
> > > >
> > > > f12560779f9d sched/cpufreq: Rework iowait boost
> > > > 9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > > >
> > > > one more boot to go, then I'll try to revert whichever causes my
> > > > machine to perform horribly much worse.
> > >
> > > I guess it should come as no surprise that the result is
> > >
> > > 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> > >
> > > but to revert cleanly I will have to revert all of
> > >
> > > b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > > performance estimation")
> > >
> > > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> > >
> > > I'll keep that revert in my private test-tree for now (so that I have
> > > a working machine again), but I'll move it to my main branch soon
> > > unless somebody has a quick fix for this problem.
> >
> > Hi Linus,
> >
> > I'm able to reproduce this issue with my AMD Ryzen 5600G system. But
> > only if I disable CPPC in BIOS and boot with acpi-cpufreq + schedutil.
> > (I believe for your case also CPPC is diabled as log "_CPC object is not
> > present" came). Enabling CPPC in BIOS issue not seen in my system. For
> > AMD acpi-cpufreq also uses _CPC object to determine the boost ratio.
> > When CPPC is disabled in BIOS something is going wrong and max
> > capacity is becoming zero.
> >
> > Hi Vincent, Qais,
> >
> > I have collected some data with bpftracing:
>
> Thanks for your tests results
>
> >
> > sudo bpftrace -e 'kretprobe:effective_cpu_util /cpu == 1/ { @eff_util = lhist(retval, 0, 1200, 50);} kprobe:get_next_freq /cpu == 1/ { @sugov_eff_util = lhist(arg1, 0, 1200, 50); @sugov_max_cap = lhist(arg2, 0, 1000, 2);} kretprobe:get_next_freq /cpu == 1/ { @sugov_freq = lhist(retval, 1000000, 5000000, 100000);}'
> >
> > with running: taskset -c 1 make
> >
> > issue case:
> >
> > Attaching 3 probes...
> > @eff_util:
> > [0, 50) 1263 |@ |
> > [50, 100) 517 | |
> > [100, 150) 233 | |
> > [150, 200) 297 | |
> > [200, 250) 162 | |
> > [250, 300) 98 | |
> > [300, 350) 75 | |
> > [350, 400) 205 | |
> > [400, 450) 210 | |
> > [450, 500) 16 | |
> > [500, 550) 1532 |@ |
> > [550, 600) 1026 | |
> > [600, 650) 761 | |
> > [650, 700) 876 | |
> > [700, 750) 1085 | |
> > [750, 800) 891 | |
> > [800, 850) 816 | |
> > [850, 900) 983 | |
> > [900, 950) 661 | |
> > [950, 1000) 759 | |
> > [1000, 1050) 57433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >
>
> ok so the output of effective_cpu_util() seems correct or at least to maw utilization
> value. In order to be correct, it means that arch_scale_cpu_capacity(cpu) is not zero
> because of :
>
> unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> unsigned long *min,
> unsigned long *max)
> {
> unsigned long util, irq, scale;
> struct rq *rq = cpu_rq(cpu);
>
> scale = arch_scale_cpu_capacity(cpu);
>
> /*
> * Early check to see if IRQ/steal time saturates the CPU, can be
> * because of inaccuracies in how we track these -- see
> * update_irq_load_avg().
> */
> irq = cpu_util_irq(rq);
> if (unlikely(irq >= scale)) {
> if (min)
> *min = scale;
> if (max)
> *max = scale;
> return scale;
> }
> ...
> }
>
> If arch_scale_cpu_capacity(cpu) returns 0 then effective_cpu_util() should returns
> 0 too.
>
> Now see below
>
> > @sugov_eff_util:
> > [0, 50) 1074 | |
> > [50, 100) 571 | |
> > [100, 150) 259 | |
> > [150, 200) 169 | |
> > [200, 250) 237 | |
> > [250, 300) 156 | |
> > [300, 350) 91 | |
> > [350, 400) 46 | |
> > [400, 450) 52 | |
> > [450, 500) 195 | |
> > [500, 550) 175 | |
> > [550, 600) 46 | |
> > [600, 650) 493 | |
> > [650, 700) 1424 |@ |
> > [700, 750) 646 | |
> > [750, 800) 628 | |
> > [800, 850) 612 | |
> > [850, 900) 840 | |
> > [900, 950) 893 | |
> > [950, 1000) 640 | |
> > [1000, 1050) 60679 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >
> > @sugov_freq:
> > [1400000, 1500000) 69911 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >
> > @sugov_max_cap:
> > [0, 2) 69926 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> In get_next_freq(struct sugov_policy *sg_policy, unsigned long util, unsigned long max)
>
> max is 0 and we comes from this path:
>
> static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
> unsigned int flags)
> {
>
> ...
> max_cap = arch_scale_cpu_capacity(sg_cpu->cpu);
>
> if (!sugov_update_single_common(sg_cpu, time, max_cap, flags))
> return;
>
> next_f = get_next_freq(sg_policy, sg_cpu->util, max_cap);
> ...
>
> so here arch_scale_cpu_capacity(sg_cpu->cpu) returns 0 ...
>
> AFAICT, AMD platform uses the default
> static __always_inline
> unsigned long arch_scale_cpu_capacity(int cpu)
> {
> return SCHED_CAPACITY_SCALE;
> }
>
> I'm missing something here
>
> >
> >
> > good case:
> >
> > Attaching 3 probes...
> > @eff_util:
> > [0, 50) 246 |@ |
> > [50, 100) 150 |@ |
> > [100, 150) 191 |@ |
> > [150, 200) 239 |@ |
> > [200, 250) 117 | |
> > [250, 300) 2101 |@@@@@@@@@@@@@@@ |
> > [300, 350) 2284 |@@@@@@@@@@@@@@@@ |
> > [350, 400) 713 |@@@@@ |
> > [400, 450) 151 |@ |
> > [450, 500) 154 |@ |
> > [500, 550) 1121 |@@@@@@@@ |
> > [550, 600) 1901 |@@@@@@@@@@@@@ |
> > [600, 650) 1208 |@@@@@@@@ |
> > [650, 700) 606 |@@@@ |
> > [700, 750) 557 |@@@ |
> > [750, 800) 872 |@@@@@@ |
> > [800, 850) 1092 |@@@@@@@ |
> > [850, 900) 1416 |@@@@@@@@@@ |
> > [900, 950) 1107 |@@@@@@@ |
> > [950, 1000) 1051 |@@@@@@@ |
> > [1000, 1050) 7260 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >
> > @sugov_eff_util:
> > [0, 50) 241 | |
> > [50, 100) 149 | |
> > [100, 150) 72 | |
> > [150, 200) 95 | |
> > [200, 250) 43 | |
> > [250, 300) 49 | |
> > [300, 350) 19 | |
> > [350, 400) 56 | |
> > [400, 450) 22 | |
> > [450, 500) 29 | |
> > [500, 550) 1840 |@@@@@@ |
> > [550, 600) 1476 |@@@@@ |
> > [600, 650) 1027 |@@@ |
> > [650, 700) 473 |@ |
> > [700, 750) 366 |@ |
> > [750, 800) 627 |@@ |
> > [800, 850) 930 |@@@ |
> > [850, 900) 1285 |@@@@ |
> > [900, 950) 971 |@@@ |
> > [950, 1000) 946 |@@@ |
> > [1000, 1050) 13839 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >
> > @sugov_freq:
> > [1400000, 1500000) 648 |@ |
> > [1500000, 1600000) 0 | |
> > [1600000, 1700000) 0 | |
> > [1700000, 1800000) 25 | |
> > [1800000, 1900000) 0 | |
> > [1900000, 2000000) 0 | |
> > [2000000, 2100000) 0 | |
> > [2100000, 2200000) 0 | |
> > [2200000, 2300000) 0 | |
> > [2300000, 2400000) 0 | |
> > [2400000, 2500000) 0 | |
> > [2500000, 2600000) 0 | |
> > [2600000, 2700000) 0 | |
> > [2700000, 2800000) 0 | |
> > [2800000, 2900000) 0 | |
> > [2900000, 3000000) 0 | |
> > [3000000, 3100000) 0 | |
> > [3100000, 3125K) 0 | |
> > [3125K, 3300000) 0 | |
> > [3300000, 3400000) 0 | |
> > [3400000, 3500000) 0 | |
> > [3500000, 3600000) 0 | |
> > [3600000, 3700000) 0 | |
> > [3700000, 3800000) 0 | |
> > [3800000, 3900000) 0 | |
> > [3900000, 4000000) 23879 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >
> > @sugov_max_cap:
> > [0, 2) 24555 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >
> > In both case max_cap is zero but selected freq is incorrect in bad case.
>
> Also we have in get_next_freq():
> freq = map_util_freq(util, freq, max);
> --> util * freq /max
>
> If max was 0, we should have been an error ?
>
> There is something strange that I don't understand
>
> Could you trace on the return of sugov_get_util()
> the value of sg_cpu->util ?

Yeah, correct something was wrong in the bpftrace readings, max_cap is
not zero in traces.

git-5511 [001] d.h1. 427.159763: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
git-5511 [001] d.h1. 427.163733: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
git-5511 [001] d.h1. 427.163735: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
git-5511 [001] d.h1. 427.167706: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
git-5511 [001] d.h1. 427.167708: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
git-5511 [001] d.h1. 427.171678: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
git-5511 [001] d.h1. 427.171679: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
git-5511 [001] d.h1. 427.175653: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
git-5511 [001] d.h1. 427.175655: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
git-5511 [001] d.s1. 427.175665: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
git-5511 [001] d.s1. 427.175665: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024

Debug patch applied:

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..5c9b3e1de7a0 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -166,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,

freq = get_capacity_ref_freq(policy);
freq = map_util_freq(util, freq, max);
+ trace_printk("[DEBUG] : freq %llu, util %llu, max %llu\n", freq, util, max);

if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
return sg_policy->next_freq;
@@ -199,6 +200,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
util = max(util, boost);
sg_cpu->bw_min = min;
sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
+ trace_printk("[DEBUG] : util %llu, sg_cpu->util %llu\n", util, sg_cpu->util);
}

/**

So, I guess map_util_freq going wrong somewhere.

Thanks,
Wyes
>
> Thanks for you help
> Vincent
>
> >
> > Thanks,
> > Wyes
> >

2024-01-14 13:03:08

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On 14/01/2024 13:37, Wyes Karny wrote:
> On Sun, Jan 14, 2024 at 12:18:06PM +0100, Vincent Guittot wrote:
>> Hi Wyes,
>>
>> Le dimanche 14 janv. 2024 ï¿½ 14:42:40 (+0530), Wyes Karny a ï¿½crit :
>>> On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
>>>> On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
>>>> <[email protected]> wrote:

[...]

> Yeah, correct something was wrong in the bpftrace readings, max_cap is
> not zero in traces.
>
> git-5511 [001] d.h1. 427.159763: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> git-5511 [001] d.h1. 427.163733: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> git-5511 [001] d.h1. 427.163735: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> git-5511 [001] d.h1. 427.167706: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> git-5511 [001] d.h1. 427.167708: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> git-5511 [001] d.h1. 427.171678: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> git-5511 [001] d.h1. 427.171679: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> git-5511 [001] d.h1. 427.175653: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> git-5511 [001] d.h1. 427.175655: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> git-5511 [001] d.s1. 427.175665: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> git-5511 [001] d.s1. 427.175665: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
>
> Debug patch applied:
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..5c9b3e1de7a0 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -166,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>
> freq = get_capacity_ref_freq(policy);
> freq = map_util_freq(util, freq, max);
> + trace_printk("[DEBUG] : freq %llu, util %llu, max %llu\n", freq, util, max);
>
> if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> return sg_policy->next_freq;
> @@ -199,6 +200,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> util = max(util, boost);
> sg_cpu->bw_min = min;
> sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> + trace_printk("[DEBUG] : util %llu, sg_cpu->util %llu\n", util, sg_cpu->util);
> }
>
> /**
>
>
> So, I guess map_util_freq going wrong somewhere.

sugov_update_single_freq() -> get_next_freq() -> get_capacity_ref_freq()

Is arch_scale_freq_invariant() true in both cases, so that
get_capacity_ref_freq() returns 'policy->cpuinfo.max_freq' and not just
'policy->cur'?

2024-01-14 13:03:38

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Sun, 14 Jan 2024 at 13:38, Wyes Karny <[email protected]> wrote:
>
> On Sun, Jan 14, 2024 at 12:18:06PM +0100, Vincent Guittot wrote:
> > Hi Wyes,
> >
> > Le dimanche 14 janv. 2024 à 14:42:40 (+0530), Wyes Karny a écrit :
> > > On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
> > > > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > > > <[email protected]> wrote:
> > > > >
> > > > > It's one of these two:
> > > > >
> > > > > f12560779f9d sched/cpufreq: Rework iowait boost
> > > > > 9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > > > >
> > > > > one more boot to go, then I'll try to revert whichever causes my
> > > > > machine to perform horribly much worse.
> > > >
> > > > I guess it should come as no surprise that the result is
> > > >
> > > > 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> > > >
> > > > but to revert cleanly I will have to revert all of
> > > >
> > > > b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > > > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > > > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > > > performance estimation")
> > > >
> > > > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> > > >
> > > > I'll keep that revert in my private test-tree for now (so that I have
> > > > a working machine again), but I'll move it to my main branch soon
> > > > unless somebody has a quick fix for this problem.
> > >
> > > Hi Linus,
> > >
> > > I'm able to reproduce this issue with my AMD Ryzen 5600G system. But
> > > only if I disable CPPC in BIOS and boot with acpi-cpufreq + schedutil.
> > > (I believe for your case also CPPC is diabled as log "_CPC object is not
> > > present" came). Enabling CPPC in BIOS issue not seen in my system. For
> > > AMD acpi-cpufreq also uses _CPC object to determine the boost ratio.
> > > When CPPC is disabled in BIOS something is going wrong and max
> > > capacity is becoming zero.
> > >
> > > Hi Vincent, Qais,
> > >

..

> >
> > There is something strange that I don't understand
> >
> > Could you trace on the return of sugov_get_util()
> > the value of sg_cpu->util ?
>
> Yeah, correct something was wrong in the bpftrace readings, max_cap is
> not zero in traces.
>
> git-5511 [001] d.h1. 427.159763: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> git-5511 [001] d.h1. 427.163733: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> git-5511 [001] d.h1. 427.163735: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> git-5511 [001] d.h1. 427.167706: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> git-5511 [001] d.h1. 427.167708: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> git-5511 [001] d.h1. 427.171678: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> git-5511 [001] d.h1. 427.171679: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> git-5511 [001] d.h1. 427.175653: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> git-5511 [001] d.h1. 427.175655: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> git-5511 [001] d.s1. 427.175665: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> git-5511 [001] d.s1. 427.175665: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
>
> Debug patch applied:
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..5c9b3e1de7a0 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -166,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>
> freq = get_capacity_ref_freq(policy);
> freq = map_util_freq(util, freq, max);
> + trace_printk("[DEBUG] : freq %llu, util %llu, max %llu\n", freq, util, max);
>
> if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> return sg_policy->next_freq;
> @@ -199,6 +200,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> util = max(util, boost);
> sg_cpu->bw_min = min;
> sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> + trace_printk("[DEBUG] : util %llu, sg_cpu->util %llu\n", util, sg_cpu->util);
> }
>
> /**
>
>
> So, I guess map_util_freq going wrong somewhere.

Thanks for the trace. It was really helpful and I think that I got the
root cause.

The problem comes from get_capacity_ref_freq() which returns current
freq when arch_scale_freq_invariant() is not enable, and the fact that
we apply map_util_perf() earlier in the path now which is then capped
by max capacity.

Could you try the below ?

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e420e2ee1a10..611c621543f4 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
cpufreq_policy *policy)
if (arch_scale_freq_invariant())
return policy->cpuinfo.max_freq;

- return policy->cur;
+ return policy->cur + policy->cur >> 2;
}

/**

>
> Thanks,
> Wyes
> >
> > Thanks for you help
> > Vincent
> >
> > >
> > > Thanks,
> > > Wyes
> > >

2024-01-14 13:05:43

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Sun, 14 Jan 2024 at 14:02, Dietmar Eggemann <[email protected]> wrote:
>
> On 14/01/2024 13:37, Wyes Karny wrote:
> > On Sun, Jan 14, 2024 at 12:18:06PM +0100, Vincent Guittot wrote:
> >> Hi Wyes,
> >>
> >> Le dimanche 14 janv. 2024 ï¿½ 14:42:40 (+0530), Wyes Karny a ï¿½crit :
> >>> On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
> >>>> On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> >>>> <[email protected]> wrote:
>
> [...]
>
> > Yeah, correct something was wrong in the bpftrace readings, max_cap is
> > not zero in traces.
> >
> > git-5511 [001] d.h1. 427.159763: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > git-5511 [001] d.h1. 427.163733: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > git-5511 [001] d.h1. 427.163735: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > git-5511 [001] d.h1. 427.167706: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > git-5511 [001] d.h1. 427.167708: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > git-5511 [001] d.h1. 427.171678: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > git-5511 [001] d.h1. 427.171679: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > git-5511 [001] d.h1. 427.175653: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > git-5511 [001] d.h1. 427.175655: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > git-5511 [001] d.s1. 427.175665: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > git-5511 [001] d.s1. 427.175665: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> >
> > Debug patch applied:
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 95c3c097083e..5c9b3e1de7a0 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -166,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> >
> > freq = get_capacity_ref_freq(policy);
> > freq = map_util_freq(util, freq, max);
> > + trace_printk("[DEBUG] : freq %llu, util %llu, max %llu\n", freq, util, max);
> >
> > if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> > return sg_policy->next_freq;
> > @@ -199,6 +200,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> > util = max(util, boost);
> > sg_cpu->bw_min = min;
> > sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> > + trace_printk("[DEBUG] : util %llu, sg_cpu->util %llu\n", util, sg_cpu->util);
> > }
> >
> > /**
> >
> >
> > So, I guess map_util_freq going wrong somewhere.
>
> sugov_update_single_freq() -> get_next_freq() -> get_capacity_ref_freq()
>
> Is arch_scale_freq_invariant() true in both cases, so that
> get_capacity_ref_freq() returns 'policy->cpuinfo.max_freq' and not just
> 'policy->cur'?

That's also my assumption and the change that I sent shoulf fix it

>

2024-01-14 15:13:04

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On 01/14/24 14:03, Vincent Guittot wrote:

> Thanks for the trace. It was really helpful and I think that I got the
> root cause.
>
> The problem comes from get_capacity_ref_freq() which returns current
> freq when arch_scale_freq_invariant() is not enable, and the fact that
> we apply map_util_perf() earlier in the path now which is then capped
> by max capacity.
>
> Could you try the below ?
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index e420e2ee1a10..611c621543f4 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
> cpufreq_policy *policy)
> if (arch_scale_freq_invariant())
> return policy->cpuinfo.max_freq;
>
> - return policy->cur;
> + return policy->cur + policy->cur >> 2;
> }
>
> /**

Is this a test patch or a proper fix? I can't see it being the latter. It seems
the current logic fails when util is already 1024, and I think we're trying to
fix the invariance issue too late.

Is the problem that we can't read policy->cur in the scheduler to fix the util
while it's being updated that's why it's done here in this case?

If this is the problem, shouldn't the logic be if util is max then always go to
max frequency? I don't think we have enough info to correct the invariance here
IIUC. All we can see the system is saturated at this frequency and whether
a small jump or a big jump is required is hard to tell.

Something like this

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..473d0352030b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -164,8 +164,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
struct cpufreq_policy *policy = sg_policy->policy;
unsigned int freq;

- freq = get_capacity_ref_freq(policy);
- freq = map_util_freq(util, freq, max);
+ if (util != max) {
+ freq = get_capacity_ref_freq(policy);
+ freq = map_util_freq(util, freq, max);
+ } else {
+ freq = policy->cpuinfo.max_freq;
+ }

if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
return sg_policy->next_freq;

2024-01-14 15:21:20

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Sun, 14 Jan 2024 at 16:12, Qais Yousef <[email protected]> wrote:
>
> On 01/14/24 14:03, Vincent Guittot wrote:
>
> > Thanks for the trace. It was really helpful and I think that I got the
> > root cause.
> >
> > The problem comes from get_capacity_ref_freq() which returns current
> > freq when arch_scale_freq_invariant() is not enable, and the fact that
> > we apply map_util_perf() earlier in the path now which is then capped
> > by max capacity.
> >
> > Could you try the below ?
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index e420e2ee1a10..611c621543f4 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
> > cpufreq_policy *policy)
> > if (arch_scale_freq_invariant())
> > return policy->cpuinfo.max_freq;
> >
> > - return policy->cur;
> > + return policy->cur + policy->cur >> 2;
> > }
> >
> > /**
>
> Is this a test patch or a proper fix? I can't see it being the latter. It seems

It's a proper fix. It's the same mechanism that is used already :
- Either you add margin on the utilization to go above current freq
before it is fully used. This si what was done previously
- or you add margin on the freq range to select a higher freq than
current one before it become fully used

> the current logic fails when util is already 1024, and I think we're trying to
> fix the invariance issue too late.
>
> Is the problem that we can't read policy->cur in the scheduler to fix the util
> while it's being updated that's why it's done here in this case?
>
> If this is the problem, shouldn't the logic be if util is max then always go to
> max frequency? I don't think we have enough info to correct the invariance here
> IIUC. All we can see the system is saturated at this frequency and whether
> a small jump or a big jump is required is hard to tell.
>
> Something like this
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..473d0352030b 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -164,8 +164,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> struct cpufreq_policy *policy = sg_policy->policy;
> unsigned int freq;
>
> - freq = get_capacity_ref_freq(policy);
> - freq = map_util_freq(util, freq, max);
> + if (util != max) {
> + freq = get_capacity_ref_freq(policy);
> + freq = map_util_freq(util, freq, max);
> + } else {
> + freq = policy->cpuinfo.max_freq;
> + }

This is not correct because you will have to wait to reach full
utilization at the current OPP possibly the lowest OPP before moving
directly to max OPP

>
> if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> return sg_policy->next_freq;

2024-01-14 18:11:26

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Sun, Jan 14, 2024 at 02:03:14PM +0100, Vincent Guittot wrote:
> On Sun, 14 Jan 2024 at 13:38, Wyes Karny <[email protected]> wrote:
> >
> > On Sun, Jan 14, 2024 at 12:18:06PM +0100, Vincent Guittot wrote:
> > > Hi Wyes,
> > >
> > > Le dimanche 14 janv. 2024 ? 14:42:40 (+0530), Wyes Karny a ?crit :
> > > > On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
> > > > > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > It's one of these two:
> > > > > >
> > > > > > f12560779f9d sched/cpufreq: Rework iowait boost
> > > > > > 9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > > > > >
> > > > > > one more boot to go, then I'll try to revert whichever causes my
> > > > > > machine to perform horribly much worse.
> > > > >
> > > > > I guess it should come as no surprise that the result is
> > > > >
> > > > > 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> > > > >
> > > > > but to revert cleanly I will have to revert all of
> > > > >
> > > > > b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > > > > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > > > > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > > > > performance estimation")
> > > > >
> > > > > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> > > > >
> > > > > I'll keep that revert in my private test-tree for now (so that I have
> > > > > a working machine again), but I'll move it to my main branch soon
> > > > > unless somebody has a quick fix for this problem.
> > > >
> > > > Hi Linus,
> > > >
> > > > I'm able to reproduce this issue with my AMD Ryzen 5600G system. But
> > > > only if I disable CPPC in BIOS and boot with acpi-cpufreq + schedutil.
> > > > (I believe for your case also CPPC is diabled as log "_CPC object is not
> > > > present" came). Enabling CPPC in BIOS issue not seen in my system. For
> > > > AMD acpi-cpufreq also uses _CPC object to determine the boost ratio.
> > > > When CPPC is disabled in BIOS something is going wrong and max
> > > > capacity is becoming zero.
> > > >
> > > > Hi Vincent, Qais,
> > > >
>
> ...
>
> > >
> > > There is something strange that I don't understand
> > >
> > > Could you trace on the return of sugov_get_util()
> > > the value of sg_cpu->util ?
> >
> > Yeah, correct something was wrong in the bpftrace readings, max_cap is
> > not zero in traces.
> >
> > git-5511 [001] d.h1. 427.159763: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > git-5511 [001] d.h1. 427.163733: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > git-5511 [001] d.h1. 427.163735: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > git-5511 [001] d.h1. 427.167706: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > git-5511 [001] d.h1. 427.167708: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > git-5511 [001] d.h1. 427.171678: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > git-5511 [001] d.h1. 427.171679: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > git-5511 [001] d.h1. 427.175653: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > git-5511 [001] d.h1. 427.175655: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > git-5511 [001] d.s1. 427.175665: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > git-5511 [001] d.s1. 427.175665: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> >
> > Debug patch applied:
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 95c3c097083e..5c9b3e1de7a0 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -166,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> >
> > freq = get_capacity_ref_freq(policy);
> > freq = map_util_freq(util, freq, max);
> > + trace_printk("[DEBUG] : freq %llu, util %llu, max %llu\n", freq, util, max);
> >
> > if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> > return sg_policy->next_freq;
> > @@ -199,6 +200,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> > util = max(util, boost);
> > sg_cpu->bw_min = min;
> > sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> > + trace_printk("[DEBUG] : util %llu, sg_cpu->util %llu\n", util, sg_cpu->util);
> > }
> >
> > /**
> >
> >
> > So, I guess map_util_freq going wrong somewhere.
>
> Thanks for the trace. It was really helpful and I think that I got the
> root cause.
>
> The problem comes from get_capacity_ref_freq() which returns current
> freq when arch_scale_freq_invariant() is not enable, and the fact that
> we apply map_util_perf() earlier in the path now which is then capped
> by max capacity.
>
> Could you try the below ?
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index e420e2ee1a10..611c621543f4 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
> cpufreq_policy *policy)
> if (arch_scale_freq_invariant())
> return policy->cpuinfo.max_freq;
>
> - return policy->cur;
> + return policy->cur + policy->cur >> 2;
> }
>
> /**

Issue seems to be fixed with this (but bit modified by me for arithmetic precedence):

patch:

@@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
if (arch_scale_freq_invariant())
return policy->cpuinfo.max_freq;

- return policy->cur;
+ return policy->cur + (policy->cur >> 2);
}

/**

trace:
make-7912 [001] d..2. 182.070005: sugov_get_util: [DEBUG] : util 595, sg_cpu->util 743
make-7912 [001] d..2. 182.070006: get_next_freq.constprop.0: [DEBUG] : freq 3537231, util 743, max 1024
sh-7956 [001] d..2. 182.070494: sugov_get_util: [DEBUG] : util 835, sg_cpu->util 1024
sh-7956 [001] d..2. 182.070495: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
sh-7956 [001] d..2. 182.070576: sugov_get_util: [DEBUG] : util 955, sg_cpu->util 1024
sh-7956 [001] d..2. 182.070576: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
sh-7957 [001] d.h1. 182.072120: sugov_get_util: [DEBUG] : util 990, sg_cpu->util 1024
sh-7957 [001] d.h1. 182.072121: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
nm-7957 [001] dNh1. 182.076088: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
nm-7957 [001] dNh1. 182.076089: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
grep-7958 [001] d..2. 182.076833: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024

Thanks,
Wyes

2024-01-14 18:19:12

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Sun, 14 Jan 2024 at 19:11, Wyes Karny <[email protected]> wrote:
>
> On Sun, Jan 14, 2024 at 02:03:14PM +0100, Vincent Guittot wrote:
> > On Sun, 14 Jan 2024 at 13:38, Wyes Karny <[email protected]> wrote:
> > >
> > > On Sun, Jan 14, 2024 at 12:18:06PM +0100, Vincent Guittot wrote:
> > > > Hi Wyes,
> > > >
> > > > Le dimanche 14 janv. 2024 à 14:42:40 (+0530), Wyes Karny a écrit :
> > > > > On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
> > > > > > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > It's one of these two:
> > > > > > >
> > > > > > > f12560779f9d sched/cpufreq: Rework iowait boost
> > > > > > > 9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > > > > > >
> > > > > > > one more boot to go, then I'll try to revert whichever causes my
> > > > > > > machine to perform horribly much worse.
> > > > > >
> > > > > > I guess it should come as no surprise that the result is
> > > > > >
> > > > > > 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> > > > > >
> > > > > > but to revert cleanly I will have to revert all of
> > > > > >
> > > > > > b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > > > > > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > > > > > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > > > > > performance estimation")
> > > > > >
> > > > > > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> > > > > >
> > > > > > I'll keep that revert in my private test-tree for now (so that I have
> > > > > > a working machine again), but I'll move it to my main branch soon
> > > > > > unless somebody has a quick fix for this problem.
> > > > >
> > > > > Hi Linus,
> > > > >
> > > > > I'm able to reproduce this issue with my AMD Ryzen 5600G system. But
> > > > > only if I disable CPPC in BIOS and boot with acpi-cpufreq + schedutil.
> > > > > (I believe for your case also CPPC is diabled as log "_CPC object is not
> > > > > present" came). Enabling CPPC in BIOS issue not seen in my system For
> > > > > AMD acpi-cpufreq also uses _CPC object to determine the boost ratio.
> > > > > When CPPC is disabled in BIOS something is going wrong and max
> > > > > capacity is becoming zero.
> > > > >
> > > > > Hi Vincent, Qais,
> > > > >
> >
> > ...
> >
> > > >
> > > > There is something strange that I don't understand
> > > >
> > > > Could you trace on the return of sugov_get_util()
> > > > the value of sg_cpu->util ?
> > >
> > > Yeah, correct something was wrong in the bpftrace readings, max_cap is
> > > not zero in traces.
> > >
> > > git-5511 [001] d.h1. 427.159763: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > > git-5511 [001] d.h1. 427.163733: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > > git-5511 [001] d.h1. 427.163735: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > > git-5511 [001] d.h1. 427.167706: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > > git-5511 [001] d.h1. 427.167708: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > > git-5511 [001] d.h1. 427.171678: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > > git-5511 [001] d.h1. 427.171679: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > > git-5511 [001] d.h1. 427.175653: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > > git-5511 [001] d.h1. 427.175655: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > > git-5511 [001] d.s1. 427.175665: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> > > git-5511 [001] d.s1. 427.175665: get_next_freq.constprop.0: [DEBUG] : freq 1400000, util 1024, max 1024
> > >
> > > Debug patch applied:
> > >
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index 95c3c097083e..5c9b3e1de7a0 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -166,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> > >
> > > freq = get_capacity_ref_freq(policy);
> > > freq = map_util_freq(util, freq, max);
> > > + trace_printk("[DEBUG] : freq %llu, util %llu, max %llu\n", freq, util, max);
> > >
> > > if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> > > return sg_policy->next_freq;
> > > @@ -199,6 +200,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> > > util = max(util, boost);
> > > sg_cpu->bw_min = min;
> > > sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> > > + trace_printk("[DEBUG] : util %llu, sg_cpu->util %llu\n", util, sg_cpu->util);
> > > }
> > >
> > > /**
> > >
> > >
> > > So, I guess map_util_freq going wrong somewhere.
> >
> > Thanks for the trace. It was really helpful and I think that I got the
> > root cause.
> >
> > The problem comes from get_capacity_ref_freq() which returns current
> > freq when arch_scale_freq_invariant() is not enable, and the fact that
> > we apply map_util_perf() earlier in the path now which is then capped
> > by max capacity.
> >
> > Could you try the below ?
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index e420e2ee1a10..611c621543f4 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
> > cpufreq_policy *policy)
> > if (arch_scale_freq_invariant())
> > return policy->cpuinfo.max_freq;
> >
> > - return policy->cur;
> > + return policy->cur + policy->cur >> 2;
> > }
> >
> > /**
>
> Issue seems to be fixed with this (but bit modified by me for arithmetic precedence):

Ok, thanks for your test

>
> patch:
>
> @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
> if (arch_scale_freq_invariant())
> return policy->cpuinfo.max_freq;
>
> - return policy->cur;
> + return policy->cur + (policy->cur >> 2);

fair enough
I prepare a patch with the fix and will add your Tested-by

Thanks for your help

> }
>
> /**
>
> trace:
> make-7912 [001] d..2. 182.070005: sugov_get_util: [DEBUG] : util 595, sg_cpu->util 743
> make-7912 [001] d..2. 182.070006: get_next_freq.constprop.0: [DEBUG] : freq 3537231, util 743, max 1024
> sh-7956 [001] d..2. 182.070494: sugov_get_util: [DEBUG] : util 835, sg_cpu->util 1024
> sh-7956 [001] d..2. 182.070495: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
> sh-7956 [001] d..2. 182.070576: sugov_get_util: [DEBUG] : util 955, sg_cpu->util 1024
> sh-7956 [001] d..2. 182.070576: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
> sh-7957 [001] d.h1. 182.072120: sugov_get_util: [DEBUG] : util 990, sg_cpu->util 1024
> sh-7957 [001] d.h1. 182.072121: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
> nm-7957 [001] dNh1. 182.076088: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
> nm-7957 [001] dNh1. 182.076089: get_next_freq.constprop.0: [DEBUG] : freq 4875000, util 1024, max 1024
> grep-7958 [001] d..2. 182.076833: sugov_get_util: [DEBUG] : util 1024, sg_cpu->util 1024
>
>
> Thanks,
> Wyes

2024-01-14 19:58:39

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On 01/14/24 16:20, Vincent Guittot wrote:
> On Sun, 14 Jan 2024 at 16:12, Qais Yousef <[email protected]> wrote:
> >
> > On 01/14/24 14:03, Vincent Guittot wrote:
> >
> > > Thanks for the trace. It was really helpful and I think that I got the
> > > root cause.
> > >
> > > The problem comes from get_capacity_ref_freq() which returns current
> > > freq when arch_scale_freq_invariant() is not enable, and the fact that
> > > we apply map_util_perf() earlier in the path now which is then capped
> > > by max capacity.
> > >
> > > Could you try the below ?
> > >
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index e420e2ee1a10..611c621543f4 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
> > > cpufreq_policy *policy)
> > > if (arch_scale_freq_invariant())
> > > return policy->cpuinfo.max_freq;
> > >
> > > - return policy->cur;
> > > + return policy->cur + policy->cur >> 2;
> > > }
> > >
> > > /**
> >
> > Is this a test patch or a proper fix? I can't see it being the latter. It seems
>
> It's a proper fix. It's the same mechanism that is used already :
> - Either you add margin on the utilization to go above current freq
> before it is fully used. This si what was done previously
> - or you add margin on the freq range to select a higher freq than
> current one before it become fully used

Aren't we applying the 25% headroom twice then?

>
> > the current logic fails when util is already 1024, and I think we're trying to
> > fix the invariance issue too late.
> >
> > Is the problem that we can't read policy->cur in the scheduler to fix the util
> > while it's being updated that's why it's done here in this case?
> >
> > If this is the problem, shouldn't the logic be if util is max then always go to
> > max frequency? I don't think we have enough info to correct the invariance here
> > IIUC. All we can see the system is saturated at this frequency and whether
> > a small jump or a big jump is required is hard to tell.
> >
> > Something like this
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 95c3c097083e..473d0352030b 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -164,8 +164,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> > struct cpufreq_policy *policy = sg_policy->policy;
> > unsigned int freq;
> >
> > - freq = get_capacity_ref_freq(policy);
> > - freq = map_util_freq(util, freq, max);
> > + if (util != max) {
> > + freq = get_capacity_ref_freq(policy);
> > + freq = map_util_freq(util, freq, max);
> > + } else {
> > + freq = policy->cpuinfo.max_freq;
> > + }
>
> This is not correct because you will have to wait to reach full
> utilization at the current OPP possibly the lowest OPP before moving
> directly to max OPP

Isn't this already the case? The ratio (util+headroom/max) will be less than
1 until util is 80% (with 25% headroom). And for all values <= 80% * max, we
will request a frequency smaller than/equal policy->cur, no?

ie:

util = 600
max = 1024

freq = 1.25 * 600 * policy->cur / 1024 = 0.73 * policy->cur

(util+headroom/max) must be greater than 1 for us to start going above
policy->cur - which seems to have been working by accident IIUC.

So yes my proposal is incorrect, but it seems the conversion is not right to me
now.

I could reproduce the problem now (thanks Wyes!). I have 3 freqs on my system

2.2GHz, 2.8GHz and 3.8GHz

which (I believe) translates into capacities

~592, ~754, 1024

which means we should pick 2.8GHz as soon as util * 1.25 > 592; which
translates into util = ~473.

But what I see is that we go to 2.8GHz when we jump from 650 to 680 (see
attached picture), which is what you'd expect since we apply two headrooms now,
which means the ratio (util+headroom/max) will be greater than 1 after go above
this value

1024 * 0.8 * 0.8 = ~655

So I think the math makes sense logically, but we're missing some other
correction factor.

When I re-enable CPPC I see for the same test that we go into 3.8GHz straight
away. My test is simple busyloop via

cat /dev/zero > /dev/null

I see the CPU util_avg is at 523 at fork. I expected us to run to 2.8GHz here
to be honest, but I am not sure if util_cfs_boost() and util_est() are maybe
causing us to be slightly above 523 and that's why we start with max freq.

Or I've done the math wrong :-) But the two don't behave the same for the same
kernel with and without CPPC.

>
> >
> > if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> > return sg_policy->next_freq;

Attachments:

(No filename) (4.89 kB)
cppc_freq_fix.png (27.14 kB)
Download all attachments

2024-01-14 23:37:44

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On 01/14/24 19:58, Qais Yousef wrote:

> > This is not correct because you will have to wait to reach full
> > utilization at the current OPP possibly the lowest OPP before moving
> > directly to max OPP
>
> Isn't this already the case? The ratio (util+headroom/max) will be less than
> 1 until util is 80% (with 25% headroom). And for all values <= 80% * max, we
> will request a frequency smaller than/equal policy->cur, no?
>
> ie:
>
> util = 600
> max = 1024
>
> freq = 1.25 * 600 * policy->cur / 1024 = 0.73 * policy->cur
>
> (util+headroom/max) must be greater than 1 for us to start going above
> policy->cur - which seems to have been working by accident IIUC.
>
> So yes my proposal is incorrect, but it seems the conversion is not right to me
> now.
>
> I could reproduce the problem now (thanks Wyes!). I have 3 freqs on my system
>
> 2.2GHz, 2.8GHz and 3.8GHz
>
> which (I believe) translates into capacities
>
> ~592, ~754, 1024
>
> which means we should pick 2.8GHz as soon as util * 1.25 > 592; which
> translates into util = ~473.
>
> But what I see is that we go to 2.8GHz when we jump from 650 to 680 (see
> attached picture), which is what you'd expect since we apply two headrooms now,
> which means the ratio (util+headroom/max) will be greater than 1 after go above
> this value
>
> 1024 * 0.8 * 0.8 = ~655
>
> So I think the math makes sense logically, but we're missing some other
> correction factor.
>
> When I re-enable CPPC I see for the same test that we go into 3.8GHz straight
> away. My test is simple busyloop via
>
> cat /dev/zero > /dev/null
>
> I see the CPU util_avg is at 523 at fork. I expected us to run to 2.8GHz here
> to be honest, but I am not sure if util_cfs_boost() and util_est() are maybe
> causing us to be slightly above 523 and that's why we start with max freq.
>
> Or I've done the math wrong :-) But the two don't behave the same for the same
> kernel with and without CPPC.

I think the relationship should be:

freq = util * f_curr / cap_curr

(patch below)

with that I see (almost) the expected behavior (picture attached). We go to
2.8GHz when we are above 500. But the move to 3.8GHz is a bit earlier at 581
(instead of 754 * 0.8 = 603). Not sure why. With 25% headroom 581 is 726. So
it's a tad too early.

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..155f96a44fa0 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -123,7 +123,8 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy)
* Return: the reference CPU frequency to compute a capacity.
*/
static __always_inline
-unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
+unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy,
+ unsigned long *max)
{
unsigned int freq = arch_scale_freq_ref(policy->cpu);

@@ -133,6 +134,9 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
if (arch_scale_freq_invariant())
return policy->cpuinfo.max_freq;

+ if (max)
+ *max = policy->cur * (*max) / policy->cpuinfo.max_freq;
+
return policy->cur;
}

@@ -164,7 +168,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
struct cpufreq_policy *policy = sg_policy->policy;
unsigned int freq;

- freq = get_capacity_ref_freq(policy);
+ freq = get_capacity_ref_freq(policy, &max);
freq = map_util_freq(util, freq, max);

if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)

Attachments:

(No filename) (3.65 kB)
cppc_freq_fix2.png (30.13 kB)
Download all attachments

2024-01-15 06:25:29

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

Hi Qais,

On Mon, Jan 15, 2024 at 5:07 AM Qais Yousef <[email protected]> wrote:
>
> On 01/14/24 19:58, Qais Yousef wrote:
>
> > > This is not correct because you will have to wait to reach full
> > > utilization at the current OPP possibly the lowest OPP before moving
> > > directly to max OPP
> >
> > Isn't this already the case? The ratio (util+headroom/max) will be less than
> > 1 until util is 80% (with 25% headroom). And for all values <= 80% * max, we
> > will request a frequency smaller than/equal policy->cur, no?
> >
> > ie:
> >
> > util = 600
> > max = 1024
> >
> > freq = 1.25 * 600 * policy->cur / 1024 = 0.73 * policy->cur
> >
> > (util+headroom/max) must be greater than 1 for us to start going above
> > policy->cur - which seems to have been working by accident IIUC.
> >
> > So yes my proposal is incorrect, but it seems the conversion is not right to me
> > now.
> >
> > I could reproduce the problem now (thanks Wyes!). I have 3 freqs on my system
> >
> > 2.2GHz, 2.8GHz and 3.8GHz
> >
> > which (I believe) translates into capacities
> >
> > ~592, ~754, 1024
> >
> > which means we should pick 2.8GHz as soon as util * 1.25 > 592; which
> > translates into util = ~473.
> >
> > But what I see is that we go to 2.8GHz when we jump from 650 to 680 (see
> > attached picture), which is what you'd expect since we apply two headrooms now,
> > which means the ratio (util+headroom/max) will be greater than 1 after go above
> > this value
> >
> > 1024 * 0.8 * 0.8 = ~655
> >
> > So I think the math makes sense logically, but we're missing some other
> > correction factor.
> >
> > When I re-enable CPPC I see for the same test that we go into 3.8GHz straight
> > away. My test is simple busyloop via
> >
> > cat /dev/zero > /dev/null
> >
> > I see the CPU util_avg is at 523 at fork. I expected us to run to 2.8GHz here
> > to be honest, but I am not sure if util_cfs_boost() and util_est() are maybe
> > causing us to be slightly above 523 and that's why we start with max freq.
> >
> > Or I've done the math wrong :-) But the two don't behave the same for the same
> > kernel with and without CPPC.
>
> I think the relationship should be:
>
> freq = util * f_curr / cap_curr

I guess to know the curr_cap correctly we need to know the max_freq,
which is not available when CPPC is disabled.

>
> (patch below)
>
> with that I see (almost) the expected behavior (picture attached). We go to
> 2.8GHz when we are above 500. But the move to 3.8GHz is a bit earlier at 581
> (instead of 754 * 0.8 = 603). Not sure why. With 25% headroom 581 is 726. So
> it's a tad too early.
>
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..155f96a44fa0 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -123,7 +123,8 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy)
> * Return: the reference CPU frequency to compute a capacity.
> */
> static __always_inline
> -unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
> +unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy,
> + unsigned long *max)
> {
> unsigned int freq = arch_scale_freq_ref(policy->cpu);
>
> @@ -133,6 +134,9 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
> if (arch_scale_freq_invariant())
> return policy->cpuinfo.max_freq;
>
> + if (max)
> + *max = policy->cur * (*max) / policy->cpuinfo.max_freq;

But when freq_invaiant is disabled we don't have policy->cpuinfo.max_freq.

Thanks,
Wyes
> +
> return policy->cur;
> }
>
> @@ -164,7 +168,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> struct cpufreq_policy *policy = sg_policy->policy;
> unsigned int freq;
>
> - freq = get_capacity_ref_freq(policy);
> + freq = get_capacity_ref_freq(policy, &max);
> freq = map_util_freq(util, freq, max);
>
> if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)

--
Thanks & Regards
Wyes

2024-01-15 08:23:29

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Sun, 14 Jan 2024 at 20:58, Qais Yousef <[email protected]> wrote:
>
> On 01/14/24 16:20, Vincent Guittot wrote:
> > On Sun, 14 Jan 2024 at 16:12, Qais Yousef <[email protected]> wrote:
> > >
> > > On 01/14/24 14:03, Vincent Guittot wrote:
> > >
> > > > Thanks for the trace. It was really helpful and I think that I got the
> > > > root cause.
> > > >
> > > > The problem comes from get_capacity_ref_freq() which returns current
> > > > freq when arch_scale_freq_invariant() is not enable, and the fact that
> > > > we apply map_util_perf() earlier in the path now which is then capped
> > > > by max capacity.
> > > >
> > > > Could you try the below ?
> > > >
> > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > > index e420e2ee1a10..611c621543f4 100644
> > > > --- a/kernel/sched/cpufreq_schedutil.c
> > > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > > @@ -133,7 +133,7 @@ unsigned long get_capacity_ref_freq(struct
> > > > cpufreq_policy *policy)
> > > > if (arch_scale_freq_invariant())
> > > > return policy->cpuinfo.max_freq;
> > > >
> > > > - return policy->cur;
> > > > + return policy->cur + policy->cur >> 2;
> > > > }
> > > >
> > > > /**
> > >
> > > Is this a test patch or a proper fix? I can't see it being the latter. It seems
> >
> > It's a proper fix. It's the same mechanism that is used already :
> > - Either you add margin on the utilization to go above current freq
> > before it is fully used. This si what was done previously
> > - or you add margin on the freq range to select a higher freq than
> > current one before it become fully used
>
> Aren't we applying the 25% headroom twice then?

In some cases, yes. But whereas the 1st one is generic and can be
filtered with uclamp, the 2nd one is there to enable platform without
frequency invariance to jump to the next OPP

>
> >
> > > the current logic fails when util is already 1024, and I think we're trying to
> > > fix the invariance issue too late.
> > >
> > > Is the problem that we can't read policy->cur in the scheduler to fix the util
> > > while it's being updated that's why it's done here in this case?
> > >
> > > If this is the problem, shouldn't the logic be if util is max then always go to
> > > max frequency? I don't think we have enough info to correct the invariance here
> > > IIUC. All we can see the system is saturated at this frequency and whether
> > > a small jump or a big jump is required is hard to tell.
> > >
> > > Something like this
> > >
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index 95c3c097083e..473d0352030b 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -164,8 +164,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> > > struct cpufreq_policy *policy = sg_policy->policy;
> > > unsigned int freq;
> > >
> > > - freq = get_capacity_ref_freq(policy);
> > > - freq = map_util_freq(util, freq, max);
> > > + if (util != max) {
> > > + freq = get_capacity_ref_freq(policy);
> > > + freq = map_util_freq(util, freq, max);
> > > + } else {
> > > + freq = policy->cpuinfo.max_freq;
> > > + }
> >
> > This is not correct because you will have to wait to reach full
> > utilization at the current OPP possibly the lowest OPP before moving
> > directly to max OPP
>
> Isn't this already the case? The ratio (util+headroom/max) will be less than
> 1 until util is 80% (with 25% headroom). And for all values <= 80% * max, we
> will request a frequency smaller than/equal policy->cur, no?

In your proposal, we stay at current OPP until util == 1024 because
util can't be higher than max. And then we switch directly to max_freq

In the previous behavior and with the fix, we can go to the next OPP
because either util can be higher than max (previous behavior) or the
selected freq can be higher than current (fix behavior) and we don't
have to want for util == max

>
> ie:
>
> util = 600
> max = 1024
>
> freq = 1.25 * 600 * policy->cur / 1024 = 0.73 * policy->cur
>
> (util+headroom/max) must be greater than 1 for us to start going above
> policy->cur - which seems to have been working by accident IIUC.

No this is not by accident

>
> So yes my proposal is incorrect, but it seems the conversion is not right to me
> now.
>
> I could reproduce the problem now (thanks Wyes!). I have 3 freqs on my system
>
> 2.2GHz, 2.8GHz and 3.8GHz
>
> which (I believe) translates into capacities
>
> ~592, ~754, 1024
>
> which means we should pick 2.8GHz as soon as util * 1.25 > 592; which
> translates into util = ~473.

Keep in mind that util is the utilization at current OPP not at full
scale because we don't have frequency invariance.

>
> But what I see is that we go to 2.8GHz when we jump from 650 to 680 (see
> attached picture), which is what you'd expect since we apply two headrooms now,
> which means the ratio (util+headroom/max) will be greater than 1 after go above
> this value
>
> 1024 * 0.8 * 0.8 = ~655
>
> So I think the math makes sense logically, but we're missing some other
> correction factor.
>
> When I re-enable CPPC I see for the same test that we go into 3.8GHz straight
> away. My test is simple busyloop via
>
> cat /dev/zero > /dev/null
>
> I see the CPU util_avg is at 523 at fork. I expected us to run to 2.8GHz here
> to be honest, but I am not sure if util_cfs_boost() and util_est() are maybe
> causing us to be slightly above 523 and that's why we start with max freq.
>
> Or I've done the math wrong :-) But the two don't behave the same for the same
> kernel with and without CPPC.

They will never behave the same because they can't
- with invariance, the utilization is the utilization at max capacity
so we can easily jump several OPP to go directly to the right one
- without invariance, the utilization is the utilization at current
OPP so we can only jump to a limited number of OPP

>
> >
> > >
> > > if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> > > return sg_policy->next_freq;

2024-01-15 08:45:24

by David Laight

[permalink] [raw]

Subject: RE: [GIT PULL] Scheduler changes for v6.8

> My test is simple busyloop via
>
> cat /dev/zero > /dev/null

I usually use the userspace loop:
while :; do :; done

Quite good for testing whether the fans speed up!

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-01-15 11:59:44

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

Hi Wyes

On 01/15/24 11:55, Wyes Karny wrote:

> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 95c3c097083e..155f96a44fa0 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -123,7 +123,8 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy)
> > * Return: the reference CPU frequency to compute a capacity.
> > */
> > static __always_inline
> > -unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
> > +unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy,
> > + unsigned long *max)
> > {
> > unsigned int freq = arch_scale_freq_ref(policy->cpu);
> >
> > @@ -133,6 +134,9 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
> > if (arch_scale_freq_invariant())
> > return policy->cpuinfo.max_freq;
> >
> > + if (max)
> > + *max = policy->cur * (*max) / policy->cpuinfo.max_freq;
>
> But when freq_invaiant is disabled we don't have policy->cpuinfo.max_freq.

They are reported as 3.8GHz correctly for me

$ grep . /sys/devices/system/cpu/cpufreq/policy*/cpuinfo_max_freq
/sys/devices/system/cpu/cpufreq/policy0/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy10/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy11/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy12/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy13/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy14/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy15/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy16/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy17/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy18/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy19/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy1/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy20/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy21/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy22/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy23/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy2/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy3/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy4/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy5/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy6/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy7/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy8/cpuinfo_max_freq:3800000
/sys/devices/system/cpu/cpufreq/policy9/cpuinfo_max_freq:3800000

2024-01-15 12:09:27

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On 01/15/24 09:21, Vincent Guittot wrote:

> > Or I've done the math wrong :-) But the two don't behave the same for the same
> > kernel with and without CPPC.
>
> They will never behave the same because they can't
> - with invariance, the utilization is the utilization at max capacity
> so we can easily jump several OPP to go directly to the right one
> - without invariance, the utilization is the utilization at current
> OPP so we can only jump to a limited number of OPP

I am probably missing some subtlty, but the behavior looks more sensible to
me when we divide by current capacity instead of max one.

It seems what you're saying is that the capacity range for each OPP is 0-1024.
And that's when we know that we saturated the current capacity level we decide
to move on.

As I am trying to remove the hardcoded headroom values I am wary of another
one. But it seems this is bandaid scenario anyway; so maybe I shouldn't worry
too much about it.

Cheers

--
Qais Yousef

2024-01-15 13:26:42

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Mon, 15 Jan 2024 at 13:09, Qais Yousef <[email protected]> wrote:
>
> On 01/15/24 09:21, Vincent Guittot wrote:
>
> > > Or I've done the math wrong :-) But the two don't behave the same for the same
> > > kernel with and without CPPC.
> >
> > They will never behave the same because they can't
> > - with invariance, the utilization is the utilization at max capacity
> > so we can easily jump several OPP to go directly to the right one
> > - without invariance, the utilization is the utilization at current
> > OPP so we can only jump to a limited number of OPP
>
> I am probably missing some subtlty, but the behavior looks more sensible to
> me when we divide by current capacity instead of max one.
>
> It seems what you're saying is that the capacity range for each OPP is 0-1024.

Yes that's the case when you don't have frequency invariance

> And that's when we know that we saturated the current capacity level we decide
> to move on.

yes

>
> As I am trying to remove the hardcoded headroom values I am wary of another
> one. But it seems this is bandaid scenario anyway; so maybe I shouldn't worry
> too much about it.
>
>
> Cheers
>
> --
> Qais Yousef

2024-01-15 14:03:44

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On 15/01/2024 14:26, Vincent Guittot wrote:
> On Mon, 15 Jan 2024 at 13:09, Qais Yousef <[email protected]> wrote:
>>
>> On 01/15/24 09:21, Vincent Guittot wrote:
>>
>>>> Or I've done the math wrong :-) But the two don't behave the same for the same
>>>> kernel with and without CPPC.
>>>
>>> They will never behave the same because they can't
>>> - with invariance, the utilization is the utilization at max capacity
>>> so we can easily jump several OPP to go directly to the right one
>>> - without invariance, the utilization is the utilization at current
>>> OPP so we can only jump to a limited number of OPP
>>
>> I am probably missing some subtlty, but the behavior looks more sensible to
>> me when we divide by current capacity instead of max one.
>>
>> It seems what you're saying is that the capacity range for each OPP is 0-1024.
>
> Yes that's the case when you don't have frequency invariance
>
>> And that's when we know that we saturated the current capacity level we decide
>> to move on.
>
> yes
>
>>
>> As I am trying to remove the hardcoded headroom values I am wary of another
>> one. But it seems this is bandaid scenario anyway; so maybe I shouldn't worry
>> too much about it.

I still don't fully understand this fix.

We had:

sugov_update_single_freq()

sugov_update_single_common()

next_f = get_next_freq()

freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur (**) <- (2) !freq_inv

util = map_util_perf(util); <- (1) util *= 1.25

freq = map_util_freq(util, freq, max); <- (3)
}

And now there is:

sugov_update_single_freq()

sugov_update_single_common()

sugov_get_util()

sg_cpu->util = sugov_effective_cpu_perf()

/* Add dvfs headroom to actual utilization */
actual = map_util_perf(actual) <- (1) util *= 1.25

next_f = get_next_freq()

freq = get_capacity_ref_freq()

return policy->cur (*) <- (2) !freq_inv

freq = map_util_freq(util, freq, max) <- (3)

Still not clear to me why we need this extra 'policy->cur *= 1.25' here
(*) and not here (**)

2024-01-15 15:27:13

[permalink] [raw]

Subject: Re: [GIT PULL] Scheduler changes for v6.8

On Mon, 15 Jan 2024 at 15:03, Dietmar Eggemann <[email protected]> wrote:
>
> On 15/01/2024 14:26, Vincent Guittot wrote:
> > On Mon, 15 Jan 2024 at 13:09, Qais Yousef <[email protected]> wrote:
> >>
> >> On 01/15/24 09:21, Vincent Guittot wrote:
> >>
> >>>> Or I've done the math wrong :-) But the two don't behave the same for the same
> >>>> kernel with and without CPPC.
> >>>
> >>> They will never behave the same because they can't
> >>> - with invariance, the utilization is the utilization at max capacity
> >>> so we can easily jump several OPP to go directly to the right one
> >>> - without invariance, the utilization is the utilization at current
> >>> OPP so we can only jump to a limited number of OPP
> >>
> >> I am probably missing some subtlty, but the behavior looks more sensible to
> >> me when we divide by current capacity instead of max one.
> >>
> >> It seems what you're saying is that the capacity range for each OPP is 0-1024.
> >
> > Yes that's the case when you don't have frequency invariance
> >
> >> And that's when we know that we saturated the current capacity level we decide
> >> to move on.
> >
> > yes
> >
> >>
> >> As I am trying to remove the hardcoded headroom values I am wary of another
> >> one. But it seems this is bandaid scenario anyway; so maybe I shouldn't worry
> >> too much about it.
>
> I still don't fully understand this fix.
>
> We had:
>
> sugov_update_single_freq()
>
> sugov_update_single_common()
>
> next_f = get_next_freq()
>
> freq = arch_scale_freq_invariant() ?
> policy->cpuinfo.max_freq : policy->cur (**) <- (2) !freq_inv
>
>
> util = map_util_perf(util); <- (1) util *= 1.25
>
> freq = map_util_freq(util, freq, max); <- (3)
> }
>
>
>
> And now there is:
>
> sugov_update_single_freq()
>
> sugov_update_single_common()
>
> sugov_get_util()
>
> sg_cpu->util = sugov_effective_cpu_perf()
>
> /* Add dvfs headroom to actual utilization */
> actual = map_util_perf(actual) <- (1) util *= 1.25
>
> next_f = get_next_freq()
>
> freq = get_capacity_ref_freq()
>
> return policy->cur (*) <- (2) !freq_inv
>
> freq = map_util_freq(util, freq, max) <- (3)
>
> Still not clear to me why we need this extra 'policy->cur *= 1.25' here
> (*) and not here (**)

Now, util can't be higher than max to handle clamping use cases
whereas it could be the case before. The jump to next OPP was
previously done with util being higher than max and it's now done with
freq being higher than policy->cur

>
>

2024-01-15 20:05:36